Voice AI Testing Guide: Speech-to-Text and TTS Pipelines
Voice AI has moved from novelty to infrastructure. Transcription powers meeting summarization, accessibility tooling, and customer support automation. Text-to-speech drives voice assistants, audiobooks, and IVR systems. Both are production-critical, and both break in ways that traditional QA has no framework for handling.
This guide covers how to build a testing strategy for speech-to-text (STT) and text-to-speech (TTS) pipelines — from unit testing API calls to running Word Error Rate evaluations in CI.
Why Voice AI Testing Is Different
A REST API either returns the right JSON or it doesn't. A speech model returns something that is approximately right with some probability. That probabilistic nature changes everything about how you test.
Key differences from standard API testing:
- Output is not deterministic — the same audio file can produce slightly different transcriptions across model versions
- Quality is continuous, not binary — 95% accuracy is not the same as 99% accuracy, and the difference matters for production
- Test data is domain-specific — a model that works well on American English news broadcasts may fail on Scottish accents or technical jargon
- Failures are often silent — the API returns 200 with a plausible-sounding transcription that is simply wrong
The goal of voice AI testing is not just "does it run" but "does it produce output of acceptable quality."
Core Metrics
Before writing tests, define what success looks like.
Word Error Rate (WER) is the primary metric for STT quality:
WER = (Substitutions + Deletions + Insertions) / Total Words in ReferenceA WER of 5% means 1 in 20 words is wrong. Acceptable WER depends on your use case: real-time captioning tolerates higher WER than medical transcription.
Character Error Rate (CER) is useful for languages where word boundaries are ambiguous, or when evaluating fine-grained errors in proper nouns.
Mean Opinion Score (MOS) is the standard for TTS quality — a 1–5 scale rating naturalness. Automated MOS estimation tools like NISQA or MOSNet replace human raters in CI.
Real-Time Factor (RTF) measures speed: RTF = processing_time / audio_duration. An RTF below 1.0 means faster than real-time, which is typically required for live transcription.
Building Audio Test Fixtures
Good fixtures are the foundation. Audio test fixtures must:
- Cover the acoustic conditions you expect in production
- Have ground-truth transcriptions to compare against
- Be version-controlled alongside your code
A minimal fixture set for an English-language transcription service:
tests/
fixtures/
audio/
clean_studio.wav # ideal conditions baseline
office_background.wav # ambient noise
phone_call_8khz.wav # compressed/low-quality audio
accented_speech.wav # non-native speaker
technical_jargon.wav # domain-specific vocabulary
fast_speech.wav # high speaking rate
silence.wav # edge case: empty input
very_short.wav # edge case: < 1 second
transcripts/
clean_studio.txt
office_background.txt
phone_call_8khz.txt
# ... matching filesGenerate audio fixtures programmatically where possible:
import numpy as np
import soundfile as sf
def create_silence_fixture(duration_seconds=0.5, sample_rate=16000):
"""Create a near-silence audio fixture for edge case testing."""
samples = np.zeros(int(duration_seconds * sample_rate))
# Add tiny noise floor to avoid "empty input" errors in some APIs
samples += np.random.normal(0, 0.001, samples.shape)
sf.write("tests/fixtures/audio/silence.wav", samples, sample_rate)
def add_background_noise(clean_audio_path, noise_level_db=-20):
"""Add calibrated background noise to a clean recording."""
audio, sr = sf.read(clean_audio_path)
noise = np.random.normal(0, 1, audio.shape)
# Scale noise to target dB below signal
signal_rms = np.sqrt(np.mean(audio**2))
noise_rms = signal_rms * (10 ** (noise_level_db / 20))
noise = noise * (noise_rms / np.sqrt(np.mean(noise**2)))
return audio + noise, srUnit Testing STT API Calls
The first layer of testing covers your integration code, not the model itself.
import pytest
from unittest.mock import patch, MagicMock
def test_transcription_client_sends_correct_format():
"""Verify our client sends audio in the format the API expects."""
with patch("httpx.AsyncClient.post") as mock_post:
mock_post.return_value = MagicMock(
status_code=200,
json=lambda: {
"text": "hello world",
"confidence": 0.98,
"words": [
{"word": "hello", "start": 0.0, "end": 0.4},
{"word": "world", "start": 0.5, "end": 0.9}
]
}
)
client = TranscriptionClient(api_key="test-key")
result = client.transcribe("tests/fixtures/audio/clean_studio.wav")
# Verify the API was called with correct content type
call_kwargs = mock_post.call_args.kwargs
assert call_kwargs["headers"]["Content-Type"].startswith("multipart/form-data")
# Verify we handle the response correctly
assert result.text == "hello world"
assert result.confidence == 0.98
assert len(result.words) == 2
def test_transcription_client_handles_rate_limit():
"""Verify exponential backoff on 429 responses."""
with patch("httpx.AsyncClient.post") as mock_post:
mock_post.side_effect = [
MagicMock(status_code=429, headers={"Retry-After": "1"}),
MagicMock(status_code=429, headers={"Retry-After": "2"}),
MagicMock(status_code=200, json=lambda: {"text": "hello", "confidence": 0.95})
]
client = TranscriptionClient(api_key="test-key", max_retries=3)
result = client.transcribe("tests/fixtures/audio/clean_studio.wav")
assert result.text == "hello"
assert mock_post.call_count == 3WER-Based Integration Tests
Integration tests run against the real API and measure output quality:
import jiwer
def calculate_wer(reference: str, hypothesis: str) -> float:
"""Calculate Word Error Rate between reference and hypothesis."""
transformation = jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.Strip(),
jiwer.ReduceToListOfListOfWords()
])
return jiwer.wer(
reference,
hypothesis,
truth_transform=transformation,
hypothesis_transform=transformation
)
@pytest.mark.integration
class TestTranscriptionQuality:
FIXTURES = [
("clean_studio.wav", "clean_studio.txt", 0.05), # max 5% WER
("office_background.wav", "office_background.txt", 0.12), # 12% WER acceptable
("phone_call_8khz.wav", "phone_call_8khz.txt", 0.15),
]
@pytest.mark.parametrize("audio_file,transcript_file,max_wer", FIXTURES)
def test_transcription_quality(self, audio_file, transcript_file, max_wer):
client = TranscriptionClient(api_key=os.environ["TRANSCRIPTION_API_KEY"])
with open(f"tests/fixtures/transcripts/{transcript_file}") as f:
reference = f.read().strip()
result = client.transcribe(f"tests/fixtures/audio/{audio_file}")
wer = calculate_wer(reference, result.text)
assert wer <= max_wer, (
f"WER {wer:.1%} exceeds threshold {max_wer:.1%} for {audio_file}. "
f"Reference: '{reference[:100]}...'\n"
f"Got: '{result.text[:100]}...'"
)Testing TTS Pipelines
TTS testing has two layers: does it produce audio, and is the audio good?
import soundfile as sf
import numpy as np
class TestTTSPipeline:
def test_tts_produces_valid_audio(self):
"""Basic sanity: TTS returns non-empty, valid WAV audio."""
client = TTSClient(api_key=os.environ["TTS_API_KEY"])
audio_bytes = client.synthesize("Hello, this is a test.")
# Write to temp file and verify it's valid audio
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(audio_bytes)
tmp_path = f.name
audio, sample_rate = sf.read(tmp_path)
assert len(audio) > 0, "TTS returned empty audio"
assert sample_rate in [8000, 16000, 22050, 44100, 48000], f"Unexpected sample rate: {sample_rate}"
assert not np.all(audio == 0), "TTS returned silence"
# Check duration is reasonable for the input text
duration_seconds = len(audio) / sample_rate
assert 0.5 < duration_seconds < 10, f"Unexpected duration: {duration_seconds:.1f}s"
def test_tts_ssml_prosody_control(self):
"""Verify SSML prosody tags are processed (not read aloud)."""
client = TTSClient(api_key=os.environ["TTS_API_KEY"])
normal_audio = client.synthesize("Hello world")
slow_audio = client.synthesize('<speak><prosody rate="slow">Hello world</prosody></speak>')
normal_duration = get_audio_duration(normal_audio)
slow_duration = get_audio_duration(slow_audio)
# Slow speech should be at least 20% longer
assert slow_duration > normal_duration * 1.2, (
f"SSML prosody rate='slow' had no effect. "
f"Normal: {normal_duration:.2f}s, Slow: {slow_duration:.2f}s"
)CI Pipeline Integration
Run WER tests in CI without blocking fast unit tests:
# .github/workflows/voice-ai-quality.yml
name: Voice AI Quality Gates
on:
schedule:
- cron: "0 6 * * *" # Daily at 6am
workflow_dispatch:
jobs:
wer-quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install jiwer soundfile pytest
- name: Run WER quality tests
env:
TRANSCRIPTION_API_KEY: ${{ secrets.TRANSCRIPTION_API_KEY }}
run: pytest tests/integration/test_stt_quality.py -v --tb=short
- name: Upload quality report
if: always()
uses: actions/upload-artifact@v4
with:
name: wer-report
path: test-results/wer-report.jsonMonitoring Voice AI in Production
Testing before release isn't enough — model updates from upstream providers can silently degrade quality. Set up continuous monitoring that runs transcription on a fixed set of reference files and alerts when WER crosses a threshold.
HelpMeTest health checks make this straightforward: schedule a test that transcribes your reference audio, computes WER, and fails if it exceeds your threshold. You get email or Slack alerts without managing your own monitoring infrastructure.
Common Failure Modes to Test
Build specific tests for failure modes you know happen in production:
- Proper nouns: "HelpMeTest" transcribed as "help me test" or "help metest"
- Numbers: "1,234" vs "one thousand two hundred thirty-four"
- Homophone confusion: "their/there/they're"
- Sentence boundary detection: run-on sentences from fast speech
- Empty audio: silence, noise-only, or very short clips
- Large files: audio over the API's size limit
- Unsupported formats: MP4, OGG, or non-standard WAV encodings
Conclusion
Voice AI testing requires a different mindset than traditional API testing. The output is probabilistic and domain-specific, which means threshold-based quality metrics replace pass/fail assertions for the output itself — while standard unit testing still covers your integration code.
Start with a small, high-quality fixture set. Measure WER on your most important acoustic conditions. Gate deployments on those thresholds. Run daily monitoring to catch upstream model regressions. That's the complete picture for production voice AI quality assurance.