Voice AI Testing Guide: Speech-to-Text and TTS Pipelines

Voice AI Testing Guide: Speech-to-Text and TTS Pipelines

Voice AI has moved from novelty to infrastructure. Transcription powers meeting summarization, accessibility tooling, and customer support automation. Text-to-speech drives voice assistants, audiobooks, and IVR systems. Both are production-critical, and both break in ways that traditional QA has no framework for handling.

This guide covers how to build a testing strategy for speech-to-text (STT) and text-to-speech (TTS) pipelines — from unit testing API calls to running Word Error Rate evaluations in CI.

Why Voice AI Testing Is Different

A REST API either returns the right JSON or it doesn't. A speech model returns something that is approximately right with some probability. That probabilistic nature changes everything about how you test.

Key differences from standard API testing:

  • Output is not deterministic — the same audio file can produce slightly different transcriptions across model versions
  • Quality is continuous, not binary — 95% accuracy is not the same as 99% accuracy, and the difference matters for production
  • Test data is domain-specific — a model that works well on American English news broadcasts may fail on Scottish accents or technical jargon
  • Failures are often silent — the API returns 200 with a plausible-sounding transcription that is simply wrong

The goal of voice AI testing is not just "does it run" but "does it produce output of acceptable quality."

Core Metrics

Before writing tests, define what success looks like.

Word Error Rate (WER) is the primary metric for STT quality:

WER = (Substitutions + Deletions + Insertions) / Total Words in Reference

A WER of 5% means 1 in 20 words is wrong. Acceptable WER depends on your use case: real-time captioning tolerates higher WER than medical transcription.

Character Error Rate (CER) is useful for languages where word boundaries are ambiguous, or when evaluating fine-grained errors in proper nouns.

Mean Opinion Score (MOS) is the standard for TTS quality — a 1–5 scale rating naturalness. Automated MOS estimation tools like NISQA or MOSNet replace human raters in CI.

Real-Time Factor (RTF) measures speed: RTF = processing_time / audio_duration. An RTF below 1.0 means faster than real-time, which is typically required for live transcription.

Building Audio Test Fixtures

Good fixtures are the foundation. Audio test fixtures must:

  1. Cover the acoustic conditions you expect in production
  2. Have ground-truth transcriptions to compare against
  3. Be version-controlled alongside your code

A minimal fixture set for an English-language transcription service:

tests/
  fixtures/
    audio/
      clean_studio.wav          # ideal conditions baseline
      office_background.wav     # ambient noise
      phone_call_8khz.wav       # compressed/low-quality audio
      accented_speech.wav       # non-native speaker
      technical_jargon.wav      # domain-specific vocabulary
      fast_speech.wav           # high speaking rate
      silence.wav               # edge case: empty input
      very_short.wav            # edge case: < 1 second
    transcripts/
      clean_studio.txt
      office_background.txt
      phone_call_8khz.txt
      # ... matching files

Generate audio fixtures programmatically where possible:

import numpy as np
import soundfile as sf

def create_silence_fixture(duration_seconds=0.5, sample_rate=16000):
    """Create a near-silence audio fixture for edge case testing."""
    samples = np.zeros(int(duration_seconds * sample_rate))
    # Add tiny noise floor to avoid "empty input" errors in some APIs
    samples += np.random.normal(0, 0.001, samples.shape)
    sf.write("tests/fixtures/audio/silence.wav", samples, sample_rate)

def add_background_noise(clean_audio_path, noise_level_db=-20):
    """Add calibrated background noise to a clean recording."""
    audio, sr = sf.read(clean_audio_path)
    noise = np.random.normal(0, 1, audio.shape)
    # Scale noise to target dB below signal
    signal_rms = np.sqrt(np.mean(audio**2))
    noise_rms = signal_rms * (10 ** (noise_level_db / 20))
    noise = noise * (noise_rms / np.sqrt(np.mean(noise**2)))
    return audio + noise, sr

Unit Testing STT API Calls

The first layer of testing covers your integration code, not the model itself.

import pytest
from unittest.mock import patch, MagicMock

def test_transcription_client_sends_correct_format():
    """Verify our client sends audio in the format the API expects."""
    with patch("httpx.AsyncClient.post") as mock_post:
        mock_post.return_value = MagicMock(
            status_code=200,
            json=lambda: {
                "text": "hello world",
                "confidence": 0.98,
                "words": [
                    {"word": "hello", "start": 0.0, "end": 0.4},
                    {"word": "world", "start": 0.5, "end": 0.9}
                ]
            }
        )
        
        client = TranscriptionClient(api_key="test-key")
        result = client.transcribe("tests/fixtures/audio/clean_studio.wav")
        
        # Verify the API was called with correct content type
        call_kwargs = mock_post.call_args.kwargs
        assert call_kwargs["headers"]["Content-Type"].startswith("multipart/form-data")
        
        # Verify we handle the response correctly
        assert result.text == "hello world"
        assert result.confidence == 0.98
        assert len(result.words) == 2

def test_transcription_client_handles_rate_limit():
    """Verify exponential backoff on 429 responses."""
    with patch("httpx.AsyncClient.post") as mock_post:
        mock_post.side_effect = [
            MagicMock(status_code=429, headers={"Retry-After": "1"}),
            MagicMock(status_code=429, headers={"Retry-After": "2"}),
            MagicMock(status_code=200, json=lambda: {"text": "hello", "confidence": 0.95})
        ]
        
        client = TranscriptionClient(api_key="test-key", max_retries=3)
        result = client.transcribe("tests/fixtures/audio/clean_studio.wav")
        
        assert result.text == "hello"
        assert mock_post.call_count == 3

WER-Based Integration Tests

Integration tests run against the real API and measure output quality:

import jiwer

def calculate_wer(reference: str, hypothesis: str) -> float:
    """Calculate Word Error Rate between reference and hypothesis."""
    transformation = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemovePunctuation(),
        jiwer.Strip(),
        jiwer.ReduceToListOfListOfWords()
    ])
    return jiwer.wer(
        reference,
        hypothesis,
        truth_transform=transformation,
        hypothesis_transform=transformation
    )

@pytest.mark.integration
class TestTranscriptionQuality:
    
    FIXTURES = [
        ("clean_studio.wav", "clean_studio.txt", 0.05),    # max 5% WER
        ("office_background.wav", "office_background.txt", 0.12),  # 12% WER acceptable
        ("phone_call_8khz.wav", "phone_call_8khz.txt", 0.15),
    ]
    
    @pytest.mark.parametrize("audio_file,transcript_file,max_wer", FIXTURES)
    def test_transcription_quality(self, audio_file, transcript_file, max_wer):
        client = TranscriptionClient(api_key=os.environ["TRANSCRIPTION_API_KEY"])
        
        with open(f"tests/fixtures/transcripts/{transcript_file}") as f:
            reference = f.read().strip()
        
        result = client.transcribe(f"tests/fixtures/audio/{audio_file}")
        wer = calculate_wer(reference, result.text)
        
        assert wer <= max_wer, (
            f"WER {wer:.1%} exceeds threshold {max_wer:.1%} for {audio_file}. "
            f"Reference: '{reference[:100]}...'\n"
            f"Got: '{result.text[:100]}...'"
        )

Testing TTS Pipelines

TTS testing has two layers: does it produce audio, and is the audio good?

import soundfile as sf
import numpy as np

class TestTTSPipeline:
    
    def test_tts_produces_valid_audio(self):
        """Basic sanity: TTS returns non-empty, valid WAV audio."""
        client = TTSClient(api_key=os.environ["TTS_API_KEY"])
        audio_bytes = client.synthesize("Hello, this is a test.")
        
        # Write to temp file and verify it's valid audio
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            f.write(audio_bytes)
            tmp_path = f.name
        
        audio, sample_rate = sf.read(tmp_path)
        
        assert len(audio) > 0, "TTS returned empty audio"
        assert sample_rate in [8000, 16000, 22050, 44100, 48000], f"Unexpected sample rate: {sample_rate}"
        assert not np.all(audio == 0), "TTS returned silence"
        
        # Check duration is reasonable for the input text
        duration_seconds = len(audio) / sample_rate
        assert 0.5 < duration_seconds < 10, f"Unexpected duration: {duration_seconds:.1f}s"
    
    def test_tts_ssml_prosody_control(self):
        """Verify SSML prosody tags are processed (not read aloud)."""
        client = TTSClient(api_key=os.environ["TTS_API_KEY"])
        
        normal_audio = client.synthesize("Hello world")
        slow_audio = client.synthesize('<speak><prosody rate="slow">Hello world</prosody></speak>')
        
        normal_duration = get_audio_duration(normal_audio)
        slow_duration = get_audio_duration(slow_audio)
        
        # Slow speech should be at least 20% longer
        assert slow_duration > normal_duration * 1.2, (
            f"SSML prosody rate='slow' had no effect. "
            f"Normal: {normal_duration:.2f}s, Slow: {slow_duration:.2f}s"
        )

CI Pipeline Integration

Run WER tests in CI without blocking fast unit tests:

# .github/workflows/voice-ai-quality.yml
name: Voice AI Quality Gates

on:
  schedule:
    - cron: "0 6 * * *"  # Daily at 6am
  workflow_dispatch:

jobs:
  wer-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: pip install jiwer soundfile pytest

      - name: Run WER quality tests
        env:
          TRANSCRIPTION_API_KEY: ${{ secrets.TRANSCRIPTION_API_KEY }}
        run: pytest tests/integration/test_stt_quality.py -v --tb=short
      
      - name: Upload quality report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: wer-report
          path: test-results/wer-report.json

Monitoring Voice AI in Production

Testing before release isn't enough — model updates from upstream providers can silently degrade quality. Set up continuous monitoring that runs transcription on a fixed set of reference files and alerts when WER crosses a threshold.

HelpMeTest health checks make this straightforward: schedule a test that transcribes your reference audio, computes WER, and fails if it exceeds your threshold. You get email or Slack alerts without managing your own monitoring infrastructure.

Common Failure Modes to Test

Build specific tests for failure modes you know happen in production:

  • Proper nouns: "HelpMeTest" transcribed as "help me test" or "help metest"
  • Numbers: "1,234" vs "one thousand two hundred thirty-four"
  • Homophone confusion: "their/there/they're"
  • Sentence boundary detection: run-on sentences from fast speech
  • Empty audio: silence, noise-only, or very short clips
  • Large files: audio over the API's size limit
  • Unsupported formats: MP4, OGG, or non-standard WAV encodings

Conclusion

Voice AI testing requires a different mindset than traditional API testing. The output is probabilistic and domain-specific, which means threshold-based quality metrics replace pass/fail assertions for the output itself — while standard unit testing still covers your integration code.

Start with a small, high-quality fixture set. Measure WER on your most important acoustic conditions. Gate deployments on those thresholds. Run daily monitoring to catch upstream model regressions. That's the complete picture for production voice AI quality assurance.

Read more