Audio and Speech Eval Frameworks: WER, BLEURT, and MOS Scoring in CI

Audio and Speech Eval Frameworks: WER, BLEURT, and MOS Scoring in CI

Speech AI systems — transcription APIs, TTS engines, voice assistants — produce outputs that don't fit a binary pass/fail test. The question is never "did it produce output" but "is the output good enough." Answering that question consistently, automatically, and in CI requires evaluation frameworks purpose-built for audio and speech.

This guide covers the three most important evaluation approaches: Word Error Rate (WER) for transcription, BLEURT for natural language quality, and Mean Opinion Score (MOS) for speech synthesis. We'll set up each one with working code and show how to gate CI pipelines on them.

Word Error Rate (WER)

WER is the standard metric for automatic speech recognition (ASR) quality. It compares a reference transcript (ground truth) against the hypothesis (what the model produced) and counts the minimum number of word-level edits needed to transform one into the other.

WER = (Substitutions + Deletions + Insertions) / Total Reference Words

A WER of 0.05 means 5% of words are wrong. General-purpose models like Whisper typically achieve 3–5% WER on clean English. Accented speech or technical vocabulary can push this to 15–25%.

Installing jiwer

pip install jiwer

Basic WER calculation

import jiwer

def compute_wer(reference: str, hypothesis: str) -> float:
    """
    Compute Word Error Rate after normalizing both strings.
    Normalization prevents trivial differences (case, punctuation)
    from inflating the error rate.
    """
    transform = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemovePunctuation(),
        jiwer.Strip(),
        jiwer.ReduceToListOfListOfWords()
    ])
    
    return jiwer.wer(
        reference,
        hypothesis,
        truth_transform=transform,
        hypothesis_transform=transform
    )

# Example
reference = "The meeting starts at three thirty in the afternoon."
hypothesis = "The meeting starts at 3:30 in the afternoon."

wer = compute_wer(reference, hypothesis)
print(f"WER: {wer:.1%}")  # WER: 14.3% (number formatting counts as errors)

WER Diagnostic Breakdown

Use jiwer.process_words for debugging which specific words are wrong:

def wer_diagnostic(reference: str, hypothesis: str) -> dict:
    """Return WER plus the specific errors."""
    transform = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemovePunctuation(),
        jiwer.Strip(),
        jiwer.ReduceToListOfListOfWords()
    ])
    
    output = jiwer.process_words(
        reference,
        hypothesis,
        reference_transform=transform,
        hypothesis_transform=transform
    )
    
    return {
        "wer": output.wer,
        "mer": output.mer,  # Match Error Rate
        "wil": output.wil,  # Word Information Lost
        "substitutions": output.substitutions,
        "deletions": output.deletions,
        "insertions": output.insertions,
        "alignments": str(output.alignments)
    }

pytest Integration

# tests/test_transcription_quality.py
import pytest
import jiwer
import os

QUALITY_BENCHMARKS = [
    # (audio_fixture, reference_text, max_wer)
    ("clean_studio_16k.wav", "tests/fixtures/transcripts/clean_studio.txt", 0.05),
    ("office_ambient_noise.wav", "tests/fixtures/transcripts/office.txt", 0.12),
    ("phone_quality_8k.wav", "tests/fixtures/transcripts/phone.txt", 0.18),
    ("strong_accent.wav", "tests/fixtures/transcripts/accent.txt", 0.20),
    ("fast_speech.wav", "tests/fixtures/transcripts/fast.txt", 0.10),
]

@pytest.mark.integration
@pytest.mark.parametrize("audio_file,reference_file,max_wer", QUALITY_BENCHMARKS)
def test_wer_threshold(audio_file, reference_file, max_wer, transcription_client):
    with open(reference_file) as f:
        reference = f.read().strip()
    
    hypothesis = transcription_client.transcribe(f"tests/fixtures/audio/{audio_file}")
    
    transform = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemovePunctuation(),
        jiwer.Strip(),
        jiwer.ReduceToListOfListOfWords()
    ])
    
    wer = jiwer.wer(
        reference,
        hypothesis,
        truth_transform=transform,
        hypothesis_transform=transform
    )
    
    assert wer <= max_wer, (
        f"\n{'='*60}\n"
        f"WER THRESHOLD EXCEEDED: {audio_file}\n"
        f"{'='*60}\n"
        f"WER:       {wer:.1%} (max: {max_wer:.1%})\n"
        f"Reference: {reference[:120]}\n"
        f"Got:       {hypothesis[:120]}\n"
    )

BLEURT for Semantic Quality

WER is word-exact. A transcription of "The patient has hypertension" scored against "The patient has high blood pressure" gets WER of 0.6 — but semantically it's correct. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) is a learned metric that captures semantic similarity.

BLEURT is slower and heavier than WER, so use it selectively — for summarization quality, paraphrase correctness, or cases where semantic meaning matters more than exact word choice.

pip install evaluate
# BLEURT needs the checkpoint:
python -c <span class="hljs-string">"import evaluate; evaluate.load('bleurt', config_name='bleurt-20')"
import evaluate

bleurt = evaluate.load("bleurt", config_name="bleurt-20")

def compute_bleurt(references: list[str], predictions: list[str]) -> list[float]:
    """
    Compute BLEURT scores for a batch.
    Returns scores typically in range [-1, 1], where higher is better.
    A score above 0.5 is generally considered good quality.
    """
    results = bleurt.compute(predictions=predictions, references=references)
    return results["scores"]

# Example: semantic paraphrase scoring
references = [
    "The patient has elevated blood pressure.",
    "The meeting was postponed until next week."
]
predictions = [
    "The patient has hypertension.",               # Semantically equivalent
    "The meeting got pushed to the following week." # Semantically equivalent
]

scores = compute_bleurt(references, predictions)
for ref, pred, score in zip(references, predictions, scores):
    print(f"Score: {score:.3f} | Ref: {ref[:40]} | Pred: {pred[:40]}")

Using BLEURT in CI

BLEURT is expensive for large batches. Cache results and only re-evaluate when fixtures change:

import hashlib
import json
from pathlib import Path

BLEURT_CACHE_FILE = Path(".bleurt-cache.json")

def cached_bleurt_score(reference: str, hypothesis: str) -> float:
    """Cache BLEURT scores by content hash to avoid re-computation."""
    cache_key = hashlib.md5(f"{reference}|||{hypothesis}".encode()).hexdigest()
    
    cache = {}
    if BLEURT_CACHE_FILE.exists():
        with open(BLEURT_CACHE_FILE) as f:
            cache = json.load(f)
    
    if cache_key in cache:
        return cache[cache_key]
    
    scores = compute_bleurt([reference], [hypothesis])
    cache[cache_key] = scores[0]
    
    with open(BLEURT_CACHE_FILE, "w") as f:
        json.dump(cache, f)
    
    return scores[0]

@pytest.mark.integration
def test_meeting_summary_bleurt():
    """Meeting summary should be semantically close to reference summary."""
    from myapp.summarizer import summarize_transcript
    
    with open("tests/fixtures/transcripts/board_meeting.txt") as f:
        transcript = f.read()
    
    with open("tests/fixtures/expected_summaries/board_meeting.txt") as f:
        reference_summary = f.read()
    
    generated_summary = summarize_transcript(transcript)
    score = cached_bleurt_score(reference_summary, generated_summary)
    
    assert score >= 0.4, (
        f"BLEURT score {score:.3f} < 0.4 threshold\n"
        f"Reference: {reference_summary[:200]}\n"
        f"Generated: {generated_summary[:200]}"
    )

MOS Scoring for TTS Quality

Mean Opinion Score (MOS) is the gold standard for text-to-speech quality assessment. Traditional MOS requires human raters on a 1–5 scale. Automated MOS estimation — using models like NISQA, MOSNet, or UTMOS — approximates human ratings and enables CI integration.

pip install nisqa
# or
pip install torch torchaudio  <span class="hljs-comment"># for UTMOS

Using NISQA for Automated MOS

import subprocess
import json
import tempfile
import os

def estimate_mos_nisqa(audio_path: str) -> float:
    """
    Estimate MOS score using NISQA model.
    Returns estimated MOS in range [1.0, 5.0].
    """
    # NISQA CLI interface
    result = subprocess.run([
        "python", "-m", "nisqa.predict",
        "--mode", "predict_file",
        "--pretrained_model", "nisqa",
        "--deg", audio_path,
        "--output_dir", tempfile.mkdtemp()
    ], capture_output=True, text=True)
    
    # Parse output
    for line in result.stdout.split("\n"):
        if "NISQA_MOS" in line:
            parts = line.strip().split()
            return float(parts[-1])
    
    raise ValueError(f"Could not parse MOS from NISQA output: {result.stdout}")

# Alternative: use a lighter model for CI
def estimate_mos_simple(audio_path: str, sample_rate: int = 22050) -> float:
    """
    Simplified MOS proxy using signal quality heuristics.
    Not as accurate as NISQA but faster for CI use.
    Returns a score approximating MOS range.
    """
    import soundfile as sf
    import numpy as np
    
    audio, sr = sf.read(audio_path)
    
    # Heuristic 1: Signal-to-noise ratio proxy
    signal_power = np.mean(audio**2)
    noise_floor = np.percentile(np.abs(audio), 5)**2
    snr = 10 * np.log10(signal_power / (noise_floor + 1e-10))
    
    # Heuristic 2: Clipping detection
    clipping_ratio = np.mean(np.abs(audio) > 0.99)
    
    # Map to approximate MOS range (very rough)
    mos_proxy = min(5.0, max(1.0, 1.0 + (snr / 20) * 3.0 - clipping_ratio * 2.0))
    return mos_proxy

MOS Quality Gates in pytest

@pytest.mark.integration
class TestTTSQuality:
    
    TTS_TEST_CASES = [
        ("Hello, how can I help you today?", 3.5),          # Simple greeting
        ("Your appointment is on March 15th at 2:30 PM.", 3.5),  # With numbers/dates
        ("The total is $1,247.93 including tax.", 3.5),     # Currency
        ("Dr. Smith will see you now.", 3.5),               # Titles
    ]
    
    @pytest.mark.parametrize("text,min_mos", TTS_TEST_CASES)
    def test_mos_meets_threshold(self, text, min_mos, tts_client, tmp_path):
        audio_bytes = tts_client.synthesize(text)
        
        audio_file = tmp_path / "output.wav"
        audio_file.write_bytes(audio_bytes)
        
        mos = estimate_mos_nisqa(str(audio_file))
        
        assert mos >= min_mos, (
            f"MOS {mos:.2f} < threshold {min_mos} for text: '{text}'"
        )
    
    def test_mos_consistent_across_runs(self, tts_client, tmp_path):
        """TTS quality should be stable — high variance indicates instability."""
        text = "The quick brown fox jumps over the lazy dog."
        scores = []
        
        for i in range(3):
            audio_bytes = tts_client.synthesize(text)
            audio_file = tmp_path / f"output_{i}.wav"
            audio_file.write_bytes(audio_bytes)
            scores.append(estimate_mos_nisqa(str(audio_file)))
        
        variance = max(scores) - min(scores)
        assert variance < 0.3, (
            f"MOS variance {variance:.2f} too high across runs. "
            f"Scores: {scores}"
        )

CI Pipeline Integration

# .github/workflows/speech-eval.yml
name: Speech Quality Evaluation

on:
  schedule:
    - cron: "0 6 * * *"  # Daily
  workflow_dispatch:

jobs:
  wer-evaluation:
    name: WER Quality Gates
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install jiwer pytest
      - name: Run WER tests
        env:
          ASR_API_KEY: ${{ secrets.ASR_API_KEY }}
        run: |
          pytest tests/integration/test_transcription_quality.py \
            -v --tb=short \
            --junitxml=results/wer-results.xml

  mos-evaluation:
    name: MOS Quality Gates
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install torch torchaudio nisqa pytest soundfile
      - name: Run MOS tests
        env:
          TTS_API_KEY: ${{ secrets.TTS_API_KEY }}
        run: |
          pytest tests/integration/test_tts_quality.py \
            -v --tb=short \
            --junitxml=results/mos-results.xml

  publish-results:
    needs: [wer-evaluation, mos-evaluation]
    runs-on: ubuntu-latest
    if: always()
    steps:
      - name: Collect quality metrics
        run: |
          python scripts/aggregate_quality_metrics.py \
            --wer-results results/wer-results.xml \
            --mos-results results/mos-results.xml \
            --output results/quality-summary.json
      
      - uses: actions/upload-artifact@v4
        with:
          name: quality-report
          path: results/quality-summary.json

Setting Quality Thresholds

Use historical data to set initial thresholds, then tighten them as your system matures:

Condition WER Target Notes
Clean, studio audio ≤ 5% Baseline capability
Office background noise ≤ 12% Expected production degradation
Phone-quality audio ≤ 18% Compressed codecs add errors
Non-native speaker ≤ 20% Accent-specific models perform better
Technical jargon ≤ 10% Domain adaptation helps significantly
TTS Context MOS Target Notes
General assistant responses ≥ 3.5 Acceptable naturalness
Audiobook narration ≥ 4.0 Higher bar for long-form
Real-time voice call ≥ 3.0 Latency often trades off quality

For continuous production monitoring, HelpMeTest lets you schedule these evaluations and alert on threshold violations — eliminating the need to manage your own cron infrastructure for quality monitoring.

Conclusion

WER, BLEURT, and MOS give you the evaluation vocabulary for different speech AI tasks. WER is fast and essential for transcription. BLEURT catches semantic errors WER misses. MOS quantifies what your users actually hear. Running all three in CI turns subjective "sounds good" judgments into objective, version-controlled quality gates.

Read more