Audio and Speech Eval Frameworks: WER, BLEURT, and MOS Scoring in CI
Speech AI systems — transcription APIs, TTS engines, voice assistants — produce outputs that don't fit a binary pass/fail test. The question is never "did it produce output" but "is the output good enough." Answering that question consistently, automatically, and in CI requires evaluation frameworks purpose-built for audio and speech.
This guide covers the three most important evaluation approaches: Word Error Rate (WER) for transcription, BLEURT for natural language quality, and Mean Opinion Score (MOS) for speech synthesis. We'll set up each one with working code and show how to gate CI pipelines on them.
Word Error Rate (WER)
WER is the standard metric for automatic speech recognition (ASR) quality. It compares a reference transcript (ground truth) against the hypothesis (what the model produced) and counts the minimum number of word-level edits needed to transform one into the other.
WER = (Substitutions + Deletions + Insertions) / Total Reference WordsA WER of 0.05 means 5% of words are wrong. General-purpose models like Whisper typically achieve 3–5% WER on clean English. Accented speech or technical vocabulary can push this to 15–25%.
Installing jiwer
pip install jiwerBasic WER calculation
import jiwer
def compute_wer(reference: str, hypothesis: str) -> float:
"""
Compute Word Error Rate after normalizing both strings.
Normalization prevents trivial differences (case, punctuation)
from inflating the error rate.
"""
transform = jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.Strip(),
jiwer.ReduceToListOfListOfWords()
])
return jiwer.wer(
reference,
hypothesis,
truth_transform=transform,
hypothesis_transform=transform
)
# Example
reference = "The meeting starts at three thirty in the afternoon."
hypothesis = "The meeting starts at 3:30 in the afternoon."
wer = compute_wer(reference, hypothesis)
print(f"WER: {wer:.1%}") # WER: 14.3% (number formatting counts as errors)WER Diagnostic Breakdown
Use jiwer.process_words for debugging which specific words are wrong:
def wer_diagnostic(reference: str, hypothesis: str) -> dict:
"""Return WER plus the specific errors."""
transform = jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.Strip(),
jiwer.ReduceToListOfListOfWords()
])
output = jiwer.process_words(
reference,
hypothesis,
reference_transform=transform,
hypothesis_transform=transform
)
return {
"wer": output.wer,
"mer": output.mer, # Match Error Rate
"wil": output.wil, # Word Information Lost
"substitutions": output.substitutions,
"deletions": output.deletions,
"insertions": output.insertions,
"alignments": str(output.alignments)
}pytest Integration
# tests/test_transcription_quality.py
import pytest
import jiwer
import os
QUALITY_BENCHMARKS = [
# (audio_fixture, reference_text, max_wer)
("clean_studio_16k.wav", "tests/fixtures/transcripts/clean_studio.txt", 0.05),
("office_ambient_noise.wav", "tests/fixtures/transcripts/office.txt", 0.12),
("phone_quality_8k.wav", "tests/fixtures/transcripts/phone.txt", 0.18),
("strong_accent.wav", "tests/fixtures/transcripts/accent.txt", 0.20),
("fast_speech.wav", "tests/fixtures/transcripts/fast.txt", 0.10),
]
@pytest.mark.integration
@pytest.mark.parametrize("audio_file,reference_file,max_wer", QUALITY_BENCHMARKS)
def test_wer_threshold(audio_file, reference_file, max_wer, transcription_client):
with open(reference_file) as f:
reference = f.read().strip()
hypothesis = transcription_client.transcribe(f"tests/fixtures/audio/{audio_file}")
transform = jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.Strip(),
jiwer.ReduceToListOfListOfWords()
])
wer = jiwer.wer(
reference,
hypothesis,
truth_transform=transform,
hypothesis_transform=transform
)
assert wer <= max_wer, (
f"\n{'='*60}\n"
f"WER THRESHOLD EXCEEDED: {audio_file}\n"
f"{'='*60}\n"
f"WER: {wer:.1%} (max: {max_wer:.1%})\n"
f"Reference: {reference[:120]}\n"
f"Got: {hypothesis[:120]}\n"
)BLEURT for Semantic Quality
WER is word-exact. A transcription of "The patient has hypertension" scored against "The patient has high blood pressure" gets WER of 0.6 — but semantically it's correct. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) is a learned metric that captures semantic similarity.
BLEURT is slower and heavier than WER, so use it selectively — for summarization quality, paraphrase correctness, or cases where semantic meaning matters more than exact word choice.
pip install evaluate
# BLEURT needs the checkpoint:
python -c <span class="hljs-string">"import evaluate; evaluate.load('bleurt', config_name='bleurt-20')"import evaluate
bleurt = evaluate.load("bleurt", config_name="bleurt-20")
def compute_bleurt(references: list[str], predictions: list[str]) -> list[float]:
"""
Compute BLEURT scores for a batch.
Returns scores typically in range [-1, 1], where higher is better.
A score above 0.5 is generally considered good quality.
"""
results = bleurt.compute(predictions=predictions, references=references)
return results["scores"]
# Example: semantic paraphrase scoring
references = [
"The patient has elevated blood pressure.",
"The meeting was postponed until next week."
]
predictions = [
"The patient has hypertension.", # Semantically equivalent
"The meeting got pushed to the following week." # Semantically equivalent
]
scores = compute_bleurt(references, predictions)
for ref, pred, score in zip(references, predictions, scores):
print(f"Score: {score:.3f} | Ref: {ref[:40]} | Pred: {pred[:40]}")Using BLEURT in CI
BLEURT is expensive for large batches. Cache results and only re-evaluate when fixtures change:
import hashlib
import json
from pathlib import Path
BLEURT_CACHE_FILE = Path(".bleurt-cache.json")
def cached_bleurt_score(reference: str, hypothesis: str) -> float:
"""Cache BLEURT scores by content hash to avoid re-computation."""
cache_key = hashlib.md5(f"{reference}|||{hypothesis}".encode()).hexdigest()
cache = {}
if BLEURT_CACHE_FILE.exists():
with open(BLEURT_CACHE_FILE) as f:
cache = json.load(f)
if cache_key in cache:
return cache[cache_key]
scores = compute_bleurt([reference], [hypothesis])
cache[cache_key] = scores[0]
with open(BLEURT_CACHE_FILE, "w") as f:
json.dump(cache, f)
return scores[0]
@pytest.mark.integration
def test_meeting_summary_bleurt():
"""Meeting summary should be semantically close to reference summary."""
from myapp.summarizer import summarize_transcript
with open("tests/fixtures/transcripts/board_meeting.txt") as f:
transcript = f.read()
with open("tests/fixtures/expected_summaries/board_meeting.txt") as f:
reference_summary = f.read()
generated_summary = summarize_transcript(transcript)
score = cached_bleurt_score(reference_summary, generated_summary)
assert score >= 0.4, (
f"BLEURT score {score:.3f} < 0.4 threshold\n"
f"Reference: {reference_summary[:200]}\n"
f"Generated: {generated_summary[:200]}"
)MOS Scoring for TTS Quality
Mean Opinion Score (MOS) is the gold standard for text-to-speech quality assessment. Traditional MOS requires human raters on a 1–5 scale. Automated MOS estimation — using models like NISQA, MOSNet, or UTMOS — approximates human ratings and enables CI integration.
pip install nisqa
# or
pip install torch torchaudio <span class="hljs-comment"># for UTMOSUsing NISQA for Automated MOS
import subprocess
import json
import tempfile
import os
def estimate_mos_nisqa(audio_path: str) -> float:
"""
Estimate MOS score using NISQA model.
Returns estimated MOS in range [1.0, 5.0].
"""
# NISQA CLI interface
result = subprocess.run([
"python", "-m", "nisqa.predict",
"--mode", "predict_file",
"--pretrained_model", "nisqa",
"--deg", audio_path,
"--output_dir", tempfile.mkdtemp()
], capture_output=True, text=True)
# Parse output
for line in result.stdout.split("\n"):
if "NISQA_MOS" in line:
parts = line.strip().split()
return float(parts[-1])
raise ValueError(f"Could not parse MOS from NISQA output: {result.stdout}")
# Alternative: use a lighter model for CI
def estimate_mos_simple(audio_path: str, sample_rate: int = 22050) -> float:
"""
Simplified MOS proxy using signal quality heuristics.
Not as accurate as NISQA but faster for CI use.
Returns a score approximating MOS range.
"""
import soundfile as sf
import numpy as np
audio, sr = sf.read(audio_path)
# Heuristic 1: Signal-to-noise ratio proxy
signal_power = np.mean(audio**2)
noise_floor = np.percentile(np.abs(audio), 5)**2
snr = 10 * np.log10(signal_power / (noise_floor + 1e-10))
# Heuristic 2: Clipping detection
clipping_ratio = np.mean(np.abs(audio) > 0.99)
# Map to approximate MOS range (very rough)
mos_proxy = min(5.0, max(1.0, 1.0 + (snr / 20) * 3.0 - clipping_ratio * 2.0))
return mos_proxyMOS Quality Gates in pytest
@pytest.mark.integration
class TestTTSQuality:
TTS_TEST_CASES = [
("Hello, how can I help you today?", 3.5), # Simple greeting
("Your appointment is on March 15th at 2:30 PM.", 3.5), # With numbers/dates
("The total is $1,247.93 including tax.", 3.5), # Currency
("Dr. Smith will see you now.", 3.5), # Titles
]
@pytest.mark.parametrize("text,min_mos", TTS_TEST_CASES)
def test_mos_meets_threshold(self, text, min_mos, tts_client, tmp_path):
audio_bytes = tts_client.synthesize(text)
audio_file = tmp_path / "output.wav"
audio_file.write_bytes(audio_bytes)
mos = estimate_mos_nisqa(str(audio_file))
assert mos >= min_mos, (
f"MOS {mos:.2f} < threshold {min_mos} for text: '{text}'"
)
def test_mos_consistent_across_runs(self, tts_client, tmp_path):
"""TTS quality should be stable — high variance indicates instability."""
text = "The quick brown fox jumps over the lazy dog."
scores = []
for i in range(3):
audio_bytes = tts_client.synthesize(text)
audio_file = tmp_path / f"output_{i}.wav"
audio_file.write_bytes(audio_bytes)
scores.append(estimate_mos_nisqa(str(audio_file)))
variance = max(scores) - min(scores)
assert variance < 0.3, (
f"MOS variance {variance:.2f} too high across runs. "
f"Scores: {scores}"
)CI Pipeline Integration
# .github/workflows/speech-eval.yml
name: Speech Quality Evaluation
on:
schedule:
- cron: "0 6 * * *" # Daily
workflow_dispatch:
jobs:
wer-evaluation:
name: WER Quality Gates
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install jiwer pytest
- name: Run WER tests
env:
ASR_API_KEY: ${{ secrets.ASR_API_KEY }}
run: |
pytest tests/integration/test_transcription_quality.py \
-v --tb=short \
--junitxml=results/wer-results.xml
mos-evaluation:
name: MOS Quality Gates
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install torch torchaudio nisqa pytest soundfile
- name: Run MOS tests
env:
TTS_API_KEY: ${{ secrets.TTS_API_KEY }}
run: |
pytest tests/integration/test_tts_quality.py \
-v --tb=short \
--junitxml=results/mos-results.xml
publish-results:
needs: [wer-evaluation, mos-evaluation]
runs-on: ubuntu-latest
if: always()
steps:
- name: Collect quality metrics
run: |
python scripts/aggregate_quality_metrics.py \
--wer-results results/wer-results.xml \
--mos-results results/mos-results.xml \
--output results/quality-summary.json
- uses: actions/upload-artifact@v4
with:
name: quality-report
path: results/quality-summary.jsonSetting Quality Thresholds
Use historical data to set initial thresholds, then tighten them as your system matures:
| Condition | WER Target | Notes |
|---|---|---|
| Clean, studio audio | ≤ 5% | Baseline capability |
| Office background noise | ≤ 12% | Expected production degradation |
| Phone-quality audio | ≤ 18% | Compressed codecs add errors |
| Non-native speaker | ≤ 20% | Accent-specific models perform better |
| Technical jargon | ≤ 10% | Domain adaptation helps significantly |
| TTS Context | MOS Target | Notes |
|---|---|---|
| General assistant responses | ≥ 3.5 | Acceptable naturalness |
| Audiobook narration | ≥ 4.0 | Higher bar for long-form |
| Real-time voice call | ≥ 3.0 | Latency often trades off quality |
For continuous production monitoring, HelpMeTest lets you schedule these evaluations and alert on threshold violations — eliminating the need to manage your own cron infrastructure for quality monitoring.
Conclusion
WER, BLEURT, and MOS give you the evaluation vocabulary for different speech AI tasks. WER is fast and essential for transcription. BLEURT catches semantic errors WER misses. MOS quantifies what your users actually hear. Running all three in CI turns subjective "sounds good" judgments into objective, version-controlled quality gates.