Testing Embedding Models: Regression, Benchmarking, and Drift Detection
Embedding models are the foundation of vector search, RAG pipelines, and semantic applications. When you switch embedding models, update to a new version, or change preprocessing, the entire downstream system is affected. Without a testing strategy, you may not notice degraded retrieval quality until users complain.
This guide covers testing embedding models: correctness tests, regression tests across model versions, benchmark datasets, and detecting embedding drift in production.
What Can Go Wrong with Embedding Models
Model version updates: OpenAI's text-embedding-3-small v2 may produce slightly different vectors than v1 for the same text. If you update without re-indexing, your similarity searches produce incorrect results.
Preprocessing changes: Lowercasing, stripping punctuation, or truncating text before embedding changes the output vectors. This is invisible unless you test it.
Model drift: Self-hosted embedding models can drift when fine-tuned on new data. The model behaves differently, but your retrieval quality tests don't run automatically.
Dimension mismatches: Switching from a 384-dim model to a 1536-dim model without updating your vector store causes silent failures.
Semantic regression: A new model may perform better on general text but worse on your specific domain. General benchmarks don't catch domain-specific regression.
Testing Embedding Correctness
Start with property-based tests that don't require a golden dataset:
# tests/embeddings/test_embedding_properties.py
import pytest
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List
@pytest.fixture(scope="session")
def model():
return SentenceTransformer("all-MiniLM-L6-v2")
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))
class TestEmbeddingProperties:
"""Property-based tests that should hold for any good embedding model."""
def test_identical_texts_produce_identical_embeddings(self, model):
"""The same text should always produce the same embedding."""
text = "Software testing best practices"
emb1 = model.encode(text)
emb2 = model.encode(text)
np.testing.assert_allclose(
emb1, emb2, atol=1e-5,
err_msg="Same text should produce identical embeddings"
)
def test_similar_texts_have_high_cosine_similarity(self, model):
"""Paraphrases and synonyms should have cosine similarity > 0.7."""
similar_pairs = [
("Python unit testing", "Writing Python tests"),
("Machine learning model", "ML model"),
("Containerize the application", "Dockerize the app"),
("End-to-end test automation", "Automated browser testing"),
]
for text_a, text_b in similar_pairs:
emb_a = model.encode(text_a)
emb_b = model.encode(text_b)
similarity = cosine_similarity(emb_a, emb_b)
assert similarity > 0.7, (
f"Expected similarity > 0.7 for semantically similar texts:\n"
f" '{text_a}'\n '{text_b}'\n Got: {similarity:.3f}"
)
def test_dissimilar_texts_have_low_cosine_similarity(self, model):
"""Unrelated texts should have cosine similarity < 0.3."""
dissimilar_pairs = [
("Python software testing", "Chocolate cake recipe"),
("Vector database architecture", "Medieval history"),
("CI/CD pipeline setup", "Ocean ecology"),
]
for text_a, text_b in dissimilar_pairs:
emb_a = model.encode(text_a)
emb_b = model.encode(text_b)
similarity = cosine_similarity(emb_a, emb_b)
assert similarity < 0.3, (
f"Expected similarity < 0.3 for unrelated texts:\n"
f" '{text_a}'\n '{text_b}'\n Got: {similarity:.3f}"
)
def test_embedding_dimensions_are_consistent(self, model):
"""All texts should produce embeddings with the same dimension."""
texts = [
"Short",
"A much longer piece of text with many words and complex structure",
"123 456 789",
"!@#$%^&*()",
"", # Edge case: empty string
]
embeddings = [model.encode(t) for t in texts]
dimensions = [len(e) for e in embeddings]
assert len(set(dimensions)) == 1, (
f"Inconsistent dimensions: {dict(zip(texts, dimensions))}"
)
def test_embeddings_are_normalized(self, model):
"""Embeddings should have unit norm (required for cosine similarity to work correctly)."""
texts = ["Test text 1", "Test text 2", "Test text 3"]
for text in texts:
emb = model.encode(text, normalize_embeddings=True)
norm = np.linalg.norm(emb)
assert abs(norm - 1.0) < 1e-5, (
f"Embedding norm for '{text}' is {norm:.6f}, expected 1.0"
)
def test_triangle_inequality_holds(self, model):
"""Semantic triangle inequality: if A≈B and B≈C, then A should be somewhat similar to C."""
a = "Python testing tools"
b = "Software quality assurance testing"
c = "QA automation frameworks"
emb_a = model.encode(a, normalize_embeddings=True)
emb_b = model.encode(b, normalize_embeddings=True)
emb_c = model.encode(c, normalize_embeddings=True)
sim_ab = cosine_similarity(emb_a, emb_b)
sim_bc = cosine_similarity(emb_b, emb_c)
sim_ac = cosine_similarity(emb_a, emb_c)
# A and C should be at least somewhat related since both are related to B
assert sim_ac > 0.4, (
f"A-C similarity ({sim_ac:.3f}) too low given A-B={sim_ab:.3f} and B-C={sim_bc:.3f}"
)Semantic Benchmark Datasets
Build a domain-specific benchmark dataset to test retrieval quality on your actual use case:
# tests/embeddings/test_semantic_benchmarks.py
import pytest
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class SemanticBenchmarkCase:
"""A ranked list of texts: text[0] should be most similar to the anchor."""
anchor: str
ranked_texts: List[str] # Most similar first
description: str
# Domain-specific benchmark for a software testing knowledge base
BENCHMARK_CASES = [
SemanticBenchmarkCase(
anchor="How to write unit tests in Python",
ranked_texts=[
"Python unit testing tutorial with pytest", # Most similar
"Automated software testing guide", # Related
"Python programming language overview", # Loosely related
"Database backup and recovery strategies", # Unrelated
],
description="Python testing query should rank testing-specific content highest",
),
SemanticBenchmarkCase(
anchor="RAG pipeline retrieval quality evaluation",
ranked_texts=[
"Evaluating retrieval augmented generation recall", # Most similar
"Machine learning model evaluation metrics", # Related
"Python performance optimization tips", # Weakly related
"Renaissance art history overview", # Unrelated
],
description="RAG evaluation query should rank RAG-specific content highest",
),
SemanticBenchmarkCase(
anchor="Kubernetes pod memory limits configuration",
ranked_texts=[
"Setting resource limits for Kubernetes containers", # Most similar
"Docker container resource management", # Related
"Cloud infrastructure scaling strategies", # Loosely related
"Cooking vegetables in a pressure cooker", # Unrelated
],
description="K8s query should rank infrastructure content highest",
),
]
class TestSemanticBenchmarks:
def test_ranking_correctness(self, model):
"""Model should rank texts in the expected semantic order."""
failures = []
for case in BENCHMARK_CASES:
anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
# Verify each text is more similar than the next
for i in range(len(similarities) - 1):
if similarities[i] <= similarities[i + 1]:
failures.append(
f"{case.description}\n"
f" Expected rank {i} > rank {i+1}\n"
f" '{case.ranked_texts[i]}' ({similarities[i]:.3f})\n"
f" ≤\n"
f" '{case.ranked_texts[i+1]}' ({similarities[i+1]:.3f})"
)
assert not failures, "\n\n".join(failures)
def test_top_1_accuracy(self, model):
"""The most similar text should rank first for all benchmark cases."""
errors = []
for case in BENCHMARK_CASES:
anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
top_ranked_index = np.argmax(similarities)
if top_ranked_index != 0:
errors.append(
f"{case.description}\n"
f" Top-1 was index {top_ranked_index}: '{case.ranked_texts[top_ranked_index]}'\n"
f" Expected index 0: '{case.ranked_texts[0]}'"
)
assert not errors, "\n\n".join(errors)Regression Testing Across Model Updates
When you update your embedding model, test that the new model doesn't regress on your benchmark:
# tests/embeddings/test_model_regression.py
import json
import numpy as np
from pathlib import Path
from dataclasses import dataclass, asdict
BASELINE_PATH = Path("tests/baselines/embedding_model_metrics.json")
@dataclass
class ModelMetrics:
model_name: str
avg_similar_cosine: float
avg_dissimilar_cosine: float
top1_accuracy: float
ranking_accuracy: float
def compute_model_metrics(model) -> ModelMetrics:
similar_pairs = [
("Python testing", "Writing Python tests"),
("Docker containers", "Containerization"),
("CI/CD pipeline", "Continuous integration"),
]
dissimilar_pairs = [
("Python testing", "Medieval history"),
("Docker containers", "Cooking recipes"),
]
similar_scores = []
for a, b in similar_pairs:
emb_a = model.encode(a, normalize_embeddings=True)
emb_b = model.encode(b, normalize_embeddings=True)
similar_scores.append(cosine_similarity(emb_a, emb_b))
dissimilar_scores = []
for a, b in dissimilar_pairs:
emb_a = model.encode(a, normalize_embeddings=True)
emb_b = model.encode(b, normalize_embeddings=True)
dissimilar_scores.append(cosine_similarity(emb_a, emb_b))
# Compute benchmark accuracy
correct_top1 = 0
correct_ranking = 0
total_rankings = 0
for case in BENCHMARK_CASES:
anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
if np.argmax(similarities) == 0:
correct_top1 += 1
for i in range(len(similarities) - 1):
if similarities[i] > similarities[i + 1]:
correct_ranking += 1
total_rankings += 1
return ModelMetrics(
model_name=str(model),
avg_similar_cosine=float(np.mean(similar_scores)),
avg_dissimilar_cosine=float(np.mean(dissimilar_scores)),
top1_accuracy=correct_top1 / len(BENCHMARK_CASES),
ranking_accuracy=correct_ranking / total_rankings,
)
class TestModelRegression:
def test_metrics_do_not_regress(self, model):
"""Current model metrics should not regress below baseline."""
current = compute_model_metrics(model)
if not BASELINE_PATH.exists():
# First run: save baseline
BASELINE_PATH.parent.mkdir(exist_ok=True)
BASELINE_PATH.write_text(json.dumps(asdict(current), indent=2))
pytest.skip("Saved baseline — run again to test against it")
baseline = ModelMetrics(**json.loads(BASELINE_PATH.read_text()))
# Allow 5% tolerance on each metric
assert current.avg_similar_cosine >= baseline.avg_similar_cosine * 0.95, (
f"avg_similar_cosine regressed: {baseline.avg_similar_cosine:.3f} → {current.avg_similar_cosine:.3f}"
)
assert current.top1_accuracy >= baseline.top1_accuracy - 0.05, (
f"top1_accuracy regressed: {baseline.top1_accuracy:.3f} → {current.top1_accuracy:.3f}"
)
assert current.ranking_accuracy >= baseline.ranking_accuracy * 0.95, (
f"ranking_accuracy regressed: {baseline.ranking_accuracy:.3f} → {current.ranking_accuracy:.3f}"
)
def test_dissimilar_similarity_does_not_increase(self, model):
"""Dissimilar text scores should not increase (would indicate worse discrimination)."""
current = compute_model_metrics(model)
if not BASELINE_PATH.exists():
pytest.skip("No baseline to compare against")
baseline = ModelMetrics(**json.loads(BASELINE_PATH.read_text()))
# Dissimilar scores should stay low — an increase indicates the model is worse at discrimination
assert current.avg_dissimilar_cosine <= baseline.avg_dissimilar_cosine * 1.10, (
f"avg_dissimilar_cosine increased: {baseline.avg_dissimilar_cosine:.3f} → {current.avg_dissimilar_cosine:.3f}"
)Detecting Embedding Drift
Embedding drift occurs when embeddings for the same text change over time — due to model updates, preprocessing changes, or non-deterministic inference:
# tests/embeddings/test_drift_detection.py
import hashlib
import json
import numpy as np
from pathlib import Path
EMBEDDING_SNAPSHOT_PATH = Path("tests/baselines/embedding_snapshots.json")
ANCHOR_TEXTS = [
"HelpMeTest is a cloud-based QA testing platform",
"Playwright end-to-end browser automation",
"Robot Framework test automation library",
"Vector database similarity search",
"Continuous integration and deployment pipeline",
]
def fingerprint_embedding(embedding: np.ndarray, precision: int = 4) -> str:
"""Create a stable fingerprint of an embedding for drift detection."""
rounded = np.round(embedding, precision)
return hashlib.sha256(rounded.tobytes()).hexdigest()
class TestEmbeddingDrift:
def test_embeddings_match_baseline_snapshots(self, model):
"""Embeddings should not change between runs (deterministic model)."""
current_embeddings = {}
for text in ANCHOR_TEXTS:
emb = model.encode(text, normalize_embeddings=True)
current_embeddings[text] = emb.tolist()
if not EMBEDDING_SNAPSHOT_PATH.exists():
# Save baseline
EMBEDDING_SNAPSHOT_PATH.parent.mkdir(exist_ok=True)
EMBEDDING_SNAPSHOT_PATH.write_text(
json.dumps({k: v for k, v in current_embeddings.items()}, indent=2)
)
pytest.skip("Saved embedding snapshots — run again to test drift")
baseline = json.loads(EMBEDDING_SNAPSHOT_PATH.read_text())
drifted = []
for text, current_emb in current_embeddings.items():
if text not in baseline:
continue
baseline_emb = np.array(baseline[text])
current_arr = np.array(current_emb)
# Compute drift as max absolute difference
max_diff = float(np.max(np.abs(baseline_emb - current_arr)))
if max_diff > 1e-4:
drifted.append(f"'{text[:50]}': max_diff={max_diff:.6f}")
assert not drifted, (
f"Embedding drift detected for {len(drifted)} texts:\n" + "\n".join(drifted)
)
def test_similar_relationship_preserved_after_model_change(self, model):
"""If model is updated, similar relationships should be preserved."""
if not EMBEDDING_SNAPSHOT_PATH.exists():
pytest.skip("No baseline to compare against")
baseline = json.loads(EMBEDDING_SNAPSHOT_PATH.read_text())
# Verify that the ranking of texts is preserved even if exact values changed
anchor_text = ANCHOR_TEXTS[0]
if anchor_text not in baseline:
pytest.skip("Anchor text not in baseline")
baseline_anchor = np.array(baseline[anchor_text])
# Sort texts by similarity to anchor using baseline
baseline_sims = {
text: cosine_similarity(baseline_anchor, np.array(emb))
for text, emb in baseline.items()
if text != anchor_text
}
baseline_ranking = sorted(baseline_sims, key=baseline_sims.get, reverse=True)
# Sort using current model
current_anchor = model.encode(anchor_text, normalize_embeddings=True)
current_sims = {
text: cosine_similarity(current_anchor, model.encode(text, normalize_embeddings=True))
for text in baseline.keys()
if text != anchor_text
}
current_ranking = sorted(current_sims, key=current_sims.get, reverse=True)
# Top-3 should match
assert baseline_ranking[:3] == current_ranking[:3], (
f"Ranking changed after model update:\n"
f" Before: {baseline_ranking[:3]}\n"
f" After: {current_ranking[:3]}"
)Testing Model Upgrade Path
When upgrading your embedding model, you need to re-index all documents. Test the migration:
# tests/embeddings/test_model_migration.py
import pytest
import numpy as np
def test_re_indexing_does_not_break_retrieval(vector_store, old_model, new_model, test_corpus):
"""After re-indexing with a new model, retrieval quality should not regress."""
# Index with old model
old_embeddings = old_model.encode([doc.text for doc in test_corpus], normalize_embeddings=True)
vector_store.upsert_batch(
ids=[doc.id for doc in test_corpus],
embeddings=old_embeddings.tolist(),
)
# Measure recall with old model
old_recall = measure_recall_at_k(vector_store, old_model, test_corpus, k=3)
# Re-index with new model
new_embeddings = new_model.encode([doc.text for doc in test_corpus], normalize_embeddings=True)
vector_store.upsert_batch(
ids=[doc.id for doc in test_corpus],
embeddings=new_embeddings.tolist(),
)
# Measure recall with new model (must use same model for queries!)
new_recall = measure_recall_at_k(vector_store, new_model, test_corpus, k=3)
# New model recall should be at least as good as old (within 5%)
assert new_recall >= old_recall * 0.95, (
f"Recall regressed after model update: {old_recall:.3f} → {new_recall:.3f}"
)Continuous Monitoring
Run embedding quality tests on a schedule, not just on deployment:
# .github/workflows/embedding-quality.yml
name: Embedding Quality Check
on:
schedule:
- cron: '0 2 * * 1' # Weekly on Monday at 2am
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -r requirements-test.txt
- name: Run embedding quality tests
run: pytest tests/embeddings/ -v --tb=short
- name: Alert on regression
if: failure()
uses: actions/github-script@v7
with:
script: |
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: 'Embedding quality regression detected',
labels: ['ai-quality', 'alert'],
})Summary
Testing embedding models across their lifecycle:
- Property tests: Identical texts → identical vectors; similar texts → high cosine; dissimilar texts → low cosine
- Benchmark datasets: Domain-specific ranked text pairs to measure Recall@K and ranking accuracy
- Regression testing: Save metric baselines and compare on every model update
- Drift detection: Snapshot reference embeddings and alert on deviation
- Migration testing: Verify re-indexing with a new model preserves retrieval quality
The key insight is that embedding model testing is not about the model in isolation — it's about whether the model's output enables your downstream system (vector search, RAG pipeline, recommendations) to work correctly for your specific use case.