Testing Embedding Models: Regression, Benchmarking, and Drift Detection

Testing Embedding Models: Regression, Benchmarking, and Drift Detection

Embedding models are the foundation of vector search, RAG pipelines, and semantic applications. When you switch embedding models, update to a new version, or change preprocessing, the entire downstream system is affected. Without a testing strategy, you may not notice degraded retrieval quality until users complain.

This guide covers testing embedding models: correctness tests, regression tests across model versions, benchmark datasets, and detecting embedding drift in production.

What Can Go Wrong with Embedding Models

Model version updates: OpenAI's text-embedding-3-small v2 may produce slightly different vectors than v1 for the same text. If you update without re-indexing, your similarity searches produce incorrect results.

Preprocessing changes: Lowercasing, stripping punctuation, or truncating text before embedding changes the output vectors. This is invisible unless you test it.

Model drift: Self-hosted embedding models can drift when fine-tuned on new data. The model behaves differently, but your retrieval quality tests don't run automatically.

Dimension mismatches: Switching from a 384-dim model to a 1536-dim model without updating your vector store causes silent failures.

Semantic regression: A new model may perform better on general text but worse on your specific domain. General benchmarks don't catch domain-specific regression.

Testing Embedding Correctness

Start with property-based tests that don't require a golden dataset:

# tests/embeddings/test_embedding_properties.py
import pytest
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List

@pytest.fixture(scope="session")
def model():
    return SentenceTransformer("all-MiniLM-L6-v2")

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

class TestEmbeddingProperties:
    """Property-based tests that should hold for any good embedding model."""
    
    def test_identical_texts_produce_identical_embeddings(self, model):
        """The same text should always produce the same embedding."""
        text = "Software testing best practices"
        
        emb1 = model.encode(text)
        emb2 = model.encode(text)
        
        np.testing.assert_allclose(
            emb1, emb2, atol=1e-5,
            err_msg="Same text should produce identical embeddings"
        )

    def test_similar_texts_have_high_cosine_similarity(self, model):
        """Paraphrases and synonyms should have cosine similarity > 0.7."""
        similar_pairs = [
            ("Python unit testing", "Writing Python tests"),
            ("Machine learning model", "ML model"),
            ("Containerize the application", "Dockerize the app"),
            ("End-to-end test automation", "Automated browser testing"),
        ]
        
        for text_a, text_b in similar_pairs:
            emb_a = model.encode(text_a)
            emb_b = model.encode(text_b)
            similarity = cosine_similarity(emb_a, emb_b)
            
            assert similarity > 0.7, (
                f"Expected similarity > 0.7 for semantically similar texts:\n"
                f"  '{text_a}'\n  '{text_b}'\n  Got: {similarity:.3f}"
            )

    def test_dissimilar_texts_have_low_cosine_similarity(self, model):
        """Unrelated texts should have cosine similarity < 0.3."""
        dissimilar_pairs = [
            ("Python software testing", "Chocolate cake recipe"),
            ("Vector database architecture", "Medieval history"),
            ("CI/CD pipeline setup", "Ocean ecology"),
        ]
        
        for text_a, text_b in dissimilar_pairs:
            emb_a = model.encode(text_a)
            emb_b = model.encode(text_b)
            similarity = cosine_similarity(emb_a, emb_b)
            
            assert similarity < 0.3, (
                f"Expected similarity < 0.3 for unrelated texts:\n"
                f"  '{text_a}'\n  '{text_b}'\n  Got: {similarity:.3f}"
            )

    def test_embedding_dimensions_are_consistent(self, model):
        """All texts should produce embeddings with the same dimension."""
        texts = [
            "Short",
            "A much longer piece of text with many words and complex structure",
            "123 456 789",
            "!@#$%^&*()",
            "",  # Edge case: empty string
        ]
        
        embeddings = [model.encode(t) for t in texts]
        dimensions = [len(e) for e in embeddings]
        
        assert len(set(dimensions)) == 1, (
            f"Inconsistent dimensions: {dict(zip(texts, dimensions))}"
        )

    def test_embeddings_are_normalized(self, model):
        """Embeddings should have unit norm (required for cosine similarity to work correctly)."""
        texts = ["Test text 1", "Test text 2", "Test text 3"]
        
        for text in texts:
            emb = model.encode(text, normalize_embeddings=True)
            norm = np.linalg.norm(emb)
            
            assert abs(norm - 1.0) < 1e-5, (
                f"Embedding norm for '{text}' is {norm:.6f}, expected 1.0"
            )

    def test_triangle_inequality_holds(self, model):
        """Semantic triangle inequality: if A≈B and B≈C, then A should be somewhat similar to C."""
        a = "Python testing tools"
        b = "Software quality assurance testing"
        c = "QA automation frameworks"
        
        emb_a = model.encode(a, normalize_embeddings=True)
        emb_b = model.encode(b, normalize_embeddings=True)
        emb_c = model.encode(c, normalize_embeddings=True)
        
        sim_ab = cosine_similarity(emb_a, emb_b)
        sim_bc = cosine_similarity(emb_b, emb_c)
        sim_ac = cosine_similarity(emb_a, emb_c)
        
        # A and C should be at least somewhat related since both are related to B
        assert sim_ac > 0.4, (
            f"A-C similarity ({sim_ac:.3f}) too low given A-B={sim_ab:.3f} and B-C={sim_bc:.3f}"
        )

Semantic Benchmark Datasets

Build a domain-specific benchmark dataset to test retrieval quality on your actual use case:

# tests/embeddings/test_semantic_benchmarks.py
import pytest
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class SemanticBenchmarkCase:
    """A ranked list of texts: text[0] should be most similar to the anchor."""
    anchor: str
    ranked_texts: List[str]  # Most similar first
    description: str

# Domain-specific benchmark for a software testing knowledge base
BENCHMARK_CASES = [
    SemanticBenchmarkCase(
        anchor="How to write unit tests in Python",
        ranked_texts=[
            "Python unit testing tutorial with pytest",           # Most similar
            "Automated software testing guide",                   # Related
            "Python programming language overview",               # Loosely related
            "Database backup and recovery strategies",            # Unrelated
        ],
        description="Python testing query should rank testing-specific content highest",
    ),
    SemanticBenchmarkCase(
        anchor="RAG pipeline retrieval quality evaluation",
        ranked_texts=[
            "Evaluating retrieval augmented generation recall",   # Most similar
            "Machine learning model evaluation metrics",          # Related
            "Python performance optimization tips",               # Weakly related
            "Renaissance art history overview",                   # Unrelated
        ],
        description="RAG evaluation query should rank RAG-specific content highest",
    ),
    SemanticBenchmarkCase(
        anchor="Kubernetes pod memory limits configuration",
        ranked_texts=[
            "Setting resource limits for Kubernetes containers",  # Most similar
            "Docker container resource management",               # Related
            "Cloud infrastructure scaling strategies",            # Loosely related
            "Cooking vegetables in a pressure cooker",            # Unrelated
        ],
        description="K8s query should rank infrastructure content highest",
    ),
]

class TestSemanticBenchmarks:
    def test_ranking_correctness(self, model):
        """Model should rank texts in the expected semantic order."""
        failures = []
        
        for case in BENCHMARK_CASES:
            anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
            text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
            
            similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
            
            # Verify each text is more similar than the next
            for i in range(len(similarities) - 1):
                if similarities[i] <= similarities[i + 1]:
                    failures.append(
                        f"{case.description}\n"
                        f"  Expected rank {i} > rank {i+1}\n"
                        f"  '{case.ranked_texts[i]}' ({similarities[i]:.3f})\n"
                        f"    ≤\n"
                        f"  '{case.ranked_texts[i+1]}' ({similarities[i+1]:.3f})"
                    )
        
        assert not failures, "\n\n".join(failures)

    def test_top_1_accuracy(self, model):
        """The most similar text should rank first for all benchmark cases."""
        errors = []
        
        for case in BENCHMARK_CASES:
            anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
            text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
            
            similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
            top_ranked_index = np.argmax(similarities)
            
            if top_ranked_index != 0:
                errors.append(
                    f"{case.description}\n"
                    f"  Top-1 was index {top_ranked_index}: '{case.ranked_texts[top_ranked_index]}'\n"
                    f"  Expected index 0: '{case.ranked_texts[0]}'"
                )
        
        assert not errors, "\n\n".join(errors)

Regression Testing Across Model Updates

When you update your embedding model, test that the new model doesn't regress on your benchmark:

# tests/embeddings/test_model_regression.py
import json
import numpy as np
from pathlib import Path
from dataclasses import dataclass, asdict

BASELINE_PATH = Path("tests/baselines/embedding_model_metrics.json")

@dataclass
class ModelMetrics:
    model_name: str
    avg_similar_cosine: float
    avg_dissimilar_cosine: float
    top1_accuracy: float
    ranking_accuracy: float

def compute_model_metrics(model) -> ModelMetrics:
    similar_pairs = [
        ("Python testing", "Writing Python tests"),
        ("Docker containers", "Containerization"),
        ("CI/CD pipeline", "Continuous integration"),
    ]
    
    dissimilar_pairs = [
        ("Python testing", "Medieval history"),
        ("Docker containers", "Cooking recipes"),
    ]
    
    similar_scores = []
    for a, b in similar_pairs:
        emb_a = model.encode(a, normalize_embeddings=True)
        emb_b = model.encode(b, normalize_embeddings=True)
        similar_scores.append(cosine_similarity(emb_a, emb_b))
    
    dissimilar_scores = []
    for a, b in dissimilar_pairs:
        emb_a = model.encode(a, normalize_embeddings=True)
        emb_b = model.encode(b, normalize_embeddings=True)
        dissimilar_scores.append(cosine_similarity(emb_a, emb_b))
    
    # Compute benchmark accuracy
    correct_top1 = 0
    correct_ranking = 0
    total_rankings = 0
    
    for case in BENCHMARK_CASES:
        anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
        text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
        similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
        
        if np.argmax(similarities) == 0:
            correct_top1 += 1
        
        for i in range(len(similarities) - 1):
            if similarities[i] > similarities[i + 1]:
                correct_ranking += 1
            total_rankings += 1
    
    return ModelMetrics(
        model_name=str(model),
        avg_similar_cosine=float(np.mean(similar_scores)),
        avg_dissimilar_cosine=float(np.mean(dissimilar_scores)),
        top1_accuracy=correct_top1 / len(BENCHMARK_CASES),
        ranking_accuracy=correct_ranking / total_rankings,
    )

class TestModelRegression:
    def test_metrics_do_not_regress(self, model):
        """Current model metrics should not regress below baseline."""
        current = compute_model_metrics(model)
        
        if not BASELINE_PATH.exists():
            # First run: save baseline
            BASELINE_PATH.parent.mkdir(exist_ok=True)
            BASELINE_PATH.write_text(json.dumps(asdict(current), indent=2))
            pytest.skip("Saved baseline — run again to test against it")
        
        baseline = ModelMetrics(**json.loads(BASELINE_PATH.read_text()))
        
        # Allow 5% tolerance on each metric
        assert current.avg_similar_cosine >= baseline.avg_similar_cosine * 0.95, (
            f"avg_similar_cosine regressed: {baseline.avg_similar_cosine:.3f}{current.avg_similar_cosine:.3f}"
        )
        
        assert current.top1_accuracy >= baseline.top1_accuracy - 0.05, (
            f"top1_accuracy regressed: {baseline.top1_accuracy:.3f}{current.top1_accuracy:.3f}"
        )
        
        assert current.ranking_accuracy >= baseline.ranking_accuracy * 0.95, (
            f"ranking_accuracy regressed: {baseline.ranking_accuracy:.3f}{current.ranking_accuracy:.3f}"
        )

    def test_dissimilar_similarity_does_not_increase(self, model):
        """Dissimilar text scores should not increase (would indicate worse discrimination)."""
        current = compute_model_metrics(model)
        
        if not BASELINE_PATH.exists():
            pytest.skip("No baseline to compare against")
        
        baseline = ModelMetrics(**json.loads(BASELINE_PATH.read_text()))
        
        # Dissimilar scores should stay low — an increase indicates the model is worse at discrimination
        assert current.avg_dissimilar_cosine <= baseline.avg_dissimilar_cosine * 1.10, (
            f"avg_dissimilar_cosine increased: {baseline.avg_dissimilar_cosine:.3f}{current.avg_dissimilar_cosine:.3f}"
        )

Detecting Embedding Drift

Embedding drift occurs when embeddings for the same text change over time — due to model updates, preprocessing changes, or non-deterministic inference:

# tests/embeddings/test_drift_detection.py
import hashlib
import json
import numpy as np
from pathlib import Path

EMBEDDING_SNAPSHOT_PATH = Path("tests/baselines/embedding_snapshots.json")

ANCHOR_TEXTS = [
    "HelpMeTest is a cloud-based QA testing platform",
    "Playwright end-to-end browser automation",
    "Robot Framework test automation library",
    "Vector database similarity search",
    "Continuous integration and deployment pipeline",
]

def fingerprint_embedding(embedding: np.ndarray, precision: int = 4) -> str:
    """Create a stable fingerprint of an embedding for drift detection."""
    rounded = np.round(embedding, precision)
    return hashlib.sha256(rounded.tobytes()).hexdigest()

class TestEmbeddingDrift:
    def test_embeddings_match_baseline_snapshots(self, model):
        """Embeddings should not change between runs (deterministic model)."""
        current_embeddings = {}
        
        for text in ANCHOR_TEXTS:
            emb = model.encode(text, normalize_embeddings=True)
            current_embeddings[text] = emb.tolist()
        
        if not EMBEDDING_SNAPSHOT_PATH.exists():
            # Save baseline
            EMBEDDING_SNAPSHOT_PATH.parent.mkdir(exist_ok=True)
            EMBEDDING_SNAPSHOT_PATH.write_text(
                json.dumps({k: v for k, v in current_embeddings.items()}, indent=2)
            )
            pytest.skip("Saved embedding snapshots — run again to test drift")
        
        baseline = json.loads(EMBEDDING_SNAPSHOT_PATH.read_text())
        
        drifted = []
        for text, current_emb in current_embeddings.items():
            if text not in baseline:
                continue
            
            baseline_emb = np.array(baseline[text])
            current_arr = np.array(current_emb)
            
            # Compute drift as max absolute difference
            max_diff = float(np.max(np.abs(baseline_emb - current_arr)))
            
            if max_diff > 1e-4:
                drifted.append(f"'{text[:50]}': max_diff={max_diff:.6f}")
        
        assert not drifted, (
            f"Embedding drift detected for {len(drifted)} texts:\n" + "\n".join(drifted)
        )

    def test_similar_relationship_preserved_after_model_change(self, model):
        """If model is updated, similar relationships should be preserved."""
        if not EMBEDDING_SNAPSHOT_PATH.exists():
            pytest.skip("No baseline to compare against")
        
        baseline = json.loads(EMBEDDING_SNAPSHOT_PATH.read_text())
        
        # Verify that the ranking of texts is preserved even if exact values changed
        anchor_text = ANCHOR_TEXTS[0]
        
        if anchor_text not in baseline:
            pytest.skip("Anchor text not in baseline")
        
        baseline_anchor = np.array(baseline[anchor_text])
        
        # Sort texts by similarity to anchor using baseline
        baseline_sims = {
            text: cosine_similarity(baseline_anchor, np.array(emb))
            for text, emb in baseline.items()
            if text != anchor_text
        }
        baseline_ranking = sorted(baseline_sims, key=baseline_sims.get, reverse=True)
        
        # Sort using current model
        current_anchor = model.encode(anchor_text, normalize_embeddings=True)
        current_sims = {
            text: cosine_similarity(current_anchor, model.encode(text, normalize_embeddings=True))
            for text in baseline.keys()
            if text != anchor_text
        }
        current_ranking = sorted(current_sims, key=current_sims.get, reverse=True)
        
        # Top-3 should match
        assert baseline_ranking[:3] == current_ranking[:3], (
            f"Ranking changed after model update:\n"
            f"  Before: {baseline_ranking[:3]}\n"
            f"  After:  {current_ranking[:3]}"
        )

Testing Model Upgrade Path

When upgrading your embedding model, you need to re-index all documents. Test the migration:

# tests/embeddings/test_model_migration.py
import pytest
import numpy as np

def test_re_indexing_does_not_break_retrieval(vector_store, old_model, new_model, test_corpus):
    """After re-indexing with a new model, retrieval quality should not regress."""
    # Index with old model
    old_embeddings = old_model.encode([doc.text for doc in test_corpus], normalize_embeddings=True)
    vector_store.upsert_batch(
        ids=[doc.id for doc in test_corpus],
        embeddings=old_embeddings.tolist(),
    )
    
    # Measure recall with old model
    old_recall = measure_recall_at_k(vector_store, old_model, test_corpus, k=3)
    
    # Re-index with new model
    new_embeddings = new_model.encode([doc.text for doc in test_corpus], normalize_embeddings=True)
    vector_store.upsert_batch(
        ids=[doc.id for doc in test_corpus],
        embeddings=new_embeddings.tolist(),
    )
    
    # Measure recall with new model (must use same model for queries!)
    new_recall = measure_recall_at_k(vector_store, new_model, test_corpus, k=3)
    
    # New model recall should be at least as good as old (within 5%)
    assert new_recall >= old_recall * 0.95, (
        f"Recall regressed after model update: {old_recall:.3f}{new_recall:.3f}"
    )

Continuous Monitoring

Run embedding quality tests on a schedule, not just on deployment:

# .github/workflows/embedding-quality.yml
name: Embedding Quality Check
on:
  schedule:
    - cron: '0 2 * * 1'  # Weekly on Monday at 2am
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - run: pip install -r requirements-test.txt
      
      - name: Run embedding quality tests
        run: pytest tests/embeddings/ -v --tb=short
      
      - name: Alert on regression
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Embedding quality regression detected',
              labels: ['ai-quality', 'alert'],
            })

Summary

Testing embedding models across their lifecycle:

  • Property tests: Identical texts → identical vectors; similar texts → high cosine; dissimilar texts → low cosine
  • Benchmark datasets: Domain-specific ranked text pairs to measure Recall@K and ranking accuracy
  • Regression testing: Save metric baselines and compare on every model update
  • Drift detection: Snapshot reference embeddings and alert on deviation
  • Migration testing: Verify re-indexing with a new model preserves retrieval quality

The key insight is that embedding model testing is not about the model in isolation — it's about whether the model's output enables your downstream system (vector search, RAG pipeline, recommendations) to work correctly for your specific use case.

Read more