Testing

Testing Embedding Models: Regression, Benchmarking, and Drift Detection

HelpMeTest

18 May 2026 — 8 min read

Embedding models are the foundation of vector search, RAG pipelines, and semantic applications. When you switch embedding models, update to a new version, or change preprocessing, the entire downstream system is affected. Without a testing strategy, you may not notice degraded retrieval quality until users complain.

This guide covers testing embedding models: correctness tests, regression tests across model versions, benchmark datasets, and detecting embedding drift in production.

What Can Go Wrong with Embedding Models

Model version updates: OpenAI's text-embedding-3-small v2 may produce slightly different vectors than v1 for the same text. If you update without re-indexing, your similarity searches produce incorrect results.

Preprocessing changes: Lowercasing, stripping punctuation, or truncating text before embedding changes the output vectors. This is invisible unless you test it.

Model drift: Self-hosted embedding models can drift when fine-tuned on new data. The model behaves differently, but your retrieval quality tests don't run automatically.

Dimension mismatches: Switching from a 384-dim model to a 1536-dim model without updating your vector store causes silent failures.

Semantic regression: A new model may perform better on general text but worse on your specific domain. General benchmarks don't catch domain-specific regression.

Testing Embedding Correctness

Start with property-based tests that don't require a golden dataset:

# tests/embeddings/test_embedding_properties.py
import pytest
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List

@pytest.fixture(scope="session")
def model():
    return SentenceTransformer("all-MiniLM-L6-v2")

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

class TestEmbeddingProperties:
    """Property-based tests that should hold for any good embedding model."""
    
    def test_identical_texts_produce_identical_embeddings(self, model):
        """The same text should always produce the same embedding."""
        text = "Software testing best practices"
        
        emb1 = model.encode(text)
        emb2 = model.encode(text)
        
        np.testing.assert_allclose(
            emb1, emb2, atol=1e-5,
            err_msg="Same text should produce identical embeddings"
        )

    def test_similar_texts_have_high_cosine_similarity(self, model):
        """Paraphrases and synonyms should have cosine similarity > 0.7."""
        similar_pairs = [
            ("Python unit testing", "Writing Python tests"),
            ("Machine learning model", "ML model"),
            ("Containerize the application", "Dockerize the app"),
            ("End-to-end test automation", "Automated browser testing"),
        ]
        
        for text_a, text_b in similar_pairs:
            emb_a = model.encode(text_a)
            emb_b = model.encode(text_b)
            similarity = cosine_similarity(emb_a, emb_b)
            
            assert similarity > 0.7, (
                f"Expected similarity > 0.7 for semantically similar texts:\n"
                f"  '{text_a}'\n  '{text_b}'\n  Got: {similarity:.3f}"
            )

    def test_dissimilar_texts_have_low_cosine_similarity(self, model):
        """Unrelated texts should have cosine similarity < 0.3."""
        dissimilar_pairs = [
            ("Python software testing", "Chocolate cake recipe"),
            ("Vector database architecture", "Medieval history"),
            ("CI/CD pipeline setup", "Ocean ecology"),
        ]
        
        for text_a, text_b in dissimilar_pairs:
            emb_a = model.encode(text_a)
            emb_b = model.encode(text_b)
            similarity = cosine_similarity(emb_a, emb_b)
            
            assert similarity < 0.3, (
                f"Expected similarity < 0.3 for unrelated texts:\n"
                f"  '{text_a}'\n  '{text_b}'\n  Got: {similarity:.3f}"
            )

    def test_embedding_dimensions_are_consistent(self, model):
        """All texts should produce embeddings with the same dimension."""
        texts = [
            "Short",
            "A much longer piece of text with many words and complex structure",
            "123 456 789",
            "!@#$%^&*()",
            "",  # Edge case: empty string
        ]
        
        embeddings = [model.encode(t) for t in texts]
        dimensions = [len(e) for e in embeddings]
        
        assert len(set(dimensions)) == 1, (
            f"Inconsistent dimensions: {dict(zip(texts, dimensions))}"
        )

    def test_embeddings_are_normalized(self, model):
        """Embeddings should have unit norm (required for cosine similarity to work correctly)."""
        texts = ["Test text 1", "Test text 2", "Test text 3"]
        
        for text in texts:
            emb = model.encode(text, normalize_embeddings=True)
            norm = np.linalg.norm(emb)
            
            assert abs(norm - 1.0) < 1e-5, (
                f"Embedding norm for '{text}' is {norm:.6f}, expected 1.0"
            )

    def test_triangle_inequality_holds(self, model):
        """Semantic triangle inequality: if A≈B and B≈C, then A should be somewhat similar to C."""
        a = "Python testing tools"
        b = "Software quality assurance testing"
        c = "QA automation frameworks"
        
        emb_a = model.encode(a, normalize_embeddings=True)
        emb_b = model.encode(b, normalize_embeddings=True)
        emb_c = model.encode(c, normalize_embeddings=True)
        
        sim_ab = cosine_similarity(emb_a, emb_b)
        sim_bc = cosine_similarity(emb_b, emb_c)
        sim_ac = cosine_similarity(emb_a, emb_c)
        
        # A and C should be at least somewhat related since both are related to B
        assert sim_ac > 0.4, (
            f"A-C similarity ({sim_ac:.3f}) too low given A-B={sim_ab:.3f} and B-C={sim_bc:.3f}"
        )

Semantic Benchmark Datasets

Build a domain-specific benchmark dataset to test retrieval quality on your actual use case:

# tests/embeddings/test_semantic_benchmarks.py
import pytest
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class SemanticBenchmarkCase:
    """A ranked list of texts: text[0] should be most similar to the anchor."""
    anchor: str
    ranked_texts: List[str]  # Most similar first
    description: str

# Domain-specific benchmark for a software testing knowledge base
BENCHMARK_CASES = [
    SemanticBenchmarkCase(
        anchor="How to write unit tests in Python",
        ranked_texts=[
            "Python unit testing tutorial with pytest",           # Most similar
            "Automated software testing guide",                   # Related
            "Python programming language overview",               # Loosely related
            "Database backup and recovery strategies",            # Unrelated
        ],
        description="Python testing query should rank testing-specific content highest",
    ),
    SemanticBenchmarkCase(
        anchor="RAG pipeline retrieval quality evaluation",
        ranked_texts=[
            "Evaluating retrieval augmented generation recall",   # Most similar
            "Machine learning model evaluation metrics",          # Related
            "Python performance optimization tips",               # Weakly related
            "Renaissance art history overview",                   # Unrelated
        ],
        description="RAG evaluation query should rank RAG-specific content highest",
    ),
    SemanticBenchmarkCase(
        anchor="Kubernetes pod memory limits configuration",
        ranked_texts=[
            "Setting resource limits for Kubernetes containers",  # Most similar
            "Docker container resource management",               # Related
            "Cloud infrastructure scaling strategies",            # Loosely related
            "Cooking vegetables in a pressure cooker",            # Unrelated
        ],
        description="K8s query should rank infrastructure content highest",
    ),
]

class TestSemanticBenchmarks:
    def test_ranking_correctness(self, model):
        """Model should rank texts in the expected semantic order."""
        failures = []
        
        for case in BENCHMARK_CASES:
            anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
            text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
            
            similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
            
            # Verify each text is more similar than the next
            for i in range(len(similarities) - 1):
                if similarities[i] <= similarities[i + 1]:
                    failures.append(
                        f"{case.description}\n"
                        f"  Expected rank {i} > rank {i+1}\n"
                        f"  '{case.ranked_texts[i]}' ({similarities[i]:.3f})\n"
                        f"    ≤\n"
                        f"  '{case.ranked_texts[i+1]}' ({similarities[i+1]:.3f})"
                    )
        
        assert not failures, "\n\n".join(failures)

    def test_top_1_accuracy(self, model):
        """The most similar text should rank first for all benchmark cases."""
        errors = []
        
        for case in BENCHMARK_CASES:
            anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
            text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
            
            similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
            top_ranked_index = np.argmax(similarities)
            
            if top_ranked_index != 0:
                errors.append(
                    f"{case.description}\n"
                    f"  Top-1 was index {top_ranked_index}: '{case.ranked_texts[top_ranked_index]}'\n"
                    f"  Expected index 0: '{case.ranked_texts[0]}'"
                )
        
        assert not errors, "\n\n".join(errors)

Regression Testing Across Model Updates

When you update your embedding model, test that the new model doesn't regress on your benchmark:

# tests/embeddings/test_model_regression.py
import json
import numpy as np
from pathlib import Path
from dataclasses import dataclass, asdict

BASELINE_PATH = Path("tests/baselines/embedding_model_metrics.json")

@dataclass
class ModelMetrics:
    model_name: str
    avg_similar_cosine: float
    avg_dissimilar_cosine: float
    top1_accuracy: float
    ranking_accuracy: float

def compute_model_metrics(model) -> ModelMetrics:
    similar_pairs = [
        ("Python testing", "Writing Python tests"),
        ("Docker containers", "Containerization"),
        ("CI/CD pipeline", "Continuous integration"),
    ]
    
    dissimilar_pairs = [
        ("Python testing", "Medieval history"),
        ("Docker containers", "Cooking recipes"),
    ]
    
    similar_scores = []
    for a, b in similar_pairs:
        emb_a = model.encode(a, normalize_embeddings=True)
        emb_b = model.encode(b, normalize_embeddings=True)
        similar_scores.append(cosine_similarity(emb_a, emb_b))
    
    dissimilar_scores = []
    for a, b in dissimilar_pairs:
        emb_a = model.encode(a, normalize_embeddings=True)
        emb_b = model.encode(b, normalize_embeddings=True)
        dissimilar_scores.append(cosine_similarity(emb_a, emb_b))
    
    # Compute benchmark accuracy
    correct_top1 = 0
    correct_ranking = 0
    total_rankings = 0
    
    for case in BENCHMARK_CASES:
        anchor_emb = model.encode(case.anchor, normalize_embeddings=True)
        text_embs = model.encode(case.ranked_texts, normalize_embeddings=True)
        similarities = [cosine_similarity(anchor_emb, te) for te in text_embs]
        
        if np.argmax(similarities) == 0:
            correct_top1 += 1
        
        for i in range(len(similarities) - 1):
            if similarities[i] > similarities[i + 1]:
                correct_ranking += 1
            total_rankings += 1
    
    return ModelMetrics(
        model_name=str(model),
        avg_similar_cosine=float(np.mean(similar_scores)),
        avg_dissimilar_cosine=float(np.mean(dissimilar_scores)),
        top1_accuracy=correct_top1 / len(BENCHMARK_CASES),
        ranking_accuracy=correct_ranking / total_rankings,
    )

class TestModelRegression:
    def test_metrics_do_not_regress(self, model):
        """Current model metrics should not regress below baseline."""
        current = compute_model_metrics(model)
        
        if not BASELINE_PATH.exists():
            # First run: save baseline
            BASELINE_PATH.parent.mkdir(exist_ok=True)
            BASELINE_PATH.write_text(json.dumps(asdict(current), indent=2))
            pytest.skip("Saved baseline — run again to test against it")
        
        baseline = ModelMetrics(**json.loads(BASELINE_PATH.read_text()))
        
        # Allow 5% tolerance on each metric
        assert current.avg_similar_cosine >= baseline.avg_similar_cosine * 0.95, (
            f"avg_similar_cosine regressed: {baseline.avg_similar_cosine:.3f} → {current.avg_similar_cosine:.3f}"
        )
        
        assert current.top1_accuracy >= baseline.top1_accuracy - 0.05, (
            f"top1_accuracy regressed: {baseline.top1_accuracy:.3f} → {current.top1_accuracy:.3f}"
        )
        
        assert current.ranking_accuracy >= baseline.ranking_accuracy * 0.95, (
            f"ranking_accuracy regressed: {baseline.ranking_accuracy:.3f} → {current.ranking_accuracy:.3f}"
        )

    def test_dissimilar_similarity_does_not_increase(self, model):
        """Dissimilar text scores should not increase (would indicate worse discrimination)."""
        current = compute_model_metrics(model)
        
        if not BASELINE_PATH.exists():
            pytest.skip("No baseline to compare against")
        
        baseline = ModelMetrics(**json.loads(BASELINE_PATH.read_text()))
        
        # Dissimilar scores should stay low — an increase indicates the model is worse at discrimination
        assert current.avg_dissimilar_cosine <= baseline.avg_dissimilar_cosine * 1.10, (
            f"avg_dissimilar_cosine increased: {baseline.avg_dissimilar_cosine:.3f} → {current.avg_dissimilar_cosine:.3f}"
        )

Detecting Embedding Drift

Embedding drift occurs when embeddings for the same text change over time — due to model updates, preprocessing changes, or non-deterministic inference:

# tests/embeddings/test_drift_detection.py
import hashlib
import json
import numpy as np
from pathlib import Path

EMBEDDING_SNAPSHOT_PATH = Path("tests/baselines/embedding_snapshots.json")

ANCHOR_TEXTS = [
    "HelpMeTest is a cloud-based QA testing platform",
    "Playwright end-to-end browser automation",
    "Robot Framework test automation library",
    "Vector database similarity search",
    "Continuous integration and deployment pipeline",
]

def fingerprint_embedding(embedding: np.ndarray, precision: int = 4) -> str:
    """Create a stable fingerprint of an embedding for drift detection."""
    rounded = np.round(embedding, precision)
    return hashlib.sha256(rounded.tobytes()).hexdigest()

class TestEmbeddingDrift:
    def test_embeddings_match_baseline_snapshots(self, model):
        """Embeddings should not change between runs (deterministic model)."""
        current_embeddings = {}
        
        for text in ANCHOR_TEXTS:
            emb = model.encode(text, normalize_embeddings=True)
            current_embeddings[text] = emb.tolist()
        
        if not EMBEDDING_SNAPSHOT_PATH.exists():
            # Save baseline
            EMBEDDING_SNAPSHOT_PATH.parent.mkdir(exist_ok=True)
            EMBEDDING_SNAPSHOT_PATH.write_text(
                json.dumps({k: v for k, v in current_embeddings.items()}, indent=2)
            )
            pytest.skip("Saved embedding snapshots — run again to test drift")
        
        baseline = json.loads(EMBEDDING_SNAPSHOT_PATH.read_text())
        
        drifted = []
        for text, current_emb in current_embeddings.items():
            if text not in baseline:
                continue
            
            baseline_emb = np.array(baseline[text])
            current_arr = np.array(current_emb)
            
            # Compute drift as max absolute difference
            max_diff = float(np.max(np.abs(baseline_emb - current_arr)))
            
            if max_diff > 1e-4:
                drifted.append(f"'{text[:50]}': max_diff={max_diff:.6f}")
        
        assert not drifted, (
            f"Embedding drift detected for {len(drifted)} texts:\n" + "\n".join(drifted)
        )

    def test_similar_relationship_preserved_after_model_change(self, model):
        """If model is updated, similar relationships should be preserved."""
        if not EMBEDDING_SNAPSHOT_PATH.exists():
            pytest.skip("No baseline to compare against")
        
        baseline = json.loads(EMBEDDING_SNAPSHOT_PATH.read_text())
        
        # Verify that the ranking of texts is preserved even if exact values changed
        anchor_text = ANCHOR_TEXTS[0]
        
        if anchor_text not in baseline:
            pytest.skip("Anchor text not in baseline")
        
        baseline_anchor = np.array(baseline[anchor_text])
        
        # Sort texts by similarity to anchor using baseline
        baseline_sims = {
            text: cosine_similarity(baseline_anchor, np.array(emb))
            for text, emb in baseline.items()
            if text != anchor_text
        }
        baseline_ranking = sorted(baseline_sims, key=baseline_sims.get, reverse=True)
        
        # Sort using current model
        current_anchor = model.encode(anchor_text, normalize_embeddings=True)
        current_sims = {
            text: cosine_similarity(current_anchor, model.encode(text, normalize_embeddings=True))
            for text in baseline.keys()
            if text != anchor_text
        }
        current_ranking = sorted(current_sims, key=current_sims.get, reverse=True)
        
        # Top-3 should match
        assert baseline_ranking[:3] == current_ranking[:3], (
            f"Ranking changed after model update:\n"
            f"  Before: {baseline_ranking[:3]}\n"
            f"  After:  {current_ranking[:3]}"
        )

Testing Model Upgrade Path

When upgrading your embedding model, you need to re-index all documents. Test the migration:

# tests/embeddings/test_model_migration.py
import pytest
import numpy as np

def test_re_indexing_does_not_break_retrieval(vector_store, old_model, new_model, test_corpus):
    """After re-indexing with a new model, retrieval quality should not regress."""
    # Index with old model
    old_embeddings = old_model.encode([doc.text for doc in test_corpus], normalize_embeddings=True)
    vector_store.upsert_batch(
        ids=[doc.id for doc in test_corpus],
        embeddings=old_embeddings.tolist(),
    )
    
    # Measure recall with old model
    old_recall = measure_recall_at_k(vector_store, old_model, test_corpus, k=3)
    
    # Re-index with new model
    new_embeddings = new_model.encode([doc.text for doc in test_corpus], normalize_embeddings=True)
    vector_store.upsert_batch(
        ids=[doc.id for doc in test_corpus],
        embeddings=new_embeddings.tolist(),
    )
    
    # Measure recall with new model (must use same model for queries!)
    new_recall = measure_recall_at_k(vector_store, new_model, test_corpus, k=3)
    
    # New model recall should be at least as good as old (within 5%)
    assert new_recall >= old_recall * 0.95, (
        f"Recall regressed after model update: {old_recall:.3f} → {new_recall:.3f}"
    )

Continuous Monitoring

Run embedding quality tests on a schedule, not just on deployment:

# .github/workflows/embedding-quality.yml
name: Embedding Quality Check
on:
  schedule:
    - cron: '0 2 * * 1'  # Weekly on Monday at 2am
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - run: pip install -r requirements-test.txt
      
      - name: Run embedding quality tests
        run: pytest tests/embeddings/ -v --tb=short
      
      - name: Alert on regression
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Embedding quality regression detected',
              labels: ['ai-quality', 'alert'],
            })

Summary

Testing embedding models across their lifecycle:

Property tests: Identical texts → identical vectors; similar texts → high cosine; dissimilar texts → low cosine
Benchmark datasets: Domain-specific ranked text pairs to measure Recall@K and ranking accuracy
Regression testing: Save metric baselines and compare on every model update
Drift detection: Snapshot reference embeddings and alert on deviation
Migration testing: Verify re-indexing with a new model preserves retrieval quality

The key insight is that embedding model testing is not about the model in isolation — it's about whether the model's output enables your downstream system (vector search, RAG pipeline, recommendations) to work correctly for your specific use case.

Testing Embedding Models: Regression, Benchmarking, and Drift Detection

HelpMeTest

What Can Go Wrong with Embedding Models

Testing Embedding Correctness

Semantic Benchmark Datasets

Regression Testing Across Model Updates

Detecting Embedding Drift

Testing Model Upgrade Path

Continuous Monitoring

Summary

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest