MLOps

Testing ML Experiments with Weights & Biases: Assertions, Alerts, and CI Checks

HelpMeTest

20 May 2026 — 4 min read

Weights & Biases (W&B) tracks ML experiments — but it also provides the infrastructure to assert training quality, detect regressions, and gate deployments on metric thresholds. This post covers how to use W&B as a testing tool, not just a logging tool.

W&B as a Testing Tool

Beyond logging, W&B enables:

Metric assertions — fail a run if accuracy drops below a threshold
Automated alerts — notify on Slack/email when metrics cross boundaries
Model registry gates — require passing eval metrics before promoting to production
CI integration — use the W&B API to query past runs and compare

Basic Run with Assertions

import wandb
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Thresholds — treat these as your test spec
ACCURACY_THRESHOLD = 0.90
F1_THRESHOLD = 0.88

run = wandb.init(
    project="iris-classifier",
    config={
        "n_estimators": 100,
        "max_depth": 5,
        "random_state": 42,
    }
)

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(**wandb.config)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

accuracy = accuracy_score(y_test, preds)
f1 = f1_score(y_test, preds, average="weighted")

wandb.log({"accuracy": accuracy, "f1": f1})

# Assertions — fail the run if quality gates are not met
assert accuracy >= ACCURACY_THRESHOLD, (
    f"Accuracy {accuracy:.3f} below threshold {ACCURACY_THRESHOLD}"
)
assert f1 >= F1_THRESHOLD, (
    f"F1 {f1:.3f} below threshold {F1_THRESHOLD}"
)

print(f"All checks passed: accuracy={accuracy:.3f}, f1={f1:.3f}")
wandb.finish()

Querying Previous Runs for Regression Detection

Use the W&B API to compare a new run against historical baselines:

import wandb

api = wandb.Api()

def get_best_accuracy(project: str, metric: str = "accuracy") -> float:
    """Return the best metric value from all completed runs."""
    runs = api.runs(
        project,
        filters={"state": "finished"},
        order=f"-summary_metrics.{metric}",
    )
    if not runs:
        return 0.0
    return runs[0].summary.get(metric, 0.0)

def assert_no_regression(
    current_accuracy: float,
    project: str,
    tolerance: float = 0.02,
):
    """Fail if the current run is worse than the best historical run by more than tolerance."""
    best = get_best_accuracy(project)
    min_acceptable = best - tolerance
    assert current_accuracy >= min_acceptable, (
        f"Regression detected: current={current_accuracy:.3f}, "
        f"best={best:.3f}, tolerance={tolerance}"
    )

# In your training script:
assert_no_regression(accuracy, project="my-team/iris-classifier")

pytest Integration

Write W&B checks as pytest tests for CI:

# tests/test_training_quality.py
import pytest
import wandb

PROJECT = "my-team/iris-classifier"
MIN_ACCURACY = 0.90
MIN_F1 = 0.88
MAX_REGRESSION = 0.02  # allow 2% drop from historical best

@pytest.fixture(scope="session")
def api():
    return wandb.Api()

@pytest.fixture(scope="session")
def latest_run(api):
    runs = api.runs(
        PROJECT,
        filters={"state": "finished"},
        order="-created_at",
    )
    assert len(runs) > 0, "No completed runs found"
    return runs[0]

def test_accuracy_above_threshold(latest_run):
    accuracy = latest_run.summary.get("accuracy")
    assert accuracy is not None, "No accuracy metric in latest run"
    assert accuracy >= MIN_ACCURACY, (
        f"Accuracy {accuracy:.3f} < threshold {MIN_ACCURACY}"
    )

def test_f1_above_threshold(latest_run):
    f1 = latest_run.summary.get("f1")
    assert f1 is not None, "No f1 metric in latest run"
    assert f1 >= MIN_F1

def test_no_accuracy_regression(api, latest_run):
    all_runs = api.runs(
        PROJECT,
        filters={"state": "finished"},
        order="-summary_metrics.accuracy",
    )
    best_accuracy = all_runs[0].summary.get("accuracy", 0)
    current_accuracy = latest_run.summary.get("accuracy", 0)
    assert current_accuracy >= best_accuracy - MAX_REGRESSION, (
        f"Regression: current={current_accuracy:.3f}, best={best_accuracy:.3f}"
    )

def test_training_time_acceptable(latest_run):
    duration = latest_run.summary.get("_wandb", {}).get("runtime", 0)
    assert duration < 3600, f"Training took {duration}s, max is 3600s"

Automated Alerts

Configure alerts in W&B to notify on threshold violations:

import wandb

run = wandb.init(project="production-model")

# Log metrics during training
for epoch in range(100):
    loss = train_one_epoch()
    val_loss = evaluate()
    wandb.log({"loss": loss, "val_loss": val_loss, "epoch": epoch})

    # Programmatic alert if diverging
    if val_loss > loss * 1.5:
        wandb.alert(
            title="Validation loss diverging",
            text=f"val_loss={val_loss:.4f} is 50% above train_loss={loss:.4f} at epoch {epoch}",
            level=wandb.AlertLevel.WARN,
        )

Configure alert destinations (Slack, email) in W&B settings under Alerts.

Model Registry Gates

Use the W&B Model Registry to enforce quality gates before promoting a model:

import wandb

api = wandb.Api()

def promote_if_quality_passes(
    artifact_name: str,
    version: str,
    project: str,
    accuracy_threshold: float = 0.92,
):
    artifact = api.artifact(f"{project}/{artifact_name}:{version}")
    run = artifact.logged_by()
    accuracy = run.summary.get("accuracy", 0)

    if accuracy >= accuracy_threshold:
        # Link to the "production" alias
        artifact.aliases.append("production")
        artifact.save()
        print(f"Promoted {artifact_name}:{version} to production (accuracy={accuracy:.3f})")
    else:
        raise ValueError(
            f"Model not promoted: accuracy={accuracy:.3f} < {accuracy_threshold}"
        )

promote_if_quality_passes(
    "iris-classifier",
    "v3",
    "my-team/iris-classifier",
)

CI Pipeline

# .github/workflows/ml-quality-gate.yml
name: ML Quality Gate

on:
  push:
    paths:
      - "train.py"
      - "model/**"

jobs:
  train-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install wandb scikit-learn pytest

      - name: Train model
        env:
          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
        run: python train.py

      - name: Run quality gate tests
        env:
          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
        run: pytest tests/test_training_quality.py -v

Sweep Testing (Hyperparameter Validation)

Test that your hyperparameter sweep produces at least one configuration above the quality bar:

import wandb

api = wandb.Api()

def test_sweep_produced_valid_model(sweep_id: str, min_accuracy: float = 0.90):
    sweep = api.sweep(sweep_id)
    best_run = sweep.best_run(order="accuracy")
    accuracy = best_run.summary.get("accuracy", 0)
    assert accuracy >= min_accuracy, (
        f"Best sweep run only achieved accuracy={accuracy:.3f}, "
        f"minimum required is {min_accuracy}"
    )

test_sweep_produced_valid_model("my-team/iris-classifier/sweep-abc123")

Key Takeaways

Add explicit assert statements in training scripts to fail runs on quality regressions
Use the W&B API in pytest to query historical runs and detect metric degradation
Configure wandb.alert() for runtime notifications when training goes wrong
Gate model promotion with the Model Registry using alias-based quality checks
Run W&B-backed pytest tests in CI after every model training job

Testing ML Experiments with Weights & Biases: Assertions, Alerts, and CI Checks

HelpMeTest

W&B as a Testing Tool

Basic Run with Assertions

Querying Previous Runs for Regression Detection

pytest Integration

Automated Alerts

Model Registry Gates

CI Pipeline

Sweep Testing (Hyperparameter Validation)

Key Takeaways

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest