Python

Reproducible Science: Validating Notebook Outputs with nbval

HelpMeTest

17 May 2026 — 5 min read

Reproducibility is the foundation of scientific computing. A notebook that produces a result should produce the same result when re-executed—same inputs, same code, same environment. nbval enforces this by re-running a notebook and comparing each cell's output to the recorded output in the .ipynb file. Any difference fails the test.

This makes nbval the right tool for analysis notebooks, report generators, and data validation pipelines where the outputs are the deliverable and drift is a defect.

Installing nbval

pip install pytest nbval

nbval works as a pytest plugin.

Basic Usage

Save a reference notebook with expected outputs (run it once to completion in Jupyter), then validate it with:

pytest --nbval notebooks/data-analysis.ipynb

nbval re-executes each cell and compares its output to the stored output. Any difference is a test failure:

notebooks/data-analysis.ipynb::Cell 4 PASSED
notebooks/data-analysis.ipynb::Cell 5 PASSED
notebooks/data-analysis.ipynb::Cell 6 FAILED

Cell 6:
Expected: "Accuracy: 0.924"
Got:      "Accuracy: 0.919"

This failure tells you that something in the code, data, or environment changed between when the notebook was last run and now.

The --nbval-lax Flag

Strict mode (--nbval) fails on any output difference, including metadata like cell IDs, timestamps, and execution counts. For most use cases, --nbval-lax is more appropriate:

pytest --nbval-lax notebooks/

--nbval-lax ignores:

Cell execution numbers (In [1]:, In [2]:)
Object memory addresses (<__main__.Model object at 0x7f...>)
Irrelevant metadata

It compares only visible text output. Use --nbval-lax as the default and switch to --nbval only when you need to catch output format changes.

Cell-Level Skip Directives

Some cells produce outputs that cannot be reproduced exactly: timestamps, random numbers without fixed seeds, progress bars, interactive widgets. Mark these cells to skip output comparison:

Skip a cell entirely

Add a comment at the top of the cell:

# NBVAL_SKIP
import datetime
print(f"Report generated at: {datetime.datetime.now()}")

nbval skips output comparison for this cell.

Ignore a cell's output

The cell executes (so it can produce side effects), but its output is not compared:

# NBVAL_IGNORE_OUTPUT
from tqdm import tqdm
for i in tqdm(range(100)):
    pass

Raise an error if output changes (explicit check)

# NBVAL_CHECK_OUTPUT
accuracy = evaluate_model(X_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

The NBVAL_CHECK_OUTPUT directive is the default—cells without any directive are checked. You only need this comment when you want to make the intent explicit.

Handling Float Tolerances

Floating-point outputs vary slightly across platforms, Python versions, and hardware. A model accuracy of 0.9231 might be 0.9232 on a different machine. Strict comparison would fail; you want comparison within a tolerance.

nbval supports regex patterns for tolerance matching. Replace exact float output with a regex that matches an acceptable range:

In the notebook cell output (before saving):

# Replace the cell's output with this regex pattern
Test accuracy: 0.92[0-9][0-9]

When nbval compares output, it treats stored outputs that look like regex patterns as patterns to match against. The notation is:

# NBVAL_RAISES_EXCEPTION
# Or use sanitize file for regex-based matching

The cleaner approach is to use nbval's sanitize file feature.

Sanitize Files for Regex Matching

A sanitize file maps patterns to expected output using regex substitution. Create it at nbval_sanitize.cfg:

[regex]
# Replace floating point numbers with a fixed placeholder
pattern: \d+\.\d{3,}
replace: FLOAT

# Replace memory addresses
pattern: 0x[0-9a-f]+
replace: 0xADDR

# Replace datetime output
pattern: \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
replace: TIMESTAMP

Run with the sanitize file:

pytest --nbval-lax --sanitize-with=nbval_sanitize.cfg notebooks/

With this configuration, Accuracy: 0.9231 and Accuracy: 0.9229 are both normalized to Accuracy: FLOAT before comparison. The test checks that an accuracy value appears, not that it matches exactly.

Reproducibility Patterns

For nbval to work, the notebook must produce deterministic output. Apply these patterns to every notebook that will be output-validated.

Fix random seeds

# Cell 1: Always the first cell in validated notebooks
import numpy as np
import random
import os

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
os.environ['PYTHONHASHSEED'] = str(RANDOM_SEED)

# For PyTorch
try:
    import torch
    torch.manual_seed(RANDOM_SEED)
    torch.backends.cudnn.deterministic = True
except ImportError:
    pass

# For TensorFlow/Keras
try:
    import tensorflow as tf
    tf.random.set_seed(RANDOM_SEED)
except ImportError:
    pass

Pin library versions

# NBVAL_IGNORE_OUTPUT
import pandas as pd
import numpy as np
import sklearn

print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scikit-learn: {sklearn.__version__}")

Use NBVAL_IGNORE_OUTPUT so version bumps don't fail the test—but record them for debugging.

Use fixed dates

# Instead of:
df = df[df['date'] >= pd.Timestamp.today() - pd.Timedelta(days=30)]

# Use:
REFERENCE_DATE = pd.Timestamp('2026-01-01')
df = df[df['date'] >= REFERENCE_DATE - pd.Timedelta(days=30)]

Fix data ordering

# Sort outputs so comparison is order-independent
df_result = df_result.sort_values('id').reset_index(drop=True)

Creating Reference Notebooks

Before running nbval, you must create a reference notebook—a fully executed notebook with the expected outputs recorded. The workflow:

Set all random seeds, fix dates, and sort outputs in the notebook
Kernel → Restart and Run All in Jupyter
Verify the outputs look correct
Save the notebook (File → Save)
Commit the notebook including outputs to version control

The .ipynb file now contains the reference outputs. nbval re-runs the notebook and compares to these stored values.

Updating reference outputs: When you intentionally change a computation, re-run the notebook in Jupyter, verify the new outputs are correct, and commit the updated .ipynb. Do not use --nbval --force-regen unless you've verified the new outputs are acceptable.

pytest Configuration

# pytest.ini or pyproject.toml
[pytest]
addopts = --nbval-lax
testpaths = notebooks/reports notebooks/validation

Run all validated notebooks:

pytest

Run a specific notebook:

pytest notebooks/reports/monthly-summary.ipynb

CI Integration

# .github/workflows/notebook-validation.yml
name: Notebook Output Validation

on:
  push:
    paths:
      - 'notebooks/reports/**'
      - 'notebooks/validation/**'
      - 'src/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: pip

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest nbval

      - name: Validate notebook outputs
        run: |
          pytest --nbval-lax \
                 --sanitize-with=nbval_sanitize.cfg \
                 --nbval-cell-timeout=120 \
                 notebooks/reports/ \
                 notebooks/validation/ \
                 -v

What nbval Catches

nbval is best at catching:

Silent numerical drift: A refactored function produces slightly different results. The notebook runs without error but the numbers changed.
Broken pipelines: A data transformation that used to produce 10,000 rows now produces 9,847. The notebook runs, but the output is wrong.
Library behavior changes: A library upgrade changes how values are formatted or rounded.
Missing data: A data pipeline that reads from a source that changed schema or dropped records.

nbval does not catch:

Logic errors that happen to produce the same output (these require correctness tests)
Performance regressions (these require benchmarking)
Errors in unskipped, non-executed cells

Combining nbval with nbmake

For a complete notebook test strategy, run both tools:

# First: verify all notebooks execute without error
pytest --nbmake notebooks/

<span class="hljs-comment"># Then: verify report notebooks produce expected outputs
pytest --nbval-lax notebooks/reports/

In CI, make the nbmake step a prerequisite for the nbval step. A notebook that fails execution will also fail nbval, but nbmake produces a cleaner error message for the execution failure case.

Summary

nbval enforces reproducibility by comparing notebook outputs on every run. The setup process is: fix all sources of nondeterminism (random seeds, dates, sort orders), run the notebook to completion, commit the reference outputs. For floats and platform-dependent output, use a sanitize file with regex patterns. Run with --nbval-lax to ignore metadata noise. The result is a test suite that fails when your data pipeline produces different numbers—before users notice the discrepancy.