Reproducible Science: Validating Notebook Outputs with nbval
Reproducibility is the foundation of scientific computing. A notebook that produces a result should produce the same result when re-executed—same inputs, same code, same environment. nbval enforces this by re-running a notebook and comparing each cell's output to the recorded output in the .ipynb file. Any difference fails the test.
This makes nbval the right tool for analysis notebooks, report generators, and data validation pipelines where the outputs are the deliverable and drift is a defect.
Installing nbval
pip install pytest nbvalnbval works as a pytest plugin.
Basic Usage
Save a reference notebook with expected outputs (run it once to completion in Jupyter), then validate it with:
pytest --nbval notebooks/data-analysis.ipynbnbval re-executes each cell and compares its output to the stored output. Any difference is a test failure:
notebooks/data-analysis.ipynb::Cell 4 PASSED
notebooks/data-analysis.ipynb::Cell 5 PASSED
notebooks/data-analysis.ipynb::Cell 6 FAILED
Cell 6:
Expected: "Accuracy: 0.924"
Got: "Accuracy: 0.919"This failure tells you that something in the code, data, or environment changed between when the notebook was last run and now.
The --nbval-lax Flag
Strict mode (--nbval) fails on any output difference, including metadata like cell IDs, timestamps, and execution counts. For most use cases, --nbval-lax is more appropriate:
pytest --nbval-lax notebooks/--nbval-lax ignores:
- Cell execution numbers (
In [1]:,In [2]:) - Object memory addresses (
<__main__.Model object at 0x7f...>) - Irrelevant metadata
It compares only visible text output. Use --nbval-lax as the default and switch to --nbval only when you need to catch output format changes.
Cell-Level Skip Directives
Some cells produce outputs that cannot be reproduced exactly: timestamps, random numbers without fixed seeds, progress bars, interactive widgets. Mark these cells to skip output comparison:
Skip a cell entirely
Add a comment at the top of the cell:
# NBVAL_SKIP
import datetime
print(f"Report generated at: {datetime.datetime.now()}")nbval skips output comparison for this cell.
Ignore a cell's output
The cell executes (so it can produce side effects), but its output is not compared:
# NBVAL_IGNORE_OUTPUT
from tqdm import tqdm
for i in tqdm(range(100)):
passRaise an error if output changes (explicit check)
# NBVAL_CHECK_OUTPUT
accuracy = evaluate_model(X_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")The NBVAL_CHECK_OUTPUT directive is the default—cells without any directive are checked. You only need this comment when you want to make the intent explicit.
Handling Float Tolerances
Floating-point outputs vary slightly across platforms, Python versions, and hardware. A model accuracy of 0.9231 might be 0.9232 on a different machine. Strict comparison would fail; you want comparison within a tolerance.
nbval supports regex patterns for tolerance matching. Replace exact float output with a regex that matches an acceptable range:
In the notebook cell output (before saving):
# Replace the cell's output with this regex pattern
Test accuracy: 0.92[0-9][0-9]When nbval compares output, it treats stored outputs that look like regex patterns as patterns to match against. The notation is:
# NBVAL_RAISES_EXCEPTION
# Or use sanitize file for regex-based matchingThe cleaner approach is to use nbval's sanitize file feature.
Sanitize Files for Regex Matching
A sanitize file maps patterns to expected output using regex substitution. Create it at nbval_sanitize.cfg:
[regex]
# Replace floating point numbers with a fixed placeholder
pattern: \d+\.\d{3,}
replace: FLOAT
# Replace memory addresses
pattern: 0x[0-9a-f]+
replace: 0xADDR
# Replace datetime output
pattern: \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
replace: TIMESTAMPRun with the sanitize file:
pytest --nbval-lax --sanitize-with=nbval_sanitize.cfg notebooks/With this configuration, Accuracy: 0.9231 and Accuracy: 0.9229 are both normalized to Accuracy: FLOAT before comparison. The test checks that an accuracy value appears, not that it matches exactly.
Reproducibility Patterns
For nbval to work, the notebook must produce deterministic output. Apply these patterns to every notebook that will be output-validated.
Fix random seeds
# Cell 1: Always the first cell in validated notebooks
import numpy as np
import random
import os
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
os.environ['PYTHONHASHSEED'] = str(RANDOM_SEED)
# For PyTorch
try:
import torch
torch.manual_seed(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
except ImportError:
pass
# For TensorFlow/Keras
try:
import tensorflow as tf
tf.random.set_seed(RANDOM_SEED)
except ImportError:
passPin library versions
# NBVAL_IGNORE_OUTPUT
import pandas as pd
import numpy as np
import sklearn
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scikit-learn: {sklearn.__version__}")Use NBVAL_IGNORE_OUTPUT so version bumps don't fail the test—but record them for debugging.
Use fixed dates
# Instead of:
df = df[df['date'] >= pd.Timestamp.today() - pd.Timedelta(days=30)]
# Use:
REFERENCE_DATE = pd.Timestamp('2026-01-01')
df = df[df['date'] >= REFERENCE_DATE - pd.Timedelta(days=30)]Fix data ordering
# Sort outputs so comparison is order-independent
df_result = df_result.sort_values('id').reset_index(drop=True)Creating Reference Notebooks
Before running nbval, you must create a reference notebook—a fully executed notebook with the expected outputs recorded. The workflow:
- Set all random seeds, fix dates, and sort outputs in the notebook
- Kernel → Restart and Run All in Jupyter
- Verify the outputs look correct
- Save the notebook (File → Save)
- Commit the notebook including outputs to version control
The .ipynb file now contains the reference outputs. nbval re-runs the notebook and compares to these stored values.
Updating reference outputs: When you intentionally change a computation, re-run the notebook in Jupyter, verify the new outputs are correct, and commit the updated .ipynb. Do not use --nbval --force-regen unless you've verified the new outputs are acceptable.
pytest Configuration
# pytest.ini or pyproject.toml
[pytest]
addopts = --nbval-lax
testpaths = notebooks/reports notebooks/validationRun all validated notebooks:
pytestRun a specific notebook:
pytest notebooks/reports/monthly-summary.ipynbCI Integration
# .github/workflows/notebook-validation.yml
name: Notebook Output Validation
on:
push:
paths:
- 'notebooks/reports/**'
- 'notebooks/validation/**'
- 'src/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: pip
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest nbval
- name: Validate notebook outputs
run: |
pytest --nbval-lax \
--sanitize-with=nbval_sanitize.cfg \
--nbval-cell-timeout=120 \
notebooks/reports/ \
notebooks/validation/ \
-vWhat nbval Catches
nbval is best at catching:
- Silent numerical drift: A refactored function produces slightly different results. The notebook runs without error but the numbers changed.
- Broken pipelines: A data transformation that used to produce 10,000 rows now produces 9,847. The notebook runs, but the output is wrong.
- Library behavior changes: A library upgrade changes how values are formatted or rounded.
- Missing data: A data pipeline that reads from a source that changed schema or dropped records.
nbval does not catch:
- Logic errors that happen to produce the same output (these require correctness tests)
- Performance regressions (these require benchmarking)
- Errors in unskipped, non-executed cells
Combining nbval with nbmake
For a complete notebook test strategy, run both tools:
# First: verify all notebooks execute without error
pytest --nbmake notebooks/
<span class="hljs-comment"># Then: verify report notebooks produce expected outputs
pytest --nbval-lax notebooks/reports/In CI, make the nbmake step a prerequisite for the nbval step. A notebook that fails execution will also fail nbval, but nbmake produces a cleaner error message for the execution failure case.
Summary
nbval enforces reproducibility by comparing notebook outputs on every run. The setup process is: fix all sources of nondeterminism (random seeds, dates, sort orders), run the notebook to completion, commit the reference outputs. For floats and platform-dependent output, use a sanitize file with regex patterns. Run with --nbval-lax to ignore metadata noise. The result is a test suite that fails when your data pipeline produces different numbers—before users notice the discrepancy.