How to Test Jupyter Notebooks: nbmake, testbook, and nbval
Jupyter notebooks are the primary artifact for data science work—model training, data exploration, report generation. Yet most teams treat them as write-only documents. Notebooks break silently when dependencies update, when data shapes change, or when code paths that ran during development are skipped during the final run. Testing notebooks is not complicated; it just requires knowing which tool fits which use case.
This guide covers the three main tools for notebook testing and when to reach for each.
The Problem with Untested Notebooks
A notebook that passes a visual review can still fail in production because:
- Cell execution order: Notebooks that were developed interactively often have cells that depend on state from cells that are not above them. They work in development because variables persist in the kernel. They fail when run top-to-bottom in CI.
- Hidden state: Variables defined in deleted cells still exist in the kernel. The notebook runs fine until the kernel is restarted.
- Output drift: A model evaluation notebook records accuracy of 94%. Six months later, the same notebook runs and produces 87%. Nobody notices because no test checks the output.
- Broken imports: A library update changes an API. The notebook fails on the second cell, but the error is buried in CI logs.
The Three Tools
nbmake: Execution Testing
nbmake is a pytest plugin that discovers and executes notebooks. It verifies that every cell executes without raising an exception. This is the lowest bar—and the most important one to meet first.
pip install pytest nbmakepytest --nbmake notebooks/nbmake treats each notebook as a test case. A notebook passes if it executes from top to bottom without error. It fails if any cell raises an exception.
Use nbmake when:
- You want to verify notebooks are not broken by dependency updates
- You want to enforce top-to-bottom executability
- You're starting from zero notebook testing
testbook: Unit Testing
testbook lets you write pytest tests that inject code into specific cells of a running notebook kernel, call functions defined in the notebook, and assert the results. It's notebook testing with the same granularity as function-level unit tests.
pip install testbookUse testbook when:
- The notebook contains functions you want to test with different inputs
- You want to test error handling without running the full notebook
- You need to mock external dependencies (databases, APIs)
nbval: Output Validation
nbval reruns a notebook and compares the output of each cell to the recorded output in the .ipynb file. If the output changes, the test fails.
pip install pytest nbvalpytest --nbval notebooks/data-validation.ipynbUse nbval when:
- The notebook is a report or data validation step where output consistency matters
- You want to detect regressions in model output or computed values
- You have deterministic computations that should always produce the same result
CI Integration
All three tools work with standard pytest and integrate into CI the same way:
# .github/workflows/notebook-tests.yml
name: Notebook Tests
on:
push:
paths:
- 'notebooks/**'
- 'src/**'
- 'requirements.txt'
jobs:
test-notebooks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt pytest nbmake testbook nbval
- name: Test notebook execution
run: pytest --nbmake notebooks/
- name: Validate notebook outputs
run: pytest --nbval-lax notebooks/reports/The --nbval-lax flag ignores differences in output metadata (timestamps, cell IDs) and only compares the visible output text.
Choosing the Right Tool
| Goal | Tool |
|---|---|
| Notebooks execute without error | nbmake |
| Functions produce correct output | testbook |
| Cell outputs haven't changed | nbval |
| All of the above | Use all three |
In practice, most teams start with nbmake (cheapest to set up), add nbval to report notebooks where output consistency is business-critical, and use testbook for notebooks that contain non-trivial data processing logic.
Common Pitfalls
Hardcoded file paths: Notebooks often contain paths like /home/username/data/. In CI, these paths don't exist. Use pathlib.Path(__file__).parent or environment variables for paths that change across environments.
GPU dependencies: Notebooks developed on GPU machines fail in CI because the GPU is not available. Mock GPU calls or run notebook tests on GPU runners if the computation is essential to test.
Random seed inconsistency: Machine learning notebooks that use random operations need a fixed seed to produce deterministic output for nbval. Always call np.random.seed() and random.seed() at the top of notebooks that will be output-validated.
Timeout: Long-running notebooks (model training) should not be part of the CI notebook test suite. Extract the training logic to a separate Python module and test that instead. Test the notebook's data loading and evaluation cells only.
Summary
Three tools, three jobs: nbmake catches broken notebooks, testbook enables unit testing inside notebooks, and nbval guards against output drift. Start with nbmake for execution testing in CI, and add the other tools where the notebook's role in your pipeline demands higher confidence.