Mutation Score vs Code Coverage: Which Metric Actually Measures Test Quality?
Teams routinely hit 90% code coverage targets while shipping code with significant test gaps. The coverage metric is satisfied. The tests are weak. Mutation score reveals what coverage hides — but it comes with its own blind spots.
This post is a direct comparison: what each metric measures, what each misses, and how to use them together.
What Code Coverage Measures
Code coverage (line, branch, statement) answers: which parts of the code executed during tests?
def calculate_tax(income: float, married: bool) -> float:
if income > 100_000:
rate = 0.37 if not married else 0.32
elif income > 40_000:
rate = 0.22
else:
rate = 0.12
return income * rateA test that calls calculate_tax(120_000, False) gets you to:
- 83% line coverage (all lines except the
0.32branch) - 67% branch coverage (3 of 5 branches)
The coverage number tells you which code paths haven't been exercised at all. That's useful — uncovered code is definitely undertested.
But covered code is not necessarily well-tested.
What Code Coverage Misses
Consider this test with 100% line coverage:
def test_high_income_single():
result = calculate_tax(120_000, False)
assert result is not None # this assertion is nearly uselessEvery line runs. The assertion is true for any number. If you changed 0.37 to 0.03, this test would still pass. If you inverted the condition so married people pay more, this test would still pass.
Coverage tells you the code ran. It says nothing about whether the test actually verifies correct behavior.
What Mutation Score Measures
Mutation score answers: if I introduced a bug, would my tests catch it?
A mutation testing tool would apply changes to calculate_tax and run the test:
- Change
0.37to0.38→ would the test fail? No. Mutation survives. - Change
income > 100_000toincome >= 100_000→ would the test fail? No. Mutation survives. - Change
not marriedtomarried→ would the test fail? No. Mutation survives. - Delete
return income * rate→ would the test fail? Depends. Ifassert result is not Noneis the only check, possibly not if the function returnsNone. Mutation likely survives.
The test has 100% line coverage and ~0% mutation kill rate.
A test with precise assertions kills all of these:
def test_high_income_single():
result = calculate_tax(120_000, False)
assert result == pytest.approx(44_400.0) # 120_000 * 0.37
def test_high_income_married():
result = calculate_tax(120_000, True)
assert result == pytest.approx(38_400.0) # 120_000 * 0.32
def test_boundary_100k():
assert calculate_tax(100_000, False) != calculate_tax(100_001, False)
def test_middle_bracket():
result = calculate_tax(60_000, False)
assert result == pytest.approx(13_200.0) # 60_000 * 0.22
def test_low_income():
result = calculate_tax(30_000, False)
assert result == pytest.approx(3_600.0) # 30_000 * 0.12Now coverage is still 100%, and mutation score approaches 100%.
The Relationship Between the Two Metrics
Coverage and mutation score are correlated but not equivalent:
| Coverage | Mutation Score | What it means |
|---|---|---|
| Low | Low | Undertested — obvious gaps |
| High | Low | Covered but not verified — tests exist but don't assert |
| Low | High | Rare — tests that are very precise but miss code paths |
| High | High | Well-tested |
Low coverage almost always implies low mutation score (you can't kill a mutation on a line no test reaches). High coverage doesn't imply high mutation score — you can execute every line with useless assertions.
The goal is both. High coverage ensures you exercise all paths. High mutation score ensures those tests actually verify behavior.
Where Mutation Score Falls Short
Equivalent mutations: Some code changes don't change behavior. A tool that changes i++ to ++i in a loop header produces equivalent code. If this mutation survives, it's not a test gap — it's a false positive. Good tools filter these, but not perfectly.
Computational cost: Running your full test suite against 500 mutations takes 500x longer. For large projects, this is hours, not minutes.
Not a complete picture: Mutation testing is still automated analysis. It can't test whether your system behaves correctly under real conditions — it can only test whether your existing tests would detect specific small changes. Tests that are systematically wrong (all tests assert the wrong expected value) will have a high mutation score but low correctness.
Scope limitations: Mutation testing is most effective on pure logic. It's less useful for I/O code, UI code, and integration tests where behavioral equivalence is harder to establish.
Using Both Metrics Together
The most useful signal combines both:
High coverage + High mutation score = Well-tested code
High coverage + Low mutation score = Tests exist but don't verify
Low coverage + Any mutation score = Uncovered pathsPractical workflow:
- Use coverage to find completely untested code paths — write tests to cover them
- Use mutation score on your critical modules to find tests that exist but don't assert correctly
- Fix both — write tests that cover paths AND make precise assertions
Quality gate example:
# Jest config
coverageThreshold:
global:
lines: 80
branches: 75
# Stryker config
thresholds:
high: 80
low: 60
break: 50Both gates must pass. Coverage ensures breadth. Mutation score ensures depth.
Practical Thresholds
What score to target depends on the code:
Core business logic (pricing, auth, validation, state machines):
- Coverage: 90%+ lines, 85%+ branches
- Mutation score: 80%+
Service/application layer (orchestration, API handlers):
- Coverage: 80%+ lines
- Mutation score: 70%+
Infrastructure code (config, migrations, seed data):
- Coverage: basic smoke test
- Mutation score: not worth running
UI components:
- Coverage: verify key behaviors
- Mutation score: usually not worth the cost
Starting from a Low Score
If your current mutation score is 30% and you want to get to 70%, the path is:
- Run mutation testing on your most critical 300–500 lines of code
- Sort survived mutations by impact — focus on critical and high-impact mutations first
- Write tests that kill the highest-value survived mutations
- Re-run to confirm kills
- Set a threshold at your current score + 5% to prevent regression
- Raise the threshold incrementally as you improve
Don't try to jump from 30% to 70% in one sprint. Pick the survived mutations in your most risky code, kill those, and build from there.
The Right Mental Model
Code coverage measures code exploration. Mutation score measures assertion quality.
You need both, but if you're choosing where to invest in test improvement first:
- Uncovered critical paths: fix with coverage-focused testing
- Covered but weakly asserted critical paths: fix with mutation-score-focused testing
A team that optimizes for coverage tends toward tests that execute every line with minimal assertions. A team that optimizes for mutation score tends toward tests that precisely verify behavior on the paths they cover. The second team has fewer tests but better ones.
The ideal is specific tests that cover all meaningful paths and make precise assertions at every step. Mutation testing is the forcing function that reveals where that bar isn't being met.