Testing

Mutation Score vs Code Coverage: Which Metric Actually Measures Test Quality?

HelpMeTest

24 May 2026 — 4 min read

Teams routinely hit 90% code coverage targets while shipping code with significant test gaps. The coverage metric is satisfied. The tests are weak. Mutation score reveals what coverage hides — but it comes with its own blind spots.

This post is a direct comparison: what each metric measures, what each misses, and how to use them together.

What Code Coverage Measures

Code coverage (line, branch, statement) answers: which parts of the code executed during tests?

def calculate_tax(income: float, married: bool) -> float:
    if income > 100_000:
        rate = 0.37 if not married else 0.32
    elif income > 40_000:
        rate = 0.22
    else:
        rate = 0.12
    return income * rate

A test that calls calculate_tax(120_000, False) gets you to:

83% line coverage (all lines except the 0.32 branch)
67% branch coverage (3 of 5 branches)

The coverage number tells you which code paths haven't been exercised at all. That's useful — uncovered code is definitely undertested.

But covered code is not necessarily well-tested.

What Code Coverage Misses

Consider this test with 100% line coverage:

def test_high_income_single():
    result = calculate_tax(120_000, False)
    assert result is not None  # this assertion is nearly useless

Every line runs. The assertion is true for any number. If you changed 0.37 to 0.03, this test would still pass. If you inverted the condition so married people pay more, this test would still pass.

Coverage tells you the code ran. It says nothing about whether the test actually verifies correct behavior.

What Mutation Score Measures

Mutation score answers: if I introduced a bug, would my tests catch it?

A mutation testing tool would apply changes to calculate_tax and run the test:

Change 0.37 to 0.38 → would the test fail? No. Mutation survives.
Change income > 100_000 to income >= 100_000 → would the test fail? No. Mutation survives.
Change not married to married → would the test fail? No. Mutation survives.
Delete return income * rate → would the test fail? Depends. If assert result is not None is the only check, possibly not if the function returns None. Mutation likely survives.

The test has 100% line coverage and ~0% mutation kill rate.

A test with precise assertions kills all of these:

def test_high_income_single():
    result = calculate_tax(120_000, False)
    assert result == pytest.approx(44_400.0)  # 120_000 * 0.37

def test_high_income_married():
    result = calculate_tax(120_000, True)
    assert result == pytest.approx(38_400.0)  # 120_000 * 0.32

def test_boundary_100k():
    assert calculate_tax(100_000, False) != calculate_tax(100_001, False)

def test_middle_bracket():
    result = calculate_tax(60_000, False)
    assert result == pytest.approx(13_200.0)  # 60_000 * 0.22

def test_low_income():
    result = calculate_tax(30_000, False)
    assert result == pytest.approx(3_600.0)   # 30_000 * 0.12

Now coverage is still 100%, and mutation score approaches 100%.

The Relationship Between the Two Metrics

Coverage and mutation score are correlated but not equivalent:

Coverage	Mutation Score	What it means
Low	Low	Undertested — obvious gaps
High	Low	Covered but not verified — tests exist but don't assert
Low	High	Rare — tests that are very precise but miss code paths
High	High	Well-tested

Low coverage almost always implies low mutation score (you can't kill a mutation on a line no test reaches). High coverage doesn't imply high mutation score — you can execute every line with useless assertions.

The goal is both. High coverage ensures you exercise all paths. High mutation score ensures those tests actually verify behavior.

Where Mutation Score Falls Short

Equivalent mutations: Some code changes don't change behavior. A tool that changes i++ to ++i in a loop header produces equivalent code. If this mutation survives, it's not a test gap — it's a false positive. Good tools filter these, but not perfectly.

Computational cost: Running your full test suite against 500 mutations takes 500x longer. For large projects, this is hours, not minutes.

Not a complete picture: Mutation testing is still automated analysis. It can't test whether your system behaves correctly under real conditions — it can only test whether your existing tests would detect specific small changes. Tests that are systematically wrong (all tests assert the wrong expected value) will have a high mutation score but low correctness.

Scope limitations: Mutation testing is most effective on pure logic. It's less useful for I/O code, UI code, and integration tests where behavioral equivalence is harder to establish.

Using Both Metrics Together

The most useful signal combines both:

High coverage + High mutation score = Well-tested code
High coverage + Low mutation score  = Tests exist but don't verify
Low coverage  + Any mutation score  = Uncovered paths

Practical workflow:

Use coverage to find completely untested code paths — write tests to cover them
Use mutation score on your critical modules to find tests that exist but don't assert correctly
Fix both — write tests that cover paths AND make precise assertions

Quality gate example:

# Jest config
coverageThreshold:
  global:
    lines: 80
    branches: 75

# Stryker config
thresholds:
  high: 80
  low: 60
  break: 50

Both gates must pass. Coverage ensures breadth. Mutation score ensures depth.

Practical Thresholds

What score to target depends on the code:

Core business logic (pricing, auth, validation, state machines):

Coverage: 90%+ lines, 85%+ branches
Mutation score: 80%+

Service/application layer (orchestration, API handlers):

Coverage: 80%+ lines
Mutation score: 70%+

Infrastructure code (config, migrations, seed data):

Coverage: basic smoke test
Mutation score: not worth running

UI components:

Coverage: verify key behaviors
Mutation score: usually not worth the cost

Starting from a Low Score

If your current mutation score is 30% and you want to get to 70%, the path is:

Run mutation testing on your most critical 300–500 lines of code
Sort survived mutations by impact — focus on critical and high-impact mutations first
Write tests that kill the highest-value survived mutations
Re-run to confirm kills
Set a threshold at your current score + 5% to prevent regression
Raise the threshold incrementally as you improve

Don't try to jump from 30% to 70% in one sprint. Pick the survived mutations in your most risky code, kill those, and build from there.

The Right Mental Model

Code coverage measures code exploration. Mutation score measures assertion quality.

You need both, but if you're choosing where to invest in test improvement first:

Uncovered critical paths: fix with coverage-focused testing
Covered but weakly asserted critical paths: fix with mutation-score-focused testing

A team that optimizes for coverage tends toward tests that execute every line with minimal assertions. A team that optimizes for mutation score tends toward tests that precisely verify behavior on the paths they cover. The second team has fewer tests but better ones.

The ideal is specific tests that cover all meaningful paths and make precise assertions at every step. Mutation testing is the forcing function that reveals where that bar isn't being met.

Mutation Score vs Code Coverage: Which Metric Actually Measures Test Quality?

HelpMeTest

What Code Coverage Measures

What Code Coverage Misses

What Mutation Score Measures

The Relationship Between the Two Metrics

Where Mutation Score Falls Short

Using Both Metrics Together

Practical Thresholds

Starting from a Low Score

The Right Mental Model

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest