AI Test Generation: How LLMs Write Unit Tests

AI Test Generation: How LLMs Write Unit Tests

Large language models can generate unit tests from source code, docstrings, or natural language descriptions. They excel at happy-path tests and boilerplate reduction but miss edge cases without explicit prompting. This guide explains how LLM-based test generation works, which tools use it, and how to integrate it into a real workflow.

Key Takeaways

LLMs generate tests from context, not from running code. They analyze function signatures, docstrings, and examples — they don't execute your code. This means they can write plausible-looking tests that miss real runtime behavior.

Always review generated tests before committing. AI-generated tests can be syntactically correct but semantically wrong. A test that always passes proves nothing.

Prompt with examples of failure, not just success. LLMs produce better edge-case tests when you explicitly ask: "Also generate tests for null input, empty arrays, and concurrent access."

Generated tests work best as a starting draft. Use them to eliminate boilerplate, then review and extend. Treat AI output as a junior dev's first pass.

Test coverage isn't test quality. LLMs can hit 90% line coverage with tests that catch almost nothing. Focus on meaningful assertions, not coverage numbers.

The Problem AI Test Generation Solves

Writing unit tests is the most skipped part of software development. Not because developers don't understand their value — most do — but because writing a comprehensive test suite for existing code is slow, repetitive, and unrewarding work.

A function with five code paths needs five test cases. A class with ten methods might need fifty tests. Multiply that across a codebase with thousands of functions and the backlog becomes paralyzingly large.

AI test generation attacks this backlog directly. Give an LLM a function, and it returns a test file. The developer reviews it, fixes anything wrong, and commits. What might have taken two hours of test writing takes twenty minutes of review.

The question isn't whether this is useful — it clearly is. The question is: how does it work, where does it fail, and how do you use it without introducing false confidence into your test suite?

How LLMs Generate Tests

Large language models don't understand code in the way a compiler does. They recognize patterns. Given enough training examples of functions paired with their tests, they learn the mapping between "function that validates email" and "test suite with valid email, invalid email, empty string, and null cases."

The generation process works in three phases:

Context extraction. The model reads the function signature, parameter types, docstrings, inline comments, and any examples. It looks at what other functions in the same file do. Some tools also read the test files that already exist to match style.

Pattern matching. The model identifies the function's category: pure function, method with side effects, async operation, class constructor, etc. It applies patterns it learned during training for that category.

Test synthesis. The model produces test code that calls the function with specific inputs and asserts specific outputs. The inputs are chosen based on common patterns for that input type — boundary values for numbers, empty/single/many cases for collections, and so on.

The result looks like this in practice. Given this Python function:

def calculate_discount(price: float, discount_pct: float) -> float:
    """Apply percentage discount to price. Returns discounted price."""
    if discount_pct < 0 or discount_pct > 100:
        raise ValueError("Discount must be between 0 and 100")
    return price * (1 - discount_pct / 100)

A well-prompted LLM produces:

import pytest
from pricing import calculate_discount

class TestCalculateDiscount:
    def test_zero_discount_returns_full_price(self):
        assert calculate_discount(100.0, 0) == 100.0

    def test_full_discount_returns_zero(self):
        assert calculate_discount(100.0, 100) == 0.0

    def test_partial_discount_applied_correctly(self):
        assert calculate_discount(200.0, 25) == 150.0

    def test_negative_discount_raises_value_error(self):
        with pytest.raises(ValueError):
            calculate_discount(100.0, -5)

    def test_discount_over_100_raises_value_error(self):
        with pytest.raises(ValueError):
            calculate_discount(100.0, 101)

    def test_zero_price_returns_zero(self):
        assert calculate_discount(0, 50) == 0.0

    def test_floating_point_precision(self):
        result = calculate_discount(10.0, 33.33)
        assert abs(result - 6.667) < 0.001

That's seven useful tests generated in seconds. Not perfect — you'd want to check if negative prices are allowed, what happens with very large prices, and whether the function handles integer inputs. But it's a solid starting point that a developer can review and extend in minutes rather than writing from scratch.

What LLMs Do Well

Boundary value tests. LLMs are trained on enough test code to know that boundaries matter. They reliably generate zero, negative, maximum, and empty cases for numeric and collection inputs.

Error path tests. If a function has explicit error handling (raises an exception, returns an error code), LLMs will usually generate tests for those paths.

Boilerplate elimination. Setup, teardown, fixture creation — LLMs handle the structural repetition that makes test writing tedious. They match the test framework syntax (pytest, Jest, JUnit, RSpec) and generate idiomatic code.

Docstring coverage. If your function has examples in the docstring, LLMs use them. Python's doctest format works especially well as model context.

Happy-path completeness. For functions with clear inputs and outputs, LLMs reliably cover the main scenarios. You won't miss the obvious cases.

Where LLMs Fail

Concurrency and timing. Tests for race conditions, deadlocks, and timing-sensitive behavior require understanding of thread scheduling and system state. LLMs write sequential tests for concurrent code and miss the issues that actually matter.

External dependencies. LLMs often generate tests that call real external services — databases, APIs, file systems — when they should use mocks. The tests work on the developer's machine, fail in CI, and produce false failures that erode trust in the test suite.

Business logic edge cases. LLMs don't know your domain. A discount calculation for a US retail system needs to handle edge cases specific to US tax law, coupon stacking rules, and loyalty program interactions. LLMs generate generic edge cases, not domain-specific ones.

State-dependent behavior. Functions whose behavior depends on previous calls (mutable objects, database state, file system state) require tests that carefully manage that state. LLMs often generate tests that work in isolation but fail when run as a suite.

Performance assertions. LLMs don't generate performance tests. If a function must return in under 50ms, that assertion won't appear in generated tests.

Integration points. The most valuable tests often test the interaction between components. LLMs generate unit tests (one function at a time) and miss the integration failures that actually reach production.

Tools That Use LLM Test Generation

Several tools have built LLM-based test generation into developer workflows:

GitHub Copilot generates tests inline as you type. It works best when you open a test file and write the first test manually — Copilot uses your style as context and generates the rest. It struggles with complex setup and mocking.

CodiumAI (now Qodo) is purpose-built for test generation. It analyzes entire files, identifies testable behaviors, and generates comprehensive test suites. It runs the generated tests and iterates until they pass — a meaningful improvement over pure generation.

Tabnine generates test suggestions as you type, similar to Copilot but with better support for enterprise environments and private model deployment.

AWS CodeWhisperer includes test generation as part of its code suggestion engine, with particularly strong support for Java and Python.

ChatGPT/Claude work well for test generation when you explicitly provide the function and describe what you want tested. The lack of IDE integration means copy-paste overhead, but the flexibility to prompt for specific scenarios makes up for it.

Effective Prompting for Test Generation

The quality of generated tests depends heavily on how you prompt. Weak prompts produce generic tests. Specific prompts produce useful ones.

Include the full context: Don't paste just the function. Include the file, the imports, the class it belongs to, and any related data structures.

Ask for specific failure modes explicitly:

Generate unit tests for this function. Include:
- Happy path with typical inputs
- All documented error conditions
- Null/undefined/empty inputs for each parameter
- Boundary values (min, max, zero)
- Any domain-specific edge cases you can infer from the function name

Ask for mocks when dependencies exist:

The function calls database.find() — mock it using unittest.mock.patch.
Do not make real database calls.

Specify the test style:

Use pytest fixtures. Group tests in classes by scenario.
Follow the Arrange-Act-Assert pattern.

Request explanatory comments for non-obvious cases:

Add a one-line comment before each test explaining what scenario it covers
and why it matters.

Integrating AI Test Generation Into Your Workflow

The most effective integration isn't "generate tests for all new code." It's targeted: use AI generation for specific situations where the ratio of effort to value is highest.

Legacy code coverage. When adding tests to existing code with no test coverage, AI generation is invaluable. The function behavior is already defined; you just need tests that document it. Generate a full suite, review it, run it, fix it.

New function first draft. When writing a new function, generate tests before implementing it. Use TDD with AI assistance: describe the function in the docstring, generate tests, then implement until the tests pass.

Code review assistance. When reviewing a PR, paste changed functions into an AI tool and ask "what tests are missing here?" The AI flags scenarios the author didn't think of.

Documentation through tests. If a function is underdocumented, generate tests and keep the ones that reveal the function's actual behavior. The test names become living documentation.

Quality Gates for Generated Tests

Before committing AI-generated tests, apply these checks:

Run them. The first filter is obvious: do they pass? Generated tests often include incorrect assertions about return values. Running immediately surfaces these.

Check for tautological assertions. A test that calls a function and asserts the result equals the same function call proves nothing. Look for patterns like assert calculate(x) == calculate(x).

Verify mock behavior. If the generated test mocks a dependency, check that the mock is actually used — that the function under test calls the mock, not a real dependency.

Add at least one test the LLM didn't generate. This forces you to think about the function yourself, which often reveals the scenario the AI missed.

Check for hardcoded values from one run. LLMs sometimes generate tests like assert result == 3.141592653589793 based on pattern matching rather than calculation. Verify these values are actually correct.

The Right Mental Model

AI test generation is most useful when you treat it as a collaborator that does the mechanical work, not an oracle that tells you whether your code is correct.

The LLM writes the boilerplate. You write the judgment. It generates the structure; you add the domain knowledge. It covers the obvious cases; you add the edge cases specific to your system.

Used this way, AI test generation meaningfully reduces the time cost of maintaining a comprehensive test suite. The tests it produces aren't perfect, but they're far better than no tests — and they shift the work from "generate from scratch" to "review and improve," which is both faster and more likely to actually happen.

For teams using HelpMeTest, AI-generated unit tests pair well with browser-level end-to-end tests that validate user-facing behavior. The unit tests catch function-level regressions; the E2E tests catch integration failures. Neither replaces the other.

Summary

LLMs generate unit tests by pattern-matching against training data of function-test pairs. They reliably produce boundary tests, error path tests, and boilerplate. They miss concurrency issues, domain-specific edge cases, integration failures, and performance requirements.

The best workflow uses AI generation as a first draft: generate, run, review, extend. Prompt specifically for the failure modes that matter. Treat coverage numbers as a floor, not a goal.

AI test generation doesn't replace understanding your code. It removes the excuse for not testing it.

Read more