Test Estimation Techniques: Story Points, Function Points, and How to Present Estimates to PMs
Test estimation is notoriously difficult because the effort required to test something depends on factors that are often unknown at estimation time — complexity, environment stability, defect rate, and whether the feature actually works as described. This guide covers four practical estimation techniques, when to apply each, and how to present estimates to product managers in a way that builds trust rather than creating conflict.
Key Takeaways
- No estimation technique is perfectly accurate; the goal is to be calibrated, not precise
- Three-point estimation (optimistic / most likely / pessimistic) gives PMs useful range information, not false precision
- Story points work best when QA and development estimate together — separate estimation breaks the conversation
- Function point analysis is useful for large, well-defined projects; too heavyweight for agile sprints
- Always present estimates with explicit assumptions — a changed assumption invalidates the estimate, and PMs need to know that
Why Test Estimation Is Hard
Test estimation is harder than development estimation for a structural reason: developers estimate the work of building something known, while testers estimate the work of finding problems in something unknown.
The number of bugs in a feature is not predictable at estimation time. The stability of the test environment is not predictable. Whether the feature will be ready to test when the sprint schedule says it will — not predictable. A QA engineer estimating a testing effort is making assumptions about all of these unknowns simultaneously.
This does not mean estimation is impossible. It means estimates should carry explicit uncertainty, and the techniques used should be chosen to surface that uncertainty rather than paper over it.
Technique 1: Story Points for Testing
Story points are a relative estimation unit that measure effort, complexity, and uncertainty together. They originated in development estimation but apply equally well to testing.
How It Works
Story points assign a number (typically from a Fibonacci sequence: 1, 2, 3, 5, 8, 13, 21) to each testing task. The number is relative to a reference task. If "testing a CRUD form with five fields and no integrations" is a 3, then "testing a multi-step checkout flow with payment integration" might be a 13.
The advantage of story points over hours is that they decouple effort from calendar time. A 13-point task takes a senior QA engineer less calendar time than a junior one, but the effort is the same. This makes velocity calculations more stable over time.
Applying Story Points to Testing
Define your reference point first. Pick a testing task the team has done before and agrees is "medium" complexity. Assign it a baseline point value (3 or 5 works well).
Factors that increase story point estimates for testing:
- Multiple integrations to verify
- New feature with no existing test infrastructure
- Unclear or incomplete requirements
- Known unstable test environment
- Feature involves third-party services (payment processors, external APIs)
- Multiple user roles with different access patterns
- Mobile and desktop both need testing
- No existing regression suite to build on
Factors that decrease story point estimates:
- Feature is a well-understood variation of something tested before
- Strong unit test coverage already exists
- Stable test environment with good test data
- Single user role, single platform
QA Story Points in Planning Poker
The most effective use of story points for testing is to include QA in planning poker alongside developers. When QA estimates at the same time as development, two things happen:
- The development estimate and QA estimate can be compared — a large gap prompts discussion about why
- QA clarifying questions during estimation (what about mobile? what about the payment failure path?) improve requirement quality before work begins
This is valuable enough that many teams run planning poker with both a development estimate and a separate QA estimate on each ticket.
Technique 2: Function Point Analysis
Function point analysis (FPA) was developed by IBM in the 1970s and measures software size in terms of functionality delivered to the user. It is more rigorous than story points and produces estimates that are more consistent across teams — but it requires more setup and is better suited to larger, well-defined projects.
What Gets Counted
Function points count five types of functional components:
| Component | Description | Example |
|---|---|---|
| External Input (EI) | User inputs that modify internal data | Form submission, file upload |
| External Output (EO) | Outputs generated for the user | Reports, exported files |
| External Inquiry (EQ) | Input/output pairs with no data modification | Search results, read-only queries |
| Internal Logical File (ILF) | Data maintained by the application | User table, order table |
| External Interface File (EIF) | Data maintained by external applications | Payment processor records, external API |
Each component is rated as simple, average, or complex, and weighted accordingly. The total function point count is used to estimate testing effort.
FPA Testing Effort Formula
A common rule of thumb (from industry benchmarks) is that testing effort runs at 25–40% of development effort. If development is estimated at 100 function points with an average of 8 hours per function point (800 hours), testing effort is approximately 200–320 hours.
More granular benchmarks by function point type are available from organizations like ISBSG and IFPUG, though these require access to historical project data to be useful.
When to Use FPA
FPA is most useful when:
- The project scope is large and well-defined (a complete system or major subsystem)
- You need to compare testing effort across projects consistently
- You are providing a fixed-price bid for QA services
- Historical function point data is available for calibration
It is not practical for individual sprint tasks or exploratory testing estimation.
Technique 3: Analogy-Based Estimation
Analogy-based estimation uses historical data from similar past tasks to estimate current tasks. It is the most intuitive technique and the one most experienced QA engineers use implicitly, even when they call it something else.
How It Works
- Identify the task to be estimated
- Find one or more completed tasks that are similar
- Note the similarities and differences
- Adjust the historical effort up or down based on the differences
- Produce the estimate
Example:
"Last sprint we tested the user profile settings page — four fields, one image upload, no integrations. That took 6 hours including regression. This sprint we're testing the company settings page — eight fields, a logo upload, and an SSO integration. The SSO integration is new territory for us. I'd estimate 10–14 hours: double the fields roughly doubles the basic test effort, plus 3–5 hours buffer for the SSO integration complexity."
Building Your Reference Library
Analogy-based estimation becomes more accurate as you accumulate a reference library of past estimates and actual outcomes. After each sprint, record:
- What was tested
- Estimated effort
- Actual effort
- Key factors that caused deviation
A simple spreadsheet works. After 6–12 sprints, this data becomes the most reliable estimating resource a team has.
Technique 4: Three-Point Estimation
Three-point estimation, also known as PERT (Program Evaluation and Review Technique), addresses the problem of uncertainty by explicitly encoding it in the estimate. Instead of a single number, you provide three:
- O (Optimistic): the best case, assuming everything goes smoothly
- M (Most Likely): the realistic estimate, assuming typical conditions
- P (Pessimistic): the worst case, accounting for plausible obstacles
The weighted average formula is:
E = (O + 4M + P) / 6
The standard deviation is:
SD = (P - O) / 6
Worked Example
Testing a new checkout flow:
- Optimistic (O): 12 hours — environment stable, feature works as specified, no blockers
- Most Likely (M): 20 hours — one or two integration issues, some exploratory time on edge cases
- Pessimistic (P): 36 hours — test environment instability, payment sandbox issues, significant defects requiring retest cycles
E = (12 + 4×20 + 36) / 6 = (12 + 80 + 36) / 6 = 128 / 6 ≈ 21.3 hours
SD = (36 - 12) / 6 = 4 hours
So you'd present this as: "Estimated 21 hours, ±4 hours, assuming [specific conditions]."
This is dramatically more honest than saying "20 hours" as if that number were precise.
How to Present Estimates to Product Managers
The estimate is only half of the conversation. How you present it determines whether PMs trust your estimates or negotiate against them reflexively.
Lead With Assumptions
Never present a number without its assumptions. "12 hours to test the checkout flow" invites negotiation. "12 hours to test the checkout flow, assuming the staging environment is stable, test payment cards are set up, and the feature enters testing in the first half of the sprint" is a professional estimate with auditable conditions.
If an assumption is violated — the feature is not ready until day 4 of a 5-day sprint — the estimate changes. PMs need to understand that the estimate and its assumptions are a package deal.
Use Ranges, Not Point Estimates
Point estimates communicate false precision. A QA engineer who says "18 hours" sounds less uncertain than one who says "16–22 hours," but the 18-hour estimate is probably less accurate.
Present three-point estimates or ranges. "Between 16 and 24 hours depending on environment stability" is an honest representation of reality. PMs who plan around point estimates are setting themselves up for schedule failures — help them plan with ranges.
Explain the Risk Distribution
Using three-point estimation, you can tell PMs not just what the likely outcome is but what the probability distribution looks like. "We expect 20 hours, but there is about a 15% chance this runs over 30 hours if the payment sandbox has issues — we had problems with it in March."
This kind of communication lets PMs make informed decisions about schedule buffers rather than having them imposed after the fact.
Separate Testing Effort from Testing Time
Effort (hours of work) and time (calendar days) are different. 20 hours of testing effort spread across 4 people takes 3–4 calendar days, not 20. Clarify which you are reporting to avoid confusion.
Also clarify: does your estimate include defect retest cycles? Bug filing time? Test planning? Documentation? Scope the estimate explicitly so PMs are not surprised when "20 hours of testing" turns into 30 hours of total QA activity.
Improving Estimation Accuracy Over Time
The only reliable path to better estimates is feedback loops. After each sprint or release:
- Compare estimated effort to actual effort for each test area
- Identify the factors that caused the largest deviations
- Update your reference library with new data
- Adjust your estimation heuristics for next time
Teams that do this consistently will find their estimates converge toward accuracy over 6–12 months. Teams that skip retrospectives on estimation stay at the same accuracy level indefinitely.
How Continuous Testing Changes the Estimation Equation
One underappreciated benefit of continuous test automation is that it changes what you need to estimate. When critical path tests run automatically on every deploy via a tool like HelpMeTest, you are no longer estimating "regression testing" as a sprint task — it is handled automatically.
This means QA estimation in sprints becomes smaller and more focused: you are estimating only the new test coverage needed for new features, not the maintenance of coverage for existing ones. Over time, as more of the critical path is automated and monitored continuously, sprint QA estimates shrink, velocity improves, and the estimation conversations with PMs become more predictable.
The goal is not to make estimation unnecessary — it is to reduce the amount of testing that needs to be estimated at all.