BDD Best Practices: Living Documentation That Teams Actually Use

BDD Best Practices: Living Documentation That Teams Actually Use

Most teams that adopt BDD get the tools right — Cucumber, SpecFlow, Behave — and get the practices wrong. They write Gherkin that only QA engineers understand, run scenarios that only engineers maintain, and end up with feature files that diverge from the actual application within weeks.

This guide covers the practices that make BDD sustainable: collaboration patterns, writing guidelines, living documentation workflows, and the warning signs that your BDD adoption is going off the rails.

What Living Documentation Actually Means

"Living documentation" is the central promise of BDD: your test scenarios are the specification, and they're always current because they fail when the implementation diverges.

This only works if:

  1. The scenarios describe real behaviour in language stakeholders understand
  2. The scenarios are run frequently enough to catch regressions quickly
  3. The scenarios are owned by the whole team, not just QA

When any of these breaks down, you get "dead documentation" — Gherkin files that describe the application as it was designed, not as it works now.

The Three Amigos

The three amigos session is the core BDD collaboration practice. Before implementing a feature, three perspectives meet to define what "done" looks like:

  • Business / Productwhat problem are we solving and why?
  • Developerhow might this break? what edge cases exist?
  • QA / Testerwhat scenarios will verify this works correctly?

The output is a set of concrete examples that everyone agrees on. Those examples become Gherkin scenarios.

Running a Three Amigos Session

Keep it short (30-60 minutes per feature). The product person opens with the user story. Then the group works through:

  1. Happy path — the ideal flow when everything works
  2. Edge cases — inputs at boundaries, optional fields, empty states
  3. Error scenarios — invalid data, network failures, permission violations

Write examples on a whiteboard or shared doc using Given-When-Then informally. Don't type Gherkin during the session — capture the intent, write the formal syntax later.

A good example from the whiteboard:

Given: customer has a premium subscription
When: they download a report
Then: the full report downloads immediately

A bad example (skips context):

Given: user exists
When: button clicked
Then: stuff happens

Who Writes the Final Feature File?

It doesn't matter, as long as it's reviewed by all three roles before it's committed. Many teams have QA write the first draft, then share it with the product owner and developer for a quick async review.

Writing Rules That Stick

The Rule of One

Each scenario tests exactly one behaviour. If you find yourself using "and then" in a scenario title, you probably have two scenarios.

Too broad:

Scenario: User can register, verify email, and log in

Right:

Scenario: New user can create an account
Scenario: Registration sends a verification email
Scenario: Verified user can log in

Broad scenarios obscure which behaviour failed when the test breaks.

Declarative Over Imperative

This is the most common Gherkin mistake. Imperative Gherkin describes UI mechanics; declarative Gherkin describes business intent.

Imperative (avoid):

When I click the "Menu" icon in the top right
And I hover over "Account Settings"
And I click "Subscription" in the dropdown
And I click the "Upgrade" button
And I select the "Pro" option from the dropdown
And I click "Confirm upgrade"

Declarative (prefer):

When I upgrade my account to Pro

The step definition handles the mechanics. The scenario describes what the user is doing, not how they move through the UI.

Declarative scenarios are more stable — they survive UI redesigns. When you change from a dropdown to a modal, you update one step definition, not twenty scenarios.

Domain Language Over Technical Language

Write in the language of your business, not your tech stack.

Technical (avoid):

When the POST request to /api/v2/subscriptions returns 200
Then the user record in the database has plan_id = 3

Domain language (prefer):

When I upgrade to the Pro plan
Then my account should show "Pro" as the current plan

If your stakeholders can't read a Gherkin file and understand what it means, you've lost the collaboration value of BDD.

Scenario Independence

Every scenario must be able to run in isolation and produce the same result regardless of what ran before it.

Bad — depends on shared state:

Scenario: Create a user
  When I create user "alice@example.com"
  Then the user list should have 1 entry

Scenario: Delete the created user
  When I delete "alice@example.com"
  Then the user list should be empty

Good — each scenario sets up its own state:

Scenario: Create a user
  Given no users exist in the system
  When I create user "alice@example.com"
  Then the user list should have 1 entry

Scenario: Delete a user
  Given user "alice@example.com" exists in the system
  When I delete "alice@example.com"
  Then "alice@example.com" should no longer appear in the user list

Dependent scenarios cause cascading failures and make parallelisation impossible.

Meaningful Scenario Names

Scenario names appear in test reports. They're what teams read when a build fails.

Vague:

Scenario: Test error handling
Scenario: User login edge case
Scenario: API failure

Specific:

Scenario: Login form shows "Account locked" after 5 failed attempts
Scenario: Upload fails gracefully when file exceeds 10MB limit
Scenario: Dashboard loads within 3 seconds for accounts with 1000+ records

The test name should tell you exactly what broke, without having to open the file.

Organising Feature Files

One Feature File Per User Story or Feature

Don't try to put everything in one file. Each file should cover one coherent user journey.

features/
├── authentication/
│   ├── login.feature
│   ├── registration.feature
│   └── password-reset.feature
├── shopping-cart/
│   ├── add-items.feature
│   ├── remove-items.feature
│   └── discount-codes.feature
└── checkout/
    ├── address.feature
    ├── payment.feature
    └── order-confirmation.feature

Tag Strategy

Tags are your primary mechanism for filtering which scenarios run when. Decide on a tag taxonomy before you have 200 feature files:

Tag Meaning
@smoke Critical path — runs on every commit (fast)
@regression Full regression suite — runs nightly or pre-release
@wip Work in progress — excluded from CI
@api API tests — no browser needed
@slow Known slow tests — excluded from PR gates
@flaky Known flaky tests — excluded, tracked separately

Document your tag taxonomy in a README in the features/ directory.

Living Documentation Workflows

Generating Published Documentation

Tools that turn feature files into stakeholder-readable documentation:

Pickles (free, .NET and Java):

# Java
pickles --feature-directory=src/test/resources/features \
        --output-directory=docs/living-doc \
        --output-format=html

SpecFlow LivingDoc (free CLI):

livingdoc test-assembly MyApp.Tests.dll \
  -t TestResults.trx \
  --output LivingDoc.html

Allure (any framework, rich UI):

# Generate report after test run
allure generate allure-results -o allure-report
allure open allure-report

Publishing to Confluence or Notion

Many teams publish the living documentation to their wiki after each CI run:

# GitHub Actions example
- name: Generate living doc
  run: livingdoc test-assembly MyApp.Tests.dll -t results.trx -o docs/living-doc.html

- name: Upload to Confluence
  run: |
    curl -X POST \
      -H "Authorization: Basic $CONFLUENCE_TOKEN" \
      -H "Content-Type: multipart/form-data" \
      -F "file=@docs/living-doc.html" \
      "https://your-org.atlassian.net/wiki/rest/api/content/${PAGE_ID}/child/attachment"

When stakeholders can see "last run: 14 minutes ago — 247 passing, 0 failing" in their wiki, they start trusting the tests. That trust is what makes BDD worthwhile.

Anti-Patterns to Avoid

The Testing Pyramid Inversion

BDD scenarios should cover acceptance criteria, not unit logic. If you have hundreds of Gherkin scenarios testing individual validation rules, you've inverted the testing pyramid. Unit tests are cheaper and faster for that level of detail.

Use BDD for:

  • User journey acceptance criteria
  • Cross-service integration points
  • Business rules that cut across multiple components

Use unit tests for:

  • Individual function logic
  • Validation rules
  • Edge cases at the component level

Gherkin as a DSL for Developers

# This is not BDD — it's unit tests wearing a costume
Scenario: calculateDiscount returns 0.1 when tier is "gold"
  Given the discount calculator is initialized
  When calculateDiscount is called with tier "gold" and amount 100
  Then the return value should be 10.0

If a scenario is only understandable by someone who has read the source code, it's not doing BDD — it's translating unit tests into an unnecessarily verbose format.

Background Overload

A Background with 8 steps is a sign that the feature file covers too many different concerns. When you need that much setup for every scenario, the feature is too broad.

# Too much Background
Background:
  Given I am logged in as an admin
  And the database has been reset
  And test products have been loaded
  And the feature flag "new-checkout" is enabled
  And I have navigated to the admin dashboard
  And I have expanded the "Products" section
  And I have clicked "Add New Product"
  And the form has loaded

Split into focused feature files, each with a 2-3 step Background at most.

Scenario Outline Abuse

Scenario Outline is for running the same behaviour with different data. It's not for creating variations with fundamentally different outcomes:

# Wrong — these are different behaviours, not data variations
Scenario Outline: Various user actions
  Given I am a "<role>" user
  When I do "<action>"
  Then I should see "<result>"

  Examples:
    | role  | action          | result                    |
    | admin | delete user     | user deleted message      |
    | user  | delete user     | permission denied message |
    | admin | export data     | download starts           |
    | user  | export data     | upgrade prompt            |

Write these as separate named scenarios with clear intent. The outline table obscures what's actually being tested.

Metrics for BDD Health

Track these to know if your BDD adoption is working:

Scenario count per quarter — should grow with the feature set, not faster. Explosive growth often means test duplication or testing at the wrong level.

Average scenario execution time — keep the smoke suite under 10 minutes. Slow suites get disabled.

Flaky test rate — track scenarios with intermittent failures separately. A flaky test is worse than no test because it erodes trust in the whole suite.

Time from three amigos session to first passing scenario — long gaps mean the process is broken somewhere.

Last time a stakeholder read a feature file — if it's been months, living documentation has become dead documentation.

Integrating with Your Workflow

Pre-commit: Run Smoke Tests

# .git/hooks/pre-push
<span class="hljs-comment">#!/bin/bash
mvn <span class="hljs-built_in">test -Dcucumber.filter.tags=<span class="hljs-string">"@smoke" --fail-at-end

PR Gate: Run Regression Suite

# GitHub Actions
- name: Regression Tests
  run: mvn test -Dcucumber.filter.tags="@regression and not @slow"

Nightly: Full Suite Including Slow Tests

# Scheduled workflow
- name: Full Suite
  run: mvn test -Dcucumber.filter.tags="not @wip and not @flaky"

Summary

BDD's value is collaboration, not just automation. The Gherkin syntax is almost secondary — what matters is that three perspectives (business, development, testing) agree on concrete examples before implementation starts, and that those examples stay accurate over time.

The practices that make this work:

  • Three amigos sessions before every feature
  • Declarative Gherkin in domain language
  • Independent, focused scenarios with precise names
  • Published living documentation that stakeholders actually read
  • Tag strategies that keep the smoke suite fast

Done right, BDD doesn't just catch regressions — it changes how teams communicate about requirements. That's the return on the investment.

Read more

Testing Atlantis Terraform PR Automation: Workflows, Plan Verification, and Policy Enforcement

Testing Atlantis Terraform PR Automation: Workflows, Plan Verification, and Policy Enforcement

Atlantis automates Terraform plan and apply through pull requests. But Atlantis itself needs testing: workflow configuration, plan output validation, policy enforcement, and server health checks. This guide covers testing Atlantis workflows locally with atlantis-local, validating plan outputs with custom scripts, enforcing Terraform policies with OPA and Conftest, and monitoring Atlantis

By HelpMeTest