Testing

Rasa NLU Testing Guide: pytest-rasa, Stories, and Custom Actions

HelpMeTest

17 May 2026 — 5 min read

Rasa testing has three distinct layers: NLU model accuracy tests using rasa test nlu, conversation flow tests using rasa test core with story files, and unit tests for custom actions using pytest with mocked trackers. Each layer catches different categories of bugs.

Rasa Open Source gives you full control over your NLU pipeline and dialog management — and full responsibility for testing it. Unlike hosted services, there's no Rasa cloud console to run test cases in. You own the entire testing stack. Done right, this is an advantage: you can integrate NLU accuracy assertions, story coverage, and custom action unit tests into the same CI pipeline as your application code.

Layer 1: NLU Model Testing with rasa test nlu

The rasa test nlu command evaluates your trained NLU model against a test dataset and produces precision/recall/F1 metrics per intent and entity. It's your primary tool for catching NLU regressions.

Structure your test data in tests/test_data/nlu.yml:

nlu:
- intent: book_table
  examples: |
    - I want to book a table
    - reserve a spot for dinner
    - can I get a reservation for two
    - table for 4 on Friday
    - need a table tomorrow evening

- intent: cancel_booking
  examples: |
    - cancel my reservation
    - I need to cancel
    - forget the booking
    - please cancel
    - I changed my mind about dinner

Run evaluation:

rasa test nlu \
  --nlu tests/test_data/nlu.yml \
  --model models/ \
  --out results/nlu/

This writes intent_report.json, DIETClassifier_report.json, and entity_report.json to the output directory. Inspect these for low-performing intents.

Asserting NLU Thresholds in CI

The raw test command exits 0 even when F1 is 0.5. You need a script that reads the JSON report and fails the build when any intent falls below your threshold:

# scripts/check_nlu_thresholds.py
import json
import sys

MIN_F1 = 0.85
MIN_PRECISION = 0.80
MIN_RECALL = 0.80

with open("results/nlu/intent_report.json") as f:
    report = json.load(f)

failures = []
for intent, metrics in report.items():
    if intent in ("accuracy", "macro avg", "weighted avg"):
        continue
    f1 = metrics.get("f1-score", 0)
    precision = metrics.get("precision", 0)
    recall = metrics.get("recall", 0)
    support = metrics.get("support", 0)

    if support == 0:
        continue  # Skip intents with no test examples

    if f1 < MIN_F1:
        failures.append(f"  {intent}: F1={f1:.2f} (min {MIN_F1}), support={support}")
    if precision < MIN_PRECISION:
        failures.append(f"  {intent}: precision={precision:.2f} (min {MIN_PRECISION})")
    if recall < MIN_RECALL:
        failures.append(f"  {intent}: recall={recall:.2f} (min {MIN_RECALL})")

if failures:
    print("NLU threshold violations:")
    for f in failures:
        print(f)
    sys.exit(1)

print(f"All intents pass thresholds (F1>={MIN_F1}, P>={MIN_PRECISION}, R>={MIN_RECALL})")

Integrate into your CI:

rasa test nlu --nlu tests/test_data/nlu.yml --model models/ --out results/nlu/
python scripts/check_nlu_thresholds.py

Layer 2: Story Testing with rasa test core

Story files define expected conversation flows. rasa test core replays these stories against your trained model and reports failed predictions. Stories live in tests/test_stories.yml:

stories:
- story: happy path - book table with all params
  steps:
  - user: |
      I want to book a table for two on Friday at 7pm
    intent: book_table
    entities:
    - party_size: "2"
    - date: "friday"
    - time: "19:00"
  - action: action_check_availability
  - slot_was_set:
    - availability: true
  - action: utter_confirm_booking

- story: booking - missing date, elicitation
  steps:
  - user: |
      book a table for two
    intent: book_table
    entities:
    - party_size: "2"
  - action: utter_ask_date
  - user: |
      tomorrow
    intent: provide_date
    entities:
    - date: "tomorrow"
  - action: utter_ask_time

- story: cancellation flow
  steps:
  - user: |
      cancel my reservation
    intent: cancel_booking
  - action: utter_ask_cancellation_reference
  - user: |
      reservation 12345
    intent: provide_reference
    entities:
    - reference_number: "12345"
  - action: action_cancel_booking
  - action: utter_cancellation_confirmed

Run story tests:

rasa test core \
  --stories tests/test_stories.yml \
  --model models/ \
  --out results/core/

The output includes failed_test_stories.yml — a file containing only the stories that failed, with the actual vs expected action at the point of divergence. This is extremely useful for debugging.

Layer 3: Custom Action Unit Testing with pytest

Custom actions are where your business logic lives, and they're the most important layer to unit test. The Rasa SDK provides CollectingDispatcher and you can build tracker mocks directly.

# tests/test_actions.py
import pytest
from unittest.mock import AsyncMock, MagicMock, patch
from rasa_sdk import Tracker
from rasa_sdk.executor import CollectingDispatcher
from actions.action_check_availability import ActionCheckAvailability
from actions.action_cancel_booking import ActionCancelBooking

def make_tracker(slots=None, intent="book_table", entities=None):
    """Build a minimal Rasa Tracker for testing."""
    slots = slots or {}
    entities = entities or []
    return Tracker(
        sender_id="test-user",
        slots=slots,
        latest_message={
            "intent": {"name": intent, "confidence": 0.99},
            "entities": entities,
            "text": "test input",
        },
        events=[],
        paused=False,
        followup_action=None,
        active_loop=None,
        latest_action_name=None,
    )

@pytest.mark.asyncio
async def test_check_availability_when_available():
    tracker = make_tracker(slots={"date": "2026-06-15", "time": "19:00", "party_size": 2})
    dispatcher = CollectingDispatcher()
    domain = {}

    with patch("actions.action_check_availability.check_restaurant_availability") as mock_check:
        mock_check.return_value = True
        action = ActionCheckAvailability()
        events = await action.run(dispatcher, tracker, domain)

    assert any(
        e.get("event") == "slot" and e.get("name") == "availability" and e.get("value") is True
        for e in events
    ), "Expected availability slot to be set to True"
    assert len(dispatcher.messages) == 0  # No message for available slot

@pytest.mark.asyncio
async def test_check_availability_when_unavailable():
    tracker = make_tracker(slots={"date": "2026-06-20", "time": "19:00", "party_size": 2})
    dispatcher = CollectingDispatcher()

    with patch("actions.action_check_availability.check_restaurant_availability") as mock_check:
        mock_check.return_value = False
        action = ActionCheckAvailability()
        events = await action.run(dispatcher, {}, tracker, {})

    slot_events = [e for e in events if e.get("event") == "slot" and e.get("name") == "availability"]
    assert slot_events[0]["value"] is False

@pytest.mark.asyncio
async def test_cancel_booking_not_found():
    tracker = make_tracker(
        slots={"reference_number": "INVALID-999"},
        intent="provide_reference"
    )
    dispatcher = CollectingDispatcher()

    with patch("actions.action_cancel_booking.find_booking") as mock_find:
        mock_find.return_value = None
        action = ActionCancelBooking()
        await action.run(dispatcher, tracker, {})

    assert len(dispatcher.messages) == 1
    assert "not found" in dispatcher.messages[0]["text"].lower()

End-to-End Story Tests with pytest-rasa

The pytest-rasa package lets you write end-to-end conversation tests in pytest style, without needing a running Rasa server:

# tests/test_e2e_conversations.py
import pytest

@pytest.mark.asyncio
async def test_full_booking_conversation(rasa_app):
    """Test a complete booking conversation end-to-end."""
    client = rasa_app.test_client()

    # Turn 1: Initiate booking
    response = await client.post(
        "/webhooks/rest/webhook",
        json={"sender": "test-e2e-1", "message": "I want to book a table for two on Friday"}
    )
    assert response.status == 200
    data = await response.json()
    assert any("time" in msg.get("text", "").lower() for msg in data), \
        f"Expected time elicitation, got: {data}"

    # Turn 2: Provide time
    response = await client.post(
        "/webhooks/rest/webhook",
        json={"sender": "test-e2e-1", "message": "at 7pm"}
    )
    data = await response.json()
    assert any("confirm" in msg.get("text", "").lower() for msg in data), \
        f"Expected confirmation prompt, got: {data}"

    # Turn 3: Confirm
    response = await client.post(
        "/webhooks/rest/webhook",
        json={"sender": "test-e2e-1", "message": "yes"}
    )
    data = await response.json()
    assert any("confirmed" in msg.get("text", "").lower() for msg in data), \
        f"Expected booking confirmation, got: {data}"

Difference: Unit Tests vs Integration Tests for Actions

Unit tests for custom actions give you fast, isolated feedback — mock the database, mock external APIs, test the logic. Integration tests verify the full pipeline including NLU parsing, dialog management, and action execution.

Use unit tests during development (run in milliseconds) and integration tests in CI (run in seconds to minutes). Never run integration tests against a production Rasa server.

A quick rule of thumb: if you're testing the logic inside an action (what it does given specific slots), use unit tests. If you're testing whether the right action gets triggered given a user utterance, use story tests or E2E tests.

CI Pipeline

name: Rasa Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - run: pip install rasa==3.6.0 pytest pytest-asyncio pytest-rasa
      - name: Train model
        run: rasa train --fixed-model-name test-model
      - name: NLU tests
        run: |
          rasa test nlu --nlu tests/test_data/nlu.yml --model models/test-model.tar.gz --out results/nlu/
          python scripts/check_nlu_thresholds.py
      - name: Story tests
        run: rasa test core --stories tests/test_stories.yml --model models/test-model.tar.gz --out results/core/
      - name: Action unit tests
        run: pytest tests/test_actions.py -v
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: rasa-test-results
          path: results/

HelpMeTest can monitor your deployed Rasa chatbot by running test conversations on a schedule, so you catch NLU drift and dialog regressions before your users do.

Rasa NLU Testing Guide: pytest-rasa, Stories, and Custom Actions

HelpMeTest

Layer 1: NLU Model Testing with rasa test nlu

Asserting NLU Thresholds in CI

Layer 2: Story Testing with rasa test core

Layer 3: Custom Action Unit Testing with pytest

End-to-End Story Tests with pytest-rasa

Difference: Unit Tests vs Integration Tests for Actions

CI Pipeline

Read more

SLSA Framework: Achieving Supply Chain Integrity Levels

Testing Sidekiq Jobs in Rails: Unit, Integration, and Retry Logic

Testing Scheduled Jobs and Cron Tasks: Strategies for Time-Sensitive Code

ExtentReports: Rich HTML Test Reports for Java and .NET