Rasa NLU Testing Guide: pytest-rasa, Stories, and Custom Actions
Rasa testing has three distinct layers: NLU model accuracy tests using rasa test nlu, conversation flow tests using rasa test core with story files, and unit tests for custom actions using pytest with mocked trackers. Each layer catches different categories of bugs.
Rasa Open Source gives you full control over your NLU pipeline and dialog management — and full responsibility for testing it. Unlike hosted services, there's no Rasa cloud console to run test cases in. You own the entire testing stack. Done right, this is an advantage: you can integrate NLU accuracy assertions, story coverage, and custom action unit tests into the same CI pipeline as your application code.
Layer 1: NLU Model Testing with rasa test nlu
The rasa test nlu command evaluates your trained NLU model against a test dataset and produces precision/recall/F1 metrics per intent and entity. It's your primary tool for catching NLU regressions.
Structure your test data in tests/test_data/nlu.yml:
nlu:
- intent: book_table
examples: |
- I want to book a table
- reserve a spot for dinner
- can I get a reservation for two
- table for 4 on Friday
- need a table tomorrow evening
- intent: cancel_booking
examples: |
- cancel my reservation
- I need to cancel
- forget the booking
- please cancel
- I changed my mind about dinnerRun evaluation:
rasa test nlu \
--nlu tests/test_data/nlu.yml \
--model models/ \
--out results/nlu/This writes intent_report.json, DIETClassifier_report.json, and entity_report.json to the output directory. Inspect these for low-performing intents.
Asserting NLU Thresholds in CI
The raw test command exits 0 even when F1 is 0.5. You need a script that reads the JSON report and fails the build when any intent falls below your threshold:
# scripts/check_nlu_thresholds.py
import json
import sys
MIN_F1 = 0.85
MIN_PRECISION = 0.80
MIN_RECALL = 0.80
with open("results/nlu/intent_report.json") as f:
report = json.load(f)
failures = []
for intent, metrics in report.items():
if intent in ("accuracy", "macro avg", "weighted avg"):
continue
f1 = metrics.get("f1-score", 0)
precision = metrics.get("precision", 0)
recall = metrics.get("recall", 0)
support = metrics.get("support", 0)
if support == 0:
continue # Skip intents with no test examples
if f1 < MIN_F1:
failures.append(f" {intent}: F1={f1:.2f} (min {MIN_F1}), support={support}")
if precision < MIN_PRECISION:
failures.append(f" {intent}: precision={precision:.2f} (min {MIN_PRECISION})")
if recall < MIN_RECALL:
failures.append(f" {intent}: recall={recall:.2f} (min {MIN_RECALL})")
if failures:
print("NLU threshold violations:")
for f in failures:
print(f)
sys.exit(1)
print(f"All intents pass thresholds (F1>={MIN_F1}, P>={MIN_PRECISION}, R>={MIN_RECALL})")Integrate into your CI:
rasa test nlu --nlu tests/test_data/nlu.yml --model models/ --out results/nlu/
python scripts/check_nlu_thresholds.pyLayer 2: Story Testing with rasa test core
Story files define expected conversation flows. rasa test core replays these stories against your trained model and reports failed predictions. Stories live in tests/test_stories.yml:
stories:
- story: happy path - book table with all params
steps:
- user: |
I want to book a table for two on Friday at 7pm
intent: book_table
entities:
- party_size: "2"
- date: "friday"
- time: "19:00"
- action: action_check_availability
- slot_was_set:
- availability: true
- action: utter_confirm_booking
- story: booking - missing date, elicitation
steps:
- user: |
book a table for two
intent: book_table
entities:
- party_size: "2"
- action: utter_ask_date
- user: |
tomorrow
intent: provide_date
entities:
- date: "tomorrow"
- action: utter_ask_time
- story: cancellation flow
steps:
- user: |
cancel my reservation
intent: cancel_booking
- action: utter_ask_cancellation_reference
- user: |
reservation 12345
intent: provide_reference
entities:
- reference_number: "12345"
- action: action_cancel_booking
- action: utter_cancellation_confirmedRun story tests:
rasa test core \
--stories tests/test_stories.yml \
--model models/ \
--out results/core/The output includes failed_test_stories.yml — a file containing only the stories that failed, with the actual vs expected action at the point of divergence. This is extremely useful for debugging.
Layer 3: Custom Action Unit Testing with pytest
Custom actions are where your business logic lives, and they're the most important layer to unit test. The Rasa SDK provides CollectingDispatcher and you can build tracker mocks directly.
# tests/test_actions.py
import pytest
from unittest.mock import AsyncMock, MagicMock, patch
from rasa_sdk import Tracker
from rasa_sdk.executor import CollectingDispatcher
from actions.action_check_availability import ActionCheckAvailability
from actions.action_cancel_booking import ActionCancelBooking
def make_tracker(slots=None, intent="book_table", entities=None):
"""Build a minimal Rasa Tracker for testing."""
slots = slots or {}
entities = entities or []
return Tracker(
sender_id="test-user",
slots=slots,
latest_message={
"intent": {"name": intent, "confidence": 0.99},
"entities": entities,
"text": "test input",
},
events=[],
paused=False,
followup_action=None,
active_loop=None,
latest_action_name=None,
)
@pytest.mark.asyncio
async def test_check_availability_when_available():
tracker = make_tracker(slots={"date": "2026-06-15", "time": "19:00", "party_size": 2})
dispatcher = CollectingDispatcher()
domain = {}
with patch("actions.action_check_availability.check_restaurant_availability") as mock_check:
mock_check.return_value = True
action = ActionCheckAvailability()
events = await action.run(dispatcher, tracker, domain)
assert any(
e.get("event") == "slot" and e.get("name") == "availability" and e.get("value") is True
for e in events
), "Expected availability slot to be set to True"
assert len(dispatcher.messages) == 0 # No message for available slot
@pytest.mark.asyncio
async def test_check_availability_when_unavailable():
tracker = make_tracker(slots={"date": "2026-06-20", "time": "19:00", "party_size": 2})
dispatcher = CollectingDispatcher()
with patch("actions.action_check_availability.check_restaurant_availability") as mock_check:
mock_check.return_value = False
action = ActionCheckAvailability()
events = await action.run(dispatcher, {}, tracker, {})
slot_events = [e for e in events if e.get("event") == "slot" and e.get("name") == "availability"]
assert slot_events[0]["value"] is False
@pytest.mark.asyncio
async def test_cancel_booking_not_found():
tracker = make_tracker(
slots={"reference_number": "INVALID-999"},
intent="provide_reference"
)
dispatcher = CollectingDispatcher()
with patch("actions.action_cancel_booking.find_booking") as mock_find:
mock_find.return_value = None
action = ActionCancelBooking()
await action.run(dispatcher, tracker, {})
assert len(dispatcher.messages) == 1
assert "not found" in dispatcher.messages[0]["text"].lower()End-to-End Story Tests with pytest-rasa
The pytest-rasa package lets you write end-to-end conversation tests in pytest style, without needing a running Rasa server:
# tests/test_e2e_conversations.py
import pytest
@pytest.mark.asyncio
async def test_full_booking_conversation(rasa_app):
"""Test a complete booking conversation end-to-end."""
client = rasa_app.test_client()
# Turn 1: Initiate booking
response = await client.post(
"/webhooks/rest/webhook",
json={"sender": "test-e2e-1", "message": "I want to book a table for two on Friday"}
)
assert response.status == 200
data = await response.json()
assert any("time" in msg.get("text", "").lower() for msg in data), \
f"Expected time elicitation, got: {data}"
# Turn 2: Provide time
response = await client.post(
"/webhooks/rest/webhook",
json={"sender": "test-e2e-1", "message": "at 7pm"}
)
data = await response.json()
assert any("confirm" in msg.get("text", "").lower() for msg in data), \
f"Expected confirmation prompt, got: {data}"
# Turn 3: Confirm
response = await client.post(
"/webhooks/rest/webhook",
json={"sender": "test-e2e-1", "message": "yes"}
)
data = await response.json()
assert any("confirmed" in msg.get("text", "").lower() for msg in data), \
f"Expected booking confirmation, got: {data}"Difference: Unit Tests vs Integration Tests for Actions
Unit tests for custom actions give you fast, isolated feedback — mock the database, mock external APIs, test the logic. Integration tests verify the full pipeline including NLU parsing, dialog management, and action execution.
Use unit tests during development (run in milliseconds) and integration tests in CI (run in seconds to minutes). Never run integration tests against a production Rasa server.
A quick rule of thumb: if you're testing the logic inside an action (what it does given specific slots), use unit tests. If you're testing whether the right action gets triggered given a user utterance, use story tests or E2E tests.
CI Pipeline
name: Rasa Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- run: pip install rasa==3.6.0 pytest pytest-asyncio pytest-rasa
- name: Train model
run: rasa train --fixed-model-name test-model
- name: NLU tests
run: |
rasa test nlu --nlu tests/test_data/nlu.yml --model models/test-model.tar.gz --out results/nlu/
python scripts/check_nlu_thresholds.py
- name: Story tests
run: rasa test core --stories tests/test_stories.yml --model models/test-model.tar.gz --out results/core/
- name: Action unit tests
run: pytest tests/test_actions.py -v
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: rasa-test-results
path: results/HelpMeTest can monitor your deployed Rasa chatbot by running test conversations on a schedule, so you catch NLU drift and dialog regressions before your users do.