End-to-End Voice Assistant Testing: Alexa Skills, Google Actions, and Wake-Word Detection
Voice assistants are distributed systems. A user says "Hey Alexa, reorder coffee" and the following chain executes: wake-word detection, audio capture, STT transcription, natural language understanding, skill dispatch, business logic, TTS synthesis, and audio playback. Each link can fail independently. End-to-end testing covers the full chain, not just the skill logic you wrote.
This guide covers testing strategies for Alexa Skills, Google Actions, and the wake-word detection pipelines that gate all of this.
The E2E Testing Challenge for Voice Assistants
Voice assistant testing has constraints that standard web or API testing doesn't:
Platform coupling. Alexa Skills and Google Actions run inside vendor infrastructure. You can unit test your skill logic in isolation, but integration and E2E tests must go through the platform's simulator or real hardware.
Non-deterministic NLU. The platform's natural language understanding (NLU) layer converts user speech to structured intents. The same utterance can map to different intents across platform versions. This is outside your control but inside your test scope.
Stateful conversations. Multi-turn interactions build on session state. Testing step 3 of a conversation requires the state from steps 1 and 2. Unlike stateless REST APIs, you can't test steps in isolation.
Acoustic dependencies. Wake-word detection accuracy depends on device hardware, room acoustics, background noise, and user voice characteristics. Software tests approximate this, but hardware-in-the-loop tests catch problems software misses.
Testing Alexa Skills
Unit Testing Skill Handlers
The Alexa Skills Kit (ASK) SDK decouples request handling from the platform. Unit test your handlers with mock request objects:
# tests/unit/test_order_coffee_handler.py
import pytest
from unittest.mock import MagicMock, patch
import json
def build_intent_request(intent_name: str, slots: dict = None) -> dict:
"""Build a mock Alexa intent request for testing."""
return {
"version": "1.0",
"session": {
"sessionId": "test-session-123",
"application": {"applicationId": "amzn1.ask.skill.test"},
"user": {"userId": "test-user-123"},
"new": False,
"attributes": {}
},
"context": {
"System": {
"application": {"applicationId": "amzn1.ask.skill.test"},
"user": {"userId": "test-user-123"}
}
},
"request": {
"type": "IntentRequest",
"requestId": "test-request-123",
"intent": {
"name": intent_name,
"confirmationStatus": "NONE",
"slots": {
slot_name: {
"name": slot_name,
"value": slot_value,
"confirmationStatus": "NONE"
}
for slot_name, slot_value in (slots or {}).items()
}
}
}
}
class TestOrderCoffeeHandler:
def test_order_coffee_with_quantity(self):
"""Ordering coffee with a quantity should confirm the order."""
from myskill.handlers import handle_order_coffee
request = build_intent_request(
"OrderCoffeeIntent",
slots={"quantity": "2", "roast": "medium"}
)
response = handle_order_coffee(request)
assert response["response"]["outputSpeech"]["type"] == "SSML"
speech = response["response"]["outputSpeech"]["ssml"]
assert "2" in speech or "two" in speech.lower()
assert response["response"]["shouldEndSession"] is False # Expects confirmation
def test_order_coffee_missing_quantity_asks(self):
"""Missing required slot should trigger elicitation dialog."""
from myskill.handlers import handle_order_coffee
request = build_intent_request("OrderCoffeeIntent", slots={})
response = handle_order_coffee(request)
# Should enter dialog delegation for slot elicitation
assert response["response"].get("directives") is not None
directives = response["response"]["directives"]
dialog_directives = [d for d in directives if d["type"].startswith("Dialog")]
assert len(dialog_directives) > 0
def test_session_attributes_persist_between_turns(self):
"""Order context should persist across multiple conversation turns."""
from myskill.handlers import handle_order_coffee, handle_confirm_order
# Turn 1: initial order
request1 = build_intent_request(
"OrderCoffeeIntent",
slots={"quantity": "1"}
)
response1 = handle_order_coffee(request1)
session_attrs = response1["sessionAttributes"]
assert "pending_order" in session_attrs
# Turn 2: confirmation with session state from turn 1
request2 = build_intent_request("AMAZON.YesIntent")
request2["session"]["attributes"] = session_attrs
response2 = handle_confirm_order(request2)
assert "confirmed" in response2["response"]["outputSpeech"]["ssml"].lower()Integration Testing with ASK CLI
The ASK CLI simulator lets you test the full NLU pipeline without hardware:
# Install ASK CLI
npm install -g ask-cli
<span class="hljs-comment"># Simulate an intent
ask simulate \
--skill-id amzn1.ask.skill.your-skill-id \
--locale en-US \
--text <span class="hljs-string">"order two medium roast coffees"Automate this in Python:
import subprocess
import json
import pytest
def ask_simulate(skill_id: str, text: str, locale: str = "en-US") -> dict:
"""Run ASK CLI simulator and return parsed response."""
result = subprocess.run([
"ask", "simulate",
"--skill-id", skill_id,
"--locale", locale,
"--text", text,
"--format", "json"
], capture_output=True, text=True, timeout=30)
if result.returncode != 0:
raise RuntimeError(f"ASK CLI error: {result.stderr}")
return json.loads(result.stdout)
@pytest.mark.integration
class TestAlexaE2E:
SKILL_ID = "amzn1.ask.skill.your-skill-id"
HAPPY_PATH_UTTERANCES = [
("order two medium roast coffees", "2", "medium"),
("add one dark roast to my order", "1", "dark"),
("I want three light roast", "3", "light"),
]
@pytest.mark.parametrize("utterance,expected_qty,expected_roast", HAPPY_PATH_UTTERANCES)
def test_order_intent_recognition(self, utterance, expected_qty, expected_roast):
"""NLU should map utterances to OrderCoffeeIntent with correct slots."""
response = ask_simulate(self.SKILL_ID, utterance)
invocations = response.get("result", {}).get("skillExecutionInfo", {}).get("invocations", [])
assert invocations, f"No invocations found for: {utterance}"
intent_request = invocations[0]["invocationRequest"]["body"]["request"]
assert intent_request["intent"]["name"] == "OrderCoffeeIntent"
slots = intent_request["intent"]["slots"]
assert slots.get("quantity", {}).get("value") == expected_qty
assert slots.get("roast", {}).get("value") == expected_roast
def test_cancel_intent_ends_session(self):
"""Saying cancel should end the session cleanly."""
response = ask_simulate(self.SKILL_ID, "cancel")
skill_response = response["result"]["skillExecutionInfo"]["invocations"][0]["invocationResponse"]
assert skill_response["body"]["response"]["shouldEndSession"] is TrueTesting Google Actions
Google Actions uses a different SDK but similar patterns apply:
# tests/unit/test_google_action_handlers.py
import pytest
from unittest.mock import MagicMock
def build_google_intent_request(intent: str, params: dict = None) -> dict:
"""Build a mock Google Actions webhook request."""
return {
"handler": {"name": intent},
"intent": {
"name": intent,
"params": {
key: {"original": val, "resolved": val}
for key, val in (params or {}).items()
}
},
"scene": {
"name": "OrderScene",
"slotFillingStatus": "FINAL" if params else "COLLECTING"
},
"session": {
"id": "test-session-123",
"params": {},
"languageCode": "en-US"
},
"user": {"locale": "en-US"}
}
class TestGoogleActionHandlers:
def test_order_handler_returns_simple_response(self):
"""Order handler should return a valid Actions response."""
from myaction.handlers import handle_order
request = build_google_intent_request(
"order_coffee",
params={"quantity": "2", "roast_type": "medium"}
)
response = handle_order(request)
assert "prompt" in response
assert "firstSimple" in response["prompt"]
speech = response["prompt"]["firstSimple"]["speech"]
assert len(speech) > 0
def test_account_linking_required_for_orders(self):
"""Orders require account linking — unauthenticated users get sign-in prompt."""
from myaction.handlers import handle_order
request = build_google_intent_request("order_coffee")
# No user account token
response = handle_order(request)
# Should trigger account linking
assert any(
d.get("requestLinkPermission") is not None
for d in response.get("scene", {}).get("next", {}).get("name", "")
for d in [response]
) or "sign in" in str(response).lower()Wake-Word Detection Testing
Wake-word detection sits at the front of the entire pipeline. False negatives mean users can't activate the device. False positives drain battery and compromise privacy.
Software Testing Wake-Word Models
# tests/test_wake_word_detection.py
import pytest
import numpy as np
import soundfile as sf
class TestWakeWordDetection:
"""
Tests for a custom wake-word detector.
Adjust for your specific wake-word engine (Porcupine, Precise, etc.)
"""
WAKE_WORD = "hey_helpmetest"
@pytest.fixture(scope="class")
def detector(self):
from myapp.wakeword import WakeWordDetector
return WakeWordDetector(
keyword=self.WAKE_WORD,
sensitivity=0.5,
model_path="models/wake_word.ppn"
)
def test_detects_wake_word_in_clean_audio(self, detector):
"""Should reliably detect wake word in clean studio conditions."""
audio, sr = sf.read("tests/fixtures/audio/wake_word_clean.wav")
assert sr == 16000, "Wake-word models typically require 16kHz"
detections = detector.process(audio)
assert len(detections) == 1, (
f"Expected 1 detection, got {len(detections)}. "
"False negative in clean audio is unacceptable."
)
def test_detects_wake_word_with_background_noise(self, detector):
"""Should detect wake word with -15dB SNR background noise."""
audio, sr = sf.read("tests/fixtures/audio/wake_word_office_noise.wav")
detections = detector.process(audio)
# Allow occasional miss in noisy conditions
assert len(detections) >= 1, "Wake word missed in office noise (-15dB SNR)"
def test_no_false_positives_on_speech(self, detector):
"""Conversational speech should not trigger the wake word detector."""
audio, sr = sf.read("tests/fixtures/audio/conversation_no_wake_word.wav")
detections = detector.process(audio)
assert len(detections) == 0, (
f"False positive: wake word triggered {len(detections)} times "
"during non-wake-word speech."
)
def test_no_false_positives_on_music(self, detector):
"""Music and TV audio should not trigger the wake word."""
for fixture in ["tv_audio.wav", "music_pop.wav", "music_classical.wav"]:
audio, sr = sf.read(f"tests/fixtures/audio/{fixture}")
detections = detector.process(audio)
assert len(detections) == 0, f"False positive on {fixture}"
def test_sensitivity_tradeoff(self, detector):
"""Higher sensitivity should increase true positives and false positives."""
from myapp.wakeword import WakeWordDetector
audio_tp, _ = sf.read("tests/fixtures/audio/wake_word_distant.wav")
audio_fp, _ = sf.read("tests/fixtures/audio/similar_sounding_word.wav")
low_sensitivity = WakeWordDetector(keyword=self.WAKE_WORD, sensitivity=0.3)
high_sensitivity = WakeWordDetector(keyword=self.WAKE_WORD, sensitivity=0.8)
# Low sensitivity should miss distant wake word
low_tp = len(low_sensitivity.process(audio_tp))
# High sensitivity should catch distant wake word
high_tp = len(high_sensitivity.process(audio_tp))
# High sensitivity produces more false positives
high_fp = len(high_sensitivity.process(audio_fp))
assert high_tp >= low_tp, "Higher sensitivity should catch more true positives"
# Don't assert on FP count — just verify tradeoff is measurableFull Pipeline E2E Test
Test the complete chain with a simulated voice input:
@pytest.mark.e2e
def test_full_voice_pipeline():
"""
End-to-end: audio → wake word → STT → NLU → skill → TTS → audio output.
This test catches integration failures across the whole chain.
"""
from myapp.pipeline import VoicePipeline
pipeline = VoicePipeline(
wake_word_model="models/wake_word.ppn",
stt_api_key=os.environ["STT_API_KEY"],
skill_endpoint="https://api.myskill.com/handle",
tts_api_key=os.environ["TTS_API_KEY"]
)
# Load audio with wake word + command
audio, sr = sf.read("tests/fixtures/audio/hey_helpmetest_order_coffee.wav")
result = pipeline.process(audio)
# Wake word was detected
assert result.wake_word_detected, "Wake word not detected"
# STT produced text
assert result.transcription, "STT returned empty transcription"
assert "coffee" in result.transcription.lower(), (
f"Expected 'coffee' in transcription, got: {result.transcription}"
)
# NLU mapped to correct intent
assert result.intent == "OrderCoffeeIntent", (
f"Expected OrderCoffeeIntent, got: {result.intent}"
)
# Skill returned a response
assert result.skill_response, "Skill returned no response"
# TTS generated audio
assert result.audio_response and len(result.audio_response) > 0, (
"TTS returned empty audio"
)Continuous Monitoring
Voice assistant pipelines can degrade without warning — STT model updates, NLU retraining, or infrastructure changes all affect the full stack. Schedule daily E2E tests using HelpMeTest to run against a staging environment and catch regressions before users do. For Alexa Skills, use the ASK simulator in monitoring mode; for Google Actions, the Actions on Google simulator API supports programmatic testing.
Conclusion
End-to-end voice assistant testing requires covering three distinct layers: skill handler logic (unit tests), platform NLU mapping (ASK/Actions simulator integration tests), and the full audio pipeline (E2E tests with audio fixtures). Wake-word testing adds a fourth layer that requires careful fixture construction to measure the sensitivity/accuracy tradeoff. Build all four layers, run the expensive ones daily rather than on every commit, and treat any regression in wake-word false positives as a severity-1 issue — they're the first thing users notice.