End-to-End Voice Assistant Testing: Alexa Skills, Google Actions, and Wake-Word Detection

End-to-End Voice Assistant Testing: Alexa Skills, Google Actions, and Wake-Word Detection

Voice assistants are distributed systems. A user says "Hey Alexa, reorder coffee" and the following chain executes: wake-word detection, audio capture, STT transcription, natural language understanding, skill dispatch, business logic, TTS synthesis, and audio playback. Each link can fail independently. End-to-end testing covers the full chain, not just the skill logic you wrote.

This guide covers testing strategies for Alexa Skills, Google Actions, and the wake-word detection pipelines that gate all of this.

The E2E Testing Challenge for Voice Assistants

Voice assistant testing has constraints that standard web or API testing doesn't:

Platform coupling. Alexa Skills and Google Actions run inside vendor infrastructure. You can unit test your skill logic in isolation, but integration and E2E tests must go through the platform's simulator or real hardware.

Non-deterministic NLU. The platform's natural language understanding (NLU) layer converts user speech to structured intents. The same utterance can map to different intents across platform versions. This is outside your control but inside your test scope.

Stateful conversations. Multi-turn interactions build on session state. Testing step 3 of a conversation requires the state from steps 1 and 2. Unlike stateless REST APIs, you can't test steps in isolation.

Acoustic dependencies. Wake-word detection accuracy depends on device hardware, room acoustics, background noise, and user voice characteristics. Software tests approximate this, but hardware-in-the-loop tests catch problems software misses.

Testing Alexa Skills

Unit Testing Skill Handlers

The Alexa Skills Kit (ASK) SDK decouples request handling from the platform. Unit test your handlers with mock request objects:

# tests/unit/test_order_coffee_handler.py
import pytest
from unittest.mock import MagicMock, patch
import json

def build_intent_request(intent_name: str, slots: dict = None) -> dict:
    """Build a mock Alexa intent request for testing."""
    return {
        "version": "1.0",
        "session": {
            "sessionId": "test-session-123",
            "application": {"applicationId": "amzn1.ask.skill.test"},
            "user": {"userId": "test-user-123"},
            "new": False,
            "attributes": {}
        },
        "context": {
            "System": {
                "application": {"applicationId": "amzn1.ask.skill.test"},
                "user": {"userId": "test-user-123"}
            }
        },
        "request": {
            "type": "IntentRequest",
            "requestId": "test-request-123",
            "intent": {
                "name": intent_name,
                "confirmationStatus": "NONE",
                "slots": {
                    slot_name: {
                        "name": slot_name,
                        "value": slot_value,
                        "confirmationStatus": "NONE"
                    }
                    for slot_name, slot_value in (slots or {}).items()
                }
            }
        }
    }

class TestOrderCoffeeHandler:
    
    def test_order_coffee_with_quantity(self):
        """Ordering coffee with a quantity should confirm the order."""
        from myskill.handlers import handle_order_coffee
        
        request = build_intent_request(
            "OrderCoffeeIntent",
            slots={"quantity": "2", "roast": "medium"}
        )
        
        response = handle_order_coffee(request)
        
        assert response["response"]["outputSpeech"]["type"] == "SSML"
        speech = response["response"]["outputSpeech"]["ssml"]
        assert "2" in speech or "two" in speech.lower()
        assert response["response"]["shouldEndSession"] is False  # Expects confirmation
    
    def test_order_coffee_missing_quantity_asks(self):
        """Missing required slot should trigger elicitation dialog."""
        from myskill.handlers import handle_order_coffee
        
        request = build_intent_request("OrderCoffeeIntent", slots={})
        response = handle_order_coffee(request)
        
        # Should enter dialog delegation for slot elicitation
        assert response["response"].get("directives") is not None
        directives = response["response"]["directives"]
        dialog_directives = [d for d in directives if d["type"].startswith("Dialog")]
        assert len(dialog_directives) > 0
    
    def test_session_attributes_persist_between_turns(self):
        """Order context should persist across multiple conversation turns."""
        from myskill.handlers import handle_order_coffee, handle_confirm_order
        
        # Turn 1: initial order
        request1 = build_intent_request(
            "OrderCoffeeIntent",
            slots={"quantity": "1"}
        )
        response1 = handle_order_coffee(request1)
        
        session_attrs = response1["sessionAttributes"]
        assert "pending_order" in session_attrs
        
        # Turn 2: confirmation with session state from turn 1
        request2 = build_intent_request("AMAZON.YesIntent")
        request2["session"]["attributes"] = session_attrs
        
        response2 = handle_confirm_order(request2)
        
        assert "confirmed" in response2["response"]["outputSpeech"]["ssml"].lower()

Integration Testing with ASK CLI

The ASK CLI simulator lets you test the full NLU pipeline without hardware:

# Install ASK CLI
npm install -g ask-cli

<span class="hljs-comment"># Simulate an intent
ask simulate \
  --skill-id amzn1.ask.skill.your-skill-id \
  --locale en-US \
  --text <span class="hljs-string">"order two medium roast coffees"

Automate this in Python:

import subprocess
import json
import pytest

def ask_simulate(skill_id: str, text: str, locale: str = "en-US") -> dict:
    """Run ASK CLI simulator and return parsed response."""
    result = subprocess.run([
        "ask", "simulate",
        "--skill-id", skill_id,
        "--locale", locale,
        "--text", text,
        "--format", "json"
    ], capture_output=True, text=True, timeout=30)
    
    if result.returncode != 0:
        raise RuntimeError(f"ASK CLI error: {result.stderr}")
    
    return json.loads(result.stdout)

@pytest.mark.integration
class TestAlexaE2E:
    
    SKILL_ID = "amzn1.ask.skill.your-skill-id"
    
    HAPPY_PATH_UTTERANCES = [
        ("order two medium roast coffees", "2", "medium"),
        ("add one dark roast to my order", "1", "dark"),
        ("I want three light roast", "3", "light"),
    ]
    
    @pytest.mark.parametrize("utterance,expected_qty,expected_roast", HAPPY_PATH_UTTERANCES)
    def test_order_intent_recognition(self, utterance, expected_qty, expected_roast):
        """NLU should map utterances to OrderCoffeeIntent with correct slots."""
        response = ask_simulate(self.SKILL_ID, utterance)
        
        invocations = response.get("result", {}).get("skillExecutionInfo", {}).get("invocations", [])
        assert invocations, f"No invocations found for: {utterance}"
        
        intent_request = invocations[0]["invocationRequest"]["body"]["request"]
        assert intent_request["intent"]["name"] == "OrderCoffeeIntent"
        
        slots = intent_request["intent"]["slots"]
        assert slots.get("quantity", {}).get("value") == expected_qty
        assert slots.get("roast", {}).get("value") == expected_roast
    
    def test_cancel_intent_ends_session(self):
        """Saying cancel should end the session cleanly."""
        response = ask_simulate(self.SKILL_ID, "cancel")
        
        skill_response = response["result"]["skillExecutionInfo"]["invocations"][0]["invocationResponse"]
        assert skill_response["body"]["response"]["shouldEndSession"] is True

Testing Google Actions

Google Actions uses a different SDK but similar patterns apply:

# tests/unit/test_google_action_handlers.py
import pytest
from unittest.mock import MagicMock

def build_google_intent_request(intent: str, params: dict = None) -> dict:
    """Build a mock Google Actions webhook request."""
    return {
        "handler": {"name": intent},
        "intent": {
            "name": intent,
            "params": {
                key: {"original": val, "resolved": val}
                for key, val in (params or {}).items()
            }
        },
        "scene": {
            "name": "OrderScene",
            "slotFillingStatus": "FINAL" if params else "COLLECTING"
        },
        "session": {
            "id": "test-session-123",
            "params": {},
            "languageCode": "en-US"
        },
        "user": {"locale": "en-US"}
    }

class TestGoogleActionHandlers:
    
    def test_order_handler_returns_simple_response(self):
        """Order handler should return a valid Actions response."""
        from myaction.handlers import handle_order
        
        request = build_google_intent_request(
            "order_coffee",
            params={"quantity": "2", "roast_type": "medium"}
        )
        
        response = handle_order(request)
        
        assert "prompt" in response
        assert "firstSimple" in response["prompt"]
        speech = response["prompt"]["firstSimple"]["speech"]
        assert len(speech) > 0
    
    def test_account_linking_required_for_orders(self):
        """Orders require account linking — unauthenticated users get sign-in prompt."""
        from myaction.handlers import handle_order
        
        request = build_google_intent_request("order_coffee")
        # No user account token
        
        response = handle_order(request)
        
        # Should trigger account linking
        assert any(
            d.get("requestLinkPermission") is not None
            for d in response.get("scene", {}).get("next", {}).get("name", "")
            for d in [response]
        ) or "sign in" in str(response).lower()

Wake-Word Detection Testing

Wake-word detection sits at the front of the entire pipeline. False negatives mean users can't activate the device. False positives drain battery and compromise privacy.

Software Testing Wake-Word Models

# tests/test_wake_word_detection.py
import pytest
import numpy as np
import soundfile as sf

class TestWakeWordDetection:
    """
    Tests for a custom wake-word detector.
    Adjust for your specific wake-word engine (Porcupine, Precise, etc.)
    """
    
    WAKE_WORD = "hey_helpmetest"
    
    @pytest.fixture(scope="class")
    def detector(self):
        from myapp.wakeword import WakeWordDetector
        return WakeWordDetector(
            keyword=self.WAKE_WORD,
            sensitivity=0.5,
            model_path="models/wake_word.ppn"
        )
    
    def test_detects_wake_word_in_clean_audio(self, detector):
        """Should reliably detect wake word in clean studio conditions."""
        audio, sr = sf.read("tests/fixtures/audio/wake_word_clean.wav")
        assert sr == 16000, "Wake-word models typically require 16kHz"
        
        detections = detector.process(audio)
        assert len(detections) == 1, (
            f"Expected 1 detection, got {len(detections)}. "
            "False negative in clean audio is unacceptable."
        )
    
    def test_detects_wake_word_with_background_noise(self, detector):
        """Should detect wake word with -15dB SNR background noise."""
        audio, sr = sf.read("tests/fixtures/audio/wake_word_office_noise.wav")
        
        detections = detector.process(audio)
        # Allow occasional miss in noisy conditions
        assert len(detections) >= 1, "Wake word missed in office noise (-15dB SNR)"
    
    def test_no_false_positives_on_speech(self, detector):
        """Conversational speech should not trigger the wake word detector."""
        audio, sr = sf.read("tests/fixtures/audio/conversation_no_wake_word.wav")
        
        detections = detector.process(audio)
        assert len(detections) == 0, (
            f"False positive: wake word triggered {len(detections)} times "
            "during non-wake-word speech."
        )
    
    def test_no_false_positives_on_music(self, detector):
        """Music and TV audio should not trigger the wake word."""
        for fixture in ["tv_audio.wav", "music_pop.wav", "music_classical.wav"]:
            audio, sr = sf.read(f"tests/fixtures/audio/{fixture}")
            detections = detector.process(audio)
            assert len(detections) == 0, f"False positive on {fixture}"
    
    def test_sensitivity_tradeoff(self, detector):
        """Higher sensitivity should increase true positives and false positives."""
        from myapp.wakeword import WakeWordDetector
        
        audio_tp, _ = sf.read("tests/fixtures/audio/wake_word_distant.wav")
        audio_fp, _ = sf.read("tests/fixtures/audio/similar_sounding_word.wav")
        
        low_sensitivity = WakeWordDetector(keyword=self.WAKE_WORD, sensitivity=0.3)
        high_sensitivity = WakeWordDetector(keyword=self.WAKE_WORD, sensitivity=0.8)
        
        # Low sensitivity should miss distant wake word
        low_tp = len(low_sensitivity.process(audio_tp))
        
        # High sensitivity should catch distant wake word
        high_tp = len(high_sensitivity.process(audio_tp))
        
        # High sensitivity produces more false positives
        high_fp = len(high_sensitivity.process(audio_fp))
        
        assert high_tp >= low_tp, "Higher sensitivity should catch more true positives"
        # Don't assert on FP count — just verify tradeoff is measurable

Full Pipeline E2E Test

Test the complete chain with a simulated voice input:

@pytest.mark.e2e
def test_full_voice_pipeline():
    """
    End-to-end: audio → wake word → STT → NLU → skill → TTS → audio output.
    This test catches integration failures across the whole chain.
    """
    from myapp.pipeline import VoicePipeline
    
    pipeline = VoicePipeline(
        wake_word_model="models/wake_word.ppn",
        stt_api_key=os.environ["STT_API_KEY"],
        skill_endpoint="https://api.myskill.com/handle",
        tts_api_key=os.environ["TTS_API_KEY"]
    )
    
    # Load audio with wake word + command
    audio, sr = sf.read("tests/fixtures/audio/hey_helpmetest_order_coffee.wav")
    
    result = pipeline.process(audio)
    
    # Wake word was detected
    assert result.wake_word_detected, "Wake word not detected"
    
    # STT produced text
    assert result.transcription, "STT returned empty transcription"
    assert "coffee" in result.transcription.lower(), (
        f"Expected 'coffee' in transcription, got: {result.transcription}"
    )
    
    # NLU mapped to correct intent
    assert result.intent == "OrderCoffeeIntent", (
        f"Expected OrderCoffeeIntent, got: {result.intent}"
    )
    
    # Skill returned a response
    assert result.skill_response, "Skill returned no response"
    
    # TTS generated audio
    assert result.audio_response and len(result.audio_response) > 0, (
        "TTS returned empty audio"
    )

Continuous Monitoring

Voice assistant pipelines can degrade without warning — STT model updates, NLU retraining, or infrastructure changes all affect the full stack. Schedule daily E2E tests using HelpMeTest to run against a staging environment and catch regressions before users do. For Alexa Skills, use the ASK simulator in monitoring mode; for Google Actions, the Actions on Google simulator API supports programmatic testing.

Conclusion

End-to-end voice assistant testing requires covering three distinct layers: skill handler logic (unit tests), platform NLU mapping (ASK/Actions simulator integration tests), and the full audio pipeline (E2E tests with audio fixtures). Wake-word testing adds a fourth layer that requires careful fixture construction to measure the sensitivity/accuracy tradeoff. Build all four layers, run the expensive ones daily rather than on every commit, and treat any regression in wake-word false positives as a severity-1 issue — they're the first thing users notice.

Read more