How to Test Amazon Strands Agents Before They Hit Production

How to Test Amazon Strands Agents Before They Hit Production

Amazon released the Strands Agents SDK in May 2025. If your team is already deep in the AWS ecosystem — Bedrock models, Lambda, DynamoDB — Strands is the natural path to production AI agents. It integrates cleanly with AWS services and handles the orchestration layer for you.

But clean integration and correct behavior are different things. An agent that connects to AWS correctly and still gives wrong answers, calls the wrong tools, or fails silently on edge cases is a production problem regardless of how well it's architected.

This is the Strands testing problem: the AWS integration works. Does the agent behavior?

What Makes Strands Agents Tricky to Test

The Strands Agents SDK is model-agnostic (it supports Bedrock, Claude, and other providers) and tool-focused. Agents are defined by the tools they can call and the model reasoning over when and how to call them.

The testing challenges:

  • Tool selection — with multiple tools available, does the agent pick the right one for each input?
  • AWS service interactions — tools that call DynamoDB, S3, Lambda, or other AWS services need to behave correctly in tests without hitting real production resources
  • Model-specific behavior — behavior differences across Bedrock models (Claude 3.5 vs Titan vs Llama) affect which tool gets called and how
  • Session state — multi-turn agents carry state between turns; errors in state management compound
  • Streaming behavior — Strands supports streaming responses; test that partial results don't break your application

Layer 1: Testing Individual Tools

In Strands, tools are decorated Python functions. Test them in isolation first:

from strands import tool
import boto3
from moto import mock_dynamodb
import json

@tool
def lookup_customer(customer_id: str) -> dict:
    """Look up customer information by ID."""
    dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
    table = dynamodb.Table('customers')
    response = table.get_item(Key={'customer_id': customer_id})
    return response.get('Item', {})

@mock_dynamodb
def test_lookup_customer_returns_correct_data():
    # Set up mock DynamoDB
    dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
    table = dynamodb.create_table(
        TableName='customers',
        KeySchema=[{'AttributeName': 'customer_id', 'KeyType': 'HASH'}],
        AttributeDefinitions=[{'AttributeName': 'customer_id', 'AttributeType': 'S'}],
        BillingMode='PAY_PER_REQUEST'
    )
    table.put_item(Item={
        'customer_id': 'cust_123',
        'name': 'Alice Smith',
        'tier': 'premium'
    })
    
    result = lookup_customer('cust_123')
    
    assert result['name'] == 'Alice Smith'
    assert result['tier'] == 'premium'

@mock_dynamodb
def test_lookup_customer_returns_empty_for_missing():
    # Table exists but customer doesn't
    dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
    dynamodb.create_table(
        TableName='customers',
        KeySchema=[{'AttributeName': 'customer_id', 'KeyType': 'HASH'}],
        AttributeDefinitions=[{'AttributeName': 'customer_id', 'AttributeType': 'S'}],
        BillingMode='PAY_PER_REQUEST'
    )
    
    result = lookup_customer('nonexistent_cust')
    assert result == {}

Use moto to mock AWS services in tests. Never hit real DynamoDB, S3, or Lambda in unit tests — it's slow, costs money, and creates test pollution.

Layer 2: Testing Agent Tool Selection

After individual tools check out, test that the agent calls the right tool for each input:

from strands import Agent
from unittest.mock import patch, MagicMock

def test_agent_calls_lookup_for_status_query():
    tool_calls = []
    
    def tracking_lookup_customer(customer_id: str) -> dict:
        tool_calls.append({"tool": "lookup_customer", "customer_id": customer_id})
        return {"name": "Alice Smith", "tier": "premium", "status": "active"}
    
    def tracking_cancel_account(customer_id: str, reason: str) -> dict:
        tool_calls.append({"tool": "cancel_account", "customer_id": customer_id})
        return {"cancelled": True}
    
    agent = Agent(
        tools=[tracking_lookup_customer, tracking_cancel_account]
    )
    
    response = agent("What is the status of customer cust_123?")
    
    called_tool_names = [c["tool"] for c in tool_calls]
    assert "lookup_customer" in called_tool_names
    assert "cancel_account" not in called_tool_names
    assert "cust_123" in [c.get("customer_id") for c in tool_calls]

The wrong tool being called with correct parameters is a production failure. It might not look like one in logs.

Layer 3: Testing Multi-Turn Conversations

Strands agents support multi-turn conversations with session state. Test that state persists correctly across turns:

from strands import Agent

def test_agent_maintains_context_across_turns():
    agent = Agent(tools=[lookup_customer, update_customer])
    
    # First turn: establish context
    response1 = agent("Look up customer cust_456")
    assert "cust_456" in str(response1) or "customer" in str(response1).lower()
    
    # Second turn: agent should remember the customer from first turn
    response2 = agent("What tier is this customer on?")
    
    # The agent should answer based on the customer looked up in turn 1
    # without needing the customer ID repeated
    assert "tier" in str(response2).lower() or "premium" in str(response2).lower() \
           or "standard" in str(response2).lower()

Context loss between turns is a silent failure. The agent either re-fetches data it already has (wasteful) or gives a generic answer instead of a contextual one (wrong).

Layer 4: Testing AWS Integration Correctness

Strands agents in AWS-heavy stacks often interact with multiple services. Test the integration points:

from moto import mock_s3, mock_dynamodb, mock_lambda
import boto3

@mock_s3
@mock_dynamodb
def test_agent_document_processing_flow():
    # Set up mock S3
    s3 = boto3.client('s3', region_name='us-east-1')
    s3.create_bucket(Bucket='documents-bucket')
    s3.put_object(
        Bucket='documents-bucket',
        Key='contracts/contract_001.pdf',
        Body=b'Contract content: payment terms 30 days net'
    )
    
    # Set up mock DynamoDB for results
    dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
    table = dynamodb.create_table(
        TableName='processing-results',
        KeySchema=[{'AttributeName': 'doc_id', 'KeyType': 'HASH'}],
        AttributeDefinitions=[{'AttributeName': 'doc_id', 'AttributeType': 'S'}],
        BillingMode='PAY_PER_REQUEST'
    )
    
    agent = Agent(tools=[read_s3_document, store_analysis_result])
    response = agent("Analyze contract_001.pdf and extract the payment terms")
    
    # Verify the agent stored results in DynamoDB
    result = table.get_item(Key={'doc_id': 'contract_001'})
    assert 'Item' in result
    assert '30 days' in result['Item'].get('payment_terms', '')

Layer 5: Testing Bedrock Model Behavior

Strands is model-agnostic, but different Bedrock models behave differently. If you're testing a specific model, account for model-specific patterns:

from strands import Agent
from strands.models.bedrock import BedrockModel
from unittest.mock import patch

def test_agent_with_claude_bedrock_model():
    # Mock the Bedrock API call
    mock_response = {
        "content": [{"type": "text", "text": "I'll look up that customer for you."}],
        "tool_use": [{"name": "lookup_customer", "input": {"customer_id": "cust_789"}}]
    }
    
    with patch('strands.models.bedrock.BedrockModel.invoke') as mock_invoke:
        mock_invoke.return_value = mock_response
        
        model = BedrockModel(model_id="anthropic.claude-3-5-sonnet-20241022-v2:0")
        agent = Agent(model=model, tools=[lookup_customer])
        
        response = agent("What do we know about customer cust_789?")
        
        mock_invoke.assert_called_once()
        call_args = mock_invoke.call_args
        # Verify the model was called with the right tool configuration
        assert "lookup_customer" in str(call_args)

What Code-Level Tests Miss

Your unit and integration tests run against mocked AWS services in a local environment. Production fails differently:

  • IAM permission edge cases — your Lambda has the right permissions in dev. In production, a specific action on a specific resource fails because the IAM policy is too narrow.
  • VPC network issues — DynamoDB calls succeed in your test environment. In production, the Lambda is in a VPC without a proper endpoint and calls time out silently.
  • Bedrock throttling — model inference calls succeed under low test load. In production, simultaneous requests hit Bedrock throttling limits.
  • Real input distribution — users ask questions in ways your test fixtures don't cover. Tool selection that worked for your test cases fails on the ambiguous middle.
  • Cross-region behavior — you tested in us-east-1. Production deploys to eu-west-1. A service your agent depends on behaves differently.

Monitoring Strands Agents in Production

Once your Strands agent is live, you need ongoing behavioral monitoring.

HelpMeTest lets you write natural language behavioral tests against your deployed agent endpoint and run them on a schedule:

Test: customer lookup returns correct tier information
When user says: "Look up customer status for account A-1023"
Then: response includes account tier (standard, premium, or enterprise)
And: response includes account status (active, suspended, or cancelled)
And: response time under 10 seconds
And: no AWS error messages appear in response

Tests run continuously. If your agent's behavior shifts after a Bedrock model update, an IAM policy change, or a DynamoDB schema migration, you find out before your users do.

Free tier: 10 tests, unlimited health checks. Try HelpMeTest →

Strands Agents Testing Checklist

Before shipping any Strands-based agent:

  • Unit tests for every tool function with moto-mocked AWS services
  • Tool selection tests — right tool called for each input type
  • Multi-turn context tests — session state persists correctly
  • AWS service integration tests — correct behavior against mocked DynamoDB, S3, Lambda
  • Error handling — what happens when an AWS service returns an error?
  • IAM permission validation — agent has exactly the permissions it needs, no more
  • Bedrock model-specific tests if you're targeting a specific model
  • Production behavioral monitoring for model updates, IAM changes, and service drift

The AWS infrastructure is solid. The agent behavior is what you need to test.

Read more