AWS Fault Injection Simulator (FIS): Chaos Experiments Tutorial

AWS Fault Injection Simulator (FIS): Chaos Experiments Tutorial

AWS Fault Injection Simulator (FIS) is AWS's managed chaos engineering service. It runs fault injection experiments against your AWS resources — EC2 instances, ECS tasks, RDS clusters, EKS pods, and more — without requiring agents or sidecar processes. Experiments are defined as templates in AWS, run on a schedule or on demand, and automatically stop if your system crosses a health threshold you define.

What AWS FIS Does

FIS injects faults at the infrastructure level. Unlike application-level chaos tools, FIS can:

  • Terminate EC2 instances
  • Stop RDS database instances
  • Drain ECS tasks
  • Introduce network latency and packet loss on EC2 instances
  • Stress CPU and memory on ECS containers
  • Kill EKS pods and nodes

The key difference from tools like LitmusChaos: FIS requires no tooling in your cluster or application. AWS owns the agent, which means less setup but also means you're limited to what AWS has implemented. For AWS-native workloads, this is often exactly what you need.

IAM Setup

FIS needs permissions to perform fault actions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "fis.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Create the role:

# Trust policy for FIS
<span class="hljs-built_in">cat > fis-trust-policy.json << <span class="hljs-string">'EOF'
{
  <span class="hljs-string">"Version": <span class="hljs-string">"2012-10-17",
  <span class="hljs-string">"Statement": [{
    <span class="hljs-string">"Effect": <span class="hljs-string">"Allow",
    <span class="hljs-string">"Principal": {<span class="hljs-string">"Service": <span class="hljs-string">"fis.amazonaws.com"},
    <span class="hljs-string">"Action": <span class="hljs-string">"sts:AssumeRole"
  }]
}
EOF

aws iam create-role \
  --role-name FISExperimentRole \
  --assume-role-policy-document file://fis-trust-policy.json

<span class="hljs-comment"># Attach permissions for EC2 actions
aws iam attach-role-policy \
  --role-name FISExperimentRole \
  --policy-arn arn:aws:iam::aws:policy/PowerUserAccess

For production, use a narrower policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:StopInstances",
        "ec2:TerminateInstances",
        "ec2:RebootInstances"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "ec2:ResourceTag/Environment": "staging"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecs:ListTasks",
        "ecs:StopTask"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:DescribeAlarms"
      ],
      "Resource": "*"
    }
  ]
}

The Condition on EC2 actions scopes experiments to instances tagged Environment: staging — preventing accidental production impact.

Your First Experiment: Stop EC2 Instances

Create an experiment template via AWS CLI:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
ROLE_ARN="arn:aws:iam::${ACCOUNT_ID}:role/FISExperimentRole"

aws fis create-experiment-template \
  --description <span class="hljs-string">"Stop 50% of web tier instances" \
  --targets <span class="hljs-string">'
    {
      "webInstances": {
        "resourceType": "aws:ec2:instance",
        "resourceTags": {
          "Role": "web-server",
          "Environment": "staging"
        },
        "selectionMode": "PERCENT(50)"
      }
    }
  ' \
  --actions <span class="hljs-string">'
    {
      "stopInstances": {
        "actionId": "aws:ec2:stop-instances",
        "targets": {
          "Instances": "webInstances"
        },
        "parameters": {
          "startInstancesAfterDuration": "PT5M"
        }
      }
    }
  ' \
  --stop-conditions <span class="hljs-string">'
    [
      {
        "source": "aws:cloudwatch:alarm",
        "value": "arn:aws:cloudwatch:us-east-1:'<span class="hljs-variable">${ACCOUNT_ID}<span class="hljs-string">':alarm/prod-error-rate"
      }
    ]
  ' \
  --role-arn <span class="hljs-string">"${ROLE_ARN}"

This experiment:

  • Targets 50% of EC2 instances tagged Role: web-server in staging
  • Stops them for 5 minutes (PT5M in ISO 8601 duration format)
  • Automatically aborts if the prod-error-rate CloudWatch alarm triggers

The stop condition is critical — it's your safety valve.

Stop Conditions

Stop conditions are what make FIS safe enough for production use. When a stop condition alarm triggers, FIS:

  1. Immediately stops injecting new faults
  2. Rolls back reversible actions (starts stopped instances, etc.)
  3. Reports the experiment as stopped

Design stop conditions before running any experiment:

# Create a stop condition alarm
aws cloudwatch put-metric-alarm \
  --alarm-name FIS-StopCondition-ErrorRate \
  --metric-name 5XXError \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-description <span class="hljs-string">"FIS stop condition: too many errors"

Good stop condition metrics:

  • HTTP 5xx error rate
  • P99 latency above threshold
  • Failed health checks
  • Queue depth exceeding capacity
  • Your key business metric (orders per minute, etc.)

If any of these trigger, your system is degraded and the experiment should stop.

Running an Experiment

# Get the template ID
TEMPLATE_ID=$(aws fis list-experiment-templates \
  --query <span class="hljs-string">'experimentTemplates[?description==`Stop 50% of web tier instances`].id' \
  --output text)

<span class="hljs-comment"># Start the experiment
EXPERIMENT_ID=$(aws fis start-experiment \
  --experiment-template-id <span class="hljs-variable">$TEMPLATE_ID \
  --query <span class="hljs-string">'experiment.id' \
  --output text)

<span class="hljs-built_in">echo <span class="hljs-string">"Experiment ID: $EXPERIMENT_ID"

<span class="hljs-comment"># Monitor status
aws fis get-experiment \
  --<span class="hljs-built_in">id <span class="hljs-variable">$EXPERIMENT_ID \
  --query <span class="hljs-string">'experiment.state'

Experiment states:

  • pending — queued to run
  • initiating — setting up
  • running — actively injecting faults
  • completed — finished normally
  • stopping — stop condition triggered, rolling back
  • stopped — rolled back after stop condition
  • failed — experiment itself failed (not the target system)

ECS Task Chaos

Stop tasks in an ECS service to verify service restarts and load balancer behavior:

aws fis create-experiment-template \
  --description "Stop ECS tasks in payment service" \
  --targets <span class="hljs-string">'
    {
      "paymentTasks": {
        "resourceType": "aws:ecs:task",
        "resourceArns": [],
        "filters": [
          {
            "path": "clusterArn",
            "values": ["arn:aws:ecs:us-east-1:'<span class="hljs-variable">${ACCOUNT_ID}<span class="hljs-string">':cluster/staging"]
          },
          {
            "path": "group",
            "values": ["service:payment-service"]
          }
        ],
        "selectionMode": "COUNT(1)"
      }
    }
  ' \
  --actions <span class="hljs-string">'
    {
      "stopTasks": {
        "actionId": "aws:ecs:stop-task",
        "targets": {
          "Tasks": "paymentTasks"
        }
      }
    }
  ' \
  --stop-conditions <span class="hljs-string">'[{"source": "none"}]' \
  --role-arn <span class="hljs-string">"${ROLE_ARN}"

"source": "none" means no stop condition — only use this in isolated staging environments where you've verified no production impact is possible.

RDS Failover

Test Aurora failover to verify your application handles database leader changes:

aws fis create-experiment-template \
  --description "Aurora failover test" \
  --targets <span class="hljs-string">'
    {
      "auroraCluster": {
        "resourceType": "aws:rds:cluster",
        "resourceArns": [
          "arn:aws:rds:us-east-1:'<span class="hljs-variable">${ACCOUNT_ID}<span class="hljs-string">':cluster:staging-aurora-cluster"
        ],
        "selectionMode": "ALL"
      }
    }
  ' \
  --actions <span class="hljs-string">'
    {
      "failoverCluster": {
        "actionId": "aws:rds:failover-db-cluster",
        "targets": {
          "Clusters": "auroraCluster"
        }
      }
    }
  ' \
  --stop-conditions <span class="hljs-string">'
    [{"source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:us-east-1:'<span class="hljs-variable">${ACCOUNT_ID}<span class="hljs-string">':alarm/FIS-StopCondition-ErrorRate"}]
  ' \
  --role-arn <span class="hljs-string">"${ROLE_ARN}"

Aurora failover completes in under 30 seconds for most workloads. But your application might not handle the brief connection interruption gracefully. This experiment reveals:

  • Whether your connection pool handles failover without errors
  • How long your application shows errors during the failover window
  • Whether your read replicas correctly handle traffic during primary unavailability

EKS Pod Chaos

FIS supports Kubernetes pod operations through the EKS FIS integration:

aws fis create-experiment-template \
  --description "Delete EKS pods" \
  --targets <span class="hljs-string">'
    {
      "appPods": {
        "resourceType": "aws:eks:pod",
        "filters": [
          {
            "path": "clusterName",
            "values": ["staging-cluster"]
          },
          {
            "path": "namespace",
            "values": ["production"]
          },
          {
            "path": "selector/matchLabels/app",
            "values": ["my-app"]
          }
        ],
        "selectionMode": "PERCENT(50)"
      }
    }
  ' \
  --actions <span class="hljs-string">'
    {
      "deletePods": {
        "actionId": "aws:eks:pod-delete",
        "targets": {
          "Pods": "appPods"
        },
        "parameters": {
          "gracePeriodSeconds": "0"
        }
      }
    }
  ' \
  --stop-conditions <span class="hljs-string">'[{"source": "none"}]' \
  --role-arn <span class="hljs-string">"${ROLE_ARN}"

gracePeriodSeconds: 0 is force-kill. Use 30 or higher for graceful shutdown testing.

Observability Integration

FIS publishes experiment events to EventBridge. Route them to CloudWatch for a complete picture:

# Create EventBridge rule for FIS events
aws events put-rule \
  --name FIS-Experiment-Events \
  --event-pattern <span class="hljs-string">'
    {
      "source": ["aws.fis"],
      "detail-type": ["FIS Experiment State Change"]
    }
  '

Add Grafana annotations when experiments start and stop. This overlays chaos events on your metrics dashboards — you can see exactly when error rate spiked relative to when the experiment started.

In CloudWatch Dashboards, use annotations:

import boto3

def annotate_dashboard(experiment_id, state):
    cloudwatch = boto3.client('cloudwatch')
    dashboard = cloudwatch.get_dashboard(DashboardName='ApplicationHealth')
    # Add annotation marking chaos event start/stop
    # (implementation depends on your dashboard format)

Cost Considerations

FIS itself has no base cost — you pay per action-minute (the time your experiment runs). Each action type has different pricing. Check the AWS FIS pricing page for current rates.

The bigger cost concern is the chaos targets themselves. Terminating EC2 instances and spinning them back up costs compute time. Running database failovers causes brief unavailability. Design experiments to minimize both duration and blast radius.

Rule of thumb: Start with short experiments (2-5 minutes) against 1-10% of your fleet, verify your stop conditions work, then expand. Never run your first experiment against 50% of production.

Terraform Integration

Manage FIS experiment templates as code:

resource "aws_fis_experiment_template" "pod_delete" {
  description = "Delete 25% of web pods"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate.arn
  }

  target {
    name           = "webPods"
    resource_type  = "aws:eks:pod"
    selection_mode = "PERCENT(25)"

    filter {
      path   = "clusterName"
      values = ["staging"]
    }

    filter {
      path   = "namespace"
      values = ["default"]
    }

    filter {
      path   = "selector/matchLabels/app"
      values = ["web"]
    }
  }

  action {
    name      = "deletePods"
    action_id = "aws:eks:pod-delete"

    target {
      key   = "Pods"
      value = "webPods"
    }

    parameter {
      key   = "gracePeriodSeconds"
      value = "30"
    }
  }
}

Storing experiments in Terraform means they're version-controlled, reviewed in PRs, and reproducible across environments.

Running in CI

Schedule chaos tests in your CI pipeline:

# .github/workflows/chaos.yml
name: Chaos Tests
on:
  schedule:
    - cron: '0 14 * * 2'   # Tuesdays at 2pm UTC

jobs:
  fis-chaos:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ secrets.AWS_CHAOS_ROLE }}
          aws-region: us-east-1

      - name: Run FIS experiment
        run: |
          EXPERIMENT_ID=$(aws fis start-experiment \
            --experiment-template-id ${{ vars.FIS_TEMPLATE_ID }} \
            --query 'experiment.id' --output text)
          
          echo "Experiment: $EXPERIMENT_ID"
          
          # Poll until complete
          while true; do
            STATE=$(aws fis get-experiment \
              --id $EXPERIMENT_ID \
              --query 'experiment.state.status' --output text)
            echo "State: $STATE"
            
            case $STATE in
              completed) echo "Experiment completed successfully"; break ;;
              stopped) echo "Stop condition triggered - system degraded"; exit 1 ;;
              failed) echo "Experiment failed"; exit 1 ;;
              *) sleep 30 ;;
            esac
          done

FIS is the lowest-friction chaos engineering option for AWS users. No cluster changes, no agents, no certificates — just IAM permissions and a CloudWatch alarm for your stop condition. For teams already deep in the AWS ecosystem, it's the natural first choice for infrastructure-level resilience testing.

Read more