Gremlin Chaos Engineering: Attacks, Scenarios, and Best Practices
When the Netflix Chaos Monkey paper was published, it proved a point: you can deliberately break production systems and come out stronger. But replicating that capability required deep engineering investment. Gremlin was built to package that capability as a commercial platform—safe, auditable, and accessible to teams that do not have Netflix-scale engineering resources.
Gremlin operates as a daemon running on your hosts or containers. It communicates with the Gremlin control plane, which lets you launch attacks through a web UI, a CLI, or an API. The daemon executes attacks locally, which means attack execution continues even if the control plane becomes unreachable—a deliberate design decision that prevents the chaos tool itself from being a single point of failure.
This guide covers Gremlin's attack taxonomy, running attacks from the CLI and REST API, composing multi-attack scenarios, planning GameDays, and integrating Gremlin with PagerDuty for closed-loop incident simulation.
Attack Categories
Gremlin organizes attacks into three categories: Resource, State, and Network. Each category tests a different failure hypothesis.
Resource Attacks
Resource attacks starve the system of compute, memory, or disk. They simulate the failure modes that come from gradual degradation: a memory leak that grows over weeks, a log file that fills a disk, a CPU-intensive job that runs at the wrong time.
CPU — consumes a configurable percentage of CPU cores for a defined duration. This tests whether throttling or eviction policies work correctly when a noisy neighbor appears on the same host.
Memory — allocates a specified amount of RAM. This forces the OOM killer to act or evicts containers, revealing which processes your system considers expendable and whether it recovers correctly when memory is released.
Disk — fills disk space to a threshold. This catches applications that crash when log files cannot be written, or databases that corrupt data when the write-ahead log runs out of space.
IO — generates disk read/write load, simulating a situation where storage throughput is saturated. This is particularly relevant for database servers and applications that buffer heavily to disk.
State Attacks
State attacks change the runtime state of the system: killing processes, blocking the clock, or terminating container runtimes.
Process Killer — repeatedly kills a named process. Paired with a process manager that should restart the process automatically, this validates that your restart policies are configured correctly and that the restart time is within acceptable bounds.
Time Travel — shifts the system clock forward or backward. This is surprisingly destructive in systems that use JWTs with expiration times, TLS certificates, or cron jobs. Testing clock skew is something most teams completely skip.
Blackhole — drops all outbound network traffic from a target. This is a severe state attack that simulates a complete network partition. The system must handle it as if it were completely isolated.
Container Killer — terminates Docker containers or Kubernetes pods. Equivalent to docker kill or kubectl delete pod.
Network Attacks
Network attacks modify how traffic flows between services. These are the most commonly revealing attacks because distributed systems almost always have incorrect timeout and retry configurations.
Latency — adds delay to network packets. The killer question is: does your service return a timeout error to the caller, or does it hang indefinitely?
Packet Loss — randomly drops a percentage of packets. Unlike latency, which is measurable, packet loss is often silent—TCP retransmits without the application knowing, which burns through connection resources.
Bandwidth — throttles available bandwidth to a specified rate. Simulates WAN conditions or a slow upstream dependency.
DNS — corrupts DNS resolution for specified hostnames. If your service caches DNS results aggressively, it may fail to reconnect even after the upstream service recovers.
Certificate Expiry — presents an expired TLS certificate to the target service. If your service does not validate certificates strictly, this attack may pass silently—which is itself a finding.
Installing the Gremlin Agent
# Ubuntu/Debian
<span class="hljs-built_in">echo <span class="hljs-string">"deb https://deb.gremlin.com/ release non-free" <span class="hljs-pipe">| <span class="hljs-built_in">sudo <span class="hljs-built_in">tee /etc/apt/sources.list.d/gremlin.list
curl -fsSL https://deb.gremlin.com/gremlin.gpg <span class="hljs-pipe">| <span class="hljs-built_in">sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/gremlin.gpg
<span class="hljs-built_in">sudo apt-get update && <span class="hljs-built_in">sudo apt-get install -y gremlin gremlind
<span class="hljs-comment"># Authenticate
<span class="hljs-built_in">sudo gremlin init
<span class="hljs-comment"># Enter your Team ID and certificate credentials from app.gremlin.com/settings/teamsFor Kubernetes, deploy via Helm:
helm repo add gremlin https://helm.gremlin.com
helm repo update
helm install gremlin gremlin/gremlin \
--namespace gremlin \
--create-namespace \
--set gremlin.secret.managed=<span class="hljs-literal">true \
--<span class="hljs-built_in">set gremlin.secret.type=secret \
--<span class="hljs-built_in">set gremlin.secret.teamID=<span class="hljs-string">"YOUR_TEAM_ID" \
--<span class="hljs-built_in">set gremlin.secret.clusterID=<span class="hljs-string">"production-us-east-1" \
--<span class="hljs-built_in">set gremlin.secret.teamSecret=<span class="hljs-string">"YOUR_TEAM_SECRET"Running Attacks via CLI
The gremlin CLI provides direct attack execution without going through the web UI.
# List available attack types
gremlin attack-container --<span class="hljs-built_in">help
<span class="hljs-comment"># CPU attack: consume 75% of 2 cores for 60 seconds on a specific container
gremlin attack-container \
--length 60 \
--cpu-cores 2 \
--cpu-percent 75 \
my-app-container
<span class="hljs-comment"># Network latency: add 300ms to all outbound traffic from a host
gremlin attack-host \
--attack-type latency \
--length 120 \
--ms 300 \
--percent 100 \
--egress \
my-host-identifier
<span class="hljs-comment"># Memory attack: allocate 2GB of RAM for 90 seconds
gremlin attack-host \
--attack-type memory \
--length 90 \
--mb 2048 \
my-host-identifier
<span class="hljs-comment"># Target a specific port with packet loss
gremlin attack-container \
--attack-type packet_loss \
--length 60 \
--percent 25 \
--port 5432 \
my-app-containerThe --port parameter for network attacks is particularly useful because it limits the blast radius to traffic on a specific port—you can simulate "the PostgreSQL connection is lossy" without affecting all network traffic from the container.
Running Attacks via REST API
The Gremlin API enables programmatic attack creation, which is essential for CI integration and automated GameDay orchestration.
First, obtain a bearer token:
GREMLIN_TOKEN=$(curl -s -X POST https://api.gremlin.com/v1/users/auth \
-H "Content-Type: application/json" \
-d <span class="hljs-string">"{\"email\": \"$GREMLIN_EMAIL\", \"password\": \"<span class="hljs-variable">$GREMLIN_PASSWORD\", \"companyName\": \"<span class="hljs-variable">$GREMLIN_COMPANY\"}" \
<span class="hljs-pipe">| jq -r <span class="hljs-string">'.[0].token')Launch a latency attack against a target:
curl -X POST https://api.gremlin.com/v1/attacks/new \
-H "Authorization: Key $GREMLIN_TOKEN" \
-H <span class="hljs-string">"Content-Type: application/json" \
-d <span class="hljs-string">'{
"targetDefinition": {
"strategy": {
"percentage": 25,
"type": "RandomPercent"
},
"type": "Random"
},
"targetType": "Container",
"targetTags": {
"app": "payment-service"
},
"command": {
"type": "latency",
"commandType": "Network",
"args": [
"-l", "120",
"-m", "500",
"-r", "100"
]
}
}'Halt all active attacks programmatically:
# Get all active attacks
ACTIVE=$(curl -s -H <span class="hljs-string">"Authorization: Key $GREMLIN_TOKEN" \
https://api.gremlin.com/v1/attacks/active <span class="hljs-pipe">| jq -r <span class="hljs-string">'.[].guid')
<span class="hljs-comment"># Halt each one
<span class="hljs-keyword">for attack_id <span class="hljs-keyword">in <span class="hljs-variable">$ACTIVE; <span class="hljs-keyword">do
curl -X DELETE \
-H <span class="hljs-string">"Authorization: Key $GREMLIN_TOKEN" \
<span class="hljs-string">"https://api.gremlin.com/v1/attacks/$attack_id"
<span class="hljs-keyword">doneBuilding Scenarios
Gremlin Scenarios allow you to compose multiple attacks into a single, reusable workflow with explicit ordering and delays between steps. A scenario is the unit of a structured GameDay.
Scenarios are defined as JSON and stored in Gremlin. Here is an example that simulates a cascading database failure:
{
"name": "Database Degradation Cascade",
"description": "Simulate slow DB leading to connection pool exhaustion and downstream timeout",
"hypothesis": "When DB latency exceeds 2s, the API returns 503 within 5s and the circuit breaker opens within 30s",
"steps": [
{
"delay": 0,
"attacks": [
{
"targetDefinition": {
"type": "Exact",
"exact": {
"hostnames": ["db-primary.internal"]
}
},
"command": {
"type": "latency",
"args": ["-l", "300", "-m", "2000", "-r", "100", "-p", "5432"]
}
}
]
},
{
"delay": 60,
"attacks": [
{
"targetDefinition": {
"type": "RandomPercent",
"percentage": 50
},
"targetTags": {"tier": "api"},
"command": {
"type": "cpu",
"args": ["-l", "120", "-c", "2", "-p", "80"]
}
}
]
}
]
}This scenario first introduces 2 seconds of latency on port 5432 (simulating a slow database), then 60 seconds later adds CPU pressure to the API tier (simulating connection pool threads spinning). The combined effect tests whether the system degrades gracefully or whether a slow database causes cascading failures across the stack.
Planning a GameDay
A GameDay is a structured session where the engineering team deliberately runs chaos experiments against a production or production-like environment. Gremlin provides a GameDay template, but the structure matters more than the tooling.
Before the GameDay:
Define the steady-state hypothesis for each experiment. Write it down explicitly:
Steady state: p99 API latency < 200ms, error rate < 0.5%, no alerts firing
Hypothesis: When we inject 500ms of latency on the database connection, the API degrades to p99 < 800ms and the circuit breaker opens within 30s, routing to the cache fallback
Expected result: Users see slightly slower responses but no errors; dashboard shows circuit breaker state change; no on-call alert firesAssign roles: attack operator (runs Gremlin), observer (watches dashboards), incident commander (decides whether to halt), scribe (records findings).
Define the halt condition: any condition under which the attack operator immediately halts all experiments. Typically: p99 > 5s, error rate > 5%, on-call alert fires unexpectedly, or any data loss detected.
During the GameDay:
Start with the lowest-impact experiments. Validate that monitoring detects the chaos before escalating to higher-impact scenarios. If your Prometheus dashboard does not show the latency spike, something is wrong with observability—stop and fix it before continuing.
Run each scenario for the planned duration, record metrics, and then halt. Allow the system 5 minutes to recover and verify it returns to steady state before running the next experiment.
After the GameDay:
Write a findings document with three sections: what worked as expected, what failed unexpectedly, and what we are going to fix. Gremlin's GameDay report feature generates a shareable PDF automatically.
Integrating with PagerDuty
Integrating Gremlin with PagerDuty enables two powerful workflows. First, you can validate that your alerts fire during chaos experiments. Second, you can automatically halt attacks when a real incident fires, preventing chaos experiments from making a real outage worse.
Setting Up the Integration
In PagerDuty, create an Events API v2 integration in the service you want to correlate with. Copy the integration key.
In Gremlin, go to Team Settings → Integrations → PagerDuty. Provide the integration key. Gremlin will:
- Send a PagerDuty event when an attack starts (so you can validate the alert fires correctly)
- Listen for PagerDuty incidents and halt all active Gremlin attacks when a new incident is triggered
Testing Alert Fidelity
During a GameDay, run a CPU attack that you expect to trigger a CPU usage alert. After the attack starts, check PagerDuty:
# Check if the alert fired (via PagerDuty API)
curl -X GET https://api.pagerduty.com/incidents \
-H <span class="hljs-string">"Authorization: Token token=$PAGERDUTY_API_KEY" \
-H <span class="hljs-string">"Accept: application/vnd.pagerduty+json;version=2" \
-G \
--data-urlencode <span class="hljs-string">"service_ids[]=$SERVICE_ID" \
--data-urlencode <span class="hljs-string">"statuses[]=triggered" \
--data-urlencode <span class="hljs-string">"since=$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
<span class="hljs-pipe">| jq <span class="hljs-string">'.incidents[] | {id, title, created_at, urgency}'If the alert does not fire during the experiment, your alert threshold is misconfigured—a finding far more valuable than the experiment itself.
Auto-Halt on Real Incidents
Configure the PagerDuty webhook to call Gremlin's halt endpoint when an incident fires:
# Gremlin halt webhook (to configure in PagerDuty)
<span class="hljs-comment"># POST https://api.gremlin.com/v1/attacks/halt
<span class="hljs-comment"># Header: Authorization: Key YOUR_GREMLIN_TOKENIn PagerDuty, add a webhook subscription in Extensions → Add Extension → Webhook, targeting the Gremlin halt endpoint. Now if a real incident fires while a GameDay is in progress, Gremlin automatically stops all active attacks.
Recommended Attack Progression
Teams new to chaos engineering consistently make the same mistake: they start with too-aggressive experiments and get burned. Follow this progression:
Week 1–2: Network latency only, staging environment, low percentages (10–25%), short duration (30–60 seconds). Goal: validate that monitoring detects the failure.
Week 3–4: Process killer and container termination in staging. Goal: validate restart policies and recovery time.
Month 2: Full scenario runs in staging with GameDay structure. Goal: validate end-to-end resilience and alert fidelity.
Month 3+: Selected experiments in production canary, small blast radius (5–10% of targets). Goal: validate that production has the same resilience properties as staging.
Ongoing: Automated scheduled experiments in staging. Any new service deployed to production must pass a defined set of chaos experiments before being removed from heightened monitoring.
Gremlin's value is not in the individual attack—it is in building the organizational habit of treating failure as a first-class concern, tested continuously, with findings documented and tracked to resolution.