Gremlin Tutorial: Running Chaos Experiments at Scale
Gremlin is a commercial chaos engineering platform that simplifies running controlled failure experiments at scale. Where open-source tools like Chaos Mesh require Kubernetes expertise to operate, Gremlin provides a web UI, pre-built attack scenarios, and team collaboration features.
This tutorial covers setup, running your first experiments, and building a chaos engineering practice with Gremlin.
What Gremlin Offers
Gremlin organizes chaos experiments into attack categories:
Resource attacks — Consume CPU, memory, disk, or I/O to simulate resource exhaustion:
- CPU: consume N% of CPU capacity for N seconds
- Memory: allocate and hold N GB of memory
- Disk: fill disk to N% capacity
- I/O: slow down disk reads and writes
Network attacks — Degrade or disrupt network connectivity:
- Latency: add delay to outbound traffic
- Packet loss: drop N% of packets
- Bandwidth: limit available bandwidth
- DNS: corrupt DNS responses for specific hosts
- Blackhole: block all traffic to a target IP/host
State attacks — Terminate processes or nodes:
- Shutdown: shut down the host
- Time Travel: change system clock
- Process killer: kill a specific process by name
Application attacks — For Kubernetes and containers:
- Pod termination
- Container kill
- Stress specific pods
Setup
Create a Gremlin Account
Sign up at gremlin.com. The free tier allows one target (one agent) with all attack types — sufficient for learning and small environments.
Install the Gremlin Agent
Linux (systemd)
# Add Gremlin repo
<span class="hljs-built_in">echo <span class="hljs-string">"deb https://deb.gremlin.com/ release non-free" <span class="hljs-pipe">| <span class="hljs-built_in">sudo <span class="hljs-built_in">tee /etc/apt/sources.list.d/gremlin.list
curl https://deb.gremlin.com/gpg.key <span class="hljs-pipe">| <span class="hljs-built_in">sudo apt-key add -
<span class="hljs-built_in">sudo apt-get update
<span class="hljs-built_in">sudo apt-get install -y gremlin gremlindConfigure with your credentials:
sudo gremlin init
<span class="hljs-comment"># Enter your team ID and certificate/secretStart the daemon:
sudo systemctl start gremlind
<span class="hljs-built_in">sudo systemctl <span class="hljs-built_in">enable gremlindDocker
docker run -d \
--name gremlin \
-e GREMLIN_TEAM_ID="your-team-id" \
-e GREMLIN_TEAM_SECRET=<span class="hljs-string">"your-team-secret" \
-e GREMLIN_IDENTIFIER=<span class="hljs-string">"my-container" \
--cap-add=NET_ADMIN \
--cap-add=SYS_BOOT \
--pid=host \
gremlin/gremlin daemonThe NET_ADMIN capability is required for network attacks. SYS_BOOT is needed for shutdown attacks. Remove capabilities you don't need.
Kubernetes
Install via Helm:
helm repo add gremlin https://helm.gremlin.com
helm repo update
helm install gremlin gremlin/gremlin \
--namespace gremlin \
--create-namespace \
--set gremlin.secret.managed=<span class="hljs-literal">true \
--<span class="hljs-built_in">set gremlin.secret.type=secret \
--<span class="hljs-built_in">set gremlin.secret.teamID=<span class="hljs-string">"your-team-id" \
--<span class="hljs-built_in">set gremlin.secret.clusterID=<span class="hljs-string">"production-cluster" \
--<span class="hljs-built_in">set gremlin.secret.teamSecret=<span class="hljs-string">"your-team-secret"Verify agents are registered:
kubectl get pods -n gremlinYou should see a Gremlin agent pod on each node.
Running Your First Attack
Via Web UI
- Go to app.gremlin.com
- Click Attacks → New Attack
- Select Infrastructure tab
- Choose your target (the host where you installed the agent)
- Select attack type: Resource → CPU
- Configure: CPU = 80%, Duration = 60 seconds
- Click Unleash Gremlin
Watch your monitoring dashboard. If CPU utilization spikes as expected and your application continues responding normally — you've validated that your system tolerates CPU pressure.
Via CLI
# Install Gremlin CLI
pip install gremlinapi
<span class="hljs-comment"># Authenticate
gremlin login
<span class="hljs-comment"># Run a CPU attack on a specific target
gremlin attack-targets \
--attack-type cpu \
--cpu-capacity 80 \
--length 60 \
--targets <span class="hljs-string">'exact!{"type":"Host","hosts":{"ids":["your-target-id"]}}'Via API
curl -X POST https://api.gremlin.com/v1/attacks/new \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H <span class="hljs-string">"Content-Type: application/json" \
-d <span class="hljs-string">'{
"target": {
"type": "Exact",
"hosts": {
"ids": ["your-target-id"]
}
},
"command": {
"type": "cpu",
"args": ["-c", "80", "-l", "60"]
}
}'Useful Attack Scenarios
CPU Exhaustion
Validate autoscaling: does your cluster spin up new instances when CPU is high?
gremlin attack-targets \
--attack-type cpu \
--cpu-capacity 95 \
--length 300 \
--targets 'exact!{"type":"Host","hosts":{"ids":["app-server-1"]}}'Watch for: autoscaler triggering, traffic routing to healthy instances, response times staying acceptable.
Memory Leak Simulation
gremlin attack-targets \
--attack-type memory \
--memory-gb 4 \
--length 120 \
--targets 'exact!{"type":"Container","containerLabels":[{"key":"app","value":"api-server"}]}'Watch for: OOM killer firing, pod restart, memory alerts triggering, requests failing during restart.
Network Latency to Downstream Service
gremlin attack-targets \
--attack-type latency \
--latency-ms 500 \
--latency-jitter 100 \
--affected-hosts api.stripe.com \
--length 120 \
--targets 'exact!{"type":"Container","containerLabels":[{"key":"app","value":"checkout-service"}]}'Watch for: checkout timeout handling, error messages shown to users, circuit breaker opening.
Random Pod Termination (Kubernetes)
gremlin attack-targets \
--attack-type shutdown \
--delay 0 \
--targets 'random!{"type":"Container","containerLabels":[{"key":"app","value":"worker"}],"percent":25}'Terminates 25% of worker containers. Validates that your job queue handles in-flight job requeuing.
Chaos Scenarios as Gremlin Scenarios
Gremlin's Scenario feature lets you chain attacks into a workflow:
- Go to Scenarios → New Scenario
- Add a step: CPU attack on app servers, 80% CPU, 60 seconds
- Add a step: Latency attack on the same servers, 200ms, 60 seconds
- Add a step: Wait 30 seconds
- Add a step: Shutdown one instance
This simulates progressive degradation — the system starts struggling with CPU, then network gets slow, then a node goes down. Real cascading failures often look like this.
Scheduling Regular Experiments
Gremlin lets you schedule recurring experiments:
Via UI: Scenarios → your scenario → Schedule → set cron expression.
Via API:
curl -X POST https://api.gremlin.com/v1/scenarios/{scenario-id}/schedules \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H <span class="hljs-string">"Content-Type: application/json" \
-d <span class="hljs-string">'{
"cron": "0 10 * * 1",
"endTime": "2026-12-31T00:00:00.000Z",
"timezone": "America/New_York",
"tags": {
"team": "platform",
"environment": "staging"
}
}'This runs the scenario every Monday at 10am Eastern — during business hours, when the team is available to observe and respond.
Halt / Rollback
Every Gremlin attack can be stopped immediately:
- UI: Click the running attack → Halt
- CLI:
gremlin halt --attack-id <id> - API:
DELETE https://api.gremlin.com/v1/attacks/<id>
Define halt conditions before starting:
- Error rate exceeds 1%
- P99 latency exceeds 2 seconds
- On-call engineer requests stop
- Any SEV1 alert fires
Make halt the default response when something unexpected happens. You can always restart with more information.
Reliability Score
Gremlin's Reliability Score aggregates your chaos experiment results into a score per service:
- Attacks run — how many experiments you've conducted
- Weaknesses found — findings from experiments
- Recommendations — what to run next based on your infrastructure
Use it to prioritize where to invest resilience improvements.
Integrations
Monitoring
Connect Gremlin to your monitoring stack to automatically mark experiment windows:
Datadog:
curl -X POST https://api.gremlin.com/v1/integrations \
-H "Authorization: Key $GREMLIN_API_KEY" \
-d <span class="hljs-string">'{"type":"datadog","apiKey":"your-datadog-api-key"}'Gremlin will send events to Datadog when attacks start and stop — you'll see them on your dashboards as annotations.
PagerDuty:
Configure under Settings → Integrations → PagerDuty. Gremlin can silence alerts during planned experiments (so you don't page on-call for a scheduled test).
CI/CD
Run chaos experiments after staging deployment:
# GitHub Actions
chaos-validation:
needs: deploy-staging
runs-on: ubuntu-latest
steps:
- name: Run Gremlin scenario
env:
GREMLIN_API_KEY: ${{ secrets.GREMLIN_API_KEY }}
SCENARIO_ID: ${{ vars.CHAOS_SCENARIO_ID }}
run: |
# Start scenario
ATTACK_ID=$(curl -s -X POST \
"https://api.gremlin.com/v1/scenarios/$SCENARIO_ID/runs" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq -r '.guid')
# Wait for completion
sleep 180
# Check results
STATUS=$(curl -s \
"https://api.gremlin.com/v1/scenarios/$SCENARIO_ID/runs/$ATTACK_ID" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq -r '.status')
if [ "$STATUS" = "failed" ]; then
echo "Chaos scenario found issues — blocking deployment"
exit 1
fiGremlin vs. Open Source
| Feature | Gremlin | Chaos Mesh / Litmus |
|---|---|---|
| Setup | Agent install, web UI | Kubernetes Helm install |
| Attack types | Resource, network, state, app | Pod/network/IO (Kubernetes) |
| UI | Full web UI with history | Kubernetes dashboard |
| Scheduling | Built-in | CronJob resources |
| Team features | RBAC, audit logs | None built-in |
| Non-Kubernetes | ✓ | ✗ |
| Cost | Paid (free tier: 1 target) | Free |
Gremlin's main advantages are the ability to target non-Kubernetes hosts and the team/audit features. Chaos Mesh is preferable if you're Kubernetes-only and want to avoid a paid tool.
Starting Your Practice
The technology is the easy part. Building a chaos engineering practice requires:
- Start with a postmortem list — what caused your last three incidents? Run experiments that reproduce those conditions.
- Define your steady state first — what does "working" look like? Error rate, latency, throughput. You need a baseline to compare against.
- Run in staging before production — validate that your experiments produce the expected conditions, and that your halt procedures work, before touching production.
- Make findings into tickets — every weakness discovered must become a tracked issue. Chaos experiments that don't drive fixes are theater.
- Hold a GameDay — schedule 2-3 hours with the on-call team, run experiments together, review the results. Learning happens faster when it's shared.
Gremlin's platform makes the experiments easier. The value comes from acting on what you find.