Testing Tools

Gremlin Tutorial: Running Chaos Experiments at Scale

HelpMeTest

14 May 2026 — 6 min read

Gremlin is a commercial chaos engineering platform that simplifies running controlled failure experiments at scale. Where open-source tools like Chaos Mesh require Kubernetes expertise to operate, Gremlin provides a web UI, pre-built attack scenarios, and team collaboration features.

This tutorial covers setup, running your first experiments, and building a chaos engineering practice with Gremlin.

What Gremlin Offers

Gremlin organizes chaos experiments into attack categories:

Resource attacks — Consume CPU, memory, disk, or I/O to simulate resource exhaustion:

CPU: consume N% of CPU capacity for N seconds
Memory: allocate and hold N GB of memory
Disk: fill disk to N% capacity
I/O: slow down disk reads and writes

Network attacks — Degrade or disrupt network connectivity:

Latency: add delay to outbound traffic
Packet loss: drop N% of packets
Bandwidth: limit available bandwidth
DNS: corrupt DNS responses for specific hosts
Blackhole: block all traffic to a target IP/host

State attacks — Terminate processes or nodes:

Shutdown: shut down the host
Time Travel: change system clock
Process killer: kill a specific process by name

Application attacks — For Kubernetes and containers:

Pod termination
Container kill
Stress specific pods

Setup

Create a Gremlin Account

Sign up at gremlin.com. The free tier allows one target (one agent) with all attack types — sufficient for learning and small environments.

Install the Gremlin Agent

Linux (systemd)

# Add Gremlin repo
<span class="hljs-built_in">echo <span class="hljs-string">"deb https://deb.gremlin.com/ release non-free" <span class="hljs-pipe">| <span class="hljs-built_in">sudo <span class="hljs-built_in">tee /etc/apt/sources.list.d/gremlin.list
curl https://deb.gremlin.com/gpg.key <span class="hljs-pipe">| <span class="hljs-built_in">sudo apt-key add -
<span class="hljs-built_in">sudo apt-get update
<span class="hljs-built_in">sudo apt-get install -y gremlin gremlind

Configure with your credentials:

sudo gremlin init
<span class="hljs-comment"># Enter your team ID and certificate/secret

Start the daemon:

sudo systemctl start gremlind
<span class="hljs-built_in">sudo systemctl <span class="hljs-built_in">enable gremlind

Docker

docker run -d \
  --name gremlin \
  -e GREMLIN_TEAM_ID="your-team-id" \
  -e GREMLIN_TEAM_SECRET=<span class="hljs-string">"your-team-secret" \
  -e GREMLIN_IDENTIFIER=<span class="hljs-string">"my-container" \
  --cap-add=NET_ADMIN \
  --cap-add=SYS_BOOT \
  --pid=host \
  gremlin/gremlin daemon

The NET_ADMIN capability is required for network attacks. SYS_BOOT is needed for shutdown attacks. Remove capabilities you don't need.

Kubernetes

Install via Helm:

helm repo add gremlin https://helm.gremlin.com
helm repo update

helm install gremlin gremlin/gremlin \
  --namespace gremlin \
  --create-namespace \
  --set gremlin.secret.managed=<span class="hljs-literal">true \
  --<span class="hljs-built_in">set gremlin.secret.type=secret \
  --<span class="hljs-built_in">set gremlin.secret.teamID=<span class="hljs-string">"your-team-id" \
  --<span class="hljs-built_in">set gremlin.secret.clusterID=<span class="hljs-string">"production-cluster" \
  --<span class="hljs-built_in">set gremlin.secret.teamSecret=<span class="hljs-string">"your-team-secret"

Verify agents are registered:

kubectl get pods -n gremlin

You should see a Gremlin agent pod on each node.

Running Your First Attack

Via Web UI

Go to app.gremlin.com
Click Attacks → New Attack
Select Infrastructure tab
Choose your target (the host where you installed the agent)
Select attack type: Resource → CPU
Configure: CPU = 80%, Duration = 60 seconds
Click Unleash Gremlin

Watch your monitoring dashboard. If CPU utilization spikes as expected and your application continues responding normally — you've validated that your system tolerates CPU pressure.

Via CLI

# Install Gremlin CLI
pip install gremlinapi

<span class="hljs-comment"># Authenticate
gremlin login

<span class="hljs-comment"># Run a CPU attack on a specific target
gremlin attack-targets \
  --attack-type cpu \
  --cpu-capacity 80 \
  --length 60 \
  --targets <span class="hljs-string">'exact!{"type":"Host","hosts":{"ids":["your-target-id"]}}'

Via API

curl -X POST https://api.gremlin.com/v1/attacks/new \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{
    "target": {
      "type": "Exact",
      "hosts": {
        "ids": ["your-target-id"]
      }
    },
    "command": {
      "type": "cpu",
      "args": ["-c", "80", "-l", "60"]
    }
  }'

Useful Attack Scenarios

CPU Exhaustion

Validate autoscaling: does your cluster spin up new instances when CPU is high?

gremlin attack-targets \
  --attack-type cpu \
  --cpu-capacity 95 \
  --length 300 \
  --targets 'exact!{"type":"Host","hosts":{"ids":["app-server-1"]}}'

Watch for: autoscaler triggering, traffic routing to healthy instances, response times staying acceptable.

Memory Leak Simulation

gremlin attack-targets \
  --attack-type memory \
  --memory-gb 4 \
  --length 120 \
  --targets 'exact!{"type":"Container","containerLabels":[{"key":"app","value":"api-server"}]}'

Watch for: OOM killer firing, pod restart, memory alerts triggering, requests failing during restart.

Network Latency to Downstream Service

gremlin attack-targets \
  --attack-type latency \
  --latency-ms 500 \
  --latency-jitter 100 \
  --affected-hosts api.stripe.com \
  --length 120 \
  --targets 'exact!{"type":"Container","containerLabels":[{"key":"app","value":"checkout-service"}]}'

Watch for: checkout timeout handling, error messages shown to users, circuit breaker opening.

Random Pod Termination (Kubernetes)

gremlin attack-targets \
  --attack-type shutdown \
  --delay 0 \
  --targets 'random!{"type":"Container","containerLabels":[{"key":"app","value":"worker"}],"percent":25}'

Terminates 25% of worker containers. Validates that your job queue handles in-flight job requeuing.

Chaos Scenarios as Gremlin Scenarios

Gremlin's Scenario feature lets you chain attacks into a workflow:

Go to Scenarios → New Scenario
Add a step: CPU attack on app servers, 80% CPU, 60 seconds
Add a step: Latency attack on the same servers, 200ms, 60 seconds
Add a step: Wait 30 seconds
Add a step: Shutdown one instance

This simulates progressive degradation — the system starts struggling with CPU, then network gets slow, then a node goes down. Real cascading failures often look like this.

Scheduling Regular Experiments

Gremlin lets you schedule recurring experiments:

Via UI: Scenarios → your scenario → Schedule → set cron expression.

Via API:

curl -X POST https://api.gremlin.com/v1/scenarios/{scenario-id}/schedules \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{
    "cron": "0 10 * * 1",
    "endTime": "2026-12-31T00:00:00.000Z",
    "timezone": "America/New_York",
    "tags": {
      "team": "platform",
      "environment": "staging"
    }
  }'

This runs the scenario every Monday at 10am Eastern — during business hours, when the team is available to observe and respond.

Halt / Rollback

Every Gremlin attack can be stopped immediately:

UI: Click the running attack → Halt
CLI: gremlin halt --attack-id <id>
API: DELETE https://api.gremlin.com/v1/attacks/<id>

Define halt conditions before starting:

Error rate exceeds 1%
P99 latency exceeds 2 seconds
On-call engineer requests stop
Any SEV1 alert fires

Make halt the default response when something unexpected happens. You can always restart with more information.

Reliability Score

Gremlin's Reliability Score aggregates your chaos experiment results into a score per service:

Attacks run — how many experiments you've conducted
Weaknesses found — findings from experiments
Recommendations — what to run next based on your infrastructure

Use it to prioritize where to invest resilience improvements.

Integrations

Monitoring

Connect Gremlin to your monitoring stack to automatically mark experiment windows:

Datadog:

curl -X POST https://api.gremlin.com/v1/integrations \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -d <span class="hljs-string">'{"type":"datadog","apiKey":"your-datadog-api-key"}'

Gremlin will send events to Datadog when attacks start and stop — you'll see them on your dashboards as annotations.

PagerDuty:

Configure under Settings → Integrations → PagerDuty. Gremlin can silence alerts during planned experiments (so you don't page on-call for a scheduled test).

CI/CD

Run chaos experiments after staging deployment:

# GitHub Actions
chaos-validation:
  needs: deploy-staging
  runs-on: ubuntu-latest
  steps:
    - name: Run Gremlin scenario
      env:
        GREMLIN_API_KEY: ${{ secrets.GREMLIN_API_KEY }}
        SCENARIO_ID: ${{ vars.CHAOS_SCENARIO_ID }}
      run: |
        # Start scenario
        ATTACK_ID=$(curl -s -X POST \
          "https://api.gremlin.com/v1/scenarios/$SCENARIO_ID/runs" \
          -H "Authorization: Key $GREMLIN_API_KEY" | jq -r '.guid')
        
        # Wait for completion
        sleep 180
        
        # Check results
        STATUS=$(curl -s \
          "https://api.gremlin.com/v1/scenarios/$SCENARIO_ID/runs/$ATTACK_ID" \
          -H "Authorization: Key $GREMLIN_API_KEY" | jq -r '.status')
        
        if [ "$STATUS" = "failed" ]; then
          echo "Chaos scenario found issues — blocking deployment"
          exit 1
        fi

Gremlin vs. Open Source

Feature	Gremlin	Chaos Mesh / Litmus
Setup	Agent install, web UI	Kubernetes Helm install
Attack types	Resource, network, state, app	Pod/network/IO (Kubernetes)
UI	Full web UI with history	Kubernetes dashboard
Scheduling	Built-in	CronJob resources
Team features	RBAC, audit logs	None built-in
Non-Kubernetes	✓	✗
Cost	Paid (free tier: 1 target)	Free

Gremlin's main advantages are the ability to target non-Kubernetes hosts and the team/audit features. Chaos Mesh is preferable if you're Kubernetes-only and want to avoid a paid tool.

Starting Your Practice

The technology is the easy part. Building a chaos engineering practice requires:

Start with a postmortem list — what caused your last three incidents? Run experiments that reproduce those conditions.
Define your steady state first — what does "working" look like? Error rate, latency, throughput. You need a baseline to compare against.
Run in staging before production — validate that your experiments produce the expected conditions, and that your halt procedures work, before touching production.
Make findings into tickets — every weakness discovered must become a tracked issue. Chaos experiments that don't drive fixes are theater.
Hold a GameDay — schedule 2-3 hours with the on-call team, run experiments together, review the results. Learning happens faster when it's shared.

Gremlin's platform makes the experiments easier. The value comes from acting on what you find.