Gremlin Tutorial: Running Chaos Experiments at Scale

Gremlin Tutorial: Running Chaos Experiments at Scale

Gremlin is a commercial chaos engineering platform that simplifies running controlled failure experiments at scale. Where open-source tools like Chaos Mesh require Kubernetes expertise to operate, Gremlin provides a web UI, pre-built attack scenarios, and team collaboration features.

This tutorial covers setup, running your first experiments, and building a chaos engineering practice with Gremlin.

What Gremlin Offers

Gremlin organizes chaos experiments into attack categories:

Resource attacks — Consume CPU, memory, disk, or I/O to simulate resource exhaustion:

  • CPU: consume N% of CPU capacity for N seconds
  • Memory: allocate and hold N GB of memory
  • Disk: fill disk to N% capacity
  • I/O: slow down disk reads and writes

Network attacks — Degrade or disrupt network connectivity:

  • Latency: add delay to outbound traffic
  • Packet loss: drop N% of packets
  • Bandwidth: limit available bandwidth
  • DNS: corrupt DNS responses for specific hosts
  • Blackhole: block all traffic to a target IP/host

State attacks — Terminate processes or nodes:

  • Shutdown: shut down the host
  • Time Travel: change system clock
  • Process killer: kill a specific process by name

Application attacks — For Kubernetes and containers:

  • Pod termination
  • Container kill
  • Stress specific pods

Setup

Create a Gremlin Account

Sign up at gremlin.com. The free tier allows one target (one agent) with all attack types — sufficient for learning and small environments.

Install the Gremlin Agent

Linux (systemd)

# Add Gremlin repo
<span class="hljs-built_in">echo <span class="hljs-string">"deb https://deb.gremlin.com/ release non-free" <span class="hljs-pipe">| <span class="hljs-built_in">sudo <span class="hljs-built_in">tee /etc/apt/sources.list.d/gremlin.list
curl https://deb.gremlin.com/gpg.key <span class="hljs-pipe">| <span class="hljs-built_in">sudo apt-key add -
<span class="hljs-built_in">sudo apt-get update
<span class="hljs-built_in">sudo apt-get install -y gremlin gremlind

Configure with your credentials:

sudo gremlin init
<span class="hljs-comment"># Enter your team ID and certificate/secret

Start the daemon:

sudo systemctl start gremlind
<span class="hljs-built_in">sudo systemctl <span class="hljs-built_in">enable gremlind

Docker

docker run -d \
  --name gremlin \
  -e GREMLIN_TEAM_ID="your-team-id" \
  -e GREMLIN_TEAM_SECRET=<span class="hljs-string">"your-team-secret" \
  -e GREMLIN_IDENTIFIER=<span class="hljs-string">"my-container" \
  --cap-add=NET_ADMIN \
  --cap-add=SYS_BOOT \
  --pid=host \
  gremlin/gremlin daemon

The NET_ADMIN capability is required for network attacks. SYS_BOOT is needed for shutdown attacks. Remove capabilities you don't need.

Kubernetes

Install via Helm:

helm repo add gremlin https://helm.gremlin.com
helm repo update

helm install gremlin gremlin/gremlin \
  --namespace gremlin \
  --create-namespace \
  --set gremlin.secret.managed=<span class="hljs-literal">true \
  --<span class="hljs-built_in">set gremlin.secret.type=secret \
  --<span class="hljs-built_in">set gremlin.secret.teamID=<span class="hljs-string">"your-team-id" \
  --<span class="hljs-built_in">set gremlin.secret.clusterID=<span class="hljs-string">"production-cluster" \
  --<span class="hljs-built_in">set gremlin.secret.teamSecret=<span class="hljs-string">"your-team-secret"

Verify agents are registered:

kubectl get pods -n gremlin

You should see a Gremlin agent pod on each node.

Running Your First Attack

Via Web UI

  1. Go to app.gremlin.com
  2. Click AttacksNew Attack
  3. Select Infrastructure tab
  4. Choose your target (the host where you installed the agent)
  5. Select attack type: ResourceCPU
  6. Configure: CPU = 80%, Duration = 60 seconds
  7. Click Unleash Gremlin

Watch your monitoring dashboard. If CPU utilization spikes as expected and your application continues responding normally — you've validated that your system tolerates CPU pressure.

Via CLI

# Install Gremlin CLI
pip install gremlinapi

<span class="hljs-comment"># Authenticate
gremlin login

<span class="hljs-comment"># Run a CPU attack on a specific target
gremlin attack-targets \
  --attack-type cpu \
  --cpu-capacity 80 \
  --length 60 \
  --targets <span class="hljs-string">'exact!{"type":"Host","hosts":{"ids":["your-target-id"]}}'

Via API

curl -X POST https://api.gremlin.com/v1/attacks/new \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{
    "target": {
      "type": "Exact",
      "hosts": {
        "ids": ["your-target-id"]
      }
    },
    "command": {
      "type": "cpu",
      "args": ["-c", "80", "-l", "60"]
    }
  }'

Useful Attack Scenarios

CPU Exhaustion

Validate autoscaling: does your cluster spin up new instances when CPU is high?

gremlin attack-targets \
  --attack-type cpu \
  --cpu-capacity 95 \
  --length 300 \
  --targets 'exact!{"type":"Host","hosts":{"ids":["app-server-1"]}}'

Watch for: autoscaler triggering, traffic routing to healthy instances, response times staying acceptable.

Memory Leak Simulation

gremlin attack-targets \
  --attack-type memory \
  --memory-gb 4 \
  --length 120 \
  --targets 'exact!{"type":"Container","containerLabels":[{"key":"app","value":"api-server"}]}'

Watch for: OOM killer firing, pod restart, memory alerts triggering, requests failing during restart.

Network Latency to Downstream Service

gremlin attack-targets \
  --attack-type latency \
  --latency-ms 500 \
  --latency-jitter 100 \
  --affected-hosts api.stripe.com \
  --length 120 \
  --targets 'exact!{"type":"Container","containerLabels":[{"key":"app","value":"checkout-service"}]}'

Watch for: checkout timeout handling, error messages shown to users, circuit breaker opening.

Random Pod Termination (Kubernetes)

gremlin attack-targets \
  --attack-type shutdown \
  --delay 0 \
  --targets 'random!{"type":"Container","containerLabels":[{"key":"app","value":"worker"}],"percent":25}'

Terminates 25% of worker containers. Validates that your job queue handles in-flight job requeuing.

Chaos Scenarios as Gremlin Scenarios

Gremlin's Scenario feature lets you chain attacks into a workflow:

  1. Go to ScenariosNew Scenario
  2. Add a step: CPU attack on app servers, 80% CPU, 60 seconds
  3. Add a step: Latency attack on the same servers, 200ms, 60 seconds
  4. Add a step: Wait 30 seconds
  5. Add a step: Shutdown one instance

This simulates progressive degradation — the system starts struggling with CPU, then network gets slow, then a node goes down. Real cascading failures often look like this.

Scheduling Regular Experiments

Gremlin lets you schedule recurring experiments:

Via UI: Scenarios → your scenario → Schedule → set cron expression.

Via API:

curl -X POST https://api.gremlin.com/v1/scenarios/{scenario-id}/schedules \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{
    "cron": "0 10 * * 1",
    "endTime": "2026-12-31T00:00:00.000Z",
    "timezone": "America/New_York",
    "tags": {
      "team": "platform",
      "environment": "staging"
    }
  }'

This runs the scenario every Monday at 10am Eastern — during business hours, when the team is available to observe and respond.

Halt / Rollback

Every Gremlin attack can be stopped immediately:

  • UI: Click the running attack → Halt
  • CLI: gremlin halt --attack-id <id>
  • API: DELETE https://api.gremlin.com/v1/attacks/<id>

Define halt conditions before starting:

  • Error rate exceeds 1%
  • P99 latency exceeds 2 seconds
  • On-call engineer requests stop
  • Any SEV1 alert fires

Make halt the default response when something unexpected happens. You can always restart with more information.

Reliability Score

Gremlin's Reliability Score aggregates your chaos experiment results into a score per service:

  • Attacks run — how many experiments you've conducted
  • Weaknesses found — findings from experiments
  • Recommendations — what to run next based on your infrastructure

Use it to prioritize where to invest resilience improvements.

Integrations

Monitoring

Connect Gremlin to your monitoring stack to automatically mark experiment windows:

Datadog:

curl -X POST https://api.gremlin.com/v1/integrations \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -d <span class="hljs-string">'{"type":"datadog","apiKey":"your-datadog-api-key"}'

Gremlin will send events to Datadog when attacks start and stop — you'll see them on your dashboards as annotations.

PagerDuty:

Configure under Settings → Integrations → PagerDuty. Gremlin can silence alerts during planned experiments (so you don't page on-call for a scheduled test).

CI/CD

Run chaos experiments after staging deployment:

# GitHub Actions
chaos-validation:
  needs: deploy-staging
  runs-on: ubuntu-latest
  steps:
    - name: Run Gremlin scenario
      env:
        GREMLIN_API_KEY: ${{ secrets.GREMLIN_API_KEY }}
        SCENARIO_ID: ${{ vars.CHAOS_SCENARIO_ID }}
      run: |
        # Start scenario
        ATTACK_ID=$(curl -s -X POST \
          "https://api.gremlin.com/v1/scenarios/$SCENARIO_ID/runs" \
          -H "Authorization: Key $GREMLIN_API_KEY" | jq -r '.guid')
        
        # Wait for completion
        sleep 180
        
        # Check results
        STATUS=$(curl -s \
          "https://api.gremlin.com/v1/scenarios/$SCENARIO_ID/runs/$ATTACK_ID" \
          -H "Authorization: Key $GREMLIN_API_KEY" | jq -r '.status')
        
        if [ "$STATUS" = "failed" ]; then
          echo "Chaos scenario found issues — blocking deployment"
          exit 1
        fi

Gremlin vs. Open Source

Feature Gremlin Chaos Mesh / Litmus
Setup Agent install, web UI Kubernetes Helm install
Attack types Resource, network, state, app Pod/network/IO (Kubernetes)
UI Full web UI with history Kubernetes dashboard
Scheduling Built-in CronJob resources
Team features RBAC, audit logs None built-in
Non-Kubernetes
Cost Paid (free tier: 1 target) Free

Gremlin's main advantages are the ability to target non-Kubernetes hosts and the team/audit features. Chaos Mesh is preferable if you're Kubernetes-only and want to avoid a paid tool.

Starting Your Practice

The technology is the easy part. Building a chaos engineering practice requires:

  1. Start with a postmortem list — what caused your last three incidents? Run experiments that reproduce those conditions.
  2. Define your steady state first — what does "working" look like? Error rate, latency, throughput. You need a baseline to compare against.
  3. Run in staging before production — validate that your experiments produce the expected conditions, and that your halt procedures work, before touching production.
  4. Make findings into tickets — every weakness discovered must become a tracked issue. Chaos experiments that don't drive fixes are theater.
  5. Hold a GameDay — schedule 2-3 hours with the on-call team, run experiments together, review the results. Learning happens faster when it's shared.

Gremlin's platform makes the experiments easier. The value comes from acting on what you find.

Read more