Steadybit: Reliability Testing Platform Guide

Steadybit: Reliability Testing Platform Guide

Steadybit is a reliability testing platform that combines chaos engineering with automated discovery of your system topology. It maps your running services, detects their dependencies, and suggests reliability experiments based on what it finds. Where other chaos tools require you to know exactly what to target, Steadybit helps you discover what to test first.

The Reliability Testing Problem

Most chaos engineering tools assume you already know your system's weak points. You write experiments targeting specific services, specific failure modes, and specific infrastructure. This works well when you have deep system knowledge — but it misses the gaps.

Steadybit approaches this differently:

  1. Discover: Agents map your running infrastructure — containers, services, hosts, Kubernetes resources
  2. Advise: The advice engine analyzes your topology and suggests experiments based on missing redundancy, unconfigured timeouts, single points of failure
  3. Experiment: Run targeted reliability experiments against discovered targets
  4. Observe: Integrated metrics and logs show the impact during experiments

The discovery layer is what distinguishes Steadybit from tools like LitmusChaos or Chaos Toolkit. Instead of writing experiments from scratch, you start with what Steadybit found and verified about your system.

Architecture

Steadybit consists of:

Platform: The central control plane — SaaS or self-hosted. Stores experiments, schedules runs, displays results.

Agents: Lightweight processes running in your infrastructure. Each agent discovers local resources and executes attacks. Agents connect outbound to the platform; you don't need inbound firewall rules.

Extensions: Pluggable modules that add attack types and discovery capabilities. Extensions exist for Kubernetes, AWS, Datadog, Prometheus, Dynatrace, and others.

Installation

Steadybit runs as SaaS with agents deployed in your infrastructure. Sign up for a free account at steadybit.com to get started.

Kubernetes agent installation via Helm:

# Add Steadybit Helm repo
helm repo add steadybit https://steadybit.github.io/helm-charts
helm repo update

<span class="hljs-comment"># Install agent with your API key (from Steadybit dashboard)
helm install steadybit-agent steadybit/steadybit-agent \
  --namespace steadybit-agent \
  --create-namespace \
  --<span class="hljs-built_in">set agent.key=<span class="hljs-string">"<YOUR_AGENT_KEY>" \
  --<span class="hljs-built_in">set global.agentKey=<span class="hljs-string">"<YOUR_AGENT_KEY>" \
  --<span class="hljs-built_in">set agent.registerUrl=<span class="hljs-string">"https://platform.steadybit.com"

Verify agents are running:

kubectl get pods -n steadybit-agent
# NAME                                READY   STATUS    RESTARTS
<span class="hljs-comment"># steadybit-agent-7f8b9c-xk4wm       1/1     Running   0
<span class="hljs-comment"># steadybit-agent-7f8b9c-ml2qp       1/1     Running   0

Docker/Linux agent:

curl -sfL https://get.steadybit.com/agent | \
  STEADYBIT_KEY=<YOUR_AGENT_KEY> \
  bash

After a few minutes, your infrastructure appears in the Steadybit dashboard under "Landscape."

Understanding Discovery

Once agents run, Steadybit automatically discovers:

  • Kubernetes: Pods, deployments, stateful sets, services, namespaces, nodes
  • Hosts: CPU, memory, disk, network interfaces
  • Containers: Docker containers and their metadata
  • AWS: EC2 instances, ECS tasks, Lambda functions (with AWS extension)
  • Applications: HTTP services, JVM applications (via Micrometer)

The landscape view shows these targets and their relationships — which pods belong to which deployments, which services connect to which databases, which nodes host which workloads.

The Advice System

Steadybit's advice engine analyzes your landscape and flags reliability issues. Navigate to Advice in the dashboard to see what it found.

Common advice findings:

No redundancy: "Service payment-api has only 1 replica. A single pod failure causes a complete outage."

Suggested experiment: kill the pod and verify availability. Expected result: outage detected → fix by scaling to 2+ replicas.

Missing readiness probe: "Pod worker-service has no readiness probe. Traffic may reach unhealthy pods."

Suggested experiment: kill the pod and measure recovery time with and without readiness probe.

No resource limits: "Container api has no CPU or memory limits. Resource contention may cause unpredictable behavior."

Suggested experiment: stress CPU on co-located pods, verify isolation.

Long restart time: "Pod data-processor took 45 seconds to restart in the last 7 days. This suggests initialization work that could be moved to readiness gates."

Each advice item links directly to an experiment you can run to verify the issue and measure improvement after fixing.

Creating Experiments

Via the UI: The experiment builder lets you select targets from the discovery, choose attack types, add checks, and configure timing.

Via YAML: Experiments can be defined as code:

name: "Pod Resilience Check"
lanes:
  - steps:
      - type: action
        actionType: "com.steadybit.extension_kubernetes.pod_delete"
        parameters:
          duration: "30s"
          podCountCheckMode: "podCountEqualsDesiredCount"
        targets:
          fromQuery:
            query: 'k8s.namespace="production" AND k8s.label.app="payment-api"'
            percentage: 50

  - steps:
      - type: action
        actionType: "com.steadybit.extension_http.check"
        parameters:
          url: "https://api.example.com/health"
          method: "GET"
          successRate: 95
          duration: "35s"

Lanes run in parallel. This experiment simultaneously deletes 50% of payment-api pods and checks that the health endpoint maintains 95% success rate.

Attack Types

Steadybit's extensions provide a wide range of attack types:

Kubernetes:

  • pod_delete — delete pods, simulating crashes
  • pod_pause — pause pod processes
  • deployment_rollout_restart — trigger rolling restart
  • network_blackhole — block all network traffic
  • network_delay — add latency using tc
  • network_package_loss — drop packets
  • network_bandwidth_limit — throttle bandwidth

Host:

  • cpu_stress — consume CPU on host
  • memory_stress — consume memory
  • network_delay — host-level network latency
  • disk_stress — fill disk or stress I/O

AWS (via extension):

  • ec2_stop — stop EC2 instances
  • rds_failover — trigger RDS failover
  • az_outage — block traffic to specific AZ

Checks (used to validate steady state):

  • HTTP endpoint check
  • Datadog monitor check
  • Prometheus query check
  • Kubernetes pod count check
  • Process check

Target Queries

Steadybit uses a query language to select experiment targets from the discovered landscape:

# All pods in production namespace
k8s.namespace="production"

# Specific service
k8s.namespace="production" AND k8s.label.app="payment-api"

# Pods on a specific node
k8s.node.name="worker-node-1"

# AWS instances in a specific AZ
aws.zone="us-east-1a" AND aws.tag.Environment="staging"

# Multiple environments (logical OR)
k8s.namespace="staging" OR k8s.namespace="qa"

This query-based targeting means experiments are environment-agnostic — the same experiment runs in staging and production by changing the namespace query.

Running an Experiment from the CLI

Steadybit provides a CLI for automation:

# Install CLI
curl -sfL https://get.steadybit.com/cli <span class="hljs-pipe">| bash

<span class="hljs-comment"># Authenticate
steadybit login --key <YOUR_API_KEY>

<span class="hljs-comment"># Run an experiment
steadybit experiment run --<span class="hljs-built_in">id exp-12345

<span class="hljs-comment"># Run and wait for result
steadybit experiment run --<span class="hljs-built_in">id exp-12345 --<span class="hljs-built_in">wait

<span class="hljs-comment"># Exit code reflects experiment result (0=pass, 1=fail)
<span class="hljs-built_in">echo $?

CI/CD Integration

# .github/workflows/reliability.yml
name: Reliability Tests
on:
  push:
    branches: [main]

jobs:
  reliability:
    runs-on: ubuntu-latest
    steps:
      - name: Install Steadybit CLI
        run: curl -sfL https://get.steadybit.com/cli | bash

      - name: Run reliability experiments
        env:
          STEADYBIT_API_KEY: ${{ secrets.STEADYBIT_API_KEY }}
        run: |
          # Run all experiments tagged for CI
          steadybit experiment run \
            --tag ci \
            --wait \
            --timeout 300

      - name: Report results
        if: always()
        run: |
          steadybit experiment list \
            --tag ci \
            --format table

Tag experiments in the Steadybit UI with ci to indicate they should run in the pipeline.

Observability During Experiments

Steadybit integrates with your existing monitoring:

Prometheus/Grafana: Import the Steadybit Grafana dashboard to see experiment events overlaid on your metrics. This shows exactly when chaos started and stopped, correlated with latency or error rate spikes.

Datadog: Steadybit can check Datadog monitors as experiment conditions. If a monitor triggers during chaos, the experiment stops.

steps:
  - type: action
    actionType: "com.steadybit.extension_datadog.monitor_check"
    parameters:
      monitorId: "monitor-12345"
      duration: "60s"
      allowedState: "OK"

PagerDuty: Steadybit can notify PagerDuty when experiments start (so on-call knows it's intentional) and when they find real issues (creating incidents automatically).

Experiment Templates

The Steadybit Hub provides pre-built experiment templates organized by reliability pattern:

  • Redundancy: verify services handle instance loss
  • Fallback: verify graceful degradation when dependencies fail
  • Recovery: verify automatic recovery after failure
  • Scalability: verify correct behavior under resource pressure

Import templates directly into your workspace and customize the target query for your environment.

Self-Hosted Platform

For teams with strict data locality requirements, Steadybit offers a self-hosted option:

# Docker Compose self-hosted setup
curl -sfL https://docs.steadybit.com/install-steadybit/self-hosted/steadybit-setup.sh <span class="hljs-pipe">| \
  STEADYBIT_LICENSE=<YOUR_LICENSE_KEY> bash

The self-hosted platform runs as Docker containers and the same agents connect to your internal URL instead of steadybit.com.

Steadybit vs Chaos Toolkit vs LitmusChaos

Steadybit: Commercial platform (free tier available), UI-focused, discovery-driven. Best for teams that want guided reliability testing without writing experiments from scratch. The advice system accelerates time-to-first-experiment.

Chaos Toolkit: Open-source, CLI-focused, experiment-as-code. Best for teams that want experiments version-controlled as code, integrated with existing CI, without any SaaS dependency.

LitmusChaos: Open-source, Kubernetes-native CRDs. Best for teams already invested in Kubernetes-native tooling, who want chaos experiments alongside their Kubernetes resource definitions.

Steadybit's key differentiator is the advice loop: it discovers what you have, tells you what's likely to fail, and helps you verify it. For teams starting their chaos engineering journey, this guided approach is often faster than beginning with blank YAML files. For teams with mature practices who need experiments version-controlled and reviewed in PRs, Chaos Toolkit's code-first approach is more natural.

The free tier of Steadybit (up to 10 agents) is sufficient for most development and staging environments, making it accessible for teams that want to evaluate reliability testing without upfront commitment.

Read more