Steadybit: Reliability Testing Platform Guide
Steadybit is a reliability testing platform that combines chaos engineering with automated discovery of your system topology. It maps your running services, detects their dependencies, and suggests reliability experiments based on what it finds. Where other chaos tools require you to know exactly what to target, Steadybit helps you discover what to test first.
The Reliability Testing Problem
Most chaos engineering tools assume you already know your system's weak points. You write experiments targeting specific services, specific failure modes, and specific infrastructure. This works well when you have deep system knowledge — but it misses the gaps.
Steadybit approaches this differently:
- Discover: Agents map your running infrastructure — containers, services, hosts, Kubernetes resources
- Advise: The advice engine analyzes your topology and suggests experiments based on missing redundancy, unconfigured timeouts, single points of failure
- Experiment: Run targeted reliability experiments against discovered targets
- Observe: Integrated metrics and logs show the impact during experiments
The discovery layer is what distinguishes Steadybit from tools like LitmusChaos or Chaos Toolkit. Instead of writing experiments from scratch, you start with what Steadybit found and verified about your system.
Architecture
Steadybit consists of:
Platform: The central control plane — SaaS or self-hosted. Stores experiments, schedules runs, displays results.
Agents: Lightweight processes running in your infrastructure. Each agent discovers local resources and executes attacks. Agents connect outbound to the platform; you don't need inbound firewall rules.
Extensions: Pluggable modules that add attack types and discovery capabilities. Extensions exist for Kubernetes, AWS, Datadog, Prometheus, Dynatrace, and others.
Installation
Steadybit runs as SaaS with agents deployed in your infrastructure. Sign up for a free account at steadybit.com to get started.
Kubernetes agent installation via Helm:
# Add Steadybit Helm repo
helm repo add steadybit https://steadybit.github.io/helm-charts
helm repo update
<span class="hljs-comment"># Install agent with your API key (from Steadybit dashboard)
helm install steadybit-agent steadybit/steadybit-agent \
--namespace steadybit-agent \
--create-namespace \
--<span class="hljs-built_in">set agent.key=<span class="hljs-string">"<YOUR_AGENT_KEY>" \
--<span class="hljs-built_in">set global.agentKey=<span class="hljs-string">"<YOUR_AGENT_KEY>" \
--<span class="hljs-built_in">set agent.registerUrl=<span class="hljs-string">"https://platform.steadybit.com"Verify agents are running:
kubectl get pods -n steadybit-agent
# NAME READY STATUS RESTARTS
<span class="hljs-comment"># steadybit-agent-7f8b9c-xk4wm 1/1 Running 0
<span class="hljs-comment"># steadybit-agent-7f8b9c-ml2qp 1/1 Running 0Docker/Linux agent:
curl -sfL https://get.steadybit.com/agent | \
STEADYBIT_KEY=<YOUR_AGENT_KEY> \
bashAfter a few minutes, your infrastructure appears in the Steadybit dashboard under "Landscape."
Understanding Discovery
Once agents run, Steadybit automatically discovers:
- Kubernetes: Pods, deployments, stateful sets, services, namespaces, nodes
- Hosts: CPU, memory, disk, network interfaces
- Containers: Docker containers and their metadata
- AWS: EC2 instances, ECS tasks, Lambda functions (with AWS extension)
- Applications: HTTP services, JVM applications (via Micrometer)
The landscape view shows these targets and their relationships — which pods belong to which deployments, which services connect to which databases, which nodes host which workloads.
The Advice System
Steadybit's advice engine analyzes your landscape and flags reliability issues. Navigate to Advice in the dashboard to see what it found.
Common advice findings:
No redundancy: "Service payment-api has only 1 replica. A single pod failure causes a complete outage."
Suggested experiment: kill the pod and verify availability. Expected result: outage detected → fix by scaling to 2+ replicas.
Missing readiness probe: "Pod worker-service has no readiness probe. Traffic may reach unhealthy pods."
Suggested experiment: kill the pod and measure recovery time with and without readiness probe.
No resource limits: "Container api has no CPU or memory limits. Resource contention may cause unpredictable behavior."
Suggested experiment: stress CPU on co-located pods, verify isolation.
Long restart time: "Pod data-processor took 45 seconds to restart in the last 7 days. This suggests initialization work that could be moved to readiness gates."
Each advice item links directly to an experiment you can run to verify the issue and measure improvement after fixing.
Creating Experiments
Via the UI: The experiment builder lets you select targets from the discovery, choose attack types, add checks, and configure timing.
Via YAML: Experiments can be defined as code:
name: "Pod Resilience Check"
lanes:
- steps:
- type: action
actionType: "com.steadybit.extension_kubernetes.pod_delete"
parameters:
duration: "30s"
podCountCheckMode: "podCountEqualsDesiredCount"
targets:
fromQuery:
query: 'k8s.namespace="production" AND k8s.label.app="payment-api"'
percentage: 50
- steps:
- type: action
actionType: "com.steadybit.extension_http.check"
parameters:
url: "https://api.example.com/health"
method: "GET"
successRate: 95
duration: "35s"Lanes run in parallel. This experiment simultaneously deletes 50% of payment-api pods and checks that the health endpoint maintains 95% success rate.
Attack Types
Steadybit's extensions provide a wide range of attack types:
Kubernetes:
pod_delete— delete pods, simulating crashespod_pause— pause pod processesdeployment_rollout_restart— trigger rolling restartnetwork_blackhole— block all network trafficnetwork_delay— add latency using tcnetwork_package_loss— drop packetsnetwork_bandwidth_limit— throttle bandwidth
Host:
cpu_stress— consume CPU on hostmemory_stress— consume memorynetwork_delay— host-level network latencydisk_stress— fill disk or stress I/O
AWS (via extension):
ec2_stop— stop EC2 instancesrds_failover— trigger RDS failoveraz_outage— block traffic to specific AZ
Checks (used to validate steady state):
- HTTP endpoint check
- Datadog monitor check
- Prometheus query check
- Kubernetes pod count check
- Process check
Target Queries
Steadybit uses a query language to select experiment targets from the discovered landscape:
# All pods in production namespace
k8s.namespace="production"
# Specific service
k8s.namespace="production" AND k8s.label.app="payment-api"
# Pods on a specific node
k8s.node.name="worker-node-1"
# AWS instances in a specific AZ
aws.zone="us-east-1a" AND aws.tag.Environment="staging"
# Multiple environments (logical OR)
k8s.namespace="staging" OR k8s.namespace="qa"This query-based targeting means experiments are environment-agnostic — the same experiment runs in staging and production by changing the namespace query.
Running an Experiment from the CLI
Steadybit provides a CLI for automation:
# Install CLI
curl -sfL https://get.steadybit.com/cli <span class="hljs-pipe">| bash
<span class="hljs-comment"># Authenticate
steadybit login --key <YOUR_API_KEY>
<span class="hljs-comment"># Run an experiment
steadybit experiment run --<span class="hljs-built_in">id exp-12345
<span class="hljs-comment"># Run and wait for result
steadybit experiment run --<span class="hljs-built_in">id exp-12345 --<span class="hljs-built_in">wait
<span class="hljs-comment"># Exit code reflects experiment result (0=pass, 1=fail)
<span class="hljs-built_in">echo $?CI/CD Integration
# .github/workflows/reliability.yml
name: Reliability Tests
on:
push:
branches: [main]
jobs:
reliability:
runs-on: ubuntu-latest
steps:
- name: Install Steadybit CLI
run: curl -sfL https://get.steadybit.com/cli | bash
- name: Run reliability experiments
env:
STEADYBIT_API_KEY: ${{ secrets.STEADYBIT_API_KEY }}
run: |
# Run all experiments tagged for CI
steadybit experiment run \
--tag ci \
--wait \
--timeout 300
- name: Report results
if: always()
run: |
steadybit experiment list \
--tag ci \
--format tableTag experiments in the Steadybit UI with ci to indicate they should run in the pipeline.
Observability During Experiments
Steadybit integrates with your existing monitoring:
Prometheus/Grafana: Import the Steadybit Grafana dashboard to see experiment events overlaid on your metrics. This shows exactly when chaos started and stopped, correlated with latency or error rate spikes.
Datadog: Steadybit can check Datadog monitors as experiment conditions. If a monitor triggers during chaos, the experiment stops.
steps:
- type: action
actionType: "com.steadybit.extension_datadog.monitor_check"
parameters:
monitorId: "monitor-12345"
duration: "60s"
allowedState: "OK"PagerDuty: Steadybit can notify PagerDuty when experiments start (so on-call knows it's intentional) and when they find real issues (creating incidents automatically).
Experiment Templates
The Steadybit Hub provides pre-built experiment templates organized by reliability pattern:
- Redundancy: verify services handle instance loss
- Fallback: verify graceful degradation when dependencies fail
- Recovery: verify automatic recovery after failure
- Scalability: verify correct behavior under resource pressure
Import templates directly into your workspace and customize the target query for your environment.
Self-Hosted Platform
For teams with strict data locality requirements, Steadybit offers a self-hosted option:
# Docker Compose self-hosted setup
curl -sfL https://docs.steadybit.com/install-steadybit/self-hosted/steadybit-setup.sh <span class="hljs-pipe">| \
STEADYBIT_LICENSE=<YOUR_LICENSE_KEY> bashThe self-hosted platform runs as Docker containers and the same agents connect to your internal URL instead of steadybit.com.
Steadybit vs Chaos Toolkit vs LitmusChaos
Steadybit: Commercial platform (free tier available), UI-focused, discovery-driven. Best for teams that want guided reliability testing without writing experiments from scratch. The advice system accelerates time-to-first-experiment.
Chaos Toolkit: Open-source, CLI-focused, experiment-as-code. Best for teams that want experiments version-controlled as code, integrated with existing CI, without any SaaS dependency.
LitmusChaos: Open-source, Kubernetes-native CRDs. Best for teams already invested in Kubernetes-native tooling, who want chaos experiments alongside their Kubernetes resource definitions.
Steadybit's key differentiator is the advice loop: it discovers what you have, tells you what's likely to fail, and helps you verify it. For teams starting their chaos engineering journey, this guided approach is often faster than beginning with blank YAML files. For teams with mature practices who need experiments version-controlled and reviewed in PRs, Chaos Toolkit's code-first approach is more natural.
The free tier of Steadybit (up to 10 agents) is sufficient for most development and staging environments, making it accessible for teams that want to evaluate reliability testing without upfront commitment.