Toxiproxy: Testing Network Faults and Latency in Your Applications
Network faults are the most common source of distributed system bugs, and also the hardest to test. A unit test can stub a database call. An integration test can verify a happy-path HTTP request. But what tests verify that your application handles a 5-second connection timeout gracefully? Or that a flaky connection that drops 20% of packets triggers a retry rather than a silent failure?
Toxiproxy is a TCP proxy developed by Shopify specifically for testing these scenarios. It sits between your application and any network dependency, and lets you inject faults programmatically: latency, bandwidth throttling, packet loss, connection resets, slow closed connections, and more. Because it operates at the TCP level, it works with any protocol—HTTP, database wire protocols, message broker connections, anything that uses TCP.
How Toxiproxy Works
Toxiproxy has two components: a server and a client. The server is a standalone process that manages proxy configurations. The client is a library (available for most languages) that communicates with the server to create, configure, and tear down proxies.
A proxy is a named TCP listener that forwards traffic to a real upstream. Your application connects to the proxy instead of directly to the upstream. A toxic is a configuration applied to a proxy that modifies how traffic flows—adding latency, dropping packets, or closing connections.
The key properties:
- Toxics are bidirectional. You can apply different toxics to upstream (application → dependency) and downstream (dependency → application) traffic streams independently.
- Multiple toxics stack. You can apply latency and bandwidth throttling simultaneously.
- Toxics are applied atomically via the HTTP control API, so you can enable/disable them mid-test without restarting the proxy.
- Latency jitter is supported natively, making it easy to simulate realistic network conditions rather than artificial constant latency.
Installing and Running Toxiproxy
Toxiproxy is a single binary with no dependencies:
# macOS via Homebrew
brew install toxiproxy
<span class="hljs-comment"># Linux (download binary)
wget -O toxiproxy-server https://github.com/Shopify/toxiproxy/releases/download/v2.7.0/toxiproxy-server-linux-amd64
<span class="hljs-built_in">chmod +x toxiproxy-server
<span class="hljs-comment"># Run the server (default port 8474 for control API)
./toxiproxy-server
<span class="hljs-comment"># Or with a specific log level:
./toxiproxy-server --port 8474 --log-level infoThe control API runs on port 8474 by default. The proxies you create listen on whatever ports you specify.
For Docker:
docker run -d \
--name toxiproxy \
-p 8474:8474 \
-p 5433:5433 \
ghcr.io/shopify/toxiproxy:2.7.0The CLI tool (toxiproxy-cli) communicates with the control API:
# Create a proxy: listen on localhost:5433, forward to postgres:5432
toxiproxy-cli create postgres --listen localhost:5433 --upstream postgres:5432
<span class="hljs-comment"># List all proxies
toxiproxy-cli list
<span class="hljs-comment"># Add a latency toxic (500ms ± 50ms jitter)
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type latency --attribute latency=500 --attribute jitter=50
<span class="hljs-comment"># Remove the toxic
toxiproxy-cli toxic remove postgres --toxicName latency_downstream
<span class="hljs-comment"># Delete the proxy
toxiproxy-cli delete postgresToxic Types
Latency
Adds a fixed delay plus optional jitter to all data passing through. This is the most commonly used toxic.
toxiproxy-cli toxic add my-proxy \
--type latency \
--attribute latency=1000 \ <span class="hljs-comment"># 1000ms base latency
--attribute jitter=100 <span class="hljs-comment"># ±100ms random jitterBandwidth
Throttles the data transfer rate, simulating a slow connection:
toxiproxy-cli toxic add my-proxy \
--type bandwidth \
--attribute rate=100 <span class="hljs-comment"># 100 KB/sSlow Close
Delays the TCP close after the upstream has finished sending data. This simulates a connection that hangs at the end of a response—particularly common in poorly implemented keep-alive scenarios:
toxiproxy-cli toxic add my-proxy \
--type slow_close \
--attribute delay=5000 <span class="hljs-comment"># 5s before close completesTimeout
Stops forwarding data after the connection has been idle for a specified period, without closing it. The connection appears open but stops responding—deadlier than a clean close because many applications do not handle silent timeouts:
toxiproxy-cli toxic add my-proxy \
--type <span class="hljs-built_in">timeout \
--attribute <span class="hljs-built_in">timeout=3000 <span class="hljs-comment"># Stop forwarding after 3s of silenceReset Peer
Immediately resets (RST) the connection. This simulates an abrupt connection failure rather than a graceful close:
toxiproxy-cli toxic add my-proxy \
--type reset_peer \
--attribute <span class="hljs-built_in">timeout=0 <span class="hljs-comment"># Reset immediatelySlicer
Splits data into small chunks and sends them with delays between each chunk. This simulates a server that sends responses byte by byte, which can expose buffering bugs in HTTP parsers:
toxiproxy-cli toxic add my-proxy \
--type slicer \
--attribute average_size=1 \ <span class="hljs-comment"># 1 byte per chunk
--attribute delay=50 <span class="hljs-comment"># 50ms between chunksLimitData
Forwards a fixed number of bytes through the connection and then closes it. Useful for testing partial response handling:
toxiproxy-cli toxic add my-proxy \
--type limit_data \
--attribute bytes=1024 <span class="hljs-comment"># Forward only 1KB then closeUsing Toxiproxy in Go Tests
The Go client library provides a clean API for creating and managing proxies within test code. Install it:
go get github.com/Shopify/toxiproxy/v2/clientHere is a complete example testing a PostgreSQL connection pool under network fault conditions:
package db_test
import (
"database/sql"
"testing"
"time"
toxiproxy "github.com/Shopify/toxiproxy/v2/client"
_ "github.com/lib/pq"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func setupToxiproxy(t *testing.T) (*toxiproxy.Client, *toxiproxy.Proxy) {
t.Helper()
client := toxiproxy.NewClient("localhost:8474")
proxy, err := client.CreateProxy("postgres-test", "localhost:15432", "localhost:5432")
require.NoError(t, err)
t.Cleanup(func() {
proxy.Delete()
})
return client, proxy
}
func openTestDB(t *testing.T) *sql.DB {
t.Helper()
// Connect to Toxiproxy, not directly to Postgres
db, err := sql.Open("postgres",
"host=localhost port=15432 user=test password=test dbname=testdb sslmode=disable")
require.NoError(t, err)
db.SetConnMaxLifetime(5 * time.Second)
db.SetMaxOpenConns(10)
db.SetMaxIdleConns(5)
return db
}
func TestQuerySucceedsUnderNormalConditions(t *testing.T) {
_, _ = setupToxiproxy(t)
db := openTestDB(t)
defer db.Close()
var result int
err := db.QueryRow("SELECT 1").Scan(&result)
assert.NoError(t, err)
assert.Equal(t, 1, result)
}
func TestQueryTimesOutUnderHighLatency(t *testing.T) {
_, proxy := setupToxiproxy(t)
db := openTestDB(t)
defer db.Close()
// Add 6 seconds of latency — should exceed our 5s connection lifetime
_, err := proxy.AddToxic("latency-test", "latency", "downstream", 1.0,
toxiproxy.Attributes{
"latency": 6000,
"jitter": 0,
})
require.NoError(t, err)
// The query should fail with a timeout, not hang forever
done := make(chan error, 1)
go func() {
var result int
done <- db.QueryRow("SELECT 1").Scan(&result)
}()
select {
case err := <-done:
// We expect an error — the connection should have timed out
assert.Error(t, err, "expected timeout error under high latency")
case <-time.After(10 * time.Second):
t.Fatal("query hung indefinitely — missing timeout configuration")
}
}
func TestConnectionPoolRecoversAfterLatencyRemoved(t *testing.T) {
_, proxy := setupToxiproxy(t)
db := openTestDB(t)
defer db.Close()
// Verify baseline works
var result int
err := db.QueryRow("SELECT 1").Scan(&result)
require.NoError(t, err)
// Inject latency
toxic, err := proxy.AddToxic("latency-recovery", "latency", "downstream", 1.0,
toxiproxy.Attributes{"latency": 4000})
require.NoError(t, err)
// Queries during fault period should fail
err = db.QueryRow("SELECT 1").Scan(&result)
assert.Error(t, err, "expected failure during latency injection")
// Remove the toxic
err = proxy.RemoveToxic(toxic.Name)
require.NoError(t, err)
// Wait for pool to recover and retry
time.Sleep(500 * time.Millisecond)
// Queries should succeed again
err = db.QueryRow("SELECT 1").Scan(&result)
assert.NoError(t, err, "expected recovery after latency removed")
assert.Equal(t, 1, result)
}
func TestResetPeerCausesReconnection(t *testing.T) {
_, proxy := setupToxiproxy(t)
db := openTestDB(t)
defer db.Close()
// Establish a connection
var result int
err := db.QueryRow("SELECT 1").Scan(&result)
require.NoError(t, err)
// Inject TCP reset — this will kill all active connections
_, err = proxy.AddToxic("reset-test", "reset_peer", "downstream", 1.0,
toxiproxy.Attributes{"timeout": 0})
require.NoError(t, err)
// Remove the toxic immediately — we just want to close existing connections
time.Sleep(100 * time.Millisecond)
proxy.RemoveToxic("reset_peer_downstream")
// The pool should re-establish connections transparently
err = db.QueryRow("SELECT 1").Scan(&result)
assert.NoError(t, err, "expected pool to reconnect after TCP reset")
}Using Toxiproxy in Node.js Tests
The toxiproxy-node-client package provides a promise-based API:
npm install toxiproxy-node-client// tests/redis-resilience.test.js
const { Toxiproxy } = require('toxiproxy-node-client');
const Redis = require('ioredis');
let toxiproxy;
let proxy;
let redis;
beforeAll(async () => {
toxiproxy = new Toxiproxy('http://localhost:8474');
});
beforeEach(async () => {
// Create a proxy for each test to ensure clean state
proxy = await toxiproxy.createProxy({
name: `redis-test-${Date.now()}`,
listen: '0.0.0.0:16379',
upstream: 'localhost:6379',
enabled: true,
});
redis = new Redis({
host: 'localhost',
port: 16379,
connectTimeout: 2000,
commandTimeout: 3000,
maxRetriesPerRequest: 1,
enableOfflineQueue: false,
});
});
afterEach(async () => {
redis.disconnect();
await proxy.remove();
});
test('SET and GET succeed under normal conditions', async () => {
await redis.set('test-key', 'hello');
const value = await redis.get('test-key');
expect(value).toBe('hello');
});
test('commands fail fast under high latency', async () => {
// Add 5 seconds of latency — exceeds our 3s command timeout
await proxy.addToxic({
name: 'redis-latency',
type: 'latency',
stream: 'downstream',
toxicity: 1.0,
attributes: { latency: 5000, jitter: 0 },
});
const start = Date.now();
await expect(redis.get('any-key')).rejects.toThrow();
const elapsed = Date.now() - start;
// Should fail within ~4 seconds, not hang for 30+
expect(elapsed).toBeLessThan(5000);
});
test('bandwidth throttling does not cause silent failures', async () => {
// Throttle to 1 KB/s
await proxy.addToxic({
name: 'bandwidth-limit',
type: 'bandwidth',
stream: 'downstream',
toxicity: 1.0,
attributes: { rate: 1 },
});
// A small GET should still succeed, just slowly
const value = await redis.get('test-key');
// Value may be null (key not set) but the command should complete, not throw
expect(value === null || typeof value === 'string').toBe(true);
});CI Integration with Docker Compose
The standard pattern for CI is to run Toxiproxy as a service alongside your application's dependencies:
# docker-compose.test.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 3s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 3s
toxiproxy:
image: ghcr.io/shopify/toxiproxy:2.7.0
ports:
- "8474:8474" # Control API
- "15432:15432" # Postgres proxy
- "16379:16379" # Redis proxy
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
test:
build:
context: .
target: test
environment:
DB_HOST: toxiproxy
DB_PORT: 15432
REDIS_HOST: toxiproxy
REDIS_PORT: 16379
TOXIPROXY_URL: http://toxiproxy:8474
depends_on:
toxiproxy:
condition: service_started
command: >
sh -c "
# Wait for toxiproxy to be ready
until curl -sf http://toxiproxy:8474/proxies; do sleep 1; done;
# Create proxies
curl -X POST http://toxiproxy:8474/proxies \
-H 'Content-Type: application/json' \
-d '{\"name\":\"postgres\",\"listen\":\"0.0.0.0:15432\",\"upstream\":\"postgres:5432\",\"enabled\":true}';
curl -X POST http://toxiproxy:8474/proxies \
-H 'Content-Type: application/json' \
-d '{\"name\":\"redis\",\"listen\":\"0.0.0.0:16379\",\"upstream\":\"redis:6379\",\"enabled\":true}';
# Run tests
go test ./... -tags integration -timeout 5m
"In your CI pipeline:
# .github/workflows/resilience-tests.yml
name: Resilience Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run resilience tests
run: docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from test
- name: Collect logs on failure
if: failure()
run: docker compose -f docker-compose.test.yml logs toxiproxyTesting HTTP Client Behavior
Toxiproxy is not limited to database connections. Testing how your HTTP client handles a slow or unresponsive upstream API is equally important:
func TestHTTPClientTimesOutOnSlowAPI(t *testing.T) {
client := toxiproxy.NewClient("localhost:8474")
// Proxy to an upstream API
proxy, err := client.CreateProxy("external-api", "localhost:18080", "api.example.com:443")
require.NoError(t, err)
defer proxy.Delete()
// Inject a timeout toxic — connection opens but then goes silent
_, err = proxy.AddToxic("api-timeout", "timeout", "downstream", 1.0,
toxiproxy.Attributes{"timeout": 100}) // 100ms then silence
require.NoError(t, err)
httpClient := &http.Client{
Timeout: 2 * time.Second, // Our configured client timeout
Transport: &http.Transport{
// Point at our proxy instead of the real API
DialContext: func(ctx context.Context, _, _ string) (net.Conn, error) {
return (&net.Dialer{}).DialContext(ctx, "tcp", "localhost:18080")
},
},
}
start := time.Now()
_, err = httpClient.Get("http://localhost:18080/endpoint")
elapsed := time.Since(start)
assert.Error(t, err, "expected timeout error")
// Should fail within our 2s timeout, not hang
assert.Less(t, elapsed, 3*time.Second,
"HTTP client should respect timeout configuration")
}What Toxiproxy Does Not Cover
Toxiproxy operates at layer 4 (TCP). This means it cannot simulate:
- TLS certificate errors — for those, use a test certificate or mitmproxy
- HTTP-level faults — if you need to return specific HTTP error codes, use a mock server like WireMock
- UDP protocols — Toxiproxy is TCP-only
- DNS resolution failures — manipulate
/etc/hostsor use a mock DNS server for this
Understanding these boundaries helps you choose the right tool. For most database, cache, and microservice communication testing, Toxiproxy covers exactly the failure modes that matter most.
The discipline of testing network faults is ultimately about making implicit assumptions explicit. When you write a Toxiproxy test, you are documenting a specific assumption: "this service will time out after 3 seconds, not hang indefinitely." That assumption is now verified on every build, and the test will fail if someone changes the timeout configuration without understanding the implication.