Toxiproxy: Testing Network Faults and Latency in Your Applications

Toxiproxy: Testing Network Faults and Latency in Your Applications

Network faults are the most common source of distributed system bugs, and also the hardest to test. A unit test can stub a database call. An integration test can verify a happy-path HTTP request. But what tests verify that your application handles a 5-second connection timeout gracefully? Or that a flaky connection that drops 20% of packets triggers a retry rather than a silent failure?

Toxiproxy is a TCP proxy developed by Shopify specifically for testing these scenarios. It sits between your application and any network dependency, and lets you inject faults programmatically: latency, bandwidth throttling, packet loss, connection resets, slow closed connections, and more. Because it operates at the TCP level, it works with any protocol—HTTP, database wire protocols, message broker connections, anything that uses TCP.

How Toxiproxy Works

Toxiproxy has two components: a server and a client. The server is a standalone process that manages proxy configurations. The client is a library (available for most languages) that communicates with the server to create, configure, and tear down proxies.

A proxy is a named TCP listener that forwards traffic to a real upstream. Your application connects to the proxy instead of directly to the upstream. A toxic is a configuration applied to a proxy that modifies how traffic flows—adding latency, dropping packets, or closing connections.

The key properties:

  • Toxics are bidirectional. You can apply different toxics to upstream (application → dependency) and downstream (dependency → application) traffic streams independently.
  • Multiple toxics stack. You can apply latency and bandwidth throttling simultaneously.
  • Toxics are applied atomically via the HTTP control API, so you can enable/disable them mid-test without restarting the proxy.
  • Latency jitter is supported natively, making it easy to simulate realistic network conditions rather than artificial constant latency.

Installing and Running Toxiproxy

Toxiproxy is a single binary with no dependencies:

# macOS via Homebrew
brew install toxiproxy

<span class="hljs-comment"># Linux (download binary)
wget -O toxiproxy-server https://github.com/Shopify/toxiproxy/releases/download/v2.7.0/toxiproxy-server-linux-amd64
<span class="hljs-built_in">chmod +x toxiproxy-server

<span class="hljs-comment"># Run the server (default port 8474 for control API)
./toxiproxy-server
<span class="hljs-comment"># Or with a specific log level:
./toxiproxy-server --port 8474 --log-level info

The control API runs on port 8474 by default. The proxies you create listen on whatever ports you specify.

For Docker:

docker run -d \
  --name toxiproxy \
  -p 8474:8474 \
  -p 5433:5433 \
  ghcr.io/shopify/toxiproxy:2.7.0

The CLI tool (toxiproxy-cli) communicates with the control API:

# Create a proxy: listen on localhost:5433, forward to postgres:5432
toxiproxy-cli create postgres --listen localhost:5433 --upstream postgres:5432

<span class="hljs-comment"># List all proxies
toxiproxy-cli list

<span class="hljs-comment"># Add a latency toxic (500ms ± 50ms jitter)
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type latency --attribute latency=500 --attribute jitter=50

<span class="hljs-comment"># Remove the toxic
toxiproxy-cli toxic remove postgres --toxicName latency_downstream

<span class="hljs-comment"># Delete the proxy
toxiproxy-cli delete postgres

Toxic Types

Latency

Adds a fixed delay plus optional jitter to all data passing through. This is the most commonly used toxic.

toxiproxy-cli toxic add my-proxy \
  --type latency \
  --attribute latency=1000 \   <span class="hljs-comment"># 1000ms base latency
  --attribute jitter=100        <span class="hljs-comment"># ±100ms random jitter

Bandwidth

Throttles the data transfer rate, simulating a slow connection:

toxiproxy-cli toxic add my-proxy \
  --type bandwidth \
  --attribute rate=100          <span class="hljs-comment"># 100 KB/s

Slow Close

Delays the TCP close after the upstream has finished sending data. This simulates a connection that hangs at the end of a response—particularly common in poorly implemented keep-alive scenarios:

toxiproxy-cli toxic add my-proxy \
  --type slow_close \
  --attribute delay=5000        <span class="hljs-comment"># 5s before close completes

Timeout

Stops forwarding data after the connection has been idle for a specified period, without closing it. The connection appears open but stops responding—deadlier than a clean close because many applications do not handle silent timeouts:

toxiproxy-cli toxic add my-proxy \
  --type <span class="hljs-built_in">timeout \
  --attribute <span class="hljs-built_in">timeout=3000      <span class="hljs-comment"># Stop forwarding after 3s of silence

Reset Peer

Immediately resets (RST) the connection. This simulates an abrupt connection failure rather than a graceful close:

toxiproxy-cli toxic add my-proxy \
  --type reset_peer \
  --attribute <span class="hljs-built_in">timeout=0         <span class="hljs-comment"># Reset immediately

Slicer

Splits data into small chunks and sends them with delays between each chunk. This simulates a server that sends responses byte by byte, which can expose buffering bugs in HTTP parsers:

toxiproxy-cli toxic add my-proxy \
  --type slicer \
  --attribute average_size=1 \  <span class="hljs-comment"># 1 byte per chunk
  --attribute delay=50          <span class="hljs-comment"># 50ms between chunks

LimitData

Forwards a fixed number of bytes through the connection and then closes it. Useful for testing partial response handling:

toxiproxy-cli toxic add my-proxy \
  --type limit_data \
  --attribute bytes=1024        <span class="hljs-comment"># Forward only 1KB then close

Using Toxiproxy in Go Tests

The Go client library provides a clean API for creating and managing proxies within test code. Install it:

go get github.com/Shopify/toxiproxy/v2/client

Here is a complete example testing a PostgreSQL connection pool under network fault conditions:

package db_test

import (
    "database/sql"
    "testing"
    "time"

    toxiproxy "github.com/Shopify/toxiproxy/v2/client"
    _ "github.com/lib/pq"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func setupToxiproxy(t *testing.T) (*toxiproxy.Client, *toxiproxy.Proxy) {
    t.Helper()

    client := toxiproxy.NewClient("localhost:8474")

    proxy, err := client.CreateProxy("postgres-test", "localhost:15432", "localhost:5432")
    require.NoError(t, err)

    t.Cleanup(func() {
        proxy.Delete()
    })

    return client, proxy
}

func openTestDB(t *testing.T) *sql.DB {
    t.Helper()
    // Connect to Toxiproxy, not directly to Postgres
    db, err := sql.Open("postgres",
        "host=localhost port=15432 user=test password=test dbname=testdb sslmode=disable")
    require.NoError(t, err)
    db.SetConnMaxLifetime(5 * time.Second)
    db.SetMaxOpenConns(10)
    db.SetMaxIdleConns(5)
    return db
}

func TestQuerySucceedsUnderNormalConditions(t *testing.T) {
    _, _ = setupToxiproxy(t)
    db := openTestDB(t)
    defer db.Close()

    var result int
    err := db.QueryRow("SELECT 1").Scan(&result)
    assert.NoError(t, err)
    assert.Equal(t, 1, result)
}

func TestQueryTimesOutUnderHighLatency(t *testing.T) {
    _, proxy := setupToxiproxy(t)
    db := openTestDB(t)
    defer db.Close()

    // Add 6 seconds of latency — should exceed our 5s connection lifetime
    _, err := proxy.AddToxic("latency-test", "latency", "downstream", 1.0,
        toxiproxy.Attributes{
            "latency": 6000,
            "jitter":  0,
        })
    require.NoError(t, err)

    // The query should fail with a timeout, not hang forever
    done := make(chan error, 1)
    go func() {
        var result int
        done <- db.QueryRow("SELECT 1").Scan(&result)
    }()

    select {
    case err := <-done:
        // We expect an error — the connection should have timed out
        assert.Error(t, err, "expected timeout error under high latency")
    case <-time.After(10 * time.Second):
        t.Fatal("query hung indefinitely — missing timeout configuration")
    }
}

func TestConnectionPoolRecoversAfterLatencyRemoved(t *testing.T) {
    _, proxy := setupToxiproxy(t)
    db := openTestDB(t)
    defer db.Close()

    // Verify baseline works
    var result int
    err := db.QueryRow("SELECT 1").Scan(&result)
    require.NoError(t, err)

    // Inject latency
    toxic, err := proxy.AddToxic("latency-recovery", "latency", "downstream", 1.0,
        toxiproxy.Attributes{"latency": 4000})
    require.NoError(t, err)

    // Queries during fault period should fail
    err = db.QueryRow("SELECT 1").Scan(&result)
    assert.Error(t, err, "expected failure during latency injection")

    // Remove the toxic
    err = proxy.RemoveToxic(toxic.Name)
    require.NoError(t, err)

    // Wait for pool to recover and retry
    time.Sleep(500 * time.Millisecond)

    // Queries should succeed again
    err = db.QueryRow("SELECT 1").Scan(&result)
    assert.NoError(t, err, "expected recovery after latency removed")
    assert.Equal(t, 1, result)
}

func TestResetPeerCausesReconnection(t *testing.T) {
    _, proxy := setupToxiproxy(t)
    db := openTestDB(t)
    defer db.Close()

    // Establish a connection
    var result int
    err := db.QueryRow("SELECT 1").Scan(&result)
    require.NoError(t, err)

    // Inject TCP reset — this will kill all active connections
    _, err = proxy.AddToxic("reset-test", "reset_peer", "downstream", 1.0,
        toxiproxy.Attributes{"timeout": 0})
    require.NoError(t, err)

    // Remove the toxic immediately — we just want to close existing connections
    time.Sleep(100 * time.Millisecond)
    proxy.RemoveToxic("reset_peer_downstream")

    // The pool should re-establish connections transparently
    err = db.QueryRow("SELECT 1").Scan(&result)
    assert.NoError(t, err, "expected pool to reconnect after TCP reset")
}

Using Toxiproxy in Node.js Tests

The toxiproxy-node-client package provides a promise-based API:

npm install toxiproxy-node-client
// tests/redis-resilience.test.js
const { Toxiproxy } = require('toxiproxy-node-client');
const Redis = require('ioredis');

let toxiproxy;
let proxy;
let redis;

beforeAll(async () => {
  toxiproxy = new Toxiproxy('http://localhost:8474');
});

beforeEach(async () => {
  // Create a proxy for each test to ensure clean state
  proxy = await toxiproxy.createProxy({
    name: `redis-test-${Date.now()}`,
    listen: '0.0.0.0:16379',
    upstream: 'localhost:6379',
    enabled: true,
  });

  redis = new Redis({
    host: 'localhost',
    port: 16379,
    connectTimeout: 2000,
    commandTimeout: 3000,
    maxRetriesPerRequest: 1,
    enableOfflineQueue: false,
  });
});

afterEach(async () => {
  redis.disconnect();
  await proxy.remove();
});

test('SET and GET succeed under normal conditions', async () => {
  await redis.set('test-key', 'hello');
  const value = await redis.get('test-key');
  expect(value).toBe('hello');
});

test('commands fail fast under high latency', async () => {
  // Add 5 seconds of latency — exceeds our 3s command timeout
  await proxy.addToxic({
    name: 'redis-latency',
    type: 'latency',
    stream: 'downstream',
    toxicity: 1.0,
    attributes: { latency: 5000, jitter: 0 },
  });

  const start = Date.now();
  await expect(redis.get('any-key')).rejects.toThrow();
  const elapsed = Date.now() - start;

  // Should fail within ~4 seconds, not hang for 30+
  expect(elapsed).toBeLessThan(5000);
});

test('bandwidth throttling does not cause silent failures', async () => {
  // Throttle to 1 KB/s
  await proxy.addToxic({
    name: 'bandwidth-limit',
    type: 'bandwidth',
    stream: 'downstream',
    toxicity: 1.0,
    attributes: { rate: 1 },
  });

  // A small GET should still succeed, just slowly
  const value = await redis.get('test-key');
  // Value may be null (key not set) but the command should complete, not throw
  expect(value === null || typeof value === 'string').toBe(true);
});

CI Integration with Docker Compose

The standard pattern for CI is to run Toxiproxy as a service alongside your application's dependencies:

# docker-compose.test.yml
version: '3.8'

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 3s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 3s

  toxiproxy:
    image: ghcr.io/shopify/toxiproxy:2.7.0
    ports:
      - "8474:8474"   # Control API
      - "15432:15432" # Postgres proxy
      - "16379:16379" # Redis proxy
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  test:
    build:
      context: .
      target: test
    environment:
      DB_HOST: toxiproxy
      DB_PORT: 15432
      REDIS_HOST: toxiproxy
      REDIS_PORT: 16379
      TOXIPROXY_URL: http://toxiproxy:8474
    depends_on:
      toxiproxy:
        condition: service_started
    command: >
      sh -c "
        # Wait for toxiproxy to be ready
        until curl -sf http://toxiproxy:8474/proxies; do sleep 1; done;

        # Create proxies
        curl -X POST http://toxiproxy:8474/proxies \
          -H 'Content-Type: application/json' \
          -d '{\"name\":\"postgres\",\"listen\":\"0.0.0.0:15432\",\"upstream\":\"postgres:5432\",\"enabled\":true}';

        curl -X POST http://toxiproxy:8474/proxies \
          -H 'Content-Type: application/json' \
          -d '{\"name\":\"redis\",\"listen\":\"0.0.0.0:16379\",\"upstream\":\"redis:6379\",\"enabled\":true}';

        # Run tests
        go test ./... -tags integration -timeout 5m
      "

In your CI pipeline:

# .github/workflows/resilience-tests.yml
name: Resilience Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run resilience tests
        run: docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from test

      - name: Collect logs on failure
        if: failure()
        run: docker compose -f docker-compose.test.yml logs toxiproxy

Testing HTTP Client Behavior

Toxiproxy is not limited to database connections. Testing how your HTTP client handles a slow or unresponsive upstream API is equally important:

func TestHTTPClientTimesOutOnSlowAPI(t *testing.T) {
    client := toxiproxy.NewClient("localhost:8474")

    // Proxy to an upstream API
    proxy, err := client.CreateProxy("external-api", "localhost:18080", "api.example.com:443")
    require.NoError(t, err)
    defer proxy.Delete()

    // Inject a timeout toxic — connection opens but then goes silent
    _, err = proxy.AddToxic("api-timeout", "timeout", "downstream", 1.0,
        toxiproxy.Attributes{"timeout": 100}) // 100ms then silence
    require.NoError(t, err)

    httpClient := &http.Client{
        Timeout: 2 * time.Second, // Our configured client timeout
        Transport: &http.Transport{
            // Point at our proxy instead of the real API
            DialContext: func(ctx context.Context, _, _ string) (net.Conn, error) {
                return (&net.Dialer{}).DialContext(ctx, "tcp", "localhost:18080")
            },
        },
    }

    start := time.Now()
    _, err = httpClient.Get("http://localhost:18080/endpoint")
    elapsed := time.Since(start)

    assert.Error(t, err, "expected timeout error")
    // Should fail within our 2s timeout, not hang
    assert.Less(t, elapsed, 3*time.Second,
        "HTTP client should respect timeout configuration")
}

What Toxiproxy Does Not Cover

Toxiproxy operates at layer 4 (TCP). This means it cannot simulate:

  • TLS certificate errors — for those, use a test certificate or mitmproxy
  • HTTP-level faults — if you need to return specific HTTP error codes, use a mock server like WireMock
  • UDP protocols — Toxiproxy is TCP-only
  • DNS resolution failures — manipulate /etc/hosts or use a mock DNS server for this

Understanding these boundaries helps you choose the right tool. For most database, cache, and microservice communication testing, Toxiproxy covers exactly the failure modes that matter most.

The discipline of testing network faults is ultimately about making implicit assumptions explicit. When you write a Toxiproxy test, you are documenting a specific assumption: "this service will time out after 3 seconds, not hang indefinitely." That assumption is now verified on every build, and the test will fail if someone changes the timeout configuration without understanding the implication.

Read more