Load Testing MCP Servers: Concurrent Tool Calls, Streaming, and k6 Benchmarks

Load Testing MCP Servers: Concurrent Tool Calls, Streaming, and k6 Benchmarks

MCP servers work fine with one client. The question is whether they hold up when five AI agents are calling them simultaneously, or when a single agent fires off ten parallel tool calls in a complex workflow.

Most MCP server developers never test this. They ship, an agent makes concurrent requests, and they discover their server has a database connection leak or a shared mutex that serializes every request.

Performance testing for MCP servers isn't about finding the maximum throughput. It's about finding the failure modes before your users do.

What Breaks Under Load

Before writing tests, understand what actually fails:

Connection pool exhaustion. If your tools make database or API calls, you have a connection pool. Under load, all connections get occupied and new requests queue. If the queue grows unbounded, memory usage spikes. If the queue times out, requests fail with confusing errors.

Shared mutable state. Caches, counters, and in-memory data structures that work fine single-threaded often corrupt under concurrent access in async Node.js code.

Event loop blocking. Synchronous operations inside tool handlers block all other requests while they run. One slow handler can starve all other concurrent tool calls.

SSE connection limits. HTTP/SSE servers have connection limits at the OS, HTTP server, and application levels. Exceeding any of them causes new connections to fail.

Memory leaks. Some resources (stream handles, timers, event listeners) that aren't properly cleaned up look fine in short tests but cause gradual memory growth under sustained load.

Concurrent Tool Call Testing

Start with a direct concurrency test using the MCP SDK. This doesn't require k6 — it's just Node.js.

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from './server';

describe('concurrent tool call performance', () => {
  let client: Client;
  let server: ReturnType<typeof createServer>;

  beforeAll(async () => {
    const [ct, st] = InMemoryTransport.createLinkedPair();
    server = createServer();
    await server.connect(st);
    client = new Client({ name: 'perf-test', version: '1.0' });
    await client.connect(ct);
  });

  afterAll(async () => {
    await client.close();
    await server.close();
  });

  it('handles 10 concurrent tool calls without errors', async () => {
    const calls = Array.from({ length: 10 }, (_, i) =>
      client.callTool({
        name: 'search',
        arguments: { query: `concurrent query ${i}`, limit: 5 }
      })
    );

    const results = await Promise.all(calls);
    const errors = results.filter(r => r.isError);
    
    expect(errors).toHaveLength(0);
    expect(results).toHaveLength(10);
  });

  it('each concurrent result is independent', async () => {
    const queries = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'];
    
    const calls = queries.map(q =>
      client.callTool({
        name: 'search',
        arguments: { query: q, limit: 1 }
      })
    );

    const results = await Promise.all(calls);
    
    // Each result should correspond to its query, not a mixed result
    results.forEach((result, i) => {
      const text = result.content[0].text;
      expect(text).toContain(queries[i]);
    });
  });

  it('completes 100 sequential tool calls within acceptable time', async () => {
    const start = Date.now();
    
    for (let i = 0; i < 100; i++) {
      await client.callTool({
        name: 'search',
        arguments: { query: `sequential ${i}`, limit: 1 }
      });
    }
    
    const elapsed = Date.now() - start;
    const avgMs = elapsed / 100;
    
    console.log(`Average tool call latency: ${avgMs.toFixed(2)}ms`);
    expect(avgMs).toBeLessThan(50); // Adjust based on your server's expected perf
  });
});

The last test is a baseline measurement. Run it before and after every significant change. If average latency doubles, you've introduced a regression.

Memory Leak Detection

describe('memory stability', () => {
  it('does not leak memory across many tool calls', async () => {
    // Warmup
    for (let i = 0; i < 10; i++) {
      await client.callTool({ name: 'search', arguments: { query: 'warmup', limit: 1 } });
    }

    const baselineMemory = process.memoryUsage().heapUsed;
    
    // Run many calls
    for (let i = 0; i < 500; i++) {
      await client.callTool({
        name: 'search',
        arguments: { query: `memory test ${i}`, limit: 1 }
      });
    }

    // Force GC if available
    if (global.gc) global.gc();
    
    const finalMemory = process.memoryUsage().heapUsed;
    const growthMB = (finalMemory - baselineMemory) / 1024 / 1024;
    
    console.log(`Memory growth after 500 calls: ${growthMB.toFixed(2)}MB`);
    
    // Allow some growth, but not unbounded
    expect(growthMB).toBeLessThan(50); // 50MB growth is suspicious
  });
});

Run this with node --expose-gc to enable explicit GC. If memory grows beyond your threshold, you have a leak.

Streaming Response Testing

If your MCP server returns streaming content (large text responses, file reads), test that streaming works correctly and doesn't buffer everything in memory.

describe('streaming response behavior', () => {
  it('handles large text responses without timeout', async () => {
    // Tool that returns a large response
    const result = await client.callTool({
      name: 'read-large-file',
      arguments: { path: '/tmp/10mb-test-file.txt' }
    });

    expect(result.isError).toBe(false);
    expect(result.content[0].type).toBe('text');
    expect(result.content[0].text.length).toBeGreaterThan(1000000); // 1MB+
  }, 30000); // 30 second timeout for large response

  it('handles multiple large responses concurrently', async () => {
    const calls = Array.from({ length: 3 }, () =>
      client.callTool({
        name: 'read-large-file',
        arguments: { path: '/tmp/10mb-test-file.txt' }
      })
    );

    const results = await Promise.all(calls);
    results.forEach(r => expect(r.isError).toBe(false));
  }, 60000);
});

k6 Load Testing for HTTP/SSE Servers

For HTTP-based MCP servers, k6 gives you realistic load testing with virtual users and ramp-up curves.

First, create a k6 script that exercises your server via HTTP:

// k6/mcp-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const toolCallLatency = new Trend('tool_call_latency');

export const options = {
  stages: [
    { duration: '30s', target: 5 },   // Ramp up to 5 concurrent connections
    { duration: '60s', target: 10 },  // Hold at 10
    { duration: '30s', target: 20 },  // Spike to 20
    { duration: '30s', target: 0 },   // Ramp down
  ],
  thresholds: {
    errors: ['rate<0.05'],                    // Less than 5% error rate
    tool_call_latency: ['p95<2000'],          // 95th percentile under 2 seconds
    http_req_duration: ['p99<5000'],          // 99th percentile under 5 seconds
  },
};

const BASE_URL = __ENV.MCP_SERVER_URL || 'http://localhost:3000';
const TOKEN = __ENV.MCP_TOKEN || 'test-token';

export default function () {
  // Initialize SSE session
  const initRes = http.post(
    `${BASE_URL}/mcp`,
    JSON.stringify({
      jsonrpc: '2.0',
      id: 1,
      method: 'initialize',
      params: {
        protocolVersion: '2024-11-05',
        capabilities: {},
        clientInfo: { name: 'k6-test', version: '1.0' }
      }
    }),
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${TOKEN}`
      }
    }
  );

  check(initRes, {
    'init status 200': (r) => r.status === 200,
  });

  if (initRes.status !== 200) {
    errorRate.add(1);
    return;
  }

  // Call a tool
  const start = Date.now();
  const toolRes = http.post(
    `${BASE_URL}/mcp`,
    JSON.stringify({
      jsonrpc: '2.0',
      id: 2,
      method: 'tools/call',
      params: {
        name: 'search',
        arguments: { query: 'k6 load test', limit: 5 }
      }
    }),
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${TOKEN}`
      },
      timeout: '10s'
    }
  );

  const latency = Date.now() - start;
  toolCallLatency.add(latency);

  const toolSuccess = check(toolRes, {
    'tool call status 200': (r) => r.status === 200,
    'result not error': (r) => {
      try {
        const body = JSON.parse(r.body);
        return body.result && !body.result.isError;
      } catch {
        return false;
      }
    }
  });

  errorRate.add(!toolSuccess ? 1 : 0);
  sleep(1);
}

Run it:

k6 run \
  --env MCP_SERVER_URL=http://localhost:3000 \
  --<span class="hljs-built_in">env MCP_TOKEN=your-test-token \
  k6/mcp-load-test.js

Reading k6 output:

scenarios: (100.00%) 1 scenario, 20 max VUs

✓ init status 200
✓ tool call status 200
✓ result not error

checks.........................: 99.23% 
tool_call_latency.............: avg=234ms p90=445ms p95=612ms p99=1234ms
http_req_duration..............: avg=156ms p90=312ms p95=489ms p99=987ms
errors.........................: 0.77%

The numbers that matter: p95 latency (how bad is the worst 5% of requests?), and error rate. A 5% error rate under load is unacceptable for a production server.

Finding Connection Pool Exhaustion

// k6/connection-pool-test.js
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  vus: 50,       // High concurrent users
  duration: '30s',
  thresholds: {
    http_req_failed: ['rate<0.01'], // Less than 1% failure
  },
};

export default function () {
  // Fire many requests rapidly without sleep
  const res = http.post(
    `${__ENV.MCP_SERVER_URL}/mcp`,
    JSON.stringify({
      jsonrpc: '2.0',
      id: 1,
      method: 'tools/call',
      params: { name: 'db-query', arguments: { sql: 'SELECT 1' } }
    }),
    { headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${__ENV.MCP_TOKEN}` } }
  );

  check(res, { 'success': (r) => r.status === 200 });
}

If this test fails with 429 or 503 errors, you've found your connection pool limit. The fix is usually increasing pool size or adding request queuing.

Latency Benchmarks in CI

Add a lightweight latency benchmark to your CI to catch performance regressions:

# .github/workflows/perf.yml
name: Performance

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci && npm run build
      - run: npm run server &
      - run: sleep 3 # Wait for server to start
      - name: Run latency benchmark
        run: npm run test:perf
        env:
          MCP_SERVER_URL: http://localhost:3000
      - name: Compare with baseline
        run: node scripts/compare-perf-baseline.js

Keep a perf-baseline.json in your repo with acceptable latency numbers. If the PR degrades p95 by more than 20%, fail the CI check.

Interpreting Results and Setting Thresholds

Reasonable thresholds for an MCP server:

  • Tool call p50 latency: under 100ms for pure computation tools, under 500ms for tools making external API calls
  • Tool call p95 latency: under 500ms for computation, under 2s for external API tools
  • Error rate under 10x concurrent load: under 1%
  • Memory growth over 1000 calls: under 10MB

These aren't universal — they depend on what your tools do. The point is to define thresholds before you measure, not after.

When thresholds fail:

  • High p95 but low p50: you have occasional slow requests. Look for synchronous I/O, timeouts, or GC pauses.
  • High error rate under load: connection pool exhaustion or rate limiting from upstream services.
  • Memory growth: resource leak. Profile with Node.js --inspect and Chrome DevTools.
  • Everything fine in test but slow in production: your test environment has different network latency or CPU. Always test against production-like infrastructure.

Performance testing isn't a one-time activity. Run your benchmarks on every significant change, and keep the historical data. The trend matters as much as the current number.

Read more