Testing Apps on Fly.io: Health Checks, Multi-Region, and CI Integration

Testing Apps on Fly.io: Health Checks, Multi-Region, and CI Integration

Fly.io has become a go-to platform for teams that want Heroku-style simplicity with genuine multi-region distribution. You push code, and Fly.io runs it close to your users in 30+ regions worldwide. But with that power comes a real testing challenge: how do you verify that your app is healthy not just in one region but across all the regions you've deployed to? How do you validate rolling deployments without taking down production? And how do you wire all of this into a CI pipeline that gives you confidence before traffic hits real users?

This guide covers practical testing strategies for Fly.io deployments — from configuring health checks in fly.toml to writing smoke tests that run automatically after every deploy.

Understanding Fly.io's Deployment Model

Before writing tests, it helps to understand what you're testing. Fly.io runs your app as Machines — lightweight Firecracker microVMs that boot in milliseconds. When you deploy, Fly.io performs a rolling update: new Machines start, health checks run, and only if checks pass does Fly.io route traffic to them.

This means your health check configuration is your first line of defense. Get it wrong, and either bad deployments slip through or good ones get rolled back.

Configuring Health Checks in fly.toml

Your fly.toml is the foundation of your deployment configuration. Here's a production-ready health check setup:

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1

  [[http_service.checks]]
    grace_period = "10s"
    interval = "15s"
    method = "GET"
    path = "/health"
    protocol = "http"
    timeout = "5s"
    tls_skip_verify = false

    [http_service.checks.headers]
      X-Internal-Check = "fly-health"

[checks]
  [checks.db_connection]
    grace_period = "20s"
    interval = "30s"
    method = "GET"
    path = "/health/db"
    protocol = "http"
    timeout = "10s"
    port = 8080

The /health endpoint should be lightweight — just enough to confirm the process is alive. The /health/db check is more thorough. Here's what these endpoints should look like in a Node.js app:

// health.js
app.get('/health', (req, res) => {
  res.json({
    status: 'ok',
    region: process.env.FLY_REGION,
    app: process.env.FLY_APP_NAME,
    machine: process.env.FLY_MACHINE_ID,
    timestamp: new Date().toISOString()
  });
});

app.get('/health/db', async (req, res) => {
  try {
    await db.query('SELECT 1');
    res.json({ status: 'ok', db: 'connected' });
  } catch (err) {
    res.status(503).json({ status: 'error', db: err.message });
  }
});

Including FLY_REGION in your health response is useful because it lets you verify which region a request is being served from — critical for multi-region debugging.

Testing Multi-Region Deployments

Fly.io lets you run Machines in multiple regions simultaneously. Testing multi-region behavior means verifying that:

  1. Each region is actually serving traffic
  2. Data replication is working (if you use LiteFS or Fly Postgres with read replicas)
  3. Region-specific configuration is correct

Here's a bash script that validates all your configured regions:

#!/bin/bash
<span class="hljs-comment"># test-regions.sh — validate all Fly.io regions are healthy

APP_NAME=<span class="hljs-string">"your-app-name"
REGIONS=(<span class="hljs-string">"iad" <span class="hljs-string">"lhr" <span class="hljs-string">"sin" <span class="hljs-string">"syd")
FAILED=0

<span class="hljs-keyword">for region <span class="hljs-keyword">in <span class="hljs-string">"${REGIONS[@]}"; <span class="hljs-keyword">do
  <span class="hljs-built_in">echo <span class="hljs-string">"Testing region: $region"

  <span class="hljs-comment"># Use the Fly.io anycast address with region header
  RESPONSE=$(curl -s -o /tmp/response.json -w <span class="hljs-string">"%{http_code}" \
    -H <span class="hljs-string">"fly-prefer-region: $region" \
    <span class="hljs-string">"https://${APP_NAME}.fly.dev/health")

  <span class="hljs-keyword">if [ <span class="hljs-string">"$RESPONSE" != <span class="hljs-string">"200" ]; <span class="hljs-keyword">then
    <span class="hljs-built_in">echo <span class="hljs-string">"  FAIL: HTTP $RESPONSE from <span class="hljs-variable">$region"
    FAILED=$((FAILED + <span class="hljs-number">1))
    <span class="hljs-built_in">continue
  <span class="hljs-keyword">fi

  SERVING_REGION=$(jq -r <span class="hljs-string">'.region' /tmp/response.json)
  <span class="hljs-built_in">echo <span class="hljs-string">"  OK: Serving from $SERVING_REGION (requested <span class="hljs-variable">$region)"

  <span class="hljs-keyword">if [ <span class="hljs-string">"$SERVING_REGION" != <span class="hljs-string">"$region" ]; <span class="hljs-keyword">then
    <span class="hljs-built_in">echo <span class="hljs-string">"  WARN: Region mismatch — possibly no machine in $region, routed to <span class="hljs-variable">$SERVING_REGION"
  <span class="hljs-keyword">fi
<span class="hljs-keyword">done

<span class="hljs-keyword">if [ <span class="hljs-variable">$FAILED -gt 0 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"Region tests failed: $FAILED/<span class="hljs-variable">${#REGIONS[@]}"
  <span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi

<span class="hljs-built_in">echo <span class="hljs-string">"All regions healthy"

Using the Machines API for Deployment Validation

The Fly.io Machines API gives you programmatic control over your deployment. You can use it to verify Machine states after a deploy:

#!/bin/bash
<span class="hljs-comment"># check-machines.sh — verify all machines are in expected state after deploy

APP_NAME=<span class="hljs-string">"your-app-name"
FLY_API_TOKEN=<span class="hljs-string">"${FLY_API_TOKEN}"
EXPECTED_STATE=<span class="hljs-string">"started"

MACHINES=$(curl -s \
  -H <span class="hljs-string">"Authorization: Bearer ${FLY_API_TOKEN}" \
  <span class="hljs-string">"https://api.machines.dev/v1/apps/${APP_NAME}/machines")

TOTAL=$(<span class="hljs-built_in">echo <span class="hljs-string">"$MACHINES" <span class="hljs-pipe">| jq length)
STARTED=$(<span class="hljs-built_in">echo <span class="hljs-string">"$MACHINES" <span class="hljs-pipe">| jq <span class="hljs-string">"[.[] | select(.state == \"<span class="hljs-variable">${EXPECTED_STATE}\")] <span class="hljs-pipe">| length")

<span class="hljs-built_in">echo <span class="hljs-string">"Machines total: $TOTAL, started: <span class="hljs-variable">$STARTED"

<span class="hljs-keyword">if [ <span class="hljs-string">"$STARTED" -ne <span class="hljs-string">"$TOTAL" ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"ERROR: Not all machines are in state '${EXPECTED_STATE}'"
  <span class="hljs-built_in">echo <span class="hljs-string">"$MACHINES" <span class="hljs-pipe">| jq <span class="hljs-string">'.[] | {id, region, state}'
  <span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi

<span class="hljs-comment"># Check version consistency — all machines should run the same image
VERSIONS=$(<span class="hljs-built_in">echo <span class="hljs-string">"$MACHINES" <span class="hljs-pipe">| jq -r <span class="hljs-string">'.[].image_ref.tag' <span class="hljs-pipe">| <span class="hljs-built_in">sort -u)
VERSION_COUNT=$(<span class="hljs-built_in">echo <span class="hljs-string">"$VERSIONS" <span class="hljs-pipe">| <span class="hljs-built_in">wc -l <span class="hljs-pipe">| <span class="hljs-built_in">tr -d <span class="hljs-string">' ')

<span class="hljs-keyword">if [ <span class="hljs-string">"$VERSION_COUNT" -gt 1 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"WARN: Multiple versions running:"
  <span class="hljs-built_in">echo <span class="hljs-string">"$VERSIONS"
<span class="hljs-keyword">fi

<span class="hljs-built_in">echo <span class="hljs-string">"Deployment verified: $TOTAL machines running version <span class="hljs-subst">$(echo "$VERSIONS" <span class="hljs-pipe">| head -1)"

flyctl in CI Pipelines

The most reliable pattern for Fly.io CI integration is to deploy, wait for health checks, then run smoke tests. Here's a complete GitHub Actions workflow:

# .github/workflows/deploy.yml
name: Deploy and Test

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup flyctl
        uses: superfly/flyctl-actions/setup-flyctl@master

      - name: Deploy to Fly.io
        run: flyctl deploy --remote-only --wait-timeout 300
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

      - name: Wait for deployment to stabilize
        run: |
          echo "Waiting for health checks to pass..."
          sleep 15

          # Poll until app responds consistently
          for i in {1..10}; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://your-app.fly.dev/health)
            if [ "$STATUS" = "200" ]; then
              echo "Health check passed on attempt $i"
              break
            fi
            echo "Attempt $i: got HTTP $STATUS, retrying..."
            sleep 5
          done

      - name: Run smoke tests
        run: ./scripts/smoke-tests.sh
        env:
          APP_URL: https://your-app.fly.dev

      - name: Verify machine states
        run: ./scripts/check-machines.sh
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

Writing Effective Smoke Tests

Smoke tests after a Fly.io deploy should cover the critical paths your users depend on. Here's a structured approach using bash and curl:

#!/bin/bash
<span class="hljs-comment"># smoke-tests.sh

APP_URL=<span class="hljs-string">"${APP_URL:-https://your-app.fly.dev}"
FAILED=0

<span class="hljs-function">run_test() {
  <span class="hljs-built_in">local name=<span class="hljs-string">"$1"
  <span class="hljs-built_in">local result=<span class="hljs-string">"$2"
  <span class="hljs-built_in">local expected=<span class="hljs-string">"$3"

  <span class="hljs-keyword">if [ <span class="hljs-string">"$result" = <span class="hljs-string">"$expected" ]; <span class="hljs-keyword">then
    <span class="hljs-built_in">echo <span class="hljs-string">"  PASS: $name"
  <span class="hljs-keyword">else
    <span class="hljs-built_in">echo <span class="hljs-string">"  FAIL: $name (expected: <span class="hljs-variable">$expected, got: <span class="hljs-variable">$result)"
    FAILED=$((FAILED + <span class="hljs-number">1))
  <span class="hljs-keyword">fi
}

<span class="hljs-built_in">echo <span class="hljs-string">"=== Smoke Tests: $APP_URL ==="

<span class="hljs-comment"># Test 1: Basic health check
STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" <span class="hljs-string">"$APP_URL/health")
run_test <span class="hljs-string">"Health endpoint returns 200" <span class="hljs-string">"$STATUS" <span class="hljs-string">"200"

<span class="hljs-comment"># Test 2: Database connectivity
DB_STATUS=$(curl -s <span class="hljs-string">"$APP_URL/health/db" <span class="hljs-pipe">| jq -r <span class="hljs-string">'.status')
run_test <span class="hljs-string">"Database connection is healthy" <span class="hljs-string">"$DB_STATUS" <span class="hljs-string">"ok"

<span class="hljs-comment"># Test 3: API authentication endpoint exists
AUTH_STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" -X POST \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{"email":"test@example.com","password":"wrong"}' \
  <span class="hljs-string">"$APP_URL/api/auth/login")
run_test <span class="hljs-string">"Auth endpoint reachable (returns 401 for bad creds)" <span class="hljs-string">"$AUTH_STATUS" <span class="hljs-string">"401"

<span class="hljs-comment"># Test 4: Static assets load
ASSET_STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" <span class="hljs-string">"$APP_URL/favicon.ico")
run_test <span class="hljs-string">"Static assets accessible" <span class="hljs-string">"$ASSET_STATUS" <span class="hljs-string">"200"

<span class="hljs-comment"># Test 5: Response time is acceptable
RESPONSE_TIME=$(curl -s -o /dev/null -w <span class="hljs-string">"%{time_total}" <span class="hljs-string">"$APP_URL/health")
RESPONSE_MS=$(<span class="hljs-built_in">echo <span class="hljs-string">"$RESPONSE_TIME * 1000" <span class="hljs-pipe">| bc <span class="hljs-pipe">| <span class="hljs-built_in">cut -d. -f1)
<span class="hljs-keyword">if [ <span class="hljs-string">"$RESPONSE_MS" -lt 2000 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"  PASS: Response time acceptable (${RESPONSE_MS}ms)"
<span class="hljs-keyword">else
  <span class="hljs-built_in">echo <span class="hljs-string">"  FAIL: Response time too slow (${RESPONSE_MS}ms)"
  FAILED=$((FAILED + <span class="hljs-number">1))
<span class="hljs-keyword">fi

<span class="hljs-built_in">echo <span class="hljs-string">""
<span class="hljs-keyword">if [ <span class="hljs-variable">$FAILED -gt 0 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"Smoke tests FAILED: $FAILED tests failed"
  <span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"All smoke tests passed"

Testing Rolling Deployments

Fly.io performs rolling deployments by default — old Machines keep running while new ones start. You can verify this behavior with a script that monitors availability during a deploy:

#!/bin/bash
<span class="hljs-comment"># monitor-during-deploy.sh
<span class="hljs-comment"># Run this in one terminal while deploying in another

APP_URL=<span class="hljs-string">"https://your-app.fly.dev"
INTERVAL=2
DOWNTIME=0
REQUESTS=0

<span class="hljs-built_in">echo <span class="hljs-string">"Monitoring $APP_URL during deployment..."
<span class="hljs-built_in">echo <span class="hljs-string">"Press Ctrl+C to stop"

<span class="hljs-keyword">while <span class="hljs-literal">true; <span class="hljs-keyword">do
  START=$(<span class="hljs-built_in">date +%s%N)
  STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" --max-time 5 <span class="hljs-string">"$APP_URL/health")
  END=$(<span class="hljs-built_in">date +%s%N)
  LATENCY=$(( (END - START) / <span class="hljs-number">1000000 ))

  REQUESTS=$((REQUESTS + <span class="hljs-number">1))
  TIMESTAMP=$(<span class="hljs-built_in">date +%H:%M:%S)

  <span class="hljs-keyword">if [ <span class="hljs-string">"$STATUS" != <span class="hljs-string">"200" ]; <span class="hljs-keyword">then
    DOWNTIME=$((DOWNTIME + <span class="hljs-number">1))
    <span class="hljs-built_in">echo <span class="hljs-string">"$TIMESTAMP  FAIL  HTTP <span class="hljs-variable">$STATUS  <span class="hljs-variable">${LATENCY}ms  [downtime: <span class="hljs-variable">$DOWNTIME/<span class="hljs-variable">$REQUESTS]"
  <span class="hljs-keyword">else
    <span class="hljs-built_in">echo <span class="hljs-string">"$TIMESTAMP  OK    HTTP <span class="hljs-variable">$STATUS  <span class="hljs-variable">${LATENCY}ms"
  <span class="hljs-keyword">fi

  <span class="hljs-built_in">sleep <span class="hljs-variable">$INTERVAL
<span class="hljs-keyword">done

Continuous Monitoring with HelpMeTest

Running smoke tests after each deploy catches regressions, but it doesn't tell you about problems that develop over time — memory leaks, database connection exhaustion, or third-party API degradation. For ongoing confidence, tools like HelpMeTest complement your deploy pipeline by running full browser-based tests on a schedule.

HelpMeTest's Robot Framework and Playwright-based tests can simulate real user journeys across your Fly.io regions, with AI-powered test generation that creates tests from plain English descriptions. At $100/month for unlimited tests and parallel execution, you can run comprehensive multi-region checks every few minutes without worrying about test infrastructure costs.

A typical setup pairs the CI smoke tests above with HelpMeTest monitoring runs every 5 minutes, giving you both immediate post-deploy confidence and ongoing production health visibility.

Validating fly.toml Configuration Changes

When you change fly.toml — especially health check settings or machine sizing — validate the configuration before deploying:

# Validate fly.toml syntax
flyctl config validate

<span class="hljs-comment"># Show planned changes without deploying
flyctl deploy --dry-run

<span class="hljs-comment"># Check current health check status
flyctl checks list

<span class="hljs-comment"># Watch health check results in real time
flyctl checks list --watch

After any configuration change, verify that your health checks are firing correctly:

# Tail logs to see health check requests
flyctl logs --app your-app-name <span class="hljs-pipe">| grep <span class="hljs-string">"health"

<span class="hljs-comment"># Check machine status
flyctl status --app your-app-name

<span class="hljs-comment"># SSH into a machine to debug health check behavior
flyctl ssh console --app your-app-name

Testing Fly.io Postgres

If you're using Fly Postgres, add a dedicated test for your database cluster health:

#!/bin/bash
<span class="hljs-comment"># test-fly-postgres.sh

PG_APP=<span class="hljs-string">"your-app-db"

<span class="hljs-built_in">echo <span class="hljs-string">"Testing Fly Postgres cluster: $PG_APP"

<span class="hljs-comment"># Check cluster status
flyctl postgres status --app <span class="hljs-string">"$PG_APP"

<span class="hljs-comment"># Verify leader election
MACHINES=$(flyctl machines list --app <span class="hljs-string">"$PG_APP" --json)
LEADER=$(<span class="hljs-built_in">echo <span class="hljs-string">"$MACHINES" <span class="hljs-pipe">| jq -r <span class="hljs-string">'[.[] | select(.checks[]?.status == "passing")] <span class="hljs-pipe">| length')
<span class="hljs-built_in">echo <span class="hljs-string">"Healthy machines: $LEADER"

<span class="hljs-comment"># Test connection string works
flyctl postgres connect --app <span class="hljs-string">"$PG_APP" -c <span class="hljs-string">"SELECT version();"

Putting It All Together

Here's the complete testing pipeline for a Fly.io deployment:

  1. Pre-deploy: Run unit and integration tests in CI
  2. Deploy: flyctl deploy --remote-only --wait-timeout 300
  3. Post-deploy: Wait for Fly.io health checks, then run smoke tests
  4. Validate machines: Confirm all Machines are in started state and on the same version
  5. Multi-region check: Hit each configured region and verify responses
  6. Continuous monitoring: HelpMeTest runs full E2E tests on a schedule

The key insight is that Fly.io's built-in health checks and rolling deploy behavior give you a safety net, but they only check what you configure. Smoke tests and continuous monitoring catch what health checks miss — subtle regressions, performance degradation, and user-facing issues that only appear under real conditions.

With this setup, you can deploy to Fly.io with confidence, knowing that bad deploys will be caught before they reach users and that any post-deploy issues will surface immediately.

Read more