Deep Dive: Health Checks

HelpMeTest

This is the complete technical reference for HelpMeTest health checks. For a high-level overview of what health checks are and why they matter, see our main health checks guide.

CLI Installation
CLI Command Reference
Grace Period Formats
Environment Variables
Docker Integration
Docker Compose Integration
Kubernetes Integration
Cron Job Monitoring
API Isolation Feature
Auto-Detection of Service Types
Troubleshooting
Real-World Examples

CLI Installation

The HelpMeTest CLI is a single binary (~55MB) that includes the Bun runtime for JavaScript execution. The installer is a shell script that detects your operating system and CPU architecture, downloads the appropriate binary from our releases, and installs it to a location in your PATH.

This should output the current version number. If you get "command not found", the binary wasn't installed to your PATH correctly.

CLI Command Reference

Basic Health Check

The most basic health check is just a heartbeat that reports your service is alive:

Parameters:

name - A unique identifier for this health check. This is how the health check appears in your dashboard. Use descriptive names like "database-backup" or "web-api" instead of generic names like "check1".
grace_period - How long the platform should wait before marking this health check as failed if no heartbeat is received. This should be longer than your service's normal execution time plus a buffer for network delays. Format: 30s, 5m, 2h, 1d.
command - Optional. A command to execute that determines if the service is healthy. If the command exits with code 0, the health check passes. If it exits with any other code, the health check fails. If omitted, this is just a simple heartbeat.

Examples:

This reports that the service "database-backup" is alive. If the platform doesn't receive another heartbeat within 5 minutes, it marks the service as down and sends an alert. Use this pattern after your backup script completes successfully.

The ENV environment variable tags this health check as belonging to the "production" environment. This lets you filter your dashboard by environment and see production vs staging vs development separately. The platform automatically captures any environment variable starting with ENV or HELPMETEST_.

HTTP Health Checks

HTTP health checks make an HTTP GET request to a URL and expect a 200-299 response code. This is useful for web servers, APIs, and any service that exposes an HTTP endpoint.

When you provide a path like /health, the CLI automatically prepends http://localhost to create the full URL http://localhost/health. This is a convenience for services running locally on default ports.

If you need to specify a port, include it in the host. The CLI will request http://127.0.0.1:3000/health. This is useful in containers where services might run on non-standard ports.

For external services or HTTPS endpoints, provide the full URL. The CLI makes the request exactly as specified. Any HTTP status code in the 200-299 range is considered success. Everything else (404, 500, connection refused, timeout) is considered failure.

How it works:

The CLI uses the Bun fetch API to make the HTTP request with a 10-second timeout. If the request returns 200-299, the CLI exits with code 0 (success) and reports the health check as passing. If the request returns any other status code, times out, or fails to connect, the CLI exits with code 1 (failure) and reports the health check as failing.

Port Availability Checks

Port checks verify that a service is listening on a specific TCP port. This doesn't verify that the service is working correctly, just that something is bound to that port.

The :3000 syntax tells the CLI to check if port 3000 is listening on localhost. The CLI attempts to open a TCP connection to 127.0.0.1:3000. If the connection succeeds (something is listening), the health check passes. If the connection is refused (nothing listening) or times out, the health check fails.

When to use this:

Use port checks for services that don't expose HTTP endpoints but listen on TCP ports, like databases, message queues, or custom TCP servers. This is less thorough than HTTP checks (you're just verifying something is listening, not that it's working) but useful when that's all you can test.

File Age Checks

File age checks verify that a file exists and was modified recently. This is perfect for background workers that update status files or batch jobs that create output files.

The file-updated 2m /var/log/app.log syntax tells the CLI to check if /var/log/app.log exists and was modified within the last 2 minutes. If the file doesn't exist or is older than 2 minutes, the health check fails. The 5-minute grace period means if the health check doesn't report success within 5 minutes, you get an alert.

For daily batch jobs, verify the output file was created in the last day. The 25-hour grace period (24 hours + 1-hour buffer) accounts for the daily schedule plus some slack for jobs that run late.

You can check multiple files in a single health check. The CLI checks each file in sequence. If any file is missing or too old, the entire health check fails. This is useful for ensuring multiple related files are being updated together.

How to use this with workers:

Your background worker code should touch a status file periodically:

from pathlib import Path
import time
def process_jobs():
    while True:
        try:
            # Process a job
            process_next_job()
            # Update status file to prove we're alive
            Path("/tmp/worker.alive").touch()
            time.sleep(60)
        except Exception as e:
            # Remove status file on error so health check fails
            Path("/tmp/worker.alive").unlink(missing_ok=True)
            raise

Then your health check verifies this file is recent, proving the worker is still processing jobs.

Command Execution Checks

Command execution checks run an arbitrary shell command and use its exit code to determine health. Exit code 0 means success, anything else means failure.

This runs psql -h localhost -c '\l' (which lists all databases) and checks the exit code. If psql can connect to PostgreSQL and execute the query, it exits with 0 and the health check passes. If the connection fails or the query fails, psql exits with a non-zero code and the health check fails.

You can execute custom scripts that implement your own health check logic. The script should exit with 0 if everything is healthy and non-zero if something is wrong.

The && operator means "run the second command only if the first succeeds". So backup.sh runs first. If it succeeds (exit code 0), then helpmetest health runs and reports success. If backup.sh fails (non-zero exit), helpmetest health never runs and no heartbeat is sent. After 25 hours without a heartbeat, the platform marks the backup as failed and alerts you.

Why use && vs passing the command as an argument:

backup.sh && helpmetest health "backup" "25h" - Only reports success, never reports failure. Use this when you want silence on failure.
helpmetest health "backup" "25h" "backup.sh" - Reports both success and failure. Use this when you want explicit failure reports.

Status Command

The status command shows the current state of all your health checks:

This queries the HelpMeTest API and displays a table of all your health checks with their current status (up/down/unknown), last heartbeat time, and grace period. The table updates to show the most recent data from the platform.

Filter to only show health checks tagged with ENV=production. This is useful when you have multiple environments and want to see just production or just staging.

Shows additional details like the full command being executed, environment variables, and system metrics collected with each heartbeat.

Grace Period Formats

Grace periods use the timespan-parser library which supports human-readable time formats:

30s - 30 seconds
5m - 5 minutes
2h - 2 hours
1d - 1 day
15min - 15 minutes (alternative syntax)
2.5h - 2 hours and 30 minutes (decimals work)

Grace Period Guidelines:

The grace period should be longer than your service's normal execution time plus a buffer for variability and network delays.

Service Type	Recommended Grace Period	Reasoning
Web APIs	30s - 2m	Fast response expected. If your API doesn't respond for 2 minutes, something is very wrong.
Database operations	2m - 10m	Queries can legitimately take time. Connection issues often resolve themselves within minutes.
Backup jobs	20-30% longer than execution	If your backup takes 2 hours, use a 3-hour grace period to account for slower nights.
Daily jobs	25h - 26h	For a job that runs once per day, 25 hours gives you 1 hour of slack for late execution.
Weekly jobs	8d - 9d	For weekly jobs, 8 days gives you 1 day of buffer for maintenance windows.

Why these buffers matter:

Too short: False alerts when services are legitimately slow. Too long: Delayed alerts when services actually fail.

Start with the recommendations above and tune based on your false positive rate. If you're getting alerts for services that are actually healthy, increase the grace period. If you're discovering failures hours after they happen, decrease it.

Environment Variables

The CLI uses environment variables for configuration and metadata.

Required:

HELPMETEST_API_TOKEN - Your API token from the HelpMeTest platform. This authenticates your health check reports. Get this from your dashboard settings. The token is a long string like HELP-1dc7fbe0-1f4f-4c58-abb6-20f7ae47570c.

Optional:

ENV - Environment identifier (dev, staging, prod). This tags your health checks by environment so you can filter them in the dashboard. The platform treats this as a special field and provides environment-based filtering.
HELPMETEST_* - Any environment variable starting with HELPMETEST_ is captured and sent with health check reports. For example:
- HELPMETEST_SERVICE=auth-api - Service name
- HELPMETEST_VERSION=2.1.3 - Deployment version
- HELPMETEST_REGION=us-west-2 - AWS region
- HELPMETEST_POD_NAME=web-app-abc123 - Kubernetes pod name

Auto-Collected System Metrics:

Every health check report includes system metrics automatically collected by the CLI:

Hostname - Output of hostname command
IP address - First non-loopback IP address found
CPU usage - Percentage of CPU used (sampled at collection time)
Memory usage - Total memory and available memory in MB
Disk usage - Disk space used and available for the root partition
Environment variables - All HELPMETEST_* and ENV variables

These metrics appear in your dashboard alongside the health check status, giving you context about the system state when the health check ran.

Docker Integration

Docker's HEALTHCHECK directive runs a command periodically inside the container and uses its exit code to determine container health. This integrates with Docker's health tracking so docker ps shows container health status.

Basic Dockerfile Health Check

Here's a complete Dockerfile for a Node.js web application with health checks:

FROM node:18-alpine
# Install HelpMeTest CLI
# The installer downloads the appropriate binary for alpine linux
RUN curl -fsSL https://helpmetest.com/install | bash
# Set API token as environment variable
# This will be replaced with the actual token at build time via --build-arg
# or at runtime via docker run -e
ENV HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
# Copy application files
COPY package.json .
RUN npm install
COPY . .
# Add Docker health check
# This runs every 30 seconds inside the container
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD helpmetest health "web-app" "2m" "GET localhost:3000/health"
CMD ["npm", "start"]

HEALTHCHECK parameters explained:

--interval=30s - How often Docker runs the health check command. 30 seconds is good for web services that should always be responsive. For less critical services, use longer intervals like 60s or 120s to reduce overhead.
--timeout=10s - Maximum time Docker waits for the health check command to complete. If the command takes longer than this, Docker kills it and considers the check failed. 10 seconds is reasonable for HTTP requests and database queries. If your health check legitimately takes longer, increase this.
--start-period=5s - Grace period during container startup before Docker starts counting health check failures. Your application needs time to start up (loading config, connecting to database, warming up). Set this to your application's typical startup time. Failed health checks during the start period don't count toward the failure threshold.
--retries=3 - How many consecutive health check failures before Docker marks the container as unhealthy. 3 retries means the service must fail for 3 * 30s = 90 seconds before Docker calls it unhealthy. This prevents false positives from temporary glitches.

What happens when a container is unhealthy:

In plain Docker, an unhealthy container keeps running but docker ps shows it as unhealthy. In Docker Compose with restart: always, Docker may restart unhealthy containers. In orchestration systems like Docker Swarm or Kubernetes, unhealthy containers are removed and replaced automatically.

Database Container

Databases need different health checks than web services because they don't expose HTTP endpoints:

FROM postgres:15
# Install curl first (not included in postgres image by default)
RUN apt-get update && apt-get install -y curl
# Install HelpMeTest CLI
RUN curl -fsSL https://helpmetest.com/install | bash
# Add database health check using psql to verify connections work
HEALTHCHECK --interval=60s --timeout=30s --retries=3 \
  CMD helpmetest health "postgres-db" "5m" \
      "psql -U $POSTGRES_USER -d $POSTGRES_DB -c 'SELECT 1'"

Why these parameters are different:

--interval=60s - Database queries are more expensive than HTTP requests, so we check less frequently. 60 seconds is reasonable for database health.
--timeout=30s - Database connections can take longer to establish than HTTP requests, especially if the connection pool is exhausted. 30 seconds gives the database time to process the connection request.
Grace period 5m - If we don't receive a health check report for 5 minutes, something is wrong. This is longer than the web service grace period because database operations are expected to be slower.

The health check command:

psql -U $POSTGRES_USER -d $POSTGRES_DB -c 'SELECT 1' connects to PostgreSQL and executes a simple query. This verifies:

PostgreSQL is accepting connections
The database exists
The user credentials work
The database can execute queries

If any of these fail, psql exits with a non-zero code and the health check fails.

Background Worker Container

Background workers don't expose HTTP endpoints or database interfaces. They just process jobs from queues. We use file-based health checks for these:

FROM python:3.11-slim
# Install HelpMeTest CLI
RUN curl -fsSL https://helpmetest.com/install | bash
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# Health check using file modification time
# The worker must update /tmp/worker.alive every 5 minutes
HEALTHCHECK --interval=60s --timeout=15s --retries=3 \
  CMD helpmetest health "background-worker" "10m" \
      "file-updated 5m /tmp/worker.alive"
CMD ["python", "worker.py"]

Worker application code:

Your worker must periodically update the status file to prove it's alive:

import time
from pathlib import Path
def update_health():
    """Update health file to show worker is actively processing"""
    Path("/tmp/worker.alive").touch()
def process_jobs():
    while True:
        try:
            # Fetch next job from queue
            job = fetch_next_job()
            # Process the job (this might take minutes)
            process(job)
            # Update health status after each job
            update_health()
            time.sleep(1)
        except Exception as e:
            # Remove health file on critical error
            # This causes health check to fail and alerts you
            Path("/tmp/worker.alive").unlink(missing_ok=True)
            raise

How the file check works:

Every 60 seconds, Docker runs the health check
The health check verifies /tmp/worker.alive was modified in the last 5 minutes
If the file is older than 5 minutes or doesn't exist, the health check fails
After 3 consecutive failures (3 minutes), Docker marks the container unhealthy
The platform alerts you if no heartbeat is received for 10 minutes

Why 5 minutes for file age:

If your jobs take 2 minutes on average, you should update the file after each job. In the worst case (job takes 2 minutes, then health check runs), the file will be 2 minutes old. 5 minutes gives you a 3-minute buffer for slow jobs. Tune this based on your job processing time.

Docker Compose Integration

Docker Compose runs multiple containers together and can coordinate their health checks. Here's a complete example showing different health check patterns:

version: '3.8'
services:
  web:
    build: .
    environment:
      # API token from .env file or environment
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
      # Custom metadata for dashboard filtering
      - HELPMETEST_SERVICE=web-frontend
    healthcheck:
      test: ["CMD", "helpmetest", "health", "web-service", "2m", "GET", "localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      # start_period is how long to wait before starting health checks
      # Gives the app time to start up and connect to database
      start_period: 40s
  api:
    build: ./api
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    # This service depends on the database being healthy
    depends_on:
      database:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "helpmetest", "health", "api-service", "1m", "GET", "localhost:8080/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
  database:
    image: postgres:15
    environment:
      - POSTGRES_DB=myapp
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
    healthcheck:
      # Note: Must install helpmetest CLI in postgres image first
      # Or use postgres's built-in pg_isready instead
      test: ["CMD", "helpmetest", "health", "postgres-db", "5m", "psql -U user -d myapp -c 'SELECT 1'"]
      interval: 60s
      timeout: 30s
      retries: 3
  worker:
    build: ./worker
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    depends_on:
      database:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "helpmetest", "health", "worker", "10m", "file-updated 5m /tmp/worker.alive"]
      interval: 60s
      timeout: 15s
      retries: 3

The depends_on configuration:

This tells Docker Compose to wait for the database container to report healthy before starting the api service. Without this, the API would start immediately and fail to connect to the database because it's still starting up. The health check coordination ensures services start in the right order.

Running with Docker Compose:

The docker-compose ps output shows health status for each service. Services show as "healthy", "unhealthy", or "starting" based on their health check results.

Kubernetes Integration

Kubernetes has two types of probes: liveness and readiness. They serve different purposes and both are important for robust deployments.

Liveness Probe: "Is this container broken and should be restarted?"

If the liveness probe fails repeatedly, Kubernetes kills and restarts the pod
Use this to detect deadlocks, infinite loops, or corrupted state that requires a restart
Should be conservative - only fail when a restart would actually help

Readiness Probe: "Is this container ready to serve traffic?"

If the readiness probe fails, Kubernetes removes the pod from the service load balancer
Use this to temporarily remove pods during startup, during degraded states, or when dependent services are down
Can fail more liberally - temporary removal from load balancer doesn't hurt

Create Secret for API Token

Never hardcode API tokens in Kubernetes manifests. Use secrets instead:

This creates a secret named helpmetest-secret with one key api-token containing your token. The secret is stored encrypted in etcd and can be mounted into containers as environment variables or files.

Or use a secret manager like Infisical:

To base64-encode your token:

Secret managers like Infisical, AWS Secrets Manager, or HashiCorp Vault are better for production because they provide:

Automatic token rotation
Audit logs of secret access
Integration with your existing auth system
Encryption at rest and in transit

Deployment with Liveness and Readiness Probes

Here's a production-ready Kubernetes deployment with proper health checks:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  # Run 3 replicas for high availability
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web
        image: myapp:latest
        env:
        # Mount API token from secret
        - name: HELPMETEST_API_TOKEN
          valueFrom:
            secretKeyRef:
              name: helpmetest-secret
              key: api-token
        # Set environment to production
        - name: ENV
          value: "production"
        # Inject pod metadata as environment variables
        # Useful for tracking which pod reported which health check
        - name: HELPMETEST_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HELPMETEST_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        # Liveness probe - restart pod if this fails
        livenessProbe:
          exec:
            command:
            - helpmetest
            - health
            - "web-app-live"
            - "2m"
            - "GET localhost:3000/health"
          # Wait 30 seconds after container starts before first check
          # Your app needs time to initialize
          initialDelaySeconds: 30
          # Check every 30 seconds
          periodSeconds: 30
          # Allow 10 seconds for the check to complete
          timeoutSeconds: 10
          # Restart pod after 3 consecutive failures (90 seconds of failure)
          failureThreshold: 3
        # Readiness probe - remove from load balancer if this fails
        readinessProbe:
          exec:
            command:
            - helpmetest
            - health
            - "web-app-ready"
            - "1m"
            - "GET localhost:3000/ready"
          # Start readiness checks early - only wait 5 seconds
          # Readiness failures don't restart pods, just remove from LB
          initialDelaySeconds: 5
          # Check more frequently than liveness
          periodSeconds: 10
          # Shorter timeout since this is for traffic management
          timeoutSeconds: 5
          # Remove from LB after 2 failures (20 seconds)
          # More aggressive than liveness because temporary removal is safe
          failureThreshold: 2

Why separate liveness and readiness probes:

Imagine your app depends on a database. If the database goes down:

Readiness probe fails immediately, removing your pod from the load balancer so users don't hit it
Liveness probe keeps passing because your app process is fine, just waiting for the database
When the database comes back, readiness probe starts passing and traffic resumes
No pod restart needed because the app itself was never broken

If liveness and readiness were the same probe, the pod would restart every time the database hiccuped, which doesn't help anything.

The /health vs /ready endpoints:

Your application should expose two endpoints:

/health - Returns 200 if the app process itself is healthy (not deadlocked, not out of memory)
/ready - Returns 200 if the app is ready to serve traffic (dependencies are available, caches are warm)

Database Deployment

Databases are stateful and need special handling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  # Only run 1 replica for the primary database
  # Use StatefulSet for replicated databases
  replicas: 1
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:15
        env:
        - name: POSTGRES_DB
          value: myapp
        - name: POSTGRES_USER
          value: user
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: password
        - name: HELPMETEST_API_TOKEN
          valueFrom:
            secretKeyRef:
              name: helpmetest-secret
              key: api-token
        # Liveness probe for database
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            # Shell wrapper needed for command substitution
            - helpmetest health "postgres-db" "5m" "psql -U user -d myapp -c 'SELECT 1'"
          # Databases take longer to start than apps
          initialDelaySeconds: 60
          # Check every minute (less frequent than apps)
          periodSeconds: 60
          # Database queries can be slow
          timeoutSeconds: 30
          # Be conservative with database restarts - wait for 3 failures
          failureThreshold: 3

Why databases are different:

Longer initialDelaySeconds (60s vs 30s) because databases initialize schema, load data, etc.
Longer periodSeconds (60s vs 30s) because database queries have more overhead than HTTP requests
Longer timeoutSeconds (30s vs 10s) because connection pools can be exhausted under load
Only 1 replica because primary databases don't run in parallel (use StatefulSet for replicas)

CronJob with Health Check

Kubernetes CronJobs run scheduled tasks. Use health checks to verify they complete successfully:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-backup
spec:
  # Run daily at 2 AM
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: backup-tool:latest
            env:
            - name: HELPMETEST_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: helpmetest-secret
                  key: api-token
            - name: ENV
              value: "production"
            command:
            - /bin/sh
            - -c
            # Multi-line script using pipe
            - |
              # Run the backup script
              /usr/local/bin/backup-data.sh
              # Only report success if backup succeeded
              # If backup-data.sh fails, this line never runs
              helpmetest health "daily-backup-job" "25h"
          # OnFailure means retry failed jobs
          restartPolicy: OnFailure
          # Try up to 3 times before giving up
          backoffLimit: 3

How the health check works with CronJobs:

At 2 AM, Kubernetes creates a new pod and runs your command
The backup script executes
If the backup succeeds (exit code 0), the shell continues to the next line
helpmetest health reports success to the platform
If the backup fails (non-zero exit), the shell stops and the helpmetest health line never runs
After 25 hours without a heartbeat, the platform marks the backup as failed and alerts you

The 25-hour grace period:

The job runs daily, so 24 hours is the expected interval. We add 1 hour of slack for jobs that run late, infrastructure delays, etc. If the backup hasn't reported success for 25 hours, something is definitely wrong.

Why restartPolicy: OnFailure:

If the backup fails, Kubernetes retries up to backoffLimit times. This handles transient failures (network blips, temporarily full disk) without alerting you. If all 3 attempts fail, you get an alert because the job legitimately failed.

Cron Job Monitoring

Traditional cron jobs on Linux servers need health check integration too. Add the health check command after your script runs:

Basic Cron Setup

# Daily backup at 2 AM (25-hour grace period)
0 2 * * * /usr/local/bin/helpmetest health "daily-backup" "25h" "backup-database.sh"
# Hourly log processing (75-minute grace period)
0 * * * * /usr/local/bin/helpmetest health "log-processing" "75m" "process-logs.sh"
# Data sync every 15 minutes (20-minute grace period)
*/15 * * * * /usr/local/bin/helpmetest health "data-sync" "20m" "sync-data.sh"

When you pass a command to helpmetest health, it executes the command and reports both success and failure based on the exit code. This is different from command && helpmetest health which only reports success.

Cron Best Practices

1. Use absolute paths:

Cron runs with a minimal PATH that often doesn't include /usr/local/bin. Always use absolute paths:

2. Set environment variables explicitly:

Cron doesn't inherit your shell's environment. Set variables at the top of your crontab:

3. Grace periods should be 20-30% longer than execution time:

The extra buffer accounts for variability. Some nights the backup might take 2.5 hours if the database is larger or the disk is slower. A 3-hour grace period prevents false alerts on those nights.

4. Log output for debugging:

This captures all output (both stdout and stderr) to a log file. When a backup fails, you can check the log to see what went wrong.

5. Use file-based health checks for long-running scripts:

If your cron job runs for hours, you can't use the command execution pattern (it would time out). Instead, have your script update a status file and check that file:

Your backup script should touch /tmp/backup-complete when it finishes successfully. The health check runs every 10 minutes and verifies the file was updated in the last 2 hours.

API Isolation Feature

This is one of the most important features of the HelpMeTest CLI and is critical for production deployments.

The Problem:

Imagine your Kubernetes cluster runs health checks that report to the HelpMeTest API. One day there's a network issue between your cluster and the API (AWS region outage, DNS failure, firewall misconfiguration). If health check exit codes depend on successfully reporting to the API, all your health checks would fail even though your services are perfectly healthy. Kubernetes would kill all your pods. Your entire application goes down because of an unrelated network issue.

The Solution:

The HelpMeTest CLI always returns an exit code based purely on whether your health check command succeeded:

CLI executes your health check command (HTTP request, database query, file check, etc.)
Command succeeds → CLI returns exit code 0 (success)
Command fails → CLI returns exit code 1 (failure)
This happens regardless of whether the CLI can reach the HelpMeTest API

If the API is reachable:

CLI reports the health check result to the platform
You see the status in your dashboard
Alerts fire when checks fail
Historical data is stored

If the API is unreachable:

CLI logs a warning: "Failed to report health check to API: connection timeout"
CLI still returns the correct exit code based on your service status
Kubernetes doesn't kill healthy pods
Docker doesn't mark healthy containers as unhealthy
Your services keep running normally

Why this matters in practice:

Your infrastructure is more reliable than any third-party monitoring service. The HelpMeTest API might have outages (AWS problems, deployment issues, DDoS attacks). Your application should never go down because your monitoring is down. The API isolation feature ensures health checks reflect actual service health, not network connectivity to a monitoring platform.

Where the health check data goes when API is down:

The CLI doesn't queue or retry. If it can't reach the API, the health check data for that specific check is lost. But that's okay because:

Health checks run frequently (every 30-60 seconds), so you'll only miss a few data points
Your services stay healthy during the API outage, which is what matters
When the API comes back, health checks resume reporting normally

Auto-Detection of Service Types

When you use the AI integration to add health checks automatically, the AI detects service types based on the container image and exposed ports. This table shows what health check commands the AI generates for different service types:

Container Type	Auto-Detected Check	Why This Check
PostgreSQL	`psql -h localhost -c "SELECT 1"`	Verifies database accepts connections and can execute queries
MySQL	`mysql -h localhost -e "SELECT 1"`	Verifies MySQL connection and query execution
Redis	`redis-cli ping`	Redis's built-in health check returns PONG if healthy
MongoDB	`mongosh --eval "db.runCommand({ping: 1})"`	MongoDB's ping command verifies connection and responsiveness
Node.js	`GET localhost:3000/health`	Most Node apps expose /health endpoint on port 3000
Python/Flask	`GET localhost:8000/health`	Flask's default port is 8000, health endpoint is standard
Nginx	`GET localhost:80/health`	Nginx runs on port 80, can proxy health checks to backends
Kafka	`:9092`	Check if Kafka is listening on its default port
RabbitMQ	`GET localhost:15672/api/overview`	RabbitMQ management API provides cluster health status

The AI looks at:

Base image name (FROM postgres:15 → database health check)
Exposed ports (EXPOSE 3000 → HTTP health check on :3000)
Installed packages (apt-get install postgresql-client → database service)
Running processes (CMD ["nginx"] → web server health check)

Troubleshooting

Debug Mode

Enable verbose output to see exactly what the CLI is doing:

Debug mode logs:

Full HTTP request/response details for HTTP checks
File paths and modification times for file checks
Command execution and output for command checks
API request/response details
System metrics collection process

This shows the system metrics the CLI collects (CPU, memory, disk) without actually reporting a health check. Useful for verifying the CLI can access system information.

Verbose status shows:

Full command being executed
All environment variables
System metrics from last check
Historical status changes

False Positive Alerts

Symptoms:

Getting alerts when service is actually healthy
Intermittent failures for stable services
Health checks timing out unexpectedly

Solutions:

1. Increase grace period:

If your service occasionally takes longer than expected, increase the grace period:

Start with 50-100% buffer over typical execution time and adjust based on false positive rate.

2. Test command manually:

Run the exact health check command yourself to see if it actually works:

If the command fails when you run it manually, fix the command before using it in health checks.

3. Check system resources during execution:

High CPU or memory usage can cause health checks to timeout:

If CPU is pegged at 100% or memory is maxed out, your health checks might legitimately be slow. Either add more resources or increase timeouts.

4. Use more specific health checks:

Generic checks can fail for many reasons. Specific checks give you better signal:

The HTTP check is better because it verifies the application is responding correctly, not just that something is listening on the port.

Missing Heartbeats

Symptoms:

Health checks show as 'down' but services are running
Irregular heartbeat patterns in dashboard
Cron jobs not reporting consistently

Solutions:

1. Verify cron job syntax:

Check if cron is actually running your job:

If the manual test works but cron doesn't, the problem is in your crontab syntax.

2. Use absolute paths in cron:

Cron has a minimal PATH. Use absolute paths for everything:

3. Set environment variables explicitly:

Cron doesn't inherit your shell environment:

Command Execution Issues

Symptoms:

Health check commands fail unexpectedly
Different behavior when run manually vs automated
Permission denied errors

Solutions:

1. Use absolute paths:

2. Set required environment variables:

3. Check file permissions:

4. Test as same user:

Health checks in containers run as the container user (often root). Test as that user:

Container Health Check Debugging

Run health check manually inside container:

Enable debug mode in container:

Check container logs:

Check Docker container health status:

Real-World Examples

E-commerce Platform

Complete Docker Compose setup for an e-commerce platform with web frontend, API, background workers, payment processing, and database:

version: '3.8'
services:
  # Customer-facing website
  frontend:
    build: ./frontend
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      test: ["CMD", "helpmetest", "health", "frontend", "1m", "GET", "localhost:3000/"]
      interval: 30s    # Check frequently - customer-facing
      timeout: 10s
      retries: 3
  # Product catalog API
  product-api:
    build: ./product-api
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      test: ["CMD", "helpmetest", "health", "products", "30s", "GET", "localhost:8080/api/products/health"]
      interval: 30s    # Check frequently - critical path
      timeout: 10s
      retries: 3
  # Background worker for processing orders
  order-worker:
    build: ./order-worker
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      # File-based check since this doesn't expose HTTP
      test: ["CMD", "helpmetest", "health", "orders", "10m", "file-updated 5m /tmp/orders.processing"]
      interval: 60s    # Less frequent - not customer-facing
      timeout: 15s
      retries: 3
  # Payment processing service
  payment-service:
    build: ./payment-service
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      test: ["CMD", "helpmetest", "health", "payments", "2m", "GET", "localhost:9000/payments/status"]
      interval: 30s    # Check frequently - handles money
      timeout: 10s
      retries: 3
  # PostgreSQL database
  database:
    image: postgres:15
    environment:
      - POSTGRES_DB=ecommerce
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
    healthcheck:
      test: ["CMD", "helpmetest", "health", "db", "5m", "psql -U user -d ecommerce -c 'SELECT 1'"]
      interval: 60s    # Less frequent - database queries are expensive
      timeout: 30s     # Longer timeout - connections can be slow
      retries: 3

This configuration:

Uses appropriate grace periods for each service type
Checks customer-facing services more frequently (30s) than background workers (60s)
Uses HTTP checks for web services and file checks for workers
Gives database queries longer timeouts (30s vs 10s)

SaaS Application

Multi-service SaaS platform with authentication, background jobs, email sending, and analytics:

version: '3.8'
services:
  # Main application
  app:
    build: ./app
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
      - HELPMETEST_VERSION=${APP_VERSION}  # Track which version is deployed
    healthcheck:
      test: ["CMD", "helpmetest", "health", "app", "1m", "GET", "localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
  # Authentication service
  auth:
    build: ./auth
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      test: ["CMD", "helpmetest", "health", "auth", "30s", "GET", "localhost:8000/auth/health"]
      interval: 30s    # Check frequently - auth is critical
      timeout: 10s
      retries: 2       # Fail faster - auth should always be fast
  # Background job processor
  jobs:
    build: ./jobs
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      # Jobs can take 10 minutes, check file was updated in last 10 minutes
      test: ["CMD", "helpmetest", "health", "jobs", "15m", "file-updated 10m /tmp/jobs.alive"]
      interval: 60s
      timeout: 15s
      retries: 3
  # Email sending service
  email:
    build: ./email
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      # Email service updates file every 3 minutes after processing queue
      test: ["CMD", "helpmetest", "health", "email", "5m", "file-updated 3m /tmp/email.ready"]
      interval: 60s
      timeout: 15s
      retries: 3
  # Analytics processing
  analytics:
    build: ./analytics
    environment:
      - HELPMETEST_API_TOKEN=${HELPMETEST_API_TOKEN}
      - ENV=production
    healthcheck:
      # Analytics runs hourly batch jobs, check file updated in last 45 minutes
      test: ["CMD", "helpmetest", "health", "analytics", "1h", "file-updated 45m /tmp/analytics.processing"]
      interval: 300s   # Check every 5 minutes - not time-critical
      timeout: 15s
      retries: 3

Key patterns:

Critical services (auth) have lower retries for faster failure detection
Background services have longer intervals and timeouts
File-based checks use realistic time windows based on job duration
Version tracking with HELPMETEST_VERSION environment variable

For a high-level overview and AI-powered setup, see the main health checks guide.

Questions? Email us at contact@helpmetest.com

Table of Contents

CLI Installation

CLI Command Reference

Basic Health Check

HTTP Health Checks

Port Availability Checks

File Age Checks

Command Execution Checks

Status Command

Grace Period Formats

Environment Variables

Docker Integration

Basic Dockerfile Health Check

Database Container

Background Worker Container

Docker Compose Integration

Kubernetes Integration

Create Secret for API Token

Deployment with Liveness and Readiness Probes

Database Deployment

CronJob with Health Check

Cron Job Monitoring

Basic Cron Setup

Cron Best Practices

API Isolation Feature

Auto-Detection of Service Types

Troubleshooting

Debug Mode

False Positive Alerts

Missing Heartbeats

Command Execution Issues

Container Health Check Debugging

Real-World Examples

E-commerce Platform

SaaS Application