This is the complete technical reference for HelpMeTest health checks. For a high-level overview of what health checks are and why they matter, see our main health checks guide.

Table of Contents

CLI Installation

The HelpMeTest CLI is a single binary (~55MB) that includes the Bun runtime for JavaScript execution. The installer is a shell script that detects your operating system and CPU architecture, downloads the appropriate binary from our releases, and installs it to a location in your PATH.

This should output the current version number. If you get "command not found", the binary wasn't installed to your PATH correctly.

CLI Command Reference

Basic Health Check

The most basic health check is just a heartbeat that reports your service is alive:

Parameters:

  • name - A unique identifier for this health check. This is how the health check appears in your dashboard. Use descriptive names like "database-backup" or "web-api" instead of generic names like "check1".

  • grace_period - How long the platform should wait before marking this health check as failed if no heartbeat is received. This should be longer than your service's normal execution time plus a buffer for network delays. Format: 30s, 5m, 2h, 1d.

  • command - Optional. A command to execute that determines if the service is healthy. If the command exits with code 0, the health check passes. If it exits with any other code, the health check fails. If omitted, this is just a simple heartbeat.

Examples:

This reports that the service "database-backup" is alive. If the platform doesn't receive another heartbeat within 5 minutes, it marks the service as down and sends an alert. Use this pattern after your backup script completes successfully.

The ENV environment variable tags this health check as belonging to the "production" environment. This lets you filter your dashboard by environment and see production vs staging vs development separately. The platform automatically captures any environment variable starting with ENV or HELPMETEST_.

HTTP Health Checks

HTTP health checks make an HTTP GET request to a URL and expect a 200-299 response code. This is useful for web servers, APIs, and any service that exposes an HTTP endpoint.

When you provide a path like /health, the CLI automatically prepends http://localhost to create the full URL http://localhost/health. This is a convenience for services running locally on default ports.

If you need to specify a port, include it in the host. The CLI will request http://127.0.0.1:3000/health. This is useful in containers where services might run on non-standard ports.

For external services or HTTPS endpoints, provide the full URL. The CLI makes the request exactly as specified. Any HTTP status code in the 200-299 range is considered success. Everything else (404, 500, connection refused, timeout) is considered failure.

How it works:

The CLI uses the Bun fetch API to make the HTTP request with a 10-second timeout. If the request returns 200-299, the CLI exits with code 0 (success) and reports the health check as passing. If the request returns any other status code, times out, or fails to connect, the CLI exits with code 1 (failure) and reports the health check as failing.

Port Availability Checks

Port checks verify that a service is listening on a specific TCP port. This doesn't verify that the service is working correctly, just that something is bound to that port.

The :3000 syntax tells the CLI to check if port 3000 is listening on localhost. The CLI attempts to open a TCP connection to 127.0.0.1:3000. If the connection succeeds (something is listening), the health check passes. If the connection is refused (nothing listening) or times out, the health check fails.

When to use this:

Use port checks for services that don't expose HTTP endpoints but listen on TCP ports, like databases, message queues, or custom TCP servers. This is less thorough than HTTP checks (you're just verifying something is listening, not that it's working) but useful when that's all you can test.

File Age Checks

File age checks verify that a file exists and was modified recently. This is perfect for background workers that update status files or batch jobs that create output files.

The file-updated 2m /var/log/app.log syntax tells the CLI to check if /var/log/app.log exists and was modified within the last 2 minutes. If the file doesn't exist or is older than 2 minutes, the health check fails. The 5-minute grace period means if the health check doesn't report success within 5 minutes, you get an alert.

For daily batch jobs, verify the output file was created in the last day. The 25-hour grace period (24 hours + 1-hour buffer) accounts for the daily schedule plus some slack for jobs that run late.

You can check multiple files in a single health check. The CLI checks each file in sequence. If any file is missing or too old, the entire health check fails. This is useful for ensuring multiple related files are being updated together.

How to use this with workers:

Your background worker code should touch a status file periodically:

Then your health check verifies this file is recent, proving the worker is still processing jobs.

Command Execution Checks

Command execution checks run an arbitrary shell command and use its exit code to determine health. Exit code 0 means success, anything else means failure.

This runs psql -h localhost -c '\l' (which lists all databases) and checks the exit code. If psql can connect to PostgreSQL and execute the query, it exits with 0 and the health check passes. If the connection fails or the query fails, psql exits with a non-zero code and the health check fails.

You can execute custom scripts that implement your own health check logic. The script should exit with 0 if everything is healthy and non-zero if something is wrong.

The && operator means "run the second command only if the first succeeds". So backup.sh runs first. If it succeeds (exit code 0), then helpmetest health runs and reports success. If backup.sh fails (non-zero exit), helpmetest health never runs and no heartbeat is sent. After 25 hours without a heartbeat, the platform marks the backup as failed and alerts you.

Why use && vs passing the command as an argument:

  • backup.sh && helpmetest health "backup" "25h" - Only reports success, never reports failure. Use this when you want silence on failure.
  • helpmetest health "backup" "25h" "backup.sh" - Reports both success and failure. Use this when you want explicit failure reports.

Status Command

The status command shows the current state of all your health checks:

This queries the HelpMeTest API and displays a table of all your health checks with their current status (up/down/unknown), last heartbeat time, and grace period. The table updates to show the most recent data from the platform.

Filter to only show health checks tagged with ENV=production. This is useful when you have multiple environments and want to see just production or just staging.

Shows additional details like the full command being executed, environment variables, and system metrics collected with each heartbeat.

Grace Period Formats

Grace periods use the timespan-parser library which supports human-readable time formats:

  • 30s - 30 seconds
  • 5m - 5 minutes
  • 2h - 2 hours
  • 1d - 1 day
  • 15min - 15 minutes (alternative syntax)
  • 2.5h - 2 hours and 30 minutes (decimals work)

Grace Period Guidelines:

The grace period should be longer than your service's normal execution time plus a buffer for variability and network delays.

Service Type Recommended Grace Period Reasoning
Web APIs 30s - 2m Fast response expected. If your API doesn't respond for 2 minutes, something is very wrong.
Database operations 2m - 10m Queries can legitimately take time. Connection issues often resolve themselves within minutes.
Backup jobs 20-30% longer than execution If your backup takes 2 hours, use a 3-hour grace period to account for slower nights.
Daily jobs 25h - 26h For a job that runs once per day, 25 hours gives you 1 hour of slack for late execution.
Weekly jobs 8d - 9d For weekly jobs, 8 days gives you 1 day of buffer for maintenance windows.

Why these buffers matter:

Too short: False alerts when services are legitimately slow. Too long: Delayed alerts when services actually fail.

Start with the recommendations above and tune based on your false positive rate. If you're getting alerts for services that are actually healthy, increase the grace period. If you're discovering failures hours after they happen, decrease it.

Environment Variables

The CLI uses environment variables for configuration and metadata.

Required:

  • HELPMETEST_API_TOKEN - Your API token from the HelpMeTest platform. This authenticates your health check reports. Get this from your dashboard settings. The token is a long string like HELP-1dc7fbe0-1f4f-4c58-abb6-20f7ae47570c.

Optional:

  • ENV - Environment identifier (dev, staging, prod). This tags your health checks by environment so you can filter them in the dashboard. The platform treats this as a special field and provides environment-based filtering.

  • HELPMETEST_* - Any environment variable starting with HELPMETEST_ is captured and sent with health check reports. For example:

    • HELPMETEST_SERVICE=auth-api - Service name
    • HELPMETEST_VERSION=2.1.3 - Deployment version
    • HELPMETEST_REGION=us-west-2 - AWS region
    • HELPMETEST_POD_NAME=web-app-abc123 - Kubernetes pod name

Auto-Collected System Metrics:

Every health check report includes system metrics automatically collected by the CLI:

  • Hostname - Output of hostname command
  • IP address - First non-loopback IP address found
  • CPU usage - Percentage of CPU used (sampled at collection time)
  • Memory usage - Total memory and available memory in MB
  • Disk usage - Disk space used and available for the root partition
  • Environment variables - All HELPMETEST_* and ENV variables

These metrics appear in your dashboard alongside the health check status, giving you context about the system state when the health check ran.

Docker Integration

Docker's HEALTHCHECK directive runs a command periodically inside the container and uses its exit code to determine container health. This integrates with Docker's health tracking so docker ps shows container health status.

Basic Dockerfile Health Check

Here's a complete Dockerfile for a Node.js web application with health checks:

HEALTHCHECK parameters explained:

  • --interval=30s - How often Docker runs the health check command. 30 seconds is good for web services that should always be responsive. For less critical services, use longer intervals like 60s or 120s to reduce overhead.

  • --timeout=10s - Maximum time Docker waits for the health check command to complete. If the command takes longer than this, Docker kills it and considers the check failed. 10 seconds is reasonable for HTTP requests and database queries. If your health check legitimately takes longer, increase this.

  • --start-period=5s - Grace period during container startup before Docker starts counting health check failures. Your application needs time to start up (loading config, connecting to database, warming up). Set this to your application's typical startup time. Failed health checks during the start period don't count toward the failure threshold.

  • --retries=3 - How many consecutive health check failures before Docker marks the container as unhealthy. 3 retries means the service must fail for 3 * 30s = 90 seconds before Docker calls it unhealthy. This prevents false positives from temporary glitches.

What happens when a container is unhealthy:

In plain Docker, an unhealthy container keeps running but docker ps shows it as unhealthy. In Docker Compose with restart: always, Docker may restart unhealthy containers. In orchestration systems like Docker Swarm or Kubernetes, unhealthy containers are removed and replaced automatically.

Database Container

Databases need different health checks than web services because they don't expose HTTP endpoints:

Why these parameters are different:

  • --interval=60s - Database queries are more expensive than HTTP requests, so we check less frequently. 60 seconds is reasonable for database health.

  • --timeout=30s - Database connections can take longer to establish than HTTP requests, especially if the connection pool is exhausted. 30 seconds gives the database time to process the connection request.

  • Grace period 5m - If we don't receive a health check report for 5 minutes, something is wrong. This is longer than the web service grace period because database operations are expected to be slower.

The health check command:

psql -U $POSTGRES_USER -d $POSTGRES_DB -c 'SELECT 1' connects to PostgreSQL and executes a simple query. This verifies:

  1. PostgreSQL is accepting connections
  2. The database exists
  3. The user credentials work
  4. The database can execute queries

If any of these fail, psql exits with a non-zero code and the health check fails.

Background Worker Container

Background workers don't expose HTTP endpoints or database interfaces. They just process jobs from queues. We use file-based health checks for these:

Worker application code:

Your worker must periodically update the status file to prove it's alive:

How the file check works:

  • Every 60 seconds, Docker runs the health check
  • The health check verifies /tmp/worker.alive was modified in the last 5 minutes
  • If the file is older than 5 minutes or doesn't exist, the health check fails
  • After 3 consecutive failures (3 minutes), Docker marks the container unhealthy
  • The platform alerts you if no heartbeat is received for 10 minutes

Why 5 minutes for file age:

If your jobs take 2 minutes on average, you should update the file after each job. In the worst case (job takes 2 minutes, then health check runs), the file will be 2 minutes old. 5 minutes gives you a 3-minute buffer for slow jobs. Tune this based on your job processing time.

Docker Compose Integration

Docker Compose runs multiple containers together and can coordinate their health checks. Here's a complete example showing different health check patterns:

The depends_on configuration:

This tells Docker Compose to wait for the database container to report healthy before starting the api service. Without this, the API would start immediately and fail to connect to the database because it's still starting up. The health check coordination ensures services start in the right order.

Running with Docker Compose:

The docker-compose ps output shows health status for each service. Services show as "healthy", "unhealthy", or "starting" based on their health check results.

Kubernetes Integration

Kubernetes has two types of probes: liveness and readiness. They serve different purposes and both are important for robust deployments.

Liveness Probe: "Is this container broken and should be restarted?"

  • If the liveness probe fails repeatedly, Kubernetes kills and restarts the pod
  • Use this to detect deadlocks, infinite loops, or corrupted state that requires a restart
  • Should be conservative - only fail when a restart would actually help

Readiness Probe: "Is this container ready to serve traffic?"

  • If the readiness probe fails, Kubernetes removes the pod from the service load balancer
  • Use this to temporarily remove pods during startup, during degraded states, or when dependent services are down
  • Can fail more liberally - temporary removal from load balancer doesn't hurt

Create Secret for API Token

Never hardcode API tokens in Kubernetes manifests. Use secrets instead:

This creates a secret named helpmetest-secret with one key api-token containing your token. The secret is stored encrypted in etcd and can be mounted into containers as environment variables or files.

Or use a secret manager like Infisical:

To base64-encode your token:

Secret managers like Infisical, AWS Secrets Manager, or HashiCorp Vault are better for production because they provide:

  • Automatic token rotation
  • Audit logs of secret access
  • Integration with your existing auth system
  • Encryption at rest and in transit

Deployment with Liveness and Readiness Probes

Here's a production-ready Kubernetes deployment with proper health checks:

Why separate liveness and readiness probes:

Imagine your app depends on a database. If the database goes down:

  • Readiness probe fails immediately, removing your pod from the load balancer so users don't hit it
  • Liveness probe keeps passing because your app process is fine, just waiting for the database
  • When the database comes back, readiness probe starts passing and traffic resumes
  • No pod restart needed because the app itself was never broken

If liveness and readiness were the same probe, the pod would restart every time the database hiccuped, which doesn't help anything.

The /health vs /ready endpoints:

Your application should expose two endpoints:

  • /health - Returns 200 if the app process itself is healthy (not deadlocked, not out of memory)
  • /ready - Returns 200 if the app is ready to serve traffic (dependencies are available, caches are warm)

Database Deployment

Databases are stateful and need special handling:

Why databases are different:

  • Longer initialDelaySeconds (60s vs 30s) because databases initialize schema, load data, etc.
  • Longer periodSeconds (60s vs 30s) because database queries have more overhead than HTTP requests
  • Longer timeoutSeconds (30s vs 10s) because connection pools can be exhausted under load
  • Only 1 replica because primary databases don't run in parallel (use StatefulSet for replicas)

CronJob with Health Check

Kubernetes CronJobs run scheduled tasks. Use health checks to verify they complete successfully:

How the health check works with CronJobs:

  1. At 2 AM, Kubernetes creates a new pod and runs your command
  2. The backup script executes
  3. If the backup succeeds (exit code 0), the shell continues to the next line
  4. helpmetest health reports success to the platform
  5. If the backup fails (non-zero exit), the shell stops and the helpmetest health line never runs
  6. After 25 hours without a heartbeat, the platform marks the backup as failed and alerts you

The 25-hour grace period:

The job runs daily, so 24 hours is the expected interval. We add 1 hour of slack for jobs that run late, infrastructure delays, etc. If the backup hasn't reported success for 25 hours, something is definitely wrong.

Why restartPolicy: OnFailure:

If the backup fails, Kubernetes retries up to backoffLimit times. This handles transient failures (network blips, temporarily full disk) without alerting you. If all 3 attempts fail, you get an alert because the job legitimately failed.

Cron Job Monitoring

Traditional cron jobs on Linux servers need health check integration too. Add the health check command after your script runs:

Basic Cron Setup

When you pass a command to helpmetest health, it executes the command and reports both success and failure based on the exit code. This is different from command && helpmetest health which only reports success.

Cron Best Practices

1. Use absolute paths:

Cron runs with a minimal PATH that often doesn't include /usr/local/bin. Always use absolute paths:

2. Set environment variables explicitly:

Cron doesn't inherit your shell's environment. Set variables at the top of your crontab:

3. Grace periods should be 20-30% longer than execution time:

The extra buffer accounts for variability. Some nights the backup might take 2.5 hours if the database is larger or the disk is slower. A 3-hour grace period prevents false alerts on those nights.

4. Log output for debugging:

This captures all output (both stdout and stderr) to a log file. When a backup fails, you can check the log to see what went wrong.

5. Use file-based health checks for long-running scripts:

If your cron job runs for hours, you can't use the command execution pattern (it would time out). Instead, have your script update a status file and check that file:

Your backup script should touch /tmp/backup-complete when it finishes successfully. The health check runs every 10 minutes and verifies the file was updated in the last 2 hours.

API Isolation Feature

This is one of the most important features of the HelpMeTest CLI and is critical for production deployments.

The Problem:

Imagine your Kubernetes cluster runs health checks that report to the HelpMeTest API. One day there's a network issue between your cluster and the API (AWS region outage, DNS failure, firewall misconfiguration). If health check exit codes depend on successfully reporting to the API, all your health checks would fail even though your services are perfectly healthy. Kubernetes would kill all your pods. Your entire application goes down because of an unrelated network issue.

The Solution:

The HelpMeTest CLI always returns an exit code based purely on whether your health check command succeeded:

  1. CLI executes your health check command (HTTP request, database query, file check, etc.)
  2. Command succeeds → CLI returns exit code 0 (success)
  3. Command fails → CLI returns exit code 1 (failure)
  4. This happens regardless of whether the CLI can reach the HelpMeTest API

If the API is reachable:

  • CLI reports the health check result to the platform
  • You see the status in your dashboard
  • Alerts fire when checks fail
  • Historical data is stored

If the API is unreachable:

  • CLI logs a warning: "Failed to report health check to API: connection timeout"
  • CLI still returns the correct exit code based on your service status
  • Kubernetes doesn't kill healthy pods
  • Docker doesn't mark healthy containers as unhealthy
  • Your services keep running normally

Why this matters in practice:

Your infrastructure is more reliable than any third-party monitoring service. The HelpMeTest API might have outages (AWS problems, deployment issues, DDoS attacks). Your application should never go down because your monitoring is down. The API isolation feature ensures health checks reflect actual service health, not network connectivity to a monitoring platform.

Where the health check data goes when API is down:

The CLI doesn't queue or retry. If it can't reach the API, the health check data for that specific check is lost. But that's okay because:

  1. Health checks run frequently (every 30-60 seconds), so you'll only miss a few data points
  2. Your services stay healthy during the API outage, which is what matters
  3. When the API comes back, health checks resume reporting normally

Auto-Detection of Service Types

When you use the AI integration to add health checks automatically, the AI detects service types based on the container image and exposed ports. This table shows what health check commands the AI generates for different service types:

Container Type Auto-Detected Check Why This Check
PostgreSQL psql -h localhost -c "SELECT 1" Verifies database accepts connections and can execute queries
MySQL mysql -h localhost -e "SELECT 1" Verifies MySQL connection and query execution
Redis redis-cli ping Redis's built-in health check returns PONG if healthy
MongoDB mongosh --eval "db.runCommand({ping: 1})" MongoDB's ping command verifies connection and responsiveness
Node.js GET localhost:3000/health Most Node apps expose /health endpoint on port 3000
Python/Flask GET localhost:8000/health Flask's default port is 8000, health endpoint is standard
Nginx GET localhost:80/health Nginx runs on port 80, can proxy health checks to backends
Kafka :9092 Check if Kafka is listening on its default port
RabbitMQ GET localhost:15672/api/overview RabbitMQ management API provides cluster health status

The AI looks at:

  • Base image name (FROM postgres:15 → database health check)
  • Exposed ports (EXPOSE 3000 → HTTP health check on :3000)
  • Installed packages (apt-get install postgresql-client → database service)
  • Running processes (CMD ["nginx"] → web server health check)

Troubleshooting

Debug Mode

Enable verbose output to see exactly what the CLI is doing:

Debug mode logs:

  • Full HTTP request/response details for HTTP checks
  • File paths and modification times for file checks
  • Command execution and output for command checks
  • API request/response details
  • System metrics collection process

This shows the system metrics the CLI collects (CPU, memory, disk) without actually reporting a health check. Useful for verifying the CLI can access system information.

Verbose status shows:

  • Full command being executed
  • All environment variables
  • System metrics from last check
  • Historical status changes

False Positive Alerts

Symptoms:

  • Getting alerts when service is actually healthy
  • Intermittent failures for stable services
  • Health checks timing out unexpectedly

Solutions:

1. Increase grace period:

If your service occasionally takes longer than expected, increase the grace period:

Start with 50-100% buffer over typical execution time and adjust based on false positive rate.

2. Test command manually:

Run the exact health check command yourself to see if it actually works:

If the command fails when you run it manually, fix the command before using it in health checks.

3. Check system resources during execution:

High CPU or memory usage can cause health checks to timeout:

If CPU is pegged at 100% or memory is maxed out, your health checks might legitimately be slow. Either add more resources or increase timeouts.

4. Use more specific health checks:

Generic checks can fail for many reasons. Specific checks give you better signal:

The HTTP check is better because it verifies the application is responding correctly, not just that something is listening on the port.

Missing Heartbeats

Symptoms:

  • Health checks show as 'down' but services are running
  • Irregular heartbeat patterns in dashboard
  • Cron jobs not reporting consistently

Solutions:

1. Verify cron job syntax:

Check if cron is actually running your job:

If the manual test works but cron doesn't, the problem is in your crontab syntax.

2. Use absolute paths in cron:

Cron has a minimal PATH. Use absolute paths for everything:

3. Set environment variables explicitly:

Cron doesn't inherit your shell environment:

Command Execution Issues

Symptoms:

  • Health check commands fail unexpectedly
  • Different behavior when run manually vs automated
  • Permission denied errors

Solutions:

1. Use absolute paths:

2. Set required environment variables:

3. Check file permissions:

4. Test as same user:

Health checks in containers run as the container user (often root). Test as that user:

Container Health Check Debugging

Run health check manually inside container:

Enable debug mode in container:

Check container logs:

Check Docker container health status:

Real-World Examples

E-commerce Platform

Complete Docker Compose setup for an e-commerce platform with web frontend, API, background workers, payment processing, and database:

This configuration:

  • Uses appropriate grace periods for each service type
  • Checks customer-facing services more frequently (30s) than background workers (60s)
  • Uses HTTP checks for web services and file checks for workers
  • Gives database queries longer timeouts (30s vs 10s)

SaaS Application

Multi-service SaaS platform with authentication, background jobs, email sending, and analytics:

Key patterns:

  • Critical services (auth) have lower retries for faster failure detection
  • Background services have longer intervals and timeouts
  • File-based checks use realistic time windows based on job duration
  • Version tracking with HELPMETEST_VERSION environment variable

For a high-level overview and AI-powered setup, see the main health checks guide.

Questions? Email us at contact@helpmetest.com