Deep Dive: Health Checks
Complete technical documentation for HelpMeTest health checks - CLI reference, Docker integration, Kubernetes configuration, and troubleshooting
This is the complete technical reference for HelpMeTest health checks. For a high-level overview of what health checks are and why they matter, see our main health checks guide.
Table of Contents
- CLI Installation
- CLI Command Reference
- Grace Period Formats
- Environment Variables
- Docker Integration
- Docker Compose Integration
- Kubernetes Integration
- Cron Job Monitoring
- API Isolation Feature
- Auto-Detection of Service Types
- Troubleshooting
- Real-World Examples
CLI Installation
The HelpMeTest CLI is a single binary (~55MB) that includes the Bun runtime for JavaScript execution. The installer is a shell script that detects your operating system and CPU architecture, downloads the appropriate binary from our releases, and installs it to a location in your PATH.
This should output the current version number. If you get "command not found", the binary wasn't installed to your PATH correctly.
CLI Command Reference
Basic Health Check
The most basic health check is just a heartbeat that reports your service is alive:
Parameters:
name- A unique identifier for this health check. This is how the health check appears in your dashboard. Use descriptive names like "database-backup" or "web-api" instead of generic names like "check1".grace_period- How long the platform should wait before marking this health check as failed if no heartbeat is received. This should be longer than your service's normal execution time plus a buffer for network delays. Format:30s,5m,2h,1d.command- Optional. A command to execute that determines if the service is healthy. If the command exits with code 0, the health check passes. If it exits with any other code, the health check fails. If omitted, this is just a simple heartbeat.
Examples:
This reports that the service "database-backup" is alive. If the platform doesn't receive another heartbeat within 5 minutes, it marks the service as down and sends an alert. Use this pattern after your backup script completes successfully.
The ENV environment variable tags this health check as belonging to the "production" environment. This lets you filter your dashboard by environment and see production vs staging vs development separately. The platform automatically captures any environment variable starting with ENV or HELPMETEST_.
HTTP Health Checks
HTTP health checks make an HTTP GET request to a URL and expect a 200-299 response code. This is useful for web servers, APIs, and any service that exposes an HTTP endpoint.
When you provide a path like /health, the CLI automatically prepends http://localhost to create the full URL http://localhost/health. This is a convenience for services running locally on default ports.
If you need to specify a port, include it in the host. The CLI will request http://127.0.0.1:3000/health. This is useful in containers where services might run on non-standard ports.
For external services or HTTPS endpoints, provide the full URL. The CLI makes the request exactly as specified. Any HTTP status code in the 200-299 range is considered success. Everything else (404, 500, connection refused, timeout) is considered failure.
How it works:
The CLI uses the Bun fetch API to make the HTTP request with a 10-second timeout. If the request returns 200-299, the CLI exits with code 0 (success) and reports the health check as passing. If the request returns any other status code, times out, or fails to connect, the CLI exits with code 1 (failure) and reports the health check as failing.
Port Availability Checks
Port checks verify that a service is listening on a specific TCP port. This doesn't verify that the service is working correctly, just that something is bound to that port.
The :3000 syntax tells the CLI to check if port 3000 is listening on localhost. The CLI attempts to open a TCP connection to 127.0.0.1:3000. If the connection succeeds (something is listening), the health check passes. If the connection is refused (nothing listening) or times out, the health check fails.
When to use this:
Use port checks for services that don't expose HTTP endpoints but listen on TCP ports, like databases, message queues, or custom TCP servers. This is less thorough than HTTP checks (you're just verifying something is listening, not that it's working) but useful when that's all you can test.
File Age Checks
File age checks verify that a file exists and was modified recently. This is perfect for background workers that update status files or batch jobs that create output files.
The file-updated 2m /var/log/app.log syntax tells the CLI to check if /var/log/app.log exists and was modified within the last 2 minutes. If the file doesn't exist or is older than 2 minutes, the health check fails. The 5-minute grace period means if the health check doesn't report success within 5 minutes, you get an alert.
For daily batch jobs, verify the output file was created in the last day. The 25-hour grace period (24 hours + 1-hour buffer) accounts for the daily schedule plus some slack for jobs that run late.
You can check multiple files in a single health check. The CLI checks each file in sequence. If any file is missing or too old, the entire health check fails. This is useful for ensuring multiple related files are being updated together.
How to use this with workers:
Your background worker code should touch a status file periodically:
Then your health check verifies this file is recent, proving the worker is still processing jobs.
Command Execution Checks
Command execution checks run an arbitrary shell command and use its exit code to determine health. Exit code 0 means success, anything else means failure.
This runs psql -h localhost -c '\l' (which lists all databases) and checks the exit code. If psql can connect to PostgreSQL and execute the query, it exits with 0 and the health check passes. If the connection fails or the query fails, psql exits with a non-zero code and the health check fails.
You can execute custom scripts that implement your own health check logic. The script should exit with 0 if everything is healthy and non-zero if something is wrong.
The && operator means "run the second command only if the first succeeds". So backup.sh runs first. If it succeeds (exit code 0), then helpmetest health runs and reports success. If backup.sh fails (non-zero exit), helpmetest health never runs and no heartbeat is sent. After 25 hours without a heartbeat, the platform marks the backup as failed and alerts you.
Why use && vs passing the command as an argument:
backup.sh && helpmetest health "backup" "25h"- Only reports success, never reports failure. Use this when you want silence on failure.helpmetest health "backup" "25h" "backup.sh"- Reports both success and failure. Use this when you want explicit failure reports.
Status Command
The status command shows the current state of all your health checks:
This queries the HelpMeTest API and displays a table of all your health checks with their current status (up/down/unknown), last heartbeat time, and grace period. The table updates to show the most recent data from the platform.
Filter to only show health checks tagged with ENV=production. This is useful when you have multiple environments and want to see just production or just staging.
Shows additional details like the full command being executed, environment variables, and system metrics collected with each heartbeat.
Grace Period Formats
Grace periods use the timespan-parser library which supports human-readable time formats:
30s- 30 seconds5m- 5 minutes2h- 2 hours1d- 1 day15min- 15 minutes (alternative syntax)2.5h- 2 hours and 30 minutes (decimals work)
Grace Period Guidelines:
The grace period should be longer than your service's normal execution time plus a buffer for variability and network delays.
| Service Type | Recommended Grace Period | Reasoning |
|---|---|---|
| Web APIs | 30s - 2m | Fast response expected. If your API doesn't respond for 2 minutes, something is very wrong. |
| Database operations | 2m - 10m | Queries can legitimately take time. Connection issues often resolve themselves within minutes. |
| Backup jobs | 20-30% longer than execution | If your backup takes 2 hours, use a 3-hour grace period to account for slower nights. |
| Daily jobs | 25h - 26h | For a job that runs once per day, 25 hours gives you 1 hour of slack for late execution. |
| Weekly jobs | 8d - 9d | For weekly jobs, 8 days gives you 1 day of buffer for maintenance windows. |
Why these buffers matter:
Too short: False alerts when services are legitimately slow. Too long: Delayed alerts when services actually fail.
Start with the recommendations above and tune based on your false positive rate. If you're getting alerts for services that are actually healthy, increase the grace period. If you're discovering failures hours after they happen, decrease it.
Environment Variables
The CLI uses environment variables for configuration and metadata.
Required:
HELPMETEST_API_TOKEN- Your API token from the HelpMeTest platform. This authenticates your health check reports. Get this from your dashboard settings. The token is a long string likeHELP-1dc7fbe0-1f4f-4c58-abb6-20f7ae47570c.
Optional:
ENV- Environment identifier (dev, staging, prod). This tags your health checks by environment so you can filter them in the dashboard. The platform treats this as a special field and provides environment-based filtering.HELPMETEST_*- Any environment variable starting withHELPMETEST_is captured and sent with health check reports. For example:HELPMETEST_SERVICE=auth-api- Service nameHELPMETEST_VERSION=2.1.3- Deployment versionHELPMETEST_REGION=us-west-2- AWS regionHELPMETEST_POD_NAME=web-app-abc123- Kubernetes pod name
Auto-Collected System Metrics:
Every health check report includes system metrics automatically collected by the CLI:
- Hostname - Output of
hostnamecommand - IP address - First non-loopback IP address found
- CPU usage - Percentage of CPU used (sampled at collection time)
- Memory usage - Total memory and available memory in MB
- Disk usage - Disk space used and available for the root partition
- Environment variables - All
HELPMETEST_*andENVvariables
These metrics appear in your dashboard alongside the health check status, giving you context about the system state when the health check ran.
Docker Integration
Docker's HEALTHCHECK directive runs a command periodically inside the container and uses its exit code to determine container health. This integrates with Docker's health tracking so docker ps shows container health status.
Basic Dockerfile Health Check
Here's a complete Dockerfile for a Node.js web application with health checks:
HEALTHCHECK parameters explained:
--interval=30s- How often Docker runs the health check command. 30 seconds is good for web services that should always be responsive. For less critical services, use longer intervals like 60s or 120s to reduce overhead.--timeout=10s- Maximum time Docker waits for the health check command to complete. If the command takes longer than this, Docker kills it and considers the check failed. 10 seconds is reasonable for HTTP requests and database queries. If your health check legitimately takes longer, increase this.--start-period=5s- Grace period during container startup before Docker starts counting health check failures. Your application needs time to start up (loading config, connecting to database, warming up). Set this to your application's typical startup time. Failed health checks during the start period don't count toward the failure threshold.--retries=3- How many consecutive health check failures before Docker marks the container as unhealthy. 3 retries means the service must fail for 3 * 30s = 90 seconds before Docker calls it unhealthy. This prevents false positives from temporary glitches.
What happens when a container is unhealthy:
In plain Docker, an unhealthy container keeps running but docker ps shows it as unhealthy. In Docker Compose with restart: always, Docker may restart unhealthy containers. In orchestration systems like Docker Swarm or Kubernetes, unhealthy containers are removed and replaced automatically.
Database Container
Databases need different health checks than web services because they don't expose HTTP endpoints:
Why these parameters are different:
--interval=60s- Database queries are more expensive than HTTP requests, so we check less frequently. 60 seconds is reasonable for database health.--timeout=30s- Database connections can take longer to establish than HTTP requests, especially if the connection pool is exhausted. 30 seconds gives the database time to process the connection request.Grace period
5m- If we don't receive a health check report for 5 minutes, something is wrong. This is longer than the web service grace period because database operations are expected to be slower.
The health check command:
psql -U $POSTGRES_USER -d $POSTGRES_DB -c 'SELECT 1' connects to PostgreSQL and executes a simple query. This verifies:
- PostgreSQL is accepting connections
- The database exists
- The user credentials work
- The database can execute queries
If any of these fail, psql exits with a non-zero code and the health check fails.
Background Worker Container
Background workers don't expose HTTP endpoints or database interfaces. They just process jobs from queues. We use file-based health checks for these:
Worker application code:
Your worker must periodically update the status file to prove it's alive:
How the file check works:
- Every 60 seconds, Docker runs the health check
- The health check verifies
/tmp/worker.alivewas modified in the last 5 minutes - If the file is older than 5 minutes or doesn't exist, the health check fails
- After 3 consecutive failures (3 minutes), Docker marks the container unhealthy
- The platform alerts you if no heartbeat is received for 10 minutes
Why 5 minutes for file age:
If your jobs take 2 minutes on average, you should update the file after each job. In the worst case (job takes 2 minutes, then health check runs), the file will be 2 minutes old. 5 minutes gives you a 3-minute buffer for slow jobs. Tune this based on your job processing time.
Docker Compose Integration
Docker Compose runs multiple containers together and can coordinate their health checks. Here's a complete example showing different health check patterns:
The depends_on configuration:
This tells Docker Compose to wait for the database container to report healthy before starting the api service. Without this, the API would start immediately and fail to connect to the database because it's still starting up. The health check coordination ensures services start in the right order.
Running with Docker Compose:
The docker-compose ps output shows health status for each service. Services show as "healthy", "unhealthy", or "starting" based on their health check results.
Kubernetes Integration
Kubernetes has two types of probes: liveness and readiness. They serve different purposes and both are important for robust deployments.
Liveness Probe: "Is this container broken and should be restarted?"
- If the liveness probe fails repeatedly, Kubernetes kills and restarts the pod
- Use this to detect deadlocks, infinite loops, or corrupted state that requires a restart
- Should be conservative - only fail when a restart would actually help
Readiness Probe: "Is this container ready to serve traffic?"
- If the readiness probe fails, Kubernetes removes the pod from the service load balancer
- Use this to temporarily remove pods during startup, during degraded states, or when dependent services are down
- Can fail more liberally - temporary removal from load balancer doesn't hurt
Create Secret for API Token
Never hardcode API tokens in Kubernetes manifests. Use secrets instead:
This creates a secret named helpmetest-secret with one key api-token containing your token. The secret is stored encrypted in etcd and can be mounted into containers as environment variables or files.
Or use a secret manager like Infisical:
To base64-encode your token:
Secret managers like Infisical, AWS Secrets Manager, or HashiCorp Vault are better for production because they provide:
- Automatic token rotation
- Audit logs of secret access
- Integration with your existing auth system
- Encryption at rest and in transit
Deployment with Liveness and Readiness Probes
Here's a production-ready Kubernetes deployment with proper health checks:
Why separate liveness and readiness probes:
Imagine your app depends on a database. If the database goes down:
- Readiness probe fails immediately, removing your pod from the load balancer so users don't hit it
- Liveness probe keeps passing because your app process is fine, just waiting for the database
- When the database comes back, readiness probe starts passing and traffic resumes
- No pod restart needed because the app itself was never broken
If liveness and readiness were the same probe, the pod would restart every time the database hiccuped, which doesn't help anything.
The /health vs /ready endpoints:
Your application should expose two endpoints:
/health- Returns 200 if the app process itself is healthy (not deadlocked, not out of memory)/ready- Returns 200 if the app is ready to serve traffic (dependencies are available, caches are warm)
Database Deployment
Databases are stateful and need special handling:
Why databases are different:
- Longer
initialDelaySeconds(60s vs 30s) because databases initialize schema, load data, etc. - Longer
periodSeconds(60s vs 30s) because database queries have more overhead than HTTP requests - Longer
timeoutSeconds(30s vs 10s) because connection pools can be exhausted under load - Only 1 replica because primary databases don't run in parallel (use StatefulSet for replicas)
CronJob with Health Check
Kubernetes CronJobs run scheduled tasks. Use health checks to verify they complete successfully:
How the health check works with CronJobs:
- At 2 AM, Kubernetes creates a new pod and runs your command
- The backup script executes
- If the backup succeeds (exit code 0), the shell continues to the next line
helpmetest healthreports success to the platform- If the backup fails (non-zero exit), the shell stops and the
helpmetest healthline never runs - After 25 hours without a heartbeat, the platform marks the backup as failed and alerts you
The 25-hour grace period:
The job runs daily, so 24 hours is the expected interval. We add 1 hour of slack for jobs that run late, infrastructure delays, etc. If the backup hasn't reported success for 25 hours, something is definitely wrong.
Why restartPolicy: OnFailure:
If the backup fails, Kubernetes retries up to backoffLimit times. This handles transient failures (network blips, temporarily full disk) without alerting you. If all 3 attempts fail, you get an alert because the job legitimately failed.
Cron Job Monitoring
Traditional cron jobs on Linux servers need health check integration too. Add the health check command after your script runs:
Basic Cron Setup
When you pass a command to helpmetest health, it executes the command and reports both success and failure based on the exit code. This is different from command && helpmetest health which only reports success.
Cron Best Practices
1. Use absolute paths:
Cron runs with a minimal PATH that often doesn't include /usr/local/bin. Always use absolute paths:
2. Set environment variables explicitly:
Cron doesn't inherit your shell's environment. Set variables at the top of your crontab:
3. Grace periods should be 20-30% longer than execution time:
The extra buffer accounts for variability. Some nights the backup might take 2.5 hours if the database is larger or the disk is slower. A 3-hour grace period prevents false alerts on those nights.
4. Log output for debugging:
This captures all output (both stdout and stderr) to a log file. When a backup fails, you can check the log to see what went wrong.
5. Use file-based health checks for long-running scripts:
If your cron job runs for hours, you can't use the command execution pattern (it would time out). Instead, have your script update a status file and check that file:
Your backup script should touch /tmp/backup-complete when it finishes successfully. The health check runs every 10 minutes and verifies the file was updated in the last 2 hours.
API Isolation Feature
This is one of the most important features of the HelpMeTest CLI and is critical for production deployments.
The Problem:
Imagine your Kubernetes cluster runs health checks that report to the HelpMeTest API. One day there's a network issue between your cluster and the API (AWS region outage, DNS failure, firewall misconfiguration). If health check exit codes depend on successfully reporting to the API, all your health checks would fail even though your services are perfectly healthy. Kubernetes would kill all your pods. Your entire application goes down because of an unrelated network issue.
The Solution:
The HelpMeTest CLI always returns an exit code based purely on whether your health check command succeeded:
- CLI executes your health check command (HTTP request, database query, file check, etc.)
- Command succeeds → CLI returns exit code 0 (success)
- Command fails → CLI returns exit code 1 (failure)
- This happens regardless of whether the CLI can reach the HelpMeTest API
If the API is reachable:
- CLI reports the health check result to the platform
- You see the status in your dashboard
- Alerts fire when checks fail
- Historical data is stored
If the API is unreachable:
- CLI logs a warning: "Failed to report health check to API: connection timeout"
- CLI still returns the correct exit code based on your service status
- Kubernetes doesn't kill healthy pods
- Docker doesn't mark healthy containers as unhealthy
- Your services keep running normally
Why this matters in practice:
Your infrastructure is more reliable than any third-party monitoring service. The HelpMeTest API might have outages (AWS problems, deployment issues, DDoS attacks). Your application should never go down because your monitoring is down. The API isolation feature ensures health checks reflect actual service health, not network connectivity to a monitoring platform.
Where the health check data goes when API is down:
The CLI doesn't queue or retry. If it can't reach the API, the health check data for that specific check is lost. But that's okay because:
- Health checks run frequently (every 30-60 seconds), so you'll only miss a few data points
- Your services stay healthy during the API outage, which is what matters
- When the API comes back, health checks resume reporting normally
Auto-Detection of Service Types
When you use the AI integration to add health checks automatically, the AI detects service types based on the container image and exposed ports. This table shows what health check commands the AI generates for different service types:
| Container Type | Auto-Detected Check | Why This Check |
|---|---|---|
| PostgreSQL | psql -h localhost -c "SELECT 1" | Verifies database accepts connections and can execute queries |
| MySQL | mysql -h localhost -e "SELECT 1" | Verifies MySQL connection and query execution |
| Redis | redis-cli ping | Redis's built-in health check returns PONG if healthy |
| MongoDB | mongosh --eval "db.runCommand({ping: 1})" | MongoDB's ping command verifies connection and responsiveness |
| Node.js | GET localhost:3000/health | Most Node apps expose /health endpoint on port 3000 |
| Python/Flask | GET localhost:8000/health | Flask's default port is 8000, health endpoint is standard |
| Nginx | GET localhost:80/health | Nginx runs on port 80, can proxy health checks to backends |
| Kafka | :9092 | Check if Kafka is listening on its default port |
| RabbitMQ | GET localhost:15672/api/overview | RabbitMQ management API provides cluster health status |
The AI looks at:
- Base image name (FROM postgres:15 → database health check)
- Exposed ports (EXPOSE 3000 → HTTP health check on :3000)
- Installed packages (apt-get install postgresql-client → database service)
- Running processes (CMD ["nginx"] → web server health check)
Troubleshooting
Debug Mode
Enable verbose output to see exactly what the CLI is doing:
Debug mode logs:
- Full HTTP request/response details for HTTP checks
- File paths and modification times for file checks
- Command execution and output for command checks
- API request/response details
- System metrics collection process
This shows the system metrics the CLI collects (CPU, memory, disk) without actually reporting a health check. Useful for verifying the CLI can access system information.
Verbose status shows:
- Full command being executed
- All environment variables
- System metrics from last check
- Historical status changes
False Positive Alerts
Symptoms:
- Getting alerts when service is actually healthy
- Intermittent failures for stable services
- Health checks timing out unexpectedly
Solutions:
1. Increase grace period:
If your service occasionally takes longer than expected, increase the grace period:
Start with 50-100% buffer over typical execution time and adjust based on false positive rate.
2. Test command manually:
Run the exact health check command yourself to see if it actually works:
If the command fails when you run it manually, fix the command before using it in health checks.
3. Check system resources during execution:
High CPU or memory usage can cause health checks to timeout:
If CPU is pegged at 100% or memory is maxed out, your health checks might legitimately be slow. Either add more resources or increase timeouts.
4. Use more specific health checks:
Generic checks can fail for many reasons. Specific checks give you better signal:
The HTTP check is better because it verifies the application is responding correctly, not just that something is listening on the port.
Missing Heartbeats
Symptoms:
- Health checks show as 'down' but services are running
- Irregular heartbeat patterns in dashboard
- Cron jobs not reporting consistently
Solutions:
1. Verify cron job syntax:
Check if cron is actually running your job:
If the manual test works but cron doesn't, the problem is in your crontab syntax.
2. Use absolute paths in cron:
Cron has a minimal PATH. Use absolute paths for everything:
3. Set environment variables explicitly:
Cron doesn't inherit your shell environment:
Command Execution Issues
Symptoms:
- Health check commands fail unexpectedly
- Different behavior when run manually vs automated
- Permission denied errors
Solutions:
1. Use absolute paths:
2. Set required environment variables:
3. Check file permissions:
4. Test as same user:
Health checks in containers run as the container user (often root). Test as that user:
Container Health Check Debugging
Run health check manually inside container:
Enable debug mode in container:
Check container logs:
Check Docker container health status:
Real-World Examples
E-commerce Platform
Complete Docker Compose setup for an e-commerce platform with web frontend, API, background workers, payment processing, and database:
This configuration:
- Uses appropriate grace periods for each service type
- Checks customer-facing services more frequently (30s) than background workers (60s)
- Uses HTTP checks for web services and file checks for workers
- Gives database queries longer timeouts (30s vs 10s)
SaaS Application
Multi-service SaaS platform with authentication, background jobs, email sending, and analytics:
Key patterns:
- Critical services (auth) have lower
retriesfor faster failure detection - Background services have longer intervals and timeouts
- File-based checks use realistic time windows based on job duration
- Version tracking with
HELPMETEST_VERSIONenvironment variable
For a high-level overview and AI-powered setup, see the main health checks guide.
Questions? Email us at contact@helpmetest.com