AI-Powered Health Checks for Your Infrastructure | Documentation

HelpMeTest

Your database backup runs every night at 2 AM. One day it silently fails. You find out three weeks later when you need to restore 50GB of customer data that doesn't exist. The client threatening to sue you for data loss is on line 2.
Your API is down. Kubernetes thinks it's fine because the container is running. Users are getting 500 errors. You lose $2,000 in sales before someone tweets "@yourcompany your checkout is broken lol"
Your background worker stopped processing jobs 6 hours ago. The process is alive. The queue has 10,000 stuck jobs. Your support inbox has 47 angry emails.
These are silent failures. Your infrastructure breaks and you don't know until it's too late. Until it costs you money or reputation you can't get back.

What Health Checks Actually Do

Health checks are automated tests that verify your services are actually working, not just running. There's a massive difference between a process that's alive and a process that's doing its job. Your container might be running and the process might show up in ps aux, but the web server could be returning 500 errors, or the database connection pool could be exhausted, or the background worker could be stuck in an infinite loop processing the same job forever. The container is technically "up" but functionally dead.

Health checks catch this by testing actual functionality instead of just checking if a process exists. A web app health check makes an HTTP request to your service and expects a 200 response, proving the web server is not just running but actually serving requests. A database health check executes a simple query and verifies it returns data, proving the database can accept connections and execute SQL. A background worker health check looks for a recently updated status file that proves the worker is actively processing jobs, not just sitting there in an infinite loop.

When a health check fails, your orchestration system takes action automatically. Docker restarts the container. Kubernetes removes the pod from the load balancer and spins up a replacement. Your monitoring system sends an alert. The whole point is you know immediately instead of waiting for users to complain or discovering the problem when you desperately need that three-week-old backup that doesn't exist.

The Three Problems

Setting up health monitoring has three problems. You solve one, the other two bite you.

Problem 1: Configuration Is Tedious

Every platform has its own syntax. Docker uses HEALTHCHECK directives. Kubernetes uses livenessProbe and readinessProbe. Docker Compose uses yet another format.

It's like owning three cars where each one has the gas pedal in a different spot. You end up writing the same health check logic four different ways just to cover Docker, Kubernetes, Docker Compose, and scheduled tasks.

Then there's choosing the right check for each service type. A web app needs HTTP endpoint monitoring because it serves requests, but a database can't be tested with HTTP requests because it doesn't speak HTTP—it needs a connection query using psql or mysql. A background worker might not expose any network interface at all, so you need file-based monitoring where the worker updates a status file and the health check verifies the file is recent enough.

And you need to tune timeouts based on what each service actually does. A web API should respond in seconds so you check every 30 seconds. But a database backup might take hours, so you need a 25-hour grace period. A background worker processing heavy jobs might legitimately take 10 minutes per job. Get these timeouts wrong and you either miss real failures or get bombarded with false alerts at 3 AM.

Problem 2: Health Checks Need Somewhere To Report

Docker and Kubernetes have built-in health check mechanisms, but they're local and ephemeral. Docker knows if your container is healthy right now on this particular machine. Kubernetes knows if your pod is healthy right now in this specific cluster. But nobody is collecting this data centrally, nobody is tracking historical patterns, and nobody is sending you alerts when things break at 3 AM.

If your web app fails its health check at 3 AM, Kubernetes will automatically restart the pod, which is great for keeping your service running. But unless you happen to be watching kubectl get pods at 3 AM (and you're not), you'll never know it happened. There's no historical record of the failure, no alert sent to your phone, and no way to see patterns like "this service fails its health check every Tuesday" or "this pod has been restarting constantly since we deployed version 2.1."

For real monitoring, you need a centralized platform that collects health check data from all your services across all environments, stores the historical data so you can analyze patterns, and sends you alerts when things actually break. Building this means picking a monitoring platform, integrating it with your infrastructure, configuring alerts for each service, building dashboards that make sense, and managing API keys and authentication.

Problem 3: Maintenance Never Stops

Let's say you solve both problems—you configure all your health checks and integrate with a monitoring platform. Everything works beautifully. You're getting alerts when things break. You have dashboards showing your entire infrastructure. Life is good.

Then three weeks later you add a new microservice to handle authentication. Or you split your monolithic background worker into three specialized workers. Or you change how your batch processor reports its status because the old method was causing false alerts. Your infrastructure is constantly evolving because software doesn't stand still.

Now you need to update all your health checks to reflect these changes. You regenerate the health check configurations for the new services. You review the generated configs. You deploy the updated Dockerfiles and Kubernetes manifests to your clusters. Then you update your monitoring platform to know about the new services, reconfigure your alerts, update your dashboards, and test everything.

Every single infrastructure change requires this health check maintenance cycle. The maintenance burden never actually stops because infrastructure never stops evolving.

How We Solve All Three Problems

We built a system that handles configuration, monitoring, and ongoing maintenance. The AI integration generates health check configurations automatically based on your actual infrastructure. The monitoring platform is built in, so you don't need to pick one, integrate it, and configure it separately. And when your infrastructure inevitably changes, you just ask the AI to update your health checks and it handles everything end-to-end.

Step 1: Install the CLI and Configure AI Integration

The installer automatically detects which AI editors you have installed—Claude, Cursor, VSCode, or others—and configures the MCP integration for you. This gives your AI the ability to read your infrastructure files and generate health checks that report to the HelpMeTest platform.

Step 2: Ask Your AI to Add Health Checks

The AI scans your entire infrastructure by reading every Dockerfile and Kubernetes manifest it can find. It analyzes what each service actually does based on the code and configuration—is it a web server? A database? A background worker? Then it generates appropriate health checks for each service type and adds them directly to your configuration files.

Here's what the conversation looks like:

You: "Add health checks to all my containers"
AI: "Found 9 containers in your infrastructure:
    • /app/Dockerfile - Node.js web application
    • /api/Dockerfile - Python API service
    • /auth/Dockerfile - Authentication service
    • /worker/Dockerfile - Background job worker
    • /processor/Dockerfile - File processor
    • /consumer/Dockerfile - Message queue consumer
    • /database/Dockerfile - PostgreSQL database
    • /nginx/Dockerfile - Load balancer
    • /monitoring/Dockerfile - Monitoring service"
You: "Yes, add health checks to all of them"
AI: "✅ COMPLETE - Added health checks to all services
    🐳 Updated Dockerfiles:
    ✅ Web App - HTTP health check on port 3000
    ✅ API Service - Health endpoint check at /api/health
    ✅ Auth Service - Status endpoint check at /auth/status
    ✅ Background Worker - File-based status monitoring
    ✅ File Processor - File update timestamp checking
    ✅ Message Consumer - Processing heartbeat verification
    ✅ Database - PostgreSQL connection query check
    ✅ Load Balancer - HTTP health endpoint check
    ✅ Monitoring - Already had health checks configured
    ⚙️ Kubernetes Probes Configured:
    • 5 HTTP-based services with liveness and readiness probes
    • 1 Database service with connection checks
    • 3 File-based services with status file monitoring
    All services will now report to the HelpMeTest platform automatically."

Step 3: Deploy and Start Monitoring

Deploy your updated containers to your infrastructure. As soon as they start running, the health checks begin executing and reporting their status to the HelpMeTest platform automatically. You don't need to configure anything else because the health check commands include our CLI, and the CLI already knows how to report to the platform using your API token.

You immediately get a dashboard that shows all your services across all environments in one place. Development, staging, production—they all show up automatically. When any service fails its health check, you get an alert through whatever channel you configured. The platform stores all the historical health check data, so you can see patterns like "this service has been flaky for the past week" or "this started failing right after we deployed version 2.1."

Maintenance Is Just As Easy

Three weeks later you add a new microservice to handle authentication. Instead of manually writing health check configs and updating your monitoring dashboards, you just ask your AI "add health checks for the new auth service." The AI scans your infrastructure again, finds the new service, determines that it's a web service that needs HTTP health checks, generates the appropriate configurations, and adds them to your files. You review the changes, deploy the updated configs, and the health checks start reporting to the platform automatically. The new service just shows up in your monitoring dashboard without any manual configuration.

What Actually Gets Added

The AI adds health checks appropriate for each service type. For a web application, it adds HTTP endpoint checks. For a background worker that doesn't expose HTTP endpoints, it adds file-based monitoring where your worker code updates a status file and the health check verifies it's recent. For a database, it adds connection query checks.

The AI also handles all the platform-specific syntax differences—Docker HEALTHCHECK directives, Kubernetes livenessProbe and readinessProbe configurations, Docker Compose healthcheck sections. It sets appropriate timeouts based on service type. Web services get 30-second checks. Background workers get 10-minute grace periods. Database backups get 25-hour windows.

For detailed examples of what gets added to your Dockerfiles, Kubernetes manifests, and Docker Compose files, see our technical deep dive guide.

API Isolation

The CLI has a critical feature specifically designed for container orchestration environments. Health check exit codes depend only on your actual service status, not on whether the CLI can reach the HelpMeTest API. This prevents cascading failures where network issues to our API cause your healthy containers to be killed.

If your Kubernetes cluster has a temporary network issue that prevents it from reaching the HelpMeTest API, Kubernetes won't start killing healthy pods just because they can't report their health check status. Similarly, if you're developing locally on your laptop and you go offline, Docker won't mark all your containers as unhealthy just because they can't phone home.

The CLI always executes your health check command—the HTTP request, the database query, the file age check, whatever—and returns an exit code based purely on whether that command succeeded. If the API is reachable, the CLI reports the health check result so you can see it in your dashboard and get alerts. If the API is unreachable due to network issues, the CLI logs a warning but still returns the correct exit code based on your service's actual health.

Manual Setup

If you prefer manual control or need to understand exactly how the CLI works, you can use it directly without the AI integration:

For complete technical documentation including all CLI command syntax, Docker integration patterns, Kubernetes configuration examples, cron job setup, troubleshooting guides, and grace period recommendations, see our technical deep dive guide.

Questions? Email us at contact@helpmetest.com