Synthetic Monitoring Alerting Strategy: Stop Waking Up for Nothing
Bad alerting is worse than no alerting. A monitoring setup that pages you at 3am for a 2-second blip that self-resolved trains your team to ignore alerts. And teams that ignore alerts miss the real outages.
Getting synthetic monitoring alerting right is mostly about restraint: being precise about what constitutes a real problem, routing alerts to the right people, and building escalation paths that match business impact.
The Fundamental Principle: Alert on Impact, Not on Events
Every alert should represent something a human needs to act on right now. If a human doesn't need to act on it right now, it should be a log entry, a dashboard metric, or a weekly report — not an alert.
Applied to synthetic monitoring: a single failed check is usually not actionable. Networks blip. DNS resolves slowly occasionally. A server handles a garbage collection pause. One failed check in isolation is noise. Two or three consecutive failures from multiple locations is a real problem.
Before configuring any alert, ask: "If this fires at 3am, will the on-call engineer be able to do something useful, and will it matter that they do it now rather than in the morning?" If the answer is no to either part, the alert shouldn't wake anyone up.
Thresholds: The Basics
Consecutive Failure Count
Never alert on a single failed check. Require at least 2-3 consecutive failures before triggering:
| Check Interval | Failures Before Alert | Time to Alert |
|---|---|---|
| 1 minute | 2 consecutive | ~2 minutes |
| 5 minutes | 2 consecutive | ~10 minutes |
| 5 minutes | 3 consecutive | ~15 minutes |
For critical flows like checkout or login, 2 consecutive failures in 10 minutes is a reasonable threshold. For less critical flows, 3 failures is better.
Multi-Location Confirmation
If you run monitors from multiple geographic locations, require failures from at least 2 locations before alerting. A failure from one location might be a network issue between your monitoring infrastructure and your app, not an actual outage.
Single-location failure → log it, don't alert. Multi-location failure → this is real, alert immediately.
Performance Thresholds
For latency-based alerts, the rule is similar: don't alert on a single slow measurement. Alert when performance degrades consistently.
A reasonable approach:
- Warning threshold: 3 consecutive checks where response time > 2× baseline
- Critical threshold: 3 consecutive checks where response time > 4× baseline, or response time > absolute limit (e.g., 10 seconds regardless of baseline)
You need a baseline to make this work. Let your monitors run for a week before setting performance alert thresholds. Use the 95th percentile of that week's data as your baseline, not the average.
Grace Periods: The Underused Tool
Grace periods let you absorb known variability without creating noise. HelpMeTest's health check system has built-in grace period support:
# Service gets 5 minutes to respond before alerting
helpmetest health api-service 5m
<span class="hljs-comment"># Batch job gets 2 hours (it's not a real-time service)
helpmetest health nightly-processor 2h
<span class="hljs-comment"># Long-running report generation — give it a full day
helpmetest health monthly-report-job 1dGrace periods are especially important for:
- Batch jobs that run on a schedule — you expect them to be "missing" for most of the day
- Services with rolling restarts — a new deploy causes a brief period where health checks fail
- Scheduled maintenance — you need to suppress alerts during maintenance windows without disabling monitoring entirely
The grace periods available in HelpMeTest (30s, 5m, 2h, 1d) map to common operational patterns. Use 30s for real-time services where downtime is immediately user-facing. Use 5m for services that can tolerate brief interruptions. Use 2h or 1d for batch processing.
Alert Routing: Match Severity to Action
Not all alert failures are equal. A broken checkout flow and a broken "contact us" page both show up as synthetic monitor failures. The response should be very different.
Three-Tier Routing Model
Tier 1 — Immediate page (any hour):
- Login failures (users can't access the app)
- Checkout or payment failures (direct revenue impact)
- Core API down (app is unusable)
- Data loss or corruption detected
Tier 2 — Business hours page, off-hours Slack:
- Performance degradation > 3× baseline
- Secondary features broken (search, notifications, account settings)
- Third-party integration failures that have workarounds
Tier 3 — Slack notification, no page:
- Single-region performance degradation
- Non-critical features broken
- Intermittent errors below threshold
The implementation varies by alerting tool, but the pattern is universal. In PagerDuty:
# Routing rules (simplified)
- condition: monitor.tags includes "critical" AND failures >= 2
route: pagerduty-critical-service
- condition: monitor.tags includes "high" AND failures >= 2 AND hour in [9..18]
route: pagerduty-business-hours
- condition: monitor.tags includes "high" AND failures >= 2 AND hour not in [9..18]
route: slack-channel-monitoring
- condition: failures >= 3
route: slack-channel-monitoringTag your monitors with severity levels when you create them. This metadata pays dividends when you're half-asleep at 3am and need to know immediately whether to get out of bed.
Handling Flapping
Flapping is when a monitor oscillates between passing and failing. It's one of the most frustrating alert patterns: you get paged, the alert resolves before you can investigate, you go back to sleep, it fires again 20 minutes later.
The fix is alert suppression during flapping:
- Alert fires (2 consecutive failures)
- Monitor recovers (1 pass)
- Monitor fails again (1 failure)
- Don't re-alert — the monitor is flapping
The "don't re-alert for X minutes after recovery" pattern prevents flapping from generating repeated pages. Most alerting systems support a "re-alert after N hours if still failing" option — use it. Set it to at least 30 minutes for critical monitors, 2 hours for less critical ones.
In the meantime, create a flapping ticket automatically. A monitor that oscillates is telling you something is wrong even if it's not fully down. Log it, investigate during business hours.
Alert Message Content
The alert message is what an engineer reads at 3am before deciding what to do. Make it count.
Bad alert message:
ALERT: synthetic-monitor-login failedGood alert message:
CRITICAL: Login flow failing — 3 consecutive failures (last: 14:23:07)
Monitor: login-flow-prod
Location: us-east-1
Error: Timeout waiting for [data-testid="dashboard-content"] after 10000ms
Step: Wait for dashboard after login redirect
Last success: 14:18:02 (5 minutes ago)
Runbook: https://wiki.example.com/runbooks/login-failure
Dashboard: https://monitoring.example.com/monitors/login-flow-prodEvery good alert message includes:
- What failed — monitor name and the specific failure
- When it started failing — last successful check time
- Where — which location(s) are seeing the failure
- What the error was — the actual exception or assertion that failed
- Link to runbook — what should I do right now?
- Link to dashboard — where can I see more context?
Writing runbooks takes time upfront but saves time and stress during incidents. A runbook for a login failure might be: "Check auth service health → check database connection pool → check recent deploys → check Cloudflare status." Three minutes to write, saves 20 minutes of disoriented investigation at 3am.
Alert Fatigue: The Warning Signs
Your alerting strategy is failing if:
- On-call engineers mute Slack during their rotation
- You regularly close alerts without investigating them
- The same monitor fires repeatedly without anyone fixing the underlying issue
- New engineers don't understand why most alerts can be safely ignored
If you see these patterns, audit your alerts. For each one that fired in the last 30 days:
- Did an engineer act on it?
- Was the action they took meaningful?
- Did the problem require immediate response, or could it have waited?
Alerts that answer "no" to any of these questions should be demoted (lower severity, Slack instead of page) or removed entirely.
A Practical Starter Setup
For a team just getting started with synthetic monitoring alerting:
Critical monitors (page any time):
- Login flow — 2 consecutive failures, multi-location
- Checkout flow — 2 consecutive failures, multi-location
- Core API health — 2 consecutive failures
Standard monitors (Slack alert, page during business hours):
- Search functionality — 3 consecutive failures
- User profile — 3 consecutive failures
- Third-party integrations — 3 consecutive failures
Infrastructure health (Slack only):
- All services — alert after grace period expires
- Alert resolves automatically when service reports back inRun this for a month. Review the alert history. Promote monitors that generated false negatives (real outages you missed). Demote monitors that generated noise. After 3 months, your alerting should be tight enough that every page represents a genuine, urgent problem.
The goal isn't zero alerts — it's zero meaningless alerts. Every page should be something you're glad you woke up for.