Testing Prometheus Alerting Rules: promtool, Unit Tests, and Alert Validation
Prometheus alerting rules can fail silently for years: a typo in a PromQL expression, a label mismatch in Alertmanager routing, or a threshold that never actually fires. promtool provides a native unit testing framework for alert rules—YAML-defined tests that validate expressions against synthetic time-series data without running a live Prometheus instance.
Key Takeaways
Alert rules are code. Test them like code. promtool test rules runs YAML-defined unit tests against your alerting rules without a live Prometheus instance—fits perfectly in CI.
Test both the firing and the non-firing conditions. An alert that fires when it shouldn't is alert fatigue. An alert that never fires when it should is an incident you missed.
Test Alertmanager routing separately. A correct alert rule is useless if Alertmanager routes it to the wrong receiver or silences it with an accidental inhibition rule.
Why Alert Rules Need Tests
Prometheus alerting rules are evaluated continuously against live time-series data. But the rules themselves are just PromQL expressions in YAML files—they are code. And like all code, they break.
Common failure modes:
- A label name changes in the exporter (e.g.,
jobbecomesservice) and the alert expression silently stops matching any series - A threshold is set in the wrong unit (milliseconds instead of seconds)
- A
forduration is too short, causing the alert to fire on transient spikes and page the team at 3am for nothing - An alert is added but the Alertmanager routing tree doesn't have a matching route, so it goes to a catch-all that nobody watches
None of these produce an error. Prometheus happily evaluates the broken expression and either never fires or fires constantly. The promtool test rules command catches this before it reaches production.
The promtool Test Format
promtool ships with Prometheus. The test command takes a YAML file that specifies:
- The alerting rule file(s) to load
- Synthetic time-series input data
- The expected alert state at specific points in time
Here is a minimal example for a high error rate alert:
# rules/api-alerts.yaml
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
runbook: "https://runbooks.example.com/high-error-rate"# tests/test-api-alerts.yaml
rule_files:
- ../rules/api-alerts.yaml
evaluation_interval: 1m
tests:
- name: HighErrorRate fires when error rate exceeds 5%
interval: 1m
input_series:
- series: 'http_requests_total{job="api", status="200"}'
values: "100 100 100 100 100 100"
- series: 'http_requests_total{job="api", status="500"}'
values: "0 2 4 6 8 10" # increasing errors
alert_rule_test:
- eval_time: 3m
alertname: HighErrorRate
exp_alerts: [] # not yet firing — under 2m `for` duration
- eval_time: 5m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
exp_annotations:
summary: "High error rate on api"
- name: HighErrorRate does not fire on low error rate
interval: 1m
input_series:
- series: 'http_requests_total{job="api", status="200"}'
values: "1000 1000 1000 1000 1000 1000"
- series: 'http_requests_total{job="api", status="500"}'
values: "1 1 1 1 1 1" # 0.1% error rate
alert_rule_test:
- eval_time: 5m
alertname: HighErrorRate
exp_alerts: []Run it:
promtool test rules tests/test-api-alerts.yamlOutput on success:
Unit Testing: tests/test-api-alerts.yaml
SUCCESSOn failure, promtool shows the expected vs actual alert state, the expression value at each evaluation step, and which labels didn't match.
Writing Unit Tests for Alert Expressions
Anatomy of input_series Values
The values field uses a compact encoding:
"0 1 2 3 4 5" # one value per interval
"0+1x5" # start=0, increment=1, repeat=5 times → 0 1 2 3 4 5
"0 1 _ 3 4" # _ means "no sample" (gap in data)
"stale" # explicitly set to stale markerFor counter metrics (like http_requests_total), use cumulative values—the rate() function computes the per-second increase:
input_series:
- series: 'http_requests_total{job="api", status="500"}'
values: "0+10x10" # counter goes 0, 10, 20, ... 100
# rate() over 5m window ≈ 10/60 per secondTesting the for Duration
The for field means an alert must be continuously firing for that duration before it becomes FIRING. Test both sides:
tests:
- name: Alert does not fire before for-duration elapses
interval: 1m
input_series:
- series: 'up{job="my-service", instance="10.0.0.1:9090"}'
values: "0 0 0 0 0" # down from t=0
alert_rule_test:
# Alert rule has `for: 3m`
- eval_time: 2m
alertname: InstanceDown
exp_alerts: [] # PENDING, not FIRING yet
- eval_time: 4m
alertname: InstanceDown
exp_alerts:
- exp_labels:
job: my-service
instance: "10.0.0.1:9090"
severity: warningTesting Multi-Label Alerts
When alerts can fire for multiple label combinations, test them explicitly:
tests:
- name: HighMemoryUsage fires per pod
interval: 1m
input_series:
- series: 'container_memory_usage_bytes{pod="api-1", namespace="production"}'
values: "0+1073741824x6" # 1GB increments
- series: 'container_memory_usage_bytes{pod="api-2", namespace="production"}'
values: "500000000+0x6" # stays at 500MB — should not fire
alert_rule_test:
- eval_time: 5m
alertname: HighMemoryUsage
exp_alerts:
- exp_labels:
pod: api-1
namespace: production
severity: warning
# api-2 should NOT appear in exp_alertsTesting PromQL Queries with promtool query
promtool query lets you run PromQL against a live Prometheus instance from the CLI—useful for verifying your queries work correctly against real data before encoding them as alert rules:
# Instant query
promtool query instant http://prometheus:9090 \
<span class="hljs-string">'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
<span class="hljs-comment"># Range query (last 30 minutes, 1m step)
promtool query range \
--start=$(<span class="hljs-built_in">date -d <span class="hljs-string">'-30 minutes' +%s) \
--end=$(<span class="hljs-built_in">date +%s) \
--step=60 \
http://prometheus:9090 \
<span class="hljs-string">'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))'Use this during rule development to validate expressions return sensible values before writing unit tests.
Alertmanager Routing Tests
An alert that fires correctly but routes to the wrong receiver is useless. Alertmanager has its own test tool: amtool check-config and amtool config routes test.
First, validate your Alertmanager config parses correctly:
amtool check-config alertmanager.ymlThen test routing for specific alert label sets:
# Test which receiver handles a critical API alert
amtool config routes <span class="hljs-built_in">test \
--config.file=alertmanager.yml \
severity=critical \
team=api
<span class="hljs-comment"># Expected output:
<span class="hljs-comment"># Receiver: pagerduty-api-critical
<span class="hljs-comment"># Repeat interval: 4hWrite a shell-based test suite for your routing rules:
#!/bin/bash
<span class="hljs-comment"># tests/test-alertmanager-routing.sh
<span class="hljs-built_in">set -e
AMTOOL=<span class="hljs-string">"amtool config routes test --config.file=alertmanager.yml"
<span class="hljs-function">assert_receiver() {
<span class="hljs-built_in">local labels=<span class="hljs-string">"$1"
<span class="hljs-built_in">local expected_receiver=<span class="hljs-string">"$2"
<span class="hljs-built_in">local actual
actual=$(<span class="hljs-variable">$AMTOOL <span class="hljs-variable">$labels <span class="hljs-pipe">| grep <span class="hljs-string">"Receiver:" <span class="hljs-pipe">| awk <span class="hljs-string">'{print $2}')
<span class="hljs-keyword">if [ <span class="hljs-string">"$actual" = <span class="hljs-string">"$expected_receiver" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"PASS: $labels → <span class="hljs-variable">$expected_receiver"
<span class="hljs-keyword">else
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: $labels → expected <span class="hljs-variable">$expected_receiver, got <span class="hljs-variable">$actual"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
}
<span class="hljs-comment"># Critical API alerts go to PagerDuty
assert_receiver <span class="hljs-string">"severity=critical team=api" <span class="hljs-string">"pagerduty-api-critical"
<span class="hljs-comment"># Warning alerts go to Slack
assert_receiver <span class="hljs-string">"severity=warning team=api" <span class="hljs-string">"slack-api-warnings"
<span class="hljs-comment"># Database critical alerts go to DB team PagerDuty
assert_receiver <span class="hljs-string">"severity=critical team=database" <span class="hljs-string">"pagerduty-db-critical"
<span class="hljs-comment"># Watchdog (always-on) alert goes to deadman switch
assert_receiver <span class="hljs-string">"alertname=Watchdog" <span class="hljs-string">"deadman-sns"
<span class="hljs-built_in">echo <span class="hljs-string">"All routing tests passed."Validating Alert Labels and Annotations
Alert labels must be consistent—they are how Alertmanager routes and deduplicates. Test them programmatically:
#!/usr/bin/env python3
# tests/validate_alert_rules.py
import yaml, sys, pathlib, re
REQUIRED_LABELS = {'severity'}
SEVERITY_VALUES = {'critical', 'warning', 'info'}
REQUIRED_ANNOTATIONS = {'summary', 'runbook'}
RUNBOOK_PATTERN = re.compile(r'^https://runbooks\.example\.com/')
errors = []
for rule_file in pathlib.Path('rules').glob('*.yaml'):
data = yaml.safe_load(rule_file.read_text())
for group in data.get('groups', []):
for rule in group.get('rules', []):
if 'alert' not in rule:
continue # recording rule, skip
name = rule['alert']
labels = rule.get('labels', {})
annotations = rule.get('annotations', {})
for required in REQUIRED_LABELS:
if required not in labels:
errors.append(f"{rule_file}:{name}: missing required label '{required}'")
if labels.get('severity') not in SEVERITY_VALUES:
errors.append(
f"{rule_file}:{name}: severity '{labels.get('severity')}' "
f"not in {SEVERITY_VALUES}"
)
for required in REQUIRED_ANNOTATIONS:
if required not in annotations:
errors.append(f"{rule_file}:{name}: missing annotation '{required}'")
runbook = annotations.get('runbook', '')
if runbook and not RUNBOOK_PATTERN.match(runbook):
errors.append(f"{rule_file}:{name}: runbook URL format incorrect: {runbook}")
if errors:
for e in errors:
print(f"ERROR: {e}", file=sys.stderr)
sys.exit(1)
print(f"Validated all alert rules in rules/ — no issues found.")CI Integration
Add all tests to CI so alert regressions are caught before rules are applied to the cluster:
# .github/workflows/prometheus-rules.yaml
name: Validate Prometheus Rules
on:
pull_request:
paths:
- 'rules/**'
- 'tests/test-*.yaml'
- 'alertmanager.yml'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download promtool
run: |
wget -q https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xzf prometheus-2.51.0.linux-amd64.tar.gz
sudo mv prometheus-2.51.0.linux-amd64/promtool /usr/local/bin/
- name: Validate rule syntax
run: promtool check rules rules/*.yaml
- name: Run alert unit tests
run: |
for test_file in tests/test-*.yaml; do
echo "Running: $test_file"
promtool test rules "$test_file"
done
- name: Validate alert metadata
run: python3 tests/validate_alert_rules.py
- name: Install amtool
run: |
wget -q https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
sudo mv alertmanager-0.27.0.linux-amd64/amtool /usr/local/bin/
- name: Validate Alertmanager config
run: amtool check-config alertmanager.yml
- name: Test Alertmanager routing
run: bash tests/test-alertmanager-routing.shEvery PR that touches alerting rules or Alertmanager config must pass these checks before merge. No more "I thought that route matched" postmortems.
Common Pitfalls
Rate windows shorter than scrape interval. If your scrape interval is 30s and you write rate(metric[15s]), the rate is always zero. Test with realistic values.
Missing by clause on alerts that should be per-instance. sum(rate(...)) aggregates everything—no label to route on. Test that your exp_labels in unit tests actually have the labels Alertmanager needs.
for duration shorter than twice the scrape interval. Prometheus may not evaluate the expression twice within the for window, preventing the alert from ever becoming FIRING. Test with evaluation intervals that reflect your real setup.
HelpMeTest can monitor your observability stack automatically — sign up free