Infrastructure Testing in CI/CD: GitOps Validation Pipelines
GitOps treats infrastructure like application code: changes go through pull requests, get reviewed, pass CI, and deploy automatically. But infrastructure PRs need different validation than application PRs. A bad dependency upgrade might break a test; a bad Terraform change might delete your database.
This guide covers how to build CI/CD pipelines that validate infrastructure changes thoroughly before they reach production.
The GitOps Infrastructure Testing Model
In a GitOps workflow, every infrastructure change follows this path:
Developer → PR → CI validation → Review → Merge → CD → ProductionCI validation for infrastructure needs multiple gates:
- Static checks — syntax, formatting, linting (seconds)
- Policy checks — security, compliance, required tags (seconds)
- Plan validation — "what will change?" (seconds to minutes)
- Ephemeral testing — real deployment to isolated environment (minutes)
- Deployment verification — post-deploy health checks (minutes)
Layer 1: Static Validation
Run fast checks on every commit and PR. These should complete in under 2 minutes:
Terraform
# .github/workflows/terraform-validate.yml
name: Terraform Static Checks
on:
push:
pull_request:
jobs:
static-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.6.6"
- name: Format check
run: terraform fmt -check -recursive
- name: Init (no backend)
run: terraform init -backend=false
- name: Validate
run: terraform validate
- name: tflint
uses: terraform-linters/setup-tflint@v4
run: tflint --recursive --format compact
- name: Checkov security scan
uses: bridgecrewio/checkov-action@v12
with:
framework: terraform
soft_fail: falseKubernetes
kubernetes-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate YAML syntax
run: |
pip install yamllint
yamllint kubernetes/
- name: kubeval — schema validation
run: |
wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
tar xzf kubeval-linux-amd64.tar.gz
./kubeval --kubernetes-version 1.29.0 kubernetes/**/*.yaml
- name: conftest — policy checks
run: |
conftest test kubernetes/ --policy policy/kubernetes/Layer 2: Plan and Drift Detection
Before merging, show reviewers exactly what will change:
terraform-plan:
runs-on: ubuntu-latest
needs: static-checks
permissions:
pull-requests: write # To comment the plan
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
env:
AWS_ACCESS_KEY_ID: ${{ secrets.PLAN_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.PLAN_AWS_SECRET_ACCESS_KEY }}
- name: Terraform Plan
id: plan
run: terraform plan -out=plan.tfplan -no-color 2>&1 | tee plan.txt
continue-on-error: true
- name: Post plan to PR
uses: actions/github-script@v7
if: github.event_name == 'pull_request'
with:
script: |
const plan = require('fs').readFileSync('plan.txt', 'utf8');
const truncated = plan.length > 60000 ? plan.slice(0, 60000) + '\n...[truncated]' : plan;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan\n\`\`\`hcl\n${truncated}\n\`\`\``
});
- name: Fail if plan failed
if: steps.plan.outcome == 'failure'
run: exit 1
- name: Policy check on plan
run: |
terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policy/terraform/ --parser jsonFor detecting configuration drift (actual infrastructure diverged from code), run plans on a schedule:
drift-detection:
runs-on: ubuntu-latest
if: github.event_name == 'schedule'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Check for drift
run: |
terraform init
terraform plan -detailed-exitcode 2>&1
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "DRIFT DETECTED — infrastructure has diverged from code"
exit 1
fi
env:
AWS_ACCESS_KEY_ID: ${{ secrets.PROD_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.PROD_AWS_SECRET_ACCESS_KEY }}
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
push:
branches: [main]Layer 3: Ephemeral Environment Testing
For significant infrastructure changes, deploy to an isolated test environment and run real tests against it:
ephemeral-test:
runs-on: ubuntu-latest
needs: [static-checks, terraform-plan]
if: github.event_name == 'pull_request'
environment: test
env:
TF_WORKSPACE: pr-${{ github.event.pull_request.number }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- name: Create ephemeral workspace
run: |
terraform init
terraform workspace new $TF_WORKSPACE || terraform workspace select $TF_WORKSPACE
- name: Deploy test infrastructure
run: terraform apply -auto-approve
env:
TF_VAR_environment: test
TF_VAR_name_prefix: pr-${{ github.event.pull_request.number }}
- name: Run Terratest integration tests
run: go test -v -timeout 30m ./test/...
- name: Destroy test infrastructure
if: always() # Always destroy, even if tests fail
run: |
terraform destroy -auto-approve
terraform workspace select default
terraform workspace delete $TF_WORKSPACELayer 4: Post-Deployment Verification
After infrastructure deploys to production, verify the actual system is healthy:
verify-production:
runs-on: ubuntu-latest
needs: deploy
steps:
- uses: actions/checkout@v4
- name: Wait for deployment to stabilize
run: sleep 60
- name: Health check — API endpoint
run: |
response=$(curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health)
if [ "$response" != "200" ]; then
echo "Health check failed: HTTP $response"
exit 1
fi
echo "Health check passed: HTTP $response"
- name: Health check — database connectivity
run: |
# Use a lightweight probe that tests DB connectivity
curl -s https://api.example.com/readiness | jq -e '.database == "ok"'
- name: Verify key metrics
run: |
# Check that error rate is below threshold
ERROR_RATE=$(curl -s "https://monitoring.example.com/metrics" | jq '.error_rate')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high after deployment: $ERROR_RATE"
exit 1
fiAutomated Rollback
If post-deployment verification fails, automatically roll back:
deploy-and-verify:
runs-on: ubuntu-latest
steps:
- name: Get previous revision
id: prev
run: |
PREV=$(terraform show -json | jq -r '.values.root_module.resources[0].values.id')
echo "PREVIOUS_ID=$PREV" >> $GITHUB_OUTPUT
- name: Deploy
run: terraform apply -auto-approve
- name: Verify deployment
id: verify
run: ./scripts/verify-deployment.sh
continue-on-error: true
- name: Rollback on failure
if: steps.verify.outcome == 'failure'
run: |
echo "Deployment verification failed — rolling back"
git revert HEAD --no-edit
terraform apply -auto-approve
exit 1Full GitOps Pipeline: Putting It Together
A complete pipeline using separate stages:
name: Infrastructure GitOps Pipeline
on:
push:
branches: [main]
paths: ['infrastructure/**', 'kubernetes/**']
pull_request:
paths: ['infrastructure/**', 'kubernetes/**']
schedule:
- cron: '0 */6 * * *'
jobs:
# === ALWAYS: Static checks ===
static:
name: Static Validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: ./scripts/static-checks.sh
# === PRs only: Plan + ephemeral test ===
plan:
name: Terraform Plan
needs: static
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: ./scripts/terraform-plan.sh
test:
name: Ephemeral Integration Test
needs: plan
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
environment: test
steps:
- uses: actions/checkout@v4
- run: ./scripts/ephemeral-test.sh
# === Main branch: Deploy + verify ===
deploy-staging:
name: Deploy to Staging
needs: static
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- run: terraform apply -auto-approve
- run: ./scripts/verify-staging.sh
deploy-production:
name: Deploy to Production
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- run: terraform apply -auto-approve
- run: ./scripts/verify-production.sh
# === Scheduled: Drift detection ===
drift:
name: Drift Detection
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: ./scripts/drift-detection.shContinuous Monitoring After Deployment
Infrastructure tests validate the deployment process. But once infrastructure is live, things still go wrong: certificates expire, services become unreachable, latency degrades.
HelpMeTest provides continuous end-to-end monitoring after every deployment. Write test scenarios in plain English — "check that the login page loads and a user can sign in" — and HelpMeTest runs them every 5 minutes, 24/7. When something breaks in production, you find out before your users do.
This is the final layer of a mature infrastructure testing strategy: not just "did it deploy correctly" but "is it still working correctly, right now."
Summary
| Layer | When | Tool | Speed |
|---|---|---|---|
| Static validation | Every commit | terraform fmt, tflint, yamllint | Seconds |
| Policy checks | Every commit | conftest, checkov | Seconds |
| Plan review | On PR | terraform plan + PR comment | 1-2 min |
| Drift detection | Scheduled | terraform plan -detailed-exitcode | 1-5 min |
| Ephemeral testing | On PR | Terratest, Molecule | 10-30 min |
| Post-deploy verification | After deploy | curl, readiness probes | Minutes |
| Continuous monitoring | 24/7 | HelpMeTest | Ongoing |
Start with static validation (free, fast, immediate feedback) and add layers as your infrastructure matures. The goal is to catch issues as early as possible — ideally before the PR is merged, not in production at 2am.