Infrastructure Testing in CI/CD: GitOps Validation Pipelines

Infrastructure Testing in CI/CD: GitOps Validation Pipelines

GitOps treats infrastructure like application code: changes go through pull requests, get reviewed, pass CI, and deploy automatically. But infrastructure PRs need different validation than application PRs. A bad dependency upgrade might break a test; a bad Terraform change might delete your database.

This guide covers how to build CI/CD pipelines that validate infrastructure changes thoroughly before they reach production.

The GitOps Infrastructure Testing Model

In a GitOps workflow, every infrastructure change follows this path:

Developer → PR → CI validation → Review → Merge → CD → Production

CI validation for infrastructure needs multiple gates:

  1. Static checks — syntax, formatting, linting (seconds)
  2. Policy checks — security, compliance, required tags (seconds)
  3. Plan validation — "what will change?" (seconds to minutes)
  4. Ephemeral testing — real deployment to isolated environment (minutes)
  5. Deployment verification — post-deploy health checks (minutes)

Layer 1: Static Validation

Run fast checks on every commit and PR. These should complete in under 2 minutes:

Terraform

# .github/workflows/terraform-validate.yml
name: Terraform Static Checks

on:
  push:
  pull_request:

jobs:
  static-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.6"

      - name: Format check
        run: terraform fmt -check -recursive

      - name: Init (no backend)
        run: terraform init -backend=false

      - name: Validate
        run: terraform validate

      - name: tflint
        uses: terraform-linters/setup-tflint@v4
        run: tflint --recursive --format compact

      - name: Checkov security scan
        uses: bridgecrewio/checkov-action@v12
        with:
          framework: terraform
          soft_fail: false

Kubernetes

  kubernetes-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate YAML syntax
        run: |
          pip install yamllint
          yamllint kubernetes/

      - name: kubeval  schema validation
        run: |
          wget https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz
          tar xzf kubeval-linux-amd64.tar.gz
          ./kubeval --kubernetes-version 1.29.0 kubernetes/**/*.yaml

      - name: conftest  policy checks
        run: |
          conftest test kubernetes/ --policy policy/kubernetes/

Layer 2: Plan and Drift Detection

Before merging, show reviewers exactly what will change:

  terraform-plan:
    runs-on: ubuntu-latest
    needs: static-checks
    permissions:
      pull-requests: write  # To comment the plan
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.PLAN_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.PLAN_AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=plan.tfplan -no-color 2>&1 | tee plan.txt
        continue-on-error: true

      - name: Post plan to PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const plan = require('fs').readFileSync('plan.txt', 'utf8');
            const truncated = plan.length > 60000 ? plan.slice(0, 60000) + '\n...[truncated]' : plan;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan\n\`\`\`hcl\n${truncated}\n\`\`\``
            });

      - name: Fail if plan failed
        if: steps.plan.outcome == 'failure'
        run: exit 1

      - name: Policy check on plan
        run: |
          terraform show -json plan.tfplan > plan.json
          conftest test plan.json --policy policy/terraform/ --parser json

For detecting configuration drift (actual infrastructure diverged from code), run plans on a schedule:

  drift-detection:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3

      - name: Check for drift
        run: |
          terraform init
          terraform plan -detailed-exitcode 2>&1
          EXIT_CODE=$?
          if [ $EXIT_CODE -eq 2 ]; then
            echo "DRIFT DETECTED — infrastructure has diverged from code"
            exit 1
          fi
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.PROD_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.PROD_AWS_SECRET_ACCESS_KEY }}

on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  push:
    branches: [main]

Layer 3: Ephemeral Environment Testing

For significant infrastructure changes, deploy to an isolated test environment and run real tests against it:

  ephemeral-test:
    runs-on: ubuntu-latest
    needs: [static-checks, terraform-plan]
    if: github.event_name == 'pull_request'
    environment: test
    env:
      TF_WORKSPACE: pr-${{ github.event.pull_request.number }}
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - uses: actions/setup-go@v5
        with:
          go-version: '1.21'

      - name: Create ephemeral workspace
        run: |
          terraform init
          terraform workspace new $TF_WORKSPACE || terraform workspace select $TF_WORKSPACE

      - name: Deploy test infrastructure
        run: terraform apply -auto-approve
        env:
          TF_VAR_environment: test
          TF_VAR_name_prefix: pr-${{ github.event.pull_request.number }}

      - name: Run Terratest integration tests
        run: go test -v -timeout 30m ./test/...

      - name: Destroy test infrastructure
        if: always()  # Always destroy, even if tests fail
        run: |
          terraform destroy -auto-approve
          terraform workspace select default
          terraform workspace delete $TF_WORKSPACE

Layer 4: Post-Deployment Verification

After infrastructure deploys to production, verify the actual system is healthy:

  verify-production:
    runs-on: ubuntu-latest
    needs: deploy
    steps:
      - uses: actions/checkout@v4

      - name: Wait for deployment to stabilize
        run: sleep 60

      - name: Health check  API endpoint
        run: |
          response=$(curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health)
          if [ "$response" != "200" ]; then
            echo "Health check failed: HTTP $response"
            exit 1
          fi
          echo "Health check passed: HTTP $response"

      - name: Health check  database connectivity
        run: |
          # Use a lightweight probe that tests DB connectivity
          curl -s https://api.example.com/readiness | jq -e '.database == "ok"'

      - name: Verify key metrics
        run: |
          # Check that error rate is below threshold
          ERROR_RATE=$(curl -s "https://monitoring.example.com/metrics" | jq '.error_rate')
          if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
            echo "Error rate too high after deployment: $ERROR_RATE"
            exit 1
          fi

Automated Rollback

If post-deployment verification fails, automatically roll back:

  deploy-and-verify:
    runs-on: ubuntu-latest
    steps:
      - name: Get previous revision
        id: prev
        run: |
          PREV=$(terraform show -json | jq -r '.values.root_module.resources[0].values.id')
          echo "PREVIOUS_ID=$PREV" >> $GITHUB_OUTPUT

      - name: Deploy
        run: terraform apply -auto-approve

      - name: Verify deployment
        id: verify
        run: ./scripts/verify-deployment.sh
        continue-on-error: true

      - name: Rollback on failure
        if: steps.verify.outcome == 'failure'
        run: |
          echo "Deployment verification failed — rolling back"
          git revert HEAD --no-edit
          terraform apply -auto-approve
          exit 1

Full GitOps Pipeline: Putting It Together

A complete pipeline using separate stages:

name: Infrastructure GitOps Pipeline

on:
  push:
    branches: [main]
    paths: ['infrastructure/**', 'kubernetes/**']
  pull_request:
    paths: ['infrastructure/**', 'kubernetes/**']
  schedule:
    - cron: '0 */6 * * *'

jobs:
  # === ALWAYS: Static checks ===
  static:
    name: Static Validation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/static-checks.sh

  # === PRs only: Plan + ephemeral test ===
  plan:
    name: Terraform Plan
    needs: static
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/terraform-plan.sh

  test:
    name: Ephemeral Integration Test
    needs: plan
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    environment: test
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/ephemeral-test.sh

  # === Main branch: Deploy + verify ===
  deploy-staging:
    name: Deploy to Staging
    needs: static
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - run: terraform apply -auto-approve
      - run: ./scripts/verify-staging.sh

  deploy-production:
    name: Deploy to Production
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - run: terraform apply -auto-approve
      - run: ./scripts/verify-production.sh

  # === Scheduled: Drift detection ===
  drift:
    name: Drift Detection
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/drift-detection.sh

Continuous Monitoring After Deployment

Infrastructure tests validate the deployment process. But once infrastructure is live, things still go wrong: certificates expire, services become unreachable, latency degrades.

HelpMeTest provides continuous end-to-end monitoring after every deployment. Write test scenarios in plain English — "check that the login page loads and a user can sign in" — and HelpMeTest runs them every 5 minutes, 24/7. When something breaks in production, you find out before your users do.

This is the final layer of a mature infrastructure testing strategy: not just "did it deploy correctly" but "is it still working correctly, right now."

Summary

Layer When Tool Speed
Static validation Every commit terraform fmt, tflint, yamllint Seconds
Policy checks Every commit conftest, checkov Seconds
Plan review On PR terraform plan + PR comment 1-2 min
Drift detection Scheduled terraform plan -detailed-exitcode 1-5 min
Ephemeral testing On PR Terratest, Molecule 10-30 min
Post-deploy verification After deploy curl, readiness probes Minutes
Continuous monitoring 24/7 HelpMeTest Ongoing

Start with static validation (free, fast, immediate feedback) and add layers as your infrastructure matures. The goal is to catch issues as early as possible — ideally before the PR is merged, not in production at 2am.

Read more