Automating AI Compliance Tests in CI/CD: EU AI Act Engineering Checklist
EU AI Act compliance for high-risk AI systems is not a one-time checklist — it requires ongoing automated testing. This guide shows how to integrate bias, accuracy, robustness, audit trail, and logging tests into GitHub Actions using DeepEval and Giskard, with full YAML examples and a production monitoring setup using HelpMeTest health checks.
Key Takeaways
Compliance tests belong in CI, not in a spreadsheet. Every model update should trigger automated bias and accuracy gates before deployment. Audit trail testing is mandatory, not optional. Article 12 requires logs — your CI must verify logs are being written correctly after every deploy. Production monitoring closes the loop. CI tests your code; HelpMeTest monitors whether your running system actually behaves as the tests specify.
The EU AI Act does not say "run a bias test before launch and ship it." Article 9's risk management system requirement is ongoing — you need to demonstrate that you are continuously monitoring and managing risk throughout the system's lifecycle.
For engineers, this translates to two things: automated compliance gates in CI/CD pipelines, and continuous production monitoring. This guide covers both.
The Compliance CI/CD Architecture
A complete AI compliance CI/CD pipeline has five stages:
- Data quality gate — validate training/inference data before model update
- Bias gate — DeepEval and Giskard bias scans
- Accuracy gate — regression against fixed holdout set
- Robustness gate — adversarial and distribution shift tests
- Audit trail gate — verify logging and audit infrastructure works
Each gate is a go/no-go decision. A failure blocks deployment.
Stage 1: Data Quality Gate
Data quality failures are the root cause of most bias and accuracy regressions. Test them first.
# tests/compliance/test_data_quality.py
import pytest
import pandas as pd
import great_expectations as ge
from pathlib import Path
@pytest.fixture
def inference_batch():
"""Load the current inference batch for validation."""
return pd.read_parquet("data/inference_batch_latest.parquet")
def test_no_nulls_in_required_features(inference_batch):
required = ["annual_income", "credit_score", "loan_amount", "employment_years"]
for col in required:
null_rate = inference_batch[col].isna().mean()
assert null_rate < 0.01, f"Column '{col}' has {null_rate:.1%} nulls (threshold: 1%)"
def test_prohibited_attributes_absent(inference_batch):
"""Protected attributes must not be direct model inputs."""
prohibited = ["race", "religion", "national_origin", "sex", "disability_status"]
for col in prohibited:
assert col not in inference_batch.columns, \
f"Prohibited attribute '{col}' found in model input features"
def test_feature_distributions_within_range(inference_batch):
df = ge.from_pandas(inference_batch)
assert df.expect_column_values_to_be_between("credit_score", 300, 850).success
assert df.expect_column_values_to_be_between("annual_income", 0, 10_000_000).success
assert df.expect_column_values_to_be_between("employment_years", 0, 70).success
def test_sample_size_sufficient(inference_batch):
assert len(inference_batch) >= 100, \
f"Inference batch too small for reliable bias testing: {len(inference_batch)} rows"Stage 2: Bias Gate with DeepEval
# tests/compliance/test_bias.py
import pytest
from deepeval import evaluate
from deepeval.metrics import BiasMetric, ToxicityMetric
from deepeval.test_case import LLMTestCase
from typing import List
def load_bias_test_prompts() -> List[dict]:
"""Load standardized bias test prompts from fixtures."""
import json
return json.loads(open("tests/fixtures/bias_prompts.json").read())
def get_model_output(prompt: str) -> str:
from your_model_client import model
return model.complete(prompt)
@pytest.mark.compliance
def test_no_demographic_bias_in_hiring_context():
prompts = load_bias_test_prompts()
hiring_prompts = [p for p in prompts if p["category"] == "hiring"]
metric = BiasMetric(threshold=0.5)
failures = []
for item in hiring_prompts:
output = get_model_output(item["prompt"])
tc = LLMTestCase(input=item["prompt"], actual_output=output)
metric.measure(tc)
if not metric.is_successful():
failures.append({
"prompt_id": item["id"],
"score": metric.score,
"reason": metric.reason
})
assert len(failures) == 0, \
f"Bias detected in {len(failures)}/{len(hiring_prompts)} hiring prompts: {failures}"
@pytest.mark.compliance
def test_no_toxicity_in_outputs():
metric = ToxicityMetric(threshold=0.5)
test_prompts = load_bias_test_prompts()
failures = []
for item in test_prompts[:20]: # sample for speed
output = get_model_output(item["prompt"])
tc = LLMTestCase(input=item["prompt"], actual_output=output)
metric.measure(tc)
if not metric.is_successful():
failures.append(item["id"])
assert len(failures) == 0, f"Toxicity in outputs for prompts: {failures}"Stage 2b: Giskard Scan in CI
Giskard's vulnerability scanner automates detection of bias, hallucination, and robustness issues for tabular and NLP models. Add it as a separate job that can run in parallel with DeepEval:
# scripts/giskard_scan.py
import giskard
import sys
import argparse
def run_scan(fail_on_level: str = "major"):
from your_model import load_model, load_test_data
model, feature_names = load_model()
test_df = load_test_data()
gsk_model = giskard.Model(
model=model,
model_type="classification",
feature_names=feature_names,
classification_labels=["denied", "approved"],
name="CreditDecisionModel"
)
gsk_dataset = giskard.Dataset(
df=test_df,
target="decision",
name="HoldoutSet",
cat_columns=["employment_type", "region"]
)
results = giskard.scan(gsk_model, gsk_dataset)
# Save scan report as compliance artifact
results.to_html("compliance/giskard-scan-latest.html")
level_map = {"minor": 1, "medium": 2, "major": 3, "critical": 4}
threshold = level_map.get(fail_on_level, 3)
issues = [i for i in results.issues if level_map.get(i.level, 0) >= threshold]
if issues:
print(f"COMPLIANCE GATE FAILED: {len(issues)} {fail_on_level}+ issues found")
for issue in issues:
print(f" - [{issue.level}] {issue.description}")
sys.exit(1)
else:
print(f"Giskard scan passed. No {fail_on_level}+ issues found.")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--fail-on", default="major")
args = parser.parse_args()
run_scan(args.fail_on)Stage 3: Accuracy Gate
# tests/compliance/test_accuracy.py
import pytest
import json
from datetime import datetime
from pathlib import Path
ACCURACY_THRESHOLD = 0.87
F1_THRESHOLD = 0.83
@pytest.mark.compliance
def test_accuracy_regression():
"""Accuracy must not regress below declared threshold."""
from your_model import load_model
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
model = load_model()
holdout = pd.read_parquet("tests/fixtures/holdout_v3.parquet")
X = holdout.drop("label", axis=1)
y = holdout["label"]
predictions = model.predict(X)
acc = accuracy_score(y, predictions)
f1 = f1_score(y, predictions, average="macro")
# Persist for audit trail
record = {
"timestamp": datetime.utcnow().isoformat(),
"model_version": model.version,
"accuracy": float(acc),
"f1_macro": float(f1),
"holdout_version": "v3",
"passed": acc >= ACCURACY_THRESHOLD and f1 >= F1_THRESHOLD
}
Path("compliance/accuracy_log.jsonl").open("a").write(
json.dumps(record) + "\n"
)
assert acc >= ACCURACY_THRESHOLD, \
f"Accuracy {acc:.4f} regressed below threshold {ACCURACY_THRESHOLD}"
assert f1 >= F1_THRESHOLD, \
f"F1 {f1:.4f} regressed below threshold {F1_THRESHOLD}"
@pytest.mark.compliance
def test_accuracy_per_demographic_group():
"""No demographic group should fall significantly below overall accuracy."""
from your_model import load_model
import pandas as pd
from sklearn.metrics import accuracy_score
model = load_model()
holdout = pd.read_parquet("tests/fixtures/holdout_v3.parquet")
overall_acc = accuracy_score(holdout["label"], model.predict(holdout.drop("label", axis=1)))
for group in holdout["demographic_group"].unique():
group_df = holdout[holdout["demographic_group"] == group]
group_acc = accuracy_score(
group_df["label"],
model.predict(group_df.drop("label", axis=1))
)
# No group should be more than 5pp below overall accuracy
assert group_acc >= overall_acc - 0.05, \
f"Group '{group}' accuracy {group_acc:.3f} is more than 5pp below overall {overall_acc:.3f}"Stage 4: Robustness Gate
# tests/compliance/test_robustness.py
import pytest
import numpy as np
import pandas as pd
@pytest.mark.compliance
def test_confident_prediction_stability():
"""Confident predictions should not flip under small perturbations."""
from your_model import load_model
model = load_model()
holdout = pd.read_parquet("tests/fixtures/holdout_v3.parquet").drop("label", axis=1)
proba = model.predict_proba(holdout)
confident_mask = proba.max(axis=1) > 0.90
confident_cases = holdout[confident_mask].copy()
base_preds = model.predict(confident_cases)
perturbed = confident_cases.copy()
perturbed["annual_income"] *= 1.01
perturbed_preds = model.predict(perturbed)
flip_rate = (base_preds != perturbed_preds).mean()
assert flip_rate < 0.02, \
f"Decision flip rate {flip_rate:.3f} under 1% income perturbation (threshold: 0.02)"
@pytest.mark.compliance
def test_graceful_handling_of_missing_values():
"""Model must not crash or produce errors on null inputs."""
from your_model import load_model
model = load_model()
holdout = pd.read_parquet("tests/fixtures/holdout_v3.parquet").drop("label", axis=1)
nullable_features = ["credit_score", "employment_years", "loan_amount"]
for feature in nullable_features:
test_input = holdout.iloc[[0]].copy()
test_input[feature] = np.nan
try:
result = model.predict(test_input)
assert result is not None
except Exception as e:
pytest.fail(f"Model crashed on null '{feature}': {e}")Stage 5: Audit Trail Gate
This is the gate engineers most often skip — and the one most likely to cause regulatory problems. Article 12 requires logging; your CI must verify it works.
# tests/compliance/test_audit_trail.py
import pytest
import uuid
from datetime import datetime
@pytest.mark.compliance
def test_inference_creates_audit_log_entry():
"""Every inference must produce a logged entry with required fields."""
from your_model import load_model, get_audit_log
from tests.fixtures import get_test_input
model = load_model()
audit_log = get_audit_log()
test_input = get_test_input()
correlation_id = str(uuid.uuid4())
result = model.predict(test_input, correlation_id=correlation_id)
# Retrieve the log entry
entry = audit_log.get_by_correlation_id(correlation_id)
assert entry is not None, "No audit log entry found for inference"
assert entry["timestamp"] is not None
assert entry["model_version"] is not None
assert entry["input_hash"] is not None
assert entry["output"] is not None
assert entry["correlation_id"] == correlation_id
@pytest.mark.compliance
def test_audit_log_is_immutable():
"""Audit log entries must not be modifiable after creation."""
from your_model import get_audit_log
audit_log = get_audit_log()
existing_id = audit_log.get_latest_entry_id()
with pytest.raises((PermissionError, NotImplementedError, Exception)) as exc_info:
audit_log.modify_entry(existing_id, {"output": "tampered"})
assert exc_info.value is not None, "Audit log allowed modification — must be immutable"
@pytest.mark.compliance
def test_audit_log_retention_metadata():
"""Log entries must include retention metadata for compliance."""
from your_model import get_audit_log
entry = audit_log.get_latest_entry()
assert "retention_until" in entry, "Log entry missing retention timestamp"
assert "data_subject_id_hash" in entry or "no_personal_data" in entryFull GitHub Actions Workflow
# .github/workflows/ai-compliance.yml
name: AI Compliance Gate
on:
push:
paths:
- 'models/**'
- 'src/ml/**'
- 'tests/compliance/**'
pull_request:
paths:
- 'models/**'
- 'src/ml/**'
jobs:
data-quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-test.txt
- name: Data quality gate
run: pytest tests/compliance/test_data_quality.py -v --tb=short
bias-gate:
runs-on: ubuntu-latest
needs: data-quality
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-test.txt deepeval giskard
- name: DeepEval bias tests
run: pytest tests/compliance/test_bias.py -v -m compliance --tb=short
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Giskard vulnerability scan
run: python scripts/giskard_scan.py --fail-on major
- name: Upload scan report
uses: actions/upload-artifact@v4
if: always()
with:
name: giskard-scan-${{ github.sha }}
path: compliance/giskard-scan-latest.html
retention-days: 365 # EU AI Act audit retention
accuracy-gate:
runs-on: ubuntu-latest
needs: data-quality
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-test.txt
- name: Accuracy regression tests
run: pytest tests/compliance/test_accuracy.py -v -m compliance
- name: Upload accuracy log
uses: actions/upload-artifact@v4
if: always()
with:
name: accuracy-log-${{ github.sha }}
path: compliance/accuracy_log.jsonl
retention-days: 365
robustness-gate:
runs-on: ubuntu-latest
needs: data-quality
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-test.txt
- name: Robustness tests
run: pytest tests/compliance/test_robustness.py -v -m compliance
audit-trail-gate:
runs-on: ubuntu-latest
needs: [bias-gate, accuracy-gate, robustness-gate]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-test.txt
- name: Audit trail tests
run: pytest tests/compliance/test_audit_trail.py -v -m compliance
compliance-report:
runs-on: ubuntu-latest
needs: audit-trail-gate
if: always()
steps:
- uses: actions/checkout@v4
- name: Generate compliance summary
run: python scripts/generate_compliance_report.py --sha ${{ github.sha }}
- uses: actions/upload-artifact@v4
with:
name: compliance-report-${{ github.sha }}
path: compliance/report-*.json
retention-days: 365Setting Pass/Fail Thresholds
Compliance thresholds should be declared explicitly in your codebase, not hardcoded in test files:
# compliance/thresholds.py
COMPLIANCE_THRESHOLDS = {
"accuracy": {
"overall": 0.87,
"per_group_min": 0.82, # No group below 82%
"regression_tolerance": 0.02 # Max 2pp drop from previous version
},
"bias": {
"demographic_parity_difference": 0.10,
"equalized_odds_difference": 0.05,
"winobias_gap": 0.15,
"deepeval_bias_score": 0.5
},
"robustness": {
"perturbation_flip_rate": 0.02,
"ood_detection_rate": 0.90
},
"audit": {
"log_completeness_rate": 1.0,
"max_log_latency_ms": 100
}
}Production Monitoring with HelpMeTest Health Checks
CI tests your code at deploy time. But after deployment, model behavior can change — due to data drift, upstream API changes, or gradual performance degradation. HelpMeTest health checks let you run your compliance tests on a schedule against your production endpoints.
Configure a daily compliance health check that:
- Pings your AI endpoint with a standard test input
- Verifies the response doesn't contain bias markers
- Checks that audit log entries are being created
- Alerts you if accuracy on a canary test set drops
This closes the compliance loop: CI gates prevent regressions from shipping, and production health checks catch drift after deployment.
Try HelpMeTest
Set up continuous compliance monitoring for your AI system at https://helpmetest.com. Schedule your compliance test suite to run daily against production, get alerts when bias or accuracy metrics regress, and keep an audit trail of test results for EU AI Act documentation. Flat $100/month — no per-test pricing, unlimited runs.