Soda Core Data Quality Testing: From Setup to Production Monitoring
Soda Core is an open-source Python library for data quality checks across databases, data warehouses, and data lakes. It uses a YAML-based check definition language called SodaCL (Soda Checks Language) that lets data engineers define quality rules without writing SQL. This guide walks through setting up Soda Core, writing comprehensive checks, integrating with CI/CD pipelines, and monitoring data quality in production.
What Soda Core Does
Soda Core connects to your data source — Snowflake, BigQuery, PostgreSQL, Spark, and more — runs your defined checks, and reports results. Checks cover:
- Schema checks — required columns, correct types
- Missing value checks — null counts and percentages
- Validity checks — format patterns, value ranges, allowed values
- Freshness checks — data recency
- Referential integrity — foreign key consistency
- Duplicate detection — unique constraints
- Statistical checks — row counts, averages, distributions
Installation and Setup
pip install soda-core
# Install database-specific connector
pip install soda-core-snowflake <span class="hljs-comment"># Snowflake
pip install soda-core-bigquery <span class="hljs-comment"># BigQuery
pip install soda-core-postgres <span class="hljs-comment"># PostgreSQL
pip install soda-core-spark <span class="hljs-comment"># Apache SparkConfigure your data source in a configuration.yaml:
# configuration.yaml
data_sources:
orders_warehouse:
type: snowflake
username: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
account: ${SNOWFLAKE_ACCOUNT}
database: ANALYTICS
schema: PUBLIC
warehouse: COMPUTE_WHWriting SodaCL Checks
Create a checks file for your table:
# checks/orders.yaml
checks for orders:
# Row count
- row_count > 0
- row_count > 1000:
name: Sufficient order volume
fail: when < 100
warn: when between 100 and 1000
# Missing values
- missing_count(order_id) = 0:
name: order_id is never null
- missing_percent(customer_id) < 0.1%:
name: customer_id almost always populated
- missing_count(amount) = 0:
name: amount is never null
# Validity
- invalid_count(status) = 0:
name: Valid status values only
valid values:
- pending
- confirmed
- shipped
- delivered
- cancelled
- invalid_count(amount) = 0:
name: Amount within valid range
valid min: 0
valid max: 1000000
# Freshness
- freshness(created_at) < 1h:
name: Orders are recent
# Duplicates
- duplicate_count(order_id) = 0:
name: order_id is uniqueRunning Soda Scans
# Run checks against a data source
soda scan -d orders_warehouse -c configuration.yaml checks/orders.yaml
<span class="hljs-comment"># Output:
<span class="hljs-comment"># Soda Core 3.x.x
<span class="hljs-comment"># By Soda | SpW [soda.io/terms]
<span class="hljs-comment"># Sending anonymous telemetry...
<span class="hljs-comment"># Scanned 1 data source(s)
<span class="hljs-comment">#
<span class="hljs-comment"># orders [orders_warehouse]
<span class="hljs-comment"># PASS row_count > 0 [4821]
<span class="hljs-comment"># PASS order_id is never null [missing_count = 0]
<span class="hljs-comment"># FAIL Valid status values only [invalid_count = 3]
<span class="hljs-comment"># PASS Orders are recent [freshness = 23m ago]Advanced Check Patterns
Schema Checks
checks for orders:
- schema:
name: Orders table schema matches contract
warn:
when wrong column index:
- order_id
- customer_id
- amount
fail:
when required column missing:
- order_id
- customer_id
- amount
- status
- created_at
when wrong column type:
- order_id: text
- amount: decimal
- created_at: timestamp_ltzReference Checks (Cross-Table)
checks for orders:
- values in (customer_id) must exist in customers (customer_id):
name: All order customer_ids exist in customers table
samples limit: 50Change-Over-Time Checks
Detect anomalies by comparing current values to historical baselines:
checks for orders:
- change for row_count:
name: Row count doesn't drop more than 20%
fail: when percent change < -20
warn: when percent change < -10
- change for missing_count(amount):
name: Missing amount count doesn't increase
fail: when percent change > 10SQL-Based Custom Checks
For business logic that SodaCL can't express natively:
checks for orders:
- failed rows:
name: No negative amounts on delivered orders
fail query: |
SELECT *
FROM orders
WHERE status = 'delivered'
AND amount <= 0
- failed rows:
name: No future timestamps
fail query: |
SELECT *
FROM orders
WHERE created_at > CURRENT_TIMESTAMP()Integration Testing Soda Checks
Test your Soda check configurations against synthetic data:
# tests/test_data_quality.py
import pytest
import pandas as pd
from soda.scan import Scan
@pytest.fixture
def clean_orders_data():
return pd.DataFrame({
'order_id': ['uuid-1', 'uuid-2', 'uuid-3'],
'customer_id': ['CUST-000001', 'CUST-000002', 'CUST-000003'],
'amount': [100.0, 250.0, 75.50],
'status': ['confirmed', 'delivered', 'pending'],
'created_at': pd.date_range('2026-01-01', periods=3, freq='1h'),
})
@pytest.fixture
def dirty_orders_data():
return pd.DataFrame({
'order_id': ['uuid-1', 'uuid-1', 'uuid-3'], # duplicate!
'customer_id': ['CUST-000001', None, 'CUST-000003'], # null!
'amount': [100.0, -50.0, 75.50], # negative!
'status': ['confirmed', 'refunded', 'pending'], # invalid status!
'created_at': pd.date_range('2026-01-01', periods=3, freq='1h'),
})
def run_soda_scan(df, checks_file):
"""Helper to run Soda scan against a DataFrame"""
scan = Scan()
scan.set_data_source_name('test_source')
scan.add_pandas_dataframe(
dataset_name='orders',
pandas_df=df,
data_source_name='test_source'
)
scan.add_sodacl_yaml_file(checks_file)
scan.execute()
return scan
def test_clean_data_passes_all_checks(clean_orders_data):
scan = run_soda_scan(clean_orders_data, 'checks/orders.yaml')
assert scan.get_error_logs_text() == ''
checks = scan.get_checks_fail()
assert len(checks) == 0, f"Expected no failures, got: {[c.name for c in checks]}"
def test_dirty_data_fails_expected_checks(dirty_orders_data):
scan = run_soda_scan(dirty_orders_data, 'checks/orders.yaml')
failed_checks = {c.name for c in scan.get_checks_fail()}
assert 'order_id is unique' in failed_checks
assert 'customer_id almost always populated' in failed_checks
assert 'Valid status values only' in failed_checks
assert 'Amount within valid range' in failed_checksCI/CD Pipeline Integration
GitHub Actions
# .github/workflows/data-quality.yml
name: Data Quality Checks
on:
schedule:
- cron: '0 */4 * * *' # Every 4 hours
push:
paths:
- 'checks/**'
jobs:
soda-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Soda Core
run: |
pip install soda-core-snowflake
- name: Run data quality checks
run: |
soda scan \
-d orders_warehouse \
-c configuration.yaml \
checks/orders.yaml \
checks/customers.yaml \
checks/events.yaml
env:
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SODA_API_KEY: ${{ secrets.SODA_CLOUD_API_KEY }}dbt Integration
If you use dbt, Soda integrates as a test layer:
# dbt_project.yml
on-run-end:
- "{{ soda.run_checks(results) }}"Airflow DAG Integration
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG('orders_pipeline', schedule_interval='@hourly') as dag:
transform_orders = BashOperator(
task_id='transform_orders',
bash_command='dbt run --select orders',
)
validate_orders = BashOperator(
task_id='validate_orders',
bash_command='soda scan -d orders_warehouse -c /config/soda.yaml /checks/orders.yaml',
env={
'SNOWFLAKE_USER': '{{ var.value.snowflake_user }}',
'SNOWFLAKE_PASSWORD': '{{ var.value.snowflake_password }}',
},
)
transform_orders >> validate_ordersAlerting on Check Failures
Configure Soda Cloud for centralized alerting (or use the CLI exit code):
# Exit code is non-zero on any failure — works with CI failure conditions
soda scan -d my_source -c config.yaml checks/orders.yaml
<span class="hljs-built_in">echo <span class="hljs-string">"Exit code: $?"
<span class="hljs-comment"># Send results to Soda Cloud for dashboards and alerting
soda scan \
-d my_source \
-c config.yaml \
checks/orders.yaml \
--cloud-api-key <span class="hljs-variable">$SODA_CLOUD_API_KEYFor team notifications without Soda Cloud:
# notify_on_failure.py
import subprocess
import json
import requests
result = subprocess.run(
['soda', 'scan', '-d', 'orders_warehouse', '-c', 'config.yaml', 'checks/orders.yaml', '-o', 'json'],
capture_output=True, text=True
)
scan_results = json.loads(result.stdout)
failures = [c for c in scan_results['checks'] if c['outcome'] == 'fail']
if failures:
message = f"Data quality failures detected:\n" + \
"\n".join(f"- {c['name']}" for c in failures)
requests.post(os.environ['SLACK_WEBHOOK_URL'], json={'text': message})Production Monitoring with HelpMeTest
Soda scans are best when scheduled continuously. HelpMeTest's health check system complements Soda by monitoring the application layer — if a data quality issue causes your dashboards to show incorrect values, HelpMeTest catches the symptom while Soda catches the root cause.
Set up a health check that:
- Runs your Soda scan via CLI
- Fails if exit code is non-zero
- Alerts your team immediately
This gives you data quality monitoring with 5-minute granularity.
Summary
Soda Core enables declarative data quality testing with SodaCL checks covering schema, missing values, validity, freshness, duplicates, and cross-table references. Test your checks against synthetic clean and dirty datasets in CI to validate the checks themselves. Integrate Soda scans into your ETL pipelines (dbt, Airflow) to catch quality issues at every stage. Schedule production scans every 15-60 minutes and alert on failures. The most overlooked check type is the change-over-time check — detecting anomalies relative to your baseline is far more powerful than static thresholds alone.