Soda Core Data Quality Testing: From Setup to Production Monitoring

Soda Core Data Quality Testing: From Setup to Production Monitoring

Soda Core is an open-source Python library for data quality checks across databases, data warehouses, and data lakes. It uses a YAML-based check definition language called SodaCL (Soda Checks Language) that lets data engineers define quality rules without writing SQL. This guide walks through setting up Soda Core, writing comprehensive checks, integrating with CI/CD pipelines, and monitoring data quality in production.

What Soda Core Does

Soda Core connects to your data source — Snowflake, BigQuery, PostgreSQL, Spark, and more — runs your defined checks, and reports results. Checks cover:

  • Schema checks — required columns, correct types
  • Missing value checks — null counts and percentages
  • Validity checks — format patterns, value ranges, allowed values
  • Freshness checks — data recency
  • Referential integrity — foreign key consistency
  • Duplicate detection — unique constraints
  • Statistical checks — row counts, averages, distributions

Installation and Setup

pip install soda-core

# Install database-specific connector
pip install soda-core-snowflake  <span class="hljs-comment"># Snowflake
pip install soda-core-bigquery   <span class="hljs-comment"># BigQuery
pip install soda-core-postgres   <span class="hljs-comment"># PostgreSQL
pip install soda-core-spark      <span class="hljs-comment"># Apache Spark

Configure your data source in a configuration.yaml:

# configuration.yaml
data_sources:
  orders_warehouse:
    type: snowflake
    username: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    account: ${SNOWFLAKE_ACCOUNT}
    database: ANALYTICS
    schema: PUBLIC
    warehouse: COMPUTE_WH

Writing SodaCL Checks

Create a checks file for your table:

# checks/orders.yaml
checks for orders:
  # Row count
  - row_count > 0
  - row_count > 1000:
      name: Sufficient order volume
      fail: when < 100
      warn: when between 100 and 1000

  # Missing values
  - missing_count(order_id) = 0:
      name: order_id is never null
  - missing_percent(customer_id) < 0.1%:
      name: customer_id almost always populated
  - missing_count(amount) = 0:
      name: amount is never null

  # Validity
  - invalid_count(status) = 0:
      name: Valid status values only
      valid values:
        - pending
        - confirmed
        - shipped
        - delivered
        - cancelled

  - invalid_count(amount) = 0:
      name: Amount within valid range
      valid min: 0
      valid max: 1000000

  # Freshness
  - freshness(created_at) < 1h:
      name: Orders are recent

  # Duplicates
  - duplicate_count(order_id) = 0:
      name: order_id is unique

Running Soda Scans

# Run checks against a data source
soda scan -d orders_warehouse -c configuration.yaml checks/orders.yaml

<span class="hljs-comment"># Output:
<span class="hljs-comment"># Soda Core 3.x.x
<span class="hljs-comment"># By Soda | SpW [soda.io/terms]
<span class="hljs-comment"># Sending anonymous telemetry...
<span class="hljs-comment"># Scanned 1 data source(s)
<span class="hljs-comment"># 
<span class="hljs-comment"># orders [orders_warehouse]
<span class="hljs-comment"># PASS row_count > 0 [4821]
<span class="hljs-comment"># PASS order_id is never null [missing_count = 0]
<span class="hljs-comment"># FAIL Valid status values only [invalid_count = 3]
<span class="hljs-comment"># PASS Orders are recent [freshness = 23m ago]

Advanced Check Patterns

Schema Checks

checks for orders:
  - schema:
      name: Orders table schema matches contract
      warn:
        when wrong column index:
          - order_id
          - customer_id
          - amount
      fail:
        when required column missing:
          - order_id
          - customer_id
          - amount
          - status
          - created_at
        when wrong column type:
          - order_id: text
          - amount: decimal
          - created_at: timestamp_ltz

Reference Checks (Cross-Table)

checks for orders:
  - values in (customer_id) must exist in customers (customer_id):
      name: All order customer_ids exist in customers table
      samples limit: 50

Change-Over-Time Checks

Detect anomalies by comparing current values to historical baselines:

checks for orders:
  - change for row_count:
      name: Row count doesn't drop more than 20%
      fail: when percent change < -20
      warn: when percent change < -10

  - change for missing_count(amount):
      name: Missing amount count doesn't increase
      fail: when percent change > 10

SQL-Based Custom Checks

For business logic that SodaCL can't express natively:

checks for orders:
  - failed rows:
      name: No negative amounts on delivered orders
      fail query: |
        SELECT *
        FROM orders
        WHERE status = 'delivered'
          AND amount <= 0

  - failed rows:
      name: No future timestamps
      fail query: |
        SELECT *
        FROM orders
        WHERE created_at > CURRENT_TIMESTAMP()

Integration Testing Soda Checks

Test your Soda check configurations against synthetic data:

# tests/test_data_quality.py
import pytest
import pandas as pd
from soda.scan import Scan

@pytest.fixture
def clean_orders_data():
    return pd.DataFrame({
        'order_id': ['uuid-1', 'uuid-2', 'uuid-3'],
        'customer_id': ['CUST-000001', 'CUST-000002', 'CUST-000003'],
        'amount': [100.0, 250.0, 75.50],
        'status': ['confirmed', 'delivered', 'pending'],
        'created_at': pd.date_range('2026-01-01', periods=3, freq='1h'),
    })

@pytest.fixture
def dirty_orders_data():
    return pd.DataFrame({
        'order_id': ['uuid-1', 'uuid-1', 'uuid-3'],  # duplicate!
        'customer_id': ['CUST-000001', None, 'CUST-000003'],  # null!
        'amount': [100.0, -50.0, 75.50],  # negative!
        'status': ['confirmed', 'refunded', 'pending'],  # invalid status!
        'created_at': pd.date_range('2026-01-01', periods=3, freq='1h'),
    })

def run_soda_scan(df, checks_file):
    """Helper to run Soda scan against a DataFrame"""
    scan = Scan()
    scan.set_data_source_name('test_source')
    scan.add_pandas_dataframe(
        dataset_name='orders',
        pandas_df=df,
        data_source_name='test_source'
    )
    scan.add_sodacl_yaml_file(checks_file)
    scan.execute()
    return scan

def test_clean_data_passes_all_checks(clean_orders_data):
    scan = run_soda_scan(clean_orders_data, 'checks/orders.yaml')
    
    assert scan.get_error_logs_text() == ''
    checks = scan.get_checks_fail()
    assert len(checks) == 0, f"Expected no failures, got: {[c.name for c in checks]}"

def test_dirty_data_fails_expected_checks(dirty_orders_data):
    scan = run_soda_scan(dirty_orders_data, 'checks/orders.yaml')
    
    failed_checks = {c.name for c in scan.get_checks_fail()}
    assert 'order_id is unique' in failed_checks
    assert 'customer_id almost always populated' in failed_checks
    assert 'Valid status values only' in failed_checks
    assert 'Amount within valid range' in failed_checks

CI/CD Pipeline Integration

GitHub Actions

# .github/workflows/data-quality.yml
name: Data Quality Checks

on:
  schedule:
    - cron: '0 */4 * * *'  # Every 4 hours
  push:
    paths:
      - 'checks/**'

jobs:
  soda-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Soda Core
        run: |
          pip install soda-core-snowflake

      - name: Run data quality checks
        run: |
          soda scan \
            -d orders_warehouse \
            -c configuration.yaml \
            checks/orders.yaml \
            checks/customers.yaml \
            checks/events.yaml
        env:
          SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          SODA_API_KEY: ${{ secrets.SODA_CLOUD_API_KEY }}

dbt Integration

If you use dbt, Soda integrates as a test layer:

# dbt_project.yml
on-run-end:
  - "{{ soda.run_checks(results) }}"

Airflow DAG Integration

from airflow import DAG
from airflow.operators.bash import BashOperator

with DAG('orders_pipeline', schedule_interval='@hourly') as dag:
    
    transform_orders = BashOperator(
        task_id='transform_orders',
        bash_command='dbt run --select orders',
    )

    validate_orders = BashOperator(
        task_id='validate_orders',
        bash_command='soda scan -d orders_warehouse -c /config/soda.yaml /checks/orders.yaml',
        env={
            'SNOWFLAKE_USER': '{{ var.value.snowflake_user }}',
            'SNOWFLAKE_PASSWORD': '{{ var.value.snowflake_password }}',
        },
    )

    transform_orders >> validate_orders

Alerting on Check Failures

Configure Soda Cloud for centralized alerting (or use the CLI exit code):

# Exit code is non-zero on any failure — works with CI failure conditions
soda scan -d my_source -c config.yaml checks/orders.yaml
<span class="hljs-built_in">echo <span class="hljs-string">"Exit code: $?"

<span class="hljs-comment"># Send results to Soda Cloud for dashboards and alerting
soda scan \
  -d my_source \
  -c config.yaml \
  checks/orders.yaml \
  --cloud-api-key <span class="hljs-variable">$SODA_CLOUD_API_KEY

For team notifications without Soda Cloud:

# notify_on_failure.py
import subprocess
import json
import requests

result = subprocess.run(
    ['soda', 'scan', '-d', 'orders_warehouse', '-c', 'config.yaml', 'checks/orders.yaml', '-o', 'json'],
    capture_output=True, text=True
)

scan_results = json.loads(result.stdout)
failures = [c for c in scan_results['checks'] if c['outcome'] == 'fail']

if failures:
    message = f"Data quality failures detected:\n" + \
              "\n".join(f"- {c['name']}" for c in failures)
    requests.post(os.environ['SLACK_WEBHOOK_URL'], json={'text': message})

Production Monitoring with HelpMeTest

Soda scans are best when scheduled continuously. HelpMeTest's health check system complements Soda by monitoring the application layer — if a data quality issue causes your dashboards to show incorrect values, HelpMeTest catches the symptom while Soda catches the root cause.

Set up a health check that:

  1. Runs your Soda scan via CLI
  2. Fails if exit code is non-zero
  3. Alerts your team immediately

This gives you data quality monitoring with 5-minute granularity.

Summary

Soda Core enables declarative data quality testing with SodaCL checks covering schema, missing values, validity, freshness, duplicates, and cross-table references. Test your checks against synthetic clean and dirty datasets in CI to validate the checks themselves. Integrate Soda scans into your ETL pipelines (dbt, Airflow) to catch quality issues at every stage. Schedule production scans every 15-60 minutes and alert on failures. The most overlooked check type is the change-over-time check — detecting anomalies relative to your baseline is far more powerful than static thresholds alone.

Read more