Great Expectations Tutorial: Data Validation and Pipeline Testing

Great Expectations Tutorial: Data Validation and Pipeline Testing

Great Expectations (GX) is a Python library for defining and running data quality assertions against DataFrames, files, and databases. You define Expectations (e.g., "this column should be non-null"), group them into Expectation Suites, and run them via Checkpoints. Failed Expectations produce HTML Data Docs reports showing exactly what failed and why.

Key Takeaways

Expectations are assertions about your data. expect_column_values_to_not_be_null, expect_column_values_to_be_between, expect_table_row_count_to_be_between—GX ships with 50+ built-in Expectations covering most data quality needs.

Expectation Suites group related Expectations. A suite for "raw orders" might contain 20 Expectations covering all the columns in that dataset.

Checkpoints connect suites to data. A Checkpoint says "run this Expectation Suite against this data source." Checkpoints are what you call in CI or in your pipeline.

Data Docs provide human-readable validation reports. Every Checkpoint run produces an HTML report showing which Expectations passed, which failed, and with what values—useful for debugging and for sharing with data consumers.

Profiling generates an initial suite automatically. Run the profiler against a sample of good data to get a baseline suite, then edit it to remove noisy assertions.

What Great Expectations Solves

Data pipelines fail in two ways: the pipeline crashes (you know immediately), or the pipeline runs but produces wrong data (you find out later, in a meeting). Great Expectations targets the second failure mode.

By defining Expectations on your data—at ingestion, after transformation, before loading to a warehouse—you create checkpoints where bad data is caught and rejected instead of silently flowing downstream.

Installation

pip install great-expectations

For database connections, install the relevant extras:

pip install great-expectations[sqlalchemy]    # PostgreSQL, MySQL, etc.
pip install great-expectations[bigquery]       <span class="hljs-comment"># BigQuery
pip install great-expectations[snowflake]      <span class="hljs-comment"># Snowflake

Core Concepts

Data Context — the central GX object that manages configuration, data sources, suites, and checkpoints.

Data Source — a connection to your data (a pandas DataFrame, a database, a file system).

Expectation Suite — a collection of Expectations about a specific dataset.

Batch — a slice of data to validate (a DataFrame, a table, a partition).

Checkpoint — a reusable configuration that runs a suite against a batch and saves results.

Validation Result — the output of a Checkpoint run, including pass/fail for each Expectation.

Quick Start: Validating a DataFrame

import great_expectations as gx
import pandas as pd

# Create a Data Context (in-memory for quick testing)
context = gx.get_context(mode="ephemeral")

# Create a Data Source from a pandas DataFrame
data = pd.DataFrame({
    "order_id": [1, 2, 3, 4],
    "customer_id": [101, 102, 103, None],
    "amount": [50.0, -10.0, 200.0, 75.0],
    "status": ["placed", "shipped", "unknown", "completed"]
})

datasource = context.sources.add_pandas("my_datasource")
asset = datasource.add_dataframe_asset("orders")
batch_request = asset.build_batch_request(dataframe=data)

# Create an Expectation Suite
suite = context.add_expectation_suite("orders_suite")

# Add Expectations
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_suite"
)

validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("amount", min_value=0)
validator.expect_column_values_to_be_in_set(
    "status",
    value_set=["placed", "shipped", "returned", "completed"]
)
validator.expect_table_row_count_to_be_between(min_value=1, max_value=1000000)

validator.save_expectation_suite()

# Run validation
results = validator.validate()
print(f"Success: {results.success}")
print(f"Failed expectations: {results.statistics['unsuccessful_expectations']}")

Common Expectations

Null Checks

# No nulls allowed
validator.expect_column_values_to_not_be_null("email")

# Nulls are acceptable (less than 10% null)
validator.expect_column_values_to_not_be_null(
    "phone_number",
    mostly=0.90  # at least 90% must be non-null
)

Value Range Checks

# Numeric bounds
validator.expect_column_values_to_be_between(
    "price",
    min_value=0.01,
    max_value=99999.99
)

# Date bounds
validator.expect_column_values_to_be_between(
    "created_at",
    min_value="2020-01-01",
    max_value="2030-12-31",
    parse_strings_as_datetimes=True
)

Format Checks

# Regex pattern matching
validator.expect_column_values_to_match_regex(
    "email",
    regex=r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"
)

# JSON parseable
validator.expect_column_values_to_be_json_parseable("metadata")

Uniqueness and Cardinality

# No duplicates
validator.expect_column_values_to_be_unique("order_id")

# Expected number of distinct values
validator.expect_column_unique_value_count_to_be_between(
    "country_code",
    min_value=1,
    max_value=250
)

Table-Level Checks

# Row count
validator.expect_table_row_count_to_be_between(
    min_value=1000,
    max_value=10000000
)

# Expected columns present
validator.expect_table_columns_to_match_ordered_list([
    "order_id", "customer_id", "amount", "status", "created_at"
])

Statistical Checks

# Mean in expected range
validator.expect_column_mean_to_be_between(
    "order_amount",
    min_value=40.0,
    max_value=300.0
)

# No unexpected distribution shift
validator.expect_column_quantile_values_to_be_between(
    "order_amount",
    quantile_ranges={
        "quantiles": [0.25, 0.50, 0.75],
        "value_ranges": [[10, 100], [50, 250], [100, 500]]
    }
)

Using Checkpoints for Pipeline Integration

Checkpoints are the main integration point between GX and your pipeline:

import great_expectations as gx

context = gx.get_context()

# Define a Checkpoint
checkpoint = context.add_or_update_checkpoint(
    name="orders_checkpoint",
    validations=[
        {
            "batch_request": {
                "datasource_name": "my_datasource",
                "data_asset_name": "orders",
            },
            "expectation_suite_name": "orders_suite",
        }
    ],
)

# Run it
results = checkpoint.run()

if not results.success:
    raise ValueError("Data quality validation failed — pipeline halted")

Call this Checkpoint at the start of your pipeline to gate on data quality before running expensive transformations.

Validating Database Tables

Connect GX to a SQL database:

context = gx.get_context()

# PostgreSQL connection
datasource = context.sources.add_or_update_sql(
    name="postgres_datasource",
    connection_string="postgresql+psycopg2://user:pass@localhost/mydb"
)

# Add a table asset
table_asset = datasource.add_table_asset(
    name="raw_orders",
    table_name="raw_orders",
    schema_name="public"
)

# Build a batch request (most recent partition)
batch_request = table_asset.build_batch_request()

# Validate against the suite
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_suite"
)
results = validator.validate()

For Snowflake, BigQuery, and other warehouses, swap the connection_string for the appropriate SQLAlchemy URL or GX-specific connector.

Profiling: Bootstrap an Expectation Suite from Data

Instead of writing every Expectation manually, use the profiler to generate an initial suite from a sample of known-good data:

from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler

# Get a validator with your data
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="auto_profiled_suite"
)

# Run the profiler
profiler = BasicDatasetProfiler()
suite, validation_result = profiler.profile(validator)

context.save_expectation_suite(suite)

The profiler examines column types, null rates, value distributions, and uniqueness to generate a baseline suite. Review the generated suite and remove or adjust Expectations that are too strict or don't reflect real business rules.

Integrating with Airflow

Add a GX validation step to your Airflow DAG:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import great_expectations as gx

def validate_orders(**kwargs):
    context = gx.get_context()
    checkpoint = context.get_checkpoint("orders_checkpoint")
    results = checkpoint.run()
    
    if not results.success:
        failed = [
            r["expectation_config"]["expectation_type"]
            for r in results.results
            if not r["success"]
        ]
        raise ValueError(f"Data quality failed: {failed}")

with DAG("orders_pipeline", start_date=datetime(2025, 1, 1), schedule_interval="@daily") as dag:
    validate = PythonOperator(
        task_id="validate_raw_orders",
        python_callable=validate_orders,
    )
    
    transform = PythonOperator(
        task_id="transform_orders",
        python_callable=run_transformations,
    )
    
    validate >> transform  # Validation gates transformation

Running in CI

# .github/workflows/data-quality.yml
name: Data Quality Check
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6am
  push:
    paths:
      - 'expectations/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: pip install great-expectations pandas
      
      - name: Run data quality checks
        run: python scripts/run_checkpoints.py
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}
      
      - name: Upload Data Docs
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: data-docs
          path: great_expectations/uncommitted/data_docs/

Data Docs

Every Checkpoint run produces HTML Data Docs that show a human-readable report. View them locally:

great_expectations docs build
great_expectations docs serve

Data Docs show:

  • Which Expectations passed and failed
  • Sample of failing rows
  • Historical pass/fail trends across runs
  • Links to the Expectation definitions

Share Data Docs with data consumers to communicate data quality SLAs and current status.

Practical Tips

Start with the profiler on a good batch. Running the profiler against a week of clean production data gives you a realistic baseline suite in minutes, not hours.

Use mostly for tolerant checks. Not every column needs 100% compliance. mostly=0.95 means 95% of values must pass, allowing for acceptable null rates or edge cases.

Separate critical from warning expectations. Add meta: {"severity": "warning"} to expectations that shouldn't block pipelines. Your Checkpoint runner can filter on severity.

Version control your expectation suites. Store great_expectations/expectations/ in git. Suite changes are code changes—review them in PRs.

Test GX itself in CI with a sample CSV. Include a small test fixture in your repo so CI validates that your expectation suites work before running against production data.

Summary

Great Expectations provides a structured way to define, version, and enforce data quality assertions throughout your pipeline. Define Expectations on raw data to catch ingestion problems, define them on transformed data to catch transformation bugs, and run Checkpoints in CI to catch regressions before they reach production.

The investment in defining Expectation Suites pays off quickly: instead of finding data quality issues in dashboards or stakeholder meetings, you find them at the pipeline stage where they're easiest to fix.

Read more

ScyllaDB Testing Guide: Cassandra Driver Compatibility, Shard-per-Core Testing & Performance Regression

ScyllaDB Testing Guide: Cassandra Driver Compatibility, Shard-per-Core Testing & Performance Regression

ScyllaDB delivers Cassandra-compatible APIs with a rewritten Seastar-based engine that achieves dramatically higher throughput. Testing ScyllaDB applications requires validating both Cassandra compatibility and ScyllaDB-specific behaviors like shard-per-core data distribution. This guide covers both angles. ScyllaDB Testing Landscape ScyllaDB is a drop-in replacement for Cassandra at the API level—which means

By HelpMeTest