Testing BigQuery Queries and Data Pipelines with bq-test-kit and dbt
BigQuery pipelines are notoriously hard to test because every query costs money and requires live infrastructure. This guide covers three layers of testing: bq-test-kit for Python unit tests against a local emulator, dbt's built-in test framework for data quality assertions, and direct SQL testing patterns for UDFs and partitioned tables — all runnable in CI without a real GCP project.
Key Takeaways
Use the BigQuery emulator for unit tests. Running a local emulator via Docker eliminates query costs and network latency in CI, making it practical to run hundreds of SQL unit tests on every commit.
bq-test-kit lets you test SQL in isolation. It handles dataset/table creation, data injection, and cleanup — you write a query, assert on results, done.
dbt singular tests catch business logic regressions. A singular test is just a SQL file that returns rows on failure — powerful for encoding invariants like "revenue must never be negative."
Test UDFs with known inputs and edge cases. BigQuery UDFs are pure functions; test them with null inputs, boundary values, and Unicode edge cases before deploying.
Data freshness tests prevent silent staleness. A freshness assertion on your most critical tables will catch pipeline failures before your stakeholders do.
Why Testing BigQuery Is Hard
BigQuery is a managed service with no local equivalent built in. Every query hits the network, costs money (or burns free tier quota), and requires IAM permissions. The usual unit-test feedback loop — run fast, iterate fast — breaks down the moment you introduce a cloud warehouse.
The practical answer is a combination of:
- A local emulator that speaks the BigQuery API
- A Python test harness that injects data and asserts on query results
- dbt tests for data quality at the pipeline level
This guide walks through all three.
Setting Up the BigQuery Emulator
The open-source BigQuery emulator from goccy/bigquery-emulator runs as a Docker container and implements most of the BigQuery REST and Storage APIs.
# docker-compose.yml
version: "3.8"
services:
bigquery-emulator:
image: ghcr.io/goccy/bigquery-emulator:0.6.3
ports:
- "9050:9050"
- "9060:9060"
command: ["--project=test-project", "--log-level=error"]Start it with docker compose up -d bigquery-emulator. The emulator exposes a standard BigQuery HTTP API on port 9050 and a gRPC endpoint on 9060.
To point the Python client at the emulator, set the API endpoint override:
from google.cloud import bigquery
from google.api_core.client_options import ClientOptions
client = bigquery.Client(
project="test-project",
client_options=ClientOptions(api_endpoint="http://localhost:9050"),
credentials=None, # emulator ignores credentials
)Unit Testing with bq-test-kit
bq-test-kit is a pytest-friendly library that wraps the BigQuery client and provides a clean DSL for creating datasets, loading fixture data, running queries, and asserting on results.
pip install bq-test-kit[pandas] pytestBasic Query Test
# tests/test_revenue_query.py
import pytest
from bq_test_kit.bq_test_kit import BQTestKit
from bq_test_kit.bq_test_kit_config import BQTestKitConfig
from bq_test_kit.resource_loaders.package_file_loader import PackageFileLoader
from google.cloud import bigquery
from google.api_core.client_options import ClientOptions
@pytest.fixture(scope="session")
def bq_client():
return bigquery.Client(
project="test-project",
client_options=ClientOptions(api_endpoint="http://localhost:9050"),
credentials=None,
)
@pytest.fixture(scope="session")
def bqtk(bq_client):
return BQTestKit(bq_client=bq_client, bqtk_config=BQTestKitConfig())
def test_daily_revenue_aggregation(bqtk):
with bqtk.dataset("analytics").clean_and_keep() as ds:
with ds.table("orders", schema=[
bigquery.SchemaField("order_id", "STRING"),
bigquery.SchemaField("order_date", "DATE"),
bigquery.SchemaField("amount_usd", "FLOAT64"),
bigquery.SchemaField("status", "STRING"),
]).insert([
{"order_id": "o1", "order_date": "2026-01-15", "amount_usd": 100.0, "status": "completed"},
{"order_id": "o2", "order_date": "2026-01-15", "amount_usd": 50.0, "status": "completed"},
{"order_id": "o3", "order_date": "2026-01-15", "amount_usd": 200.0, "status": "refunded"},
{"order_id": "o4", "order_date": "2026-01-16", "amount_usd": 75.0, "status": "completed"},
]) as orders:
result = bqtk.query("""
SELECT
order_date,
SUM(amount_usd) AS total_revenue,
COUNT(*) AS order_count
FROM `test-project.analytics.orders`
WHERE status = 'completed'
GROUP BY order_date
ORDER BY order_date
""").to_dataframe()
assert len(result) == 2
jan15 = result[result["order_date"].astype(str) == "2026-01-15"].iloc[0]
assert jan15["total_revenue"] == 150.0
assert jan15["order_count"] == 2The .clean_and_keep() context manager creates the dataset at entry and deletes it at exit, keeping your emulator clean between test runs.
Testing BigQuery UDFs
User-defined functions are pure transformations — they're ideal for unit testing because you control inputs and can enumerate edge cases.
-- sql/udf_normalize_phone.sql
CREATE OR REPLACE FUNCTION `test-project.utils.normalize_phone`(raw STRING)
RETURNS STRING
LANGUAGE js AS """
if (!raw) return null;
const digits = raw.replace(/\\D/g, '');
if (digits.length === 10) return '+1' + digits;
if (digits.length === 11 && digits[0] === '1') return '+' + digits;
return null;
""";# tests/test_udf_normalize_phone.py
import pytest
@pytest.mark.parametrize("raw, expected", [
("555-867-5309", "+15558675309"),
("(555) 867-5309", "+15558675309"),
("15558675309", "+15558675309"),
(None, None),
("", None),
("not-a-phone", None),
("1234", None),
])
def test_normalize_phone(bqtk, raw, expected):
# Create UDF first
bqtk.query(open("sql/udf_normalize_phone.sql").read()).run()
raw_literal = f"'{raw}'" if raw is not None else "NULL"
result = bqtk.query(f"""
SELECT `test-project.utils.normalize_phone`({raw_literal}) AS normalized
""").to_dataframe()
assert result.iloc[0]["normalized"] == expectedTesting Partitioned Table Queries
Partition pruning is a common source of bugs — a missing WHERE event_date BETWEEN ... clause can silently full-scan a petabyte table. Test that your queries actually filter correctly.
def test_partition_filter_applied(bqtk):
"""Ensure the query only reads from the target partition."""
schema = [
bigquery.SchemaField("event_id", "STRING"),
bigquery.SchemaField("event_date", "DATE"),
bigquery.SchemaField("user_id", "STRING"),
bigquery.SchemaField("event_type", "STRING"),
]
with bqtk.dataset("events").clean_and_keep() as ds:
with ds.table("user_events", schema=schema,
partition_field="event_date").insert([
{"event_id": "e1", "event_date": "2026-01-01", "user_id": "u1", "event_type": "click"},
{"event_id": "e2", "event_date": "2026-01-02", "user_id": "u2", "event_type": "purchase"},
{"event_id": "e3", "event_date": "2026-01-03", "user_id": "u3", "event_type": "click"},
]) as t:
result = bqtk.query("""
SELECT event_id, user_id
FROM `test-project.events.user_events`
WHERE event_date = '2026-01-02'
""").to_dataframe()
assert len(result) == 1
assert result.iloc[0]["event_id"] == "e2"dbt Tests for Data Quality
dbt has two test categories: generic tests (built-in constraints) and singular tests (custom SQL that returns rows on failure).
Generic Tests in schema.yml
# models/schema.yml
version: 2
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: amount_usd
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: status
tests:
- accepted_values:
values: ["pending", "completed", "refunded", "cancelled"]Run with dbt test --select fct_orders. Each test compiles to a SQL query that returns failing rows — zero rows means the test passes.
Singular Tests for Business Logic
-- tests/assert_revenue_never_negative.sql
-- Returns rows where daily revenue is negative — should always be 0 rows.
SELECT
order_date,
SUM(amount_usd) AS total_revenue
FROM {{ ref('fct_orders') }}
WHERE status = 'completed'
GROUP BY order_date
HAVING SUM(amount_usd) < 0-- tests/assert_no_duplicate_events_per_user_per_day.sql
SELECT
user_id,
event_date,
event_type,
COUNT(*) AS cnt
FROM {{ ref('fct_user_events') }}
GROUP BY 1, 2, 3
HAVING COUNT(*) > 1Testing Data Freshness
Add freshness assertions to your sources.yml:
# models/sources.yml
version: 2
sources:
- name: raw
database: my_project
schema: raw_data
freshness:
warn_after:
count: 6
period: hour
error_after:
count: 24
period: hour
loaded_at_field: _loaded_at
tables:
- name: orders
- name: events
freshness:
error_after:
count: 2
period: hourRun dbt source freshness in CI to catch stale tables before they affect downstream models.
CI Integration
# .github/workflows/bigquery-tests.yml
name: BigQuery Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
services:
bigquery-emulator:
image: ghcr.io/goccy/bigquery-emulator:0.6.3
ports:
- 9050:9050
options: >-
--health-cmd "curl -sf http://localhost:9050/bigquery/v2/projects"
--health-interval 5s
--health-retries 10
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements-test.txt
- run: pytest tests/ -v --tb=short
env:
BIGQUERY_EMULATOR_HOST: localhost:9050
GOOGLE_CLOUD_PROJECT: test-projectFor dbt tests in CI, point your profiles.yml at the emulator using the bigquery adapter's http connection type, or use a dedicated GCP test project with a service account scoped to a test dataset.
What to Test vs. What to Skip
Test these:
- SQL transformation logic — aggregations, joins, window functions with known fixture data
- UDFs — all inputs including nulls, empty strings, boundary values
- Partition filter correctness — verify queries respect date ranges
- dbt model relationships — referential integrity between fact and dimension tables
- Data freshness thresholds on critical source tables
- Schema evolution — column renames or type changes that break downstream models
Skip these:
- BigQuery infrastructure itself — Google tests that
- Query performance and slot consumption — not a unit test concern
- IAM and permission boundaries — test in staging with real credentials, not in unit tests
- Storage billing estimates — irrelevant to correctness testing
- Auto-detected schema inference on JSON exports — flaky by nature, too much variance