Atheris: Python Fuzzing with libFuzzer for Automated Vulnerability Discovery
Python's dynamic type system and exception handling make it easier to write code that handles invalid input gracefully — in theory. In practice, Python libraries that parse formats, process untrusted data, or call into native extensions can crash, hang, or exhibit undefined behavior just like C code. Atheris brings libFuzzer's coverage-guided fuzzing to Python, making it possible to find these bugs automatically. Released by Google in 2020, it has since found bugs in dozens of Python packages and is used as part of OSS-Fuzz for Python projects.
What Atheris Provides
Atheris is a Python fuzzing library that:
- Integrates with libFuzzer's coverage-guided mutation engine
- Instruments Python bytecode to track code coverage (enabling coverage-guided fuzzing in pure Python)
- Can fuzz Python code that calls into native C/C++ extensions
- Includes a structured data provider (
FuzzedDataProvider) for generating typed data from raw bytes - Supports both libFuzzer standalone mode and integration with OSS-Fuzz
The key insight that makes Atheris work: libFuzzer can guide fuzzing based on either native code coverage or Python bytecode coverage, depending on which is more relevant to the target. For pure Python, Python bytecode tracing is used. For native extensions, the native coverage instrumentation applies.
Installation
Atheris requires Python 3.6+ and a libFuzzer-enabled clang. On Linux:
# Install dependencies
<span class="hljs-built_in">sudo apt-get install -y clang python3-dev
<span class="hljs-comment"># Install Atheris
pip install atheris
<span class="hljs-comment"># Verify installation
python3 -c <span class="hljs-string">"import atheris; print(atheris.__version__)"On macOS:
brew install llvm
pip install atherisFor fuzzing native extensions, ensure they are compiled with -fsanitize=address,fuzzer-no-link:
CC=clang CXX=clang++ \
CFLAGS="-fsanitize=address,fuzzer-no-link" \
CXXFLAGS=<span class="hljs-string">"-fsanitize=address,fuzzer-no-link" \
pip install mypackageYour First Fuzz Target
The basic structure of an Atheris fuzz target:
import atheris
import sys
def fuzz_json_parser(data):
fdp = atheris.FuzzedDataProvider(data)
# Generate a string from the fuzz data
test_string = fdp.ConsumeUnicodeNoSurrogates(100)
try:
import json
json.loads(test_string)
except (json.JSONDecodeError, UnicodeDecodeError):
pass # Expected — invalid JSON is fine
except Exception as e:
# Unexpected exception = bug
print(f"Unexpected exception: {e}")
raise
atheris.Setup(sys.argv, fuzz_json_parser)
atheris.Fuzz()Run it:
# Run for 10 seconds
python3 fuzz_json.py -max_total_time=10
<span class="hljs-comment"># With corpus directory
python3 fuzz_json.py corpus/ -max_total_time=10
<span class="hljs-comment"># With multiple workers
python3 fuzz_json.py corpus/ -<span class="hljs-built_in">jobs=4 -workers=4FuzzedDataProvider: Structured Fuzzing
The FuzzedDataProvider is the key to fuzzing code that expects structured input rather than raw bytes. It provides methods for extracting typed values from the raw fuzz bytes:
import atheris
import sys
def fuzz_user_service(data):
fdp = atheris.FuzzedDataProvider(data)
# Extract structured values
user_id = fdp.ConsumeInt(4) # 4-byte integer
username = fdp.ConsumeString(50) # Up to 50 bytes as string
email = fdp.ConsumeUnicodeNoSurrogates(100) # Unicode string
age = fdp.ConsumeIntInRange(0, 150) # Integer in range
is_admin = fdp.ConsumeBool() # Boolean
score = fdp.ConsumeFloat() # Float
# List of items
items = []
while fdp.remaining_bytes() > 0:
item_id = fdp.ConsumeIntInRange(1, 10000)
items.append(item_id)
from my_service import UserService
svc = UserService()
try:
user = svc.create_user(
user_id=user_id,
username=username,
email=email,
age=age,
is_admin=is_admin
)
svc.add_items(user, items)
except ValueError:
pass # Expected validation errors
except Exception as e:
# Unexpected errors indicate bugs
raise
atheris.Setup(sys.argv, fuzz_user_service)
atheris.Fuzz()Available FuzzedDataProvider methods:
fdp.ConsumeBytes(count) # bytes
fdp.ConsumeByteArray(count) # bytearray
fdp.ConsumeString(count) # str (may contain invalid unicode)
fdp.ConsumeUnicode(count) # str (valid unicode, may have surrogates)
fdp.ConsumeUnicodeNoSurrogates(count) # str (valid unicode, no surrogates)
fdp.ConsumeBool() # bool
fdp.ConsumeInt(num_bytes) # int (from N bytes)
fdp.ConsumeIntInRange(min, max) # int in [min, max]
fdp.ConsumeUInt(num_bytes) # unsigned int
fdp.ConsumeFloat() # float (may be inf or nan)
fdp.ConsumeRegularFloat() # float (not inf or nan)
fdp.PickValueInList(lst) # random element from list
fdp.remaining_bytes() # bytes not yet consumedFuzzing a Real Python Library
Here is a complete example fuzzing a URL parser:
import atheris
import sys
import urllib.parse
def fuzz_url_parser(data):
fdp = atheris.FuzzedDataProvider(data)
url = fdp.ConsumeUnicodeNoSurrogates(200)
try:
result = urllib.parse.urlparse(url)
# Test round-trip consistency
reconstructed = urllib.parse.urlunparse(result)
result2 = urllib.parse.urlparse(reconstructed)
# Invariant: parsing the reconstructed URL should give
# the same result as parsing the original
# (with some tolerance for normalization)
assert result.scheme == result2.scheme, \
f"Scheme mismatch: {result.scheme!r} != {result2.scheme!r}"
except AssertionError:
raise
except Exception:
pass
atheris.Setup(sys.argv, fuzz_url_parser)
atheris.Fuzz()The invariant check (round-trip consistency) is a powerful fuzzing technique — it does not require knowing the correct output, only that certain properties should hold. Other useful invariants:
- Parsing and serializing should round-trip
- Decoding then encoding should give back the original
- Two implementations of the same spec should agree
- Input within documented bounds should never crash
Fuzzing Native Extensions
Python libraries with C extensions are particularly valuable to fuzz, because their bugs can cause memory corruption rather than mere Python exceptions.
import atheris
import sys
# Enable ASan/UBSan for the native extension
# (The extension must be compiled with sanitizer support)
atheris.enabled_hooks.add("trace_cmp")
def fuzz_numpy_operations(data):
fdp = atheris.FuzzedDataProvider(data)
import numpy as np
# Generate a random array size and data
size = fdp.ConsumeIntInRange(0, 1000)
dtype_name = fdp.PickValueInList(['int8', 'int16', 'int32', 'float32', 'float64'])
try:
arr_bytes = fdp.ConsumeBytes(size * np.dtype(dtype_name).itemsize)
arr = np.frombuffer(arr_bytes, dtype=dtype_name)
# Exercise various numpy operations
if len(arr) > 0:
np.sort(arr)
np.unique(arr)
arr.reshape(-1, 1)
np.cumsum(arr)
except (ValueError, OverflowError):
pass # Expected with weird inputs
atheris.Setup(sys.argv, fuzz_numpy_operations)
atheris.Fuzz()For extensions that you want to fuzz with AddressSanitizer:
# Set up ASan environment
<span class="hljs-built_in">export LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so)
<span class="hljs-built_in">export ASAN_OPTIONS=<span class="hljs-string">"detect_leaks=0:halt_on_error=0"
python3 fuzz_native.py -max_total_time=60Finding Type Errors and Logic Bugs
Atheris is particularly good at finding Python-specific bugs that are not crashes:
Type confusion bugs — where the code assumes input has a specific type:
def fuzz_type_confusion(data):
fdp = atheris.FuzzedDataProvider(data)
# Generate various Python types
type_choice = fdp.ConsumeIntInRange(0, 4)
if type_choice == 0:
value = fdp.ConsumeInt(4)
elif type_choice == 1:
value = fdp.ConsumeFloat()
elif type_choice == 2:
value = fdp.ConsumeUnicodeNoSurrogates(50)
elif type_choice == 3:
value = None
else:
value = fdp.ConsumeBytes(20)
try:
from my_module import process_value
process_value(value)
except TypeError:
pass # Expected for wrong types
except Exception as e:
if "expected" not in str(e).lower():
raise # Unexpected exception = bugReDoS (Regular Expression Denial of Service):
import re
import signal
def fuzz_regex_dos(data):
fdp = atheris.FuzzedDataProvider(data)
test_input = fdp.ConsumeUnicodeNoSurrogates(500)
# Test pattern that might be vulnerable to ReDoS
pattern = r'(a+)+b' # Classic ReDoS pattern
def timeout_handler(signum, frame):
raise TimeoutError("Regex took too long — possible ReDoS")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
re.match(pattern, test_input)
except TimeoutError:
raise # This is the bug we want to find
except Exception:
pass
finally:
signal.alarm(0)CI Integration with GitHub Actions
name: Fuzz Tests
on:
schedule:
- cron: '0 3 * * *'
push:
paths:
- 'src/**'
workflow_dispatch:
inputs:
fuzz_duration:
description: 'Fuzzing duration in seconds'
default: '300'
jobs:
fuzz:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Atheris and dependencies
run: |
sudo apt-get install -y clang
pip install atheris
pip install -e .
- name: Restore corpus cache
uses: actions/cache@v3
with:
path: fuzz-corpus
key: fuzz-corpus-${{ hashFiles('src/**') }}
restore-keys: fuzz-corpus-
- name: Run fuzz tests
run: |
mkdir -p fuzz-corpus/json_parser
python3 tests/fuzz/fuzz_json_parser.py \
fuzz-corpus/json_parser/ \
-max_total_time=${{ inputs.fuzz_duration || 300 }} \
-jobs=2
- name: Save updated corpus
uses: actions/cache/save@v3
if: always()
with:
path: fuzz-corpus
key: fuzz-corpus-${{ github.sha }}
- name: Upload crashes
uses: actions/upload-artifact@v3
if: failure()
with:
name: fuzz-crashes-${{ github.run_id }}
path: |
crash-*
timeout-*
leak-*Interpreting Crashes
When Atheris finds a crash, the output looks like this:
INFO: Using built-in libfuzzer
INFO: Seed: 1234567890
INFO: Loaded 1 modules (50 guards): 50 [0x...]
INFO: Running: crash-abc123
python3: fuzz_json.py
...
==12345==ERROR: Python encountered a fatal error...
Traceback (most recent call last):
File "fuzz_json.py", line 15, in fuzz_json_parser
json_library.parse(data_string)
File "/path/to/json_library/parser.py", line 89, in parse
return _fast_parse(data)
RecursionError: maximum recursion depth exceeded
SUMMARY: python3 Crash
artifact_prefix='./'; Test unit written to ./crash-abc123The crash input is written to ./crash-abc123. Reproduce it:
python3 fuzz_json.py crash-abc123This particular crash — RecursionError in a JSON parser — indicates the parser does not guard against deeply nested input. A quick fix:
def parse(data, max_depth=100):
if max_depth <= 0:
raise ParseError("Input too deeply nested")
# ...Atheris makes Python fuzzing as accessible as writing a unit test. The investment in writing fuzz targets pays off in bugs found automatically, running overnight in CI, with no human effort required after setup.