Atheris: Python Fuzzing with libFuzzer for Automated Vulnerability Discovery

Atheris: Python Fuzzing with libFuzzer for Automated Vulnerability Discovery

Python's dynamic type system and exception handling make it easier to write code that handles invalid input gracefully — in theory. In practice, Python libraries that parse formats, process untrusted data, or call into native extensions can crash, hang, or exhibit undefined behavior just like C code. Atheris brings libFuzzer's coverage-guided fuzzing to Python, making it possible to find these bugs automatically. Released by Google in 2020, it has since found bugs in dozens of Python packages and is used as part of OSS-Fuzz for Python projects.

What Atheris Provides

Atheris is a Python fuzzing library that:

  • Integrates with libFuzzer's coverage-guided mutation engine
  • Instruments Python bytecode to track code coverage (enabling coverage-guided fuzzing in pure Python)
  • Can fuzz Python code that calls into native C/C++ extensions
  • Includes a structured data provider (FuzzedDataProvider) for generating typed data from raw bytes
  • Supports both libFuzzer standalone mode and integration with OSS-Fuzz

The key insight that makes Atheris work: libFuzzer can guide fuzzing based on either native code coverage or Python bytecode coverage, depending on which is more relevant to the target. For pure Python, Python bytecode tracing is used. For native extensions, the native coverage instrumentation applies.

Installation

Atheris requires Python 3.6+ and a libFuzzer-enabled clang. On Linux:

# Install dependencies
<span class="hljs-built_in">sudo apt-get install -y clang python3-dev

<span class="hljs-comment"># Install Atheris
pip install atheris

<span class="hljs-comment"># Verify installation
python3 -c <span class="hljs-string">"import atheris; print(atheris.__version__)"

On macOS:

brew install llvm
pip install atheris

For fuzzing native extensions, ensure they are compiled with -fsanitize=address,fuzzer-no-link:

CC=clang CXX=clang++ \
  CFLAGS="-fsanitize=address,fuzzer-no-link" \
  CXXFLAGS=<span class="hljs-string">"-fsanitize=address,fuzzer-no-link" \
  pip install mypackage

Your First Fuzz Target

The basic structure of an Atheris fuzz target:

import atheris
import sys

def fuzz_json_parser(data):
    fdp = atheris.FuzzedDataProvider(data)
    
    # Generate a string from the fuzz data
    test_string = fdp.ConsumeUnicodeNoSurrogates(100)
    
    try:
        import json
        json.loads(test_string)
    except (json.JSONDecodeError, UnicodeDecodeError):
        pass  # Expected — invalid JSON is fine
    except Exception as e:
        # Unexpected exception = bug
        print(f"Unexpected exception: {e}")
        raise

atheris.Setup(sys.argv, fuzz_json_parser)
atheris.Fuzz()

Run it:

# Run for 10 seconds
python3 fuzz_json.py -max_total_time=10

<span class="hljs-comment"># With corpus directory
python3 fuzz_json.py corpus/ -max_total_time=10

<span class="hljs-comment"># With multiple workers
python3 fuzz_json.py corpus/ -<span class="hljs-built_in">jobs=4 -workers=4

FuzzedDataProvider: Structured Fuzzing

The FuzzedDataProvider is the key to fuzzing code that expects structured input rather than raw bytes. It provides methods for extracting typed values from the raw fuzz bytes:

import atheris
import sys

def fuzz_user_service(data):
    fdp = atheris.FuzzedDataProvider(data)
    
    # Extract structured values
    user_id = fdp.ConsumeInt(4)           # 4-byte integer
    username = fdp.ConsumeString(50)       # Up to 50 bytes as string
    email = fdp.ConsumeUnicodeNoSurrogates(100)  # Unicode string
    age = fdp.ConsumeIntInRange(0, 150)   # Integer in range
    is_admin = fdp.ConsumeBool()           # Boolean
    score = fdp.ConsumeFloat()            # Float
    
    # List of items
    items = []
    while fdp.remaining_bytes() > 0:
        item_id = fdp.ConsumeIntInRange(1, 10000)
        items.append(item_id)
    
    from my_service import UserService
    svc = UserService()
    
    try:
        user = svc.create_user(
            user_id=user_id,
            username=username,
            email=email,
            age=age,
            is_admin=is_admin
        )
        svc.add_items(user, items)
    except ValueError:
        pass  # Expected validation errors
    except Exception as e:
        # Unexpected errors indicate bugs
        raise

atheris.Setup(sys.argv, fuzz_user_service)
atheris.Fuzz()

Available FuzzedDataProvider methods:

fdp.ConsumeBytes(count)           # bytes
fdp.ConsumeByteArray(count)       # bytearray
fdp.ConsumeString(count)          # str (may contain invalid unicode)
fdp.ConsumeUnicode(count)         # str (valid unicode, may have surrogates)
fdp.ConsumeUnicodeNoSurrogates(count)  # str (valid unicode, no surrogates)
fdp.ConsumeBool()                 # bool
fdp.ConsumeInt(num_bytes)         # int (from N bytes)
fdp.ConsumeIntInRange(min, max)   # int in [min, max]
fdp.ConsumeUInt(num_bytes)        # unsigned int
fdp.ConsumeFloat()                # float (may be inf or nan)
fdp.ConsumeRegularFloat()         # float (not inf or nan)
fdp.PickValueInList(lst)          # random element from list
fdp.remaining_bytes()             # bytes not yet consumed

Fuzzing a Real Python Library

Here is a complete example fuzzing a URL parser:

import atheris
import sys
import urllib.parse

def fuzz_url_parser(data):
    fdp = atheris.FuzzedDataProvider(data)
    url = fdp.ConsumeUnicodeNoSurrogates(200)
    
    try:
        result = urllib.parse.urlparse(url)
        
        # Test round-trip consistency
        reconstructed = urllib.parse.urlunparse(result)
        result2 = urllib.parse.urlparse(reconstructed)
        
        # Invariant: parsing the reconstructed URL should give
        # the same result as parsing the original
        # (with some tolerance for normalization)
        assert result.scheme == result2.scheme, \
            f"Scheme mismatch: {result.scheme!r} != {result2.scheme!r}"
        
    except AssertionError:
        raise
    except Exception:
        pass

atheris.Setup(sys.argv, fuzz_url_parser)
atheris.Fuzz()

The invariant check (round-trip consistency) is a powerful fuzzing technique — it does not require knowing the correct output, only that certain properties should hold. Other useful invariants:

  • Parsing and serializing should round-trip
  • Decoding then encoding should give back the original
  • Two implementations of the same spec should agree
  • Input within documented bounds should never crash

Fuzzing Native Extensions

Python libraries with C extensions are particularly valuable to fuzz, because their bugs can cause memory corruption rather than mere Python exceptions.

import atheris
import sys

# Enable ASan/UBSan for the native extension
# (The extension must be compiled with sanitizer support)
atheris.enabled_hooks.add("trace_cmp")

def fuzz_numpy_operations(data):
    fdp = atheris.FuzzedDataProvider(data)
    
    import numpy as np
    
    # Generate a random array size and data
    size = fdp.ConsumeIntInRange(0, 1000)
    dtype_name = fdp.PickValueInList(['int8', 'int16', 'int32', 'float32', 'float64'])
    
    try:
        arr_bytes = fdp.ConsumeBytes(size * np.dtype(dtype_name).itemsize)
        arr = np.frombuffer(arr_bytes, dtype=dtype_name)
        
        # Exercise various numpy operations
        if len(arr) > 0:
            np.sort(arr)
            np.unique(arr)
            arr.reshape(-1, 1)
            np.cumsum(arr)
            
    except (ValueError, OverflowError):
        pass  # Expected with weird inputs

atheris.Setup(sys.argv, fuzz_numpy_operations)
atheris.Fuzz()

For extensions that you want to fuzz with AddressSanitizer:

# Set up ASan environment
<span class="hljs-built_in">export LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so)
<span class="hljs-built_in">export ASAN_OPTIONS=<span class="hljs-string">"detect_leaks=0:halt_on_error=0"

python3 fuzz_native.py -max_total_time=60

Finding Type Errors and Logic Bugs

Atheris is particularly good at finding Python-specific bugs that are not crashes:

Type confusion bugs — where the code assumes input has a specific type:

def fuzz_type_confusion(data):
    fdp = atheris.FuzzedDataProvider(data)
    
    # Generate various Python types
    type_choice = fdp.ConsumeIntInRange(0, 4)
    
    if type_choice == 0:
        value = fdp.ConsumeInt(4)
    elif type_choice == 1:
        value = fdp.ConsumeFloat()
    elif type_choice == 2:
        value = fdp.ConsumeUnicodeNoSurrogates(50)
    elif type_choice == 3:
        value = None
    else:
        value = fdp.ConsumeBytes(20)
    
    try:
        from my_module import process_value
        process_value(value)
    except TypeError:
        pass  # Expected for wrong types
    except Exception as e:
        if "expected" not in str(e).lower():
            raise  # Unexpected exception = bug

ReDoS (Regular Expression Denial of Service):

import re
import signal

def fuzz_regex_dos(data):
    fdp = atheris.FuzzedDataProvider(data)
    test_input = fdp.ConsumeUnicodeNoSurrogates(500)
    
    # Test pattern that might be vulnerable to ReDoS
    pattern = r'(a+)+b'  # Classic ReDoS pattern
    
    def timeout_handler(signum, frame):
        raise TimeoutError("Regex took too long — possible ReDoS")
    
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(1)  # 1 second timeout
    
    try:
        re.match(pattern, test_input)
    except TimeoutError:
        raise  # This is the bug we want to find
    except Exception:
        pass
    finally:
        signal.alarm(0)

CI Integration with GitHub Actions

name: Fuzz Tests
on:
  schedule:
    - cron: '0 3 * * *'
  push:
    paths:
      - 'src/**'
  workflow_dispatch:
    inputs:
      fuzz_duration:
        description: 'Fuzzing duration in seconds'
        default: '300'

jobs:
  fuzz:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install Atheris and dependencies
        run: |
          sudo apt-get install -y clang
          pip install atheris
          pip install -e .
      
      - name: Restore corpus cache
        uses: actions/cache@v3
        with:
          path: fuzz-corpus
          key: fuzz-corpus-${{ hashFiles('src/**') }}
          restore-keys: fuzz-corpus-
      
      - name: Run fuzz tests
        run: |
          mkdir -p fuzz-corpus/json_parser
          python3 tests/fuzz/fuzz_json_parser.py \
            fuzz-corpus/json_parser/ \
            -max_total_time=${{ inputs.fuzz_duration || 300 }} \
            -jobs=2
      
      - name: Save updated corpus
        uses: actions/cache/save@v3
        if: always()
        with:
          path: fuzz-corpus
          key: fuzz-corpus-${{ github.sha }}
      
      - name: Upload crashes
        uses: actions/upload-artifact@v3
        if: failure()
        with:
          name: fuzz-crashes-${{ github.run_id }}
          path: |
            crash-*
            timeout-*
            leak-*

Interpreting Crashes

When Atheris finds a crash, the output looks like this:

INFO: Using built-in libfuzzer
INFO: Seed: 1234567890
INFO: Loaded 1 modules (50 guards): 50 [0x...]
INFO: Running: crash-abc123
python3: fuzz_json.py
...
==12345==ERROR: Python encountered a fatal error...
Traceback (most recent call last):
  File "fuzz_json.py", line 15, in fuzz_json_parser
    json_library.parse(data_string)
  File "/path/to/json_library/parser.py", line 89, in parse
    return _fast_parse(data)
RecursionError: maximum recursion depth exceeded
SUMMARY: python3 Crash
artifact_prefix='./'; Test unit written to ./crash-abc123

The crash input is written to ./crash-abc123. Reproduce it:

python3 fuzz_json.py crash-abc123

This particular crash — RecursionError in a JSON parser — indicates the parser does not guard against deeply nested input. A quick fix:

def parse(data, max_depth=100):
    if max_depth <= 0:
        raise ParseError("Input too deeply nested")
    # ...

Atheris makes Python fuzzing as accessible as writing a unit test. The investment in writing fuzz targets pays off in bugs found automatically, running overnight in CI, with no human effort required after setup.

Read more