Benchmarking and Profiling WebAssembly: Criterion, Wasmer, and Flamegraphs

Benchmarking and Profiling WebAssembly: Criterion, Wasmer, and Flamegraphs

WebAssembly performance is close to native — but "close" needs to be measured. Whether you're evaluating the overhead of crossing the WASM/JS boundary, comparing execution speed between runtimes, or profiling where your WASM module spends its time, you need specific tools: Criterion for Rust-level benchmarks, Wasmer for runtime comparisons, flamegraphs for visual profiling, and Chrome DevTools for browser-side WASM profiling.

Key Takeaways

Benchmark WASM vs native in the same Criterion suite. Compile the same Rust code for both x86_64 and wasm32-unknown-unknown, run Criterion benchmarks, and compare the results directly.

Wasmer supports pluggable backends. Benchmark the same .wasm file under Cranelift, LLVM, and Singlepass backends to find the fastest option for your workload.

Criterion's throughput feature gives you meaningful units. Set group.throughput(Throughput::Elements(n)) to measure operations/second instead of raw nanoseconds.

Flamegraphs need debug symbols. Build with --release + debug = true in your profile, or use DWARF info in the WASM file. Without symbols, flamegraphs show ?? everywhere.

Chrome DevTools WASM profiling requires source maps. Enable DWARF debugging info and use the C/C++ DevTools Support (DWARF) Chrome extension to see Rust function names in the profiler.

Why WASM Performance Testing Matters

WebAssembly was designed for near-native speed, but the reality is nuanced:

  • Startup time: Parsing and compiling a WASM module takes time. A 1MB module might take 50-200ms to JIT compile in a browser. Streaming compilation (WebAssembly.instantiateStreaming) helps, but it's something to measure.
  • Call overhead: Crossing the JS/WASM boundary (calling WASM from JS or vice versa) has overhead — typically 10-100ns per call. For tight inner loops, this matters.
  • Memory operations: WASM has a single flat linear memory. Garbage collection doesn't exist at the WASM level, so allocation patterns matter more than in JS.
  • SIMD instructions: WASM SIMD can dramatically accelerate numeric workloads — but only if your runtime supports it and your code uses it.

Measuring these dimensions requires a proper benchmarking setup, not guesswork.

Criterion Benchmarks for WASM

Criterion is the standard Rust benchmarking library. Setting it up for WASM comparison requires a bit of structure:

# Cargo.toml
[package]
name = "wasm-benchmarks"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib", "rlib"]

[dependencies]
wasm-bindgen = "0.2"

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "computation_bench"
harness = false
// src/lib.rs — the code to benchmark
pub fn naive_sum(data: &[f64]) -> f64 {
    data.iter().sum()
}

pub fn pairwise_sum(data: &[f64]) -> f64 {
    if data.len() <= 32 {
        return naive_sum(data);
    }
    let mid = data.len() / 2;
    pairwise_sum(&data[..mid]) + pairwise_sum(&data[mid..])
}

pub fn matrix_multiply_2x2(
    a: [[f64; 2]; 2],
    b: [[f64; 2]; 2],
) -> [[f64; 2]; 2] {
    [
        [
            a[0][0] * b[0][0] + a[0][1] * b[1][0],
            a[0][0] * b[0][1] + a[0][1] * b[1][1],
        ],
        [
            a[1][0] * b[0][0] + a[1][1] * b[1][0],
            a[1][0] * b[0][1] + a[1][1] * b[1][1],
        ],
    ]
}

pub fn fibonacci_iter(n: u64) -> u64 {
    if n <= 1 { return n; }
    let (mut a, mut b) = (0u64, 1u64);
    for _ in 2..=n {
        (a, b) = (b, a.wrapping_add(b));
    }
    b
}

// Simulate a workload with allocation to test memory performance
pub fn process_strings(inputs: &[&str]) -> Vec<String> {
    inputs.iter()
        .map(|s| s.to_uppercase())
        .filter(|s| s.len() > 3)
        .collect()
}
// benches/computation_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, Throughput, BenchmarkId};
use wasm_benchmarks::{naive_sum, pairwise_sum, matrix_multiply_2x2, fibonacci_iter, process_strings};

fn bench_sum_variants(c: &mut Criterion) {
    let mut group = c.benchmark_group("summation");

    for size in [100usize, 1_000, 10_000, 100_000].iter() {
        let data: Vec<f64> = (0..*size).map(|i| i as f64 * 0.001).collect();

        group.throughput(Throughput::Elements(*size as u64));

        group.bench_with_input(
            BenchmarkId::new("naive", size),
            &data,
            |b, data| b.iter(|| naive_sum(black_box(data))),
        );

        group.bench_with_input(
            BenchmarkId::new("pairwise", size),
            &data,
            |b, data| b.iter(|| pairwise_sum(black_box(data))),
        );
    }

    group.finish();
}

fn bench_matrix_multiply(c: &mut Criterion) {
    let a = [[1.5_f64, 2.3], [4.1, 0.7]];
    let b = [[3.2_f64, 1.1], [2.8, 5.6]];

    c.bench_function("matrix_multiply_2x2", |bench| {
        bench.iter(|| matrix_multiply_2x2(black_box(a), black_box(b)))
    });
}

fn bench_fibonacci(c: &mut Criterion) {
    let mut group = c.benchmark_group("fibonacci");

    for n in [10u64, 20, 30, 40, 50].iter() {
        group.bench_with_input(
            BenchmarkId::from_parameter(n),
            n,
            |b, &n| b.iter(|| fibonacci_iter(black_box(n))),
        );
    }

    group.finish();
}

fn bench_string_processing(c: &mut Criterion) {
    let inputs: Vec<&str> = vec![
        "hello", "world", "rust", "wasm", "performance", "benchmark", "test",
        "foo", "bar", "baz", "qux", "quux", "corge", "grault", "garply",
    ];

    c.bench_function("process_strings", |bench| {
        bench.iter(|| process_strings(black_box(&inputs)))
    });
}

criterion_group!(
    benches,
    bench_sum_variants,
    bench_matrix_multiply,
    bench_fibonacci,
    bench_string_processing,
);
criterion_main!(benches);

Run native benchmarks:

cargo bench
# Results saved to target/criterion/
<span class="hljs-comment"># Open target/criterion/report/index.html for HTML reports

To benchmark the WASM target, you need a WASI-aware benchmark runner:

# Build the WASM benchmark binary
GOOS=wasip1 GOARCH=wasm <span class="hljs-comment"># (for Go)
<span class="hljs-comment"># For Rust:
cargo build --target wasm32-wasip1 --release --bench computation_bench
<span class="hljs-comment"># Then run with wasmtime:
wasmtime target/wasm32-wasip1/release/deps/computation_bench-*.wasm -- --bench

Wasmer: Benchmarking Across Runtimes and Backends

Wasmer is a standalone WASM runtime that supports multiple compiler backends, making it ideal for comparing execution strategies:

# Cargo.toml for wasmer benchmarks
[dev-dependencies]
wasmer = "4"
wasmer-compiler-cranelift = "4"
wasmer-compiler-llvm = { version = "4", optional = true }
wasmer-compiler-singlepass = "4"
criterion = { version = "0.5", features = ["html_reports"] }
// benches/wasmer_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
use wasmer::{Engine, Module, Store, Instance, imports, Function, Value};
use wasmer_compiler_cranelift::Cranelift;
use wasmer_compiler_singlepass::Singlepass;

const FIBONACCI_WAT: &str = r#"
    (module
        (func $fib (export "fib") (param $n i64) (result i64)
            local.get $n
            i64.const 1
            i64.le_s
            if (result i64)
                local.get $n
            else
                local.get $n
                i64.const 1
                i64.sub
                call $fib
                local.get $n
                i64.const 2
                i64.sub
                call $fib
                i64.add
            end
        )
    )
"#;

fn create_wasmer_instance(compiler: &str) -> (Store, Instance) {
    let engine = match compiler {
        "cranelift" => Engine::from(Cranelift::default()),
        "singlepass" => Engine::from(Singlepass::default()),
        _ => Engine::default(),
    };

    let mut store = Store::new(engine);
    let module = Module::new(&store, FIBONACCI_WAT).unwrap();
    let import_object = imports! {};
    let instance = Instance::new(&mut store, &module, &import_object).unwrap();
    (store, instance)
}

fn bench_wasmer_backends(c: &mut Criterion) {
    let mut group = c.benchmark_group("wasmer_backends");

    for (backend_name, n) in [("cranelift", 20i64), ("singlepass", 20i64)] {
        group.bench_with_input(
            BenchmarkId::new(backend_name, n),
            &n,
            |b, &n| {
                let (mut store, instance) = create_wasmer_instance(backend_name);
                let fib = instance.exports.get_function("fib").unwrap();
                b.iter(|| {
                    fib.call(&mut store, &[Value::I64(black_box(n))]).unwrap()
                });
            },
        );
    }

    group.finish();
}

fn bench_module_compilation(c: &mut Criterion) {
    let mut group = c.benchmark_group("module_compilation");

    // Measure how long it takes to compile the module (cold start simulation)
    group.bench_function("cranelift_compile", |b| {
        b.iter(|| {
            let engine = Engine::from(Cranelift::default());
            let store = Store::new(engine);
            Module::new(&store, black_box(FIBONACCI_WAT)).unwrap()
        });
    });

    group.bench_function("singlepass_compile", |b| {
        b.iter(|| {
            let engine = Engine::from(Singlepass::default());
            let store = Store::new(engine);
            Module::new(&store, black_box(FIBONACCI_WAT)).unwrap()
        });
    });

    group.finish();
}

criterion_group!(wasmer_benches, bench_wasmer_backends, bench_module_compilation);
criterion_main!(wasmer_benches);

Singlepass compiles faster (good for short-lived invocations); Cranelift produces faster code (good for long-running computations). Benchmarking both helps you choose the right backend for your use case.

Flamegraphs for WASM Profiling

Native Flamegraph (for the compiled Rust logic)

# Install flamegraph tooling
cargo install flamegraph

<span class="hljs-comment"># Build with debug symbols in release mode
<span class="hljs-comment"># Add to Cargo.toml:
<span class="hljs-comment"># [profile.release]
<span class="hljs-comment"># debug = true

<span class="hljs-comment"># Generate flamegraph (requires perf on Linux, DTrace on macOS)
cargo flamegraph --bench computation_bench -- --bench -n <span class="hljs-string">"summation"

<span class="hljs-comment"># Opens flamegraph.svg in your browser

WASM Flamegraph with Wasmtime

# Build with DWARF debug info
cargo build --target wasm32-wasip1 --release
<span class="hljs-comment"># Enable debug info in the WASM output:
<span class="hljs-comment"># RUSTFLAGS="-C debuginfo=2" cargo build --target wasm32-wasip1 --release

<span class="hljs-comment"># Profile with wasmtime's built-in profiler
wasmtime run \
    --profile=jitdump \
    target/wasm32-wasip1/release/my_program.wasm

<span class="hljs-comment"># Convert jitdump to perf.data and generate flamegraph
perf inject --jit -i perf.data -o perf.jit.data
perf script -i perf.jit.data <span class="hljs-pipe">| stackcollapse-perf.pl <span class="hljs-pipe">| flamegraph.pl > wasm-flamegraph.svg

Browser WASM Profiling with Chrome DevTools

Chrome DevTools has first-class WASM profiling support:

  1. Enable WASM debugging: Open DevTools → Settings → Experiments → "WebAssembly Debugging: Enable DWARF support"
  2. Install the DWARF extension: Add "C/C++ DevTools Support (DWARF)" from the Chrome Web Store. This decodes WASM debug info into source-level symbols.
  3. Profile your app:
    • Open DevTools → Performance tab
    • Click Record
    • Perform the action you want to profile
    • Click Stop
  4. Interpret the results: In the flame chart, WASM frames appear as purple bars. With DWARF support, you'll see Rust function names. Without it, you see addresses.

For JavaScript-side benchmarking of WASM calls:

// benchmark_wasm.js — measure JS/WASM call overhead
async function benchmarkWasm() {
    const { instance } = await WebAssembly.instantiateStreaming(
        fetch('/calculator.wasm')
    );

    const { add, fibonacci } = instance.exports;

    // Warm up JIT
    for (let i = 0; i < 1000; i++) add(i, i + 1);

    // Benchmark add (simple operation)
    const addRuns = 1_000_000;
    const addStart = performance.now();
    for (let i = 0; i < addRuns; i++) {
        add(i | 0, (i + 1) | 0);
    }
    const addTime = performance.now() - addStart;

    console.log(`add: ${(addTime / addRuns * 1e6).toFixed(2)} ns/call`);
    console.log(`add throughput: ${(addRuns / addTime * 1000).toFixed(0)} calls/sec`);

    // Benchmark fibonacci(30)
    const fibRuns = 10_000;
    const fibStart = performance.now();
    for (let i = 0; i < fibRuns; i++) {
        fibonacci(30);
    }
    const fibTime = performance.now() - fibStart;

    console.log(`fibonacci(30): ${(fibTime / fibRuns * 1e6).toFixed(0)} ns/call`);

    // Measure module instantiation time (cold start)
    const instantiateStart = performance.now();
    for (let i = 0; i < 10; i++) {
        const resp = await fetch('/calculator.wasm');
        const buf = await resp.arrayBuffer();
        await WebAssembly.instantiate(buf);
    }
    const instantiateTime = (performance.now() - instantiateStart) / 10;

    console.log(`instantiation: ${instantiateTime.toFixed(1)} ms/module`);
}

benchmarkWasm().catch(console.error);

Tracking Performance Regressions in CI

Add a performance regression gate to your CI pipeline:

#!/bin/bash
<span class="hljs-comment"># ci/check_perf.sh — fail if WASM is >3x slower than native

NATIVE_NS=$(cargo bench --bench computation_bench 2>&1 <span class="hljs-pipe">| grep <span class="hljs-string">"fibonacci/30" <span class="hljs-pipe">| grep -oP <span class="hljs-string">'\d+\.\d+ ns/iter' <span class="hljs-pipe">| <span class="hljs-built_in">head -1)
WASM_NS=$(cargo bench --target wasm32-wasip1 --bench computation_bench 2>&1 <span class="hljs-pipe">| grep <span class="hljs-string">"fibonacci/30" <span class="hljs-pipe">| grep -oP <span class="hljs-string">'\d+\.\d+ ns/iter' <span class="hljs-pipe">| <span class="hljs-built_in">head -1)

RATIO=$(<span class="hljs-built_in">echo <span class="hljs-string">"$WASM_NS / <span class="hljs-variable">$NATIVE_NS" <span class="hljs-pipe">| bc -l)

<span class="hljs-keyword">if (( $(echo "<span class="hljs-variable">$RATIO > <span class="hljs-number">3.0" <span class="hljs-pipe">| bc -l) )); <span class="hljs-keyword">then
    <span class="hljs-built_in">echo <span class="hljs-string">"FAIL: WASM is ${RATIO}x slower than native (threshold: 3x)"
    <span class="hljs-built_in">exit 1
<span class="hljs-keyword">else
    <span class="hljs-built_in">echo <span class="hljs-string">"PASS: WASM overhead is ${RATIO}x (within 3x threshold)"
<span class="hljs-keyword">fi

End-to-End Performance Testing with HelpMeTest

Micro-benchmarks with Criterion and flamegraphs tell you about CPU cycles and function call overhead. But production performance problems often show up at a different level: slow page loads because the WASM module is too large, janky UI because the main thread is blocked during WASM computation, or sluggish responses because of N+1 boundary-crossing calls.

HelpMeTest can test these scenarios end-to-end. Write a scenario like "load the image processing page, upload a 5MB file, and verify the result appears within 3 seconds" — HelpMeTest runs it in a real browser, measures the wall-clock time, and fails the test if your WASM-powered feature regresses on real-world performance. No browser driver setup, no custom Playwright scripts.

Combine Criterion benchmarks (catching algorithmic regressions) with HelpMeTest scenarios (catching UX-level performance regressions). Both are necessary — a function that runs 10% slower in benchmarks might be imperceptible to users, while a page that takes 2 seconds longer to become interactive is immediately noticed.

Read more

ScyllaDB Testing Guide: Cassandra Driver Compatibility, Shard-per-Core Testing & Performance Regression

ScyllaDB Testing Guide: Cassandra Driver Compatibility, Shard-per-Core Testing & Performance Regression

ScyllaDB delivers Cassandra-compatible APIs with a rewritten Seastar-based engine that achieves dramatically higher throughput. Testing ScyllaDB applications requires validating both Cassandra compatibility and ScyllaDB-specific behaviors like shard-per-core data distribution. This guide covers both angles. ScyllaDB Testing Landscape ScyllaDB is a drop-in replacement for Cassandra at the API level—which means

By HelpMeTest