AFL++ Persistent Mode — Patterns That Actually Speed Up Your Fuzzing

The single biggest performance multiplier available to an AFL++ campaign is persistent mode. In the default fork-server model, AFL++ forks a new process for every generated input — which means paying full process startup cost even though the binary is already mapped into memory. Persistent mode eliminates those forks by reusing the same process across many inputs, looping inside the harness between iterations. The throughput difference is not marginal: a parser that runs at 500 exec/s in fork mode routinely reaches 50,000–200,000 exec/s under persistent mode on the same hardware.

This post explains the mechanics, the LLVM mode requirement, the state-leakage footgun that bites almost everyone on their first persistent-mode harness, how deferred initialization interacts with the fork-server, and the cases where fork mode is actually the right answer.

What Persistent Mode Actually Does

AFL++'s fork-server model works by sending the target a special file descriptor notification that causes it to pause just before main() begins processing input. AFL++ then forks from that paused state for each new input — the child processes one input and exits. This avoids re-running dynamic linker startup but still pays for a fork + exec per input.

Persistent mode replaces the outer fork loop with an in-process loop controlled by the __AFL_LOOP(N) macro. The macro evaluates to true for N iterations, then returns false so the outer process can exit cleanly and a fresh child can be forked. Within those N iterations, there are no forks at all — the fuzzer passes each generated input directly to the running process via shared memory and receives coverage feedback via the same 64 KB bitmap used in fork mode.

The __AFL_LOOP(N) Macro and Tuning N

The N argument to __AFL_LOOP controls how many fuzzing iterations run in a single process lifetime before AFL++ restarts with a fresh fork. It is not a hard limit on throughput — it is a safety valve. After N iterations the process is discarded and a new child is forked, resetting all process state from the parent snapshot. This bounds the damage from memory leaks and from cumulative state corruption that your reset function might not fully clean up.

How to tune N:

Start with 1000–10000. For most targets this is the right range. At 50,000 exec/s with N=1000 a fresh fork happens roughly every 20 ms — imperceptible overhead.
Increase N if the target has expensive teardown. If your reset function calls into a garbage collector or destroys and re-creates a large data structure, fewer restarts means less overhead. Try N=50000 and measure exec/s.
Decrease N if you see intermittent crashes without a clear reproducer. This usually means accumulated state is triggering undefined behavior that isn't present on iteration 1. Dropping to N=100 (or even N=1, which degrades to fork mode) will tell you if iteration number is the variable.
N has no effect on coverage quality. AFL++ collects the bitmap after every call to parse_input(), not at process exit. Iteration count does not affect what gets discovered.

State Leakage: The #1 Footgun

In fork mode, each input runs in a fresh process — any global or static state the previous input wrote is invisible to the next. In persistent mode, the same process handles thousands of inputs consecutively. If your target has any mutable global state — a parse context, an error accumulator, a cache, a statistics counter, an open file descriptor — that state persists from one iteration to the next unless you explicitly reset it.

State leakage produces two classes of symptom that are easy to confuse with real bugs:

False crashes. Input N writes invalid data into a global struct. Input N+1 is completely valid but reads the corrupted struct and crashes. The crashing input reproduces differently or not at all when run in isolation.
False stability. A bug is only reachable after the target processes a specific sequence of inputs that leaves the global state in a particular configuration. The fuzzer may stumble across the crash once and never reproduce it because it cannot control the sequence of prior inputs that set up the state.

The fix is a reset function called at the top of every __AFL_LOOP body, before any call to the target. The reset must handle:

Heap allocations from the previous iteration (free or arena-reset them).
Global struct fields (memset or explicitly zero each field).
Open file descriptors (close any opened during parsing).
Static caches (flush or zero the cache arrays).
Errno and other thread-locals (usually not needed but worth considering for multi-threaded targets).

// Example: a parser that uses a global context object
typedef struct {
    int     depth;
    char   *error_msg;
    Arena  *arena;
} ParseCtx;

static ParseCtx g_ctx;       // lives for the lifetime of the process
static Arena    g_arena_storage[1];

// Call this at the top of __AFL_LOOP to avoid state leakage between iterations.
static void reset_ctx(void) {
    // 1. Free heap allocations made during the previous iteration.
    arena_reset(g_ctx.arena);

    // 2. Zero the struct so integer/pointer fields don't carry over.
    memset(&g_ctx, 0, sizeof(g_ctx));

    // 3. Re-initialise fields that need non-zero defaults.
    g_ctx.arena = g_arena_storage;
    g_ctx.depth = 0;
    g_ctx.error_msg = NULL;
}

int main(void) {
    uint8_t buf[1 << 16];
    ssize_t len;

    // One-time initialisation: set up the arena allocator.
    arena_init(g_arena_storage, 1 << 20);

    __AFL_INIT();

    while (__AFL_LOOP(5000)) {
        len = read(STDIN_FILENO, buf, sizeof(buf));
        if (len <= 0) continue;

        reset_ctx();            // <-- mandatory state reset
        parse_with_ctx(&g_ctx, buf, (size_t)len);
    }
    return 0;
}

The arena pattern above is common in parser code: rather than calling malloc/free per allocation, all per-parse memory comes from an arena that is reset in bulk at the end of each iteration. This is both faster and safer — no individual free calls to miss, and the arena bounds the total per-iteration memory growth.

Basic Persistent Mode Harness

#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include "my_parser.h"

// Compiled with: AFL_USE_ASAN=1 afl-clang-fast -o fuzz_target fuzz_target.c my_parser.c
// Run with:      afl-fuzz -i seeds/ -o findings/ -- ./fuzz_target

int main(void) {
    uint8_t buf[1 << 16];
    ssize_t len;

    // __AFL_INIT() establishes the fork-server checkpoint.
    // Everything above this line runs once in the parent; everything
    // below runs once per fuzzing iteration in forked children.
    __AFL_INIT();

    while (__AFL_LOOP(10000)) {
        len = read(STDIN_FILENO, buf, sizeof(buf));
        if (len <= 0) continue;

        // Reset any mutable global state before calling the parser.
        // Failure to do this is the #1 persistent-mode footgun.
        my_parser_reset();

        parse_input(buf, (size_t)len);
    }
    return 0;
}

A few things to note about the structure above. __AFL_INIT() and __AFL_LOOP(N) are compiler-injected primitives — when you compile with afl-clang-fast / afl-clang-lto / afl-cc the instrumentation pass recognises these symbols and lowers them to inline calls into the fork-server runtime that ships with AFL++. They do not require any explicit header include. The read(STDIN_FILENO, ...) call inside the loop is the simplest way to receive each input — AFL++ writes the test case to the child's stdin and your harness reads it.

For maximum throughput on small inputs, AFL++ also exposes shared-memory test cases via the __AFL_FUZZ_TESTCASE_BUF pointer and __AFL_FUZZ_TESTCASE_LEN length, which let the fuzzer hand the harness a pointer into the shared input region directly — no read() syscall per iteration. This is the pattern used in most upstream AFL++ persistent-mode harnesses. Add __AFL_FUZZ_INIT(); at file scope, then inside the loop use the pointer instead of read(). The semantics of __AFL_LOOP and state-reset discipline are identical either way.

The LLVM mode requirement is real: persistent mode requires compiling with one of the LLVM-based instrumentation frontends — afl-clang-fast, afl-clang-lto, or the unified afl-cc / afl-c++ driver (which dispatches to one of the above based on what is available). It does not work with GCC plugin mode (afl-gcc-fast) or with QEMU mode. If your target requires GCC, you have two options: use fork mode, or cross-compile with Clang just for the fuzz build. Most targets that build with GCC will also build with Clang after minor flag adjustments.

Deferred Initialisation with __AFL_INIT()

__AFL_INIT() controls where AFL++ takes the fork-server snapshot — the point in the process from which all forked children start. Everything that runs before __AFL_INIT() runs once in the parent and is inherited by all children. Everything after runs once per child (i.e., once every N iterations).

Deferred initialization lets you amortize expensive one-time setup across the entire campaign instead of paying for it on every fork. Put expensive setup before __AFL_INIT(); put cheap per-iteration work inside the loop:

#include <stdio.h>
#include "my_parser.h"
#include "my_crypto.h"

int main(void) {
    uint8_t buf[1 << 16];
    ssize_t len;

    // Expensive one-time setup that you want in the parent snapshot.
    // Run before __AFL_INIT() so forked children inherit the result.
    my_crypto_init();        // loads key tables, ~50 ms
    my_parser_init();        // pre-compiles RE patterns, ~20 ms

    __AFL_INIT();            // fork-server checkpoint: parent snapshots here

    while (__AFL_LOOP(10000)) {
        len = read(STDIN_FILENO, buf, sizeof(buf));
        if (len <= 0) continue;

        // Reset cheap mutable state; inherit expensive init from snapshot.
        my_parser_reset_state();
        parse_input(buf, (size_t)len);
    }
    return 0;
}

// Contrast: if my_crypto_init() were called INSIDE __AFL_LOOP it would
// run 10,000 times per process lifetime instead of once.
// If it were called AFTER __AFL_INIT() but outside the loop it would
// still run on every fork — one call per child, not once for all children.

The constraint is that whatever runs before __AFL_INIT() must not consume stdin. If your initialization reads from a config file on stdin, that read will consume bytes before the fuzzer has a chance to inject input. Read config from a file path passed as argv, or hard-code it for the fuzz build.

How Persistent Mode Interacts with ASan

AddressSanitizer and persistent mode interact in one important way: ASan's allocator quarantine. When you free memory, ASan does not immediately reclaim it — it quarantines freed regions to detect use-after-free bugs. Over many persistent-mode iterations this quarantine grows. By default ASan quarantines up to 256 MB; with a fast harness running 50,000 exec/s that ceiling can be hit within seconds.

When the quarantine fills, ASan either expands it (consuming more RSS) or begins evicting old entries. Neither outcome is a correctness problem, but runaway RSS growth can trigger OOM kills that look like crashes. Tune the quarantine with:

ASAN_OPTIONS=quarantine_size_mb=64 — reduce quarantine to 64 MB to limit RSS growth. This reduces the use-after-free detection window but is fine for most fuzzing campaigns where you're primarily hunting memory safety bugs, not temporal correctness.
ASAN_OPTIONS=malloc_context_size=0 — disable allocation stack traces in the quarantine. Cuts quarantine metadata overhead significantly without affecting detection quality.

The combination ASAN_OPTIONS=quarantine_size_mb=64:malloc_context_size=0:abort_on_error=1:detect_leaks=0 is a reasonable starting point for persistent-mode ASan builds. abort_on_error=1 ensures crashes raise SIGABRT so the fork server detects them reliably, and detect_leaks=0 disables LSan's end-of-process leak scan — under persistent mode that scan would fire every time a child exits after N iterations and report any in-flight allocation the harness happens to be holding as a "leak", drowning real findings in noise. If you actually want leak detection, run a separate non-persistent campaign with leak detection enabled.

CmpLog Mode Compatibility

AFL++'s CmpLog (REDQUEEN) mode instruments every comparison instruction to record both operands, then uses those values as candidate bytes in the havoc mutator. CmpLog is one of the most effective techniques for breaking through magic-byte and checksum guards. The good news: it is fully compatible with persistent mode. The mechanism is simply an additional instrumented binary — you compile a second copy of the target with AFL_LLVM_CMPLOG=1 and pass it to afl-fuzz via the -c flag:

# Step 1: build the normal instrumented binary
AFL_USE_ASAN=1 afl-clang-fast -o fuzz_target fuzz_target.c my_parser.c

# Step 2: build a separate CmpLog-instrumented binary
# CmpLog records comparison operands (both sides of every ==, !=, <, etc.)
# and feeds them to the havoc mutator as candidate byte sequences.
AFL_USE_ASAN=1 AFL_LLVM_CMPLOG=1 afl-clang-fast -o fuzz_target_cmplog fuzz_target.c my_parser.c

# Run the main instance with -c pointing at the CmpLog binary
afl-fuzz -i seeds/ -o findings/ \
  -c ./fuzz_target_cmplog \
  -- ./fuzz_target

# CmpLog works with persistent mode. No changes to the harness are needed.
# The -c flag instructs AFL++ to run both binaries in parallel: the normal
# binary for throughput and the CmpLog binary to harvest comparison values.

AFL++ runs the CmpLog binary in parallel with the normal binary: the normal binary does throughput, the CmpLog binary does one execution per interesting input to harvest comparison values. The harness code is identical for both. No changes to the persistent-mode loop are needed.

When Fork Mode Is Actually Better

Persistent mode is not universally correct. There are targets where fork mode is the right choice — or where making persistent mode work correctly costs more than the throughput gain is worth:

Targets with irrecoverable global state. If the target initializes a subsystem (say, a custom memory allocator or a JIT compiler) in a way that cannot be reset without restarting the process, fork mode gives you a clean slate for free. Persistent mode would require re-implementing the initialization logic in userspace, which is error-prone.
Targets that spawn threads. Thread-local state, mutexes, and thread pools accumulated over persistent-mode iterations interact badly with the fork-server model. Fork does not clone threads; a forked child inherits the parent's thread state without the running threads, which leaves mutexes locked and thread pools empty. For multi-threaded targets, fork mode is simpler and safer.
Targets that mutate signal handlers. If the target installs aSIGABRT or SIGSEGV handler during initialization and removes it after processing each input, the handler state can leak between iterations. AFL++ uses these signals for crash detection; a leaked handler that swallows SIGABRT will silently discard ASan crash signals.
Targets that are already fast. If your target runs at 10,000 exec/s in fork mode (which is respectable for a complex parser), the effort of porting to persistent mode might yield 50,000 exec/s — a meaningful gain. If the target already runs at 200,000 exec/s in fork mode, the gain over persistent is small and the correctness risk is not justified.
Binary-only targets. QEMU mode does not support persistent mode. If you cannot recompile the target, fork mode is your only option. AFL++'s QEMU mode is well optimized and runs many targets at 5,000–20,000 exec/s even without persistence — fast enough for most campaigns.

The diagnostic workflow when persistent mode produces suspicious results: set __AFL_LOOP(1) (which degrades to fork mode semantics), rerun, and compare crash rates. If crashes disappear or change character, state leakage is the culprit. Audit your reset function against every mutable global in the target.

Measuring the Gain

The easiest measurement is coverage per CPU-hour rather than raw exec/s, because a faster harness is only better if it is finding new paths. Run the same target in fork mode and persistent mode for 30 minutes each, then compare afl-showmap output on the two corpora. If persistent mode finds substantially more unique edges in the same wall-clock time, it is working correctly. If coverage is similar despite the higher exec/s, the target may have become throughput-limited by the mutation pipeline rather than by process overhead — at that point, more CPU cores or a better seed corpus will help more than further harness optimization.