JSON seed corpus
Ubiquitous text-based data format — parser diversity and number handling edge cases make JSON a rewarding fuzz target.
JSON's simplicity is deceptive from a fuzzing perspective. The format admits a small number of value types (object, array, string, number, boolean, null), but the combination of Unicode string encoding, deeply nested structures, and floating-point number parsing creates a large space of interesting inputs. Different JSON parsers disagree on edge cases: duplicate object keys, extremely long strings, deeply nested structures that exhaust the call stack in recursive-descent parsers, and numbers with exponents larger than double-precision can represent.
Stack overflow bugs in recursive-descent parsers (particularly those without an explicit depth limit) are among the most commonly found JSON fuzzing results. Number parsing is a secondary hot spot: the conversion from decimal string to IEEE 754 double is surprisingly complex (David Gay's dtoa algorithm has its own history of bugs), and some parsers accept unusual number formats (leading zeros, NaN/Infinity literals) that produce undefined behaviour on certain inputs.
Grammar-aware fuzzing is particularly effective for JSON because a pure byte-flip strategy generates many syntactically invalid documents that are rejected in the first few bytes. Using a grammar-based generator (libFuzzer custom mutator with JSON grammar, or Grammarinator) to build the initial corpus, then switching to coverage-guided byte-level mutation, produces the best results. CMPLOG/RedQueen mode in AFL++ is also highly effective because JSON parsers do many character comparisons on bracket, colon, comma, and quote characters.
Building + curating your corpus
- →The JSONTestSuite (300+ test vectors) is the best compliance corpus: it explicitly marks which inputs should be accepted or rejected and covers all major edge cases in one place.
- →Include deeply nested structures (object inside array inside object, 500+ levels) as separate corpus entries to exercise depth-limit and stack overflow guards.
- →Add JSON files with extremely long strings (1 MB of repeated characters), strings with every Unicode escape sequence (\uXXXX), and strings with surrogate pairs.
- →Include number edge cases: very large integers (larger than 2^53), numbers with very long decimal parts, numbers in scientific notation, and -0.
- →Use afl-cmin to reduce a large API-response corpus — many real-world JSON responses have identical structure with different values; minimise to unique coverage paths.
Mutator hints
- →Use AFL++ json.dict to teach the fuzzer JSON structural tokens ('{', '}', '[', ']', ':', ',', 'true', 'false', 'null') as dictionary entries.
- →Write or use an existing libFuzzer JSON grammar mutator (e.g. from the Fuzzing Book) to generate syntactically valid JSON with adversarial semantic content.
- →Use CMPLOG/RedQueen mode in AFL++ to recover exact string values compared inside JSON parsers — particularly useful for finding key-specific code paths in schema validators.
- →Inject numbers at IEEE 754 boundary values: DOUBLE_MAX, DOUBLE_MIN, negative zero, infinity, and NaN-encoding bit patterns as corpus entries or dictionary tokens.
Recommended fuzzers
- → AFL++
- → libFuzzer
- → Honggfuzz
- → Centipede
Push a Dockerfile + harness + the corpus links above. First month 50% off.