PDF seed corpus
The broadest attack surface in document parsing — 700+ spec pages and decades of implementation debt.
PDF is one of the most complex binary formats in widespread use. Its cross-reference tables, object streams, embedded fonts, XREF compression, and optional encryption layers each represent independent parser state machines, any of which can interact with the others in unexpected ways. A well-built PDF corpus must exercise all major object types: Type 1 and TrueType fonts, embedded JPEG and JBIG2 images, form XObjects, JavaScript actions, and incremental updates.
The format's optional-everything philosophy means production parsers routinely accept malformed input that the spec rejects. This makes black-box corpus collection especially valuable: real-world PDFs crawled from the web carry mutations that formal grammar-based generators miss. Bug classes found by fuzzing PDF parsers include heap overflows in glyph rendering, integer overflows in page-tree traversal, use-after-free in form field evaluation, and out-of-bounds reads in stream filters (FlateDecode, LZWDecode, CCITTFaxDecode).
For mutation, PDF-specific dictionaries that include token keywords (/Type, /Pages, startxref, %%EOF) significantly improve coverage over raw byte-flip strategies. Grammar-aware mutators such as Peach Pit or radamsa's PDF mode can generate structurally valid but semantically surprising inputs. AFL++'s CMPLOG mode recovers magic byte sequences used in cross-reference table entries and object ID numbering.
Building + curating your corpus
- →Run afl-cmin on a raw crawled corpus before fuzzing — PDF files are often hundreds of kilobytes; minimising to <10 KB inputs halves campaign time.
- →Include at least a few intentionally malformed files: truncated cross-reference tables, mismatched stream lengths, and broken startxref offsets stress the most-tested error paths.
- →Seed with PDFs from multiple major generators (Microsoft Office export, LibreOffice, Ghostscript, Cairo) to maximise feature coverage across the code paths each generator tickles.
- →Strip JavaScript-heavy PDFs from your starter corpus unless specifically targeting JS engines — they bloat execution time without expanding structural coverage.
- →Use afl-tmin on any crash-reproducing file to isolate the minimal trigger before reporting.
Mutator hints
- →Load the AFL++ PDF dictionary (pdf.dict) to teach the fuzzer magic tokens like 'obj', 'endobj', 'stream', 'startxref', and '%%EOF'.
- →Use CMPLOG / RedQueen mode in AFL++ to recover hardcoded cross-reference offsets and generation numbers from strcmp-style checks.
- →Structurally aware mutators (e.g. Peach's PDF pit) produce valid object graphs with injected type confusion — catches bugs that pure byte-flip misses.
- →Combine structure-aware seed generation with coverage-guided byte-level mutation: start with grammar-generated inputs, then let AFL++ mutate from there.
Recommended fuzzers
- → AFL++
- → libFuzzer
- → Honggfuzz
Libraries that consume PDF
Push a Dockerfile + harness + the corpus links above. First month 50% off.