Skip to main content
← All formats
Document

PDF seed corpus

The broadest attack surface in document parsing — 700+ spec pages and decades of implementation debt.

PDF is one of the most complex binary formats in widespread use. Its cross-reference tables, object streams, embedded fonts, XREF compression, and optional encryption layers each represent independent parser state machines, any of which can interact with the others in unexpected ways. A well-built PDF corpus must exercise all major object types: Type 1 and TrueType fonts, embedded JPEG and JBIG2 images, form XObjects, JavaScript actions, and incremental updates.

The format's optional-everything philosophy means production parsers routinely accept malformed input that the spec rejects. This makes black-box corpus collection especially valuable: real-world PDFs crawled from the web carry mutations that formal grammar-based generators miss. Bug classes found by fuzzing PDF parsers include heap overflows in glyph rendering, integer overflows in page-tree traversal, use-after-free in form field evaluation, and out-of-bounds reads in stream filters (FlateDecode, LZWDecode, CCITTFaxDecode).

For mutation, PDF-specific dictionaries that include token keywords (/Type, /Pages, startxref, %%EOF) significantly improve coverage over raw byte-flip strategies. Grammar-aware mutators such as Peach Pit or radamsa's PDF mode can generate structurally valid but semantically surprising inputs. AFL++'s CMPLOG mode recovers magic byte sequences used in cross-reference table entries and object ID numbering.

Building + curating your corpus

  • Run afl-cmin on a raw crawled corpus before fuzzing — PDF files are often hundreds of kilobytes; minimising to <10 KB inputs halves campaign time.
  • Include at least a few intentionally malformed files: truncated cross-reference tables, mismatched stream lengths, and broken startxref offsets stress the most-tested error paths.
  • Seed with PDFs from multiple major generators (Microsoft Office export, LibreOffice, Ghostscript, Cairo) to maximise feature coverage across the code paths each generator tickles.
  • Strip JavaScript-heavy PDFs from your starter corpus unless specifically targeting JS engines — they bloat execution time without expanding structural coverage.
  • Use afl-tmin on any crash-reproducing file to isolate the minimal trigger before reporting.

Mutator hints

  • Load the AFL++ PDF dictionary (pdf.dict) to teach the fuzzer magic tokens like 'obj', 'endobj', 'stream', 'startxref', and '%%EOF'.
  • Use CMPLOG / RedQueen mode in AFL++ to recover hardcoded cross-reference offsets and generation numbers from strcmp-style checks.
  • Structurally aware mutators (e.g. Peach's PDF pit) produce valid object graphs with injected type confusion — catches bugs that pure byte-flip misses.
  • Combine structure-aware seed generation with coverage-guided byte-level mutation: start with grammar-generated inputs, then let AFL++ mutate from there.

Recommended fuzzers

  • AFL++
  • libFuzzer
  • Honggfuzz
Run a PDF fuzz campaign on Fuzze.rs →

Push a Dockerfile + harness + the corpus links above. First month 50% off.