Skip to main content
← All formats
Archive

ZIP seed corpus

Dual-directory format with overlapping metadata fields — inconsistencies between local and central directory headers are a classic bug class.

ZIP archives store metadata in two places: local file headers (LFH) immediately preceding each compressed entry, and a central directory (CD) at the end of the archive. Parsers that trust one over the other, or that fail to validate that both agree, are vulnerable to a class of inconsistency bugs that includes path traversal (zip slip), integer overflow in field calculations, and heap overflows from mismatched entry counts. A fuzzing corpus must exercise archives with many entries, archives with the data descriptor extension, ZIP64 archives, self-extracting archives, and archives with comment fields at the maximum length.

The Deflate compression used inside ZIP entries adds a second attack surface: decompressors must handle invalid block types, Huffman trees with degenerate code lengths, and back-references that point outside the sliding window. libarchive and libzip both expose this surface, and OSS-Fuzz has found numerous bugs in their Deflate implementations by fuzzing with a corpus of minimised ZIP files.

ZIP is also the container format for DOCX, XLSX, PPTX, JAR, APK, and many other high-level formats, which means bugs in the ZIP layer can have cascading effects on format-specific parsers built on top.

Building + curating your corpus

  • Include ZIP64 archives (entries larger than 4 GB headers simulate) alongside regular ZIP archives — many parsers have separate code paths for 64-bit extensions.
  • Add archives with deliberate local/central-directory mismatches: different filenames, different compressed sizes, or different compression methods in LFH vs CD.
  • Include zero-entry archives, archives with the maximum comment length (65535 bytes), and archives with duplicate filenames.
  • Use afl-cmin to reduce a crawled corpus of real-world ZIP files; JAR and DOCX files are valid ZIP archives and add real-world feature diversity.
  • Keep entries small (a few bytes each) but vary entry count widely (1, 64, 1000+) to stress iteration logic.

Mutator hints

  • Use AFL++ zip.dict to inject ZIP signature bytes (PK\x03\x04, PK\x01\x02, PK\x05\x06) as dictionary tokens.
  • A custom mutator that independently mutates local and central directory headers without synchronising them is the most direct way to generate inconsistency bugs.
  • For Deflate fuzzing inside ZIP entries, replace valid Deflate streams with streams that have invalid block type bits, over-subscribed Huffman codes, or back-references with distance > 32768.
  • CMPLOG mode in AFL++ recovers the PK magic signatures and version-needed values that the parser checks before reading any structured fields.

Recommended fuzzers

  • AFL++
  • libFuzzer
  • Honggfuzz
  • Centipede
Run a ZIP fuzz campaign on Fuzze.rs →

Push a Dockerfile + harness + the corpus links above. First month 50% off.