ZIP seed corpus
Dual-directory format with overlapping metadata fields — inconsistencies between local and central directory headers are a classic bug class.
ZIP archives store metadata in two places: local file headers (LFH) immediately preceding each compressed entry, and a central directory (CD) at the end of the archive. Parsers that trust one over the other, or that fail to validate that both agree, are vulnerable to a class of inconsistency bugs that includes path traversal (zip slip), integer overflow in field calculations, and heap overflows from mismatched entry counts. A fuzzing corpus must exercise archives with many entries, archives with the data descriptor extension, ZIP64 archives, self-extracting archives, and archives with comment fields at the maximum length.
The Deflate compression used inside ZIP entries adds a second attack surface: decompressors must handle invalid block types, Huffman trees with degenerate code lengths, and back-references that point outside the sliding window. libarchive and libzip both expose this surface, and OSS-Fuzz has found numerous bugs in their Deflate implementations by fuzzing with a corpus of minimised ZIP files.
ZIP is also the container format for DOCX, XLSX, PPTX, JAR, APK, and many other high-level formats, which means bugs in the ZIP layer can have cascading effects on format-specific parsers built on top.
Building + curating your corpus
- →Include ZIP64 archives (entries larger than 4 GB headers simulate) alongside regular ZIP archives — many parsers have separate code paths for 64-bit extensions.
- →Add archives with deliberate local/central-directory mismatches: different filenames, different compressed sizes, or different compression methods in LFH vs CD.
- →Include zero-entry archives, archives with the maximum comment length (65535 bytes), and archives with duplicate filenames.
- →Use afl-cmin to reduce a crawled corpus of real-world ZIP files; JAR and DOCX files are valid ZIP archives and add real-world feature diversity.
- →Keep entries small (a few bytes each) but vary entry count widely (1, 64, 1000+) to stress iteration logic.
Mutator hints
- →Use AFL++ zip.dict to inject ZIP signature bytes (PK\x03\x04, PK\x01\x02, PK\x05\x06) as dictionary tokens.
- →A custom mutator that independently mutates local and central directory headers without synchronising them is the most direct way to generate inconsistency bugs.
- →For Deflate fuzzing inside ZIP entries, replace valid Deflate streams with streams that have invalid block type bits, over-subscribed Huffman codes, or back-references with distance > 32768.
- →CMPLOG mode in AFL++ recovers the PK magic signatures and version-needed values that the parser checks before reading any structured fields.
Recommended fuzzers
- → AFL++
- → libFuzzer
- → Honggfuzz
- → Centipede
Push a Dockerfile + harness + the corpus links above. First month 50% off.