TAR seed corpus

Sequential header-data blocks with octal numeric fields — path traversal and header field overflow are the two dominant bug classes.

TAR archives are a sequence of 512-byte header blocks followed by data blocks. The original POSIX ustar format and the extended GNU tar format both encode numeric fields (file size, uid, gid, mtime) as octal strings with no null terminator, making integer parsing a frequent source of bugs. GNU tar extensions add multi-volume archives, long filename records (././@LongLink), and sparse file headers that significantly complicate the parser.

The dominant vulnerability class in TAR parsing is path traversal: harnesses that extract entries without stripping leading slashes or normalising '..' components can be exploited to write files outside the intended extraction directory. Fuzzing for this requires corpus files with absolute paths, paths containing '../', and symlink entries that point outside the archive root. Sanitizers catch the memory-safety consequences, but path traversal itself requires application-level checks.

libarchive supports TAR as one of many archive formats and also handles the encoding conversions required for non-ASCII filenames. The format-detection layer, which reads the first few bytes to distinguish TAR from ZIP, CPIO, and 7z, is itself an interesting fuzzing target when fed inputs that mix format signatures.

Where to grab a starter corpus

Building + curating your corpus

→Seed with both POSIX ustar and GNU tar format archives — they share a common 512-byte header layout but diverge in extension headers; both need coverage.
→Include archives with long filenames via GNU extension records (././@LongLink), sparse files, and multi-volume markers.
→Add a TAR file with a symlink entry whose target is an absolute path or contains '../' to exercise path-safety checks in the extraction layer.
→Keep data blocks minimal (1-byte files) to maximise the number of header blocks per corpus file and thus the header-parsing coverage per input.
→Mix POSIX numeric encoding (octal strings) with GNU base-256 encoding in numeric fields to exercise both parsing branches.

Mutator hints

→Mutate octal numeric fields in TAR headers with values that overflow when parsed as size_t (e.g. '77777777777\0' = ~33 GB) to reach integer overflow checks.
→Use a custom mutator that generates adversarial GNU extension headers: ././@LongLink entries with lengths that mismatch the following data block size.
→CMPLOG mode helps AFL++ learn the 'ustar' magic string at offset 257 in the header block that distinguishes POSIX tar from other formats.
→Inject PAX extended headers (prefix 'PaxHeaders') with repeated or oversized key=value pairs to stress the key-value parsing loop.

Recommended fuzzers

→ AFL++
→ libFuzzer
→ Honggfuzz

Libraries that consume TAR

libarchive

One library, 30 archive formats — each is a separate bug surface with shared allocation logic.

Run a TAR fuzz campaign on Fuzze.rs →

Push a Dockerfile + harness + the corpus links above. First month 50% off.