DOC/DOCX seed corpus
Binary OLE2 and ZIP-based XML — two completely different parser stacks hiding under one file extension.
The DOC (Office Binary) format stores document content inside a Compound Document File (OLE2/CFB) container. Every stream within the container — WordDocument, Table, Data, Summary — is independently parsed by the consuming application. A fuzzing corpus must exercise each stream in isolation as well as in combination, since bugs frequently arise from cross-stream consistency assumptions. DOCX wraps its content in a ZIP archive containing Open Packaging Convention XML parts, introducing a completely separate attack surface: the ZIP parser, the XML parser, and the schema-level validation logic are all in play simultaneously.
Both formats have a long history of code execution vulnerabilities because parsers are expected to render rich content including embedded OLE objects, macros, RTF fragments, and binary data structures like SPRM records. Mutators that understand the OLE2 sector-chain layout are more effective than byte-flip strategies at reaching deep parsing logic without hitting early rejection paths.
Libgsf and LibreOffice's filter infrastructure are common open-source targets. OSS-Fuzz runs continuous fuzzing against LibreOffice's DOC/DOCX importer stacks, so the project seed corpora there are production-quality starting points.
Building + curating your corpus
- →Seed separately for DOC (binary OLE2) and DOCX (ZIP+XML) — they exercise completely different parser paths and should not be mixed in a single corpus directory.
- →Include intentionally corrupt OLE2 FAT chains and mismatched sector counts to reach error-recovery code paths in OLE2 parsers.
- →For DOCX, include ZIP files with duplicate part names, non-UTF-8 filenames, and oversized central directory entries to stress the decompression layer before the XML layer is even reached.
- →Crawl public government document repositories for real-world diversity — they often include files exported from unusual Office versions that exercise legacy compatibility paths.
- →Keep individual corpus files under 64 KB for libFuzzer campaigns; DOCX files can be stripped of embedded images while preserving structure.
Mutator hints
- →Use the AFL++ office.dict dictionary to inject OOXML namespace URIs and DOC SPRM opcodes as interesting tokens.
- →For OLE2 fuzzing, write a custom mutator that recomputes FAT sector chains after byte-level mutations — prevents trivial rejection at the container layer.
- →Grammar-aware mutation over OOXML schemas (DrawingML, WordprocessingML) surfaces type-confusion bugs that random mutation rarely reaches.
- →CMPLOG in AFL++ recovers hardcoded magic bytes from OLE2 signatures (D0 CF 11 E0 A1 B1 1A E1) and ZIP local file headers (PK).
Recommended fuzzers
- → AFL++
- → libFuzzer
- → Honggfuzz
Libraries that consume DOC/DOCX
Push a Dockerfile + harness + the corpus links above. First month 50% off.