Skip to main content
← All formats
Document

RTF seed corpus

Plain-text container with binary payloads — RTF's escape-everything encoding hides parser complexity behind readable syntax.

Rich Text Format is deceptively simple on the surface: it is a plain-text stream of control words (e.g. \rtf1, \ansi, \fonttbl) and groups delimited by braces. In practice, RTF parsers must handle deeply nested groups, hexadecimal-encoded binary blobs (\bin keyword), embedded OLE objects, and field instructions that invoke macro-like evaluation. The recursive group nesting without a depth limit is a classic vector for stack exhaustion bugs.

RTF has been one of the most historically exploited document formats, with numerous CVEs in Microsoft Word's RTF parser. The \objdata and \pict keywords embed arbitrary binary content that is fed to secondary decoders (WMF, EMF, OLE), dramatically expanding the attack surface of any RTF-capable parser. A good corpus must include files that stress all major keyword classes: character formatting, paragraph formatting, table markup, embedded objects, and annotations.

Because RTF is text-based, grammar-aware mutation is particularly effective. AFL++ CMPLOG mode and libFuzzer's value-profile mode both perform well on RTF due to the high density of magic string comparisons (control word names) in parser internals.

Building + curating your corpus

  • RTF files compress very well — store seeds as plain text without compression in the corpus directory to make bit-flip mutations more effective.
  • Include deeply nested brace groups (>100 levels) and files with \bin followed by large byte counts to stress depth-limit and allocation checks.
  • Add RTF files containing embedded WMF and EMF images via \pict — these reach a completely separate rendering code path.
  • Mix RTF files generated by different applications: Microsoft Word exports, LibreOffice Writer, and AbiWord each produce distinct feature subsets.
  • Use afl-cmin to deduplicate: many RTF generators produce nearly identical control word sequences; diversity in the corpus matters more than file count.

Mutator hints

  • Load AFL++ rtf.dict to provide the fuzzer with the full set of RTF control words as dictionary tokens — critical for reaching rarely-exercised keyword handlers.
  • Use a text-mode grammar mutator (e.g. Grammarinator with an RTF grammar) for the first 24 hours to build a coverage-rich initial corpus, then switch to byte-level AFL++.
  • CMPLOG mode in AFL++ is highly effective for RTF because the parser does many strcmp/strncmp comparisons on control word names.
  • Inject \bin N with N larger than the remaining file size to stress length-validation paths in binary blob readers.

Recommended fuzzers

  • AFL++
  • libFuzzer
Run a RTF fuzz campaign on Fuzze.rs →

Push a Dockerfile + harness + the corpus links above. First month 50% off.