Skip to main content
← All formats
Image

JPEG seed corpus

Marker-segmented with Huffman entropy coding — entropy streams make pure random mutation less effective than marker-aware strategies.

JPEG stores image data as a sequence of marker segments (SOF, DHT, DQT, SOS, APP0, APP1/EXIF, COM). The entropy-coded scan data (ECS) between SOS and EOI markers uses Huffman coding, making it opaque to random byte-flip mutation — almost any byte change in the ECS will produce an invalid bitstream that most parsers reject early without reaching deep decoding logic. Effective JPEG fuzzing therefore requires either marker-aware mutation or a corpus seeded with valid JFIF and EXIF files of diverse types.

libjpeg-turbo's SIMD-optimised DCT inverse transform and its colour-conversion paths are the regions where most memory-safety bugs have been found historically. Progressive JPEGs (which reassemble multiple scans into a single image) are particularly complex: the decoder must maintain scan state across many SOS markers, and bugs in progressive scan reassembly have led to heap overflows. EXIF metadata parsing is a secondary attack surface that can be reached before any pixel data is decoded.

A high-quality JPEG corpus should include baseline, progressive, arithmetic-coded, and lossless JPEG files; JFIF and raw EXIF thumbnails; images with oversized or undersized quantisation tables; and at least a few malformed-but-accepted files from real-world applications. OSS-Fuzz's libjpeg-turbo project provides production-quality seed corpora used in continuous fuzzing.

Building + curating your corpus

  • Include both JFIF (APP0) and EXIF (APP1) files — they exercise different metadata parsing paths and should both be in your starter set.
  • Add progressive JPEG files explicitly; many crawled corpora are dominated by baseline JPEGs, leaving progressive decode paths undercovered.
  • Strip JPEG thumbnails and preserve the main image to keep corpus file sizes below 32 KB for higher throughput.
  • Include arithmetic-coded JPEG files if your target supports them — this codepath is rarely tested and has historically harboured unique bugs.
  • Use afl-cmin after initial corpus collection; many image hosting sites serve near-identical re-encoded files that add no new coverage.

Mutator hints

  • Use AFL++ jpeg.dict to teach the fuzzer JPEG marker bytes (0xFF followed by SOF/SOS/DHT/DQT type bytes) as interesting token replacements.
  • Write a marker-aware custom mutator that inserts, deletes, or reorders complete JPEG segments while preserving SOI (0xFF 0xD8) and EOI (0xFF 0xD9) markers.
  • CMPLOG mode in AFL++ is highly effective for EXIF metadata paths because EXIF parsers contain many strcmp-style comparisons on tag names.
  • For entropy-coded data fuzzing, generate valid Huffman tables with adversarial symbol distributions (all symbols mapped to maximum code length) to stress Huffman decoder edge cases.

Recommended fuzzers

  • AFL++
  • libFuzzer
  • Honggfuzz
Run a JPEG fuzz campaign on Fuzze.rs →

Push a Dockerfile + harness + the corpus links above. First month 50% off.