JPEG seed corpus
Marker-segmented with Huffman entropy coding — entropy streams make pure random mutation less effective than marker-aware strategies.
JPEG stores image data as a sequence of marker segments (SOF, DHT, DQT, SOS, APP0, APP1/EXIF, COM). The entropy-coded scan data (ECS) between SOS and EOI markers uses Huffman coding, making it opaque to random byte-flip mutation — almost any byte change in the ECS will produce an invalid bitstream that most parsers reject early without reaching deep decoding logic. Effective JPEG fuzzing therefore requires either marker-aware mutation or a corpus seeded with valid JFIF and EXIF files of diverse types.
libjpeg-turbo's SIMD-optimised DCT inverse transform and its colour-conversion paths are the regions where most memory-safety bugs have been found historically. Progressive JPEGs (which reassemble multiple scans into a single image) are particularly complex: the decoder must maintain scan state across many SOS markers, and bugs in progressive scan reassembly have led to heap overflows. EXIF metadata parsing is a secondary attack surface that can be reached before any pixel data is decoded.
A high-quality JPEG corpus should include baseline, progressive, arithmetic-coded, and lossless JPEG files; JFIF and raw EXIF thumbnails; images with oversized or undersized quantisation tables; and at least a few malformed-but-accepted files from real-world applications. OSS-Fuzz's libjpeg-turbo project provides production-quality seed corpora used in continuous fuzzing.
Building + curating your corpus
- →Include both JFIF (APP0) and EXIF (APP1) files — they exercise different metadata parsing paths and should both be in your starter set.
- →Add progressive JPEG files explicitly; many crawled corpora are dominated by baseline JPEGs, leaving progressive decode paths undercovered.
- →Strip JPEG thumbnails and preserve the main image to keep corpus file sizes below 32 KB for higher throughput.
- →Include arithmetic-coded JPEG files if your target supports them — this codepath is rarely tested and has historically harboured unique bugs.
- →Use afl-cmin after initial corpus collection; many image hosting sites serve near-identical re-encoded files that add no new coverage.
Mutator hints
- →Use AFL++ jpeg.dict to teach the fuzzer JPEG marker bytes (0xFF followed by SOF/SOS/DHT/DQT type bytes) as interesting token replacements.
- →Write a marker-aware custom mutator that inserts, deletes, or reorders complete JPEG segments while preserving SOI (0xFF 0xD8) and EOI (0xFF 0xD9) markers.
- →CMPLOG mode in AFL++ is highly effective for EXIF metadata paths because EXIF parsers contain many strcmp-style comparisons on tag names.
- →For entropy-coded data fuzzing, generate valid Huffman tables with adversarial symbol distributions (all symbols mapped to maximum code length) to stress Huffman decoder edge cases.
Recommended fuzzers
- → AFL++
- → libFuzzer
- → Honggfuzz
Libraries that consume JPEG
Push a Dockerfile + harness + the corpus links above. First month 50% off.