A benchmark for multi-modal structured extraction from documents
1,777 government forms with unique per-document schemas, evaluated across four input modalities. Built for diagnosing where small models (≤4B) fail — and how to fix them.
Click through the tabs to see what models receive as input and what they must produce as output.
Documents from the benchmark. Click to see model scores.
@article{varex2026,
title = {VAREX: A Benchmark for Multi-Modal
Structured Extraction from Documents},
author = {Barzelay, Udi and Azulai, Ophir and Shapira, Inbar and
Friedman, Idan and Abo Dahood, Foad and Lee, Madison and
Daniels, Abraham},
year = {2026},
note = {arXiv preprint forthcoming}
}