VAREX

A benchmark for multi-modal structured extraction from documents

1,777 government forms with unique per-document schemas, evaluated across four input modalities. Built for diagnosing where small models (≤4B) fail — and how to fix them.

1,777
Documents
single-page government forms
1,771
Unique Schemas
per-document extraction targets
21,084
Evaluation Fields
validated through 3-phase QA
4
Input Modalities
text · spatial · image · combined

Structured extraction from documents

📄
Document
image, text, or both
+
{ }
JSON Schema
unique per document
🤖
Model
800M → frontier API
Structured JSON
scored vs ground truth

Anatomy of a benchmark item

Click through the tabs to see what models receive as input and what they must produce as output.

SF-2823 at 200 DPI
Primary evaluation modality. Clean digital render at 200 DPI — all text is legible. Frontier models reach 98.0% EM.

Model Results

Image (V) modality · Exact Match · Click to sort
# Model Size EM % ANLS Flat Nested Table Perfect

Examples

Documents from the benchmark. Click to see model scores.

About the benchmark

VAREX uses Reverse Annotation: instead of annotating existing documents, we start with fillable PDF templates, fill them with deterministic placeholders, use an LLM to discover a semantic schema, then inject realistic synthetic values. Because every value is programmatically written to a specific PDF widget, the ground truth is deterministic — no human annotation noise. A three-phase QA process (automated screening, frontier-model audit, expert human review) validates the schema mappings to ~98.5% field-level accuracy.
1,777 single-page, English-language U.S. government forms from agencies including HUD, GSA, FDA, USCIS, and state departments. Forms range from simple 3-field applications to complex documents with nested sections and tabular regions. The benchmark comprises 299 Flat (17%), 1,146 Nested (64%), and 332 Table (19%) documents.
Primary metric is Exact Match (EM): a field scores 1 if the normalized prediction matches the ground truth exactly. We additionally report ANLS (Average Normalized Levenshtein Similarity) for partial credit. Array fields use order-invariant matching via the Hungarian algorithm — models aren't penalized for reading table rows in a different order. Scores are reported per structure category: VAREX-FLAT, VAREX-NESTED, and VAREX-TABLE.
VAREX most effectively discriminates models in the 60–95% range, making it especially useful for practitioners building and fine-tuning small (≤4B) extraction models for cost-sensitive or on-device deployment. The benchmark diagnoses specific failure modes — schema echo, under-extraction, attention decay — that offer concrete targets for improvement. Above 95%, residual ground truth noise limits the precision of model-to-model comparisons.
All documents are single-page, English-only, typed digital PDFs — no handwriting, scan artifacts, or multilingual content. Faker-generated values (e.g., 555-prefix phones) don't match real-world distributions. Table documents have a median of 3 rows, below enterprise scale. The exact-match metric doesn't normalize name ordering or phone formats.

Citation

@article{varex2026,
  title   = {VAREX: A Benchmark for Multi-Modal
             Structured Extraction from Documents},
  author  = {Barzelay, Udi and Azulai, Ophir and Shapira, Inbar and
             Friedman, Idan and Abo Dahood, Foad and Lee, Madison and
             Daniels, Abraham},
  year    = {2026},
  note    = {arXiv preprint forthcoming}
}