VAREX-Bench — Multi-Modal Structured Extraction from Documents

Anatomy of a benchmark item

Click through the tabs to see what models receive as input and what they must produce as output.

{
  "$defs": {
    "Beneficiary": {
      "type": "object",
      "properties": {
        "name":         { "type": "string", "description": "Name of the beneficiary" },
        "relationship": { "type": "string", "description": "Relationship to the insured" },
        "percentage":   { "type": "integer", "description": "Percentage of benefits" },
        "address":      { "type": "string", "description": "Address of the beneficiary" }
      }
    },
    "InsuredAssignee": {
      "type": "object",
      "properties": {
        "name":    { "type": "string" },
        "address": { "type": "string" },
        "date":    { "type": "string" }
      }
    }
  },
  "properties": {
    "insured_name":          { "type": "string" },
    "insured_date_of_birth": { "type": "string" },
    "insured_ssn":           { "type": "string" },
    "insured_status":        { "type": "string" },
    "department":            { "type": "string" },
    "bureau_division":       { "type": "string" },
    "location":              { "type": "string" },
    "beneficiaries":         { "type": "array", "items": { "$ref": "#/$defs/Beneficiary" } },
    "claim_number":          { "type": "string" },
    "assignee":              { "$ref": "#/$defs/InsuredAssignee" }
  }
}

{
  "insured_name": "Walter, Nicole",
  "insured_date_of_birth": "06/19/1965",
  "insured_ssn": "230-74-0063",
  "insured_status": "employee",
  "department": "Wise-Castillo",
  "bureau_division": "Air Brokerage",
  "location": "East Randybury, TN",
  "beneficiaries": [
    { "name": "Brown, Danielle H.", "relationship": "Friend",    "percentage": 20 },
    { "name": "Mendoza, Brian",     "relationship": "Colleague", "percentage": 15 },
    { "name": "James Orozco",       "relationship": "Boss",      "percentage": 25 },
    { "name": "Cook, Emily",        "relationship": "Friend",    "percentage": 10 },
    { "name": "Ingram, Sarah",      "relationship": "Sister",    "percentage": 20 },
    { "name": "Perez, John",        "relationship": "Brother",   "percentage": 10 }
  ],
  "assignee": {
    "name": "Nicole Walter",
    "address": "7389 Benjamin Hollow Apt. 565, East Randybury, TN",
    "date": "11/15/2023"
  }
}

Loading...

⚠ Labels and values are interleaved with no spatial structure. "Brown, Danielle H." appears far from "Friend" and "20" — models must guess which values belong together.

Loading...

✓ Column alignment preserved. "Brown, Danielle H." is visually aligned with "Friend" and "20" — the table structure is recoverable from whitespace alone. This modality alone provides +4–19pp over plain text.

Primary evaluation modality. Clean digital render at 200 DPI — all text is legible. Frontier models reach 98.0% EM.

⚠ Same document at 50 DPI. Gemini degrades gracefully (−1.5pp). Open models in the 8–17B range crash 38–40pp, falling from 93%+ to the mid-50s.

Both spatial text and image provided simultaneously. Best overall modality — S+V reaches 96.7% for Gemini Flash and 96.4% for Llama 4 Maverick.

About the benchmark

VAREX uses Reverse Annotation: instead of annotating existing documents, we start with fillable PDF templates, fill them with deterministic placeholders, use an LLM to discover a semantic schema, then inject realistic synthetic values. Because every value is programmatically written to a specific PDF widget, the ground truth is deterministic — no human annotation noise. A three-phase QA process (automated screening, frontier-model audit, expert human review) validates the schema mappings to ~98.5% field-level accuracy.

1,777 single-page, English-language U.S. government forms from agencies including HUD, GSA, FDA, USCIS, and state departments. Forms range from simple 3-field applications to complex documents with nested sections and tabular regions. The benchmark comprises 299 Flat (17%), 1,146 Nested (64%), and 332 Table (19%) documents.

Primary metric is Exact Match (EM): a field scores 1 if the normalized prediction matches the ground truth exactly. We additionally report ANLS (Average Normalized Levenshtein Similarity) for partial credit. Array fields use order-invariant matching via the Hungarian algorithm — models aren't penalized for reading table rows in a different order. Scores are reported per structure category: VAREX-FLAT, VAREX-NESTED, and VAREX-TABLE.

VAREX most effectively discriminates models in the 60–95% range, making it especially useful for practitioners building and fine-tuning small (≤4B) extraction models for cost-sensitive or on-device deployment. The benchmark diagnoses specific failure modes — schema echo, under-extraction, attention decay — that offer concrete targets for improvement. Above 95%, residual ground truth noise limits the precision of model-to-model comparisons.

All documents are single-page, English-only, typed digital PDFs — no handwriting, scan artifacts, or multilingual content. Faker-generated values (e.g., 555-prefix phones) don't match real-world distributions. Table documents have a median of 3 rows, below enterprise scale. The exact-match metric doesn't normalize name ordering or phone formats.

VAREX

Structured extraction from documents

Anatomy of a benchmark item

Model Results

Examples

About the benchmark

Download

Citation