EPIC-Bench

Abstract

This work presents EPIC-Bench, Embodied PerceptIon BenChmark, a grounding benchmark designed to systematically evaluate the visual perceptual capabilities required for large vision-language models (VLMs) in embodied environments. We construct a dataset of 6.6k meticulously annotated (Image, Text, Mask) tuples, to answer the question: Can VLMs perceive the embodied real-world? EPIC-Bench is characterized by three key design principles. First, it encourages genuine visually grounded perception without exploiting linguistic priors. Second, it comprises 23 fine-grained tasks spanning the embodied interaction pipeline from Target Localization to Navigation and Manipulation. Third, its fine-grained taxonomy supports diagnostic analysis of embodied visual perception. Comprehensive experiments show that VLMs still struggle to align visual–text information for downstream physical interactions, especially in affordance region detection, where the target is only part of an object.

📊 Benchmark

Key Features and Statistics

6.6K mask-level samples move embodied perception beyond QA/MCQ, testing whether VLMs can truly see, localize, and understand the physical world.
23 fine-grained tasks across 3 embodied stages cover the full interaction pipeline, from target localization to navigation and manipulation, including affordance, contact, and placement.
80+ VLMs evaluated expose where today’s models still fail, revealing critical gaps in multi-target grounding, spatial understanding, and action-oriented visual perception.

📝 Citation

    @article{2026,
    title={EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Mode},
    journal={xxx},
    year={2026}
    }

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained
Embodied Visual Grounding in Vision-Language Mode

Overview

Abstract

🏆 Leaderboard

📊 Benchmark

Key Features and Statistics

Representative VLM performance on 23 tasks of EPIC-Bench

Data Collection and Annotation Pipeline

Why EPIC-Bench: Comprehensive Coverage of Embodied Perception