Surgery
Fine‑grained tool recognition, phase understanding, and hand‑specific interactions.
EgoCross
A comprehensive benchmark across Surgery, Industry, Extreme Sports, and Animal Perspective. EgoCross comprises 798 clips and 957 QA pairs, supporting both CloseQA and OpenQA formats for fine‑grained evaluation.
EgoCross is a cross-domain benchmark designed to evaluate how well multimodal large language models (MLLMs) generalize to egocentric video question answering (VQA). Unlike prior daily-life datasets, EgoCross focuses on diverse and challenging domains — including surgery, industrial assembly, extreme sports, and animal perspective — to assess model robustness under varying visual and semantic conditions.
The benchmark covers 15 sub-tasks grouped into four capability families: Identification, Localization, Prediction, and Counting. Each video clip is paired with multiple close-ended and open-ended questions that require fine-grained temporal, spatial, and reasoning understanding.
In total, EgoCross contains 798 video clips and 957 QA pairs, curated through a semi-automatic pipeline combining LLM-based question generation and human verification. It provides a unified platform for measuring cross-domain generalization, highlighting the gap between everyday understanding and complex real-world egocentric perception.
Below are four domain exemplars (Figure‑1 style). Replace the placeholders with your own video frames or GIF thumbnails.
Fine‑grained tool recognition, phase understanding, and hand‑specific interactions.
Component identification, procedural reasoning, and tool‑usage logic.
High‑speed egocentric motion, navigation cues, and temporal anticipation.
Species cues, alternative movement patterns, and behavioral understanding.
EgoCross dataset is now available on Hugging Face. Access the complete benchmark with all domains and QA pairs.
East China Normal University • INSAIT • Fudan University
For questions and collaboration, please reach out to our team members.