Overview of the construction pipeline for EgoPlan-Bench based on existing untrimmed egocentric videos with detailed action narrations. (1) We first leverage GPT-4 to identify task goals through hierarchical reasoning. (2) We then filter task goals based on the requisite number of actions. (3) The questions are designed in the form of multiple-choice, where the questions are automatically generated based on task goals, and the options are derived from different actions under the same task goal. (4) We employ human annotators to verify each question to ensure the benchmark quality.
The evaluation data of EgoPlan-Bench comprises a total of 4,939 multiple-choice questions, which are divided into two subsets: EgoPlan-Val for validation and EgoPlan-Test for testing. Our benchmark exhibits three main characteristics: 1)Realism of Tasks: The tasks are extrapolated from authentic real-world videos, offering a closer reflection of daily human needs and showcasing greater variety than artificially constructed tasks. 2) Diversity of Action Plans: The benchmark involves a diverse set of action plans, requiring interaction with hundreds of different objects and extending beyond basic manipulation skills such as picking and placing items. 3) Intricacy of Visual Observations: The visual observations come across various real-world scenes, where objects vary in appearance, state, and placement.
a) Statistics of the evaluation data of EgoPlan-Bench.
b) Wordcloud of task goals in EgoPlan-Bench questions.
c) Top 20 verbs with top 8 related objects in EgoPlan-
Bench candidate action plans.
We evaluate a total of 28 MLLMs on our benchmark. The results indicate that our benchmark poses significant challenges for existing MLLMs, and there is still a long way to go before these models evolve into human-level task planners. We further analyze three main reasons for this limitation, including 1) insufficient integration of visual modality, 2) omitting key state changes in task progress, and 3) inadequate application of world knowledge.
a) Impact of goal-answer similarity on model performance.
b) Impact of task progress length on model performance.
Given the suboptimal performance of the evaluated MLLMs on EgoPlan-Bench, we investigate enhancing
the human-level planning capabilities of MLLMs through instruction-tuning. Specifically, we
construct an instruction-tuning dataset, EgoPlan-IT, to align MLLMs with real-world needs of task
planning.
The model tuned on EgoPlan-IT demonstrates a significant and robust performance improvement on the proposed benchmark,
verifying the effectiveness of our data.
@misc{chen2024egoplanbenchbenchmarkingmultimodallarge,
title={EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning},
author={Yi Chen and Yuying Ge and Yixiao Ge and Mingyu Ding and Bohao Li and Rui Wang and Ruifeng Xu and Ying Shan and Xihui Liu},
year={2024},
eprint={2312.06722},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2312.06722},
}