EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

1Tencent AI Lab, 2The University of Hong Kong, 3ARC Lab, Tencent PCG,
4University of California, Berkeley, 5Peng Cheng Laboratory
Benchmark overview.
Our EgoPlan-Bench evaluates Planning, where a model predicts the next feasible action plan by taking a video showing task progress, current visual observation, and open-form task goal as inputs like humans. In contrast, the egocentric-video-based QA examples from existing benchmarks mainly evaluate Comprehension, where a model answers questions based on the spatial and temporal understanding of the entire video.

Abstract

The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning? To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench. We have made all codes, data, and a maintained benchmark leaderboard available to advance future research.

Benchmark Construction

Benchmark construction.

Overview of the construction pipeline for EgoPlan-Bench based on existing untrimmed egocentric videos with detailed action narrations. (1) We first leverage GPT-4 to identify task goals through hierarchical reasoning. (2) We then filter task goals based on the requisite number of actions. (3) The questions are designed in the form of multiple-choice, where the questions are automatically generated based on task goals, and the options are derived from different actions under the same task goal. (4) We employ human annotators to verify each question to ensure the benchmark quality.

Data Statistics

The evaluation data of EgoPlan-Bench comprises a total of 4,939 multiple-choice questions, which are divided into two subsets: EgoPlan-Val for validation and EgoPlan-Test for testing. Our benchmark exhibits three main characteristics: 1)Realism of Tasks: The tasks are extrapolated from authentic real-world videos, offering a closer reflection of daily human needs and showcasing greater variety than artificially constructed tasks. 2) Diversity of Action Plans: The benchmark involves a diverse set of action plans, requiring interaction with hundreds of different objects and extending beyond basic manipulation skills such as picking and placing items. 3) Intricacy of Visual Observations: The visual observations come across various real-world scenes, where objects vary in appearance, state, and placement.


Evaluation Data Statistics.

a) Statistics of the evaluation data of EgoPlan-Bench.



Task Goal Distribution.

b) Wordcloud of task goals in EgoPlan-Bench questions.

Action distribution.

c) Top 20 verbs with top 8 related objects in EgoPlan-
Bench candidate action plans.

Evaluation Results

We evaluate a total of 28 MLLMs on our benchmark. The results indicate that our benchmark poses significant challenges for existing MLLMs, and there is still a long way to go before these models evolve into human-level task planners. We further analyze three main reasons for this limitation, including 1) insufficient integration of visual modality, 2) omitting key state changes in task progress, and 3) inadequate application of world knowledge.



Impact of Goal-Answer Similarity.

a) Impact of goal-answer similarity on model performance.



Impact of Task Progress Length.

b) Impact of task progress length on model performance.

Enhancing Human-Level Planning through Instruction Tuning

Given the suboptimal performance of the evaluated MLLMs on EgoPlan-Bench, we investigate enhancing the human-level planning capabilities of MLLMs through instruction-tuning. Specifically, we construct an instruction-tuning dataset, EgoPlan-IT, to align MLLMs with real-world needs of task planning. The model tuned on EgoPlan-IT demonstrates a significant and robust performance improvement on the proposed benchmark, verifying the effectiveness of our data.

Instruction-tuning results.

BibTeX

@article{chen2023egoplan,
  title={EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning},
  author={Chen, Yi and Ge, Yuying and Ge, Yixiao and Ding, Mingyu and Li, Bohao and Wang, Rui and Xu, Ruifeng and Shan, Ying and Liu, Xihui},
  journal={arXiv preprint arXiv:2312.06722},
  year={2023}
}