EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

Abstract

The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning? To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench. We have made all codes, data, and a maintained benchmark leaderboard available to advance future research.

Benchmark Construction

Overview of the construction pipeline for EgoPlan-Bench based on existing untrimmed egocentric videos with detailed action narrations. (1) We first leverage GPT-4 to identify task goals through hierarchical reasoning. (2) We then filter task goals based on the requisite number of actions. (3) The questions are designed in the form of multiple-choice, where the questions are automatically generated based on task goals, and the options are derived from different actions under the same task goal. (4) We employ human annotators to verify each question to ensure the benchmark quality.

Data Statistics

The evaluation data of EgoPlan-Bench comprises a total of 4,939 multiple-choice questions, which are divided into two subsets: EgoPlan-Val for validation and EgoPlan-Test for testing. Our benchmark exhibits three main characteristics: 1)Realism of Tasks: The tasks are extrapolated from authentic real-world videos, offering a closer reflection of daily human needs and showcasing greater variety than artificially constructed tasks. 2) Diversity of Action Plans: The benchmark involves a diverse set of action plans, requiring interaction with hundreds of different objects and extending beyond basic manipulation skills such as picking and placing items. 3) Intricacy of Visual Observations: The visual observations come across various real-world scenes, where objects vary in appearance, state, and placement.

a) Statistics of the evaluation data of EgoPlan-Bench.

b) Wordcloud of task goals in EgoPlan-Bench questions.

c) Top 20 verbs with top 8 related objects in EgoPlan-
Bench candidate action plans.

Enhancing Human-Level Planning through Instruction Tuning

Given the suboptimal performance of the evaluated MLLMs on EgoPlan-Bench, we investigate enhancing the human-level planning capabilities of MLLMs through instruction-tuning. Specifically, we construct an instruction-tuning dataset, EgoPlan-IT, to align MLLMs with real-world needs of task planning. The model tuned on EgoPlan-IT demonstrates a significant and robust performance improvement on the proposed benchmark, verifying the effectiveness of our data.

Instruction-tuning results.

BibTeX

@misc{chen2024egoplanbenchbenchmarkingmultimodallarge,
  title={EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning}, 
  author={Yi Chen and Yuying Ge and Yixiao Ge and Mingyu Ding and Bohao Li and Rui Wang and Ruifeng Xu and Ying Shan and Xihui Liu},
  year={2024},
  eprint={2312.06722},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2312.06722}, 
}

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

Abstract

Benchmark Construction

Data Statistics

Evaluation Results

Failure Type-1: Insufficient integration of visual modality.

Failure Type-2: Omitting key state changes in task progress.

Failure Type-3: Inadequate application of world knowledge.

Enhancing Human-Level Planning through Instruction Tuning

BibTeX