EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

1Tencent AI Lab, 2The University of Hong Kong, 3ARC Lab, Tencent PCG,
4University of California, Berkeley, 5Peng Cheng Laboratory
Benchmark overview.
Our EgoPlan-Bench evaluates Egocentric Embodied Planning, where a model predicts the next feasible action by taking a video showing task progress, current visual observation and language instruction as inputs. In contrast, existing egocentric video QA benchmarks mainly evaluate Comprehension, where a model answers questions based on the spatial and temporal understanding of the entire video. The key difference is that the setting of EgoPlan-Bench better aligns with real-world embodied AI applications, as the model’s output can directly serve as plans for agents to execute tasks in real environments.

Abstract

Multimodal Large Language Models, combining the remarkable reasoning and generalization capabilities of Large Language Models (LLMs) with the ability to comprehend visual inputs, have opened up new avenues for embodied task planning. Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected to predict feasible actions, which is a feat inherently achievable by Multimodal Large Language Models (MLLMs).

In this work, we aim to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios by introducing a benchmark with human annotations named EgoPlan-Bench. Our benchmark is distinguished by realistic tasks derived from real-world videos, a diverse set of actions involving interactions with hundreds of different objects, and complex visual observations from varied scenes. We evaluate a wide range of MLLMs, revealing that these models have not yet evolved into embodied planning generalists (even GPT-4V).

We further construct an instruction-tuning dataset EgoPlan-IT from videos with human-object interactions, to facilitate the learning of high-level task planning in intricate real-world situations. The experiment results demonstrate that the model tuned on EgoPlan-IT not only significantly improves performance on our benchmark, but can also be applied as a task planner for guiding embodied agents in simulations.

Benchmark Construction

Benchmark construction.

Overview of the construction pipeline for EgoPlan-Bench based on existing untrimmed egocentric videos with detailed action narrations. (1) We first leverage GPT-4 to identify task goals through hierarchical reasoning. (2) We then filter task goals based on the requisite number of actions. (3) The questions are designed in the form of multiple-choice, where the questions are automatically generated based on task goals, and the options are derived from different actions under the same task goal. (4) We employ human annotators to verify each question to ensure the benchmark quality.

Data Statistics

Our benchmark comprises a total of 3,355 QA pairs. Drawing upon the attributes of the utilized egocentric video sources, our benchmark exhibits three main characteristics. 1)Realism of Tasks: The tasks are extrapolated from authentic real-world videos, offering a closer reflection of daily human needs and showcasing greater variety than artificially constructed tasks. 2) Diversity of Actions: The benchmark involves a diverse set of actions, requiring interaction with hundreds of different objects and extending beyond basic manipulation skills such as picking and placing items. 3) Intricacy of Visual Observations: The visual observations come across various real-world scenes, where objects vary in appearance, state, and placement.


Evaluation Data Statistics.

a) Statistics of the evaluation data of EgoPlan-Bench.



Task Goal Distribution.

b) Wordcloud of verbs in task goals of EgoPlan-Bench.

Action distribution.

c) Top 20 verbs with top 8 related objects in EgoPlan-
Bench candidate actions.

Evaluation Results

We evaluate a wide range of MLLMs. The results indicate that our EgoPlan-Bench poses significant challenges for existing MLLMs, and there is still a long way to go before these models evolve into generalist embodied task planners.

Enhancing EgoPlan Capability by Instruction Tuning

We further construct EgoPlan-IT, an instruction-tuning dataset, to facilitate the learning of high-level task planning from human videos. The model tuned on EgoPlan-IT not only exhibits a significant performance enhancement on our benchmark, but also shows potential as a task planner for an embodied agent to complete long-horizon tasks within a simulated environment.

Simulation case.

BibTeX

@article{chen2023egoplan,
  title={EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models},
  author={Chen, Yi and Ge, Yuying and Ge, Yixiao and Ding, Mingyu and Li, Bohao and Wang, Rui and Xu, Ruifeng and Shan, Ying and Liu, Xihui},
  journal={arXiv preprint arXiv:2312.06722},
  year={2023}
}