The EgoPlan Challenge will be held at ICML 2024 WORKSHOP: Multi-modal Foundation Model meets Embodied AI.
Embodied task planning in real-world scenarios presents significant challenges, as it requires a comprehensive understanding of the dynamic and complicated visual environment and the open-form task goals. Multimodal Large Language Models (MLLMs), combining the remarkable reasoning and generalization capabilities of Large Language Models with the ability to comprehend visual inputs, have opened up new possibilities for embodied task planning.
The EgoPlan Challenge aims to evaluate the planning capabilities of MLLMs in complex real-world scenarios, focusing on realistic tasks involved in human daily activities. In the competition, models need to choose the most reasonable next step from a diverse set of candidate actions based on open-form task goal descriptions, real-time task progress videos, and current environment observations, to effectively advance task completion.
We encourage participants to explore MLLM-related technologies (including but not limited to instruction tuning, prompt engineering, etc.) to enhance the task planning capabilities of MLLMs, thereby promoting the research and application of MLLMs as versatile AI assistants in daily life.
• Challenge Website
https://icml-mfm-eai.github.io/challenges/#TRACK1
• GitHub Repository
https://github.com/ChenYi99/EgoPlan/tree/challenge
• EgoPlan Paper
• From now until July 10, 2024: Register for this challenge by filling out the Google Form
• May 1, 2024: Training set and validation set available
• June 1, 2024: Test set available, test server opens
• July 10, 2024: Test server closes, registration ends
The EgoPlan datasets are constructed based on the two existing egocentric video sources: Epic-Kitchens-100[1] and Ego4D[2].
• The training dataset is automatically constructed and encompasses 50K instruction-following pairs.
• The validation set contains 3,355 human-verified multiple-choice questions with ground-truth answers.
• The test set will be released on June 1, 2024. Please follow the GitHub repository for updates.
For more details, please refer to the GitHub repository.
Questions are formatted as multiple-choice problems. MLLMs need to select the most reasonable answer from four candidate choices. The primary metric is Accuracy.
The test server will be open on June 1, 2024. Please follow the GitHub repository for updates.
We enhance the planning capability of Video-LLaMA[3] by instruction-tuning on our training data. The detailed method can be referred to in Section 5 of the EgoPlan paper, and the implementation details can can be found in the GitHub repository.
8 V100 cards for 0.5 days.
From now until July 1, 2024, participants can register for this challenge by filling out the Google Form.
After the test set is released, results can be submitted via the test server since June 1, 2024.
Please follow the GitHub repository for updates. We will also keep you updated on the challenge news through the email address you provided in the Google Form.
• Outstanding Champion: USD $800
• Honorable Runner-up: USD $600
• Innovation Award: USD $600
All participants will retain the copyright of their submitted works. Participants have the right to decide how their work is used and distributed. Our organization will not copy, distribute, display, or perform the participants' works without explicit permission from the author.
• GitHub Issue (Issues · ChenYi99/EgoPlan (github.com))
• yichennlp@gmail.com
[1] Damen, Dima, et al. "Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100." International Journal of Computer Vision(2022): 1-23.
[2] Grauman, Kristen, et al. "Ego4d: Around the world in 3,000 hours of egocentric video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[3] Zhang, Hang, Xin Li, and Lidong Bing. "Video-llama: An instruction-tuned audio-visual language model for video understanding." arXiv preprint arXiv:2306.02858 (2023).
@article{chen2023egoplan,
title={EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models},
author={Chen, Yi and Ge, Yuying and Ge, Yixiao and Ding, Mingyu and Li, Bohao and Wang, Rui and Xu, Ruifeng and Shan, Ying and Liu, Xihui},
journal={arXiv preprint arXiv:2312.06722},
year={2023}
}