Challenge overview.

🚀Introduction

The EgoPlan Challenge will be held at ICML 2024 WORKSHOP: Multi-modal Foundation Model meets Embodied AI.

Embodied task planning in real-world scenarios presents significant challenges, as it requires a comprehensive understanding of the dynamic and complicated visual environment and the open-form task goals. Multimodal Large Language Models (MLLMs), combining the remarkable reasoning and generalization capabilities of Large Language Models with the ability to comprehend visual inputs, have opened up new possibilities for embodied task planning.

The EgoPlan Challenge aims to evaluate the planning capabilities of MLLMs in complex real-world scenarios, focusing on realistic tasks involved in human daily activities. In the competition, models need to choose the most reasonable next step from a diverse set of candidate actions based on open-form task goal descriptions, real-time task progress videos, and current environment observations, to effectively advance task completion.

We encourage participants to explore MLLM-related technologies (including but not limited to instruction tuning, prompt engineering, etc.) to enhance the task planning capabilities of MLLMs, thereby promoting the research and application of MLLMs as versatile AI assistants in daily life.

🔍Track Information

📆Timeline

• From now until July 10, 2024: Register for this challenge by filling out the Google Form

• May 1, 2024: Training set and validation set available

• June 1, 2024: Test set available, test server opens

• July 10, 2024: Test server closes, registration ends

📚Dataset Information

The EgoPlan datasets are constructed based on the two existing egocentric video sources: Epic-Kitchens-100[1] and Ego4D[2].


• The training dataset is automatically constructed and encompasses 50K instruction-following pairs.

• The validation set contains 3,355 human-verified multiple-choice questions with ground-truth answers.

• The test set will be released on June 1, 2024. Please follow the GitHub repository for updates.

For more details, please refer to the GitHub repository.

📊Evaluation Metrics

Questions are formatted as multiple-choice problems. MLLMs need to select the most reasonable answer from four candidate choices. The primary metric is Accuracy.

Question formulation.

🖥️Test Server

The test server will be open on June 1, 2024. Please follow the GitHub repository for updates.

📈Baselines

We enhance the planning capability of Video-LLaMA[3] by instruction-tuning on our training data. The detailed method can be referred to in Section 5 of the EgoPlan paper, and the implementation details can can be found in the GitHub repository.

⚙️Reference Training Cost

8 V100 cards for 0.5 days.

📝Participation

From now until July 1, 2024, participants can register for this challenge by filling out the Google Form.

After the test set is released, results can be submitted via the test server since June 1, 2024.

Please follow the GitHub repository for updates. We will also keep you updated on the challenge news through the email address you provided in the Google Form.

🏅Award

• Outstanding Champion: USD $800

• Honorable Runner-up: USD $600

• Innovation Award: USD $600

❗Rules

  • For participation in the challenge, it is a strict requirement to register for your team by filling out the Google Form.
  • Any kind of Multimodal Large Language Model (MLLM) can be used in this challenge.
  • MLLMs must depend solely on the visual modality within the designated temporal scope to perceive the environment and monitor task progress. Visual input should be restricted to the current observation frame and its preceding frames. Moreover, incorporating textual narrations of past actions into the input is not permitted.
  • Using training data in addition to the officially released EgoPlan-IT is allowed.
  • In order to check for compliance, we will ask the participants to provide technical reports to the challenge committee and participants will be asked to provide a public talk about their works after winning the award.

🛡️Copyright Policy

All participants will retain the copyright of their submitted works. Participants have the right to decide how their work is used and distributed. Our organization will not copy, distribute, display, or perform the participants' works without explicit permission from the author.

📩Contact

👥Organizers

Challenge Organizers

References

[1] Damen, Dima, et al. "Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100." International Journal of Computer Vision(2022): 1-23.

[2] Grauman, Kristen, et al. "Ego4d: Around the world in 3,000 hours of egocentric video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Zhang, Hang, Xin Li, and Lidong Bing. "Video-llama: An instruction-tuned audio-visual language model for video understanding." arXiv preprint arXiv:2306.02858 (2023).

BibTeX

@article{chen2023egoplan,
  title={EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models},
  author={Chen, Yi and Ge, Yuying and Ge, Yixiao and Ding, Mingyu and Li, Bohao and Wang, Rui and Xu, Ruifeng and Shan, Ying and Liu, Xihui},
  journal={arXiv preprint arXiv:2312.06722},
  year={2023}
}