Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Abstract

In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

Cite

Text

Goko et al. "Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations." Proceedings of The 8th Conference on Robot Learning, 2024.

Markdown

[Goko et al. "Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations." Proceedings of The 8th Conference on Robot Learning, 2024.](https://mlanthology.org/corl/2024/goko2024corl-task/)

BibTeX

@inproceedings{goko2024corl-task,
  title     = {{Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations}},
  author    = {Goko, Miyu and Kambara, Motonari and Saito, Daichi and Otsuki, Seitaro and Sugiura, Komei},
  booktitle = {Proceedings of The 8th Conference on Robot Learning},
  year      = {2024},
  pages     = {3242-3263},
  volume    = {270},
  url       = {https://mlanthology.org/corl/2024/goko2024corl-task/}
}