Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Abstract

In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Accordingly, we train a neural network to find a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Additionally, we demonstrate that regularization via the addition of a high-level, semantic classification objective improves performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M dataset and food and cooking in general.

Cite

Text

Salvador et al. "Learning Cross-Modal Embeddings for Cooking Recipes and Food Images." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.327

Markdown

[Salvador et al. "Learning Cross-Modal Embeddings for Cooking Recipes and Food Images." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/salvador2017cvpr-learning/) doi:10.1109/CVPR.2017.327

BibTeX

@inproceedings{salvador2017cvpr-learning,
  title     = {{Learning Cross-Modal Embeddings for Cooking Recipes and Food Images}},
  author    = {Salvador, Amaia and Hynes, Nicholas and Aytar, Yusuf and Marin, Javier and Ofli, Ferda and Weber, Ingmar and Torralba, Antonio},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2017},
  doi       = {10.1109/CVPR.2017.327},
  url       = {https://mlanthology.org/cvpr/2017/salvador2017cvpr-learning/}
}