Learnable PINs: Cross-Modal Embeddings for Person Identity

Abstract

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

Cite

Text

Nagrani et al. "Learnable PINs: Cross-Modal Embeddings for Person Identity." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01261-8_5

Markdown

[Nagrani et al. "Learnable PINs: Cross-Modal Embeddings for Person Identity." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/nagrani2018eccv-learnable/) doi:10.1007/978-3-030-01261-8_5

BibTeX

@inproceedings{nagrani2018eccv-learnable,
  title     = {{Learnable PINs: Cross-Modal Embeddings for Person Identity}},
  author    = {Nagrani, Arsha and Albanie, Samuel and Zisserman, Andrew},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2018},
  doi       = {10.1007/978-3-030-01261-8_5},
  url       = {https://mlanthology.org/eccv/2018/nagrani2018eccv-learnable/}
}