Detecting Deep-Fake Videos from Aural and Oral Dynamics

Abstract

A face-swap deep fake replaces a person’s face – from eyebrows to chin – with another face. A lip-sync deep fake replaces a person’s mouth region to be consistent with an impersonated or synthesized audio track. An overlooked aspect in the creation of these deep-fake videos is the human ear. Statically, the shape of the human ear has been shown to provide a biometric signal. Dynamically, movement of the mandible (lower jaw) causes changes in the shape of the ear and ear canal. While the facial identity in a face-swap deep fake may accurately depict the co-opted identity, the ears belong to the original identity. While the mouth in a lip-sync deep fake may be well synchronized with the audio, the dynamics of the ear motion will be de-coupled from the mouth and jaw motion. We describe a forensic technique that exploits these static and dynamic aural properties.

Cite

Text

Agarwal and Farid. "Detecting Deep-Fake Videos from Aural and Oral Dynamics." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021. doi:10.1109/CVPRW53098.2021.00109

Markdown

[Agarwal and Farid. "Detecting Deep-Fake Videos from Aural and Oral Dynamics." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.](https://mlanthology.org/cvprw/2021/agarwal2021cvprw-detecting/) doi:10.1109/CVPRW53098.2021.00109

BibTeX

@inproceedings{agarwal2021cvprw-detecting,
  title     = {{Detecting Deep-Fake Videos from Aural and Oral Dynamics}},
  author    = {Agarwal, Shruti and Farid, Hany},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2021},
  pages     = {981-989},
  doi       = {10.1109/CVPRW53098.2021.00109},
  url       = {https://mlanthology.org/cvprw/2021/agarwal2021cvprw-detecting/}
}