Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction
Abstract
Frame-level visual features are generally aggregated in time with the techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust video-level representation. We here introduce a learnable aggregation technique whose primary objective is to retain short-time temporal structure between frame-level features and their spatial interdependencies in the representation. Also, it can be easily adapted to the cases where there have very scarce training samples. We evaluate the method on a real-fake expression prediction dataset to demonstrate its superiority. Our method obtains 65% score on the test dataset in the official MAP evaluation and there is only one misclassified decision with the best reported result in the Chalearn Challenge (i.e. 66.7%). Lastly, we believe that this method can be extended to different problems such as action/event recognition in future.
Cite
Text
Özkan and Akar. "Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction." IEEE/CVF International Conference on Computer Vision Workshops, 2017. doi:10.1109/ICCVW.2017.366Markdown
[Özkan and Akar. "Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction." IEEE/CVF International Conference on Computer Vision Workshops, 2017.](https://mlanthology.org/iccvw/2017/ozkan2017iccvw-relaxed/) doi:10.1109/ICCVW.2017.366BibTeX
@inproceedings{ozkan2017iccvw-relaxed,
title = {{Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction}},
author = {Özkan, Savas and Akar, Gozde Bozdagi},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2017},
pages = {3094-3100},
doi = {10.1109/ICCVW.2017.366},
url = {https://mlanthology.org/iccvw/2017/ozkan2017iccvw-relaxed/}
}