Set Augmented Triplet Loss for Video Person Re-Identification
Abstract
Modern video person re-identification (re-ID) machines are often trained using a metric learning approach, supervised by a triplet loss. The triplet loss used in video re-ID is usually based on so-called clip features, each aggregated from a few frame features. In this paper, we propose to model the video clip as a set and instead study the distance between sets in the corresponding triplet loss. In contrast to the distance between clip representations, the distance between clip sets considers the pair-wise similarity of each element (i.e., frame representation) between two sets. This allows the network to directly optimize the feature representation at a frame level. Apart from the commonly-used set distance metrics (e.g., ordinary distance and Hausdorff distance), we further propose a hybrid distance metric, tailored for the set-aware triplet loss. Also, we propose a hard positive set construction strategy using the learned class prototypes in a batch. Our proposed method achieves state-of-the-art results across several standard benchmarks, demonstrating the advantages of the proposed method.
Cite
Text
Fang et al. "Set Augmented Triplet Loss for Video Person Re-Identification." Winter Conference on Applications of Computer Vision, 2021.Markdown
[Fang et al. "Set Augmented Triplet Loss for Video Person Re-Identification." Winter Conference on Applications of Computer Vision, 2021.](https://mlanthology.org/wacv/2021/fang2021wacv-set/)BibTeX
@inproceedings{fang2021wacv-set,
title = {{Set Augmented Triplet Loss for Video Person Re-Identification}},
author = {Fang, Pengfei and Ji, Pan and Petersson, Lars and Harandi, Mehrtash},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2021},
pages = {464-473},
url = {https://mlanthology.org/wacv/2021/fang2021wacv-set/}
}