Video-Based Human-Object Interaction Detection from Tubelet Tokens
Abstract
We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatial-temporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each token is learned by a selective attention mechanism to reduce redundant dependencies from others; 2) Expressiveness: each token is enabled to align with a semantic instance, i.e., an object or a human, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results show our method outperforms existing works by large margins, with a relative mAP gain of $16.14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.
Cite
Text
Tu et al. "Video-Based Human-Object Interaction Detection from Tubelet Tokens." Neural Information Processing Systems, 2022.Markdown
[Tu et al. "Video-Based Human-Object Interaction Detection from Tubelet Tokens." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/tu2022neurips-videobased/)BibTeX
@inproceedings{tu2022neurips-videobased,
title = {{Video-Based Human-Object Interaction Detection from Tubelet Tokens}},
author = {Tu, Danyang and Sun, Wei and Min, Xiongkuo and Zhai, Guangtao and Shen, Wei},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/tu2022neurips-videobased/}
}