Polar Relative Positional Encoding for Video-Language Segmentation
Abstract
In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i.e., in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11.4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.
Cite
Text
Ning et al. "Polar Relative Positional Encoding for Video-Language Segmentation." International Joint Conference on Artificial Intelligence, 2020. doi:10.24963/IJCAI.2020/132Markdown
[Ning et al. "Polar Relative Positional Encoding for Video-Language Segmentation." International Joint Conference on Artificial Intelligence, 2020.](https://mlanthology.org/ijcai/2020/ning2020ijcai-polar/) doi:10.24963/IJCAI.2020/132BibTeX
@inproceedings{ning2020ijcai-polar,
title = {{Polar Relative Positional Encoding for Video-Language Segmentation}},
author = {Ning, Ke and Xie, Lingxi and Wu, Fei and Tian, Qi},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2020},
pages = {948-954},
doi = {10.24963/IJCAI.2020/132},
url = {https://mlanthology.org/ijcai/2020/ning2020ijcai-polar/}
}