C4AV: Learning Cross-Modal Representations from Transformers

Luo, Shujie; Dai, Hang; Shao, Ling; Ding, Yong

doi:10.1007/978-3-030-66096-3_3

C4AV: Learning Cross-Modal Representations from Transformers

Shujie Luo, Hang Dai, Ling Shao, Yong Ding

ECCVW 2020 pp. 33-38

doi:10.1007/978-3-030-66096-3_3 /eccvw/2020/luo2020eccvw-c4av/

Abstract

In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.

PDF ECCVW Semantic Scholar

Cite

Text

Luo et al. "C4AV: Learning Cross-Modal Representations from Transformers." European Conference on Computer Vision Workshops, 2020. doi:10.1007/978-3-030-66096-3_3

Markdown

[Luo et al. "C4AV: Learning Cross-Modal Representations from Transformers." European Conference on Computer Vision Workshops, 2020.](https://mlanthology.org/eccvw/2020/luo2020eccvw-c4av/) doi:10.1007/978-3-030-66096-3_3

BibTeX

@inproceedings{luo2020eccvw-c4av,
  title     = {{C4AV: Learning Cross-Modal Representations from Transformers}},
  author    = {Luo, Shujie and Dai, Hang and Shao, Ling and Ding, Yong},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2020},
  pages     = {33-38},
  doi       = {10.1007/978-3-030-66096-3_3},
  url       = {https://mlanthology.org/eccvw/2020/luo2020eccvw-c4av/}
}