Improving Language-Supervised Object Detection with Linguistic Structure Analysis

Abstract

Language-supervised object detection typically uses descriptive captions from human-annotated datasets. However, in-the-wild captions take on wider styles of language. We analyze one particular ubiquitous form of language: narrative. We study the differences in linguistic structure and visual-text alignment in narrative and descriptive captions and find we can classify descriptive and narrative style captions using linguistic features such as part of speech, rhetoric structure theory, and multimodal discourse. Then, we use this to select captions from which to extract image-level labels as supervision for weakly supervised object detection. We also improve the quality of extracted labels by filtering based on proximity to verb types for both descriptive and narrative captions.

Cite

Text

Rai and Kovashka. "Improving Language-Supervised Object Detection with Linguistic Structure Analysis." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023. doi:10.1109/CVPRW59228.2023.00588

Markdown

[Rai and Kovashka. "Improving Language-Supervised Object Detection with Linguistic Structure Analysis." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.](https://mlanthology.org/cvprw/2023/rai2023cvprw-improving/) doi:10.1109/CVPRW59228.2023.00588

BibTeX

@inproceedings{rai2023cvprw-improving,
  title     = {{Improving Language-Supervised Object Detection with Linguistic Structure Analysis}},
  author    = {Rai, Arushi and Kovashka, Adriana},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2023},
  pages     = {5560-5570},
  doi       = {10.1109/CVPRW59228.2023.00588},
  url       = {https://mlanthology.org/cvprw/2023/rai2023cvprw-improving/}
}