Modulating Bottom-up and Top-Down Visual Processing via Language-Conditional Filters

Abstract

How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.

Cite

Text

Kesen et al. "Modulating Bottom-up and Top-Down Visual Processing via Language-Conditional Filters." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00507

Markdown

[Kesen et al. "Modulating Bottom-up and Top-Down Visual Processing via Language-Conditional Filters." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/kesen2022cvprw-modulating/) doi:10.1109/CVPRW56347.2022.00507

BibTeX

@inproceedings{kesen2022cvprw-modulating,
  title     = {{Modulating Bottom-up and Top-Down Visual Processing via Language-Conditional Filters}},
  author    = {Kesen, Ilker and Can, Ozan Arkan and Erdem, Erkut and Erdem, Aykut and Yüret, Deniz},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2022},
  pages     = {4609-4619},
  doi       = {10.1109/CVPRW56347.2022.00507},
  url       = {https://mlanthology.org/cvprw/2022/kesen2022cvprw-modulating/}
}