LAMS: A Location-Aware Approach for Multimodal Summarization (Student Abstract)

Abstract

Multimodal summarization aims to refine salient information from multiple modalities, among which texts and images are two mostly discussed ones. In recent years, many fantastic works have emerged in this field by modeling image-text interactions; however, they neglect the fact that most of multimodal documents have been elaborately organized by their writers. This means that a critical organized factor has long been short of enough attention, that is, image locations, which may carry illuminating information and imply the key contents of a document. To address this issue, we propose a location-aware approach for multimodal summarization (LAMS) based on Transformer. We investigate image locations for multimodal summarization via a stack of multimodal fusion block, which can formulate the high-order interactions among images and texts. An extensive experimental study on an extended multimodal dataset validates the superior summarization performance of the proposed model.

Cite

Text

Zhang et al. "LAMS: A Location-Aware Approach for Multimodal Summarization (Student Abstract)." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I18.17971

Markdown

[Zhang et al. "LAMS: A Location-Aware Approach for Multimodal Summarization (Student Abstract)." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/zhang2021aaai-lams/) doi:10.1609/AAAI.V35I18.17971

BibTeX

@inproceedings{zhang2021aaai-lams,
  title     = {{LAMS: A Location-Aware Approach for Multimodal Summarization (Student Abstract)}},
  author    = {Zhang, Zhengkun and Wang, Jun and Sun, Zhe and Yang, Zhenglu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {15949-15950},
  doi       = {10.1609/AAAI.V35I18.17971},
  url       = {https://mlanthology.org/aaai/2021/zhang2021aaai-lams/}
}