MUSE: Mamba Is Efficient Multi-Scale Learner for Text-Video Retrieval

Tang, Haoran; Cao, Meng; Huang, Jinfa; Liu, Ruyang; Jin, Peng; Li, Ge; Liang, Xiaodan

doi:10.1609/AAAI.V39I7.32778

MUSE: Mamba Is Efficient Multi-Scale Learner for Text-Video Retrieval

Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

AAAI 2025 pp. 7238-7246

doi:10.1609/AAAI.V39I7.32778 /aaai/2025/tang2025aaai-muse/

Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to CLIP's inherent plain structure, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

PDF AAAI Semantic Scholar

Cite

Text

Tang et al. "MUSE: Mamba Is Efficient Multi-Scale Learner for Text-Video Retrieval." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I7.32778

Markdown

[Tang et al. "MUSE: Mamba Is Efficient Multi-Scale Learner for Text-Video Retrieval." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/tang2025aaai-muse/) doi:10.1609/AAAI.V39I7.32778

BibTeX

@inproceedings{tang2025aaai-muse,
  title     = {{MUSE: Mamba Is Efficient Multi-Scale Learner for Text-Video Retrieval}},
  author    = {Tang, Haoran and Cao, Meng and Huang, Jinfa and Liu, Ruyang and Jin, Peng and Li, Ge and Liang, Xiaodan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {7238-7246},
  doi       = {10.1609/AAAI.V39I7.32778},
  url       = {https://mlanthology.org/aaai/2025/tang2025aaai-muse/}
}