CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer
Abstract
Video Moment Retrieval (VMR) involves locating specific moments within a video based on natural language queries. However, existing VMR methods that employ various strategies for cross-modal alignment still face challenges such as limited understanding of fine-grained semantics, semantic overlap, and sparse constraints. To address these limitations, we propose a novel Concept Decomposition Transformer (CDTR) model for VMR. CDTR introduces a semantic concept decomposition module that disentangles video moments and sentence queries into concept representations, reflecting the relevance between various concepts and capturing fine-grained semantics which is crucial for cross-modal matching. These decomposed concept representations are then used as pseudo-labels, determined as positive or negative samples by adaptive concept-specific thresholds. Subsequently, fine-grained concept alignment is performed in video intra-modal and textual-visual cross-modal, aligning different conceptual components within features, enhancing the model's ability to distinguish fine-grained semantics, and alleviating issues related to semantic overlap and sparse constraints. Comprehensive experiments demonstrate the effectiveness of the CDTR, outperforming state-of-the-art methods on three widely used datasets: QVHighlights, Charades-STA, and TACoS.
Cite
Text
Ran et al. "CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I6.32717Markdown
[Ran et al. "CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/ran2025aaai-cdtr/) doi:10.1609/AAAI.V39I6.32717BibTeX
@inproceedings{ran2025aaai-cdtr,
title = {{CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer}},
author = {Ran, Ran and Wei, Jiwei and Cai, Xiangyi and Guan, Xiang and Zou, Jie and Yang, Yang and Shen, Heng Tao},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {6684-6692},
doi = {10.1609/AAAI.V39I6.32717},
url = {https://mlanthology.org/aaai/2025/ran2025aaai-cdtr/}
}