Boundary Matching and Refinement Network with Cross-Modal Contrastive Learning for Temporal Moment Localization
Abstract
Temporal Moment Localization (TML) identifies specific temporal intervals in untrimmed videos based on a sentence query. Traditional methods using 2D temporal maps face limitations due to rigid boundaries and GPU constraints. We propose a Boundary Matching and Refinement Network (BMRN) that dynamically adjusts moment proposals with predicted center and length offsets for precise localization. BMRN integrates boundary matching and refinement maps with a length-aware cross-modal interactive proposal feature map. Enhanced with Cross-Modal Contrastive Learning (CCL), BMRN-CCL reduces the impact of visually and semantically similar negative samples. Extensive ablation studies and benchmarks on Charades-STA and ActivityNet Captions datasets demonstrate the superior performance of BMRN and BMRN-CCL, surpassing state-of-the-art methods.
Cite
Text
Moon et al. "Boundary Matching and Refinement Network with Cross-Modal Contrastive Learning for Temporal Moment Localization." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91581-9_21Markdown
[Moon et al. "Boundary Matching and Refinement Network with Cross-Modal Contrastive Learning for Temporal Moment Localization." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/moon2024eccvw-boundary/) doi:10.1007/978-3-031-91581-9_21BibTeX
@inproceedings{moon2024eccvw-boundary,
title = {{Boundary Matching and Refinement Network with Cross-Modal Contrastive Learning for Temporal Moment Localization}},
author = {Moon, Jinyoung and Seol, Muah and Kim, Jonghee},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {294-310},
doi = {10.1007/978-3-031-91581-9_21},
url = {https://mlanthology.org/eccvw/2024/moon2024eccvw-boundary/}
}