UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

He, Qingdong; Peng, Jinlong; Jiang, Zhengkai; Wu, Kai; Ji, Xiaozhong; Zhang, Jiangning; Wang, Yabiao; Wang, Chengjie; Chen, Mingang; Wu, Yunsheng

doi:10.24963/ijcai.2024/90

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, Yunsheng Wu

IJCAI 2024 pp. 812-820

doi:10.24963/ijcai.2024/90 /ijcai/2024/he2024ijcai-unim/

Abstract

Person Re-Identification (ReID) aims to match individuals across different camera views, but occlusions in real-world scenarios, such as vehicles or crowds, hinder feature extraction and matching. Current occluded ReID methodologies typically leverage visual augmentation techniques in an attempt to mitigate the disruptive effects of occlusion-induced noise. However, relying solely on visual data fail to effectively filter out occlusion noise. In this paper, we introduce the Fine-grained Language-guided Noise Filtering Network (FLaN-Net) for occluded ReID. FLaN-Net innovatively employs categorical attention mechanism to generate adaptive tokens that capture the following three distinct types of visual information: comprehensive descriptions of individuals, detailed visible attributes, and characteristics of occluding objects. Subsequently, a cross-attention mechanism aligns these prompts with the image, guiding the model to focus on relevant regions. To generate robust and discriminative features for occluded pedestrians, we further introduce a dynamic weighting fusion module that integrates visual, textual, and cross-attention features based on their reliability. Experimental results demonstrate that FLaN-Net outperforms existing methods on occluded ReID benchmarks, offering a robust solution for challenging real-world conditions.

PDF IJCAI Semantic Scholar

Cite

Text

He et al. "UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/90

Markdown

[He et al. "UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/he2024ijcai-unim/) doi:10.24963/ijcai.2024/90

BibTeX

@inproceedings{he2024ijcai-unim,
  title     = {{UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation}},
  author    = {He, Qingdong and Peng, Jinlong and Jiang, Zhengkai and Wu, Kai and Ji, Xiaozhong and Zhang, Jiangning and Wang, Yabiao and Wang, Chengjie and Chen, Mingang and Wu, Yunsheng},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {812-820},
  doi       = {10.24963/ijcai.2024/90},
  url       = {https://mlanthology.org/ijcai/2024/he2024ijcai-unim/}
}