CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

Abstract

Open-vocabulary object detection (OVD) utilizes image-level cues to expand the linguistic space of region proposals, thereby facilitating the detection of diverse novel classes. Recent works adapt CLIP embedding by minimizing the object-image and object-text discrepancy combinatorially in a discriminative paradigm. However, they ignore the underlying distribution and the disagreement between the image and text objective, leading to the misaligned distribution between the vision and language sub-space. To address the deficiency, we explore the advanced generative paradigm with distribution perception and propose a novel framework based on the diffusion model, coined Continual Latent Diffusion (CLIFF), which formulates a continual distribution transfer among the object, image, and text latent space probabilistically. CLIFF consists of a Variational Latent Sampler (VLS) enabling the probabilistic modeling and a Continual Diffusion Module (CDM) for the distribution transfer. Specifically, in VLS, we first establish a probabilistic object space with region proposals by estimating distribution parameters. Then, the object-centric noise is sampled from the estimated distribution to generate text embedding for OVD. To achieve this generation process, CDM conducts a short-distance object-to-image diffusion from the sampled noise to generate image embedding as the medium, which guides the long-distance diffusion to generate text embedding. Extensive experiments verify that CLIFF can significantly surpass state-of-the-art methods on benchmarks. The code is available at https://github.com/CUHK-AIM-Group/CLIFF.

Cite

Text

Li et al. "CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73001-6_15

Markdown

[Li et al. "CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/li2024eccv-cliff/) doi:10.1007/978-3-031-73001-6_15

BibTeX

@inproceedings{li2024eccv-cliff,
  title     = {{CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection}},
  author    = {Li, Wuyang and Liu, Xinyu and Ma, Jiayi and Yuan, Yixuan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73001-6_15},
  url       = {https://mlanthology.org/eccv/2024/li2024eccv-cliff/}
}