STING-BEE: Towards Vision-Language Model for Real-World X-Ray Baggage Security Inspection

Velayudhan, Divya; Ahmed, Abdelfatah; Alansari, Mohamad; Gour, Neha; Behouch, Abderaouf; Hassan, Taimur; Wasim, Syed Talal; Maalej, Nabil; Naseer, Muzammal; Gall, Juergen; Bennamoun, Mohammed; Damiani, Ernesto; Werghi, Naoufel

doi:10.1109/CVPR52734.2025.01934

STING-BEE: Towards Vision-Language Model for Real-World X-Ray Baggage Security Inspection

Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari, Neha Gour, Abderaouf Behouch, Taimur Hassan, Syed Talal Wasim, Nabil Maalej, Muzammal Naseer, Juergen Gall, Mohammed Bennamoun, Ernesto Damiani, Naoufel Werghi

CVPR 2025 pp. 20767-20777

doi:10.1109/CVPR52734.2025.01934 /cvpr/2025/velayudhan2025cvpr-stingbee/

Abstract

Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

PDF CVPR Semantic Scholar

Cite

Text

Velayudhan et al. "STING-BEE: Towards Vision-Language Model for Real-World X-Ray Baggage Security Inspection." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01934

Markdown

[Velayudhan et al. "STING-BEE: Towards Vision-Language Model for Real-World X-Ray Baggage Security Inspection." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/velayudhan2025cvpr-stingbee/) doi:10.1109/CVPR52734.2025.01934

BibTeX

@inproceedings{velayudhan2025cvpr-stingbee,
  title     = {{STING-BEE: Towards Vision-Language Model for Real-World X-Ray Baggage Security Inspection}},
  author    = {Velayudhan, Divya and Ahmed, Abdelfatah and Alansari, Mohamad and Gour, Neha and Behouch, Abderaouf and Hassan, Taimur and Wasim, Syed Talal and Maalej, Nabil and Naseer, Muzammal and Gall, Juergen and Bennamoun, Mohammed and Damiani, Ernesto and Werghi, Naoufel},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {20767-20777},
  doi       = {10.1109/CVPR52734.2025.01934},
  url       = {https://mlanthology.org/cvpr/2025/velayudhan2025cvpr-stingbee/}
}