TaxaBind: A Unified Embedding Space for Ecological Applications

Abstract

We present TaxaBind a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species geographic location satellite image text audio and environmental features useful for solving ecological problems. To learn this joint embedding space we leverage ground-level images of species as a binding modality. We propose multimodal patching a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images and iSoundNat with species images and audio. Additionally we introduce TaxaBench-8k a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification cross-model retrieval and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.

Cite

Text

Sastry et al. "TaxaBind: A Unified Embedding Space for Ecological Applications." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Sastry et al. "TaxaBind: A Unified Embedding Space for Ecological Applications." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/sastry2025wacv-taxabind/)

BibTeX

@inproceedings{sastry2025wacv-taxabind,
  title     = {{TaxaBind: A Unified Embedding Space for Ecological Applications}},
  author    = {Sastry, Srikumar and Khanal, Subash and Dhakal, Aayush and Ahmad, Adeel and Jacobs, Nathan},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {1765-1774},
  url       = {https://mlanthology.org/wacv/2025/sastry2025wacv-taxabind/}
}