ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation
Abstract
Transformers have revolutionized Computer Vision (CV) through self-attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre-trained models without requiring fine-tuning. Additionally, we propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.
Cite
Text
Hosseini et al. "ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation." Transactions on Machine Learning Research, 2026.Markdown
[Hosseini et al. "ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/hosseini2026tmlr-ultra/)BibTeX
@article{hosseini2026tmlr-ultra,
title = {{ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation}},
author = {Hosseini, Hesam and Mighan, Ghazal Hosseini and Afzali, Amirabbas and Amini, Sajjad and Houmansadr, Amir},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/hosseini2026tmlr-ultra/}
}