Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data
Abstract
This work introduces a novel approach to model regularization and explanation in \Glspl{vit}, particularly beneficial for small-scale but high-dimensional data regimes, such as in healthcare. We introduce stochastic embedded feature selection in the context of echocardiography video analysis, specifically focusing on the EchoNet-Dynamic dataset for the prediction of \gls{lvef}. Our proposed method, termed \Glspl{gvit}, augments \Glspl{vvit}, a performant transformer architecture for videos with \Glspl{cae}, a common dataset-level feature selection technique, to enhance \gls{vvit}’s generalization and interpretability. The key contribution lies in the incorporation of stochastic token selection individually for each video frame during training. Such token selection regularizes the training of \gls{vvit}, improves its interpretability, and is achieved by differentiable sampling of categoricals using the Gumbel-Softmax distribution. Our experiments on EchoNet-Dynamic demonstrate a consistent and notable regularization effect. The \gls{gvit} model outperforms both a random selection baseline and standard \gls{vvit}. % using multiple evaluation metrics. The \gls{gvit} is also compared against recent works on EchoNet-Dynamic where it exhibits state-of-the-art performance among end-to-end learned methods. Finally, we explore model explainability by visualizing selected patches, providing insights into how the \gls{gvit} utilizes regions known to be crucial for \gls{lvef} prediction for humans. This proposed approach, therefore, extends beyond regularization, offering enhanced interpretability for \gls{vit}s.
Cite
Text
Nilsson and Azizpour. "Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data." Proceedings of the fifth Conference on Health, Inference, and Learning, 2024.Markdown
[Nilsson and Azizpour. "Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data." Proceedings of the fifth Conference on Health, Inference, and Learning, 2024.](https://mlanthology.org/chil/2024/nilsson2024chil-regularizing/)BibTeX
@inproceedings{nilsson2024chil-regularizing,
title = {{Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data}},
author = {Nilsson, Alfred and Azizpour, Hossein},
booktitle = {Proceedings of the fifth Conference on Health, Inference, and Learning},
year = {2024},
pages = {155-168},
volume = {248},
url = {https://mlanthology.org/chil/2024/nilsson2024chil-regularizing/}
}