Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data

Nilsson, Alfred; Azizpour, Hossein

Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data

CHIL 2024 pp. 155-168

/chil/2024/nilsson2024chil-regularizing/

Abstract

This work introduces a novel approach to model regularization and explanation in \Glspl{vit}, particularly beneficial for small-scale but high-dimensional data regimes, such as in healthcare. We introduce stochastic embedded feature selection in the context of echocardiography video analysis, specifically focusing on the EchoNet-Dynamic dataset for the prediction of \gls{lvef}. Our proposed method, termed \Glspl{gvit}, augments \Glspl{vvit}, a performant transformer architecture for videos with \Glspl{cae}, a common dataset-level feature selection technique, to enhance \gls{vvit}’s generalization and interpretability. The key contribution lies in the incorporation of stochastic token selection individually for each video frame during training. Such token selection regularizes the training of \gls{vvit}, improves its interpretability, and is achieved by differentiable sampling of categoricals using the Gumbel-Softmax distribution. Our experiments on EchoNet-Dynamic demonstrate a consistent and notable regularization effect. The \gls{gvit} model outperforms both a random selection baseline and standard \gls{vvit}. % using multiple evaluation metrics. The \gls{gvit} is also compared against recent works on EchoNet-Dynamic where it exhibits state-of-the-art performance among end-to-end learned methods. Finally, we explore model explainability by visualizing selected patches, providing insights into how the \gls{gvit} utilizes regions known to be crucial for \gls{lvef} prediction for humans. This proposed approach, therefore, extends beyond regularization, offering enhanced interpretability for \gls{vit}s.

PDF CHIL Semantic Scholar

Cite

Text

Nilsson and Azizpour. "Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data." Proceedings of the fifth Conference on Health, Inference, and Learning, 2024.

Markdown

[Nilsson and Azizpour. "Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data." Proceedings of the fifth Conference on Health, Inference, and Learning, 2024.](https://mlanthology.org/chil/2024/nilsson2024chil-regularizing/)

BibTeX

@inproceedings{nilsson2024chil-regularizing,
  title     = {{Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data}},
  author    = {Nilsson, Alfred and Azizpour, Hossein},
  booktitle = {Proceedings of the fifth Conference on Health, Inference, and Learning},
  year      = {2024},
  pages     = {155-168},
  volume    = {248},
  url       = {https://mlanthology.org/chil/2024/nilsson2024chil-regularizing/}
}