ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Abstract
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image–language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image–language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video–text supervision or fine-tuning, ActAlign achieves 30.4% accuracy on ActionAtlas—the most diverse benchmark of fine-grained actions across multiple sports—where human performance is only 61.6%. ActAlign outperforms billion-parameter video–language models while using $\sim 8\times$ fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image–language models for fine-grained video understanding.
Cite
Text
Aghdam et al. "ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment." Transactions on Machine Learning Research, 2025.Markdown
[Aghdam et al. "ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/aghdam2025tmlr-actalign/)BibTeX
@article{aghdam2025tmlr-actalign,
title = {{ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment}},
author = {Aghdam, Amir and Hu, Vincent Tao and Ommer, Björn},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/aghdam2025tmlr-actalign/}
}