LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
Abstract
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multi-view, multi-modal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL, a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at https://adl-x.github.io/.
Cite
Text
Reilly et al. "LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02263Markdown
[Reilly et al. "LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/reilly2025cvpr-llavidal/) doi:10.1109/CVPR52734.2025.02263BibTeX
@inproceedings{reilly2025cvpr-llavidal,
title = {{LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living}},
author = {Reilly, Dominick and Chakraborty, Rajatsubhra and Sinha, Arkaprava and Govind, Manish Kumar and Wang, Pu and Bremond, Francois and Xue, Le and Das, Srijan},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {24297-24308},
doi = {10.1109/CVPR52734.2025.02263},
url = {https://mlanthology.org/cvpr/2025/reilly2025cvpr-llavidal/}
}