LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

Abstract

With the increasing pervasiveness of video content throughout society, the demand for robust video-language models is increasingly urgent. In this work we introduce LLAVIDAL, a Large Language Vision Model tailored for Activities of Daily Living (ADL). Unlike existing models primarily trained on curated web videos, LLAVIDAL leverages a novel multiview RGB-D dataset, ADL-X, which includes 100K untrimmed video-instruction pairs, enriched with 3D skeletons and object trajectories to mimic real-world complexities. The model integrates these features to effectively understand intricate human behaviors and spatiotemporal dynamics typical of daily activities. We also introduce ADLMCQ, a new benchmark designed to evaluate the proficiency of video-language models in interpreting ADL content. Our evaluations demonstrate that LLAVIDAL significantly outperforms existing models, showcasing superior ability to process and reason about real-life video scenarios. The insights gained underscore the necessity for advanced processing techniques to handle the scale and multimodality of video data, alongside a need for comprehensive benchmarks that reflect real-world use cases more accurately. The instruction tuning data is available at https://adl-x.github.io

Cite

Text

Chakraborty et al. "LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.

Markdown

[Chakraborty et al. "LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.](https://mlanthology.org/neuripsw/2024/chakraborty2024neuripsw-llavidal/)

BibTeX

@inproceedings{chakraborty2024neuripsw-llavidal,
  title     = {{LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living}},
  author    = {Chakraborty, Rajatsubhra and Sinha, Arkaprava and Reilly, Dominick and Govind, Manish Kumar and Wang, Pu and Bremond, Francois and Das, Srijan},
  booktitle = {NeurIPS 2024 Workshops: Video-Langauge_Models},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/chakraborty2024neuripsw-llavidal/}
}