Multi-Modal Knowledge Distillation-Based Human Trajectory Forecasting
Abstract
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to 13%. The code is available at github.com/Jaewoo97/KDTF.
Cite
Text
Jeong et al. "Multi-Modal Knowledge Distillation-Based Human Trajectory Forecasting." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02256Markdown
[Jeong et al. "Multi-Modal Knowledge Distillation-Based Human Trajectory Forecasting." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/jeong2025cvpr-multimodal/) doi:10.1109/CVPR52734.2025.02256BibTeX
@inproceedings{jeong2025cvpr-multimodal,
title = {{Multi-Modal Knowledge Distillation-Based Human Trajectory Forecasting}},
author = {Jeong, Jaewoo and Lee, Seohee and Park, Daehee and Lee, Giwon and Yoon, Kuk-Jin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {24222-24233},
doi = {10.1109/CVPR52734.2025.02256},
url = {https://mlanthology.org/cvpr/2025/jeong2025cvpr-multimodal/}
}