Eyes on the Road, Words in the Changing Skies: Vision-Language Assistance for Autonomous Driving in Transitional Weather
Abstract
The rapid advancement of autonomous vehicle technology (AVT) necessitates robust scene perception and interactive decision-making, particularly under adverse weather conditions. While significant progress has been made in extreme weather scenarios like cloudy, foggy, rainy, and snowy, a critical challenge remains in transitional weather conditions, such as the shift from cloudy to rainy, foggy to sunny, etc. These dynamic environmental changes degrade the performance of conventional vision-language systems by causing unpredictable illumination changes and partial occlusions, which are inadequately represented in current AVT datasets. This lack of continuous, transitional training data compromises model robustness and ultimately affects safety and reliability. On the other hand, Vision-language Models (VLMs) enable interpretable reasoning in autonomous driving through tasks such as image captioning and visual question answering. However, current VLMs, designed for clear weather, perform poorly in transitional conditions and rely on computationally expensive LLMs. This leads to high memory usage and slow inference, which is unsuitable for real-time decision making in AVT. To address these limitations, we propose Vision-language Assistance for Autonomous Driving under Transitional Weather (VLAAD-TW), a lightweight framework with a novel cross-modal spatiotemporal reasoning architecture that robustly interprets and acts on multimodal data. The VLAAD-TW framework integrates a Feature Encoder for Transitional Weather (FETW), a lightweight backbone for robust visual feature extraction, with a Spatiotemporal Contextual Aggregator (SCA), which models dynamic weather-induced changes. It uses a Selective Attention-guided Fusion Module (SAFM) to balance visual and linguistic cues for a unified representation dynamically. Finally, a Semantic Text Generator (STG) fuses these representations to produce context-aware driving information, adapting in real time to both current and predicted weather states. Further, we introduce the AIWD16-text dataset, an adverse intermediate weather driving dataset for vision language tasks, which features sixteen transitional weather states created using a Stochastic Conditional Variational Autoencoder (SC-VAE) and enriched with manual annotations of image captions and open-ended question-answer pairs. An extensive evaluation of the AIWD16-text and DriveLM datasets demonstrates VLAAD-TW's high performance in BLEU and ROUGE scores, with low memory and computational requirements, confirming its effectiveness in challenging weather conditions.
Cite
Text
Kondapally et al. "Eyes on the Road, Words in the Changing Skies: Vision-Language Assistance for Autonomous Driving in Transitional Weather." Transactions on Machine Learning Research, 2026.Markdown
[Kondapally et al. "Eyes on the Road, Words in the Changing Skies: Vision-Language Assistance for Autonomous Driving in Transitional Weather." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/kondapally2026tmlr-eyes/)BibTeX
@article{kondapally2026tmlr-eyes,
title = {{Eyes on the Road, Words in the Changing Skies: Vision-Language Assistance for Autonomous Driving in Transitional Weather}},
author = {Kondapally, Madhavi and Kumar, K Naveen and Mohan, C Krishna},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/kondapally2026tmlr-eyes/}
}