TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding

Abstract

Multimodal large language models (MLLMs) have made significant progress across vision-language tasks, yet many designs still suffer from two core limitations. (i) Excessive visual tokens and broken global context: Tiled Patch Encoding fragments high-resolution images, leading to token overload and disrupting global attention modeling. (ii) Lack of temporal reasoning: Most models process video as independent frames using static image encoders, failing to capture temporal dynamics. We present TempFlex-VL, a token-efficient and temporally aware MLLM that addresses both issues through lightweight architectural enhancements. First, we introduce a resolution-agnostic visual encoder that directly processes full images without tiling, preserving global context while substantially reducing visual tokens. Second, we propose Temporal Fiber Fusion (TFF), a plug-and-play module with three complementary pathways: (1) a dynamic local-convolution branch for fine-grained motion, (2) a gated memory accumulator for long-term dependencies, and (3) a periodic encoder for modeling cyclic patterns. These signals are softly fused, enabling the model to adapt to diverse temporal structures without overfitting. To support large-scale video-language pretraining, we curate TempFlex-2M, a high-quality synthetic video–text corpus generated in a single stage via GPT-4o with direct visual prompting. We instantiate TempFlex-VL using two different language backbones, Gemma3-4B and Qwen3-4B, demonstrating the generality of our design across architectures. Both variants achieve state-of-the-art or competitive results on a wide range of image and video benchmarks while markedly improving token efficiency. Code is publicly available at: https://github.com/wang-zhanyu/TempFlex.

Cite

Text

Wang et al. "TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding." Transactions on Machine Learning Research, 2025.

Markdown

[Wang et al. "TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/wang2025tmlr-tempflex/)

BibTeX

@article{wang2025tmlr-tempflex,
  title     = {{TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding}},
  author    = {Wang, Zhanyu and Tang, Chen and He, Haoyu and Feng, Kuan and Wang, Chao and Zhang, Bingni and Xu, Xiaolei and Wang, Shen and Zhou, Luping},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/wang2025tmlr-tempflex/}
}