RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications

Abstract

Visual Language Models (VLMs) have achieved remarkable success in many domains due to their ability to perform step-by-step reasoning. However, progress in the telecommunication (Telecom) domain remains limited, primarily due to the lack of high-quality datasets and domain-specific insights. In this paper, we introduce RMultiplex200K, a multimodal dataset designed to present step-wise reasoning rationales and correctness scores for real-world Telecom questions. This enables VLMs to engage in step-level reasoning and verification using multimodal information, thereby facilitating reliable problem-solving. RMultiplex200K is highly scalable as it is constructed without human annotations, relying instead on our automatic plan-based annotation (ApPA) method, which automatically synthesizes reasoning steps labeled with reward scores. With this dataset, we introduce TC-NAVIGATOR, a new mechanism for training multimodal process reward models to serve as reliable reasoning verifiers for VLMs. For instance, the Qwen-2-VL-72B and Llama-3.2-90B models, which initially achieve only 21.3% and 19.8% respectively on practice Telecom questions, reached 48.5% and 46.1% accuracy, respectively, after training with RMultiplex200K and verifying with TC-NAVIGATOR.

Cite

Text

Chen and Song. "RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications." International Conference on Computer Vision, 2025.

Markdown

[Chen and Song. "RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/chen2025iccv-rmultiplex200k/)

BibTeX

@inproceedings{chen2025iccv-rmultiplex200k,
  title     = {{RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications}},
  author    = {Chen, Sijia and Song, Bin},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {1686-1696},
  url       = {https://mlanthology.org/iccv/2025/chen2025iccv-rmultiplex200k/}
}