A Multimodal AI Dialogue System for Unified Document, Visual, and Audio Interaction

Abstract

This paper presents a multimodal intelligent dialogue system that seamlessly integrates document analysis, visual media processing, and audio interaction within a unified web interface. The system ensures secure user identity verification through persistent conversational management, leveraging textual document analysis, dynamic context integration, and cross-media interactions via video, image, and real-time speech processing. Our approach introduces three key innovations: (1) context-aware document analysis through text extraction, (2) a multimodal input pipeline supporting images, videos, and audio, and (3) persistent chat history management for maintaining conversational continuity. The system facilitates seamless transitions between audio and text, enabling natural interactions by processing audio input and converting text responses into speech. Additionally, the platform provides an intuitive interface for document uploads, camera capture, and audio recording, while ensuring conversation context is preserved across sessions. This implementation demonstrates the practical integration of multimodal input in an interactive artificial intelligence (AI) system, showcasing its potential for enhanced user engagement and interaction.

Cite

Text

Feng et al. "A Multimodal AI Dialogue System for Unified Document, Visual, and Audio Interaction." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1259

Markdown

[Feng et al. "A Multimodal AI Dialogue System for Unified Document, Visual, and Audio Interaction." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/feng2025ijcai-multimodal/) doi:10.24963/IJCAI.2025/1259

BibTeX

@inproceedings{feng2025ijcai-multimodal,
  title     = {{A Multimodal AI Dialogue System for Unified Document, Visual, and Audio Interaction}},
  author    = {Feng, Yujun and Huang, Jingyi and Zhang, Yang},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {11044-11047},
  doi       = {10.24963/IJCAI.2025/1259},
  url       = {https://mlanthology.org/ijcai/2025/feng2025ijcai-multimodal/}
}