Multichannel AV-Wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Zhu, Qiushi; Zhang, Jie; Gu, Yu; Hu, Yuchen; Dai, Lirong

doi:10.1609/AAAI.V38I17.29951

Multichannel AV-Wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai

AAAI 2024 pp. 19768-19776

doi:10.1609/AAAI.V38I17.29951 /aaai/2024/zhu2024aaai-multichannel/

Abstract

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multi-channel audio streams and a visual stream in parallel, with intra-, and inter-channel contrastive as training targets to fully exploit the rich information in multi-channel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of multichannel multi-modal representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.

PDF AAAI Semantic Scholar

Cite

Text

Zhu et al. "Multichannel AV-Wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I17.29951

Markdown

[Zhu et al. "Multichannel AV-Wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zhu2024aaai-multichannel/) doi:10.1609/AAAI.V38I17.29951

BibTeX

@inproceedings{zhu2024aaai-multichannel,
  title     = {{Multichannel AV-Wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation}},
  author    = {Zhu, Qiushi and Zhang, Jie and Gu, Yu and Hu, Yuchen and Dai, Lirong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {19768-19776},
  doi       = {10.1609/AAAI.V38I17.29951},
  url       = {https://mlanthology.org/aaai/2024/zhu2024aaai-multichannel/}
}