Speaker-Invariant Features for Automatic Speech Recognition

Abstract

In this paper, we consider the generation of features for automatic speech recognition (ASR) that are robust to speaker-variations. One of the major causes for the degradation in the performance of ASR systems is due to inter-speaker variations. These variations are commonly modeled by a pure scaling relation between spectra of speakers enunciating the same sound. Therefore, current state-of-the art ASR systems overcome this problem of speaker-variability by doing a brute-force search for the optimal scaling parameter. This procedure known as vocal -tract length normalization (VTLN) is computationally intensive. We have recently used Scale-Transform (a variation of Mellin transform) to generate features which are robust to speaker variations without the need to search for the scaling parameter. However, these features have poorer performance due to loss of phase information. In this paper, we propose to use the magnitude of Scale-Transform and a pre-computed phase-vector for each phoneme to generate speaker-invariant features. We compare the performance of the proposed features with conventional VTLN on a phoneme recognition task.

Cite

Text

Umesh et al. "Speaker-Invariant Features for Automatic Speech Recognition." International Joint Conference on Artificial Intelligence, 2007.

Markdown

[Umesh et al. "Speaker-Invariant Features for Automatic Speech Recognition." International Joint Conference on Artificial Intelligence, 2007.](https://mlanthology.org/ijcai/2007/umesh2007ijcai-speaker/)

BibTeX

@inproceedings{umesh2007ijcai-speaker,
  title     = {{Speaker-Invariant Features for Automatic Speech Recognition}},
  author    = {Umesh, Srinivasan and Sanand, D. Rama and Praveen, G.},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2007},
  pages     = {1738-1743},
  url       = {https://mlanthology.org/ijcai/2007/umesh2007ijcai-speaker/}
}