Self-Supervised Acquisition of Vowels in American English

Abstract

This paper presents a self-supervised framework for perceptual learning based upon correlations in different sensory modalities. We demonstrate this with a system that has learned the vowel structure of American English – i.e., the number of vowels and their phonetic descriptions – by simultaneously watching and listening to someone speak. It is highly non-parametric, knowing neither the number of vowels nor their input distributions in advance, and it has no prior linguistic knowledge. This work is the first example of unsupervised phonetic acquisition of which we are aware, outside of that done by human infants. This system is based on the cross-modal clustering framework introduced by [4], which has been significantly enhanced here. This paper presents our results and focuses on the mathematical framework that enables this type of intersensory selfsupervised learning. the first unsupervised acquisition of phonetic structure of which we are aware, at least outside of that done by human infants, who solve this problem easily. The output of this system is displayed in Figure 1. The goal of this paper is to elaborate upon these results and outline the framework through which they were obtained. Our approach to perceptual grounding has been to mathematically formalize an insight in Aristotle's De Anima [1], that differences in the world are only detectable because different senses perceive the same world events differently. This implies both that sensory systems need some way to share their different perspectives on the world and that they need some way to incorporate these shared

Cite

Text

Coen. "Self-Supervised Acquisition of Vowels in American English." AAAI Conference on Artificial Intelligence, 2006.

Markdown

[Coen. "Self-Supervised Acquisition of Vowels in American English." AAAI Conference on Artificial Intelligence, 2006.](https://mlanthology.org/aaai/2006/coen2006aaai-self/)

BibTeX

@inproceedings{coen2006aaai-self,
  title     = {{Self-Supervised Acquisition of Vowels in American English}},
  author    = {Coen, Michael H.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2006},
  pages     = {1451-1456},
  url       = {https://mlanthology.org/aaai/2006/coen2006aaai-self/}
}