On the Emergence of Linear Analogies in Word Embeddings

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure---for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$---whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king--queen, man--woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)--(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

Cite

Text

Korchinski et al. "On the Emergence of Linear Analogies in Word Embeddings." Advances in Neural Information Processing Systems, 2025.

Markdown

[Korchinski et al. "On the Emergence of Linear Analogies in Word Embeddings." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/korchinski2025neurips-emergence/)

BibTeX

@inproceedings{korchinski2025neurips-emergence,
  title     = {{On the Emergence of Linear Analogies in Word Embeddings}},
  author    = {Korchinski, Daniel James and Karkada, Dhruva and Bahri, Yasaman and Wyart, Matthieu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/korchinski2025neurips-emergence/}
}