Two-Dimensional Visualization of Large Document Libraries Using T-SNE

Abstract

We benchmarked different approaches for creating 2D visualizations of large document libraries, using the MEDLINE (PubMed) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled TF-IDF representation of the abstract text, SVD preprocessing, and t-SNE with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.

Cite

Text

González-Márquez et al. "Two-Dimensional Visualization of Large Document Libraries Using T-SNE." ICLR 2022 Workshops: GTRL, 2022.

Markdown

[González-Márquez et al. "Two-Dimensional Visualization of Large Document Libraries Using T-SNE." ICLR 2022 Workshops: GTRL, 2022.](https://mlanthology.org/iclrw/2022/gonzalezmarquez2022iclrw-twodimensional/)

BibTeX

@inproceedings{gonzalezmarquez2022iclrw-twodimensional,
  title     = {{Two-Dimensional Visualization of Large Document Libraries Using T-SNE}},
  author    = {González-Márquez, Rita and Berens, Philipp and Kobak, Dmitry},
  booktitle = {ICLR 2022 Workshops: GTRL},
  year      = {2022},
  url       = {https://mlanthology.org/iclrw/2022/gonzalezmarquez2022iclrw-twodimensional/}
}