Nomic Embed: Training a Reproducible Long Context Text Embedder
Abstract
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at \href{https://github.com/nomic-ai/contrastors}https://github.com/nomic-ai/contrastors
Cite
Text
Nussbaum et al. "Nomic Embed: Training a Reproducible Long Context Text Embedder." Transactions on Machine Learning Research, 2025.Markdown
[Nussbaum et al. "Nomic Embed: Training a Reproducible Long Context Text Embedder." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/nussbaum2025tmlr-nomic/)BibTeX
@article{nussbaum2025tmlr-nomic,
title = {{Nomic Embed: Training a Reproducible Long Context Text Embedder}},
author = {Nussbaum, Zach and Morris, John Xavier and Mulyar, Andriy and Duderstadt, Brandon},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/nussbaum2025tmlr-nomic/}
}