Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery

Abstract

Exploring the predictive capabilities of natural language processing models in material science is a subject of ongoing interest. This study examines material property prediction, relying on models to extract latent knowledge from compound names and material properties. We assessed various methods for contextual embeddings and explored pre-trained models like BERT and GPT. Our findings indicate that using information-dense embeddings from the third layer of domain-specific BERT models, such as MatBERT, combined with the context-average method, is the optimal approach for utilizing unsupervised word embeddings from material science literature to identify material-property relationships. The stark contrast between the domain-specific MatBERT and the general BERT model emphasizes the value of domain-specific training and tokenization for material prediction. Our research identifies a "tokenizer effect", highlighting the importance of specialized tokenization techniques to capture material names effectively during the pretraining phase. We discovered that a tokenizer which preserves compound names entirely, while maintaining a consistent token count, enhances the efficacy of context-aware embeddings in functional material prediction.

Cite

Text

Xie et al. "Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery." NeurIPS 2023 Workshops: AI4Mat, 2023.

Markdown

[Xie et al. "Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery." NeurIPS 2023 Workshops: AI4Mat, 2023.](https://mlanthology.org/neuripsw/2023/xie2023neuripsw-tokenizer/)

BibTeX

@inproceedings{xie2023neuripsw-tokenizer,
  title     = {{Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery}},
  author    = {Xie, Tong and Wan, Yuwei and Lu, Ke and Zhang, Wenjie and Kit, Chunyu and Hoex, Bram},
  booktitle = {NeurIPS 2023 Workshops: AI4Mat},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/xie2023neuripsw-tokenizer/}
}