Mitigating Data Scarcity in Polymer Property Prediction via Multi-Task Auxiliary Learning

Pinheiro, Gabriel A.; Quiles, Marcos G.; Da Silva, Juarez L. F.; Fern, Xiaoli Z.

doi:10.1007/978-3-032-06118-8_25

Mitigating Data Scarcity in Polymer Property Prediction via Multi-Task Auxiliary Learning

Gabriel A. Pinheiro, Marcos G. Quiles, Juarez L. F. Da Silva, Xiaoli Z. Fern

ECML-PKDD 2025 pp. 426-442

doi:10.1007/978-3-032-06118-8_25 /ecmlpkdd/2025/pinheiro2025ecmlpkdd-mitigating/

Abstract

Polymers are fundamental materials with numerous applications in everyday life, making their synthesis, characterization, and property measurement critically important. Machine learning (ML) algorithms offer promising opportunities to accelerate polymer screening with high accuracy, yet significant challenges persist. Unlike small molecules with fixed structures, polymers, especially copolymers formed by polymerizing two or more distinct monomers, can be modeled at multiple scales (atomic, monomer, or repeat-unit level) and exhibit inherent variability due to the stochastic polymerization process, which affects connectivity, chain length, conformations, and compositional complexity. Additionally, the scarcity of labeled polymer data with high-fidelity, experimentally measured properties poses a challenge for ML training. In this work, we tackle these challenges by (1) proposing CoPolyGNN (CoPolymer Graph Neural Network), a multi-scale model that employs a GNN encoder to learn representations of polymer repeating units or individual monomers, combined with an attention-based readout function that aggregates these representations with explicit monomer proportion information; (2) compiling a large dataset of polymers annotated with both simulated and experimentally measured properties; and (3) introducing a supervised auxiliary training framework to mitigate data scarcity in polymer property prediction. We empirically validate CoPolyGNN on datasets of polymer properties measured under real experimental conditions. Our findings demonstrate that augmenting the main task with auxiliary tasks leads to beneficial performance gains. Consequently, our work provides a neural architecture and training framework enabling practitioners to predict polymer properties from simple text notations of repeat units or monomers and their proportions, achieving strong performance even with limited training data. (Code available at https://github.com/CIDAG/CoPolyGNN ).

PDF ECML-PKDD Semantic Scholar

Cite

Text

Pinheiro et al. "Mitigating Data Scarcity in Polymer Property Prediction via Multi-Task Auxiliary Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025. doi:10.1007/978-3-032-06118-8_25

Markdown

[Pinheiro et al. "Mitigating Data Scarcity in Polymer Property Prediction via Multi-Task Auxiliary Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025.](https://mlanthology.org/ecmlpkdd/2025/pinheiro2025ecmlpkdd-mitigating/) doi:10.1007/978-3-032-06118-8_25

BibTeX

@inproceedings{pinheiro2025ecmlpkdd-mitigating,
  title     = {{Mitigating Data Scarcity in Polymer Property Prediction via Multi-Task Auxiliary Learning}},
  author    = {Pinheiro, Gabriel A. and Quiles, Marcos G. and Da Silva, Juarez L. F. and Fern, Xiaoli Z.},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2025},
  pages     = {426-442},
  doi       = {10.1007/978-3-032-06118-8_25},
  url       = {https://mlanthology.org/ecmlpkdd/2025/pinheiro2025ecmlpkdd-mitigating/}
}