Assessing Machine Learning and Data Imputation Approaches to Handle the Issue of Data Sparsity in Sports Forecasting
Abstract
Sparsity is a common characteristic for datasets used in the domain of sports forecasting, mainly derived from inconsistencies in data coverage. Typically, this issue is circumvented by cutting the number of features (depth-focused) or the sample size (breadth-focused) for analysis. The present study uses an experimental approach to analyse the effects of depth- or breadth-focused analyses and data imputation to enable usage of the full sample size and feature wealth. Two forecasting models following a hybrid (i.e., a combination of classical statistical and machine learning) and a full deep learning approach are introduced to perform experiments on a dataset of more than 300,000 soccer matches. In contrast to typical soccer forecasting studies, the analysis was not restricted to one-match-ahead forecasts but used a longer forecasting horizon of up to two months ahead. Systematic differences between the two types of models were identified. The hybrid model based on classical statistical rating models, performs strongly on depth-focused approaches while not or only marginally improving for approaches with high data breadth. The deep learning model, however, performs weakly in a depth-focused approach but profits strongly from data breadth. The improved prediction performance in cases of high data breadth suggests that a rich feature set offers better training opportunities than a less detailed set with a larger sample size. Additionally, we showcase that data imputation can be used to address data sparsity by enabling full data depth and breadth. The presented findings are relevant for advancing predictive accuracy and sports forecasting methodologies, emphasizing the viability of imputation techniques to increase data coverage in different analytical approaches.
Cite
Text
Wunderlich et al. "Assessing Machine Learning and Data Imputation Approaches to Handle the Issue of Data Sparsity in Sports Forecasting." Machine Learning, 2025. doi:10.1007/S10994-024-06651-7Markdown
[Wunderlich et al. "Assessing Machine Learning and Data Imputation Approaches to Handle the Issue of Data Sparsity in Sports Forecasting." Machine Learning, 2025.](https://mlanthology.org/mlj/2025/wunderlich2025mlj-assessing/) doi:10.1007/S10994-024-06651-7BibTeX
@article{wunderlich2025mlj-assessing,
title = {{Assessing Machine Learning and Data Imputation Approaches to Handle the Issue of Data Sparsity in Sports Forecasting}},
author = {Wunderlich, Fabian and Biermann, Henrik and Yang, Weiran and Bassek, Manuel and Raabe, Dominik and Elbert, Nico and Memmert, Daniel and Caparrós, Marc Garnica},
journal = {Machine Learning},
year = {2025},
pages = {48},
doi = {10.1007/S10994-024-06651-7},
volume = {114},
url = {https://mlanthology.org/mlj/2025/wunderlich2025mlj-assessing/}
}