Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?
Abstract
Missing data poses a significant challenge in real-world data analysis, prompting the development of various imputation methods. However, existing literature often overlooks two critical limitations. Firstly, many methods assume a Missing Completely At Random (MCAR) mechanism, which is relatively easy to handle but may not reflect real-world scenarios where data is often missing due to some underlying mechanisms (issues/problems) that are often unknown. This type of missing data is categorized as Missing At Random (MAR) and Missing Not At Random (MNAR). Secondly, the effectiveness of these methods is primarily assessed solely in terms of imputation accuracy using metrics such as Root Mean Square Error (RMSE), ignoring the practical utility of imputed data in downstream tasks. In this study, we comprehensively compare a broad spectrum of missing data imputation techniques, ranging from traditional statistical methods to advanced machine and deep learning approaches. Our evaluation considers their effectiveness in handling various missing mechanisms across different missing parameters. Furthermore, we assess the imputed data’s quality not only in terms of RMSE but also its impact on downstream tasks, such as classification, regression, and clustering. Contrary to common assumptions, our findings reveal that the superiority of complex deep learning-based methods is not guaranteed over simple traditional techniques. Moreover, relying solely on RMSE for evaluation can be misleading. Instead, selecting an imputation method should prioritise its effectiveness in enhancing the performance of learning algorithms in downstream tasks.
Cite
Text
Zhou et al. "Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024. doi:10.1007/978-3-031-70381-2_7Markdown
[Zhou et al. "Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024.](https://mlanthology.org/ecmlpkdd/2024/zhou2024ecmlpkdd-missing/) doi:10.1007/978-3-031-70381-2_7BibTeX
@inproceedings{zhou2024ecmlpkdd-missing,
title = {{Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?}},
author = {Zhou, Youran and Bouadjenek, Mohamed Reda and Aryal, Sunil},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2024},
pages = {100-115},
doi = {10.1007/978-3-031-70381-2_7},
url = {https://mlanthology.org/ecmlpkdd/2024/zhou2024ecmlpkdd-missing/}
}