Marvolo: Programmatic Data Augmentation for Deep Malware Detection
Abstract
Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively expensive. On the other hand, an entity (e.g., a bank or government), may be targeted with unique malware, but the data samples available will never be sufficient to train a bespoke ML-based detector. While data augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far more challenging for malware detection. The main challenges are that (1) determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available), complicating correctness and understanding, and (3) labeling new files mandates expensive binary reverse engineering. We present Marvolo for creating realistic, semantics preserving transformations that mimic the code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to safely propagate labels to newly-generated data. Across several malware datasets and recent ML-based detectors, Marvolo improves accuracy and AUC by up to 5% and 10% respectively, while boosting efficiency by 79x by avoiding redundant computation.
Cite
Text
Wong et al. "Marvolo: Programmatic Data Augmentation for Deep Malware Detection." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023. doi:10.1007/978-3-031-43412-9_16Markdown
[Wong et al. "Marvolo: Programmatic Data Augmentation for Deep Malware Detection." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023.](https://mlanthology.org/ecmlpkdd/2023/wong2023ecmlpkdd-marvolo/) doi:10.1007/978-3-031-43412-9_16BibTeX
@inproceedings{wong2023ecmlpkdd-marvolo,
title = {{Marvolo: Programmatic Data Augmentation for Deep Malware Detection}},
author = {Wong, Mike and Raff, Edward and Holt, James and Netravali, Ravi},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2023},
pages = {270-285},
doi = {10.1007/978-3-031-43412-9_16},
url = {https://mlanthology.org/ecmlpkdd/2023/wong2023ecmlpkdd-marvolo/}
}