Marvolo: Programmatic Data Augmentation for Deep Malware Detection

Wong, Mike; Raff, Edward; Holt, James; Netravali, Ravi

doi:10.1007/978-3-031-43412-9_16

Marvolo: Programmatic Data Augmentation for Deep Malware Detection

Mike Wong, Edward Raff, James Holt, Ravi Netravali

ECML-PKDD 2023 pp. 270-285

doi:10.1007/978-3-031-43412-9_16 /ecmlpkdd/2023/wong2023ecmlpkdd-marvolo/

Abstract

Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively expensive. On the other hand, an entity (e.g., a bank or government), may be targeted with unique malware, but the data samples available will never be sufficient to train a bespoke ML-based detector. While data augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far more challenging for malware detection. The main challenges are that (1) determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available), complicating correctness and understanding, and (3) labeling new files mandates expensive binary reverse engineering. We present Marvolo for creating realistic, semantics preserving transformations that mimic the code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to safely propagate labels to newly-generated data. Across several malware datasets and recent ML-based detectors, Marvolo improves accuracy and AUC by up to 5% and 10% respectively, while boosting efficiency by 79x by avoiding redundant computation.

PDF ECML-PKDD Semantic Scholar

Cite

Text

Wong et al. "Marvolo: Programmatic Data Augmentation for Deep Malware Detection." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023. doi:10.1007/978-3-031-43412-9_16

Markdown

[Wong et al. "Marvolo: Programmatic Data Augmentation for Deep Malware Detection." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023.](https://mlanthology.org/ecmlpkdd/2023/wong2023ecmlpkdd-marvolo/) doi:10.1007/978-3-031-43412-9_16

BibTeX

@inproceedings{wong2023ecmlpkdd-marvolo,
  title     = {{Marvolo: Programmatic Data Augmentation for Deep Malware Detection}},
  author    = {Wong, Mike and Raff, Edward and Holt, James and Netravali, Ravi},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2023},
  pages     = {270-285},
  doi       = {10.1007/978-3-031-43412-9_16},
  url       = {https://mlanthology.org/ecmlpkdd/2023/wong2023ecmlpkdd-marvolo/}
}