malware-modeler 0.0.2

malware modeler

A machine learning application and library for training logistic regression models for benign vs. malicious prediction.

This code is alpha quality and is not fully tested. Don't use in a production setting.

There are four basic steps:

Feature extraction: find top k n-grams, k is about 100k to 1m, n should be 8.
Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
Model training: trains a model based on the training data.
Evaluation: evaluate the model against some testing or validation (hold-out data).

Additionally:

The model can reduce the k features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
These are simple models, and only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.

Based on the following research:

Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. KiloGrams: Very Large N-Grams for Malware Classification. In Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19). 2019. Article.
William Fleshman, Edward Raff, Richard Zak, Mark McLean and Charles Nicholas. Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus. In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE). October 2018, 1–10. Best Paper Award. Article, Arvix, DOI.
Edward Raff and Charles Nicholas. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In Proceedings of the ACM Symposium on Document Engineering 2018. 2018. Article, DOI.
Richard Zak, Edward Raff and Charles Nicholas. What can N-grams learn for malware detection? In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). October 2017, 109–118. Article, DOI.
Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques, September 2016. Article, DOI.