Crate malware_modeler

Expand description

A machine learning application and library for training logistic regression models for benign vs. malicious prediction.

This code is alpha quality and is not fully tested. Don’t use in a production setting.

There are four basic steps:

Feature extraction: find top k n-grams, k is about 100k to 1m, n should be 8.
Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
Model training: trains a model based on the training data.
Evaluation: evaluate the model against some testing or validation (hold-out data).

Additionally:

The model can reduce the k features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
These are simple models, and only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.

Based on the following research:

Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. KiloGrams: Very Large N-Grams for Malware Classification. In Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS’19). 2019. Article.
Edward Raff and Charles Nicholas. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In Proceedings of the ACM Symposium on Document Engineering 2018. 2018. Article, DOI.
Richard Zak, Edward Raff and Charles Nicholas. What can N-grams learn for malware detection? In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). October 2017, 109–118. Article, DOI.
Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques, September 2016. Article, DOI.

Modules§

dataset: Data structures and logic for storing training/inference data
model: Data structure and logic for training a model and calculating predictions

MAX_RECURSION_DEPTH: Maximum recursion depth when talking a directory structure
VERSION: Malware Modeler version