malware-modeler 0.0.1

## malware modeler

A machine learning application and library for training logistic regression models for benign vs. malicious prediction.

**This code is alpha quality and is not fully tested. Don't use in a production setting.**

There are four basic steps:
1. Feature extraction: find top *k* *n*-grams, *k* is about 100k to 1m, *n* should be 8.
2. Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
3. Model training: trains a model based on the training data.
4. Evaluation: evaluate the model against some testing or validation (hold-out data).

Additionally:
* The model can reduce the *k* features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
* Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
* The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
* These are simple models, and only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.

Based on the following research:
* Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. **KiloGrams: Very Large N-Grams for Malware Classification.** In *Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19)*. 2019. [Article](https://arxiv.org/abs/1908.00200).
* Edward Raff and Charles Nicholas. **Hash-Grams: Faster N-Gram Features for Classification and Malware Detection.** In *Proceedings of the ACM Symposium on Document Engineering 2018*. 2018. [Article](http://doi.acm.org/10.1145/3209280.3229085), [DOI](http://dx.doi.org/10.1145/3209280.3229085).
* Richard Zak, Edward Raff and Charles Nicholas. **What can N-grams learn for malware detection?** In *2017 12th International Conference on Malicious and Unwanted Software (MALWARE)*. October 2017, 109–118. [Article](http://ieeexplore.ieee.org/document/8323963/), [DOI](http://dx.doi.org/10.1109/MALWARE.2017.8323963).
* Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. **An investigation of byte n-gram features for malware classification.** *Journal of Computer Virology and Hacking Techniques*, September 2016. [Article](http://link.springer.com/10.1007/s11416-016-0283-1), [DOI](http://dx.doi.org/10.1007/s11416-016-0283-1).