malware-modeler 0.0.5

Train logisitic regression models for benign vs. malicious files based on byte n-grams and publish research, plus related tools.
Documentation
## malware modeler
[![Crates.io Version](https://img.shields.io/crates/v/malware-modeler)](https://crates.io/crates/malware-modeler)

A machine learning application and library for training logistic regression models for benign vs. malicious prediction plus related tools.

**This code is alpha quality and is not fully tested. Don't use in a production setting.**

There are four basic steps:
1. Feature extraction: find top *k* *n*-grams, *k* is about 100k to 1m, *n* should be 8.
2. Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
3. Model training: trains a model based on the training data.
4. Evaluation: evaluate the model against some testing or validation (hold-out data).

Additionally:
* The similarity feature can be used to ensure the samples used for training have decent variation.
* N-gram features, dataset files, and models are tied to a file type.
* The model can reduce the *k* features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
* Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
* The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
* These are simple models, which are only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.

Based on the following research:
* Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. **KiloGrams: Very Large N-Grams for Malware Classification.** In *Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19)*. 2019. [Article]https://arxiv.org/abs/1908.00200.
* William Fleshman, Edward Raff, Richard Zak, Mark McLean and Charles Nicholas. **Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus**. In *2018 13th International Conference on Malicious and Unwanted Software (MALWARE)*. October 2018, 1–10. Best Paper Award. [Article]https://ieeexplore.ieee.org/document/8659360/, [Arvix]https://arxiv.org/abs/1806.04773, [DOI]http://dx.doi.org/10.1109/MALWARE.2018.8659360.
* Edward Raff and Charles Nicholas. **Hash-Grams: Faster N-Gram Features for Classification and Malware Detection.** In *Proceedings of the ACM Symposium on Document Engineering 2018*. 2018. [Article]http://doi.acm.org/10.1145/3209280.3229085, [DOI]http://dx.doi.org/10.1145/3209280.3229085.
* Richard Zak, Edward Raff and Charles Nicholas. **What can N-grams learn for malware detection?** In *2017 12th International Conference on Malicious and Unwanted Software (MALWARE)*. October 2017, 109–118. [Article]http://ieeexplore.ieee.org/document/8323963/, [DOI]http://dx.doi.org/10.1109/MALWARE.2017.8323963.
* Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. **An investigation of byte n-gram features for malware classification.** *Journal of Computer Virology and Hacking Techniques*, September 2016. [Article]http://link.springer.com/10.1007/s11416-016-0283-1, [DOI]http://dx.doi.org/10.1007/s11416-016-0283-1.

Additional tools:
* Extract files from a Zip archive based on file type, useful for working with files from [VirusShare]https://virusshare.com.
* Get a summary of files in a Zip archive by file type.
* Check files in a directory for similarity with each other to help you build a dataset with good variation.