malware-modeler 0.0.2

Train logisitic regression models for benign vs. malicious files based on byte n-grams and publish research.
Documentation

malware modeler

Crates.io Version

A machine learning application and library for training logistic regression models for benign vs. malicious prediction.

This code is alpha quality and is not fully tested. Don't use in a production setting.

There are four basic steps:

  1. Feature extraction: find top k n-grams, k is about 100k to 1m, n should be 8.
  2. Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
  3. Model training: trains a model based on the training data.
  4. Evaluation: evaluate the model against some testing or validation (hold-out data).

Additionally:

  • The model can reduce the k features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
  • Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
  • The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
  • These are simple models, and only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.

Based on the following research:

  • Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. KiloGrams: Very Large N-Grams for Malware Classification. In Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19). 2019. Article.
  • William Fleshman, Edward Raff, Richard Zak, Mark McLean and Charles Nicholas. Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus. In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE). October 2018, 1–10. Best Paper Award. Article, Arvix, DOI.
  • Edward Raff and Charles Nicholas. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In Proceedings of the ACM Symposium on Document Engineering 2018. 2018. Article, DOI.
  • Richard Zak, Edward Raff and Charles Nicholas. What can N-grams learn for malware detection? In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). October 2017, 109–118. Article, DOI.
  • Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques, September 2016. Article, DOI.