Expand description
§malware modeler
A machine learning application and library for training logistic regression models for benign vs. malicious prediction.
This code is alpha quality and is not fully tested. Don’t use in a production setting.
There are four basic steps:
- Feature extraction: find top k n-grams, k is about 100k to 1m, n should be 8.
- Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
- Model training: trains a model based on the training data.
- Evaluation: evaluate the model against some testing or validation (hold-out data).
Additionally:
- The model can reduce the k features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
- Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
- The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
- These are simple models, and only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.
Based on the following research:
- Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. KiloGrams: Very Large N-Grams for Malware Classification. In Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS’19). 2019. Article.
- Edward Raff and Charles Nicholas. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In Proceedings of the ACM Symposium on Document Engineering 2018. 2018. Article, DOI.
- Richard Zak, Edward Raff and Charles Nicholas. What can N-grams learn for malware detection? In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). October 2017, 109–118. Article, DOI.
- Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques, September 2016. Article, DOI.
Modules§
- dataset
- Data structures and logic for storing training/inference data
- model
- Data structure and logic for training a model and calculating predictions
Structs§
- Ngrammer
- N-gramming object
Constants§
- MAX_
RECURSION_ DEPTH - Maximum recursion depth when talking a directory structure
- VERSION
- Malware Modeler version
Type Aliases§
- Bytes
- Convenience type for vector of bytes