pub struct Dataset {
pub data: Vec<Vec<f32>>,
pub labels: Vec<f32>,
pub features: Vec<Bytes>,
pub ftype: FileType,
}Expand description
A dataset contains data for training or inference, training requires labels
Fields§
§data: Vec<Vec<f32>>Data used for training a model or calculating predictions
labels: Vec<f32>Data labels, can be empty if only used for inference
features: Vec<Bytes>N-gram byte features
ftype: FileTypeThe type of file represented
Implementations§
Source§impl Dataset
impl Dataset
Sourcepub fn load<P: AsRef<Path>>(path: P) -> Result<Dataset>
pub fn load<P: AsRef<Path>>(path: P) -> Result<Dataset>
Load a file
§Errors
An error results if the file type can’t be determined, is incorrectly determined, or if the file isn’t a supported format.
Sourcepub fn from_csv_file<P: AsRef<Path>>(
path: P,
data_length: usize,
) -> Result<Self>
pub fn from_csv_file<P: AsRef<Path>>( path: P, data_length: usize, ) -> Result<Self>
Create a dataset struct from a CSV file
§Errors
Returns an error if:
- The file can’t be read
- The data contained isn’t numeric
- Feature data is missing
- The expected amount of data isn’t encountered
Sourcepub fn from_csv_file_assume_data_length<P: AsRef<Path>>(path: P) -> Result<Self>
pub fn from_csv_file_assume_data_length<P: AsRef<Path>>(path: P) -> Result<Self>
Create a dataset struct from a CSV file
§Errors
Returns an error if:
- The file can’t be read
- The data contained isn’t a float
- Feature data is missing
- The amount of columns can’t be determined
Sourcepub fn from_csv_string(contents: &str, data_length: usize) -> Result<Self>
pub fn from_csv_string(contents: &str, data_length: usize) -> Result<Self>
Create a dataset struct from a CSV string
§Errors
Returns an error if:
- The data contained isn’t numeric
- Feature data is missing
- The expected amount of data isn’t encountered
Sourcepub fn from_arff_file<P: AsRef<Path>>(path: P) -> Result<Self>
pub fn from_arff_file<P: AsRef<Path>>(path: P) -> Result<Self>
Create a dataset struct from an ARFF string
§Errors
Returns an error if:
- The file can’t be read
- The data contained isn’t numeric
- Feature data is missing
- The expected amount of data isn’t encountered
Sourcepub fn from_arff_string(contents: &str) -> Result<Self>
pub fn from_arff_string(contents: &str) -> Result<Self>
Create a dataset struct from an ARFF string
§Errors
Returns an error if:
- The data contained isn’t numeric
- Feature data is missing
- The expected amount of data isn’t encountered
Sourcepub fn from_libsvm_file<P: AsRef<Path>>(path: P) -> Result<Self>
pub fn from_libsvm_file<P: AsRef<Path>>(path: P) -> Result<Self>
Create a dataset struct from a libsvm file
§Errors
Returns an error if:
- The file can’t be read
- Feature data is missing
- The data isn’t in the expected format
- The expected amount of data isn’t encountered
Sourcepub fn from_libsvm_string(contents: &str) -> Result<Self>
pub fn from_libsvm_string(contents: &str) -> Result<Self>
Create a dataset from a libsvm string
§Errors
Returns an error if the file doesn’t contain the expected format or is missing features
Sourcepub fn create_save_from_benign_malicious_files_and_ngrams<P: AsRef<Path>>(
malicious_dir: P,
benign_dir: P,
ngrams_file: P,
output_file: P,
) -> Result<()>
pub fn create_save_from_benign_malicious_files_and_ngrams<P: AsRef<Path>>( malicious_dir: P, benign_dir: P, ngrams_file: P, output_file: P, ) -> Result<()>
Given paths to malicious files, benign files, and n-grams (features), get a Dataset object.
§Errors
This will fail if:
- The directories for benign or malicious files don’t exist or are empty.
- The n-gram feature file doesn’t exist, is empty, or doesn’t have hexidecimal-encoded features
Sourcepub fn save_libsvm<P: AsRef<Path>>(&self, path: P) -> Result<()>
pub fn save_libsvm<P: AsRef<Path>>(&self, path: P) -> Result<()>
Save a dataset as a libsvm file
§Errors
An error will result if the file can’t be opened for writing
Sourcepub fn save<P: AsRef<Path>>(&self, path: P) -> Result<()>
pub fn save<P: AsRef<Path>>(&self, path: P) -> Result<()>
Save the dataset using the file extension to determine data format
§Errors
There’s an error if the file can’t be written or if the format can’t be determined
Sourcepub fn validate(&self) -> bool
pub fn validate(&self) -> bool
Ensure the dataset is valid
- Same size data columns
- If present, the amount of data rows equals the amount of labels
Sourcepub fn shuffle(&mut self)
pub fn shuffle(&mut self)
Shuffle the data, using roughly 10 X log10(size). So 10 records = 10 iterations, 1,000 records gets 30 iterations
Sourcepub fn shuffle_iterations(&mut self, iterations: u32)
pub fn shuffle_iterations(&mut self, iterations: u32)
Shuffle the data with a specified amount of iterations, ensures that the labels are swapped with the data, if present
Sourcepub fn split(&mut self, ratio: f32) -> Self
pub fn split(&mut self, ratio: f32) -> Self
Split the dataset, ideally into train/test datasets. The ratio indicates how much data is kept, the remaining size is shed and returned.
Sourcepub fn reduce(&mut self, model: &LogisticRegression) -> Result<Vec<usize>>
pub fn reduce(&mut self, model: &LogisticRegression) -> Result<Vec<usize>>
The model training allows for the algorithm to not only train a model but determine the features most useful for determining benign vs. malicious. This action removes the features deemed unneeded.
§Errors
If the model would remove all features, an error is returned as an empty dataset isn’t useful, and it’s instead likely the modal and dataset weren’t for the same data collection.
Trait Implementations§
Source§impl<'de> Deserialize<'de> for Dataset
impl<'de> Deserialize<'de> for Dataset
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Auto Trait Implementations§
impl Freeze for Dataset
impl RefUnwindSafe for Dataset
impl Send for Dataset
impl Sync for Dataset
impl Unpin for Dataset
impl UnwindSafe for Dataset
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more