pub struct DatasetBase<R, T> where
    R: Records
{ pub records: R, pub targets: T, pub weights: Array1<f32>, /* private fields */ }
Expand description

DatasetBase

This is the fundamental structure of a dataset. It contains a number of records about the data and may contain targets, weights and feature names. In order to keep the type complexity low the dataset base is only generic over the records and targets and introduces a trait bound on the records. weights and feature_names, on the other hand, are always assumed to be owned and copied when views are created.

Fields

  • records: a two-dimensional matrix with dimensionality (nsamples, nfeatures), in case of kernel methods a quadratic matrix with dimensionality (nsamples, nsamples), which may be sparse
  • targets: a two-/one-dimension matrix with dimensionality (nsamples, ntargets)
  • weights: optional weights for each sample with dimensionality (nsamples)
  • feature_names: optional descriptive feature names with dimensionality (nfeatures)

Trait bounds

  • R: Records: generic over feature matrices or kernel matrices
  • T: generic over any ndarray matrix which can be used as targets. The AsTargets trait bound is omitted here to avoid some repetition in implementation src/dataset/impl_dataset.rs

Fields

records: Rtargets: Tweights: Array1<f32>

Implementations

Calculate the Pearson Correlation Coefficients from a dataset

The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.

Example
let corr = linfa_datasets::diabetes()
    .pearson_correlation();

println!("{}", corr);

Calculate the Pearson Correlation Coefficients and p-values from the dataset

The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.

The p-value supports or reject the null hypthesis that two variables are not correlated. The smaller the p-value the stronger is the evidence that two variables are correlated. A typical threshold is p < 0.05.

Parameters
  • num_iter: number of iterations of the permutation test to estimate the p-value
Example
let corr = linfa_datasets::diabetes()
    .pearson_correlation_with_p_value(100);

println!("{}", corr);

Implementation without constraints on records and targets

This implementation block provides methods for the creation and mutation of datasets. This includes swapping the targets, return the records etc.

Create a new dataset from records and targets

Example
let dataset = Dataset::new(records, targets);

Returns reference to targets

Returns optionally weights

Return a single weight

The weight of the idxth observation is returned. If no weight is specified, then all observations are unweighted with default value 1.0.

Returns feature names

A feature name gives a human-readable string describing the purpose of a single feature. This allow the reader to understand its purpose while analysing results, for example correlation analysis or feature importance.

Return records of a dataset

The records are data points from which predictions are made. This functions returns a reference to the record field.

Updates the records of a dataset

This function overwrites the records in a dataset. It also invalidates the weights and feature names.

Updates the targets of a dataset

This function overwrites the targets in a dataset.

Updates the weights of a dataset

Updates the feature names of a dataset

Map targets with a function f

Example
let dataset = linfa_datasets::winequality()
    .map_targets(|x| *x > 6);

// dataset has now boolean targets
println!("{:?}", dataset.targets());
Returns

A modified dataset with new target type.

Return the number of targets in the dataset

Example
let dataset = linfa_datasets::winequality();

println!("#targets {}", dataset.ntargets());

Iterate over observations

This function creates an iterator which produces tuples of data points and target value. The iterator runs once for each data point and, while doing so, holds an reference to the owned dataset.

For multi-target datasets, the yielded target value is ArrayView1 consisting of the different targets. For single-target datasets, the target value is ArrayView0 containing the single target.

Example
let dataset = linfa_datasets::iris();

for (x, y) in dataset.sample_iter() {
    println!("{} => {}", x, y);
}

Creates a view of a dataset

Iterate over features

This iterator produces dataset views with only a single feature, while the set of targets remain complete. It can be useful to compare each feature individual to all targets.

Iterate over targets

This functions creates an iterator which produces dataset views complete records, but only a single target each. Useful to train multiple single target models for a multi-target dataset.

Split dataset into two disjoint chunks

This function splits the observations in a dataset into two disjoint chunks. The splitting threshold is calculated with the ratio. For example a ratio of 0.9 allocates 90% to the first chunks and 9% to the second. This is often used in training, validation splitting procedures.

Produce N boolean targets from multi-class targets

Some algorithms (like SVM) don’t support multi-class targets. This function splits a dataset into multiple binary single-target views of the same dataset.

Calculates label frequencies from a dataset while masking certain samples.

Parameters
  • mask: a boolean array that specifies which samples to include in the count
Returns

A mapping of the Dataset’s samples to their frequencies

Calculates label frequencies from a dataset

Apply bootstrapping for samples and features

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. This selects a subset of observations as well as features.

Parameters
  • sample_feature_size: The number of samples and features per bootstrap
  • rng: The random number generator used in the sampling procedure
Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

Apply sample bootstrapping

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a sample subset is selected which retains all features and targets.

Parameters
  • num_samples: The number of samples per bootstrap
  • rng: The random number generator used in the sampling procedure
Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

Apply feature bootstrapping

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a feature subset is selected while retaining all samples and targets.

Parameters
  • num_features: The number of features per bootstrap
  • rng: The random number generator used in the sampling procedure
Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

Produces a shuffled version of the current Dataset.

Parameters
  • rng: the random number generator that will be used to shuffle the samples
Returns

A new shuffled version of the current Dataset

Performs K-folding on the dataset.

The dataset is divided into k “folds”, each containing (dataset size)/k samples, used to generate k training-validation dataset pairs. Each pair contains a validation Dataset with k samples, the ones contained in the i-th fold, and a training Dataset composed by the union of all the samples in the remaining folds.

Parameters
  • k: the number of folds to apply
Returns

A vector of k training-validation Dataset pairs.

Example
use linfa::dataset::DatasetView;
use ndarray::{Ix1, array};

let records = array![[1.,1.], [2.,1.], [3.,2.], [4.,1.],[5., 3.], [6.,2.]];
let targets = array![1, 1, 0, 1, 0, 0];

let dataset : DatasetView<f64, usize, Ix1> = (records.view(), targets.view()).into();
let accuracies = dataset.fold(3).into_iter().map(|(train, valid)| {
    // Here you can train your model and perform validation
     
    // let model = params.fit(&dataset);
    // let predi = model.predict(&valid);
    // predi.confusion_matrix(&valid).accuracy()  
});

Performs k-folding cross validation on fittable algorithms.

Given a dataset as input, a value of k and the desired params for the fittable algorithm, returns an iterator over the k trained models and the associated validation set.

The models are trained according to a closure specified as an input.

Parameters
  • k: the number of folds to apply to the dataset
  • params: the desired parameters for the fittable algorithm at hand
  • fit_closure: a closure of the type (params, training_data) -> fitted_model that will be used to produce the trained model for each fold. The training data given in input won’t outlive the closure.
Returns

An iterator over couples (trained_model, validation_set).

Panics

This method will panic for any of the following three reasons:

  • The value of k provided is not positive;
  • The value of k provided is greater than the total number of samples in the dataset;
  • The dataset’s data is not stored contiguously and in standard order;
Example
use linfa::traits::Fit;
use linfa::dataset::{Dataset, DatasetView, Records};
use ndarray::{array, ArrayView1, ArrayView2, Ix1};
use linfa::Error;

struct MockFittable {}

struct MockFittableResult {
   mock_var: usize,
}

impl<'a> Fit<ArrayView2<'a,f64>, ArrayView1<'a, f64>, linfa::error::Error> for MockFittable {
    type Object = MockFittableResult;

    fn fit(&self, training_data: &DatasetView<f64, f64, Ix1>) -> Result<Self::Object, linfa::error::Error> {
        Ok(MockFittableResult {
            mock_var: training_data.nsamples(),
        })
    }
}

let records = array![[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]];
let targets = array![1.,2.,3.,4.,5.];
let mut dataset: Dataset<f64, f64, Ix1> = (records, targets).into();
let params = MockFittable {};

for (model,validation_set) in dataset.iter_fold(5, |v| params.fit(v).unwrap()){
    // Here you can use `model` and `validation_set` to
    // assert the performance of the chosen algorithm
}

Cross validation for single and multi-target algorithms

Given a list of fittable models, cross validation is used to compare their performance according to some performance metric. To do so, k-folding is applied to the dataset and, for each fold, each model is trained on the training set and its performance is evaluated on the validation set. The performances collected for each model are then averaged over the number of folds.

For single-target datasets, Dataset::cross_validate_single is recommended.

Parameters:
  • k: the number of folds to apply
  • parameters: a list of models to compare
  • eval: closure used to evaluate the performance of each trained model. This closure is called on the model output and validation targets of each fold and outputs the performance score for each target. For single-target dataset the signature is (Array1, Array1) -> Array0. For multi-target dataset the signature is (Array2, Array2) -> Array1.
Returns

An array of model performances, for each model and each target, if no errors occur. For multi-target dataset, the array has dimensions (n_models, n_targets). For single-target dataset, the array has dimensions (n_models). Otherwise, it might return an Error in one of the following cases:

  • An error occurred during the fitting of one model
  • An error occurred inside the evaluation closure
Example

use linfa::prelude::*;
use ndarray::arr0;






// mutability needed for fast cross validation
let mut dataset = linfa_datasets::diabetes();

let models = vec![model1, model2];

let r2_scores = dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(truth).map(arr0))?;

Specialized version of cross_validate for single-target datasets. Allows the evaluation closure to return a float without wrapping it in arr0. See [Dataset.cross_validate] for more details.

Split dataset into two disjoint chunks

This function splits the observations in a dataset into two disjoint chunks. The splitting threshold is calculated with the ratio. If the input Dataset contains n samples then the two new Datasets will have respectively n * ratio and n - (n*ratio) samples. For example a ratio of 0.9 allocates 90% to the first chunks and 10% to the second. This is often used in training, validation splitting procedures.

Parameters
  • ratio: the ratio of samples in the input Dataset to include in the first output one
Returns

The input Dataset split into two according to the input ratio.

Panics

Panic occurs when the input record or targets are not in row-major layout.

Transforms the input dataset by keeping only those samples whose label appears in labels.

In the multi-target case a sample is kept if any of its targets appears in labels.

Sample weights and feature names are preserved by this transformation.

Trait Implementations

Log loss of the probabilities of the binary target

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Formats the value using the given formatter. Read more

Converts to this type from the input type.

Converts to this type from the input type.

Maximal error between two continuous variables

Mean error between two continuous variables

Mean squared error between two continuous variables

Mean squared log error between two continuous variables

Median absolute error between two continuous variables

R squared coefficient, is the proportion of the variance in the dependent variable that is predictable from the independent variable Read more

Same as R-Squared but with biased variance

This method tests for self and other values to be equal, and is used by ==. Read more

This method tests for !=.

Implement records for a DatasetBase

Evaluates the quality of a clustering. Read more

Maximal error between two continuous variables

Mean error between two continuous variables

Mean squared error between two continuous variables

Mean squared log error between two continuous variables

Median absolute error between two continuous variables

R squared coefficient, is the proportion of the variance in the dependent variable that is predictable from the independent variable Read more

Same as R-Squared but with biased variance

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.