Struct linfa::dataset::DatasetBase[][src]

pub struct DatasetBase<R, T> where
    R: Records
{ pub records: R, pub targets: T, pub weights: Array1<f32>, // some fields omitted }
Expand description

DatasetBase

This is the fundamental structure of a dataset. It contains a number of records about the data and may contain targets, weights and feature names. In order to keep the type complexity low the dataset base is only generic over the records and targets and introduces a trait bound on the records. weights and feature_names, on the other hand, are always assumed to be owned and copied when views are created.

Fields

  • records: a two-dimensional matrix with dimensionality (nsamples, nfeatures), in case of kernel methods a quadratic matrix with dimensionality (nsamples, nsamples), which may be sparse
  • targets: a two-/one-dimension matrix with dimensionality (nsamples, ntargets)
  • weights: optional weights for each sample with dimensionality (nsamples)
  • feature_names: optional descriptive feature names with dimensionality (nfeatures)

Trait bounds

  • R: Records: generic over feature matrices or kernel matrices
  • T: generic over any ndarray matrix which can be used as targets. The AsTargets trait bound is omitted here to avoid some repetition in implementation src/dataset/impl_dataset.rs

Fields

records: Rtargets: Tweights: Array1<f32>

Implementations

Calculate the Pearson Correlation Coefficients from a dataset

The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.

Example
let corr = linfa_datasets::diabetes()
    .pearson_correlation();

println!("{}", corr);

Calculate the Pearson Correlation Coefficients and p-values from the dataset

The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.

The p-value supports or reject the null hypthesis that two variables are not correlated. The smaller the p-value the stronger is the evidence that two variables are correlated. A typical threshold is p < 0.05.

Parameters
  • num_iter: number of iterations of the permutation test to estimate the p-value
Example
let corr = linfa_datasets::diabetes()
    .pearson_correlation_with_p_value(100);

println!("{}", corr);

Implementation without constraints on records and targets

This implementation block provides methods for the creation and mutation of datasets. This includes swapping the targets, return the records etc.

Create a new dataset from records and targets

Example
let dataset = Dataset::new(records, targets);

Returns reference to targets

Returns optionally weights

Return a single weight

The weight of the idxth observation is returned. If no weight is specified, then all observations are unweighted with default value 1.0.

Returns feature names

A feature name gives a human-readable string describing the purpose of a single feature. This allow the reader to understand its purpose while analysing results, for example correlation analysis or feature importance.

Return records of a dataset

The records are data points from which predictions are made. This functions returns a reference to the record field.

Updates the records of a dataset

This function overwrites the records in a dataset. It also invalidates the weights and feature names.

Updates the targets of a dataset

This function overwrites the targets in a dataset.

Updates the weights of a dataset

Updates the feature names of a dataset

Map targets with a function f

Example
let dataset = linfa_datasets::winequality()
    .map_targets(|x| *x > 6);

// dataset has now boolean targets
println!("{:?}", dataset.targets());
Returns

A modified dataset with new target type.

Return the number of targets in the dataset

Example
let dataset = linfa_datasets::winequality();

println!("#targets {}", dataset.ntargets());

Iterate over observations

This function creates an iterator which produces tuples of data points and target value. The iterator runs once for each data point and, while doing so, holds an reference to the owned dataset.

Example
let dataset = linfa_datasets::iris();

for (x, y) in dataset.sample_iter() {
    println!("{} => {}", x, y);
}

Creates a view of a dataset

Iterate over features

This iterator produces dataset views with only a single feature, while the set of targets remain complete. It can be useful to compare each feature individual to all targets.

Iterate over targets

This functions creates an iterator which produces dataset views complete records, but only a single target each. Useful to train multiple single target models for a multi-target dataset.

Split dataset into two disjoint chunks

This function splits the observations in a dataset into two disjoint chunks. The splitting threshold is calculated with the ratio. For example a ratio of 0.9 allocates 90% to the first chunks and 9% to the second. This is often used in training, validation splitting procedures.

Produce N boolean targets from multi-class targets

Some algorithms (like SVM) don’t support multi-class targets. This function splits a dataset into multiple binary target view of the same dataset.

Calculates label frequencies from a dataset while masking certain samples.

Parameters
  • mask: a boolean array that specifies which samples to include in the count
Returns

A mapping of the Dataset’s samples to their frequencies

Calculates label frequencies from a dataset

Apply bootstrapping for samples and features

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. This selects a subset of observations as well as features.

Parameters
  • sample_feature_size: The number of samples and features per bootstrap
  • rng: The random number generator used in the sampling procedure
Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

Apply sample bootstrapping

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a sample subset is selected which retains all features and targets.

Parameters
  • num_samples: The number of samples per bootstrap
  • rng: The random number generator used in the sampling procedure
Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

Apply feature bootstrapping

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a feature subset is selected while retaining all samples and targets.

Parameters
  • num_features: The number of features per bootstrap
  • rng: The random number generator used in the sampling procedure
Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

Produces a shuffled version of the current Dataset.

Parameters
  • rng: the random number generator that will be used to shuffle the samples
Returns

A new shuffled version of the current Dataset

Performs K-folding on the dataset. The dataset is divided into k “fold”, each containing (dataset size)/k samples, used to generate k training-validation dataset pairs. Each pair contains a validation Dataset with k samples, the ones contained in the i-th fold, and a training Dataset composed by the union of all the samples in the remaining folds.

Parameters
  • k: the number of folds to apply
Returns

A vector of k training-validation Dataset pairs.

Example
use linfa::dataset::DatasetView;
use ndarray::array;

let records = array![[1.,1.], [2.,1.], [3.,2.], [4.,1.],[5., 3.], [6.,2.]];
let targets = array![1, 1, 0, 1, 0, 0];

let dataset : DatasetView<f64, usize> = (records.view(), targets.view()).into();
let accuracies = dataset.fold(3).into_iter().map(|(train, valid)| {
    // Here you can train your model and perform validation
     
    // let model = params.fit(&dataset);
    // let predi = model.predict(&valid);
    // predi.confusion_matrix(&valid).accuracy()  
});

Allows to perform k-folding cross validation on fittable algorithms.

Given in input a dataset, a value of k and the desired params for the fittable algorithm, returns an iterator over the k trained models and the associated validation set.

The models are trained according to a closure specified as an input.

Parameters
  • k: the number of folds to apply to the dataset
  • params: the desired parameters for the fittable algorithm at hand
  • fit_closure: a closure of the type (params, training_data) -> fitted_model that will be used to produce the trained model for each fold. The training data given in input won’t outlive the closure.
Returns

An iterator over couples (trained_model, validation_set).

Panics

This method will panic for any of the following three reasons:

  • The value of k provided is not positive;
  • The value of k provided is greater than the total number of samples in the dataset;
  • The dataset’s data is not stored contiguously and in standard order;
Example
use linfa::traits::Fit;
use linfa::dataset::{Dataset, DatasetView, Records};
use ndarray::{array, ArrayView1, ArrayView2};
use linfa::Error;

struct MockFittable {}

struct MockFittableResult {
   mock_var: usize,
}


impl<'a> Fit<ArrayView2<'a,f64>, ArrayView2<'a, f64>, linfa::error::Error> for MockFittable {
    type Object = MockFittableResult;

    fn fit(&self, training_data: &DatasetView<f64, f64>) -> Result<Self::Object, linfa::error::Error> {
        Ok(MockFittableResult {
            mock_var: training_data.nsamples(),
        })
    }
}

let records = array![[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]];
let targets = array![1.,2.,3.,4.,5.];
let mut dataset: Dataset<f64, f64> = (records, targets).into();
let params = MockFittable {};

for (model,validation_set) in dataset.iter_fold(5, |v| params.fit(&v).unwrap()){
    // Here you can use `model` and `validation_set` to
    // assert the performance of the chosen algorithm
}

Cross validation for multi-target algorithms

Given a list of fittable models, cross validation is used to compare their performance according to some performance metric. To do so, k-folding is applied to the dataset and, for each fold, each model is trained on the training set and its performance is evaluated on the validation set. The performances collected for each model are then averaged over the number of folds.

Parameters:
  • k: the number of folds to apply
  • parameters: a list of models to compare
  • eval: closure used to evaluate the performance of each trained model
Returns

An array of model performances, in the same order as the models in input, if no errors occur. The performance of each model is given as an array of performances, one for each target. Otherwise, it might return an Error in one of the following cases:

  • An error occurred during the fitting of one model
  • An error occurred inside the evaluation closure
Example

use linfa::prelude::*;

// mutability needed for fast cross validation
let mut dataset = linfa_datasets::diabetes();

let models = vec![model1, model2, ... ];

let r2_scores = dataset.cross_validate_multi(5,&models, |prediction, truth| prediction.r2(truth))?;

Cross validation for single target algorithms

Given a list of fittable models, cross validation is used to compare their performance according to some performance metric. To do so, k-folding is applied to the dataset and, for each fold, each model is trained on the training set and its performance is evaluated on the validation set. The performances collected for each model are then averaged over the number of folds.

Parameters:
  • k: the number of folds to apply
  • parameters: a list of models to compare
  • eval: closure used to evaluate the performance of each trained model. For single target datasets, this closure is called once for each fold. For multi-target datasets the closure is called, in each fold, once for every different target. If there is the need to use different evaluations for each target, take a look at the cross_validate_multi method.
Returns

On succesful evalutation it returns an array of model performances, in the same order as the models in input.

It returns an Error in one of the following cases:

  • An error occurred during the fitting of one model
  • An error occurred inside the evaluation closure
Example

use linfa::prelude::*;

// mutability needed for fast cross validation
let mut dataset = linfa_datasets::diabetes();

let models = vec![model1, model2, ... ];

let r2_scores = dataset.cross_validate(5,&models, |prediction, truth| prediction.r2(truth))?;

Split dataset into two disjoint chunks

This function splits the observations in a dataset into two disjoint chunks. The splitting threshold is calculated with the ratio. If the input Dataset contains n samples then the two new Datasets will have respectively n * ratio and n - (n*ratio) samples. For example a ratio of 0.9 allocates 90% to the first chunks and 10% to the second. This is often used in training, validation splitting procedures.

Parameters
  • ratio: the ratio of samples in the input Dataset to include in the first output one
Returns

The input Dataset split into two according to the input ratio.

Transforms the input dataset by keeping only those samples whose label appears in labels.

In the multi-target case a sample is kept if any of its targets appears in labels.

Sample weights and feature names are preserved by this transformation.

Trait Implementations

Returns a view on targets as two-dimensional array

Convert to single target, fails for more than one target Read more

Returns a mutable view on targets as two-dimensional array

Convert to single target, fails for more than one target

Performs the conversion.

Performs the conversion.

Performs the conversion.

Maximal error between two continuous variables

Mean error between two continuous variables

Mean squared error between two continuous variables

Mean squared log error between two continuous variables

Median absolute error between two continuous variables

R squared coefficient, is the proportion of the variance in the dependent variable that is predictable from the independent variable Read more

Same as R-Squared but with biased variance

Implement records for a DatasetBase

Evaluates the quality of a clustering. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Performs the conversion.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.