Struct linfa::dataset::DatasetBase[][src]

pub struct DatasetBase<R, T> where
    R: Records
{ pub records: R, pub targets: T, // some fields omitted }

DatasetBase

This is the fundamental structure of a dataset. It contains a number of records about the data and may contain targets, weights and feature names. In order to keep the type complexity low the dataset base is only generic over the records and targets and introduces a trait bound on the records. weights and feature_names, on the other hand, are always assumed to be owned and copied when views are created.

Fields

  • records: a two-dimensional matrix with dimensionality (nsamples, nfeatures), in case of kernel methods a quadratic matrix with dimensionality (nsamples, nsamples), which may be sparse
  • targets: a two-/one-dimension matrix with dimensionality (nsamples, ntargets)
  • weights: optional weights for each sample with dimensionality (nsamples)
  • feature_names: optional descriptive feature names with dimensionality (nfeatures)

Trait bounds

  • R: Records: generic over feature matrices or kernel matrices
  • T: generic over any ndarray matrix which can be used as targets. The AsTargets trait bound is omitted here to avoid some repetition in implementation src/dataset/impl_dataset.rs

Fields

records: Rtargets: T

Implementations

impl<F: Float, D: Data<Elem = F>, T> DatasetBase<ArrayBase<D, Ix2>, T>[src]

pub fn pearson_correlation(&self) -> PearsonCorrelation<F>[src]

Calculate the Pearson Correlation Coefficients from a dataset

The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.

Example

let corr = linfa_datasets::diabetes()
    .pearson_correlation();

println!("{}", corr);

pub fn pearson_correlation_with_p_value(
    &self,
    num_iter: usize
) -> PearsonCorrelation<F>
[src]

Calculate the Pearson Correlation Coefficients and p-values from the dataset

The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.

The p-value supports or reject the null hypthesis that two variables are not correlated. The smaller the p-value the stronger is the evidence that two variables are correlated. A typical threshold is p < 0.05.

Parameters

  • num_iter: number of iterations of the permutation test to estimate the p-value

Example

let corr = linfa_datasets::diabetes()
    .pearson_correlation_with_p_value(100);

println!("{}", corr);

impl<R: Records, S> DatasetBase<R, S>[src]

Implementation without constraints on records and targets

This implementation block provides methods for the creation and mutation of datasets. This includes swapping the targets, return the records etc.

pub fn new<T: IntoTargets<S>>(records: R, targets: T) -> DatasetBase<R, S>[src]

Create a new dataset from records and targets

Example

let dataset = Dataset::new(records, targets);

pub fn targets(&self) -> &S[src]

Returns reference to targets

pub fn weights(&self) -> Option<&[f32]>[src]

Returns optionally weights

pub fn weight_for(&self, idx: usize) -> f32[src]

Return a single weight

The weight of the idxth observation is returned. If no weight is specified, then all observations are unweighted with default value 1.0.

pub fn feature_names(&self) -> Vec<String>[src]

Returns feature names

A feature name gives a human-readable string describing the purpose of a single feature. This allow the reader to understand its purpose while analysing results, for example correlation analysis or feature importance.

pub fn records(&self) -> &R[src]

Return records of a dataset

The records are data points from which predictions are made. This functions returns a reference to the record field.

pub fn with_records<T: Records>(self, records: T) -> DatasetBase<T, S>[src]

Updates the records of a dataset

This function overwrites the records in a dataset. It also invalidates the weights and feature names.

pub fn with_targets<T>(self, targets: T) -> DatasetBase<R, T>[src]

Updates the targets of a dataset

This function overwrites the targets in a dataset.

pub fn with_weights(self, weights: Array1<f32>) -> DatasetBase<R, S>[src]

Updates the weights of a dataset

pub fn with_feature_names<I: Into<String>>(
    self,
    names: Vec<I>
) -> DatasetBase<R, S>
[src]

Updates the feature names of a dataset

impl<L, R: Records, T: AsTargets<Elem = L>> DatasetBase<R, T>[src]

pub fn map_targets<S, G: FnMut(&L) -> S>(
    self,
    fnc: G
) -> DatasetBase<R, Array2<S>>
[src]

Map targets with a function f

Example

let dataset = linfa_datasets::winequality()
    .map_targets(|x| *x > 6);

// dataset has now boolean targets
println!("{:?}", dataset.targets());

Returns

A modified dataset with new target type.

pub fn ntargets(&self) -> usize[src]

Return the number of targets in the dataset

Example

let dataset = linfa_datasets::winequality();

println!("#targets {}", dataset.ntargets());

impl<'a, F: Float, L, D, T> DatasetBase<ArrayBase<D, Ix2>, T> where
    D: Data<Elem = F>,
    T: AsTargets<Elem = L>, 
[src]

pub fn sample_iter(&'a self) -> Iter<'a, '_, F, T::Elem>[src]

Iterate over observations

This function creates an iterator which produces tuples of data points and target value. The iterator runs once for each data point and, while doing so, holds an reference to the owned dataset.

Example

let dataset = linfa_datasets::iris();

for (x, y) in dataset.sample_iter() {
    println!("{} => {}", x, y);
}

impl<'a, F: Float, L: 'a, D, T> DatasetBase<ArrayBase<D, Ix2>, T> where
    D: Data<Elem = F>,
    T: AsTargets<Elem = L> + FromTargetArray<'a, L>, 
[src]

pub fn view(&'a self) -> DatasetBase<ArrayView2<'a, F>, T::View>[src]

Creates a view of a dataset

pub fn feature_iter(&'a self) -> DatasetIter<'a, '_, ArrayBase<D, Ix2>, T>[src]

Iterate over features

This iterator produces dataset views with only a single feature, while the set of targets remain complete. It can be useful to compare each feature individual to all targets.

pub fn target_iter(&'a self) -> DatasetIter<'a, '_, ArrayBase<D, Ix2>, T>[src]

Iterate over targets

This functions creates an iterator which produces dataset views complete records, but only a single target each. Useful to train multiple single target models for a multi-target dataset.

impl<'a, L: 'a, F: Float, T> DatasetBase<ArrayView2<'a, F>, T> where
    T: AsTargets<Elem = L> + FromTargetArray<'a, L>, 
[src]

pub fn split_with_ratio(
    &'a self,
    ratio: f32
) -> (DatasetBase<ArrayView2<'a, F>, T::View>, DatasetBase<ArrayView2<'a, F>, T::View>)
[src]

Split dataset into two disjoint chunks

This function splits the observations in a dataset into two disjoint chunks. The splitting threshold is calculated with the ratio. For example a ratio of 0.9 allocates 90% to the first chunks and 9% to the second. This is often used in training, validation splitting procedures.

impl<'a, 'b: 'a, F: Float, L: Label, T, D> DatasetBase<ArrayBase<D, Ix2>, T> where
    D: Data<Elem = F>,
    T: AsTargets<Elem = L> + Labels<Elem = L>, 
[src]

pub fn one_vs_all(
    &self
) -> Result<Vec<DatasetBase<ArrayView2<'_, F>, CountedTargets<bool, Array2<bool>>>>>
[src]

Produce N boolean targets from multi-class targets

Some algorithms (like SVM) don’t support multi-class targets. This function splits a dataset into multiple binary target view of the same dataset.

impl<L: Label, R: Records, S: AsTargets<Elem = L>> DatasetBase<R, S>[src]

pub fn label_frequencies_with_mask(&self, mask: &[bool]) -> HashMap<L, f32>[src]

Calculates label frequencies from a dataset while masking certain samples.

Parameters

  • mask: a boolean array that specifies which samples to include in the count

Returns

A mapping of the Dataset’s samples to their frequencies

pub fn label_frequencies(&self) -> HashMap<L, f32>[src]

Calculates label frequencies from a dataset

impl<'b, F: Float, E: Copy + 'b, D, T> DatasetBase<ArrayBase<D, Ix2>, T> where
    D: Data<Elem = F>,
    T: AsTargets<Elem = E> + FromTargetArray<'b, E>,
    T::Owned: AsTargets
[src]

pub fn bootstrap<R: Rng>(
    &'b self,
    sample_feature_size: (usize, usize),
    rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
[src]

Apply bootstrapping for samples and features

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. This selects a subset of observations as well as features.

Parameters

  • sample_feature_size: The number of samples and features per bootstrap
  • rng: The random number generator used in the sampling procedure

Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

pub fn bootstrap_samples<R: Rng>(
    &'b self,
    num_samples: usize,
    rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
[src]

Apply sample bootstrapping

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a sample subset is selected which retains all features and targets.

Parameters

  • num_samples: The number of samples per bootstrap
  • rng: The random number generator used in the sampling procedure

Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

pub fn bootstrap_features<R: Rng>(
    &'b self,
    num_features: usize,
    rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
[src]

Apply feature bootstrapping

Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a feature subset is selected while retaining all samples and targets.

Parameters

  • num_features: The number of features per bootstrap
  • rng: The random number generator used in the sampling procedure

Returns

An infinite Iterator yielding at each step a new bootstrapped dataset

pub fn shuffle<R: Rng>(&self, rng: &mut R) -> DatasetBase<Array2<F>, T::Owned>[src]

Produces a shuffled version of the current Dataset.

Parameters

  • rng: the random number generator that will be used to shuffle the samples

Returns

A new shuffled version of the current Dataset

pub fn fold(
    &self,
    k: usize
) -> Vec<(DatasetBase<Array2<F>, T::Owned>, DatasetBase<Array2<F>, T::Owned>)>
[src]

Performs K-folding on the dataset. The dataset is divided into k “fold”, each containing (dataset size)/k samples, used to generate k training-validation dataset pairs. Each pair contains a validation Dataset with k samples, the ones contained in the i-th fold, and a training Dataset composed by the union of all the samples in the remaining folds.

Parameters

  • k: the number of folds to apply

Returns

A vector of k training-validation Dataset pairs.

Example

use linfa::dataset::DatasetView;
use ndarray::array;

let records = array![[1.,1.], [2.,1.], [3.,2.], [4.,1.],[5., 3.], [6.,2.]];
let targets = array![1, 1, 0, 1, 0, 0];

let dataset : DatasetView<f64, usize> = (records.view(), targets.view()).into();
let accuracies = dataset.fold(3).into_iter().map(|(train, valid)| {
    // Here you can train your model and perform validation
     
    // let model = params.fit(&dataset);
    // let predi = model.predict(&valid);
    // predi.confusion_matrix(&valid).accuracy()  
});

pub fn sample_chunks<'a: 'b>(
    &'b self,
    chunk_size: usize
) -> ChunksIter<'b, 'a, F, T>
[src]

pub fn to_owned(&self) -> DatasetBase<Array2<F>, T::Owned>[src]

impl<'a, F: Float, E: Copy + 'a, D, S> DatasetBase<ArrayBase<D, Ix2>, ArrayBase<S, Ix2>> where
    D: DataMut<Elem = F>,
    S: DataMut<Elem = E>, 
[src]

pub fn iter_fold<O, C: Fn(DatasetView<'_, F, E>) -> O>(
    &'a mut self,
    k: usize,
    fit_closure: C
) -> impl Iterator<Item = (O, DatasetBase<ArrayView2<'_, F>, ArrayView2<'_, E>>)>
[src]

Allows to perform k-folding cross validation on fittable algorithms.

Given in input a dataset, a value of k and the desired params for the fittable algorithm, returns an iterator over the k trained models and the associated validation set.

The models are trained according to a closure specified as an input.

Parameters

  • k: the number of folds to apply to the dataset
  • params: the desired parameters for the fittable algorithm at hand
  • fit_closure: a closure of the type (params, training_data) -> fitted_model that will be used to produce the trained model for each fold. The training data given in input won’t outlive the closure.

Returns

An iterator over couples (trained_model, validation_set).

Panics

This method will panic for any of the following three reasons:

  • The value of k provided is not positive;
  • The value of k provided is greater than the total number of samples in the dataset;
  • The dataset’s data is not stored contiguously and in standard order;

Example

use linfa::traits::Fit;
use linfa::dataset::{Dataset, DatasetView};
use ndarray::{array, ArrayView1, ArrayView2};

struct MockFittable {}

struct MockFittableResult {
    mock_var: usize,
}

impl<'a> Fit<'a, ArrayView2<'a, f64>, ArrayView2<'a, f64>> for MockFittable {
    type Object = MockFittableResult;

    fn fit(&self, training_data: &DatasetView<f64, f64>) -> Self::Object {
        MockFittableResult { mock_var: training_data.ntargets()}
    }
}

let records = array![[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]];
let targets = array![1.,2.,3.,4.,5.];
let mut dataset: Dataset<f64, f64> = (records, targets).into();
let params = MockFittable {};

for (model,validation_set) in dataset.iter_fold(5, |v| params.fit(&v)){
    // Here you can use `model` and `validation_set` to
    // assert the performance of the chosen algorithm
}

impl<F: Float, E> DatasetBase<ArrayBase<OwnedRepr<D>, Dim<[usize; 2]>>, ArrayBase<OwnedRepr<T>, Dim<[usize; 2]>>>[src]

pub fn split_with_ratio(self, ratio: f32) -> (Self, Self)[src]

Split dataset into two disjoint chunks

This function splits the observations in a dataset into two disjoint chunks. The splitting threshold is calculated with the ratio. If the input Dataset contains n samples then the two new Datasets will have respectively n * ratio and n - (n*ratio) samples. For example a ratio of 0.9 allocates 90% to the first chunks and 10% to the second. This is often used in training, validation splitting procedures.

Parameters

  • ratio: the ratio of samples in the input Dataset to include in the first output one

Returns

The input Dataset split into two according to the input ratio.

impl<F: Float, L: Copy + Label, D, T> DatasetBase<ArrayBase<D, Ix2>, T> where
    D: Data<Elem = F>,
    T: AsTargets<Elem = L>, 
[src]

pub fn with_labels(
    &self,
    labels: &[&[L]]
) -> DatasetBase<Array2<F>, CountedTargets<L, Array2<L>>>
[src]

Trait Implementations

impl<L, R: Records, T: AsTargets<Elem = L>> AsTargets for DatasetBase<R, T>[src]

type Elem = L

impl<L, R: Records, T: AsTargetsMut<Elem = L>> AsTargetsMut for DatasetBase<R, T>[src]

type Elem = L

impl<R: Records, R2: Records, T: AsTargets<Elem = bool>, T2: AsTargets<Elem = Pr>> BinaryClassification<&'_ DatasetBase<R, T>> for DatasetBase<R2, T2>[src]

impl<F: Float, E, D, S> From<(ArrayBase<D, Dim<[usize; 2]>>, ArrayBase<S, Dim<[usize; 1]>>)> for DatasetBase<ArrayBase<D, Ix2>, ArrayBase<S, Ix2>> where
    D: Data<Elem = F>,
    S: Data<Elem = E>, 
[src]

impl<F: Float, E, D, S> From<(ArrayBase<D, Dim<[usize; 2]>>, ArrayBase<S, Dim<[usize; 2]>>)> for DatasetBase<ArrayBase<D, Ix2>, ArrayBase<S, Ix2>> where
    D: Data<Elem = F>,
    S: Data<Elem = E>, 
[src]

impl<F: Float, D: Data<Elem = F>, I: Dimension> From<ArrayBase<D, I>> for DatasetBase<ArrayBase<D, I>, Array2<()>>[src]

impl<L: Label, T: Labels<Elem = L>, R: Records> Labels for DatasetBase<R, T>[src]

type Elem = L

impl<L: Label, R: Records, T: AsTargets<Elem = L>> Labels for DatasetBase<R, CountedTargets<L, T>>[src]

A NdArray with discrete labels can act as labels

type Elem = L

impl<F: Float, T: AsTargets<Elem = F>, T2: AsTargets<Elem = F>, D: Data<Elem = F>> MultiTargetRegression<F, T2> for DatasetBase<ArrayBase<D, Ix2>, T>[src]

impl<'a, F: Float, R, T, S, O> Predict<&'a DatasetBase<R, T>, S> for O where
    R: Records<Elem = F>,
    O: PredictRef<R, S>, 
[src]

impl<F: Float, D, T, O> Predict<ArrayBase<D, Dim<[usize; 2]>>, DatasetBase<ArrayBase<D, Dim<[usize; 2]>>, T>> for O where
    D: Data<Elem = F>,
    O: PredictRef<ArrayBase<D, Ix2>, T>, 
[src]

impl<F: Float, R, T, S, O> Predict<DatasetBase<R, T>, DatasetBase<R, S>> for O where
    R: Records<Elem = F>,
    O: PredictRef<R, S>, 
[src]

impl<F: Float, D: Records<Elem = F>, T> Records for DatasetBase<D, T>[src]

Implement records for a DatasetBase

type Elem = F

impl<'a, F: Float, L: 'a + Label, D: Data<Elem = F>, T: AsTargets<Elem = L> + Labels<Elem = L>> SilhouetteScore<F> for DatasetBase<ArrayBase<D, Ix2>, T>[src]

impl<L: Label, R, R2, T, T2> ToConfusionMatrix<L, &'_ DatasetBase<R, T>> for DatasetBase<R2, T2> where
    R: Records,
    R2: Records,
    T: AsTargets<Elem = L>,
    T2: AsTargets<Elem = L> + Labels<Elem = L>, 
[src]

impl<L: Label, S: Data<Elem = L>, T: AsTargets<Elem = L> + Labels<Elem = L>, R: Records> ToConfusionMatrix<L, &'_ DatasetBase<R, T>> for ArrayBase<S, Ix1>[src]

Auto Trait Implementations

impl<R, T> RefUnwindSafe for DatasetBase<R, T> where
    R: RefUnwindSafe,
    T: RefUnwindSafe

impl<R, T> Send for DatasetBase<R, T> where
    R: Send,
    T: Send

impl<R, T> Sync for DatasetBase<R, T> where
    R: Sync,
    T: Sync

impl<R, T> Unpin for DatasetBase<R, T> where
    R: Unpin,
    T: Unpin

impl<R, T> UnwindSafe for DatasetBase<R, T> where
    R: UnwindSafe,
    T: UnwindSafe

Blanket Implementations

impl<T> Any for T where
    T: 'static + ?Sized
[src]

impl<T> Borrow<T> for T where
    T: ?Sized
[src]

impl<T> BorrowMut<T> for T where
    T: ?Sized
[src]

impl<T> From<T> for T[src]

impl<T, U> Into<U> for T where
    U: From<T>, 
[src]

impl<'a, F, R, T, S, O> Predict<&'a DatasetBase<R, T>, S> for O where
    F: Float,
    R: Records<Elem = F>,
    O: PredictRef<R, S>, 
[src]

impl<F, D, T, O> Predict<ArrayBase<D, Dim<[usize; 2]>>, DatasetBase<ArrayBase<D, Dim<[usize; 2]>>, T>> for O where
    F: Float,
    D: Data<Elem = F>,
    O: PredictRef<ArrayBase<D, Dim<[usize; 2]>>, T>, 
[src]

impl<F, R, T, S, O> Predict<DatasetBase<R, T>, DatasetBase<R, S>> for O where
    F: Float,
    R: Records<Elem = F>,
    O: PredictRef<R, S>, 
[src]

impl<T, U> TryFrom<U> for T where
    U: Into<T>, 
[src]

type Error = Infallible

The type returned in the event of a conversion error.

impl<T, U> TryInto<U> for T where
    U: TryFrom<T>, 
[src]

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.

impl<V, T> VZip<V> for T where
    V: MultiLane<T>,