Struct linfa::dataset::DatasetBase [−][src]
pub struct DatasetBase<R, T> where
R: Records, {
pub records: R,
pub targets: T,
pub weights: Array1<f32>,
// some fields omitted
}
Expand description
DatasetBase
This is the fundamental structure of a dataset. It contains a number of records about the data
and may contain targets, weights and feature names. In order to keep the type complexity low
the dataset base is only generic over the records and targets and introduces a trait bound on
the records. weights
and feature_names
, on the other hand, are always assumed to be owned
and copied when views are created.
Fields
records
: a two-dimensional matrix with dimensionality (nsamples, nfeatures), in case of kernel methods a quadratic matrix with dimensionality (nsamples, nsamples), which may be sparsetargets
: a two-/one-dimension matrix with dimensionality (nsamples, ntargets)weights
: optional weights for each sample with dimensionality (nsamples)feature_names
: optional descriptive feature names with dimensionality (nfeatures)
Trait bounds
R: Records
: generic over feature matrices or kernel matricesT
: generic over anyndarray
matrix which can be used as targets. TheAsTargets
trait bound is omitted here to avoid some repetition in implementationsrc/dataset/impl_dataset.rs
Fields
records: R
targets: T
weights: Array1<f32>
Implementations
Calculate the Pearson Correlation Coefficients from a dataset
The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.
Example
let corr = linfa_datasets::diabetes()
.pearson_correlation();
println!("{}", corr);
Calculate the Pearson Correlation Coefficients and p-values from the dataset
The PCC describes the linear correlation between two variables. It is the covariance divided by the product of the standard deviations, therefore essentially a normalised measurement of the covariance and in range (-1, 1). A negative coefficient indicates a negative correlation between both variables.
The p-value supports or reject the null hypthesis that two variables are not correlated. The smaller the p-value the stronger is the evidence that two variables are correlated. A typical threshold is p < 0.05.
Parameters
num_iter
: number of iterations of the permutation test to estimate the p-value
Example
let corr = linfa_datasets::diabetes()
.pearson_correlation_with_p_value(100);
println!("{}", corr);
Implementation without constraints on records and targets
This implementation block provides methods for the creation and mutation of datasets. This includes swapping the targets, return the records etc.
Create a new dataset from records and targets
Example
let dataset = Dataset::new(records, targets);
Return a single weight
The weight of the idx
th observation is returned. If no weight is specified, then all
observations are unweighted with default value 1.0
.
Returns feature names
A feature name gives a human-readable string describing the purpose of a single feature. This allow the reader to understand its purpose while analysing results, for example correlation analysis or feature importance.
Return records of a dataset
The records are data points from which predictions are made. This functions returns a reference to the record field.
Updates the records of a dataset
This function overwrites the records in a dataset. It also invalidates the weights and feature names.
Updates the targets of a dataset
This function overwrites the targets in a dataset.
Updates the weights of a dataset
Updates the feature names of a dataset
Iterate over observations
This function creates an iterator which produces tuples of data points and target value. The iterator runs once for each data point and, while doing so, holds an reference to the owned dataset.
Example
let dataset = linfa_datasets::iris();
for (x, y) in dataset.sample_iter() {
println!("{} => {}", x, y);
}
impl<'a, F: Float, L: 'a, D, T> DatasetBase<ArrayBase<D, Ix2>, T> where
D: Data<Elem = F>,
T: AsTargets<Elem = L> + FromTargetArray<'a, L>,
impl<'a, F: Float, L: 'a, D, T> DatasetBase<ArrayBase<D, Ix2>, T> where
D: Data<Elem = F>,
T: AsTargets<Elem = L> + FromTargetArray<'a, L>,
Creates a view of a dataset
Iterate over features
This iterator produces dataset views with only a single feature, while the set of targets remain complete. It can be useful to compare each feature individual to all targets.
Iterate over targets
This functions creates an iterator which produces dataset views complete records, but only a single target each. Useful to train multiple single target models for a multi-target dataset.
impl<'a, L: 'a, F: Float, T> DatasetBase<ArrayView2<'a, F>, T> where
T: AsTargets<Elem = L> + FromTargetArray<'a, L>,
impl<'a, L: 'a, F: Float, T> DatasetBase<ArrayView2<'a, F>, T> where
T: AsTargets<Elem = L> + FromTargetArray<'a, L>,
pub fn split_with_ratio(
&'a self,
ratio: f32
) -> (DatasetBase<ArrayView2<'a, F>, T::View>, DatasetBase<ArrayView2<'a, F>, T::View>)
pub fn split_with_ratio(
&'a self,
ratio: f32
) -> (DatasetBase<ArrayView2<'a, F>, T::View>, DatasetBase<ArrayView2<'a, F>, T::View>)
Split dataset into two disjoint chunks
This function splits the observations in a dataset into two disjoint chunks. The splitting
threshold is calculated with the ratio
. For example a ratio of 0.9
allocates 90% to the
first chunks and 9% to the second. This is often used in training, validation splitting
procedures.
pub fn one_vs_all(
&self
) -> Result<Vec<(L, DatasetBase<ArrayView2<'_, F>, CountedTargets<bool, Array2<bool>>>)>>
pub fn one_vs_all(
&self
) -> Result<Vec<(L, DatasetBase<ArrayView2<'_, F>, CountedTargets<bool, Array2<bool>>>)>>
Produce N boolean targets from multi-class targets
Some algorithms (like SVM) don’t support multi-class targets. This function splits a dataset into multiple binary target view of the same dataset.
Calculates label frequencies from a dataset while masking certain samples.
Parameters
mask
: a boolean array that specifies which samples to include in the count
Returns
A mapping of the Dataset’s samples to their frequencies
Calculates label frequencies from a dataset
Apply bootstrapping for samples and features
Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. This selects a subset of observations as well as features.
Parameters
sample_feature_size
: The number of samples and features per bootstraprng
: The random number generator used in the sampling procedure
Returns
An infinite Iterator yielding at each step a new bootstrapped dataset
pub fn bootstrap_samples<R: Rng>(
&'b self,
num_samples: usize,
rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
pub fn bootstrap_samples<R: Rng>(
&'b self,
num_samples: usize,
rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
Apply sample bootstrapping
Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a sample subset is selected which retains all features and targets.
Parameters
num_samples
: The number of samples per bootstraprng
: The random number generator used in the sampling procedure
Returns
An infinite Iterator yielding at each step a new bootstrapped dataset
pub fn bootstrap_features<R: Rng>(
&'b self,
num_features: usize,
rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
pub fn bootstrap_features<R: Rng>(
&'b self,
num_features: usize,
rng: &'b mut R
) -> impl Iterator<Item = DatasetBase<Array2<F>, <T as FromTargetArray<'b, E>>::Owned>> + 'b
Apply feature bootstrapping
Bootstrap aggregating is used for sub-sample generation and improves the accuracy and stability of machine learning algorithms. It samples data uniformly with replacement and generates datasets where elements may be shared. Only a feature subset is selected while retaining all samples and targets.
Parameters
num_features
: The number of features per bootstraprng
: The random number generator used in the sampling procedure
Returns
An infinite Iterator yielding at each step a new bootstrapped dataset
Produces a shuffled version of the current Dataset.
Parameters
rng
: the random number generator that will be used to shuffle the samples
Returns
A new shuffled version of the current Dataset
Performs K-folding on the dataset.
The dataset is divided into k
“fold”, each containing
(dataset size)/k
samples, used to generate k
training-validation
dataset pairs. Each pair contains a validation Dataset
with k
samples,
the ones contained in the i-th fold, and a training Dataset
composed by the
union of all the samples in the remaining folds.
Parameters
k
: the number of folds to apply
Returns
A vector of k
training-validation Dataset pairs.
Example
use linfa::dataset::DatasetView;
use ndarray::array;
let records = array![[1.,1.], [2.,1.], [3.,2.], [4.,1.],[5., 3.], [6.,2.]];
let targets = array![1, 1, 0, 1, 0, 0];
let dataset : DatasetView<f64, usize> = (records.view(), targets.view()).into();
let accuracies = dataset.fold(3).into_iter().map(|(train, valid)| {
// Here you can train your model and perform validation
// let model = params.fit(&dataset);
// let predi = model.predict(&valid);
// predi.confusion_matrix(&valid).accuracy()
});
pub fn iter_fold<O, C: Fn(&DatasetView<'_, F, E>) -> O>(
&'a mut self,
k: usize,
fit_closure: C
) -> impl Iterator<Item = (O, DatasetBase<ArrayView2<'_, F>, ArrayView2<'_, E>>)>
pub fn iter_fold<O, C: Fn(&DatasetView<'_, F, E>) -> O>(
&'a mut self,
k: usize,
fit_closure: C
) -> impl Iterator<Item = (O, DatasetBase<ArrayView2<'_, F>, ArrayView2<'_, E>>)>
Allows to perform k-folding cross validation on fittable algorithms.
Given in input a dataset, a value of k and the desired params for the fittable algorithm, returns an iterator over the k trained models and the associated validation set.
The models are trained according to a closure specified as an input.
Parameters
k
: the number of folds to apply to the datasetparams
: the desired parameters for the fittable algorithm at handfit_closure
: a closure of the type(params, training_data) -> fitted_model
that will be used to produce the trained model for each fold. The training data given in input won’t outlive the closure.
Returns
An iterator over couples (trained_model, validation_set)
.
Panics
This method will panic for any of the following three reasons:
- The value of
k
provided is not positive; - The value of
k
provided is greater than the total number of samples in the dataset; - The dataset’s data is not stored contiguously and in standard order;
Example
use linfa::traits::Fit;
use linfa::dataset::{Dataset, DatasetView, Records};
use ndarray::{array, ArrayView1, ArrayView2};
use linfa::Error;
struct MockFittable {}
struct MockFittableResult {
mock_var: usize,
}
impl<'a> Fit<ArrayView2<'a,f64>, ArrayView2<'a, f64>, linfa::error::Error> for MockFittable {
type Object = MockFittableResult;
fn fit(&self, training_data: &DatasetView<f64, f64>) -> Result<Self::Object, linfa::error::Error> {
Ok(MockFittableResult {
mock_var: training_data.nsamples(),
})
}
}
let records = array![[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]];
let targets = array![1.,2.,3.,4.,5.];
let mut dataset: Dataset<f64, f64> = (records, targets).into();
let params = MockFittable {};
for (model,validation_set) in dataset.iter_fold(5, |v| params.fit(&v).unwrap()){
// Here you can use `model` and `validation_set` to
// assert the performance of the chosen algorithm
}
pub fn cross_validate_multi<O, ER, M, FACC, C>(
&'a mut self,
k: usize,
parameters: &[M],
eval: C
) -> Result<Array2<FACC>, ER> where
ER: Error + From<Error>,
M: for<'c> Fit<ArrayView2<'c, F>, ArrayView2<'c, E>, ER, Object = O>,
O: for<'d> PredictInplace<ArrayView2<'a, F>, Array2<E>>,
FACC: Float,
C: Fn(&Array2<E>, &ArrayView2<'_, E>) -> Result<Array1<FACC>, Error>,
pub fn cross_validate_multi<O, ER, M, FACC, C>(
&'a mut self,
k: usize,
parameters: &[M],
eval: C
) -> Result<Array2<FACC>, ER> where
ER: Error + From<Error>,
M: for<'c> Fit<ArrayView2<'c, F>, ArrayView2<'c, E>, ER, Object = O>,
O: for<'d> PredictInplace<ArrayView2<'a, F>, Array2<E>>,
FACC: Float,
C: Fn(&Array2<E>, &ArrayView2<'_, E>) -> Result<Array1<FACC>, Error>,
Cross validation for multi-target algorithms
Given a list of fittable models, cross validation is used to compare their performance according to some performance metric. To do so, k-folding is applied to the dataset and, for each fold, each model is trained on the training set and its performance is evaluated on the validation set. The performances collected for each model are then averaged over the number of folds.
Parameters:
k
: the number of folds to applyparameters
: a list of models to compareeval
: closure used to evaluate the performance of each trained model
Returns
An array of model performances, in the same order as the models in input, if no errors occur. The performance of each model is given as an array of performances, one for each target. Otherwise, it might return an Error in one of the following cases:
- An error occurred during the fitting of one model
- An error occurred inside the evaluation closure
Example
use linfa::prelude::*;
// mutability needed for fast cross validation
let mut dataset = linfa_datasets::diabetes();
let models = vec![model1, model2, ... ];
let r2_scores = dataset.cross_validate_multi(5,&models, |prediction, truth| prediction.r2(truth))?;
pub fn cross_validate<O, ER, M, FACC, C, I>(
&'a mut self,
k: usize,
parameters: &[M],
eval: C
) -> Result<ArrayBase<OwnedRepr<FACC>, I>, ER> where
ER: Error + From<Error>,
M: for<'c> Fit<ArrayView2<'c, F>, ArrayView2<'c, E>, ER, Object = O>,
O: for<'d> PredictInplace<ArrayView2<'a, F>, ArrayBase<OwnedRepr<E>, I>>,
FACC: Float,
C: Fn(&ArrayView1<'_, E>, &ArrayView1<'_, E>) -> Result<FACC, Error>,
I: Dimension,
pub fn cross_validate<O, ER, M, FACC, C, I>(
&'a mut self,
k: usize,
parameters: &[M],
eval: C
) -> Result<ArrayBase<OwnedRepr<FACC>, I>, ER> where
ER: Error + From<Error>,
M: for<'c> Fit<ArrayView2<'c, F>, ArrayView2<'c, E>, ER, Object = O>,
O: for<'d> PredictInplace<ArrayView2<'a, F>, ArrayBase<OwnedRepr<E>, I>>,
FACC: Float,
C: Fn(&ArrayView1<'_, E>, &ArrayView1<'_, E>) -> Result<FACC, Error>,
I: Dimension,
Cross validation for single target algorithms
Given a list of fittable models, cross validation is used to compare their performance according to some performance metric. To do so, k-folding is applied to the dataset and, for each fold, each model is trained on the training set and its performance is evaluated on the validation set. The performances collected for each model are then averaged over the number of folds.
Parameters:
k
: the number of folds to applyparameters
: a list of models to compareeval
: closure used to evaluate the performance of each trained model. For single target datasets, this closure is called once for each fold. For multi-target datasets the closure is called, in each fold, once for every different target. If there is the need to use different evaluations for each target, take a look at thecross_validate_multi
method.
Returns
On succesful evalutation it returns an array of model performances, in the same order as the models in input.
It returns an Error in one of the following cases:
- An error occurred during the fitting of one model
- An error occurred inside the evaluation closure
Example
use linfa::prelude::*;
// mutability needed for fast cross validation
let mut dataset = linfa_datasets::diabetes();
let models = vec![model1, model2, ... ];
let r2_scores = dataset.cross_validate(5,&models, |prediction, truth| prediction.r2(truth))?;
Split dataset into two disjoint chunks
This function splits the observations in a dataset into two disjoint chunks. The splitting
threshold is calculated with the ratio
. If the input Dataset contains n
samples then the
two new Datasets will have respectively n * ratio
and n - (n*ratio)
samples.
For example a ratio of 0.9
allocates 90% to the
first chunks and 10% to the second. This is often used in training, validation splitting
procedures.
Parameters
ratio
: the ratio of samples in the input Dataset to include in the first output one
Returns
The input Dataset split into two according to the input ratio.
pub fn with_labels(
&self,
labels: &[L]
) -> DatasetBase<Array2<F>, CountedTargets<L, Array2<L>>>
pub fn with_labels(
&self,
labels: &[L]
) -> DatasetBase<Array2<F>, CountedTargets<L, Array2<L>>>
Transforms the input dataset by keeping only those samples whose label appears in labels
.
In the multi-target case a sample is kept if any of its targets appears in labels
.
Sample weights and feature names are preserved by this transformation.
Trait Implementations
type Elem = L
Returns a mutable view on targets as two-dimensional array
Convert to single target, fails for more than one target
impl<R: Records, R2: Records, T: AsTargets<Elem = bool>, T2: AsTargets<Elem = Pr>> BinaryClassification<&'_ DatasetBase<R, T>> for DatasetBase<R2, T2>
impl<R: Records, R2: Records, T: AsTargets<Elem = bool>, T2: AsTargets<Elem = Pr>> BinaryClassification<&'_ DatasetBase<R, T>> for DatasetBase<R2, T2>
impl<F: Float, T: AsTargets<Elem = F>, T2: AsTargets<Elem = F>, D: Data<Elem = F>> MultiTargetRegression<F, T2> for DatasetBase<ArrayBase<D, Ix2>, T>
impl<F: Float, T: AsTargets<Elem = F>, T2: AsTargets<Elem = F>, D: Data<Elem = F>> MultiTargetRegression<F, T2> for DatasetBase<ArrayBase<D, Ix2>, T>
Maximal error between two continuous variables
Mean error between two continuous variables
Mean squared error between two continuous variables
Mean squared log error between two continuous variables
Median absolute error between two continuous variables
R squared coefficient, is the proportion of the variance in the dependent variable that is predictable from the independent variable Read more
Same as R-Squared but with biased variance
impl<'a, F: Float, R, T, S, O> Predict<&'a DatasetBase<R, T>, S> for O where
R: Records<Elem = F>,
O: PredictInplace<R, S>,
impl<'a, F: Float, R, T, S, O> Predict<&'a DatasetBase<R, T>, S> for O where
R: Records<Elem = F>,
O: PredictInplace<R, S>,
impl<F: Float, R, T, E, S, O> Predict<DatasetBase<R, T>, DatasetBase<R, S>> for O where
R: Records<Elem = F>,
S: AsTargets<Elem = E>,
O: PredictInplace<R, S>,
impl<F: Float, R, T, E, S, O> Predict<DatasetBase<R, T>, DatasetBase<R, S>> for O where
R: Records<Elem = F>,
S: AsTargets<Elem = E>,
O: PredictInplace<R, S>,
Implement records for a DatasetBase
impl<'a, F: Float, L: 'a + Label, D: Data<Elem = F>, T: AsTargets<Elem = L> + Labels<Elem = L>> SilhouetteScore<F> for DatasetBase<ArrayBase<D, Ix2>, T>
impl<'a, F: Float, L: 'a + Label, D: Data<Elem = F>, T: AsTargets<Elem = L> + Labels<Elem = L>> SilhouetteScore<F> for DatasetBase<ArrayBase<D, Ix2>, T>
Evaluates the quality of a clustering. Read more
impl<L: Label, R, R2, T, T2> ToConfusionMatrix<L, &'_ DatasetBase<R, T>> for DatasetBase<R2, T2> where
R: Records,
R2: Records,
T: AsTargets<Elem = L>,
T2: AsTargets<Elem = L> + Labels<Elem = L>,
impl<L: Label, R, R2, T, T2> ToConfusionMatrix<L, &'_ DatasetBase<R, T>> for DatasetBase<R2, T2> where
R: Records,
R2: Records,
T: AsTargets<Elem = L>,
T2: AsTargets<Elem = L> + Labels<Elem = L>,
impl<L: Label, S: Data<Elem = L>, T: AsTargets<Elem = L> + Labels<Elem = L>, R: Records> ToConfusionMatrix<L, &'_ DatasetBase<R, T>> for ArrayBase<S, Ix1>
impl<L: Label, S: Data<Elem = L>, T: AsTargets<Elem = L> + Labels<Elem = L>, R: Records> ToConfusionMatrix<L, &'_ DatasetBase<R, T>> for ArrayBase<S, Ix1>
Auto Trait Implementations
impl<R, T> RefUnwindSafe for DatasetBase<R, T> where
R: RefUnwindSafe,
T: RefUnwindSafe,
impl<R, T> Send for DatasetBase<R, T> where
R: Send,
T: Send,
impl<R, T> Sync for DatasetBase<R, T> where
R: Sync,
T: Sync,
impl<R, T> Unpin for DatasetBase<R, T> where
R: Unpin,
T: Unpin,
impl<R, T> UnwindSafe for DatasetBase<R, T> where
R: UnwindSafe,
T: UnwindSafe,
Blanket Implementations
Mutably borrows from an owned value. Read more