rafor 0.3.0 - Docs.rs

**Rafor** is a performance-oriented Random Forest and Decision Trees library.

# Classification
Rafor provide a decision tree (DT) classifier `dt::Classifier` and a random forest (RF) classifier
`rf::Classifier`. The class label is `i64` value. Classifiers use Gini index for
evaluating the split impurity.

Classifiers provide method `predict` for predicting a batch of samples, it returns `Vec<i64>`
with predicted class labels. Method `predict_one` returns `i64` -- a predicted class for a
single sample.

To get probabilities distribution, there is a method `proba` which returns a `Vec<f32>` of
length `num_samples * num_classes` where `i`-th chunk of length `num_classes` contains the
probabilities of classes for `i`-th sample. The classes are ordered by their values.

# Regression
Regression models are decision tree regressor `dt::Regressor` and random forest regressor
`rf::Regressor`. The targets are `f32` values. By default regressors use MSE score for evaluating
the split impurity.

# Dataset
Multiple samples for inference or training are provided as a single `f32` slice, where each chunk of
the size of feature space (`num_features`) is treated as a feature vector of a single sample.
During training, `num_features` is derieved as a length of the `f32` input vector of samples
deviced by the number of proviced targets.

# Model training
All models provide method `trainer()` which returns a `Trainer` object for particular model. The
`Trainer` incorporates builder interface (`use rafor::prelude::*`) for setting optional
train parameters and a method `train` for feeding dataset and targets.

Currently supported training parameters are given below. Please see default values in concrete
models.
## Common parameters
The following parameters are common for decision trees and forests.

`max_depth: usize` defines the maximal tree depth.

`max_features`: [MaxFeaturesPolicy], the maximal number of features that are considered when finding
best split value for decision tree node. Note that if no split value found, additional features
will be considered until split is found or all features used.

`seed: u64`, defines the seed for random number generator. For trees the random numbers are
used for generating the feature sequence when finding split when `max_features` is less than the
number of all features of training dataset. In RF, the datasets are generated using random sampling,
also the seeds for individual trees are randomly generated, because in RF by default `max_features`
is less than the total number of features.

`min_samples_leaf: usize`, guarantees that each leaf has at least `min_samples_leaf` nodes.
 Default: `1`.

`min_samples_split: usize`, the minimal samples in node to consider splitting it.

`sample_weights: Vec<f32>` defines the weight for each sample. If empty, each sample is weighted
with 1.0

## Ensemble parameters
`num_trees: usize` defines the number of individual trees in ensemble.

`num_threads: usize` defines the number of CPU threads to use for training.

# Example
```rust
use rafor::prelude::*; // Required for .with_option builders and .num_classes().
use rafor::rf::Classifier;
use num_cpus; // Requires num_cpus dependency in Cargo.toml

fn main() {
    // Dataset for 5 samples (number of samples is defined by the number of targets).
    let dataset = [
        0.7, 0.0,
        0.8, 1.0,
        0.3, 0.0,
        1.0, 1.3,
        0.4, 2.1
    ];

    // Target classes.
    let targets = [1, 5, 1, -15, 5];

    let predictor = Classifier::trainer()
        .with_max_depth(15)
        .with_trees(40)
        .with_threads(num_cpus::get())
        .with_seed(42)
        .train(&dataset, &targets);

    // Get predictions for same dataset.
    let predictions = predictor.predict(&dataset, num_cpus::get());
    println!("Predictions: {:?}", predictions);

    // Now let's get probability distributions for each class. Use all CPU cores.
    let proba = predictor.proba(&dataset, num_cpus::get());
    println!("Probability distributions:");
    for p in proba.chunks(predictor.num_classes()) {
        println!("{:?}", p);
    }
}
```

# Model serialization and deserialization
All models support [serde](https://docs.rs/serde/latest/serde/), so any lib that supports `serde`
can be used for serialization and deserialization.

# Space / performance considerations
Rafor utilizes compact trees representation under the following restrictions:
1. split threshold is `f32`;
2. feature index is `u16`, up to 2^16 = 65,536 features allowed;
3. in regression tasks, the target type is `f32`;
4. in classification tasks, the class is represented by `u32` (the input `i64` labels are mapped
into `u32` internally, and restored during prediction);
5. child node index is `u32`, up to 2^32 = 4,294,967,296 nodes allowed.

The decision tree is represented by a vector of internal (parent) nodes. The leaf value
(`f32` for regression trees, `u32` index pointing to the class probabilities for classification
trees) is bit-packed into parent's `u32` child node index.


# License
Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or [MIT license](LICENSE-MIT)
at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in
rafor by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.