stats-ci 0.0.8 - Docs.rs

[![MIT license](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE-MIT)
[![Apache 2.0 license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](./LICENSE-APACHE)
[![Docs](https://img.shields.io/docsrs/stats-ci)](https://docs.rs/stats-ci)
[![Tests](https://github.com/xdefago/stats-ci/actions/workflows/tests.yml/badge.svg)](https://github.com/xdefago/stats-ci/actions/workflows/tests.yml)
[![Downloads](https://img.shields.io/crates/d/stats-ci)](https://crates.io/crates/stats-ci)
[![Latest crates.io](https://img.shields.io/crates/v/stats-ci)](https://crates.io/crates/stats-ci)

# stats-ci

## Description

Stats-ci provides some basic functions to compute confidence intervals of sample data.
This includes the following:
* confidence intervals around the mean for numerical data,
* confidence intervals around a quantile (e.g., median) for arbitrary ordered data,
* confidence intervals for proportions.
* confidence intervals for comparisons (paired or unpaired observations).

Not included yet but planned are:
* confidence intervals for regression parameters.
* confidence intervals for other statistics (e.g., variance, etc.)
* Chi square test

## Motivation

The motivation behind creating this crate came both from the recurring need of confidence intervals in personal projects and also out of frustration from having to look up the formulas each time. I reckoned that I might not be alone in this situation and that such a crate could prove useful to some.

## Disclaimer

NB: As probably obvious from the `0.0.x` version number, this crate is not currently in a finished state and any commit can possibly introduce breaking changes. At this point, I am making no particular efforts to preserve backward compatibility. Therefore, please use at your own risks at least until version `0.1` or above. 

I am far from being a statistician and I will gladly welcome any advice or corrections.
I only made a feeble attempt at numerical statibility (e.g., kahan sum, log-sum-exp).
In any case, please be circumspect about the results obtained from this crate for the time being.

## Usage

Add the most recent release to your `Cargo.toml` _(check the latest version number on [crates.io](https://crates.io/crates/stats-ci) and replace `{ latest version }` below)_:

```toml
[dependencies]
stats-ci = "{ latest version }"
```

## Features

The crate has two features:

* `approx` _(default)_ enables approximate comparison between intervals. Adds the dependency to the crate [`approx`](https://crates.io/crates/approx).
* `serde` feature adds the crate [`serde`](https://crates.io/crates/serde) as a dependency and provides serialization and deserialization for both [`Confidence`](https://docs.rs/stats-ci/latest/stats_ci/enum.Confidence.html) and [`Interval`](https://docs.rs/stats-ci/latest/stats_ci/enum.Interval.html), as well as the incremental states for intervals on the mean.
```toml
stats-ci = { version = "{ latest version }", features = ["serde"] }
```

## Examples

You can find more detailed information and additional examples from this crate's [API documentation](https://docs.rs/stats-ci).

### C.I. for the Mean

The crate provides functions to compute confidence intervals for the mean of floating-point (`f32` or `f64`) data.
The functions are generic and can be used with any type that implements the `Float` trait from the crate [`num-traits`](https://crates.io/crates/num-traits).
 
The crate provides three functions to compute confidence intervals for the mean of floating-point data:
* `mean::Arithmetic::ci` computes the confidence interval for the arithmetic mean.
* `mean::Geometric::ci` computes the confidence interval for the geometric mean
* `mean::Harmonic::ci` computes the confidence interval for the harmonic mean

```rust
    use stats_ci::*;
    let data = [
        82., 94., 68., 6., 39., 80., 10., 97., 34., 66., 62., 7., 39.,
        68., 93., 64., 10., 74., 15., 34., 4., 48., 88., 94., 17., 99.,
        81., 37., 68., 66., 40., 23., 67., 72., 63., 71., 18., 51.,
        65., 87., 12., 44., 89., 67., 28., 86., 62., 22., 90., 18.,
        50., 25., 98., 24., 61., 62., 86., 100., 96., 27., 36., 82.,
        90., 55., 26., 38., 97., 73., 16., 49., 23., 26., 55., 26., 3.,
        23., 47., 27., 58., 27., 97., 32., 29., 56., 28., 23., 37.,
        72., 62., 77., 63., 100., 40., 84., 77., 39., 71., 61., 17.,
        77.,
    ];
    // 1. create a statistics object
    let mut stats = mean::Arithmetic::new();
    // 2. add data
    stats.extend(data)?;

    // 3. define a confidence level
    let confidence = Confidence::new_two_sided(0.95);
    // 4. compute the confidence interval over the mean for some
    //    confidence level
    let ci = stats.ci_mean(confidence)?;
    // 5. get and print other statistics on the sample data
    println!("mean: {}", stats.sample_mean());
        // mean: 53.67
    println!("std_dev: {}", stats.sample_std_dev());
        // std_dev: 28.097613040716794
    println!(
        "ci ({} {}%): {}",
        confidence.kind(),
        confidence.percent(),
        ci
    ); // ci (two-sided 95%): [48.09482399055084, 59.24517600944916]
    println!("low: {}", ci.low_f()); // low: 48.09482399055084
    println!("high: {}", ci.high_f()); // high: 59.24517600944916

    // 6. compute other confidence intervals
    //    (almost no additional perfomance cost)
    println!(
        "upper one-sided 90% ci: {}",
        stats.ci_mean(Confidence::new_upper(0.9))?
    ); // upper one-sided 90% ci: [50.04495430416555,->)
    println!(
        "lower one-sided 80% ci: {}",
        stats.ci_mean(Confidence::new_lower(0.8))?
    ); // lower one-sided 80% ci: (<-,56.044998597990755]
    let ci = stats.ci_mean(Confidence::new_upper(0.975))?;
    println!("ci: {}", ci); // ci: [48.09482399055084,->)
    println!("low: {}", ci.low_f()); // low: 48.09482399055084
    println!("high: {}", ci.high_f()); // high: inf
    println!("low: {:?}", ci.low()); // high: Some(48.09482399055084)
    println!("high: {:?}", ci.high()); // high: None

    // get statistics for other means (harmonic)
    let stats = mean::Harmonic::from_iter(data)?;
    let ci = stats.ci_mean(confidence)?;
    println!("harmonic mean: {}", stats.sample_mean());
        // harmonic mean: 30.03131315633959
    println!("ci: {}", ci);
        // ci: [23.614092539460778, 41.23786064976718]

    // get statistics for other means (geometric)
    let stats = mean::Geometric::from_iter(data)?;
    let ci = stats.ci_mean(confidence)?;
    println!("geometric mean: {}", stats.sample_mean());
        // geometric mean: 43.7268032829256
    println!("ci: {}", ci);
        // ci: [37.731050052007795, 50.675327686564806]

    // incremental/intermediate statistics also work
    let mut stats = mean::Arithmetic::from_iter(data)?;
    let ci = stats.ci_mean(confidence)?;
    // a. confidence interval from the original data
    println!("incr ci: {}", ci);
        // incr ci: [48.09482399055084, 59.24517600944916]

    // b. confidence interval after adding 10 additional data points
    for _ in 0..10 {
        stats.append(1_000.)?;
    }
    let ci = stats.ci_mean(confidence)?;
    println!("incr ci: {}", ci);
        // incr ci: [87.80710255546494, 191.59289744453503]

    // parallel computation of the confidence interval
    use rayon::prelude::*;
    let state = data
        .clone()
        .par_iter()
        .map(|&x| mean::Arithmetic::from_iter([x]).unwrap())
        .reduce(|| mean::Arithmetic::new(), |s1, s2| s1 + s2);
    println!("parallel ci: {}", state.ci_mean(confidence)?);
        // parallel ci: [48.09482399055084, 59.24517600944916]
```

Incremental statistics is useful in at least three common scenarios:

* when you have a stream of data and don't want to keep all values.
* when you want to continue collecting data until you have sufficient statistical significance (e.g., interval shorter than some width relative to the mean).
* when you want to compute the confidence intervals of several confidence levels in a single pass through the data.


## C.I. for Quantiles

Depending on the type of data and measurements, it is sometimes inappropriate to compute the mean of the data because that value makes little sense.
For instance, consider a communication system and suppose that we want to find an upper bound on message delays such that, with 90% confidence, at least 95% of messages are delivered within this bound.
Then, the value of interest is the lower one-sided confidence interval of the 95th percentile with 90% confidence (quantile=.95, condidence level=0.9).

In a different context, if the data is an ordered sequence of strings, it might make sense to compute an interval around the median of the data, but the mean cannot be computed.

```rust
use stats_ci::*;

let quantile = 0.5; // median

let data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];

let confidence = Confidence::new_two_sided(0.95);
let ci = quantile::ci(confidence, &data, quantile)?;
assert_eq!(ci, Interval::new(5, 12)?);

let confidence = Confidence::new_two_sided(0.8);
let ci = quantile::ci(confidence, &data, quantile)?;
assert_eq!(ci, Interval::new(6, 11)?);

let data = [
    "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
    "M", "N", "O",
];
let confidence = Confidence::new_two_sided(0.95);
let ci = quantile::ci(confidence, &data, quantile)?;
println!("ci: {}", ci); // ci: [E, L]
```

## C.I. for Proportions

Confidence intervals for proportions are often used in the context of A/B testing or when measuring the success/failure rate of a system.
It is also useful when running Monte-Carlo simulations to estimate the winning chances of a player in a game.
 
This crate uses the Wilson score interval to compute the confidence interval for a proportion,
which is more stable than the standard normal approximation but results in slightly more conservative intervals.

```rust
use stats_ci::*;
let confidence = Confidence::new_two_sided(0.95);

let data = [
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    19, 20,
];
let ci = proportion::ci_if(confidence, &data, |&x| x <= 10)?;
println!("ci: {}", ci); // ci: [0.2992980081982124, 0.7007019918017876]
assert!(ci.contains(&0.5));

let population = 500;
let successes = 421;
let ci = proportion::ci(confidence, population, successes)?;
println!("ci: {}", ci); // ci: [0.8074376489887337, 0.8713473021355645]
assert!(ci.contains(&0.842));
```

## Contributing

I will gladly and carefully consider any constructive comments that you have to offer.
In particular, I will be considering constructive feedback both on the interface and the calculations
with the following priorities correctness, code readability, genericity, efficiency.

Currently, the following are on my TODO list:

* [feature] confidence intervals for regression parameters.
* [stats] review/fix statistical tests
* [API] remove `unwrap()` and reduce panicking code
* [Refactoring] restructure error results

## References

* Raj Jain. [The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling,](https://www.cse.wustl.edu/~jain/books/perfbook.htm) John Wiley & Sons, 1991.
* [Wikipedia - Confidence interval](https://en.wikipedia.org/wiki/Confidence_interval)
* [Wikipedia - Binomial proportion confidence interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval)
* [Wikipedia article on normal approximation interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_approximation_interval)
* Dransfield R.D., Brightwell R. (2012) Avoiding and Detecting Statistical Malpractice (or "How to Get On Top of Statistics): Design & Analysis for Biologists, with R. InfluentialPoints, UK [online](https://influentialpoints.com/hyperbook.htm)
* _idem_. Chapter [Confidence intervals of proportions and rates](https://influentialpoints.com/Training/confidence_intervals_of_proportions-principles-properties-assumptions.htm)
* Francis J. DiTraglia. [Blog post: The Wilson Confidence Interval for a Proportion](https://www.econometrics.blog/post/the-wilson-confidence-interval-for-a-proportion/). Feb 2022.
* Nilan Noris. "The standard errors of the geometric and harmonic means and their application to index numbers." Ann. Math. Statist. 11(4): 445-448 (December, 1940). DOI: [10.1214/aoms/1177731830](https://doi.org/10.1214/aoms/1177731830) [JSTOR](https://www.jstor.org/stable/2235727)
* PennState. Stat 500. [Online](https://online.stat.psu.edu/stat500/)

## License

Licensed under either of

 * Apache License, Version 2.0
   ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license
   ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

## Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
dual licensed as above, without any additional terms or conditions.