lace 0.7.0

A probabilistic cross-categorization engine
Documentation
# Lace

<div align=center>
    <img src='../assets/lace.svg' width='300px'/>
    <i><h3>Putting "science" in "data science"</h3></i>
</div>

A fast, extensible probabilistic cross-categorization engine.

<div align=center>
    <div>
        <strong>Documentation</strong>: 
        <a href='#'>User guide</a> | 
        <a href='#'>Rust API</a> | 
        <a href='#'>CLI</a>
    </div>
    <div>
        <strong>Navigation</strong>: 
        <a href='#design'>Design</a> | 
        <a href='#example'>Example</a> | 
        <a href='#getting-started'>Get started</a> | 
        <a href='#standard-cli-workflow'>CLI workflow</a> | 
        <a href='#license'>License</a>
     </div>
</div>

Lace is a probabilistic cross-categorization engine written in rust with an
optional interface to python. Unlike traditional machine learning methods,
which learn some function mapping inputs to outputs, Lace learns a joint
probability distribution over your dataset, which enables users to...

- predict or compute likelihoods of any number of features conditioned on any
  number of other features
- identify, quantify, and attribute uncertainty from variance in the data,
  epistemic uncertainty in the model, and missing features
- determine which variables are predictive of which others
- determine which records/rows are similar to which others on the whole or
  given a specific context
- simulate and manipulate synthetic data
- work natively with missing data and make inferences about missingness
  (missing not-at-random)
- work with continuous and categorical data natively, without transformation
- identify anomalies, errors, and inconsistencies within the data
- edit, backfill, and append data without retraining

and more, all in one place, without any explicit model building.

# Design
Lace learns a probabilistic model of tabular data using cross-categorization.
The general steps to operation are

* Create a `prelude::Codebook` which describes your data. One can be
    autogenerated but it is best to check it before use.
* Create an `prelude::Engine` with your data and codebook.
* Train the `prelude::Engine` and monitor the model likelihood for convergence.
* Ask questions via the `prelude::OracleT` implementation of `prelude::Engine`
    to explore your data.


# Example

(For a complete tutorial, see the [Lace Book](https://TODO))

The following example uses the pre-trained `animals` example dataset. Each row
represents an animal and each column represents a feature of that animal. The
feature is present if the cell value is 1 and is absent if the value is 0.

First, we create an oracle and import some `enum`s that allow us to call out
some of the row and column indices in plain English.

```rust
use lace::prelude::*;
use lace::examples::Example;

let oracle = Example::Animals.oracle().unwrap();
// You can also load trained-metadata using the command:
// let engine = Engine::load("my-metadata.lace")?;
```
Let's ask about the statistical dependence between whether something swims
and is fast or has flippers. We expect that something swimming is more
indicative of whether it swims than whether something is fast, therefore we
expect the dependence between swims and flippers to be higher.

```rust
let depprob_fast = oracle.depprob(
    "swims",
    "fast",
).unwrap();

let depprob_flippers = oracle.depprob(
    "swims",
    "flippers",
).unwrap();

assert!(depprob_flippers > depprob_fast);
```

We have the same expectation of mutual information. Mutual information
requires more input from the user. We need to know what type of mutual
information and how many samples to take if we need to estimate the mutual
information.

```rust
let mi_fast = oracle.mi(
    "swims",
    "fast",
    1000,
    MiType::Iqr,
).unwrap();

let mi_flippers = oracle.mi(
    "swims",
    "flippers",
    1000,
    MiType::Iqr,
).unwrap();

assert!(mi_flippers > mi_fast);
```

We can likewise ask about the similarity between rows -- in this case,
animals.

```rust
let wrt: Option<&[usize]> = None;
let rowsim_wolf = oracle.rowsim(
    "wolf",
    "chihuahua",
    wrt,
    RowSimilarityVariant::ViewWeighted,
).unwrap();

let rowsim_rat = oracle.rowsim(
    "rat",
    "chihuahua",
    wrt,
    RowSimilarityVariant::ViewWeighted,
).unwrap();

assert!(rowsim_rat > rowsim_wolf);
```

And we can add context to similarity.

```rust
let context = vec!["swims"];
let rowsim_otter = oracle.rowsim(
    "beaver",
    "otter",
    Some(&context),
    RowSimilarityVariant::ViewWeighted,
).unwrap();

let rowsim_dolphin = oracle.rowsim(
    "beaver",
    "dolphin",
    Some(&context),
    RowSimilarityVariant::ViewWeighted,
).unwrap();
```

## Getting started

To use Lace as a library, simply add it to your `Cargo.toml`

```toml
[dependencies]
lace = "*"
```

To install the CLI

```bash
cargo install --locked lace
```

To install from source

```bash
$ cargo install --path .
```

To build the API documentation

```bash
$ cargo doc --all --no-deps
```

To Run tests

```bash
$ cargo test --all
```

Note that when the build script runs, example files are moved to your data
directory.  Once you ask for an `Oracle` for one of the examples, lace will
build the metadata if it does not exist already. If you need to regenerate
the metadata — say the metadata spec has changed — you can do so with the
following CLI command:

```bash
$ lace regen-examples
```

## Standard CLI workflow

The CLI makes some things easier than they would be in rust or python.
Generating codebooks and running models is simpler from the command line.

### Codebook

The codebook tells lace how to model your data -- what types of data can been
in each feature; if they're categorical, what values they can take; what their
prior (and hyperprior) distributions should be; etc. Since codebooks scale with
the size of your data, it's best to start with a template codebook generated
using sensible defaults, and then edit it if neceessary (normally it won't be).

To generate a template codebook from a csv file:

```
$ lace codebook --csv mydata.csv codebook.yaml
```

Open the codebook in your favorite editor to adjust the codebook. You can find
[tips for editing codebooks](#TODO) and [a full codebook reference](#TODO) in
the user guide.

## Run inference

You can run inference (fit a model) using rust or the CLI.

Using Rust:

```rust
use rand::SeedableRng;
use rand_xoshiro::Xoshiro256Plus;
use polars::prelude::CsvReader;
use lace::prelude::*;

// Load a dataframe
let df = CsvReader::from_path("mydata.csv")
    .unwrap()
    .has_header(true)
    .finish()
    .unwrap();

// Create a codebook
let codebook = Codebook::from_df(&df).unwrap();

// Build the engine
let mut engine = Engine::new(
    16,
    codebook,
    DataSource::Polars(df),
    0,
    Xoshiro256Plus::from_entropy(),
).unwrap();

// Run the fit procedure. You can also use `Engine::update` if
// you would like more control over the algorithms run or if you
// would like to collect different diagnostics.
engine.run(1000).unwrap();

// Save the model
engine.save("mydata.lace", SerializedType::Bincode).unwrap();
```

You can also use the CLI. To run inference on a csv file using the default
codebook and settings, and save to `mydata.lace`

```
$ lace run --csv mydata.csv --codebook codebook.yaml mydata.lace
```

If you do not specify a codebook, a default codebook will be generated behind the scenes.

You can specify which transitions and which algorithms to use two ways. You can
use CLI args

```
$ lace run \
    --csv mydata \
    --row-alg slice \
    --col-alg gibbs \
    --transitions=row_assignment,view_alphas,column_assignment,state_alpha \
    mydata.lace
```

Or you can provide a run config

```yaml
# runconfig.yaml
n_iters: 4
timeout: 60
save_path: ~
transitions:
  - row_assignment: slice
  - view_alphas
  - column_assignment: gibbs
  - state_alpha
  - feature_priors
```

Above we run 32 states for 1000 iterations or 10 minutes (600 seconds).

```
$ lace run \
    --csv mydata \
    --run-config runconfig.yaml \
    mydata.lace
```

Note that any CLI arguments covered in the run config cannot be used if a run
config is provided.

We can also specify the number of states (samples) using `-s`, the number of
iterations to run using `-n`, and the maximum number of seconds a state should
run using `--timeout`.

```
$ lace run --csv mydata.csv -s 32 -n 1000 --timeout 600 mydata.lace
```

We can extend the run (add more iterations) by passing the engine to run.

```
$ lace run --engine mydata.lace -n 1000 mydata-extended.lace
```

## License

Lace is licensed under Server Side Public License (SSPL).

If you would like a license for use in closed source code please contact
`lace@promised.ai`