Crate lace

Source
Expand description

A fast, extensible probabilistic cross-categorization engine.

Lace is a probabilistic cross-categorization engine written in rust with an optional interface to python. Unlike traditional machine learning methods, which learn some function mapping inputs to outputs, Lace learns a joint probability distribution over your dataset, which enables users to…

  • predict or compute likelihoods of any number of features conditioned on any number of other features
  • identify, quantify, and attribute uncertainty from variance in the data, epistemic uncertainty in the model, and missing features
  • determine which variables are predictive of which others
  • determine which records/rows are similar to which others on the whole or given a specific context
  • simulate and manipulate synthetic data
  • work natively with missing data and make inferences about missingness (missing not-at-random)
  • work with continuous and categorical data natively, without transformation
  • identify anomalies, errors, and inconsistencies within the data
  • edit, backfill, and append data without retraining

and more, all in one place, without any explicit model building.

§Design

Lace learns a probabilistic model of tabular data using cross-categorization. The general steps to operation are

§Example

(For a complete tutorial, see the Lace Book)

The following example uses the pre-trained animals example dataset. Each row represents an animal and each column represents a feature of that animal. The feature is present if the cell value is 1 and is absent if the value is 0.

First, we create an oracle and import some enums that allow us to call out some of the row and column indices in plain English.

use lace::prelude::*;
use lace::examples::Example;

let oracle = Example::Animals.oracle().unwrap();

Let’s ask about the statistical dependence between whether something swims and is fast or has flippers. We expect that something swimming is more indicative of whether it swims than whether something is fast, therefore we expect the dependence between swims and flippers to be higher.

let depprob_fast = oracle.depprob(
    "swims",
    "fast",
).unwrap();

let depprob_flippers = oracle.depprob(
    "swims",
    "flippers",
).unwrap();

assert!(depprob_flippers > depprob_fast);

We have the same expectation of mutual information. Mutual information requires more input from the user. We need to know what type of mutual information, and how many samples to take if we need to estimate the mutual information.

let mut rng = rand::thread_rng();

let mi_fast = oracle.mi(
    "swims",
    "fast",
    1000,
    MiType::Iqr,
).unwrap();

let mi_flippers = oracle.mi(
    "swims",
    "flippers",
    1000,
    MiType::Iqr,
).unwrap();

assert!(mi_flippers > mi_fast);

We can likewise ask about the similarity between rows – in this case, animals.

let wrt: Option<&[usize]> = None;
let rowsim_wolf = oracle.rowsim(
    "wolf",
    "chihuahua",
    wrt,
    RowSimilarityVariant::ViewWeighted,
).unwrap();

let rowsim_rat = oracle.rowsim(
    "rat",
    "chihuahua",
    wrt,
    RowSimilarityVariant::ViewWeighted,
).unwrap();

assert!(rowsim_rat > rowsim_wolf);

And we can add context to similarity.

let context = vec!["swims"];
let rowsim_otter = oracle.rowsim(
    "beaver",
    "otter",
    Some(&context),
    RowSimilarityVariant::ViewWeighted,
).unwrap();

let rowsim_dolphin = oracle.rowsim(
    "beaver",
    "dolphin",
    Some(&context),
    RowSimilarityVariant::ViewWeighted,
).unwrap();

§Feature flags

  • formats: create Engines and Codebooks from IPC, CSV, JSON, and Parquet data files
  • bencher: Build benchmarking utilities
  • ctrc_handler: enables and update handler than captures Ctrl+C

Re-exports§

pub use config::EngineUpdateConfig;

Modules§

cc
codebook
config
consts
data
Data loaders and utilities
defaults
Default values
error
examples
metadata
misc
Misc, generally useful helper functions
optimize
Function optimization utilities
prelude
Common import for general use.
stats
update_handler
utils

Structs§

DatalessOracle
An oracle without data for sensitive data applications
Engine
The engine runs states in parallel
EngineBuilder
Builds Engines
InsertDataActions
Describes table-extending actions taken when inserting data
Metadata
MiComponents
Holds the components required to compute mutual information
Oracle
Oracle answers questions
ParseError
Row
A list of data for insertion into a certain row
StateDiagnostics
Stores some diagnostic info in the State at every iteration
Value
A datum for insertion into a certain column
WriteMode
Defines how/where data may be inserted, which day may and may not be overwritten, and whether data may extend the domain

Enums§

AppendStrategy
Defines the behavior of the data table when new rows are appended
BuildEngineError
Category
ConditionalEntropyType
The variant on conditional entropy to compute
Datum
Represents the types of data lace can work with
FType
Feature type
Given
Describes a the conditions (or not) on a conditional distribution
InsertMode
Defines insert data behavior – where data may be inserted.
MiType
Mutual Information Type
NameOrIndex
Holds a String name or a usize index
OverwriteMode
Defines which data may be overwritten
RowSimilarityVariant
The variant of row similarity to compute
StateTransition
MCMC transitions in the State
SummaryStatistics
SupportExtension
Describes the support extension action taken
TableIndex

Traits§

ColumnIndex
Trait defining items that can converted into a usize column index
HasData
Returns and summarizes data
HasStates
Returns references to crosscat states
OracleT
RowIndex
Trait defining an item that can be converted into a row index