Expand description
A fast, extensible probabilistic cross-categorization engine.
Lace is a probabilistic cross-categorization engine written in rust with an optional interface to python. Unlike traditional machine learning methods, which learn some function mapping inputs to outputs, Lace learns a joint probability distribution over your dataset, which enables users to…
- predict or compute likelihoods of any number of features conditioned on any number of other features
- identify, quantify, and attribute uncertainty from variance in the data, epistemic uncertainty in the model, and missing features
- determine which variables are predictive of which others
- determine which records/rows are similar to which others on the whole or given a specific context
- simulate and manipulate synthetic data
- work natively with missing data and make inferences about missingness (missing not-at-random)
- work with continuous and categorical data natively, without transformation
- identify anomalies, errors, and inconsistencies within the data
- edit, backfill, and append data without retraining
and more, all in one place, without any explicit model building.
§Design
Lace learns a probabilistic model of tabular data using cross-categorization. The general steps to operation are
- Create a
prelude::Codebookwhich describes your data. One can be autogenerated but it is best to check it before use. - Create an
prelude::Enginewith your data and codebook. - Train the
prelude::Engineand monitor the model likelihood for convergence. - Ask questions via the
prelude::OracleTimplementation ofprelude::Engineto explore your data.
§Example
(For a complete tutorial, see the Lace Book)
The following example uses the pre-trained animals example dataset.
Each row represents an animal and each column represents a feature of that
animal.
The feature is present if the cell value is 1 and is absent if the value is 0.
First, we create an oracle and import some enums that allow us to call
out some of the row and column indices in plain English.
use lace::prelude::*;
use lace::examples::Example;
let oracle = Example::Animals.oracle().unwrap();Let’s ask about the statistical dependence between whether something swims and is fast or has flippers. We expect that something swimming is more indicative of whether it swims than whether something is fast, therefore we expect the dependence between swims and flippers to be higher.
let depprob_fast = oracle.depprob(
"swims",
"fast",
).unwrap();
let depprob_flippers = oracle.depprob(
"swims",
"flippers",
).unwrap();
assert!(depprob_flippers > depprob_fast);We have the same expectation of mutual information. Mutual information requires more input from the user. We need to know what type of mutual information, and how many samples to take if we need to estimate the mutual information.
let mut rng = rand::rng();
let mi_fast = oracle.mi(
"swims",
"fast",
1000,
MiType::Iqr,
).unwrap();
let mi_flippers = oracle.mi(
"swims",
"flippers",
1000,
MiType::Iqr,
).unwrap();
assert!(mi_flippers > mi_fast);We can likewise ask about the similarity between rows – in this case, animals.
let wrt: Option<&[usize]> = None;
let rowsim_wolf = oracle.rowsim(
"wolf",
"chihuahua",
wrt,
RowSimilarityVariant::ViewWeighted,
).unwrap();
let rowsim_rat = oracle.rowsim(
"rat",
"chihuahua",
wrt,
RowSimilarityVariant::ViewWeighted,
).unwrap();
assert!(rowsim_rat > rowsim_wolf);And we can add context to similarity.
let context = vec!["swims"];
let rowsim_otter = oracle.rowsim(
"beaver",
"otter",
Some(&context),
RowSimilarityVariant::ViewWeighted,
).unwrap();
let rowsim_dolphin = oracle.rowsim(
"beaver",
"dolphin",
Some(&context),
RowSimilarityVariant::ViewWeighted,
).unwrap();§Feature flags
formats: createEngines andCodebooks from IPC, CSV, JSON, and Parquet data filesbencher: Build benchmarking utilitiesctrc_handler: enables and update handler than captures Ctrl+C
Re-exports§
pub use config::EngineUpdateConfig;pub use interface::Metadata;pub use cc::feature::FType;pub use cc::state::StateDiagnostics;pub use cc::transition::StateTransition;pub use data::Category;pub use data::Datum;pub use data::SummaryStatistics;pub use rv;
Modules§
- cc
- codebook
- The
Codebookis a YAML file used to associate metadata with the dataset. The user can set the priors on the structure of each state, can identify the model for each columns, and set hyper priors. - config
- consts
- data
- Data loaders and utilities
- defaults
- Default values
- error
- examples
- geweke
- Geweke (joint distribution) test
- metadata
- misc
- Misc, generally useful helper functions
- optimize
- Function optimization utilities
- prelude
- Common import for general use.
- stats
- update_
handler - utils
Macros§
- impl_
metadata_ version - Implements the MetadataVersion trait
- loaders
- creates a bunch of helper functions in a
loadmodule that load the metadata components and create andMetadataobject of the appropriate version. - series_
to_ opt_ strings - series_
to_ opt_ vec - series_
to_ strings - series_
to_ vec - to_
from_ newtype - For a newtype
Outer(Inner), implementsFrom<Inner>forOuterandFrom<Outer>forInner. - validate_
assignment - Validates assignments if the
LACE_NOCHECKis not set to"1".
Structs§
- Dataless
Oracle - An oracle without data for sensitive data applications
- Engine
- The engine runs states in parallel
- Engine
Builder - Builds
Engines - Insert
Data Actions - Describes table-extending actions taken when inserting data
- MiComponents
- Holds the components required to compute mutual information
- Oracle
- Oracle answers questions
- Parse
Error - Row
- A list of data for insertion into a certain row
- Value
- A datum for insertion into a certain column
- Write
Mode - Defines how/where data may be inserted, which day may and may not be overwritten, and whether data may extend the domain
Enums§
- Append
Strategy - Defines the behavior of the data table when new rows are appended
- Build
Engine Error - Conditional
Entropy Type - The variant on conditional entropy to compute
- Given
- Describes a the conditions (or not) on a conditional distribution
- Insert
Mode - Defines insert data behavior – where data may be inserted.
- MiType
- Mutual Information Type
- Name
OrIndex - Holds a
Stringname or ausizeindex - Overwrite
Mode - Defines which data may be overwritten
- RowSimilarity
Variant - The variant of row similarity to compute
- Support
Extension - Describes the support extension action taken
- Table
Index
Traits§
- Column
Index - Trait defining items that can converted into a usize column index
- HasData
- Returns and summarizes data
- HasStates
- Returns references to crosscat states
- OracleT
- RowIndex
- Trait defining an item that can be converted into a row index