Crate nisaba

Expand description

nisaba

A data quality, reconciliation, and validation framework across different data store in Arrow Rust.

§nisaba

This library brings a disciplined structure to data reconciliation, providing deterministic conflict resolution, data validation and quality checks, clean merging primitives, and a trustworthy source of truth—so engineers can focus on building systems, not untangling inconsistent or unreliable records. It exposes the properties and attributes of datasets to make differences, inconsistencies, and risks explicit.

In its initial versions, the library focuses on reconciliation, producing deterministic insights and reports that explain how disparate data silos relate to one another and how they can be merged into a single, unified source of truth. This enables data store migrations and integrations to surface gaps when discovered and resolve them efficiently, before trust is lost downstream.

§Naming

In Mesopotamian/Sumeria mythology, Nisaba is the goddess of writing, accounting, and the orderly keeping of records, entrusted with maintaining clarity across ledgers and knowledge archives.

§Core Concepts and Features

Reconciliation-first architecture: Establishes dataset equivalence across systems as the strongest guarantee of correctness using Lancedb for vector persistence and similarity search and Fastembed for embedding generation
Deterministic reconciliation engine: Produces order-independent, repeatable results suitable for CI and automated workflows.
Cross-store data support: Unified handling of tabular data across SQL (MySQL, PostgreSQL, SQLite), NoSQL (MongoDB), and file formats (CSV, Excel, Parquet).
Store-agnostic internal data model: Logical representation of data decoupled from physical storage or format.

§Getting Started

To get started, just add to Cargo.toml

[dependencies]
nisaba = { version = "0.2.0" }

§Usage

Prefer using the example and the generated docs or:

use nisaba::{
    AnalyzerConfig, DistanceType, EmbeddingModel, FileStoreType, SchemaAnalyzer, ScoringConfig,
    SimilarityConfig, Source,
};

#[tokio::main]
async fn main() {
    let config = AnalyzerConfig::builder()
        .sample_size(10)
        .scoring(ScoringConfig {
            type_weight: 0.65,
            structure_weight: 0.35,
        })
        .similarity(SimilarityConfig {
            threshold: 0.59,
            top_k: Some(7),
            algorithm: DistanceType::Cosine,
        })
        .build();

    // analyzer
    let analyzer = SchemaAnalyzer::builder()
        .name("nisaba")
        .config(config)
        .embedding_model(EmbeddingModel::MultilingualE5Small)
        .source(
            Source::files(FileStoreType::Csv)
                .path("./assets/csv")
                .num_rows(10)
                .has_header(true)
                .build()
                .unwrap(),
        )
        .sources(vec![
            Source::files(FileStoreType::Parquet)
                .path("./assets/parquet")
                .num_rows(10)
                .build()
                .unwrap(),
        ])
        .build()
        .await
        .unwrap();

    let _result = analyzer.analyze().await.unwrap();
}

§How nisaba works

Assume that a data engineer discovers multiple schema/sources with several tables that have been long been ignored and wants to deduce how they are connected and related between themselves and (or) the contemporary data store. The engineer would:

Map out the sources and relevant credentials
Setup Nisaba StorageConfigs
Setup SchemaAnalyzer
Run the analyzer with the storage configs
Review the Results/Report for reconcialiation hints

§Roadmap

Successive improvements will allow more features in providing quality and validation as documented in the roadmap

§Versioning

As with most Rust crates, this library is versioned according to Semantic Versioning. Breaking changes will only be made with good reason, and as infrequently as is feasible. Such changes will generally be made in releases where the major version number is increased (note Cargo’s caveat for pre-1.x versions), although limited exceptions may occur. Increases in the minimum supported Rust version (MSRV) are not considered breaking, but will result in a minor version bump.

See also the changelog for details about changes in recent versions.

§License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)

at your option.

§Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Structs§

AnalyzerConfig: The code defines a struct AnalyzerConfig with various fields representing weights and thresholds for an analyzer configuration.
CsvInferenceEngine
ExcelInferenceEngine
LatentStore: The LatentStore represents an interface to access Lancedb store
MySQLInferenceEngine: SQL inference engines for common OLTPs
NoSQLInferenceEngine: NoSQL inference engine for MongoDB The NoSQLInferenceEngine struct has a field sample_size of type u32.
ParquetInferenceEngine
PostgreSQLInferenceEngine
SchemaAnalyzer: The SchemaAnalyzer provides an interface for store reconciliation. It contains fields for name, configuration, sources and runtime state.
SchemaAnalyzerBuilder: The SchemaAnalyzerBuilder helps build the schema analyzer. It contains fields for configuration related to schemaanalysis.
ScoringConfig
SimilarityConfig
Source: Data Source with connection and metadata like identifier
SqliteInferenceEngine

Enums§

DistanceType
EmbeddingModel
FileStoreType: File-based Data Stores

Traits§

SchemaInferenceEngine: Trait for schema inference engines

Crate nisaba

Crate nisaba Copy item path

nisaba

§nisaba

§Naming

§Core Concepts and Features

§Getting Started

§Usage

§How nisaba works

§Roadmap

§Versioning

§License

§Contribution

Structs§

Enums§

Traits§

Crate nisaba