Expand description
nisaba
§nisaba
This library brings a disciplined structure to data reconciliation, providing deterministic conflict resolution, data validation and quality checks, clean merging primitives, and a trustworthy source of truth—so engineers can focus on building systems, not untangling inconsistent or unreliable records. It exposes the properties and attributes of datasets to make differences, inconsistencies, and risks explicit.
In its initial versions, the library focuses on reconciliation, producing deterministic insights and reports that explain how disparate data silos relate to one another and how they can be merged into a single, unified source of truth. This enables data store migrations and integrations to surface gaps when discovered and resolve them efficiently, before trust is lost downstream.
§Naming
In Mesopotamian/Sumeria mythology, Nisaba is the goddess of writing, accounting, and the orderly keeping of records, entrusted with maintaining clarity across ledgers and knowledge archives.
§Core Concepts and Features
-
Reconciliation-first architecture: Establishes dataset equivalence across systems as the strongest guarantee of correctness using Lancedb for vector persistence and similarity search and Fastembed for embedding generation
-
Deterministic reconciliation engine: Produces order-independent, repeatable results suitable for CI and automated workflows.
-
Cross-store data support: Unified handling of tabular data across SQL (MySQL, PostgreSQL, SQLite), NoSQL (MongoDB), and file formats (CSV, Excel, Parquet).
-
Store-agnostic internal data model: Logical representation of data decoupled from physical storage or format.
§Getting Started
To get started, just add to Cargo.toml
[dependencies]
nisaba = { version = "0.2.0" }§Usage
Prefer using the example and the generated docs or:
use nisaba::{
AnalyzerConfig, DistanceType, EmbeddingModel, FileStoreType, SchemaAnalyzer, ScoringConfig,
SimilarityConfig, Source,
};
#[tokio::main]
async fn main() {
let config = AnalyzerConfig::builder()
.sample_size(10)
.scoring(ScoringConfig {
type_weight: 0.65,
structure_weight: 0.35,
})
.similarity(SimilarityConfig {
threshold: 0.59,
top_k: Some(7),
algorithm: DistanceType::Cosine,
})
.build();
// analyzer
let analyzer = SchemaAnalyzer::builder()
.name("nisaba")
.config(config)
.embedding_model(EmbeddingModel::MultilingualE5Small)
.source(
Source::files(FileStoreType::Csv)
.path("./assets/csv")
.num_rows(10)
.has_header(true)
.build()
.unwrap(),
)
.sources(vec![
Source::files(FileStoreType::Parquet)
.path("./assets/parquet")
.num_rows(10)
.build()
.unwrap(),
])
.build()
.await
.unwrap();
let _result = analyzer.analyze().await.unwrap();
}
§How nisaba works
Assume that a data engineer discovers multiple schema/sources with several tables that have been long been ignored and wants to deduce how they are connected and related between themselves and (or) the contemporary data store. The engineer would:
- Map out the sources and relevant credentials
- Setup Nisaba StorageConfigs
- Setup SchemaAnalyzer
- Run the analyzer with the storage configs
- Review the Results/Report for reconcialiation hints
§Roadmap
Successive improvements will allow more features in providing quality and validation as documented in the roadmap
§Versioning
As with most Rust crates, this library is versioned according to Semantic Versioning. Breaking changes will only be made with good reason, and as infrequently as is feasible. Such changes will generally be made in releases where the major version number is increased (note Cargo’s caveat for pre-1.x versions), although limited exceptions may occur. Increases in the minimum supported Rust version (MSRV) are not considered breaking, but will result in a minor version bump.
See also the changelog for details about changes in recent versions.
§License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.
§Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Structs§
- Analyzer
Config - The code defines a struct
AnalyzerConfigwith various fields representing weights and thresholds for an analyzer configuration. - CsvInference
Engine - Excel
Inference Engine - Latent
Store - The
LatentStorerepresents an interface to access Lancedb store - MySQL
Inference Engine - SQL inference engines for common OLTPs
- NoSQL
Inference Engine - NoSQL inference engine for MongoDB The NoSQLInferenceEngine struct has a field sample_size of type u32.
- Parquet
Inference Engine - PostgreSQL
Inference Engine - Schema
Analyzer - The
SchemaAnalyzerprovides an interface for store reconciliation. It contains fields for name, configuration, sources and runtime state. - Scoring
Config - Similarity
Config - Source
- Data Source with connection and metadata like identifier
- Sqlite
Inference Engine
Enums§
- Distance
Type - Embedding
Model - File
Store Type - File-based Data Stores
Traits§
- Schema
Inference Engine - Trait for schema inference engines