tabkit
Tabular files → schema + sample rows. The shared spreadsheet reader Tauri / Iced / native desktop apps reach for when they need to introspect XLSX / CSV / TSV without inventing the same calamine- plus-type-inference glue twice.
Status: v0.4 — API stability candidate for 1.0. Format coverage closed in v0.3 (XLSX-family + CSV/TSV + Parquet, with typed
Date/DateTimecells). v0.4 freezes the public surface — see the stability section below for what's locked in. v0.4.x will iterate on examples + cookbook docs. 1.0 ships once the API is exercised by at least one downstream production user.
Why this exists
Every "show the user what's in their spreadsheet" project rebuilds the same calamine wrapper, the same type-inference pass, the same first-row-is-headers guess, the same ragged-row padding. Every project gets it slightly wrong:
- Treats Excel's
Float(1.0)as a Float, so aqtycolumn that should infer toIntegerends up asFloatin the schema. - Forgets ragged rows, hands downstream code a
Vec<Vec<_>>where rows have different lengths. - Hard-codes
,as the delimiter, breaks on.tsv. - Reads the entire file into memory chasing a 'sample.'
tabkit ships these bits once, with the edge cases handled in one
place. It's deliberately lower-level than a full data tool — it
hands you a [Table] and gets out of the way. Pair it with
scankit for walk-and-watch
and mdkit for documents →
markdown.
Quick start
use ;
use Path;
let engine = with_defaults;
let table = engine.read?;
for col in &table.columns
for row in &table.sample_rows
# Ok::
Design principles
- Do one thing well. Read tabular files → return
Table. Anything richer (SQL, persistence, change tracking) is the consuming application's job. Send + Synceverywhere. A singleEngineshared across threads, a singleReaderinstance per format.- JSON-friendly output.
Valuehas six narrow variants so the result serialises cleanly through Tauri IPC. Dates flatten toTextfor now — a futuredatesfeature could carry typed dates. - Forward-compat defaults.
Table,Column,Value,Error, andReadOptionsare#[non_exhaustive]so we can add fields / variants without breaking downstream callers. - Honest dep budget.
calamine+csv+thiserrorare the only required deps. ~1 MB compiled with both default backends.
Feature flags
| Feature | Adds | Approx. cost |
|---|---|---|
calamine (default) |
XLSX / XLS / XLSB / XLSM / ODS via calamine |
~600 KB compiled |
csv (default) |
CSV / TSV via the csv crate |
~100 KB compiled |
default |
both calamine + csv |
~700 KB compiled |
parquet |
Parquet via the parquet crate (default features off — no Arrow runtime) |
~3 MB compiled |
full |
calamine + csv + parquet |
~4 MB compiled |
Examples
Runnable example programs live in examples/:
inspect.rs— print schema + sample rows for any tabular file. Run with:
More examples (custom_reader.rs, compose_with_duckdb.rs) land
in v0.4.x.
Stability (v0.4+) {#stability-v04}
v0.4 is the API stability candidate for 1.0. The following surface is committed to and will only change with a major version bump:
- The
Readertrait shape — required methods, default implementations,Send + Syncbound. Future trait methods land with default impls so existing implementors don't break. Engineconstruction + dispatch —new,with_defaults,register,read,len,is_empty.Table,Column,Value,DataType,Error,ReadOptionsfield/variant sets. All#[non_exhaustive]so we can grow them without major bumps. Pattern-matchers must include a wildcard arm.- Feature flag names:
calamine,csv,parquet,full. Each reader's per-format extension list is also stable. - Per-reader
name()strings ("calamine","csv","parquet").
The following are implementation details and may change in minor versions:
- Internal layout of any specific reader (private fields, helper methods, type-inference heuristics).
- Exact set of
Table.metadatakeys per backend (new keys may appear; documented keys stay). - Auto-registration order in
Engine::with_defaults(the rule "first registered wins for overlapping extensions" stays; the specific order doesn't).
1.0 will be cut once the API is exercised by at least one downstream production user.
Composing with DuckDB {#composing-with-duckdb}
When you need SQL queries on tabular data, use duckdb
directly — DuckDB has excellent native readers for CSV and Parquet,
and it's purpose-built for this. Use tabkit for "what's in the
file" (schema, samples, type inference for the UI / agent
grounding); use DuckDB for "compute over the data" (joins,
aggregates, projections):
// 1. tabkit for schema + samples (fast, lightweight)
let table = with_defaults.read?;
println!;
// 2. DuckDB for the SQL surface (when you actually need it)
let conn = open_in_memory?;
let path_str = path.display;
conn.execute?;
let mut stmt = conn.prepare?;
// ...
Same composition shape works for XLSX (read with calamine, write intermediate CSV/Parquet, query with DuckDB) and Parquet (DuckDB reads natively).
License
Dual-licensed under MIT OR Apache 2.0
at your option. SPDX: MIT OR Apache-2.0.
Status & roadmap
- v0.1 — schema + samples.
Engine+Readertrait +Table+Column+Value, calamine + csv backends, type inference (Bool / Integer / Float / Text / Unknown), header-or-not, sheet selection for multi-sheet XLSX, ragged row padding. - v0.2 —
parquetfeature. Apache Parquet read support via theparquetcrate (default features off — no Arrow runtime). Same schema-and-samples surface, same type- inference rules. - v0.3 — typed dates.
Value::Date(String)/Value::DateTime(String)with ISO-8601 string payloads (no chrono dep). All three readers emit typed dates for source values that carry date semantics.Valueis now#[non_exhaustive]for forward-compat. - v0.4 — audit pass + 1.0 candidate. Stability
commitments doc in
lib.rs+ README.#[non_exhaustive]already on every public struct + enum (added incrementally v0.1 → v0.3);#[must_use]already on every constructor + builder + accessor. Documentation-only release — no API- shape changes. - v1.0 — once exercised by at least one downstream production user. Sery Link is the canonical integration target; v1.0 ships once the API survives real use without breaking changes.
Why no duckdb feature
Earlier roadmaps mentioned a v0.x duckdb feature for SQL queries
on top of any read table. We dropped that plan because:
- Dep weight. The bundled
duckdbcrate is ~50 MB compiled — tabkit's current ~4 MB would 13× by adding it. That violates the "small focused kit" aesthetic. - Scope creep. tabkit's contract is "schema + samples from a file." SQL queries are a fundamentally different abstraction: compute over data, not introspect a file.
- DuckDB has native CSV/Parquet readers. A tabkit-DuckDB feature would duplicate functionality — users would have two readers for the same format and not know which to pick.
- Composition is cleaner. See the next section.
Issues, PRs, and design discussion welcome at https://github.com/seryai/tabkit/issues.
Used by
tabkit was extracted from the schema-extraction layer of
Sery Link, a privacy-respecting data network for the files
on your machines. If you use tabkit in your project, please open
a PR to add yourself here.
The kit family
tabkit is part of a coordinated suite of focused single-purpose
Rust crates extracted from Sery Link:
mdkit— documents → markdown (PDF, DOCX, PPTX, HTML, IPYNB, OCR).scankit— walk + watch directory trees with exclude-glob and size-cap filters.tabkit— spreadsheets → schema + sample rows (this crate).
Use them together, use them separately. The trait surfaces are designed to compose without forcing a particular runtime.