tabkit
Tabular files → schema + sample rows. The shared spreadsheet reader Tauri / Iced / native desktop apps reach for when they need to introspect XLSX / CSV / TSV without inventing the same calamine- plus-type-inference glue twice.
Status: v0.3 — XLSX / XLS / XLSB / XLSM / ODS, CSV / TSV, Parquet (opt-in via
parquetfeature). Schema inference, sample row capping, header detection, ragged-row padding, AND typedDate/DateTimecells (ISO-8601 strings) emitted by all three readers. DuckDB-backed SQL queries planned for v0.4 behind another opt-in feature.
Why this exists
Every "show the user what's in their spreadsheet" project rebuilds the same calamine wrapper, the same type-inference pass, the same first-row-is-headers guess, the same ragged-row padding. Every project gets it slightly wrong:
- Treats Excel's
Float(1.0)as a Float, so aqtycolumn that should infer toIntegerends up asFloatin the schema. - Forgets ragged rows, hands downstream code a
Vec<Vec<_>>where rows have different lengths. - Hard-codes
,as the delimiter, breaks on.tsv. - Reads the entire file into memory chasing a 'sample.'
tabkit ships these bits once, with the edge cases handled in one
place. It's deliberately lower-level than a full data tool — it
hands you a [Table] and gets out of the way. Pair it with
scankit for walk-and-watch
and mdkit for documents →
markdown.
Quick start
use ;
use Path;
let engine = with_defaults;
let table = engine.read?;
for col in &table.columns
for row in &table.sample_rows
# Ok::
Design principles
- Do one thing well. Read tabular files → return
Table. Anything richer (SQL, persistence, change tracking) is the consuming application's job. Send + Synceverywhere. A singleEngineshared across threads, a singleReaderinstance per format.- JSON-friendly output.
Valuehas six narrow variants so the result serialises cleanly through Tauri IPC. Dates flatten toTextfor now — a futuredatesfeature could carry typed dates. - Forward-compat defaults.
Table,Column,Value,Error, andReadOptionsare#[non_exhaustive]so we can add fields / variants without breaking downstream callers. - Honest dep budget.
calamine+csv+thiserrorare the only required deps. ~1 MB compiled with both default backends.
Feature flags
| Feature | Adds | Approx. cost |
|---|---|---|
calamine (default) |
XLSX / XLS / XLSB / XLSM / ODS via calamine |
~600 KB compiled |
csv (default) |
CSV / TSV via the csv crate |
~100 KB compiled |
default |
both calamine + csv |
~700 KB compiled |
parquet |
Parquet via the parquet crate (default features off — no Arrow runtime) |
~3 MB compiled |
full |
calamine + csv + parquet |
~4 MB compiled |
(planned) duckdb |
SQL queries on top of read tables | ~50 MB |
(planned) dates |
typed Date / DateTime variants on Value |
<1 MB (chrono) |
License
Dual-licensed under MIT OR Apache 2.0
at your option. SPDX: MIT OR Apache-2.0.
Status & roadmap
- v0.1 — schema + samples.
Engine+Readertrait +Table+Column+Value, calamine + csv backends, type inference (Bool / Integer / Float / Text / Unknown), header-or-not, sheet selection for multi-sheet XLSX, ragged row padding. - v0.2 —
parquetfeature. Apache Parquet read support via theparquetcrate (default features off — no Arrow runtime). Same schema-and-samples surface, same type- inference rules. - v0.3 — typed dates.
Value::Date(String)/Value::DateTime(String)with ISO-8601 string payloads (no chrono dep). All three readers emit typed dates for source values that carry date semantics.Valueis now#[non_exhaustive]for forward-compat. - v0.4 —
duckdbfeature (optional SQL query interface on top of any read table; opt-in because DuckDB is a ~50 MB dep). - v0.5 — audit pass + first stable trait release (1.0 candidate).
Issues, PRs, and design discussion welcome at https://github.com/seryai/tabkit/issues.
Used by
tabkit was extracted from the schema-extraction layer of
Sery Link, a privacy-respecting data network for the files
on your machines. If you use tabkit in your project, please open
a PR to add yourself here.
The kit family
tabkit is part of a coordinated suite of focused single-purpose
Rust crates extracted from Sery Link:
mdkit— documents → markdown (PDF, DOCX, PPTX, HTML, IPYNB, OCR).scankit— walk + watch directory trees with exclude-glob and size-cap filters.tabkit— spreadsheets → schema + sample rows (this crate).
Use them together, use them separately. The trait surfaces are designed to compose without forcing a particular runtime.