tabkit
Tabular files → schema + sample rows. The shared spreadsheet reader Tauri / Iced / native desktop apps reach for when they need to introspect XLSX / CSV / TSV without inventing the same calamine- plus-type-inference glue twice.
Status: v0.1 — XLSX / XLS / XLSB / XLSM / ODS via
calamine, CSV / TSV viacsv. Schema inference, sample row capping, header detection, ragged-row padding all handled. Parquet + DuckDB-backed SQL queries planned for v0.2 behind opt-in features.
Why this exists
Every "show the user what's in their spreadsheet" project rebuilds the same calamine wrapper, the same type-inference pass, the same first-row-is-headers guess, the same ragged-row padding. Every project gets it slightly wrong:
- Treats Excel's
Float(1.0)as a Float, so aqtycolumn that should infer toIntegerends up asFloatin the schema. - Forgets ragged rows, hands downstream code a
Vec<Vec<_>>where rows have different lengths. - Hard-codes
,as the delimiter, breaks on.tsv. - Reads the entire file into memory chasing a 'sample.'
tabkit ships these bits once, with the edge cases handled in one
place. It's deliberately lower-level than a full data tool — it
hands you a [Table] and gets out of the way. Pair it with
scankit for walk-and-watch
and mdkit for documents →
markdown.
Quick start
use ;
use Path;
let engine = with_defaults;
let table = engine.read?;
for col in &table.columns
for row in &table.sample_rows
# Ok::
Design principles
- Do one thing well. Read tabular files → return
Table. Anything richer (SQL, persistence, change tracking) is the consuming application's job. Send + Synceverywhere. A singleEngineshared across threads, a singleReaderinstance per format.- JSON-friendly output.
Valuehas six narrow variants so the result serialises cleanly through Tauri IPC. Dates flatten toTextfor now — a futuredatesfeature could carry typed dates. - Forward-compat defaults.
Table,Column,Value,Error, andReadOptionsare#[non_exhaustive]so we can add fields / variants without breaking downstream callers. - Honest dep budget.
calamine+csv+thiserrorare the only required deps. ~1 MB compiled with both default backends.
Feature flags
| Feature | Adds | Approx. cost |
|---|---|---|
calamine (default) |
XLSX / XLS / XLSB / XLSM / ODS via calamine |
~600 KB compiled |
csv (default) |
CSV / TSV via the csv crate |
~100 KB compiled |
default |
both calamine + csv |
~700 KB compiled |
(planned) parquet |
Parquet via the parquet crate |
~? MB |
(planned) duckdb |
SQL queries on top of read tables | ~50 MB |
License
Dual-licensed under MIT OR Apache 2.0
at your option. SPDX: MIT OR Apache-2.0.
Status & roadmap
- v0.1 — schema + samples.
Engine+Readertrait +Table+Column+Value, calamine + csv backends, type inference (Bool / Integer / Float / Text / Unknown), header-or-not, sheet selection for multi-sheet XLSX, ragged row padding. - v0.2 —
parquetfeature (read Parquet directly via theparquetcrate). - v0.3 —
duckdbfeature (optional SQL query interface on top of any read table; opt-in because DuckDB is a ~50 MB dep). - v0.4 — typed dates via a
datesfeature (DataType::Date+DataType::DateTime,Value::Date(...)). - v0.5 — audit pass + first stable trait release (1.0 candidate).
Issues, PRs, and design discussion welcome at https://github.com/seryai/tabkit/issues.
Used by
tabkit was extracted from the schema-extraction layer of
Sery Link, a privacy-respecting data network for the files
on your machines. If you use tabkit in your project, please open
a PR to add yourself here.
The kit family
tabkit is part of a coordinated suite of focused single-purpose
Rust crates extracted from Sery Link:
mdkit— documents → markdown (PDF, DOCX, PPTX, HTML, IPYNB, OCR).scankit— walk + watch directory trees with exclude-glob and size-cap filters.tabkit— spreadsheets → schema + sample rows (this crate).
Use them together, use them separately. The trait surfaces are designed to compose without forcing a particular runtime.