tabkit
Tabular files → schema + sample rows. The shared spreadsheet reader Tauri / Iced / native desktop apps reach for when they need to introspect XLSX / CSV / TSV without inventing the same calamine- plus-type-inference glue twice.
Status: v0.4 — API stability candidate for 1.0. Format coverage closed in v0.3 (XLSX-family + CSV/TSV + Parquet, with typed
Date/DateTimecells). v0.4 freezes the public surface — see the stability section below for what's locked in. v0.4.x will iterate on examples + cookbook docs. 1.0 ships once the API is exercised by at least one downstream production user.
Why this exists
Every "show the user what's in their spreadsheet" project rebuilds the same calamine wrapper, the same type-inference pass, the same first-row-is-headers guess, the same ragged-row padding. Every project gets it slightly wrong:
- Treats Excel's
Float(1.0)as a Float, so aqtycolumn that should infer toIntegerends up asFloatin the schema. - Forgets ragged rows, hands downstream code a
Vec<Vec<_>>where rows have different lengths. - Hard-codes
,as the delimiter, breaks on.tsv. - Reads the entire file into memory chasing a 'sample.'
tabkit ships these bits once, with the edge cases handled in one
place. It's deliberately lower-level than a full data tool — it
hands you a [Table] and gets out of the way. Pair it with
scankit for walk-and-watch
and mdkit for documents →
markdown.
Quick start
use ;
use Path;
let engine = with_defaults;
let table = engine.read?;
for col in &table.columns
for row in &table.sample_rows
# Ok::
Design principles
- Do one thing well. Read tabular files → return
Table. Anything richer (SQL, persistence, change tracking) is the consuming application's job. Send + Synceverywhere. A singleEngineshared across threads, a singleReaderinstance per format.- JSON-friendly output.
Valuehas six narrow variants so the result serialises cleanly through Tauri IPC. Dates flatten toTextfor now — a futuredatesfeature could carry typed dates. - Forward-compat defaults.
Table,Column,Value,Error, andReadOptionsare#[non_exhaustive]so we can add fields / variants without breaking downstream callers. - Honest dep budget.
calamine+csv+thiserrorare the only required deps. ~1 MB compiled with both default backends.
Feature flags
| Feature | Adds | Approx. cost |
|---|---|---|
calamine (default) |
XLSX / XLS / XLSB / XLSM / ODS via calamine |
~600 KB compiled |
csv (default) |
CSV / TSV via the csv crate |
~100 KB compiled |
default |
both calamine + csv |
~700 KB compiled |
parquet |
Parquet via the parquet crate (default features off — no Arrow runtime) |
~3 MB compiled |
full |
calamine + csv + parquet |
~4 MB compiled |
Examples
Runnable example programs live in examples/:
inspect.rs— print schema + sample rows for any tabular file. Run with:custom_reader.rs— implement theReadertrait for a custom format (a toy semicolon- separated.ssv):
Stability (v0.4+) {#stability-v04}
v0.4 is the API stability candidate for 1.0. The following surface is committed to and will only change with a major version bump:
- The
Readertrait shape — required methods, default implementations,Send + Syncbound. Future trait methods land with default impls so existing implementors don't break. Engineconstruction + dispatch —new,with_defaults,register,read,len,is_empty.Table,Column,Value,DataType,Error,ReadOptionsfield/variant sets. All#[non_exhaustive]so we can grow them without major bumps. Pattern-matchers must include a wildcard arm.- Feature flag names:
calamine,csv,parquet,full. Each reader's per-format extension list is also stable. - Per-reader
name()strings ("calamine","csv","parquet").
The following are implementation details and may change in minor versions:
- Internal layout of any specific reader (private fields, helper methods, type-inference heuristics).
- Exact set of
Table.metadatakeys per backend (new keys may appear; documented keys stay). - Auto-registration order in
Engine::with_defaults(the rule "first registered wins for overlapping extensions" stays; the specific order doesn't).
1.0 will be cut once the API is exercised by at least one downstream production user.
When you need SQL
tabkit doesn't do SQL — that's a different abstraction (compute
over data) than what this crate aims at (introspect a file). When
your application needs queries, pair tabkit (for the
schema-and-samples view a UI / agent renders) with whatever SQL
crate fits your runtime — pure-Rust query engines, embedded
databases, ad-hoc Arrow pipelines all compose the same way:
read the file separately for queries, keep tabkit's Table for
schema awareness.
License
Dual-licensed under MIT OR Apache 2.0
at your option. SPDX: MIT OR Apache-2.0.
Status & roadmap
- v0.1 — schema + samples.
Engine+Readertrait +Table+Column+Value, calamine + csv backends, type inference (Bool / Integer / Float / Text / Unknown), header-or-not, sheet selection for multi-sheet XLSX, ragged row padding. - v0.2 —
parquetfeature. Apache Parquet read support via theparquetcrate (default features off — no Arrow runtime). Same schema-and-samples surface, same type- inference rules. - v0.3 — typed dates.
Value::Date(String)/Value::DateTime(String)with ISO-8601 string payloads (no chrono dep). All three readers emit typed dates for source values that carry date semantics.Valueis now#[non_exhaustive]for forward-compat. - v0.4 — audit pass + 1.0 candidate. Stability
commitments doc in
lib.rs+ README.#[non_exhaustive]already on every public struct + enum (added incrementally v0.1 → v0.3);#[must_use]already on every constructor + builder + accessor. Documentation-only release — no API- shape changes. - v1.0 — once exercised by at least one downstream production user. Sery Link is the canonical integration target; v1.0 ships once the API survives real use without breaking changes.
Why no SQL feature
Earlier roadmaps mentioned a v0.x SQL-engine feature for queries on top of any read table. We dropped that plan because:
- Dep weight. Embedded SQL engines are big — typically tens of MB compiled, sometimes a hundred. Adding one would multiply tabkit's current ~4 MB many times over and violate the "small focused kit" aesthetic.
- Scope creep. tabkit's contract is "schema + samples from a file." SQL queries are a fundamentally different abstraction: compute over data, not introspect a file.
- Engines bring their own readers. Most SQL engines have native CSV / Parquet readers; bundling one with tabkit would duplicate functionality — users would have two readers per format and not know which to pick.
- Composition is cleaner. When you want SQL, pair
tabkitwith the SQL crate that fits your runtime; they don't need to share types.
Issues, PRs, and design discussion welcome at https://github.com/seryai/tabkit/issues.
Used by
tabkit was extracted from the schema-extraction layer of
Sery Link, a privacy-respecting data network for the files
on your machines. If you use tabkit in your project, please open
a PR to add yourself here.
The kit family
tabkit is part of a coordinated suite of focused single-purpose
Rust crates extracted from Sery Link:
mdkit— documents → markdown (PDF, DOCX, PPTX, HTML, IPYNB, OCR).scankit— walk + watch directory trees with exclude-glob and size-cap filters.tabkit— spreadsheets → schema + sample rows (this crate).
Use them together, use them separately. The trait surfaces are designed to compose without forcing a particular runtime.