scankit 0.3.0

Walk + watch + filter directory trees. The shared scanner Tauri / Iced / native desktop apps reach for when they need to enumerate user files.
Documentation

scankit

Walk + watch + filter directory trees. The shared scanner Tauri / Iced / native desktop apps reach for when they need to enumerate user files.

Status: v0.3 — API stability candidate for 1.0. Feature coverage closed in v0.2 (one-shot walking via the walk feature, continuous filesystem-event monitoring via the watch feature, both with shared exclude-glob + size-cap filters). v0.3 freezes the public surface — see the stability section below for what's locked in. v0.3.x will iterate on examples + cookbook docs. 1.0 ships once the API is exercised by at least one downstream production user.

Why this exists

Every "index files on the user's machine" project — RAG tools, search apps, backup utilities, file watchers, document assistants — rebuilds the same five hundred lines of walkdir-with-excludes-and- size-cap-and-symlink-handling glue. Every project gets it slightly wrong:

  • Missed **/.git/** in the exclude set, scanned 200K objects in a .pack file.
  • Forgot to cap file sizes, OOM'd on a 50 GB sqlite database the user accidentally dropped in their Documents folder.
  • Followed a symlink loop and hung the indexer.
  • Rebuilt the GlobSet per-iteration, ate 30 % of CPU on glob compilation alone.

scankit ships these bits once, with the edge cases handled in one place. It's deliberately lower-level than a full indexer — it does not parse files, generate embeddings, or persist anything. It hands you ScanEntrys and gets out of the way. Pair it with mdkit for documents → markdown, with calamine / csv for tabular files, with whatever you like for the rest.

Quick start

use scankit::{Scanner, ScanConfig};
use std::path::Path;

let scanner = Scanner::new(
    ScanConfig::default()
        .max_file_size_bytes(50 * 1024 * 1024) // 50 MB cap
        .add_exclude("**/.git/**")?
        .add_exclude("**/node_modules/**")?
        .add_exclude("**/.DS_Store")?,
)?;

for result in scanner.walk(Path::new("/Users/me/Documents")) {
    match result {
        Ok(entry) => println!(
            "{}: {} bytes, .{}",
            entry.path.display(),
            entry.size_bytes,
            entry.extension,
        ),
        Err(e) => eprintln!("scan error: {e}"),
    }
}
# Ok::<(), scankit::Error>(())

Design principles

  1. Do one thing well. Walk + filter + emit ScanEntry. Anything richer (parse, embed, persist) is the consuming application's job.
  2. Send + Sync everywhere. A single Scanner shared across threads, a single GlobSet built once.
  3. No surprises in the iterator. Filtered-out entries are silently dropped. Errors come through as Err items in the stream — callers can log-and-continue or short-circuit.
  4. Forward-compat defaults. ScanConfig and ScanEntry are #[non_exhaustive] so we can add fields (content hash, inode, per-entry metadata) without breaking downstream callers.
  5. Honest dep budget. walkdir + globset + thiserror are the only required deps. notify is gated behind the watch feature.

Feature flags

Feature Adds Approx. cost
walk (default) One-shot directory walking ~250 KB compiled
watch Continuous filesystem-event monitoring on top of an initial walk ~500 KB compiled
default walk ~250 KB compiled

Examples

Runnable example programs live in examples/:

  • walk.rs — walk a directory tree with conventional excludes (.git, node_modules, .DS_Store, build outputs) and a 50 MB size cap. Run with:
    cargo run --example walk -- /Users/me/Documents
    
  • watch.rs — continuous scan: initial walk + live filesystem events. Requires the watch feature. Run with:
    cargo run --example watch --features watch -- /Users/me/Documents
    

Stability (v0.3+) {#stability-v03}

v0.3 is the API stability candidate for 1.0. The following surface is committed to and will only change with a major version bump:

  • Scanner construction + dispatch — new, walk, scan (under the watch feature), config. Future trait methods land with default impls so existing callers don't break.
  • ScanConfig field set + the builder methods (max_file_size_bytes, follow_symlinks, add_exclude).
  • ScanEntry, ScanEvent, Error field/variant sets. All #[non_exhaustive] so we can grow them without major bumps. Pattern-matchers must include a wildcard arm.
  • The lazy Iterator<Item = Result<ScanEntry>> shape from Scanner::walk.
  • The Iterator<Item = ScanEvent> lifecycle from Scanner::scan (InitialInitialComplete → live Created / Modified / Deleted).
  • Feature flag names: walk, watch.

The following are implementation details and may change in minor versions:

  • Internal layout of Scanner / ScanWalkIter / ScanStream (private fields, helper methods).
  • Threading model of Scanner::scan (currently one short-lived initial-walk thread + the notify watcher's own threads).
  • Platform-specific event-translation rules (notify itself is platform-specific; we follow upstream).

1.0 will be cut once the API is exercised by at least one downstream production user.

License

Dual-licensed under MIT OR Apache 2.0 at your option. SPDX: MIT OR Apache-2.0.

Status & roadmap

  • v0.1 — one-shot walking. Scanner + ScanConfig + ScanEntry, exclude-glob and size-cap filters, symlink handling, lazy iterator.
  • v0.2 — watch feature. Scanner::scanScanStream (an Iterator<Item = ScanEvent>). Initial walk + continuous filesystem-event monitoring via notify, same exclude + size-cap filters apply to both. InitialComplete sentinel marks the boundary between the initial enumeration and live events.
  • v0.3 — API stability candidate. Stability commitments doc in lib.rs + README. #[non_exhaustive] already on every public struct + enum (added incrementally v0.1 → v0.2); #[must_use] already on every constructor + builder + accessor. Documentation-only release — no API-shape changes.
  • v0.4 — Renamed event variant (consolidate Deleted + Created pairs from notify's platform-specific rename shapes); extension-based dispatch helper.
  • v0.4 — audit pass + first stable trait release (1.0 candidate).

Issues, PRs, and design discussion welcome at https://github.com/seryai/scankit/issues.

Used by

scankit was extracted from the folder-scanner of Sery Link, a privacy-respecting data network for the files on your machines. If you use scankit in your project, please open a PR to add yourself here.

Acknowledgements

  • walkdirBurntSushi's battle-tested directory walker. Loop detection, permission handling, and Send-iterator semantics all come from there.
  • globset — also BurntSushi's. The compiled multi-pattern glob matcher that makes our exclude set efficient even with hundreds of patterns.
  • notify — the cross-platform filesystem-event crate that v0.2's watch loop will be built on.
  • mdkit — sibling crate; scankit does "files → entries", mdkit does "documents → markdown".