ZahirScan: Template-Based Content Compression & Metadata Extraction
"Others will dream that I am mad, while I dream of the Zahir." — JL Borges, Labyrinths
A high-performance Rust CLI that uses probabilistic template mining to extract structure and patterns from content, and metadata for the formats below.
Supported formats:
- Logs: Plain text logs, JSON-formatted logs, structured log files
- Text Documents: TXT, Markdown (MD), plain text content
- Documents: DOCX, XLSX, PPTX, PDF, EPUB
- Databases: SQLite (.db, .sqlite, .sqlite3)
- Settings: INI (.ini, .cfg), TOML (.toml, .lock), YAML (.yaml, .yml), XML (.xml)
- Structured: CSV, JSON, HTML (.html, .htm)
- Archives: ZIP (.zip); TAR and compressed TAR (
.tar,.tar.gz,.tgz,.tar.bz2,.tar.xz). - Code/Scripts: Detected via linguist (e.g. .py, .rs, .js, .ts, .sh, Makefile, Dockerfile).
- Images: JPEG, PNG, GIF, WebP, BMP, TIFF
- Videos: MP4, MKV, AVI, MOV, WMV, FLV, WebM, M4V, 3GP, OGV
- Audio: MP3, FLAC, WAV, M4A, AAC, OGG, Opus, WMA, APE, DSD, DSF
Key Features
- Template mining: Repeated patterns in logs/text → templates with placeholders
- Memory-mapped I/O:
memmap2; single open per path - Adaptive parallelization: Phase 2 chunk sizes and worker usage tuned from Phase 1 stats (file count, bytes, variance); Rayon parallel iteration with adaptive batching when task count exceeds
workers × threshold_multiplier - Path batching: For large path sets, the pipeline runs in batches (batch size from process fd limit) so mmaps are dropped between batches—avoids “too many open files” on huge scans (e.g. 900k+ paths)
- Streamable output:
OutputSink::Collect(default),OutputSink::StreamOnly(callback per file, no collection), orOutputSink::Channel(send on channel); works with path batching - Size reduction: Typically 80–95% smaller than raw while preserving structure and metadata
Metadata extraction by format
| Metadata | Extracts |
|---|---|
| Media | Dimensions, codecs, bitrates for images, videos, audio |
| Document | DOCX: word count, character count, paragraph count, title, author, creation/modification dates, revision. XLSX: sheet count, sheet names, row/column counts per sheet, core properties. PPTX: slide count, core properties. PDF: page count, title, author, subject, creator, producer, creation/modification dates, PDF version, encryption status. EPUB: title, author, language, identifier, chapter count; writing footprint from spine body text. DRM-protected EPUBs (META-INF/encryption.xml present) are skipped for parsing and writing analysis. |
| Log | Byte count, line count, line ending (lf/crlf/cr), max line length, blank line count, has_timestamps (sample of first lines for ISO8601/epoch) |
| JSON | Byte count, line count, line ending, max line length, blank line count; root_type (array/object), root_array_length or root_object_key_count, max_depth, pretty_printed heuristic |
| CSV | Row/column counts, column names, data types, delimiter, quote/escape characters, null percentages, unique counts; type-specific statistics (numeric: min/max/mean/median/IQR/stdev, date: span/min/max, boolean: true percentage) |
| SQLite | Schema (tables, columns, types, constraints), primary keys, foreign keys, indexes, row counts, column statistics (null percentages, unique counts, numeric/text/boolean/blob/date) |
| TOML, YAML, INI, CFG | Recursive schema (scalar, table/mapping, array/sequence; INI: section→key→scalar, multi-line values), key count, max depth. TOML: section count. YAML: scalar/sequence/map counts. INI/.cfg: section count, comment count |
| Code/Scripts | script_type (linguist + optional shebang), byte_count, line_count; BOM, line_ending, trailing_newline, max_line_length, blank_line_count, indentation (single-pass scan) |
| ZIP | File count, entries (path, uncompressed/compressed size, detected type, modified, compression method), entry_type_counts; filters hidden OS files (e.g. __MACOSX, .DS_Store, Thumbs.db) |
| Archive (TAR family) | File count, entries (path, size), compressed_size, uncompressed_size |
| XML | Recursive schema (root→children with attributes; repeated siblings as arrays with union of all children), element count, attribute count, max depth, has_namespaces |
| HTML | Title, meta description, lang, charset, viewport; link/stylesheet/script/style counts; heading (h1–h6) and element counts (img, table, form, p, ul, ol, iframe, article, nav, section, header, footer, main); plain_text_len, word_count; writing footprint from body text |
| Writing Footprint | For text/markdown/html: vocabulary richness, sentence structure, template diversity, punctuation metrics. Uses two writing-analysis passes: (1) exact-pattern grouping (n-gram/phrase-based); (2) shape fallback (group by sentence length + end punctuation) when pass 1 yields no templates |
Installation
ffprobe (FFmpeg) is optional and required for video/audio metadata.
Usage
CLI
)
)
) )
)
) & )
)
)
)
Output formats:
- Mode 1 (Templates): Minimal JSON with template patterns & schema, writing footprint (for text/markdown), media metadata (for images/videos/audio), code metadata (for code/script files), log metadata (for log files), JSON metadata (for JSON files), and document metadata (for DOCX/XLSX/PPTX)
- Mode 2 (Full): Mode 1 output plus:
- File statistics (size, line count, processing time)
- Size comparison (before/after)
Library Usage
ZahirScan can be used as a Rust library to extract schemas (templates and metadata) from files programmatically.
Basic: collect all outputs**
use ;
// Default config, no file write; all results in result.outputs
let result = extract_zahir?;
// result.outputs, result.phase1_failed, result.phase2_failed
Stream-only: callback or channel (no collection, bounded memory)
use ;
use ;
let collected = new;
let c = clone;
let sink = StreamOnly;
let result = extract_zahir?;
// result.outputs is empty; results are in collected
Streaming input (paths from a channel)
use mpsc;
use ;
let = channel;
// Producer sends paths, then drops tx
let result = extract_zahir_from_stream?;
Inputs: single path (&str, String) or multiple (&[&str], Vec<String>, etc.). Config: pass Some(&config) for custom RuntimeConfig; None uses embedded default. Full API: ZahirScanResult, Output, OutputSink, Template, WritingFootprint, and per-format metadata: docs.rs.
Configuration
- CLI: Embedded default (
config.toml), merged with a user config in the app data dir if present. Only keys in the user file override.- User config path: Unix
~/.config/zahirscan/zahirscan.toml(or$XDG_CONFIG_HOME/zahirscan/zahirscan.toml), Windows%APPDATA%\zahirscan\zahirscan.toml. - Run
zahirscan initto write the embedded default to edit; the CLI will use it as the overlay.
- User config path: Unix
- Library:
extract_zahir(..., config: None, output_dir: None)uses the embedded default only (no overlay). For custom config passSome(&config);RuntimeConfig::new()is the embedded default (no file I/O). Overlay is only used by the CLI viasetup::load_config().
Full schema: config.toml.
Adaptive batching and parallelization:
- Path batching: If the number of paths exceeds the batch size (derived from the process fd limit), the pipeline runs in batches; mmaps are dropped between batches so open file count stays bounded.
- Phase 2 adaptive chunking: Chunk sizes and “chunks per file” are derived from Phase 1 stats (file count, mean bytes, variance); targets a neat multiple of
max_workersfor load balancing. - Phase 2 parallel batching: When task count exceeds
workers × threshold_multiplier, Rayon useswith_min_len(batch_size)to avoid thread-pool saturation; otherwise full parallelism. max_workers = 0uses a sensible default (e.g. num_cpus - 1). No manual tuning is required for typical workloads.
File filtering ([filter]):
ignore_patterns: skip files whose basename matches (exact:.DS_Store,Thumbs.db; suffix:*.swp,*~; prefix:prefix*)ignore_hidden_files = true: skip Unix hidden files (basename starts with.)
Architecture
Phase 1: Format detection, stats (lines/bytes/tokens), mmap per path, content-type classification. Runs in parallel over paths (Rayon).
Path batching: When path count exceeds the batch size (from the process fd limit), the pipeline runs Phase 1 + Phase 2 per chunk of paths, then drops the chunk (and mmaps) before the next chunk.
Phase 2: Metadata extraction per format, template mining, writing footprint (exact-pattern then shape fallback for text/markdown). Single Rayon pool; adaptive chunk sizing from Phase 1 stats and adaptive parallel batching (min chunk length) when task count is large.
Security
Read-only, non-invasive: path sanitization, existence checks, no source modification.
License
Dual-licensed under MIT or Apache-2.0.