zahirscan 0.2.6

Token-efficient content compression for AI analysis using probabilistic template mining
zahirscan-0.2.6 has been yanked.
Visit the last successful build: zahirscan-0.2.14

ZahirScan: Template-Based Content Compression & Metadata Extraction

Crates.io docs.rs Build Rust

"Others will dream that I am mad, while I dream of the Zahir."JL Borges, Labyrinths

A high-performance Rust CLI tool that extracts templates and patterns from unstructured content, converting them into compact structured formats while preserving essential information. Provides comprehensive metadata extraction for many file types: media (images, video, audio), documents (DOCX, XLSX, PDF), databases (SQLite), settings (TOML, YAML, INI, XML), archives (ZIP, TAR), and code/scripts (via linguist).

Overview

ZahirScan uses probabilistic template mining to extract essential structure and patterns from content, and extracts metadata for the formats below.

Supported Formats:

  • Logs: Plain text logs, JSON-formatted logs, structured log files
  • Text Documents: TXT, Markdown (MD), plain text content
  • Documents: DOCX, XLSX, PDF
  • Databases: SQLite (.db, .sqlite, .sqlite3)
  • Settings: INI (.ini, .cfg), TOML (.toml, .lock), YAML (.yaml, .yml), XML (.xml)
  • Structured: CSV, HTML (.html, .htm)
  • Archives: ZIP (.zip); TAR and compressed TAR (.tar, .tar.gz, .tgz, .tar.bz2, .tar.xz).
  • Code/Scripts: Detected via linguist (e.g. .py, .rs, .js, .ts, .sh, Makefile, Dockerfile).
  • Images: JPEG, PNG, GIF, WebP, BMP, TIFF
  • Videos: MP4, MKV, AVI, MOV, WMV, FLV, WebM, M4V, 3GP, OGV
  • Audio: MP3, FLAC, WAV, M4A, AAC, OGG, Opus, WMA, APE, DSD, DSF

All outputs reduce size by 80-95% compared to raw content while preserving essential information.

Key Features

  • Template Mining: Automatically identifies repeated patterns in logs/text and extracts them as templates with placeholders
  • Zero-Copy Processing: Uses memory-mapped files (memmap2) to handle files larger than available RAM
  • Adaptive Parallelization: Automatically optimizes chunk sizes based on file statistics and CPU resources
  • Size Reduction: Typically reduces content size by 80-95% while preserving essential information

Metadata extraction by format

Metadata Extracts
Media Dimensions, codecs, bitrates for images, videos, audio
Document DOCX: word count, character count, paragraph count, title, author, creation/modification dates, revision. XLSX: sheet count, sheet names, row/column counts per sheet, core properties. PDF: page count, title, author, subject, creator, producer, creation/modification dates, PDF version, encryption status
CSV Row/column counts, column names, data types, delimiter, quote/escape characters, null percentages, unique counts; type-specific statistics (numeric: min/max/mean/median/IQR/stdev, date: span/min/max, boolean: true percentage)
SQLite Schema (tables, columns, types, constraints), primary keys, foreign keys, indexes, row counts, column statistics (null percentages, unique counts, numeric/text/boolean/blob/date)
TOML, YAML, INI, CFG Recursive schema (scalar, table/mapping, array/sequence; INI: section→key→scalar, multi-line values), key count, max depth. TOML: section count. YAML: scalar/sequence/map counts. INI/.cfg: section count, comment count
Code/Scripts script_type (linguist + optional shebang), byte_count, line_count; BOM, line_ending, trailing_newline, max_line_length, blank_line_count, indentation (single-pass scan)
ZIP File count, entries (path, uncompressed/compressed size, detected type, modified, compression method), entry_type_counts; filters hidden OS files (e.g. __MACOSX, .DS_Store, Thumbs.db)
Archive (TAR family) File count, entries (path, size), compressed_size, uncompressed_size
XML Recursive schema (root→children with attributes; repeated siblings as arrays with union of all children), element count, attribute count, max depth, has_namespaces
HTML Title, meta description, lang, charset, viewport; link/stylesheet/script/style counts; heading (h1–h6) and element counts (img, table, form, p, ul, ol, iframe, article, nav, section, header, footer, main); plain_text_len, word_count; writing footprint from body text
Writing Footprint For text/markdown/html: vocabulary richness, sentence structure, template diversity, punctuation metrics

Installation

# library
cargo add zahirscan

# CLI (from crates.io)
cargo install zahirscan

# Source archive (from GitHub Releases)
# Download from: https://github.com/thicclatka/zahirscan/releases

Note: ffprobe (from FFmpeg) is optional but required for video/audio metadata extraction.

Documentation: docs.rs/zahirscan

Usage

CLI

$ zahirscan --help
Text file and log file parser using probabilistic template mining

Usage: zahirscan [OPTIONS]

Options:
  -i, --input <INPUT>...
          Input file(s) to parse (can specify multiple)

  -o, --output <OUTPUT>
          Output folder path (defaults to temp file if not specified).
          Creates filename.zahirscan.out in the folder for each input file

  -f, --full
          Output mode: full metadata (for development/debugging).
          Default is templates-only mode (minimal JSON with templates, writing footprint, and media metadata)

  -d, --dev
          Development mode: enables debug logging.
          Default is production mode (info level only).
          This disables progress bars if enabled

  -r, --redact
          Redact file paths in output (show only filename as ***/filename.ext).
          Useful for privacy when sharing output JSON

  -n, --no-media
          Skip media metadata extraction (audio, video, image).
          Faster processing when metadata is not needed
  -p, --progress
          Show progress bars during processing.
          This is ignored if dev mode is enabled.

  -h, --help
          Print help

Output formats:

  • Mode 1 (Templates): Minimal JSON with template patterns & schema, writing footprint (for text/markdown), media metadata (for images/videos/audio), code metadata (for code/script files), and document metadata (for DOCX/XLSX)
  • Mode 2 (Full): Mode 1 output plus:
    • File statistics (size, line count, processing time)
    • Size comparison (before/after)

Library Usage

ZahirScan can be used as a Rust library to extract schemas (templates and metadata) from files programmatically.

Basic Example

The extract_schema() function accepts flexible input types via the ToPathIter trait:

  • Single file: &str, &String, or String
  • Multiple files: &[&str], Vec<&str>, &[String], Vec<String>, or arrays like [&str; N]

For a complete working example, see examples/basic_usage.rs. Run it with:

cargo run --example basic_usage -- <input-file>

Output Schema

The extract_schema() function returns Result<Vec<Output>>. Each Output object contains:

Always present (both modes):

  • templates: Vec<Template> - Extracted template patterns
  • source: String - Source file path
  • file_type: String - Detected file type (e.g., "Log", "Text", "Code", "Sqlite", "Image")

Mode 2 (Full) only (all optional):

  • line_count: Option<usize> - Number of lines in file
  • byte_count: Option<usize> - File size in bytes
  • token_count: Option<usize> - Estimated token count
  • processing_time_ms: Option<f64> - Processing duration
  • is_binary: Option<bool> - Whether file is binary
  • compression: Option<CompressionStats> - Compression metrics

Conditional Fields (present when applicable):

  • writing_footprint: Option<WritingFootprint> - Writing analysis for text/markdown files
  • image_metadata: Option<ImageMetadata> - Image metadata (dimensions, format, etc.)
  • video_metadata: Option<VideoMetadata> - Video metadata (codec, resolution, bitrate, etc.)
  • audio_metadata: Option<AudioMetadata> - Audio metadata (codec, bitrate, sample rate, etc.)
  • code_metadata: Option<CodeMetadata> - Code/script metadata (script_type, byte_count, line_count, BOM, line_ending, trailing_newline, max_line_length, blank_line_count, indentation)
  • csv_metadata: Option<CsvMetadata> - CSV metadata (row/column counts, data types, statistics)
  • sqlite_metadata: Option<SqliteMetadata> - SQLite database metadata (schema, tables, columns, indexes, statistics)
  • toml_metadata: Option<TomlMetadata> - TOML config metadata (recursive schema, section/key counts, depth)
  • zip_metadata: Option<ZipMetadata> - ZIP archive metadata (entries, sizes, detected types, compression; hidden OS files filtered)
  • archive_metadata: Option<ArchiveMetadata> - TAR / compressed TAR. Plain .tar: format, file_count, entries, compressed_size, uncompressed_size. Compressed (.tar.gz/.xz/.bz2): zero-copy, no decompression—format, compressed_size; .tar.gz also has uncompressed_size from gzip trailer; file_count and entries are None.
  • xml_metadata: Option<XmlMetadata> - XML structure metadata (recursive schema, element/attribute counts, namespaces)
  • html_metadata: Option<HtmlMetadata> - HTML metadata (title, meta, lang, charset, element counts, plain text/word count, writing footprint from body)
  • yaml_metadata: Option<YamlMetadata> - YAML metadata (recursive schema, key count, max depth, scalar/sequence/map counts)
  • ini_metadata: Option<IniMetadata> - INI/.cfg metadata (recursive schema section→key→scalar, section/key/comment counts, max depth, multi-line values)
  • pdf_metadata: Option<PdfMetadata> - PDF metadata (page count, document properties, etc.)
  • docx_metadata: Option<DocumentMetadata> - DOCX/XLSX metadata (word count, sheet count, title, author, dates, etc.)
  • pptx_metadata: Option<PptxMetadata> - PPTX metadata (slide count, core properties, etc.)
  • epub_metadata: Option<EpubMetadata> - EPUB metadata (title, creator, language, chapter count, etc.)

Template Structure

Each Template contains:

  • pattern: String - Template pattern with placeholders (e.g., "[DATE] [TIME] ERROR: [MESSAGE]")
  • count: usize - Number of lines matching this template
  • examples: BTreeMap<String, Vec<String>> - Example values for each placeholder

Writing Footprint Structure

WritingFootprint (for text/markdown files) contains:

  • vocabulary_richness: f64 - Unique words / total words (0.0-1.0)
  • avg_sentence_length: f64 - Average sentence length in words
  • punctuation: PunctuationMetrics - Punctuation usage statistics
  • template_diversity: usize - Number of unique template patterns
  • avg_entropy: f64 - Average entropy across templates (0.0-1.0)
  • svo_analysis: Option<SVOAnalysis> - Sentence structure analysis

Compression Stats Structure

CompressionStats contains:

  • original_tokens: usize - Original content token count
  • compressed_tokens: usize - Compressed template token count
  • reduction_percent: f64 - Percentage reduction (0.0-100.0)

Configuration

See config.toml for configuration.

Adaptive Defaults:

  • max_workers = 0 uses a sensible default based on CPU cores
  • Phase 2 uses adaptive chunking based on Phase 1 file statistics (count/bytes/variance) and targets a neat multiple of max_workers
  • No manual batching configuration is required for typical workloads

File filtering ([filter]):

  • ignore_patterns: skip files whose basename matches (exact: .DS_Store, Thumbs.db; suffix: *.swp, *~; prefix: prefix*)
  • ignore_hidden_files = true: skip Unix hidden files (basename starts with .)

Architecture

Phase 1: Initial File Scan

  • File format detection and statistics collection (line count, byte count, token count)
  • Memory-mapped file access for text files (memmap2)
  • Content type determination (log vs. text/markdown vs. media)
  • Prepares tasks for Phase 2

Phase 2: Template Mining and Metadata Extraction

  • Metadata extraction (media, document, database, settings, structured, archives, code): see the Metadata extraction by format table above for what is extracted per format.
  • Template Mining: Frequency-based analysis to identify static vs. dynamic fields, extracts patterns as templates
  • Tokenization: Content-aware (whitespace for logs, JSON structure for JSON logs, sentence/paragraph for text/markdown)
  • Writing Footprint: Calculates vocabulary richness, sentence structure, punctuation metrics, template diversity for text/markdown
  • Parallel Processing: Single Rayon thread pool with adaptive chunk sizing based on Phase 1 statistics

Security

ZahirScan implements non-invasive file operations:

  • Path sanitization to prevent directory traversal attacks
  • File existence validation before processing
  • Read-only file access (never modifies source files)

TODO

  • Word universe for enhanced writing analysis (per-document vocabulary corpus with frequency distributions, word length statistics, and visualization data)

  • Improve template extraction for short literary texts (adaptive thresholds and pattern similarity merging for better pattern recognition in short documents)

  • Shared lightweight NLP utility layer for logs + writing analysis (normalization/tokenization/stats/redaction; optional similarity/embeddings later)

  • (Optional) Security hardening: output path validation + symlink checks

License

This project is licensed under the MIT OR Apache-2.0 dual license - see the LICENSE-MIT and LICENSE-APACHE files for details.