ZahirScan: Template-Based Content Compression & Media Metadata Extraction
"Others will dream that I am mad, while I dream of the Zahir." — JL Borges, Labyrinths
A high-performance Rust CLI tool that extracts templates and patterns from unstructured content, converting them into compact structured formats while preserving essential information. Additionally provides comprehensive metadata extraction for media files.
Note: This project is currently a work in progress, so use with caution.
Overview
ZahirScan uses probabilistic template mining to extract essential structure and patterns from content. The tool automatically adapts to different content types:
- Logs & Text: Identifies static vs. dynamic tokens, groups similar log lines into templates, extracts structural patterns and repeated phrases
- Media Files: Automatically detects and extracts comprehensive metadata for images, videos, and audio
Supported Formats:
- Logs: Plain text logs, JSON-formatted logs, structured log files
- Text Documents: TXT, Markdown (MD), plain text content
- Documents: DOCX (Word documents), XLSX (Excel spreadsheets), PDF (metadata extraction)
- CSV Files: CSV
- Images: JPEG, PNG, GIF, WebP, BMP, TIFF
- Videos: MP4, MKV, AVI, MOV, WMV, FLV, WebM, M4V, 3GP, OGV
- Audio: MP3, FLAC, WAV, M4A, AAC, OGG, Opus, WMA, APE, DSD, DSF
All outputs reduce size by 80-95% compared to raw content while preserving essential information.
Key Features
- Template Mining: Automatically identifies repeated patterns in logs/text and extracts them as templates with placeholders
- Media Metadata: Extracts comprehensive metadata for images, videos, and audio (dimensions, codecs, bitrates, etc.)
- Document Metadata: Extracts metadata from DOCX files (word count, character count, paragraph count, title, author, creation/modification dates, revision), XLSX files (sheet count, sheet names, row/column counts per sheet, core properties), and PDF files (page count, title, author, subject, creator, producer, creation/modification dates, PDF version, encryption status)
- CSV Metadata: Extracts row/column counts, column names, data types, delimiter, quote/escape characters, null percentages, unique counts, and type-specific statistics (numeric: min/max/mean/median/IQR/stdev, date: span/min/max, boolean: true percentage)
- Writing Footprint: For text/markdown files, provides vocabulary richness, sentence structure, template diversity metrics, and word universe analysis (when enabled)
- Zero-Copy Processing: Uses memory-mapped files (
memmap2) to handle files larger than available RAM - Adaptive Parallelization: Automatically optimizes chunk sizes based on file statistics and CPU resources
- Size Reduction: Typically reduces content size by 80-95% while preserving essential information
Installation
Prerequisites
-
Rust (stable toolchain)
-
ffprobe (optional, for video/audio metadata extraction):
ffprobeis distributed with FFmpeg. Install FFmpeg:https://ffmpeg.org/download.htmlNote: If
ffprobeis not installed, ZahirScan will still work for text, log, and image files. Video and audio files will be processed but metadata extraction will be skipped.
Build
# Build from source
Usage
Quickstart Examples
# Process log files
# Process text/markdown files (extracts templates and writing footprint)
# Extract image metadata (dimensions, format, compression, chroma subsampling)
# Extract video metadata (requires ffprobe: codec, resolution, bitrate, frame_rate, etc.)
# Extract audio metadata (codec, bitrate, sample_rate, channels, bit_rate_mode for MP3)
# Extract CSV metadata (row/column counts, data types, statistics)
# Extract DOCX metadata (word count, character count, title, author, dates, revision)
# Extract XLSX metadata (sheet count, sheet names, row/column counts, core properties)
# Process multiple file types at once
# Skip media metadata for faster processing
# Redact file paths in output (privacy)
Command-Line Options
) )
)
)
)
)
)
)
Output formats:
- Mode 1 (Templates): Minimal JSON with template patterns & schema, writing footprint (for text/markdown), media metadata (for images/videos/audio), and document metadata (for DOCX/XLSX)
- Mode 2 (Full): Mode 1 output plus:
- File statistics (size, line count, processing time)
- Size comparison (before/after)
Library Usage
ZahirScan can be used as a Rust library to extract schemas (templates and metadata) from files programmatically.
Basic Example
The extract_schema() function accepts flexible input types via the ToPathIter trait:
- Single file:
&str,&String, orString - Multiple files:
&[&str],Vec<&str>,&[String],Vec<String>, or arrays like[&str; N]
use ;
// Process a single file (accepts &str, &String, or String)
let outputs = extract_schema?;
println!;
// Process multiple files (accepts slices, vectors, or arrays)
let files = vec!;
let outputs = extract_schema?;
for output in outputs
For a complete working example, see examples/basic_usage.rs. Run it with:
Output Schema
The extract_schema() function returns Result<Vec<Output>>. Each Output object contains:
Always Present:
templates: Vec<Template>- Extracted template patterns
Mode 2 (Full) Only (all optional):
source: Option<String>- Source file pathfile_type: Option<String>- Detected file type (e.g., "log", "text", "image", "video")line_count: Option<usize>- Number of lines in filebyte_count: Option<usize>- File size in bytestoken_count: Option<usize>- Estimated token countprocessing_time_ms: Option<f64>- Processing durationis_binary: Option<bool>- Whether file is binarycompression: Option<CompressionStats>- Compression metrics
Conditional Fields (present when applicable):
writing_footprint: Option<WritingFootprint>- Writing analysis for text/markdown filesimage_metadata: Option<ImageMetadata>- Image metadata (dimensions, format, etc.)video_metadata: Option<VideoMetadata>- Video metadata (codec, resolution, bitrate, etc.)audio_metadata: Option<AudioMetadata>- Audio metadata (codec, bitrate, sample rate, etc.)csv_metadata: Option<CsvMetadata>- CSV metadata (row/column counts, data types, statistics)pdf_metadata: Option<PdfMetadata>- PDF metadata (page count, document properties, etc.)docx_metadata: Option<DocumentMetadata>- DOCX/XLSX metadata (word count, sheet count, title, author, dates, etc.)
Template Structure
Each Template contains:
pattern: String- Template pattern with placeholders (e.g.,"[DATE] [TIME] ERROR: [MESSAGE]")count: usize- Number of lines matching this templateexamples: BTreeMap<String, Vec<String>>- Example values for each placeholder
Writing Footprint Structure
WritingFootprint (for text/markdown files) contains:
vocabulary_richness: f64- Unique words / total words (0.0-1.0)avg_sentence_length: f64- Average sentence length in wordspunctuation: PunctuationMetrics- Punctuation usage statisticstemplate_diversity: usize- Number of unique template patternsavg_entropy: f64- Average entropy across templates (0.0-1.0)svo_analysis: Option<SVOAnalysis>- Sentence structure analysisword_universe: Option<WordUniverse>- Per-document vocabulary corpus for enhanced writing analysis (future enhancement)
Word Universe (when enabled) provides detailed vocabulary analysis:
- Unique word collection and frequency distributions
- Word length statistics (min, max, average, median, distribution)
- Most common and rare words
- Frequency histograms for visualization
- Enables better template extraction for short texts by identifying structural vs. content words
Compression Stats Structure
CompressionStats contains:
original_tokens: usize- Original content token countcompressed_tokens: usize- Compressed template token countreduction_percent: f64- Percentage reduction (0.0-100.0)
Configuration
See config.toml for configuration.
Adaptive Defaults:
max_workers = 0uses a sensible default based on CPU cores- Phase 2 uses adaptive chunking based on Phase 1 file statistics (count/bytes/variance) and targets a neat multiple of
max_workers - No manual batching configuration is required for typical workloads
Architecture
Phase 1: Initial File Scan
- File format detection and statistics collection (line count, byte count, token count)
- Memory-mapped file access for text files (
memmap2) - Content type determination (log vs. text/markdown vs. media)
- Prepares tasks for Phase 2
Phase 2: Template Mining and Metadata Extraction
- Media Metadata: Extracts metadata for images (via
imagecrate), videos/audio (viaffprobe) - Document Metadata: Extracts metadata from DOCX/XLSX files (via
zipandquick_xmlcrates,calaminefor XLSX row/column counts) - Template Mining: Frequency-based analysis to identify static vs. dynamic fields, extracts patterns as templates
- Tokenization: Content-aware (whitespace for logs, JSON structure for JSON logs, sentence/paragraph for text/markdown)
- Writing Footprint: Calculates vocabulary richness, sentence structure, template diversity for text/markdown, with optional word universe analysis for enhanced pattern recognition
- Parallel Processing: Single Rayon thread pool with adaptive chunk sizing based on Phase 1 statistics
Security
ZahirScan implements non-invasive file operations:
- Path sanitization to prevent directory traversal attacks
- File existence validation before processing
- Read-only file access (never modifies source files)
TODO
- Word universe for enhanced writing analysis (per-document vocabulary corpus with frequency distributions, word length statistics, and visualization data)
- Improve template extraction for short literary texts (adaptive thresholds and pattern similarity merging for better pattern recognition in short documents)
- SQLite database metadata extraction (schema information, table/column metadata, database statistics)
License
This project is licensed under the MIT OR Apache-2.0 dual license - see the LICENSE-MIT and LICENSE-APACHE files for details.