Expand description
Token-efficient content compression using probabilistic template mining.
This crate provides tools for analyzing text files, logs, and media files to extract templates and metadata. It uses probabilistic template mining to compress content while preserving structure, making it efficient for AI consumption.
§Main Workflow
- Phase 1: Initial file scan to collect statistics and prepare for processing
- Phase 2: Template mining and metadata extraction
§API Example
use zahirscan::{extract_zahir, OutputMode};
// Process with default config (no overlay)
let result = extract_zahir("file.log", OutputMode::Full, None, None, &zahirscan::OutputSink::Collect)?;
// Process with explicit config and optional output dir (None = no file write)
let config = zahirscan::RuntimeConfig::new();
let result = extract_zahir(
vec!["file1.log", "file2.log"],
OutputMode::Templates,
Some(&config),
None,
&zahirscan::OutputSink::Collect,
)?;§Stream-only example (no collection, bounded memory)
Use OutputSink::StreamOnly to receive each (path, Output) in a callback; the engine does not collect.
use std::sync::{Arc, Mutex};
use zahirscan::{extract_zahir, Output, OutputMode, OutputSink};
let collected = Arc::new(Mutex::new(Vec::<(String, zahirscan::Output)>::new()));
let collected_clone = Arc::clone(&collected);
let sink = OutputSink::StreamOnly(Box::new(move |path, out| {
collected_clone.lock().unwrap().push((path, out));
}));
let result = extract_zahir(
["file1.log", "file2.log"],
OutputMode::Full,
None,
None,
&sink,
)?;
// result.outputs is empty; collected has each (path, Output) as it completed§Streaming input
Use extract_zahir_from_stream when paths come from a channel (e.g. nefaxer’s on_entry
callback). Producer sends path strings; when the sender is dropped, zahirscan drains the
receiver and runs the pipeline. Pass OutputSink::Collect to get all outputs in the result, or OutputSink::StreamOnly / OutputSink::Channel to stream out.
use std::sync::mpsc;
use zahirscan::{extract_zahir_from_stream, OutputMode, OutputSink};
let (tx, rx) = mpsc::channel();
// In another thread: run nefaxer with on_entry: Some(|e| { tx.send(e.path.to_string_lossy().into_owned()).ok(); });
// Then drop(tx). This thread:
let result = extract_zahir_from_stream(&rx, OutputMode::Full, None, None, &OutputSink::Collect)?;Re-exports§
pub use config::DEFAULT_CONFIG_TOML;pub use config::RuntimeConfig;pub use config::RuntimeFlags;pub use engine::chunking::ProcessingTask;pub use engine::chunking::calculate_adaptive_chunking;pub use engine::orchestrator::run_pipeline;pub use engine::phases::mining::phase2_mining;pub use engine::phases::scanning::phase1_scan;pub use parsers::FileType;pub use parsers::ParseResult;pub use parsers::extract_templates;pub use parsers::initial_file_scan;pub use setup::OutputSink;pub use results::*;pub use utils::*;
Modules§
- analysis
- Analysis utilities for text: sentence extraction, n-grams, writing footprint, SVO, pivot extraction.
Shape/coarse fallback (by word count + end type) is in
plain_textwhen exact-pattern yields no templates. - config
- Configuration for
ZahirScan: TOML structs, runtime config, loading, and overlay merge. - engine
- Scan engine: orchestration, chunking, path iteration, progress, and file-type/format utilities.
- parsers
- Probabilistic template mining and parsing Main handler that routes to log or text parsers
- results
- Result structures for template mining and parsing
- setup
- Setup utilities for
ZahirScan: logger, config, CLI flag application, input/output path resolution. - utils
Macros§
- cached_
static - Macro to create a lazily-initialized static value using
OnceLock - copy_
metadata_ fields - Macro to copy all metadata fields from
ParseResultto Output. Usage:copy_metadata_fields!(from_stats, to_output)This ensures all metadata fields are copied without having to manually list them. - extract_
metadata_ with_ fallback - Macro to extract metadata with error handling and fallback
Usage:
extract_metadata_with_fallback!(stats.field,extract_fn, stats,MetadataType,type_name_expr) - impl_
minimal_ fallback - Macro to implement
MinimalFallbacktrait for metadata types - no_
template_ mining - Defines
extract_X_templatesthat returnsempty_mining_result(stats). Use for parsers that do metadata only and no template mining. Usage:crate::no_template_mining!(extract_toml_templates, "TOML is config; schema covers structure. No template mining.") - ok_
anyhow - Wrap an infallible value in
Result::Okwithanyhow::Erroras the error type. Use withprocess_with_metadata!when metadata extractors returnTinstead ofResult<T, _>. - process_
with_ metadata - Macro to handle metadata extraction (unless skipped) then run templates extractor.
Use from any parser mod:
crate::process_with_metadata!(stats, mmap, config, field_name, extract_meta_call, MetadataType, FileType::X, extract_templates_call) - serialize_
optional - Helper macro to conditionally serialize optional fields Skips serialization if the field is None
- validate_
min - Validate a numeric field is >= min; return Err with clear message otherwise.
- validate_
range_ 01 - Validate a f64 is in [0.0, 1.0]; return Err with clear message otherwise.
- with_
progress - Macro to execute a function and update progress bar
Usage:
with_progress!(pb, function_call(...))Optimized: only callsupdate_progress_barif pb is Some
Constants§
Functions§
- extract_
zahir - Single entry point: extract templates and metadata from one or more files.
- extract_
zahir_ from_ stream - Extract templates and metadata from paths received over a channel (streaming input).