Skip to main content

Crate zahirscan

Crate zahirscan 

Source
Expand description

Token-efficient content compression using probabilistic template mining.

This crate provides tools for analyzing text files, logs, and media files to extract templates and metadata. It uses probabilistic template mining to compress content while preserving structure, making it efficient for AI consumption.

§Main Workflow

  1. Phase 1: Initial file scan to collect statistics and prepare for processing
  2. Phase 2: Template mining and metadata extraction

§API Example

use zahirscan::{extract_zahir, OutputMode};

// Process with default config (no overlay)
let result = extract_zahir("file.log", OutputMode::Full, None, None, &zahirscan::OutputSink::Collect)?;

// Process with explicit config and optional output dir (None = no file write)
let config = zahirscan::RuntimeConfig::new();
let result = extract_zahir(
    vec!["file1.log", "file2.log"],
    OutputMode::Templates,
    Some(&config),
    None,
    &zahirscan::OutputSink::Collect,
)?;

§Stream-only example (no collection, bounded memory)

Use OutputSink::StreamOnly to receive each (path, Output) in a callback; the engine does not collect.

use std::sync::{Arc, Mutex};
use zahirscan::{extract_zahir, Output, OutputMode, OutputSink};

let collected = Arc::new(Mutex::new(Vec::<(String, zahirscan::Output)>::new()));
let collected_clone = Arc::clone(&collected);
let sink = OutputSink::StreamOnly(Box::new(move |path, out| {
    collected_clone.lock().unwrap().push((path, out));
}));
let result = extract_zahir(
    ["file1.log", "file2.log"],
    OutputMode::Full,
    None,
    None,
    &sink,
)?;
// result.outputs is empty; collected has each (path, Output) as it completed

§Streaming input

Use extract_zahir_from_stream when paths come from a channel (e.g. nefaxer’s on_entry callback). Producer sends path strings; when the sender is dropped, zahirscan drains the receiver and runs the pipeline. Pass OutputSink::Collect to get all outputs in the result, or OutputSink::StreamOnly / OutputSink::Channel to stream out.

use std::sync::mpsc;
use zahirscan::{extract_zahir_from_stream, OutputMode, OutputSink};

let (tx, rx) = mpsc::channel();
// In another thread: run nefaxer with on_entry: Some(|e| { tx.send(e.path.to_string_lossy().into_owned()).ok(); });
// Then drop(tx). This thread:
let result = extract_zahir_from_stream(&rx, OutputMode::Full, None, None, &OutputSink::Collect)?;

Re-exports§

pub use config::DEFAULT_CONFIG_TOML;
pub use config::RuntimeConfig;
pub use config::RuntimeFlags;
pub use engine::chunking::ProcessingTask;
pub use engine::chunking::calculate_adaptive_chunking;
pub use engine::orchestrator::run_pipeline;
pub use engine::phases::mining::phase2_mining;
pub use engine::phases::scanning::phase1_scan;
pub use parsers::FileType;
pub use parsers::ParseResult;
pub use parsers::extract_templates;
pub use parsers::initial_file_scan;
pub use setup::OutputSink;
pub use results::*;
pub use utils::*;

Modules§

analysis
Analysis utilities for text: sentence extraction, n-grams, writing footprint, SVO, pivot extraction. Shape/coarse fallback (by word count + end type) is in plain_text when exact-pattern yields no templates.
config
Configuration for ZahirScan: TOML structs, runtime config, loading, and overlay merge.
engine
Scan engine: orchestration, chunking, path iteration, progress, and file-type/format utilities.
parsers
Probabilistic template mining and parsing Main handler that routes to log or text parsers
results
Result structures for template mining and parsing
setup
Setup utilities for ZahirScan: logger, config, CLI flag application, input/output path resolution.
utils

Macros§

cached_static
Macro to create a lazily-initialized static value using OnceLock
copy_metadata_fields
Macro to copy all metadata fields from ParseResult to Output. Usage: copy_metadata_fields!(from_stats, to_output) This ensures all metadata fields are copied without having to manually list them.
extract_metadata_with_fallback
Macro to extract metadata with error handling and fallback Usage: extract_metadata_with_fallback!(stats.field, extract_fn, stats, MetadataType, type_name_expr)
impl_minimal_fallback
Macro to implement MinimalFallback trait for metadata types
no_template_mining
Defines extract_X_templates that returns empty_mining_result(stats). Use for parsers that do metadata only and no template mining. Usage: crate::no_template_mining!(extract_toml_templates, "TOML is config; schema covers structure. No template mining.")
ok_anyhow
Wrap an infallible value in Result::Ok with anyhow::Error as the error type. Use with process_with_metadata! when metadata extractors return T instead of Result<T, _>.
process_with_metadata
Macro to handle metadata extraction (unless skipped) then run templates extractor. Use from any parser mod: crate::process_with_metadata!(stats, mmap, config, field_name, extract_meta_call, MetadataType, FileType::X, extract_templates_call)
serialize_optional
Helper macro to conditionally serialize optional fields Skips serialization if the field is None
validate_min
Validate a numeric field is >= min; return Err with clear message otherwise.
validate_range_01
Validate a f64 is in [0.0, 1.0]; return Err with clear message otherwise.
with_progress
Macro to execute a function and update progress bar Usage: with_progress!(pb, function_call(...)) Optimized: only calls update_progress_bar if pb is Some

Constants§

PKG_NAME

Functions§

extract_zahir
Single entry point: extract templates and metadata from one or more files.
extract_zahir_from_stream
Extract templates and metadata from paths received over a channel (streaming input).