scribe-patterns 0.5.1

Advanced pattern matching and search algorithms for Scribe
Documentation

scribe-patterns

Advanced pattern matching and filtering for Scribe repository analysis.

Overview

scribe-patterns provides sophisticated pattern matching capabilities for file selection, filtering, and search operations. It handles glob patterns, regex matching, .gitignore semantics, and custom ignore rules with high performance and correct edge case handling.

Key Features

Glob Pattern Matching

  • Standard glob syntax: *, **, ?, [abc], {a,b,c}
  • Directory-aware matching: Handles **/ for recursive directory traversal
  • Negative patterns: !pattern to exclude specific files
  • Case sensitivity control: Case-insensitive matching on Windows by default

Gitignore Semantics

  • .gitignore parsing: Full compatibility with Git's ignore rules
  • Directory negation: Properly handles ! negation patterns
  • Relative vs absolute paths: Distinguishes /pattern from pattern
  • Trailing slashes: Directory-only patterns with /
  • Comment support: Lines starting with # are ignored

Custom Ignore Files

  • .scribeignore: Scribe-specific ignore patterns
  • Multiple ignore files: Hierarchical ignore file processing
  • Override precedence: Later patterns override earlier ones
  • Inheritance: Child directories inherit parent ignore rules

Performance Optimizations

  • Compiled pattern sets: Pre-compile globs into efficient matchers
  • Aho-Corasick for literals: Fast multi-pattern matching for literal strings
  • Regex caching: Compiled regex patterns are cached
  • Early returns: Short-circuit evaluation for common cases

Architecture

Pattern Input → Parser → Compiled Matcher → Match Engine
     ↓            ↓            ↓                ↓
 Glob/Regex   Validate   globset/regex    Apply to Paths
  Strings     Syntax     Compilation       Fast Matching

Core Components

PatternSet

Collection of patterns with unified matching interface:

  • Globs: File name patterns like *.rs, **/*.py
  • Regex: Complex patterns using regular expressions
  • Literals: Exact string matches (optimized with Aho-Corasick)
  • Negations: Exclude patterns that override includes

IgnoreBuilder

Constructs ignore rule sets from multiple sources:

  • .gitignore files: Standard Git ignore semantics
  • .scribeignore files: Scribe-specific patterns
  • Custom patterns: Programmatically added rules
  • Precedence handling: Correct override behavior

PathMatcher

Efficient path matching against pattern sets:

  • Compiled matchers: Pre-compiled globset for performance
  • Path normalization: Handles Windows vs Unix path separators
  • Absolute vs relative: Correct matching for both path types
  • Directory detection: Special handling for directory patterns

PatternParser

Parses and validates pattern syntax:

  • Glob expansion: Converts globs to regex when needed
  • Escape sequence handling: Properly handles \*, \?, etc.
  • Error reporting: Clear error messages for invalid patterns
  • Syntax validation: Detects malformed patterns early

Usage

Basic Glob Matching

use scribe_patterns::{PatternSet, PathMatcher};

let patterns = PatternSet::from_globs(vec![
    "**/*.rs",           // All Rust files
    "**/*.py",           // All Python files
    "!**/*_test.py",     // Except test files
])?;

let matcher = PathMatcher::new(patterns);

assert!(matcher.is_match("src/main.rs"));
assert!(matcher.is_match("lib/utils.py"));
assert!(!matcher.is_match("lib/utils_test.py")); // Negated

Gitignore-Style Filtering

use scribe_patterns::IgnoreBuilder;

let mut builder = IgnoreBuilder::new("/path/to/repo");
builder.add_gitignore(".gitignore")?;
builder.add_custom("target/**")?;    // Exclude Rust build directory
builder.add_custom("!target/debug/important.txt")?; // But include this file

let ignore = builder.build()?;

for entry in walkdir::WalkDir::new("/path/to/repo") {
    let entry = entry?;
    if ignore.matched(entry.path(), entry.file_type().is_dir()).is_ignore() {
        continue; // Skip ignored files
    }
    // Process file
}

Multiple Pattern Sets

use scribe_patterns::{PatternSet, Matcher};

// Include patterns
let include = PatternSet::from_globs(vec![
    "src/**/*.rs",
    "lib/**/*.rs",
])?;

// Exclude patterns
let exclude = PatternSet::from_globs(vec![
    "**/target/**",
    "**/*.bak",
])?;

let matcher = Matcher::new()
    .include(include)
    .exclude(exclude);

// File must match include AND not match exclude
if matcher.should_include("src/utils.rs") {
    // Process file
}

Regex Patterns

use scribe_patterns::PatternSet;

let patterns = PatternSet::from_regex(vec![
    r".*_test\.(rs|py)$",     // Test files in Rust or Python
    r"^src/.*/mod\.rs$",      // All mod.rs files in src
])?;

assert!(patterns.is_match("src/utils/mod.rs"));
assert!(patterns.is_match("lib/parser_test.py"));

Case-Insensitive Matching

use scribe_patterns::{PatternSet, MatchOptions};

let patterns = PatternSet::from_globs(vec!["*.TXT", "*.Md"])?;

let options = MatchOptions {
    case_sensitive: false,
    ..Default::default()
};

let matcher = PathMatcher::new(patterns).with_options(options);

assert!(matcher.is_match("readme.md"));  // Matches *.Md
assert!(matcher.is_match("notes.txt"));  // Matches *.TXT

Pattern Syntax

Glob Patterns

Pattern Matches Example
* Any string (not /) *.rsmain.rs, lib.rs
** Any path segment **/*.pya/b/c.py
? Single character ?.txta.txt, 1.txt
[abc] Character set [abc].rsa.rs, b.rs
{a,b} Alternatives *.{rs,py}main.rs, util.py
!pattern Negation !test*.py → exclude test files

Gitignore Rules

Pattern Behavior
pattern Matches in any directory
/pattern Matches only at root
dir/ Matches directory only
!pattern Negates previous patterns
#comment Ignored line

Special Cases

  • Empty patterns: Ignored (no effect)
  • Whitespace: Leading/trailing whitespace is trimmed
  • Backslash escapes: \* matches literal *
  • Unicode: Full UTF-8 support for paths and patterns

Performance

Benchmarks

Pattern compilation and matching is highly optimized:

  • Glob compilation: <1ms for typical pattern sets (10-50 patterns)
  • Path matching: <1μs per path for compiled matchers
  • Literal matching: <100ns using Aho-Corasick for large literal sets
  • Regex matching: ~1-10μs depending on pattern complexity

Optimizations

  1. Lazy compilation: Patterns compiled only when first used
  2. Caching: Compiled matchers cached in OnceCell
  3. Fast paths: Literal string matching before expensive regex
  4. Set operations: Boolean algebra simplification for pattern sets
  5. Aho-Corasick: Multi-pattern matching for literals in O(n) time

Configuration

MatchOptions

Field Type Default Description
case_sensitive bool Platform Match case-sensitively
require_literal_separator bool true * doesn't match /
require_literal_leading_dot bool true * doesn't match .hidden

IgnoreOptions

Field Type Default Description
hidden bool true Ignore hidden files (.file)
parents bool true Check parent .gitignore files
git_global bool false Use Git global ignore
git_exclude bool false Use .git/info/exclude

Error Handling

All pattern operations return Result<T, PatternError>:

pub enum PatternError {
    InvalidGlob(String),      // Malformed glob syntax
    InvalidRegex(String),     // Malformed regex pattern
    IoError(io::Error),       // File read errors
    EmptyPatternSet,          // No patterns provided
}

Integration

scribe-patterns is used throughout Scribe:

  • scribe-scanner: Filters files during repository traversal
  • scribe-analysis: Selects files for AST parsing
  • scribe-selection: Applies include/exclude rules to selection
  • CLI: Processes --include and --exclude flags

See Also

  • scribe-scanner: Repository scanning and filtering
  • scribe-selection: File selection using patterns
  • scribe-core: Shared types and configuration
  • globset documentation: Underlying glob implementation