hunch 0.3.1

A Rust port of guessit — extract media metadata from filenames
Documentation

🔍 Hunch

🤖 This repository is entirely AI-generated with human guidance. All code, tests, and documentation were produced by AI under the direction of a human collaborator.

A Rust port of Python's guessit for extracting media metadata from filenames.

⚠️ Work in progress. Hunch currently passes 80.0% of guessit's own 1,309-case test suite (1,047 / 1,309). All 49 guessit properties are implemented (3 intentionally diverged). Core properties like video codec, screen size, source, audio codec, edition, and year are 96–100% accurate. Title (92%), release group (90%), episode (90%), and 40+ other properties are steadily improving. Uses a two-pass pipeline with post-resolution extraction. All regex is linear-time via the regex crate (ReDoS-immune). See COMPATIBILITY.md for the full breakdown.

Hunch extracts title, year, season, episode, resolution, codec, language, and 40+ other properties from messy media filenames — the same job guessit does, rewritten from scratch for Rust.

Quick Start

cargo add hunch

As a library

use hunch::hunch;

fn main() {
    let result = hunch("The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv");
    println!("{:#?}", result);
    // GuessResult {
    //   title: Some("The Walking Dead"),
    //   season: Some(5),
    //   episode: Some(3),
    //   screen_size: Some("720p"),
    //   source: Some("Blu-ray"),
    //   video_codec: Some("H.264"),
    //   release_group: Some("DEMAND"),
    //   container: Some("mkv"),
    //   media_type: Episode,
    //   ...
    // }
}

As a CLI tool

$ hunch "The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv"
{
  "container": "mkv",
  "episode": 3,
  "release_group": "DEMAND",
  "screen_size": "720p",
  "season": 5,
  "source": "Blu-ray",
  "title": "The Walking Dead",
  "type": "episode",
  "video_codec": "H.264"
}

guessit Compatibility

Hunch is a port of guessit. All 49 of guessit's properties are implemented (3 intentionally diverged — see COMPATIBILITY.md). We validate against guessit's own YAML test suite:

guessit (Python) hunch (Rust)
Overall pass rate 100% (by definition) 80.0% (1,047 / 1,309)
Properties implemented 49 49 (3 diverged)
Properties at 90%+ 49 31
Properties at 100% 49 16

Where hunch matches guessit (96–100% accuracy): year, video_codec, container, source, screen_size, audio_codec, crc32, color_depth, streaming_service, bonus, film, aspect_ratio, size, edition, episode_details, version, frame_rate, episode_count, season_count, proper_count, date, disc, episode_format, week, video_api.

Where hunch is developing (70–95%): title (92%), release_group (90%), episode (90%), season (94%), audio_channels (95%), language (85%), other (85%), episode_title (74%), subtitle_language (77%), audio_profile (85%).

For per-property breakdowns, per-file results, and known gaps, see COMPATIBILITY.md.

Design

Hunch does not port guessit's rebulk engine. Instead it uses a tokenizer-first, two-pass, TOML-driven architecture:

  1. Tokenize — Split input on separators (. - _ space), extract extension, detect bracket groups, handle path segments.
  2. Zone map — Detect structural anchors (SxxExx, 720p, x264) to establish title zone vs tech zone boundaries, per directory and filename.
  3. Pass 1: Match & Resolve — TOML rule files (20 files, embedded at compile time) match tokens via exact lookup, regex, and templates. Algorithmic matchers handle complex patterns (episodes, dates). Conflicts resolved by priority, then length. Zone disambiguation.
  4. Pass 2: Extract — Release group, title, and episode title run with access to resolved match positions from Pass 1.
  5. Result — HunchResult with 49 typed property accessors.
Input: "The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv"
  │
  ├─ 1. Tokenize → TokenStream (segments, tokens, brackets, extension)
  ├─ 2. Zone map → title zone, tech zone, year disambiguation
  │
  ═══ Pass 1: Tech Resolution ═════════════════════════════════
  ├─ 3. TOML rules + legacy matchers → Vec<MatchSpan>
  ├─ 4. Conflict resolution (priority, then length)
  ├─ 5. Zone disambiguation → resolved_tech_matches
  │
  ═══ Pass 2: Positional Extraction ═══════════════════════════
  ├─ 6. Release group (sees resolved positions)
  ├─ 7. Title, episode title, alternative title
  └─ 8. Build HunchResult

See ARCHITECTURE.md for the full design, decision log, and module map.

Project Structure

src/
├── lib.rs              # Public API: hunch(), hunch_with()
├── main.rs             # CLI binary (clap)
├── hunch_result.rs     # HunchResult type + JSON serialization
├── options.rs          # Configuration
├── zone_map.rs         # ZoneMap: structural zone analysis (filename + per-dir)
├── tokenizer.rs        # Input → TokenStream (segments, brackets, extension)
├── pipeline/           # Two-pass orchestration
│   ├─ mod.rs          # Pipeline: Pass 1 (tech) → Pass 2 (positional)
│   ├─ zone_rules.rs   # Zone disambiguation (pre-RG + post-RG rules)
│   └─ proper_count.rs # PROPER/REPACK count computation
├── matcher/
│   ├── span.rs         # MatchSpan + Property enum (55 variants)
│   ├── engine.rs       # Conflict resolution
│   ├── regex_utils.rs  # BoundedRegex + boundary checking
│   └── rule_loader.rs  # TOML rule engine: exact + regex + templates + zone_scope
└── properties/         # 31 property matcher modules
    ├── title/          # Title extraction (mod.rs + clean.rs + secondary.rs)
    ├── episodes/       # Season/episode (mod.rs + patterns.rs + tests.rs)
    ├── release_group/  # Post-resolution release group extraction (Pass 2)
    │   ├── mod.rs       # Regex patterns + matching logic
    │   └── known_tokens.rs # Position-based validation + helpers
    └── ...             # year, date, source, language, etc.

rules/                  # 20 TOML data files (compile-time embedded)
├── source.toml         # Source patterns with Rip/Screener side_effects
├── other.toml          # Other flags (unambiguous)
├── other_positional.toml # Position-dependent Other (zone_scope=tech_only)
└── ...                 # video_codec, audio_codec, language, edition, etc.

tests/ ├── integration.rs # 32 hand-written end-to-end tests ├── guessit_regression.rs # 22 regression suites + compatibility report └── fixtures/ # Copied from guessit (self-contained)


## License

MIT