hunch 1.1.3

A Rust port of guessit — extract media metadata from filenames
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
# ARCHITECTURE.md — Hunch

> **Decision log, architectural rationale, and developer guide for the hunch
> media filename parser.** This is the single source of truth for how the
> project works and why.

---

## Overview

Hunch parses media filenames like `The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv`
into structured metadata (title, season, episode, codec, etc.).

| Field       | Value                                                        |
| ----------- | ------------------------------------------------------------ |
| Crate name  | `hunch`                                                      |
| Language    | Rust (edition 2024)                                          |
| License     | MIT                                                          |
| Ancestor    | Python `guessit` (LGPLv3) — patterns/knowledge ported, engine rewritten |
| Goal        | Fast, correct, offline, deterministic media filename parsing |

The problem decomposes into three sub-problems, each favoring a different approach:

1. **Recognition** — Is `x264` a video codec? → Lookup tables + regex (TOML rules)
2. **Disambiguation** — Is `French` a language or title word? → Zone inference
3. **Extraction** — Where does the title end? → Positional/algorithmic (Rust code)

---

## Current Status

**Overall: 81.4%** (1,066 / 1,309 guessit test cases). `regex`-only (no
`fancy_regex`). TOML-driven rule engine with side effects, neighbor
constraints (`not_before`/`not_after`/`requires_after`/`requires_before`),
zone-scope filtering, and two-pass pipeline (tech resolution → positional extraction).

| Tier | Properties |
|------|------------|
| ✅ 100% | video_api, season_count, disc, aspect_ratio, proper_count, version, bonus, film, size, frame_rate, date, episode_count, episode_format, week, edition, color_depth |
| ✅ 95–99% | video_codec (98.6%), screen_size (98.4%), audio_codec (97.8%), source (97.5%), year (96.5%), crc32 (96.0%) |
| ✅ 90–94% | audio_channels (94.9%), container (94.7%), season (93.9%), type (93.6%), title (91.7%), release_group (91.1%), website (90.9%), episode (90.6%), streaming_service (90.3%) |
| 🟡 80–89% | other (87.7%), film_title (87.5%), uuid (87.5%), language (87.3%), video_profile (85.7%), audio_profile (85.3%), part (84.2%), subtitle_language (82.7%), episode_details (81.2%) |
| 🟡 60–80% | episode_title (73.6%), country (69.2%), bonus_title (61.5%), absolute_episode (60.0%), cd (60.0%) |
| 🟡 <60% | cd_count (50.0%), alternative_title (43.8%) |

Properties: 49/49 implemented (3 intentionally diverged — see COMPATIBILITY.md).

Highest-ROI targets: episode_title (53 fails), episode (52),
release_group (48), title (88 fails), other (43).

---

## Layered Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Layer 0–1: Tokenizer + TOML Rules + regex-only (this crate) │
│                                                             │
│   • Split input into tokens at boundaries (. - _ space)     │
│   • Match tokens against TOML rule files (embedded at       │
│     compile time via include_str!())                         │
│   • Algorithmic matchers for episodes, dates, titles,       │
│     release groups (Rust code)                               │
│   • Zone-based disambiguation (structural, not heuristic)   │
│   • regex crate only — linear-time, ReDoS-immune            │
│   • Offline, deterministic, fast (microseconds)             │
├─────────────────────────────────────────────────────────────┤
│ Layer 2: Database Lookup (future — NOT this crate)          │
│   • TMDB/IMDB/TVDB to validate titles                      │
│   • Resolves "2001 A Space Odyssey" — DB knows 2001 is     │
│     part of the title, not the year                          │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: LLM Fallback (future — NOT this crate)             │
│   • For truly ambiguous filenames                            │
└─────────────────────────────────────────────────────────────┘
```

**Hard boundary**: Hunch is a **pure, offline, deterministic library**. No
network, no database, no ML. Layers 2–3 belong in downstream consumers.

---

## v0.2 Pipeline

```
Input string
  │
  ├─ 1. Tokenize: split on separators, extract extension, detect brackets
  │     └─ TokenStream { tokens: [{text, start, end, separator, in_brackets}...], extension }
  │
  ├─ 1b. Zone map: anchor detection + year disambiguation (v0.2.1)
  │     └─ ZoneMap { title_zone, tech_zone, has_anchors, year }
  │
  ═══ PASS 1: Tech Property Resolution ═════════════════════════════════════
  │
  ├─ 2a. TOML rules: iterate tokens + multi-token windows (1–3, longest first)
  │      └─ 20 TOML rule files (exact lookups + regex + {N} capture templates)
  │
  ├─ 2b. Legacy matchers: regex against raw input (algorithmic properties)
  │      └─ year, date, episodes, language, subtitle_language, etc.
  │      └─ NOTE: release_group NOT included (runs in Pass 2)
  │
  ├─ 2c. Extension → Container span (priority 10)
  │
  ├─ 3. Year disambiguation (ZoneMap title-years)
  │
  ├─ 4. Conflict resolution (sort by priority desc, length desc; sweep overlaps)
  │
  ├─ 5. Zone-based disambiguation (7 active rules in zone_rules.rs)
  │     └─ Output: resolved_tech_matches
  │
  ═══ PASS 2: Positional Property Extraction ═══════════════════════════════
  │
  ├─ 6a. release_group(input, resolved_matches, zone_map)
  │      └─ Uses resolved positions instead of is_known_token exclusion list
  │
  ├─ 6b. title extraction (uses zone_map boundaries)
  │
  ├─ 6c. episode_title, film_title, alternative_title
  │
  ├─ 7. Computed: media_type, proper_count
  │
  └─ 8. Build HunchResult → JSON
```

### v0.2 pipeline limitations

The v0.2 pipeline follows a **match-everything-then-prune** pattern. Every
ambiguous token generates a match first; zone rules try to undo mistakes
after the fact. This has three structural problems:

1. **Lost information**: once a token is claimed (e.g., "Proof" → Other),
   title extraction sees it as consumed. Removing the Other match doesn't
   restore it as title content — the title extractor already ran.
2. **Scattered zone logic**: six zone rules in `apply_zone_rules()` each
   reconstruct a partial view of the filename's structure. Title extraction,
   episode title extraction, and release group extraction each independently
   re-derive zone boundaries.
3. **No positional awareness during matching**: TOML rules match tokens
   without knowing whether the token is in the title zone or the tech zone.
   The same token ("Proof", "French", "3D") should match in tech zones
   but be suppressed in title zones.

---

## v0.2.1 Pipeline: Zone Map

v0.2.1 introduces a **ZoneMap** — a structural analysis of the filename
that identifies zones *before* full matching runs. This inverts the
control flow from "match then prune" to "know zones, then match
appropriately."

```
Input string
  │
  ├─ 1. Tokenize (unchanged)
  │     └─ TokenStream { segments, extension }
  │
  ├─ 2. Anchor detection (NEW — two-phase, no TOML rules)
  │     Phase 1: Find high-confidence markers (Tier 1 + Tier 2):
  │     ├─ Structural:  S01E02, 1x03, 1080p, .mkv  (Tier 1)
  │     ├─ Tech vocab:  x264, BluRay, DTS, HDTV     (Tier 2)
  │     └─ → Establishes `tech_zone_start` (first Tier 1/2 token)
  │
  │     Phase 2: Disambiguate position-dependent tokens (Tier 3):
  │     ├─ Year-like numbers: use tech_zone_start + position + count
  │     ├─ Leading brackets:  [Group] at start
  │     └─ Trailing group:    -GROUP.ext at end
  │
  ├─ 3. Zone map construction (NEW)
  │     From anchors, derive zone boundaries per filename segment:
  │
  │     [Group] Title.Words.Year.SxxExx.Ep_Title.Source.Codec-Group.ext
  │      ──┬──  ───┬───── ─┬── ───┬─── ────┬───── ──────┬────── ─┬─── ─┬─
  │     release title     anchor          ep_title    tech          release ext
  │     (lead)  zone      (year/ep)       zone        zone          (trail)
  │
  │     ZoneMap {
  │       title_zone:      fn_start .. first_anchor,
  │       tech_zone:       first_anchor .. group_start,
  │       ep_title_zone:   ep_end .. first_tech_after_ep,  // if episode exists
  │       release_zone:    group_start .. ext_start,        // if group detected
  │     }
  │
  ├─ 4. Zone-aware matching (CHANGED)
  │     TOML rules declare a `zone_scope`:
  │     ├─ Unrestricted: match everywhere (default, backwards-compatible)
  │     │   VideoCodec, AudioCodec, ScreenSize (with p/i suffix), Source, ...
  │     ├─ TechOnly: suppress in title zone (ambiguous tokens)
  │     │   Other (Proof, HDR, 3D), Language, Edition, EpisodeDetails
  │     └─ TechOrAfterAnchor: match in tech zone + after any anchor
  │         Country, SubtitleLanguage
  │
  ├─ 4b. Legacy matchers (unchanged, transitional)
  │
  ├─ 5. Conflict resolution (unchanged)
  │
  ├─ 6. Zone-informed disambiguation (REPLACES apply_zone_rules)
  │     Most zone rules become unnecessary because matching was already
  │     zone-aware. Remaining rules handle cross-zone conflicts only.
  │
  ├─ 7. Post-processing (SIMPLIFIED)
  │     Title extraction uses zone boundaries directly instead of
  │     re-deriving them from match positions.
  │
  └─ 8. Build HunchResult → JSON
```

### Anchor confidence tiers

Not all tokens are equally unambiguous. Anchors exist on a **confidence
spectrum**, and zone construction must account for this.

#### Tier 1: Structural anchors (always unambiguous)

These tokens have built-in structural markers (prefix, suffix, or
position) that make them unambiguous regardless of where they appear.
They never appear as title words.

| Anchor | Signal | Examples |
|---|---|---|
| Season/Episode | `S`/`E` prefix + digits | `S01E02`, `S03-E01` |
| NxN episode | Digit-x-digit pattern | `1x03`, `5x09` |
| Suffixed resolution | Digits + `p`/`i` suffix | `1080p`, `720i`, `2160p` |
| Extension | Last `.xxx` position | `.mkv`, `.avi`, `.mp4` |

#### Tier 2: Unambiguous vocabulary (high confidence)

These tokens have unique vocabulary that virtually never appears in
titles. They are safe to use as zone boundary markers.

| Anchor | Examples |
|---|---|
| Video codec | `x264`, `x265`, `H.264`, `XviD`, `HEVC`, `AV1` |
| Audio codec | `DTS`, `AAC`, `AC3`, `FLAC`, `Atmos`, `TrueHD` |
| Source | `BluRay`, `WEB-DL`, `HDTV`, `DVDRip`, `BDRip` |
| Container (inline) | `MKV`, `AVI` when not in extension position |

#### Tier 3: Position-dependent (need disambiguation)

These tokens can be either metadata or title content. Their meaning
depends on **absolute position** (where in the filename) and **relative
position** (what's around them).

| Token | As metadata | As title | Disambiguation |
|---|---|---|---|
| Year-like (1920–2039) | Release year | Movie title | See below |
| `[Group]` at start | Release group | Subtitle tag | Anime heuristics |
| `-GROUP` at end | Release group | Hyphenated word | Tech tokens must precede |
| `HD`, `3D` | Other / ScreenSize | Title word | Zone position |

### Year disambiguation strategy

Year-like numbers (1920–2039) are the most important position-dependent
anchor. Notable title-as-year examples:

- `1917.2019.1080p.BluRay` — "1917" is title, "2019" is year
- `2001.A.Space.Odyssey.1968.1080p` — "2001" is title, "1968" is year
- `2012.2009.720p.BluRay` — "2012" is title, "2009" is year
- `1922.2017.WEB-DL` — "1922" is title, "2017" is year

**Key insight**: tech tokens define the zone boundary; years are
disambiguated *using* that boundary, not the other way around.

The algorithm:

1. **Find `tech_zone_start`** from Tier 1 + Tier 2 anchors (the first
   structural anchor or unambiguous tech token in the filename). This
   does NOT use year-like numbers.

2. **Classify each year candidate** using `tech_zone_start`:

   | Position | Context | Classification |
   |---|---|---|
   | Parenthesized `(NNNN)` | Any | **Year** (very high confidence) |
   | After `tech_zone_start` | In tech zone | **Year** |
   | Before `tech_zone_start` | Only year candidate | **Year** (also zone boundary) |
   | Before `tech_zone_start` | Multiple candidates; this is first | **Title** |
   | Before `tech_zone_start` | Multiple candidates; this is last | **Year** |
   | At filename start (pos 0) | Followed by non-year words | **Title** |
   | At filename start (pos 0) | Immediately before tech | **Ambiguous** (could be either) |

3. **Refine `title_zone`**: if the year candidate that survives as
   the actual year is before `tech_zone_start`, it becomes the new
   zone boundary. If the year is title content, the boundary stays
   at `tech_zone_start`.

Example walkthrough:
```
2001.A.Space.Odyssey.1968.HDDVD.1080p.DTS.x264.mkv
 │                     │    │     │    │    │
 │                     │    └─────┴────┴────┴── Tier 2 tech tokens
 │                     │
 ├─ Year candidate #1  ├─ Year candidate #2
 │  pos=0 (start)       │  pos=before tech zone
 │  followed by words   │  immediately before HDDVD (Tier 2)
 │                      │
 └─ → Title content     └─ → Actual year

tech_zone_start = position of "HDDVD" (Tier 2, first tech token)
title_zone = [0 .. "1968") → "2001 A Space Odyssey"
year = 1968
```

### Zone scopes for TOML rules

Each TOML rule file can declare a `zone_scope` (defaults to `"unrestricted"`
for backwards compatibility):

```toml
property = "other"
zone_scope = "tech_only"     # NEW: suppress in title zone

[exact]
proof    = "Proof"
hdr      = "HDR"
hdr10    = "HDR10"
```

| Scope | Behavior | Use for |
|---|---|---|
| `unrestricted` | Match in all zones (default) | Unambiguous tech: codecs, resolutions |
| `tech_only` | Suppress matches in title zone | Ambiguous tokens: Other, Edition |
| `after_anchor` | Match only after first anchor | Language, SubtitleLanguage, Country |

The pipeline passes the `ZoneMap` to `match_tokens_in_segment()`. If a
token falls in the title zone and the rule's scope is `tech_only`, the
match is silently skipped — no span is emitted, no conflict to resolve.

### What ZoneMap solves

| Problem | v0.2 (match-then-prune) | v0.2.1 (zones-first) |
|---|---|---|
| "Proof" at start → Other | Other claims it, title gets nothing | title_zone → Other suppressed → title |
| "French" before year | Language claims it, zone rule partial fix | title_zone → Language suppressed → title |
| "3D" in movie title | ScreenSize claims it | title_zone → ambiguous ScreenSize suppressed |
| "LiNE" in title | Other (Line Audio) claims it | title_zone → Other suppressed → title |
| "Edition" in title | Edition claims it before title runs | title_zone → Edition suppressed → title |
| "2001" as title | Both 2001 and 1968 claimed as year | Year disambiguation: first=title, last=year |
| Episode title boundary | Independently re-derives zones | Uses ep_title_zone from ZoneMap |
| Path segment selection | Title picks wrong dir | Per-segment zone analysis → best structure |

### What ZoneMap does NOT solve

These problems are orthogonal to zones:

- **Multi-token compounds** ("Edition Collector") — needs TOML engine
  multi-token window enhancement, not zone awareness.
- **Compound release groups** ("Tigole QxR") — needs release_group.rs
  logic for merging across brackets, not zones.
- **Layer 2 disambiguation** ("2001 A Space Odyssey") — needs external
  title DB. Out of scope for this crate (D004).

### Incremental implementation path

No big-bang rewrite. Each step is a separate commit, tested independently:

1. **Add `ZoneMap` struct + `detect_anchors()`** — non-breaking new code.
   Two-phase: (a) find Tier 1+2 tech tokens → `tech_zone_start`,
   (b) disambiguate Tier 3 year candidates using that boundary.
   Compute zones, log to debug. (~150 lines)
2. **Add `zone_scope` field to TOML `RuleSet`** — additive, defaults to
   `Unrestricted`. Parse `zone_scope` from TOML files. (~30 lines)
3. **Pass `ZoneMap` to `match_tokens_in_segment()`** — add filtering
   alongside existing matching code. Tokens in title zone skip
   `TechOnly` rules. (~20 lines)
4. **Tag ambiguous TOML rules** with `zone_scope = "tech_only"`   start with `other.toml`, `edition.toml`. Each file is one commit.
5. **Retire `apply_zone_rules()` heuristics** one at a time as zone
   scopes make them redundant.
6. **Simplify `extract_title()`** — use `title_zone` boundaries
   instead of re-scanning for first-match positions.
7. **Integrate year disambiguation** into the zone map so title
   extraction naturally includes year-as-title numbers.

### Why not rebulk?

guessit uses `rebulk`, a generic Python pattern-matching engine with match
chaining, conflict resolution, and rule-based post-processing. We do NOT
port rebulk. Instead:

| Rebulk Feature | Hunch Equivalent | Why |
|---------------|-----------------|-----|
| Chain execution (A → B) | Flat pipeline, order-independent | Avoids execution-order bugs |
| Rule DSL + callbacks | TOML files + Rust closures | Simpler, auditable |
| Runtime reconfiguration | Compile-time embedding | YAGNI, faster |
| Match tagging system | Typed `Property` enum | Stronger types |
| Backtracking regex | `regex` crate (linear-time) | ReDoS-immune |

---

## v0.2 Audit Findings (2025-02)

Thorough review of every engine component, all 18 TOML files, 21 legacy
matchers, the pipeline, tokenizer, and test infrastructure.

### What's solid (don't change)

1. **Tokenizer-first pipeline** — structurally position-aware, eliminates
   rebulk's execution-order sensitivity.
2. **TOML data-driven rules** — clean, auditable, compile-time embedded.
   Contributors can add codecs without deep Rust knowledge.
3. **Zone-based disambiguation** — structural (derived from token positions),
   not procedural (dependent on execution order). Better than guessit.
4. **Algorithmic matchers in Rust** — episodes (800+ lines), title, release_group,
   date. These are inherently algorithmic and should NEVER be TOML.
5. **Flat `Vec<MatchSpan>` + conflict resolution** — simpler than rebulk's
   chain/tree, priority + length tiebreaking is sufficient.

### Gap 1: Side-effect rules ~~(BLOCKER for legacy removal)~~ ✅ IMPLEMENTED

The single most important gap. Pattern:
```
"DVDRip" → Source:DVD + Other:Rip        (2 properties)
"BRRip"  → Source:Blu-ray + Other:Rip + Other:Reencoded  (3 properties!)
"DVDSCR" → Source:DVD + Other:Screener   (2 properties)
```

**Implemented**: `side_effects` in TOML pattern entries:
```toml
[[patterns]]
match = '(?i)^dvd[-. ]?rip$'
value = "DVD"
side_effects = [
    { property = "other", value = "Rip" }
]
```

This is NOT rebulk chains. It's a flat, declarative side-effect list. One
match → N outputs. No callbacks, no execution order, no rule dependencies.

### Gap 2: Context-dependent matching ✅ IMPLEMENTED

Some tokens are ambiguous and need neighbor-awareness:
- `"HD"` → Other:HD, but NOT before `tv`, `dvd`, `cam`, `rip`
- `"DV"` → Dolby Vision in tech zone vs ignored elsewhere
- `"TS"` → Telesync source vs `.ts` file extension

**Implemented**: Token neighbor checks (NOT regex lookahead):
```toml
[[patterns]]
match = '(?i)^hd$'
value = "HD"
not_before = ["tv", "dvd", "cam", "rip", "tc", "ts"]
```

Supports `not_before`, `not_after`, and `requires_after`. Uses the
tokenizer — checks adjacent token text, not regex lookahead.
Linear time, no backtracking, structurally sound.

### Gap 3: Path-segment awareness ✅ IMPLEMENTED

The tokenizer now tokenizes ALL path segments with `SegmentKind`
(Directory vs Filename). Each TOML rule set declares a `SegmentScope`:

- **`FilenameOnly`**: Tech properties (source, codec, screen_size, etc.)
  skip directory tokens to avoid false positives like "TV Shows" → Source:TV.
- **`AllSegments`**: Contextual properties that benefit from directory metadata
  get a priority penalty (-5) so filename matches always win in conflicts.

Currently all rules use FilenameOnly. AllSegments requires per-segment
zone detection (future work) to avoid title-word false positives in dirs.

### The fancy_regex removal path

| Component | Uses fancy_regex? | Removal path |
|-----------|:-:|---|
| TOML rule_loader || Already clean |
| BoundedRegex (episodes, date) || Strips lookarounds → standard `regex` |
| ValuePattern (source, language, other) | ✅ Migrated | Standard `regex` only |

`fancy_regex` lives ONLY in `ValuePattern` (regex_utils.rs). Once legacy
matchers are removed, `regex_utils.rs` (380 lines), `ValuePattern`, and
`fancy_regex` all die together.

### The dual-pipeline problem

Every vocabulary property is currently matched twice (TOML + legacy). This:
- Required Zone Rule 6 (TOML Source catches "WEB-DL", legacy Language catches "DL")
- Causes ~1-2% regression in common_words and movies tests
- Increases conflict resolution work

This is fine as a transitional state but must be resolved by removing legacy
matchers incrementally.

---

## Execution Plan

### Phase A: Close engine gaps ~~(unblock legacy removal)~~ ✅ DONE
1. ✅ Add `side_effects` to TOML rule engine + rule_loader
2. ✅ Add `not_before` / `not_after` / `requires_after` neighbor checks
3. ✅ Path-segment tokenizer with `SegmentKind` (Directory vs Filename)
4. ✅ Property-scoped `SegmentScope` (FilenameOnly vs AllSegments)
5.`Property::from_name()` for side-effect property mapping
6. ✅ Extract `match_tokens_in_segment()` for clarity
7. ✅ tokenizer.rs coherent single unit (698 lines) — no split needed

### Phase A.1: Zone map (v0.2.1) — ✅ DONE
1. `ZoneMap` struct + two-phase `detect_anchors()` in `zone_map.rs`
2.`zone_scope` field in TOML `RuleSet` + parser (`rule_loader.rs`)
3.`ZoneMap` passed to `match_tokens_in_segment()` for filtering
4. ✅ Tagged ambiguous TOML rules:
   - `other_positional.toml`: `zone_scope = "tech_only"`
   - `episode_details.toml`: `zone_scope = "tech_only"`
   - `edition.toml`, `other.toml`: intentionally unrestricted (unambiguous)
   - `language.toml`: handled by zone_rules Rule 1 (needs legacy matcher retirement first)
5. ✅ Retired `apply_zone_rules()` heuristics (Rule 4 → TOML zone_scope)
6. ✅ Simplified `extract_title()` — uses ZoneMap for year disambiguation
7. ✅ Year disambiguation integrated into zone map + pipeline Step 2b

### Phase B: Remove legacy matchers (incremental)
Retire one legacy matcher at a time, in order of coverage:
1. **TOML-only (test shells remain)**: color_depth, audio_profile,
   other_positional, video_api, video_codec, edition, streaming_service,
   video_profile, episode_details, country, audio_codec, screen_size,
   container, frame_rate, other
2.**Cooperative legacy (no overlap)**: source (TOML-only),
   language (bracket codes only), subtitle_language (algorithmic only)
3.**ValuePattern retired**: year.rs uses plain Regex; ValuePattern
   deleted from regex_utils.rs.
4. **Never TOML** (algorithmic): episodes, title, release_group, date,
   year, crc32, uuid, website, size, part, bonus, version, aspect_ratio,
   bit_rate, episode_count

### Phase C: Accuracy improvements ✅
1. ✅ Tier 2 anchor expansion (dvd, dvdr, bd, pal, ntsc, secam)
2. ✅ TOML fixes: bd→source, scr→other, ultra→other, ld/hq→unrestricted
3. ✅ Audio profile HQ fix (require AAC prefix for standalone HQ)
4. ✅ Dubbed not_after constraint (GERMAN.DUBBED → no Other)
5. ✅ Zone Rule 5: adjacency-based HQ/FanSub pruning near release groups
6. ✅ Zone Rule 8: source subsumption dedup (TV+HDTV → HDTV)
7. ✅ AmazonHD side_effect (streaming_service + other:HD)
8. ✅ requires_before engine support (symmetric with requires_after)
9. ✅ Fix: requires_before + requires_after (surrounded by tech tokens)
10. ✅ Complete: requires_before with expanded context list
11. ✅ Year-as-anchor (title_len >= 6) for zone filtering
12. ✅ Edition Collector 2-token pattern (French reversed form)
13. ✅ Edition AllSegments scope
14. ✅ FLEMISH → nl-be
15. ✅ Episode title: exclude Part from stop list
16. ✅ Bracket group title boundary detection
17. Subtitle language (76.5% → 80%+)
18. Title extraction edge cases (90.1%)
19. Episode title improvements (70.1% → 80%+)
20. Release group edge cases (89.1%)

### Phase D: Polish ✅
- ✅ Version bumped to 1.0.0
- ✅ Updated COMPATIBILITY.md, README, CHANGELOG
-`cargo clippy` clean, no warnings
- Benchmark comparison with guessit (Python)
- crates.io published

### Phase E: v0.3.x — Architectural improvements ✉️ DONE

#### E1: Release group → post-resolution extraction ✅ IMPLEMENTED

**Goal**: Replace the 150-line `is_known_token` hardcoded exclusion list
in `release_group.rs` with structural overlap detection against resolved
matches.

**Implemented in v0.3**: Two-pass pipeline. Pass 1 resolves all tech
properties (TOML rules + legacy matchers). Pass 2 runs release_group
with access to resolved match positions. `is_known_token` (∼130 tokens)
replaced by:

- `is_position_claimed()` — checks if candidate overlaps any resolved tech match
- `is_non_group_token()` — small curated list (∼20 tokens) for subtitle markers
  and containers not covered by TOML rules
- `zone_map::is_tier2_token()` — reuses existing Tier 2 vocabulary
- `is_suffixed_resolution()` — catches NNNNp/i patterns

**Key design decisions**:

1. **Positional properties excluded from overlap check**: Title, EpisodeTitle,
   BonusTitle, AlternativeTitle are broad positional spans that shouldn't
   block release group detection.
2. **50%+ overlap threshold**: Prevents over-broad regex matches (like a
   VideoCodec pattern spanning `HEVC.Atmos-GROUP`) from blocking groups.
3. **Tech-only claims for backwards expansion**: `expand_group_backwards`
   only treats codec/source/screen_size matches as "tech" anchors, not
   Language/SubtitleLanguage (which would cause `SPANISH.AUDIO-GROUP` to
   incorrectly expand through AUDIO).
4. **Compound token check preserved**: `DVD+R = dvdr` is checked against
   Tier 2 tokens to prevent `DVD-R-GROUP` from expanding to `R-GROUP`.

---

## Decision Log

### D001: Data-driven patterns (TOML) over hardcoded Rust

**Status**: Stable (v1.0)

Move simple property patterns into TOML rule files, embedded at compile time
via `include_str!()`. Keep complex algorithmic logic (title, episodes,
release_group, date) in Rust.

**Consequences**:
- Pattern definitions are readable and auditable in isolation
- The Rust engine becomes a generic rule loader + matcher
- Contributors can add patterns without deep Rust knowledge
- Single binary deployment preserved (TOML embedded at compile time)

### D002: `regex` crate only — drop `fancy_regex`

**Status**: Stable

The tokenizer eliminates the need for lookaround because patterns match
against isolated tokens, not substrings of the full input:
```
Before (needs lookaround):  (?<![a-z])HDTV(?![a-z])  on "Movie.HDTV.x264"
After  (token isolation):   (?i)^HDTV$               on token "HDTV"
```

**Security benefit**: `regex` guarantees linear-time matching. ReDoS is
structurally impossible.

### D003: Tokenizer + TOML + regex-only as bundled change

These three are interdependent:
- TOML without tokenizer still needs `fancy_regex` for boundaries
- regex-only without tokenizer breaks ~30 patterns
- Tokenizer enables both TOML and regex-only cleanly

### D004: No network, database, or ML in this crate

**Status**: Decided, permanent. See Layered Architecture above.

### D005: No rebulk port

**Status**: Decided, permanent. See "Why not rebulk?" above.

### D006: Zone map for disambiguation (anchors first, zones second, matching third)

**Status**: Stable (v1.0)

The v0.2 pipeline matches all tokens against all rules, then prunes
mistakes via post-hoc zone rules. This loses information (a pruned match
can't be restored as title content) and scatters zone logic across
multiple components.

v0.2.1 inverts the flow: detect unambiguous anchors first, construct a
zone map from those anchors, then run TOML rules with zone-awareness so
ambiguous tokens in the title zone are never matched in the first place.

**Key insight 1**: Anchors are not binary — they have confidence tiers.
Structural markers (SxxExx, 1080p, .mkv) are always unambiguous.
Tech vocabulary (x264, BluRay) is almost always unambiguous. But
position-dependent tokens like year-like numbers (1920–2039) need
contextual disambiguation.

**Key insight 2**: Tech tokens define the zone boundary; years are
disambiguated *using* that boundary, not the other way around. The
first unambiguous tech token establishes `tech_zone_start`. Year
candidates before it may be title content (e.g., "2001" in
"2001.A.Space.Odyssey.1968"); year candidates at or after it are
metadata.

**Consequences**:
- Eliminates the "match-then-prune" anti-pattern for most disambiguation
- Zone logic lives in one place (`ZoneMap`) instead of scattered heuristics
- Title extraction becomes simpler (uses zone boundaries, not re-scanning)
- Year-as-title cases (1917, 2001, 2012) are handled structurally
- Incremental: each TOML rule file can opt-in to zone scoping independently
- Backwards compatible: default scope is `unrestricted` (no behavior change)

**Does NOT solve**: multi-token compounds, compound release groups,
title DB lookups (Layer 2). These are orthogonal problems.

---

## Module Map

```
src/
├── lib.rs                  # Public API: parse()
├── main.rs                 # CLI binary (clap)
├── hunch_result.rs         # HunchResult type + typed accessors + JSON
├── pipeline.rs             # v0.2 pipeline orchestration
├── tokenizer.rs            # Input → TokenStream (separators, brackets, extension)
├── zone_map.rs             # (v0.2.1) Anchor detection + zone boundary computation
├── matcher/
│   ├── mod.rs              # Re-exports
│   ├── span.rs             # MatchSpan, Property enum (55 variants)
│   ├── engine.rs           # Conflict resolution (priority + length sweep)
│   ├── regex_utils.rs      # ValuePattern + BoundedRegex (LEGACY — to be removed)
│   └── rule_loader.rs      # TOML rule engine: exact + regex + {N} templates + zone_scope
└── properties/             # 31 property matcher modules
    ├── title.rs             # Title/episode_title extraction (algorithmic, stays)
    ├── episodes/            # S01E02, 1x03, ranges, anime, week (algorithmic, stays)
    ├── release_group.rs     # Positional heuristics (stays)
    ├── bit_rate.rs          # Bit rate detection (v0.2.1)
    ├── date.rs              # Date parsing (algorithmic, stays)
    ├── year.rs              # Year detection (stays)
    └── ...                  # ~25 legacy matchers (being migrated to TOML)

rules/                      # 20 TOML data files (compile-time embedded)
├── video_codec.toml        # H.264, H.265, AV1, Xvid, …
├── audio_codec.toml        # AAC, DTS, Dolby, FLAC, Opus, …
├── source.toml             # Blu-ray, WEB-DL, HDTV, DVDRip, …
├── screen_size.toml        # 720p, 1080p, 4K, WxH ({N} templates)
├── container.toml          # mkv, mp4, avi, srt, …
├── frame_rate.toml         # 24fps, 120fps ({N} templates)
├── language.toml           # English, French, Multi, VFF, …
├── subtitle_language.toml  # VOSTFR, NLsubs, SubForced, …
├── edition.toml            # Director's Cut, Extended, Unrated, …
├── other.toml              # HDR, Remux, Proper, Repack, 3D, …
├── other_positional.toml    # Position-dependent Other (zone_scope = tech_only)
├── streaming_service.toml  # AMZN, NF, HMAX, DSNP, …
├── video_profile.toml      # Hi10P, HP, SVC, …
├── audio_profile.toml      # Atmos, DTS:X, TrueHD, …
├── audio_channels.toml     # 5.1, 7.1, 2ch, …
├── color_depth.toml        # 10-bit, 8-bit, 12-bit, …
├── country.toml            # US, UK, GB, CA, AU, NZ
├── episode_details.toml    # Special, Pilot, Unaired, Final
├── episode_format.toml     # Minisode (v0.2.1)
└── video_api.toml          # DXVA, D3D11, CUDA, …
```
tests/
├── guessit_regression.rs   # 22 fixture files, ratchet-pattern floors
├── integration.rs          # 27 hand-written end-to-end tests
├── helpers/mod.rs           # YAML loader, test case struct
└── fixtures/               # Self-contained test vectors (from guessit)
```

---

## TOML Rule File Format

```toml
property = "video_codec"
zone_scope = "unrestricted"  # (v0.2.1) "unrestricted" | "tech_only" | "after_anchor"

[exact]                    # Case-insensitive exact token lookups
x264 = "H.264"
hevc = "H.265"

[exact_sensitive]          # Case-sensitive (for ambiguous short tokens)
NZ = "NZ"

[[patterns]]               # Regex with optional {N} capture templates
match = '(?i)^[xh][-.]?265$'
value = "H.265"            # Static value

[[patterns]]
match = '(?i)^(\d{3,4})x(\d{3,4})$'
value = "{2}p"             # Dynamic: capture group 2 → "1080p"

[[patterns]]               # Side effects: one match → multiple properties
match = '(?i)^dvd[-. ]?rip$'
value = "DVD"
side_effects = [{ property = "other", value = "Rip" }]

[[patterns]]               # Neighbor constraints: context-aware matching
match = '(?i)^hd$'
value = "HD"
not_before = ["tv", "dvd", "cam"]
# Also available: not_after, requires_after
```

Matching order: case-sensitive exact → case-insensitive exact → regex (first match wins).
All regex uses `regex` crate only (linear-time, ReDoS-immune).

---

## Developer Guidelines

### Code style

- **Idiomatic Rust**: ownership, strong types, exhaustive matches, `clippy` clean.
- **DRY**: shared helpers in `matcher/`. Don't duplicate patterns between TOML and Rust.
- **YAGNI**: don't build Phase D infra during Phase A.
- **Files under 600 lines**: split by responsibility if growing beyond this.
- **Tests in each module**: `#[cfg(test)] mod tests` blocks.

### Adding a new property (v0.2)

1. Create `rules/<name>.toml` with `property`, `[exact]`, and `[[patterns]]`.
2. Add a `LazyLock<RuleSet>` static in `pipeline.rs`.
3. Register it in the `toml_rules` vector with appropriate property + priority.
4. Add `Property::YourProp` variant to `src/matcher/span.rs` (if new).
5. Add unit tests in rule_loader and integration tests.
6. Only create `src/properties/<name>.rs` if the property needs algorithmic
   parsing (episodes, year, etc.) that tokens can't express.

### Testing strategy

1. **Unit tests** in each property matcher (`#[cfg(test)]` blocks).
2. **Integration tests** (`tests/integration.rs`) — hand-written end-to-end.
3. **Regression tests** (`tests/guessit_regression.rs`) — 22 fixture files with
   ratchet-pattern minimum pass rates. Floors only go up.
4. **Compatibility report**`cargo test compatibility_report -- --ignored --nocapture`
5. **Benchmarks** (`benches/parse.rs`) — Criterion benchmarks.
6. All fixtures self-contained in `tests/fixtures/` (no external repos needed).

### Conflict resolution

1. **Priority tiers**: Extension (10) > known tokens (0) > weak/positional (-1/-2).
2. **Overlap**: higher priority wins; ties broken by longer span.
3. **Multi-value**: Episode, Language, SubtitleLanguage, Other, Season, Disc
   support multiple values on the same span (serialized as JSON arrays).

---

## Dependencies

| Crate | Purpose | Status |
|-------|---------|--------|
| `regex` | Pattern matching (linear-time) | Permanent |
| `fancy-regex` | Lookaround fallback | **Removing** (blocked by legacy matchers) |
| `serde` + `serde_json` | JSON output | Permanent |
| `clap` | CLI argument parsing | Permanent |
| `toml` | TOML rule parsing | Permanent |

---

## Guessit Source Map

For porting patterns, find originals in
[guessit](https://github.com/guessit-io/guessit) `guessit/rules/properties/`:

| hunch module | guessit source |
|-------------|----------------|
| `episodes/` | `episodes.py` (~900 lines) |
| `title.rs` | `title.py` |
| `release_group.rs` | `release_group.py` |
| `source.rs``source.toml` | `source.py` |
| `audio_codec.rs``audio_codec.toml` | `audio_codec.py` |
| `video_codec.rs``video_codec.toml` | `video_codec.py` |
| `language.rs``language.toml` | `language.py` |
| `other.rs``other.toml` | `other.py` |

---

## Security Model

- TOML rule files embedded at compile time — no runtime file access
- `regex` crate only (target) — linear-time, ReDoS structurally impossible
- Schema-validated at load time (max pattern length, valid property names)
- No `unsafe`, no FFI, no file I/O, no network
- All patterns reviewed as code changes (TOML files are versioned)