clockwords 0.4.0

Find and resolve natural-language time expressions across multiple languages
Documentation
# clockwords

**Find and resolve natural-language time expressions in text.**

[![CI](https://github.com/hg8496/clockwords/actions/workflows/ci.yml/badge.svg)](https://github.com/hg8496/clockwords/actions/workflows/ci.yml)
[![Crates.io](https://img.shields.io/crates/v/clockwords.svg)](https://crates.io/crates/clockwords)
[![Docs.rs](https://docs.rs/clockwords/badge.svg)](https://docs.rs/clockwords)
[![License](https://img.shields.io/crates/l/clockwords.svg)](LICENSE)

`clockwords` scans free-form text for relative time expressions like *"last Friday from 9 to eleven"*, *"yesterday at 3pm"*, or *"letzten Freitag von 9 bis 12 Uhr"* and returns their byte-offset spans together with resolved `DateTime<Utc>` values. It supports **English**, **German**, **French**, and **Spanish** out of the box.

Built for **real-time GUI applications** (time-tracking, note-taking, calendars) where the user types naturally and the app highlights detected time references as they appear. Timezone-aware — times the user enters are interpreted in their local timezone (configurable, defaults to UTC).

## Features

- **Four languages**: English, German, French, Spanish
- **Timezone-aware**: User input is interpreted in a configurable timezone (defaults to UTC for backward compatibility)
- **Byte-offset spans**: Directly usable for text highlighting in any GUI framework
- **Resolved times**: Every match resolves to a concrete `DateTime<Utc>` point or range
- **Incremental typing support**: Detects partial matches (e.g. `"yester"` while the user is still typing `"yesterday"`)
- **Accent-tolerant**: Handles `días`/`dias`, `à`/`a`, `mañana`/`manana`, `dernière`/`derniere`
- **Fast rejection**: Aho-Corasick keyword prefilter skips text with no time-related words in sub-microsecond time
- **Zero allocations on rejection**: If no keywords are found, `scan()` returns immediately
- **No unsafe code**
- **Defensive**: All internal date arithmetic returns `Option` — no panics from edge-case dates

## Quick Start

Add to your `Cargo.toml`:

```toml
[dependencies]
clockwords = "0.3"
```

### Basic Usage

```rust
use clockwords::{default_scanner, ResolvedTime};
use chrono::Utc;

fn main() {
    // Create a scanner with all four languages enabled
    let scanner = default_scanner();
    let now = Utc::now();

    let text = "The last hour I coded the initial code for the time library";
    let matches = scanner.scan(text, now);

    for m in &matches {
        println!(
            "Found '{}' at bytes {}..{} ({:?})",
            &text[m.span.as_range()],
            m.span.start,
            m.span.end,
            m.kind,
        );

        match &m.resolved {
            ResolvedTime::Point(dt) => println!("  Resolved to: {dt}"),
            ResolvedTime::Range { start, end } => {
                println!("  Resolved to: {start} .. {end}")
            }
        }
    }
}
```

**Output:**
```
Found 'The last hour' at bytes 0..13 (TimeRange)
  Resolved to: 2026-02-08T12:30:00Z .. 2026-02-08T13:30:00Z
```

### Select Specific Languages

```rust
use clockwords::scanner_for_languages;

// Only English and German
let scanner = scanner_for_languages(&["en", "de"]);
```

### Timezone Support

By default, all times are interpreted in UTC. To interpret user input in a specific timezone, configure `ParserConfig::timezone` or use `scan_with_tz()`:

```rust
use clockwords::{ParserConfig, TimeExpressionScanner, Tz, default_scanner};
use chrono::Utc;

// Option 1: Set timezone in config
let config = ParserConfig {
    timezone: Tz::Europe__Berlin,
    ..Default::default()
};
// Pass config when constructing the scanner (e.g. via TimeExpressionScanner::new)

// Option 2: Override per scan call
let scanner = default_scanner();
let matches = scanner.scan_with_tz("yesterday at 3pm", Utc::now(), Tz::Europe__Berlin);
// "3pm" is interpreted as 15:00 Berlin time → resolves to 14:00 UTC (in winter)
```

When a timezone is set, all day boundaries (midnight), time-of-day values, and weekday calculations use the user's local timezone. The resolved output always remains in UTC. For example, with `Europe/Berlin` (CET, UTC+1 in winter):

- `"today"` at 23:30 UTC (= 00:30 CET next day) → the range covers the *next* calendar day in Berlin
- `"at 3pm"` → resolves to 14:00 UTC (not 15:00 UTC)
- `"the last hour"` → unchanged (duration-based, timezone-independent)

## Supported Expressions

### Relative Days

| Language | Examples |
|----------|----------|
| English  | `today`, `tomorrow`, `yesterday` |
| German   | `heute`, `morgen`, `gestern` |
| French   | `aujourd'hui`, `demain`, `hier` |
| Spanish  | `hoy`, `mañana`, `ayer` |

Resolves to a full-day `Range` (midnight to midnight in the configured timezone).

### Relative Weekdays

| Language | Examples |
|----------|----------|
| English  | `last Friday`, `next Monday`, `this Wednesday` |
| German   | `letzten Freitag`, `nächsten Montag`, `diesen Mittwoch` |
| French   | `vendredi dernier`, `lundi prochain`, `ce mercredi` |
| Spanish  | `el viernes pasado`, `el próximo lunes`, `este miércoles` |

Resolves to a full-day `Range` (midnight to midnight in the configured timezone). French and Spanish support both pre- and post-positive word order (e.g. `lundi prochain` and `prochain lundi`). Spanish also supports `el viernes que viene`.

### Day Offsets

| Language | Examples |
|----------|----------|
| English  | `in 4 days`, `two days ago`, `in three days` |
| German   | `in 3 Tagen`, `vor zwei Tagen` |
| French   | `dans 3 jours`, `il y a deux jours` |
| Spanish  | `en 3 días`, `hace 2 dias` |

Supports both digits and written-out number words (1–30).

### Time Specifications

| Language | Examples |
|----------|----------|
| English  | `at 3pm`, `at 3 am`, `13 o'clock`, `at 3:30pm`, `11:30am`, `at 15:30` |
| German   | `um 15 Uhr`, `um 15:30 Uhr`, `um 15:30` |
| French   | `à 13h`, `à 13h30`, `à 13:30` |
| Spanish  | `a las 3`, `a las 15:30` |

Colon-delimited minutes (`H:MM`) are supported in all languages. In English, am/pm is optional — bare `H:MM` with `at` is treated as 24-hour time. French supports both `h` and `:` as separators (`13h30` and `13:30`).

Resolves to a `Point` in time.

### Time Ranges

| Language | Examples |
|----------|----------|
| English  | `the last hour`, `last minute`, `between 9 and 12`, `from 9 to 12` |
| German   | `die letzte Stunde`, `von 9 bis 12 Uhr`, `zwischen 9 und 12` |
| French   | `la dernière heure`, `entre 9 et 12 heures` |
| Spanish  | `la última hora`, `entre las 9 y las 12` |

English supports both `between X and Y` and `from X to Y` with number words (`from nine to five`).

### Combined Expressions

Any day reference (relative day, weekday, or day offset) can be combined with a time specification or time range in a single expression. The entire phrase is detected as one match:

**Relative day + time:**

| Language | Examples |
|----------|----------|
| English  | `yesterday at 3pm`, `yesterday at 3:30pm`, `yesterday at 15:30`, `tomorrow between 9 and 12`, `yesterday from 9 to 11` |
| German   | `gestern um 15 Uhr`, `gestern um 15:30 Uhr`, `gestern um 15:30`, `gestern von 9 bis 12 Uhr` |
| French   | `hier à 13h`, `hier à 13h30`, `hier à 13:30`, `hier entre 9 et 12 heures` |
| Spanish  | `ayer a las 3`, `ayer a las 15:30`, `ayer entre las 9 y las 12` |

**Weekday + time:**

| Language | Examples |
|----------|----------|
| English  | `last Friday at 3pm`, `last Friday at 3:30pm`, `last Friday at 15:30`, `last Friday from 9 to eleven`, `next Monday between 9 and 12` |
| German   | `letzten Freitag um 15 Uhr`, `letzten Freitag um 15:30 Uhr`, `nächsten Montag um 9:15`, `diesen Mittwoch zwischen 9 und 11` |
| French   | `vendredi dernier à 13h`, `vendredi dernier à 13h30`, `vendredi dernier à 13:30`, `ce lundi à 14h30`, `ce mercredi entre 9 et 11 heures` |
| Spanish  | `el viernes pasado a las 3`, `el viernes pasado a las 3:30`, `el próximo lunes a las 9:30`, `el pasado viernes entre las 9 y las 12` |

Combined expressions resolve to either a `Point` (day + time spec) or a `Range` (day + time range) on the specified day.

## Architecture

### How Scanning Works

```
Input text
┌─────────────────────┐
│ Aho-Corasick        │  Fast keyword check (~ns)
│ Prefilter           │  Rejects text with no time words
└─────────┬───────────┘
          │ keywords found
┌─────────────────────┐
│ Per-Language         │  Regex rules with resolver closures
│ Grammar Rules       │  Run for each enabled language
└─────────┬───────────┘
          │ raw matches
┌─────────────────────┐
│ Deduplication       │  Prefer Complete > Partial, longer > shorter
│ & Sorting           │  Remove overlapping inferior matches
└─────────┬───────────┘
     Vec<TimeMatch>
```

### Buffer-Rescan Strategy

Rather than maintaining an incremental parser state machine, `clockwords` re-scans the full text buffer on every call to `scan()`. This is the right trade-off for GUI text input:

- Input buffers are typically < 1 KB
- Full regex scan of a short buffer completes in microseconds
- Dramatically simpler than maintaining parser state across edits
- No edge cases around cursor position, insertions, or deletions

### Type Overview

| Type | Description |
|------|-------------|
| `TimeExpressionScanner` | Main entry point — holds language parsers and prefilter |
| `TimeMatch` | A single match result: span + confidence + resolved time + kind |
| `Span` | Byte-offset range (`start..end`) for slicing the original text |
| `ResolvedTime` | `Point(DateTime<Utc>)` or `Range { start, end }` |
| `MatchConfidence` | `Partial` (user still typing) or `Complete` |
| `ExpressionKind` | `RelativeDay`, `RelativeDayOffset`, `TimeSpecification`, `TimeRange`, `Combined` |
| `ParserConfig` | Settings: `report_partial` (default `true`), `max_matches` (default `10`), `timezone` (default `Tz::UTC`) |
| `Tz` | Re-exported from `chrono-tz` — IANA timezone (e.g. `Tz::Europe__Berlin`, `Tz::US__Eastern`) |

## GUI Integration

`clockwords` is designed for real-time text highlighting. Here's how to wire it up:

```rust
use clockwords::{default_scanner, MatchConfidence, TimeExpressionScanner};
use chrono::Utc;

struct App {
    scanner: TimeExpressionScanner,
}

impl App {
    fn new() -> Self {
        Self {
            scanner: default_scanner(),
        }
    }

    /// Call this on every keystroke
    fn on_text_changed(&self, text: &str) {
        let matches = self.scanner.scan(text, Utc::now());

        for m in &matches {
            let range = m.span.start..m.span.end;
            let style = match m.confidence {
                MatchConfidence::Complete => "solid_underline",
                MatchConfidence::Partial  => "dotted_underline",
            };
            // Apply `style` to the character range in your text widget
            println!("Highlight bytes {range:?} with {style}");
        }
    }
}
```

### Partial Match Highlighting

When the user types `"I worked yester"`, the scanner returns a **Partial** match on `"yester"`. Your GUI can show a dimmed or dotted underline to hint that a time expression is being formed. Once the user completes `"yesterday"`, the match upgrades to **Complete** with a fully resolved time.

To disable partial matching:

```rust
use clockwords::{ParserConfig, TimeExpressionScanner};

let config = ParserConfig {
    report_partial: false,
    ..Default::default()
};
```

## Adding a New Language

1. Create `src/lang/xx.rs` (copy an existing language file as a template)
2. Implement the `LanguageParser` trait:
   - `lang_id()` — return the ISO 639-1 code (e.g. `"it"`)
   - `keywords()` — return Aho-Corasick trigger words
   - `keyword_prefixes()` — return typing prefixes (length >= 3)
   - `parse()` — call `apply_rules()` with your `GrammarRule` list
3. Add number-word mappings to `src/lang/numbers.rs`
4. Register the language in `src/lib.rs``scanner_for_languages()`
5. Add tests in `tests/`

Each `GrammarRule` is a compiled regex paired with a resolver closure:

```rust
GrammarRule {
    pattern: Regex::new(r"(?i)\b(?P<day>oggi|domani|ieri)\b").unwrap(),
    kind: ExpressionKind::RelativeDay,
    resolver: |caps, now, tz| {
        let offset = match caps.name("day")?.as_str().to_lowercase().as_str() {
            "oggi" => 0,
            "domani" => 1,
            "ieri" => -1,
            _ => return None,
        };
        resolve::resolve_relative_day(offset, now, tz)
    },
}
```

## Performance

| Scenario | Approximate Time |
|----------|------------------|
| No keywords in text (fast rejection) | ~1 µs |
| Short sentence with 1 match | ~10 µs |
| Paragraph with multiple matches | ~10 µs |

The Aho-Corasick prefilter means that text without any time-related words is rejected in microseconds — the regex engine is never invoked.

## Running Tests

```bash
cargo test
```

The test suite includes **141 integration tests + 1 doctest** covering:
- All four languages with various expression types
- Combined weekday + time expressions across all languages
- Timezone-aware resolution (Europe/Berlin, US/Eastern, UTC)
- Cross-midnight timezone boundary handling
- Accent-tolerant variants (with and without diacritics)
- Embedded expressions in longer sentences
- Colon-delimited time parsing (`3:30pm`, `15:30`, `13h30`, `13:30`)
- `from X to Y` with number words (`nine to five`)
- Incremental/partial matching
- Edge cases (empty input, no false positives)
- Cross-language default scanner

## Running the TUI Demo

An interactive terminal demo is included:

```bash
cargo run --example tui_demo
```

Type time expressions and watch them get parsed in real time. Press **ESC** to quit.

## Dependencies

| Crate | Purpose |
|-------|---------|
| [`chrono`]https://crates.io/crates/chrono | Date/time types and arithmetic |
| [`chrono-tz`]https://crates.io/crates/chrono-tz | IANA timezone database for timezone-aware resolution |
| [`regex`]https://crates.io/crates/regex | Per-language grammar patterns |
| [`aho-corasick`]https://crates.io/crates/aho-corasick | Fast multi-keyword prefilter |

## License

Licensed under the Apache License, Version 2.0 ([LICENSE](LICENSE) or <http://www.apache.org/licenses/LICENSE-2.0>).