Skip to main content

tokenise

Function tokenise 

Source
pub fn tokenise(utterance: &str, config: &Config) -> Vec<Token>
Expand description

Public re-export of the tokenise function. Split utterance on any standard separator or extra separator and classify each resulting chunk as a Token.

§What counts as a separator

The standard separator set is: ASCII whitespace (space, tab, newline, carriage return), /, -, ., ,, \. Any additional strings in config.extra_separators are also treated as separators.

When config.no_separator is true and the utterance is a pure digit string of length 6 or 8, it is sliced positionally according to config.component_order rather than split on separators.

§Classification

Each non-separator chunk is examined for digit-to-alpha (or alpha-to-digit) boundaries, allowing adjacent tokens like "19october" or "August7" to be split and classified independently.

  • Token::OrdinalDay — digit run followed by st, nd, rd, or th.
  • Token::MonthName — full name, 3-letter abbreviation, unambiguous prefix, or fuzzy misspelling.
  • Token::Numeric — a run of ASCII digits; stores (value, digit_count).
  • Anything else (noise words, stray punctuation) is silently discarded.

When Config::letter_o_substitution is true (the default), any token whose characters are all ASCII digits or the letter O (upper or lower case) is treated as a numeric token with every O/o replaced by 0. This handles OCR and typing errors such as "2O24"2024. The substitution applies only to isolated tokens; a letter O that is part of a longer alphabetic run (e.g. "october") is never affected because sub_split_on_boundary has already separated digit and alpha runs.

At most three tokens are returned.

§Examples

use partial_date::extract::tokenise;
use partial_date::models::{Config, MonthName, Token};

assert_eq!(
    tokenise("19 October 2014", &Config::default()),
    vec![
        Token::Numeric(19, 2),
        Token::MonthName(MonthName::October),
        Token::Numeric(2014, 4),
    ]
);

assert_eq!(
    tokenise("19th October,2015", &Config::default()),
    vec![
        Token::OrdinalDay(19),
        Token::MonthName(MonthName::October),
        Token::Numeric(2015, 4),
    ]
);

assert_eq!(
    tokenise("19october", &Config::default()),
    vec![
        Token::Numeric(19, 2),
        Token::MonthName(MonthName::October),
    ]
);

// Letter O substitution (enabled by default):
assert_eq!(
    tokenise("2O24", &Config::default()),
    vec![Token::Numeric(2024, 4)]
);

// "7october" — the O is part of "october", not a standalone token, so
// substitution does not apply and the month name is recognised normally.
assert_eq!(
    tokenise("7october", &Config::default()),
    vec![
        Token::Numeric(7, 1),
        Token::MonthName(MonthName::October),
    ]
);