pub fn tokenise(utterance: &str, config: &Config) -> Vec<Token>Expand description
Public re-export of the tokenise function.
Split utterance on any standard separator or extra separator and classify
each resulting chunk as a Token.
§What counts as a separator
The standard separator set is: ASCII whitespace (space, tab, newline,
carriage return), /, -, ., ,, \. Any additional strings in
config.extra_separators are also treated as separators.
When config.no_separator is true and the utterance is a pure digit
string of length 6 or 8, it is sliced positionally according to
config.component_order rather than split on separators.
§Classification
Each non-separator chunk is examined for digit-to-alpha (or alpha-to-digit)
boundaries, allowing adjacent tokens like "19october" or "August7" to
be split and classified independently.
Token::OrdinalDay— digit run followed byst,nd,rd, orth.Token::MonthName— full name, 3-letter abbreviation, unambiguous prefix, or fuzzy misspelling.Token::Numeric— a run of ASCII digits; stores(value, digit_count).- Anything else (noise words, stray punctuation) is silently discarded.
When Config::letter_o_substitution is true (the default), any token
whose characters are all ASCII digits or the letter O (upper or lower
case) is treated as a numeric token with every O/o replaced by 0.
This handles OCR and typing errors such as "2O24" → 2024. The
substitution applies only to isolated tokens; a letter O that is part of a
longer alphabetic run (e.g. "october") is never affected because
sub_split_on_boundary has already separated digit and alpha runs.
At most three tokens are returned.
§Examples
use partial_date::extract::tokenise;
use partial_date::models::{Config, MonthName, Token};
assert_eq!(
tokenise("19 October 2014", &Config::default()),
vec![
Token::Numeric(19, 2),
Token::MonthName(MonthName::October),
Token::Numeric(2014, 4),
]
);
assert_eq!(
tokenise("19th October,2015", &Config::default()),
vec![
Token::OrdinalDay(19),
Token::MonthName(MonthName::October),
Token::Numeric(2015, 4),
]
);
assert_eq!(
tokenise("19october", &Config::default()),
vec![
Token::Numeric(19, 2),
Token::MonthName(MonthName::October),
]
);
// Letter O substitution (enabled by default):
assert_eq!(
tokenise("2O24", &Config::default()),
vec![Token::Numeric(2024, 4)]
);
// "7october" — the O is part of "october", not a standalone token, so
// substitution does not apply and the month name is recognised normally.
assert_eq!(
tokenise("7october", &Config::default()),
vec![
Token::Numeric(7, 1),
Token::MonthName(MonthName::October),
]
);