Skip to main content

Module tokens

Module tokens 

Source
Expand description

Identifier tokenizer for BM25 indexing.

Port of ~/src/semble/src/semble/tokens.py. Splits identifiers on camelCase / PascalCase / snake_case boundaries and emits sub-tokens alongside the lowercased compound, so partial matches work in BM25.

§Behavioural parity with Python

Python’s _CAMEL_RE uses a lookahead ((?=[A-Z][a-z])) that the default regex crate’s RE2 engine does not support. Rather than pulling in fancy-regex, this module hand-rolls [split_camel_segments] to match the four alternatives Python’s regex tries in order:

  • [A-Z]+(?=[A-Z][a-z]) — uppercase acronym before a CamelWord (HTTP in HTTPResponse).
  • [A-Z]?[a-z]+ — optional upper + lower-case run (Response, get, andler).
  • [A-Z]+ — uppercase-only run (acronym at end of identifier).
  • [0-9]+ — digit run.

Parity is enforced by [tests::matches_python_docstring_examples] and a corpus property test.

Functions§

split_identifier
Split a single identifier into sub-tokens via camelCase / snake_case.
tokenize
Split text into lowercase identifier-like tokens for BM25 indexing.