Expand description
Identifier tokenizer for BM25 indexing.
Port of ~/src/semble/src/semble/tokens.py. Splits identifiers on
camelCase / PascalCase / snake_case boundaries and emits sub-tokens
alongside the lowercased compound, so partial matches work in BM25.
§Behavioural parity with Python
Python’s _CAMEL_RE uses a lookahead ((?=[A-Z][a-z])) that the
default regex crate’s RE2 engine does not support. Rather than
pulling in fancy-regex, this module hand-rolls
[split_camel_segments] to match the four alternatives Python’s
regex tries in order:
[A-Z]+(?=[A-Z][a-z])— uppercase acronym before a CamelWord (HTTPinHTTPResponse).[A-Z]?[a-z]+— optional upper + lower-case run (Response,get,andler).[A-Z]+— uppercase-only run (acronym at end of identifier).[0-9]+— digit run.
Parity is enforced by [tests::matches_python_docstring_examples]
and a corpus property test.
Functions§
- split_
identifier - Split a single identifier into sub-tokens via camelCase / snake_case.
- tokenize
- Split text into lowercase identifier-like tokens for BM25 indexing.