Expand description
String processing utilities for scientific computing
This module provides common string distance metrics, similarity measures, tokenization, n-gram generation, and case conversion utilities used across the SciRS2 ecosystem.
§Distance Metrics
levenshtein_distance- Edit distance (insertions, deletions, substitutions)hamming_distance- Number of positions where characters differjaro_similarity/jaro_winkler_similarity- Positional character similarity
§Subsequences
longest_common_subsequence- LCS lengthlcs_string- Actual LCS string
§Tokenization & N-grams
tokenize_whitespace- Split by whitespacetokenize_pattern- Split by custom delimiter patternsngrams/char_ngrams- Word-level and character-level n-grams
§Case Conversions
Functions§
- center
- Center a string within a given width.
- char_
ngrams - Generate character-level n-grams from a string.
- count_
occurrences - Count occurrences of a substring.
- dice_
coefficient - Compute the Dice coefficient (bigram overlap) between two strings.
- hamming_
distance - Compute the Hamming distance between two strings of equal length.
- is_
palindrome - Check if a string is a palindrome (case-insensitive, alphanumeric only).
- jaro_
similarity - Compute the Jaro similarity between two strings.
- jaro_
winkler_ similarity - Compute the Jaro-Winkler similarity between two strings.
- lcs_
similarity - Compute the LCS similarity (0.0 = no common subsequence, 1.0 = identical).
- lcs_
string - Return the actual longest common subsequence string.
- levenshtein_
distance - Compute the Levenshtein edit distance between two strings.
- levenshtein_
similarity - Compute the Levenshtein similarity (1.0 = identical, 0.0 = completely different).
- longest_
common_ subsequence - Compute the length of the longest common subsequence (LCS) of two strings.
- ngrams
- Generate word-level n-grams from a list of tokens.
- normalized_
levenshtein - Compute the normalized Levenshtein distance (0.0 = identical, 1.0 = completely different).
- pad_
left - Pad a string on the left to a given width.
- pad_
right - Pad a string on the right to a given width.
- reverse
- Reverse a string (Unicode-aware).
- skip_
bigrams - Generate skip-grams (n-grams with gaps).
- to_
camel_ case - Convert a string to camelCase.
- to_
kebab_ case - Convert a string to kebab-case.
- to_
pascal_ case - Convert a string to PascalCase.
- to_
screaming_ snake_ case - Convert a string to SCREAMING_SNAKE_CASE.
- to_
snake_ case - Convert a string to snake_case.
- to_
title_ case - Convert a string to Title Case.
- tokenize_
char - Tokenize a string by splitting on a delimiter character.
- tokenize_
pattern - Tokenize by splitting on a string pattern.
- tokenize_
predicate - Tokenize by splitting on any character matching a predicate.
- tokenize_
sentences - Tokenize into sentences (split on ‘.’, ‘!’, ‘?’).
- tokenize_
whitespace - Tokenize a string by splitting on whitespace.
- tokenize_
words - Simple word tokenizer that splits on non-alphanumeric characters.