Module string_ops

Expand description

String processing utilities for scientific computing

This module provides common string distance metrics, similarity measures, tokenization, n-gram generation, and case conversion utilities used across the SciRS2 ecosystem.

§Distance Metrics

levenshtein_distance - Edit distance (insertions, deletions, substitutions)
hamming_distance - Number of positions where characters differ
jaro_similarity / jaro_winkler_similarity - Positional character similarity

§Subsequences

longest_common_subsequence - LCS length
lcs_string - Actual LCS string

§Tokenization & N-grams

tokenize_whitespace - Split by whitespace
tokenize_pattern - Split by custom delimiter patterns
ngrams / char_ngrams - Word-level and character-level n-grams

§Case Conversions

to_snake_case / to_camel_case / to_pascal_case / to_kebab_case / to_screaming_snake_case

Functions§

center: Center a string within a given width.
char_ngrams: Generate character-level n-grams from a string.
count_occurrences: Count occurrences of a substring.
dice_coefficient: Compute the Dice coefficient (bigram overlap) between two strings.
hamming_distance: Compute the Hamming distance between two strings of equal length.
is_palindrome: Check if a string is a palindrome (case-insensitive, alphanumeric only).
jaro_similarity: Compute the Jaro similarity between two strings.
jaro_winkler_similarity: Compute the Jaro-Winkler similarity between two strings.
lcs_similarity: Compute the LCS similarity (0.0 = no common subsequence, 1.0 = identical).
lcs_string: Return the actual longest common subsequence string.
levenshtein_distance: Compute the Levenshtein edit distance between two strings.
levenshtein_similarity: Compute the Levenshtein similarity (1.0 = identical, 0.0 = completely different).
longest_common_subsequence: Compute the length of the longest common subsequence (LCS) of two strings.
ngrams: Generate word-level n-grams from a list of tokens.
normalized_levenshtein: Compute the normalized Levenshtein distance (0.0 = identical, 1.0 = completely different).
pad_left: Pad a string on the left to a given width.
pad_right: Pad a string on the right to a given width.
reverse: Reverse a string (Unicode-aware).
skip_bigrams: Generate skip-grams (n-grams with gaps).
to_camel_case: Convert a string to camelCase.
to_kebab_case: Convert a string to kebab-case.
to_pascal_case: Convert a string to PascalCase.
to_screaming_snake_case: Convert a string to SCREAMING_SNAKE_CASE.
to_snake_case: Convert a string to snake_case.
to_title_case: Convert a string to Title Case.
tokenize_char: Tokenize a string by splitting on a delimiter character.
tokenize_pattern: Tokenize by splitting on a string pattern.
tokenize_predicate: Tokenize by splitting on any character matching a predicate.
tokenize_sentences: Tokenize into sentences (split on ‘.’, ‘!’, ‘?’).
tokenize_whitespace: Tokenize a string by splitting on whitespace.
tokenize_words: Simple word tokenizer that splits on non-alphanumeric characters.

Module string_ops

Module string_ops Copy item path

§Distance Metrics

§Subsequences

§Tokenization & N-grams

§Case Conversions

Functions§

Module string_ops