simple-string-patterns 0.2.3

Makes it easier to match, split and extract strings in Rust without regular expressions. The parallel string-patterns crate provides extensions to work with regular expressions via the Regex library
Documentation
simple-string-patterns-0.2.3 has been yanked.

mirror crates.io docs.rs

Simple String Patterns

This library makes it easier to match, split and extract strings in Rust. It builds on the Rust standard library. A parallel string-patterns crate provides extensions to work with regular expressions. Together, these crates aim to make working with strings as easy in Rust as it is Javascript or Python with cleaner syntax.

Simpler string matching methods such as starts_with, contains or ends_with will always perform better, especially when processing large data sets. To this end, the crate provides methods such as starts_with_ci and starts_with_ci_alphanum for basic string validation without regular expressions as well as extension methods to split strings into vectors of strings or a head and tail components.

Method overview

  • Many methods without _ci or _cs suffixes require a boolean case_insensitive parameter
  • Methods ending in _cs are case-sensitive
  • Methods ending in _ci are case-insensitive
  • Methods containing _split return either a vector or tuple pair.
  • Methods containing _part(s) always include leading or trailing separators and may return empty elements in vectors
  • Methods containing _segment(s) ignore leading, trailing, repeated consecutive separators and thus exclude empty elements
  • In tuples returned from segment(s) and part(s) methods, head means the segment before the first split and tail the remainder, while start means the whole string before the last split and end only the last part of the last matched separator.
  • Enclose or wrap methods ending in _escaped have an optional escape character parameter
  • Enclose or wrap methods ending in _safe insert a backslash before the any non-final occurrences of the closing characters unless already present
Simple case-insensitive match
let str_1 = "Dog food";
if str_1.starts_with_ci("dog") {
  println!("{} is dog-related", str_1);
}
Simple case-insensitive match on the alphanumeric characters only in a longer text
// This method is handy for validating text values from external data sources with
// inconsistent naming conventions, e.g. first-name, first_name, firstName or "first name"
let str_1 = "Do you spell hip-hop with a hyphen?";
if str_1.contains_ci_alphanum("hiphop") {
  println!("{} is hip-hop-related", str_1);
}
Filter a vector of strings by their first alphanumeric characters
// Methods ending in _alphanum are good for filtering strings that may have other
// to strings() converts as an array of &str references to a vector of strings
let sample_strs = [
  "/blue-sky.jpg",
  "----bluesky.png",
  "-B-l-u-e--sky",
  "Blueberry",
  " Blues sky thinking"
].to_strings();
let strings_starting_with_blue = sample_strs
  .into_iter()
  .filter(|s| s.starts_with_ci_alphanum("bluesky"))
  .collect::<Vec<String>>();
// should return all except "Blueberry"
Extract the third non-empty segment of a long path name
let path_string = "/var/www/mysite.com/web/uploads";
if let Some(domain) = path_string.to_segment("/", 2) {
  println!("The domain folder name is: {}", domain); // "mysite.com" is an owned string
}
Extract the head and tail or start and end from a longer string
let test_string = "long-list-of-technical-words"
let (head, tail) = test_string.to_head_tail("-");
println!("Head: {}, tail: {}", head, tail); // Head: long, tail: list-of-technical-words

let (start, end) = test_string.to_start_end("-");
println!("Start: {}, end: {}", start, end); // Start: long-list-of-technical, end: words
Capture an inner segment via multiple patterns
let source_str = "long/path/with-a-long-title/details";
  let target_str = "long";
  if let Some(inner_segment) = source_str.to_inner_segment(&[("/", 2), ("-", 2)]) { 
    println!("The inner segment between 'a' and 'title' is: '{}'", inner_segment); // should read 'long'
  }
Extract the first decimal value as an f64 from a longer string
const GBP_TO_EURO: f64 = 0.835;

let sample_str = "Price £12.50 each";
if let Some(price_gbp) = sample_str.to_first_number::<f64>() {
    let price_eur = price_gbp / GBP_TO_EURO;
    println!("The price in euros is {:.2}", price_eur);
}
Extract numeric sequences from phrases and convert them to a vector of floats
// extract European-style numbers with commas as decimal separators and points as thousand separators
let sample_str = "2.500 grammi di farina costa 9,90€ al supermercato.";
  let numbers: Vec<f32> = sample_str.to_numbers_euro();
  // If two valid numbers are matched assume the first is the weight
  if numbers.len() > 1 {
    let weight_grams = numbers[0];
    let price_euros = numbers[1];
    let price_per_kg = price_euros / (weight_grams / 1000f32);
    // the price in kg should be 3.96
    println!("Flour costs €{:.2} per kilo", price_per_kg);
  }
Match multiple patterns without regular expressions
// Match only file names that contain the character sequence "nepal" and do not end in .psd 
// This is very useful for prefiltering large sets of simple strings 
// representing things like file names.
// Ci, Cs suffixes mean case-insensitive and case-sensitive respectively
let mixed_conditions = [
  StringBounds::ContainsCi("nepal", true),
  StringBounds::EndsWithCi(".psd", false),
];

// The same array may also be expressed via the new bounds_builder() function with chainable rules:
// Call .as_vec() at the end
let mixed_conditions = bounds_builder()
  .contains_ci("nepal", true)
  .ends_with_ci(".psd", false)
  .as_vec();

let file_names = [
  "edited-img-Nepal-Feb-2003.psd",
  "image-Thailand-Mar-2003.jpg",
  "photo_Nepal_Jan-2005.jpg",
  "image-India-Mar-2003.jpg",
  "pic_nepal_Dec-2004.png"
];
  
let nepal_source_files: Vec<&str> = file_names.filter_all_conditional(&mixed_conditions);
// should yield two file names: ["photo_Nepal_Jan-2005.jpg", "pic_nepal_Dec-2004.png"]
// This will now return Vec<&str> or Vec<String> depending on the source string type.

Enclose strings in common bounding characters

  let sample_phrase = r#"LLM means "large language model""#;
  
  let phrase_in_round_brackets = sample_phrase.parenthesize();
  // yields (LLM means "large language model")
  // but will not escape any parentheses in the source string.

  let phrase_in_left_right_quotes = sample_phrase.enclose('', '');
  // yields “LLM means "large language model"”
  // in custom left and right quotation marks, but will not escape double quotes.

  let phrase_in_double_quotes = sample_phrase.double_quotes_safe();
  // yields “LLM means \"large language model\"" with backslash-escaped double quotes

Filter strings by character categories

  let sample_str = "Products: $9.99 per unit, £19.50 each, €15 only. Zürich café cañon";
  
  let vowels_only = sample_str.filter_by_type(CharType::Chars(&['a','e','i','o', 'u', 'é', 'ü', 'y']));
  println!("{}", vowels_only);
  // should print "oueuieaoyüiaéao"

  let lower_case_letters_a_to_m_only = sample_str.filter_by_type(CharType::Range('a'..'n'));
  println!("{}", lower_case_letters_a_to_m_only);
  // should print  "dceieachlichcafca"

  /// You can filter strings by multiple character categories
  let sample_with_lower_case_chars_and_spaces = sample_str.filter_by_types(&[CharType::Lower, CharType::Spaces]);
  println!("{}", sample_with_lower_case_chars_and_spaces);
  // Should print "roducts  per unit  each  only ürich café cañon"

Remove character categories from strings

  let sample_without_punctuation = sample_str.strip_by_type(CharType::Punctuation);
  println!("{}", sample_without_punctuation);
  // should print "Products 999 per unit £1950 each €15 only Zürich café cañon";
  
  let sample_without_spaces_and_punct = sample_str.strip_by_types(&[CharType::Spaces, CharType::Punctuation]);
  println!("{}", sample_without_spaces_and_punct);
  // should print "Products999perunit£1950each€15onlyZürichcafécañon";

Split a string on any of set of characters

  let sample_str = "jazz-and-blues_music/section";
  let parts = sample_str.split_on_any_char(&['-','_', '/']);
  // should yield "jazz", "and", "blues", "music", "section" as a vector of strings

Traits

MatchOccurrences: Has methods to return the indices of all ocurrences of an exact string (find_matched_indices) or single character (find_char_indices);

  • CharGroupMatch: Has methods to validate strings with character classes, has_digits, has_alphanumeric, has_alphabetic
  • IsNumeric Provides a method to check if the string may be parsed to an integer or float
  • StripCharacters: Set of methods to strip unwanted characters by type or extract vectors of numeric strings, integers or floats without regular expressions
  • SimpleMatch: Regex-free match methods for common validation rules, e.g. starts_with_ci_alphanum checks if the first letters or numerals in a sample string in case-insensitive mode without regular expressions.
  • SimpleMatchesMany: Regex-free multiple match methods accepting an array of StringBounds items, tuples or patterns and returning a vector of boolean results. matched_conditional
  • SimpleMatchAll: Regex-free multiple match methods accepting an array of StringBounds items, tuples or patterns and returning a boolean if all are matched
  • SimpleFilterAll: Applies simple Regex-free multiple match methods to an array or vector of strings and returns a filtered vector of string slices
  • SimpleEnclose: Wraps strings in pairs of matching characters with variants for different escape character rules
  • ToStrings: Converts arrays or vectors of strs to a vector of owned strings
  • ToSegments: Splits strings into parts, segments or head and tail pairs on a separator
  • ToSegmentFromChars: Splits strings into parts on any of any array of characters

Enums

  • StringBounds: Defines simple match rules with the pattern and a positivty flag, e.g. StringBounds::ContainsCi("report", true) or StringBounds::EndsWithCi(".docx", false). The Ci and Cs variants affect case-sensitivity. Options:
    • StartsWithCi(&str, bool) case-insensitive starts with + boolean positivity flag
    • EndsWithCi(&str, bool) case-insensitive ends with + is_positive flag
    • ContainsCi(&str, bool) case-insensitive contains + is_positive flag
    • StartsWithCs(&str, bool) case-sensitive starts with + is_positive flag
    • EndsWithCs(&str, bool) case-sensitive ends with + is_positive flag
    • ContainsCs(&str, bool) case-sensitive contains + is_positive flag
  • CharType: Defines categories, sets or ranges of characters as well as single characters.
    • Any: will match any characters
    • DecDigit => Match 0-9 only (is_ascii_digit)
    • Digit(radix) => Match digit with the specified radix (e.g. 16 for hexadecimal)
    • Numeric => Match number-like characters in the decimal base. Unlike the is_numeric() extension method this excludes . and -. Use to_numbers_conditional() to extract valid decimal number as strings;
    • AlphaNum => Match any alphanumeric characters (is_alphanumeric)
    • Lower => Match lower case letters (is_lowercase),
    • Upper => Match upper case letters (is_uppercase)
    • Alpha => Match any letters in most supported alphabets (is_alphabetic)
    • Spaces => Match spaces c.is_whitespace(),
    • Punctuation => c.is_ascii_punctuation(),
    • Char(char) => match a single character
    • Chars(&[char]) => Match an array of characters
    • Range(Range) => Match an Range e.g. 'a'..'d' will include a, b and c, but not d. This follows the Unicode sequence.
    • Between(c1, c2) => Match characters betweeen the specified characters e.g. Between('a', 'd') will include d.

Dev Notes

This crate serves as a building block for other crates as well as to supplement a future version of string-patterns. Some updates reflect minor editorial changes.

Version 0.2.2 introduces three new features:

  • bounds_builder() makes it easier to define string matching rules methods requiring an array of StringBounds rules such as filter_all_conditional(). See example above.
  • ToSegmentFromChars provides new methods to split on any of an array of characters, e.g. when processing common patterns that may use a predictable set of separators. This mimics characters classes in regular expressions and is more efficient when you only need to allow for a limited set of split characters.
  • MatchOccurrences has a variant find_char_indices method that accepts a char rather than a &str. This avoids any need to cast a character to a string.

Versions of the string-patterns crate before 0.3.0 contained many of these extensions. Since version 0.3.0 all traits, enums and methods defined in this simple-string-patterns have been removed. These crates supplement each other, but may be installed independently.