BlitzText
BlitzText is a high-performance library for efficient keyword extraction and replacement in strings. It is based on the FlashText and Aho-Corasick algorithm. There are both Rust and Python implementations. Main difference form Aho-Corasick is that BlitzText only matches the longest pattern in a greedy manner.
Table of Contents
Installation
Rust
Add this to your Cargo.toml:
[]
= "0.1.0"
or
cargo add blitztext
Python
Install the library using pip:
pip install blitztext
Usage
Rust Usage
use KeywordProcessor;
Python Usage
=
=
=
=
// :
Features
1. Parallel Processing
For processing multiple texts in parallel:
// Rust
let texts = vec!;
let results = processor.parallel_extract_keywords_from_texts;
# Python
=
=
2. Fuzzy Matching
Both Rust and Python implementations support fuzzy matching:
// Rust
let matches = processor.extract_keywords;
# Python
=
3. Case Sensitivity
You can enable case-sensitive matching:
// Rust
let mut processor = with_options;
processor.add_keyword;
let matches = processor.extract_keywords;
// Only "Rust" will be matched, not "rust"
# Python
=
=
# Only "Rust" will be matched, not "rust"
4. Overlapping Matches
Enable overlapping matches:
// Rust
let mut processor = with_options;
processor.add_keyword;
processor.add_keyword;
let matches = processor.extract_keywords;
// "word" will be matched
# Python
=
=
# "word" will be matched
5. Custom Non-Word Boundaries
This library uses the concept of non-word boundaries to determine where words begin and end. By default, alphanumeric characters and underscores are considered part of a word. You can customize this behavior to fit your specific needs.
Understanding Non-Word Boundaries
- Characters defined as non-word boundaries are considered part of a word.
- Characters not defined as non-word boundaries are treated as word separators.
Example
// Rust
let mut processor = new;
processor.add_keyword;
processor.add_keyword;
let text = "I-love-rust-programming-and-1coding2";
// Default behavior: '-' is a word separator
let matches = processor.extract_keywords;
assert_eq!;
// Matches: "rust" and "coding"
// Add '-' as a non-word boundary
processor.add_non_word_boundary;
// Now '-' is considered part of words
let matches = processor.extract_keywords;
assert_eq!;
// No matches, because "rust" and "programming" are now part of larger "words"
# Python
=
=
# Default behavior: '-' is a word separator
=
assert == 2
# Matches: "rust" and "coding"
# Add '-' as a non-word boundary
# Now '-' is considered part of words
=
assert == 0
# No matches, because "rust" and "programming" are now part of larger "words"
Setting a whole new set of non-word boundaries
// Rust
processor.set_non_word_boundaries;
# Python
Performance
BlitzText is designed for high performance, making it suitable for processing large volumes of text. Benchmark details here.
Mult-threaded performance:
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Issues
If you encounter any problems, please file an issue along with a detailed description.
License
This project is licensed under the MIT License.