# Features
This document provides detailed information about the features supported by regexr.
## Pattern Syntax
### Literals
Match literal characters:
```rust
let re = Regex::new("hello").unwrap();
assert!(re.is_match("hello world"));
```
### Character Classes
#### Basic Classes
- `.` - Any character except newline
- `\d` - Digit `[0-9]`
- `\D` - Non-digit `[^0-9]`
- `\w` - Word character `[a-zA-Z0-9_]`
- `\W` - Non-word character `[^a-zA-Z0-9_]`
- `\s` - Whitespace `[ \t\n\r\f\v]`
- `\S` - Non-whitespace
```rust
let re = Regex::new(r"\d+").unwrap();
assert!(re.is_match("123"));
```
#### Custom Classes
```rust
let re = Regex::new(r"[aeiou]").unwrap();
assert!(re.is_match("hello"));
let re = Regex::new(r"[^aeiou]").unwrap(); // Negated
assert!(re.is_match("xyz"));
let re = Regex::new(r"[a-z]").unwrap(); // Range
assert!(re.is_match("hello"));
```
### Quantifiers
#### Greedy Quantifiers
- `*` - Zero or more
- `+` - One or more
- `?` - Zero or one
- `{n}` - Exactly n times
- `{n,}` - n or more times
- `{n,m}` - Between n and m times
```rust
let re = Regex::new(r"\d+").unwrap();
assert_eq!(re.find("abc123def").unwrap().as_str(), "123");
let re = Regex::new(r"\w{3,5}").unwrap();
assert!(re.is_match("hello"));
```
#### Non-Greedy Quantifiers
Add `?` after any quantifier to make it non-greedy:
- `*?` - Zero or more (non-greedy)
- `+?` - One or more (non-greedy)
- `??` - Zero or one (non-greedy)
- `{n,}?` - n or more (non-greedy)
- `{n,m}?` - Between n and m (non-greedy)
```rust
let re = Regex::new(r#"<.*?>"#).unwrap();
assert_eq!(re.find("<a><b>").unwrap().as_str(), "<a>");
let re = Regex::new(r#"<.*>"#).unwrap(); // Greedy version
assert_eq!(re.find("<a><b>").unwrap().as_str(), "<a><b>");
```
### Anchors
- `^` - Start of string
- `$` - End of string
- `\b` - Word boundary
- `\B` - Non-word boundary
```rust
let re = Regex::new(r"^\d+").unwrap();
assert!(re.is_match("123abc"));
assert!(!re.is_match("abc123"));
let re = Regex::new(r"\bword\b").unwrap();
assert!(re.is_match("a word here"));
assert!(!re.is_match("awordhere"));
```
### Alternation
```rust
assert!(re.is_match("She has a cat"));
```
### Grouping
#### Capturing Groups
```rust
let re = Regex::new(r"(\w+)@(\w+)").unwrap();
let caps = re.captures("user@example").unwrap();
assert_eq!(&caps[1], "user");
assert_eq!(&caps[2], "example");
```
#### Named Groups
```rust
let re = Regex::new(r"(?P<user>\w+)@(?P<domain>\w+)").unwrap();
let caps = re.captures("user@example").unwrap();
assert_eq!(&caps["user"], "user");
assert_eq!(&caps["domain"], "example");
```
#### Non-Capturing Groups
```rust
```
### Backreferences
Match the same text as a previous capture group:
```rust
let re = Regex::new(r"(\w+)\s+\1").unwrap();
assert!(re.is_match("hello hello"));
assert!(!re.is_match("hello world"));
```
Backreferences require the BacktrackingVm or BacktrackingJit engine.
### Lookaround Assertions
#### Lookahead
- `(?=...)` - Positive lookahead
- `(?!...)` - Negative lookahead
```rust
let re = Regex::new(r"\w+(?=@)").unwrap();
assert_eq!(re.find("user@example").unwrap().as_str(), "user");
let re = Regex::new(r"\w+(?!@)").unwrap();
assert_eq!(re.find("user example").unwrap().as_str(), "user");
```
#### Lookbehind
- `(?<=...)` - Positive lookbehind
- `(?<!...)` - Negative lookbehind
```rust
let re = Regex::new(r"(?<=@)\w+").unwrap();
assert_eq!(re.find("user@example").unwrap().as_str(), "example");
let re = Regex::new(r"(?<!@)\w+").unwrap();
assert_eq!(re.find("user example").unwrap().as_str(), "user");
```
Lookaround assertions require the PikeVm or TaggedNfa engine.
## Unicode Support
### Unicode Character Classes
Patterns can match Unicode characters using escape sequences:
```rust
let re = Regex::new(r"\w+").unwrap();
assert!(re.is_match("café")); // Matches Unicode word characters
```
### Unicode Properties
Match characters by Unicode properties:
```rust
// Letter category
let re = Regex::new(r"\p{Letter}+").unwrap();
assert!(re.is_match("hello"));
assert!(re.is_match("привет")); // Cyrillic
// Number category
let re = Regex::new(r"\p{Number}+").unwrap();
assert!(re.is_match("123"));
assert!(re.is_match("①②③")); // Unicode numbers
// Script
let re = Regex::new(r"\p{Greek}+").unwrap();
assert!(re.is_match("αβγ"));
let re = Regex::new(r"\p{Cyrillic}+").unwrap();
assert!(re.is_match("привет"));
```
### Supported Unicode Categories
- `Letter` (`L`): All letters
- `Number` (`N`): All numbers
- `Mark` (`M`): Combining marks
- `Punctuation` (`P`): Punctuation characters
- `Symbol` (`S`): Symbols
- `Separator` (`Z`): Separators
### Supported Scripts
- `Arabic`, `Armenian`, `Bengali`, `Cyrillic`, `Devanagari`
- `Georgian`, `Greek`, `Gujarati`, `Gurmukhi`, `Han`
- `Hangul`, `Hebrew`, `Hiragana`, `Kannada`, `Katakana`
- `Khmer`, `Lao`, `Latin`, `Malayalam`, `Myanmar`
- `Oriya`, `Sinhala`, `Tamil`, `Telugu`, `Thai`, `Tibetan`
And many more. See Unicode Character Database for the complete list.
### Case-Insensitive Matching
Unicode-aware case folding:
```rust
let re = Regex::new(r"(?i)hello").unwrap();
assert!(re.is_match("HELLO"));
assert!(re.is_match("Hello"));
assert!(re.is_match("café")); // Unicode case folding
```
## JIT Compilation
### Enabling JIT
Use `RegexBuilder` to enable JIT compilation:
```rust
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"\w+@\w+\.\w+")
.jit(true)
.build()
.unwrap();
assert!(re.is_match("user@example.com"));
```
### When to Use JIT
JIT compilation is beneficial when:
1. **Pattern will be matched many times**: JIT has higher compilation cost but faster execution
2. **Performance is critical**: JIT generates native code for maximum speed
3. **Pattern has effective prefilters**: Combines SIMD literal search with native DFA execution
### JIT Requirements
- Available on x86-64 (Linux, macOS, Windows) and ARM64 (Linux, macOS)
- Requires `jit` feature flag
- Automatically falls back to interpreted engines if compilation fails
### JIT Engine Selection
When JIT is enabled, the engine is selected based on:
- **Backreferences**: Uses BacktrackingJit
- **Lookaround**: Falls back to PikeVm (requires NFA semantics)
- **Non-greedy quantifiers**: Uses TaggedNfa
- **Large Unicode classes**: Uses LazyDfa (avoids state explosion)
- **Alternations without prefilter**: Uses JitShiftOr
- **General patterns**: Uses DFA JIT with SIMD prefiltering
## SIMD Acceleration
### Default Behavior
SIMD acceleration is enabled by default through the `simd` feature. It provides:
- AVX2-accelerated literal search using Teddy algorithm
- Fast multi-pattern matching for 2-8 literals
- Automatic fallback to scalar implementations when SIMD is unavailable
### How It Works
The SIMD prefilter:
1. Extracts required literals from the pattern
2. Uses SIMD instructions to scan for candidates
3. Verifies candidates with the full regex engine
4. Returns matches
Example pattern with effective prefilter:
```rust
let re = Regex::new(r"hello\w+").unwrap();
// SIMD scans for "hello", then engine verifies \w+
```
### Disabling SIMD
Build without SIMD:
```bash
cargo build --no-default-features
```
## Prefix Optimization
### Tokenizer Optimization
For patterns with many literal alternatives (common in tokenizers), prefix optimization merges common prefixes into a trie structure:
```rust
use regexr::RegexBuilder;
.build()
.unwrap();
```
### How It Works
Without optimization:
```
Creates separate NFA branches for each alternative.
With optimization:
```
Merges the common prefix `f`, reducing active NFA threads from O(vocabulary_size) to O(token_length).
### When to Use
Enable prefix optimization when:
- Pattern has many literal alternatives (>10)
- Alternatives share common prefixes
- Used in tokenization or keyword matching
## API Features
### Matching
```rust
let re = Regex::new(r"\d+").unwrap();
// Check if pattern matches
if re.is_match("abc123") {
println!("Match found");
}
// Find first match
if let Some(m) = re.find("abc123def") {
println!("Found at {}-{}: {}", m.start(), m.end(), m.as_str());
}
// Find all matches
for m in re.find_iter("123 456 789") {
println!("{}", m.as_str());
}
```
### Capture Groups
```rust
let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
let caps = re.captures("2024-01-15").unwrap();
println!("Year: {}", &caps[1]);
println!("Month: {}", &caps[2]);
println!("Day: {}", &caps[3]);
// Iterate over all captures in text
for caps in re.captures_iter("2024-01-15 2024-02-20") {
println!("{}", &caps[0]);
}
```
### Text Replacement
```rust
let re = Regex::new(r"\d+").unwrap();
// Replace first match
let result = re.replace("Price: 100 dollars", "200");
assert_eq!(result, "Price: 200 dollars");
// Replace all matches
let result = re.replace_all("Price: 100 and 200 dollars", "X");
assert_eq!(result, "Price: X and X dollars");
```
### Pattern Information
```rust
let re = Regex::new(r"(?P<year>\d{4})-(?P<month>\d{2})").unwrap();
// Get original pattern
println!("Pattern: {}", re.as_str());
// Get capture names
for name in re.capture_names() {
println!("Capture group: {}", name);
}
// Get engine name (for debugging)
println!("Engine: {}", re.engine_name());
```
## Error Handling
All regex compilation returns `Result<Regex, Error>`:
```rust
use regexr::Regex;
match Regex::new(r"(unclosed") {
Ok(re) => println!("Compiled successfully"),
Err(e) => eprintln!("Compilation error: {}", e),
}
```
Common errors:
- Unclosed groups
- Invalid escape sequences
- Invalid repetition
- Invalid backreference
- Unsupported features
## Performance Tips
### 1. Enable JIT for Hot Patterns
```rust
let re = RegexBuilder::new(pattern)
.jit(true)
.build()
.unwrap();
```
### 2. Use Prefix Optimization for Tokenizers
```rust
let re = RegexBuilder::new(keyword_pattern)
.optimize_prefixes(true)
.build()
.unwrap();
```
### 3. Anchor Patterns When Possible
```rust
// Better
let re = Regex::new(r"^\d+").unwrap();
// Slower (must search entire string)
let re = Regex::new(r"\d+").unwrap();
```
### 4. Use Character Classes Instead of Alternations
```rust
// Better
let re = Regex::new(r"[abc]").unwrap();
// Slower
### 5. Avoid Unnecessary Captures
```rust
// Better (non-capturing group)
// Slower (capturing group not needed)
let re = Regex::new(r"(cat|dog)+").unwrap();
```
### 6. Profile Engine Selection
Use `engine_name()` to verify the selected engine:
```rust
let re = Regex::new(pattern).unwrap();
println!("Using engine: {}", re.engine_name());
```
Ensure the engine matches your expectations for the pattern type.
## Limitations
### Current Limitations
1. **SIMD**: Only available on x86-64 with AVX2 support
2. **JIT**: Not available on WASM — falls back to interpreted engines automatically
3. **Multiline mode**: Currently `.` never matches newline
4. **Backreferences**: Cannot be combined with JIT DFA (uses BacktrackingJit instead)
5. **Variable-width lookbehind**: Limited support (fixed-width lookbehind only)
### Platform Support
| Linux x86-64 | ✓ | ✓ (AVX2) |
| Linux ARM64 | ✓ | ✗ |
| macOS x86-64 | ✓ | ✓ (AVX2) |
| macOS ARM64 (Apple Silicon) | ✓ | ✗ |
| Windows x86-64 | ✓ | ✓ (AVX2) |
| WASM (wasm32) | ✗ | ✗ |
| Other | ✗ | ✗ |
### Feature Compatibility
| Backreferences | ✓ | ✗ | ✗ | ✗ | ✓ |
| Lookaround | ✓ | ✗ | ✗ | ✗ | ✗ |
| Non-greedy | ✓ | ✗ | ✗ | ✗ | ✗ |
| Word boundaries | ✓ | ✗ | ✓ | ✓ | ✓ |
| Anchors | ✓ | ✗ | ✓ | ✓ | ✓ |
| Captures | ✓ | ✓\* | ✓\* | ✓ | ✓ |
\*ShiftOr and LazyDFA fall back to PikeVm for capture extraction.