# Rift Search Specification
This document outlines the regular expression syntax and features supported by Rift's search engine.
## Usage
Add `monster-regex` to your `Cargo.toml`:
```toml
[dependencies]
monster-regex = "0.1.0"
```
### Basic Example
```rust
use monster_regex::{Regex, Flags};
fn main() {
// Compile a regex
let re = Regex::new(r"\w+", Flags::default()).unwrap();
// Check for a match
assert!(re.is_match("hello"));
// Find a match
if let Some(m) = re.find("hello world") {
println!("Found match at {}-{}", m.start, m.end); // 0-5
}
}
```
### Using Flags
You can configure behavior using `Flags`:
```rust
use monster_regex::{Regex, Flags};
fn main() {
let mut flags = Flags::default();
flags.ignore_case = Some(true); // Case insensitive
flags.multiline = true; // ^ and $ match line boundaries
let re = Regex::new(r"^hello", flags).unwrap();
assert!(re.is_match("HELLO\nworld"));
}
```
### Parsing Rift Format
You can also parse patterns in the `pattern/flags` format used by Rift:
```rust
use monster_regex::parse_rift_format;
use monster_regex::Regex;
fn main() {
let (pattern, flags) = parse_rift_format("abc/i").unwrap();
let re = Regex::new(&pattern, flags).unwrap();
assert!(re.is_match("ABC"));
}
```
### Find All
```rust
use monster_regex::{Regex, Flags};
fn main() {
let re = Regex::new(r"\d+", Flags::default()).unwrap();
let text = "123 abc 456";
for m in re.find_all(text) {
println!("Match: {}", &text[m.start..m.end]);
}
}
```
### Replacement
```rust
use monster_regex::{Regex, Flags};
fn main() {
let re = Regex::new(r"foo", Flags::default()).unwrap();
let result = re.replace_all("foo bar foo", "baz");
assert_eq!(result, "baz bar baz");
}
```
### Streaming / Zero-Copy Search
For advanced use cases like searching non-contiguous memory (ropes, gap buffers) without allocation, implement the `Haystack` trait:
```rust
use monster_regex::{Regex, Haystack};
#[derive(Copy, Clone)]
struct MyRope<'a> {
// ... custom internal structure
phantom: std::marker::PhantomData<&'a ()>,
}
impl<'a> Haystack for MyRope<'a> {
fn len(&self) -> usize { /* ... */ }
fn char_at(&self, pos: usize) -> Option<(char, usize)> { /* ... */ }
fn char_before(&self, pos: usize) -> Option<char> { /* ... */ }
fn matches_range(&self, pos: usize, other_start: usize, other_end: usize) -> bool { /* ... */ }
fn starts_with(&self, pos: usize, literal: &str) -> bool { /* ... */ }
}
fn main() {
let rope = MyRope { /* ... */ };
let re = Regex::new("pattern", Default::default()).unwrap();
// Find generic match
if let Some(m) = re.find_from(rope) {
println!("Match at {}-{}", m.start, m.end);
}
// Iterate all matches
for m in re.find_all_from(rope) {
// ...
}
}
```
## 1. General Syntax
Search patterns are entered in the format:
`pattern/flags`
* **Pattern**: The regex to match.
* **Flags**: Optional single-character flags modifying the search behavior.
### Special Characters
The following characters have special meaning and must be escaped with `\` to be matched literally:
`.` `*` `+` `?` `^` `$` `|` `(` `)` `[` `]` `{` `}` `\`
All other characters match themselves literally.
**Note on Dot (`.`):**
By default, `.` matches any character except newline. Use the `s` (dotall) flag to make `.` match newlines.
### Case Sensitivity
* **Default (Smartcase)**: Case-insensitive if the pattern contains only lowercase letters. Case-sensitive if the pattern contains any uppercase letters.
* **Overrides**: Can be explicitly set using the `i` (ignore-case) or `c` (case-sensitive) flags.
## 2. Quantifiers
Quantifiers specify how many times the preceding atom (character, group, or character class) should match.
| `*` | 0 or more | Yes | `a*` matches "", "a", "aa"... |
| `+` | 1 or more | Yes | `a+` matches "a", "aa"... |
| `?` | 0 or 1 | Yes (prefers 1) | `a?` matches "" or "a", preferring "a" |
| `{n}` | Exactly *n* | — | `a{3}` matches "aaa" |
| `{n,m}` | *n* to *m* | Yes | `a{2,4}` matches "aa", "aaa", "aaaa" |
| `{n,}` | *n* or more | Yes | `a{2,}` matches "aa", "aaa"... |
| `{,m}` | 0 to *m* | Yes | `a{,3}` matches "", "a", "aa", "aaa" |
| `*?` | 0 or more | No | `a*?` matches minimal characters |
| `+?` | 1 or more | No | `a+?` matches minimal characters |
| `??` | 0 or 1 | No | `a??` prefers 0 matches |
| `{n,m}?` | *n* to *m* | No | `a{2,4}?` matches "aa" before "aaa" |
## 3. Character Classes
### Standard Classes
| `\d` | Digit `[0-9]` |
| `\D` | Non-digit |
| `\w` | Word character `[a-zA-Z0-9_]` (ASCII by default) |
| `\W` | Non-word character |
| `\s` | Whitespace `[ \t\r\n\f\v]` |
| `\S` | Non-whitespace |
### Extended Classes
| `\l` | Lowercase character |
| `\L` | Non-lowercase character |
| `\u` | Uppercase character |
| `\U` | Non-uppercase character |
| `\x` | Hexadecimal digit |
| `\X` | Non-hexadecimal digit |
| `\o` | Octal digit |
| `\O` | Non-octal digit |
| `\h` | Head of word character (start of a word) |
| `\H` | Non-head of word character |
| `\p` | Punctuation `[!"#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]` |
| `\P` | Non-punctuation |
| `\a` | Alphanumeric `[a-zA-Z0-9]` |
| `\A` | Non-alphanumeric |
### Unicode Support
* **Default**: `\w`, `\d`, `\s`, `\h` match ASCII characters only.
* **With `u` flag**: These classes include Unicode characters (e.g., `\w` matches accented characters).
### Character Sets
Custom character sets and ranges (e.g., `[a-z]`, `[^0-9]`) are supported.
**Note on Escaping in Character Classes:**
In character classes, special meaning is different. For example, `[\]]` matches a literal `]`, and `[a\-z]` matches `a`, `\`, or `-`.
## 4. Anchors and Boundaries
Anchors assert a position without matching characters (zero-width).
| `^` | Start of string (or start of line in multiline mode) |
| `$` | End of string (or end of line in multiline mode) |
| `\<` | Start of word |
| `\>` | End of word |
| `\b` | Word boundary (matches at `\<` or `\>`) |
| `\zs` | Sets the start of the match (everything before is excluded from the result) |
| `\ze` | Sets the end of the match (everything after is excluded from the result) |
### Position Anchors
These anchors match at a specific position in the buffer. They are zero-width assertions and do not consume characters.
| `\%nl` | Matches anywhere on line *n* (1-indexed). | `\%5lfoo` matches "foo" only if it appears on line 5. |
Not implemented in the parser, clients must handle line-based matching.
### Word Boundaries Explained
* `\<`: Matches the position where a word starts (preceded by non-word, followed by word char).
* `\>`: Matches the position where a word ends (preceded by word char, followed by non-word).
* `\b`: Matches at either `\<` or `\>`.
Word boundaries `\<` and `\>` use the same character definition as `\w` (`[a-zA-Z0-9_]`). With the `u` flag, both adapt to Unicode.
## 5. Flags
Flags are appended after the pattern delimiter (e.g., `pattern/flags`).
| `i` | ignore-case | Case-insensitive matching (overrides smartcase). |
| `c` | case-sensitive | Case-sensitive matching (overrides smartcase). |
| `m` | multiline | `^` and `$` match line boundaries (`\n`), not just the start/end of the entire buffer. |
| `s` | dotall | `.` matches newlines (including end-of-line). |
| `x` | verbose | Whitespace and `#` comments in the pattern are ignored. Literal spaces must be escaped (e.g., `\ ` or `[ ]`). |
| `g` | global | Match all occurrences (used for find-all or replace operations). |
| `u` | unicode | Enables Unicode support for character classes (`\w`, `\d`, etc.). |
**Verbose Mode Examples (`x` flag):**
* `/foo bar/x` matches "foobar" (space is ignored).
* `/foo\ bar/x` matches "foo bar" (space is escaped).
* `/foo[ ]bar/x` matches "foo bar" (space in bracket).
## 6. Escape Sequences
| `\n` | Newline (LF) |
| `\t` | Tab |
| `\r` | Carriage return (CR) |
| `\f` | Form feed |
| `\v` | Vertical tab |
| `\\` | Literal backslash |
## 7. Groups, Alternation, and Assertions
* **Alternation**: `pattern1|pattern2` matches either *pattern1* or *pattern2*.
* **Grouping**: `(pattern)` groups part of the regex and captures it.
* **Named Capture**: `(?<name>pattern)` captures the group with a specific name.
* **Non-Capturing Group**: `(?:pattern)` groups without capturing.
* **Backreferences**: `\1` through `\9` refer to captured groups 1-9. `\0` refers to the entire match.
### Lookaround Assertions
Lookarounds assert that what follows or precedes the current position matches a pattern, without including it in the match result.
| `(?>=foo)` | Positive Lookahead | Matches if followed by "foo". |
| `(?>!foo)` | Negative Lookahead | Matches if **not** followed by "foo". |
| `(?<=foo)` | Positive Lookbehind | Matches if preceded by "foo". |
| `(?<!foo)` | Negative Lookbehind | Matches if **not** preceded by "foo". |