tyrx 0.1.0

Typed, ergonomic regular expression library
Documentation
# TyRx: a typed, ergonomic regular expression library

TyRx attempts to bring the strong typing and excellent domain modeling capabilities
of Rust into the world of regular expressions.

It provides traits, types, and macros for quickly building types that know how to
parse themselves from a string by compiling and matching a regular expression.

The crate name is pronounced "tee-rex", like the dinosaur.

## Examples

As a trivial example, when you need to parse a string with a list of numbers:

```rust
use tyrx::{Result, TyRx};

fn main() -> Result<()> {
    let string = "13.37 69.67 -137 +42 -2.718281829";
    let numbers: Vec<_> = f64::iter_from_str(string).collect::<Result<_>>()?;

    assert_eq!(numbers, [
        13.37,
        69.67,
        -137.0,
        42.0,
        -2.718281829,
    ]);

    Ok(())
}
```

Now, for a slightly more complicated example to show off the capabilities of
the crate a bit more. Let's say there's a file with each line in the format:

```text
ident1: 3.14, SomeText
ident2: -137.42, OtherStringContent
```

so the first part before the `:` is an identifier of the record, while the
rest of the line is a comma-separated pair of values (a fractional number
and some alphanumeric text), represented by a nested type.

You can use the following piece of code to represent the outer and the inner
type, specify the subpatterns necessary for matching each field, and have the
library generate all the parsing boilerplate:

```rust
use tyrx::{
    RegexPattern, FromMatch, ErasedLifetime, TyRx,
    builder::{Char, Ignore},
};


#[derive(PartialEq, Debug, RegexPattern, FromMatch, ErasedLifetime)]
struct Outer {
    #[tyrx(pattern = r"(?<Outer.prefix>[[:alnum:]]+)")]
    prefix: String,
    
    colon: Char<':'>,

    #[tyrx(pattern = r"(?<Outer.space>\s+)")]
    space: Ignore<String>,

    /// nested type implementing `RegexPattern` and `FromMatch`
    inner: Inner,
}

#[derive(PartialEq, Debug, RegexPattern, FromMatch, ErasedLifetime)]
struct Inner {
    number_value: f64,

    #[tyrx(pattern = r"(?<Inner.separator>,\s*)")]
    separator: tyrx::builder::Ignore<String>,
    
    #[tyrx(pattern = r"(?<Inner.text_content>[[:alnum:]]+)")]
    text_content: String,
}

fn main() -> tyrx::Result<()> {
    let text = r#"
        ident1: 3.14, SomeText
        ident2: -137.42, OtherStringContent
    "#;

    let matches: Vec<_> = Outer::iter_from_str(text).collect::<tyrx::Result<_>>()?;

    assert_eq!(matches, [
        Outer {
            prefix: String::from("ident1"),
            colon: Char::default(),
            space: Ignore::default(),
            inner: Inner {
                number_value: 3.14,
                separator: Ignore::default(),
                text_content: String::from("SomeText"),
            },
        },
        Outer {
            prefix: String::from("ident2"),
            colon: Char::default(),
            space: Ignore::default(),
            inner: Inner {
                number_value: -137.42,
                separator: Ignore::default(),
                text_content: String::from("OtherStringContent"),
            },
        },
    ]);

    Ok(()) 
}
```

### Explanation of the Example

The main entry point of the crate is the [`TyRx`] trait. This is automatically
implemented (by means of a blanket impl) for types that also implement the
[`RegexPattern`], [`FromMatch`], and [`ErasedLifetime`] traits, all of which
can be automatically `#[derive]`'d.

* The [`RegexPattern`] trait is implemented by types that represent a regular
  expression pattern. They supply this pattern to the regex engine by writing
  it into the provided formatter in the [`RegexPattern::fmt_pattern()`] method.

  The derive macro accepts the following attributes:

  - Struct field and variant field attributes:
    - `#[tyrx(rename = identifier)]`: causes the field name part of the capture
      group in the generated pattern to be replaced by the specified literal
      identifier.
    - `#[tyrx(pattern = "regex pattern string or other Display-able value")]`:
      causes the field's portion of the generated pattern to be replacede by the
      supplied sub-pattern. By default, the field's sub-pattern is derived from
      its type. You may re-use this sub-pattern in the custom pattern by using
      e.g. `format_args!()` and interpolating [`RegexPattern::pattern_display()`],
      forwarded to the field type.

  - Enum variant attributes:
    - `[tyrx(rename = identifier)]`: similar to the `rename` attribute on struct
      fields, except that it replaces the variant name part of the capture group
      name. When applied to a unit variant, it also changes the literal pattern to
      be matched.

* The [`FromMatch`] trait represents a type that can parse itself from a match
  or a set of matched capture groups.
  
  The derive macro accepts **all attributes accepted by the [`RegexPattern`]
  derive,** and some more:

  - Top-level (struct and enum) attributes:
    - `#[tyrx(lifetime = 'lt)]`: changes the lifetime parameter of the trait
      from the default, fresh lifetime to the specified parameter. The specified
      lifetime must already exist as a parameter of the type, as it will not be
      added to the generic parameter declaration list of the generated `impl`.
      
* The [`ErasedLifetime`] trait is a technical necessity, arising out of storing
  compiled regular expressions in a global cache. For a detailed explanation, see
  the [relevant section below]#the-regex-cache-and-erased-lifetimes.

## Caveats

- Due to the way capture groups are named, a given type can't be nested in an
  outer type more than once, since that would lead to duplicate capture group
  names.

## Advanced Concepts

### Enums

Enums are represented as a choice between each variant. Choices are ordered: each
variant is attempted to be matched in sequence. This is important when some patterns
overlap (i.e., they match some common subset of haystacks).

Variants are treated identically to structs, with one exception: unit variants,
unlike unit structs, match their own literal name. For example:

```rust
use tyrx::{TyRx, RegexPattern, FromMatch, ErasedLifetime};

#[derive(Clone, PartialEq, Debug, RegexPattern, FromMatch, ErasedLifetime)]
enum MyChoice {
    /// Struct variants
    Ratio {
        numerator: f64,
        slash: tyrx::builder::Char<'/'>,
        denominator: f64,
    },
    /// Unit variants match themselves, except when renamed
    #[tyrx(rename = literal_one)]
    LiteralOne,
    /// Raw identifiers work correctly, too
    r#LiteralTwo,
    /// Tuple variants
    Identifier(
        #[tyrx(pattern = "(?<MyChoice.Identifier.foo>[a-zA-Z_][a-zA-Z0-9_]*)", rename = r#foo)]
        String,
    ),
}

fn main() -> tyrx::Result<()> {
    let haystack = "42/-13.37 +8./1.0 arbitrary literal_one -69/42 Some LiteralTwo OTHER";
    let enum_matches: Vec<_> = MyChoice::iter_from_str(haystack).collect::<tyrx::Result<_>>()?;

    assert_eq!(enum_matches, [
        MyChoice::Ratio {
            numerator: 42.0, 
            slash: Default::default(), 
            denominator: -13.37,
        },
        MyChoice::Ratio {
            numerator: 8.0, 
            slash: Default::default(), 
            denominator: 1.0,
        },
        MyChoice::Identifier("arbitrary".into()),
        MyChoice::LiteralOne,
        MyChoice::Ratio {
            numerator: -69.0, 
            slash: Default::default(), 
            denominator: 42.0,
        },
        MyChoice::Identifier("Some".into()),
        MyChoice::LiteralTwo,
        MyChoice::Identifier("OTHER".into()),
    ]);

    Ok(())
}
```

### Borrowing from the input string

Borrowed string-like types (including `&str`, `Cow<'_, str>`, etc.) can also be
deserialized from the haystack without copying or allocation. The following example
demonstrates this:

```rust
use std::borrow::Cow;
use tyrx::{TyRx, RegexPattern, FromMatch, ErasedLifetime};

#[derive(Clone, PartialEq, Debug, RegexPattern, FromMatch, ErasedLifetime)]
struct Borrowing<'a> {
    #[tyrx(pattern = r"(?<Borrowing.first>[0-9]+)\s+")]
    first: &'a str,
    #[tyrx(pattern = r"(?<Borrowing.last>[a-zA-Z]+)")]
    last: Cow<'a, str>,
}

fn main() -> tyrx::Result<()> {
    // make this a local instead of a &'static str
    let haystack = String::from("123 abc 99 defghi 9876543 foobar");
    let borrowed_matches: Vec<_> = Borrowing::iter_from_str(&haystack).collect::<tyrx::Result<_>>()?;

    assert_eq!(borrowed_matches, [
        Borrowing { first: "123", last: Cow::Borrowed("abc") },
        Borrowing { first: "99", last: Cow::Borrowed("defghi") },
        Borrowing { first: "9876543", last: Cow::Borrowed("foobar") },
    ]);

    Ok(())
}
```

This example also demonstrates that the automatically-added bounds should usually
suffice. However, if you need precise control over the lifetime argument of the
[`FromMatch`], impl, then you can use the `#[tyrx(lifetime = 'a)]` annotation with
the `#[derive]` macro.

### The Regex Cache and Erased Lifetimes

In order to avoid re-compiling the regex each time a type is parsed, the crate
maintains a global cache of compiled regular expressions. In order to identify
types, their `TypeId` is used as a key in the cache.

This would, however, preclude non-`'static` types from being used with the library,
which would be a pretty big loss, as borrowing from the matched string (as opposed
to cloning its substrings) is an important performance optimiation. To solve this
problem, the [`ErasedLifetime`] trait is defined with the sole purpose of providing
the [`ErasedLifetime::Erased`] associated type. When automatically derived, this
associated type is set to the `Self` type but with all lifetime parameters (if any)
replaced with the `'static` lifetime, thereby allowing `TypeId` to work on the
lifetime-erased type, thus allowing borrowed types to also work with the library.

Compiling and caching a regular expression can be performed explicitly by calling
the [`build_regex()`] function.

### Collecting Span Information

The [`Spanned`] type allows one to preserve the byte range of each match.
This is a transparent newtype wrapper which simply forwards its [`RegexPattern`]
and [`FromMatch`] impls to the underlying type, while storing the byte span of
the specific match it came from.

### Best-effort checking of regex pattern literals for capturing groups

When using `#[tyrx(pattern = "...")]`, the derive macro makes a best-effort attempt
at ensuring that the specified pattern contains the corresponding, appropriately
named capture group. However, this only works when the pattern expression is a
literal or a sufficiently simple expression (e.g., a block, a parenthesized group,
a typecast expression, a reference or dereference) that can be naively determined
to be a literal. If the expression contains more complex subexpressions, then the
macro gives up and lets the code compile, even if the required capture group is
missing.

### Harnessing [`FromStr`]std::str::FromStr `impl`s

Many types have an implementation of the standard [`FromStr`](std::str::FromStr)
trait as a way of naturally parsing a value from a string. If you have such a
type, you can automatically adapt it to have [`RegexPattern`] and [`FromMatch`]
impls by wrapping it in a [`MatchFromStr`].

### Ignoring matched substrings

* manually: by specifying an explicit pattern and only wrapping part of it in
  a named capture group. TODO(H2CO3): describe this in detail.
* automatically: by wrapping a matching type into a [`builder::Ignore`].
  TODO(H2CO3): describe this in detail.

### Regex Builder Types

The [`builder`] module contains helper types for composing regexes in
frequently-used ways. For example:

* [`crate::builder::Char`]
* [`crate::builder::CharRange`]
* [`crate::builder::CharClass`]
* [`crate::builder::Repeat`]
* [`crate::builder::Alternation`]
* [`crate::builder::Ignore`]

TODO(H2CO3): describe each of these in detail.