grex 1.4.6 - Docs.rs

/*
 * Copyright © 2019-today Peter M. Stahl pemistahl@gmail.com
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expressed or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

//! ## 1. What does this tool do?
//!
//! *grex* is a library as well as a command-line utility that is meant to simplify the often
//! complicated and tedious task of creating regular expressions. It does so by automatically
//! generating a single regular expression from user-provided test cases. The resulting
//! expression is guaranteed to match the test cases which it was generated from.
//!
//! This project has started as a Rust port of the JavaScript tool
//! [*regexgen*](https://github.com/devongovett/regexgen) written by
//! [Devon Govett](https://github.com/devongovett). Although a lot of further useful features
//! could be added to it, its development was apparently ceased several years ago. The plan
//! is now to add these new features to *grex* as Rust really shines when it comes to
//! command-line tools. *grex* offers all features that *regexgen* provides, and more.
//!
//! The philosophy of this project is to generate the most specific regular expression
//! possible by default which exactly matches the given input only and nothing else.
//! With the use of command-line flags (in the CLI tool) or preprocessing methods
//! (in the library), more generalized expressions can be created.
//!
//! The produced expressions are [Perl-compatible regular expressions](https://www.pcre.org)
//! which are also compatible with the regular expression parser in Rust's
//! [*regex crate*](https://crates.io/crates/regex).
//! Other regular expression parsers or respective libraries from other programming languages
//! have not been tested so far, but they ought to be mostly compatible as well.
//!
//! ## 2. Do I still need to learn to write regexes then?
//!
//! **Definitely, yes!** Using the standard settings, *grex* produces a regular expression that
//! is guaranteed to match only the test cases given as input and nothing else. This has been
//! verified by [property tests](https://github.com/pemistahl/grex/blob/main/tests/property_tests.rs).
//! However, if the conversion to shorthand character classes such as `\w` is enabled, the
//! resulting regex matches a much wider scope of test cases. Knowledge about the consequences of
//! this conversion is essential for finding a correct regular expression for your business domain.
//!
//! *grex* uses an algorithm that tries to find the shortest possible regex for the given test cases.
//! Very often though, the resulting expression is still longer or more complex than it needs to be.
//! In such cases, a more compact or elegant regex can be created only by hand.
//! Also, every regular expression engine has different built-in optimizations.
//! *grex* does not know anything about those and therefore cannot optimize its regexes
//! for a specific engine.
//!
//! **So, please learn how to write regular expressions!** The currently best use case for *grex*
//! is to find an initial correct regex which should be inspected by hand if further optimizations
//! are possible.
//!
//! ## 3. Current features
//!
//! - literals
//! - character classes
//! - detection of common prefixes and suffixes
//! - detection of repeated substrings and conversion to `{min,max}` quantifier notation
//! - alternation using `|` operator
//! - optionality using `?` quantifier
//! - escaping of non-ascii characters, with optional conversion of astral code points to surrogate pairs
//! - case-sensitive or case-insensitive matching
//! - capturing or non-capturing groups
//! - optional anchors `^` and `$`
//! - fully compliant to [Unicode Standard 15.0](https://unicode.org/versions/Unicode15.0.0)
//! - fully compatible with [*regex* crate 1.9.0+](https://crates.io/crates/regex)
//! - correctly handles graphemes consisting of multiple Unicode symbols
//! - reads input strings from the command-line or from a file
//! - produces more readable expressions indented on multiple using optional verbose mode
//!
//! ## 4. How to use?
//!
//! The code snippets below show how to use the public api.
//!
//! For [more detailed examples](https://github.com/pemistahl/grex/tree/main#53-examples), please
//! take a look at the project's readme file on GitHub.
//!
//! ### 4.1 Default settings
//!
//! Test cases are passed either from a collection via [`RegExpBuilder::from()`]
//! or from a file via [`RegExpBuilder::from_file()`].
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["a", "aa", "aaa"]).build();
//! assert_eq!(regexp, "^a(?:aa?)?$");
//! ```
//!
//! ### 4.2 Convert to character classes
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["a", "aa", "123"])
//!     .with_conversion_of_digits()
//!     .with_conversion_of_words()
//!     .build();
//! assert_eq!(regexp, "^(?:\\d\\d\\d|\\w(?:\\w)?)$");
//! ```
//!
//! ### 4.3 Convert repeated substrings
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
//!     .with_conversion_of_repetitions()
//!     .build();
//! assert_eq!(regexp, "^(?:a{2}|(?:bc){2}|(?:def){3})$");
//! ```
//!
//! By default, *grex* converts each substring this way which is at least a single character long
//! and which is subsequently repeated at least once. You can customize these two parameters
//! if you like.
//!
//! In the following example, the test case `aa` is not converted to `a{2}` because the repeated
//! substring `a` has a length of 1, but the minimum substring length has been set to 2.
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
//!     .with_conversion_of_repetitions()
//!     .with_minimum_substring_length(2)
//!     .build();
//! assert_eq!(regexp, "^(?:aa|(?:bc){2}|(?:def){3})$");
//! ```
//!
//! Setting a minimum number of 2 repetitions in the next example, only the test case `defdefdef`
//! will be converted because it is the only one that is repeated twice.
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
//!     .with_conversion_of_repetitions()
//!     .with_minimum_repetitions(2)
//!     .build();
//! assert_eq!(regexp, "^(?:bcbc|aa|(?:def){3})$");
//! ```
//!
//! ### 4.4 Escape non-ascii characters
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["You smell like 💩."])
//!     .with_escaping_of_non_ascii_chars(false)
//!     .build();
//! assert_eq!(regexp, "^You smell like \\u{1f4a9}\\.$");
//! ```
//!
//! Old versions of JavaScript do not support unicode escape sequences for
//! the astral code planes (range `U+010000` to `U+10FFFF`). In order to
//! support these symbols in JavaScript regular expressions, the conversion
//! to surrogate pairs is necessary. More information on that matter can be
//! found [here](https://mathiasbynens.be/notes/javascript-unicode).
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["You smell like 💩."])
//!     .with_escaping_of_non_ascii_chars(true)
//!     .build();
//! assert_eq!(regexp, "^You smell like \\u{d83d}\\u{dca9}\\.$");
//! ```
//!
//! ### 4.5 Case-insensitive matching
//!
//! The regular expressions that *grex* generates are case-sensitive by default.
//! Case-insensitive matching can be enabled like so:
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["big", "BIGGER"])
//!     .with_case_insensitive_matching()
//!     .build();
//! assert_eq!(regexp, "(?i)^big(?:ger)?$");
//! ```
//!
//! ### 4.6 Capturing Groups
//!
//! Non-capturing groups are used by default.
//! Extending the previous example, you can switch to capturing groups instead.
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["big", "BIGGER"])
//!     .with_case_insensitive_matching()
//!     .with_capturing_groups()
//!     .build();
//! assert_eq!(regexp, "(?i)^big(ger)?$");
//! ```
//!
//! ### 4.7 Verbose mode
//!
//! If you find the generated regular expression hard to read, you can enable verbose mode.
//! The expression is then put on multiple lines and indented to make it more pleasant to the eyes.
//!
//! ```
//! use grex::RegExpBuilder;
//! use indoc::indoc;
//!
//! let regexp = RegExpBuilder::from(&["a", "b", "bcd"])
//!     .with_verbose_mode()
//!     .build();
//!
//! assert_eq!(regexp, indoc!(
//!     r#"
//!     (?x)
//!     ^
//!       (?:
//!         b
//!         (?:
//!           cd
//!         )?
//!         |
//!         a
//!       )
//!     $"#
//! ));
//! ```
//!
//! ### 4.8 Disable anchors
//!
//! By default, the anchors `^` and `$` are put around every generated regular expression in order
//! to ensure that it matches only the test cases given as input. Often enough, however, it is
//! desired to use the generated pattern as part of a larger one. For this purpose, the anchors
//! can be disabled, either separately or both of them.
//!
//! ```
//! use grex::RegExpBuilder;
//!
//! let regexp = RegExpBuilder::from(&["a", "aa", "aaa"])
//!     .without_anchors()
//!     .build();
//! assert_eq!(regexp, "a(?:aa?)?");
//! ```
//!
//! ### 5. How does it work?
//!
//! 1. A [deterministic finite automaton](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) (DFA)
//!    is created from the input strings.
//!
//! 2. The number of states and transitions between states in the DFA is reduced by applying
//!    [Hopcroft's DFA minimization algorithm](https://en.wikipedia.org/wiki/DFA_minimization#Hopcroft.27s_algorithm).
//!
//! 3. The minimized DFA is expressed as a system of linear equations which are solved with
//!    [Brzozowski's algebraic method](http://cs.stackexchange.com/questions/2016/how-to-convert-finite-automata-to-regular-expressions#2392),
//!    resulting in the final regular expression.

#[macro_use]
mod macros;

mod builder;
mod char_range;
mod cluster;
mod component;
mod config;
mod dfa;
mod expression;
mod format;
mod grapheme;
mod quantifier;
mod regexp;
mod substring;
mod unicode_tables;

#[cfg(feature = "python")]
mod python;

#[cfg(target_family = "wasm")]
mod wasm;

pub use builder::RegExpBuilder;

#[cfg(target_family = "wasm")]
pub use wasm::RegExpBuilder as WasmRegExpBuilder;