1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
//! Implements _character classes_. The analogue in the regex world are
//! [character classes](https://www.regular-expressions.info/charclass.html),
//! [shorthand character classes](https://www.regular-expressions.info/shorthand.html),
//! [non-printable characters](https://www.regular-expressions.info/nonprint.html),
//! [Unicode categories/scripts/blocks](https://www.regular-expressions.info/unicode.html#category),
//! [POSIX classes](https://www.regular-expressions.info/posixbrackets.html#class) and the
//! [dot](https://www.regular-expressions.info/dot.html).
//!
//! All kinds of character classes mentioned above require `[` square brackets
//! `]` in Pomsky. A character class can be negated by putting the keyword `not`
//! after the opening bracket. For example, `![.]` compiles to `\n`.
//!
//! ## Items
//!
//! A character class can contain multiple _items_, which can be
//!
//! - A __code point__, e.g. `['a']` or `[U+107]`
//!
//! - This includes [non-printable characters](https://www.regular-expressions.info/nonprint.html).\
//! Supported are `[n]`, `[r]`, `[t]`, `[a]`, `[e]` and `[f]`.
//!
//! - A __range of code points__. For example, `[U+10 - U+200]` matches any code
//! point P where `U+10 ≤ P ≤ U+200`
//!
//! - A __named character class__, which can be one of
//!
//! - a [shorthand character class](https://www.regular-expressions.info/shorthand.html).\
//! Supported are `[w]`, `[d]`, `[s]`, `[h]`, `[v]` and `[R]`.
//!
//! - a [POSIX class](https://www.regular-expressions.info/posixbrackets.html#class).\
//! Supported are `[ascii_alnum]`, `[ascii_alpha]`, `[ascii]`,
//! `[ascii_blank]`, `[ascii_cntrl]`, `[ascii_digit]`, `[ascii_graph]`,
//! `[ascii_lower]`, `[ascii_print]`, `[ascii_punct]`, ´ `[ascii_space]`,
//! `[ascii_upper]`, `[ascii_word]` and `[ascii_xdigit]`.\ _Note_: POSIX
//! classes are not Unicode aware!\ _Note_: They're converted to ranges,
//! e.g. `[ascii_alpha]` = `[a-zA-Z]`.
//!
//! - a [Unicode category, script or block](https://www.regular-expressions.info/unicode.html#category).\
//! For example: `[Letter]` compiles to `\p{Letter}`. Pomsky currently
//! treats any uppercase identifier except `R` as Unicode class.
//!
//! ### "Special" items
//!
//! There are also three special variants:
//!
//! - `[cp]` or `[codepoint]`, matching a code point
//! - `[.]` (the [dot](https://www.regular-expressions.info/dot.html)), matching
//! any code point except the ASCII line break (`\n`)
//!
//! A character class containing `cp` or `.` can't contain anything else. Note
//! that:
//!
//! - combining `[cp]` with anything else would be equivalent to `[cp]`
//! - combining `[.]` with anything other than `[cp]` or `[n]` would be
//! equivalent to `[.]`
//!
//! They also require special treatment when negating them (see below).
//!
//! ## Compilation
//!
//! When a character class contains only a single item (e.g. `[w]`), the
//! character class is "flattened":
//!
//! - `['a']` = `a`
//! - `[w]` = `\w`
//! - `[Letter]` = `\p{Letter}`
//! - `[.]` = `.`
//!
//! The exception is `[cp]`, which compiles to `[\S\s]`.
//!
//! When there is more than one item or a range (e.g. `['a'-'z' '!']`), a regex
//! character class is created:
//!
//! - `['a'-'z' '!']` = `[a-z!]`
//! - `[w e Punctuation]` = `[\w\e\p{Punctuation}]`
//!
//! ### Negation
//!
//! Negation is implemented as follows:
//!
//! - Ranges and chars such as `!['a'-'z' '!' e]` are wrapped in a negative
//! character class, e.g. `[^a-z!\e]`.
//!
//! - The `h`, `v` and `R` shorthands are also wrapped in a negative character
//! class.
//!
//! - The `w`, `d` and `s` shorthands are negated by making them uppercase
//! (`![w]` = `\W`), except when there is more than one item in the class
//! (`![w '-']` = `[^\w\-]`)
//!
//! - Special classes:
//! - `![.]` = `\n`
//! - `![cp]` is an error, as this would result in an empty group, which is
//! only allowed in JavaScript; instead we could return `[^\S\s]`, but this
//! doesn't have a use case, since it matches nothing (it always fails).
//!
//! - `w`, `s`, `d` and Unicode categories/scripts/blocks can be negated
//! individually _within a character class_, e.g. `[s !s]` = `[\s\S]`
//! (equivalent to `[cp]`), `![!Latin 'a']` = `[^\P{Latin}a]`.
//!
//! When a negated character class only contains 1 item, which is also
//! negated, the class is removed and the negations cancel each other out:
//! `![!w]` = `\w`, `![!L]` = `\p{L}`.
use crate::{error::ParseErrorKind, Span};
pub use char_group::{CharGroup, GroupItem, GroupName};
pub use unicode::{Category, CodeBlock, OtherProperties, Script};
mod ascii;
pub(crate) mod char_group;
pub(crate) mod unicode;
/// A _character class_. Refer to the [module-level documentation](self) for
/// details.
#[derive(Clone, PartialEq, Eq)]
pub struct CharClass {
pub negative: bool,
pub inner: CharGroup,
pub span: Span,
}
impl CharClass {
pub fn new(inner: CharGroup, span: Span) -> Self {
CharClass { inner, span, negative: false }
}
/// Makes a positive character class negative and vice versa.
pub(crate) fn negate(&mut self) -> Result<(), ParseErrorKind> {
if self.negative {
Err(ParseErrorKind::UnallowedMultiNot(2))
} else {
self.negative = !self.negative;
Ok(())
}
}
#[cfg(feature = "dbg")]
pub(super) fn pretty_print(&self, buf: &mut crate::PrettyPrinter) {
match &self.inner {
CharGroup::Dot if self.negative => buf.push_str("[n]"),
CharGroup::Dot => buf.push_str("."),
CharGroup::Items(items) => {
if self.negative {
buf.push_str("![");
} else {
buf.push('[');
}
for (i, item) in items.iter().enumerate() {
if i > 0 {
buf.push(' ');
}
item.pretty_print(buf);
}
buf.push(']');
}
}
}
}