markdown/construct/
autolink.rs

1//! Autolink occurs in the [text][] content type.
2//!
3//! ## Grammar
4//!
5//! Autolink forms with the following BNF
6//! (<small>see [construct][crate::construct] for character groups</small>):
7//!
8//! ```bnf
9//! autolink ::= '<' (url | email) '>'
10//!
11//! url ::= protocol *url_byte
12//! protocol ::= ascii_alphabetic 0*31(protocol_byte) ':'
13//! protocol_byte ::= '+' '-' '.' ascii_alphanumeric
14//! url_byte ::= byte - ascii_control - ' '
15//!
16//! email ::= 1*ascii_atext '@' email_domain *('.' email_domain)
17//! ; Restriction: up to (including) 63 character are allowed in each domain.
18//! email_domain ::= ascii_alphanumeric *(ascii_alphanumeric | '-' ascii_alphanumeric)
19//!
20//! ascii_atext ::= ascii_alphanumeric | '!' | '"' | '#' | '$' | '%' | '&' | '\'' | '*' | '+' | '-' | '/' | '=' | '?' | '^' | '_' | '`' | '{' | '|' | '}' | '~'
21//! ```
22//!
23//! The maximum allowed size of a scheme is `31` (inclusive), which is defined
24//! in [`AUTOLINK_SCHEME_SIZE_MAX`][].
25//! The maximum allowed size of a domain is `63` (inclusive), which is defined
26//! in [`AUTOLINK_DOMAIN_SIZE_MAX`][].
27//!
28//! The grammar for autolinks is quite strict and prohibits the use of ASCII control
29//! characters or spaces.
30//! To use non-ascii characters and otherwise impossible characters in URLs,
31//! you can use percent encoding:
32//!
33//! ```markdown
34//! <https://example.com/alpha%20bravo>
35//! ```
36//!
37//! Yields:
38//!
39//! ```html
40//! <p><a href="https://example.com/alpha%20bravo">https://example.com/alpha%20bravo</a></p>
41//! ```
42//!
43//! There are several cases where incorrect encoding of URLs would, in other
44//! languages, result in a parse error.
45//! In markdown, there are no errors, and URLs are normalized.
46//! In addition, many characters are percent encoded
47//! ([`sanitize_uri`][sanitize_uri]).
48//! For example:
49//!
50//! ```markdown
51//! <https://a👍b%>
52//! ```
53//!
54//! Yields:
55//!
56//! ```html
57//! <p><a href="https://a%F0%9F%91%8Db%25">https://a👍b%</a></p>
58//! ```
59//!
60//! Interestingly, there are a couple of things that are valid autolinks in
61//! markdown but in HTML would be valid tags, such as `<svg:rect>` and
62//! `<xml:lang/>`.
63//! However, because `CommonMark` employs a naïve HTML parsing algorithm, those
64//! are not considered HTML.
65//!
66//! While `CommonMark` restricts links from occurring in other links in the
67//! case of labels (see [label end][label_end]), this restriction is not in
68//! place for autolinks inside labels:
69//!
70//! ```markdown
71//! [<https://example.com>](#)
72//! ```
73//!
74//! Yields:
75//!
76//! ```html
77//! <p><a href="#"><a href="https://example.com">https://example.com</a></a></p>
78//! ```
79//!
80//! The generated output, in this case, is invalid according to HTML.
81//! When a browser sees that markup, it will instead parse it as:
82//!
83//! ```html
84//! <p><a href="#"></a><a href="https://example.com">https://example.com</a></p>
85//! ```
86//!
87//! ## HTML
88//!
89//! Autolinks relate to the `<a>` element in HTML.
90//! See [*§ 4.5.1 The `a` element*][html_a] in the HTML spec for more info.
91//! When an email autolink is used (so, without a protocol), the string
92//! `mailto:` is prepended before the email, when generating the `href`
93//! attribute of the hyperlink.
94//!
95//! ## Recommendation
96//!
97//! It is recommended to use labels ([label start link][label_start_link],
98//! [label end][label_end]), either with a resource or a definition
99//! ([definition][]), instead of autolinks, as those allow more characters in
100//! URLs, and allow relative URLs and `www.` URLs.
101//! They also allow for descriptive text to explain the URL in prose.
102//!
103//! ## Tokens
104//!
105//! * [`Autolink`][Name::Autolink]
106//! * [`AutolinkEmail`][Name::AutolinkEmail]
107//! * [`AutolinkMarker`][Name::AutolinkMarker]
108//! * [`AutolinkProtocol`][Name::AutolinkProtocol]
109//!
110//! ## References
111//!
112//! * [`autolink.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/autolink.js)
113//! * [*§ 6.4 Autolinks* in `CommonMark`](https://spec.commonmark.org/0.31/#autolinks)
114//!
115//! [text]: crate::construct::text
116//! [definition]: crate::construct::definition
117//! [label_start_link]: crate::construct::label_start_link
118//! [label_end]: crate::construct::label_end
119//! [autolink_scheme_size_max]: crate::util::constant::AUTOLINK_SCHEME_SIZE_MAX
120//! [autolink_domain_size_max]: crate::util::constant::AUTOLINK_DOMAIN_SIZE_MAX
121//! [sanitize_uri]: crate::util::sanitize_uri
122//! [html_a]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-a-element
123
124use crate::event::Name;
125use crate::state::{Name as StateName, State};
126use crate::tokenizer::Tokenizer;
127use crate::util::constant::{AUTOLINK_DOMAIN_SIZE_MAX, AUTOLINK_SCHEME_SIZE_MAX};
128
129/// Start of an autolink.
130///
131/// ```markdown
132/// > | a<https://example.com>b
133///      ^
134/// > | a<user@example.com>b
135///      ^
136/// ```
137pub fn start(tokenizer: &mut Tokenizer) -> State {
138    if tokenizer.parse_state.options.constructs.autolink && tokenizer.current == Some(b'<') {
139        tokenizer.enter(Name::Autolink);
140        tokenizer.enter(Name::AutolinkMarker);
141        tokenizer.consume();
142        tokenizer.exit(Name::AutolinkMarker);
143        tokenizer.enter(Name::AutolinkProtocol);
144        State::Next(StateName::AutolinkOpen)
145    } else {
146        State::Nok
147    }
148}
149
150/// After `<`, at protocol or atext.
151///
152/// ```markdown
153/// > | a<https://example.com>b
154///       ^
155/// > | a<user@example.com>b
156///       ^
157/// ```
158pub fn open(tokenizer: &mut Tokenizer) -> State {
159    match tokenizer.current {
160        // ASCII alphabetic.
161        Some(b'A'..=b'Z' | b'a'..=b'z') => {
162            tokenizer.consume();
163            State::Next(StateName::AutolinkSchemeOrEmailAtext)
164        }
165        Some(b'@') => State::Nok,
166        _ => State::Retry(StateName::AutolinkEmailAtext),
167    }
168}
169
170/// At second byte of protocol or atext.
171///
172/// ```markdown
173/// > | a<https://example.com>b
174///        ^
175/// > | a<user@example.com>b
176///        ^
177/// ```
178pub fn scheme_or_email_atext(tokenizer: &mut Tokenizer) -> State {
179    match tokenizer.current {
180        // ASCII alphanumeric and `+`, `-`, and `.`.
181        Some(b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => {
182            // Count the previous alphabetical from `open` too.
183            tokenizer.tokenize_state.size = 1;
184            State::Retry(StateName::AutolinkSchemeInsideOrEmailAtext)
185        }
186        _ => State::Retry(StateName::AutolinkEmailAtext),
187    }
188}
189
190/// In ambiguous protocol or atext.
191///
192/// ```markdown
193/// > | a<https://example.com>b
194///        ^
195/// > | a<user@example.com>b
196///        ^
197/// ```
198pub fn scheme_inside_or_email_atext(tokenizer: &mut Tokenizer) -> State {
199    match tokenizer.current {
200        Some(b':') => {
201            tokenizer.consume();
202            tokenizer.tokenize_state.size = 0;
203            State::Next(StateName::AutolinkUrlInside)
204        }
205        // ASCII alphanumeric and `+`, `-`, and `.`.
206        Some(b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z')
207            if tokenizer.tokenize_state.size < AUTOLINK_SCHEME_SIZE_MAX =>
208        {
209            tokenizer.consume();
210            tokenizer.tokenize_state.size += 1;
211            State::Next(StateName::AutolinkSchemeInsideOrEmailAtext)
212        }
213        _ => {
214            tokenizer.tokenize_state.size = 0;
215            State::Retry(StateName::AutolinkEmailAtext)
216        }
217    }
218}
219
220/// After protocol, in URL.
221///
222/// ```markdown
223/// > | a<https://example.com>b
224///             ^
225/// ```
226pub fn url_inside(tokenizer: &mut Tokenizer) -> State {
227    match tokenizer.current {
228        Some(b'>') => {
229            tokenizer.exit(Name::AutolinkProtocol);
230            tokenizer.enter(Name::AutolinkMarker);
231            tokenizer.consume();
232            tokenizer.exit(Name::AutolinkMarker);
233            tokenizer.exit(Name::Autolink);
234            State::Ok
235        }
236        // ASCII control, space, or `<`.
237        None | Some(b'\0'..=0x1F | b' ' | b'<' | 0x7F) => State::Nok,
238        Some(_) => {
239            tokenizer.consume();
240            State::Next(StateName::AutolinkUrlInside)
241        }
242    }
243}
244
245/// In email atext.
246///
247/// ```markdown
248/// > | a<user.name@example.com>b
249///              ^
250/// ```
251pub fn email_atext(tokenizer: &mut Tokenizer) -> State {
252    match tokenizer.current {
253        Some(b'@') => {
254            tokenizer.consume();
255            State::Next(StateName::AutolinkEmailAtSignOrDot)
256        }
257        // ASCII atext.
258        //
259        // atext is an ASCII alphanumeric (see [`is_ascii_alphanumeric`][]), or
260        // a byte in the inclusive ranges U+0023 NUMBER SIGN (`#`) to U+0027
261        // APOSTROPHE (`'`), U+002A ASTERISK (`*`), U+002B PLUS SIGN (`+`),
262        // U+002D DASH (`-`), U+002F SLASH (`/`), U+003D EQUALS TO (`=`),
263        // U+003F QUESTION MARK (`?`), U+005E CARET (`^`) to U+0060 GRAVE
264        // ACCENT (`` ` ``), or U+007B LEFT CURLY BRACE (`{`) to U+007E TILDE
265        // (`~`).
266        //
267        // See:
268        // **\[RFC5322]**:
269        // [Internet Message Format](https://tools.ietf.org/html/rfc5322).
270        // P. Resnick.
271        // IETF.
272        //
273        // [`is_ascii_alphanumeric`]: char::is_ascii_alphanumeric
274        Some(
275            b'#'..=b'\'' | b'*' | b'+' | b'-'..=b'9' | b'=' | b'?' | b'A'..=b'Z' | b'^'..=b'~',
276        ) => {
277            tokenizer.consume();
278            State::Next(StateName::AutolinkEmailAtext)
279        }
280        _ => State::Nok,
281    }
282}
283
284/// In label, after at-sign or dot.
285///
286/// ```markdown
287/// > | a<user.name@example.com>b
288///                 ^       ^
289/// ```
290pub fn email_at_sign_or_dot(tokenizer: &mut Tokenizer) -> State {
291    match tokenizer.current {
292        // ASCII alphanumeric.
293        Some(b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => {
294            State::Retry(StateName::AutolinkEmailValue)
295        }
296        _ => State::Nok,
297    }
298}
299
300/// In label, where `.` and `>` are allowed.
301///
302/// ```markdown
303/// > | a<user.name@example.com>b
304///                   ^
305/// ```
306pub fn email_label(tokenizer: &mut Tokenizer) -> State {
307    match tokenizer.current {
308        Some(b'.') => {
309            tokenizer.consume();
310            tokenizer.tokenize_state.size = 0;
311            State::Next(StateName::AutolinkEmailAtSignOrDot)
312        }
313        Some(b'>') => {
314            let index = tokenizer.events.len();
315            tokenizer.exit(Name::AutolinkProtocol);
316            // Change the event name.
317            tokenizer.events[index - 1].name = Name::AutolinkEmail;
318            tokenizer.events[index].name = Name::AutolinkEmail;
319            tokenizer.enter(Name::AutolinkMarker);
320            tokenizer.consume();
321            tokenizer.exit(Name::AutolinkMarker);
322            tokenizer.exit(Name::Autolink);
323            tokenizer.tokenize_state.size = 0;
324            State::Ok
325        }
326        _ => State::Retry(StateName::AutolinkEmailValue),
327    }
328}
329
330/// In label, where `.` and `>` are *not* allowed.
331///
332/// Though, this is also used in `email_label` to parse other values.
333///
334/// ```markdown
335/// > | a<user.name@ex-ample.com>b
336///                    ^
337/// ```
338pub fn email_value(tokenizer: &mut Tokenizer) -> State {
339    match tokenizer.current {
340        // ASCII alphanumeric or `-`.
341        Some(b'-' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z')
342            if tokenizer.tokenize_state.size < AUTOLINK_DOMAIN_SIZE_MAX =>
343        {
344            let name = if matches!(tokenizer.current, Some(b'-')) {
345                StateName::AutolinkEmailValue
346            } else {
347                StateName::AutolinkEmailLabel
348            };
349            tokenizer.tokenize_state.size += 1;
350            tokenizer.consume();
351            State::Next(name)
352        }
353        _ => {
354            tokenizer.tokenize_state.size = 0;
355            State::Nok
356        }
357    }
358}