markdown/construct/autolink.rs
1//! Autolink occurs in the [text][] content type.
2//!
3//! ## Grammar
4//!
5//! Autolink forms with the following BNF
6//! (<small>see [construct][crate::construct] for character groups</small>):
7//!
8//! ```bnf
9//! autolink ::= '<' (url | email) '>'
10//!
11//! url ::= protocol *url_byte
12//! protocol ::= ascii_alphabetic 0*31(protocol_byte) ':'
13//! protocol_byte ::= '+' '-' '.' ascii_alphanumeric
14//! url_byte ::= byte - ascii_control - ' '
15//!
16//! email ::= 1*ascii_atext '@' email_domain *('.' email_domain)
17//! ; Restriction: up to (including) 63 character are allowed in each domain.
18//! email_domain ::= ascii_alphanumeric *(ascii_alphanumeric | '-' ascii_alphanumeric)
19//!
20//! ascii_atext ::= ascii_alphanumeric | '!' | '"' | '#' | '$' | '%' | '&' | '\'' | '*' | '+' | '-' | '/' | '=' | '?' | '^' | '_' | '`' | '{' | '|' | '}' | '~'
21//! ```
22//!
23//! The maximum allowed size of a scheme is `31` (inclusive), which is defined
24//! in [`AUTOLINK_SCHEME_SIZE_MAX`][].
25//! The maximum allowed size of a domain is `63` (inclusive), which is defined
26//! in [`AUTOLINK_DOMAIN_SIZE_MAX`][].
27//!
28//! The grammar for autolinks is quite strict and prohibits the use of ASCII control
29//! characters or spaces.
30//! To use non-ascii characters and otherwise impossible characters in URLs,
31//! you can use percent encoding:
32//!
33//! ```markdown
34//! <https://example.com/alpha%20bravo>
35//! ```
36//!
37//! Yields:
38//!
39//! ```html
40//! <p><a href="https://example.com/alpha%20bravo">https://example.com/alpha%20bravo</a></p>
41//! ```
42//!
43//! There are several cases where incorrect encoding of URLs would, in other
44//! languages, result in a parse error.
45//! In markdown, there are no errors, and URLs are normalized.
46//! In addition, many characters are percent encoded
47//! ([`sanitize_uri`][sanitize_uri]).
48//! For example:
49//!
50//! ```markdown
51//! <https://a👍b%>
52//! ```
53//!
54//! Yields:
55//!
56//! ```html
57//! <p><a href="https://a%F0%9F%91%8Db%25">https://a👍b%</a></p>
58//! ```
59//!
60//! Interestingly, there are a couple of things that are valid autolinks in
61//! markdown but in HTML would be valid tags, such as `<svg:rect>` and
62//! `<xml:lang/>`.
63//! However, because `CommonMark` employs a naïve HTML parsing algorithm, those
64//! are not considered HTML.
65//!
66//! While `CommonMark` restricts links from occurring in other links in the
67//! case of labels (see [label end][label_end]), this restriction is not in
68//! place for autolinks inside labels:
69//!
70//! ```markdown
71//! [<https://example.com>](#)
72//! ```
73//!
74//! Yields:
75//!
76//! ```html
77//! <p><a href="#"><a href="https://example.com">https://example.com</a></a></p>
78//! ```
79//!
80//! The generated output, in this case, is invalid according to HTML.
81//! When a browser sees that markup, it will instead parse it as:
82//!
83//! ```html
84//! <p><a href="#"></a><a href="https://example.com">https://example.com</a></p>
85//! ```
86//!
87//! ## HTML
88//!
89//! Autolinks relate to the `<a>` element in HTML.
90//! See [*§ 4.5.1 The `a` element*][html_a] in the HTML spec for more info.
91//! When an email autolink is used (so, without a protocol), the string
92//! `mailto:` is prepended before the email, when generating the `href`
93//! attribute of the hyperlink.
94//!
95//! ## Recommendation
96//!
97//! It is recommended to use labels ([label start link][label_start_link],
98//! [label end][label_end]), either with a resource or a definition
99//! ([definition][]), instead of autolinks, as those allow more characters in
100//! URLs, and allow relative URLs and `www.` URLs.
101//! They also allow for descriptive text to explain the URL in prose.
102//!
103//! ## Tokens
104//!
105//! * [`Autolink`][Name::Autolink]
106//! * [`AutolinkEmail`][Name::AutolinkEmail]
107//! * [`AutolinkMarker`][Name::AutolinkMarker]
108//! * [`AutolinkProtocol`][Name::AutolinkProtocol]
109//!
110//! ## References
111//!
112//! * [`autolink.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/autolink.js)
113//! * [*§ 6.4 Autolinks* in `CommonMark`](https://spec.commonmark.org/0.31/#autolinks)
114//!
115//! [text]: crate::construct::text
116//! [definition]: crate::construct::definition
117//! [label_start_link]: crate::construct::label_start_link
118//! [label_end]: crate::construct::label_end
119//! [autolink_scheme_size_max]: crate::util::constant::AUTOLINK_SCHEME_SIZE_MAX
120//! [autolink_domain_size_max]: crate::util::constant::AUTOLINK_DOMAIN_SIZE_MAX
121//! [sanitize_uri]: crate::util::sanitize_uri
122//! [html_a]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-a-element
123
124use crate::event::Name;
125use crate::state::{Name as StateName, State};
126use crate::tokenizer::Tokenizer;
127use crate::util::constant::{AUTOLINK_DOMAIN_SIZE_MAX, AUTOLINK_SCHEME_SIZE_MAX};
128
129/// Start of an autolink.
130///
131/// ```markdown
132/// > | a<https://example.com>b
133/// ^
134/// > | a<user@example.com>b
135/// ^
136/// ```
137pub fn start(tokenizer: &mut Tokenizer) -> State {
138 if tokenizer.parse_state.options.constructs.autolink && tokenizer.current == Some(b'<') {
139 tokenizer.enter(Name::Autolink);
140 tokenizer.enter(Name::AutolinkMarker);
141 tokenizer.consume();
142 tokenizer.exit(Name::AutolinkMarker);
143 tokenizer.enter(Name::AutolinkProtocol);
144 State::Next(StateName::AutolinkOpen)
145 } else {
146 State::Nok
147 }
148}
149
150/// After `<`, at protocol or atext.
151///
152/// ```markdown
153/// > | a<https://example.com>b
154/// ^
155/// > | a<user@example.com>b
156/// ^
157/// ```
158pub fn open(tokenizer: &mut Tokenizer) -> State {
159 match tokenizer.current {
160 // ASCII alphabetic.
161 Some(b'A'..=b'Z' | b'a'..=b'z') => {
162 tokenizer.consume();
163 State::Next(StateName::AutolinkSchemeOrEmailAtext)
164 }
165 Some(b'@') => State::Nok,
166 _ => State::Retry(StateName::AutolinkEmailAtext),
167 }
168}
169
170/// At second byte of protocol or atext.
171///
172/// ```markdown
173/// > | a<https://example.com>b
174/// ^
175/// > | a<user@example.com>b
176/// ^
177/// ```
178pub fn scheme_or_email_atext(tokenizer: &mut Tokenizer) -> State {
179 match tokenizer.current {
180 // ASCII alphanumeric and `+`, `-`, and `.`.
181 Some(b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => {
182 // Count the previous alphabetical from `open` too.
183 tokenizer.tokenize_state.size = 1;
184 State::Retry(StateName::AutolinkSchemeInsideOrEmailAtext)
185 }
186 _ => State::Retry(StateName::AutolinkEmailAtext),
187 }
188}
189
190/// In ambiguous protocol or atext.
191///
192/// ```markdown
193/// > | a<https://example.com>b
194/// ^
195/// > | a<user@example.com>b
196/// ^
197/// ```
198pub fn scheme_inside_or_email_atext(tokenizer: &mut Tokenizer) -> State {
199 match tokenizer.current {
200 Some(b':') => {
201 tokenizer.consume();
202 tokenizer.tokenize_state.size = 0;
203 State::Next(StateName::AutolinkUrlInside)
204 }
205 // ASCII alphanumeric and `+`, `-`, and `.`.
206 Some(b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z')
207 if tokenizer.tokenize_state.size < AUTOLINK_SCHEME_SIZE_MAX =>
208 {
209 tokenizer.consume();
210 tokenizer.tokenize_state.size += 1;
211 State::Next(StateName::AutolinkSchemeInsideOrEmailAtext)
212 }
213 _ => {
214 tokenizer.tokenize_state.size = 0;
215 State::Retry(StateName::AutolinkEmailAtext)
216 }
217 }
218}
219
220/// After protocol, in URL.
221///
222/// ```markdown
223/// > | a<https://example.com>b
224/// ^
225/// ```
226pub fn url_inside(tokenizer: &mut Tokenizer) -> State {
227 match tokenizer.current {
228 Some(b'>') => {
229 tokenizer.exit(Name::AutolinkProtocol);
230 tokenizer.enter(Name::AutolinkMarker);
231 tokenizer.consume();
232 tokenizer.exit(Name::AutolinkMarker);
233 tokenizer.exit(Name::Autolink);
234 State::Ok
235 }
236 // ASCII control, space, or `<`.
237 None | Some(b'\0'..=0x1F | b' ' | b'<' | 0x7F) => State::Nok,
238 Some(_) => {
239 tokenizer.consume();
240 State::Next(StateName::AutolinkUrlInside)
241 }
242 }
243}
244
245/// In email atext.
246///
247/// ```markdown
248/// > | a<user.name@example.com>b
249/// ^
250/// ```
251pub fn email_atext(tokenizer: &mut Tokenizer) -> State {
252 match tokenizer.current {
253 Some(b'@') => {
254 tokenizer.consume();
255 State::Next(StateName::AutolinkEmailAtSignOrDot)
256 }
257 // ASCII atext.
258 //
259 // atext is an ASCII alphanumeric (see [`is_ascii_alphanumeric`][]), or
260 // a byte in the inclusive ranges U+0023 NUMBER SIGN (`#`) to U+0027
261 // APOSTROPHE (`'`), U+002A ASTERISK (`*`), U+002B PLUS SIGN (`+`),
262 // U+002D DASH (`-`), U+002F SLASH (`/`), U+003D EQUALS TO (`=`),
263 // U+003F QUESTION MARK (`?`), U+005E CARET (`^`) to U+0060 GRAVE
264 // ACCENT (`` ` ``), or U+007B LEFT CURLY BRACE (`{`) to U+007E TILDE
265 // (`~`).
266 //
267 // See:
268 // **\[RFC5322]**:
269 // [Internet Message Format](https://tools.ietf.org/html/rfc5322).
270 // P. Resnick.
271 // IETF.
272 //
273 // [`is_ascii_alphanumeric`]: char::is_ascii_alphanumeric
274 Some(
275 b'#'..=b'\'' | b'*' | b'+' | b'-'..=b'9' | b'=' | b'?' | b'A'..=b'Z' | b'^'..=b'~',
276 ) => {
277 tokenizer.consume();
278 State::Next(StateName::AutolinkEmailAtext)
279 }
280 _ => State::Nok,
281 }
282}
283
284/// In label, after at-sign or dot.
285///
286/// ```markdown
287/// > | a<user.name@example.com>b
288/// ^ ^
289/// ```
290pub fn email_at_sign_or_dot(tokenizer: &mut Tokenizer) -> State {
291 match tokenizer.current {
292 // ASCII alphanumeric.
293 Some(b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => {
294 State::Retry(StateName::AutolinkEmailValue)
295 }
296 _ => State::Nok,
297 }
298}
299
300/// In label, where `.` and `>` are allowed.
301///
302/// ```markdown
303/// > | a<user.name@example.com>b
304/// ^
305/// ```
306pub fn email_label(tokenizer: &mut Tokenizer) -> State {
307 match tokenizer.current {
308 Some(b'.') => {
309 tokenizer.consume();
310 tokenizer.tokenize_state.size = 0;
311 State::Next(StateName::AutolinkEmailAtSignOrDot)
312 }
313 Some(b'>') => {
314 let index = tokenizer.events.len();
315 tokenizer.exit(Name::AutolinkProtocol);
316 // Change the event name.
317 tokenizer.events[index - 1].name = Name::AutolinkEmail;
318 tokenizer.events[index].name = Name::AutolinkEmail;
319 tokenizer.enter(Name::AutolinkMarker);
320 tokenizer.consume();
321 tokenizer.exit(Name::AutolinkMarker);
322 tokenizer.exit(Name::Autolink);
323 tokenizer.tokenize_state.size = 0;
324 State::Ok
325 }
326 _ => State::Retry(StateName::AutolinkEmailValue),
327 }
328}
329
330/// In label, where `.` and `>` are *not* allowed.
331///
332/// Though, this is also used in `email_label` to parse other values.
333///
334/// ```markdown
335/// > | a<user.name@ex-ample.com>b
336/// ^
337/// ```
338pub fn email_value(tokenizer: &mut Tokenizer) -> State {
339 match tokenizer.current {
340 // ASCII alphanumeric or `-`.
341 Some(b'-' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z')
342 if tokenizer.tokenize_state.size < AUTOLINK_DOMAIN_SIZE_MAX =>
343 {
344 let name = if matches!(tokenizer.current, Some(b'-')) {
345 StateName::AutolinkEmailValue
346 } else {
347 StateName::AutolinkEmailLabel
348 };
349 tokenizer.tokenize_state.size += 1;
350 tokenizer.consume();
351 State::Next(name)
352 }
353 _ => {
354 tokenizer.tokenize_state.size = 0;
355 State::Nok
356 }
357 }
358}