markdown/construct/
gfm_autolink_literal.rs

1//! GFM: autolink literal occurs in the [text][] content type.
2//!
3//! ## Grammar
4//!
5//! Autolink literals form with the following BNF
6//! (<small>see [construct][crate::construct] for character groups</small>):
7//!
8//! ```bnf
9//! gfm_autolink_literal ::= gfm_protocol_autolink | gfm_www_autolink | gfm_email_autolink
10//!
11//! ; Restriction: the code before must be `www_autolink_before`.
12//! ; Restriction: the code after `.` must not be eof.
13//! www_autolink ::= 3('w' | 'W') '.' [domain [path]]
14//! www_autolink_before ::= eof | eol | space_or_tab | '(' | '*' | '_' | '[' | ']' | '~'
15//!
16//! ; Restriction: the code before must be `http_autolink_before`.
17//! ; Restriction: the code after the protocol must be `http_autolink_protocol_after`.
18//! http_autolink ::= ('h' | 'H') 2('t' | 'T') ('p' | 'P') ['s' | 'S'] ':' 2'/' domain [path]
19//! http_autolink_before ::= byte - ascii_alpha
20//! http_autolink_protocol_after ::= byte - eof - eol - ascii_control - unicode_whitespace - unicode_punctuation
21//!
22//! ; Restriction: the code before must be `email_autolink_before`.
23//! ; Restriction: `ascii_digit` may not occur in the last label part of the label.
24//! email_autolink ::= 1*('+' | '-' | '.' | '_' | ascii_alphanumeric) '@' 1*(1*label_segment label_dot_cont) 1*label_segment
25//! email_autolink_before ::= byte - ascii_alpha - '/'
26//!
27//! ; Restriction: `_` may not occur in the last two domain parts.
28//! domain ::= 1*(url_ampt_cont | domain_punct_cont | '-' | byte - eof - ascii_control - unicode_whitespace - unicode_punctuation)
29//! ; Restriction: must not be followed by `punct`.
30//! domain_punct_cont ::= '.' | '_'
31//! ; Restriction: must not be followed by `char-ref`.
32//! url_ampt_cont ::= '&'
33//!
34//! ; Restriction: a counter `balance = 0` is increased for every `(`, and decreased for every `)`.
35//! ; Restriction: `)` must not be `paren_at_end`.
36//! path ::= 1*(url_ampt_cont | path_punctuation_cont | '(' | ')' | byte - eof - eol - space_or_tab)
37//! ; Restriction: must not be followed by `punct`.
38//! path_punctuation_cont ::= trailing_punctuation - '<'
39//! ; Restriction: must be followed by `punct` and `balance` must be less than `0`.
40//! paren_at_end ::= ')'
41//!
42//! label_segment ::= label_dash_underscore_cont | ascii_alpha | ascii_digit
43//! ; Restriction: if followed by `punct`, the whole email autolink is invalid.
44//! label_dash_underscore_cont ::= '-' | '_'
45//! ; Restriction: must not be followed by `punct`.
46//! label_dot_cont ::= '.'
47//!
48//! punct ::= *trailing_punctuation ( byte - eof - eol - space_or_tab - '<' )
49//! char_ref ::= *ascii_alpha ';' path_end
50//! trailing_punctuation ::= '!' | '"' | '\'' | ')' | '*' | ',' | '.' | ':' | ';' | '<' | '?' | '_' | '~'
51//! ```
52//!
53//! The grammar for GFM autolink literal is very relaxed: basically anything
54//! except for whitespace is allowed after a prefix.
55//! To use whitespace characters and otherwise impossible characters, in URLs,
56//! you can use percent encoding:
57//!
58//! ```markdown
59//! https://example.com/alpha%20bravo
60//! ```
61//!
62//! Yields:
63//!
64//! ```html
65//! <p><a href="https://example.com/alpha%20bravo">https://example.com/alpha%20bravo</a></p>
66//! ```
67//!
68//! There are several cases where incorrect encoding of URLs would, in other
69//! languages, result in a parse error.
70//! In markdown, there are no errors, and URLs are normalized.
71//! In addition, many characters are percent encoded
72//! ([`sanitize_uri`][sanitize_uri]).
73//! For example:
74//!
75//! ```markdown
76//! www.a👍b%
77//! ```
78//!
79//! Yields:
80//!
81//! ```html
82//! <p><a href="http://www.a%F0%9F%91%8Db%25">www.a👍b%</a></p>
83//! ```
84//!
85//! There is a big difference between how www and protocol literals work
86//! compared to how email literals work.
87//! The first two are done when parsing, and work like anything else in
88//! markdown.
89//! But email literals are handled afterwards: when everything is parsed, we
90//! look back at the events to figure out if there were email addresses.
91//! This particularly affects how they interleave with character escapes and
92//! character references.
93//!
94//! ## HTML
95//!
96//! GFM autolink literals relate to the `<a>` element in HTML.
97//! See [*§ 4.5.1 The `a` element*][html_a] in the HTML spec for more info.
98//! When an email autolink is used, the string `mailto:` is prepended when
99//! generating the `href` attribute of the hyperlink.
100//! When a www autolink is used, the string `http:` is prepended.
101//!
102//! ## Recommendation
103//!
104//! It is recommended to use labels ([label start link][label_start_link],
105//! [label end][label_end]), either with a resource or a definition
106//! ([definition][]), instead of autolink literals, as those allow relative
107//! URLs and descriptive text to explain the URL in prose.
108//!
109//! ## Bugs
110//!
111//! GitHub’s own algorithm to parse autolink literals contains three bugs.
112//! A smaller bug is left unfixed in this project for consistency.
113//! Two main bugs are not present in this project.
114//! The issues relating to autolink literals are:
115//!
116//! * [GFM autolink extension (`www.`, `https?://` parts): links don’t work when after bracket](https://github.com/github/cmark-gfm/issues/278)\
117//!   fixed here ✅
118//! * [GFM autolink extension (`www.` part): uppercase does not match on issues/PRs/comments](https://github.com/github/cmark-gfm/issues/280)\
119//!   fixed here ✅
120//! * [GFM autolink extension (`www.` part): the word `www` matches](https://github.com/github/cmark-gfm/issues/279)\
121//!   present here for consistency
122//!
123//! ## Tokens
124//!
125//! * [`GfmAutolinkLiteralEmail`][Name::GfmAutolinkLiteralEmail]
126//! * [`GfmAutolinkLiteralMailto`][Name::GfmAutolinkLiteralMailto]
127//! * [`GfmAutolinkLiteralProtocol`][Name::GfmAutolinkLiteralProtocol]
128//! * [`GfmAutolinkLiteralWww`][Name::GfmAutolinkLiteralWww]
129//! * [`GfmAutolinkLiteralXmpp`][Name::GfmAutolinkLiteralXmpp]
130//!
131//! ## References
132//!
133//! * [`micromark-extension-gfm-autolink-literal`](https://github.com/micromark/micromark-extension-gfm-autolink-literal)
134//! * [*§ 6.9 Autolinks (extension)* in `GFM`](https://github.github.com/gfm/#autolinks-extension-)
135//!
136//! > 👉 **Note**: `mailto:` and `xmpp:` protocols before email autolinks were
137//! > added in `cmark-gfm@0.29.0.gfm.5` and are as of yet undocumented.
138//!
139//! [text]: crate::construct::text
140//! [definition]: crate::construct::definition
141//! [attention]: crate::construct::attention
142//! [label_start_link]: crate::construct::label_start_link
143//! [label_end]: crate::construct::label_end
144//! [sanitize_uri]: crate::util::sanitize_uri
145//! [html_a]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-a-element
146
147use crate::event::{Event, Kind, Name};
148use crate::state::{Name as StateName, State};
149use crate::tokenizer::Tokenizer;
150use crate::util::{
151    char::{kind_after_index, Kind as CharacterKind},
152    slice::{Position, Slice},
153};
154use alloc::vec::Vec;
155
156/// Start of protocol autolink literal.
157///
158/// ```markdown
159/// > | https://example.com/a?b#c
160///     ^
161/// ```
162pub fn protocol_start(tokenizer: &mut Tokenizer) -> State {
163    if tokenizer
164        .parse_state
165        .options
166        .constructs
167        .gfm_autolink_literal &&
168        matches!(tokenizer.current, Some(b'H' | b'h'))
169            // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L214>.
170            && !matches!(tokenizer.previous, Some(b'A'..=b'Z' | b'a'..=b'z'))
171    {
172        tokenizer.enter(Name::GfmAutolinkLiteralProtocol);
173        tokenizer.attempt(
174            State::Next(StateName::GfmAutolinkLiteralProtocolAfter),
175            State::Nok,
176        );
177        tokenizer.attempt(
178            State::Next(StateName::GfmAutolinkLiteralDomainInside),
179            State::Nok,
180        );
181        tokenizer.tokenize_state.start = tokenizer.point.index;
182        State::Retry(StateName::GfmAutolinkLiteralProtocolPrefixInside)
183    } else {
184        State::Nok
185    }
186}
187
188/// After a protocol autolink literal.
189///
190/// ```markdown
191/// > | https://example.com/a?b#c
192///                              ^
193/// ```
194pub fn protocol_after(tokenizer: &mut Tokenizer) -> State {
195    tokenizer.exit(Name::GfmAutolinkLiteralProtocol);
196    State::Ok
197}
198
199/// In protocol.
200///
201/// ```markdown
202/// > | https://example.com/a?b#c
203///     ^^^^^
204/// ```
205pub fn protocol_prefix_inside(tokenizer: &mut Tokenizer) -> State {
206    match tokenizer.current {
207        Some(b'A'..=b'Z' | b'a'..=b'z')
208            // `5` is size of `https`
209            if tokenizer.point.index - tokenizer.tokenize_state.start < 5 =>
210        {
211            tokenizer.consume();
212            State::Next(StateName::GfmAutolinkLiteralProtocolPrefixInside)
213        }
214        Some(b':') => {
215            let slice = Slice::from_indices(
216                tokenizer.parse_state.bytes,
217                tokenizer.tokenize_state.start,
218                tokenizer.point.index,
219            );
220            let name = slice.as_str().to_ascii_lowercase();
221
222            tokenizer.tokenize_state.start = 0;
223
224            if name == "http" || name == "https" {
225                tokenizer.consume();
226                State::Next(StateName::GfmAutolinkLiteralProtocolSlashesInside)
227            } else {
228                State::Nok
229            }
230        }
231        _ => {
232            tokenizer.tokenize_state.start = 0;
233            State::Nok
234        }
235    }
236}
237
238/// In protocol slashes.
239///
240/// ```markdown
241/// > | https://example.com/a?b#c
242///           ^^
243/// ```
244pub fn protocol_slashes_inside(tokenizer: &mut Tokenizer) -> State {
245    if tokenizer.current == Some(b'/') {
246        tokenizer.consume();
247        if tokenizer.tokenize_state.size == 0 {
248            tokenizer.tokenize_state.size += 1;
249            State::Next(StateName::GfmAutolinkLiteralProtocolSlashesInside)
250        } else {
251            tokenizer.tokenize_state.size = 0;
252            State::Ok
253        }
254    } else {
255        tokenizer.tokenize_state.size = 0;
256        State::Nok
257    }
258}
259
260/// Start of www autolink literal.
261///
262/// ```markdown
263/// > | www.example.com/a?b#c
264///     ^
265/// ```
266pub fn www_start(tokenizer: &mut Tokenizer) -> State {
267    if tokenizer
268        .parse_state
269        .options
270        .constructs
271        .gfm_autolink_literal &&
272        matches!(tokenizer.current, Some(b'W' | b'w'))
273            // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L156>.
274            && matches!(tokenizer.previous, None | Some(b'\t' | b'\n' | b' ' | b'(' | b'*' | b'_' | b'[' | b']' | b'~'))
275    {
276        tokenizer.enter(Name::GfmAutolinkLiteralWww);
277        tokenizer.attempt(
278            State::Next(StateName::GfmAutolinkLiteralWwwAfter),
279            State::Nok,
280        );
281        // Note: we *check*, so we can discard the `www.` we parsed.
282        // If it worked, we consider it as a part of the domain.
283        tokenizer.check(
284            State::Next(StateName::GfmAutolinkLiteralDomainInside),
285            State::Nok,
286        );
287        State::Retry(StateName::GfmAutolinkLiteralWwwPrefixInside)
288    } else {
289        State::Nok
290    }
291}
292
293/// After a www autolink literal.
294///
295/// ```markdown
296/// > | www.example.com/a?b#c
297///                          ^
298/// ```
299pub fn www_after(tokenizer: &mut Tokenizer) -> State {
300    tokenizer.exit(Name::GfmAutolinkLiteralWww);
301    State::Ok
302}
303
304/// In www prefix.
305///
306/// ```markdown
307/// > | www.example.com
308///     ^^^^
309/// ```
310pub fn www_prefix_inside(tokenizer: &mut Tokenizer) -> State {
311    match tokenizer.current {
312        Some(b'.') if tokenizer.tokenize_state.size == 3 => {
313            tokenizer.tokenize_state.size = 0;
314            tokenizer.consume();
315            State::Next(StateName::GfmAutolinkLiteralWwwPrefixAfter)
316        }
317        Some(b'W' | b'w') if tokenizer.tokenize_state.size < 3 => {
318            tokenizer.tokenize_state.size += 1;
319            tokenizer.consume();
320            State::Next(StateName::GfmAutolinkLiteralWwwPrefixInside)
321        }
322        _ => {
323            tokenizer.tokenize_state.size = 0;
324            State::Nok
325        }
326    }
327}
328
329/// After www prefix.
330///
331/// ```markdown
332/// > | www.example.com
333///         ^
334/// ```
335pub fn www_prefix_after(tokenizer: &mut Tokenizer) -> State {
336    // If there is *anything*, we can link.
337    if tokenizer.current.is_none() {
338        State::Nok
339    } else {
340        State::Ok
341    }
342}
343
344/// In domain.
345///
346/// ```markdown
347/// > | https://example.com/a
348///             ^^^^^^^^^^^
349/// ```
350pub fn domain_inside(tokenizer: &mut Tokenizer) -> State {
351    match tokenizer.current {
352        // Check whether this marker, which is a trailing punctuation
353        // marker, optionally followed by more trailing markers, and then
354        // followed by an end.
355        Some(b'.' | b'_') => {
356            tokenizer.check(
357                State::Next(StateName::GfmAutolinkLiteralDomainAfter),
358                State::Next(StateName::GfmAutolinkLiteralDomainAtPunctuation),
359            );
360            State::Retry(StateName::GfmAutolinkLiteralTrail)
361        }
362        // Dashes and continuation bytes are fine.
363        Some(b'-' | 0x80..=0xBF) => {
364            tokenizer.consume();
365            State::Next(StateName::GfmAutolinkLiteralDomainInside)
366        }
367        _ => {
368            // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L12>.
369            if kind_after_index(tokenizer.parse_state.bytes, tokenizer.point.index)
370                == CharacterKind::Other
371            {
372                tokenizer.tokenize_state.seen = true;
373                tokenizer.consume();
374                State::Next(StateName::GfmAutolinkLiteralDomainInside)
375            } else {
376                State::Retry(StateName::GfmAutolinkLiteralDomainAfter)
377            }
378        }
379    }
380}
381
382/// In domain, at potential trailing punctuation, that was not trailing.
383///
384/// ```markdown
385/// > | https://example.com
386///                    ^
387/// ```
388pub fn domain_at_punctuation(tokenizer: &mut Tokenizer) -> State {
389    // There is an underscore in the last segment of the domain
390    if matches!(tokenizer.current, Some(b'_')) {
391        tokenizer.tokenize_state.marker = b'_';
392    }
393    // Otherwise, it’s a `.`: save the last segment underscore in the
394    // penultimate segment slot.
395    else {
396        tokenizer.tokenize_state.marker_b = tokenizer.tokenize_state.marker;
397        tokenizer.tokenize_state.marker = 0;
398    }
399
400    tokenizer.consume();
401    State::Next(StateName::GfmAutolinkLiteralDomainInside)
402}
403
404/// After domain
405///
406/// ```markdown
407/// > | https://example.com/a
408///                        ^
409/// ```
410pub fn domain_after(tokenizer: &mut Tokenizer) -> State {
411    // No underscores allowed in last two segments.
412    let result = if tokenizer.tokenize_state.marker_b == b'_'
413        || tokenizer.tokenize_state.marker == b'_'
414        // At least one character must be seen.
415        || !tokenizer.tokenize_state.seen
416    // Note: that’s GH says a dot is needed, but it’s not true:
417    // <https://github.com/github/cmark-gfm/issues/279>
418    {
419        State::Nok
420    } else {
421        State::Retry(StateName::GfmAutolinkLiteralPathInside)
422    };
423
424    tokenizer.tokenize_state.seen = false;
425    tokenizer.tokenize_state.marker = 0;
426    tokenizer.tokenize_state.marker_b = 0;
427    result
428}
429
430/// In path.
431///
432/// ```markdown
433/// > | https://example.com/a
434///                        ^^
435/// ```
436pub fn path_inside(tokenizer: &mut Tokenizer) -> State {
437    match tokenizer.current {
438        // Continuation bytes are fine, we’ve already checked the first one.
439        Some(0x80..=0xBF) => {
440            tokenizer.consume();
441            State::Next(StateName::GfmAutolinkLiteralPathInside)
442        }
443        // Count opening parens.
444        Some(b'(') => {
445            tokenizer.tokenize_state.size += 1;
446            tokenizer.consume();
447            State::Next(StateName::GfmAutolinkLiteralPathInside)
448        }
449        // Check whether this trailing punctuation marker is optionally
450        // followed by more trailing markers, and then followed
451        // by an end.
452        // If this is a paren (followed by trailing, then the end), we
453        // *continue* if we saw less closing parens than opening parens.
454        Some(
455            b'!' | b'"' | b'&' | b'\'' | b')' | b'*' | b',' | b'.' | b':' | b';' | b'<' | b'?'
456            | b']' | b'_' | b'~',
457        ) => {
458            let next = if tokenizer.current == Some(b')')
459                && tokenizer.tokenize_state.size_b < tokenizer.tokenize_state.size
460            {
461                StateName::GfmAutolinkLiteralPathAtPunctuation
462            } else {
463                StateName::GfmAutolinkLiteralPathAfter
464            };
465            tokenizer.check(
466                State::Next(next),
467                State::Next(StateName::GfmAutolinkLiteralPathAtPunctuation),
468            );
469            State::Retry(StateName::GfmAutolinkLiteralTrail)
470        }
471        _ => {
472            // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L12>.
473            if tokenizer.current.is_none()
474                || kind_after_index(tokenizer.parse_state.bytes, tokenizer.point.index)
475                    == CharacterKind::Whitespace
476            {
477                State::Retry(StateName::GfmAutolinkLiteralPathAfter)
478            } else {
479                tokenizer.consume();
480                State::Next(StateName::GfmAutolinkLiteralPathInside)
481            }
482        }
483    }
484}
485
486/// In path, at potential trailing punctuation, that was not trailing.
487///
488/// ```markdown
489/// > | https://example.com/a"b
490///                          ^
491/// ```
492pub fn path_at_punctuation(tokenizer: &mut Tokenizer) -> State {
493    // Count closing parens.
494    if tokenizer.current == Some(b')') {
495        tokenizer.tokenize_state.size_b += 1;
496    }
497
498    tokenizer.consume();
499    State::Next(StateName::GfmAutolinkLiteralPathInside)
500}
501
502/// At end of path, reset parens.
503///
504/// ```markdown
505/// > | https://example.com/asd(qwe).
506///                                 ^
507/// ```
508pub fn path_after(tokenizer: &mut Tokenizer) -> State {
509    tokenizer.tokenize_state.size = 0;
510    tokenizer.tokenize_state.size_b = 0;
511    State::Ok
512}
513
514/// In trail of domain or path.
515///
516/// ```markdown
517/// > | https://example.com").
518///                        ^
519/// ```
520pub fn trail(tokenizer: &mut Tokenizer) -> State {
521    match tokenizer.current {
522        // Regular trailing punctuation.
523        Some(
524            b'!' | b'"' | b'\'' | b')' | b'*' | b',' | b'.' | b':' | b';' | b'?' | b'_' | b'~',
525        ) => {
526            tokenizer.consume();
527            State::Next(StateName::GfmAutolinkLiteralTrail)
528        }
529        // `&` followed by one or more alphabeticals and then a `;`, is
530        // as a whole considered as trailing punctuation.
531        // In all other cases, it is considered as continuation of the URL.
532        Some(b'&') => {
533            tokenizer.consume();
534            State::Next(StateName::GfmAutolinkLiteralTrailCharRefStart)
535        }
536        // `<` is an end.
537        Some(b'<') => State::Ok,
538        // Needed because we allow literals after `[`, as we fix:
539        // <https://github.com/github/cmark-gfm/issues/278>.
540        // Check that it is not followed by `(` or `[`.
541        Some(b']') => {
542            tokenizer.consume();
543            State::Next(StateName::GfmAutolinkLiteralTrailBracketAfter)
544        }
545        _ => {
546            // Whitespace is the end of the URL, anything else is continuation.
547            if kind_after_index(tokenizer.parse_state.bytes, tokenizer.point.index)
548                == CharacterKind::Whitespace
549            {
550                State::Ok
551            } else {
552                State::Nok
553            }
554        }
555    }
556}
557
558/// In trail, after `]`.
559///
560/// > 👉 **Note**: this deviates from `cmark-gfm` to fix a bug.
561/// > See end of <https://github.com/github/cmark-gfm/issues/278> for more.
562///
563/// ```markdown
564/// > | https://example.com](
565///                         ^
566/// ```
567pub fn trail_bracket_after(tokenizer: &mut Tokenizer) -> State {
568    // Whitespace or something that could start a resource or reference is the end.
569    // Switch back to trail otherwise.
570    if matches!(
571        tokenizer.current,
572        None | Some(b'\t' | b'\n' | b' ' | b'(' | b'[')
573    ) {
574        State::Ok
575    } else {
576        State::Retry(StateName::GfmAutolinkLiteralTrail)
577    }
578}
579
580/// In character-reference like trail, after `&`.
581///
582/// ```markdown
583/// > | https://example.com&amp;).
584///                         ^
585/// ```
586pub fn trail_char_ref_start(tokenizer: &mut Tokenizer) -> State {
587    if matches!(tokenizer.current, Some(b'A'..=b'Z' | b'a'..=b'z')) {
588        State::Retry(StateName::GfmAutolinkLiteralTrailCharRefInside)
589    } else {
590        State::Nok
591    }
592}
593
594/// In character-reference like trail.
595///
596/// ```markdown
597/// > | https://example.com&amp;).
598///                         ^
599/// ```
600pub fn trail_char_ref_inside(tokenizer: &mut Tokenizer) -> State {
601    match tokenizer.current {
602        Some(b'A'..=b'Z' | b'a'..=b'z') => {
603            tokenizer.consume();
604            State::Next(StateName::GfmAutolinkLiteralTrailCharRefInside)
605        }
606        // Switch back to trail if this is well-formed.
607        Some(b';') => {
608            tokenizer.consume();
609            State::Next(StateName::GfmAutolinkLiteralTrail)
610        }
611        _ => State::Nok,
612    }
613}
614
615/// Resolve: postprocess text to find email autolink literals.
616pub fn resolve(tokenizer: &mut Tokenizer) {
617    tokenizer.map.consume(&mut tokenizer.events);
618
619    let mut index = 0;
620    let mut links = 0;
621
622    while index < tokenizer.events.len() {
623        let event = &tokenizer.events[index];
624
625        if event.kind == Kind::Enter {
626            if event.name == Name::Link {
627                links += 1;
628            }
629        } else {
630            if event.name == Name::Data && links == 0 {
631                let slice = Slice::from_position(
632                    tokenizer.parse_state.bytes,
633                    &Position::from_exit_event(&tokenizer.events, index),
634                );
635                let bytes = slice.bytes;
636                let mut byte_index = 0;
637                let mut replace = Vec::new();
638                let mut point = tokenizer.events[index - 1].point.clone();
639                let start_index = point.index;
640                let mut min = 0;
641
642                while byte_index < bytes.len() {
643                    if bytes[byte_index] == b'@' {
644                        let mut range = (0, 0, Name::GfmAutolinkLiteralEmail);
645
646                        if let Some(start) = peek_bytes_atext(bytes, min, byte_index) {
647                            let (start, kind) = peek_protocol(bytes, min, start);
648
649                            if let Some(end) = peek_bytes_email_domain(
650                                bytes,
651                                byte_index + 1,
652                                kind == Name::GfmAutolinkLiteralXmpp,
653                            ) {
654                                // Note: normally we’d truncate trailing
655                                // punctuation from the link.
656                                // However, email autolink literals cannot
657                                // contain any of those markers, except for
658                                // `.`, but that can only occur if it isn’t
659                                // trailing.
660                                // So we can ignore truncating while
661                                // postprocessing!
662                                range = (start, end, kind);
663                            }
664                        }
665
666                        if range.1 != 0 {
667                            byte_index = range.1;
668
669                            // If there is something between the last link
670                            // (or `min`) and this link.
671                            if min != range.0 {
672                                replace.push(Event {
673                                    kind: Kind::Enter,
674                                    name: Name::Data,
675                                    point: point.clone(),
676                                    link: None,
677                                });
678                                point = point
679                                    .shift_to(tokenizer.parse_state.bytes, start_index + range.0);
680                                replace.push(Event {
681                                    kind: Kind::Exit,
682                                    name: Name::Data,
683                                    point: point.clone(),
684                                    link: None,
685                                });
686                            }
687
688                            // Add the link.
689                            replace.push(Event {
690                                kind: Kind::Enter,
691                                name: range.2.clone(),
692                                point: point.clone(),
693                                link: None,
694                            });
695                            point =
696                                point.shift_to(tokenizer.parse_state.bytes, start_index + range.1);
697                            replace.push(Event {
698                                kind: Kind::Exit,
699                                name: range.2.clone(),
700                                point: point.clone(),
701                                link: None,
702                            });
703                            min = range.1;
704                        }
705                    }
706
707                    byte_index += 1;
708                }
709
710                // If there was a link, and we have more bytes left.
711                if min != 0 && min < bytes.len() {
712                    replace.push(Event {
713                        kind: Kind::Enter,
714                        name: Name::Data,
715                        point: point.clone(),
716                        link: None,
717                    });
718                    replace.push(Event {
719                        kind: Kind::Exit,
720                        name: Name::Data,
721                        point: event.point.clone(),
722                        link: None,
723                    });
724                }
725
726                // If there were links.
727                if !replace.is_empty() {
728                    tokenizer.map.add(index - 1, 2, replace);
729                }
730            }
731
732            if event.name == Name::Link {
733                links -= 1;
734            }
735        }
736
737        index += 1;
738    }
739}
740
741/// Move back past atext.
742///
743/// Moving back is only used when post processing text: so for the email address
744/// algorithm.
745///
746/// ```markdown
747/// > | a contact@example.org b
748///              ^-- from
749///       ^-- to
750/// ```
751fn peek_bytes_atext(bytes: &[u8], min: usize, end: usize) -> Option<usize> {
752    let mut index = end;
753
754    // Take simplified atext.
755    // See `email_atext` in `autolink.rs` for a similar algorithm.
756    // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L301>.
757    while index > min
758        && matches!(bytes[index - 1], b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'_' | b'a'..=b'z')
759    {
760        index -= 1;
761    }
762
763    // Do not allow a slash “inside” atext.
764    // The reference code is a bit weird, but that’s what it results in.
765    // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L307>.
766    // Other than slash, every preceding character is allowed.
767    if index == end || (index > min && bytes[index - 1] == b'/') {
768        None
769    } else {
770        Some(index)
771    }
772}
773
774/// Move back past a `mailto:` or `xmpp:` protocol.
775///
776/// Moving back is only used when post processing text: so for the email address
777/// algorithm.
778///
779/// ```markdown
780/// > | a mailto:contact@example.org b
781///              ^-- from
782///       ^-- to
783/// ```
784fn peek_protocol(bytes: &[u8], min: usize, end: usize) -> (usize, Name) {
785    let mut index = end;
786
787    if index > min && bytes[index - 1] == b':' {
788        index -= 1;
789
790        // Take alphanumerical.
791        while index > min && matches!(bytes[index - 1], b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') {
792            index -= 1;
793        }
794
795        let slice = Slice::from_indices(bytes, index, end - 1);
796        let name = slice.as_str().to_ascii_lowercase();
797
798        if name == "xmpp" {
799            return (index, Name::GfmAutolinkLiteralXmpp);
800        } else if name == "mailto" {
801            return (index, Name::GfmAutolinkLiteralMailto);
802        }
803    }
804
805    (end, Name::GfmAutolinkLiteralEmail)
806}
807
808/// Move past email domain.
809///
810/// Peeking like this only used when post processing text: so for the email
811/// address algorithm.
812///
813/// ```markdown
814/// > | a contact@example.org b
815///               ^-- from
816///                         ^-- to
817/// ```
818fn peek_bytes_email_domain(bytes: &[u8], start: usize, xmpp: bool) -> Option<usize> {
819    let mut index = start;
820    let mut dot = false;
821
822    // Move past “domain”.
823    // The reference code is a bit overly complex as it handles the `@`, of which there may be just one.
824    // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L318>
825    while index < bytes.len() {
826        match bytes[index] {
827            // Alphanumerical, `-`, and `_`.
828            b'-' | b'0'..=b'9' | b'A'..=b'Z' | b'_' | b'a'..=b'z' => {}
829            b'/' if xmpp => {}
830            // Dot followed by alphanumerical (not `-` or `_`).
831            b'.' if index + 1 < bytes.len()
832                && matches!(bytes[index + 1], b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') =>
833            {
834                dot = true;
835            }
836            _ => break,
837        }
838
839        index += 1;
840    }
841
842    // Domain must not be empty, must include a dot, and must end in alphabetical or `.`.
843    // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L332>.
844    if index > start && dot && matches!(bytes[index - 1], b'.' | b'A'..=b'Z' | b'a'..=b'z') {
845        Some(index)
846    } else {
847        None
848    }
849}
markdown/construct/gfm_autolink_literal.rs

markdown/construct/
gfm_autolink_literal.rs