markdown/construct/gfm_autolink_literal.rs
1//! GFM: autolink literal occurs in the [text][] content type.
2//!
3//! ## Grammar
4//!
5//! Autolink literals form with the following BNF
6//! (<small>see [construct][crate::construct] for character groups</small>):
7//!
8//! ```bnf
9//! gfm_autolink_literal ::= gfm_protocol_autolink | gfm_www_autolink | gfm_email_autolink
10//!
11//! ; Restriction: the code before must be `www_autolink_before`.
12//! ; Restriction: the code after `.` must not be eof.
13//! www_autolink ::= 3('w' | 'W') '.' [domain [path]]
14//! www_autolink_before ::= eof | eol | space_or_tab | '(' | '*' | '_' | '[' | ']' | '~'
15//!
16//! ; Restriction: the code before must be `http_autolink_before`.
17//! ; Restriction: the code after the protocol must be `http_autolink_protocol_after`.
18//! http_autolink ::= ('h' | 'H') 2('t' | 'T') ('p' | 'P') ['s' | 'S'] ':' 2'/' domain [path]
19//! http_autolink_before ::= byte - ascii_alpha
20//! http_autolink_protocol_after ::= byte - eof - eol - ascii_control - unicode_whitespace - unicode_punctuation
21//!
22//! ; Restriction: the code before must be `email_autolink_before`.
23//! ; Restriction: `ascii_digit` may not occur in the last label part of the label.
24//! email_autolink ::= 1*('+' | '-' | '.' | '_' | ascii_alphanumeric) '@' 1*(1*label_segment label_dot_cont) 1*label_segment
25//! email_autolink_before ::= byte - ascii_alpha - '/'
26//!
27//! ; Restriction: `_` may not occur in the last two domain parts.
28//! domain ::= 1*(url_ampt_cont | domain_punct_cont | '-' | byte - eof - ascii_control - unicode_whitespace - unicode_punctuation)
29//! ; Restriction: must not be followed by `punct`.
30//! domain_punct_cont ::= '.' | '_'
31//! ; Restriction: must not be followed by `char-ref`.
32//! url_ampt_cont ::= '&'
33//!
34//! ; Restriction: a counter `balance = 0` is increased for every `(`, and decreased for every `)`.
35//! ; Restriction: `)` must not be `paren_at_end`.
36//! path ::= 1*(url_ampt_cont | path_punctuation_cont | '(' | ')' | byte - eof - eol - space_or_tab)
37//! ; Restriction: must not be followed by `punct`.
38//! path_punctuation_cont ::= trailing_punctuation - '<'
39//! ; Restriction: must be followed by `punct` and `balance` must be less than `0`.
40//! paren_at_end ::= ')'
41//!
42//! label_segment ::= label_dash_underscore_cont | ascii_alpha | ascii_digit
43//! ; Restriction: if followed by `punct`, the whole email autolink is invalid.
44//! label_dash_underscore_cont ::= '-' | '_'
45//! ; Restriction: must not be followed by `punct`.
46//! label_dot_cont ::= '.'
47//!
48//! punct ::= *trailing_punctuation ( byte - eof - eol - space_or_tab - '<' )
49//! char_ref ::= *ascii_alpha ';' path_end
50//! trailing_punctuation ::= '!' | '"' | '\'' | ')' | '*' | ',' | '.' | ':' | ';' | '<' | '?' | '_' | '~'
51//! ```
52//!
53//! The grammar for GFM autolink literal is very relaxed: basically anything
54//! except for whitespace is allowed after a prefix.
55//! To use whitespace characters and otherwise impossible characters, in URLs,
56//! you can use percent encoding:
57//!
58//! ```markdown
59//! https://example.com/alpha%20bravo
60//! ```
61//!
62//! Yields:
63//!
64//! ```html
65//! <p><a href="https://example.com/alpha%20bravo">https://example.com/alpha%20bravo</a></p>
66//! ```
67//!
68//! There are several cases where incorrect encoding of URLs would, in other
69//! languages, result in a parse error.
70//! In markdown, there are no errors, and URLs are normalized.
71//! In addition, many characters are percent encoded
72//! ([`sanitize_uri`][sanitize_uri]).
73//! For example:
74//!
75//! ```markdown
76//! www.a👍b%
77//! ```
78//!
79//! Yields:
80//!
81//! ```html
82//! <p><a href="http://www.a%F0%9F%91%8Db%25">www.a👍b%</a></p>
83//! ```
84//!
85//! There is a big difference between how www and protocol literals work
86//! compared to how email literals work.
87//! The first two are done when parsing, and work like anything else in
88//! markdown.
89//! But email literals are handled afterwards: when everything is parsed, we
90//! look back at the events to figure out if there were email addresses.
91//! This particularly affects how they interleave with character escapes and
92//! character references.
93//!
94//! ## HTML
95//!
96//! GFM autolink literals relate to the `<a>` element in HTML.
97//! See [*§ 4.5.1 The `a` element*][html_a] in the HTML spec for more info.
98//! When an email autolink is used, the string `mailto:` is prepended when
99//! generating the `href` attribute of the hyperlink.
100//! When a www autolink is used, the string `http:` is prepended.
101//!
102//! ## Recommendation
103//!
104//! It is recommended to use labels ([label start link][label_start_link],
105//! [label end][label_end]), either with a resource or a definition
106//! ([definition][]), instead of autolink literals, as those allow relative
107//! URLs and descriptive text to explain the URL in prose.
108//!
109//! ## Bugs
110//!
111//! GitHub’s own algorithm to parse autolink literals contains three bugs.
112//! A smaller bug is left unfixed in this project for consistency.
113//! Two main bugs are not present in this project.
114//! The issues relating to autolink literals are:
115//!
116//! * [GFM autolink extension (`www.`, `https?://` parts): links don’t work when after bracket](https://github.com/github/cmark-gfm/issues/278)\
117//! fixed here âś…
118//! * [GFM autolink extension (`www.` part): uppercase does not match on issues/PRs/comments](https://github.com/github/cmark-gfm/issues/280)\
119//! fixed here âś…
120//! * [GFM autolink extension (`www.` part): the word `www` matches](https://github.com/github/cmark-gfm/issues/279)\
121//! present here for consistency
122//!
123//! ## Tokens
124//!
125//! * [`GfmAutolinkLiteralEmail`][Name::GfmAutolinkLiteralEmail]
126//! * [`GfmAutolinkLiteralMailto`][Name::GfmAutolinkLiteralMailto]
127//! * [`GfmAutolinkLiteralProtocol`][Name::GfmAutolinkLiteralProtocol]
128//! * [`GfmAutolinkLiteralWww`][Name::GfmAutolinkLiteralWww]
129//! * [`GfmAutolinkLiteralXmpp`][Name::GfmAutolinkLiteralXmpp]
130//!
131//! ## References
132//!
133//! * [`micromark-extension-gfm-autolink-literal`](https://github.com/micromark/micromark-extension-gfm-autolink-literal)
134//! * [*§ 6.9 Autolinks (extension)* in `GFM`](https://github.github.com/gfm/#autolinks-extension-)
135//!
136//! > 👉 **Note**: `mailto:` and `xmpp:` protocols before email autolinks were
137//! > added in `cmark-gfm@0.29.0.gfm.5` and are as of yet undocumented.
138//!
139//! [text]: crate::construct::text
140//! [definition]: crate::construct::definition
141//! [attention]: crate::construct::attention
142//! [label_start_link]: crate::construct::label_start_link
143//! [label_end]: crate::construct::label_end
144//! [sanitize_uri]: crate::util::sanitize_uri
145//! [html_a]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-a-element
146
147use crate::event::{Event, Kind, Name};
148use crate::state::{Name as StateName, State};
149use crate::tokenizer::Tokenizer;
150use crate::util::{
151 char::{kind_after_index, Kind as CharacterKind},
152 slice::{Position, Slice},
153};
154use alloc::vec::Vec;
155
156/// Start of protocol autolink literal.
157///
158/// ```markdown
159/// > | https://example.com/a?b#c
160/// ^
161/// ```
162pub fn protocol_start(tokenizer: &mut Tokenizer) -> State {
163 if tokenizer
164 .parse_state
165 .options
166 .constructs
167 .gfm_autolink_literal &&
168 matches!(tokenizer.current, Some(b'H' | b'h'))
169 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L214>.
170 && !matches!(tokenizer.previous, Some(b'A'..=b'Z' | b'a'..=b'z'))
171 {
172 tokenizer.enter(Name::GfmAutolinkLiteralProtocol);
173 tokenizer.attempt(
174 State::Next(StateName::GfmAutolinkLiteralProtocolAfter),
175 State::Nok,
176 );
177 tokenizer.attempt(
178 State::Next(StateName::GfmAutolinkLiteralDomainInside),
179 State::Nok,
180 );
181 tokenizer.tokenize_state.start = tokenizer.point.index;
182 State::Retry(StateName::GfmAutolinkLiteralProtocolPrefixInside)
183 } else {
184 State::Nok
185 }
186}
187
188/// After a protocol autolink literal.
189///
190/// ```markdown
191/// > | https://example.com/a?b#c
192/// ^
193/// ```
194pub fn protocol_after(tokenizer: &mut Tokenizer) -> State {
195 tokenizer.exit(Name::GfmAutolinkLiteralProtocol);
196 State::Ok
197}
198
199/// In protocol.
200///
201/// ```markdown
202/// > | https://example.com/a?b#c
203/// ^^^^^
204/// ```
205pub fn protocol_prefix_inside(tokenizer: &mut Tokenizer) -> State {
206 match tokenizer.current {
207 Some(b'A'..=b'Z' | b'a'..=b'z')
208 // `5` is size of `https`
209 if tokenizer.point.index - tokenizer.tokenize_state.start < 5 =>
210 {
211 tokenizer.consume();
212 State::Next(StateName::GfmAutolinkLiteralProtocolPrefixInside)
213 }
214 Some(b':') => {
215 let slice = Slice::from_indices(
216 tokenizer.parse_state.bytes,
217 tokenizer.tokenize_state.start,
218 tokenizer.point.index,
219 );
220 let name = slice.as_str().to_ascii_lowercase();
221
222 tokenizer.tokenize_state.start = 0;
223
224 if name == "http" || name == "https" {
225 tokenizer.consume();
226 State::Next(StateName::GfmAutolinkLiteralProtocolSlashesInside)
227 } else {
228 State::Nok
229 }
230 }
231 _ => {
232 tokenizer.tokenize_state.start = 0;
233 State::Nok
234 }
235 }
236}
237
238/// In protocol slashes.
239///
240/// ```markdown
241/// > | https://example.com/a?b#c
242/// ^^
243/// ```
244pub fn protocol_slashes_inside(tokenizer: &mut Tokenizer) -> State {
245 if tokenizer.current == Some(b'/') {
246 tokenizer.consume();
247 if tokenizer.tokenize_state.size == 0 {
248 tokenizer.tokenize_state.size += 1;
249 State::Next(StateName::GfmAutolinkLiteralProtocolSlashesInside)
250 } else {
251 tokenizer.tokenize_state.size = 0;
252 State::Ok
253 }
254 } else {
255 tokenizer.tokenize_state.size = 0;
256 State::Nok
257 }
258}
259
260/// Start of www autolink literal.
261///
262/// ```markdown
263/// > | www.example.com/a?b#c
264/// ^
265/// ```
266pub fn www_start(tokenizer: &mut Tokenizer) -> State {
267 if tokenizer
268 .parse_state
269 .options
270 .constructs
271 .gfm_autolink_literal &&
272 matches!(tokenizer.current, Some(b'W' | b'w'))
273 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L156>.
274 && matches!(tokenizer.previous, None | Some(b'\t' | b'\n' | b' ' | b'(' | b'*' | b'_' | b'[' | b']' | b'~'))
275 {
276 tokenizer.enter(Name::GfmAutolinkLiteralWww);
277 tokenizer.attempt(
278 State::Next(StateName::GfmAutolinkLiteralWwwAfter),
279 State::Nok,
280 );
281 // Note: we *check*, so we can discard the `www.` we parsed.
282 // If it worked, we consider it as a part of the domain.
283 tokenizer.check(
284 State::Next(StateName::GfmAutolinkLiteralDomainInside),
285 State::Nok,
286 );
287 State::Retry(StateName::GfmAutolinkLiteralWwwPrefixInside)
288 } else {
289 State::Nok
290 }
291}
292
293/// After a www autolink literal.
294///
295/// ```markdown
296/// > | www.example.com/a?b#c
297/// ^
298/// ```
299pub fn www_after(tokenizer: &mut Tokenizer) -> State {
300 tokenizer.exit(Name::GfmAutolinkLiteralWww);
301 State::Ok
302}
303
304/// In www prefix.
305///
306/// ```markdown
307/// > | www.example.com
308/// ^^^^
309/// ```
310pub fn www_prefix_inside(tokenizer: &mut Tokenizer) -> State {
311 match tokenizer.current {
312 Some(b'.') if tokenizer.tokenize_state.size == 3 => {
313 tokenizer.tokenize_state.size = 0;
314 tokenizer.consume();
315 State::Next(StateName::GfmAutolinkLiteralWwwPrefixAfter)
316 }
317 Some(b'W' | b'w') if tokenizer.tokenize_state.size < 3 => {
318 tokenizer.tokenize_state.size += 1;
319 tokenizer.consume();
320 State::Next(StateName::GfmAutolinkLiteralWwwPrefixInside)
321 }
322 _ => {
323 tokenizer.tokenize_state.size = 0;
324 State::Nok
325 }
326 }
327}
328
329/// After www prefix.
330///
331/// ```markdown
332/// > | www.example.com
333/// ^
334/// ```
335pub fn www_prefix_after(tokenizer: &mut Tokenizer) -> State {
336 // If there is *anything*, we can link.
337 if tokenizer.current.is_none() {
338 State::Nok
339 } else {
340 State::Ok
341 }
342}
343
344/// In domain.
345///
346/// ```markdown
347/// > | https://example.com/a
348/// ^^^^^^^^^^^
349/// ```
350pub fn domain_inside(tokenizer: &mut Tokenizer) -> State {
351 match tokenizer.current {
352 // Check whether this marker, which is a trailing punctuation
353 // marker, optionally followed by more trailing markers, and then
354 // followed by an end.
355 Some(b'.' | b'_') => {
356 tokenizer.check(
357 State::Next(StateName::GfmAutolinkLiteralDomainAfter),
358 State::Next(StateName::GfmAutolinkLiteralDomainAtPunctuation),
359 );
360 State::Retry(StateName::GfmAutolinkLiteralTrail)
361 }
362 // Dashes and continuation bytes are fine.
363 Some(b'-' | 0x80..=0xBF) => {
364 tokenizer.consume();
365 State::Next(StateName::GfmAutolinkLiteralDomainInside)
366 }
367 _ => {
368 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L12>.
369 if kind_after_index(tokenizer.parse_state.bytes, tokenizer.point.index)
370 == CharacterKind::Other
371 {
372 tokenizer.tokenize_state.seen = true;
373 tokenizer.consume();
374 State::Next(StateName::GfmAutolinkLiteralDomainInside)
375 } else {
376 State::Retry(StateName::GfmAutolinkLiteralDomainAfter)
377 }
378 }
379 }
380}
381
382/// In domain, at potential trailing punctuation, that was not trailing.
383///
384/// ```markdown
385/// > | https://example.com
386/// ^
387/// ```
388pub fn domain_at_punctuation(tokenizer: &mut Tokenizer) -> State {
389 // There is an underscore in the last segment of the domain
390 if matches!(tokenizer.current, Some(b'_')) {
391 tokenizer.tokenize_state.marker = b'_';
392 }
393 // Otherwise, it’s a `.`: save the last segment underscore in the
394 // penultimate segment slot.
395 else {
396 tokenizer.tokenize_state.marker_b = tokenizer.tokenize_state.marker;
397 tokenizer.tokenize_state.marker = 0;
398 }
399
400 tokenizer.consume();
401 State::Next(StateName::GfmAutolinkLiteralDomainInside)
402}
403
404/// After domain
405///
406/// ```markdown
407/// > | https://example.com/a
408/// ^
409/// ```
410pub fn domain_after(tokenizer: &mut Tokenizer) -> State {
411 // No underscores allowed in last two segments.
412 let result = if tokenizer.tokenize_state.marker_b == b'_'
413 || tokenizer.tokenize_state.marker == b'_'
414 // At least one character must be seen.
415 || !tokenizer.tokenize_state.seen
416 // Note: that’s GH says a dot is needed, but it’s not true:
417 // <https://github.com/github/cmark-gfm/issues/279>
418 {
419 State::Nok
420 } else {
421 State::Retry(StateName::GfmAutolinkLiteralPathInside)
422 };
423
424 tokenizer.tokenize_state.seen = false;
425 tokenizer.tokenize_state.marker = 0;
426 tokenizer.tokenize_state.marker_b = 0;
427 result
428}
429
430/// In path.
431///
432/// ```markdown
433/// > | https://example.com/a
434/// ^^
435/// ```
436pub fn path_inside(tokenizer: &mut Tokenizer) -> State {
437 match tokenizer.current {
438 // Continuation bytes are fine, we’ve already checked the first one.
439 Some(0x80..=0xBF) => {
440 tokenizer.consume();
441 State::Next(StateName::GfmAutolinkLiteralPathInside)
442 }
443 // Count opening parens.
444 Some(b'(') => {
445 tokenizer.tokenize_state.size += 1;
446 tokenizer.consume();
447 State::Next(StateName::GfmAutolinkLiteralPathInside)
448 }
449 // Check whether this trailing punctuation marker is optionally
450 // followed by more trailing markers, and then followed
451 // by an end.
452 // If this is a paren (followed by trailing, then the end), we
453 // *continue* if we saw less closing parens than opening parens.
454 Some(
455 b'!' | b'"' | b'&' | b'\'' | b')' | b'*' | b',' | b'.' | b':' | b';' | b'<' | b'?'
456 | b']' | b'_' | b'~',
457 ) => {
458 let next = if tokenizer.current == Some(b')')
459 && tokenizer.tokenize_state.size_b < tokenizer.tokenize_state.size
460 {
461 StateName::GfmAutolinkLiteralPathAtPunctuation
462 } else {
463 StateName::GfmAutolinkLiteralPathAfter
464 };
465 tokenizer.check(
466 State::Next(next),
467 State::Next(StateName::GfmAutolinkLiteralPathAtPunctuation),
468 );
469 State::Retry(StateName::GfmAutolinkLiteralTrail)
470 }
471 _ => {
472 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L12>.
473 if tokenizer.current.is_none()
474 || kind_after_index(tokenizer.parse_state.bytes, tokenizer.point.index)
475 == CharacterKind::Whitespace
476 {
477 State::Retry(StateName::GfmAutolinkLiteralPathAfter)
478 } else {
479 tokenizer.consume();
480 State::Next(StateName::GfmAutolinkLiteralPathInside)
481 }
482 }
483 }
484}
485
486/// In path, at potential trailing punctuation, that was not trailing.
487///
488/// ```markdown
489/// > | https://example.com/a"b
490/// ^
491/// ```
492pub fn path_at_punctuation(tokenizer: &mut Tokenizer) -> State {
493 // Count closing parens.
494 if tokenizer.current == Some(b')') {
495 tokenizer.tokenize_state.size_b += 1;
496 }
497
498 tokenizer.consume();
499 State::Next(StateName::GfmAutolinkLiteralPathInside)
500}
501
502/// At end of path, reset parens.
503///
504/// ```markdown
505/// > | https://example.com/asd(qwe).
506/// ^
507/// ```
508pub fn path_after(tokenizer: &mut Tokenizer) -> State {
509 tokenizer.tokenize_state.size = 0;
510 tokenizer.tokenize_state.size_b = 0;
511 State::Ok
512}
513
514/// In trail of domain or path.
515///
516/// ```markdown
517/// > | https://example.com").
518/// ^
519/// ```
520pub fn trail(tokenizer: &mut Tokenizer) -> State {
521 match tokenizer.current {
522 // Regular trailing punctuation.
523 Some(
524 b'!' | b'"' | b'\'' | b')' | b'*' | b',' | b'.' | b':' | b';' | b'?' | b'_' | b'~',
525 ) => {
526 tokenizer.consume();
527 State::Next(StateName::GfmAutolinkLiteralTrail)
528 }
529 // `&` followed by one or more alphabeticals and then a `;`, is
530 // as a whole considered as trailing punctuation.
531 // In all other cases, it is considered as continuation of the URL.
532 Some(b'&') => {
533 tokenizer.consume();
534 State::Next(StateName::GfmAutolinkLiteralTrailCharRefStart)
535 }
536 // `<` is an end.
537 Some(b'<') => State::Ok,
538 // Needed because we allow literals after `[`, as we fix:
539 // <https://github.com/github/cmark-gfm/issues/278>.
540 // Check that it is not followed by `(` or `[`.
541 Some(b']') => {
542 tokenizer.consume();
543 State::Next(StateName::GfmAutolinkLiteralTrailBracketAfter)
544 }
545 _ => {
546 // Whitespace is the end of the URL, anything else is continuation.
547 if kind_after_index(tokenizer.parse_state.bytes, tokenizer.point.index)
548 == CharacterKind::Whitespace
549 {
550 State::Ok
551 } else {
552 State::Nok
553 }
554 }
555 }
556}
557
558/// In trail, after `]`.
559///
560/// > 👉 **Note**: this deviates from `cmark-gfm` to fix a bug.
561/// > See end of <https://github.com/github/cmark-gfm/issues/278> for more.
562///
563/// ```markdown
564/// > | https://example.com](
565/// ^
566/// ```
567pub fn trail_bracket_after(tokenizer: &mut Tokenizer) -> State {
568 // Whitespace or something that could start a resource or reference is the end.
569 // Switch back to trail otherwise.
570 if matches!(
571 tokenizer.current,
572 None | Some(b'\t' | b'\n' | b' ' | b'(' | b'[')
573 ) {
574 State::Ok
575 } else {
576 State::Retry(StateName::GfmAutolinkLiteralTrail)
577 }
578}
579
580/// In character-reference like trail, after `&`.
581///
582/// ```markdown
583/// > | https://example.com&).
584/// ^
585/// ```
586pub fn trail_char_ref_start(tokenizer: &mut Tokenizer) -> State {
587 if matches!(tokenizer.current, Some(b'A'..=b'Z' | b'a'..=b'z')) {
588 State::Retry(StateName::GfmAutolinkLiteralTrailCharRefInside)
589 } else {
590 State::Nok
591 }
592}
593
594/// In character-reference like trail.
595///
596/// ```markdown
597/// > | https://example.com&).
598/// ^
599/// ```
600pub fn trail_char_ref_inside(tokenizer: &mut Tokenizer) -> State {
601 match tokenizer.current {
602 Some(b'A'..=b'Z' | b'a'..=b'z') => {
603 tokenizer.consume();
604 State::Next(StateName::GfmAutolinkLiteralTrailCharRefInside)
605 }
606 // Switch back to trail if this is well-formed.
607 Some(b';') => {
608 tokenizer.consume();
609 State::Next(StateName::GfmAutolinkLiteralTrail)
610 }
611 _ => State::Nok,
612 }
613}
614
615/// Resolve: postprocess text to find email autolink literals.
616pub fn resolve(tokenizer: &mut Tokenizer) {
617 tokenizer.map.consume(&mut tokenizer.events);
618
619 let mut index = 0;
620 let mut links = 0;
621
622 while index < tokenizer.events.len() {
623 let event = &tokenizer.events[index];
624
625 if event.kind == Kind::Enter {
626 if event.name == Name::Link {
627 links += 1;
628 }
629 } else {
630 if event.name == Name::Data && links == 0 {
631 let slice = Slice::from_position(
632 tokenizer.parse_state.bytes,
633 &Position::from_exit_event(&tokenizer.events, index),
634 );
635 let bytes = slice.bytes;
636 let mut byte_index = 0;
637 let mut replace = Vec::new();
638 let mut point = tokenizer.events[index - 1].point.clone();
639 let start_index = point.index;
640 let mut min = 0;
641
642 while byte_index < bytes.len() {
643 if bytes[byte_index] == b'@' {
644 let mut range = (0, 0, Name::GfmAutolinkLiteralEmail);
645
646 if let Some(start) = peek_bytes_atext(bytes, min, byte_index) {
647 let (start, kind) = peek_protocol(bytes, min, start);
648
649 if let Some(end) = peek_bytes_email_domain(
650 bytes,
651 byte_index + 1,
652 kind == Name::GfmAutolinkLiteralXmpp,
653 ) {
654 // Note: normally we’d truncate trailing
655 // punctuation from the link.
656 // However, email autolink literals cannot
657 // contain any of those markers, except for
658 // `.`, but that can only occur if it isn’t
659 // trailing.
660 // So we can ignore truncating while
661 // postprocessing!
662 range = (start, end, kind);
663 }
664 }
665
666 if range.1 != 0 {
667 byte_index = range.1;
668
669 // If there is something between the last link
670 // (or `min`) and this link.
671 if min != range.0 {
672 replace.push(Event {
673 kind: Kind::Enter,
674 name: Name::Data,
675 point: point.clone(),
676 link: None,
677 });
678 point = point
679 .shift_to(tokenizer.parse_state.bytes, start_index + range.0);
680 replace.push(Event {
681 kind: Kind::Exit,
682 name: Name::Data,
683 point: point.clone(),
684 link: None,
685 });
686 }
687
688 // Add the link.
689 replace.push(Event {
690 kind: Kind::Enter,
691 name: range.2.clone(),
692 point: point.clone(),
693 link: None,
694 });
695 point =
696 point.shift_to(tokenizer.parse_state.bytes, start_index + range.1);
697 replace.push(Event {
698 kind: Kind::Exit,
699 name: range.2.clone(),
700 point: point.clone(),
701 link: None,
702 });
703 min = range.1;
704 }
705 }
706
707 byte_index += 1;
708 }
709
710 // If there was a link, and we have more bytes left.
711 if min != 0 && min < bytes.len() {
712 replace.push(Event {
713 kind: Kind::Enter,
714 name: Name::Data,
715 point: point.clone(),
716 link: None,
717 });
718 replace.push(Event {
719 kind: Kind::Exit,
720 name: Name::Data,
721 point: event.point.clone(),
722 link: None,
723 });
724 }
725
726 // If there were links.
727 if !replace.is_empty() {
728 tokenizer.map.add(index - 1, 2, replace);
729 }
730 }
731
732 if event.name == Name::Link {
733 links -= 1;
734 }
735 }
736
737 index += 1;
738 }
739}
740
741/// Move back past atext.
742///
743/// Moving back is only used when post processing text: so for the email address
744/// algorithm.
745///
746/// ```markdown
747/// > | a contact@example.org b
748/// ^-- from
749/// ^-- to
750/// ```
751fn peek_bytes_atext(bytes: &[u8], min: usize, end: usize) -> Option<usize> {
752 let mut index = end;
753
754 // Take simplified atext.
755 // See `email_atext` in `autolink.rs` for a similar algorithm.
756 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L301>.
757 while index > min
758 && matches!(bytes[index - 1], b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'_' | b'a'..=b'z')
759 {
760 index -= 1;
761 }
762
763 // Do not allow a slash “inside” atext.
764 // The reference code is a bit weird, but that’s what it results in.
765 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L307>.
766 // Other than slash, every preceding character is allowed.
767 if index == end || (index > min && bytes[index - 1] == b'/') {
768 None
769 } else {
770 Some(index)
771 }
772}
773
774/// Move back past a `mailto:` or `xmpp:` protocol.
775///
776/// Moving back is only used when post processing text: so for the email address
777/// algorithm.
778///
779/// ```markdown
780/// > | a mailto:contact@example.org b
781/// ^-- from
782/// ^-- to
783/// ```
784fn peek_protocol(bytes: &[u8], min: usize, end: usize) -> (usize, Name) {
785 let mut index = end;
786
787 if index > min && bytes[index - 1] == b':' {
788 index -= 1;
789
790 // Take alphanumerical.
791 while index > min && matches!(bytes[index - 1], b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') {
792 index -= 1;
793 }
794
795 let slice = Slice::from_indices(bytes, index, end - 1);
796 let name = slice.as_str().to_ascii_lowercase();
797
798 if name == "xmpp" {
799 return (index, Name::GfmAutolinkLiteralXmpp);
800 } else if name == "mailto" {
801 return (index, Name::GfmAutolinkLiteralMailto);
802 }
803 }
804
805 (end, Name::GfmAutolinkLiteralEmail)
806}
807
808/// Move past email domain.
809///
810/// Peeking like this only used when post processing text: so for the email
811/// address algorithm.
812///
813/// ```markdown
814/// > | a contact@example.org b
815/// ^-- from
816/// ^-- to
817/// ```
818fn peek_bytes_email_domain(bytes: &[u8], start: usize, xmpp: bool) -> Option<usize> {
819 let mut index = start;
820 let mut dot = false;
821
822 // Move past “domain”.
823 // The reference code is a bit overly complex as it handles the `@`, of which there may be just one.
824 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L318>
825 while index < bytes.len() {
826 match bytes[index] {
827 // Alphanumerical, `-`, and `_`.
828 b'-' | b'0'..=b'9' | b'A'..=b'Z' | b'_' | b'a'..=b'z' => {}
829 b'/' if xmpp => {}
830 // Dot followed by alphanumerical (not `-` or `_`).
831 b'.' if index + 1 < bytes.len()
832 && matches!(bytes[index + 1], b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') =>
833 {
834 dot = true;
835 }
836 _ => break,
837 }
838
839 index += 1;
840 }
841
842 // Domain must not be empty, must include a dot, and must end in alphabetical or `.`.
843 // Source: <https://github.com/github/cmark-gfm/blob/ef1cfcb/extensions/autolink.c#L332>.
844 if index > start && dot && matches!(bytes[index - 1], b'.' | b'A'..=b'Z' | b'a'..=b'z') {
845 Some(index)
846 } else {
847 None
848 }
849}