1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
//! Roxygen2 doc-comment recognition, sub-tokenization, and block structure.
//!
//! A roxygen line is a comment whose text matches `^#+'` (one-or-more `#`
//! followed by a single `'`). Such lines are sub-tokenized—rather than emitted
//! as one `COMMENT` token—so their structure (marker, tags, arguments, prose)
//! lives directly in the lossless CST. The sub-tokens' texts tile the line's
//! bytes exactly, preserving the round-trip invariant.
//!
//! The work is split into three phases, one module each, plus this parent which
//! owns the macro-classification layer (the `\macro` arity/verbatim tables that
//! both the lexer and the structure builder consult) and the shared
//! balanced-delimiter scan:
//!
//! * [`lex`] — sub-lexing: block-mode resolution + the per-line tokenizer
//! (text → `Vec<Token>`).
//! * [`group`] — block grouping: wrapping a run of lines in a `ROXYGEN_BLOCK`
//! and laying out its section/paragraph skeleton (`Vec<Token>` → `Vec<Event>`).
//! * [`build`] — structure building: the block-level Rd-macro and markdown
//! constructs (`\itemize{…}`, `\describe{…}`, markdown lists) dispatched from
//! the grouper.
pub use emit_roxygen_block;
pub use ;
/// Inline Rd macros whose `{…}` content is **verbatim** (`VERB` in
/// `tools::parse_Rd`): the body is raw text and nested `\macro` markup is *not*
/// parsed. Confirmed against `parse_Rd` (see the projector's `rd_macros` work).
/// Latexlike macros (`\code`, `\emph`, `\strong`, `\link`, …) are everything
/// else --- their content is sub-parsed, so nested macros become child nodes.
const VERBATIM_RD_MACROS: & = &;
/// Whether the macro named `name` (without the leading `\`) takes verbatim
/// `{…}` content. Used both when building the CST (don't recurse into a verbatim
/// body) and when projecting it (emit `VERB`, not coalesced `TEXT`).
pub
/// Whether argument group `index` (0-based) of the macro named `name` takes
/// **verbatim** `{…}` content (`VERB` in `parse_Rd`: raw text, no nested markup).
/// A fully-verbatim macro (`\url`/`\verb`/…) is verbatim in its only argument;
/// `\href{url}{text}` is verbatim in its *first* argument (the URL) but latexlike
/// in its *second* (the link text, which is sub-parsed). Drives both the tree
/// builder (don't recurse into a verbatim arg) and, via the emitted `VERB` leaf,
/// the projector. Confirmed against `parse_Rd`: `\href`'s first arg is `VERB`.
pub
/// Inline Rd macros that take **two** adjacent `{…}` argument groups, the way
/// `tools::parse_Rd` does: `\item{term}{description}` (in `\describe`/`\value`/
/// `\arguments`) and `\tabular{format}{content}`. A one-argument macro like
/// `\code` consumes only its first group, so a trailing `\code{x}{y}`'s `{y}`
/// stays literal --- the arity is per macro. Also `\href{url}{text}`, whose first
/// argument is verbatim, and `\figure{path}{caption}` (both args verbatim --- see
/// [`is_verbatim_rd_arg`]). Extensible (`\section`/… are
/// future targets, several of which surface as block macros instead). A braceless
/// `\item` (under `\itemize`/`\enumerate`) never reaches here: it has no `{`, so
/// it is not a macro token at all.
///
/// These are also the macros whose `{…}` arguments `parse_Rd` models as *list*
/// wrappers (so a multi-atom argument projects to a `(GRP …)`), as opposed to
/// latexlike macros (`\code`, `\emph`, …) whose single argument's content is
/// inlined directly. The projector keys its GRP rule on this set.
const TWO_ARG_RD_MACROS: & = &;
/// Whether the macro named `name` (without the leading `\`) takes two `{…}`
/// argument groups. Drives the lexer (consume the second group into one token),
/// the tree builder (emit both groups as children), and the projector (each
/// group is a list argument --- a multi-atom one becomes a `(GRP …)`).
pub
/// Split a GFM table row into its cells, honoring backslash-escaped pipes. One
/// optional leading and one optional trailing **unescaped** `|` are stripped (the
/// GFM leading/trailing pipe), then the remainder is split on each unescaped `|`.
/// Cells are returned untrimmed (callers trim). An escaped `\|` stays inside its
/// cell. Shared by the recognition gate (cell counting) and the projector (cell
/// rendering) so the two never disagree on where a cell begins.
///
/// GFM counts pipes **without** honoring code spans — a `|` inside `` `…` `` still
/// splits a cell — so this deliberately does not track backticks. That is what
/// makes `| ` + "`a|b`" + ` | y |` fail the header/delimiter cell-count match and
/// stay prose (verified against roxygen2).
pub
/// The number of cells in a GFM table row (see [`split_table_row_cells`]). The
/// header row and the delimiter row form a table only when these are equal.
pub
/// Whether `line` is a GFM table **delimiter row**: it contains at least one
/// unescaped `|` (so a bare `---`, which is a setext underline, is *not* a
/// single-column table) and every cell (trimmed) is `:?-+:?` (optional leading
/// colon, one or more hyphens, optional trailing colon). The pipe requirement
/// mirrors cmark-gfm, which treats a pipeless dash run as a setext heading.
pub
/// Whether the byte at `idx` (a `|`) is backslash-escaped: preceded by an odd
/// run of `\`.
/// Whether `line` contains an unescaped `|`.
/// Whether `cell` (already trimmed) is a valid delimiter cell: `:?-+:?`.
/// Scan a balanced delimited run starting at `bytes[i] == open`, tracking nesting
/// and skipping Rd backslash escapes (`\}` etc.). Returns the index past the
/// matching `close`, or `None` if it is unbalanced before end of input.
pub
/// The end index of an Rd macro name starting at `bytes[start]` (the byte *after*
/// the leading `\`). An Rd command name is `[A-Za-z][A-Za-z0-9]*`: a leading
/// letter then any letters or digits (e.g. `\linkS4class`). Returns `start` when
/// no valid name begins there (`\\`, `\{`, `\4`, end of input). The single source
/// of truth for where a `\name` ends, shared by the lexer and the tree builder.
pub
/// The built-in Rd macro names `tools::parse_Rd` recognizes (without the leading
/// `\`). A `\word` *not* in this set is an **unknown** macro: `parse_Rd` tags it
/// `UNKNOWN` (warning "unknown macro '\word'"), even brace-less. Used to gate
/// brace-less macro recognition in the lexer (only an unknown name is carved as a
/// token; a known name brace-less stays literal prose --- its name-only/expanded
/// rendering is backlog) and the projector's name-only classification (a known
/// list child like `\item`/`\cr` → `(\name)`, an unknown one → `(UNKNOWN …)`).
///
/// The set is parse_Rd's static keyword table, verified against R 4.5; it
/// deliberately excludes package/user-defined macros (`\CRANpkg`, `\doi`, …),
/// which `parse_Rd` *expands* rather than parses (out of scope for a static
/// projector --- they surface as faithful divergences).
const KNOWN_RD_MACROS: & = &;
/// Whether `name` (without the leading `\`) is a built-in Rd macro `parse_Rd`
/// recognizes. The single source of truth for the known/unknown split, shared by
/// the lexer (gate brace-less recognition) and the projector (name-only → `(\name)`
/// vs `(UNKNOWN …)`). See [`KNOWN_RD_MACROS`].
pub
/// Rd macros whose `{…}` content roxygen2 **protects** from the markdown parser
/// (`escaped_for_md` in roxygen2's `R/markdown-escaping.R`): under `@md`,
/// `escape_rd_for_md` swaps the whole `\tag{…}` span out for a placeholder before
/// running cmark, so markdown inside such a macro stays literal Rd
/// (`\code{*x*}` → `\code{*x*}`, not `\code{\emph{x}}`). Every *other* macro keeps
/// only its backslash-word as literal text while its argument **is** markdown-
/// processed (`\emph{*x*}` → `\emph{\emph{x}}`), so the projector resolves the arg
/// of a non-fragile, known, single-argument macro as a markdown inline run.
const FRAGILE_FOR_MD_RD_MACROS: & = &;
/// Whether the macro named `name` (without the leading `\`) has its `{…}` content
/// **protected** from markdown under `@md` (roxygen2's `escaped_for_md`). A fragile
/// macro keeps its argument literal; a non-fragile one has it markdown-processed.
/// See [`FRAGILE_FOR_MD_RD_MACROS`].
pub
/// Resolve a bare prose `content` string as a `@md` markdown **inline run** and
/// return the resulting `ROXYGEN_PARAGRAPH` node, whose children are the resolved
/// inline elements (text, emphasis/strong nodes, links, code spans, nested Rd
/// macros). Drives the projector's translation of a non-fragile Rd macro's argument
/// under `@md` (`\emph{*x*}` → `\emph{\emph{x}}`): the projector slices out the raw
/// argument text and feeds it here, reusing the **real** inline pass (the
/// delimiter-stack arena) rather than a second markdown scanner — so nesting,
/// links, and code spans resolve exactly as in ordinary `@md` prose. A nested
/// fragile macro stays an opaque `ROXYGEN_RD_MACRO` token here; the projector keeps
/// its argument literal by recursing with the same fragility check.
pub
/// One piece of a **structural** Rd-macro argument fed to
/// [`resolve_md_inline_pieces`]: either raw prose `Text` (markdown-lexed) or a
/// pre-parsed nested `Macro` (its raw `\name{…}`/`\name` source, kept opaque).
pub
/// Resolve a **structural** Rd-macro argument (`\item`/`\tabular`/`\href` under
/// `@md`) as a single markdown inline run from its already-carved `pieces`,
/// returning the resolved `ROXYGEN_PARAGRAPH` node (same shape as
/// [`resolve_md_inline`]).
///
/// roxygen2 markdown-processes a structural argument as **one** cmark run: a nested
/// Rd macro is opaque text to cmark (reconstituted afterward), so an emphasis or
/// link span crosses it. Re-lexing the raw argument string cannot reproduce this
/// faithfully — the prose fragment lexer leaves a *brace-less* known macro
/// (`\tab`/`\cr`, the table separators) literal. Instead each pre-parsed macro
/// child (carved by the block-macro grouper) is emitted as one opaque
/// `RoxygenRdMacro` token, which [`build_rd_macro`](crate::parser::tree_builder)
/// re-expands into a faithful node (a brace-less `\tab` → a name-only `\tab` node).
/// The prose pieces between them lex as ordinary markdown fragments, so the
/// delimiter-stack arena spans emphasis across the macros exactly as cmark does.
pub
/// Wrap `tokens` in a paragraph, run the emphasis/inline pass, and return the
/// resolved `ROXYGEN_PARAGRAPH` node. Shared by [`resolve_md_inline`] and
/// [`resolve_md_inline_pieces`].
/// Length in bytes of the UTF-8 char whose leading byte is `b`.