1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
//! `v2`: a lossless, trivia-bearing token layer.
//!
//! # Why a v2 exists
//!
//! The `v1` pipeline is built for analysis: its lexer emits
//! `Newline` and `Comment` tokens but drops plain whitespace, and
//! its `unparse` regenerates spacing from scratch. That is the
//! right trade for linting and metrics and the wrong one for the two tools
//! planned next:
//!
//! * a **formatter** must read the author's comments and line structure to
//! place them in the output, and must be able to prove "I only changed
//! whitespace";
//! * a **codemod** (e.g. `Get-WmiObject` to `Get-CimInstance`) must rewrite
//! one command and leave every other byte alone, or the diff is unusable.
//!
//! Both need the same foundation: tokens that own every byte of the input.
//! Per the crate's versioning policy this is a breaking change to the token
//! shape, so it ships as a sibling `v2` module; `v1` stays untouched.
//!
//! # The invariant
//!
//! For any input `src`, including malformed input:
//!
//! ```
//! use poshtree::v2::{lex, reconstruct};
//!
//! let src = "ls -la # list\r\n@'\n raw '@ body\n'@\n";
//! assert_eq!(reconstruct(&lex(src).tokens), src);
//! ```
//!
//! Every byte lands in exactly one token's `leading` trivia, `value`, or
//! `trailing` trivia, in source order. [`reconstruct`] is the lossless
//! unparse; the formatter and codemod tools never need the v1 unparser.
//!
//! # From tokens to a tree
//!
//! Tokens alone carry the codemod use case (find a token, patch its span).
//! For anything that reasons about structure, `tree::parse_with_tokens`
//! (available when the `v1` feature is also enabled) pairs the v1 AST with
//! these tokens: every node gets a `TokenRange` into the token vector, so
//! `node.unparse_lossless()` returns that node's exact source, trivia and
//! all. It reuses the v1 parser rather than forking the grammar; see the
//! `tree` module for how ranges are recovered from node offsets. The walks
//! come in untyped (`TreeNode::walk`, `walk_with_ancestors`) and typed
//! (`Tree::walk_zipped`) forms; the zipped walk hands a visitor each typed
//! v1 node, its range-bearing mirror, and the ancestor path, which is what
//! a refactoring tool wants.
//!
//! # Formatting
//!
//! [`formatter::format_source`] is a width-aware formatter built on these
//! tokens and verified with the native parser, so it works under `v2`
//! alone. It normalizes indentation, spacing, blank lines, and over-long
//! lines while preserving every token byte-for-byte, and it checks its own
//! output by re-lexing and re-parsing: the result either round-trips to
//! the identical program or the call returns an error instead of damaged
//! source. See the `formatter` module docs for the exact rules.
//!
//! # A native parser
//!
//! `tree` reuses the v1 parser and recovers ranges from node offsets, so it
//! needs the `v1` feature. `parser::parse` is the standalone path: a native
//! recursive-descent parser that consumes v2 tokens and builds an `ast::Node`
//! tree directly, with each node carrying both a byte `Span` and a
//! `TokenRange`. It depends on no `v1` code, so it builds under `v2` alone.
//! Because v2 keeps newlines as trivia, statement boundaries come from
//! `Token::starts_line` and `;`.
//!
//! The grammar tracks v1's: pipelines and `&&`/`||` chains, command-versus-
//! expression dispatch with parameter-argument binding and redirections, every
//! control-flow statement, `function`/`filter`/`workflow`, `class`, `enum`,
//! `using`, `trap`/`data`/`dynamicparam`, `param` blocks, and the full
//! expression layer. A differential test reduces both trees to a label
//! skeleton and asserts the native parser matches the v1 tree shape across a
//! broad corpus, including double-quoted string interpolation parts (`$var`,
//! `${name}`, `$(...)`) and `Add-Type` C# extraction (`[DllImport]` P/Invoke
//! parsing, with constant propagation of a string assigned to a variable). A
//! companion test checks the extracted C# metadata against v1's field by
//! field. The remaining differences are at the lexer, not the parser: the v1
//! and v2 lexers tokenize a few things differently (for example a dotted run
//! such as `a.b.c`).
//!
//! ```
//! use poshtree::v2::{parse, NodeKind};
//!
//! let out = parse("Get-ChildItem -Recurse | Sort-Object Length\n");
//! assert!(out.errors.is_empty());
//! let mut commands = Vec::new();
//! out.script.walk(&mut |n| {
//! if let NodeKind::Command { name, .. } = &n.kind {
//! if let NodeKind::BareWord(s) = &name.kind {
//! commands.push(s.clone());
//! }
//! }
//! });
//! assert_eq!(commands, ["Get-ChildItem", "Sort-Object"]);
//! ```
//!
//! # What changed against v1, concretely
//!
//! * v1 emits `Newline` and `Comment` as tokens; v2 has neither. A line
//! break or comment rides as [`Trivia`] on a neighboring token, so the
//! significant stream is free of layout noise.
//! * v2 keeps the plain whitespace v1 drops as [`TriviaKind::Whitespace`];
//! that is what makes reconstruction lossless.
//! * Instead of v1's start offset plus line/column per token, v2 stores
//! full byte [`Span`]s and derives line/column on demand through
//! [`LineIndex`].
//! * v1 decodes `text`, `scope`, and `splat` at lex time; v2 keeps only the
//! raw `value`, so a token is a faithful slice of the source and decoding
//! is the parser's job.
//! * After `--%`, v1's parser re-slices the raw source; v2's lexer emits
//! the operator and one raw [`TokenKind::VerbatimArgs`] token.
//! * A v1 rewrite reprints the whole tree; v2 patches byte spans with
//! [`TextEdit`] and [`apply_edits`], so a change shows up as a minimal
//! diff.
//!
//! Classification agrees with v1 by test rather than by import: v2 carries
//! its own copies of the [`KEYWORDS`](tokens::KEYWORDS) and
//! [`NAMED_OPERATORS`](tokens::NAMED_OPERATORS) tables, and an integration
//! test keeps them equal to v1's, so the two lexers decide `Keyword` vs
//! `Generic` and `Operator` vs `Parameter` identically. The
//! `?.`/`?[`/`??`/`??=` operators and the `.5`-vs-member-access rule also
//! match v1. The lexer, spans, trivia, and edits therefore compile without
//! v1 (build with `default-features = false, features = ["v2"]`); the one
//! v2 component that does use v1 is `tree`, which by design runs the v1
//! parser and is only compiled when both features are on (see its docs for
//! the removal path).
//!
//! # Where v2 lexing deliberately differs from v1
//!
//! Each of these favors the formatter/codemod use case; all are kind/shape
//! differences only, never byte loss:
//!
//! * **Cohesive barewords.** `C:\tmp`, `*.txt`, `/usr/bin/env`, `user@host`
//! are one `Generic` each, where v1 fragments them (a Windows path lexes
//! as four v1 tokens including an `Unknown` for `\`). The v1 parser
//! already glues fragments back by byte adjacency; v2 just does less
//! fragmenting up front. Where v2 still splits (`.\run.ps1`, `a=b`,
//! `192.168.1.1`), adjacent spans carry the same glue signal.
//! * **`-Path:` keeps its colon** as one `Parameter` token; v1 emits
//! `-Path` plus a glued `:` operator the parser re-joins by position.
//! * **`--%` is handled in the lexer.** v1 lexes the rest of the line
//! normally (comments and all) and the parser re-reads the raw source;
//! v2 emits the `--%` operator and a raw [`TokenKind::VerbatimArgs`].
//! * **Richer literals and operators.** Binary `0b1010`, the PowerShell 7
//! numeric suffixes (`u`, `ul`, `n`, ...), `-=` as one token, and `!` as
//! an `Operator` (v1: `Unknown`).
//! * **Unicode identifiers.** `$żółć` is one `Variable`; v1's name scan is
//! ASCII-only.
//! * **A UTF-8 BOM becomes whitespace trivia** instead of being stripped,
//! so reconstruction stays exact ([`lex`] never calls
//! [`strip_bom`](crate::encoding::strip_bom)).
//!
//! # Porting parser code from v1
//!
//! `tok.ty` becomes `tok.kind`, `tok.value` stays `tok.value`. Checks
//! against `TokenType::Newline` become [`Token::starts_line`]. Code that
//! compared `tok.pos` arithmetic to detect glued tokens compares
//! `a.span.end == b.span.start` instead. Decoded payloads (`text`,
//! `scope`, `splat`) are derived from `value` on demand.
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
/// Concatenates `leading + value + trailing` over all tokens: the lossless
/// unparse. Applied to the output of [`lex`], this reproduces the original
/// source byte-for-byte.