kindaxml 0.1.0

Close-enough, XML-ish annotation parsing with deterministic recovery for LLM output.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# KindaXML Spec + Recovery Rules

This document defines a tolerant, XML-ish annotation DSL designed for LLM output. It is **not XML** and does not aim for a fully-correct DOM. Its purpose is to let a model express structured annotations while remaining **robust to common model formatting errors**.

---

## 1) Goals & Non-Goals

### Goals

* **LLM-friendly syntax**: looks like XML so models naturally emit it.
* **Tolerant parsing**: missing quotes, missing close-tags, and other small errors are recoverable.
* **Annotation-first**: tags primarily **annotate spans** (Option A), with a small amount of “block segmentation” (Option B) via auto-close-on-next-tag.
* **Deterministic recovery**: every malformed input has a predictable parse.

### Non-Goals

* No requirement to preserve XML well-formedness.
* No guarantee of building a faithful nested tree (Option C).
* No strict validation of attribute correctness beyond a simple tokenizer + recovery.

---

## 2) Core Concepts

### 2.1 Tokens

The language recognizes:

* **Start tag**: `<Tag ...>`
* **End tag**: `</Tag>`
* **Self-closing tag**: `<Tag .../>`
* **Text**: any content not part of tags

### 2.2 Tag Names

* `Tag` matches: `[A-Za-z][A-Za-z0-9_\-:.]*`
* Tags are **case-sensitive by default** (configurable).

### 2.3 Recognized vs Unrecognized Tags

The parser operates with a configurable set:

* **Recognized tags**: tags the LLM is instructed to use (e.g., `cite`, `claim`, `todo`, `metric`, `risk`, `note`, etc.)
* **Unrecognized tags**: everything else

Parser behavior is configurable:

* `unknown_mode = "strip"`: remove unknown tags but keep their inner text (if any)
* `unknown_mode = "passthrough"`: leave unknown tags in-place as literal text
* `unknown_mode = "treat_as_text"`: treat `<...>` as plain text if the tag name isn’t recognized

---

## 3) Grammar (Informal)

This is an *informal grammar* describing intended forms; recovery rules (Section 6) define what happens when input deviates.

```
Document  := (Text | Element)*

Element   := StartTag Document? EndTag?
          | SelfClosingTag

StartTag  := "<" Name Attrs? ">"
EndTag    := "</" Name ">"
SelfClosingTag := "<" Name Attrs? "/>"

Attrs     := (WS Attr)*
Attr      := Name (WS? "=" WS? Value)?
Value     := Quoted | Unquoted
Quoted    := "'" (not-quote-or-recovered)* "'"  |  '"' (not-quote-or-recovered)* '"'
Unquoted  := (not-WS-and-not-">" and not "/>")+
```

**Important:** While this grammar allows nesting, **the semantic model does not rely on nesting.** Auto-close behavior makes it closer to “flat annotations”.

---

## 4) Semantics: What a Tag Means

### 4.1 Annotation Tag (Option A)

A recognized tag may annotate:

* **Inline span** between `<Tag ...>` and `</Tag>` (if it closes properly), or
* A **recovered span** determined by recovery rules (if it does not close properly).

The output form is typically a sequence of segments:

```json
[
  {"text": "some text", "ann": []},
  {"text": "annotated phrase", "ann": [{"tag": "cite", "attrs": {"id": "3"}}]},
  ...
]
```

### 4.2 Soft Block Boundary (Option B-lite)

Because “auto-close on encountering another tag” is always applied, tags naturally become **local annotations** and can also be treated as **block delimiters** in downstream logic if you wish.

Example: If `<note>` often begins a line, you can interpret it as “this line is a note” even if it doesn’t close.

---

## 5) Attributes

### 5.1 Allowed Forms

Attributes support all of:

* `a="x"`
* `a='x'`
* `a=x` (unquoted)
* `a` (boolean attribute; value is `true`)
* Whitespace around `=` is allowed.

### 5.2 Recovery for Broken Quotes

If a quoted value is opened but the closing quote is missing:

* The value is terminated by the **end of the tag** (`>` or `/>`), **not** by scanning arbitrarily far.
* In practice: `ok='es>` parses as `ok = "es"` and the quote is considered implicitly closed at `>`.

### 5.3 Duplicate Attributes

If the same attribute appears multiple times:

* Default: **last one wins**
* Configurable alternative: first wins / keep list.

---

## 6) Recovery Rules (Deterministic)

Recovery rules are applied in this order.

### 6.1 Tag Boundary Detection

A tag starts at `<` and ends at the first `>` that is not clearly “inside a quoted value”.

Because we tolerate broken quotes, “inside quoted value” is determined with a *bounded rule*:

* If a quote (`'` or `"`) begins inside a tag but no matching closing quote appears before `>`, the quote is **implicitly closed at `>`**.

This prevents runaway parsing across lines.

### 6.2 Auto-close on Encountering Another Tag

**Rule:** When a start tag `<A ...>` is open and the parser encounters the next tag start `<B...` (recognized or not, configurable), then `<A>` is **implicitly closed immediately before `<B`**, unless `<A>` was explicitly closed earlier.

This is the key rule that keeps the structure flat and robust.

Notes:

* This applies even if the earlier tag was intended to wrap multiple tags.
* This deliberately sacrifices nesting to preserve determinism.

### 6.3 Missing End Tag

If `<A ...>` is opened and no `</A>` appears before:

* end of document, or
* next tag start that forces auto-close (6.2),
  then it is treated as an **unclosed tag** and recovered as described in 6.5.

### 6.4 Self-closing Tags

`<A .../>` produces an annotation event with **zero-length span** (or a “marker” annotation), depending on consumer needs:

* Default: emit `A` as a marker at that position.
* Configurable: treat as annotating the next token / next word / until newline.

### 6.5 Annotation Span Selection for Unclosed Tags

For unclosed tags, the parser chooses an annotation target span using one of these strategies (configurable per tag; defaults shown):

**Default (line-anchored retroactive):**

* Annotate the text on the **current output line** from the most recent emitted newline up to the tag start.
* Optionally trim leading/trailing punctuation/whitespace.

This matches your cited-text design and is excellent for “postfix tags” like citations.

**Alternative strategies (supported by config):**

* `forward_until_tag`: annotate from tag end to next tag start / newline.
* `forward_next_token`: annotate the next contiguous token after the tag.
* `noop`: ignore unclosed tags completely (only for markers).

Recommended defaults by tag type:

* `<cite id=...>`: **retroactive line-anchored**
* `<label ...>`: **forward_next_token** or **retroactive** depending on style
* `<note>`: **forward_until_tag** or **until_newline**

### 6.6 Unknown Tags

If a tag name is unknown:

* `passthrough`: keep it as literal text (no structural meaning)
* `strip`: remove the tag syntax but keep its inner text
* `treat_as_text`: treat the entire `<...>` as text and do not attempt attribute parsing

### 6.7 Malformed End Tags

If `</X>` is found and there is no matching open start tag currently relevant (because of auto-close policy):

* Treat it as literal text **or** drop it (configurable).
  Default: **drop** if `X` is recognized; **passthrough** otherwise.

---

## 7) Escaping and Literal Text (`CDATA` / Alternatives)

### 7.1 Is CDATA “well-known” to LLMs?

Yes—`<![CDATA[ ... ]]>` is a common XML construct and most LLMs trained on web/code corpora will recognize it.

### 7.2 CDATA Support (Recommended)

Support the following literal block:

* Start: `<![CDATA[`
* End: `]]>`

Inside CDATA:

* No tags are parsed; everything is literal text.

If the end delimiter `]]>` is missing:

* Recover by treating CDATA as running to end-of-document.
* (Optional safer recovery) run to the next line that begins with `]]>`.

### 7.3 Simpler Alternative (If you want)

If you prefer not to implement CDATA, support backslash escaping:

* `\<` means literal `<`
* `\>` means literal `>`

This is simpler but less “standard”.

---

## 8) Output Model

A recommended canonical output for consumers:

### 8.1 Segmented Text Stream

Return an ordered list of segments, each with:

* `text: str`
* `annotations: List[Annotation]`

Where each annotation has:

* `tag: str`
* `attrs: dict[str, str|bool]`
* optional `confidence/recovery_reason` metadata

### 8.2 Marker Events (optional)

For self-closing tags or zero-width tags:

* Represent as `{pos, tag, attrs}` events, or
* Convert them into segments with empty `text`.

---

## 9) Examples

### 9.1 Properly Closed

Input:

```
We shipped <cite id="1">last week</cite>.
```

Output:

* “We shipped ” (no ann)
* “last week” (cite id=1)
* “.” (no ann)

### 9.2 Unclosed + Auto-close by Next Tag

Input:

```
We shipped last week <cite id=1> <note>Details...</note>
```

Process:

* `<cite>` is unclosed; next `<note>` forces auto-close.
* Cite applies retroactively to “We shipped last week ” (trim punctuation optional).

### 9.3 Broken Quote in Attribute

Input:

```
<cite id='1, 2>Evidence</cite>
```

Recovery:

* `id` parses as `1, 2` and quote closes at `>`
* emits inner “Evidence” annotated with cite(id=1,2)

### 9.4 Unknown Tag Passthrough

Input:

```
Hello <weird x=1>world</weird>
```

With `unknown_mode=passthrough`, output keeps `<weird x=1>` as literal text.

### 9.5 CDATA Literal

Input:

```
<note><![CDATA[Use < and > freely here]]></note>
```

No tag parsing inside CDATA.

---

## 10) Failure Cases & Known Limitations (By Design)

### 10.1 Nested Intent Will Flatten

Input:

```
<A>outer <B>inner</B> more</A>
```

Because you always auto-close when encountering other tags, the parser may treat:

* `<A>` closes before `<B>`
* `</A>` may be dropped as unmatched

**Guidance to LLM:** avoid nesting; prefer sibling tags.

### 10.2 Attribute Edge Weirdness

Input:

```
<tag a="x y z b=2>
```

Recovery rule closes quote at `>` so `a="x y z b=2` (probably not intended).
Mitigation:

* Encourage quoted attributes to end before `>` and discourage embedding `b=` in values.

### 10.3 Literal “<” Without CDATA/Escape

If CDATA/escaping isn’t used, `<` can start a tag accidentally.
Mitigation:

* Recommend CDATA for code snippets and angle-bracket-heavy text.

### 10.4 Stray `</tag>` closers

Given auto-close, closers often become stray.
Mitigation:

* Default to drop recognized stray closers.

---

## 11) Recommended Authoring Guidelines (for the LLM Prompt)

Tell the LLM:

* Use only recognized tags: `<cite> <note> <risk> <todo> ...`
* Do not nest tags.
* Prefer putting `<cite ...>` at end of line (postfix) if using retroactive mode.
* Use `<![CDATA[ ... ]]>` for code or any text containing `<` `>`.

---

## 12) Configuration Surface (Suggested)

* `recognized_tags: set[str]`
* `unknown_mode: {"strip","passthrough","treat_as_text"}`
* `autoclose_on_any_tag: bool` (you want true)
* `per_tag_recovery_strategy: dict[tag -> strategy]`

  * `retro_line` (default for cite)
  * `forward_until_tag`
  * `forward_until_newline`
  * `forward_next_token`
  * `noop`
* `trim_punctuation: bool`
* `case_sensitive_tags: bool`
* `stray_end_tag_policy: {"drop","passthrough"}`