# KindaXML Spec + Recovery Rules
This document defines a tolerant, XML-ish annotation DSL designed for LLM output. It is **not XML** and does not aim for a fully-correct DOM. Its purpose is to let a model express structured annotations while remaining **robust to common model formatting errors**.
---
## 1) Goals & Non-Goals
### Goals
* **LLM-friendly syntax**: looks like XML so models naturally emit it.
* **Tolerant parsing**: missing quotes, missing close-tags, and other small errors are recoverable.
* **Annotation-first**: tags primarily **annotate spans** (Option A), with a small amount of “block segmentation” (Option B) via auto-close-on-next-tag.
* **Deterministic recovery**: every malformed input has a predictable parse.
### Non-Goals
* No requirement to preserve XML well-formedness.
* No guarantee of building a faithful nested tree (Option C).
* No strict validation of attribute correctness beyond a simple tokenizer + recovery.
---
## 2) Core Concepts
### 2.1 Tokens
The language recognizes:
* **Start tag**: `<Tag ...>`
* **End tag**: `</Tag>`
* **Self-closing tag**: `<Tag .../>`
* **Text**: any content not part of tags
### 2.2 Tag Names
* `Tag` matches: `[A-Za-z][A-Za-z0-9_\-:.]*`
* Tags are **case-sensitive by default** (configurable).
### 2.3 Recognized vs Unrecognized Tags
The parser operates with a configurable set:
* **Recognized tags**: tags the LLM is instructed to use (e.g., `cite`, `claim`, `todo`, `metric`, `risk`, `note`, etc.)
* **Unrecognized tags**: everything else
Parser behavior is configurable:
* `unknown_mode = "strip"`: remove unknown tags but keep their inner text (if any)
* `unknown_mode = "passthrough"`: leave unknown tags in-place as literal text
* `unknown_mode = "treat_as_text"`: treat `<...>` as plain text if the tag name isn’t recognized
---
## 3) Grammar (Informal)
This is an *informal grammar* describing intended forms; recovery rules (Section 6) define what happens when input deviates.
```
Element := StartTag Document? EndTag?
| SelfClosingTag
StartTag := "<" Name Attrs? ">"
EndTag := "</" Name ">"
SelfClosingTag := "<" Name Attrs? "/>"
Attrs := (WS Attr)*
Attr := Name (WS? "=" WS? Value)?
Unquoted := (not-WS-and-not-">" and not "/>")+
```
**Important:** While this grammar allows nesting, **the semantic model does not rely on nesting.** Auto-close behavior makes it closer to “flat annotations”.
---
## 4) Semantics: What a Tag Means
### 4.1 Annotation Tag (Option A)
A recognized tag may annotate:
* **Inline span** between `<Tag ...>` and `</Tag>` (if it closes properly), or
* A **recovered span** determined by recovery rules (if it does not close properly).
The output form is typically a sequence of segments:
```json
[
{"text": "some text", "ann": []},
{"text": "annotated phrase", "ann": [{"tag": "cite", "attrs": {"id": "3"}}]},
...
]
```
### 4.2 Soft Block Boundary (Option B-lite)
Because “auto-close on encountering another tag” is always applied, tags naturally become **local annotations** and can also be treated as **block delimiters** in downstream logic if you wish.
Example: If `<note>` often begins a line, you can interpret it as “this line is a note” even if it doesn’t close.
---
## 5) Attributes
### 5.1 Allowed Forms
Attributes support all of:
* `a="x"`
* `a='x'`
* `a=x` (unquoted)
* `a` (boolean attribute; value is `true`)
* Whitespace around `=` is allowed.
### 5.2 Recovery for Broken Quotes
If a quoted value is opened but the closing quote is missing:
* The value is terminated by the **end of the tag** (`>` or `/>`), **not** by scanning arbitrarily far.
* In practice: `ok='es>` parses as `ok = "es"` and the quote is considered implicitly closed at `>`.
### 5.3 Duplicate Attributes
If the same attribute appears multiple times:
* Default: **last one wins**
* Configurable alternative: first wins / keep list.
---
## 6) Recovery Rules (Deterministic)
Recovery rules are applied in this order.
### 6.1 Tag Boundary Detection
A tag starts at `<` and ends at the first `>` that is not clearly “inside a quoted value”.
Because we tolerate broken quotes, “inside quoted value” is determined with a *bounded rule*:
* If a quote (`'` or `"`) begins inside a tag but no matching closing quote appears before `>`, the quote is **implicitly closed at `>`**.
This prevents runaway parsing across lines.
### 6.2 Auto-close on Encountering Another Tag
**Rule:** When a start tag `<A ...>` is open and the parser encounters the next tag start `<B...` (recognized or not, configurable), then `<A>` is **implicitly closed immediately before `<B`**, unless `<A>` was explicitly closed earlier.
This is the key rule that keeps the structure flat and robust.
Notes:
* This applies even if the earlier tag was intended to wrap multiple tags.
* This deliberately sacrifices nesting to preserve determinism.
### 6.3 Missing End Tag
If `<A ...>` is opened and no `</A>` appears before:
* end of document, or
* next tag start that forces auto-close (6.2),
then it is treated as an **unclosed tag** and recovered as described in 6.5.
### 6.4 Self-closing Tags
`<A .../>` produces an annotation event with **zero-length span** (or a “marker” annotation), depending on consumer needs:
* Default: emit `A` as a marker at that position.
* Configurable: treat as annotating the next token / next word / until newline.
### 6.5 Annotation Span Selection for Unclosed Tags
For unclosed tags, the parser chooses an annotation target span using one of these strategies (configurable per tag; defaults shown):
**Default (line-anchored retroactive):**
* Annotate the text on the **current output line** from the most recent emitted newline up to the tag start.
* Optionally trim leading/trailing punctuation/whitespace.
This matches your cited-text design and is excellent for “postfix tags” like citations.
**Alternative strategies (supported by config):**
* `forward_until_tag`: annotate from tag end to next tag start / newline.
* `forward_next_token`: annotate the next contiguous token after the tag.
* `noop`: ignore unclosed tags completely (only for markers).
Recommended defaults by tag type:
* `<cite id=...>`: **retroactive line-anchored**
* `<label ...>`: **forward_next_token** or **retroactive** depending on style
* `<note>`: **forward_until_tag** or **until_newline**
### 6.6 Unknown Tags
If a tag name is unknown:
* `passthrough`: keep it as literal text (no structural meaning)
* `strip`: remove the tag syntax but keep its inner text
* `treat_as_text`: treat the entire `<...>` as text and do not attempt attribute parsing
### 6.7 Malformed End Tags
If `</X>` is found and there is no matching open start tag currently relevant (because of auto-close policy):
* Treat it as literal text **or** drop it (configurable).
Default: **drop** if `X` is recognized; **passthrough** otherwise.
---
## 7) Escaping and Literal Text (`CDATA` / Alternatives)
### 7.1 Is CDATA “well-known” to LLMs?
Yes—`<![CDATA[ ... ]]>` is a common XML construct and most LLMs trained on web/code corpora will recognize it.
### 7.2 CDATA Support (Recommended)
Support the following literal block:
* Start: `<![CDATA[`
* End: `]]>`
Inside CDATA:
* No tags are parsed; everything is literal text.
If the end delimiter `]]>` is missing:
* Recover by treating CDATA as running to end-of-document.
* (Optional safer recovery) run to the next line that begins with `]]>`.
### 7.3 Simpler Alternative (If you want)
If you prefer not to implement CDATA, support backslash escaping:
* `\<` means literal `<`
* `\>` means literal `>`
This is simpler but less “standard”.
---
## 8) Output Model
A recommended canonical output for consumers:
### 8.1 Segmented Text Stream
Return an ordered list of segments, each with:
* `text: str`
* `annotations: List[Annotation]`
Where each annotation has:
* `tag: str`
* `attrs: dict[str, str|bool]`
* optional `confidence/recovery_reason` metadata
### 8.2 Marker Events (optional)
For self-closing tags or zero-width tags:
* Represent as `{pos, tag, attrs}` events, or
* Convert them into segments with empty `text`.
---
## 9) Examples
### 9.1 Properly Closed
Input:
```
We shipped <cite id="1">last week</cite>.
```
Output:
* “We shipped ” (no ann)
* “last week” (cite id=1)
* “.” (no ann)
### 9.2 Unclosed + Auto-close by Next Tag
Input:
```
We shipped last week <cite id=1> <note>Details...</note>
```
Process:
* `<cite>` is unclosed; next `<note>` forces auto-close.
* Cite applies retroactively to “We shipped last week ” (trim punctuation optional).
### 9.3 Broken Quote in Attribute
Input:
```
<cite id='1, 2>Evidence</cite>
```
Recovery:
* `id` parses as `1, 2` and quote closes at `>`
* emits inner “Evidence” annotated with cite(id=1,2)
### 9.4 Unknown Tag Passthrough
Input:
```
Hello <weird x=1>world</weird>
```
With `unknown_mode=passthrough`, output keeps `<weird x=1>` as literal text.
### 9.5 CDATA Literal
Input:
```
<note><![CDATA[Use < and > freely here]]></note>
```
No tag parsing inside CDATA.
---
## 10) Failure Cases & Known Limitations (By Design)
### 10.1 Nested Intent Will Flatten
Input:
```
<A>outer <B>inner</B> more</A>
```
Because you always auto-close when encountering other tags, the parser may treat:
* `<A>` closes before `<B>`
* `</A>` may be dropped as unmatched
**Guidance to LLM:** avoid nesting; prefer sibling tags.
### 10.2 Attribute Edge Weirdness
Input:
```
<tag a="x y z b=2>
```
Recovery rule closes quote at `>` so `a="x y z b=2` (probably not intended).
Mitigation:
* Encourage quoted attributes to end before `>` and discourage embedding `b=` in values.
### 10.3 Literal “<” Without CDATA/Escape
If CDATA/escaping isn’t used, `<` can start a tag accidentally.
Mitigation:
* Recommend CDATA for code snippets and angle-bracket-heavy text.
### 10.4 Stray `</tag>` closers
Given auto-close, closers often become stray.
Mitigation:
* Default to drop recognized stray closers.
---
## 11) Recommended Authoring Guidelines (for the LLM Prompt)
Tell the LLM:
* Use only recognized tags: `<cite> <note> <risk> <todo> ...`
* Do not nest tags.
* Prefer putting `<cite ...>` at end of line (postfix) if using retroactive mode.
* Use `<![CDATA[ ... ]]>` for code or any text containing `<` `>`.
---
## 12) Configuration Surface (Suggested)
* `recognized_tags: set[str]`
* `unknown_mode: {"strip","passthrough","treat_as_text"}`
* `autoclose_on_any_tag: bool` (you want true)
* `per_tag_recovery_strategy: dict[tag -> strategy]`
* `retro_line` (default for cite)
* `forward_until_tag`
* `forward_until_newline`
* `forward_next_token`
* `noop`
* `trim_punctuation: bool`
* `case_sensitive_tags: bool`
* `stray_end_tag_policy: {"drop","passthrough"}`