kindaxml 0.1.0

Close-enough, XML-ish annotation parsing with deterministic recovery for LLM output.
Documentation
# KindaXML (`kindaxml`) — close-enough, XML-ish markup for LLM output

KindaXML is an **XML-inspired annotation DSL** designed for **LLM-generated text**. It keeps the familiar `<tag attr=...>` shape, but the parser is **tolerant**: it recovers from missing end tags, missing quotes, and other common “almost XML” mistakes.

KindaXML is **not XML** (and not meant to be parsed by strict XML parsers). Think: *well-formed-ish*.

---

## Why KindaXML?

LLMs are good at emitting XML-like text, but strict XML breaks easily. KindaXML aims to be:

* **LLM-friendly**: angle brackets and attributes feel natural in prompts.
* **Deterministic recovery**: malformed input still produces predictable output.
* **Annotation-first**: tags annotate spans of text rather than building a complex DOM.
* **Configurable**: recognized tags are whitelisted, unknown tags can be stripped or preserved.

---

## Design: Annotation DSL (Option A) + a pinch of “blocks”

KindaXML’s primary output is a **stream of text segments**, each optionally annotated:

```json
[
  {"text": "We shipped last week", "ann": [{"tag":"cite","attrs":{"id":"1"}}]},
  {"text": ". ", "ann": []},
  {"text": "Details", "ann": [{"tag":"note","attrs":{}}]}
]
```

KindaXML intentionally avoids deep nesting. In fact, it auto-closes open tags when the next tag begins, which keeps structures shallow and robust.

---

## Syntax overview

### Tags

* Start tag: `<tag ...>`
* End tag: `</tag>`
* Self-closing tag: `<tag .../>`

Tag names match:

```
[A-Za-z][A-Za-z0-9_\-:.]*
```

### Attributes

Supported forms:

* `a="x"`
* `a='x'`
* `a=x` (unquoted)
* `a` (boolean attribute; implies `true`)
* Whitespace around `=` is allowed.

---

## Parsing rules (the “close enough” part)

### 1) Tag boundary detection

A tag begins at `<` and ends at the first `>`.

If a quote starts inside the tag but never closes, it is **implicitly closed at `>`**.

Example:

```
<cite id='1,2>text</cite>
```

Parses as:

* `tag = cite`
* `id = "1,2"` (quote recovered)
* inner text = `text`

### 2) Auto-close on encountering another tag

If a start tag is open and the parser encounters the next `<something...>`, the current tag is **implicitly closed immediately before** that next `<`.

This is the core rule that prevents runaway structures.

Example:

```
<A>hello <B>world</B>
```

`<A>` auto-closes before `<B>`.

### 3) Missing end tags are tolerated

If a tag never closes, it’s recovered according to its configured **span strategy** (below).

### 4) Self-closing tags

`<tag .../>` is treated as a **marker annotation** at that position (or optionally “annotate next token”, configurable).

---

## Span strategies (how KindaXML decides what a tag annotates)

KindaXML is annotation-first. Each recognized tag can be configured with a span strategy:

### `inline` (normal XML-ish)

If `<tag> ... </tag>` is present, annotate the inner range.

### `retro_line` (great for citations)

If `<cite ...>` is unclosed, annotate the text on the current line **before** the tag (from last emitted newline to the tag start), optionally trimming punctuation/whitespace.

Example:

```
We shipped last week <cite id=1>.
```

The cite attaches to `We shipped last week` (not the punctuation).

### Other useful strategies (optional)

* `forward_until_tag`: annotate from the end of `<tag ...>` to the next tag start.
* `forward_until_newline`: annotate until newline.
* `forward_next_token`: annotate the next token/word.
* `noop`: ignore tag if unclosed (marker-only tags).

---

## Unknown tags

You instruct the LLM to use a whitelist of recognized tags, but the parser can handle unknown tags in one of three modes:

* `strip` (default-friendly): drop unknown tag markup, keep inner text
* `passthrough`: keep unknown tags as literal text
* `treat_as_text`: don’t parse unknown tags at all; treat `<...>` as text

---

## Escaping / literal text (CDATA support)

KindaXML can support XML’s CDATA form:

* Start: `<![CDATA[`
* End: `]]>`

Inside CDATA, nothing is parsed as tags.

Example:

```xml
<note><![CDATA[
Use < and > freely here. Even <fake tags>.
]]></note>
```

If `]]>` is missing, CDATA runs to end-of-document (recovered).

(If you prefer simpler escaping, you can also support `\<` and `\>` as literals.)

---

## Using the Rust crate

```rust
use kindaxml::{parse, ParserConfig, UnknownMode};

fn main() {
    let mut cfg = ParserConfig::default();
    cfg.recognized_tags = ["cite", "note"].into_iter().map(String::from).collect();
    cfg.case_sensitive_tags = false;
    cfg.unknown_mode = UnknownMode::Strip;

    let input = "We shipped <cite id=1>last week</cite>.";
    let parsed = parse(input, &cfg);

    for segment in parsed.segments {
        println!("{:?} -> {:?}", segment.text, segment.annotations);
    }
}
```

`ParserConfig` exposes toggles for unknown tags, per-tag recovery strategies, case sensitivity, punctuation trimming, and auto-close behavior. The default config is conservative and strips unknown tags.

---

## Examples

Run the runnable demo with `cargo run --example basic` to see the original snippets alongside their parsed segments and markers.

### Closed tag (inline span)

Input:

```xml
We shipped <cite id="1">last week</cite>.
```

Output (conceptual):

* `We shipped ` (no annotations)
* `last week` (annotated: cite{id=1})
* `.` (no annotations)

### Unclosed cite (retro_line)

Input:

```xml
We shipped last week <cite id=1>.
```

Output:

* `We shipped last week` (annotated: cite{id=1})
* `.`
* (tag removed)

### Broken quote recovery

Input:

```xml
<cite id='1, 2>Evidence</cite>
```

Recovered as `id="1,2"`.

### Auto-close on next tag

Input:

```xml
alpha <note>bravo <cite id=9> charlie
```

* `<note>` auto-closes before `<cite ...>`
* `<cite>` is unclosed and recovered by its strategy

---

## Failure cases / limitations (by design)

### Nesting will not behave like XML

KindaXML is not a DOM language. If you try to nest, the “auto-close on next tag” rule will flatten it.

Bad idea:

```xml
<A>outer <B>inner</B> outer</A>
```

KindaXML outcome: `<A>` likely ends before `<B>`, and `</A>` may become stray.

**Guidance:** don’t nest; prefer sibling tags.

### Attribute ambiguity in severely malformed tags

Example:

```xml
<tag a="x y z b=2>
```

KindaXML will recover by closing the quote at `>` and treat the entire remaining text as part of `a`. This is intentional: recovery is bounded to the tag.

**Guidance:** keep attributes simple; use CDATA for messy text.

### Stray end tags

Because auto-close flattens structure, you may get stray `</tag>`. By default, recognized stray end tags are dropped; unknown ones can be passed through (configurable).

---

## Recommended prompting style for LLMs

Tell the model:

* Use only these tags: `<cite> <note> <todo> <risk> ...` (whitelist)
* Do **not** nest tags
* Prefer postfix citations: `... statement <cite id=1>.`
* Use CDATA for code or text with `<`/`>`: `<![CDATA[ ... ]]>`