KindaXML (kindaxml) — close-enough, XML-ish markup for LLM output
KindaXML is an XML-inspired annotation DSL designed for LLM-generated text. It keeps the familiar <tag attr=...> shape, but the parser is tolerant: it recovers from missing end tags, missing quotes, and other common “almost XML” mistakes.
KindaXML is not XML (and not meant to be parsed by strict XML parsers). Think: well-formed-ish.
Why KindaXML?
LLMs are good at emitting XML-like text, but strict XML breaks easily. KindaXML aims to be:
- LLM-friendly: angle brackets and attributes feel natural in prompts.
- Deterministic recovery: malformed input still produces predictable output.
- Annotation-first: tags annotate spans of text rather than building a complex DOM.
- Configurable: recognized tags are whitelisted, unknown tags can be stripped or preserved.
Design: Annotation DSL (Option A) + a pinch of “blocks”
KindaXML’s primary output is a stream of text segments, each optionally annotated:
KindaXML intentionally avoids deep nesting. In fact, it auto-closes open tags when the next tag begins, which keeps structures shallow and robust.
Syntax overview
Tags
- Start tag:
<tag ...> - End tag:
</tag> - Self-closing tag:
<tag .../>
Tag names match:
[A-Za-z][A-Za-z0-9_\-:.]*
Attributes
Supported forms:
a="x"a='x'a=x(unquoted)a(boolean attribute; impliestrue)- Whitespace around
=is allowed.
Parsing rules (the “close enough” part)
1) Tag boundary detection
A tag begins at < and ends at the first >.
If a quote starts inside the tag but never closes, it is implicitly closed at >.
Example:
<cite id='1,2>text</cite>
Parses as:
tag = citeid = "1,2"(quote recovered)- inner text =
text
2) Auto-close on encountering another tag
If a start tag is open and the parser encounters the next <something...>, the current tag is implicitly closed immediately before that next <.
This is the core rule that prevents runaway structures.
Example:
<A>hello <B>world</B>
<A> auto-closes before <B>.
3) Missing end tags are tolerated
If a tag never closes, it’s recovered according to its configured span strategy (below).
4) Self-closing tags
<tag .../> is treated as a marker annotation at that position (or optionally “annotate next token”, configurable).
Span strategies (how KindaXML decides what a tag annotates)
KindaXML is annotation-first. Each recognized tag can be configured with a span strategy:
inline (normal XML-ish)
If <tag> ... </tag> is present, annotate the inner range.
retro_line (great for citations)
If <cite ...> is unclosed, annotate the text on the current line before the tag (from last emitted newline to the tag start), optionally trimming punctuation/whitespace.
Example:
We shipped last week <cite id=1>.
The cite attaches to We shipped last week (not the punctuation).
Other useful strategies (optional)
forward_until_tag: annotate from the end of<tag ...>to the next tag start.forward_until_newline: annotate until newline.forward_next_token: annotate the next token/word.noop: ignore tag if unclosed (marker-only tags).
Unknown tags
You instruct the LLM to use a whitelist of recognized tags, but the parser can handle unknown tags in one of three modes:
strip(default-friendly): drop unknown tag markup, keep inner textpassthrough: keep unknown tags as literal texttreat_as_text: don’t parse unknown tags at all; treat<...>as text
Escaping / literal text (CDATA support)
KindaXML can support XML’s CDATA form:
- Start:
<![CDATA[ - End:
]]>
Inside CDATA, nothing is parsed as tags.
Example:
If ]]> is missing, CDATA runs to end-of-document (recovered).
(If you prefer simpler escaping, you can also support \< and \> as literals.)
Using the Rust crate
use ;
ParserConfig exposes toggles for unknown tags, per-tag recovery strategies, case sensitivity, punctuation trimming, and auto-close behavior. The default config is conservative and strips unknown tags.
Examples
Run the runnable demo with cargo run --example basic to see the original snippets alongside their parsed segments and markers.
Closed tag (inline span)
Input:
We shipped last week.
Output (conceptual):
We shipped(no annotations)last week(annotated: cite{id=1}).(no annotations)
Unclosed cite (retro_line)
Input:
We shipped last week .
Output:
We shipped last week(annotated: cite{id=1}).- (tag removed)
Broken quote recovery
Input:
Recovered as id="1,2".
Auto-close on next tag
Input:
alpha bravo charlie
<note>auto-closes before<cite ...><cite>is unclosed and recovered by its strategy
Failure cases / limitations (by design)
Nesting will not behave like XML
KindaXML is not a DOM language. If you try to nest, the “auto-close on next tag” rule will flatten it.
Bad idea:
outer inner outer
KindaXML outcome: <A> likely ends before <B>, and </A> may become stray.
Guidance: don’t nest; prefer sibling tags.
Attribute ambiguity in severely malformed tags
Example:
KindaXML will recover by closing the quote at > and treat the entire remaining text as part of a. This is intentional: recovery is bounded to the tag.
Guidance: keep attributes simple; use CDATA for messy text.
Stray end tags
Because auto-close flattens structure, you may get stray </tag>. By default, recognized stray end tags are dropped; unknown ones can be passed through (configurable).
Recommended prompting style for LLMs
Tell the model:
- Use only these tags:
<cite> <note> <todo> <risk> ...(whitelist) - Do not nest tags
- Prefer postfix citations:
... statement <cite id=1>. - Use CDATA for code or text with
</>:<![CDATA[ ... ]]>