xml-syntax-reader
A high-performance, zero-copy, streaming XML syntax reader for Rust.
This is a syntax reader, not a full XML parser. It tokenizes well-formed XML into events (start tags, attributes, text, comments, etc.) without building a tree, resolving namespaces, or expanding entity references. It validates syntactic well-formedness constraints that are detectable at the lexical level, but does not check higher-level rules like tag matching or DTD conformance.
Usage
use ;
;
Streaming
The reader is designed for streaming use. parse() borrows the caller's buffer, processes as much as possible, and returns the number of bytes consumed. The caller shifts unconsumed bytes to the front, appends more data, and calls parse() again:
use ;
use Read;
Events
The Visitor trait provides callbacks for all XML syntax constructs:
| Callback | Trigger | Data |
|---|---|---|
start_tag_open |
<name |
element name |
attribute_name |
attribute name in start tag | attribute name |
attribute_value |
"value" or 'value' |
raw value (without quotes) |
attribute_entity_ref |
&name; in attribute value |
entity name |
attribute_char_ref |
< or < in attribute value |
value between &# and ; |
attribute_end |
closing quote of attribute value | |
start_tag_close |
> |
|
empty_element_end |
/> |
|
end_tag |
</name> |
element name |
characters |
text content | raw text |
entity_ref |
&name; |
entity name |
char_ref |
< or < |
value between &# and ; |
cdata_start / cdata_content / cdata_end |
<![CDATA[...]]> |
raw content |
comment_start / comment_content / comment_end |
<!--...--> |
comment text |
xml_declaration |
<?xml ...?> |
version, encoding, standalone |
pi_start / pi_content / pi_end |
<?target ...?> |
target name, content |
doctype_start / doctype_content / doctype_end |
<!DOCTYPE ...> |
root name, opaque content |
All &[u8] slices are zero-copy references into the caller's buffer. Every event includes a Span with absolute byte offsets into the input stream.
Attribute values may be segmented at entity/char-ref boundaries and buffer boundaries - attribute_value fires for each text segment, interleaved with attribute_entity_ref / attribute_char_ref callbacks. Empty segments are omitted, so an attribute whose value is empty or consists entirely of references produces zero attribute_value calls.
Text content between markup is delivered as interleaved characters, entity_ref, and char_ref callbacks. For example, a&b produces characters("a"), entity_ref("amp"), characters("b").
Content bodies (cdata_content, comment_content, pi_content, doctype_content) fire zero or more times per construct - zero for empty constructs (e.g. <!---->, <?target?>), and more than once when content spans buffer boundaries.
Error Handling
The parser rejects malformed input with specific error kinds:
| Error | Trigger |
|---|---|
UnexpectedByte(u8) |
Invalid byte in the current parsing context |
UnexpectedEof |
Input ends inside an incomplete construct |
CdataEndInContent |
]]> in text content |
DoubleDashInComment |
-- inside a comment body |
InvalidCharRef |
Empty or non-numeric character reference |
DoctypeMissingWhitespace |
Missing whitespace after <!DOCTYPE keyword |
DoctypeMissingName |
Missing or invalid name in <!DOCTYPE declaration |
InvalidUtf8 |
Invalid UTF-8 byte sequence |
NameTooLong |
Name exceeds 1,000-byte limit |
CharRefTooLong |
Character reference exceeds 7-byte limit |
DoctypeBracketsTooDeep |
DOCTYPE bracket nesting exceeds 1,024 depth limit |
MalformedXmlDeclaration |
Malformed XML declaration (missing version, bad syntax) |
ReservedPITarget |
PI target matching xml (case-insensitive) after document start |
Errors include the absolute byte offset where the problem was detected.
Convenience Functions
For in-memory documents, parse_slice avoids the streaming boilerplate:
let mut reader = new;
reader.parse_slice.unwrap;
For std::io::Read sources, parse_read manages the buffer internally:
use ;
let file = open.unwrap;
let mut visitor = MyVisitor;
parse_read.unwrap;
parse_read_with_capacity allows specifying the buffer size (minimum 64 bytes).
Encoding Detection
probe_encoding examines the first few bytes of a document for a BOM and/or XML declaration to determine the encoding:
use ;
let result = probe_encoding;
assert_eq!;
assert_eq!;
Supported detections: UTF-8, UTF-16 LE/BE, UTF-32 LE/BE, and encoding names declared in XML declarations.
UTF-8 Handling
The parser operates on raw bytes and assumes its input is UTF-8. It does not fully validate that every byte sequence in the document is valid UTF-8, nor does it transcode from other encodings. To safely reject documents in invalid or unsupported encodings, callers should take these steps:
-
Probe the encoding before parsing. Call
probe_encoding()on the first bytes of the document. If the result is anything other thanEncoding::Utf8(e.g. UTF-16, UTF-32, or a declared encoding likeISO-8859-1), either transcode the input to UTF-8 before feeding it to the reader, or reject the document. -
Strip the BOM if present.
probe_encoding()returns abom_length- skip that many bytes when passing data to the reader. (A UTF-8 BOM is harmless to the parser but may appear as content in the first text event.) -
Validate UTF-8 in visitor callbacks. The parser delivers
&[u8]slices, not&str. It guarantees that multi-byte UTF-8 sequences are never split acrosscharacters()calls at buffer boundaries, sostd::str::from_utf8()on any individual callback slice will not fail due to a buffer-boundary split. However, it will fail if the source data contains genuinely invalid UTF-8. Callstd::str::from_utf8()(or your own validation) on the slices you care about and reject the document if it fails. -
Check the
xml_declarationencoding attribute. Thexml_declarationcallback receives the declaredencodingvalue (if any). A document that declaresencoding="UTF-8"(or omits the attribute, which defaults to UTF-8) and passes UTF-8 validation in step 3 is safe. A document that declares a non-UTF-8 encoding should be transcoded or rejected.
In summary: probe_encoding() detects the transport encoding, the reader handles byte-level tokenization, and the caller is responsible for validating that the bytes are actually valid UTF-8.
no_std Support
The crate supports no_std environments. The std feature is enabled by default and adds:
- Runtime SIMD detection via
is_x86_feature_detected!- selects AVX2, SSE2, or scalar at runtime. parse_read/parse_read_with_capacity- convenience functions forstd::io::Readsources.ReadErrortype (wrapsstd::io::Error).
To use without std:
[]
= { = "...", = false }
SIMD backend selection falls back to compile-time target_feature detection. If you compile with -C target-feature=+avx2, the AVX2 backend is used; otherwise it falls back to scalar. Reader, parse_slice, Visitor, probe_encoding, and all error types remain available.
Security Considerations
The parser enforces hardcoded limits to bound resource consumption from untrusted input:
- Name length: element names, attribute names, PI targets, DOCTYPE names, and entity reference names are capped at 1,000 bytes (
NameTooLong). - Character reference length: the value between
&#and;is capped at 7 bytes - the longest valid reference isor(CharRefTooLong). - DOCTYPE bracket nesting: internal subset
[nesting is capped at 1,024 levels (DoctypeBracketsTooDeep).
Text content, attribute values, and content bodies (comments, CDATA, PIs, DOCTYPE) have no size limit - they are streamed in chunks to the visitor, so memory usage is bounded by the caller's buffer size, not by document size.
All unsafe code is confined to SIMD intrinsics in src/bitstream/. The parser logic itself contains no unsafe blocks.
Beyond Syntax
This crate is a syntax reader, not a conformant XML processor. If you are building a higher-level layer on top (namespace resolution, DOM construction, validation), you need to know exactly what gaps exist relative to the XML Information Set. This section catalogs them.
Well-formedness constraints not checked
The XML 1.0 spec requires all conformant processors to enforce these rules. This parser deliberately skips them:
- Tag matching -
<a></b>is accepted without error. The parser does not track a stack of open elements. - Attribute uniqueness -
<e a="x" a="y">is accepted. Duplicate attribute names are not detected. - Character validation - bytes in text, attributes, comments, and PIs are not checked against the XML
Charproduction. Control characters like U+0000 pass through. - Character reference range -
�and�are not rejected. The parser validates the syntax of character references (digits, hex digits, terminating;) but not that the decoded codepoint is a legal XML character. - Namespace prefix binding - the parser does not enforce that
xmlandxmlnsprefixes are used correctly. This is a namespace-level constraint.
Information the parser does not provide
The XML Information Set defines these as properties of information items. This parser does not deliver them:
- Namespace URI / local name - names are reported verbatim (e.g.
ns:elem). No prefix resolution is performed. - Entity expansion -
&is reported asentity_ref("amp"), not expanded to&. This applies to all five predefined entities and any DTD-defined entities. - Character reference decoding -
<is reported aschar_ref("60"), not decoded to<. - Attribute value normalization - raw bytes between quotes are delivered as-is. No whitespace normalization (tabs/newlines to spaces, leading/trailing stripping for tokenized types) is applied.
- Default attributes - DTD-declared defaults are not applied; only attributes present in the source are reported.
- Document structure - no tree, no parent/child relationships, no document-order guarantees beyond event sequence.
- Base URI - not tracked.
- Notations and unparsed entities - not reported.
DTD and external features not processed
- Internal subset - delivered as opaque
doctype_contentchunks. Entity, notation, and attlist declarations within it are not parsed. - External subset - not fetched or processed.
- External entities - not resolved.
- Standalone enforcement -
standalone="yes"is parsed and reported viaxml_declaration, but its constraints (no external markup declarations affecting content, no attribute defaulting, no normalization changes) are not enforced.