xml-syntax-reader 0.1.0

Low-level, callback-based, streaming XML tokenizer
Documentation
  • Coverage
  • 85.92%
    61 out of 71 items documented1 out of 1 items with examples
  • Size
  • Source code size: 187.74 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 7.04 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 16s Average build duration of successful builds.
  • all releases: 16s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • dholroyd/xml-syntax-reader
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • dholroyd

xml-syntax-reader

A high-performance, zero-copy, streaming XML syntax reader for Rust.

This is a syntax reader, not a full XML parser. It tokenizes well-formed XML into events (start tags, attributes, text, comments, etc.) without building a tree, resolving namespaces, or expanding entity references. It validates syntactic well-formedness constraints that are detectable at the lexical level, but does not check higher-level rules like tag matching or DTD conformance.

Usage

use xml_syntax_reader::{Reader, Visitor, Span};

struct MyVisitor;

impl Visitor for MyVisitor {
    type Error = std::convert::Infallible;

    fn start_tag_open(&mut self, name: &[u8], span: Span) -> Result<(), Self::Error> {
        println!("element: {} at {}..{}", String::from_utf8_lossy(name), span.start, span.end);
        Ok(())
    }

    fn characters(&mut self, text: &[u8], span: Span) -> Result<(), Self::Error> {
        println!("text: {:?} at {}..{}", String::from_utf8_lossy(text), span.start, span.end);
        Ok(())
    }

    // ... implement other callbacks as needed; defaults are no-ops
}

fn main() {
    let xml = b"<greeting>Hello, world!</greeting>";
    let mut reader = Reader::new();
    let mut visitor = MyVisitor;
    reader.parse_slice(xml, &mut visitor).unwrap();
}

Streaming

The reader is designed for streaming use. parse() borrows the caller's buffer, processes as much as possible, and returns the number of bytes consumed. The caller shifts unconsumed bytes to the front, appends more data, and calls parse() again:

use xml_syntax_reader::{Reader, Visitor, Span};
use std::io::Read;

struct CountVisitor { count: u64 }
impl Visitor for CountVisitor {
    type Error = std::convert::Infallible;
    fn start_tag_open(&mut self, _: &[u8], _: Span) -> Result<(), Self::Error> {
        self.count += 1;
        Ok(())
    }
}

fn count_elements(mut input: impl Read) -> std::io::Result<u64> {
    let mut reader = Reader::new();
    let mut visitor = CountVisitor { count: 0 };
    let mut buf = vec![0u8; 64 * 1024];
    let mut valid = 0usize;
    let mut stream_offset = 0u64;

    loop {
        let n = input.read(&mut buf[valid..])?;
        valid += n;
        let is_final = n == 0;

        let consumed = reader
            .parse(&buf[..valid], stream_offset, is_final, &mut visitor)
            .map_err(|e| std::io::Error::new(std::io::ErrorKind::InvalidData, format!("{e:?}")))?
            as usize;

        buf.copy_within(consumed..valid, 0);
        valid -= consumed;
        stream_offset += consumed as u64;

        if is_final { break; }
    }
    Ok(visitor.count)
}

Events

The Visitor trait provides callbacks for all XML syntax constructs:

Callback Trigger Data
start_tag_open <name element name
attribute_name attribute name in start tag attribute name
attribute_value "value" or 'value' raw value (without quotes)
attribute_entity_ref &name; in attribute value entity name
attribute_char_ref &#60; or &#x3C; in attribute value value between &# and ;
attribute_end closing quote of attribute value
start_tag_close >
empty_element_end />
end_tag </name> element name
characters text content raw text
entity_ref &name; entity name
char_ref &#60; or &#x3C; value between &# and ;
cdata_start / cdata_content / cdata_end <![CDATA[...]]> raw content
comment_start / comment_content / comment_end <!--...--> comment text
xml_declaration <?xml ...?> version, encoding, standalone
pi_start / pi_content / pi_end <?target ...?> target name, content
doctype_start / doctype_content / doctype_end <!DOCTYPE ...> root name, opaque content

All &[u8] slices are zero-copy references into the caller's buffer. Every event includes a Span with absolute byte offsets into the input stream.

Attribute values may be segmented at entity/char-ref boundaries and buffer boundaries - attribute_value fires for each text segment, interleaved with attribute_entity_ref / attribute_char_ref callbacks. Empty segments are omitted, so an attribute whose value is empty or consists entirely of references produces zero attribute_value calls.

Text content between markup is delivered as interleaved characters, entity_ref, and char_ref callbacks. For example, a&amp;b produces characters("a"), entity_ref("amp"), characters("b").

Content bodies (cdata_content, comment_content, pi_content, doctype_content) fire zero or more times per construct - zero for empty constructs (e.g. <!---->, <?target?>), and more than once when content spans buffer boundaries.

Error Handling

The parser rejects malformed input with specific error kinds:

Error Trigger
UnexpectedByte(u8) Invalid byte in the current parsing context
UnexpectedEof Input ends inside an incomplete construct
CdataEndInContent ]]> in text content
DoubleDashInComment -- inside a comment body
InvalidCharRef Empty or non-numeric character reference
DoctypeMissingWhitespace Missing whitespace after <!DOCTYPE keyword
DoctypeMissingName Missing or invalid name in <!DOCTYPE declaration
InvalidUtf8 Invalid UTF-8 byte sequence
NameTooLong Name exceeds 1,000-byte limit
CharRefTooLong Character reference exceeds 7-byte limit
DoctypeBracketsTooDeep DOCTYPE bracket nesting exceeds 1,024 depth limit
MalformedXmlDeclaration Malformed XML declaration (missing version, bad syntax)
ReservedPITarget PI target matching xml (case-insensitive) after document start

Errors include the absolute byte offset where the problem was detected.

Convenience Functions

For in-memory documents, parse_slice avoids the streaming boilerplate:

let mut reader = Reader::new();
reader.parse_slice(xml, &mut visitor).unwrap();

For std::io::Read sources, parse_read manages the buffer internally:

use xml_syntax_reader::{parse_read, Visitor, Span};

let file = std::fs::File::open("input.xml").unwrap();
let mut visitor = MyVisitor;
parse_read(file, &mut visitor).unwrap();

parse_read_with_capacity allows specifying the buffer size (minimum 64 bytes).

Encoding Detection

probe_encoding examines the first few bytes of a document for a BOM and/or XML declaration to determine the encoding:

use xml_syntax_reader::{probe_encoding, Encoding};

let result = probe_encoding(b"\xEF\xBB\xBF<?xml version=\"1.0\"?>");
assert_eq!(result.encoding, Encoding::Utf8);
assert_eq!(result.bom_length, 3);

Supported detections: UTF-8, UTF-16 LE/BE, UTF-32 LE/BE, and encoding names declared in XML declarations.

UTF-8 Handling

The parser operates on raw bytes and assumes its input is UTF-8. It does not fully validate that every byte sequence in the document is valid UTF-8, nor does it transcode from other encodings. To safely reject documents in invalid or unsupported encodings, callers should take these steps:

  1. Probe the encoding before parsing. Call probe_encoding() on the first bytes of the document. If the result is anything other than Encoding::Utf8 (e.g. UTF-16, UTF-32, or a declared encoding like ISO-8859-1), either transcode the input to UTF-8 before feeding it to the reader, or reject the document.

  2. Strip the BOM if present. probe_encoding() returns a bom_length - skip that many bytes when passing data to the reader. (A UTF-8 BOM is harmless to the parser but may appear as content in the first text event.)

  3. Validate UTF-8 in visitor callbacks. The parser delivers &[u8] slices, not &str. It guarantees that multi-byte UTF-8 sequences are never split across characters() calls at buffer boundaries, so std::str::from_utf8() on any individual callback slice will not fail due to a buffer-boundary split. However, it will fail if the source data contains genuinely invalid UTF-8. Call std::str::from_utf8() (or your own validation) on the slices you care about and reject the document if it fails.

  4. Check the xml_declaration encoding attribute. The xml_declaration callback receives the declared encoding value (if any). A document that declares encoding="UTF-8" (or omits the attribute, which defaults to UTF-8) and passes UTF-8 validation in step 3 is safe. A document that declares a non-UTF-8 encoding should be transcoded or rejected.

In summary: probe_encoding() detects the transport encoding, the reader handles byte-level tokenization, and the caller is responsible for validating that the bytes are actually valid UTF-8.

no_std Support

The crate supports no_std environments. The std feature is enabled by default and adds:

  • Runtime SIMD detection via is_x86_feature_detected! - selects AVX2, SSE2, or scalar at runtime.
  • parse_read / parse_read_with_capacity - convenience functions for std::io::Read sources.
  • ReadError type (wraps std::io::Error).

To use without std:

[dependencies]
xml-syntax-reader = { version = "...", default-features = false }

SIMD backend selection falls back to compile-time target_feature detection. If you compile with -C target-feature=+avx2, the AVX2 backend is used; otherwise it falls back to scalar. Reader, parse_slice, Visitor, probe_encoding, and all error types remain available.

Security Considerations

The parser enforces hardcoded limits to bound resource consumption from untrusted input:

  • Name length: element names, attribute names, PI targets, DOCTYPE names, and entity reference names are capped at 1,000 bytes (NameTooLong).
  • Character reference length: the value between &# and ; is capped at 7 bytes - the longest valid reference is &#x10FFFF; or &#1114111; (CharRefTooLong).
  • DOCTYPE bracket nesting: internal subset [ nesting is capped at 1,024 levels (DoctypeBracketsTooDeep).

Text content, attribute values, and content bodies (comments, CDATA, PIs, DOCTYPE) have no size limit - they are streamed in chunks to the visitor, so memory usage is bounded by the caller's buffer size, not by document size.

All unsafe code is confined to SIMD intrinsics in src/bitstream/. The parser logic itself contains no unsafe blocks.

Beyond Syntax

This crate is a syntax reader, not a conformant XML processor. If you are building a higher-level layer on top (namespace resolution, DOM construction, validation), you need to know exactly what gaps exist relative to the XML Information Set. This section catalogs them.

Well-formedness constraints not checked

The XML 1.0 spec requires all conformant processors to enforce these rules. This parser deliberately skips them:

  • Tag matching - <a></b> is accepted without error. The parser does not track a stack of open elements.
  • Attribute uniqueness - <e a="x" a="y"> is accepted. Duplicate attribute names are not detected.
  • Character validation - bytes in text, attributes, comments, and PIs are not checked against the XML Char production. Control characters like U+0000 pass through.
  • Character reference range - &#0; and &#xD800; are not rejected. The parser validates the syntax of character references (digits, hex digits, terminating ;) but not that the decoded codepoint is a legal XML character.
  • Namespace prefix binding - the parser does not enforce that xml and xmlns prefixes are used correctly. This is a namespace-level constraint.

Information the parser does not provide

The XML Information Set defines these as properties of information items. This parser does not deliver them:

  • Namespace URI / local name - names are reported verbatim (e.g. ns:elem). No prefix resolution is performed.
  • Entity expansion - &amp; is reported as entity_ref("amp"), not expanded to &. This applies to all five predefined entities and any DTD-defined entities.
  • Character reference decoding - &#60; is reported as char_ref("60"), not decoded to <.
  • Attribute value normalization - raw bytes between quotes are delivered as-is. No whitespace normalization (tabs/newlines to spaces, leading/trailing stripping for tokenized types) is applied.
  • Default attributes - DTD-declared defaults are not applied; only attributes present in the source are reported.
  • Document structure - no tree, no parent/child relationships, no document-order guarantees beyond event sequence.
  • Base URI - not tracked.
  • Notations and unparsed entities - not reported.

DTD and external features not processed

  • Internal subset - delivered as opaque doctype_content chunks. Entity, notation, and attlist declarations within it are not parsed.
  • External subset - not fetched or processed.
  • External entities - not resolved.
  • Standalone enforcement - standalone="yes" is parsed and reported via xml_declaration, but its constraints (no external markup declarations affecting content, no attribute defaulting, no normalization changes) are not enforced.