xmltok 0.14.0

Pull-based, zero-allocation XML tokenizer with compact, lifetime-free tokens. A fork of xmlparser.
Documentation
  • Coverage
  • 100%
    66 out of 66 items documented1 out of 8 items with examples
  • Size
  • Source code size: 307.36 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 1.97 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 3s Average build duration of successful builds.
  • all releases: 3s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • artemshein/xmlparser
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • artemshein

xmltok

crates.io docs.rs

xmltok is a low-level, pull-based, zero-allocation XML 1.0 tokenizer.

It is a fork of RazrFalcon/xmlparser with a focus on reducing memory consumption and removing the lifetime parameter from Token.

Differences from upstream

The upstream library represents token fields as StrSpan<'a>, a struct that holds a &'a str pointer along with a byte offset into the original document. This ties every Token<'a> to the lifetime of the source string, and each span carries an 8-byte pointer even though the pointer is always the same (the start of the document).

This fork replaces StrSpan<'a> fields inside tokens with two offset-only types:

  • SmallDetachedStrSpan — stores start and end as u16 values relative to the token's own start offset. Used for short in-token strings (tag names, attribute names/values, etc.).
  • DetachedStrSpan — stores start and end as u32 values relative to the token's own start offset. Used where u16 would overflow (text, CDATA, comments).

Each token stores one absolute start: u32 position. Most variants store an end: u16 length; ElementStart, Attribute, Text, Comment and Cdata store no end at all — it is derived from their last sub-span by Token::range(). All string spans inside a token are relative to its start, so no pointer is stored. To recover a &str you call span.as_str(full_text, token_start as usize).

The result is that Token has no lifetime parameter, is Copy + 'static, and 20 bytes instead of upstream's 112.

The compact representation imposes documented limits, enforced with parse errors instead of silent truncation: markup tokens (declaration, PI, DOCTYPE, ENTITY, element start/end, attribute) are limited to 64 KiB each, text/CDATA/comment tokens and the whole document to 4 GiB.

Benchmarks

Measured on Apple Silicon (aarch64) with Rust 1.96 against xmlparser 0.13.6 (the latest crates.io release, identical to upstream master). Four XML fixtures:

  • dense — 44 KB, compact elements, minimal whitespace
  • spaced — 66 KB, heavy inter-element whitespace
  • dtd — 12 KB, DOCTYPE with 50 ENTITY declarations, comments, and processing instructions
  • large — 1 GB, 10.5 million items, same structure as dense

Token memory footprint

This fork Upstream 0.13.6
size_of::<Token>() 20 bytes 112 bytes
10.5 M tokens (1 GB file) 210 MB 1 120 MB
Savings per token −92 bytes (82%)

Streaming throughput (tokens counted, nothing stored)

File Fork Upstream 0.13.6 Δ
dense (44 KB) 536 MiB/s 434 MiB/s +24%
spaced (66 KB) 628 MiB/s 485 MiB/s +30%
dtd (12 KB) 582 MiB/s 450 MiB/s +29%
large (1 GB) 594 MiB/s 457 MiB/s +30%

Collect throughput (Vec<Token> allocation included)

File Fork Upstream 0.13.6 Δ
dense (44 KB) 498 MiB/s 356 MiB/s +40%
spaced (66 KB) 624 MiB/s 408 MiB/s +53%
dtd (12 KB) 535 MiB/s 379 MiB/s +41%

The streaming gap comes from byte-table scanning in the hot paths (qualified names, text, attribute values, comments). The collect gap additionally benefits from the 5.6× smaller Vec — less memory to allocate, write, and grow during reallocation.

Numbers are medians from the interleaved A/B harnesses, which are stable run-to-run:

cargo run --release --example ab [fixture.xml]
cargo run --release --example ab_collect [fixture.xml]

Criterion benchmarks live in benches/tokenize.rs (cargo bench).


Original upstream README below.

xmlparser

xmlparser is a low-level, pull-based, zero-allocation XML 1.0 parser.

Example

for token in xmlparser::Tokenizer::from("<tagname name='value'/>") {
    println!("{:?}", token);
}

Why a new library?

This library is basically a low-level XML tokenizer that preserves the positions of the tokens and is not intended to be used directly.

If you are looking for a higher level solution, check out roxmltree.

Benefits

  • All tokens contain StrSpan structs which represent the position of the substring in the original document.
  • Good error processing. All error types contain the position (line:column) where it occurred.
  • No heap allocations.
  • No dependencies.
  • Tiny. ~1400 LOC and ~30KiB in the release build according to cargo-bloat.
  • Supports no_std builds. To use without the standard library, disable the default features.

Limitations

  • Currently, only ENTITY objects are parsed from the DOCTYPE. All others are ignored.
  • No tree structure validation. So an XML like <root><child></root></child> or a string without root element will be parsed without errors. You should check for this manually. On the other hand <a/><a/> will lead to an error.
  • Duplicated attributes is not an error. So XML like <item a="v1" a="v2"/> will be parsed without errors. You should check for this manually.
  • UTF-8 only.

Safety

  • The library must not panic. Any panic is considered a critical bug and should be reported.
  • The library forbids unsafe code.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.