xmltok
xmltok is a low-level, pull-based, zero-allocation XML 1.0 tokenizer.
It is a fork of RazrFalcon/xmlparser with a focus on
reducing memory consumption and removing the lifetime parameter from Token.
Differences from upstream
The upstream library represents token fields as StrSpan<'a>, a struct that holds a &'a str
pointer along with a byte offset into the original document. This ties every Token<'a> to the
lifetime of the source string, and each span carries an 8-byte pointer even though the pointer is
always the same (the start of the document).
This fork replaces StrSpan<'a> fields inside tokens with two offset-only types:
SmallDetachedStrSpan— storesstartandendasu16values relative to the token's ownstartoffset. Used for short in-token strings (tag names, attribute names/values, etc.).DetachedStrSpan— storesstartandendasu32values relative to the token's ownstartoffset. Used whereu16would overflow (text, CDATA, comments).
Each token stores one absolute start: u32 position. Most variants store an end: u16 length;
ElementStart, Attribute, Text, Comment and Cdata store no end at all — it is derived
from their last sub-span by Token::range(). All string spans inside a token are relative to its
start, so no pointer is stored. To recover a &str you call
span.as_str(full_text, token_start as usize).
The result is that Token has no lifetime parameter, is Copy + 'static, and 20 bytes instead
of upstream's 112.
The compact representation imposes documented limits, enforced with parse errors instead of silent truncation: markup tokens (declaration, PI, DOCTYPE, ENTITY, element start/end, attribute) are limited to 64 KiB each, text/CDATA/comment tokens and the whole document to 4 GiB.
Benchmarks
Measured on Apple Silicon (aarch64) with Rust 1.96 against xmlparser 0.13.6 (the latest
crates.io release, identical to upstream master). Four XML fixtures:
- dense — 44 KB, compact elements, minimal whitespace
- spaced — 66 KB, heavy inter-element whitespace
- dtd — 12 KB, DOCTYPE with 50 ENTITY declarations, comments, and processing instructions
- large — 1 GB, 10.5 million items, same structure as dense
Token memory footprint
| This fork | Upstream 0.13.6 | |
|---|---|---|
size_of::<Token>() |
20 bytes | 112 bytes |
| 10.5 M tokens (1 GB file) | 210 MB | 1 120 MB |
| Savings per token | −92 bytes (82%) | — |
Streaming throughput (tokens counted, nothing stored)
| File | Fork | Upstream 0.13.6 | Δ |
|---|---|---|---|
| dense (44 KB) | 536 MiB/s | 434 MiB/s | +24% |
| spaced (66 KB) | 628 MiB/s | 485 MiB/s | +30% |
| dtd (12 KB) | 582 MiB/s | 450 MiB/s | +29% |
| large (1 GB) | 594 MiB/s | 457 MiB/s | +30% |
Collect throughput (Vec<Token> allocation included)
| File | Fork | Upstream 0.13.6 | Δ |
|---|---|---|---|
| dense (44 KB) | 498 MiB/s | 356 MiB/s | +40% |
| spaced (66 KB) | 624 MiB/s | 408 MiB/s | +53% |
| dtd (12 KB) | 535 MiB/s | 379 MiB/s | +41% |
The streaming gap comes from byte-table scanning in the hot paths (qualified names, text,
attribute values, comments). The collect gap additionally benefits from the 5.6× smaller
Vec — less memory to allocate, write, and grow during reallocation.
Numbers are medians from the interleaved A/B harnesses, which are stable run-to-run:
cargo run --release --example ab [fixture.xml]
cargo run --release --example ab_collect [fixture.xml]
Criterion benchmarks live in benches/tokenize.rs (cargo bench).
Original upstream README below.
xmlparser
xmlparser is a low-level, pull-based, zero-allocation XML 1.0 parser.
Example
for token in from
Why a new library?
This library is basically a low-level XML tokenizer that preserves the positions of the tokens and is not intended to be used directly.
If you are looking for a higher level solution, check out roxmltree.
Benefits
- All tokens contain
StrSpanstructs which represent the position of the substring in the original document. - Good error processing. All error types contain the position (line:column) where it occurred.
- No heap allocations.
- No dependencies.
- Tiny. ~1400 LOC and ~30KiB in the release build according to
cargo-bloat. - Supports
no_stdbuilds. To use without the standard library, disable the default features.
Limitations
- Currently, only ENTITY objects are parsed from the DOCTYPE. All others are ignored.
- No tree structure validation. So an XML like
<root><child></root></child>or a string without root element will be parsed without errors. You should check for this manually. On the other hand<a/><a/>will lead to an error. - Duplicated attributes is not an error. So XML like
<item a="v1" a="v2"/>will be parsed without errors. You should check for this manually. - UTF-8 only.
Safety
- The library must not panic. Any panic is considered a critical bug and should be reported.
- The library forbids unsafe code.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.