xmltok 0.14.0 - Docs.rs

# xmltok

[![crates.io](https://img.shields.io/crates/v/xmltok.svg)](https://crates.io/crates/xmltok)
[![docs.rs](https://img.shields.io/docsrs/xmltok)](https://docs.rs/xmltok)

*xmltok* is a low-level, pull-based, zero-allocation XML 1.0 tokenizer.

It is a fork of [RazrFalcon/xmlparser](https://github.com/RazrFalcon/xmlparser) with a focus on
reducing memory consumption and removing the lifetime parameter from `Token`.

### Differences from upstream

The upstream library represents token fields as `StrSpan<'a>`, a struct that holds a `&'a str`
pointer along with a byte offset into the original document. This ties every `Token<'a>` to the
lifetime of the source string, and each span carries an 8-byte pointer even though the pointer is
always the same (the start of the document).

This fork replaces `StrSpan<'a>` fields inside tokens with two offset-only types:

- **`SmallDetachedStrSpan`** — stores `start` and `end` as `u16` values relative to the token's
  own `start` offset. Used for short in-token strings (tag names, attribute names/values, etc.).
- **`DetachedStrSpan`** — stores `start` and `end` as `u32` values relative to the token's
  own `start` offset. Used where `u16` would overflow (text, CDATA, comments).

Each token stores one absolute `start: u32` position. Most variants store an `end: u16` length;
`ElementStart`, `Attribute`, `Text`, `Comment` and `Cdata` store no end at all — it is derived
from their last sub-span by `Token::range()`. All string spans inside a token are relative to its
`start`, so no pointer is stored. To recover a `&str` you call
`span.as_str(full_text, token_start as usize)`.

The result is that `Token` has no lifetime parameter, is `Copy + 'static`, and 20 bytes instead
of upstream's 112.

The compact representation imposes documented limits, enforced with parse errors instead of
silent truncation: markup tokens (declaration, PI, DOCTYPE, ENTITY, element start/end,
attribute) are limited to 64 KiB each, text/CDATA/comment tokens and the whole document
to 4 GiB.

### Benchmarks

Measured on Apple Silicon (aarch64) with Rust 1.96 against `xmlparser` 0.13.6 (the latest
crates.io release, identical to upstream master). Four XML fixtures:
- **dense** — 44 KB, compact elements, minimal whitespace
- **spaced** — 66 KB, heavy inter-element whitespace
- **dtd** — 12 KB, DOCTYPE with 50 ENTITY declarations, comments, and processing instructions
- **large** — 1 GB, 10.5 million items, same structure as dense

#### Token memory footprint

| | This fork | Upstream 0.13.6 |
|---|---|---|
| `size_of::<Token>()` | **20 bytes** | 112 bytes |
| 10.5 M tokens (1 GB file) | **210 MB** | 1 120 MB |
| Savings per token | **−92 bytes (82%)** | — |

#### Streaming throughput (tokens counted, nothing stored)

| File | Fork | Upstream 0.13.6 | Δ |
|---|---|---|---|
| dense (44 KB) | 536 MiB/s | 434 MiB/s | **+24%** |
| spaced (66 KB) | 628 MiB/s | 485 MiB/s | **+30%** |
| dtd (12 KB) | 582 MiB/s | 450 MiB/s | **+29%** |
| large (1 GB) | 594 MiB/s | 457 MiB/s | **+30%** |

#### Collect throughput (`Vec<Token>` allocation included)

| File | Fork | Upstream 0.13.6 | Δ |
|---|---|---|---|
| dense (44 KB) | 498 MiB/s | 356 MiB/s | **+40%** |
| spaced (66 KB) | 624 MiB/s | 408 MiB/s | **+53%** |
| dtd (12 KB) | 535 MiB/s | 379 MiB/s | **+41%** |

The streaming gap comes from byte-table scanning in the hot paths (qualified names, text,
attribute values, comments). The collect gap additionally benefits from the 5.6× smaller
`Vec` — less memory to allocate, write, and grow during reallocation.

Numbers are medians from the interleaved A/B harnesses, which are stable run-to-run:

```
cargo run --release --example ab [fixture.xml]
cargo run --release --example ab_collect [fixture.xml]
```

Criterion benchmarks live in `benches/tokenize.rs` (`cargo bench`).

---

Original upstream README below.

# xmlparser

[<img alt="github" src="https://img.shields.io/badge/github-RazrFalcon/xmlparser-8da0cb?style=for-the-badge&logo=github" height="20">](https://github.com/RazrFalcon/xmlparser)
[<img alt="crates.io" src="https://img.shields.io/crates/v/xmlparser.svg?style=for-the-badge&color=fc8d62&logo=rust" height="20">](https://crates.io/crates/xmlparser)
[<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-xmlparser-66c2a5?style=for-the-badge&logoColor=white&logo=data:image/svg+xml;base64,PHN2ZyByb2xlPSJpbWciIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgdmlld0JveD0iMCAwIDUxMiA1MTIiPjxwYXRoIGZpbGw9IiNmNWY1ZjUiIGQ9Ik00ODguNiAyNTAuMkwzOTIgMjE0VjEwNS41YzAtMTUtOS4zLTI4LjQtMjMuNC0zMy43bC0xMDAtMzcuNWMtOC4xLTMuMS0xNy4xLTMuMS0yNS4zIDBsLTEwMCAzNy41Yy0xNC4xIDUuMy0yMy40IDE4LjctMjMuNCAzMy43VjIxNGwtOTYuNiAzNi4yQzkuMyAyNTUuNSAwIDI2OC45IDAgMjgzLjlWMzk0YzAgMTMuNiA3LjcgMjYuMSAxOS45IDMyLjJsMTAwIDUwYzEwLjEgNS4xIDIyLjEgNS4xIDMyLjIgMGwxMDMuOS01MiAxMDMuOSA1MmMxMC4xIDUuMSAyMi4xIDUuMSAzMi4yIDBsMTAwLTUwYzEyLjItNi4xIDE5LjktMTguNiAxOS45LTMyLjJWMjgzLjljMC0xNS05LjMtMjguNC0yMy40LTMzLjd6TTM1OCAyMTQuOGwtODUgMzEuOXYtNjguMmw4NS0zN3Y3My4zek0xNTQgMTA0LjFsMTAyLTM4LjIgMTAyIDM4LjJ2LjZsLTEwMiA0MS40LTEwMiA0MS40di0uNnptODQgMjkxLjFsLTg1IDQyLjV2LTc5LjFsODUtMzguOHY3NS40em0wLTExMmwtMTAyIDQxLjQtMTAyLTQxLjR2LS42bDEwMiAzOC4yIDEwMiAzOC4ydi42em0yNDAgMTEybC04NSA0Mi41di03OS4xbDg1LTM4Ljh2NzUuNHptMC0xMTJsLTEwMiA0MS40LTEwMiA0MS40di0uNmwxMDItMzguMiAxMDIgMzguMnYuNnoiPjwvcGF0aD48L3N2Zz4K" height="20">](https://docs.rs/xmlparser)
[<img alt="build status" src="https://img.shields.io/github/actions/workflow/status/RazrFalcon/xmlparser/ci.yml?branch=master&style=for-the-badge" height="20">](https://github.com/RazrFalcon/xmlparser/actions?query=branch%3Amaster)

*xmlparser* is a low-level, pull-based, zero-allocation
[XML 1.0](https://www.w3.org/TR/xml/) parser.

<br>

## Example

```rust
for token in xmlparser::Tokenizer::from("<tagname name='value'/>") {
    println!("{:?}", token);
}
```

<br>

## Why a new library?

This library is basically a low-level XML tokenizer that preserves the
positions of the tokens and is not intended to be used directly.

If you are looking for a higher level solution, check out
[roxmltree](https://github.com/RazrFalcon/roxmltree).

<br>

## Benefits

- All tokens contain `StrSpan` structs which represent the position of the
  substring in the original document.
- Good error processing. All error types contain the position (line:column)
  where it occurred.
- No heap allocations.
- No dependencies.
- Tiny. ~1400 LOC and ~30KiB in the release build according to
  `cargo-bloat`.
- Supports `no_std` builds. To use without the standard library, disable the
  default features.

<br>

## Limitations

- Currently, only ENTITY objects are parsed from the DOCTYPE. All others are
  ignored.
- No tree structure validation. So an XML like
  `<root><child></root></child>` or a string without root element will be
  parsed without errors. You should check for this manually. On the other
  hand `<a/><a/>` will lead to an error.
- Duplicated attributes is not an error. So XML like `<item a="v1" a="v2"/>`
  will be parsed without errors. You should check for this manually.
- UTF-8 only.

<br>

## Safety

- The library must not panic. Any panic is considered a critical bug and
  should be reported.
- The library forbids unsafe code.

<br>

## License

Licensed under either of

- Apache License, Version 2.0 ([LICENSE-APACHE] or
  http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT] or http://opensource.org/licenses/MIT)

at your option.

<br>

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
dual licensed as above, without any additional terms or conditions.

[LICENSE-APACHE]: https://github.com/RazrFalcon/xmlparser/blob/master/LICENSE-APACHE
[LICENSE-MIT]: https://github.com/RazrFalcon/xmlparser/blob/master/LICENSE-MIT