Skip to main content

Crate simdxml

Crate simdxml 

Source
Expand description

SIMD-accelerated XML parser with XPath 1.0 evaluation.

simdxml parses XML into flat arrays instead of a DOM tree, then evaluates XPath expressions against those arrays. The approach adapts simdjson’s structural indexing architecture to XML: SIMD instructions classify structural characters (<, >, ", etc.) in parallel, producing a compact index that supports random-access XPath evaluation without building a pointer-heavy tree.

The structural index uses ~16 bytes per tag (vs ~35 for a typical DOM node), has better cache locality for axis traversal, and supports all 13 XPath 1.0 axes via array operations with O(1) ancestor/descendant checks.

§Quick Start

let xml = b"<library><book><title>Rust</title></book></library>";
let index = simdxml::parse(xml).unwrap();

let titles = index.xpath_text("//title").unwrap();
assert_eq!(titles, vec!["Rust"]);

§Compiled Queries

For repeated queries (batch processing, multiple documents), compile the XPath expression once and reuse it:

use simdxml::CompiledXPath;

let query = CompiledXPath::compile("//title").unwrap();

let docs: Vec<&[u8]> = vec![
    b"<r><title>A</title></r>",
    b"<r><title>B</title></r>",
];
for doc in &docs {
    let index = simdxml::parse(doc).unwrap();
    let results = query.eval_text(&index).unwrap();
    assert_eq!(results.len(), 1);
}

§Scalar Expressions

Top-level scalar expressions (count(), string(), boolean(), arithmetic) are supported via XmlIndex::eval:

let xml = b"<r><item/><item/><item/></r>";
let mut index = simdxml::parse(xml).unwrap();

match index.eval("count(//item)").unwrap() {
    simdxml::XPathResult::Number(n) => assert_eq!(n, 3.0),
    _ => panic!("expected number"),
}

§Batch Processing

Process many documents with a single compiled query. The batch API handles bloom filter prescanning (skip files that can’t match) and lazy parsing (only index tags relevant to the query):

use simdxml::{batch, CompiledXPath};

let docs: Vec<&[u8]> = vec![
    b"<r><claim>First</claim></r>",
    b"<r><other>No claims here</other></r>",
    b"<r><claim>Third</claim></r>",
];
let query = CompiledXPath::compile("//claim").unwrap();
let results = batch::eval_batch_text_bloom(&docs, &query).unwrap();

assert_eq!(results[0], vec!["First"]);
assert!(results[1].is_empty()); // skipped via bloom filter
assert_eq!(results[2], vec!["Third"]);

§Parallel Parsing

Large files can be split across cores for parallel structural indexing. Each chunk is parsed independently, then merged:

let index = simdxml::parallel::parse_parallel(xml, 4).unwrap();
assert!(index.tag_count() > 0);

§Lazy Parsing

When you know the query ahead of time, parse_for_xpath only indexes tags relevant to the expression — skipping 70-90% of index construction for selective queries on large documents:

let xml = b"<r><a>1</a><b>2</b><c>3</c></r>";
let index = simdxml::parse_for_xpath(xml, "//a").unwrap();
let texts = index.xpath_text("//a").unwrap();
assert_eq!(texts, vec!["1"]);

§Persistent Indices

For files queried repeatedly, load_or_parse saves the structural index to a .sxi sidecar file and reloads it via mmap on subsequent calls:

let index = simdxml::load_or_parse("large_file.xml").unwrap();
// First call: parses and saves large_file.sxi
// Subsequent calls: mmap the .sxi, skip parsing entirely

§Platform Support

PlatformSIMD BackendStatus
aarch64 (Apple Silicon, ARM)NEON 128-bitProduction
x86_64AVX2 256-bit / SSE4.2 128-bitProduction
OtherScalar (memchr-accelerated)Working

The parser automatically selects the best available backend at runtime via is_x86_feature_detected! on x86_64 (compile-time on aarch64). A scalar fallback is always available.

Re-exports§

pub use bloom::TagBloom;
pub use error::Result;
pub use error::SimdXmlError;
pub use index::XmlIndex;
pub use persist::OwnedXmlIndex;
pub use xpath::CompiledXPath;
pub use xpath::XPathResult;

Modules§

batch
Columnar batch XPath evaluation.
bloom
Per-document bloom filter of tag names.
error
Error types for XML parsing, XPath evaluation, and index persistence.
index
Flat-array structural index for XML documents.
parallel
Speculative parallel chunked parsing.
persist
Persistent structural index — serialize to .sxi, load via mmap.
xpath
XPath 1.0 evaluation engine.

Functions§

load_or_parse
Load a pre-built .sxi index if it exists and is fresh, otherwise parse and save the index for next time. Returns an OwnedXmlIndex that derefs to XmlIndex.
parse
Parse XML bytes and build a structural index.
parse_for_xpath
Parse XML with query-driven optimization: only index tags relevant to the given XPath expression. Falls back to full parse if the query uses wildcards.