uppsala 0.3.0

A pure Rust XML parser, DOM, namespace, XPath, and XSD validation library
Documentation
# Uppsala

A **zero-dependency** pure Rust XML library.

Uppsala implements the core XML stack from parsing through schema validation,
with no external crates -- not even in dev-dependencies. Everything is built
from scratch: the parser, the DOM, the XPath engine, the XSD validator, and
even the regex engine used for XSD pattern facets.

## Features

- **XML 1.0 (Fifth Edition)** parsing and well-formedness checking
- **Namespaces in XML 1.0 (Third Edition)** with prefix resolution and scoping
- **Arena-based DOM** with tree mutation (insert, remove, replace)
- **XPath 1.0** evaluation (all axes, functions, predicates, operators)
- **XSD 1.1 validation** (structures + datatypes, 40+ built-in types)
- **XSD regex engine** (custom NFA matcher for pattern facets)
- **SIMD-accelerated parsing** (SSE2 on x86_64, scalar fallback elsewhere)
- **Serialization** with round-trip fidelity, pretty-printing, and streaming output
- **XmlWriter** for imperative XML construction without a DOM
- **UTF-16 auto-detection** (LE/BE with or without BOM)

## Conformance

Uppsala is tested against the W3C conformance suites:

| Suite | Pass Rate | Tests |
|-------|-----------|-------|
| W3C XML Conformance (not-wf) | 100% | 631/631 |
| W3C XML Conformance (valid) | 100% | 531/531 |
| W3C XML Conformance (invalid) | 100% | 46/46 |
| W3C XSD -- NIST Datatypes | 100% | 19,217/19,217 |
| W3C XSD -- Sun Combined | 100% | 199/199 |
| W3C XSD -- MS DataTypes | 100% | 1,212/1,212 |

In addition there are 274 hand-crafted tests covering XML parsing, namespaces,
XPath evaluation, XSD validation, serialization round-trips, and source ranges.

```bash
# Run all tests
cargo test

# Run W3C XML Conformance Suite (~1208 tests)
cargo test --test w3c_xmlconf

# Run W3C XML Schema Test Suite (~20156 tests)
cargo test --test w3c_xsts -- --nocapture
```

## Performance

We need someone to do a full benchmark in a proper environment. The following is
in an Ubuntu 24.04 VM.

Uppsala uses SSE2 SIMD intrinsics on x86_64 to scan text content and attribute
values 16 bytes at a time, with a scalar fallback for other architectures.
Combined with lookup-table optimizations and zero-copy parsing, this makes it
faster than roxmltree across all document sizes:

| File | Size | vs roxmltree |
|------|------|-------------|
| gigantic.svg | 1.3 MB | **5.3x faster** |
| text.xml | 126 KB | **9.3x faster** |
| attributes.xml | 265 KB | **2.0x faster** |
| medium.svg | 152 KB | **1.4x faster** |
| huge.xml | 815 KB | **1.2x faster** |
| SAML files | 3-11 KB | **1.5-1.8x faster** |

Text-heavy documents benefit most from SIMD -- long runs of plain text between
markup are scanned with minimal per-byte overhead.

Is this really fast? Maybe, maybe not. But it is good enough for my use cases right now.

## Usage

Add to your `Cargo.toml`:

```toml
[dependencies]
uppsala = "0.3"
```

### Parse and query

```rust
use uppsala::{parse, XPathEvaluator};
use uppsala::xpath::XPathValue;

let xml = r#"
<bookstore>
  <book category="fiction">
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <price>10.99</price>
  </book>
  <book category="non-fiction">
    <title>Sapiens</title>
    <author>Yuval Noah Harari</author>
    <price>14.99</price>
  </book>
</bookstore>
"#;

let mut doc = parse(xml).unwrap();

// DOM traversal
let titles = doc.get_elements_by_tag_name("title");
for id in &titles {
    println!("{}", doc.text_content_deep(*id));
}

// XPath queries
doc.prepare_xpath();
let eval = XPathEvaluator::new();
let root = doc.root();
if let Ok(XPathValue::NodeSet(nodes)) =
    eval.evaluate(&doc, root, "//book[@category='fiction']/title")
{
    for id in &nodes {
        println!("Fiction: {}", doc.text_content_deep(*id));
    }
}
```

### Validate against an XSD schema

```rust
use uppsala::{parse, XsdValidator};

let schema_xml = r#"
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="temperature" type="xs:decimal"/>
</xs:schema>
"#;

let instance_xml = "<temperature>36.6</temperature>";

let schema_doc = parse(schema_xml).unwrap();
let instance_doc = parse(instance_xml).unwrap();
let validator = XsdValidator::from_schema(&schema_doc).unwrap();
let errors = validator.validate(&instance_doc);

if errors.is_empty() {
    println!("Valid!");
} else {
    for e in &errors {
        println!("Validation error: {}", e);
    }
}
```

### Build XML with XmlWriter

```rust
use uppsala::XmlWriter;

let mut w = XmlWriter::new();
w.write_declaration();
w.start_element("catalog", &[("xmlns", "urn:example:catalog")]);
w.start_element("item", &[("id", "1")]);
w.text("Widget");
w.end_element("item");
w.empty_element("item", &[("id", "2"), ("name", "Gadget")]);
w.end_element("catalog");

println!("{}", w.into_string());
```

### Pretty-print a document

```rust
use uppsala::{parse, XmlWriteOptions};

let xml = "<root><a><b>text</b></a></root>";
let doc = parse(xml).unwrap();
let opts = XmlWriteOptions::pretty("  ");
println!("{}", doc.to_xml_with_options(&opts));
```

## Architecture

Uppsala uses an arena-based DOM where all nodes live in a flat `Vec<NodeData>`
indexed by `NodeId(usize)`. Tree relationships are maintained through
parent/first_child/last_child/next_sibling/prev_sibling indices. This avoids
`Rc`/`RefCell` overhead and makes tree mutation straightforward.

```
src/
  lib.rs            Public API, parse(), parse_bytes(), encoding detection
  error.rs          XmlError enum, XmlResult type alias
  dom.rs            Arena-based DOM: Document, NodeId, QName, serialization
  parser.rs         XML 1.0 recursive-descent parser with full DTD internal subset
  simd.rs           SSE2-accelerated byte scanning (content + attribute delimiters)
  namespace.rs      Namespace prefix resolution with scope stack
  writer.rs         XmlWriter imperative builder
  xpath.rs          XPath 1.0 lexer, parser, and evaluator
  xsd/              XSD validator (split into submodules)
    mod.rs          Module declarations, re-exports
    types.rs        Core data structures (XsdValidator, ElementDecl, TypeDef, etc.)
    builder.rs      Multi-pass schema builder
    parser.rs       Schema element/type/attribute/group parsing
    validation.rs   Instance document validation
    builtins.rs     Built-in type validation, facet enforcement
    composition.rs  xs:include, xs:redefine, xs:import
    identity.rs     xs:key, xs:unique, xs:keyref
    datetime.rs     Date/time/duration validation
    decimal.rs      Arbitrary-precision decimal comparison
  xsd_regex.rs      XSD regex pattern engine (custom NFA matcher)
```

## Examples

The `examples/` directory contains runnable programs:

```bash
# Parse XML, traverse the DOM, and run XPath queries
cargo run --example parse_and_query

# Validate documents against XSD schemas
cargo run --example validate_schema

# Build XML programmatically with XmlWriter and DOM
cargo run --example build_xml
```

## Test Data Licensing

The `test-data/` directory contains third-party conformance test suites.
These files are **not** covered by Uppsala's BSD-2-Clause license; they
retain their original licenses as described below.

### W3C XML Conformance Test Suite

- **Location:** `test-data/xmlconf/`
- **Version:** 20130923
- **Source:** <https://www.w3.org/XML/Test/>
- **License:** [W3C Document License]https://www.w3.org/copyright/document-license-2023/
- **Contributors:** James Clark (xmltest), Sun Microsystems, IBM,
  OASIS, Edinburgh University (eduni), and others

### W3C XML Schema Test Suite (XSTS)

- **Location:** `test-data/xsts/xmlschema2006-11-06/`
- **Version:** 2006-11-06
- **Source:** <https://www.w3.org/XML/2004/xml-schema-test-suite/>
- **License:** [W3C Document License]https://www.w3.org/copyright/document-license-2023/
  (see `test-data/xsts/xmlschema2006-11-06/00COPYRIGHT`)
- **Contributors:** NIST, Microsoft, Sun Microsystems, Boeing

## License

Uppsala itself is licensed under the BSD-2-Clause license. See [LICENSE](LICENSE)
for details.