Simple(ish) parser and extractor of XML.
This package provides an XmlReader which can automatically determine the character encoding
of UTF-8 and UTF-16 (big endian and little endian byte order) XML byte streams, and parse the
XML into an immutable Element tree held within an XmlDocument. It's also possible to use a
custom byte stream decoder to read XML in other character encodings.
The aim of this package is to support as closely as possible the W3C specifications Extensible Markup Language (XML) 1.0 and Namespaces in XML 1.0 for well-formed XML. This package does not aim to support validation of XML, and consequently DTD (document type definition) is deliberately not supported.
Namespace support is always enabled, so the colon character is not permitted within the names of elements nor attributes.
XML concepts already supported
- Elements
- Attributes
- Default namespaces
xmlns="namespace.com" - Prefixed namespaces
xmlns:prefix="namespace.com" - Processing instructions
- Comments (skipped and thus not retrievable)
- CDATA sections
- Element language
xml:langand filtering by language - White space indication
xml:space - Automatic detection and decoding of UTF-8 and UTF-16 XML streams.
- Support for custom encodings where the encoding is known before parsing, and where the client supplies a custom decoder to handle the byte-to-character conversion.
Examples
Reading an XML file
Suppose you want to read and extract XML from a file you know to be either UTF-8 or UTF-16
encoded. You can use XmlReader::parse_auto to read, parse, and extract the XML from the file
and return either an XmlDocument or an std::io::Error.
let xml_file = open?;
let xml_doc = parse_auto?;
Traversing an XmlDocument
Once you have an XmlDocument you can grab an immutable reference to the root Element and
then traverse through the element tree using the req (required child element) and opt
(optional child element) methods to target the first child element with the specified name.
And once we're pointing at the desired target, we can use element() or text() to attempt to
grab the element or text-only content of the target element.
For example, let's define a simple XML structure where required elements have a name starting with "r_" and optional elements have a name starting with "o_".
Helix
2021-05-12
23.10
archseer
the-mikedavis
sudormrfbin
pascalkuthe
dsseng
pickfire
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
Note that the return type of the element() and text() methods varies depending on whether
the method chain involves req or opt or both. This table summarizes the scenarios.
Similarly, the return types of att_req and att_opt methods also vary depending on the method
chain.
It's easier to remember this as the following: req/att_req will generate an error if the
element or attribute does not exist, so their use means that the return type must involve a
Result<_, XmlError> of some sort. And opt/att_opt may or may not return a value, so their
use means that the return type must involve an Option<_> of some sort. And mixing the two
(required and optional) means that the return type must involve a Result<Option<_>, XmlError>
of some sort. And text() generates an error if the target element does not have simple content
(no child elements and no processing instructions) so its use also means that the return type
must involve a Result of some sort.
More complex traversal using XmlPath
The methods req and opt always turn their attention to the first child element with the
given name. It's not possible to use them to target a sibling, say the second "Widget" within a
list of "Widget" elements. To target siblings, and/or to iterate multiple elements, you instead
use XmlPath. (Don't confuse this with XPath which has a
similar purpose but very different implementation.)
For example, if you have XML which contains a list of employees, and you want to iterate the
employees' tasks' deadlines, you could use XmlPath like this:
Angelica
Finance
Payroll
tomorrow
Reconciliation
Friday
Byron
Sales
Close the big deal
Saturday night
Cat
Software
Fix that bug
Maybe later this month
Add that new feature
Possibly this year
Make that customer happy
Good luck with that
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
This creates and iterates an XmlPath which represents "the first deadline element within
every task within the first task-list within every employee". Based on the example XML above,
this will print out all the text content of all six "deadline" elements.
Note that we could use first("employee") if we only wanted the first employee. Or we could
use nth("employee", 1) if we only want the second employee (zero would point to the first).
Or we could use last("employee") if we only want the last employee. Similarly, we could use
first("task") if we only wanted to consider the first task in each employee's list.
Filtering elements within an XmlPath
An XmlPath not only lets you specify which child element names are of interest, but also lets
you specify which xml:lang patterns are of interest, and lets you specify a required attribute
name-value pair which must be found within a child element in order to include it in the
iterator.
C&C: Tiberian Dawn
Command & Conquer
C&C: Teil 1
Doom
Zla kob
ドゥーム
Half-Life
Polu-život
Aliens
Aliens - Återkomsten
Quái Vật Không Gian 2
The Cabin In The Woods
Хижа в гората
La cabane dans les bois
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
This will print out the names of all four English-language titles for the three games. It will
skip all of the movies, and all names which are rejected by the "en" language filter. Note
that this "en" filter will match both xml:lang="en" and xml:lang="en-US" so you'll get two
matching name elements for the first game.
Attribute extraction
Getting the value of an attribute is done with the methods att_req (generate an error if the
attribute is missing) and att_opt (no error if the attribute is missing).
For example, given this simple XML document, we can grab the attribute values easily.
40.5
38.9
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
Note: the xml:lang and xml:space values cannot be read from as attribute values from an
Element, because these are "special attributes" whose values are inherited by child elements
(and the language is inherited by an element's attributes too). To get the effective value of
these language and space properties, see the methods language_tag and white_space_handling
instead.
Namespace handling
All of the examples so far have used XML without any namespace declarations, which means that
the element and attribute names are not within any namespace (or put another way, they have a
namespace which has no value). Specifying the target name of an element or attribute can be
done with a string slice &str when the namespace has no value. But when the target name has
a namespace value, you must specify the namespace in order to target the desired element.
The most direct way of doing this is to use a (&str, &str) tuple which contains the local
part and then namespace (not the prefix) of the element name. But you can also call the
pre_ns (preset or predefined namespace) method to let a cursor or XmlPath know that it should
assume the given namespace value if you don't use a tuple to directly specify the namespace for
each element and attribute within the method chain. An example is probably be the easiest way to
explain this.
<!-- The root element declares that the default namespace for it
and its descendants should be the given URI. It also declares that
any element/attribute using prefix 'pfx' belongs to a namespace
with a different URI. -->
This child element has no prefix, so it inherits
the default namespace.
This child element has prefix pfx, so inherits the
other namespace.
Attribute names can be prefixed
too.
Unprefixed attribute names do *not*
inherit namespaces.
The default namespace can be
cleared too.
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
It's important to note that once you call element() the effect of pre_ns vanishes. So don't
forget that you if you do call element() in the middle of a method chain, you need to call
pre_ns again in order to specify the preset namespace from that point forward.
something
whatever
more
and so on
// Defining a static constant makes it quicker to type namespaces,
// and easier to read the code.
const NS_DEF: &str = "example.com/DefaultNamespace";
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
Error handling
The examples above have simplified the code snippets for brevity, but in a real application you will need to handle the different error types returned by the different steps of reading/parsing and extracting from XML. Here is a compact example which shows the error handling needed for each step.
// The XML parsing methods might throw an std::io::Error, so they
// go into their own method.
// The extraction methods might throw an XmlError, so they go into
// their own method.