[−][src]Crate maybe_xml
MaybeXml is a library to scan and evaluate XML-like data into tokens. In effect, the library provides a non-validating lexer. The interface is similar to many XML pull parsers.
The library does 3 things:
-
A
Scanner
receives byte slices and identifies the start and end of tokens like tags, character content, and declarations. -
An
Evaluator
transforms bytes from an input source (like instances of types which implementstd::io::BufRead
) into complete tokens via either a cursor or an iterator pull style API.From an implementation point of view, when a library user asks an
Evaluator
for the next token, theEvaluator
reads the input and passes the bytes to an internalScanner
. TheEvaluator
buffers the scanned bytes and keeps reading until theScanner
determines a token has been completely read. Then all of the bytes which represent the token are returned to the library user as a variant of a token type. -
Each token type provides methods which can provide views into the underlying bytes. For instance, a tag token could provide a
name()
method which returns aTagName
. TheTagName
provides a method liketo_str()
which can be called to get astr
representation of the tag name.
Usage
In most cases, a library user should use an Evaluator
to read the XML
and transform the data into tokens.
First, instantiate an Evaluator
with an input source.
Second, if the Evaluator
supports the iterator API, a library user may transform
the Evaluator
into an iterator by calling the into_iter()
method. Then, like other
iterators, you can call next()
or use any of the other Iterator
methods like
map
, filter
, etc. In most cases, especially if you need to further
transform the XML content, using the Iterator
API is easier. The returned tokens
from the Iterator API have owned copies of the bytes representing the token.
If the use case only involves reading the data and not copying or transforming the
data, the cursor API (usually by calling next_token
directly on the Evaluator
itself and not
transforming the Evaluator
into an iterator) may be sufficient. The returned tokens
from the cursor API are representing a token by borrowing a byte slice view from
a shared internal buffer.
Example
As a simplified and unoptimized example, the following code uses the iterator style API to
transform uppercase ID
into lowercase id
tag names.
use maybe_xml::token::owned::{Token, StartTag, Characters, EndTag}; let mut input = std::io::BufReader::new(r#"<ID>Example</ID>"#.as_bytes()); let eval = maybe_xml::eval::bufread::BufReadEvaluator::from_reader(input); let mut iter = eval.into_iter() .map(|token| match token { Token::StartTag(start_tag) => { if let Ok(str) = start_tag.to_str() { Token::StartTag(StartTag::from(str.to_lowercase())) } else { Token::StartTag(start_tag) } } Token::EndTag(end_tag) => { if let Ok(str) = end_tag.to_str() { Token::EndTag(EndTag::from(str.to_lowercase())) } else { Token::EndTag(end_tag) } } _ => token, }); let token = iter.next(); assert_eq!(token, Some(Token::StartTag(StartTag::from("<id>")))); match token { Some(Token::StartTag(start_tag)) => { assert_eq!(start_tag.name().to_str()?, "id"); } _ => panic!("unexpected token"), } assert_eq!(iter.next(), Some(Token::Characters(Characters::from("Example")))); assert_eq!(iter.next(), Some(Token::EndTag(EndTag::from("</id>")))); assert_eq!(iter.next(), Some(Token::Eof)); assert_eq!(iter.next(), None);
Well-formed vs. Malformed document processing
The library should scan and evaluate well-formed XML documents correctly. For XML documents which are not well-formed, the behavior is currently undefined. The library does not error when scanning a malformed document.
Security Considerations
There are no limits on the amount of data read, so if a large input source is used,
an Evaluator
will buffer a large number of bytes until the end of the current
token is found. For untrusted input sources such as a std::io::BufRead
, the
input source could be wrapped with a type which checks the number of bytes read
and throws an error if too many bytes have been read.
Another possible solution is to use a Scanner
directly and process bytes
immediately instead of using an Evaluator
which buffers the bytes until a
complete token is read.
Modules
eval | Evaluators transform scanned byte sequences into complete tokens. |
scanner | Scans byte sequences for tokens. |
token | Tokens are the returned values when evaluating scanned byte ranges. |