reader_for_microxml 2.0.1

reader for microXml - the simplified subset of xml
Documentation
// region: lmake_readme include "readme.md" //! A
//! # reader for microXml
//!
//! [![crev reviews](
//! https://web.crev.dev/rust-reviews/badge/crev_count/reader_for_microxml.svg
//! )](https://web.crev.dev/rust-reviews/crate/reader_for_microxml/) [![Rust lines](https://img.shields.io/badge/Rust_lines-308-green.svg)](https://github.com/LucianoBestia/reader_for_microxml) [![Licence](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/LucianoBestia/reader_for_microxml/blob/master/LICENSE) [![Rust](https://github.com/LucianoBestia/reader_for_microxml/workflows/Rust/badge.svg)](https://github.com/LucianoBestia/reader_for_microxml/) [![crates.io](https://meritbadge.herokuapp.com/reader_for_microxml)](https://crates.io/crates/reader_for_microxml) [![Documentation](https://docs.rs/reader_for_microxml/badge.svg)](https://docs.rs/reader_for_microxml/)  
//!
//! ***version: 1.1.12  date: 2020-06-15 authors: Luciano Bestia***  
//! **reader for microXml - the simplified subset of xml**
//!
//!
//! There are many xml parsers/readers/tokenizers/lexers around, but I need something very small and simple for my simple html templates in wasm.\
//! I found the existence of a standard (or W3C proposal) for *MicroXml* - dramatically simpler then the full Xml standard. Perfect for my use-case: I have small simple html files, that are microXml compatible.  
//!
//! ## microXml
//!
//! MicroXML is a subset of XML. It is dramatically simpler.\
//! <https://www.xml.com/articles/2017/06/03/simplifying-xml-microxml/>\
//! <https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html>\
//! MicroXML is actually well-formed Xml.\
//! In the data model of MicroXml there are no CData, namespaces, declarations, processing instructions,...\
//! An example of all can be done in a well-formed microXml:  
//!
//! ```xml
//! <memo lang="en" date="2017-05-01">
//!     I <em>love</em> microXML!<br />
//!     <!-- some comment -->
//!     It's so clean &amp; simple.
//! </memo>
//! ```
//!
//! MicroXml can be only in utf-8. I am lucky, because Rust Strings are internally utf-8 and are automatically checked for correctness.\
//! MicroXml should go through normalization: CR & CRLF should be converted to LF, but I don't do that here. Also decoding xml control characters `&quot;`, `&amp;`,... or decoding unicode encodings like `&#xE343;` , `&#xE312;`,... is not inside the reader. This is left for a higher library to choose what to do with it.\
//! MicroXml can contain Comments, but they are not official microXml data. But I need them for my templating project.\
//! Whitespaces are completely preserved in Text Nodes. For me they are significant. Also newline and Tabs. This is different from full Xml whitespace processing.\
//! All other whitespaces are ignored - they are insignificant.  
//!
//! ## reader
//!
//! This ReaderForMicroXml obviously cannot read a complicated full XML.\
//! This `reader_for_microxml` is used for small html fragments.\
//! They must be well-formed microXml.\
//! This fragments are meant for a html templating for dodrio.\
//! Because of the small size of fragments, I can put all the text in memory in a string.\
//! Only basic mal-formed incorrectness produce errors. I am not trying to return errors for all the possible mal-formed incorrectness in microXml.\
//! The speed is not really important, but the size of the code is, because it will be used in WebAssembly. Every code is too big for Wasm!\
//! The crate has `#![no_std]`, #![forbid(unsafe_code)], NO dependencies, NO allocations,  
//!
//! ## iterator
//!
//! The reader is an iterator.\
//! It implements the trait of the iterator.\
//! Use this syntax to process all tokens:\
//! `for result_token in reader_iterator {`\
//! or\
//! `let x: Option<Result<Token, &str>> = reader_iterator.next();`  
//!
//! ## Tests
//!
//! Run 16 tests with:\
//! `clear; cargo make test`
//!
//! ## Examples
//!
//! Find examples in the repository on github.\
//! Run them with:  
//! `clear; cargo make run_rel1`\
//! `clear; cargo make run_rel2`\
//! it is a shortcut to:\
//! `cargo run --example microxml_tree examples/t2.html`
//!
//! ```rust
//! /// read xml and write to screen
//! use reader_for_microxml::*;
//!
//! fn main(){
//!     let str_xml = r#"<html>test</html>"#;
//!     let mut reader_iterator = ReaderForMicroXml::new(str_xml);
//!     let result = read_xml_to_debug_string(&mut reader_iterator);
//!     println!("Result: {}", result)
//! }
//!
//! fn read_xml_to_debug_string(reader_iterator: &mut ReaderForMicroXml) -> String {
//!     let mut result = String::new();
//!     // reader_iterator is iterator Option<Result<Token,&str>>
//!     // the first option is used for the iterator to know where is the end
//!     // then the Result can have an Token or an Error
//!     for result_token in reader_iterator {
//!         match result_token {
//!             Ok(token) => match token {
//!                 Token::StartElement(name) => {
//!                     result.push_str(&format!("Start: \"{}\"\n", name));
//!                 }
//!                 Token::Attribute(name, value) => {
//!                     result.push_str(&format!("Attribute: \"{}\" = \"{}\"\n", name, value));
//!                 }
//!                 Token::TextNode(txt) => {
//!                     result.push_str(&format!("Text: \"{}\"\n", txt));
//!                 }
//!                 Token::Comment(txt) => {
//!                     result.push_str(&format!("Comment: \"{}\"\n", txt));
//!                 }
//!                 Token::EndElement(name) => {
//!                     result.push_str(&format!("End: \"{}\"\n", name));
//!                 }
//!             },
//!             Err(err_msg) => {
//!                 panic!(err_msg);
//!             }
//!         }
//!     }
//!     //return
//!     result
//! }
//! ```
//!
//! ## used in projects
//!
//! <https://github.com/LucianoBestia/cargo_crev_web>  
//! <https://github.com/LucianoBestia/dodrio_templating>  
//! <https://github.com/LucianoBestia/mem6_game>  
//!
//! ## cargo crev reviews and advisory
//!
//! It is recommended to always use [cargo-crev](https://github.com/crev-dev/cargo-crev)\
//! to verify the trustworthiness of each of your dependencies.\
//! Please, spread this info.\
//! On the web use this url to read crate reviews. Example:\
//! <https://web.crev.dev/rust-reviews/crate/num-traits>  
//!
//! ## Ideas for the future
//!
//! ### Speed
//!
//! The speed could probably be improved if I use Vec\<u8\> instead of CharIndices. That could work because all the xml delimiters are ASCII characters. The specifics of the UTF-8 encoding is that ASCII characters can in no way be misinterpreted inside a string. They always have the first bit set to 0.\
//! All other unicode characters are multi-byte and all this bytes MUST start with bit 1.\
//! So there is no way of having them confused.\
//! <https://betterexplained.com/articles/unicode/>\
//! <https://naveenr.net/unicode-character-set-and-utf-8-utf-16-utf-32-encoding/>  
//!
//! ## References
//!
//! <https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html>\
//! <https://www.xml.com/articles/2017/06/03/simplifying-xml-microxml/>\
//! <https://github.com/tafia/quick-xml>\
//! <https://github.com/RazrFalcon/roxmltree>  
// endregion: lmake_readme include "readme.md" //! A

#![no_std]
#![forbid(unsafe_code)]

pub struct PosChar {
    pub pos: usize,
    pub ch: char,
}

/// struct Reader for MicroXml - the Class  
/// Rust has Structs + Traits, but for me it is just like Class/Object.  
/// Just without inheritance.  
/// All the fields are internal and not public.  
/// The only way to interact is through methods.  
pub struct ReaderForMicroXml<'a> {
    /// reference to the xml string (no allocation)
    input: &'a str,
    /// Iterator CharIndices over the input string
    indices: core::str::CharIndices<'a>,
    /// I need to know the TagState for programming as a state machine
    tag_state: TagState,
    /// the last read character from the indices iterator
    last_char: PosChar,
    /// for significant whitespace (in TextNode beginning)
    start_of_text_node_before_whitespace: usize,
}

/// The reader_for_microxml returns tokens.  
/// The caller will manage this tokens. So they must be public.  
/// The string slices are reference to the original string with microXml text  
#[derive(Clone, Debug)]
pub enum Token<'a> {
    /// Start of xml element  
    StartElement(&'a str),
    /// End of xml element  
    EndElement(&'a str),
    /// Attribute  
    Attribute(&'a str, &'a str),
    /// Text node between `StartElement` and `EndElement`.  
    TextNode(&'a str),
    /// comment node
    Comment(&'a str),
}

/// internal enum: Tags are strings inside delimiters `< and >  like <div> or </div>`  
enum TagState {
    /// outside of tag  
    OutsideOfTag,
    /// inside of tag  
    InsideOfTag,
    /// reached normal end of file
    EndOfFile,
}

impl PosChar {
    pub fn set(&mut self, tup: (usize, char)) {
        self.pos = tup.0;
        self.ch = tup.1;
    }
}

impl<'a> ReaderForMicroXml<'a> {
    /// Constructor. String is immutably borrowed here. No allocation.  
    pub fn new(input: &str) -> ReaderForMicroXml {
        // CharIndices is an iterator that returns a tuple: (pos, ch).
        // I convert this into PosChar{pos, ch} for easier coding.
        // The "byte" position for using the string slice and the character.
        // This is a complication because one utf-8 character can have more bytes.
        // And the slices are defined by "bytes position", not by "character position".
        // Very important distinction!

        let mut indices = input.char_indices();
        let mut last_char = PosChar { pos: 0, ch: ' ' };
        if input.is_empty() {
            // unwrap because it cannot error if the string is not empty
            last_char.set(indices.next().unwrap());
        }
        ReaderForMicroXml {
            input,
            indices,
            tag_state: TagState::OutsideOfTag,
            last_char,
            start_of_text_node_before_whitespace: 0,
        }
    }

    /// Reads the next token (internal).  
    /// The internal function can understand when the Eof is in a correct position  
    /// and stops the propagation of Option None.  
    #[allow(clippy::integer_arithmetic, clippy::nonminimal_bool)]
    fn read_token_internal(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        match &self.tag_state {
            TagState::OutsideOfTag => {
                if self.start_of_text_node_before_whitespace == 0 {
                    self.start_of_text_node_before_whitespace = self.last_char.pos;
                }
                self.move_over_whitespaces()?;
                // Tags can look like this:
                // Start Tags: < xxx >,  < xxx attr="val" >,  < xxx />
                // End Tags: </xxx>
                // Comments: <!-- xxx -->
                // start delimiter is <
                if self.last_char.ch == '<' {
                    self.tag_state = TagState::InsideOfTag;
                    self.move_next_char()?;
                    self.move_over_whitespaces()?;
                    // if it is not comment or end tag, must be the element name
                    if !(self.last_char.ch == '!' || self.last_char.ch == '/') {
                        self.read_element_name()
                    } else if self.last_char.ch == '!' {
                        // this is a comment <!-- xxx -->
                        // comment are not data in MicroXml standard
                        // but I need them for my templating project
                        self.read_comment()
                    } else {
                        // the end element look like this </xxx>
                        self.read_end_element()
                    }
                } else {
                    // the text node is between element so looks like this
                    // > text <
                    self.read_text_node()
                }
            }
            TagState::InsideOfTag => {
                self.move_over_whitespaces()?;
                // InsideOfTag (after name) can be > or attributes or self_closing
                // < xxx >,  < xxx attr="val" >,  < xxx />
                // if it is not self-closing or > then must be an attribute
                if self.last_char.ch == '>' {
                    // here must be the end of start tag >
                    self.move_next_char()?;
                    self.tag_state = TagState::OutsideOfTag;
                    self.start_of_text_node_before_whitespace = 0;
                    // recursive calling
                    return self.read_token_internal();
                } else if self.last_char.ch == '/' {
                    // self-closing element
                    self.move_next_char()?; // to >
                    self.move_over_whitespaces()?;
                    if self.last_char.ch != '>' {
                        return Some(Err("Error: Tag has / but not />"));
                    } else {
                        self.move_next_char()?;
                        self.tag_state = TagState::OutsideOfTag;
                        self.start_of_text_node_before_whitespace = 0;
                        return Some(Ok(Token::EndElement("")));
                    }
                } else {
                    // attribute
                    self.read_attribute()
                }
            }
            TagState::EndOfFile => {
                //return None to stop the iterator
                None
            }
        }
    }

    /// Reads the element name  
    /// Propagation of Option None if is Eof  
    fn read_element_name(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        // start of tag name < xxx >
        self.move_over_whitespaces()?;
        let start_pos = self.last_char.pos;
        let end_pos;
        loop {
            // read until delimiter space, / or >
            if self.last_char.ch.is_whitespace()
                || self.last_char.ch == '/'
                || self.last_char.ch == '>'
            {
                end_pos = self.last_char.pos;
                break;
            } else {
                self.move_next_char()?;
            }
        }
        self.move_over_whitespaces()?;
        self.tag_state = TagState::InsideOfTag;

        // unwrap because I am confident that start_pos or end_pos are correct
        return Some(Ok(Token::StartElement(
            self.input.get(start_pos..end_pos).unwrap(),
        )));
    }

    /// Reads the attribute name and value.  
    /// Return Option None if Eof.  
    fn read_attribute(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        self.move_over_whitespaces()?;
        let start_pos = self.last_char.pos;
        let end_pos;
        loop {
            // delimiters are whitespace or =
            if self.last_char.ch.is_whitespace() || self.last_char.ch == '=' {
                end_pos = self.last_char.pos;
                break;
            } else {
                self.move_next_char()?;
            }
        }
        // unwrap because I am confident that start_pos or end_pos are correct
        let attr_name = self.input.get(start_pos..end_pos).unwrap();

        // region: skip delimiters: whitespace, =, "
        self.move_over_whitespaces()?;
        if self.last_char.ch == '=' {
            self.move_next_char()?;
        }
        self.move_over_whitespaces()?;
        if self.last_char.ch == '"' {
            self.move_next_char()?;
        } else {
            return Some(Err("Error: Attribute does not have the char = ."));
        }
        // endregion

        let start_pos = self.last_char.pos;
        let end_pos;
        loop {
            // end delimiter is "
            if self.last_char.ch == '"' {
                end_pos = self.last_char.pos;
                self.move_next_char()?;
                break;
            } else {
                self.move_next_char()?;
            }
        }
        self.move_over_whitespaces()?;
        // unwrap because I am confident that start_pos or end_pos are correct
        let attr_value = self.input.get(start_pos..end_pos).unwrap();
        // return
        Some(Ok(Token::Attribute(attr_name, attr_value)))
    }

    /// reads end element  
    fn read_end_element(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        // end tag for element  </ xxx >
        // we are already at the / char
        self.move_next_char()?;
        self.move_over_whitespaces()?;
        let start_pos = self.last_char.pos;
        let end_pos;
        loop {
            // read until space or >
            if self.last_char.ch.is_whitespace() || self.last_char.ch == '>' {
                end_pos = self.last_char.pos;
                break;
            } else {
                self.move_next_char()?;
            }
        }
        self.move_over_whitespaces()?;
        if self.last_char.ch == '>' {
            // after the End element is possible to have a correct Eof
            if let Some(()) = self.move_next_char() {
                //dbg!(self.last_char.pos);
                self.start_of_text_node_before_whitespace = self.last_char.pos;
                if let Some(()) = self.move_over_whitespaces() {
                    self.tag_state = TagState::OutsideOfTag;
                } else {
                    self.tag_state = TagState::EndOfFile;
                }
            } else {
                self.tag_state = TagState::EndOfFile;
            }
            return Some(Ok(Token::EndElement(
                // unwrap because I am confident that start_pos or end_pos are correct
                self.input.get(start_pos..end_pos).unwrap(),
            )));
        } else {
            return Some(Err("End Element does not have > ."));
        }
    }

    /// Reads text node  
    /// I don't do any encoding/decoding here, because I need it "as is" for html templating.  
    /// I preserve all the "significant" whitespaces because I will use this for templating.  
    /// And because there is no hard standard for trailing spaces in xml text node.  
    /// If reached Eof propagates Option None.  
    fn read_text_node(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        // text element look like this > some text <
        // it has significant whitespace start
        let start_pos = self.start_of_text_node_before_whitespace;
        // reset it to 0, because I don't need it more here
        // and this is the signal to store a new one.
        self.start_of_text_node_before_whitespace = 0;

        let mut end_pos;
        loop {
            //dbg!(self.last_char.ch);
            end_pos = self.last_char.pos;
            // end delimiter in < or end of file
            if self.last_char.ch == '<' {
                self.tag_state = TagState::OutsideOfTag;
                break;
            } else {
                if self.move_next_char().is_none() {
                    end_pos += 1;
                    self.tag_state = TagState::EndOfFile;
                    break;
                }
            }
        }
        // unwrap because I am confident that start_pos or end_pos are correct

        //dbg!(end_pos);
        return Some(Ok(Token::TextNode(
            self.input.get(start_pos..end_pos).unwrap(),
        )));
    }

    /// Comments are not data for MicroXml standard,  
    /// But I need them as data for my templating project.  
    /// The Option is returned only because of Option None propagation because of Eof.  
    fn read_comment(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        // comments looks like this <!-- xxx -->
        // we should be now at the second character  <!
        self.move_next_char()?; // skip char !
        self.move_next_char()?; // skip char -
        self.move_next_char()?; // skip char -
        let start_pos = self.last_char.pos;
        let end_pos;
        // read until end of comment -->
        let mut ch1 = ' ';
        let mut ch2 = ' ';
        loop {
            let ch3 = self.last_char.ch;
            // end delimiter -->
            if ch1 == '-' && ch2 == '-' && ch3 == '>' {
                end_pos = self.last_char.pos - 2;
                self.move_next_char()?;
                break;
            } else {
                ch1 = ch2;
                ch2 = ch3;
                self.move_next_char()?;
            }
        }
        // it is possible to have a comment in between 2 text nodes
        self.start_of_text_node_before_whitespace = 0;
        self.tag_state = TagState::OutsideOfTag;
        // unwrap because I am confident that start_pos or end_pos are correct
        return Some(Ok(Token::Comment(
            self.input.get(start_pos..end_pos).unwrap(),
        )));
    }

    // region: methods for iterator

    /// Moves the iterator and stores the last_char.  
    /// Iterator next() of CharIndices is consuming the char.  
    /// There is no way back to the same char.  
    /// But often I need to get again the same character of the last operation.  
    /// I tried with peekable.peek(), but it gives a reference and this was a problem.  
    /// So now I have 2 separate methods: move_next_char() and get_last_char().  
    /// I store the last_char for repeated use.  
    /// Anytime it can reach the End of File (Eof),  
    /// then it propagates the Option None to the caller with the ? syntax.  
    /// Only the caller knows if the Eof here is ok or it is an unexpected error.  
    /// The usize inside the Option is only a dummy,  
    /// only because I need to propagate the Option None because of Eof  
    fn move_next_char(&mut self) -> Option<()> {
        // Eof can be reached anytime. I will propagate None to the caller with ?
        self.last_char.set(self.indices.next()?);
        // returns a dummy only because of Option None propagation with ?
        Some(())
    }

    /// Skips all whitespaces if there is any  
    /// and returns the last_char when it is not whitespace.  
    /// saves the whitespace beginning position, because
    /// the caller must know if the whitespaces are insignificant. For example TextNode.  
    /// If found Eof, propagates Option None.  
    fn move_over_whitespaces(&mut self) -> Option<()> {
        loop {
            if !self.last_char.ch.is_whitespace() {
                return Some(());
            } else {
                self.move_next_char()?;
            }
        }
    }
    // endregion
}

impl<'a> Iterator for ReaderForMicroXml<'a> {
    type Item = Result<Token<'a>, &'static str>;
    /// Reads the next token: StartElement, Attribute, Text, EndElement  
    fn next(&mut self) -> Option<Result<Token<'a>, &'static str>> {
        // return
        self.read_token_internal()
    }
}