rust_dot 0.6.0

RustDOT is mostly the Graphviz DOT language, lightly rustified.
Documentation
/*!
RustDOT is mostly the Graphviz DOT language, lightly rustified.
It can be embedded as a macro or parsed from a string or file.
The purpose is extracting the stucture. Layout hints are currently out of scope.

<img width="110" height="120" src="https://rust-dot.sourceforge.io/abcd.svg"
    style="position: absolute; right: 0; margin-top: 2em; z-index: 9">

```
let g1 = rust_dot! {
    graph {
        A -- B -- C; /* semicolon is optional */
        "B" -- D // quotes not needed here
    }
};
println!("{} {} \"{}\" {:?} {:?}", g1.strict, g1.directed, g1.name, g1.nodes, g1.edges);
// false false "" ["A", "B", "C", "D"] [(0, 1), (1, 2), (1, 3)]

let g2 = parse_string("digraph Didi { -1 -> 2 -> .3  2 -> 4.2 }");
println!("{} {} \"{}\" {:?} {:?}", g2.strict, g2.directed, g2.name, g2.nodes, g2.edges);
// false true "Didi" ["-1", "2", ".3", "4.2"] [(0, 1), (1, 2), (1, 3)]
```

The return values can be fed to crates `petgraph`:
```
let mut petgraph = petgraph::graph::Graph::new();
let nodes: Vec<_> = rust_dot_graph.nodes
    .iter()
    .map(|node| petgraph
        .add_node(node))
    .collect();
for edge in rust_dot_graph.edges {
    petgraph
        .add_edge(nodes[edge.0], nodes[edge.1], ());
};
```
or `graph`/`graph_builder`:
```
use graph::prelude::*;

let graph: DirectedCsrGraph<usize> = GraphBuilder::new()
    .csr_layout(CsrLayout::Sorted)
    .edges(rust_dot_graph.edges)
    .build();
```
This is work in progress. Nothing is stabilised!

# Todo

- Implement `strict`, it is currently ignored/skipped
- Return Err instead of panicking on wrong input
- Put Spans on Lexemes, based on their input, maybe using crate macroex
- Separate return type (currently `Parser`, which should be internal)
- Implement node attributes, they are currently ignored/skipped
- Implement node defaults
- Implement edge attributes, they are currently ignored/skipped
- Implement edge defaults
- Deal with graph attributes, with and without keyword `graph`

- Reimplement `rust_dot` as a proc-macro, transforming its input as const at compile time
- As an extension to DOT, allow label or weight to come from a Rust expression
- As an extension to DOT, allow label or weight to come from invoking a closure

# Limitations

Rust macros are tokenised by the Rust lexer, which is subtly different from Graphviz. For consistency (and ease of
implementation) the [`parse_*`](parse_string()) functions use the same lexer. These are the consequences:

- Macros must be in UTF-8, while the input to the [`parse_*`](parse_bytes()) functions may also be UTF-16 or Latin-1.
  You must deal with other encodings yourself.
- Double quotes, parentheses, braces and brackets must be balanced and some characters are not allowed. As a workaround
  you can change something like the following first line into the second. The commented quotes are seen by Rust, but
  ignored as HTML (once that is implemented):
  ```graphviz
  <<I>"</I> <B> )}] [{( </B> \\>
  <<I>"<!--"--></I> <B><!--"--> )}]  [{( <!--"--></B> <!--"-->\\<!--"-->>
  ```
- Html is partially a space aware language, where Rust is not. So on the macro side it’s impossible to get space right,
  and on run time input it would be quite some effort. Instead this uses a heuristic of space between everything, except
  inside tags and entities and before `[,;.:!?]` (incomplete and wrong for some languages.)
- Strings are not yet unescaped, when we get them, yet the Rust lexer validates them. The [`parse_*`](parse_string())
  functions work around this, but in [`rust_dot!`] you must use raw strings like `r"\N"` when they contain unrusty
  backslash sequences.
- Comments are exactly Rust comments. They differ from DOT in that block comments can nest.
- Not officially comments, but everything after `#` on the same line is also discarded. Unlike real comments, these are
  handled by RustDOT, after lexical analysis. This means that the rest of the line, like the 1<sup>st</sup> point above,
  must be balanced. And it will only end after the closing delimiter, so you should put that on the same line! In
  [`rust_dot!`] you must use `//` instead! (Only the nightly compiler gives access to line numbers in macros.)
- Valid identifiers should be accepted by Rust. Though (only in [`rust_dot!`]) confusable letters like cyrillic ‘о’ or
  rare scripts like runic give warnings.
- Valid numbers should be accepted by Rust. And floats do not need a leading zero before decimal dot.
- RustDOT returns one graph, so it wants one in the input. The grammar doesn’t clarify multiple graphs per file, but they
  are accepted. However they lead to 2 svgs invalidly concatenated in one file or a png displaying only the first.
  Likewise it accepts an empty document – not so RustDOT.
*/
use std::{
    fs::File,
    io::{self, Read},
    path::Path,
    str::from_utf8,
};

use proc_macro2::{LexError, TokenStream};

mod lexer;
use lexer::{lexer, Lexeme::*};

mod parser;
use parser::Parser;

#[cfg(test)]
mod graphviz_tests;

//extern crate r#macro;
//pub use r#macro::rust_dot;

/**
Embed RustDOT as a sub-language in Rust.

# Examples
```
let g1 = rust_dot! {
    graph "gravity" {
        -2.1 -- -1 -- 0 -- 2.71828 -- 3.14159
        A -- "B"
        { 1; 2; 3 } -- { 4 5 } -- { 6; 7 8 }
    }
};

let g2 = rust_dot! {
    DiGraph Diggy {
        A [oh = boy]
        A -> B [x = 1, y = 2 z = 3][zz = 3.1];
        A -> { C -> { D:p -> E:q:nw } -> F -> G } -> H;
    }
};
```
Since macro bodies are scanned by Rust before the macro gets them, there are slight differences to the
[`parse_*`](parse_string()) functions:

- Strings may not contain unknown escapes, so in that case you must use raw strings like `r"\G"`.
- `#` pseudo-comments are not possible, you must use normal Rust comments instead.

Currently this generates code to parse at run time and return heap allocated data.
For this to work, for now you must depend on crate `quote`.

In the future, this should become a proc-macro that generates const arrays of strs.
Then hopefully labels and weights can be Rust expressions or results of closures.
*/
#[macro_export]
macro_rules! rust_dot {
    ($($graph:tt)+) => {
        $crate::_parse_token_stream(quote::quote!($($graph)+), false, false)
    }
}

/**
Transform RustDOT from file at run time.
Input is tried as UTF-8, UTF-16 and Latin-1 (see details at [`parse_bytes()`]).
# Examples
```
let g = parse_file("polka.dot");
```
*/
pub fn parse_file<T: AsRef<std::ffi::OsStr>>(file: T) -> io::Result<Parser> {
    parse_read(&mut File::open(Path::new(&file))?)
}

/// Transform RustDOT from `Read` object at run time (generalised [`parse_file()`]).
pub fn parse_read(read: &mut impl Read) -> io::Result<Parser> {
    let mut buf = vec![];
    read.read_to_end(&mut buf)?;
    Ok(parse_bytes(&buf))
}

/** Transform RustDOT from input at run time.
If input is not UTF-8 and has a UTF-16 BOM, or (given that graphs must start with ASCII) a NUL as 1<sup>st</sup> or
2<sup>nd</sup> byte, it is treated as UTF-16, either BE or LE. Otherwise it is treated as Latin-1.
*/
pub fn parse_bytes(input: &[u8]) -> Parser {
    if let Ok(str) = from_utf8(input) {
        return parse_string(str);
    }
    // document must start with ASCII comment or keyword, so 0 byte also means UTF-16
    let (utf16, be) = match &input[0..2] {
        &[0xFE, 0xFF] | &[0, _] => (true, true),
        &[0xFF, 0xFE] | &[_, 0] => (true, false),
        _ => (false, false),
    };
    if utf16 {
        let mut iter = input.iter();
        let mut acc = Vec::with_capacity(input.len() / 2_usize);
        let join = |a: &u8, b: &u8| (*a as u16) << 8 | (*b as u16);
        while let (Some(b0), Some(b1)) = (iter.next(), iter.next()) {
            acc.push(if be { join(b0, b1) } else { join(b1, b0) });
        }
        parse_string(&String::from_utf16(&acc).unwrap())
    } else {
        // fall back to Latin-1
        parse_string(&input.iter().map(|&c| c as char).collect::<String>())
    }
}

/**
Transform RustDOT from input at run time.
# Examples
```
let g1 = parse_string("graph { 1 -- 2 -- 3; 2 -- 4 }");
// Read file at compile time, but parse it a run time
let g2 = parse_string(include_str!("polka.dot"));
```
*/
pub fn parse_string(input: &str) -> Parser {
    let mut input = input;
    let owned;
    let mut esc = input.contains('\\');
    if esc {
        esc = false;
        let len = input.len();
        owned = input
            .chars()
            .fold(String::with_capacity(len + len / 20), |mut acc, c| {
                if esc {
                    esc = false;
                    if c == '\\' {
                        acc += "\\\\"
                    } else if c != '"' && c != '\n' {
                        acc.push('\\')
                    }
                } else if c == '\\' {
                    esc = true;
                }
                acc.push(c);
                acc
            });
        input = &owned;
        esc = true;
    }
    _parse_token_stream(
        input.parse().unwrap_or_else(|e: LexError| {
            let span = e.span().start();
            panic!(
                "parse_*() input not lexically Rust-conforming at {}:{}",
                span.line,
                span.column + 1
            )
        }),
        true,
        esc,
    )
}

/// The internal work horse, exposed only for [`rust_dot!`].
pub fn _parse_token_stream(input: TokenStream, parse_fn: bool, _esc: bool) -> Parser {
    let mut lexer = lexer(input, parse_fn);

    let strict = next_if!(lexer, Strict);

    let directed = match lexer.next() {
        Some(Graph) => false,
        Some(DiGraph) => true,
        _ => panic!("[ strict ] ( graph | digraph ) expected"),
    };

    let mut item = lexer.next();
    let name = if let Some(Id(name)) = item {
        item = lexer.next();
        name
    } else {
        "".into()
    };

    let Some(Block(block)) = item else {
        panic!("After ( graph | digraph ) [ ID ] expected block")
    };
    if lexer.next().is_some() {
        panic!("Nothing expected after main block")
    };

    Parser::graph(strict, directed, name, block.stream(), parse_fn)
}

#[cfg(test)]
mod tests {
    use super::*;

    macro_rules! validate {
        ($(($strict:ident $directed:ident $name:literal))? $nodes:literal $edges:literal; $($graph:tt)+) => {{
            let graph = rust_dot!($($graph)+);
            $(
                assert!(graph.strict == $strict);
                assert!(graph.directed == $directed);
                assert_eq!(graph.name, $name);
            )?
            assert!(graph.nodes.len() == $nodes);
            assert!(graph.edges.len() == $edges);
            graph
        }}
    }

    #[test]
    fn empty() {
        validate! {
            (false false "Foo") 0 0;
            Graph Foo {}
        };

        validate! {
            (false true "bar") 0 0;
            DIGRAPH "bar" {}
        };

        validate! {
            (true false "") 0 0;
            strict graph {}
        };
    }

    #[test]
    fn simple() {
        validate! {
            5 3;
            digraph {
                0 -> 1
                2 -> 3 -> 4 [weight = 2, style = fancy];
            }
        };
    }

    #[test]
    fn cross_product() {
        validate! {
            7 12;
            DIGRAPH "Lacrosse" {
                { 0; 1 } -> { 2 3 ; 4 } -> { 5 [foo = bar] 6 }
            }
        };
    }

    #[test]
    fn deeply_nested() {
        validate! {
            8 17;
            DiGraph Diggy {
                A [oh = boy]
                A -> B [x = 1, y = 2 z = 3][zz = 3.1];
                A -> { C -> { D:p -> E:q:nw } -> F -> G } -> H;
            }
        };
    }

    #[test]
    fn attrs() {
        validate! {
            0 0;
            graph 123 {
                foo = bar
                graph [x= y z = 0];
                node [a=1 b=2.1, c=toot d="oho"]
                edge [weight = 42]
            }
        };
    }

    #[test]
    fn bytes() {
        let utf16_be = [
            0xfe, 0xff, 0, 0x67, 0, 0x72, 0, 0x61, 0, 0x70, 0, 0x68, 0, 0x7b, 0, 0xd8, 0, 0x7d,
        ];
        let utf16_le = [
            0xff, 0xfe, 0x67, 0, 0x72, 0, 0x61, 0, 0x70, 0, 0x68, 0, 0x7b, 0, 0xd8, 0, 0x7d, 0,
        ];
        let latin1 = [0x67, 0x72, 0x61, 0x70, 0x68, 0x7b, 0xd8, 0x7d];
        for bytes in [
            &utf16_be,
            &utf16_be[2..],
            &utf16_le,
            &utf16_le[2..],
            "graph{Ø} # don't see comment".as_bytes(),
            &latin1,
        ] {
            let graph = parse_bytes(bytes);
            assert!(graph.nodes.len() == 1);
        }
    }

    #[test]
    fn html() {
        validate! {
            2 1;
            digraph {
                < this is <B>bold</B> &amp; first > -> < bella <I>Italica</I> >
            }
        };
    }
}