Expand description
High-fidelity JSON lexer and parser.
§Introduction
JSON is a data format that is underspecified and sometimes contradictory. As reference, I recommend the excellent article “Parsing JSON is a Minefield”. In particular, it is ambiguous how to parse strings and numbers. For example, JSON does not impose any restriction on the maximal size of numbers, but in reality, most JSON parsers use a lossy representation, for example 64-bit floating point. This is allowed by the JSON specification; however, at the same time, if we are allowed to fix arbitrary maximal sizes, then a parser that fails on every input is a valid parser! I hope that I could convince you at this point that this is all quite a mess. The best I can do to help you around this mess is to give you a tool to deal with this mess in the way that suits you most. hifijson is this tool.
What makes hifijson so flexible is that unlike most other JSON parsers, it exposes its basic building blocks, called lexers, that allow you to build your own parsers on top of them.
Because hifijson exposes a variety of lexers and parsers,
you can combine them in a way that allows you to achieve your desired behaviour,
without having to write everything from scratch.
For example, suppose that your input data does not contain escape sequences (\n, \uxxxx);
then you can use the str::LexWrite::str_bytes function that is
guaranteed to never allocate memory when lexing from a slice,
making it suitable for usage in embedded environments.
Or suppose that you are reading an object {"title": ..., "reviews": ...},
and you do not feel like caring about reviews today.
Then you can simply skip reading the value for reviews by using ignore::parse.
Going wild and stretching the syntax a bit, you can also make
a parser that accepts any value (instead of only strings as mandated by JSON) as object key.
Or, if you just want to have a complete JSON value, then
you can use value::parse_unbounded.
The choice is yours.
In summary, hifijson aims to give you the tools to interpret JSON-like data flexibly and performantly.
§Lexers
The hardest part of lexing JSON are strings and numbers. hifijson offers many different string/number lexers, which differ most prominently in their memory allocation behaviour. For example,
str::Lex::str_ignorediscards a string,str::LexWrite::str_bytesreads a string, but does not interpret escape sequences, andstr::LexAlloc::str_stringreads a string and interprets escape sequences.
In particular,
lexers that implement the Lex trait do never allocate memory;
lexers that implement the LexWrite trait only allocate memory when lexing from iterators,
and
lexers that implement the LexAlloc trait may allocate memory when lexing from both
iterators and slices.
§Slices and Iterators
One important feature of hifijson is that it allows to read from both slices and iterators over bytes. This is useful when your application should support reading from both files and streams (such as standard input).
§Feature Flags
If you build hifijson without the feature flag alloc, you disable any allocation.
If you build hifijson with the feature flag serde,
then you can use hifijson to deserialise JSON to data types implementing serde::Deserialize.
§Examples
§Parsing strings to values
Let us consider a very simple usage: Parsing a JSON value from a string. For this, we first have to create a lexer from the string, then call the value parser on the lexer:
// our input JSON that we want to parse
let json = br#"[null, true, false, "hello", 0, 3.1415, [1, 2], {"x": 1, "y": 2}]"#;
// the lexer on our input -- just creating it does not actually run it yet
let mut lexer = hifijson::SliceLexer::new(json);
use hifijson::token::Lex;
// now we are going -- we try to
// obtain exactly one JSON value from the lexer and
// parse it to a value, allowing for arbitrarily deep (unbounded) nesting
let value = lexer.exactly_one(Lex::ws_peek, hifijson::value::parse_unbounded);
let value = value.expect("parse");
// yay, we got an array!
assert!(matches!(value, hifijson::value::Value::Array(_)));
assert_eq!(
value.to_string(),
// printing a value yields a compact representation with minimal spaces
r#"[null,true,false,"hello",0,3.1415,[1,2],{"x":1,"y":2}]"#
);§Parsing files and streams
The following example reads JSON from a file if an argument is given, otherwise from standard input:
/// Parse a single JSON value and print it.
///
/// Note that the `LexAlloc` trait indicates that this lexer allocates memory.
fn process<L: hifijson::LexAlloc>(mut lexer: L) {
let value = lexer.exactly_one(L::ws_peek, hifijson::value::parse_unbounded);
let value = value.expect("parse");
println!("{}", value);
}
let filename = std::env::args().nth(1);
if let Some(filename) = filename {
let file = std::fs::read(filename).expect("read file");
process(hifijson::SliceLexer::new(&file))
} else {
use std::io::Read;
process(hifijson::IterLexer::new(std::io::stdin().bytes()))
}We just made a pretty printer (stretching the definition of pretty pretty far).
§Operating on the lexer
Often, it is better for performance to operate directly on the non-whitespace characters that the lexer yields rather than parsing everything into a value and then processing the value. For example, the following example counts the number of values in the input JSON. Unlike the previous examples, it requires only constant memory!
use hifijson::{Error, Expect, Lex};
/// Recursively count the number of values in the value starting with the `next` character.
///
/// The `Lex` trait indicates that this lexer does *not* allocate memory.
fn count<L: Lex>(next: u8, lexer: &mut L) -> Result<usize, Error> {
match next {
// the JSON values "null", "true", and "false"
b'a'..=b'z' => Ok(lexer.null_or_bool().map(|_| 1).ok_or(Expect::Value)?),
b'0'..=b'9' => Ok(lexer.num_ignore().map(|_| 1)?),
b'-' => count(b'0', lexer.discarded()),
b'"' => Ok(lexer.discarded().str_ignore().map(|_| 1)?),
// start of array
b'[' => {
// an array is a value itself, so start with 1
let mut sum = 1;
// perform the following for every item of the array
lexer.discarded().seq(b']', L::ws_peek, |next, lexer| {
sum += count(next, lexer)?;
Ok::<_, Error>(())
})?;
Ok(sum)
}
// start of object
b'{' => {
let mut sum = 1;
// perform the following for every key-value pair of the object
lexer.discarded().seq(b'}', L::ws_peek, |next, lexer| {
/// read the key, ignoring it, and then the ':' after it
lexer.expect(|_| Some(next), b'"').ok_or(Expect::String)?;
lexer.str_ignore().map_err(Error::Str)?;
lexer.expect(L::ws_peek, b':').ok_or(Expect::Colon)?;
/// peek the next non-whitespace character
let next = lexer.ws_peek().ok_or(Expect::Value)?;
sum += count(next, lexer)?;
Ok::<_, Error>(())
})?;
Ok(sum)
}
_ => Err(Expect::Value)?,
}
}
fn process<L: Lex>(mut lexer: L) -> Result<usize, Error> {
lexer.exactly_one(L::ws_peek, count)
}
let json = br#"[null, true, false, "hello", 0, 3.1415, [1, 2], {"x": 1, "y": 2}]"#;
let mut lexer = hifijson::SliceLexer::new(json);
let n = process(lexer).unwrap();
assert_eq!(n, 13)§More Examples
See the cat example for a more worked version of a JSON “pretty” printer
that can be also used to lazily filter parts of the data based on a path.
hifijson also powers all JSON reading in the jaq crate,
for which it was originally created.
Re-exports§
pub use token::Expect;
Modules§
- escape
- Escape sequences.
- ignore
- Discarding values.
- num
- Positive numbers.
- str
- Strings.
- token
- Tokens.
- value
- Parsing and values.
Structs§
- Iter
Lexer - JSON lexer from an iterator over (fallible) bytes.
- Slice
Lexer - JSON lexer from a shared byte slice.
Enums§
- Error
- Parse error.