Crate schema_analysis

Expand description

§Universal-ish Schema Analysis

Ever wished you could figure out what was in that json file? Or maybe it was xml… Ehr, yaml? It was definitely toml.

Alas, many great tools will only work with one of those formats, and the internet is not so nice a place as to finally understand that no, xml is not an acceptable document format.

Enter this neat little tool, a single interface to any self-describing format supported by our gymnast friend, serde.

§Features

Works with any self-describing format with a Serde implementation.
Suitable for large files¹.
Keeps track of some useful info for each type.
Keeps track of null/normal/missing/duplicate values separately.
Integrates with Schemars and json_typegen to produce types and json schema if needed.
There’s a demo website here.

§Usage

let data: &[u8] = b"true";

// Just pick your format, and deserialize InferredSchema as if it were a normal type.
let inferred: InferredSchema = serde_json::from_slice(data)?;
// let inferred: InferredSchema = serde_yaml::from_slice(data)?;
// let inferred: InferredSchema = serde_cbor::from_slice(data)?;
// let inferred: InferredSchema = toml::from_slice(data)?;
// let inferred: InferredSchema = rawbson::de::from_bytes(data)?;
// let inferred: InferredSchema = quick_xml::de::from_reader(data)?;

// InferredSchema is a wrapper around Schema
let schema: Schema = inferred.schema;
let expected: Schema = Schema::Boolean(Default::default());
assert!(schema.structural_eq(&expected));

// The wrapper is there so we can both do the magic above, and also store the data for later
let serialized_schema: String = serde_json::to_string_pretty(&schema)?;

That’s it.

Check Schema to see what info you get, and targets to see the available integrations (which include code and json schema generation).

§Advanced Usage

I know, I know, the internet is evil and has decided to plague you with not one, but thousands, maybe even millions, of files.

Unfortunately Serde relies on type information to work, so ~~there is nothing we can do about it~~ we can bring out the big guns: DeserializeSeed. It’s everything you love about Serde, but with runtime state.

let a_lot_of_json_files: &[&str] = &[ "1", "2", "1000" ];
let mut iter = a_lot_of_json_files.iter();

if let Some(file) = iter.next() {
    // We use the first file to generate a new schema to work with.
    let mut inferred: InferredSchema = serde_json::from_str(file)?;

    // Then we iterate over the rest to expand the schema.
    for file in iter {
        let mut json_deserializer = serde_json::Deserializer::from_str(file);
        // DeserializeSeed is implemented on &mut InferredSchema
        // So here it borrows the data mutably and runs it against the deserializer.
        let () = inferred.deserialize(&mut json_deserializer)?;
    }

    // The result in this case would be a simple integer schema
    // that 'has met' the numbers 1, 2, and 100.
    let mut context: NumberContext<i128> = Default::default();
    context.aggregate(&1);
    context.aggregate(&2);
    context.aggregate(&1000);

    assert_eq!(inferred.schema, Schema::Integer(context));
}

Furthermore, if you need to generate separate schemas (for example to run the analysis on multiple threads) you can use the Coalesce trait to merge them after-the-fact.

§I really wish I could convert that Schema in something, you know, actually useful.

You are in luck! You can check out here the integrations with json_typegen and Schemars to convert the analysis into useful files like Rust types and json schemas. You can also find a demo website here.

§How does this work?

For a the short story long go here, the juicy bit is that Serde is kind enough to let the format tell us what it is working with, we take it from there and construct a nice schema from that info.

§Performance

These are not proper benchmarks, but should give a vague idea of the performance on a i7-7700HQ laptop (2017) with the raw data already loaded into memory.

Size	wasm (MB/s)	native (MB/s)	Format	File #
~180MB	~20s (9)	~5s (36)	json	1
~650MB	~150s (4.3)	~50s (13)	json	1
~1.7GB	~470s (3.6)	~145s (11.7)	json	1
~2.1GB	^a	~182s (11.5)	json	1
~13.3GB^b		~810s (16.4)	xml	~200k

^a This one seems to go over some kind of browser limit when fetching the data in the Web Worker, I believe I would have to split large files to handle it.

^b ~2.7GB compressed. This one seems like it would be a worst-case scenario because it includes decompression overhead and the files had a section that was formatted text which resulted in crazy schemas. (The json pretty printed schema was almost 0.5GB!)

This is just a weirdly shaped parser, so values are discarded as soon as they have been analyzed. This should hopefully translate in memory requirements that scale with the size of Schema, not the input data. If you ever had a schema analysis tool break on you after a mere million documents, this’ll probably be appreciated. ↩

Re-exports§

pub use analysis::InferredSchema;
pub use analysis::InferredSchemaWithContext;
pub use context::Aggregators;
pub use context::Context;
pub use traits::Aggregate;
pub use traits::Coalesce;
pub use traits::StructuralEq;

Modules§

analysis: This module holds the analysis logic and an overview of how it works.
context: The Context provides a way to store information about the types found during analysis.
helpers: A module for any useful helper functions.
targets: A module containing some integrations that enable useful applications.
traits: A module holding the crate’s public traits.

Structs§

Field: A Field is a useful abstraction to record metadata that does not belong or would be unyieldy to place into the Schema and to account for cases in which the existence of a Field might be known, but nothing is known about its shape.
FieldStatus: The FieldStatus keeps track of what kind of values a Field has been found to have.

Enums§

Schema: This enum is the core output of the analysis, it describes the structure of a document.

Crate schema_analysisCopy item path