Crate sqlite2parquet

Crate sqlite2parquet 

Source
Expand description

Generate parquet files from sqlite databases

This library provides two things:

  1. A flexible way to generate a parquet file from a bunch of SQL statements
  2. A way to generate the neccessary config for writing a whole table to a parquet file

This package also contains a binary crate which lets you easily compress a whole sqlite DB into a bunch of parquet files. This typically gets a better compression ratio than xz, and is much faster. See ARCHIVE for a comparison.

§The easy way

If you just want to dump the whole table as-is into a parquet file, you can use the handy infer_schema(). It tries to guess the best encoding based on the sqlite schema.

let cols = sqlite2parquet::infer_schema(&conn, "my_table")
    .unwrap()
    .collect::<anyhow::Result<Vec<_>>>()
    .unwrap();
let out_path = std::fs::File::create("my_table.parquet").unwrap();
sqlite2parquet::write_table(&conn, "my_table", &cols, &out_path, 1_000_000).unwrap();

§The flexible way

Explicitly define the columns that will go in the parquet file. One thing to be careful about: the SELECT queries must all return the same number of rows. If not, you’ll get a runtime error.

use sqlite2parquet::*;
let cols = vec![
    Column {
        name: "category".to_string(),
        required: true,
        physical_type: PhysicalType::ByteArray,
        logical_type: Some(LogicalType::String),
        encoding: None,
        dictionary: true,
        query: "SELECT category FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
    },
    Column {
        name: "first_timestamp".to_string(),
        required: true,
        physical_type: PhysicalType::Int64,
        logical_type: Some(LogicalType::Timestamp(TimeType { utc: true, unit: TimeUnit::Nanos })),
        encoding: Some(Encoding::DeltaBinaryPacked),
        dictionary: false,
        query: "SELECT MIN(timestamp) FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
    },
];

let out_path = std::fs::File::create("category_start_times.parquet").unwrap();
write_table(&conn, "category_start_times", &cols, &out_path, 1_000_000).unwrap();

Structs§

Column
Progress
TimeType

Enums§

Encoding
LogicalType
PhysicalType
TimeUnit

Functions§

infer_schema
Infer a parquet schema to use for this dataset.
write_table
Creates a parquet file from a set of SQL queries.
write_table_with_progress
Like write_table(), but lets you provide a callback which is called regularly.