Expand description
Generate parquet files from sqlite databases
This library provides two things:
- A flexible way to generate a parquet file from a bunch of SQL statements
- A way to generate the neccessary config for writing a whole table to a parquet file
This package also contains a binary crate which lets you easily compress a whole sqlite DB into a bunch of parquet files. This typically gets a better compression ratio than xz, and is much faster. See ARCHIVE for a comparison.
§The easy way
If you just want to dump the whole table as-is into a parquet file, you can
use the handy infer_schema(). It tries to guess the best encoding based
on the sqlite schema.
let cols = sqlite2parquet::infer_schema(&conn, "my_table")
.unwrap()
.collect::<anyhow::Result<Vec<_>>>()
.unwrap();
let out_path = std::fs::File::create("my_table.parquet").unwrap();
sqlite2parquet::write_table(&conn, "my_table", &cols, &out_path, 1_000_000).unwrap();§The flexible way
Explicitly define the columns that will go in the parquet file. One thing
to be careful about: the SELECT queries must all return the same number
of rows. If not, you’ll get a runtime error.
use sqlite2parquet::*;
let cols = vec![
Column {
name: "category".to_string(),
required: true,
physical_type: PhysicalType::ByteArray,
logical_type: Some(LogicalType::String),
encoding: None,
dictionary: true,
query: "SELECT category FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
Column {
name: "first_timestamp".to_string(),
required: true,
physical_type: PhysicalType::Int64,
logical_type: Some(LogicalType::Timestamp(TimeType { utc: true, unit: TimeUnit::Nanos })),
encoding: Some(Encoding::DeltaBinaryPacked),
dictionary: false,
query: "SELECT MIN(timestamp) FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
];
let out_path = std::fs::File::create("category_start_times.parquet").unwrap();
write_table(&conn, "category_start_times", &cols, &out_path, 1_000_000).unwrap();Structs§
Enums§
Functions§
- infer_
schema - Infer a parquet schema to use for this dataset.
- write_
table - Creates a parquet file from a set of SQL queries.
- write_
table_ with_ progress - Like
write_table(), but lets you provide a callback which is called regularly.