What is this repo?
This repo contains a rust binary (main.rs) which translate a csv table in rust types, every column is converted into a Vec of enum representing all the unique types, if a column is of String type then every unique String will be deserialized as an enum variant (see iris dataset example).
- The binary will output all the rust code to stdout so it can be easily piped to write a file via terminal.
Bin installation
- Clone the repo
git clone https://github.com/AliothCancer/csv_deserializer.git
A folder called csv_deserializer will be created
- Move inside that folder
cd csv_deserializer
- Compile the project
cargo build --release
- Copy in local bin
- Assuming
~/.local/bin:- is in $PATH (bash)
- is in $env.PATH (nushell)
cp target/release/csv_deserializer ~/.local/bin
Bin usage
Note on null values:
--null-valuesis an optional comma separate list of string which will be converted to the Null variant which all generated enums have
Lib Usage Guide
There is 2 struct to represent the csv file as rust type:
CsvDatasetis defined in the lib.rs. It can also be used to easily load a csv Every csv "cell" is stored inCsvAnytype:
CsvDataFrameis generated from the binary of this crate so it is available only after you put the rust generated code in a rs file and defined it as a module. The exact structure depends on the csv file you passed, i.e. name of the columns, unique values for each column. (See the iris example as a reference of the structure of this type)
To use this library for generating and utilizing a typed Rust interface for your CSV files, follow these steps:
1. Loading the Dataset
First, load your CSV file using a csv::Reader. You then create a CsvDataset by providing the reader and specifying which strings should be treated as null values.
let file = open?;
let rdr = new
.has_headers
.from_reader;
let dataset = new;
2. Generating Rust Code
Use the csv_deserializing cli to generate the rust code for a specific csv file. The binary will print all the rust code so you can redirect this output to a file from your command line to save it.
3. Using the Generated Code
Once the code is saved into a file (e.g., iris.rs), you can import it into your project. To work with the typed data, initialize a CsvDataFrame type by passing the CsvDataset you created earlier.
use *;
let df = new;
4. Iris Dataset ETL Example
// Build a reader for the csv file
let path = "iris.csv";
let file = open?;
let rdr = new
.has_headers
.from_reader;
// builf the CsvDataset with reader and nullvalues
let dataset = new;
// The iris.rs file is generate with the binary of csv_deserializer
// Then inside the iris.rs file a CsvDataFrame is used
// as the main struct which contains all the data
let df = new;
// Do ETL stuffes in a type safe way but it comes at less
// flexibility sometimes, so you can always use CsvDataset which
// use CsvAny as the type for every cell
// Can destruct the column wrapper called CsvColumn with if let
if let target = &df.target
&& let petal_length_cm = &df.petal_length_cm
// Can use a list of all columns
// make sure to use completion
// for match arms
for col in df.get_columns
More info
Name sanitization and Type Recognition: Categorical vs Numerical
Sanitization is achived converting any number or special char to Strings that will be used in the generated code. In particular the function which does it is contained in sanitizer.rs (sanitize_identifier).
The library identifies types by attempting to parse each raw CSV value.
- Numerical: If a value parses as an
i64, it is treated as anInt; if it parses as anf64, it is treated as aFloat. For example taking a look atsepal length (cm)in the iris dataset, the resulting type is:
// Also implement from string
- Categorical: Values that cannot be parsed as numbers are treated as
Str. The generated rust code for a string values column is like: (Example for iris dataset)
create_enum!;
The create_enum macro is used to have a sintactic sugar way to associate raw strings to the the typed enum variant.
- Metadata:
ColumnInfotracks the count of these types and stores unique variants to facilitate categorical Enum generation.
Main structure of the generated code
This is the example for the iris dataset:
sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
Rust generated code:
Each enum used to represent the csv value have a Null variant.