dsq-formats

File format support for DSQ - handles reading and writing various data formats.

Overview

dsq-formats provides comprehensive support for reading and writing multiple structured data formats. It serves as the I/O layer for DSQ, converting between different file formats and DSQ's internal data representations.

Features

Multiple formats: CSV, JSON, JSON Lines, Parquet, Avro, Arrow IPC
Format detection: Automatic format detection based on file content
Streaming support: Efficient processing of large files
Schema inference: Automatic schema detection for structured data
Flexible options: Configurable parsing and writing options
Error handling: Detailed error messages for format issues

Installation

Add this to your Cargo.toml:

[dependencies]
dsq-formats = "0.1"

Enable specific formats:

[dependencies]
dsq-formats = { version = "0.1", features = ["csv", "json", "parquet"] }

Usage

Reading CSV Files

use dsq_formats::csv::read_csv_file;

fn main() {
    let df = read_csv_file("data.csv")
        .expect("Failed to read CSV");

    println!("Loaded {} rows", df.height());
}

Writing JSON

use dsq_formats::json::write_json_file;
use polars::prelude::*;

fn main() {
    let df = df! {
        "name" => ["Alice", "Bob"],
        "age" => [30, 25],
    }.unwrap();

    write_json_file(&df, "output.json")
        .expect("Failed to write JSON");
}

Reading Parquet

use dsq_formats::parquet::read_parquet_file;

fn main() {
    let df = read_parquet_file("data.parquet")
        .expect("Failed to read Parquet");

    println!("Columns: {:?}", df.get_column_names());
}

Format Detection

use dsq_formats::detect_format;

fn main() {
    let format = detect_format("data.csv")
        .expect("Failed to detect format");

    match format {
        Format::Csv => println!("CSV file detected"),
        Format::Json => println!("JSON file detected"),
        Format::Parquet => println!("Parquet file detected"),
        _ => println!("Other format"),
    }
}

Custom Options

use dsq_formats::csv::{read_csv_file_with_options, CsvReadOptions};

fn main() {
    let options = CsvReadOptions {
        has_header: true,
        delimiter: b';',
        quote_char: Some(b'"'),
        ..Default::default()
    };

    let df = read_csv_file_with_options("data.csv", &options)
        .expect("Failed to read CSV with options");
}

Supported Formats

CSV (Comma-Separated Values)

Read: Yes
Write: Yes
Features: Custom delimiters, headers, quotes, null values
Streaming: Yes

JSON

Read: Yes (standard JSON and JSON Lines)
Write: Yes
Features: Pretty printing, compact format
Streaming: Yes (JSON Lines)

JSON5

Read: Yes
Write: No
Features: Comments, trailing commas, unquoted keys
Streaming: No

Parquet

Read: Yes
Write: Yes
Features: Compression, column pruning, predicate pushdown
Streaming: Yes (with chunking)

Avro

Read: Yes
Write: Yes
Features: Schema evolution, compression
Streaming: Yes

Arrow IPC

Read: Yes
Write: Yes
Features: Zero-copy reads, compression
Streaming: Yes

Format Detection

The library can automatically detect file formats based on:

File extension
Magic bytes (file signature)
Content analysis

use dsq_formats::detect_format;

let format = detect_format("unknown.dat")?;

Configuration Options

Each format supports various configuration options:

CSV Options

delimiter: Field separator character
has_header: Whether first row contains headers
quote_char: Character for quoting fields
null_values: List of strings to interpret as NULL
skip_rows: Number of rows to skip
encoding: Character encoding

JSON Options

pretty: Pretty-print output
indent: Indentation level
null_handling: How to handle null values

Parquet Options

compression: Compression algorithm (snappy, gzip, lz4, zstd)
row_group_size: Rows per row group
statistics: Whether to compute column statistics

API Documentation

For detailed API documentation, see docs.rs/dsq-formats.

Performance

Format readers and writers are optimized for:

Large file handling with streaming
Memory-efficient processing
Parallel parsing where applicable
Zero-copy operations for compatible formats

Contributing

Contributions are welcome! To add support for new formats:

Create a new module for the format
Implement read/write functions
Add format detection logic
Include tests with sample data
Update documentation

See CONTRIBUTING.md for more details.

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

dsq-formats 0.1.0

dsq-formats

Overview

Features

Installation

Usage

Reading CSV Files

Writing JSON

Reading Parquet

Format Detection

Custom Options

Supported Formats

CSV (Comma-Separated Values)

JSON

JSON5

Parquet

Avro

Arrow IPC

Format Detection

Configuration Options

CSV Options

JSON Options

Parquet Options

API Documentation

Performance

Contributing

License