dsq-formats 0.1.0

File format support for dsq - handles reading and writing various data formats
Documentation
# dsq-formats

File format support for DSQ - handles reading and writing various data formats.

## Overview

`dsq-formats` provides comprehensive support for reading and writing multiple structured data formats. It serves as the I/O layer for DSQ, converting between different file formats and DSQ's internal data representations.

## Features

- **Multiple formats**: CSV, JSON, JSON Lines, Parquet, Avro, Arrow IPC
- **Format detection**: Automatic format detection based on file content
- **Streaming support**: Efficient processing of large files
- **Schema inference**: Automatic schema detection for structured data
- **Flexible options**: Configurable parsing and writing options
- **Error handling**: Detailed error messages for format issues

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
dsq-formats = "0.1"
```

Enable specific formats:

```toml
[dependencies]
dsq-formats = { version = "0.1", features = ["csv", "json", "parquet"] }
```

## Usage

### Reading CSV Files

```rust
use dsq_formats::csv::read_csv_file;

fn main() {
    let df = read_csv_file("data.csv")
        .expect("Failed to read CSV");

    println!("Loaded {} rows", df.height());
}
```

### Writing JSON

```rust
use dsq_formats::json::write_json_file;
use polars::prelude::*;

fn main() {
    let df = df! {
        "name" => ["Alice", "Bob"],
        "age" => [30, 25],
    }.unwrap();

    write_json_file(&df, "output.json")
        .expect("Failed to write JSON");
}
```

### Reading Parquet

```rust
use dsq_formats::parquet::read_parquet_file;

fn main() {
    let df = read_parquet_file("data.parquet")
        .expect("Failed to read Parquet");

    println!("Columns: {:?}", df.get_column_names());
}
```

### Format Detection

```rust
use dsq_formats::detect_format;

fn main() {
    let format = detect_format("data.csv")
        .expect("Failed to detect format");

    match format {
        Format::Csv => println!("CSV file detected"),
        Format::Json => println!("JSON file detected"),
        Format::Parquet => println!("Parquet file detected"),
        _ => println!("Other format"),
    }
}
```

### Custom Options

```rust
use dsq_formats::csv::{read_csv_file_with_options, CsvReadOptions};

fn main() {
    let options = CsvReadOptions {
        has_header: true,
        delimiter: b';',
        quote_char: Some(b'"'),
        ..Default::default()
    };

    let df = read_csv_file_with_options("data.csv", &options)
        .expect("Failed to read CSV with options");
}
```

## Supported Formats

### CSV (Comma-Separated Values)

- **Read**: Yes
- **Write**: Yes
- **Features**: Custom delimiters, headers, quotes, null values
- **Streaming**: Yes

### JSON

- **Read**: Yes (standard JSON and JSON Lines)
- **Write**: Yes
- **Features**: Pretty printing, compact format
- **Streaming**: Yes (JSON Lines)

### JSON5

- **Read**: Yes
- **Write**: No
- **Features**: Comments, trailing commas, unquoted keys
- **Streaming**: No

### Parquet

- **Read**: Yes
- **Write**: Yes
- **Features**: Compression, column pruning, predicate pushdown
- **Streaming**: Yes (with chunking)

### Avro

- **Read**: Yes
- **Write**: Yes
- **Features**: Schema evolution, compression
- **Streaming**: Yes

### Arrow IPC

- **Read**: Yes
- **Write**: Yes
- **Features**: Zero-copy reads, compression
- **Streaming**: Yes

## Format Detection

The library can automatically detect file formats based on:

- File extension
- Magic bytes (file signature)
- Content analysis

```rust
use dsq_formats::detect_format;

let format = detect_format("unknown.dat")?;
```

## Configuration Options

Each format supports various configuration options:

### CSV Options

- `delimiter`: Field separator character
- `has_header`: Whether first row contains headers
- `quote_char`: Character for quoting fields
- `null_values`: List of strings to interpret as NULL
- `skip_rows`: Number of rows to skip
- `encoding`: Character encoding

### JSON Options

- `pretty`: Pretty-print output
- `indent`: Indentation level
- `null_handling`: How to handle null values

### Parquet Options

- `compression`: Compression algorithm (snappy, gzip, lz4, zstd)
- `row_group_size`: Rows per row group
- `statistics`: Whether to compute column statistics

## API Documentation

For detailed API documentation, see [docs.rs/dsq-formats](https://docs.rs/dsq-formats).

## Performance

Format readers and writers are optimized for:

- Large file handling with streaming
- Memory-efficient processing
- Parallel parsing where applicable
- Zero-copy operations for compatible formats

## Contributing

Contributions are welcome! To add support for new formats:

1. Create a new module for the format
2. Implement read/write functions
3. Add format detection logic
4. Include tests with sample data
5. Update documentation

See [CONTRIBUTING.md](../../CONTRIBUTING.md) for more details.

## License

Licensed under either of:

- Apache License, Version 2.0 ([LICENSE-APACHE]../../LICENSE-APACHE)
- MIT license ([LICENSE-MIT]../../LICENSE-MIT)

at your option.