dsq-formats
File format support for DSQ - handles reading and writing various data formats.
Overview
dsq-formats provides comprehensive support for reading and writing multiple structured data formats. It serves as the I/O layer for DSQ, converting between different file formats and DSQ's internal data representations.
Features
- Multiple formats: CSV, JSON, JSON Lines, Parquet, Avro, Arrow IPC
- Format detection: Automatic format detection based on file content
- Streaming support: Efficient processing of large files
- Schema inference: Automatic schema detection for structured data
- Flexible options: Configurable parsing and writing options
- Error handling: Detailed error messages for format issues
Installation
Add this to your Cargo.toml:
[]
= "0.1"
Enable specific formats:
[]
= { = "0.1", = ["csv", "json", "parquet"] }
Usage
Reading CSV Files
use read_csv_file;
Writing JSON
use write_json_file;
use *;
Reading Parquet
use read_parquet_file;
Format Detection
use detect_format;
Custom Options
use ;
Supported Formats
CSV (Comma-Separated Values)
- Read: Yes
- Write: Yes
- Features: Custom delimiters, headers, quotes, null values
- Streaming: Yes
JSON
- Read: Yes (standard JSON and JSON Lines)
- Write: Yes
- Features: Pretty printing, compact format
- Streaming: Yes (JSON Lines)
JSON5
- Read: Yes
- Write: No
- Features: Comments, trailing commas, unquoted keys
- Streaming: No
Parquet
- Read: Yes
- Write: Yes
- Features: Compression, column pruning, predicate pushdown
- Streaming: Yes (with chunking)
Avro
- Read: Yes
- Write: Yes
- Features: Schema evolution, compression
- Streaming: Yes
Arrow IPC
- Read: Yes
- Write: Yes
- Features: Zero-copy reads, compression
- Streaming: Yes
Format Detection
The library can automatically detect file formats based on:
- File extension
- Magic bytes (file signature)
- Content analysis
use detect_format;
let format = detect_format?;
Configuration Options
Each format supports various configuration options:
CSV Options
delimiter: Field separator characterhas_header: Whether first row contains headersquote_char: Character for quoting fieldsnull_values: List of strings to interpret as NULLskip_rows: Number of rows to skipencoding: Character encoding
JSON Options
pretty: Pretty-print outputindent: Indentation levelnull_handling: How to handle null values
Parquet Options
compression: Compression algorithm (snappy, gzip, lz4, zstd)row_group_size: Rows per row groupstatistics: Whether to compute column statistics
API Documentation
For detailed API documentation, see docs.rs/dsq-formats.
Performance
Format readers and writers are optimized for:
- Large file handling with streaming
- Memory-efficient processing
- Parallel parsing where applicable
- Zero-copy operations for compatible formats
Contributing
Contributions are welcome! To add support for new formats:
- Create a new module for the format
- Implement read/write functions
- Add format detection logic
- Include tests with sample data
- Update documentation
See CONTRIBUTING.md for more details.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.