scirs2-io
Scientific data input/output for the SciRS2 scientific computing library (v0.3.2).
scirs2-io provides comprehensive, high-performance file I/O for scientific and numerical workloads. It covers everything from classic scientific formats (MATLAB, NetCDF, HDF5, WAV) through modern columnar and streaming formats (Parquet, Arrow IPC, NDJSON, ORC) to cloud storage, ETL pipelines, data catalogs, and lineage tracking — all as pure Rust with no C/Fortran dependencies.
Features (v0.3.2)
Classic Scientific Formats
- MATLAB (.mat): Full
.matv4/v5 read/write with all data types, structures, and cell arrays - WAV: Professional-grade WAV audio read/write
- ARFF: Complete Weka Attribute-Relation File Format support
- NetCDF: NetCDF3 Classic and NetCDF4/HDF5 with unlimited dimensions and chunking
- HDF5-lite: Pure-Rust hierarchical data format reader; group hierarchies, attributes, datasets
- Matrix Market / Harwell-Boeing: Sparse and dense matrix exchange formats
Modern Columnar and Binary Formats
- Parquet-lite: Pure-Rust Parquet reader for columnar analytical data
- Feather (Arrow IPC): Memory-mapped Arrow columnar format, zero-copy read/write
- ORC: ORC columnar format reader
- Binary format utilities: Portable binary encoding helpers
Serialization Formats
- CBOR: Concise Binary Object Representation (RFC 7049) serialization and deserialization
- BSON: Binary JSON as used by MongoDB ecosystems
- Avro: Schema-based binary serialization with schema evolution support
- Protobuf-lite: Pure-Rust protocol buffer encoding and decoding (no C codegen dependency)
- MessagePack: Compact, fast binary serialization
- JSON / NDJSON: Standard JSON and Newline-Delimited JSON (streaming-friendly)
Streaming and Lazy Evaluation
- Streaming CSV: Lazy, chunk-by-chunk CSV processing; handles files larger than RAM
- Streaming JSON: Incremental JSON parser for large document streams
- NDJSON streaming: Line-by-line newline-delimited JSON processing
- Arrow IPC streaming: Framed streaming Arrow record batches
- Backpressure-aware pipelines: Push-pull pipeline with configurable buffer sizes
Compression
- LZ4: High-speed lossless compression
- Zstd: Zstandard compression with configurable levels
- Brotli: General-purpose compression (HTTP-friendly)
- Snappy: Google Snappy for fast block compression
- GZIP / BZIP2: Classic deflate-based and BWT-based compression
- Parallel compression: Chunk-parallel encode/decode with up to 2.5x throughput improvement
Data Catalog, Lineage, and Governance
- Data catalog: Register, tag, and discover datasets by name and schema
- Lineage tracking: Record transformations and data provenance for auditability
- Schema registry: Store and evolve data schemas; validate records on read/write
- Versioning support: Immutable dataset versioning with diff and rollback
ETL and Query
- ETL pipeline framework: Declarative source → transform → sink pipelines with parallel stages
- Query interface: SQL-like predicate pushdown and projection for tabular formats
- Universal reader: Automatic format detection from magic bytes and file extension
- Format detection: Detect dozens of formats without specifying them explicitly
Cloud and Distributed I/O
- Cloud storage connectors: Framework for AWS S3, Google Cloud Storage, and Azure Blob Storage
- Distributed I/O: Partitioned parallel read/write for cluster workloads
Validation and Integrity
- Checksums: CRC32, SHA-256, BLAKE3 integrity verification
- Schema-based validation: JSON Schema-compatible validation engine
- Format validators: Format-specific structural validators
Quick Start
Add to your Cargo.toml:
[]
= "0.3.2"
To enable optional feature groups:
[]
= { = "0.3.2", = ["async", "compression"] }
Reading a CSV file
use ;
let config = CsvReaderConfig ;
let = read_csv?;
println!;
println!;
Streaming a large NDJSON file
use NdjsonReader;
let reader = open?;
for record in reader
CBOR serialization round-trip
use ;
use ;
let m = Measurement ;
let bytes = encode_cbor?;
let decoded: Measurement = decode_cbor?;
Streaming CSV with lazy evaluation
use StreamingCsvReader;
let mut reader = open?;
reader.set_chunk_size;
while let Some = reader.next_chunk?
Parallel Zstd compression
use ;
let data = read?;
let cfg = ParallelConfig ;
let compressed = compress_parallel?;
println!;
API Overview
| Module | Purpose |
|---|---|
matlab |
.mat file read/write |
wavfile |
WAV audio read/write |
netcdf |
NetCDF3/4 |
hdf5_lite |
Pure-Rust HDF5-lite |
csv |
CSV with type inference |
image |
PNG, JPEG, BMP, TIFF |
matrix_market |
Matrix Market / Harwell-Boeing |
formats::cbor |
CBOR encode/decode |
formats::bson |
BSON encode/decode |
formats::avro |
Avro schema-based serialization |
formats::protobuf |
Protobuf-lite encode/decode |
formats::msgpack |
MessagePack encode/decode |
formats::orc |
ORC reader |
formats::feather |
Feather (Arrow IPC) |
parquet_lite |
Parquet-lite reader |
streaming_csv |
Lazy streaming CSV |
streaming_json |
Streaming JSON parser |
ndjson_streaming |
NDJSON line-by-line streaming |
arrow_ipc / arrow_streaming |
Arrow IPC framed streaming |
binary_format |
Binary encoding utilities |
compression / compression_utils |
LZ4, Zstd, Brotli, Snappy, GZIP, BZIP2 |
universal_reader |
Auto-detect format and open |
format_detect |
Magic bytes / extension detection |
query |
SQL-like predicate and projection |
catalog |
Data catalog |
lineage |
Provenance and lineage tracking |
schema |
Schema registry |
versioning |
Dataset versioning |
etl |
ETL pipeline framework |
cloud |
Cloud storage connectors |
distributed |
Distributed / partitioned I/O |
pipeline |
Typed streaming pipeline (sources, transforms, sinks) |
validation |
Checksums, schema validation |
serialize |
Array and sparse-matrix serialization |
streaming |
Low-level streaming utilities |
Feature Flags
| Feature | Description |
|---|---|
default |
CSV, compression, validation |
async |
Async I/O via tokio |
reqwest |
HTTP/HTTPS network I/O |
compression |
LZ4, Zstd, Brotli, Snappy (enabled by default) |
hdf5 |
HDF5 system library binding (optional; hdf5-lite is always available) |
all |
All features |
Documentation
Full API documentation is available at docs.rs/scirs2-io.
License
Licensed under the Apache License 2.0. See LICENSE for details.