scirs2-io 0.1.0-rc.2

Input/Output utilities module for SciRS2 (scirs2-io)
Documentation

SciRS2 IO

crates.io License Documentation

Production-ready Input/Output module for the SciRS2 scientific computing library (v0.1.0-rc.2). Following the SciRS2 POLICY, this module provides comprehensive functionality for reading and writing various scientific and numerical data formats with high performance, reliability, and enhanced real-time streaming capabilities through scirs2-core abstractions.

Features

Core File Format Support

  • MATLAB Support: Complete .mat file format support with all data types
  • WAV File Support: Professional-grade WAV audio file processing
  • ARFF Support: Full Weka ARFF (Attribute-Relation File Format) implementation
  • CSV Support: Advanced CSV processing with flexible configuration and type inference
  • NetCDF Support: Complete NetCDF3 and NetCDF4/HDF5 integration with hierarchical data management
  • HDF5 Support: Comprehensive Hierarchical Data Format with groups, datasets, compression, and chunking
  • Matrix Market: High-performance sparse and dense matrix format support
  • Harwell-Boeing: Complete sparse matrix format implementation

Advanced Data Processing

  • Image Support: Professional image processing (PNG, JPEG, BMP, TIFF) with EXIF metadata
  • Data Serialization: Multi-format serialization (Binary, JSON, MessagePack) with metadata preservation
  • Data Compression: Production-grade compression with parallel processing support
    • Multiple algorithms: GZIP, Zstandard, LZ4, BZIP2
    • Up to 2.5x performance improvement with parallel processing
    • Configurable compression levels and threading
  • Data Validation: Enterprise-grade validation with comprehensive error reporting
    • Multiple checksum algorithms (CRC32, SHA-256, BLAKE3)
    • JSON Schema-compatible validation engine
    • Format-specific validators
  • Sparse Matrix Operations: Optimized sparse matrix handling (COO, CSR, CSC formats)

High-Performance Features

  • Parallel Processing: Multi-threaded operations with automatic optimization
  • Streaming Interfaces: Memory-efficient processing for large datasets
  • Async I/O: Non-blocking operations with tokio integration
  • Memory Mapping: Efficient handling of large arrays without memory overhead
  • Network I/O: HTTP/HTTPS client with progress tracking and retry logic
  • Cloud Integration: Framework for AWS S3, Google Cloud Storage, Azure Blob Storage

Production Quality

  • Zero Warnings: Clean compilation with comprehensive error handling
  • 114 Unit Tests: Extensive test coverage with edge case validation
  • Cross-Platform: Linux, macOS, Windows support
  • API Stability: Stable APIs with semantic versioning
  • Performance Benchmarks: Validated performance improvements

Installation

Add to your Cargo.toml:

[dependencies]
scirs2-io = "0.1.0-rc.2"

Enable specific features as needed:

[dependencies]
scirs2-io = { version = "0.1.0-rc.2", features = ["hdf5", "async", "compression"] }

Available Features

  • default: CSV, compression, and validation (recommended for most use cases)
  • hdf5: HDF5 file format support
  • async: Asynchronous I/O with tokio
  • reqwest: Network operations and HTTP client
  • all: All features enabled

Quick Start

Basic File Operations

use scirs2_io::{matlab, csv, image, compression};
use scirs2_core::error::CoreResult;
use ndarray::Array2;

// Read MATLAB file
let data = matlab::loadmat("data.mat")?;
let array = data.get_array::<f64>("matrix")?;

// Process CSV with automatic type detection
let (headers, data) = csv::read_csv_numeric("dataset.csv", None)?;
println!("Dataset shape: {:?}", data.shape());

// Handle images with metadata
let (image_data, metadata) = image::read_image("photo.jpg")?;
println!("Image: {}x{} pixels", metadata.width, metadata.height);

// High-performance compression
let compressed = compression::compress_data(&large_dataset, 
    compression::CompressionAlgorithm::Zstd, Some(6))?;

Advanced Parallel Processing

use scirs2_io::compression::{
    compress_data_parallel, ParallelCompressionConfig, CompressionAlgorithm
};

// Configure high-performance parallel compression
let config = ParallelCompressionConfig {
    num_threads: 8,
    chunk_size: 1024 * 1024,  // 1MB chunks
    buffer_size: 64 * 1024,   // 64KB buffer
    enable_memory_mapping: true,
};

// Process large dataset (10MB example)
let large_data = vec![0u8; 10_000_000];
let (compressed, stats) = compress_data_parallel(
    &large_data, 
    CompressionAlgorithm::Zstd, 
    Some(6), 
    config
)?;

println!("Compressed to {:.1}% in {:.2}ms", 
    100.0 * stats.bytes_output as f64 / stats.bytes_processed as f64,
    stats.operation_time_ms);
println!("Throughput: {:.2} MB/s", stats.throughput_bps / 1_000_000.0);

Schema-Based Data Validation

use scirs2_io::validation::{SchemaValidator, schema_helpers, SchemaConstraint};
use serde_json::json;

let validator = SchemaValidator::new();

// Define validation schema
let user_schema = schema_helpers::object([
    ("name", schema_helpers::string()
        .with_constraint(SchemaConstraint::MinLength(1))
        .required()),
    ("age", schema_helpers::integer()
        .with_constraint(SchemaConstraint::MinValue(0.0))
        .with_constraint(SchemaConstraint::MaxValue(150.0))
        .required()),
    ("email", schema_helpers::email().required()),
].into_iter().collect());

// Validate data
let user_data = json!({
    "name": "Alice Johnson",
    "age": 30,
    "email": "alice@example.com"
});

let result = validator.validate(&user_data, &user_schema);
if result.valid {
    println!("Data validation passed!");
} else {
    for error in &result.errors {
        println!("Validation error in {}: {}", error.path, error.message);
    }
}

Streaming Large Files

use scirs2_io::streaming::{StreamingConfig, process_file_chunked};

// Process large files efficiently
let config = StreamingConfig::default()
    .chunk_size(64 * 1024)
    .enable_progress_reporting(true);

let (result, stats) = process_file_chunked("large_dataset.bin", config, 
    |chunk_data, chunk_id| {
        // Process each chunk
        println!("Processing chunk {}: {} bytes", chunk_id, chunk_data.len());
        // Your processing logic here
        Ok(())
    })?;

println!("Processed {} chunks, {} total bytes", 
    stats.total_chunks, stats.total_bytes_processed);

API Reference

File Format Modules

MATLAB Files

use scirs2_io::matlab::{loadmat, savemat, MatFile, MatVar};

Scientific Data Formats

use scirs2_io::{
    netcdf::{NetCDFFile, NetCDFOptions, NetCDFFormat},
    hdf5::{HDF5File, CompressionOptions, DatasetOptions},
    matrix_market::{read_matrix_market, write_matrix_market},
};

Image Processing

use scirs2_io::image::{
    read_image, write_image, convert_image, get_grayscale,
    ImageFormat, ColorMode, ImageMetadata
};

Data Processing

Compression

use scirs2_io::compression::{
    // Basic compression
    compress_data, decompress_data,
    // Parallel processing
    compress_data_parallel, decompress_data_parallel,
    // Configuration
    CompressionAlgorithm, ParallelCompressionConfig,
    // Array-specific
    ndarray::{compress_array, decompress_array},
};

Validation

use scirs2_io::validation::{
    // Integrity checking
    calculate_checksum, verify_checksum,
    // Schema validation
    SchemaValidator, schema_helpers, SchemaConstraint,
    // Format validation
    formats::{validate_format, detect_file_format},
};

Serialization

use scirs2_io::serialize::{
    serialize_array, deserialize_array,
    serialize_sparse_matrix, deserialize_sparse_matrix,
    SerializationFormat, SparseMatrixCOO,
};

Performance Characteristics

  • Parallel Compression: Up to 2.5x faster than single-threaded operations
  • Memory Efficiency: Streaming interfaces for datasets larger than RAM
  • Network I/O: Optimized for scientific data transfer with retry logic
  • Zero-Copy: Memory mapping for large file operations
  • SIMD-Ready: Architecture prepared for vectorized operations

Format Support Details

MATLAB (.mat)

  • All MATLAB data types (double, single, integers, logical, char)
  • Multidimensional arrays, structures, and cell arrays
  • Both compressed and uncompressed formats
  • Full metadata preservation

NetCDF/HDF5

  • NetCDF3 Classic and NetCDF4/HDF5 formats
  • Unlimited dimensions and compression
  • Group hierarchies and attributes
  • Chunked storage for large datasets

Image Formats

  • PNG, JPEG, BMP, TIFF with full metadata
  • Color space handling (RGB, RGBA, Grayscale)
  • EXIF metadata extraction and manipulation
  • Format conversion with quality control

Sparse Matrices

  • COO, CSR, CSC format support
  • Matrix Market and Harwell-Boeing formats
  • Efficient format conversion with caching
  • Integration with numerical operations

System Requirements

  • Rust: Edition 2021, MSRV 1.70+
  • Platforms: Linux, macOS, Windows (64-bit)
  • Optional: HDF5 system library for HDF5 features
  • Memory: Configurable memory usage for large datasets

Production Deployment

This library is production-ready with:

  • Comprehensive testing: 114 unit tests with edge case coverage
  • Memory safety: Zero unsafe code in core paths
  • Error handling: Detailed error messages with recovery suggestions
  • Performance monitoring: Built-in statistics and benchmarking
  • Backwards compatibility: Semantic versioning with stable APIs

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

For production releases:

  • All features require comprehensive tests
  • Performance changes need benchmarks
  • API changes require documentation updates
  • Security considerations for all I/O operations

License

Licensed under either:

Choose the license that works best for your project.


Ready for Production: scirs2-io v0.1.0-rc.2 provides enterprise-grade I/O capabilities for scientific computing applications.