Velox: Lightweight Rust-Powered Data Processing & Analytics Library
Velox is a new Rust library designed for highly performant and extremely lightweight in-memory data processing and analytics. It prioritizes minimal dependencies, optimal memory footprint, and compile-time guarantees, making it an ideal choice for resource-constrained environments, high-performance computing, and applications where every byte and cycle counts.
Core Principles & Design Goals
- Extreme Lightweighting: Strives for zero or very few, carefully selected external crates. Focuses on minimal overhead and small binary size.
- Performance First: Leverages Rust's zero-cost abstractions, with potential for SIMD and parallelism. Data structures are optimized for cache efficiency.
- Safety & Reliability: Fully utilizes Rust's ownership and borrowing system to ensure memory safety and prevent common data manipulation errors. Unsafe code is minimized and thoroughly audited.
- Ergonomics & Idiomatic Rust API: Designed for a clean, discoverable, and user-friendly API that feels natural to Rust developers, supporting method chaining and strong static typing.
- Composability & Extensibility: Features a modular design, allowing components to be independent and easily combinable, and is built to be easily extendable.
Key Features
Core Data Structures
- DataFrame: A columnar data store supporting heterogeneous data types per column (i32, f64, bool, String). Efficient storage and handling of missing values.
- Series (or Column): A single-typed, named column of data within a DataFrame, providing type-specific operations.
Data Ingestion & Loading
- From
Vec<Vec<T>>/ Iterator: Basic in-memory construction from Rust native collections. - CSV Support: Minimalistic, highly efficient CSV parser for loading data.
- JSON Support: Efficient parsing for common JSON structures (planned).
- Custom Data Sources: Traits/interfaces for users to implement their own data loading mechanisms.
Data Cleaning & Preparation
drop_nulls(): Remove rows with any null values.fill_nulls(value): Fill nulls with a specified value (type-aware).interpolate_nulls(): Basic linear interpolation for numeric series.- Type Casting: Efficient conversion between compatible data types for Series (e.g., i32 to f64).
rename_column(old_name, new_name): Rename columns.
Data Transformation & Manipulation
- Selection:
select_columns(names),drop_columns(names). - Filtering: Predicate-based row selection using logical (
AND,OR,NOT) and comparison operators (==,!=,<,>). - Projection:
with_column(new_name, expression),apply()for user-defined functions. - Sorting: Sort DataFrame by one or more columns (ascending/descending).
- Joining: Basic inner, left, and right join operations on common keys.
- Concatenation/Append: Combine DataFrames vertically.
Aggregation & Reduction
- Simple Aggregations:
sum(),mean(),median(),min(),max(),count(),std_dev(). - Group By: Perform aggregations on groups defined by one or more columns.
- Unique Values:
unique()for a Series or DataFrame columns.
Basic Analytics & Statistics
describe(): Provides summary statistics for numeric columns (count, mean, std, min, max, quartiles).correlation(): Calculate Pearson correlation between two numeric Series.covariance(): Calculate covariance.
Output & Export
- To
Vec<Vec<T>>: Export DataFrame content back to standard Rust collections. - To CSV: Efficiently write DataFrame to a CSV file.
- Display/Pretty Print: User-friendly console output for DataFrame and Series.
Installation
Add the following to your Cargo.toml file:
[]
= "0.1.0" # Or the latest version
Usage Example
Here's a quick example demonstrating how to create a DataFrame, filter it, and perform a group-by aggregation:
use DataFrame;
use Series;
use ;
use Condition;
use HashMap;
Non-Functional Requirements
- Comprehensive Documentation: Extensive
///documentation for all public APIs, examples, and design choices. - Robust Testing: Thorough unit and integration tests covering all functionalities and edge cases.
- Performance Benchmarking: Includes benchmarks to track performance and memory usage, ensuring lightweight and high-performance goals are met.
- Cross-Platform Compatibility: Designed to work on common operating systems (Linux, macOS, Windows).
- Safety: Upholds Rust's safety guarantees, with minimal and heavily justified
unsafecode.
Future Considerations / Roadmap
- Streaming Data: Support for processing data in a streaming fashion.
- Time-Series Functionality: Basic time-series resampling, rolling windows.
- FFI (Foreign Function Interface): Consider C API for integration with other languages (Python, JavaScript).
- Simple Plotting Integration: Provide hooks or basic data preparation for common plotting libraries.
- Persistence: Basic serialization/deserialization formats (e.g., custom binary format, Parquet subset).
Current Development Status
All compilation and test errors have been resolved, except for JSON parsing. We are currently investigating issues with the microjson crate for JSON parsing and may explore alternative solutions.
License
This project is licensed under the MIT License - see the LICENSE file for details.