PandRS
A DataFrame library for data analysis implemented in Rust. It has features and design inspired by Python's pandas library, combining fast data processing with type safety.
Key Features
- Efficient data processing with high-performance column-oriented storage
- Low memory footprint with categorical data and string pool optimization
- Multi-core utilization through parallel processing
- GPU acceleration with CUDA integration (up to 20x speedup)
- Optimization with lazy evaluation system
- Thread-safe implementation
- Robustness leveraging Rust's type safety and ownership system
- Modularized design (implementation divided by functionality)
- Python integration (PyO3 bindings)
- WebAssembly support for browser-based visualization
Features
- Series (1-dimensional array) and DataFrame (2-dimensional table) data structures
- Support for missing values (NA)
- Grouping and aggregation operations
- Row labels with indexes
- Multi-level indexes (hierarchical indexes)
- CSV/JSON reading and writing
- Parquet data format support
- Basic operations (filtering, sorting, joining, etc.)
- Aggregation functions for numeric data
- Special operations for string data
- Basic time series data processing
- Categorical data types (efficient memory use, ordered categories)
- Pivot tables
- Visualization with text-based and high-quality graphs
- Parallel processing support
- Statistical analysis functions (descriptive statistics, t-tests, regression analysis, etc.)
- Specialized categorical data statistics (contingency tables, chi-square tests, etc.)
- Machine learning evaluation metrics (MSE, R², accuracy, F1, etc.)
- Optimized implementation (column-oriented storage, lazy evaluation, string pool)
- High-performance split implementation (sub-modularized files for each functionality)
- GPU acceleration for matrix operations, statistical computation, and ML algorithms
- Disk-based processing for very large datasets
- Streaming data support with real-time analytics
- Distributed processing for datasets exceeding single-machine capacity (experimental)
- SQL-like query interface with DataFusion
- Window function support for analytics
- Advanced expression system for complex transformations
- User-defined function support
Usage Examples
Creating and Basic Operations with DataFrames
use ;
// Create series
let ages = new?;
let heights = new?;
// Add series to DataFrame
let mut df = new;
df.add_column?;
df.add_column?;
// Save as CSV
df.to_csv?;
// Load DataFrame from CSV
let df_from_csv = from_csv?;
Numeric Operations
// Create numeric series
let numbers = new?;
// Statistical calculations
let sum = numbers.sum; // 150
let mean = numbers.mean?; // 30
let min = numbers.min?; // 10
let max = numbers.max?; // 50
Installation
Add the following to your Cargo.toml:
[]
= "0.1.0-alpha.2"
For GPU acceleration, add the CUDA feature flag:
[]
= { = "0.1.0-alpha.2", = ["cuda"] }
For distributed processing capabilities, add the distributed feature:
[]
= { = "0.1.0-alpha.2", = ["distributed"] }
Multiple features can be combined:
[]
= { = "0.1.0-alpha.2", = ["cuda", "distributed", "wasm"] }
Working with Missing Values (NA)
// Create series with NA values
let data = vec!;
let series = new?;
// Handle NA values
println!;
println!;
// Drop and fill NA values
let dropped = series.dropna?;
let filled = series.fillna?;
Group Operations
// Data and group keys
let values = new?;
let keys = vec!;
// Group and aggregate
let group_by = new?;
// Aggregation results
let sums = group_by.sum?;
let means = group_by.mean?;
Time Series Operations
use ;
use NaiveDate;
// Generate date range
let dates = date_range?;
// Create time series data
let time_series = new?;
// Time filtering
let filtered = time_series.filter_by_time?;
// Calculate moving average
let moving_avg = time_series.rolling_mean?;
// Resampling (convert to weekly)
let weekly = time_series.resample.mean?;
Distributed Processing (Experimental)
Using the DataFrame-Style API
use ;
// Create a local DataFrame
let mut df = new;
df.add_column?;
df.add_column?;
df.add_column?;
// Configure distributed processing
let config = new
.with_executor // Use DataFusion engine
.with_concurrency; // Use 4 threads
// Convert to distributed DataFrame
let dist_df = df.to_distributed?;
// Define distributed operations (lazy execution)
let result = dist_df
.filter?
.groupby?
.aggregate?;
// Execute operations and get performance metrics
let executed = result.execute?;
if let Some = executed.execution_metrics
// Check execution summary
println!;
// Collect results back to local DataFrame
let final_df = executed.collect_to_local?;
println!;
// Write results directly to Parquet
executed.write_parquet?;
Using the SQL-Style API with DistributedContext
use ;
// Create a distributed context
let mut context = new?;
// Register multiple DataFrames with the context
let customers = new; // Create customers DataFrame
let orders = new; // Create orders DataFrame
context.register_dataframe?;
context.register_dataframe?;
// Also register CSV or Parquet files directly
context.register_csv?;
context.register_parquet?;
// Execute SQL queries against registered datasets
let result = context.sql_to_dataframe?;
println!;
// Execute query and write directly to Parquet
let metrics = context.sql_to_parquet?;
println!;
Using Window Functions for Analytics
use ;
use distributed_window; // Access window functions
// Create a DataFrame with time series data
let mut df = new;
// ... add columns with time series data ...
// Convert to distributed DataFrame
let dist_df = df.to_distributed?;
// Add ranking by sales within each region
let ranked = dist_df.window?;
// Calculate running total of sales by date
let running_total = dist_df.window?;
// Calculate 7-day moving average
let moving_avg = dist_df.window?;
// Or use SQL directly
let context = new?;
context.register_dataframe?;
let result = context.sql_to_dataframe?;
Using Expressions for Complex Data Transformations
use ;
// Create a DataFrame with sales data
let mut df = new;
// ... add columns with sales data ...
// Convert to distributed DataFrame
let dist_df = df.to_distributed?;
// Select columns with complex expressions
let selected = dist_df.select_expr?;
// Filter using complex expressions
let high_margin = dist_df
.filter_expr?;
// Create a user-defined function
let commission_udf = new;
// Register and use the UDF
let with_commission = dist_df
.create_udf?
.select_expr?;
// Chain operations together
let final_analysis = dist_df
// Add calculated columns
.with_column?
// Filter for high-profit regions
.filter_expr?
// Project final columns with calculations
.select_expr?;
Statistical Analysis and Machine Learning Evaluation Functions
use ;
use ;
use ;
// Descriptive statistics
let data = vec!;
let stats_summary = describe?;
println!;
println!;
// Calculate correlation coefficient
let x = vec!;
let y = vec!;
let correlation = correlation?;
println!;
// Run t-test
let sample1 = vec!;
let sample2 = vec!;
let alpha = 0.05; // significance level
let result = ttest?;
println!;
println!;
// Regression analysis
let mut df = new;
df.add_column?;
df.add_column?;
df.add_column?;
let model = linear_regression?;
println!;
println!;
// Machine learning model evaluation - regression metrics
let y_true = vec!;
let y_pred = vec!;
let mse = mean_squared_error?;
let r2 = r2_score?;
println!;
// Machine learning model evaluation - classification metrics
let true_labels = vec!;
let pred_labels = vec!;
let accuracy = accuracy_score?;
let f1 = f1_score?;
println!;
Pivot Tables and Grouping
use AggFunction;
// Grouping and aggregation
let grouped = df.groupby?;
let category_sum = grouped.sum?;
// Pivot table
let pivot_result = df.pivot_table?;
Categorical Data Analysis
use ;
use ;
// Create categorical data
let mut df = new;
df.add_column?;
df.add_column?;
// Convert columns to categorical
df.convert_to_categorical?;
df.convert_to_categorical?;
// Create contingency table
let contingency = contingency_table_from_df?;
println!;
// Chi-square test for independence
let chi2_result = chi_square_independence?;
println!;
println!;
println!;
// Measure of association
let cramers_v = cramers_v_from_df?;
println!;
// Test association between categorical and numeric variables
df.add_column?;
let anova_result = categorical_anova_from_df?;
println!;
println!;
Development Plan and Implementation Status
- Basic DataFrame structure
- Series implementation
- Index functionality
- CSV input/output
- JSON input/output
- Parquet format support
- Missing value handling
- Grouping operations
- Time series data support
- Date range generation
- Time filtering
- Moving average calculation
- Frequency conversion (resampling)
- Pivot tables
- Complete implementation of join operations
- Inner join (internal match)
- Left join (left side priority)
- Right join (right side priority)
- Outer join (all rows)
- Visualization functionality integration
- Line graphs
- Scatter plots
- Text plot output
- Parallel processing support
- Parallel conversion of Series/NASeries
- Parallel processing of DataFrames
- Parallel filtering (1.15x speedup)
- Parallel aggregation (3.91x speedup)
- Parallel computation processing (1.37x speedup)
- Adaptive parallel processing (automatic selection based on data size)
- Enhanced visualization
- Text-based plots with textplots (line, scatter)
- High-quality graph output with plotters (PNG, SVG format)
- Various graph types (line, scatter, bar, histogram, area)
- Graph customization options (size, color, grid, legend)
- Intuitive plot API for Series, DataFrame
- Multi-level indexes
- Hierarchical index structure
- Data grouping by multiple levels
- Level operations (swap, select)
- Categorical data types
- Memory-efficient encoding
- Support for ordered and unordered categories
- Complete integration with NA values (missing values)
- Advanced DataFrame operations
- Long-form and wide-form conversion (melt, stack, unstack)
- Conditional aggregation
- DataFrame concatenation
- Memory usage optimization
- String pool optimization (up to 89.8% memory reduction)
- Categorical encoding (2.59x performance improvement)
- Global string pool implementation
- Improved memory locality with column-oriented storage
- Python bindings
- Python module with PyO3
- Interoperability with numpy and pandas
- Jupyter Notebook support
- Speedup with string pool optimization (up to 3.33x)
- Distributed processing enhancements
- SQL-like API with DistributedContext
- Window function support
- Expression system for complex transformations
- User-defined function registration and use
- Lazy evaluation system
- Operation optimization with computation graph
- Operation fusion
- Avoiding unnecessary intermediate results
- Statistical analysis features
- Descriptive statistics (mean, standard deviation, quantiles, etc.)
- Correlation coefficient and covariance
- Hypothesis testing (t-test)
- Regression analysis (simple and multiple regression)
- Sampling methods (bootstrap, etc.)
- Machine learning evaluation metrics
- Regression evaluation (MSE, MAE, RMSE, R² score)
- Classification evaluation (accuracy, precision, recall, F1 score)
- Codebase maintainability improvements
- File separation of OptimizedDataFrame by functionality
- API compatibility maintained through re-exports
- Independent implementation of ML metrics module
- GPU acceleration
- CUDA integration for numerical operations
- GPU-accelerated matrix operations
- GPU-accelerated statistical functions
- GPU-accelerated machine learning algorithms
- Comprehensive benchmarking utility
- Python bindings for GPU acceleration
Multi-level Index Operations
use ;
// Create MultiIndex from tuples
let tuples = vec!;
// Set level names
let names = Some;
let multi_idx = from_tuples?;
// Create DataFrame with MultiIndex
let mut df = with_multi_index;
// Add data
let data = vec!;
df.add_column?;
// Level operations
let level0_values = multi_idx.get_level_values?;
let level1_values = multi_idx.get_level_values?;
// Swap levels
let swapped_idx = multi_idx.swaplevel?;
GPU Acceleration
use ;
use DataFrameGpuExt;
use ;
// Initialize GPU
let device_status = init_gpu?;
println!;
// Create matrices for GPU operations
let a_data = from_shape_vec?;
let b_data = from_shape_vec?;
// Create GPU matrices
let a = new;
let b = new;
// Perform GPU-accelerated matrix multiplication
let result = a.dot?;
// GPU-accelerated DataFrame operations
let mut df = new;
// Add data to DataFrame...
// Using GPU-accelerated correlation matrix
let corr_matrix = df.gpu_corr?;
// GPU-accelerated linear regression
let model = df.gpu_linear_regression?;
// GPU-accelerated PCA
let = df.gpu_pca?;
// Benchmarking CPU vs GPU performance
let mut benchmark = new?;
let matrix_multiply_result = benchmark.benchmark_matrix_multiply?;
println!;
Python Binding Usage Examples
# Create optimized DataFrame
=
# Traditional API compatible interface
=
# Interoperability with pandas
= # Convert from PandRS to pandas DataFrame
= # Convert from pandas DataFrame to PandRS
# Using lazy evaluation
=
=
# Direct use of string pool
=
=
= # Returns the same index
# Returns "repeated_value"
# GPU acceleration in Python
# Initialize GPU
=
=
=
# CSV input/output
=
# NumPy integration
=
=
# Jupyter Notebook support
Performance Optimization Results
The implementation of optimized column-oriented storage, lazy evaluation system, and GPU acceleration has achieved significant performance improvements:
Performance Comparison of Key Operations
| Operation | Traditional Implementation | Optimized Implementation | Speedup |
|---|---|---|---|
| Series/Column Creation | 198.446ms | 149.528ms | 1.33x |
| DataFrame Creation (1 million rows) | NA | NA | NA |
| Filtering | 596.146ms | 161.816ms | 3.68x |
| Group Aggregation | 544.384ms | 107.837ms | 5.05x |
GPU Acceleration Performance (vs CPU)
| Operation | Data Size | CPU Time | GPU Time | Speedup |
|---|---|---|---|---|
| Matrix Multiplication | 1000x1000 | 232.8 ms | 11.5 ms | 20.2x |
| Element-wise Addition | 2000x2000 | 18.6 ms | 2.3 ms | 8.1x |
| Correlation Matrix | 10000x10 | 89.4 ms | 12.1 ms | 7.4x |
| Linear Regression | 10000x10 | 124.3 ms | 18.7 ms | 6.6x |
| Rolling Window | 100000, window=100 | 58.2 ms | 9.8 ms | 5.9x |
String Processing Optimization
| Mode | Processing Time | vs Traditional | Notes |
|---|---|---|---|
| Legacy Mode | 596.50ms | 1.00x | Traditional implementation |
| Categorical Mode | 230.11ms | 2.59x | Categorical optimization |
| Optimized Implementation | 232.38ms | 2.57x | Optimizer selection |
Parallel Processing Performance Improvements
| Operation | Serial Processing | Parallel Processing | Speedup |
|---|---|---|---|
| Group Aggregation | 696.85ms | 178.09ms | 3.91x |
| Filtering | 201.35ms | 175.48ms | 1.15x |
| Computation | 15.41ms | 11.23ms | 1.37x |
Python Bindings String Optimization
| Data Size | Unique Rate | Without Pool | With Pool | Processing Speedup | Memory Reduction |
|---|---|---|---|---|---|
| 100,000 rows | 1% (high duplication) | 82ms | 35ms | 2.34x | 88.6% |
| 1,000,000 rows | 1% (high duplication) | 845ms | 254ms | 3.33x | 89.8% |
Recent Improvements
-
GPU Acceleration Integration
- CUDA-based acceleration for performance-critical operations
- Up to 20x speedup for matrix operations
- GPU-accelerated statistical functions and ML algorithms
- Transparent CPU fallback when GPU is unavailable
- Comprehensive benchmarking utility for CPU vs GPU performance comparison
- Python bindings for GPU functionality
-
Specialized Categorical Data Statistics
- Comprehensive contingency table implementation
- Chi-square test for independence
- Cramer's V measure of association
- Categorical ANOVA for comparing means across categories
- Entropy and mutual information calculations
- Statistical summaries specific to categorical variables
-
Large-Scale Data Processing
- Disk-based processing for datasets larger than available memory
- Memory-mapped file support for efficient large data access
- Chunked processing capabilities for scalable data operations
- Spill-to-disk functionality when memory limits are reached
-
Streaming Data Support
- Real-time data processing interfaces
- Stream connectors for various data sources
- Windowed operations on streaming data
- Real-time analytics capabilities
-
Column-Oriented Storage Engine
- Type-specialized column implementation (Int64Column, Float64Column, StringColumn, BooleanColumn)
- Improved cache efficiency through memory locality
- Operation acceleration and parallel processing efficiency
-
String Processing Optimization
- Elimination of duplicate strings with global string pool
- String to index conversion with categorical encoding
- Consistent API design and multiple optimization modes
-
Lazy Evaluation System Implementation
- Operation pipelining with computation graph
- Avoiding unnecessary intermediate results
- Improved efficiency through operation fusion
-
Significant Parallel Processing Improvements
- Efficient multi-threading with Rayon
- Adaptive parallel processing (automatic selection based on data size)
- Chunk processing optimization
-
Enhanced Python Integration
- Efficient data conversion between Python and Rust with string pool optimization
- Utilization of NumPy buffer protocol
- Near zero-copy data access
- Type-specialized Python API
- GPU acceleration support through Python bindings
-
Advanced DataFrame Operations
- Complete implementation of long-form and wide-form conversion (melt, stack, unstack)
- Enhanced conditional aggregation processing
- Optimization of complex join operations
-
Enhanced Time Series Data Processing
- Support for RFC3339 format date parsing
- Complete implementation of advanced window operations
- Support for complete format frequency specification (
DAILY,WEEKLY, etc.) - GPU-accelerated time series operations
-
WebAssembly Support
- Browser-based visualization capabilities
- Interactive dashboard functionality
- Theme customization options
- Multiple visualization types support
-
Stability and Quality Improvements
- Implementation of comprehensive test suite
- Improved error handling and warning elimination
- Enhanced documentation
- Updated dependencies (Rust 2023 compatible)
High-Quality Visualization (Plotters Integration)
use ;
use ;
// Create plot from a single Series
let values = vec!;
let series = new?;
// Create line graph
let line_settings = PlotSettings ;
series.plotters_plot?;
// Create histogram
let hist_settings = PlotSettings ;
series.plotters_histogram?;
// Visualization using DataFrame
let mut df = new;
df.add_column?;
df.add_column?;
// Scatter plot (relationship between temperature and humidity)
let xy_settings = PlotSettings ;
df.plotters_xy?;
// Multiple series line graph
let multi_settings = PlotSettings ;
df.plotters_multi?;
Dependency Versions
Latest dependency versions (May 2024):
[]
= "0.2.19" # Numeric trait support
= "2.0.12" # Error handling
= { = "1.0.219", = ["derive"] } # Serialization
= "1.0.114" # JSON processing
= "0.4.40" # Date and time processing
= "1.10.2" # Regular expressions
= "1.3.1" # CSV processing
= "1.9.0" # Parallel processing
= "1.5.0" # Lazy initialization
= "0.9.0" # Random number generation
= "3.8.1" # Temporary files
= "0.8.7" # Text-based visualization
= "0.3.7" # High-quality visualization
= "0.10.3" # Timezone processing
= "54.3.1" # Parquet file support
= "54.3.1" # Arrow format support
= "0.5.8" # For concurrent message passing
= "0.7.1" # For memory-mapped files
= "0.23.1" # For Excel reading
= "0.2.0" # For Excel writing
# Optional dependencies (feature-gated)
# CUDA support
= "0.10.0" # CUDA bindings
= "2.3.1" # Half-precision floating point support
= "0.2.0" # CUDA support for ndarray
# WebAssembly support
= "0.2.91" # WebAssembly bindings
= "0.3.68" # JavaScript interop
= "0.3.68" # Web API bindings
= "0.4.0" # Canvas backend for plotters
License
Available under the Apache License 2.0.