# CSV Processor
A high-performance **Rust library and CLI tool** for CSV data analysis, featuring automatic type inference, statistical analysis, and professional reporting capabilities.
## ๐ฆ Library + CLI Tool
This project provides both:
- **๐ Rust Library** - For embedding CSV analysis in your applications
- **๐ฅ๏ธ CLI Tool** - For command-line data analysis
## Features
- **Automatic Type Inference**: Intelligently detects integers, floats, booleans, and strings
- **Missing Value Analysis**: Comprehensive NA/null detection and reporting
- **Statistical Operations**: Built-in sum, mean, min, max calculations for all numeric types
- **JSON Export**: Native JSON serialization for DataFrames and columns
- **Professional Output**: Formatted tables and statistical reports
- **Fast Processing**: Rust-powered performance for large CSV files
- **Self-Analyzing Columns**: Each column type implements its own statistical operations
- **Comprehensive Testing**: 37+ tests ensuring reliability
## Installation
### As a Library
Add to your `Cargo.toml`:
```toml
[dependencies]
csv_processor = "0.1.0"
```
### As a CLI Tool
```bash
cargo install csv_processor
# Or build from source
git clone https://github.com/kkruglik/csv_processor
cd csv_processor
cargo build --release
```
## Usage
### ๐ Library Usage
```rust
use csv_processor::{DataFrame, reporter::{generate_info_report, generate_na_report}};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load CSV file
let df = DataFrame::from_csv("data.csv")?;
// Generate statistical report
let stats_report = generate_info_report(&df);
println!("Statistics:\n{}", stats_report);
// Generate NA analysis report
let na_report = generate_na_report(&df);
println!("Missing Values:\n{}", na_report);
// Export to JSON
let json_output = df.to_json()?;
println!("JSON: {}", json_output);
// Access individual columns for custom analysis
if let Some(column) = df.get_column(0) {
println!("Column mean: {:?}", column.mean());
println!("Column nulls: {}", column.null_count());
let column_json = column.to_json();
println!("Column as JSON: {:?}", column_json);
}
Ok(())
}
```
### ๐ฅ๏ธ CLI Usage
```bash
# Check for missing values
csv_processor na sample.csv
# Calculate comprehensive statistics
csv_processor info sample.csv
# Get help
csv_processor --help
```
**Development Usage:**
```bash
# When developing/building from source
cargo run --bin csv_processor -- na sample.csv
cargo run --bin csv_processor -- info sample.csv
```
## Sample Output
### DataFrame Display
When loading a CSV file, data is displayed in a formatted table:
```
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโ
โ name โ age โ salary โ department โ active โ score โ ... โ ... โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโค
โ Alice Smith โ 28 โ 75000.5 โEngineering โ true โ 8.7 โ ... โ ... โ
โ Bob Johnson โ null โ 65000 โ Marketing โ false โ null โ ... โ ... โ
โ Carol Davis โ 35 โ null โEngineering โ true โ 9.2 โ ... โ ... โ
โ null โ 29 โ58000.75 โ Sales โ true โ 7.8 โ ... โ ... โ
โ โฎ โ โฎ โ โฎ โ โฎ โ โฎ โ โฎ โ โฎ โ โฎ โ
โ Henry Taylor โ 38 โ 82000 โEngineering โ false โ 7.5 โ ... โ ... โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโ
10 rows ร 8 columns
```
### Statistical Report (Wide Format)
```
โโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ column โ mean โ sum โ min โ max โ
โโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ id โ 5.5 โ 55.0 โ 1.0 โ 10.0 โ
โ age โ 31.29 โ 250.33 โ 26.0 โ 42.0 โ
โ salary โ 72571.5 โ 507000.5 โ 58000.75 โ 95000.0 โ
โ active โ 0.8 โ 8.0 โ 0.0 โ 1.0 โ
โ score โ 8.06 โ 56.4 โ 6.9 โ 9.2 โ
โโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโ
5 rows ร 5 columns
```
### Missing Value Analysis
```
Column Analysis:
- id: 0 missing values (0.0%)
- name: 2 missing values (20.0%)
- age: 2 missing values (20.0%)
- salary: 3 missing values (30.0%)
- department: 1 missing values (10.0%)
- active: 1 missing values (10.0%)
- start_date: 2 missing values (20.0%)
- score: 3 missing values (30.0%)
```
## API Reference
### Core Types
```rust
use csv_processor::{DataFrame, ColumnArray, CellValue, reporter};
// Main data container
let df = DataFrame::from_csv("data.csv")?;
// Access columns polymorphically
let column: &dyn ColumnArray = df.get_column(0).unwrap();
// Statistical operations (all return Option<f64>)
let mean = column.mean();
let sum = column.sum();
let min = column.min();
let max = column.max();
let nulls = column.null_count();
// JSON export
let json_output = df.to_json()?;
let column_json = column.to_json();
// Generate reports
let stats_report = reporter::generate_info_report(&df);
let na_report = reporter::generate_na_report(&df);
```
### Key Traits
- `ColumnArray` - Unified interface for column data, statistical operations, and JSON export
- `Display` - Formatted output for DataFrames and reports
## Architecture
### Library + Binary Structure
```
src/
โโโ lib.rs # Library interface with documentation
โโโ bin/
โ โโโ csv_processor.rs # CLI binary
โโโ series/ # Column-oriented data structures (Polars-style)
โ โโโ array.rs # ColumnArray trait with statistical operations
โโโ frame/ # DataFrame operations and CSV I/O
โ โโโ mod.rs # Main DataFrame implementation
โโโ scalar/ # Cell-level operations and values
โโโ reporter.rs # Statistical report generation
โโโ config.rs # CLI parsing (exported for advanced use)
```
### Core Design Principles
- **Library First**: Clean API for embedding in applications
- **Self-Analyzing Columns**: Statistical operations embedded in column types
- **Functional Design**: Pure functions over object-oriented patterns
- **Rust Idioms**: Leverage ownership system and proper error handling
### Key Data Types
- **DataFrame**: Main container with typed columns and display formatting
- **ColumnArray**: Unified trait for data access AND statistical operations
- **Column Types**: `IntegerColumn`, `FloatColumn`, `StringColumn`, `BooleanColumn`
- **CellValue**: Enum for individual cell values with type information
## Development
```bash
# Build the project
cargo build
# Run all tests (37+ test suite)
cargo test
# Run specific test suite
cargo test frame_tests
cargo test columns_tests
# Check code quality
cargo clippy
# Format code
cargo fmt
# Check without building
cargo check
```
## Performance
- **Fast Type Inference**: Automatic detection of optimal column types
- **Memory Efficient**: Column-oriented storage following Apache Arrow patterns
- **Zero-Cost Abstractions**: Rust's performance with high-level ergonomics
- **Parallel Processing Ready**: Architecture designed for future parallelization
## Examples
### Sample CSV Structure
The tool handles various data types and missing values:
```csv
id,name,age,salary,department,active,start_date,score
1,Alice Smith,28,75000.50,Engineering,true,2021-03-15,8.7
2,Bob Johnson,,65000,Marketing,false,2020-11-22,
3,Carol Davis,35,NA,Engineering,true,,9.2
```
### Usage Examples
**CLI Usage:**
```bash
# Analyze missing values
csv_processor na employee_data.csv
# Generate statistical report (includes JSON export demonstration)
csv_processor info sales_data.csv
# For development (building from source)
cargo run --bin csv_processor -- na employee_data.csv
```
**Library Usage:**
```rust
use csv_processor::{DataFrame, reporter::generate_info_report};
let df = DataFrame::from_csv("sales_data.csv")?;
let report = generate_info_report(&df);
println!("{}", report);
```
## Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Write tests for your changes
4. Run the test suite (`cargo test`)
5. Ensure code quality (`cargo clippy`)
6. Commit your changes (`git commit -am 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request
## License
This project is licensed under the MIT License - see the LICENSE file for details.