csv_processor 0.1.4

A fast command-line CSV analysis tool with automatic type inference and comprehensive statistics
Documentation
# CSV Processor

A high-performance **Rust library and CLI tool** for CSV data analysis, featuring automatic type inference, statistical analysis, and professional reporting capabilities.

## ๐Ÿ“ฆ Library + CLI Tool

This project provides both:
- **๐Ÿ“š Rust Library** - For embedding CSV analysis in your applications
- **๐Ÿ–ฅ๏ธ CLI Tool** - For command-line data analysis

## Features

- **Automatic Type Inference**: Intelligently detects integers, floats, booleans, and strings
- **Missing Value Analysis**: Comprehensive NA/null detection and reporting
- **Statistical Operations**: Built-in sum, mean, min, max calculations for all numeric types
- **JSON Export**: Native JSON serialization for DataFrames and columns
- **Professional Output**: Formatted tables and statistical reports
- **Fast Processing**: Rust-powered performance for large CSV files
- **Self-Analyzing Columns**: Each column type implements its own statistical operations
- **Comprehensive Testing**: 37+ tests ensuring reliability

## Installation

### As a Library
Add to your `Cargo.toml`:
```toml
[dependencies]
csv_processor = "0.1.0"
```

### As a CLI Tool
```bash
cargo install csv_processor

# Or build from source
git clone https://github.com/kkruglik/csv_processor
cd csv_processor
cargo build --release
```

## Usage

### ๐Ÿ“š Library Usage

```rust
use csv_processor::{DataFrame, reporter::{generate_info_report, generate_na_report}};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load CSV file
    let df = DataFrame::from_csv("data.csv")?;
    
    // Generate statistical report
    let stats_report = generate_info_report(&df);
    println!("Statistics:\n{}", stats_report);
    
    // Generate NA analysis report
    let na_report = generate_na_report(&df);
    println!("Missing Values:\n{}", na_report);
    
    // Export to JSON
    let json_output = df.to_json()?;
    println!("JSON: {}", json_output);
    
    // Access individual columns for custom analysis
    if let Some(column) = df.get_column(0) {
        println!("Column mean: {:?}", column.mean());
        println!("Column nulls: {}", column.null_count());
        let column_json = column.to_json();
        println!("Column as JSON: {:?}", column_json);
    }
    
    Ok(())
}
```

### ๐Ÿ–ฅ๏ธ CLI Usage

```bash
# Check for missing values
csv_processor na sample.csv

# Calculate comprehensive statistics  
csv_processor info sample.csv

# Get help
csv_processor --help
```

**Development Usage:**
```bash
# When developing/building from source
cargo run --bin csv_processor -- na sample.csv
cargo run --bin csv_processor -- info sample.csv
```

## Sample Output

### DataFrame Display
When loading a CSV file, data is displayed in a formatted table:

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      name       โ”‚   age    โ”‚ salary  โ”‚ department โ”‚   active    โ”‚ score  โ”‚    ...     โ”‚  ...  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Alice Smith   โ”‚    28    โ”‚ 75000.5 โ”‚Engineering โ”‚    true     โ”‚  8.7   โ”‚    ...     โ”‚  ...  โ”‚
โ”‚   Bob Johnson   โ”‚   null   โ”‚  65000  โ”‚ Marketing  โ”‚   false     โ”‚ null   โ”‚    ...     โ”‚  ...  โ”‚
โ”‚   Carol Davis   โ”‚    35    โ”‚  null   โ”‚Engineering โ”‚    true     โ”‚  9.2   โ”‚    ...     โ”‚  ...  โ”‚
โ”‚      null       โ”‚    29    โ”‚58000.75 โ”‚   Sales    โ”‚    true     โ”‚  7.8   โ”‚    ...     โ”‚  ...  โ”‚
โ”‚                 โ‹ฎ         โ”‚    โ‹ฎ    โ”‚    โ‹ฎ    โ”‚     โ‹ฎ      โ”‚      โ‹ฎ      โ”‚   โ‹ฎ    โ”‚     โ‹ฎ      โ”‚   โ‹ฎ   โ”‚
โ”‚  Henry Taylor   โ”‚    38    โ”‚  82000  โ”‚Engineering โ”‚   false     โ”‚  7.5   โ”‚    ...     โ”‚  ...  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
10 rows ร— 8 columns
```

### Statistical Report (Wide Format)
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   column   โ”‚   mean   โ”‚     sum     โ”‚    min    โ”‚     max     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚     id     โ”‚   5.5    โ”‚    55.0     โ”‚    1.0    โ”‚    10.0     โ”‚
โ”‚    age     โ”‚  31.29   โ”‚   250.33    โ”‚   26.0    โ”‚    42.0     โ”‚
โ”‚   salary   โ”‚ 72571.5  โ”‚  507000.5   โ”‚  58000.75 โ”‚   95000.0   โ”‚
โ”‚  active    โ”‚   0.8    โ”‚     8.0     โ”‚    0.0    โ”‚     1.0     โ”‚
โ”‚   score    โ”‚   8.06   โ”‚    56.4     โ”‚    6.9    โ”‚     9.2     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
5 rows ร— 5 columns
```

### Missing Value Analysis
```
Column Analysis:
- id: 0 missing values (0.0%)
- name: 2 missing values (20.0%)
- age: 2 missing values (20.0%)
- salary: 3 missing values (30.0%)
- department: 1 missing values (10.0%)
- active: 1 missing values (10.0%)
- start_date: 2 missing values (20.0%)
- score: 3 missing values (30.0%)
```

## API Reference

### Core Types

```rust
use csv_processor::{DataFrame, ColumnArray, CellValue, reporter};

// Main data container
let df = DataFrame::from_csv("data.csv")?;

// Access columns polymorphically  
let column: &dyn ColumnArray = df.get_column(0).unwrap();

// Statistical operations (all return Option<f64>)
let mean = column.mean();
let sum = column.sum();
let min = column.min();
let max = column.max();
let nulls = column.null_count();

// JSON export
let json_output = df.to_json()?;
let column_json = column.to_json();

// Generate reports
let stats_report = reporter::generate_info_report(&df);
let na_report = reporter::generate_na_report(&df);
```

### Key Traits

- `ColumnArray` - Unified interface for column data, statistical operations, and JSON export
- `Display` - Formatted output for DataFrames and reports

## Architecture

### Library + Binary Structure
```
src/
โ”œโ”€โ”€ lib.rs              # Library interface with documentation
โ”œโ”€โ”€ bin/
โ”‚   โ””โ”€โ”€ csv_processor.rs # CLI binary
โ”œโ”€โ”€ series/             # Column-oriented data structures (Polars-style)
โ”‚   โ””โ”€โ”€ array.rs        # ColumnArray trait with statistical operations
โ”œโ”€โ”€ frame/              # DataFrame operations and CSV I/O
โ”‚   โ””โ”€โ”€ mod.rs          # Main DataFrame implementation  
โ”œโ”€โ”€ scalar/             # Cell-level operations and values
โ”œโ”€โ”€ reporter.rs         # Statistical report generation
โ””โ”€โ”€ config.rs           # CLI parsing (exported for advanced use)
```

### Core Design Principles
- **Library First**: Clean API for embedding in applications
- **Self-Analyzing Columns**: Statistical operations embedded in column types
- **Functional Design**: Pure functions over object-oriented patterns  
- **Rust Idioms**: Leverage ownership system and proper error handling

### Key Data Types
- **DataFrame**: Main container with typed columns and display formatting
- **ColumnArray**: Unified trait for data access AND statistical operations
- **Column Types**: `IntegerColumn`, `FloatColumn`, `StringColumn`, `BooleanColumn`
- **CellValue**: Enum for individual cell values with type information

## Development

```bash
# Build the project
cargo build

# Run all tests (37+ test suite)
cargo test

# Run specific test suite
cargo test frame_tests
cargo test columns_tests

# Check code quality
cargo clippy

# Format code
cargo fmt

# Check without building
cargo check
```

## Performance

- **Fast Type Inference**: Automatic detection of optimal column types
- **Memory Efficient**: Column-oriented storage following Apache Arrow patterns
- **Zero-Cost Abstractions**: Rust's performance with high-level ergonomics
- **Parallel Processing Ready**: Architecture designed for future parallelization

## Examples

### Sample CSV Structure
The tool handles various data types and missing values:
```csv
id,name,age,salary,department,active,start_date,score
1,Alice Smith,28,75000.50,Engineering,true,2021-03-15,8.7
2,Bob Johnson,,65000,Marketing,false,2020-11-22,
3,Carol Davis,35,NA,Engineering,true,,9.2
```

### Usage Examples

**CLI Usage:**
```bash
# Analyze missing values
csv_processor na employee_data.csv

# Generate statistical report (includes JSON export demonstration)
csv_processor info sales_data.csv

# For development (building from source)
cargo run --bin csv_processor -- na employee_data.csv
```

**Library Usage:**
```rust
use csv_processor::{DataFrame, reporter::generate_info_report};

let df = DataFrame::from_csv("sales_data.csv")?;
let report = generate_info_report(&df);
println!("{}", report);
```

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Write tests for your changes
4. Run the test suite (`cargo test`)
5. Ensure code quality (`cargo clippy`)
6. Commit your changes (`git commit -am 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.