# datacell Specification
## Project Overview
**datacell** is a fast, unified CLI tool for spreadsheet and columnar data manipulation, built in Rust. It provides a single binary that handles CSV, Excel (XLSX/XLS/ODS), Parquet, and Avro files with Excel-like formulas, pandas-style operations, and data transformations.
## Core Requirements
### 1. File Format Support
#### Input Formats
- **CSV**: Standard comma-separated values with configurable delimiters
- **Excel**: XLSX, XLS, ODS (OpenDocument Spreadsheet)
- **Parquet**: Apache Parquet columnar format
- **Avro**: Apache Avro binary format
#### Output Formats
- **CSV**: Standard output with headers
- **Excel**: XLSX with styling support
- **Parquet**: Compressed columnar format
- **Avro**: Schema-based binary format
- **JSON**: Structured JSON output
- **Markdown**: Formatted tables for documentation
### 2. Formula Evaluation
#### Arithmetic Operations
- Basic operators: `+`, `-`, `*`, `/`
- Cell references: `A1`, `B2`, etc.
- Range references: `A1:A10`
#### Aggregate Functions
- `SUM(range)` - Sum of values
- `AVERAGE(range)` - Mean of values
- `MIN(range)` - Minimum value
- `MAX(range)` - Maximum value
- `COUNT(range)` - Count of numeric values
#### Conditional Functions
- `IF(condition, true_value, false_value)` - Conditional logic
- `SUMIF(range, criteria, sum_range)` - Conditional sum
- `COUNTIF(range, criteria)` - Conditional count
#### Lookup Functions
- `VLOOKUP(lookup_value, table_range, col_index)` - Vertical lookup
#### Text Functions
- `CONCAT(text1, text2, ...)` - String concatenation
- `LEN(text)` - String length
#### Math Functions
- `ROUND(number, decimals)` - Round to decimal places
- `ABS(number)` - Absolute value
### 3. Data Operations
#### Core Operations
- **Sort**: Sort by column (ascending/descending)
- **Filter**: SQL-like WHERE clause filtering
- **Replace**: Find and replace values
- **Dedupe**: Remove duplicate rows
- **Transpose**: Swap rows and columns
- **Append**: Add data to existing files
#### Pandas-Style Operations
- **Selection**: `head`, `tail`, `sample`, `select`, `drop`
- **Statistics**: `describe`, `value_counts`, `corr`
- **Grouping**: `groupby` with aggregations (sum, count, mean, min, max)
- **Joining**: `join` (left, right, inner, outer), `concat`
- **Missing Data**: `fillna`, `dropna`
- **Transformation**: `rename`, `mutate`, `astype`
- **Query**: SQL-like filtering with WHERE clause
- **Info**: `dtypes`, `unique`, `info`
#### Transform Operations
- **Clip**: Limit values to min/max range
- **Normalize**: Scale column to 0-1 range
- **Date Parsing**: Convert between date formats
- **Regex**: Filter and replace with patterns
### 4. Advanced Features
#### Data Quality
- **Validation**: Rule-based data validation with JSON rules
- **Profiling**: Data quality profiling and statistics
- **Quality Reports**: Automated quality assessment
#### Security
- **Encryption**: XOR and AES256 file encryption
- **Decryption**: Decrypt encrypted files
#### Visualization
- **Styled Excel**: Export with headers, colors, borders
- **Charts**: Column, bar, line, pie, area, scatter, doughnut charts
#### Configuration
- **Config File**: `.datacell.toml` for default settings
- **Shell Completions**: bash, zsh, fish, powershell
#### Integration
- **MCP Server**: Model Context Protocol server for AI assistants
- **Batch Processing**: Process multiple files with glob patterns
- **Streaming**: Handle large files efficiently
- **Plugins**: Extensible plugin system
### 5. Performance Requirements
- **Fast Startup**: < 100ms for simple operations
- **Memory Efficient**: Stream processing for large files
- **Single Binary**: No runtime dependencies
- **Cross-Platform**: macOS, Linux, Windows
### 6. API Design
#### Command Structure
```bash
datacell <command> [options]
```
#### Common Options
- `-i, --input <INPUT>` - Input file path
- `-o, --output <OUTPUT>` - Output file path
- `-s, --sheet <SHEET>` - Sheet name for Excel files
- `-f, --format <FORMAT>` - Output format (csv, json, markdown)
#### Error Handling
- Clear error messages with context
- File path, row, and column information
- Suggestions for common mistakes
### 7. Testing Requirements
#### Unit Tests
- 189+ unit tests covering all modules
- Test coverage for formulas, operations, conversions
- Mock implementations for testing
#### Integration Tests
- End-to-end CLI testing
- 60+ capability tests across 10 categories
- Automated test suite with exit code validation
#### Test Categories
1. File Format I/O (8 tests)
2. Format Conversions (12 tests)
3. Formula Evaluation (15 tests)
4. Data Operations (10 tests)
5. Pandas-Style Operations (25 tests)
6. Transform Operations (8 tests)
7. Advanced Features (7 tests)
8. Styling and Visualization (6 tests)
9. Configuration (3 tests)
10. Batch Processing (1 test)
### 8. Documentation Requirements
- **README.md**: Overview, quick start, features, usage examples
- **ARCHITECTURE.md**: System design, modules, data flow
- **TESTING_GUIDE.md**: Comprehensive testing documentation
- **QUICK_REFERENCE.md**: Command reference card
- **EXAMPLES_SUMMARY.md**: Test suite overview
- **TODO.md**: Project status and roadmap
### 9. Quality Standards
- **Code Quality**: Zero compiler warnings, clean code
- **Test Coverage**: 100% of major features tested
- **Documentation**: All commands documented with examples
- **Performance**: Benchmarked against pandas and xsv
- **Reliability**: All tests passing, no known bugs
### 10. Future Enhancements
#### Planned Features
- Real-time data streaming support
- Data lineage tracking
- REST API server mode
- Web UI for interactive data exploration
- Machine learning integration
- Cloud storage support (S3, GCS, Azure)
#### Optimization Opportunities
- Parallel processing for large datasets
- Query optimization for complex operations
- Caching for repeated operations
- Incremental updates for large files
## Custom XLSX Writer Design
datacell generates Excel files using a from-scratch XLSX writer rather than an external library. This section documents the design rationale and implementation.
### Motivation
XLSX files are ZIP archives containing XML files that follow the ECMA-376 Office Open XML (OOXML) standard. Several Rust crates exist for writing XLSX, but each had drawbacks for our use case:
| `rust_xlsxwriter` | Heavy dependency tree; version conflicts with our `zip` crate usage |
| `simple_excel_writer` | No formula support; limited cell types |
| `xlsxwriter` (C FFI) | Requires system C library; complicates cross-compilation and CI |
We chose to generate the XML directly using only the `zip` crate. This gives us a smaller binary, full control over output, and no external library version conflicts. The trade-off is that we must implement OOXML compliance ourselves, and advanced features like charts require significant XML markup work.
### OOXML Structure
A valid XLSX file contains these required entries in the ZIP archive:
```
[Content_Types].xml # MIME types for all parts
_rels/.rels # Top-level relationships
xl/workbook.xml # Workbook with sheet list
xl/_rels/workbook.xml.rels # Workbook relationships
xl/styles.xml # Fonts, fills, borders, cell formats
xl/theme/theme1.xml # Color/font theme
xl/worksheets/sheet1.xml # Worksheet data (one per sheet)
```
### Key Implementation Details
The writer lives in `src/excel/xlsx_writer/` with these modules:
- **`mod.rs`** — `XlsxWriter` struct, sheet/row/cell API, chart/cond-fmt/sparkline integration
- **`types.rs`** — `CellData`, `RowData`, `SheetData` types
- **`xml_gen.rs`** — Core XML generation (content types, workbook, worksheets, styles, theme)
- **`chart_xml.rs`** — OOXML DrawingML chart generation (bar, column, line, area, pie, scatter, doughnut)
- **`cond_fmt_xml.rs`** — Conditional formatting XML (color scales, data bars, icon sets, formula/cell-value rules)
- **`sparkline_xml.rs`** — Sparkline XML via x14 extension namespace (line, column, win/loss)
- **`streaming.rs`** — `StreamingXlsxWriter` for row-by-row large file writing
Critical OOXML elements that Excel/Numbers require (discovered by byte-level comparison with openpyxl output):
- `<sheetPr>` with `<outlinePr>` and `<pageSetUpPr/>`
- `<sheetFormatPr baseColWidth="8" defaultRowHeight="15"/>`
- `<selection>` inside `<sheetView>` for cursor positioning
- `<pageMargins>` at the end of each worksheet
- `<workbookPr/>`, `<bookViews>`, `<calcPr>` in workbook.xml
- `<diagonal/>` in every `<border>` element
- `<tableStyles>` in styles.xml
- Number cells with explicit `t="n"` type attribute
- Theme with `lnStyleLst` and triple entries in fill/effect/bg style lists
### Supported Features
- Multiple sheets (max 31-char names, invalid character validation)
- Cell types: String (inline), Number, Formula, Empty
- Column widths (auto-fit and manual)
- Freeze header row
- Auto-filter
- Basic styling (bold, fills, borders, alignment)
- **Charts** — Bar, column, line, area, pie, scatter, doughnut with custom colors, legends, axis titles, multi-series
- **Sparklines** — Line, column, win/loss in-cell mini charts with markers
- **Conditional formatting** — Two/three-color scales, data bars, icon sets, formula-based, cell-value rules
- **Streaming** — `StreamingXlsxWriter` for incremental row-by-row writing of large files
- **CSV injection protection** — `sanitize_csv_value()` / `sanitize_csv_row()` for safe CSV output
### Not Yet Implemented
- Merged cells
- Data validation dropdowns
- Pivot tables
## Technical Stack
- **Language**: Rust (Edition 2024)
- **CLI Framework**: clap (derive macros)
- **Excel Read**: calamine (reads .xls, .xlsx, .ods)
- **Excel Write**: Custom OOXML writer using `zip` crate (no external Excel library)
- **CSV**: csv crate
- **Parquet**: parquet + arrow (v54)
- **Avro**: apache-avro
- **Async**: tokio
- **Serialization**: serde_json
- **Error Handling**: anyhow, thiserror
- **Testing**: cargo test, integration tests
## Success Criteria
1. ✅ All 189 unit tests passing
2. ✅ All 60+ integration tests passing
3. ✅ Zero compilation errors or warnings
4. ✅ Comprehensive documentation complete
5. ✅ Single binary deployment
6. ✅ Cross-platform compatibility
7. ✅ Performance benchmarks met
8. ✅ User-friendly CLI with help text
9. ✅ Extensible architecture
10. ✅ Production-ready quality
## Version History
- **v0.1.3** (Jan 2026): Comprehensive test suite, 189 tests passing, full CLI implementation
- **v0.1.2**: Added pandas-style operations, advanced features
- **v0.1.1**: Formula evaluation, data operations
- **v0.1.0**: Initial release with basic CSV/Excel support