# DataFrame Implementation Sprint - v3.64.0
## π Sprint Overview
**Objective**: Implement production-ready DataFrames using Polars backend (0% β 100%)
**Duration**: 5-7 days
**Priority**: βββ HIGHEST VALUE
**Version Target**: v3.64.0
## Current Status Analysis
### β
What Exists (Infrastructure)
- **Parser**: `df![]` syntax parsing working (`ruchy check` passes)
- **AST**: `ExprKind::DataFrame { columns }` defined
- **Transpiler**: DataFrame β Rust code generation exists
- **Runtime Helpers**: Some DataFrame methods (select, sum, slice, join, groupby)
- **Tests**: Comprehensive test suite exists but disabled in `tests.disabled/`
### β Critical Gaps (Blocking)
1. **BLOCKER**: `ExprKind::DataFrame` not handled in interpreter's `eval_expr()`
- Parser creates AST correctly
- Interpreter falls through to `eval_misc_expr()` β fails
- **Fix**: Add `ExprKind::DataFrame` case to `is_data_structure_expr()`
2. **NO LITERAL EVALUATION**: DataFrame literals don't execute
- `df![age => [25, 30, 35]]` parses but can't evaluate
- **Fix**: Implement `eval_dataframe_literal()` function
3. **MISSING POLARS INTEGRATION**: Feature flag exists but not connected
- `Cargo.toml` has `dataframe` feature with Polars
- No actual Polars usage in runtime
- **Fix**: Integrate Polars for actual dataframe storage
### π Chapter 18 Requirements (From ruchy-book)
**3 Test Files** (0/3 working):
1. `01-dataframe-creation.ruchy` - Creation, CSV, JSON loading
2. `02-dataframe-operations.ruchy` - Filter, sort, aggregate, transform
3. `03-dataframe-analytics.ruchy` - Real-world analytics use cases
**API Surface** (0% implemented):
```ruchy
// Constructor API
DataFrame::new()
.column("name", ["Alice", "Bob"])
.column("age", [25, 30])
.build()
DataFrame::from_csv_string(csv_data)
DataFrame::from_json(json_data)
// Query methods
df.rows()
df.columns()
df.column_names()
df.get("column", index)
// Operations
df.sort_by("score", descending: true)
df.group_by("department")
.agg("salary", "mean")
.agg("years", "sum")
// Transformations
// Statistics
df.column("col").mean()
df.column("col").percentile(95)
df.column("col").std()
df.rolling_mean("col", window_size)
```
## π― Ticket Breakdown (EXTREME TDD)
### **DF-001: DataFrame Literal Evaluation** β‘ P0-CRITICAL
**Status**: π΄ BLOCKING - Nothing works without this
**Effort**: 2-3 hours
**Complexity**: <10
**Problem**:
```bash
$ echo 'df![age => [25, 30]]' | ruchy repl
Error: Evaluation error: In backslash lambda after params: Expected Arrow, found Bang
```
**Root Cause**:
- Parser creates `ExprKind::DataFrame` correctly
- Interpreter's `is_data_structure_expr()` doesn't include DataFrame
- Falls through to `eval_misc_expr()` β confusing error
**Fix Strategy**:
1. Add `ExprKind::DataFrame { .. }` to `is_data_structure_expr()`
2. Implement `eval_dataframe_literal()` in `eval_data_structures.rs`
3. Create `Value::DataFrame` with columns
**Test Plan** (TDD):
```rust
#[test]
fn test_empty_dataframe_literal() {
let mut interp = Interpreter::new();
let result = interp.eval_code("df![]").unwrap();
assert_eq!(result, Value::DataFrame { columns: vec![] });
}
#[test]
fn test_single_column_dataframe() {
let mut interp = Interpreter::new();
let result = interp.eval_code("df![age => [25, 30, 35]]").unwrap();
// Verify column exists and has 3 values
}
#[test]
fn test_multi_column_dataframe() {
let code = r#"df![
name => ["Alice", "Bob"],
age => [25, 30]
]"#;
let result = interp.eval_code(code).unwrap();
// Verify 2 columns, 2 rows
}
```
**Success Criteria**:
- β
`df![]` evaluates to empty DataFrame
- β
Single column DataFrames work
- β
Multi-column DataFrames work
- β
All 3 TDD tests passing
- β
REPL demo works
**Files to Modify**:
- `src/runtime/eval_expr.rs`: Add DataFrame to `is_data_structure_expr()`
- `src/runtime/eval_data_structures.rs`: Add `eval_dataframe_literal()`
- `tests/dataframe_literal_tests.rs`: New TDD test file
---
### **DF-002: DataFrame Constructor API** ποΈ P0-CRITICAL
**Status**: π΄ BLOCKING - Required by all Chapter 18 tests
**Effort**: 1 day
**Complexity**: <10 per function
**Requirements**:
```ruchy
let df = DataFrame::new()
.column("name", ["Alice", "Bob", "Charlie"])
.column("age", [25, 30, 35])
.build();
assert_eq!(df.rows(), 3);
assert_eq!(df.columns(), 3);
assert_eq!(df.column_names(), ["name", "age", "salary"]);
```
**Implementation**:
1. `DataFrame` type in runtime (struct or object)
2. `.new()` static method
3. `.column(name, values)` builder method
4. `.build()` finalizer
5. `.rows()` accessor
6. `.columns()` accessor
7. `.column_names()` accessor
**Test Plan** (TDD):
```rust
#[test]
fn test_dataframe_new_empty() {
let df = DataFrame::new().build();
assert_eq!(df.rows(), 0);
assert_eq!(df.columns(), 0);
}
#[test]
fn test_dataframe_builder_pattern() {
let df = DataFrame::new()
.column("x", vec![1, 2, 3])
.column("y", vec![4, 5, 6])
.build();
assert_eq!(df.rows(), 3);
assert_eq!(df.columns(), 2);
}
#[test]
fn test_dataframe_accessors() {
let df = create_test_df();
assert_eq!(df.rows(), 3);
assert_eq!(df.columns(), 2);
assert_eq!(df.column_names(), vec!["name", "age"]);
}
```
**Success Criteria**:
- β
Builder pattern works
- β
Accessors return correct values
- β
All 3 TDD tests passing
- β
Chapter 18 test 01 section 1 passes
**Files to Create/Modify**:
- `src/runtime/dataframe_builder.rs`: New builder implementation
- `tests/dataframe_constructor_tests.rs`: New TDD tests
---
### **DF-003: CSV/JSON Import** π P1-HIGH
**Status**: π‘ HIGH PRIORITY - Common use case
**Effort**: 1 day
**Complexity**: <10 per function
**Requirements**:
```ruchy
let csv_data = "name,age,city
Alice,25,New York
Bob,30,San Francisco";
let df = DataFrame::from_csv_string(csv_data);
assert_eq!(df.rows(), 2);
assert_eq!(df.get("name", 0), "Alice");
let json_data = r#"[
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30}
]"#;
let df2 = DataFrame::from_json(json_data);
```
**Implementation**:
1. CSV parsing (use existing CSV library or simple parser)
2. JSON parsing (use serde_json)
3. Type inference for columns
4. Error handling for malformed data
**Test Plan** (TDD):
```rust
#[test]
fn test_from_csv_simple() {
let csv = "name,age\nAlice,25\nBob,30";
let df = DataFrame::from_csv_string(csv).unwrap();
assert_eq!(df.rows(), 2);
assert_eq!(df.get("name", 0), Value::String("Alice"));
}
#[test]
fn test_from_json_array() {
let json = r#"[{"x": 1}, {"x": 2}]"#;
let df = DataFrame::from_json(json).unwrap();
assert_eq!(df.rows(), 2);
}
#[test]
fn test_csv_type_inference() {
let csv = "num,text\n42,hello\n100,world";
let df = DataFrame::from_csv_string(csv).unwrap();
assert_eq!(df.get("num", 0), Value::Integer(42));
}
```
**Success Criteria**:
- β
CSV import works with header
- β
JSON array import works
- β
Type inference (int, float, string)
- β
All 3 TDD tests passing
- β
Chapter 18 test 01 sections 2-3 pass
**Files to Create**:
- `src/runtime/dataframe_io.rs`: CSV/JSON importers
- `tests/dataframe_io_tests.rs`: TDD tests
---
### **DF-004: Filter and Transform Operations** π P1-HIGH
**Status**: π‘ HIGH PRIORITY - Core operations
**Effort**: 1-2 days
**Complexity**: <10 per function
**Requirements**:
```ruchy
let df = create_products_df();
// Filter with closure
// Transform with closure
let with_tax = df.with_column("price_with_tax", |row| {
row["price"] * 1.08
});
// Modify existing column
**Implementation**:
1. `.filter(closure)` method
2. `.with_column(name, closure)` method
3. `.transform(name, closure)` method
4. Closure evaluation with row context
**Test Plan** (TDD):
```rust
#[test]
fn test_filter_numeric() {
let df = df_with_ages([20, 25, 30, 35]);
let filtered = df.filter(|row| row["age"] > 25);
assert_eq!(filtered.rows(), 2);
}
#[test]
fn test_with_column() {
let df = df_with_column("x", vec![1, 2, 3]);
let result = df.with_column("y", |row| row["x"] * 2);
assert_eq!(result.columns(), 2);
assert_eq!(result.get("y", 0), Value::Integer(2));
}
#[test]
fn test_transform_existing() {
let df = df_with_column("x", vec![1, 2, 3]);
let result = df.transform("x", |val| val * 10);
assert_eq!(result.get("x", 0), Value::Integer(10));
}
```
**Success Criteria**:
- β
Filter works with closures
- β
with_column adds new column
- β
transform modifies existing
- β
All 3 TDD tests passing
- β
Chapter 18 test 02 sections 1,4 pass
**Files to Create**:
- `src/runtime/dataframe_operations.rs`: Filter/transform logic
- `tests/dataframe_operations_tests.rs`: TDD tests
---
### **DF-005: Sort and Aggregation** π P1-HIGH
**Status**: π‘ HIGH PRIORITY - Analytics essential
**Effort**: 1-2 days
**Complexity**: <10 per function
**Requirements**:
```ruchy
// Sort ascending
let sorted = df.sort_by("name");
// Sort descending
let sorted_desc = df.sort_by("score", descending: true);
// Group and aggregate
let grouped = df.group_by("department")
.agg("salary", "mean")
.agg("years", "sum");
```
**Implementation**:
1. `.sort_by(column)` method
2. `.sort_by(column, descending: bool)` variant
3. `.group_by(column)` returns GroupedDataFrame
4. `.agg(column, function)` on grouped DF
5. Aggregation functions: sum, mean, count, min, max
**Test Plan** (TDD):
```rust
#[test]
fn test_sort_ascending() {
let df = df_with_column("x", vec![3, 1, 2]);
let sorted = df.sort_by("x");
assert_eq!(sorted.get("x", 0), Value::Integer(1));
assert_eq!(sorted.get("x", 2), Value::Integer(3));
}
#[test]
fn test_group_by_sum() {
let df = df_with_groups_and_values();
let grouped = df.group_by("category").agg("amount", "sum");
// Verify aggregation results
}
#[test]
fn test_group_by_multiple_aggs() {
let df = employees_df();
let result = df.group_by("dept")
.agg("salary", "mean")
.agg("years", "sum");
assert_eq!(result.columns(), 3); // dept, salary_mean, years_sum
}
```
**Success Criteria**:
- β
Sort ascending/descending works
- β
group_by works
- β
Multiple aggregations work
- β
All 3 TDD tests passing
- β
Chapter 18 test 02 sections 2-3 pass
**Files to Create**:
- `src/runtime/dataframe_aggregation.rs`: Sort/group/agg logic
- `tests/dataframe_aggregation_tests.rs`: TDD tests
---
### **DF-006: Statistics Methods** π P2-MEDIUM
**Status**: π’ ENHANCEMENT - Nice to have
**Effort**: 1 day
**Complexity**: <10 per function
**Requirements**:
```ruchy
let avg = df.column("response_time").mean();
let p95 = df.column("response_time").percentile(95);
let std_dev = df.column("cpu_usage").std();
let smoothed = df.rolling_mean("cpu_usage", window_size);
```
**Implementation**:
1. `.column(name)` returns Column accessor
2. `.mean()` on Column
3. `.percentile(p)` on Column
4. `.std()` on Column
5. `.rolling_mean(column, window)` on DataFrame
**Test Plan** (TDD - Property Tests)**:
```rust
#[test]
fn test_mean_calculation() {
let df = df_with_column("x", vec![1, 2, 3, 4, 5]);
assert_eq!(df.column("x").mean(), 3.0);
}
proptest! {
#[test]
fn prop_mean_is_within_range(values in vec(1i64..100, 1..1000)) {
let df = df_from_values(values.clone());
let mean = df.column("x").mean();
let min = values.iter().min().unwrap();
let max = values.iter().max().unwrap();
prop_assert!(mean >= *min as f64 && mean <= *max as f64);
}
}
#[test]
fn test_rolling_mean() {
let df = df_with_column("x", vec![1, 2, 3, 4, 5]);
let smoothed = df.rolling_mean("x", 3);
// First 2 rows should be nil/nan, then rolling averages
}
```
**Success Criteria**:
- β
mean/std/percentile work
- β
rolling_mean works
- β
Property tests pass (10K+ iterations)
- β
Chapter 18 test 03 passes
**Files to Create**:
- `src/runtime/dataframe_stats.rs`: Statistics calculations
- `tests/dataframe_stats_tests.rs`: TDD + property tests
---
### **DF-007: Polars Integration** π P2-MEDIUM
**Status**: π’ OPTIMIZATION - Performance boost
**Effort**: 1-2 days
**Complexity**: <10 per integration point
**Requirements**:
- Use Polars for actual dataframe storage (not custom structs)
- Leverage Polars' lazy evaluation
- Use Arrow memory format for efficiency
- Performance: 1M rows in <1s
**Implementation**:
1. Replace custom DataFrame struct with Polars wrapper
2. Convert Ruchy Value β Polars Series
3. Convert Polars DataFrame β Ruchy Value
4. Use lazy evaluation where possible
**Test Plan** (Performance + Property Tests):
```rust
#[test]
fn test_polars_large_dataframe() {
let rows: Vec<_> = (0..1_000_000).collect();
let df = DataFrame::from_column("id", rows);
let start = Instant::now();
let filtered = df.filter(|row| row["id"] > 500_000);
assert!(start.elapsed() < Duration::from_secs(1));
}
proptest! {
#[test]
fn prop_polars_operations_equivalent(
values in vec(any::<i64>(), 1..10000)
) {
let custom_df = CustomDataFrame::from(values.clone());
let polars_df = PolarsDataFrame::from(values);
let custom_result = custom_df.filter(|v| v > 50).sum();
let polars_result = polars_df.filter(|v| v > 50).sum();
prop_assert_eq!(custom_result, polars_result);
}
}
```
**Success Criteria**:
- β
Polars backend working
- β
1M rows < 1s performance
- β
Property tests confirm correctness
- β
All existing tests still pass
**Files to Modify**:
- `src/runtime/dataframe_polars.rs`: New Polars integration
- `tests/dataframe_polars_tests.rs`: Performance tests
---
## π
Implementation Timeline (5-7 days)
### Day 1: Foundation (P0 CRITICAL)
- **Morning**: DF-001 - DataFrame literal evaluation (2-3 hours)
- Fix `is_data_structure_expr()`
- Implement `eval_dataframe_literal()`
- TDD tests passing
- **Afternoon**: DF-002 - Constructor API (4 hours)
- Builder pattern implementation
- Accessors (rows, columns, column_names)
- TDD tests passing
### Day 2: Data Import (P1 HIGH)
- **Full Day**: DF-003 - CSV/JSON import
- CSV parsing implementation
- JSON parsing implementation
- Type inference
- TDD + Chapter 18 test 01 passing
### Day 3: Core Operations (P1 HIGH)
- **Full Day**: DF-004 - Filter and Transform
- Filter with closures
- with_column implementation
- transform implementation
- TDD + Chapter 18 test 02 partial passing
### Day 4: Analytics (P1 HIGH)
- **Full Day**: DF-005 - Sort and Aggregation
- Sorting (asc/desc)
- group_by implementation
- Aggregation functions
- TDD + Chapter 18 test 02 complete
### Day 5: Statistics (P2 MEDIUM)
- **Full Day**: DF-006 - Statistics Methods
- mean, std, percentile
- rolling_mean
- Property tests (10K+ iterations)
- Chapter 18 test 03 passing
### Day 6-7: Polars Integration (P2 MEDIUM - Optional)
- **2 Days**: DF-007 - Polars Backend
- Replace custom structs
- Performance optimization
- Property tests for equivalence
- 1M row benchmarks
## π― Success Metrics
### Code Quality (PMAT TDG A-)
- β
All functions <10 complexity
- β
Zero SATD comments
- β
PMAT TDG grade β₯ A- (85 points)
- β
No clippy warnings
### Test Coverage (80%+)
- β
30+ TDD unit tests
- β
10+ property tests (10K iterations each)
- β
3 Chapter 18 integration tests
- β
All tests passing
### Book Compatibility
- β
Chapter 18: 0/4 β 4/4 (100%)
- β
Overall: 92.5% β 96% (64/67 examples)
### Performance (Production Ready)
- β
1M rows processed in <1s
- β
Memory efficient (Arrow format if Polars)
- β
No memory leaks
## π₯ Risk Mitigation
### Risk 1: Polars Integration Complexity
**Mitigation**: Start with custom DataFrame, migrate to Polars in DF-007
**Fallback**: Ship v3.64.0 with custom implementation, Polars in v3.65.0
### Risk 2: Closure Evaluation Complexity
**Mitigation**: Use existing lambda evaluation infrastructure
**Fallback**: Start with simple filters (no closures), add closures in patch
### Risk 3: Time Overrun (>7 days)
**Mitigation**: DF-006 and DF-007 are P2, can defer to v3.65.0
**Minimum Viable**: DF-001 through DF-005 = 100% Chapter 18 compatibility
## π Daily Commit Protocol
Every commit MUST include:
```
[DF-XXX] Brief description
Detailed changes:
- Specific improvements
- Test coverage added
- Complexity maintained <10
TDG Score Changes:
- src/file.rs: XX.XβYY.Y (GradeβGrade) [+Z.Z improvement]
Success Metrics:
- Tests: X passing
- Complexity: All functions <10
- Coverage: XX%
Closes: DF-XXX
```
## π Release Checklist (v3.64.0)
- [ ] DF-001 through DF-005 complete (minimum)
- [ ] All TDD tests passing (30+)
- [ ] Chapter 18: 4/4 examples working
- [ ] Property tests passing (10K+ iterations each)
- [ ] PMAT TDG grade β₯ A-
- [ ] Zero clippy warnings
- [ ] Documentation updated
- [ ] CHANGELOG.md updated
- [ ] Version bump to 3.64.0
- [ ] Published to crates.io
- [ ] GitHub release created
- [ ] Roadmap updated
---
**Date Created**: 2025-10-01
**Sprint Start**: Ready to begin
**Target Completion**: 2025-10-08 (7 days)
**Methodology**: EXTREME TDD + PMAT Quality Gates + Toyota Way