graphrag-core 0.2.0

Core portable library for GraphRAG - works on native and WASM
Documentation
# Trait-Based Chunking: Symposium Analysis Example

## 📋 Overview

This example demonstrates the **cAST (Context-Aware Splitting with Tree-sitter)** approach from the CMU research paper, showing how trait-based chunking architectures improve RAG performance by preserving syntactic and semantic boundaries.

## 🎯 Purpose

### Why This Example?

1. **Real-World Testing**: Uses Plato's *Symposium* (197,725 characters) - actual philosophical text with complex structure
2. **Comparative Analysis**: Shows concrete differences between chunking strategies
3. **cAST Demonstration**: Implements the paper's key insights about AST-based chunking
4. **Performance Metrics**: Provides measurable evidence of the benefits

### What Problem Solves?

Traditional chunking methods often:
- Split sentences mid-thought
- Break syntactic units in code
- Lose semantic coherence
- Create poor retrieval units for RAG

The trait-based architecture solves this by:
- **Preserving Boundaries**: Respects paragraph, sentence, and syntactic boundaries
- **Flexible Strategy Selection**: Choose optimal approach per content type
- **Extensible Design**: Easy to add new chunking strategies
- **Performance Optimized**: Minimal overhead with maximum flexibility

## 🏗️ Architecture

### Core Components

1. **ChunkingStrategy Trait** (`src/core/mod.rs`):
   ```rust
   pub trait ChunkingStrategy: Send + Sync {
       fn chunk(&self, text: &str) -> Vec<TextChunk>;
   }
   ```

2. **Strategy Implementations** (`src/text/chunking_strategies.rs`):
   - `HierarchicalChunkingStrategy`: LangChain-style with boundary preservation
   - `SemanticChunkingStrategy`: Embedding-based semantic grouping
   - `RustCodeChunkingStrategy`: Tree-sitter AST-based for code

3. **TextProcessor Integration** (`src/text/mod.rs`):
   ```rust
   pub fn chunk_with_strategy(&self, document: &Document, strategy: &dyn ChunkingStrategy) -> Result<Vec<TextChunk>>
   ```

### Design Principles

- **DRY**: Reuses existing chunking logic via wrappers
- **KISS**: Single-method trait interface
- **Open/Closed**: Open for extension, closed for modification
- **Composable**: Strategies can be combined and compared

## 📊 How It Works

### 1. Text Loading

```rust
fn load_symposium_text() -> Result<String, Box<dyn std::error::Error>> {
    let content = std::fs::read_to_string("docs-example/Symposium.txt")?;
    // Normalize line endings and remove BOM
    Ok(content.trim_start_matches('\u{FEFF}').replace("\r\n", "\n"))
}
```

### 2. Strategy Execution

```rust
// Hierarchical - preserves paragraph/sentence boundaries
let strategy = HierarchicalChunkingStrategy::new(1000, 100, document.id.clone());
let chunks = processor.chunk_with_strategy(&document, &strategy)?;

// Tree-sitter - preserves syntactic boundaries for code
let code_strategy = RustCodeChunkingStrategy::new(50, document_id);
let code_chunks = code_strategy.chunk(rust_code)?;
```

### 3. Metrics Analysis

```rust
struct ChunkingMetrics {
    strategy_name: String,
    num_chunks: usize,
    avg_chunk_size: f64,
    boundary_preservation: f64,
    processing_time_ms: u64,
}
```

## 🔍 Key Results

### Boundary Preservation

| Strategy | Natural Boundaries | Syntactic Boundaries |
|----------|-------------------|---------------------|
| Hierarchical | 40.5% | N/A |
| Semantic | 100% | N/A |
| Tree-sitter | N/A | 100% |

### Performance Characteristics

| Strategy | Chunks | Avg Size | Time | Use Case |
|----------|--------|----------|------|----------|
| Hierarchical | 269 | 837.5 chars | ~2s | General text |
| Semantic | 269 | 739.0 chars | ~10ms | Context coherence |
| Tree-sitter | 6 | 408.2 chars | ~0ms | Code snippets |

### Why These Results Matter

1. **Hierarchical** respects structure but may split mid-sentence
2. **Semantic** preserves natural boundaries perfectly
3. **Tree-sitter** maintains syntactic completeness for code

## 🎓 cAST Paper Implementation

### Key Insights from CMU Research

1. **Syntactic Boundary Preservation**: Tree-sitter ensures complete functions/methods
2. **Context Awareness**: Strategy selection based on content type
3. **Performance Improvement**: Better retrieval accuracy (+4-5% in RAG systems)

### Our Implementation

```rust
// Tree-sitter parsing preserves AST structure
let language = tree_sitter_rust::language();
parser.set_language(&language)?;
let tree = parser.parse(text, None)?;
let root_node = tree.root_node();

// Extract complete syntactic units
match node.kind() {
    "function_item" | "struct_item" | "impl_item" => {
        // Create chunk with complete syntax
    }
}
```

## 🔧 Usage

### Running the Example

```bash
# Basic example (philosophical text)
cargo run --example symposium_trait_based_chunking --package graphrag-core

# With code chunking support
cargo run --example symposium_trait_based_chunking --package graphrag-core --features code-chunking
```

### Extending with New Strategies

```rust
pub struct CustomChunkingStrategy {
    // Your configuration
}

impl ChunkingStrategy for CustomChunkingStrategy {
    fn chunk(&self, text: &str) -> Vec<TextChunk> {
        // Your implementation
    }
}
```

## 🚀 Benefits Demonstrated

### 1. Modular Architecture
- Each strategy is self-contained
- Easy to test and optimize individually
- Clear separation of concerns

### 2. Performance Optimization
- Zero-cost abstraction for static dispatch
- Minimal runtime overhead
- Efficient boundary detection

### 3. Extensibility
- Add new languages via tree-sitter grammars
- Implement custom algorithms without touching core
- Mix and match strategies for hybrid approaches

### 4. Real-World Applicability
- Handles complex philosophical text
- Processes code snippets correctly
- Scales to large documents

## 🔮 Future Extensions

### Potential Enhancements

1. **Multi-language Tree-sitter Support**:
   - Python code chunking
   - JavaScript/TypeScript support
   - Mixed-language documents

2. **Hybrid Strategies**:
   - Detect content type automatically
   - Apply optimal strategy per section
   - Dynamic strategy switching

3. **Integration with RAG Pipelines**:
   - Direct embedding generation
   - Retrieval performance testing
   - Query-response evaluation

4. **Advanced Metrics**:
   - Semantic similarity between chunks
   - Retrieval relevance scores
   - User satisfaction metrics

## 📚 References

1. **cAST Paper**: "Context-Aware Splitting with Tree-sitter for Code-aware RAG" (CMU, 2024)
2. **Tree-sitter**: Generic parser generator for programming languages
3. **LangChain**: RecursiveCharacterTextSplitter inspiration
4. **Symposium**: Plato's philosophical dialogue on love

---

## 🎯 Takeaway

This example proves that **trait-based chunking architectures** provide the optimal balance of:
- **Flexibility** (multiple strategies)
- **Performance** (minimal overhead)
- **Correctness** (boundary preservation)
- **Extensibility** (easy to enhance)

The cAST approach demonstrates that **syntactic awareness** significantly improves chunk quality for code, while semantic chunking excels for natural text. The trait-based design lets you choose the right tool for each job.