---
title: "ADR-009: Reduced Default Chunk Size"
description: "Decision to reduce default chunk size for better granularity."
sidebar:
label: "ADR-009: Reduced Chunk Size"
---
# ADR-009: Reduced Default Chunk Size
## Status
Accepted
## Context
### Background and Problem Statement
The default chunk size affects:
- Search precision: Smaller chunks = more precise retrieval
- Context efficiency: Smaller chunks = less wasted context
- Embedding quality: Embedding models have optimal input sizes
- Chunking overhead: More chunks = more storage and processing
The original default of 4000 bytes was chosen conservatively. After production usage, it became clear that smaller chunks would improve the system.
### Current Limitations
1. **Large chunks dilute relevance**: A 4000-byte chunk may contain one relevant sentence and much irrelevant content
2. **Context waste**: Retrieved chunks often contain more text than needed
3. **Embedding quality**: Longer text may exceed optimal embedding model context
## Decision Drivers
### Primary Decision Drivers
1. **Search precision**: Smaller chunks increase retrieval precision
2. **Context efficiency**: Smaller chunks reduce wasted LLM context tokens
3. **Embedding model alignment**: 2000 bytes aligns better with typical model token limits
### Secondary Decision Drivers
1. **User feedback**: Reports of imprecise search results
2. **Empirical testing**: Better search quality observed with smaller chunks
3. **Backward compatibility**: Existing databases should still work (they keep their chunk size)
## Considered Options
### Option 1: Reduce to 2000 bytes
**Description**: Change default from 4000 to 2000 bytes.
**Technical Characteristics**:
- ~500-600 tokens per chunk (model-dependent)
- Fits comfortably in embedding model context
- Good balance of precision and coherence
**Advantages**:
- More precise search results
- Less wasted context in retrieved chunks
- Better embedding quality
- Still large enough for coherent units
**Disadvantages**:
- More chunks per document
- Slightly more storage overhead
- May break some semantic units
**Risk Assessment**:
- **Technical Risk**: Low. Simple constant change
- **Schedule Risk**: Low. Minimal code change
- **Ecosystem Risk**: Low. Backwards compatible
### Option 2: Keep 4000 bytes
**Description**: Maintain status quo.
**Technical Characteristics**:
- ~1000-1200 tokens per chunk
- Current behavior preserved
**Advantages**:
- No change required
- Preserves larger semantic units
**Disadvantages**:
- Continues precision issues
- Wastes context on irrelevant content
**Risk Assessment**:
- **Technical Risk**: None. No change
- **Schedule Risk**: None. No change
- **Ecosystem Risk**: Low. Status quo
### Option 3: Reduce to 1000 bytes
**Description**: More aggressive reduction to 1000 bytes.
**Technical Characteristics**:
- ~250-300 tokens per chunk
- Very fine-grained retrieval
**Advantages**:
- Maximum precision
- Minimal context waste
**Disadvantages**:
- May fragment semantic units
- Many more chunks to manage
- Higher storage overhead
- May lose broader context
**Risk Assessment**:
- **Technical Risk**: Medium. May be too aggressive
- **Schedule Risk**: Low. Simple change
- **Ecosystem Risk**: Low. Backwards compatible
## Decision
Reduce the default chunk size from 4000 to 2000 bytes.
The implementation will:
- Change `DEFAULT_CHUNK_SIZE` constant from 4000 to 2000
- Existing databases retain their original chunk sizes
- Users can override with `--chunk-size` flag
## Consequences
### Positive
1. **Improved precision**: Search results contain more focused, relevant content
2. **Better context efficiency**: Less wasted tokens in LLM context
3. **Embedding alignment**: 2000 bytes fits well within embedding model optimal ranges
4. **User satisfaction**: Addresses feedback about imprecise results
### Negative
1. **More chunks**: Documents produce ~2x more chunks than before
2. **Re-chunking needed**: Users wanting new default must reload documents
3. **Storage increase**: Slightly more metadata per document
### Neutral
1. **Backward compatibility**: Existing databases continue to work
## Decision Outcome
The 2000-byte default provides a better balance of precision and coherence based on production usage feedback. Users who prefer larger chunks can still use `--chunk-size 4000`.
Mitigations:
- Document the change in CHANGELOG
- Provide migration guidance for users wanting to re-chunk
- Keep `--chunk-size` flag for customization
## Related Decisions
- [ADR-004: Multiple Chunking Strategies](004-multiple-chunking-strategies.md) - Chunking framework
- [ADR-008: Hybrid Search](008-hybrid-search-with-rrf.md) - Chunk size affects search quality
## Links
- [CHANGELOG v1.1.2](../../CHANGELOG.md) - Release notes documenting change
## More Information
- **Date:** 2025-01-18
- **Source:** v1.1.2 release based on user feedback
- **Related ADRs:** ADR-004, ADR-008
## Audit
### 2025-01-20
**Status:** Compliant
**Findings:**
| Finding | Files | Lines | Assessment |
|---------|-------|-------|------------|
| DEFAULT_CHUNK_SIZE = 2000 | `src/chunking/mod.rs` | - | compliant |
| --chunk-size flag available | `src/main.rs` | - | compliant |
| CHANGELOG documents change | `CHANGELOG.md` | v1.1.2 | compliant |
**Summary:** Default chunk size reduced to 2000 bytes with CLI override available.
**Action Required:** None
### 2026-01-19
**Status:** Superseded in practice
**Findings:**
| Finding | Files | Lines | Assessment |
|---------|-------|-------|------------|
| DEFAULT_CHUNK_SIZE = 3000 | `src/chunking/mod.rs` | - | changed from 2000 |
| MAX_CHUNK_SIZE = 50000 | `src/chunking/mod.rs` | - | reduced from 250000 |
**Summary:** In v1.1.2, the default chunk size was revised to 3,000 characters. The implementation
had drifted from this ADR's 2,000-byte target to 240,000 before the v1.1.2 correction. The maximum
was also reduced to 50,000 (from 250,000). The CLI `--chunk-size` override remains available.
**Action Required:** None — the spirit of this ADR (reduce chunk size for better search
precision) remains valid. The exact value (3,000) is documented in the CHANGELOG under v1.1.2.