three-dcf-core
A high-performance Rust library for encoding documents into structured datasets optimized for LLM training and retrieval-augmented generation (RAG).
Features
- Multi-format support: PDF, Markdown, HTML, plain text, and images
- Structure preservation: Maintains document hierarchy (headers, tables, code blocks)
- Token-aware: Budget-based encoding with tiktoken-compatible token counting
- Deduplication: Hash-based content deduplication across pages
- NumGuard: Automatic numerical data validation
- Parallel processing: Rayon-powered multi-threaded encoding
- Optional OCR: Tesseract integration for scanned documents
Installation
Add to your Cargo.toml:
[]
= "0.2"
For PDF rendering and OCR support:
[]
= { = "0.2", = ["full"] }
Quick Start
use *;
Encoder Presets
| Preset | Use Case | Resolution |
|---|---|---|
reports |
Business documents, research papers | 1024×1400 |
slides |
Presentations | 1920×1080 |
news |
Articles, blog posts | 1100×1600 |
scans |
Scanned documents | 1400×2000 |
Advanced Configuration
use ;
let encoder = new?
.budget // Limit to 4K tokens
.drop_footers // Remove page footers
.dedup_window // Dedup across 5 consecutive pages
.hyphenation
.importance_tuning
.build;
Chunking for RAG
use ;
let chunker = new;
let chunks = chunker.chunk;
for chunk in chunks
Feature Flags
| Feature | Description | Dependencies |
|---|---|---|
text |
Basic text processing (default) | - |
pdfium |
Native PDF rendering | pdfium-render |
ocr |
Tesseract OCR | leptess |
full |
All features | All above |
Output Formats
Text Serialization (for LLM context)
<ctx3d grid=coarse codeset=HASH256 preset=reports budget=auto>
(z=0,x=64,y=64,w=896,h=24,code=a1b2c3d4...,rle=0,imp=100,type=HEADER) "Quarterly Results"
(z=0,x=64,y=100,w=896,h=200,code=e5f6g7h8...,rle=0,imp=80,type=TABLE) "[table rows=5 cols=4]"
</ctx3d>
JSONL (for ML pipelines)
Performance
Benchmarks on M2 MacBook Pro (8-core):
| Document Type | Pages | Time | Throughput |
|---|---|---|---|
| PDF (text) | 100 | 1.2s | 83 pages/s |
| PDF (scanned) | 100 | 45s | 2.2 pages/s |
| Markdown | 1000 | 0.8s | 1250 files/s |
License
Apache-2.0
Contributing
See CONTRIBUTING.md for guidelines.