Expand description
§three-dcf-core
A high-performance library for encoding documents into structured datasets optimized for LLM training and retrieval-augmented generation (RAG).
§Overview
three-dcf-core converts various document formats (PDF, Markdown, HTML, images)
into a normalized, cell-based representation that preserves document structure
while being optimized for machine learning workloads.
§Quick Start
use three_dcf_core::prelude::*;
fn main() -> Result<()> {
// Encode a PDF document
let encoder = Encoder::from_preset("reports")?;
let (document, metrics) = encoder.encode_path("report.pdf")?;
println!("Processed {} pages, {} cells", metrics.pages, metrics.cells_kept);
// Serialize to text format for LLM context
let serializer = TextSerializer::new();
let output = serializer.to_string(&document)?;
Ok(())
}§Encoder Presets
| Preset | Use Case | Page Size |
|---|---|---|
reports | Business documents, papers | 1024×1400 |
slides | Presentations | 1920×1080 |
news | Articles, blogs | 1100×1600 |
scans | Scanned documents | 1400×2000 |
§Features
text(default): Basic text/markdown/HTML processingpdfium: Native PDF rendering via pdfium for better extractionocr: Optical character recognition via Tesseractfull: All features enabled
§Architecture
The encoding pipeline:
- Input → Document loaded from file (PDF/MD/HTML/image)
- Parse → Extract pages and text content
- Normalize → Apply hyphenation rules, detect structure
- Classify → Identify cell types (text, table, code, header)
- Score → Calculate importance scores for ranking
- Deduplicate → Hash-based deduplication across pages
- Output →
Documentwith cells, dictionary, and metadata
§Output Formats
- TextSerializer: Human-readable format for LLM context windows
- JsonlWriter: JSONL output for dataset pipelines
- Protobuf: Binary format via
protomodule
§Example: Custom Configuration
use three_dcf_core::{EncoderBuilder, HyphenationMode, ImportanceTuning};
let encoder = EncoderBuilder::new("reports")?
.budget(Some(4096)) // Token budget
.drop_footers(true) // Remove page footers
.dedup_window(5) // Dedup across 5 pages
.hyphenation(HyphenationMode::Preserve)
.importance_tuning(ImportanceTuning {
header_boost: 1.5,
table_boost: 1.2,
..Default::default()
})
.build();§Chunking for RAG
use three_dcf_core::{Chunker, ChunkConfig, ChunkMode};
let chunker = Chunker::new(ChunkConfig {
mode: ChunkMode::Semantic,
target_tokens: 512,
overlap_tokens: 64,
..Default::default()
});
let chunks = chunker.chunk(&document);Re-exports§
pub use index::DocumentRecord;pub use index::JsonlWriter;pub use index::PageRecord;
Modules§
- index
- Index types for JSONL output (merged from three_dcf_index) Index types for JSONL output and dataset pipelines.
- prelude
- Prelude for convenient imports Prelude module for convenient imports.
- proto
- Protobuf-generated types for binary serialization
Structs§
- Bench
Config - Bench
Result - Bench
Runner - Cell
Record - Chunk
Config - Chunk
Record - Chunker
- Corpus
Metrics - Decoder
- Document
- Embedding
Record - Encode
Input - Encoder
- Encoder
Builder - Hash
Embedder - Hash
Embedder Config - Header
- Importance
Tuning - Ingest
Options - Metrics
- NumGuard
- NumGuard
Alert - NumStats
- Page
Info - Stats
- Text
Serializer - Text
Serializer Config - Token
Metrics