Crate three_dcf_core

Expand description

§three-dcf-core

A high-performance library for encoding documents into structured datasets optimized for LLM training and retrieval-augmented generation (RAG).

§Overview

three-dcf-core converts various document formats (PDF, Markdown, HTML, images) into a normalized, cell-based representation that preserves document structure while being optimized for machine learning workloads.

§Quick Start

use three_dcf_core::prelude::*;

fn main() -> Result<()> {
    // Encode a PDF document
    let encoder = Encoder::from_preset("reports")?;
    let (document, metrics) = encoder.encode_path("report.pdf")?;

    println!("Processed {} pages, {} cells", metrics.pages, metrics.cells_kept);

    // Serialize to text format for LLM context
    let serializer = TextSerializer::new();
    let output = serializer.to_string(&document)?;

    Ok(())
}

§Encoder Presets

Preset	Use Case	Page Size
`reports`	Business documents, papers	1024×1400
`slides`	Presentations	1920×1080
`news`	Articles, blogs	1100×1600
`scans`	Scanned documents	1400×2000

§Features

text (default): Basic text/markdown/HTML processing
pdfium: Native PDF rendering via pdfium for better extraction
ocr: Optical character recognition via Tesseract
full: All features enabled

§Architecture

The encoding pipeline:

Input → Document loaded from file (PDF/MD/HTML/image)
Parse → Extract pages and text content
Normalize → Apply hyphenation rules, detect structure
Classify → Identify cell types (text, table, code, header)
Score → Calculate importance scores for ranking
Deduplicate → Hash-based deduplication across pages
Output → Document with cells, dictionary, and metadata

§Output Formats

TextSerializer: Human-readable format for LLM context windows
JsonlWriter: JSONL output for dataset pipelines
Protobuf: Binary format via proto module

§Example: Custom Configuration

use three_dcf_core::{EncoderBuilder, HyphenationMode, ImportanceTuning};

let encoder = EncoderBuilder::new("reports")?
    .budget(Some(4096))           // Token budget
    .drop_footers(true)           // Remove page footers
    .dedup_window(5)              // Dedup across 5 pages
    .hyphenation(HyphenationMode::Preserve)
    .importance_tuning(ImportanceTuning {
        header_boost: 1.5,
        table_boost: 1.2,
        ..Default::default()
    })
    .build();

§Chunking for RAG

use three_dcf_core::{Chunker, ChunkConfig, ChunkMode};

let chunker = Chunker::new(ChunkConfig {
    mode: ChunkMode::Semantic,
    target_tokens: 512,
    overlap_tokens: 64,
    ..Default::default()
});

let chunks = chunker.chunk(&document);

Re-exports§

pub use index::DocumentRecord;
pub use index::JsonlWriter;
pub use index::PageRecord;

Modules§

index: Index types for JSONL output (merged from three_dcf_index) Index types for JSONL output and dataset pipelines.
prelude: Prelude for convenient imports Prelude module for convenient imports.
proto: Protobuf-generated types for binary serialization

Structs§

Enums§

BenchMode
CellType
ChunkMode
DcfError
EncoderPreset
HyphenationMode
NumGuardIssue
TableMode
TokenizerKind

Functions§

cer
estimate_tokens
hash_payload
ingest_to_index
ingest_to_index_with_opts
numeric_stats
wer

Type Aliases§

CodeHash
Result

Crate three_dcf_core

Crate three_dcf_core Copy item path

§three-dcf-core

§Overview

§Quick Start

§Encoder Presets

§Features

§Architecture

§Output Formats

§Example: Custom Configuration

§Chunking for RAG

Re-exports§

Modules§

Structs§

Enums§

Functions§

Type Aliases§

Crate three_dcf_core