Skip to main content

Crate three_dcf_core

Crate three_dcf_core 

Source
Expand description

§three-dcf-core

A high-performance library for encoding documents into structured datasets optimized for LLM training and retrieval-augmented generation (RAG).

§Overview

three-dcf-core converts various document formats (PDF, Markdown, HTML, images) into a normalized, cell-based representation that preserves document structure while being optimized for machine learning workloads.

§Quick Start

use three_dcf_core::prelude::*;

fn main() -> Result<()> {
    // Encode a PDF document
    let encoder = Encoder::from_preset("reports")?;
    let (document, metrics) = encoder.encode_path("report.pdf")?;

    println!("Processed {} pages, {} cells", metrics.pages, metrics.cells_kept);

    // Serialize to text format for LLM context
    let serializer = TextSerializer::new();
    let output = serializer.to_string(&document)?;

    Ok(())
}

§Encoder Presets

PresetUse CasePage Size
reportsBusiness documents, papers1024×1400
slidesPresentations1920×1080
newsArticles, blogs1100×1600
scansScanned documents1400×2000

§Features

  • text (default): Basic text/markdown/HTML processing
  • pdfium: Native PDF rendering via pdfium for better extraction
  • ocr: Optical character recognition via Tesseract
  • full: All features enabled

§Architecture

The encoding pipeline:

  1. Input → Document loaded from file (PDF/MD/HTML/image)
  2. Parse → Extract pages and text content
  3. Normalize → Apply hyphenation rules, detect structure
  4. Classify → Identify cell types (text, table, code, header)
  5. Score → Calculate importance scores for ranking
  6. Deduplicate → Hash-based deduplication across pages
  7. OutputDocument with cells, dictionary, and metadata

§Output Formats

  • TextSerializer: Human-readable format for LLM context windows
  • JsonlWriter: JSONL output for dataset pipelines
  • Protobuf: Binary format via proto module

§Example: Custom Configuration

use three_dcf_core::{EncoderBuilder, HyphenationMode, ImportanceTuning};

let encoder = EncoderBuilder::new("reports")?
    .budget(Some(4096))           // Token budget
    .drop_footers(true)           // Remove page footers
    .dedup_window(5)              // Dedup across 5 pages
    .hyphenation(HyphenationMode::Preserve)
    .importance_tuning(ImportanceTuning {
        header_boost: 1.5,
        table_boost: 1.2,
        ..Default::default()
    })
    .build();

§Chunking for RAG

use three_dcf_core::{Chunker, ChunkConfig, ChunkMode};

let chunker = Chunker::new(ChunkConfig {
    mode: ChunkMode::Semantic,
    target_tokens: 512,
    overlap_tokens: 64,
    ..Default::default()
});

let chunks = chunker.chunk(&document);

Re-exports§

pub use index::DocumentRecord;
pub use index::JsonlWriter;
pub use index::PageRecord;

Modules§

index
Index types for JSONL output (merged from three_dcf_index) Index types for JSONL output and dataset pipelines.
prelude
Prelude for convenient imports Prelude module for convenient imports.
proto
Protobuf-generated types for binary serialization

Structs§

BenchConfig
BenchResult
BenchRunner
CellRecord
ChunkConfig
ChunkRecord
Chunker
CorpusMetrics
Decoder
Document
EmbeddingRecord
EncodeInput
Encoder
EncoderBuilder
HashEmbedder
HashEmbedderConfig
Header
ImportanceTuning
IngestOptions
Metrics
NumGuard
NumGuardAlert
NumStats
PageInfo
Stats
TextSerializer
TextSerializerConfig
TokenMetrics

Enums§

BenchMode
CellType
ChunkMode
DcfError
EncoderPreset
HyphenationMode
NumGuardIssue
TableMode
TokenizerKind

Functions§

cer
estimate_tokens
hash_payload
ingest_to_index
ingest_to_index_with_opts
numeric_stats
wer

Type Aliases§

CodeHash
Result