1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
//! # three-dcf-core
//!
//! A high-performance library for encoding documents into structured datasets
//! optimized for LLM training and retrieval-augmented generation (RAG).
//!
//! ## Overview
//!
//! `three-dcf-core` converts various document formats (PDF, Markdown, HTML, images)
//! into a normalized, cell-based representation that preserves document structure
//! while being optimized for machine learning workloads.
//!
//! ## Quick Start
//!
//! ```rust,no_run
//! use three_dcf_core::prelude::*;
//!
//! fn main() -> Result<()> {
//! // Encode a PDF document
//! let encoder = Encoder::from_preset("reports")?;
//! let (document, metrics) = encoder.encode_path("report.pdf")?;
//!
//! println!("Processed {} pages, {} cells", metrics.pages, metrics.cells_kept);
//!
//! // Serialize to text format for LLM context
//! let serializer = TextSerializer::new();
//! let output = serializer.to_string(&document)?;
//!
//! Ok(())
//! }
//! ```
//!
//! ## Encoder Presets
//!
//! | Preset | Use Case | Page Size |
//! |--------|----------|-----------|
//! | `reports` | Business documents, papers | 1024×1400 |
//! | `slides` | Presentations | 1920×1080 |
//! | `news` | Articles, blogs | 1100×1600 |
//! | `scans` | Scanned documents | 1400×2000 |
//!
//! ## Features
//!
//! - **`text`** (default): Basic text/markdown/HTML processing
//! - **`pdfium`**: Native PDF rendering via pdfium for better extraction
//! - **`ocr`**: Optical character recognition via Tesseract
//! - **`full`**: All features enabled
//!
//! ## Architecture
//!
//! The encoding pipeline:
//!
//! 1. **Input** → Document loaded from file (PDF/MD/HTML/image)
//! 2. **Parse** → Extract pages and text content
//! 3. **Normalize** → Apply hyphenation rules, detect structure
//! 4. **Classify** → Identify cell types (text, table, code, header)
//! 5. **Score** → Calculate importance scores for ranking
//! 6. **Deduplicate** → Hash-based deduplication across pages
//! 7. **Output** → `Document` with cells, dictionary, and metadata
//!
//! ## Output Formats
//!
//! - **TextSerializer**: Human-readable format for LLM context windows
//! - **JsonlWriter**: JSONL output for dataset pipelines
//! - **Protobuf**: Binary format via `proto` module
//!
//! ## Example: Custom Configuration
//!
//! ```rust,no_run
//! use three_dcf_core::{EncoderBuilder, HyphenationMode, ImportanceTuning};
//!
//! let encoder = EncoderBuilder::new("reports")?
//! .budget(Some(4096)) // Token budget
//! .drop_footers(true) // Remove page footers
//! .dedup_window(5) // Dedup across 5 pages
//! .hyphenation(HyphenationMode::Preserve)
//! .importance_tuning(ImportanceTuning {
//! header_boost: 1.5,
//! table_boost: 1.2,
//! ..Default::default()
//! })
//! .build();
//! # Ok::<(), three_dcf_core::DcfError>(())
//! ```
//!
//! ## Chunking for RAG
//!
//! ```rust,no_run
//! use three_dcf_core::{Chunker, ChunkConfig, ChunkMode};
//!
//! let chunker = Chunker::new(ChunkConfig {
//! mode: ChunkMode::Semantic,
//! target_tokens: 512,
//! overlap_tokens: 64,
//! ..Default::default()
//! });
//!
//! let chunks = chunker.chunk(&document);
//! ```
/// Protobuf-generated types for binary serialization
/// Index types for JSONL output (merged from three_dcf_index)
/// Prelude for convenient imports
// Re-exports for public API
pub use ;
pub use ;
pub use Decoder;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
// Re-export index types at crate root for convenience
pub use ;
// Note: index::CellRecord conflicts with document::CellRecord; refer to it via `index::CellRecord` explicitly.