Skip to main content

Module index

Module index 

Source
Expand description

Index types for JSONL output (merged from three_dcf_index) Index types for JSONL output and dataset pipelines.

This module provides data structures for exporting documents to JSONL format, suitable for downstream ML pipelines and vector databases.

§Example

use three_dcf_core::index::{DocumentRecord, JsonlWriter};
use std::fs::File;

let file = File::create("output.jsonl")?;
let mut writer = JsonlWriter::new(file);

writer.write_record(&DocumentRecord {
    doc_id: "doc_001".to_string(),
    title: Some("Annual Report 2024".to_string()),
    source_type: "files".to_string(),
    source_format: "pdf".to_string(),
    source_ref: "/data/reports/annual_2024.pdf".to_string(),
    tags: vec!["finance".to_string(), "annual".to_string()],
})?;

Structs§

CellRecord
Record for a single cell (text block) within a page.
DocumentRecord
Metadata record for a processed document.
JsonlWriter
A streaming JSONL writer for efficient dataset export.
PageRecord
Metadata record for a single page within a document.