Expand description
Index types for JSONL output (merged from three_dcf_index) Index types for JSONL output and dataset pipelines.
This module provides data structures for exporting documents to JSONL format, suitable for downstream ML pipelines and vector databases.
§Example
use three_dcf_core::index::{DocumentRecord, JsonlWriter};
use std::fs::File;
let file = File::create("output.jsonl")?;
let mut writer = JsonlWriter::new(file);
writer.write_record(&DocumentRecord {
doc_id: "doc_001".to_string(),
title: Some("Annual Report 2024".to_string()),
source_type: "files".to_string(),
source_format: "pdf".to_string(),
source_ref: "/data/reports/annual_2024.pdf".to_string(),
tags: vec!["finance".to_string(), "annual".to_string()],
})?;Structs§
- Cell
Record - Record for a single cell (text block) within a page.
- Document
Record - Metadata record for a processed document.
- Jsonl
Writer - A streaming JSONL writer for efficient dataset export.
- Page
Record - Metadata record for a single page within a document.