pub struct Chunk {
pub id: Uuid,
pub doc_id: Uuid,
pub text: String,
pub byte_offset: u64,
pub byte_length: u64,
pub sequence: u32,
pub text_hash: [u8; 32],
}Expand description
A chunk of text extracted from a document
Documents are split into overlapping chunks for embedding. Each chunk tracks its position within the source document.
Per CP-011: Uses byte-based offsets (not character-based) for accurate slicing back to original document content.
Per CP-001: Chunk ID is STABLE - ID = hash(doc_id + sequence) only. This ensures re-chunking with different parameters produces the same IDs.
Fields§
§id: UuidUnique identifier for this chunk (BLAKE3-16 of doc_id + sequence) - STABLE
doc_id: UuidParent document ID
text: StringThe actual text content (canonicalized)
byte_offset: u64Byte offset within the source document (u64 for large files)
byte_length: u64Length of this chunk in bytes (u64 for large files)
sequence: u32Sequence number within the document (0-indexed)
text_hash: [u8; 32]Hash of the canonicalized text content for verification
Implementations§
Source§impl Chunk
impl Chunk
Sourcepub fn new(doc_id: Uuid, text: String, byte_offset: u64, sequence: u32) -> Self
pub fn new(doc_id: Uuid, text: String, byte_offset: u64, sequence: u32) -> Self
Create a new chunk with automatic ID generation.
Per CP-001: Chunk ID is STABLE - does NOT include text. This ensures re-chunking with different parameters produces same IDs. Content is verified via text_hash field.
Sourcepub fn text_hash_hex(&self) -> String
pub fn text_hash_hex(&self) -> String
Get the text hash as a hex string
Sourcepub fn approx_tokens(&self) -> usize
pub fn approx_tokens(&self) -> usize
Approximate token count (rough estimate: 4 chars per token)