pub struct Slab {
pub text: String,
pub start: usize,
pub end: usize,
pub char_start: Option<usize>,
pub char_end: Option<usize>,
pub index: usize,
}Expand description
A chunk of text with its position in the original document.
The name “slab” evokes a physical slice of material—concrete, wood, stone. Each slab is a self-contained piece that can be embedded, indexed, and retrieved independently.
§Offsets
Primary offsets (start/end) are byte offsets into the original text,
matching Rust’s string slicing semantics:
use code_chunker::Slab;
let text = "Hello, world!";
let slab = Slab::new("world", 7, 12, 0);
// The offsets let you recover the original position
assert_eq!(&text[slab.start..slab.end], "world");Character offsets (char_start/char_end) are automatically populated
when using Chunker::chunk. They count Unicode
scalar values (chars), useful for NLP systems that index by character
position. Only None when using Chunker::chunk_bytes
directly.
§Overlap Handling
When chunks overlap, adjacent slabs share some text. The index field
identifies each slab’s position in the sequence:
Original: "The quick brown fox"
Slab 0: "The quick b" [0..11]
Slab 1: "ck brown fox" [8..19] <- overlaps with slab 0
^
overlap region [8..11]Fields§
§text: StringThe chunk text.
start: usizeByte offset where this chunk starts in the original document.
end: usizeByte offset where this chunk ends (exclusive) in the original document.
char_start: Option<usize>Character offset where this chunk starts (Unicode scalar values).
None until with_char_offsets or
compute_char_offsets is called.
char_end: Option<usize>Character offset where this chunk ends (exclusive, Unicode scalar values).
index: usizeZero-based index of this chunk in the sequence.
Implementations§
Source§impl Slab
impl Slab
Sourcepub fn new(
text: impl Into<String>,
start: usize,
end: usize,
index: usize,
) -> Self
pub fn new( text: impl Into<String>, start: usize, end: usize, index: usize, ) -> Self
Create a new slab (byte offsets only; char offsets unset).
Sourcepub fn with_char_offsets(self, char_start: usize, char_end: usize) -> Self
pub fn with_char_offsets(self, char_start: usize, char_end: usize) -> Self
Set character offsets on this slab.