Expand description
On-disk format for one FTS posting list (Phase 8c).
Each cell carries either a posting list for one term — (term, [(rowid, term_freq), ...]) — or, in a single sidecar cell with
term.is_empty(), the per-doc length map (rowid, doc_len). Cells
live on TableLeaf pages identical to regular table data trees, so
the slot directory + sibling next_page chain + interior-page
mechanics from Phase 3d work without FTS-specific page plumbing.
Reusing the table-tree shape lets Cell::peek_rowid work uniformly
across cell kinds: it skips cell_length | kind_tag and reads the
first varint, which is cell_id here. cell_id is a sequential
integer assigned at save time (1, 2, 3 …), not a row identifier —
the B-Tree just needs an ordered key for slot directory binary
search; the actual data is keyed by term.
cell_length varint bytes after this field
kind_tag u8 = 0x06 (KIND_FTS_POSTING)
cell_id zigzag varint sequential B-Tree slot key
term_len varint length of `term` in bytes; 0 → sidecar
term term_len bytes ASCII-lowercased term per Phase 8 Q3
count varint number of (rowid, value) pairs
for each:
rowid zigzag varint the row this entry belongs to
value varint term frequency, or doc length when
term_len == 0 (sidecar cell)One sidecar cell suffices for the entire index: it lists every
indexed doc with its tokenized length, including the zero-length
corner case (a row whose text tokenizes to nothing — still indexed
so len() and total_docs round-trip). Posting cells follow.
No null bitmap, no per-field type tag — every field has a fixed type. The encoding is deliberately minimal because long posting lists dominate disk usage on real corpora.
Structs§
- FtsPosting
Cell - One FTS posting list cell — either a per-term postings entry or the
single doc-lengths sidecar (when
term.is_empty()).