Skip to main content

Module fts_cell

Module fts_cell 

Source
Expand description

On-disk format for one FTS posting list (Phase 8c).

Each cell carries either a posting list for one term — (term, [(rowid, term_freq), ...]) — or, in a single sidecar cell with term.is_empty(), the per-doc length map (rowid, doc_len). Cells live on TableLeaf pages identical to regular table data trees, so the slot directory + sibling next_page chain + interior-page mechanics from Phase 3d work without FTS-specific page plumbing.

Reusing the table-tree shape lets Cell::peek_rowid work uniformly across cell kinds: it skips cell_length | kind_tag and reads the first varint, which is cell_id here. cell_id is a sequential integer assigned at save time (1, 2, 3 …), not a row identifier — the B-Tree just needs an ordered key for slot directory binary search; the actual data is keyed by term.

  cell_length   varint          bytes after this field
  kind_tag      u8 = 0x06       (KIND_FTS_POSTING)
  cell_id       zigzag varint   sequential B-Tree slot key
  term_len      varint          length of `term` in bytes; 0 → sidecar
  term          term_len bytes  ASCII-lowercased term per Phase 8 Q3
  count         varint          number of (rowid, value) pairs
  for each:
    rowid       zigzag varint   the row this entry belongs to
    value       varint          term frequency, or doc length when
                                term_len == 0 (sidecar cell)

One sidecar cell suffices for the entire index: it lists every indexed doc with its tokenized length, including the zero-length corner case (a row whose text tokenizes to nothing — still indexed so len() and total_docs round-trip). Posting cells follow.

No null bitmap, no per-field type tag — every field has a fixed type. The encoding is deliberately minimal because long posting lists dominate disk usage on real corpora.

Structs§

FtsPostingCell
One FTS posting list cell — either a per-term postings entry or the single doc-lengths sidecar (when term.is_empty()).