logo
Expand description

Tantivy can (if instructed to do so in the schema) store the term positions in a given field. This positions are expressed as token ordinal. For instance, In “The beauty and the beast”, the term “the” appears in position 0 and position 4. This information is useful to run phrase queries.

The SegmentComponent::POSITIONS file contains all of the bitpacked positions delta, for all terms of a given field, one term after the other.

Each terms is encoded independently. Like for positing lists, tantivy rely on simd bitpacking to encode the positions delta in blocks of 128 deltas. Because we rarely have a multiple of 128, a final block may encode the remaining values variable byte encoding.

In order to make reading possible, the term delta positions first encodes the number of bitpacked blocks, then the bitwidth for each blocks, then the actual bitpacked block and finally the final variable int encoded block.

Contrary to postings list, the reader does not have access on the number of positions that is encoded, and instead stops decoding the last block when its byte slice has been entirely read.

More formally:

  • Positions := NumBitPackedBlocks BitPackedPositionBlock^(P/128) BitPackedPositionsDeltaBitWidth VIntPosDeltas?
  • NumBitPackedBlocks*: := P / 128 encoded as a variable byte integer.
  • BitPackedPositionBlock := bit width encoded block of 128 positions delta
  • BitPackedPositionsDeltaBitWidth := (BitWidth: u8)^NumBitPackedBlocks
  • VIntPosDeltas := VIntPosDelta^(P % 128).

The skip widths encoded separately makes it easy and fast to rapidly skip over n positions.

Structs

When accessing the position of a term, we get a positions_idx from the Terminfo. This means we need to skip to the nth positions efficiently.

The PositionSerializer is in charge of serializing all of the positions of all of the terms of a given field.