Module tantivy::positions

source ·
Expand description

Tantivy can (if instructed to do so in the schema) store the term positions in a given field. This position is expressed as token ordinal. For instance, In “The beauty and the beast”, the term “the” appears in position 0 and position 3. This information is useful to run phrase queries.

The position file contains all of the bitpacked positions delta, for all terms of a given field, one term after the other.

Each term is encoded independently. Like for posting lists, tantivy relies on simd bitpacking to encode the positions delta in blocks of 128 deltas. Because we rarely have a multiple of 128, the final block encodes the remaining values with variable int encoding.

In order to make reading possible, the term delta positions first encode the number of bitpacked blocks, then the bitwidth for each block, then the actual bitpacked blocks and finally the final variable int encoded block.

Contrary to postings list, the reader does not have access on the number of positions that is encoded, and instead stops decoding the last block when its byte slice has been entirely read.

More formally:

  • Positions := NumBitPackedBlocks BitPackedPositionBlock^(P/128) BitPackedPositionsDeltaBitWidth VIntPosDeltas?
  • NumBitPackedBlocks*: := P / 128 encoded as a variable byte integer.
  • BitPackedPositionBlock := bit width encoded block of 128 positions delta
  • BitPackedPositionsDeltaBitWidth := (BitWidth: u8)^NumBitPackedBlocks
  • VIntPosDeltas := VIntPosDelta^(P % 128).

The skip widths encoded separately makes it easy and fast to rapidly skip over n positions.

Structs§

  • When accessing the positions of a term, we get a positions_idx from the Terminfo. This means we need to skip to the nth position efficiently.
  • The PositionSerializer is in charge of serializing all of the positions of all of the terms of a given field.