Theory
======
The stringdex information retrieval pipline goes through the following steps:
1. It generates an implicit suffix tree according to Ukkonen's Algorithm.
Suffix trees are a data structure
2. It marks the portion of the suffix tree that's unneeded for full-word
matching, so that it doesn't need to be loaded during levenshtein search.
3. Then it serializes parts of the tree, hashes them, and uses that hash to
deduplicate parts of the tree.
4. Then it compresses parts of the tree that are bundled into the same file,
so that they can use relative offsets instead of absolute hashids.
Background information: tries and suffix trees
----------------------------------------------
Stringdex is based on a complicated [trie][]. A simple trie is a tree of
characters, and you can construct one like this:
1. Start the `current node` at the special root node, and the `current char`
at the first character (if the string is empty, one-past-the-end).
2. Branch on the value of `current char`:
- If the `current char` is one-past-the-end, you're done. Add the leaf
as a child of the `current node` and exit.
- If the `current node` has a child for the `current char`, move
`current node` to that child. Advance `current char` to the next,
and repeat step 2.
- Otherwise, create a child for `current node` with `current char` in it.
Advance `current char` to next, move `current node` to the child you
just created, and repeat step 2.
For example, a trie for `{"cat":1}` looks like this:
( )
│
c
│
a
│
t
│
(1)
Add `{"cow":2,"dog":3}`, and it now looks like this:
( )
┌──┴──┐
c d
┌─┴─┐ │
a o o
│ │ │
t w g
│ │ │
(1) (2) (3)
[trie]: https://en.wikipedia.org/wiki/Trie
The power of the trie, compared to a hash table, is that you can find non-exact
matches for your search query. For example, to find every "c" word, walk the
subtree below the "c" node.
The trie implemented by this crate is a bit more complicated, because we don't
just insert the strings, but every *suffix* of the string, making it a
[suffix tree][]. It's the trie you'd get for
`{"cat":1,"at":1,"t":1,"cow":2,"ow":2,"w":2","dog":3,"og":3,"g":3}`:
( )
┌─────┬─────┬───┬──┴──┬─────┬───┐
a c d g o t w
│ ┌─┴─┐ │ │ ┌─┴─┐ │ │
t a o o (3) g w (1) (2)
│ │ │ │ │ │
(1) t w g (3) (2)
│ │ │
(1) (2) (3)
[suffix tree]: https://en.wikipedia.org/wiki/Suffix_tree
While an ordinary trie makes it easy to answer the question "what's every word
that starts with 'c'", a suffix tree answers the question "what's every word
that has an 'o' in it".
A suffix tree is a lot bigger than a regular trie. There are several tactics
that this crate uses to keep the size manageable:
Step 1: Efficient construction of a suffix tree
-----------------------------------------------
This crate contains a version of [Ukkonen's algorithm][], which can
generating suffix trees in non-quadratic time.
How is this possible when "w,ow,cow...somewordendingincow" has a total of
1+2+3..+N = [N(N+1)/2][triangular number] characters?
The Wikipedia references, and especially [desmond's blog][],
describe it much better than I do. But the key points are:
- A node can contain more than one byte. This makes it necessary to
implement complicated "splitting" logic, but reduces the per-node overhead
by having fewer nodes.
- Each node stores offsets into a shared buffer, instead of storing the
bytes directly. This means that the length of the string associated
with each node has no effect on how much space the node takes up
at construction (the serialized form inlines the text for data locality).
- This is a byte trie. So each node has, at most, 256 children.
[Ukkonen's algorithm]: https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm
[triangular number]: https://en.wikipedia.org/wiki/Triangular_number
[desmond's blog]: http://programmerspatch.blogspot.com/2013/02/ukkonens-suffix-tree-algorithm.html
This means that the suffix tree for `{"cat":1,"cow":2,"dog":3}` becomes this:
Backing buffer: catcowdog
Tree:
( )
┌─────┬─────┬───┬──┴──┬─────┬───┐
1,2 0,0 6,8 8,8 4,4 2,2 5,5
│ ┌─┴─┐ │ │ ┌─┴─┐ │ │
(1) 1,2 4,5 (3) (3) 8,8 5,5 (1) (2)
│ │ │ │
(1) (2) (3) (2)
Notice that `6,8`, the node that used to be the three-node chain `d-o-g`, takes
up the same amount of space as `5,5`, a node representing only `w`.
This code can be found in `src/internals/tree/mod.rs`.
Step 2: The suffixes-only flag
------------------------------
In the function `make_explicit_tree_and_mark_suffixes_only`, which performs
the last step of Ukkonen's algorithm, it also distinguishes suffix leaves
from whole-word leaves, and bubbles up a `suffixes_only` flag on all
branches that transitively contain only suffix leaves.
This results in an ordinary prefix trie interleaved with the
suffix tree. The part of the tree that needs loaded when performing a
prefix-only match is represented with thick lines (because the trees
are interleaved, you can't avoid downloading the list of suffix-only children
of the root node, even when doing a no-suffixes search, but the child nodes
don't need traversed):
Backing buffer: catcowdog
Tree:
( )
╒═════╦═════╦═══╤══╩══╤═════╤═══╕
1,2 0,0 6,8 8,8 4,4 2,2 5,5
│ ╔═╩═╗ ║ │ ┌─┴─┐ │ │
(1) 1,2 4,5 (3) (3) 8,8 5,5 (1) (2)
║ ║ │ │
(1) (2) (3) (2)
Step 3: Serialization
---------------------
Once the tree has been constructed, the generator crate needs to store the
tree to disk in a format that the JavaScript-based searcher can read.
While Ukkonen's algorithm is mostly a pure win for generating the tree in the
first place, formatting on disk has more meaningful trade-offs, for latency,
total storage used, and generation speed.
- While the crate uses the "huge shared buffer" for construction, the
serialized tree stores the text inline.
- It uses a scheme where multiple tree nodes can be stored in a single file.
- The disk format is a merkle tree. That is, a node is named after its hash.
Each file has a "subtree root node", whose hash becomes the filename.
By using a merkle tree, we get HTTP cache busting "for free," since each
node's file is named after its subtree root hash, which changes if the node's
transitive contents change. We also get de-duplication: in the example above,
the `1,2 -> (1)` and `8,8 -> (3)` and `5,5 -> (2)` subtrees are duplicates.
Each node in the search tree is formatted like this:
* The first byte is used for compression data. In canonical form,
it's always 0. See "Step 4".
* Data length, as one byte. This means a single node can only contain
255 characters of data.
* Data. The initial character, which our parent node has a copy of,
is not stored here and is not considered part of the length.
* Child length counts, one byte each:
- first might-have-prefix child count
(that is, children that are not `suffixes_only`)
- then `suffixes_only` child count
These two sets of children are disjoint, which means there can be a maximum
of 256 children in total between both of them. There is a single byte for
each count, which can only go to 255.
If there are 256 children, then I don't retain the suffixes-only
information: both bytes are set to 255 (which cannot happen otherwise),
and all children are encoded and decoded as-if any of them could have
prefix leaves beneath them. This is safe, because prefix matching needs to
filter out suffixes anyway, and it's not a serious performance problem,
because UTF8 text (and, I suspect, most data that isn't compressed)
doesn't use every possible byte.
* Six byte child node hashIDs, first might-have-prefix ones, then the
suffixes-only ones. If a node is not found in the same file as this node,
it will be found in a file named after this hash, as hex (12 char).
As a further optimization, there are two kinds of node IDs. Hashes have
the MSB set to zero (leaving 47 bits of hash strength). Hash collisions
are handled by adding a dummy node child and rehashing.
The other representation is used for nodes that have exactly one leaf
child, and has the MSB set to one. The second-most-significant-bit
is a tag marking the node as either a whole match (1) or a suffix (0).
The second nibble is a tag that indicates whether there's a byte (1) or
not (0). The next byte, if any, is the data. The remaining 32 bits are
the leaf ID.
* One byte child branches, first might-have prefix ones, then the
suffix-only ones. These are in the same order as the hashes, so to act as
a set of key-value pairs, but are stored this way for compressability.
* Roaring bitmaps for whole strings and suffixes.
The length of the bitmap is calculated while decoding it, so that the
node is self-terminating.
Roaring bitmaps start with 32-bit magic number (which may have the
bitmap's length OR-ed with it). If both bitmaps are empty, we drop
ff here, which is shorter than encoding two empty sets normally.
If the bitmap has a small number of entries (less than 0x3a), we drop
NN here, followed by NN 32-bit words, if it turns out to be a win
(the magic number that starts a roaring bitmap is either 3a or 3b).
The canonical format, using the same example as always, formats the root node
like this:
00000000 00 00 02 05 4b db 3e d3 77 46 67 1c cc f3 33 1d |....K.>.wFg...3.|
00000010 81 74 00 00 00 00 80 00 00 00 00 02 2b 00 8a 05 |.t..........+...|
00000020 1f 52 80 00 00 00 00 00 80 00 00 00 00 01 63 64 |.R............cd|
00000030 61 67 6f 74 77 ff |agotw.|
Walking through it chunk-by-chunk:
00000000 00 00 02 05 4b db 3e d3 77 46 67 1c cc f3 33 1d |....K.>.wFg...3.|
┌──── ┌──── ┌───────────────── ┌────────────────
│ │ │ │
│ │ └ `c` hash └ `d` hash
│ │
│ ├ 2 might-have-prefix children
│ └ 5 suffix-only children
│
└ data length (always zero for the root node)
00000010 81 74 00 00 00 00 80 00 00 00 00 02 2b 00 8a 05 |.t..........+...|
┌──────────────── ┌───────────────── ╭──────────
│ │ :
│ └ `g` inlined ID :
│ 80 is tag :
└ `a` inlined ID 02 is leaf ID :
81 is tag :
74 is ascii `t` :
0 is leaf ID :
.. .. .. .. .. .. .. .. .. .. .. .. ..
:
00000020: 1f 52 80 00 00 00 00 00 80 00 00 00 00 01 63 64 |.R............cd|
╰┬───── ┌──────────────── ┌──────────────── ┌────
│ │ │ │
│ │ │ └ `c` `d`
│ │ │
│ │ └ `w` inlined ID
│ │ 80 is tag
│ │ 1 is leaf ID
│ │
│ └ `t` inlined ID
│ 80 is tag
│ 0 is leaf ID
│
└ `o` hash
00000030 61 67 6f 74 77 ff |agotw....|
┌───────────── ┌─
│ │
│ └ NIL
│
└ `a` `g` `o` `t` `w`
These nodes are "bundled" by concatenating them together so that dependencies
come BEFORE the data they depend on. That means the node that a file is named
for is the last node in the file. If a dependency is not found in the file,
it's lazy-loaded as needed.
Step 4: Compression
-------------------
The sample I gave above is the "canonical" form of the root node. This is the
form used for hashing, because it contains the hashids of all its children,
and it contains all the data, so hashing *that* is hashing the entire subtree.
For actual storage, it will, under very specific circumstances, use a different
serialization form:
* The MSB of the data length field, which is always 0 in the uncompressed form,
is set to 1 if dictionary compression is used.
* The remaining 7 bits are used as flags to mark each child pointer
as being in offset form instead of the whole hash (compressed and uncompressed
pointers can be mixed within the same node, as long as there's at most 7).
If all the child pointers are compressed, then the 7-child limit doesn't
apply, and it's set to 0x7F or 0xFF and all children are compressed.
* Data length, as one byte.
* Data. The initial character, which our parent node has a copy of,
is not stored here and is not considered part of the length.
* Child length counts. The same format as canonical.
* One byte relative node offsets. These offsets are within the same
bundle file, and start immediately before the current node, so `0`
is the node directly before this one.
The compressed format is never used if a node is too far away,
can be one-leaf inlined, or is stored in a separate file.
* One byte child branches, first might-have prefix ones, then the
suffix-only ones. These are in the same order as the hashes, so to act as
a set of key-value pairs, but are stored this way for compressability.
* Roaring bitmaps for whole strings and suffixes, with special representation
for very short lists. The format is the same as the canonical format.