Struct fst::raw::Fst [] [src]

pub struct Fst { /* fields omitted */ }

An acyclic deterministic finite state transducer.

How does it work?

The short answer: it's just like a prefix trie, which compresses keys based only on their prefixes, except that a automaton/transducer also compresses suffixes.

The longer answer is that keys in an automaton are stored only in the transitions from one state to another. A key can be acquired by tracing a path from the root of the automaton to any match state. The inputs along each transition are concatenated. Once a match state is reached, the concatenation of inputs up until that point corresponds to a single key.

But why is it called a transducer instead of an automaton? A finite state transducer is just like a finite state automaton, except that it has output transitions in addition to input transitions. Namely, the value associated with any particular key is determined by summing the outputs along every input transition that leads to the key's corresponding match state.

This is best demonstrated with a couple images. First, let's ignore the "transducer" aspect and focus on a plain automaton.

Consider that your keys are abbreviations of some of the months in the Gregorian calendar:

Be careful when using this code, it's not being tested!
jan
feb
mar
apr
may
jun
jul

The corresponding automaton that stores all of these as keys looks like this:

finite state automaton

Notice here how the prefix and suffix of jan and jun are shared. Similarly, the prefixes of jun and jul are shared and the prefixes of mar and may are shared.

All of the keys from this automaton can be enumerated in lexicographic order by following every transition from each node in lexicographic order. Since it is acyclic, the procedure will terminate.

A key can be found by tracing it through the transitions in the automaton. For example, the key aug is known not to be in the automaton by only visiting the root state (because there is no a transition). For another example, the key jax is known not to be in the set only after moving through the transitions for j and a. Namely, after those transitions are followed, there are no transitions for x.

Notice here that looking up a key is proportional the length of the key itself. Namely, lookup time is not affected by the number of keys in the automaton!

Additionally, notice that the automaton exploits the fact that many keys share common prefixes and suffixes. For example, jun and jul are represented with no more states than would be required to represent either one on its own. Instead, the only change is a single extra transition. This is a form of compression and is key to how the automatons produced by this crate are so small.

Let's move on to finite state transducers. Consider the same set of keys as above, but let's assign their numeric month values:

Be careful when using this code, it's not being tested!
jan,1
feb,2
mar,3
apr,4
may,5
jun,6
jul,7

The corresponding transducer looks very similar to the automaton above, except outputs have been added to some of the transitions:

finite state transducer

All of the operations with a transducer are the same as described above for automatons. Additionally, the same compression techniques are used: common prefixes and suffixes in keys are exploited.

The key difference is that some transitions have been given an output. As one follows input transitions, one must sum the outputs as they are seen. (A transition with no output represents the additive identity, or 0 in this case.) For example, when looking up feb, the transition f has output 2, the transition e has output 0, and the transition b also has output 0. The sum of these is 2, which is exactly the value we associated with feb.

For another more interesting example, consider jul. The j transition has output 1, the u transition has output 5 and the l transition has output 1. Summing these together gets us 7, which is again the correct value associated with jul. Notice that if we instead looked up the jun key, then the n transition would be followed instead of the l transition, which has no output. Therefore, the jun key equals 1+5+0=6.

The trick to transducers is that there exists a unique path through the transducer for every key, and its outputs are stored appropriately along this path such that the correct value is returned when they are all summed together. This process also enables the data that makes up each value to be shared across many values in the transducer in exactly the same way that keys are shared. This is yet another form of compression!

Bonus: a billion strings

The amount of compression one can get from automata can be absolutely ridiuclous. Consider the particular case of storing all billion strings in the range 0000000001-1000000000, e.g.,

Be careful when using this code, it's not being tested!
0000000001
0000000002
...
0000000100
0000000101
...
0999999999
1000000000

The corresponding automaton looks like this:

finite state automaton - one billion strings

Indeed, the on disk size of this automaton is a mere 251 bytes.

Of course, this is a bit of a pathological best case, but it does serve to show how good compression can be in the optimal case.

Also, check out the corresponding transducer that maps each string to its integer value. It's a bit bigger, but still only takes up 896 bytes of space on disk. This demonstrates that output values are also compressible.

Does this crate produce minimal transducers?

For any non-trivial sized set of keys, it is unlikely that this crate will produce a minimal transducer. As far as this author knows, guaranteeing a minimal transducer requires working memory proportional to the number of states. This can be quite costly and is anathema to the main design goal of this crate: provide the ability to work with gigantic sets of strings with constant memory overhead.

Instead, construction of a finite state transducer uses a cache of states. More frequently used states are cached and reused, which provides reasonably good compression ratios. (No comprehensive benchmarks exist to back up this claim.)

It is possible that this crate may expose a way to guarantee minimal construction of transducers at the expense of exorbitant memory requirements.

Bibliography

I initially got the idea to use finite state tranducers to represent ordered sets/maps from Michael McCandless' work on incorporating transducers in Lucene.

However, my work would also not have been possible without the hard work of many academics, especially Jan Daciuk.

Methods

impl Fst
[src]

[src]

Opens a transducer stored at the given file path via a memory map.

The fst must have been written with a compatible finite state transducer builder (Builder qualifies). If the format is invalid or if there is a mismatch between the API version of this library and the fst, then an error is returned.

This is unsafe because Rust programs cannot guarantee that memory backed by a memory mapped file won't be mutably aliased. It is up to the caller to enforce that the memory map is not modified while it is opened.

[src]

Opens a transducer from a MmapReadOnly.

This is useful if a transducer is serialized to only a part of a file. A MmapReadOnly lets one control which region of the file is used for the transducer.

[src]

Creates a transducer from its representation as a raw byte sequence.

Note that this operation is very cheap (no allocations and no copies).

The fst must have been written with a compatible finite state transducer builder (Builder qualifies). If the format is invalid or if there is a mismatch between the API version of this library and the fst, then an error is returned.

[src]

Creates a transducer from its representation as a raw byte sequence.

This accepts a static byte slice, which may be useful if the Fst is embedded into source code.

[src]

Creates a transducer from a shared vector at the given offset and length.

This permits creating multiple transducers from a single region of owned memory.

[src]

Retrieves the value associated with a key.

If the key does not exist, then None is returned.

[src]

Returns true if and only if the given key is in this FST.

[src]

Return a lexicographically ordered stream of all key-value pairs in this fst.

[src]

Return a builder for range queries.

A range query returns a subset of key-value pairs in this fst in a range given in lexicographic order.

[src]

Executes an automaton on the keys of this map.

[src]

Returns the number of keys in this fst.

[src]

Returns true if and only if this fst has no keys.

[src]

Returns the number of bytes used by this fst.

[src]

Creates a new fst operation with this fst added to it.

The OpBuilder type can be used to add additional fst streams and perform set operations like union, intersection, difference and symmetric difference on the keys of the fst. These set operations also allow one to specify how conflicting values are merged in the stream.

[src]

Returns true if and only if the self fst is disjoint with the fst stream.

stream must be a lexicographically ordered sequence of byte strings with associated values.

[src]

Returns true if and only if the self fst is a subset of the fst stream.

stream must be a lexicographically ordered sequence of byte strings with associated values.

[src]

Returns true if and only if the self fst is a superset of the fst stream.

stream must be a lexicographically ordered sequence of byte strings with associated values.

[src]

Returns the underlying type of this fst.

FstType is a convention used to indicate the type of the underlying transducer.

This crate reserves the range 0-255 (inclusive) but currently leaves the meaning of 0-255 unspecified.

[src]

Returns the root node of this fst.

[src]

Returns the node at the given address.

Node addresses can be obtained by reading transitions on Node values.

[src]

Returns a copy of the binary contents of this FST.

Trait Implementations

impl<'a, 'f> IntoStreamer<'a> for &'f Fst
[src]

The type of the item emitted by the stream.

The type of the stream to be constructed.

[src]

Construct a stream from Self.