fst/lib.rs
1/*!
2Crate `fst` is a library for efficiently storing and searching ordered sets or
3maps where the keys are byte strings. A key design goal of this crate is to
4support storing and searching *very large* sets or maps (i.e., billions). This
5means that much effort has gone in to making sure that all operations are
6memory efficient.
7
8Sets and maps are represented by a finite state machine, which acts as a form
9of compression on common prefixes and suffixes in the keys. Additionally,
10finite state machines can be efficiently queried with automata (like regular
11expressions or Levenshtein distance for fuzzy queries) or lexicographic ranges.
12
13To read more about the mechanics of finite state transducers, including a
14bibliography for algorithms used in this crate, see the docs for the
15[`raw::Fst`](raw/struct.Fst.html) type.
16
17# Installation
18
19Simply add a corresponding entry to your `Cargo.toml` dependency list:
20
21```plain
22[dependencies]
23fst = "0.4"
24```
25
26The examples in this documentation will show the rest.
27
28# Overview of types and modules
29
30This crate provides the high level abstractions---namely sets and maps---in the
31top-level module.
32
33The `set` and `map` sub-modules contain types specific to sets and maps, such
34as range queries and streams.
35
36The `raw` module permits direct interaction with finite state transducers.
37Namely, the states and transitions of a transducer can be directly accessed
38with the `raw` module.
39*/
40#![cfg_attr(
41    feature = "levenshtein",
42    doc = r##"
43# Example: fuzzy query
44
45This example shows how to create a set of strings in memory, and then execute
46a fuzzy query. Namely, the query looks for all keys within an edit distance
47of `1` of `foo`. (Edit distance is the number of character insertions,
48deletions or substitutions required to get from one string to another. In this
49case, a character is a Unicode codepoint.)
50
51This requires the `levenshtein` feature in this crate to be enabled. It is not
52enabled by default.
53
54```rust
55use fst::{IntoStreamer, Streamer, Set};
56use fst::automaton::Levenshtein;
57
58# fn main() { example().unwrap(); }
59fn example() -> Result<(), Box<dyn std::error::Error>> {
60    // A convenient way to create sets in memory.
61    let keys = vec!["fa", "fo", "fob", "focus", "foo", "food", "foul"];
62    let set = Set::from_iter(keys)?;
63
64    // Build our fuzzy query.
65    let lev = Levenshtein::new("foo", 1)?;
66
67    // Apply our fuzzy query to the set we built.
68    let mut stream = set.search(lev).into_stream();
69
70    let keys = stream.into_strs()?;
71    assert_eq!(keys, vec!["fo", "fob", "foo", "food"]);
72    Ok(())
73}
74```
75
76**Warning**: Levenshtein automatons use a lot of memory
77
78The construction of Levenshtein automatons should be consider "proof of
79concept" quality. Namely, they do just enough to be *correct*. But they haven't
80had any effort put into them to be memory conscious.
81
82Note that an error will be returned if a Levenshtein automaton gets too big
83(tens of MB in heap usage).
84
85"##
86)]
87/*!
88# Example: stream to a file and memory map it for searching
89
90This shows how to create a `MapBuilder` that will stream construction of the
91map to a file. Notably, this will never store the entire transducer in memory.
92Instead, only constant memory is required during construction.
93
94For the search phase, we use the
95[`memmap`](https://crates.io/memmap)
96crate to make the file available as a `&[u8]` without necessarily reading it
97all into memory (the operating system will automatically handle that for you).
98
99```rust,no_run
100# fn example() -> Result<(), fst::Error> {
101use std::fs::File;
102use std::io;
103
104use fst::{IntoStreamer, Streamer, Map, MapBuilder};
105use memmap::Mmap;
106
107// This is where we'll write our map to.
108let mut wtr = io::BufWriter::new(File::create("map.fst")?);
109
110// Create a builder that can be used to insert new key-value pairs.
111let mut build = MapBuilder::new(wtr)?;
112build.insert("bruce", 1).unwrap();
113build.insert("clarence", 2).unwrap();
114build.insert("stevie", 3).unwrap();
115
116// Finish construction of the map and flush its contents to disk.
117build.finish()?;
118
119// At this point, the map has been constructed. Now we'd like to search it.
120// This creates a memory map, which enables searching the map without loading
121// all of it into memory.
122let mmap = unsafe { Mmap::map(&File::open("map.fst")?)? };
123let map = Map::new(mmap)?;
124
125// Query for keys that are greater than or equal to clarence.
126let mut stream = map.range().ge("clarence").into_stream();
127
128let kvs = stream.into_str_vec()?;
129assert_eq!(kvs, vec![
130    ("clarence".to_owned(), 2),
131    ("stevie".to_owned(), 3),
132]);
133# Ok(())
134# }
135# example().unwrap();
136```
137
138# Example: case insensitive search
139
140We can perform case insensitive search on a set using a regular expression. We
141can use the [`regex-automata`](https://docs.rs/regex-automata) crate to compile
142a regular expression into an automaton:
143
144```ignore
145use fst::{IntoStreamer, Set};
146use regex_automata::dense; // regex-automata crate with 'transducer' feature
147
148fn main() -> Result<(), Box<dyn std::error::Error>> {
149    let set = Set::from_iter(&["FoO", "Foo", "fOO", "foo"])?;
150    let pattern = r"(?i)foo";
151    // Setting 'anchored' is important, otherwise the regex can match anywhere
152    // in the key. This would cause the regex to iterate over every key in the
153    // FST set.
154    let dfa = dense::Builder::new().anchored(true).build(pattern).unwrap();
155
156    let keys = set.search(&dfa).into_stream().into_strs()?;
157    assert_eq!(keys, vec!["FoO", "Foo", "fOO", "foo"]);
158    println!("{:?}", keys);
159    Ok(())
160}
161```
162
163Note that for this to work, the `regex-automata` crate must be compiled with
164the `transducer` feature enabled:
165
166```toml
167[dependencies]
168fst = "0.4"
169regex-automata = { version = "0.1.9", features = ["transducer"] }
170```
171
172# Example: searching multiple sets efficiently
173
174Since queries can search a transducer without reading the entire data structure
175into memory, it is possible to search *many* transducers very quickly.
176
177This crate provides efficient set/map operations that allow one to combine
178multiple streams of search results. Each operation only uses memory
179proportional to the number of streams.
180
181The example below shows how to find all keys that start with `B` or `G`. The
182example below uses sets, but the same operations are available on maps too.
183
184```rust
185use fst::automaton::{Automaton, Str};
186use fst::set;
187use fst::{IntoStreamer, Set, Streamer};
188
189# fn main() { example().unwrap(); }
190fn example() -> Result<(), Box<dyn std::error::Error>> {
191    let set1 = Set::from_iter(&["AC/DC", "Aerosmith"])?;
192    let set2 = Set::from_iter(&["Bob Seger", "Bruce Springsteen"])?;
193    let set3 = Set::from_iter(&["George Thorogood", "Golden Earring"])?;
194    let set4 = Set::from_iter(&["Kansas"])?;
195    let set5 = Set::from_iter(&["Metallica"])?;
196
197    // Create the matcher. We can reuse it to search all of the sets.
198    let matcher = Str::new("B")
199        .starts_with()
200        .union(Str::new("G").starts_with());
201
202    // Build a set operation. All we need to do is add a search result stream
203    // for each set and ask for the union. (Other operations, like intersection
204    // and difference are also available.)
205    let mut stream =
206        set::OpBuilder::new()
207        .add(set1.search(&matcher))
208        .add(set2.search(&matcher))
209        .add(set3.search(&matcher))
210        .add(set4.search(&matcher))
211        .add(set5.search(&matcher))
212        .union();
213
214    // Now collect all of the keys. Alternatively, you could build another set
215    // here using `SetBuilder::extend_stream`.
216    let mut keys = vec![];
217    while let Some(key) = stream.next() {
218        keys.push(String::from_utf8(key.to_vec())?);
219    }
220    assert_eq!(keys, vec![
221        "Bob Seger",
222        "Bruce Springsteen",
223        "George Thorogood",
224        "Golden Earring",
225    ]);
226    Ok(())
227}
228```
229
230# Memory usage
231
232An important advantage of using finite state transducers to represent sets and
233maps is that they can compress very well depending on the distribution of keys.
234The smaller your set/map is, the more likely it is that it will fit into
235memory. If it's in memory, then searching it is faster. Therefore, it is
236important to do what we can to limit what actually needs to be in memory.
237
238This is where automata shine, because they can be queried in their compressed
239state without loading the entire data structure into memory. This means that
240one can store a set/map created by this crate on disk and search it without
241actually reading the entire set/map into memory. This use case is served well
242by *memory maps*, which lets one assign the entire contents of a file to a
243contiguous region of virtual memory.
244
245Indeed, this crate encourages this mode of operation. Both sets and maps can
246be constructed from anything that provides an `AsRef<[u8]>` implementation,
247which any memory map should.
248
249This is particularly important for long running processes that use this crate,
250since it enables the operating system to determine which regions of your
251finite state transducers are actually in memory.
252
253Of course, there are downsides to this approach. Namely, navigating a
254transducer during a key lookup or a search will likely follow a pattern
255approximating random access. Supporting random access when reading from disk
256can be very slow because of how often `seek` must be called (or, in the case
257of memory maps, page faults). This is somewhat mitigated by the prevalence of
258solid state drives where seek time is eliminated. Nevertheless, solid state
259drives are not ubiquitous and it is possible that the OS will not be smart
260enough to keep your memory mapped transducers in the page cache. In that case,
261it is advisable to load the entire transducer into your process's memory (e.g.,
262calling `Set::new` with a `Vec<u8>`).
263
264# Streams
265
266Searching a set or a map needs to provide some way to iterate over the search
267results. Idiomatic Rust calls for something satisfying the `Iterator` trait
268to be used here. Unfortunately, this is not possible to do efficiently because
269the `Iterator` trait does not permit values emitted by the iterator to borrow
270from the iterator. Borrowing from the iterator is required in our case because
271keys and values are constructed *during iteration*.
272
273Namely, if we were to use iterators, then every key would need its own
274allocation, which could be quite costly.
275
276Instead, this crate provides a `Streamer`, which can be thought of as a
277streaming iterator. Namely, a stream in this crate maintains a single key
278buffer and lends it out on each iteration.
279
280For more details, including important limitations, see the `Streamer` trait.
281
282# Quirks
283
284There's no doubt about it, finite state transducers are a specialty data
285structure. They have a host of restrictions that don't apply to other similar
286data structures found in the standard library, such as `BTreeSet` and
287`BTreeMap`. Here are some of them:
288
2891. Sets can only contain keys that are byte strings.
2902. Maps can also only contain keys that are byte strings, and its values are
291   limited to unsigned 64 bit integers. (The restriction on values may be
292   relaxed some day.)
2933. Creating a set or a map requires inserting keys in lexicographic order.
294   Often, keys are not already sorted, which can make constructing large
295   sets or maps tricky. One way to do it is to sort pieces of the data and
296   build a set/map for each piece. This can be parallelized trivially. Once
297   done, they can be merged together into one big set/map if desired.
298   A somewhat simplistic example of this procedure can be seen in
299   `fst-bin/src/merge.rs` from the root of this crate's repository.
300*/
301
302#![deny(missing_docs)]
303
304#[cfg(all(feature = "levenshtein", doctest))]
305doc_comment::doctest!("../README.md");
306
307pub use crate::automaton::Automaton;
308pub use crate::error::{Error, Result};
309pub use crate::map::{Map, MapBuilder};
310pub use crate::set::{Set, SetBuilder};
311pub use crate::stream::{IntoStreamer, Streamer};
312
313mod bytes;
314mod error;
315#[path = "automaton/mod.rs"]
316mod inner_automaton;
317#[path = "map.rs"]
318mod inner_map;
319#[path = "set.rs"]
320mod inner_set;
321pub mod raw;
322mod stream;
323
324/// Automaton implementations for finite state transducers.
325///
326/// This module defines a trait, `Automaton`, with several implementations
327/// including, but not limited to, union, intersection and complement.
328pub mod automaton {
329    pub use crate::inner_automaton::*;
330}
331
332/// Map operations implemented by finite state transducers.
333///
334/// This API provided by this sub-module is close in spirit to the API
335/// provided by
336/// [`std::collections::BTreeMap`](https://doc.rust-lang.org/stable/std/collections/struct.BTreeMap.html).
337///
338/// # Overview of types
339///
340/// `Map` is a read only interface to pre-constructed sets. `MapBuilder` is
341/// used to create new sets. (Once a set is created, it can never be modified.)
342/// `Stream`, `Keys` and `Values` are streams that originated from a map.
343/// `StreamBuilder` builds range queries. `OpBuilder` collects a set of streams
344/// and executes set operations like `union` or `intersection` on them with the
345/// option of specifying a merge strategy for a map's values. The rest of the
346/// types are streams for set operations.
347pub mod map {
348    pub use crate::inner_map::*;
349}
350
351/// Set operations implemented by finite state transducers.
352///
353/// This API provided by this sub-module is close in spirit to the API
354/// provided by
355/// [`std::collections::BTreeSet`](https://doc.rust-lang.org/stable/std/collections/struct.BTreeSet.html).
356/// The principle difference, as with everything else in this crate, is that
357/// operations are performed on streams of byte strings instead of generic
358/// iterators. Another difference is that most of the set operations (union,
359/// intersection, difference and symmetric difference) work on multiple sets at
360/// the same time, instead of two.
361///
362/// # Overview of types
363///
364/// `Set` is a read only interface to pre-constructed sets. `SetBuilder` is
365/// used to create new sets. (Once a set is created, it can never be modified.)
366/// `Stream` is a stream of values that originated from a set (analogous to an
367/// iterator). `StreamBuilder` builds range queries. `OpBuilder` collects a set
368/// of streams and executes set operations like `union` or `intersection` on
369/// them. The rest of the types are streams for set operations.
370pub mod set {
371    pub use crate::inner_set::*;
372}