fst_no_std/
lib.rs

1/*!
2This is a fork of [fst](https://github.com/BurntSushi/fst) adding support for `no_std` targets (see [`no_std` usage](#no_std-usage) for details).
3
4If you're unsure whether to use this fork or the original one: Just use the original, chances are that's more up-to-date.
5
6Crate `fst` is a library for efficiently storing and searching ordered sets or
7maps where the keys are byte strings. A key design goal of this crate is to
8support storing and searching *very large* sets or maps (i.e., billions). This
9means that much effort has gone in to making sure that all operations are
10memory efficient.
11
12Sets and maps are represented by a finite state machine, which acts as a form
13of compression on common prefixes and suffixes in the keys. Additionally,
14finite state machines can be efficiently queried with automata (like regular
15expressions or Levenshtein distance for fuzzy queries) or lexicographic ranges.
16
17To read more about the mechanics of finite state transducers, including a
18bibliography for algorithms used in this crate, see the docs for the
19[`raw::Fst`](raw/struct.Fst.html) type.
20
21# Installation
22
23Simply add a corresponding entry to your `Cargo.toml` dependency list:
24
25```plain
26[dependencies]
27fst-no-std = "0.4"
28```
29
30The examples in this documentation will show the rest.
31
32# Overview of types and modules
33
34This crate provides the high level abstractions---namely sets and maps---in the
35top-level module.
36
37The `set` and `map` sub-modules contain types specific to sets and maps, such
38as range queries and streams.
39
40The `raw` module permits direct interaction with finite state transducers.
41Namely, the states and transitions of a transducer can be directly accessed
42with the `raw` module.
43*/
44#![cfg_attr(
45    feature = "levenshtein",
46    doc = r##"
47# Example: fuzzy query
48
49This example shows how to create a set of strings in memory, and then execute
50a fuzzy query. Namely, the query looks for all keys within an edit distance
51of `1` of `foo`. (Edit distance is the number of character insertions,
52deletions or substitutions required to get from one string to another. In this
53case, a character is a Unicode codepoint.)
54
55This requires the `levenshtein` feature in this crate to be enabled. It is not
56enabled by default.
57
58```rust
59use fst_no_std::{IntoStreamer, Streamer, Set};
60use fst_no_std::automaton::Levenshtein;
61
62# fn main() { example().unwrap(); }
63fn example() -> Result<(), Box<dyn std::error::Error>> {
64    // A convenient way to create sets in memory.
65    let keys = vec!["fa", "fo", "fob", "focus", "foo", "food", "foul"];
66    let set = Set::from_iter(keys)?;
67
68    // Build our fuzzy query.
69    let lev = Levenshtein::new("foo", 1)?;
70
71    // Apply our fuzzy query to the set we built.
72    let mut stream = set.search(lev).into_stream();
73
74    let keys = stream.into_strs()?;
75    assert_eq!(keys, vec!["fo", "fob", "foo", "food"]);
76    Ok(())
77}
78```
79
80**Warning**: Levenshtein automatons use a lot of memory
81
82The construction of Levenshtein automatons should be consider "proof of
83concept" quality. Namely, they do just enough to be *correct*. But they haven't
84had any effort put into them to be memory conscious.
85
86Note that an error will be returned if a Levenshtein automaton gets too big
87(tens of MB in heap usage).
88
89"##
90)]
91/*!
92# Example: stream to a file and memory map it for searching
93
94This shows how to create a `MapBuilder` that will stream construction of the
95map to a file. Notably, this will never store the entire transducer in memory.
96Instead, only constant memory is required during construction.
97
98For the search phase, we use the
99[`memmap2`](https://crates.io/memmap2)
100crate to make the file available as a `&[u8]` without necessarily reading it
101all into memory (the operating system will automatically handle that for you).
102
103```rust,no_run
104# fn example() -> Result<(), fst_no_std::Error> {
105use std::fs::File;
106use std::io;
107
108use fst_no_std::{IntoStreamer, Streamer, Map, MapBuilder};
109use memmap2::Mmap;
110
111// This is where we'll write our map to.
112let mut wtr = io::BufWriter::new(File::create("map.fst")?);
113
114// Create a builder that can be used to insert new key-value pairs.
115let mut build = MapBuilder::new(wtr)?;
116build.insert("bruce", 1).unwrap();
117build.insert("clarence", 2).unwrap();
118build.insert("stevie", 3).unwrap();
119
120// Finish construction of the map and flush its contents to disk.
121build.finish()?;
122
123// At this point, the map has been constructed. Now we'd like to search it.
124// This creates a memory map, which enables searching the map without loading
125// all of it into memory.
126let mmap = unsafe { Mmap::map(&File::open("map.fst")?)? };
127let map = Map::new(mmap)?;
128
129// Query for keys that are greater than or equal to clarence.
130let mut stream = map.range().ge("clarence").into_stream();
131
132let kvs = stream.into_str_vec()?;
133assert_eq!(kvs, vec![
134    ("clarence".to_owned(), 2),
135    ("stevie".to_owned(), 3),
136]);
137# Ok(())
138# }
139# example().unwrap();
140```
141
142# Example: case insensitive search
143
144We can perform case insensitive search on a set using a regular expression. We
145can use the [`regex-automata`](https://docs.rs/regex-automata) crate to compile
146a regular expression into an automaton:
147
148```ignore
149use fst::{IntoStreamer, Set};
150use regex_automata::dense; // regex-automata crate with 'transducer' feature
151
152fn main() -> Result<(), Box<dyn std::error::Error>> {
153    let set = Set::from_iter(&["FoO", "Foo", "fOO", "foo"])?;
154    let pattern = r"(?i)foo";
155    // Setting 'anchored' is important, otherwise the regex can match anywhere
156    // in the key. This would cause the regex to iterate over every key in the
157    // FST set.
158    let dfa = dense::Builder::new().anchored(true).build(pattern).unwrap();
159
160    let keys = set.search(&dfa).into_stream().into_strs()?;
161    assert_eq!(keys, vec!["FoO", "Foo", "fOO", "foo"]);
162    println!("{:?}", keys);
163    Ok(())
164}
165```
166
167Note that for this to work, the `regex-automata` crate must be compiled with
168the `transducer` feature enabled:
169
170```toml
171[dependencies]
172fst = "0.4"
173regex-automata = { version = "0.1.9", features = ["transducer"] }
174```
175
176# Example: searching multiple sets efficiently
177
178Since queries can search a transducer without reading the entire data structure
179into memory, it is possible to search *many* transducers very quickly.
180
181This crate provides efficient set/map operations that allow one to combine
182multiple streams of search results. Each operation only uses memory
183proportional to the number of streams.
184
185The example below shows how to find all keys that start with `B` or `G`. The
186example below uses sets, but the same operations are available on maps too.
187
188```rust
189use fst_no_std::automaton::{Automaton, Str};
190use fst_no_std::set;
191use fst_no_std::{IntoStreamer, Set, Streamer};
192
193# fn main() { example().unwrap(); }
194fn example() -> Result<(), Box<dyn std::error::Error>> {
195    let set1 = Set::from_iter(&["AC/DC", "Aerosmith"])?;
196    let set2 = Set::from_iter(&["Bob Seger", "Bruce Springsteen"])?;
197    let set3 = Set::from_iter(&["George Thorogood", "Golden Earring"])?;
198    let set4 = Set::from_iter(&["Kansas"])?;
199    let set5 = Set::from_iter(&["Metallica"])?;
200
201    // Create the matcher. We can reuse it to search all of the sets.
202    let matcher = Str::new("B")
203        .starts_with()
204        .union(Str::new("G").starts_with());
205
206    // Build a set operation. All we need to do is add a search result stream
207    // for each set and ask for the union. (Other operations, like intersection
208    // and difference are also available.)
209    let mut stream =
210        set::OpBuilder::new()
211        .add(set1.search(&matcher))
212        .add(set2.search(&matcher))
213        .add(set3.search(&matcher))
214        .add(set4.search(&matcher))
215        .add(set5.search(&matcher))
216        .union();
217
218    // Now collect all of the keys. Alternatively, you could build another set
219    // here using `SetBuilder::extend_stream`.
220    let mut keys = vec![];
221    while let Some(key) = stream.next() {
222        keys.push(String::from_utf8(key.to_vec())?);
223    }
224    assert_eq!(keys, vec![
225        "Bob Seger",
226        "Bruce Springsteen",
227        "George Thorogood",
228        "Golden Earring",
229    ]);
230    Ok(())
231}
232```
233
234# Memory usage
235
236An important advantage of using finite state transducers to represent sets and
237maps is that they can compress very well depending on the distribution of keys.
238The smaller your set/map is, the more likely it is that it will fit into
239memory. If it's in memory, then searching it is faster. Therefore, it is
240important to do what we can to limit what actually needs to be in memory.
241
242This is where automata shine, because they can be queried in their compressed
243state without loading the entire data structure into memory. This means that
244one can store a set/map created by this crate on disk and search it without
245actually reading the entire set/map into memory. This use case is served well
246by *memory maps*, which lets one assign the entire contents of a file to a
247contiguous region of virtual memory.
248
249Indeed, this crate encourages this mode of operation. Both sets and maps can
250be constructed from anything that provides an `AsRef<[u8]>` implementation,
251which any memory map should.
252
253This is particularly important for long running processes that use this crate,
254since it enables the operating system to determine which regions of your
255finite state transducers are actually in memory.
256
257Of course, there are downsides to this approach. Namely, navigating a
258transducer during a key lookup or a search will likely follow a pattern
259approximating random access. Supporting random access when reading from disk
260can be very slow because of how often `seek` must be called (or, in the case
261of memory maps, page faults). This is somewhat mitigated by the prevalence of
262solid state drives where seek time is eliminated. Nevertheless, solid state
263drives are not ubiquitous and it is possible that the OS will not be smart
264enough to keep your memory mapped transducers in the page cache. In that case,
265it is advisable to load the entire transducer into your process's memory (e.g.,
266calling `Set::new` with a `Vec<u8>`).
267
268# Streams
269
270Searching a set or a map needs to provide some way to iterate over the search
271results. Idiomatic Rust calls for something satisfying the `Iterator` trait
272to be used here. Unfortunately, this is not possible to do efficiently because
273the `Iterator` trait does not permit values emitted by the iterator to borrow
274from the iterator. Borrowing from the iterator is required in our case because
275keys and values are constructed *during iteration*.
276
277Namely, if we were to use iterators, then every key would need its own
278allocation, which could be quite costly.
279
280Instead, this crate provides a `Streamer`, which can be thought of as a
281streaming iterator. Namely, a stream in this crate maintains a single key
282buffer and lends it out on each iteration.
283
284For more details, including important limitations, see the `Streamer` trait.
285
286# Quirks
287
288There's no doubt about it, finite state transducers are a specialty data
289structure. They have a host of restrictions that don't apply to other similar
290data structures found in the standard library, such as `BTreeSet` and
291`BTreeMap`. Here are some of them:
292
2931. Sets can only contain keys that are byte strings.
2942. Maps can also only contain keys that are byte strings, and its values are
295   limited to unsigned 64 bit integers. (The restriction on values may be
296   relaxed some day.)
2973. Creating a set or a map requires inserting keys in lexicographic order.
298   Often, keys are not already sorted, which can make constructing large
299   sets or maps tricky. One way to do it is to sort pieces of the data and
300   build a set/map for each piece. This can be parallelized trivially. Once
301   done, they can be merged together into one big set/map if desired.
302   A somewhat simplistic example of this procedure can be seen in
303   `fst-bin/src/merge.rs` from the root of this crate's repository.
304*/
305
306#![cfg_attr(not(feature = "std"), no_std)]
307#![cfg_attr(not(feature = "std"), feature(error_in_core))]
308#![deny(missing_docs)]
309#![allow(clippy::should_implement_trait)]
310
311#[cfg(feature = "alloc")]
312extern crate alloc;
313
314#[cfg(all(feature = "levenshtein", doctest))]
315doc_comment::doctest!("../README.md");
316
317pub use crate::automaton::Automaton;
318pub use crate::error::{Error, Result};
319pub use crate::map::Map;
320#[cfg(feature = "std")]
321pub use crate::map::MapBuilder;
322pub use crate::set::Set;
323#[cfg(feature = "std")]
324pub use crate::set::SetBuilder;
325pub use crate::stream::{IntoStreamer, Streamer};
326
327mod bytes;
328mod error;
329#[path = "automaton/mod.rs"]
330mod inner_automaton;
331#[path = "map.rs"]
332mod inner_map;
333#[path = "set.rs"]
334mod inner_set;
335pub mod raw;
336mod stream;
337
338/// Automaton implementations for finite state transducers.
339///
340/// This module defines a trait, `Automaton`, with several implementations
341/// including, but not limited to, union, intersection and complement.
342pub mod automaton {
343    pub use crate::inner_automaton::*;
344}
345
346/// Map operations implemented by finite state transducers.
347///
348/// This API provided by this sub-module is close in spirit to the API
349/// provided by
350/// [`std::collections::BTreeMap`](https://doc.rust-lang.org/stable/std/collections/struct.BTreeMap.html).
351///
352/// # Overview of types
353///
354/// `Map` is a read only interface to pre-constructed sets. `MapBuilder` is
355/// used to create new sets. (Once a set is created, it can never be modified.)
356/// `Stream`, `Keys` and `Values` are streams that originated from a map.
357/// `StreamBuilder` builds range queries. `OpBuilder` collects a set of streams
358/// and executes set operations like `union` or `intersection` on them with the
359/// option of specifying a merge strategy for a map's values. The rest of the
360/// types are streams for set operations.
361pub mod map {
362    pub use crate::inner_map::*;
363}
364
365/// Set operations implemented by finite state transducers.
366///
367/// This API provided by this sub-module is close in spirit to the API
368/// provided by
369/// [`std::collections::BTreeSet`](https://doc.rust-lang.org/stable/std/collections/struct.BTreeSet.html).
370/// The principle difference, as with everything else in this crate, is that
371/// operations are performed on streams of byte strings instead of generic
372/// iterators. Another difference is that most of the set operations (union,
373/// intersection, difference and symmetric difference) work on multiple sets at
374/// the same time, instead of two.
375///
376/// # Overview of types
377///
378/// `Set` is a read only interface to pre-constructed sets. `SetBuilder` is
379/// used to create new sets. (Once a set is created, it can never be modified.)
380/// `Stream` is a stream of values that originated from a set (analogous to an
381/// iterator). `StreamBuilder` builds range queries. `OpBuilder` collects a set
382/// of streams and executes set operations like `union` or `intersection` on
383/// them. The rest of the types are streams for set operations.
384pub mod set {
385    pub use crate::inner_set::*;
386}