rustfst/
lib.rs

1//! Rust implementation of Weighted Finite States Transducers.
2//!
3//! Rustfst is a library for constructing, combining, optimizing, and searching weighted
4//! finite-state transducers (FSTs). Weighted finite-state transducers are automata where
5//! each transition has an input label, an output label, and a weight.
6//! The more familiar finite-state acceptor is represented as a transducer
7//! with each transition's input and output label equal. Finite-state acceptors
8//! are used to represent sets of strings (specifically, regular or rational sets);
9//! finite-state transducers are used to represent binary relations between pairs of
10//! strings (specifically, rational transductions). The weights can be used to represent
11//! the cost of taking a particular transition.
12//!
13//! FSTs have key applications in speech recognition and synthesis, machine translation,
14//! optical character recognition, pattern matching, string processing, machine learning,
15//! information extraction and retrieval among others. Often a weighted transducer is used to
16//! represent a probabilistic model (e.g., an n-gram model, pronunciation model). FSTs can be
17//! optimized by determinization and minimization, models can be applied to hypothesis sets
18//! (also represented as automata) or cascaded by finite-state composition, and the best
19//! results can be selected by shortest-path algorithms.
20//!
21//! ![fst](https://raw.githubusercontent.com/Garvys/rustfst-images-doc/master/images/project_in.svg?sanitize=true)
22//!
23//! ## Overview
24//!
25//! For a basic [example](#example) see the section below.
26//!
27//! Some simple and commonly encountered types of FSTs can be easily
28//! created with the macro [`fst`] or the functions
29//! [`acceptor`](utils::acceptor) and
30//! [`transducer`](utils::transducer).
31//!
32//! For more complex cases you will likely start with the
33//! [`VectorFst`](fst_impls::VectorFst) type, which will be imported
34//! in the [`prelude`] along with most everything else you need.
35//! [`VectorFst<TropicalWeight>`](fst_impls::VectorFst) corresponds
36//! directly to the OpenFST `StdVectorFst`, and can be used to load
37//! its files using [`read`](fst_traits::SerializableFst::read) or
38//! [`read_text`](fst_traits::SerializableFst::read_text).
39//!
40//! Because "iteration" over an FST can mean many different things,
41//! there are a variety of different iterators.  To iterate over state
42//! IDs you may use
43//! [`states_iter`](fst_traits::StateIterator::states_iter), while to
44//! iterate over transitions out of a state, you may use
45//! [`get_trs`](fst_traits::CoreFst::get_trs).  Since it is common to
46//! iterate over both, this can be done using
47//! [`fst_iter`](fst_traits::FstIterator::fst_iter) or
48//! [`fst_into_iter`](fst_traits::FstIntoIterator::fst_into_iter).  It
49//! is also very common to iterate over paths accepted by an FST,
50//! which can be done with
51//! [`paths_iter`](fst_traits::Fst::paths_iter), and as a convenience
52//! for generating text,
53//! [`string_paths_iter`](fst_traits::Fst::string_paths_iter).
54//! Alternately, in the case of a linear FST, you may retrieve the
55//! only possible path with
56//! [`decode_linear_fst`](utils::decode_linear_fst).
57//!
58//! Note that iterating over paths is not the same thing as finding
59//! the *shortest* path or paths, which is done with
60//! [`shortest_path`](algorithms::shortest_path) (for a single path)
61//! or
62//! [`shortest_path_with_config`](algorithms::shortest_path_with_config)
63//! (for N-shortest paths).
64//!
65//! For the complete list of algorithms, see the [`algorithms`] module.
66//!
67//! You may now be wondering, especially if you have previously used
68//! such linguist-friendly tools as
69//! [pyfoma](https://github.com/mhulden/pyfoma), "what if I just want
70//! to *transduce some text*???"  The unfriendly answer is that
71//! rustfst is a somewhat lower-level library, designed for
72//! implementing things like speech recognizers.  The somewhat more
73//! helpful answer is that you would do this by constructing an
74//! [`acceptor`](utils::acceptor) for your input, which you will
75//! [`compose`](algorithms::compose) with a
76//! [`transducer`](utils::transducer), then
77//! [`project`](algorithms::project) the result [to its
78//! output](algorithms::ProjectType::ProjectOutput), and finally
79//! [iterate over the paths](fst_traits::Fst::string_paths_iter) in
80//! the resulting FST.
81//!
82//! ## References
83//!
84//! Implementation heavily inspired from Mehryar Mohri's, Cyril Allauzen's and Michael Riley's work :
85//! - [Weighted automata algorithms](https://cs.nyu.edu/~mohri/pub/hwa.pdf)
86//! - [The design principles of a weighted finite-state transducer library](https://core.ac.uk/download/pdf/82101846.pdf)
87//! - [OpenFst: A general and efficient weighted finite-state transducer library](https://link.springer.com/chapter/10.1007%2F978-3-540-76336-9_3)
88//! - [Weighted finite-state transducers in speech recognition](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1010&context=cis_papers)
89//!
90//! The API closely resembles that of OpenFST, with some
91//! simplifications and changes to make it more idiomatic in Rust, notably
92//! the use of `Tr` instead of `Arc`.  See [Differences from
93//! OpenFST](#differences-from-openfst) for more information.
94//!
95//! ## Example
96//!
97//! ```rust
98//! use anyhow::Result;
99//! use rustfst::prelude::*;
100//! use rustfst::algorithms::determinize::{DeterminizeType, determinize};
101//! use rustfst::algorithms::rm_epsilon::rm_epsilon;
102//!
103//! fn main() -> Result<()> {
104//!     // Creates a empty wFST
105//!     let mut fst = VectorFst::<TropicalWeight>::new();
106//!
107//!     // Add some states
108//!     let s0 = fst.add_state();
109//!     let s1 = fst.add_state();
110//!     let s2 = fst.add_state();
111//!
112//!     // Set s0 as the start state
113//!     fst.set_start(s0)?;
114//!
115//!     // Add a transition from s0 to s1
116//!     fst.add_tr(s0, Tr::new(3, 5, 10.0, s1))?;
117//!
118//!     // Add a transition from s0 to s2
119//!     fst.add_tr(s0, Tr::new(5, 7, 18.0, s2))?;
120//!
121//!     // Set s1 and s2 as final states
122//!     fst.set_final(s1, 31.0)?;
123//!     fst.set_final(s2, 45.0)?;
124//!
125//!     // Iter over all the paths in the wFST
126//!     for p in fst.paths_iter() {
127//!          println!("{:?}", p);
128//!     }
129//!
130//!     // A lot of operations are available to modify/optimize the FST.
131//!     // Here are a few examples :
132//!
133//!     // - Remove useless states.
134//!     connect(&mut fst)?;
135//!
136//!     // - Optimize the FST by merging states with the same behaviour.
137//!     minimize(&mut fst)?;
138//!
139//!     // - Copy all the input labels in the output.
140//!     project(&mut fst, ProjectType::ProjectInput);
141//!
142//!     // - Remove epsilon transitions.
143//!     rm_epsilon(&mut fst)?;
144//!
145//!     // - Compute an equivalent FST but deterministic.
146//!     fst = determinize(&fst)?;
147//!
148//!     Ok(())
149//! }
150//! ```
151//!
152//! ## Differences from OpenFST
153//!
154//! Here is a non-exhaustive list of ways in which Rustfst's API
155//! differs from OpenFST:
156//!
157//! - The default epsilon symbol is `<eps>` and not `<epsilon>`.
158//! - Functions and methods follow Rust naming conventions,
159//!   e.g. `add_state` rather than `AddState`, but are otherwise mostly
160//!   equivalent, except that:
161//! - Transitions are called `Tr` and not `Arc`, because `Arc` has a
162//!   rather different and well-established meaning in Rust, and rustfst
163//!   uses it (`std::sync::Arc`, that is) to reference-count symbol
164//!   tables.  All associated functions also use `tr`.
165//! - Final states are not indicated by a final weight of `zero`.  You
166//!   can test for finality using [`is_final`](fst_traits::CoreFst::is_final), and
167//!   [`final_weight`](fst_traits::CoreFst::final_weight) returns an [`Option`].  This
168//!   requires some care when converting OpenFST code.
169//! - Transitions can be accessed directly as a slice rather than requiring
170//!   an iterator.
171//! - Semiring operations are expressed as plain old methods rather
172//!   than strange C++ things.  So write `w1.plus(w2)` rather than
173//!   `Plus(w1, w2)`, for instance.
174//! - Weights have in-place operations for ⊕
175//!   ([`plus_assign`](Semiring::plus_assign)) and ⊗
176//!   ([`times_assign`](Semiring::times_assign)).
177//! - Most of the type aliases (which would be trait aliases in Rust) such
178//!   as `StdArc`, `StdFst`, and so forth, are missing, but type inference
179//!   allows us to avoid explicit type arguments in most cases, such as
180//!   when calling [`Tr::new`], for instance.
181//! - State IDs are unsigned, with [`NO_STATE_ID`] used for a missing value.
182//!   They are also 32 bits by default (presumably, 4 billion states
183//!   is enough for most applications).  This means you must take care to
184//!   cast them to [`usize`] when using them as indices, and vice-versa,
185//!   preferably checking for overflows
186//! - Symbol IDs are also unsigned and 32-bits, with [`NO_LABEL`] used
187//!   for a missing value.
188//! - Floating-point weights are not generic, so are always single-precision.
189
190#[warn(missing_docs)]
191#[cfg(test)]
192extern crate counter;
193#[macro_use]
194extern crate anyhow;
195#[cfg(test)]
196extern crate serde;
197#[cfg(test)]
198extern crate serde_json;
199
200pub use crate::drawing_config::DrawingConfig;
201pub use crate::fst_path::{check_path_in_fst, FstPath};
202pub use crate::string_path::StringPath;
203pub use crate::symbol_table::SymbolTable;
204
205pub use self::tr::Tr;
206pub use self::trs::{Trs, TrsConst, TrsVec};
207
208pub use crate::semirings::Semiring;
209#[cfg(test)]
210use doc_comment::doc_comment;
211
212// When running `cargo test`, rustdoc will check this file as well.
213#[cfg(test)]
214doc_comment!(include_str!("../../README.md"));
215
216#[cfg(test)]
217mod tests_openfst;
218
219mod symbol_table;
220
221/// Type used for the input label and output label of a transition in a wFST -> usize
222#[cfg(feature = "state-label-u32")]
223pub type Label = u32;
224#[cfg(not(feature = "state-label-u32"))]
225pub type Label = usize;
226/// Symbol to map in the Symbol Table -> String
227pub type Symbol = String;
228
229/// Type used to identify a state in a wFST -> usize
230#[cfg(feature = "state-label-u32")]
231pub type StateId = u32;
232#[cfg(not(feature = "state-label-u32"))]
233pub type StateId = usize;
234
235/// Epsilon label representing the epsilon transition (empty transition) = `0`.
236pub const EPS_LABEL: Label = 0;
237/// Epsilon symbol representing the epsilon transition (empty transition) = `<eps>`.
238pub const EPS_SYMBOL: &str = "<eps>";
239
240/// A few utilities to manipulate wFSTs.
241pub mod utils;
242
243/// Provides algorithms that are generic to all Fst.
244pub mod algorithms;
245
246/// Provides the `FstProperties` struct and some utils functions around it.
247/// Useful to assert some properties on a Fst.
248pub mod fst_properties;
249/// Implementation of the transitions inside a Fst.
250mod tr;
251
252#[macro_use]
253/// Provides traits that must be implemented to be able to use generic algorithms.
254pub mod fst_traits;
255/// Implementation of the wFST traits with different data structures.
256pub mod fst_impls;
257/// Provides a trait that shall be implemented for all weights stored inside a wFST.
258pub mod semirings;
259
260mod drawing_config;
261/// Implementation of a successful path inside a wFST.
262mod fst_path;
263mod parsers;
264mod string_path;
265
266pub use crate::parsers::nom_utils::NomCustomError;
267
268/// A representable float near .001. (Used in Quantize)
269pub const KDELTA: f32 = 1.0f32 / 1024.0f32;
270/// Default tolerance value used in floating-point comparisons.
271pub const KSHORTESTDELTA: f32 = 1e-6;
272
273/// Module re-exporting most of the objects from this crate.
274pub mod prelude {
275    pub use crate::algorithms::tr_compares::*;
276    pub use crate::algorithms::tr_mappers::*;
277    pub use crate::algorithms::*;
278    pub use crate::fst_impls::*;
279    pub use crate::fst_traits::*;
280    pub use crate::parsers::SerializeBinary;
281    pub use crate::semirings::*;
282    pub use crate::tr::Tr;
283    pub use crate::trs::{Trs, TrsConst, TrsVec};
284    pub use crate::*;
285}
286
287#[cfg(test)]
288pub mod proptest_fst;
289
290/// Used to indicate a transition with no label.
291#[cfg(feature = "state-label-u32")]
292pub static NO_LABEL: Label = std::u32::MAX;
293#[cfg(not(feature = "state-label-u32"))]
294pub static NO_LABEL: Label = std::usize::MAX;
295
296/// Used to indicate a missing state ID.
297#[cfg(feature = "state-label-u32")]
298pub static NO_STATE_ID: StateId = std::u32::MAX;
299#[cfg(not(feature = "state-label-u32"))]
300pub static NO_STATE_ID: StateId = std::usize::MAX;
301
302pub(crate) static UNASSIGNED: usize = std::usize::MAX;
303
304/// Provides a trait used to access transitions from a state.
305pub mod trs;
306/// Provides a trait used to mutably access transitions from a state.
307pub mod trs_iter_mut;