Crate rustfst

source ·
Expand description

Rust implementation of Weighted Finite States Transducers.

Rustfst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition’s input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition.

FSTs have key applications in speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction and retrieval among others. Often a weighted transducer is used to represent a probabilistic model (e.g., an n-gram model, pronunciation model). FSTs can be optimized by determinization and minimization, models can be applied to hypothesis sets (also represented as automata) or cascaded by finite-state composition, and the best results can be selected by shortest-path algorithms.

fst

§Overview

For a basic example see the section below.

Some simple and commonly encountered types of FSTs can be easily created with the macro fst or the functions acceptor and transducer.

For more complex cases you will likely start with the VectorFst type, which will be imported in the prelude along with most everything else you need. VectorFst<TropicalWeight> corresponds directly to the OpenFST StdVectorFst, and can be used to load its files using read or read_text.

Because “iteration” over an FST can mean many different things, there are a variety of different iterators. To iterate over state IDs you may use states_iter, while to iterate over transitions out of a state, you may use get_trs. Since it is common to iterate over both, this can be done using fst_iter or fst_into_iter. It is also very common to iterate over paths accepted by an FST, which can be done with paths_iter, and as a convenience for generating text, string_paths_iter. Alternately, in the case of a linear FST, you may retrieve the only possible path with decode_linear_fst.

Note that iterating over paths is not the same thing as finding the shortest path or paths, which is done with shortest_path (for a single path) or shortest_path_with_config (for N-shortest paths).

For the complete list of algorithms, see the algorithms module.

You may now be wondering, especially if you have previously used such linguist-friendly tools as pyfoma, “what if I just want to transduce some text???” The unfriendly answer is that rustfst is a somewhat lower-level library, designed for implementing things like speech recognizers. The somewhat more helpful answer is that you would do this by constructing an acceptor for your input, which you will compose with a transducer, then project the result to its output, and finally iterate over the paths in the resulting FST.

§References

Implementation heavily inspired from Mehryar Mohri’s, Cyril Allauzen’s and Michael Riley’s work :

The API closely resembles that of OpenFST, with some simplifications and changes to make it more idiomatic in Rust, notably the use of Tr instead of Arc. See Differences from OpenFST for more information.

§Example

use anyhow::Result;
use rustfst::prelude::*;
use rustfst::algorithms::determinize::{DeterminizeType, determinize};
use rustfst::algorithms::rm_epsilon::rm_epsilon;

fn main() -> Result<()> {
    // Creates a empty wFST
    let mut fst = VectorFst::<TropicalWeight>::new();

    // Add some states
    let s0 = fst.add_state();
    let s1 = fst.add_state();
    let s2 = fst.add_state();

    // Set s0 as the start state
    fst.set_start(s0)?;

    // Add a transition from s0 to s1
    fst.add_tr(s0, Tr::new(3, 5, 10.0, s1))?;

    // Add a transition from s0 to s2
    fst.add_tr(s0, Tr::new(5, 7, 18.0, s2))?;

    // Set s1 and s2 as final states
    fst.set_final(s1, 31.0)?;
    fst.set_final(s2, 45.0)?;

    // Iter over all the paths in the wFST
    for p in fst.paths_iter() {
         println!("{:?}", p);
    }

    // A lot of operations are available to modify/optimize the FST.
    // Here are a few examples :

    // - Remove useless states.
    connect(&mut fst)?;

    // - Optimize the FST by merging states with the same behaviour.
    minimize(&mut fst)?;

    // - Copy all the input labels in the output.
    project(&mut fst, ProjectType::ProjectInput);

    // - Remove epsilon transitions.
    rm_epsilon(&mut fst)?;

    // - Compute an equivalent FST but deterministic.
    fst = determinize(&fst)?;

    Ok(())
}

§Differences from OpenFST

Here is a non-exhaustive list of ways in which Rustfst’s API differs from OpenFST:

  • The default epsilon symbol is <eps> and not <epsilon>.
  • Functions and methods follow Rust naming conventions, e.g. add_state rather than AddState, but are otherwise mostly equivalent, except that:
  • Transitions are called Tr and not Arc, because Arc has a rather different and well-established meaning in Rust, and rustfst uses it (std::sync::Arc, that is) to reference-count symbol tables. All associated functions also use tr.
  • Final states are not indicated by a final weight of zero. You can test for finality using is_final, and final_weight returns an Option. This requires some care when converting OpenFST code.
  • Transitions can be accessed directly as a slice rather than requiring an iterator.
  • Semiring operations are expressed as plain old methods rather than strange C++ things. So write w1.plus(w2) rather than Plus(w1, w2), for instance.
  • Weights have in-place operations for ⊕ (plus_assign) and ⊗ (times_assign).
  • Most of the type aliases (which would be trait aliases in Rust) such as StdArc, StdFst, and so forth, are missing, but type inference allows us to avoid explicit type arguments in most cases, such as when calling Tr::new, for instance.
  • State IDs are unsigned, with NO_STATE_ID used for a missing value. They are also 32 bits by default (presumably, 4 billion states is enough for most applications). This means you must take care to cast them to usize when using them as indices, and vice-versa, preferably checking for overflows
  • Symbol IDs are also unsigned and 32-bits, with NO_LABEL used for a missing value.
  • Floating-point weights are not generic, so are always single-precision.

Re-exports§

Modules§

  • Provides algorithms that are generic to all Fst.
  • Implementation of the wFST traits with different data structures.
  • Provides the FstProperties struct and some utils functions around it. Useful to assert some properties on a Fst.
  • Provides traits that must be implemented to be able to use generic algorithms.
  • Module re-exporting most of the objects from this crate.
  • Provides a trait that shall be implemented for all weights stored inside a wFST.
  • Provides a trait used to access transitions from a state.
  • Provides a trait used to mutably access transitions from a state.
  • A few utilities to manipulate wFSTs.

Macros§

  • Creates a linear Fst containing the arguments.
  • Creates a Path containing the arguments.
  • Creates a SymbolTable containing the arguments.

Structs§

  • Struct to configure how the FST should be drawn.
  • Structure representing a path in a FST (list of input labels, list of output labels and total weight).
  • Wrapper around FstPath to nicely handle SymbolTables.
  • A symbol table stores a bidirectional mapping between transition labels and “symbols” (strings).
  • Structure representing a transition from a state to another state in a FST.

Enums§

Constants§

  • Epsilon label representing the epsilon transition (empty transition) = 0.
  • Epsilon symbol representing the epsilon transition (empty transition) = <eps>.
  • A representable float near .001. (Used in Quantize)
  • Default tolerance value used in floating-point comparisons.

Statics§

  • Used to indicate a transition with no label.
  • Used to indicate a missing state ID.

Functions§

Type Aliases§

  • Type used for the input label and output label of a transition in a wFST -> usize
  • Type used to identify a state in a wFST -> usize
  • Symbol to map in the Symbol Table -> String