tongrams/
lib.rs

1//! # `tongrams`: Tons of *N*-grams
2//!
3//! `tongrams` is a crate to index and query large language models in compressed space, in which the data structures are presented in the following papers:
4//!
5//!  - Giulio Ermanno Pibiri and Rossano Venturini, [Efficient Data Structures for Massive N-Gram Datasets](https://doi.org/10.1145/3077136.3080798). In *Proceedings of the 40th ACM Conference on Research and Development in Information Retrieval (SIGIR 2017)*, pp. 615-624.
6//!
7//!  - Giulio Ermanno Pibiri and Rossano Venturini, [Handling Massive N-Gram Datasets Efficiently](https://doi.org/10.1145/3302913). *ACM Transactions on Information Systems (TOIS)*, 37.2 (2019): 1-41.
8//!
9//! This is a Rust port of [`tongrams`](https://github.com/jermp/tongrams) C++ library.
10//!
11//! ## What can do
12//!
13//!  - Store *N*-gram language models with frequency counts.
14//!
15//!  - Look up *N*-grams to get the frequency  counts.
16//!
17//! ## Features
18//!
19//!  - **Compressed language model.** `tongrams-rs` can store large *N*-gram language models in very compressed space. For example, the word *N*-gram datasets (*N*=1..5) in `test_data` are stored in only 2.6 bytes per gram.
20//!   
21//!  - **Time and memory efficiency.** `tongrams-rs` employs *Elias-Fano Trie*, which cleverly encodes a trie data structure consisting of *N*-grams through *Elias-Fano codes*, enabling fast lookups in compressed space.
22//!   
23//!  - **Pure Rust.** `tongrams-rs` is written only in Rust and can be easily pluged into your Rust codes.
24//!
25//! ## Input data format
26//!
27//! The file format of *N*-gram counts files is the same as that used in [`tongrams`](https://github.com/jermp/tongrams), a modified [Google format](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html), where
28//!
29//!  - one separate file for each distinct value of *N* (order) lists one gram per row,
30//!  - each header row `<number_of_grams>` indicates the number of *N*-grams in the file,
31//!  - tokens in a gram `<gram>` are sparated by a space (e.g., `the same time`), and
32//!  - a gram `<gram>` and the count `<count>` are sparated by a horizontal tab.
33//!
34//! ```text
35//! <number_of_grams>
36//! <gram1><TAB><count1>
37//! <gram2><TAB><count2>
38//! <gram3><TAB><count3>
39//! ...
40//! ```
41//!
42//! For example,
43//!
44//! ```text
45//! 61516
46//! the // parent       1
47//! the function is     22
48//! the function a      4
49//! the function to     1
50//! the function and    1
51//! ...
52//! ```
53//!
54//! ## Examples
55//!
56//! The following code uses datasets in [`test_data`](https://github.com/kampersanda/tongrams-rs/tree/main/test_data) at the root of this repository.
57//!
58//! ```
59//! use tongrams::EliasFanoTrieCountLm;
60//!
61//! // File names of N-grams.
62//! let filenames = vec![
63//!     "../test_data/1-grams.sorted.gz",
64//!     "../test_data/2-grams.sorted.gz",
65//!     "../test_data/3-grams.sorted.gz",
66//! ];
67//!
68//! // Builds the language model from n-gram counts files.
69//! let lm = EliasFanoTrieCountLm::from_gz_files(&filenames).unwrap();
70//!
71//! // Creates the instance for lookup.
72//! let mut lookuper = lm.lookuper();
73//!
74//! // Gets the count of a query N-gram written in a space-separated string.
75//! assert_eq!(lookuper.with_str("vector"), Some(182));
76//! assert_eq!(lookuper.with_str("in order"), Some(47));
77//! assert_eq!(lookuper.with_str("the same memory"), Some(8));
78//! assert_eq!(lookuper.with_str("vector is array"), None);
79//!
80//! // Gets the count of a query N-gram formed by a string array.
81//! assert_eq!(lookuper.with_tokens(&["vector"]), Some(182));
82//! assert_eq!(lookuper.with_tokens(&["in", "order"]), Some(47));
83//! assert_eq!(lookuper.with_tokens(&["the", "same", "memory"]), Some(8));
84//! assert_eq!(lookuper.with_tokens(&["vector", "is", "array"]), None);
85//!
86//! // Serializes the index into a writable stream.
87//! let mut data = vec![];
88//! lm.serialize_into(&mut data).unwrap();
89//!
90//! // Deserializes the index from a readable stream.
91//! let other = EliasFanoTrieCountLm::deserialize_from(&data[..]).unwrap();
92//! assert_eq!(lm.num_orders(), other.num_orders());
93//! assert_eq!(lm.num_grams(), other.num_grams());
94//! ```
95pub mod gram;
96pub mod loader;
97pub mod parser;
98pub mod record;
99pub mod trie_count_lm;
100pub mod util;
101pub mod vocabulary;
102
103mod mappers;
104mod rank_array;
105mod trie_array;
106
107/// The maximum order of *N*-grams (i.e., `1 <= N <= 8`).
108pub const MAX_ORDER: usize = 8;
109/// The separator for tokens.
110pub const TOKEN_SEPARATOR: u8 = b' ';
111/// The separator for grams and count.
112pub const GRAM_COUNT_SEPARATOR: u8 = b'\t';
113
114pub use gram::Gram;
115pub use record::Record;
116pub use trie_count_lm::TrieCountLm;
117
118pub use loader::{GramsFileFormats, GramsLoader};
119pub use parser::GramsParser;
120
121pub use rank_array::{EliasFanoRankArray, RankArray, SimpleRankArray};
122pub use trie_array::{EliasFanoTrieArray, SimpleTrieArray, TrieArray};
123pub use vocabulary::{DoubleArrayVocabulary, SimpleVocabulary, Vocabulary};
124
125/// Simple implementation of [`TrieCountLm`].
126/// Note that this is for debug, and do NOT use it for storing massive datasets.
127pub type SimpleTrieCountLm = TrieCountLm<SimpleTrieArray, SimpleVocabulary, SimpleRankArray>;
128
129/// Elias-Fano Trie implementation of [`TrieCountLm`].
130/// This configuration is similar to `ef_trie_PSEF_ranks_count_lm` in the original `tongrams`.
131pub type EliasFanoTrieCountLm =
132    TrieCountLm<EliasFanoTrieArray, DoubleArrayVocabulary, EliasFanoRankArray>;
133
134fn handle_bincode_error(e: std::boxed::Box<bincode::ErrorKind>) -> anyhow::Error {
135    anyhow::anyhow!("{:?}", e)
136}