Expand description
§tongrams
: Tons of N-grams
tongrams
is a crate to index and query large language models in compressed space, in which the data structures are presented in the following papers:
-
Giulio Ermanno Pibiri and Rossano Venturini, Efficient Data Structures for Massive N-Gram Datasets. In Proceedings of the 40th ACM Conference on Research and Development in Information Retrieval (SIGIR 2017), pp. 615-624.
-
Giulio Ermanno Pibiri and Rossano Venturini, Handling Massive N-Gram Datasets Efficiently. ACM Transactions on Information Systems (TOIS), 37.2 (2019): 1-41.
This is a Rust port of tongrams
C++ library.
§What can do
-
Store N-gram language models with frequency counts.
-
Look up N-grams to get the frequency counts.
§Features
-
Compressed language model.
tongrams-rs
can store large N-gram language models in very compressed space. For example, the word N-gram datasets (N=1..5) intest_data
are stored in only 2.6 bytes per gram. -
Time and memory efficiency.
tongrams-rs
employs Elias-Fano Trie, which cleverly encodes a trie data structure consisting of N-grams through Elias-Fano codes, enabling fast lookups in compressed space. -
Pure Rust.
tongrams-rs
is written only in Rust and can be easily pluged into your Rust codes.
§Input data format
The file format of N-gram counts files is the same as that used in tongrams
, a modified Google format, where
- one separate file for each distinct value of N (order) lists one gram per row,
- each header row
<number_of_grams>
indicates the number of N-grams in the file, - tokens in a gram
<gram>
are sparated by a space (e.g.,the same time
), and - a gram
<gram>
and the count<count>
are sparated by a horizontal tab.
<number_of_grams>
<gram1><TAB><count1>
<gram2><TAB><count2>
<gram3><TAB><count3>
...
For example,
61516
the // parent 1
the function is 22
the function a 4
the function to 1
the function and 1
...
§Examples
The following code uses datasets in test_data
at the root of this repository.
use tongrams::EliasFanoTrieCountLm;
// File names of N-grams.
let filenames = vec![
"../test_data/1-grams.sorted.gz",
"../test_data/2-grams.sorted.gz",
"../test_data/3-grams.sorted.gz",
];
// Builds the language model from n-gram counts files.
let lm = EliasFanoTrieCountLm::from_gz_files(&filenames).unwrap();
// Creates the instance for lookup.
let mut lookuper = lm.lookuper();
// Gets the count of a query N-gram written in a space-separated string.
assert_eq!(lookuper.with_str("vector"), Some(182));
assert_eq!(lookuper.with_str("in order"), Some(47));
assert_eq!(lookuper.with_str("the same memory"), Some(8));
assert_eq!(lookuper.with_str("vector is array"), None);
// Gets the count of a query N-gram formed by a string array.
assert_eq!(lookuper.with_tokens(&["vector"]), Some(182));
assert_eq!(lookuper.with_tokens(&["in", "order"]), Some(47));
assert_eq!(lookuper.with_tokens(&["the", "same", "memory"]), Some(8));
assert_eq!(lookuper.with_tokens(&["vector", "is", "array"]), None);
// Serializes the index into a writable stream.
let mut data = vec![];
lm.serialize_into(&mut data).unwrap();
// Deserializes the index from a readable stream.
let other = EliasFanoTrieCountLm::deserialize_from(&data[..]).unwrap();
assert_eq!(lm.num_orders(), other.num_orders());
assert_eq!(lm.num_grams(), other.num_grams());
Re-exports§
pub use gram::Gram;
pub use record::Record;
pub use trie_count_lm::TrieCountLm;
pub use loader::GramsFileFormats;
pub use loader::GramsLoader;
pub use parser::GramsParser;
pub use vocabulary::DoubleArrayVocabulary;
pub use vocabulary::SimpleVocabulary;
pub use vocabulary::Vocabulary;
Modules§
Structs§
- Elias
Fano Rank Array - Spece-efficient implementation of
RankArray
with Elias-Fano gapped encording. - Elias
Fano Trie Array - Spece-efficient implementation of
TrieArray
with Elias-Fano encording. - Simple
Rank Array - Simple implementation of
RankArray
withVec<usize>
. - Simple
Trie Array - Simple implementation of
TrieArray
withVec<usize>
.
Constants§
- GRAM_
COUNT_ SEPARATOR - The separator for grams and count.
- MAX_
ORDER - The maximum order of N-grams (i.e.,
1 <= N <= 8
). - TOKEN_
SEPARATOR - The separator for tokens.
Traits§
- Rank
Array - Trait for a data structure for storing count ranks.
- Trie
Array - Trait for a data structure for sorted arrays of each trie level.
Type Aliases§
- Elias
Fano Trie Count Lm - Elias-Fano Trie implementation of
TrieCountLm
. This configuration is similar toef_trie_PSEF_ranks_count_lm
in the originaltongrams
. - Simple
Trie Count Lm - Simple implementation of
TrieCountLm
. Note that this is for debug, and do NOT use it for storing massive datasets.