Tokengrams
Tokengrams allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a suffix array index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.
Our code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.
The backend is written in Rust, and the Python bindings are generated using PyO3.
Installation
Development
Usage
Building an index
# Create a new index from an on-disk corpus called `document.bin` and save it to
# `pile.idx`.
=
# Verify index correctness
# Get the count of "hello world" in the corpus.
=
# You can now load the index from disk later using __init__
=
Using an index
# Count how often each token in the corpus succeeds "hello world".
# Parallelise over queries
# Autoregressively sample 10 tokens using 5-gram language statistics. Initial
# gram statistics are derived from the query, with lower order gram statistics used
# until the sequence contains at least 5 tokens.
# Parallelize over sequence generations
# Query whether the corpus contains "hello world"
# Get all n-grams beginning with "hello world" in the corpus
Support
The best way to get support is to open an issue on this repo or post in #inductive-biases in the EleutherAI Discord server. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!