# Tokengrams
Tokengrams allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a [suffix array](https://en.wikipedia.org/wiki/Suffix_array) index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.
Our code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.
The backend is written in Rust, and the Python bindings are generated using [PyO3](https://github.com/PyO3/pyo3).
# Installation
```bash
pip install tokengrams
```
# Development
```bash
pip install maturin
maturin develop
```
# Usage
## Building an index
```python
from tokengrams import MemmapIndex
# Create a new index from an on-disk corpus called `document.bin` and save it to
# `pile.idx`.
index = MemmapIndex.build(
"/data/document.bin",
"/pile.idx",
)
# Verify index correctness
print(index.is_sorted())
# Get the count of "hello world" in the corpus.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
print(index.count(tokenizer.encode("hello world")))
# You can now load the index from disk later using __init__
index = MemmapIndex(
"/data/document.bin",
"/pile.idx"
)
```
## Using an index
```python
# Count how often each token in the corpus succeeds "hello world".
print(index.count_next(tokenizer.encode("hello world")))
# Parallelise over queries
print(index.batch_count_next(
[tokenizer.encode("hello world"), tokenizer.encode("hello universe")]
))
# Autoregressively sample 10 tokens using 5-gram language statistics. Initial
# gram statistics are derived from the query, with lower order gram statistics used
# until the sequence contains at least 5 tokens.
print(index.sample(tokenizer.encode("hello world"), n=5, k=10))
# Parallelize over sequence generations
print(index.batch_sample(tokenizer.encode("hello world"), n=5, k=10, num_samples=20))
# Query whether the corpus contains "hello world"
print(index.contains(tokenizer.encode("hello world")))
# Get all n-grams beginning with "hello world" in the corpus
print(index.positions(tokenizer.encode("hello world")))
```
# Support
The best way to get support is to open an issue on this repo or post in #inductive-biases in the [EleutherAI Discord server](https://discord.gg/eleutherai). If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!