# N-gram Model for Readable Hash Generation
This directory contains tools for training an n-gram language model that converts arbitrary bytes into pronounceable, English-like words.
## Overview
The system uses a position-aware tokenizer with 1024 tokens:
- **Beginning tokens** (0-255): Start of word, e.g., `^the`, `^un`, `^re`
- **End tokens** (256-511): End of word, e.g., `ing$`, `ed$`, `s$`
- **Middle tokens** (512-1023): Middle of word, e.g., `er`, `at`, `ion`
## Algorithm Details
### 1) Extract words
The trainer lowercases text and extracts alphabetic words only:
- Regex: `[a-z]+`
- Non-ASCII letters and punctuation are ignored.
### 2) Collect n-gram counts
For each word, the trainer counts n-grams up to length 6:
- **Beginning n-grams:** `^` + prefix of length 1..6
- **End n-grams:** suffix of length 1..6 + `$`
- **Middle n-grams:** all substrings that start after the first character and
end before the last character
This yields three frequency tables: beginning, middle, and end.
### 3) Build a fixed-size vocabulary (1024 tokens)
The vocabulary is built by frequency within each position class:
- Top 256 beginning tokens (0–255); unused slots are filled with
`[UNUSED_BEG_*]`.
- Top 256 end tokens (256–511); unused slots are filled with
`[UNUSED_END_*]`.
- Top 512 middle tokens (512–1023); unused slots are filled with
`[UNUSED_MID_*]`.
The `id_to_token` array is derived from the vocabulary so IDs are contiguous.
### 4) Tokenize words (greedy, position-aware)
Each word is tokenized using greedy longest-match up to 6 characters:
1. At each position, try the longest candidate substring.
2. If at the beginning, prefer `^substring` tokens.
3. If at the end, prefer `substring$` tokens.
4. If the entire word fits, allow `^substring$`.
5. If no candidate is found, fall back to a single-character token.
This produces a token sequence per word, typically 2–16 tokens.
### 5) Build transition counts
The model is a bigram transition table over token IDs:
- `begin_transitions`: frequency of first tokens in each word.
- `transitions`: counts for token → token where the next token is **not** an end
token.
- `end_transitions`: counts for token → end-token.
### 6) Convert to cumulative probability tables
For each transition list, counts are converted to cumulative probabilities and
sorted by descending frequency. This enables binary search over cumulative
probabilities during generation.
## Usage
### Build a Generator Model
```bash
python3 tokenize.py training-data/english-lowercase.txt -o training-data/
```
`tokenize.py` accepts one or more `.txt` files or directories. If you pass a
directory, it will scan `**/*.txt`. If you pass no input files, it reads from
stdin.
```bash
# multiple input files
python3 tokenize.py corpus/a.txt corpus/b.txt -o training-data/
# directory input
python3 tokenize.py corpus/ -o training-data/
# stdin input
The output model filename is based on the first input file stem (or `stdin`).
This generates `training-data/english-lowercase-model.json` containing:
- `vocab`: Token to ID mapping
- `id_to_token`: ID to token mapping (array)
- `begin_transitions`: Cumulative probabilities for first token selection
- `transitions`: Token-to-token transition probabilities
- `end_transitions`: Token-to-end-token transition probabilities
### Generate Words from Bytes
The model converts entropy (bytes) into English-like words using cumulative probability distributions.
**Example: 0xDEADBEEF (4 bytes)**
```
Input: 0xDEADBEEF
Output: snattenotional
```
**Example: 0xDEADBEEFCAFE (6 bytes)**
```
Input: 0xDEADBEEFCAFE
Output: snattenotlausece
```
### Word Generation Algorithm
1. **Select beginning token**: Use first 8 bits to index into `begin_transitions`
2. **Select middle tokens**: For each token, use bits to select from `transitions[current_token]`
3. **Select end token**: Use remaining bits to select from `end_transitions[current_token]`
The cumulative probability format enables O(log n) binary search:
```json
{
"begin_transitions": [[token_id, cumulative_prob], ...],
"transitions": {"token_id": [[next_id, cumulative_prob], ...]},
"end_transitions": {"token_id": [[end_id, cumulative_prob], ...]}
}
```
### Output Length
Word length scales with input bytes:
| 1 | 2 |
| 2 | 3 |
| 4 | 6 |
| 6 | 8 |
| 8 | 10 |
| 16 | 16 (max) |
## Generate Rust Code
Convert the model to Rust code for use in the library:
```bash
python3 generate_rust.py training-data/english-lowercase-model.json
```
This generates `../src/english_word.rs` containing:
- Static token vocabulary
- Transition tables with cumulative probabilities
- `generate_word()` function for word generation
### Rust API
```rust
use readable_hash::english_word_hash;
let words = english_word_hash("hello");
// "basember durangle misinual sancture execony weary"
```
## Files
- `tokenize.py` - Vocabulary builder, tokenizer, and model trainer
- `generate_rust.py` - Generate Rust code from model
- `extract_words.py` - Extract unique words from corpus
- `test_tokenizer.py` - Test tokenization of words
- `training-data/` - Generated model files
- `corpus/` - Source text files (not tracked in git)