tokengeex 0.4.1

TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster.
Documentation

TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018) and TokenMonster.

Python

You can install the PyPI TokenGeeX package through pip.

pip install tokengeex

Example usage:

import tokengeex

tokenizer = tokengeex.load("unigram-32k.json")

# Vocab
print(tokenizer.vocab_size()) # 32768
print(tokenizer.token_to_id("token")) # 13513
print(tokenizer.id_to_token(13513)) # "token"

# Encode
ids = tokenizer.encode("def main(): print(\"Hello world!\")")
print(ids) # [68, 437, 12747, 58, 14653, 2807, 1735, 10120]

# Decode
print(tokenizer.decode(ids)) # "def main(): print(\"Hello world!\")"

# Byte fallbacks
print([tokenizer.id_to_token(id) for id in tokenizer.encode("电脑")]) # ["电", "<0xe8>", "<0x84>", "<0x91>"]

Rust

You can install the Rust library crate through cargo.

cargo add tokengeex

Example usage:

fn main() {
    let tokenizer = tokengeex::load("unigram-32k.json").unwrap();

    // Vocab
    println!("{}", tokenizer.vocab_size()); // 32768
    println!("{}", .token_to_id("token").unwrap()) // 13513
    println!("{:?}", .id_to_token(13513).unwrap()) // "token"

    // Encode
    let ids = tokenizer.encode("def main(): print(\"Hello world!\")");
    println!("{:?}", ids); // [68, 437, 12747, 58, 14653, 2807, 1735, 10120]

    // Decode
    println!("{:?}", tokenizer.decode(ids)); // "def main(): print(\"Hello world!\")"

    // Byte fallbacks
    println!("{:?}", tokenizer.encode("电脑").map(|id| tokenizer.id_to_token(id))); // ["电", "<0xe8>", "<0x84>", "<0x91>"]
}

CLI

You can install the Rust binary crate through cargo.

cargo install tokengeex --features cli

Here's a sample command to train a strict 4k vocabulary on a hundred megabytes of data.

RUST_LOG=info TOKENGEEX_PARALLELISM=true tokengeex train --model 'unigram' \
    --input 'data/train/code-100MB.bin' \
    --output 'data/vocab/unigram-code-4k-strict.json' \
    --special-token '<|CODE_PREFIX|>' \
    --special-token '<|CODE_SUFFIX|>' \
    --special-token '<|CODE_MIDDLE|>' \
    --special-token '<|EOS|>' \
    --vocab-size 4096 \
    --shrinking-factor '0.8' \
    --num-sub-iterations '2' \
    --suggested-tokens-file 'data/tokens/suggested.json' \
    --added-tokens-file 'data/tokens/added.json' \
    --vg-max-token-length '16' \
    --vg-max-words-per-token '3' \
    --vg-initial-vocab-size '100000' \
    --vg-insert-probability '0.01' \
    --vg-cache 'data/cache/vocab-4k-code-100MB-strict.json' \
    --vg-strict true \
    --sg-max-sentence-size '32'

Here's a sample command to train a strict 16k vocabulary on a hundred megabytes of data.

RUST_LOG=info TOKENGEEX_PARALLELISM=true tokengeex train --model 'unigram' \
    --input 'data/train/code-100MB.bin' \
    --output 'data/vocab/unigram-code-16k-strict.json' \
    --special-token '<|CODE_PREFIX|>' \
    --special-token '<|CODE_SUFFIX|>' \
    --special-token '<|CODE_MIDDLE|>' \
    --special-token '<|EOS|>' \
    --vocab-size 16384 \
    --shrinking-factor '0.75' \
    --num-sub-iterations '2' \
    --suggested-tokens-file 'data/tokens/suggested.json' \
    --added-tokens-file 'data/tokens/added.json' \
    --vg-max-token-length '20' \
    --vg-max-words-per-token '3' \
    --vg-initial-vocab-size '1000000' \
    --vg-insert-probability '0.01' \
    --vg-cache 'data/cache/vocab-16k-code-100MB-strict.json' \
    --vg-strict true \
    --sg-max-sentence-size '40'