TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018) and TokenMonster.

Python

You can install the PyPI TokenGeeX package through pip.

pip install tokengeex

Example usage:

import tokengeex

tokenizer = tokengeex.load("code-32k-strict.json")

# Vocab
print(tokenizer.vocab_size()) # 32768
print(tokenizer.token_to_id(b"token")) # 13513
print(tokenizer.id_to_token(13513)) # (b"token", -13.322)

# Encode
ids = tokenizer.encode("def main(): print(\"Hello world!\")")
print(ids) # [68, 437, 12747, 58, 14653, 2807, 1735, 10120]

# Decode
print(tokenizer.decode(ids, include_special_tokens=False)) # "def main(): print(\"Hello world!\")"

# Byte fallbacks
print([tokenizer.id_to_token(id) for id in tokenizer.encode("电脑")]) # ["电", "<0xe8>", "<0x84>", "<0x91>"]

Rust

You can install the Rust library crate through cargo.

cargo add tokengeex

Example usage:

fn main() {
    let tokenizer = tokengeex::load("code-32k-strict.json").unwrap();

    // Vocab
    println!("{}", tokenizer.vocab_size());
    println!("{}", tokenizer.token_to_id("token").unwrap())
    println!("{:?}", tokenizer.id_to_token(13513).unwrap())

    // Encode
    let ids = tokenizer.encode("def main(): print(\"Hello world!\")");
    println!("{:?}", ids); // [68, 437, 12747, 58, 14653, 2807, 1735, 10120]

    // Decode
    println!("{:?}", tokenizer.decode(ids, false)); // "def main(): print(\"Hello world!\")"

    // Byte fallbacks
    println!("{:?}", tokenizer.encode("电脑").map(|id| tokenizer.id_to_token(id))); // ["电", "<0xe8>", "<0x84>", "<0x91>"]
}

CLI

Train

You can install the Rust binary crate through cargo.

cargo install tokengeex --features cli

Here's the full command used to train base vocabularies.

RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex train \
    --model 'unigram' \
    --output 'base-131k.json' \
    --logfile 'base-131k.log' \
    --vocab-size 131072 \
    --processor 'nfc' \
    --processor 'crlf' \
    --initial-vocab-max-token-length 32 \
    --initial-vocab-size 5000000 \
    --initial-vocab-insert-probability 0.01 \
    --initial-vocab-allow "$(cat data/base.regex)" \
    --unigram-shrinking-factor 0.8 \
    --unigram-num-sub-iterations 2 \
    --unigram-sample-regularization 'log' \
    --added-tokens-file './hub/tokens/base/added.json' \
    --suggested-tokens-file './hub/tokens/base/suggested.json' \
    $(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin --suggested-tokens-file ./hub/tokens/base/suggested-${lang}.json "; done)

Here's the full command used to train capcode vocabularies.

RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex train \
    --model 'unigram' \
    --output 'capcode-65k.json' \
    --logfile 'capcode-65k.log' \
    --vocab-size 65536 \
    --processor 'nfc' \
    --processor 'crlf' \
    --processor 'capcode' \
    --initial-vocab-max-token-length 32 \
    --initial-vocab-size 5000000 \
    --initial-vocab-insert-probability 0.01 \
    --initial-vocab-allow "$(cat data/capcode.regex)" \
    --unigram-shrinking-factor 0.8 \
    --unigram-num-sub-iterations 2 \
    --unigram-sample-regularization 'log' \
    --added-tokens-file './hub/tokens/capcode/added.json' \
    --suggested-tokens-file './hub/tokens/capcode/suggested.json' \
    $(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin --suggested-tokens-file ./hub/tokens/capcode/suggested-${lang}.json "; done)

Extend with BPE

RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex bpe \
    --output ./base-131k-extended.json \
    --vocab ./base-131k.json \
    --num-merges 1000 \
    --step 100 \
    --score-scale-factor 0.85 \
    --max-merge-length 16 \
    --ignore '^$' \
    $(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin "; done)

tokengeex 0.7.1

TokenGeeX - Efficient Tokenizer for CodeGeeX

Python

Rust

CLI

Train

Extend with BPE