TokenGeeX - Efficient Tokenizer for CodeGeeX
This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018) and TokenMonster.
Python
You can install the PyPI TokenGeeX package through pip.
Example usage:
=
# Vocab
# 32768
# 13513
# (b"token", -13.322)
# Encode
=
# [68, 437, 12747, 58, 14653, 2807, 1735, 10120]
# Decode
# "def main(): print(\"Hello world!\")"
# Byte fallbacks
# ["电", "<0xe8>", "<0x84>", "<0x91>"]
Rust
You can install the Rust library crate through cargo.
Example usage:
CLI
Train
You can install the Rust binary crate through cargo.
cargo install tokengeex --features cli
Here's the full command used to train base vocabularies.
RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex train \
--model 'unigram' \
--output 'base-131k.json' \
--logfile 'base-131k.log' \
--vocab-size 131072 \
--processor 'nfc' \
--processor 'crlf' \
--initial-vocab-max-token-length 32 \
--initial-vocab-size 5000000 \
--initial-vocab-insert-probability 0.01 \
--initial-vocab-allow "$(cat data/base.regex)" \
--unigram-shrinking-factor 0.8 \
--unigram-num-sub-iterations 2 \
--unigram-sample-regularization 'log' \
--added-tokens-file './hub/tokens/base/added.json' \
--suggested-tokens-file './hub/tokens/base/suggested.json' \
$(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin --suggested-tokens-file ./hub/tokens/base/suggested-${lang}.json "; done)
Here's the full command used to train capcode vocabularies.
RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex train \
--model 'unigram' \
--output 'capcode-65k.json' \
--logfile 'capcode-65k.log' \
--vocab-size 65536 \
--processor 'nfc' \
--processor 'crlf' \
--processor 'capcode' \
--initial-vocab-max-token-length 32 \
--initial-vocab-size 5000000 \
--initial-vocab-insert-probability 0.01 \
--initial-vocab-allow "$(cat data/capcode.regex)" \
--unigram-shrinking-factor 0.8 \
--unigram-num-sub-iterations 2 \
--unigram-sample-regularization 'log' \
--added-tokens-file './hub/tokens/capcode/added.json' \
--suggested-tokens-file './hub/tokens/capcode/suggested.json' \
$(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin --suggested-tokens-file ./hub/tokens/capcode/suggested-${lang}.json "; done)
Extend with BPE
RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex bpe \
--output ./base-131k-extended.json \
--vocab ./base-131k.json \
--num-merges 1000 \
--step 100 \
--score-scale-factor 0.85 \
--max-merge-length 16 \
--ignore '^$' \
$(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin "; done)