TokenGeeX - Efficient Tokenizer for CodeGeeX
This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018) and TokenMonster.
Python
You can install the PyPI TokenGeeX package through pip.
pip install tokengeex
Example usage:
import tokengeex
tokenizer = tokengeex.load("code-32k-strict.json")
print(tokenizer.vocab_size()) print(tokenizer.token_to_id("token")) print(tokenizer.id_to_token(13513))
ids = tokenizer.encode("def main(): print(\"Hello world!\")")
print(ids)
print(tokenizer.decode(ids))
print([tokenizer.id_to_token(id) for id in tokenizer.encode("电脑")])
Rust
You can install the Rust library crate through cargo.
cargo add tokengeex
Example usage:
fn main() {
let tokenizer = tokengeex::load("code-32k-strict.json").unwrap();
println!("{}", tokenizer.vocab_size()); println!("{}", .token_to_id("token").unwrap()) println!("{:?}", .id_to_token(13513).unwrap())
let ids = tokenizer.encode("def main(): print(\"Hello world!\")");
println!("{:?}", ids);
println!("{:?}", tokenizer.decode(ids));
println!("{:?}", tokenizer.encode("电脑").map(|id| tokenizer.id_to_token(id))); }
CLI
You can install the Rust binary crate through cargo.
cargo install tokengeex --features cli
Here's a sample command to train a strict 1k vocabulary on 250MB of data.
RUST_LOG=info TOKENGEEX_PARALLELISM=true tokengeex train --model 'unigram' \
--input 'data/train/code-250MB.bin' \
--output 'data/vocab/code-1k-strict.json' \
--special-token '<|CODE_PREFIX|>' \
--special-token '<|CODE_SUFFIX|>' \
--special-token '<|CODE_MIDDLE|>' \
--special-token '<|EOS|>' \
--vocab-size 1024 \
--shrinking-factor '0.7' \
--num-sub-iterations '2' \
--vg-max-token-length '8' \
--vg-max-words-per-token '2' \
--vg-initial-vocab-size '4096' \
--vg-insert-probability '0.01' \
--vg-cache 'data/cache/code-4k-250MB-strict.json' \
--vg-strict true
Here's a sample command to train a strict 4k vocabulary on 250MB of data.
RUST_LOG=info TOKENGEEX_PARALLELISM=true tokengeex train --model 'unigram' \
--input 'data/train/code-250MB.bin' \
--output 'data/vocab/code-4k-strict.json' \
--special-token '<|CODE_PREFIX|>' \
--special-token '<|CODE_SUFFIX|>' \
--special-token '<|CODE_MIDDLE|>' \
--special-token '<|EOS|>' \
--vocab-size 4096 \
--shrinking-factor '0.7' \
--num-sub-iterations '2' \
--suggested-tokens-file 'data/tokens/suggested.json' \
--added-tokens-file 'data/tokens/added.json' \
--vg-max-token-length '16' \
--vg-max-words-per-token '3' \
--vg-initial-vocab-size '100000' \
--vg-insert-probability '0.01' \
--vg-cache 'data/cache/code-100k-250MB-strict.json' \
--vg-strict true
Here's a sample command to train a strict 16k vocabulary on 250MB of data.
RUST_LOG=info TOKENGEEX_PARALLELISM=true tokengeex train --model 'unigram' \
--input 'data/train/code-250MB.bin' \
--output 'data/vocab/code-16k-strict.json' \
--special-token '<|CODE_PREFIX|>' \
--special-token '<|CODE_SUFFIX|>' \
--special-token '<|CODE_MIDDLE|>' \
--special-token '<|EOS|>' \
--vocab-size 16384 \
--shrinking-factor '0.7' \
--num-sub-iterations '2' \
--suggested-tokens-file 'data/tokens/suggested.json' \
--added-tokens-file 'data/tokens/added.json' \
--vg-max-token-length '24' \
--vg-max-words-per-token '3' \
--vg-initial-vocab-size '1000000' \
--vg-insert-probability '0.01' \
--vg-cache 'data/cache/code-1000k-250MB-strict.json' \
--vg-strict true
Here's a sample command to train a strict 32k vocabulary on 250MB of data.
RUST_LOG=info TOKENGEEX_PARALLELISM=true tokengeex train --model 'unigram' \
--input 'data/train/code-100MB.bin' \
--output 'data/vocab/code-32k-strict.json' \
--special-token '<|CODE_PREFIX|>' \
--special-token '<|CODE_SUFFIX|>' \
--special-token '<|CODE_MIDDLE|>' \
--special-token '<|EOS|>' \
--vocab-size 32768 \
--shrinking-factor '0.7' \
--num-sub-iterations '2' \
--suggested-tokens-file 'data/tokens/suggested.json' \
--added-tokens-file 'data/tokens/added.json' \
--vg-max-token-length '24' \
--vg-max-words-per-token '3' \
--vg-initial-vocab-size '1000000' \
--vg-insert-probability '0.01' \
--vg-cache 'data/cache/code-1000k-100MB-strict.json' \
--vg-strict true