TokenGeeX - Efficient Tokenizer for CodeGeeX
This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).
CLI
Exact
The most restrictive pattern. Does not allow punctuation to be mixed in with words and strictly adheres to code structure. Does not allow words that mix casing. Digits are encoded as a single token.
RUST_LOG=debug
Exact+
The pattern used for the merge step of exact vocabularies.
RUST_LOG=debug
General
General-purpose pattern which is loosely analogous to GPT-4's pattern. Numbers of up to three digits are allowed.
RUST_LOG=debug
General+
The pattern used for the merge step of general vocabularies.
Idiomatic
Permissive pattern which allows some common idioms to form. Allows multi-word tokens to form.
Idiomatic+
The pattern used for the merge step of idiomatic vocabularies.
Loose
Permits a wide range of patterns and idioms. Highest compression.