token-counter 0.1.0

`wc` for tokens: count tokens in files with HF Tokenizers
token-counter-0.1.0 is not a library.

tc - Token Count

tc is a CLI tool for counting tokens in text files, as a lightweight wrapper around the HuggingFace Tokenizers crate. It's like the Unix wc command, but for tokens instead of words.

Features

  • Count tokens in files or from stdin
  • Support for multiple files and glob patterns
  • Uses any tokenizer in HuggingFace Tokenizers

Installation

cargo install token-counter

Usage

Using default tokenizer (cl100k, the tokenizer for GPT-3.5 and GPT-4):

tc file1.md file2.md

Using globs:

tc *.md

Arguments:

  • -m, --model: HuggingFace ID of the model for tokenizer (ex. google-bert/bert-base-uncased)