token-count 0.2.1

Count tokens for LLM models using exact tokenization
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
# token-count

> A fast, accurate CLI tool for counting tokens in LLM model inputs

[![Rust](https://img.shields.io/badge/rust-1.85%2B-orange.svg)](https://www.rust-lang.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-152%20passing-brightgreen.svg)](tests/)

## Overview

`token-count` is a POSIX-style command-line tool that counts tokens for various LLM models. It supports exact tokenization for OpenAI models (offline) and adaptive estimation for Claude models (with optional API mode for exact counts). Pipe any text in, get token counts out—fast, offline, and accurate.

```bash
# OpenAI models (exact, offline)
echo "Hello world" | token-count --model gpt-4
2

# Claude models (estimation, offline)
echo "Hello, Claude!" | token-count --model claude
9

# From file
token-count --model gpt-4 < document.txt
1842

# With context info
cat prompt.txt | token-count --model claude-sonnet-4-6 -v
Model: claude-sonnet-4-6 (anthropic-claude)
Tokens: 142
Context window: 1000000 tokens (0.0142% used)
```

## Features

✅ **Accurate** - Exact tokenization for OpenAI, adaptive estimation for Claude  
✅ **Fast** - ~2.7µs for small inputs (3,700x faster than 10ms target)  
✅ **Efficient** - 57MB memory for 12MB files (8.8x under 500MB limit)  
✅ **Compact** - 9.2MB binary with all tokenizers embedded  
✅ **Offline** - Zero runtime dependencies for OpenAI; optional API for Claude  
✅ **Simple** - POSIX-style interface, works like `wc` or `grep`

## Installation

### Quick Install (Recommended)

**Linux / macOS:**
```bash
curl -sSfL https://raw.githubusercontent.com/shaunburdick/token-count/main/install.sh | bash
```

**Homebrew (macOS / Linux):**
```bash
brew install shaunburdick/tap/token-count
```

**Cargo (All Platforms):**
```bash
cargo install token-count
```

**Manual Download:**  
Download pre-built binaries from [GitHub Releases](https://github.com/shaunburdick/token-count/releases).

For detailed installation instructions, troubleshooting, and platform-specific guidance, see [INSTALL.md](INSTALL.md).

### System Requirements

- **Platform**: Linux x86_64, macOS (Intel/Apple Silicon), Windows x86_64
- **Runtime**: No dependencies (static binary)
- **Build from source**: Rust 1.85.0 or later

## Usage

### Basic Usage

```bash
# Default model (gpt-3.5-turbo)
echo "Hello world" | token-count
2

# Specific model
echo "Hello world" | token-count --model gpt-4
2

# From file
token-count --model gpt-4 < input.txt
1842

# Piped from another command
cat README.md | token-count --model gpt-4o
3521
```

### Model Selection

```bash
# Use canonical name
token-count --model gpt-4 < input.txt

# Use alias (case-insensitive)
token-count --model gpt4 < input.txt
token-count --model GPT-4 < input.txt

# With provider prefix
token-count --model openai/gpt-4 < input.txt
```

### Verbosity Levels

```bash
# Simple output (default) - just the number
echo "test" | token-count
1

# Verbose (-v) - model info and context usage
echo "test" | token-count -v
Model: gpt-3.5-turbo (cl100k_base)
Tokens: 1
Context window: 16385 tokens (0.0061% used)

# Debug (-vvv) - for troubleshooting
echo "test" | token-count -vvv
Model: gpt-3.5-turbo (cl100k_base)
Tokens: 1
Context window: 16385 tokens

[Debug mode: Token IDs and decoding require tokenizer access]
[Full implementation in Phase 6]
```

### Model Information

```bash
# List all supported models
token-count --list-models

# Output:
# Supported models:
#
#   gpt-3.5-turbo
#     Encoding: cl100k_base
#     Context window: 16385 tokens
#     Aliases: gpt-3.5, gpt35, gpt-35-turbo, openai/gpt-3.5-turbo
#
#   gpt-4
#     Encoding: cl100k_base
#     Context window: 128000 tokens
#     Aliases: gpt4, openai/gpt-4
# ...
```

### Help and Version

```bash
# Show help
token-count --help

# Show version
token-count --version
```

## Supported Models

### OpenAI Models (Exact Tokenization - Offline)

| Model | Encoding | Context Window | Aliases |
|-------|----------|----------------|---------|
| gpt-3.5-turbo | cl100k_base | 16,385 | gpt-3.5, gpt35, gpt-35-turbo |
| gpt-4 | cl100k_base | 128,000 | gpt4 |
| gpt-4-turbo | cl100k_base | 128,000 | gpt4-turbo, gpt-4turbo |
| gpt-4o | o200k_base | 128,000 | gpt4o |

### Anthropic Claude Models (Adaptive Estimation - Offline by Default)

| Model | Context Window | Aliases | Estimation Mode |
|-------|----------------|---------|-----------------|
| claude-opus-4-6 | 1,000,000 | opus, opus-4-6, opus-4.6 | ±10% accuracy |
| claude-sonnet-4-6 | 1,000,000 | claude, sonnet, sonnet-4-6, sonnet-4.6 | ±10% accuracy |
| claude-haiku-4-5 | 200,000 | haiku, haiku-4-5, haiku-4.5 | ±10% accuracy |

**Claude Tokenization Modes:**

**Offline Estimation (Default)** - No API key needed:
```bash
# Fast offline estimation using adaptive content-type detection
echo "Hello, Claude!" | token-count --model claude
9
```

**Exact API Mode (Optional)** - Requires `ANTHROPIC_API_KEY`:
```bash
# Exact count via Anthropic API (requires consent)
export ANTHROPIC_API_KEY="sk-ant-..."
echo "Hello, Claude!" | token-count --model claude --accurate
# Prompts: "This will send your input to Anthropic's API... Proceed? (y/N)"
# Output: 8

# Skip prompt for automation
cat file.txt | token-count --model claude --accurate -y
```

**How Claude Estimation Works:**
- Detects content type (code vs. prose) using punctuation and keyword analysis
- **Code**: 3.0 chars/token (lots of `{}[]();` and keywords)
- **Prose**: 4.5 chars/token (natural language)
- **Mixed**: 3.75 chars/token (markdown + code blocks)
- Target: ±10% accuracy for typical inputs

All models support:
- Case-insensitive names (e.g., `GPT-4`, `gpt-4`, `Gpt-4`)
- Provider prefix (e.g., `openai/gpt-4`, `anthropic/claude-sonnet-4-6`)

## Error Handling

`token-count` provides helpful error messages with suggestions:

```bash
# Unknown model with fuzzy suggestions
$ echo "test" | token-count --model gpt5
Error: Unknown model: 'gpt5'. Did you mean: gpt-4, gpt-4o?

# Typo correction
$ echo "test" | token-count --model gpt4-tubro
Error: Unknown model: 'gpt4-tubro'. Did you mean: gpt-4-turbo?

# Invalid UTF-8
$ token-count < invalid.bin
Error: Input contains invalid UTF-8 at byte 0
```

### Exit Codes

- `0` - Success
- `1` - I/O error or invalid UTF-8
- `2` - Unknown model name

## Performance

### Benchmarks

Measured on Ubuntu 22.04 with Rust 1.85.0:

| Input Size | Time | Target | Result |
|------------|------|--------|--------|
| 100 bytes | 2.7µs | <10ms | 3,700x faster ⚡ |
| 1 KB | 54µs | <100ms | 1,850x faster ⚡ |
| 10 KB | 534µs | N/A | Excellent |

### Memory Usage

- **12MB file**: 57 MB resident memory (8.8x under 500MB limit)
- **Processing time**: 0.76 seconds for 12MB
- **No memory leaks**: Validated with valgrind

### Binary Size

- **Release binary**: 9.2 MB (5.4x under 50MB target)
- **Includes**: All 4 OpenAI tokenizers embedded
- **Optimizations**: Stripped, LTO enabled

## Development

### Building from Source

```bash
# Clone repository
git clone https://github.com/shaunburdick/token-count
cd token-count

# Run tests
cargo test

# Run benchmarks
cargo bench

# Build release binary
cargo build --release

# Check code quality
cargo clippy -- -D warnings
cargo fmt --check

# Security audit
cargo audit
```

### Running Tests

```bash
# All tests (100 tests)
cargo test

# Specific test suite
cargo test --test model_aliases
cargo test --test verbosity
cargo test --test performance

# With output
cargo test -- --nocapture
```

### Project Structure

```
token-count/
├── src/
│   ├── lib.rs              # Public library API
│   ├── main.rs             # Binary entry point
│   ├── cli/                # CLI argument parsing
│   │   ├── args.rs         # Clap definitions
│   │   ├── input.rs        # Stdin reading
│   │   └── mod.rs
│   ├── tokenizers/         # Tokenization engine
│   │   ├── openai.rs       # OpenAI tokenizer (tiktoken)
│   │   ├── claude/         # Claude tokenizer
│   │   │   ├── mod.rs      # Main tokenizer
│   │   │   ├── estimation.rs  # Adaptive estimation
│   │   │   ├── api_client.rs  # Anthropic API
│   │   │   └── models.rs   # Model definitions
│   │   ├── registry.rs     # Model registry
│   │   └── mod.rs
│   ├── api/                # API utilities
│   │   ├── consent.rs      # Interactive consent prompt
│   │   └── mod.rs
│   ├── output/             # Output formatters
│   │   ├── simple.rs       # Simple formatter
│   │   ├── verbose.rs      # Verbose formatter
│   │   ├── debug.rs        # Debug formatter
│   │   └── mod.rs
│   └── error.rs            # Error types
├── tests/                  # Integration tests
│   ├── fixtures/           # Test data
│   ├── model_aliases.rs
│   ├── verbosity.rs
│   ├── performance.rs
│   ├── error_handling.rs
│   ├── end_to_end.rs
│   ├── claude_estimation.rs  # Claude estimation tests
│   ├── claude_api.rs          # Claude API tests
│   └── ...
├── benches/                # Performance benchmarks
│   └── tokenization.rs
    └── .github/
        └── workflows/
            └── ci.yml          # CI configuration
```

## Security

### Resource Limits

- **Maximum input size**: 100MB per invocation
- **Memory usage**: Typically <100MB, peaks at ~2x input size
- **CPU usage**: Single-threaded, 100% of one core during processing

### Known Limitations

**Stack Overflow with Highly Repetitive Inputs**: The underlying tiktoken-rs library can experience stack overflow when processing highly repetitive single-character inputs (e.g., 1MB+ of the same character). This is due to regex backtracking in the tokenization engine. Real-world text with varied content works fine at large sizes.

- **Workaround**: Break extremely large repetitive inputs into smaller chunks
- **Impact**: Minimal - real documents rarely exhibit this pathological pattern
- **Status**: Tracked upstream in tiktoken-rs

### Best Practices

**For CI/CD Pipelines**:
```bash
# Limit concurrent processes to avoid resource exhaustion
ulimit -n 1024                    # Limit file descriptors
ulimit -v $((500 * 1024))        # Limit virtual memory to 500MB
echo "text" | token-count --model gpt-4
```

**For Untrusted Input**:
```bash
# Use timeout to prevent hangs
timeout 30s token-count --model gpt-4 < input.txt
```

**For Large Files**:
```bash
# Monitor memory usage
/usr/bin/time -v token-count --model gpt-4 < large-file.txt
```

### Security Audit

- **Last audit**: 2026-03-13
- **Findings**: 0 critical, 0 high, 0 medium vulnerabilities
- **Dependencies**: 5 direct, all audited with `cargo audit`
- **Binary**: Stripped, no debug symbols, 9.2MB

Run security checks:
```bash
cargo audit                      # Check for known vulnerabilities
cargo clippy -- -D warnings     # Strict linting
```

### Reporting Security Issues

If you discover a security vulnerability, please email hello@burdick.dev (or open a private security advisory on GitHub). Do not open public issues for security concerns.

## Architecture


### Design Principles

From our [Constitution](.specify/memory/constitution.md):

1. **POSIX Simplicity** - Behaves like standard Unix utilities
2. **Accuracy Over Speed** - Exact tokenization for supported models
3. **Zero Runtime Dependencies** - Single offline binary
4. **Fail Fast with Clear Errors** - No silent failures
5. **Semantic Versioning** - Predictable upgrade paths

### Technical Stack

- **Language**: Rust 1.85.0+ (stable)
- **CLI Parsing**: clap 4.6.0+ (derive API)
- **Tokenization**: 
  - tiktoken-rs 0.9.1+ (OpenAI models - offline)
  - Adaptive estimation algorithm (Claude models - offline)
  - Anthropic API via reqwest 0.12+ (Claude accurate mode - optional)
- **Async Runtime**: tokio 1.0+ (for API calls)
- **Error Handling**: anyhow 1.0.102+, thiserror 1.0+
- **Fuzzy Matching**: strsim 0.11+ (Levenshtein distance)
- **Testing**: 152 tests with criterion benchmarks

### Key Features

- **Library-first design**: Core logic in `lib.rs`, thin binary wrapper
- **Trait-based abstractions**: Extensible for future tokenizers
- **Strategy pattern**: Multiple output formatters
- **Registry pattern**: Model configuration with lazy initialization
- **Streaming support**: 64KB chunks for large inputs

## Roadmap

### v0.1.0 (Current Release) ✅

- [x] OpenAI model support (4 models)
- [x] CLI with model selection and verbosity
- [x] Fuzzy model suggestions
- [x] UTF-8 validation with error reporting
- [x] Comprehensive test suite (100 tests)
- [x] Performance benchmarks
- [x] Cross-platform support (Linux, macOS, Windows)
- [x] Multiple installation methods (install.sh, Homebrew, cargo, manual)
- [x] GitHub release binaries with checksums
- [x] Automated release pipeline

### v0.2.0 (Current Release)

- [x] Anthropic Claude support (3 models)
- [x] Adaptive token estimation algorithm (code/prose detection)
- [x] Optional accurate mode via Anthropic API
- [x] Interactive consent prompt for API calls
- [x] Non-interactive mode support (`-y` flag)

### v0.3.0 (Future - More Models)

- [ ] Google Gemini support
- [ ] Meta Llama support
- [ ] Mistral support

### v0.4.0 (Future - Stable API)

- [ ] Stable library API for embedding
- [ ] Token ID output (debug mode)
- [ ] Batch processing mode
- [ ] Configuration file support

## Contributing

Contributions are welcome! This project follows specification-driven development.

### Development Setup

See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed instructions.

Quick start:
```bash
git clone https://github.com/shaunburdick/token-count
cd token-count
cargo test
cargo clippy
```

### Code Quality Standards

- **No disabled lint rules** - Fix code to comply, don't silence warnings
- **100% type safety** - No `any` types or suppressions
- **All public APIs documented** - With examples
- **Test coverage** - All user stories covered
- **Zero clippy warnings** - Strict linting enforced

## License

MIT License - see [LICENSE](LICENSE) for details.

## Acknowledgments

Built with:
- [tiktoken-rs]https://github.com/zurawiki/tiktoken-rs - Rust tiktoken implementation
- [clap]https://github.com/clap-rs/clap - Command line argument parser
- [spec-kit]https://github.com/github/spec-kit - Specification-driven development

Special thanks to:
- OpenAI for open-sourcing tiktoken
- The Rust community for excellent tooling

---

**Status**: ✅ v0.2.0 Complete (Claude Support) | **Version**: 0.2.0  
**Author**: [Shaun Burdick](https://github.com/shaunburdick)