token-count 0.4.0

Count tokens for LLM models using exact tokenization
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
# Quickstart: Core CLI Token Counting

**Date**: 2026-03-13 | **Feature**: 001-core-cli | **Plan**: [plan.md]./plan.md

## Purpose

This quickstart guide provides key validation scenarios for testing the token-count CLI tool during development. Use these scenarios to verify that each phase is working correctly before moving to the next.

---

## Prerequisites

### Build the Binary
```bash
cargo build --release
export PATH="./target/release:$PATH"  # Add to PATH for easy testing
```

### Verify Binary Exists
```bash
which token-count
# Expected: /path/to/token-count/target/release/token-count

token-count --version
# Expected: token-count 0.1.0
```

---

## Phase 0: Project Setup

### Test 1: Cargo Build Succeeds
```bash
cargo build
# Expected: Compiles without errors, generates debug binary

cargo build --release
# Expected: Compiles without errors, generates optimized binary
```

### Test 2: Linting Passes
```bash
cargo fmt --check
# Expected: No formatting issues

cargo clippy -- -D warnings
# Expected: No warnings or errors
```

### Test 3: Tests Run Successfully
```bash
cargo test
# Expected: All tests pass (initially just placeholder tests)
```

---

## Phase 1: Core Tokenization Logic

### Test 4: Simple Token Count (US-001)
```bash
echo -n "Hello world" | token-count --model gpt-4
# Expected: 2

echo -n "Translate this: Hello world" | token-count --model gpt-4
# Expected: 6
```

### Test 5: Model Aliases (US-002)
```bash
echo -n "Test" | token-count --model gpt4
# Expected: 1

echo -n "Test" | token-count --model GPT-4
# Expected: 1 (case-insensitive)

echo -n "Test" | token-count --model openai/gpt-4
# Expected: 1 (provider format)
```

### Test 6: Default Model (US-003)
```bash
echo -n "Hello" | token-count
# Expected: 1 (uses gpt-3.5-turbo by default)
```

### Test 7: List Models
```bash
token-count --list-models
# Expected:
# Supported Models:
#   OpenAI:
#     gpt-3.5-turbo (aliases: gpt35, gpt3.5, openai/gpt-3.5-turbo)
#     gpt-4 (aliases: gpt4, openai/gpt-4)
#     gpt-4-turbo (aliases: gpt4-turbo, openai/gpt-4-turbo)
#     gpt-4o (aliases: gpt4o, openai/gpt-4o)
```

### Test 8: Tokenization Accuracy
Test against hardcoded reference values (pre-generated using Python tiktoken):

```bash
# Reference values generated once using:
# python3 -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode('Hello world')))"

echo -n "Hello world" | token-count --model gpt-4
# Expected: 2 (verified against tiktoken 0.5.2)

echo -n "Hello δΈ–η•Œ 🌍" | token-count --model gpt-4
# Expected: 8 (verified against tiktoken 0.5.2)
```

**Note**: Use `echo -n` to avoid counting the newline character. All test fixtures are pre-generated and stored in `tests/fixtures/tokenization_reference.json`. No runtime Python dependency required. See [Fixture Generation Guide](#fixture-generation) for regeneration instructions.

---

## Phase 2: CLI Argument Parsing

### Test 9: Help Output (US-009)
```bash
token-count --help
# Expected:
# token-count 0.1.0
# Count tokens for LLM models using exact tokenization
# 
# USAGE:
#     token-count [OPTIONS]
# 
# OPTIONS:
#     -m, --model <MODEL>      Model to use [default: gpt-3.5-turbo]
#     -v, --verbose            Increase output verbosity (can be repeated)
#         --list-models        List all supported models
#     -h, --help               Print help information
#     -V, --version            Print version information
# 
# EXAMPLES:
#     # Count tokens from stdin
#     echo "Hello world" | token-count --model gpt-4
#     ...
```

### Test 10: Version Output (US-009)
```bash
token-count --version
# Expected: token-count 0.1.0
```

### Test 11: Invalid Arguments
```bash
token-count --invalid-flag
# Expected: Error message from clap, exit code 2
```

---

## Phase 3: Input Processing

### Test 12: File Input (US-005)
```bash
echo "Hello world from file" > /tmp/test.txt
token-count --model gpt-4 < /tmp/test.txt
# Expected: 4

cat /tmp/test.txt | token-count --model gpt-4
# Expected: 4
```

### Test 13: Empty Input (US-008)
```bash
echo "" | token-count --model gpt-4
# Expected: 0

cat /dev/null | token-count --model gpt-4
# Expected: 0
```

### Test 14: Large Input (Streaming)
```bash
# Generate 1MB of text
python3 -c "print('Hello world ' * 100000)" > /tmp/large.txt
ls -lh /tmp/large.txt
# Expected: ~1.2MB file

time token-count --model gpt-4 < /tmp/large.txt
# Expected: Completes in <1 second, outputs token count
# Memory usage should be <500MB (check with /usr/bin/time -v on Linux)
```

### Test 15: Unicode Input
```bash
echo "Hello δΈ–η•Œ 🌍 Ω…Ψ±Ψ­Ψ¨Ψ§" | token-count --model gpt-4
# Expected: Token count (should handle Unicode correctly)

echo "Emoji test: πŸ˜€ πŸŽ‰ πŸš€ ⭐ πŸ’―" | token-count --model gpt-4
# Expected: Token count (emoji should tokenize correctly)
```

---

## Phase 4: Output Formatting

### Test 16: Verbosity 0 - Simple Output (US-004)
```bash
echo "Hello world" | token-count --model gpt-4
# Expected: 2

echo "Hello world" | token-count --model gpt-4 -v 0
# Expected: 2 (explicit verbosity 0)
```

### Test 17: Verbosity 1 - Model Info (US-004)
```bash
echo "Hello world" | token-count --model gpt-4 -v
# Expected:
# Model: gpt-4 (cl100k_base encoding)
# Tokens: 2
```

### Test 18: Verbosity 2 - Context Window (US-004)
```bash
echo "Hello world" | token-count --model gpt-4 -vv
# Expected:
# Model: gpt-4 (cl100k_base encoding)
# Tokens: 2
# Context Window: 8,192 tokens
# Usage: 0.02%
```

### Test 19: Verbosity 3 - Debug Output (US-004)
```bash
echo "Hello world" | token-count --model gpt-4 -vvv
# Expected:
# Model: gpt-4 (cl100k_base encoding)
# Tokens: 2
# Token IDs: [15339, 1917]
# Decoded Tokens: ["Hello", " world"]
# Context Window: 8,192 tokens
# Usage: 0.02%
```

### Test 20: Verbosity 3 - Long Input (Truncation)
```bash
echo "One two three four five six seven eight nine ten eleven twelve" | token-count --model gpt-4 -vvv
# Expected:
# Model: gpt-4 (cl100k_base encoding)
# Tokens: 12
# Token IDs: [3198, 1403, 2380, 3116, 4330, 4848, 8254, 8223, 11888, 5899, ...]
# Decoded Tokens: ["One", " two", " three", " four", " five", " six", " seven", " eight", " nine", " ten", ...]
# Context Window: 8,192 tokens
# Usage: 0.15%
```

---

## Phase 5: Error Handling

### Test 21: Invalid UTF-8 (US-007)
```bash
# Create binary file
echo -n -e '\xff\xfe' > /tmp/binary.bin
cat /tmp/binary.bin | token-count --model gpt-4
# Expected:
# Error: Input contains invalid UTF-8
# 
# token-count requires valid UTF-8 text input.
# Binary files cannot be tokenized.
# Exit code: 1

echo $?
# Expected: 1
```

### Test 22: Unknown Model (US-006)
```bash
echo "Test" | token-count --model gpt5
# Expected:
# Error: Unknown model 'gpt5'
# 
# Did you mean one of these?
#   - gpt-4
#   - gpt-4o
#   - gpt-3.5-turbo
# 
# Use --list-models to see all supported models
# Exit code: 2

echo $?
# Expected: 2
```

### Test 23: Model Suggestions (Fuzzy Matching)
```bash
echo "Test" | token-count --model gpt4-turb
# Expected:
# Error: Unknown model 'gpt4-turb'
# 
# Did you mean one of these?
#   - gpt-4-turbo
#   - gpt-4
# 
# Use --list-models to see all supported models
```

### Test 24: Exit Codes
```bash
echo "Test" | token-count --model gpt-4
echo $?
# Expected: 0 (success)

echo "Test" | token-count --model invalid-model
echo $?
# Expected: 2 (user error)

echo -e '\xff\xfe' | token-count --model gpt-4
echo $?
# Expected: 1 (runtime error)
```

---

## Phase 6: Integration & Testing

### Test 25: Cross-Platform Line Endings
On Windows, create file with CRLF:
```bash
# Windows PowerShell
"Hello`r`nworld" | Out-File -Encoding ASCII test.txt
Get-Content test.txt | token-count --model gpt-4
# Expected: 2 (should handle CRLF correctly)
```

On Linux/macOS:
```bash
printf "Hello\r\nworld" | token-count --model gpt-4
# Expected: 2 (should handle CRLF correctly)
```

### Test 26: Binary Size Check (Informational)
```bash
ls -lh target/release/token-count
# Expected: 40-60MB (embedded tokenizers, acceptable per Amendment 1.3.0)

# macOS
du -h target/release/token-count
# Expected: 40-60MB

# Linux
du -h target/release/token-count
# Expected: 40-60MB

# Windows
dir target\release\token-count.exe
# Expected: 40-60MB

# Note: Binary size no longer has hard limit. Accuracy takes precedence.
```

### Test 27: Performance Benchmarks
```bash
# Install criterion (if not already in dev-dependencies)
cargo bench

# Expected output:
# tokenization/small (100 bytes)  time: ~5ms
# tokenization/medium (1MB)        time: ~50ms
# tokenization/large (100MB)       time: ~3s
```

### Test 28: Memory Usage (Large File)
```bash
# Linux (requires time package)
/usr/bin/time -v sh -c 'cat /tmp/large.txt | token-count --model gpt-4'
# Expected: Maximum resident set size: <500MB

# macOS
/usr/bin/time -l sh -c 'cat /tmp/large.txt | token-count --model gpt-4'
# Expected: maximum resident set size < 500MB
```

---

## Phase 7: Documentation & Polish

### Test 29: README Examples Work
Copy-paste examples from README and verify they work as documented.

```bash
# Example 1: Quick token count
echo "Hello world" | token-count --model gpt-4
# Expected: 2 (as shown in README)

# Example 2: From file
token-count --model claude-sonnet < document.txt
# Expected: Error (Claude not supported in MVP) or token count if implemented
```

### Test 30: Help Text Quality
```bash
token-count --help | wc -l
# Expected: ≀24 lines (should fit in standard terminal)

token-count --help
# Verify:
# - Clear usage instructions
# - Examples included
# - Default values documented
# - Flags explained
```

---

## Automated Test Suite

### Run All Tests
```bash
cargo test --all
# Expected: All tests pass

cargo test --all --release
# Expected: All tests pass (release mode)
```

### Run Integration Tests Only
```bash
cargo test --test '*'
# Expected: All integration tests pass
```

### Run Unit Tests Only
```bash
cargo test --lib
# Expected: All unit tests pass
```

### Run Benchmarks
```bash
cargo bench
# Expected: Benchmarks complete, performance targets met
```

### Code Coverage
```bash
# Install tarpaulin
cargo install cargo-tarpaulin

# Run coverage
cargo tarpaulin --out Html
# Expected: β‰₯80% coverage

# Open coverage report
open tarpaulin-report.html  # macOS
xdg-open tarpaulin-report.html  # Linux
```

---

## Continuous Integration

### Local CI Simulation
```bash
# Run all CI checks locally
./scripts/ci-check.sh

# Or manually:
cargo fmt --check
cargo clippy -- -D warnings
cargo test --all
cargo build --release
```

### GitHub Actions
After pushing to feature branch:
```bash
git push origin 001-core-cli
# Check GitHub Actions: https://github.com/shaunburdick/token-count/actions
# Expected: All checks pass (build, test, lint)
```

---

## Release Validation (Final Check)

### Test Installation Methods

**1. Cargo Install (from local)**
```bash
cargo install --path .
which token-count
# Expected: ~/.cargo/bin/token-count

token-count --version
# Expected: token-count 0.1.0
```

**2. Binary Release Simulation**
```bash
# Build release binary
cargo build --release

# Copy to system location
sudo cp target/release/token-count /usr/local/bin/
which token-count
# Expected: /usr/local/bin/token-count

token-count --version
# Expected: token-count 0.1.0
```

**3. Cross-Platform Builds**
```bash
# Build for all platforms (requires cross-compilation setup)
cargo build --release --target x86_64-unknown-linux-gnu
cargo build --release --target x86_64-unknown-linux-musl
cargo build --release --target x86_64-apple-darwin
cargo build --release --target aarch64-apple-darwin
cargo build --release --target x86_64-pc-windows-gnu

# Verify binary sizes (informational)
ls -lh target/*/release/token-count*
# Expected: 40-60MB (acceptable per Amendment 1.3.0)
```

---

## Troubleshooting

### Binary Size Optimization (Optional)
```bash
# Check what's taking up space
cargo install cargo-bloat
cargo bloat --release

# Optional size optimizations in Cargo.toml (if needed):
# opt-level = "z"     # Optimize for size
# strip = true        # Strip debug symbols
# lto = "fat"         # Link-time optimization

# Note: Size optimization is best effort. Accuracy takes precedence.
```

### Performance Issues
```bash
# Profile with flamegraph
cargo install flamegraph
cargo flamegraph -- token-count --model gpt-4 < /tmp/large.txt

# View flamegraph.svg to identify bottlenecks
```

### Memory Leaks
```bash
# Linux: valgrind
valgrind --leak-check=full token-count --model gpt-4 < /tmp/test.txt

# macOS: instruments
instruments -t Leaks target/release/token-count
```

---

## Success Criteria Checklist

Before marking feature complete, verify:

- [ ] All 30 quickstart tests pass
- [ ] Token counts match reference fixtures (Test 8)
- [ ] Binary size tracked (Test 26, informational only)
- [ ] Performance benchmarks meet targets (Test 27)
- [ ] Memory usage <500MB for large files (Test 28)
- [ ] All unit tests pass (β‰₯80% coverage)
- [ ] All integration tests pass
- [ ] Cross-platform tests pass (Linux, macOS, Windows)
- [ ] `cargo clippy` zero warnings
- [ ] `cargo fmt --check` passes
- [ ] Documentation is complete and accurate
- [ ] Help text fits in 24 lines (Test 30)
- [ ] All error messages are helpful (Tests 21-24)
- [ ] CI pipeline is green

---

## Fixture Generation

Test fixtures in `tests/fixtures/tokenization_reference.json` are pre-generated using Python tiktoken. **This is a one-time setup step, not required for running tests.**

### Initial Fixture Generation

```bash
# Create fixture generation script
cat > scripts/generate_fixtures.py <<'EOF'
#!/usr/bin/env python3
"""Generate tokenization reference fixtures for testing."""
import json
import tiktoken

test_cases = [
    # Basic ASCII
    "Hello world",
    "The quick brown fox jumps over the lazy dog",
    "",
    " ",
    "\n",
    
    # Unicode
    "Hello δΈ–η•Œ 🌍",
    "Emoji test: πŸŽ‰πŸŽŠπŸŽˆ",
    "Japanese: γ“γ‚“γ«γ‘γ―δΈ–η•Œ",
    "Arabic: Ω…Ψ±Ψ­Ψ¨Ψ§ Ψ¨Ψ§Ω„ΨΉΨ§Ω„Ω…",
    
    # Edge cases
    "a" * 1000,  # Repeated chars
    "Word " * 100,  # Repeated words
    "Mixed 123 !@# symbols",
]

encodings = {
    "gpt2": "gpt2",
    "p50k_base": "text-davinci-003",
    "cl100k_base": "gpt-4",
    "o200k_base": "gpt-4o",
}

fixtures = {}
for enc_name, model in encodings.items():
    enc = tiktoken.encoding_for_model(model)
    fixtures[enc_name] = []
    for text in test_cases:
        token_count = len(enc.encode(text))
        fixtures[enc_name].append({
            "input": text,
            "expected_tokens": token_count
        })

with open("tests/fixtures/tokenization_reference.json", "w") as f:
    json.dump(fixtures, f, indent=2, ensure_ascii=False)

print(f"βœ… Generated {sum(len(v) for v in fixtures.values())} test fixtures")
EOF

chmod +x scripts/generate_fixtures.py
```

### Regenerate Fixtures (if tiktoken-rs updates)

```bash
# Install Python tiktoken (same version as tiktoken-rs)
pip install tiktoken==0.5.2

# Generate fixtures
mkdir -p tests/fixtures
python3 scripts/generate_fixtures.py

# Verify fixtures
cat tests/fixtures/tokenization_reference.json | jq '.cl100k_base[0]'
# Expected: {"input": "Hello world", "expected_tokens": 2}
```

**When to Regenerate**:
- After updating tiktoken-rs dependency
- When adding new test cases to `generate_fixtures.py`
- If OpenAI changes encoding implementations (rare)

---

**Quickstart Version**: 1.1 | **Last Updated**: 2026-03-13