token-count 0.4.0

Count tokens for LLM models using exact tokenization
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
# Contributing to token-count

Thank you for your interest in contributing to `token-count`! This document provides guidelines and instructions for contributing.

## Table of Contents

- [Code of Conduct]#code-of-conduct
- [Getting Started]#getting-started
- [Development Setup]#development-setup
- [Development Workflow]#development-workflow
- [Code Quality Standards]#code-quality-standards
- [Testing]#testing
- [Submitting Changes]#submitting-changes
- [Project Structure]#project-structure

## Code of Conduct

This project follows the Rust Code of Conduct. Please be respectful and constructive in all interactions.

## Getting Started

### Prerequisites

- **Rust**: 1.85.0 or later
- **Git**: For version control
- **Linux**: Ubuntu 22.04+ (macOS/Windows support coming in v0.2.0)

### Fork and Clone

```bash
# Fork the repository on GitHub, then:
git clone https://github.com/YOUR_USERNAME/token-count
cd token-count
```

## Development Setup

### 1. Install Rust

```bash
# Install rustup if you haven't already
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Ensure you have Rust 1.85.0+
rustup update
rustc --version  # Should be 1.85.0 or later
```

### 2. Build the Project

```bash
# Build in debug mode
cargo build

# Build in release mode
cargo build --release
```

### 3. Run Tests

```bash
# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test suite
cargo test --test model_aliases
```

### 4. Run Benchmarks

```bash
# Run performance benchmarks
cargo bench

# View results in target/criterion/report/index.html
```

## Development Workflow

### 1. Create a Branch

```bash
git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix
```

### 2. Make Changes

Follow the [Code Quality Standards](#code-quality-standards) below.

### 3. Run Quality Checks

```bash
# Format code
cargo fmt

# Check formatting
cargo fmt --check

# Run linter (must have zero warnings)
cargo clippy -- -D warnings

# Run tests
cargo test

# Security audit
cargo install cargo-audit  # First time only
cargo audit
```

### 4. Commit Changes

```bash
# Stage your changes
git add .

# Commit with a descriptive message
git commit -m "feat: add support for new model"
# or
git commit -m "fix: correct token count for edge case"
```

**Commit message format**:
- `feat:` - New feature
- `fix:` - Bug fix
- `docs:` - Documentation changes
- `test:` - Test additions or changes
- `refactor:` - Code refactoring
- `perf:` - Performance improvements
- `chore:` - Build process or tooling changes

### 5. Push and Create Pull Request

```bash
git push origin feature/your-feature-name
```

Then create a Pull Request on GitHub.

## Code Quality Standards

### Non-Negotiable Rules

#### 1. No Disabled Linting Rules ❌

**NEVER** add these to your code:
- `#[allow(clippy::...)]`
- `// eslint-disable`
- `@ts-ignore`
- `@ts-nocheck`

**Instead**: Refactor your code to comply with the lint rule.

**Example - Wrong vs Right:**

```rust
// ❌ WRONG - Disabling the rule
#[allow(clippy::needless_return)]
fn calculate(x: i32) -> i32 {
    return x * 2;
}

// ✅ RIGHT - Following the rule
fn calculate(x: i32) -> i32 {
    x * 2
}
```

#### 2. 100% Type Safety ✅

- No `any` types in TypeScript (if we add TS tooling)
- Proper Rust types, no excessive `unwrap()` in library code
- Use `Result<T, E>` for fallible operations
- Use `Option<T>` for optional values

#### 3. Documentation Required 📝

All public APIs must have doc comments with examples:

```rust
/// Count tokens in the given text using the specified model
///
/// # Arguments
///
/// * `text` - The input text to tokenize
/// * `model_name` - The name of the model to use (canonical name or alias)
///
/// # Returns
///
/// Returns a `TokenizationResult` containing the token count and model information
///
/// # Errors
///
/// Returns an error if:
/// - The model name is unknown
/// - Tokenization fails
///
/// # Example
///
/// ```
/// use token_count::count_tokens;
///
/// let result = count_tokens("Hello world", "gpt-4").unwrap();
/// assert_eq!(result.token_count, 2);
/// ```
pub fn count_tokens(text: &str, model_name: &str) -> Result<TokenizationResult, TokenError> {
    // Implementation
}
```

#### 4. Test Coverage ✅

- All new features must have tests
- All bug fixes must have regression tests
- Integration tests for user-facing functionality
- Unit tests for internal logic

### Verification Protocol (Before Committing)

**⚠️ Complete this checklist before EVERY commit:**

- [ ] `cargo test` - All tests pass
- [ ] `cargo clippy -- -D warnings` - Zero warnings
- [ ] `cargo fmt --check` - Code is formatted
- [ ] `cargo build --release` - Release build succeeds
- [ ] Manually test the affected functionality
- [ ] Update documentation if behavior changed
- [ ] Add tests for new functionality

## Testing

### Test Structure

```
tests/
├── fixtures/              # Test data files
│   ├── ascii.txt
│   ├── unicode.txt
│   ├── large.txt
│   └── invalid_utf8.bin
├── model_aliases.rs       # Model resolution tests
├── verbosity.rs           # Output format tests
├── performance.rs         # Performance tests
├── error_handling.rs      # Error scenario tests
├── end_to_end.rs          # E2E integration tests
└── ...
```

### Writing Tests

```rust
use assert_cmd::Command;
use predicates::prelude::*;

#[test]
fn test_new_feature() {
    let mut cmd = Command::cargo_bin("token-count").unwrap();
    cmd.arg("--model")
        .arg("gpt-4")
        .write_stdin("test input");

    cmd.assert()
        .success()
        .stdout(predicate::str::contains("expected output"));
}
```

### Running Specific Tests

```bash
# Run all tests
cargo test

# Run specific test file
cargo test --test model_aliases

# Run specific test function
cargo test test_fuzzy_suggestions

# Run with output
cargo test -- --nocapture

# Run benchmarks
cargo bench
```

## Submitting Changes

### Pull Request Process

1. **Ensure all checks pass**
   - All tests pass
   - Zero clippy warnings
   - Code is formatted
   - Security audit clean

2. **Update documentation**
   - Update README.md if user-facing changes
   - Update CHANGELOG.md with your changes
   - Add/update doc comments

3. **Write a clear PR description**
   - What does this change do?
   - Why is this change needed?
   - How was it tested?
   - Any breaking changes?

4. **Request review**
   - Tag relevant maintainers
   - Respond to feedback promptly
   - Make requested changes

5. **Wait for CI**
   - GitHub Actions will run all tests
   - All checks must pass before merge

### PR Title Format

Use conventional commits format:

```
feat: add support for Claude models
fix: correct token count for emoji
docs: update README with new examples
test: add integration tests for file input
refactor: simplify error handling
perf: optimize large file processing
```

## Project Structure

```
token-count/
├── .github/
│   └── workflows/
│       └── ci.yml          # GitHub Actions CI
├── .specify/               # Specification documents
│   ├── memory/
│   │   └── constitution.md
│   └── features/
│       └── 001-core-cli.md
├── benches/                # Performance benchmarks
│   └── tokenization.rs
├── scripts/                # Development scripts
│   └── generate_fixtures.py
├── src/
│   ├── lib.rs             # Public library API
│   ├── main.rs            # Binary entry point
│   ├── cli/               # CLI argument parsing
│   │   ├── args.rs        # Clap definitions
│   │   ├── input.rs       # Stdin reading
│   │   └── mod.rs
│   ├── tokenizers/        # Tokenization engine
│   │   ├── openai.rs      # OpenAI tokenizer
│   │   ├── registry.rs    # Model registry
│   │   └── mod.rs
│   ├── output/            # Output formatters
│   │   ├── simple.rs      # Simple formatter
│   │   ├── verbose.rs     # Verbose formatter
│   │   ├── debug.rs       # Debug formatter
│   │   └── mod.rs
│   └── error.rs           # Error types
├── specs/                 # Implementation plans
│   └── 001-core-cli/
│       ├── plan.md
│       ├── research.md
│       ├── data-model.md
│       └── tasks.md
├── tests/                 # Integration tests
│   ├── fixtures/          # Test data
│   └── *.rs               # Test files
├── Cargo.toml             # Rust project manifest
├── README.md              # User documentation
├── CHANGELOG.md           # Version history
├── CONTRIBUTING.md        # This file
└── LICENSE                # MIT license
```

## Release Process

This section is for maintainers who have permission to create releases.

### Pre-Release Checklist

Before creating a release, ensure:

- [ ] All tests pass: `cargo test`
- [ ] No clippy warnings: `cargo clippy -- -D warnings`
- [ ] Code is formatted: `cargo fmt --check`
- [ ] CHANGELOG.md is updated with version and date
- [ ] README.md reflects current features
- [ ] All documentation is up to date

### Creating a Release

1. **Update version in `Cargo.toml`**:
   ```toml
   version = "0.1.0"  # Increment according to semver
   ```

2. **Update CHANGELOG.md**:
   ```markdown
   ## [0.1.0] - 2026-03-14
   
   ### Added
   - Feature descriptions...
   ```

3. **Commit version bump**:
   ```bash
   git add Cargo.toml CHANGELOG.md
   git commit -m "chore: bump version to 0.1.0"
   git push origin main
   ```

4. **Create and push tag**:
   ```bash
   git tag v0.1.0
   git push origin v0.1.0
   ```

5. **Monitor GitHub Actions**:
   - Go to: https://github.com/shaunburdick/token-count/actions
   - Verify the release workflow completes successfully
   - Check that all 4 platform binaries are built
   - Verify checksums.txt is generated

6. **Verify draft release**:
   - Go to: https://github.com/shaunburdick/token-count/releases
   - Review the auto-generated draft release
   - Check that all assets are attached:
     - token-count-VERSION-x86_64-unknown-linux-gnu.tar.gz
     - token-count-VERSION-x86_64-apple-darwin.tar.gz
     - token-count-VERSION-aarch64-apple-darwin.tar.gz
     - token-count-VERSION-x86_64-pc-windows-msvc.zip
     - checksums.txt
   - Edit release notes if needed
   - **Publish the release** (remove draft status)

7. **Verify automated updates**:
   - **Homebrew**: Check that homebrew-tap repository received an automatic commit updating the formula
   - **crates.io**: Verify package is published at https://crates.io/crates/token-count

8. **Test all installation methods**:
   ```bash
   # Install script
   curl -sSfL https://raw.githubusercontent.com/shaunburdick/token-count/main/install.sh | bash
   token-count --version  # Should show new version
   
   # Homebrew (wait ~5 minutes for formula update)
   brew update
   brew upgrade token-count
   token-count --version
   
   # Cargo
   cargo install token-count --force
   token-count --version
   
   # Manual download
   # Download from releases page and verify checksum
   ```

### Hotfix Releases

For critical bug fixes:

1. Create hotfix branch from the release tag:
   ```bash
   git checkout -b hotfix/v0.1.1 v0.1.0
   ```

2. Make the fix and test thoroughly

3. Follow the release process above with the new version

4. Merge hotfix back to main:
   ```bash
   git checkout main
   git merge hotfix/v0.1.1
   git push origin main
   ```

### Rollback Procedure

If a release has critical issues:

1. **Delete the GitHub release** (not the tag):
   - Go to releases page
   - Click "Delete" on the problematic release

2. **Notify users**:
   - Create a GitHub issue explaining the problem
   - Update CHANGELOG.md with a note

3. **Create a fixed release** following the process above

### Troubleshooting Releases

**Release workflow fails:**
- Check GitHub Actions logs for errors
- Common issues:
  - Compilation errors on specific platforms
  - Missing secrets (HOMEBREW_TAP_TOKEN, CARGO_REGISTRY_TOKEN)
  - Network issues downloading dependencies

**Homebrew formula doesn't update:**
- Verify HOMEBREW_TAP_TOKEN secret is valid
- Check that the tag is an official release (no `-` in version)
- Manually update formula as fallback (see specs/002-installation/HOMEBREW-SETUP.md)

**crates.io publish fails:**
- Verify CARGO_REGISTRY_TOKEN is valid
- Check that version doesn't already exist
- Ensure Cargo.toml metadata is complete

## Getting Help

- **Issues**: [GitHub Issues]https://github.com/shaunburdick/token-count/issues
- **Discussions**: [GitHub Discussions]https://github.com/shaunburdick/token-count/discussions
- **Email**: hello@burdick.dev

## License

By contributing, you agree that your contributions will be licensed under the MIT License.

---

Thank you for contributing to `token-count`! 🎉