miktik 0.1.0

A unified, multi-backend tokenizer library for LLMs
Documentation
# MikTik

Unified, multi-backend tokenizer library for LLMs, written in Rust.

MikTik provides one interface for token counting across model families:
- OpenAI models via `tiktoken-rs`
- Web tokenizer JSON models via `tokenizers`
- SentencePiece `.model` tokenizers via `sentencepiece-model` + `tokenizers`

## Design Goals
- Unified API for `encode` / `decode` / message token counting
- Lazy loading with thread-safe registry cache
- Explicit error propagation (no hidden panic paths)
- No network I/O in core library (resource ownership stays in caller)

## Installation

```toml
[dependencies]
miktik = "0.1"
```

## Quick Start

```rust
use miktik::{Message, TokenizerRegistry};

let registry = TokenizerRegistry::new();

let text_tokens = registry.count_tokens("gpt-4o", "Hello, world!")?;

registry.register_model_file("claude", "/path/to/claude-tokenizer.json")?;
let chat_tokens = registry.count_messages(
    "claude",
    &[
        Message::new("user", "What is Rust?"),
        Message::new("assistant", "A systems programming language."),
    ],
)?;
# Ok::<(), miktik::TokenizerError>(())
```

## Model Resolution

Raw model names are canonicalized by rule chain:
- O-series (`o1`/`o3`/`gpt-5`) -> `o1`
- GPT-4 family -> `gpt-4o` / `gpt-4-32k` / `gpt-4`
- Claude / LLaMA / open-source aliases -> canonical family id
- Unknown models fallback to `gpt-3.5-turbo`

## Model Resource Registration

For non-OpenAI families, register resources before counting:
- `register_model_file(model, path)`
- `register_model_bytes(model, bytes)`
- Compatibility aliases:
  - `register_huggingface_file(model, path)`
  - `register_huggingface_bytes(model, bytes)`

Supported formats:
- Web tokenizer: `tokenizer.json`
- SentencePiece: `tokenizer.model`

## Thread Safety

`TokenizerRegistry` is safe for concurrent use:
- Uses `RwLock<HashMap<...>>` for lazy cache
- Uses double-check locking to avoid duplicate instantiation

## Integration

MikTik is designed for general Rust LLM projects and is actively used in
`TauriTavern`.

- TauriTavern: `https://github.com/Darkatse/TauriTavern`

## License

MIT