tekken-rs
A Rust implementation of the Mistral Tekken tokenizer with audio support. This library provides fast and efficient tokenization capabilities for text and audio data, fully compatible with Mistral AI's tokenizer.
Features
- Text Tokenization: Full compatibility with Mistral's Tekken tokenizer
- Audio Support: Encode and decode audio data with mel-scale spectrogram processing
- Multiple Versions: Support for various tokenizer versions (V7, etc.)
- Special Tokens: Complete handling of special tokens (BOS, EOS, audio tokens, etc.)
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
Or use the Git repository directly:
[]
= { = "https://github.com/jorge-menjivar/tekken-rs" }
Quick Start
Basic Text Tokenization
use Tekkenizer;
use SpecialTokenPolicy;
Audio Processing
use ;
Examples
Run the examples to see the tokenizer in action:
# Basic tokenizer test
# Audio processing test
Testing
Run the test suite:
Architecture
The tokenizer consists of several key components:
tokenizer.rs: Main tokenizer implementationaudio.rs: Audio processing and encoding functionalityspecial_tokens.rs: Special token definitions and handlingconfig.rs: Configuration structureserrors.rs: Error handling
Audio Support
The audio implementation includes:
- WAV file loading and processing
- Mel-scale spectrogram computation
- Audio chunk encoding to tokens
- Compatible with Python implementation
Audio Token Flow
- Load Audio: Load WAV files or audio data
- Resample: Convert to target sampling rate (16kHz)
- Pad: Ensure minimum length for processing
- Tokenize: Convert to token sequence with special audio markers
Compatibility
This Rust implementation is designed to be fully compatible with the Python version:
- Same tokenization results
- Identical audio processing
- Compatible special token handling
- Same mel filter bank computations
Requirements
- Rust 1.70 or higher
- For audio support: audio files in WAV format
Project Structure
tekken-rs/
├── src/
│ ├── lib.rs # Library entry point
│ ├── tokenizer.rs # Main tokenizer implementation
│ ├── audio.rs # Audio processing functionality
│ ├── special_tokens.rs # Special token definitions
│ ├── config.rs # Configuration structures
│ └── errors.rs # Error types
├── examples/ # Example usage
├── tests/ # Integration tests
└── benches/ # Performance benchmarks
Performance
The Rust implementation provides significant performance improvements over the Python version:
- Fast tokenization using efficient data structures
- Zero-copy string handling where possible
- Optimized audio processing with SIMD operations
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to:
- Update tests as appropriate
- Follow Rust coding conventions
- Run
cargo fmtandcargo clippybefore submitting
See CONTRIBUTING.md for detailed guidelines.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
This is an original Rust implementation designed to be compatible with Mistral AI's Tekken tokenizer format.
See NOTICE file for detailed attribution.