Cuttle π¦
A CPU-based large language model inference engine implemented in pure Rust, specifically optimized for Qwen3-0.6B model.
β¨ Features
- π¦ Pure Rust Implementation: No Python dependencies, high-performance CPU inference
- π€ Qwen3-0.6B Support: Specifically optimized for Qwen3-0.6B model
- π Bilingual Support: Supports both Chinese and English text generation
- π¦ Auto Download: Automatic model download functionality
- π» Command Line Interface: Easy-to-use CLI tool
- π§ Flexible Configuration: Configurable inference parameters and tokenization system
- π Performance Monitoring: Built-in performance analysis and benchmarking
ποΈ Architecture
Cuttle adopts a modular design with the following main components:
- Tensor Module (
tensor): High-performance tensor operations using pure Rust - Model Module (
model): Transformer architecture implementation - Tokenizer Module (
tokenizer): Text tokenization and encoding - Inference Engine (
inference): Complete inference pipeline - Utils Module (
utils): Performance monitoring and utility functions
π¦ Installation and Build
System Requirements
- Rust 1.70+
- Memory: 4GB+ recommended
- Storage: ~2GB for model files
- Network: Internet connection required for initial model download
Build from Source
# Clone repository
# Debug build
# Release build (recommended for production use)
# Install command line tool
π Quick Start
1. Download Qwen3-0.6B Model
# Download Qwen3-0.6B model files to assets directory
# Force re-download (if files already exist)
2. Text Generation
# Chinese text generation
# English text generation
# Interactive mode
# Custom parameters
3. View Model Information
# Display model information
π» Programming Interface
Basic Usage
use ;
// Create model configuration
let config = default;
let model = new?;
// Create tokenizer
let mut tokenizer = create_default_tokenizer;
let texts = vec!;
tokenizer.build_vocab?;
// Create inference engine
let engine = new;
// Generate text
let response = engine.generate?;
println!;
Custom Inference Configuration
let inference_config = InferenceConfig ;
let engine = with_config;
Batch Processing
let prompts = vec!;
let responses = engine.generate_batch?;
for in prompts.iter.zip
Tensor Operations
use Tensor;
// Create tensors
let a = randn?;
let b = randn?;
// Matrix multiplication
let c = a.matmul?;
// Activation function
let activated = c.gelu;
// Softmax
let probs = activated.softmax?;
βοΈ Configuration
Model Configuration (config.json)
Configuration Options
--max-length: Maximum generation length (default: 512)--temperature: Temperature parameter, controls randomness (default: 1.0)--top-p: Top-p sampling parameter (default: 0.9)--top-k: Top-k sampling parameter (default: 50)--interactive: Interactive mode--force: Force re-download model
π Performance Benchmarks
Run benchmarks:
# Run all benchmarks
# Run specific benchmarks
Performance Optimization Tips
- Compilation Optimization: Use
--releasemode - Pure Rust Implementation: No external BLAS dependencies required
- Parallel Processing: Utilize Rayon for parallel computation
- Memory Management: Avoid unnecessary memory allocations
π§ͺ Testing
# Run unit tests
# Run integration tests
# Run documentation tests
π API Documentation
Generate and view API documentation:
π οΈ Development
Project Structure
cuttle/
βββ src/
β βββ lib.rs # Library entry point
β βββ main.rs # Command line tool
β βββ model.rs # Model definition
β βββ inference.rs # Inference engine
β βββ tensor.rs # Tensor operations
β βββ tokenizer.rs # Tokenizer
β βββ downloader.rs # Model downloader
β βββ error.rs # Error handling
β βββ utils.rs # Utility functions
βββ assets/ # Model file storage directory
β βββ qwen3-0.6b/ # Qwen3-0.6B model files
βββ examples/ # Example code
βββ benches/ # Performance tests
βββ tests/ # Integration tests
βββ Cargo.toml # Project configuration
βββ README.md # Project documentation
π€ Qwen3-0.6B Model Configuration
- Parameters: 0.6B
- Vocabulary Size: 151,936
- Hidden Dimension: 1,024
- Layers: 28
- Attention Heads: 16
- Key-Value Heads: 8 (GQA)
- Supported Languages: Chinese, English, and other multilingual support
π Usage Examples
Chinese Text Generation
English Text Generation
Interactive Dialogue
Contributing Guidelines
- Fork the project
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Create a Pull Request
Code Style
- Use
rustfmtto format code - Use
clippyfor code linting - Write comprehensive documentation and tests
# Format code
# Code linting
π§ Troubleshooting
Common Issues
Q: Compilation errors
A: Ensure you have the latest Rust toolchain:
# Update Rust
# Use Rust 2024 edition
Q: Slow inference speed
A: Check the following optimization options:
- Compile with
--releasemode - Adjust batch processing size
- Use smaller models for testing
- Enable parallel processing
Q: High memory usage
A: Try the following approaches:
- Reduce model size
- Lower batch processing size
- Use smaller sequence lengths
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- rayon - Parallel computing framework
- serde - Serialization framework
- clap - Command line argument parsing
- tokio - Asynchronous runtime
π Related Links
Cuttle - Power your AI inference with Rust π¦β¨