cuttle 0.1.0

A large language model inference engine in Rust
Documentation

Cuttle 🦀

一个基于CPU的大语言模型推理引擎,使用纯Rust实现,专门优化支持Qwen3-0.6B模型。

✨ 特性

  • 🦀 纯Rust实现: 无Python依赖,高性能CPU推理
  • 🤖 Qwen3-0.6B支持: 专门优化支持Qwen3-0.6B模型
  • 🌐 中英文双语: 支持中英文双语文本生成
  • 📦 自动下载: 自动模型下载功能
  • 💻 命令行界面: 易于使用的CLI工具
  • 🔧 灵活配置: 可配置的推理参数和分词系统
  • 📊 性能监控: 内置性能分析和基准测试

🏗️ Architecture

Cuttle adopts a modular design with the following main components:

  • Tensor Module (tensor): High-performance tensor operations using pure Rust
  • Model Module (model): Transformer architecture implementation
  • Tokenizer Module (tokenizer): Text tokenization and encoding
  • Inference Engine (inference): Complete inference pipeline
  • Utils Module (utils): Performance monitoring and utility functions

📦 安装和构建

系统要求

  • Rust 1.70+
  • 内存: 建议4GB以上
  • 存储: 约2GB用于模型文件
  • 网络: 首次下载模型需要网络连接

从源码构建

# 克隆仓库
git clone https://github.com/passchaos/cuttle.git
cd cuttle

# 调试构建
cargo build

# 发布构建(推荐用于实际使用)
cargo build --release

# 安装命令行工具
cargo install --path .

🚀 快速开始

1. 下载Qwen3-0.6B模型

# 下载Qwen3-0.6B模型文件到assets目录
cargo run -- download

# 强制重新下载(如果文件已存在)
cargo run -- download --force

2. 文本生成

# 中文文本生成
cargo run -- generate --prompt "你好,请介绍一下自己。"

# 英文文本生成
cargo run -- generate --prompt "Hello, how are you?"

# 交互式模式
cargo run -- generate --interactive

# 自定义参数
cargo run -- generate \
  --prompt "请写一首关于春天的诗。" \
  --max-length 200 \
  --temperature 0.8 \
  --top-p 0.9

3. 查看模型信息

# 显示模型信息
cargo run -- info

💻 Programming Interface

Basic Usage

use cuttle::{
    InferenceEngine, Model, ModelConfig, 
    Tokenizer, InferenceConfig
};

// Create model configuration
let config = ModelConfig::default();
let model = Model::new(config)?;

// Create tokenizer
let mut tokenizer = cuttle::tokenizer::create_default_tokenizer();
let texts = vec!["hello world".to_string()];
tokenizer.build_vocab(&texts)?;

// Create inference engine
let engine = InferenceEngine::new(model, tokenizer);

// Generate text
let response = engine.generate("Hello, how are you?")?;
println!("Generated: {}", response);

Custom Inference Configuration

let inference_config = InferenceConfig {
    max_length: 512,
    temperature: 0.8,
    top_p: 0.9,
    top_k: 50,
    do_sample: true,
    repetition_penalty: 1.1,
};

let engine = InferenceEngine::with_config(model, tokenizer, inference_config);

Batch Processing

let prompts = vec![
    "What is AI?".to_string(),
    "Explain machine learning".to_string(),
    "How does deep learning work?".to_string(),
];

let responses = engine.generate_batch(&prompts)?;
for (prompt, response) in prompts.iter().zip(responses.iter()) {
    println!("Q: {}\nA: {}\n", prompt, response);
}

Tensor Operations

use cuttle::tensor::Tensor;

// Create tensors
let a = Tensor::randn(&[128, 256])?;
let b = Tensor::randn(&[256, 512])?;

// Matrix multiplication
let c = a.matmul(&b)?;

// Activation function
let activated = c.gelu();

// Softmax
let probs = activated.softmax(1)?;

⚙️ Configuration

Model Configuration (config.json)

{
  "vocab_size": 32000,
  "hidden_size": 4096,
  "num_layers": 32,
  "num_attention_heads": 32,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "rms_norm_eps": 1e-6
}

配置选项

  • --max-length: 最大生成长度 (默认: 512)
  • --temperature: 温度参数,控制随机性 (默认: 1.0)
  • --top-p: Top-p采样参数 (默认: 0.9)
  • --top-k: Top-k采样参数 (默认: 50)
  • --interactive: 交互式模式
  • --force: 强制重新下载模型

📊 Performance Benchmarks

Run benchmarks:

# Run all benchmarks
cargo bench

# Run specific benchmarks
cargo bench tensor_operations
cargo bench inference

Performance Optimization Tips

  1. Compilation Optimization: Use --release mode
  2. Pure Rust Implementation: No external BLAS dependencies required
  3. Parallel Processing: Utilize Rayon for parallel computation
  4. Memory Management: Avoid unnecessary memory allocations

🧪 Testing

# Run unit tests
cargo test

# Run integration tests
cargo test --test integration

# Run documentation tests
cargo test --doc

📚 API Documentation

Generate and view API documentation:

cargo doc --open

🛠️ Development

项目结构

cuttle/
├── src/
│   ├── lib.rs          # 库入口
│   ├── main.rs         # 命令行工具
│   ├── model.rs        # 模型定义
│   ├── inference.rs    # 推理引擎
│   ├── tensor.rs       # 张量运算
│   ├── tokenizer.rs    # 分词器
│   ├── downloader.rs   # 模型下载器
│   ├── error.rs        # 错误处理
│   └── utils.rs        # 工具函数
├── assets/             # 模型文件存储目录
│   └── qwen3-0.6b/    # Qwen3-0.6B模型文件
├── examples/           # 示例代码
├── benches/           # 性能测试
├── tests/             # 集成测试
├── Cargo.toml         # 项目配置
└── README.md          # 项目文档

🤖 Qwen3-0.6B模型配置

  • 参数量: 0.6B
  • 词汇表大小: 151,936
  • 隐藏层维度: 1,024
  • 层数: 28
  • 注意力头数: 16
  • 键值头数: 8 (GQA)
  • 支持语言: 中文、英文等多语言

📝 使用示例

中文文本生成

cargo run -- generate --prompt "请写一首关于春天的诗。" --max-length 150

英文文本生成

cargo run -- generate --prompt "Explain quantum computing in simple terms." --max-length 200

交互式对话

cargo run -- generate --interactive

Contributing Guidelines

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Create a Pull Request

Code Style

  • Use rustfmt to format code
  • Use clippy for code linting
  • Write comprehensive documentation and tests
# Format code
cargo fmt

# Code linting
cargo clippy

🔧 Troubleshooting

Common Issues

Q: Compilation errors

A: Ensure you have the latest Rust toolchain:

# Update Rust
rustup update

# Use Rust 2024 edition
rustup toolchain install nightly

Q: Slow inference speed

A: Check the following optimization options:

  • Compile with --release mode
  • Adjust batch processing size
  • Use smaller models for testing
  • Enable parallel processing

Q: High memory usage

A: Try the following approaches:

  • Reduce model size
  • Lower batch processing size
  • Use smaller sequence lengths

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • rayon - Parallel computing framework
  • serde - Serialization framework
  • clap - Command line argument parsing
  • tokio - Asynchronous runtime

🔗 Related Links


Cuttle - Power your AI inference with Rust 🦀✨