MiniLLM π€
A lightweight, efficient transformer inference engine written in Rust. MiniLLM provides a clean, well-documented implementation of GPT-2 style transformer models with support for text generation.
β¨ Features
- π Fast Inference: Efficient tensor operations using ndarray
- π Memory Safe: Written in Rust with zero-copy operations where possible
- π¦ Easy to Use: High-level API for quick integration
- π― Well Tested: Comprehensive examples and documentation
- π§ Extensible: Modular architecture for easy customization
- π€ GPT-2 Compatible: Load and run GPT-2 models from HuggingFace
- π SafeTensors Support: Fast and secure model weight loading
ποΈ Architecture
src/
βββ lib.rs # Library entry point and public API
βββ main.rs # Simple CLI example (clean 27 lines)
βββ inference.rs # High-level inference engine
βββ gpt.rs # GPT model implementation
βββ transformer.rs # Transformer block components
βββ attention.rs # Multi-head attention mechanism
βββ mlp.rs # Feed-forward network layers
βββ tensor.rs # Tensor operations and math
βββ weights.rs # Model weight loading (SafeTensors)
βββ config.rs # Model configuration handling
examples/
βββ basic_generation.rs # Simple text generation
βββ interactive_chat.rs # Interactive chat interface
βββ tokenization.rs # Tokenization examples
π Quick Start
Library Usage
use InferenceEngine;
Command Line
# Run the main example
# Run specific examples
π Requirements
- Rust 1.70+
- HuggingFace token (optional, for private models)
Set your HuggingFace token:
π§ Dependencies
ndarray- Tensor operationssafetensors- Model weight loadingtokenizers- Text tokenizationhf-hub- HuggingFace model downloadingserde- Configuration parsing
π API Documentation
InferenceEngine
The main high-level interface:
// Create engine
let engine = new?;
// Generate text
let result = engine.generate?;
// Tokenization
let tokens = engine.tokenize?;
let text = engine.decode?;
// Get model info
let config = engine.config;
Low-Level Components
For custom implementations, you can use the individual components:
GPTModel- Complete transformer modelTransformerBlock- Individual transformer layersMultiHeadAttention- Attention mechanismMLP- Feed-forward networksTensor- Mathematical operations
π― Examples
Basic Generation
Demonstrates simple text generation with model configuration display.
Interactive Chat
Interactive command-line chat interface with the model.
Tokenization
Shows tokenization, encoding/decoding, and round-trip verification.
π Performance
MiniLLM is designed for inference efficiency:
- Memory: ~1GB RAM for GPT-2 (117M parameters)
- Speed: ~10-50 tokens/second (CPU, varies by hardware)
- Accuracy: Identical outputs to reference implementations
- Models: Currently supports GPT-2 architecture
π οΈ Development
# Clone and build
# Run tests
# Check examples
# Generate documentation
π Architecture Details
Transformer Implementation
- Multi-head attention with causal masking
- Feed-forward networks with GELU activation
- Layer normalization and residual connections
- Position and token embeddings
Tensor Operations
- Dynamic 1D-4D tensor support
- Optimized matrix multiplication
- Element-wise operations (add, softmax, layer_norm)
- Memory-efficient implementations
Model Loading
- SafeTensors format support
- Automatic model downloading from HuggingFace
- Configuration parsing and validation
- Error handling with detailed messages
β Current Status
- β Core Architecture: Complete GPT-2 implementation
- β Inference Engine: High-level API ready
- β Examples: Comprehensive usage examples
- β Documentation: Well-documented codebase
- β Testing: All components tested and working
πΊοΈ Roadmap
- Performance: GPU acceleration support
- Models: Support for larger GPT variants
- Features: Beam search and sampling options
- Optimization: Quantization and pruning
- Integration: Python bindings
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
π Acknowledgments
- Inspired by Andrej Karpathy's educational implementations
- Built on the excellent Rust ecosystem (ndarray, tokenizers, etc.)
- Model weights from HuggingFace transformers library
π¨βπ» Author
BM Monjur Morshed