minillm 0.1.1

A mini inference engine for running transformer language models
Documentation
# MiniLLM πŸ€–

A lightweight, efficient transformer inference engine written in Rust. MiniLLM provides a clean, well-documented implementation of GPT-2 style transformer models with support for text generation.

## ✨ Features

- **πŸš€ Fast Inference**: Efficient tensor operations using ndarray
- **πŸ”’ Memory Safe**: Written in Rust with zero-copy operations where possible  
- **πŸ“¦ Easy to Use**: High-level API for quick integration
- **🎯 Well Tested**: Comprehensive examples and documentation
- **πŸ”§ Extensible**: Modular architecture for easy customization
- **πŸ€– GPT-2 Compatible**: Load and run GPT-2 models from HuggingFace
- **πŸ“Š SafeTensors Support**: Fast and secure model weight loading

## πŸ—οΈ Architecture

```
src/
β”œβ”€β”€ lib.rs          # Library entry point and public API
β”œβ”€β”€ main.rs         # Simple CLI example (clean 27 lines)
β”œβ”€β”€ inference.rs    # High-level inference engine
β”œβ”€β”€ gpt.rs          # GPT model implementation
β”œβ”€β”€ transformer.rs  # Transformer block components
β”œβ”€β”€ attention.rs    # Multi-head attention mechanism
β”œβ”€β”€ mlp.rs          # Feed-forward network layers
β”œβ”€β”€ tensor.rs       # Tensor operations and math
β”œβ”€β”€ weights.rs      # Model weight loading (SafeTensors)
└── config.rs       # Model configuration handling

examples/
β”œβ”€β”€ basic_generation.rs  # Simple text generation
β”œβ”€β”€ interactive_chat.rs  # Interactive chat interface
└── tokenization.rs      # Tokenization examples
```

## πŸš€ Quick Start

### Library Usage

```rust
use minillm::inference::InferenceEngine;

fn main() -> minillm::Result<()> {
    // Load a GPT-2 model
    let engine = InferenceEngine::new("openai-community/gpt2")?;
    
    // Generate text
    let prompt = "The future of AI is";
    let generated = engine.generate(prompt, 20)?;
    
    println!("Generated: {}", generated);
    Ok(())
}
```

### Command Line

```bash
# Run the main example
cargo run

# Run specific examples  
cargo run --example basic_generation
cargo run --example interactive_chat
cargo run --example tokenization
```

## πŸ“‹ Requirements

- Rust 1.70+
- HuggingFace token (optional, for private models)

Set your HuggingFace token:
```bash
echo "HF_TOKEN=your_token_here" > .env
```

## πŸ”§ Dependencies

- `ndarray` - Tensor operations
- `safetensors` - Model weight loading
- `tokenizers` - Text tokenization
- `hf-hub` - HuggingFace model downloading
- `serde` - Configuration parsing

## πŸ“– API Documentation

### InferenceEngine

The main high-level interface:

```rust
// Create engine
let engine = InferenceEngine::new("openai-community/gpt2")?;

// Generate text
let result = engine.generate("prompt", max_tokens)?;

// Tokenization
let tokens = engine.tokenize("text")?;
let text = engine.decode(&tokens)?;

// Get model info
let config = engine.config();
```

### Low-Level Components

For custom implementations, you can use the individual components:

- `GPTModel` - Complete transformer model
- `TransformerBlock` - Individual transformer layers  
- `MultiHeadAttention` - Attention mechanism
- `MLP` - Feed-forward networks
- `Tensor` - Mathematical operations

## 🎯 Examples

### Basic Generation
```bash
cargo run --example basic_generation
```
Demonstrates simple text generation with model configuration display.

### Interactive Chat
```bash
cargo run --example interactive_chat
```
Interactive command-line chat interface with the model.

### Tokenization
```bash
cargo run --example tokenization
```
Shows tokenization, encoding/decoding, and round-trip verification.

## πŸ“Š Performance

MiniLLM is designed for inference efficiency:

- **Memory**: ~1GB RAM for GPT-2 (117M parameters)
- **Speed**: ~10-50 tokens/second (CPU, varies by hardware)
- **Accuracy**: Identical outputs to reference implementations
- **Models**: Currently supports GPT-2 architecture

## πŸ› οΈ Development

```bash
# Clone and build
git clone https://github.com/bmqube/minillm
cd minillm
cargo build --release

# Run tests
cargo test

# Check examples
cargo check --examples

# Generate documentation
cargo doc --open
```

## πŸ“š Architecture Details

### Transformer Implementation
- **Multi-head attention** with causal masking
- **Feed-forward networks** with GELU activation
- **Layer normalization** and residual connections
- **Position and token embeddings**

### Tensor Operations
- Dynamic 1D-4D tensor support
- Optimized matrix multiplication
- Element-wise operations (add, softmax, layer_norm)
- Memory-efficient implementations

### Model Loading
- SafeTensors format support
- Automatic model downloading from HuggingFace
- Configuration parsing and validation
- Error handling with detailed messages

## βœ… Current Status

- βœ… **Core Architecture**: Complete GPT-2 implementation
- βœ… **Inference Engine**: High-level API ready
- βœ… **Examples**: Comprehensive usage examples
- βœ… **Documentation**: Well-documented codebase
- βœ… **Testing**: All components tested and working

## πŸ—ΊοΈ Roadmap

- [ ] **Performance**: GPU acceleration support
- [ ] **Models**: Support for larger GPT variants
- [ ] **Features**: Beam search and sampling options
- [ ] **Optimization**: Quantization and pruning
- [ ] **Integration**: Python bindings

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## πŸ™ Acknowledgments

- Inspired by Andrej Karpathy's educational implementations
- Built on the excellent Rust ecosystem (ndarray, tokenizers, etc.)
- Model weights from HuggingFace transformers library

## πŸ‘¨β€πŸ’» Author

**BM Monjur Morshed**  
- GitHub: [@bmqube]https://github.com/bmqube
- Project: [minillm]https://github.com/bmqube/minillm