helios-engine 0.5.5

# Running Qwen2.5-0.5B-Instruct Locally with Candle

This guide shows you how to set up and run the Qwen2.5-0.5B-Instruct model locally using the Candle backend with automatic cache loading.

## Prerequisites

1. **Rust installed** (version 1.70 or higher)
   ```bash
   curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
   ```

2. **Model cached locally** - The model is automatically loaded from your HuggingFace cache

## Step 1: Download Model to Local Cache

First, download the model to your local HuggingFace cache. You only need to do this once:

```bash
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct --local-dir-use-symlinks False
```

Or using Python:
```python
from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen2.5-0.5B-Instruct", cache_dir="~/.cache/huggingface")
```

**Verify the download:**
```bash
ls -la ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/*/
```

You should see files like:
- `model.safetensors` (main model file)
- `tokenizer.json` (tokenizer)
- `config.json` (model config)

## Step 2: Create Configuration File

Create a `config.toml` file in your project root (or copy from `config.example.toml`):

```toml
[llm]
model_name = "gpt-3.5-turbo"
base_url = "https://api.openai.com/v1"
api_key = "your-api-key-here"
temperature = 0.7
max_tokens = 2048

[candle]
# Qwen2.5-0.5B-Instruct configuration
huggingface_repo = "Qwen/Qwen2.5-0.5B-Instruct"
model_file = "model.safetensors"
context_size = 32768
temperature = 0.7
max_tokens = 2048
use_gpu = true
```

### Configuration Options

- **`huggingface_repo`**: The HuggingFace model repository (e.g., `Qwen/Qwen2.5-0.5B-Instruct`)
- **`model_file`**: The model file name in the repository (typically `model.safetensors`)
- **`context_size`**: Maximum context length (Qwen2.5-0.5B-Instruct supports up to 32768)
- **`temperature`**: Controls randomness (0.0-1.0, lower = more deterministic)
- **`max_tokens`**: Maximum tokens to generate per request
- **`use_gpu`**: Whether to use GPU if available (set to `false` for CPU-only)

## Step 3: Build the Project

Build with the `candle` feature enabled:

```bash
cargo build --features candle --release
```

This will compile the project with Candle ML framework support.

## Step 4: Run the Application

### Option A: Using the binary directly

```bash
./target/release/helios-engine
```

### Option B: Using cargo run

```bash
cargo run --features candle --release
```

### Option C: Running specific examples

```bash
# Basic chat example
cargo run --example basic_chat --features candle --release

# Direct LLM usage
cargo run --example direct_llm_usage --features candle --release

# Agent with tools
cargo run --example agent_with_tools --features candle --release
```

## Step 5: Use the Model in Your Code

### Simple Chat Example

```rust
use helios_engine::{Config, Agent};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load config from file
    let config = Config::from_file("config.toml")?;
    
    // Create an agent with Candle backend
    let mut agent = Agent::new(config).await?;
    
    // Send a message
    let response = agent.chat("Hello, what is Rust?").await?;
    println!("Response: {}", response);
    
    Ok(())
}
```

## Model Information

**Qwen2.5-0.5B-Instruct**
- **Parameters**: 0.5 billion
- **Context Window**: 32,768 tokens
- **Format**: Chat-optimized (Instruct)
- **Size**: ~400-600 MB (depending on precision)
- **Speed**: Very fast on CPU and GPU
- **Language**: English + Chinese

This is a lightweight model ideal for:
- Local development and testing
- Edge devices and resource-constrained environments
- Fast inference with reasonable quality

## Custom Cache Location

If you want to use a different cache location:

```bash
# Set custom HuggingFace cache directory
export HF_HOME=/path/to/custom/cache

# Then run your application
cargo run --features candle --release
```

## Troubleshooting

### "Model not found in cache"

Make sure you've downloaded the model:
```bash
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
```

Verify the cache structure:
```bash
ls ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/*/model.safetensors
```

### "Failed to initialize tokenizer"

The tokenizer file must be in the same snapshot directory as the model. Verify:
```bash
ls ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/*/tokenizer.json
```

### Out of Memory

If you get OOM errors:
- Reduce `context_size` in config.toml
- Reduce `max_tokens`
- Set `use_gpu = false` to use CPU (slower but uses less GPU memory)

### Slow Performance

If inference is slow:
- Ensure `use_gpu = true` if you have a CUDA-capable GPU
- Install CUDA libraries if using GPU
- Consider reducing `context_size` and `max_tokens`

## Alternative Models

You can also use other models with Candle. Just change the `huggingface_repo` and ensure the model file is in your cache:

```toml
[candle]
# Qwen2-7B (larger, better quality)
huggingface_repo = "Qwen/Qwen2-7B-Instruct"
model_file = "model.safetensors"

# Or Llama
huggingface_repo = "meta-llama/Llama-2-7b-chat"
model_file = "model.safetensors"

# Or Gemma
huggingface_repo = "google/gemma-7b-it"
model_file = "model.safetensors"
```

## Performance Tips

1. **First run is slow**: Model loading and compilation takes time on first use
2. **Use GPU**: Enable `use_gpu = true` for 10-20x speedup
3. **Batch requests**: Process multiple requests together
4. **Cache model**: Keep the model in cache to avoid re-downloading

## Next Steps

- Check out `examples/` directory for more usage patterns
- Read the [Candle documentation](https://github.com/huggingface/candle)
- Explore other examples like `agent_with_tools.rs`, `forest_of_agents.rs`, etc.