oxide-rs 0.1.1

AI Inference library and CLI in Rust - llama.cpp style
Documentation
# Library Usage Guide

This guide covers how to use oxide-rs as a library in your Rust projects.

## Adding Oxide-rs to Your Project

```toml
# Cargo.toml
[dependencies]
oxide-rs = "0.1.0"
```

## Core Concepts

### GenerateOptions

Configuration for text generation:

```rust
use oxide_rs::GenerateOptions;

let options = GenerateOptions {
    max_tokens: 512,        // Maximum tokens to generate
    temperature: 0.3,      // Sampling temperature (0.0 = greedy)
    top_p: None,           // Nucleus sampling threshold
    top_k: None,           // Top-k sampling threshold
    repeat_penalty: 1.1,   // Penalty for repeated tokens
    repeat_last_n: 64,     // Context window for repeat penalty
    seed: 299792458,       // Random seed for reproducibility
    system_prompt: None,   // Optional system prompt
};
```

### Model

The `Model` struct provides a builder pattern for more complex usage:

```rust
use oxide_rs::Model;

let mut model = Model::new("model.gguf")?
    .with_options(options)
    .with_tokenizer("tokenizer.json")  // Optional, extracted from GGUF if omitted
    .load()?;
```

## Usage Patterns

### Pattern 1: Simple Function

For quick one-off generations:

```rust
use oxide_rs::{generate, GenerateOptions};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let result = generate(
        "model.gguf",
        GenerateOptions::default(),
        "Write a hello world program",
    )?;
    println!("{}", result);
    Ok(())
}
```

### Pattern 2: Builder API

For multiple generations with the same model:

```rust
use oxide_rs::Model;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut model = Model::new("model.gguf")
        .with_options(oxide_rs::GenerateOptions {
            max_tokens: 256,
            temperature: 0.7,
            ..Default::default()
        })
        .load()?;

    // Multiple generations with same model loaded
    let response1 = model.generate("What is Rust?")?;
    let response2 = model.generate("What is Cargo?")?;

    println!("Rust: {}", response1);
    println!("Cargo: {}", response2);
    Ok(())
}
```

### Pattern 3: Streaming

For real-time output as tokens are generated:

```rust
use oxide_rs::Model;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut model = Model::new("model.gguf")
        .load()?;

    model.generate_stream("Tell me a story", |token| {
        print!("{}", token);
        std::io::Write::flush(&mut std::io::stdout()).unwrap();
    })?;

    println!(); // Newline after streaming
    Ok(())
}
```

### Pattern 4: Warmup

Pre-compile compute kernels for faster first-token:

```rust
use oxide_rs::Model;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut model = Model::new("model.gguf")
        .load()?;

    // Warmup with 128 tokens (compiles compute kernels)
    model.warmup(128)?;

    // First generation will be faster
    let response = model.generate("Hello!")?;
    Ok(())
}
```

### Pattern 5: Conversation History

Maintain conversation history across generations:

```rust
use oxide_rs::Model;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut model = Model::new("model.gguf")
        .with_options(oxide_rs::GenerateOptions {
            system_prompt: Some("You are a helpful assistant.".into()),
            ..Default::default()
        })
        .load()?;

    // First turn
    let response1 = model.generate("What is 2+2?")?;
    println!("User: What is 2+2?");
    println!("Assistant: {}", response1);

    // Second turn - history is maintained automatically
    let response2 = model.generate("Multiply that by 3.")?;
    println!("User: Multiply that by 3.");
    println!("Assistant: {}", response2);

    // Clear history if needed
    model.clear_history();
    Ok(())
}
```

## Accessing Model Metadata

Get information about the loaded model:

```rust
use oxide_rs::Model;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut model = Model::new("model.gguf")?.load()?;

    if let Some(metadata) = model.metadata() {
        println!("Model: {}", metadata.name);
        println!("Architecture: {}", metadata.architecture);
        println!("Layers: {}", metadata.n_layer);
        println!("Embedding Size: {}", metadata.n_embd);
        println!("Vocab Size: {}", metadata.vocab_size);
        println!("Context Length: {}", metadata.context_length);
    }
    Ok(())
}
```

## Error Handling

All operations return `Result<T, Box<dyn std::error::Error>>`:

```rust
use oxide_rs::{generate, GenerateOptions};

fn main() {
    match generate("model.gguf", GenerateOptions::default(), "Hello") {
        Ok(result) => println!("Success: {}", result),
        Err(e) => eprintln!("Error: {}", e),
    }
}
```

## Best Practices

1. **Reuse Model instances** - Loading a model is expensive; generate multiple times with the same instance
2. **Use warmup** - Call `warmup()` after loading for faster first-token generation
3. **Adjust max_tokens** - Set appropriate limits based on your use case
4. **Handle streaming for long outputs** - Provides better UX for generated text
5. **Set appropriate repeat_penalty** - Higher values (1.2+) reduce repetition but may affect coherence