rml-core 0.1.0

A simple N-gram language model implementation in Rust
Documentation

# RML Core โ€“ Rust Language Model

`rml-core` is a simple N-Gram language model implemented in Rust. The name is a play on LLM (Large Language Models) and stands for **"Rust Language Model"**.

---

## ๐Ÿง  Overview

This project implements a character-level N-Gram language model using a basic neural architecture with one hidden layer. It uses a context of 4 characters to predict the next character in a sequence.

---

## โœจ Features

- Train a language model on any text file  
- Generate text based on a seed string  
- Supports letters, numbers, and basic punctuation  
- Configurable training parameters (e.g. number of epochs)  
- Save and load trained models  

---

## ๐Ÿš€ Installation

In your `Cargo.toml`:

```toml
[dependencies]
rml-core = "0.1.0"
```

---

## ๐Ÿงช Usage

### ๐Ÿ“Œ Train a model

```bash
cargo run --bin train path/to/input.txt path/to/output/model [epochs]
```

Example:

```bash
cargo run --bin train data/shakespeare.txt model.bin 10
```

---

### ๐Ÿ“Œ Generate text

```bash
cargo run --bin generate path/to/model "Seed Text" [length]
```

Example:

```bash
cargo run --bin generate model.bin "To be" 200
```

---

### ๐Ÿ“š Use as a library

```rust
use rml_core::{NGramModel, prepare_training_data};

// Training
let text = std::fs::read_to_string("data/input.txt").unwrap();
let training_data = prepare_training_data(&text);
let mut model = NGramModel::new();

for (context, target) in training_data {
    model.train(&context, target);
}

model.save("model.bin").unwrap();

// Generation
let mut model = NGramModel::load("model.bin").unwrap();
// Use model.forward() and sampling logic
```

---

## โš™๏ธ How It Works

1. **Preprocessing**: The input text is filtered to include only allowed characters (ASCII aโ€“z, Aโ€“Z, 0โ€“9, punctuation).  
2. **Training Data**: Generates (context, target) pairs where the context is 4 characters long.  
3. **Training**: The model learns to predict the next character using backpropagation.  
4. **Generation**: Given a seed, the model predicts the next character and slides the context window forward.

---

## ๐Ÿงฌ Technical Details

| Component           | Value                     |
|---------------------|---------------------------|
| Context Size        | 4 characters              |
| Hidden Layer        | 128 neurons               |
| Learning Rate       | 0.005                     |
| Sampling Temperature| 0.3 (conservative)        |
| Vocabulary          | a-z, A-Z, 0โ€“9, punctuation|

---

## ๐Ÿ” Example

```bash
# Train on Shakespeare for 10 epochs
cargo run --bin train data/shakespeare.txt shakespeare_model 10

# Generate 200 characters using "To be" as the seed
cargo run --bin generate shakespeare_model "To be" 200
```

---

## ๐Ÿค Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

---

## ๐Ÿ“„ License

This project is licensed under the [MIT License](LICENSE).

---

Feel free to customize this README to fit your needs.