panini-lang-engine 0.3.0

<div align="center">
  <h1>Pāṇini</h1>
  <p><b>A LLM-powered linguistic feature extraction framework</b></p>
  <p>
    <a href="https://crates.io/crates/panini-lang"><img src="https://img.shields.io/crates/v/panini-lang.svg" alt="Crates.io" /></a>
    <a href="https://pypi.org/project/panini-lang/"><img src="https://img.shields.io/pypi/v/panini-lang.svg" alt="PyPI" /></a>
    <img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License" />
  </p>
  <p>
    <b><a href="https://yro7.github.io/panini/">Official Documentation (MkDocs)</a></b>
  </p>
  <p>
    Usage: <a href="docs/guides/python.md">Python</a> | <a href="docs/guides/rust.md">Rust</a> | <a href="docs/guides/cli.md">CLI</a>
  </p>
</div>


<br>

Pāṇini is a linguistic feature extraction framework: describe your language's morphology as Rust types, write extraction directives, and the pipeline handles the rest — prompt assembly, JSON schema generation, LLM orchestration, response parsing, and validation. No universal schema imposed; you define exactly the features your language needs.

## Table of Contents
- [Table of Contents](#table-of-contents)
- [Extraction Capabilities](#extraction-capabilities)
  - [Available components](#available-components)
  - [Output examples](#output-examples)
    - [`morphology`](#morphology)
    - [`pedagogical_explanation`](#pedagogical_explanation)
    - [`morpheme_segmentation`](#morpheme_segmentation)
    - [`multiword_expressions`](#multiword_expressions)
    - [`leipzig_alignment`](#leipzig_alignment)
- [Usage](#usage)
  - [As a library (Rust API)](#as-a-library-rust-api)
  - [As a standalone CLI](#as-a-standalone-cli)
  - [Python](#python)
- [What you define, what the framework does](#what-you-define-what-the-framework-does)
- [Design principles](#design-principles)
- [Workspace structure](#workspace-structure)
- [Adding a language](#adding-a-language)
  - [Automatically (LLM-assisted)](#automatically-llm-assisted)
  - [Manually (step by step)](#manually-step-by-step)
- [Adding an analysis component](#adding-an-analysis-component)
- [Building](#building)
- [License](#license)

## Extraction Capabilities

Extraction is built around **composable components** (`AnalysisComponent`). Each component provides a different axis of analysis or extracts specific features. You choose which components to run per request — pick only what you need.

By default, all compatible components are run. You can restrict the selection with `--components`:

```bash
# Run only morphology and Leipzig glossing
panini extract --components morphology,leipzig_alignment \
  --text "Gila abur-u-n ferma güğüna amuq'-da-č." \
  --target "amuq'-da-č"

# Run everything (default)
panini extract --text "Dał kotowi mleko." --target kotowi
```

From the Rust API, pass an optional list of component keys to `extract_erased_with_components()`:

```rust
let result = registry::extract_erased_with_components(
    "pol", &model, &request,
    Some(&["morphology", "pedagogical_explanation"]),
    0.2, 4096, None, &prompts,
).await?;
```

### Available components

| Key                       | Component                | Description                                                                                     | Compatibility                 |
| ------------------------- | ------------------------ | ----------------------------------------------------------------------------------------------- | ----------------------------- |
| `morphology`              | `MorphologyAnalysis`     | POS tagging, lemmatization, case/tense/aspect/gender — language-specific morphological features | All languages                 |
| `pedagogical_explanation` | `PedagogicalExplanation` | Structured HTML explanation for learners (translations, analysis, grammar recap)                | All languages                 |
| `morpheme_segmentation`   | `MorphemeSegmentation`   | Morpheme-by-morpheme segmentation with grammatical function labels                              | Agglutinative languages only* |
| `multiword_expressions`   | `MultiwordExpressions`   | Extracts idioms, collocations, and phrasal expressions                                          | All languages                 |
| `leipzig_alignment`       | `LeipzigAlignment`       | Leipzig-style interlinear morpheme-by-morpheme gloss (Leipzig Glossing Rules)                   | All languages                 |

*Agglutinative languages are marked with a "Agglutinative" trait implementation in the framework. You can define the implementation for any language, even for low-agglutination languages like french, etc.

### Output examples

#### `morphology`

Polish — `"Studentka czyta interesującą książkę w bibliotece."`

```json
{
  "morphology": {
    "target_features": [
      { "word": "studentka", "morphology": { "pos": "noun", "lemma": "studentka", "gender": "feminine", "case": "nominative" } },
      { "word": "czyta",     "morphology": { "pos": "verb", "lemma": "czytać", "tense": "present", "aspect": "imperfective" } }
    ],
    "context_features": [
      { "word": "interesującą", "morphology": { "pos": "adjective", "lemma": "interesujący", "gender": "feminine", "case": "accusative" } },
      { "word": "w",            "morphology": { "pos": "adposition", "lemma": "w", "governed_case": "locative" } },
      { "word": "bibliotece",   "morphology": { "pos": "noun", "lemma": "biblioteka", "gender": "feminine", "case": "locative" } }
    ]
  }
}
```

#### `pedagogical_explanation`

```json
{
  "pedagogical_explanation": "<p><b>Translations:</b><br><i>Lit:</i> Student-female reads interesting book in library.<br><i>Nat:</i> The (female) student reads an interesting book in the library.</p><p><b>Analysis:</b></p><ul><li><span style='color:#3498db'><b>studentka</b></span> — nominative (subject)...</li></ul><div style='background-color:#3a3a3a;color:#e0e0e0;padding:10px;border-radius:5px;margin-top:10px;border-left:4px solid #3498db'><b>Grammar Recap:</b><br>Accusative case marks the direct object...</div>"
}
```

#### `morpheme_segmentation`

Turkish — `"Öğrenciler kütüphanede kitap okuyorlar."`

```json
{
  "morpheme_segmentation": [
    {
      "word": "öğrenciler",
      "morphemes": [
        { "surface": "ler", "base_form": "lAr", "function": { "category": "number", "value": "plural" } }
      ]
    },
    {
      "word": "okuyorlar",
      "morphemes": [
        { "surface": "yor", "base_form": "(I)yor", "function": { "category": "tense", "value": "present" } },
        { "surface": "lar", "base_form": "lAr", "function": { "category": "agreement", "person": "third", "number": "plural" } }
      ]
    }
  ]
}
```

#### `multiword_expressions`

Polish — `"Dał nogę przed policją."`

```json
{
  "multiword_expressions": [
    {
      "expression": "dać nogę",
      "translation": "to run away / to bolt",
      "type": "idiom"
    }
  ]
}
```

#### `leipzig_alignment`

Lezgian — `"Gila abur-u-n ferma hamišaluǧ güǧüna amuq'-da-č."`

```json
{
  "leipzig_alignment": {
    "original_script": "Gila abur-u-n ferma hamišaluǧ güǧüna amuq'-da-č.",
    "words": [
      { "source": "Gila",           "gloss": "now" },
      { "source": "abur-u-n",       "gloss": "they-OBL-GEN" },
      { "source": "ferma",          "gloss": "farm" },
      { "source": "hamišaluǧ",      "gloss": "forever" },
      { "source": "güǧüna",         "gloss": "behind" },
      { "source": "amuq'-da-č",     "gloss": "stay-FUT-NEG" }
    ],
    "free_translation": "Now their farm will not stay behind forever."
  }
}
```

## Usage

### As a library (Rust API)

Add `panini-engine` and `rig-core` to your `Cargo.toml`:

```toml
panini-engine = { path = "…" }
panini-langs   = { path = "…" }
rig-core       = "0.33"
```

Then call `extract_features_via_llm` with any `rig::completion::CompletionModel`:

```rust
use panini_engine::{extract_features_via_llm, ExtractionOptions, ExtractionRequest};
use panini_engine::prompts::ExtractorPrompts;
use panini_langs::polish::Polish;
use rig::providers::openai;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let client = openai::Client::new(&std::env::var("OPENAI_API_KEY")?)?;
    let model  = client.completion_model("gpt-4o");
    let prompts = ExtractorPrompts::load("prompts/default.yml")?;

    let request = ExtractionRequest::builder()
        .content("Dał kotowi mleko.")
        .targets(vec!["kotowi".to_string()])
        .build();

    let options = ExtractionOptions {
        temperature: 0.2,
        max_tokens: 4096,
        previous_attempt: None,
        extractor_prompts: &prompts,
    };

    let result = extract_features_via_llm(&Polish, &model, &request, options).await?;

    println!("{:#?}", result.target_features);
    Ok(())
}
```

### As a standalone CLI

**1. Install**

```bash
# from the workspace root
cargo install --path panini-cli

# or build locally
cargo build -p panini-cli --release
```

**2. Create a config file** (copy from `panini.example.toml`):

```toml
# panini.toml
provider     = "google"           # openai | anthropic | google
model        = "gemini-2.0-flash"
language     = "pol"              # pol | tur | ara
api_key      = "$GEMINI_API_KEY"
prompts_file = "panini-cli/prompts/default.yml"
```

**3. Run**

```bash
export GEMINI_API_KEY="$GEMINI_API_KEY"

panini extract \
  --config panini.toml \
  --text "Studentka czyta interesującą książkę." \
  --target studentka --target czyta --target książkę

# Select specific components
panini extract --config panini.toml \
  --text "Dał kotowi mleko." --target kotowi \
  --components morphology,leipzig_alignment

# List supported languages
panini languages

# Pipe output to jq
panini extract --config panini.toml --text "…" --target "…" \
  | jq '.morphology.target_features'
```

**CLI options**

| Flag            | Default                  | Description                                       |
| --------------- | ------------------------ | ------------------------------------------------- |
| `--config`      | `panini.toml`            | Path to TOML config                               |
| `--text`        | *(required)*             | Sentence / card content to analyse                |
| `--target`      | *(required, repeatable)* | Target word(s) to focus extraction on             |
| `--components`  | *(all)*                  | Comma-separated list of components to run         |
| `--temperature` | `0.2`                    | Sampling temperature                              |
| `--max-tokens`  | `4096`                   | Max tokens for LLM response                       |
| `--ui-language` | `English`                | Learner's UI language for pedagogical explanation |

### Python

```bash
pip install panini-lang
```

```python
from panini import extract

result = extract(
    provider="openai",        # "openai" | "anthropic" | "google"
    model="gpt-4o",
    api_key="sk-...",
    language="pol",           # ISO 639-3 code
    text="Dał kotowi mleko.",
    targets=["kotowi"],
)
```

→ See [pynini/README.md](pynini/README.md) for the full Python API reference.

---

## What you define, what the framework does

**You define:**
- A **morphology enum** — the features you want extracted (POS, case, tense, aspect, gender… whatever your language needs)
- **Extraction directives** — natural-language instructions that guide the LLM on how to analyze your language
- **Optional morpheme segmentation** — for agglutinative languages, a morpheme inventory with validation rules
- **Optional post-processing** — hooks to validate or enrich the LLM's output after parsing

**The framework handles:**
- **Prompt assembly** — combines your directives, the generated schema, learner context, and pedagogical focus into a structured prompt
- **JSON schema generation** — automatically derived from your Rust types, so the LLM is constrained to return exactly what you defined
- **LLM orchestration** — provider-agnostic; bring your own client (OpenAI, Anthropic, Google, local)
- **Response parsing & validation** — deserializes the LLM output into your typed structs, rejects malformed responses, supports retry with self-correction

## Design principles

- **No universal schema.** Each language defines its own morphology enum with exactly the features it needs. Polish has 7 cases and verbal aspect. Arabic has triliteral roots and wazn patterns. There is no lowest-common-denominator `Morphology` struct.
- **Type safety over convention.** Morphology variants are strongly typed Rust enums, validated at compile time via `#[derive(MorphologyInfo)]`. Every variant must carry a `lemma`. The LLM's JSON output is parsed into these types and rejected if it doesn't conform.
- **LLM as untrusted source.** Responses are validated against a JSON schema, deserialized into typed structs, then post-processed. On parse failure, the raw output and error are returned for retry with self-correction.
- **Provider-agnostic via rig.** The engine accepts any `rig::completion::CompletionModel` — OpenAI, Anthropic, Google Gemini, Mistral, Ollama, or any custom provider.
- **Opt-in complexity.** A simple language (Polish) needs a morphology enum and a few directives. An agglutinative language (Turkish) can opt into morpheme inventories, segmentation, and validation. You only implement what you need.

## Workspace structure

```
panini/              # Facade crate, re-exports everything
panini-core/         # Traits, domain types, morphology enums, components
  src/components/    # AnalysisComponent implementations
panini-engine/       # LLM extraction pipeline, prompt assembly, schema composer
panini-langs/        # Per-language implementations (Polish, Arabic, Turkish)
panini-macro/        # #[derive(MorphologyInfo)], #[derive(PaniniResult)] proc macros
```

## Adding a language

### Automatically (LLM-assisted)

Fill your `panini.toml` config, choose a language (with its ISO 639-3 code) and run:

`cargo run -p panini-cli -- add-language --language "French" --iso-code fra --config panini.toml`

You should ALWAYS check the file output, especially for linguistic definitions, to ensure the LLM properly described the language.

### Manually (step by step)

1. Create `panini-langs/src/<language>.rs`
2. Define a `Morphology` enum with `#[derive(MorphologyInfo)]` and `#[serde(tag = "pos")]` -- every variant must have `lemma: String` as its first field
3. Implement `LinguisticDefinition` on a unit struct
4. For agglutinative languages, also implement `Agglutinative` with a morpheme inventory

See `panini-langs/src/polish.rs` or `panini-langs/src/turkish.rs` as references.

## Adding an analysis component

To add a new component (e.g. `leipzig_alignment`), touch 3 files:

**1. Create the component** in `panini-core/src/components/<name>.rs`:

```rust
use std::fmt::Debug;
use crate::component::{AnalysisComponent, ComponentContext};
use crate::traits::LinguisticDefinition;

#[derive(Debug, Clone, Default)]
pub struct MyComponent;

impl<L: LinguisticDefinition> AnalysisComponent<L> for MyComponent {
    fn name(&self) -> &'static str { "My Component" }
    fn schema_key(&self) -> &'static str { "my_component" }

    fn schema_fragment(&self, _lang: &L) -> serde_json::Value {
        // JSON Schema for the component's output
        serde_json::json!({
            "type": "object",
            "properties": {
                "field": { "type": "string" }
            },
            "required": ["field"]
        })
    }

    fn prompt_fragment(&self, _lang: &L, _ctx: &ComponentContext) -> String {
        "Instructions for the LLM on how to produce this component's output.".to_string()
    }

    // Optional overrides:
    // fn is_compatible(&self, lang: &L) -> bool  — filter by language/typology
    // fn output_instruction(&self) -> Option<&str>  — extra output rules
    // fn pre_process(&self, raw: &str) -> String  — clean raw JSON before parsing
    // fn validate(&self, lang: &L, section: &Value) -> Result<(), String>
    // fn post_process(&self, lang: &L, section: &mut Value) -> Result<(), String>
}
```

**2. Register the module** in `panini-core/src/components/mod.rs`:

```rust
pub mod my_component;
pub use my_component::MyComponent;
```

**3. Add to the registry** in `panini-langs/src/registry.rs` (`extract_for_language`):

```rust
let my_comp = MyComponent;
let all_components: Vec<(&str, &dyn AnalysisComponent<L>)> = vec![
    // ...existing...
    ("my_component", &my_comp),
];
```

That's it. The component is automatically integrated into the schema and prompt via `compose_schema()` / `compose_prompt()`. Use it with `--components my_component`.

## Building

```bash
cargo build
cargo test
```

## License

MIT