Pāṇini is a linguistic feature extraction framework: describe your language's morphology as Rust types, write extraction directives, and the pipeline handles the rest — prompt assembly, JSON schema generation, LLM orchestration, response parsing, and validation. No universal schema imposed; you define exactly the features your language needs.
Table of Contents
Extraction Capabilities
Extraction is built around composable components (AnalysisComponent). Each component provides a different axis of analysis or extracts specific features. You choose which components to run per request — pick only what you need.
By default, all compatible components are run. You can restrict the selection with --components:
# Run only morphology and Leipzig glossing
# Run everything (default)
From the Rust API, pass an optional list of component keys to extract_erased_with_components():
let result = extract_erased_with_components.await?;
Available components
| Key | Component | Description | Compatibility |
|---|---|---|---|
morphology |
MorphologyAnalysis |
POS tagging, lemmatization, case/tense/aspect/gender — language-specific morphological features | All languages |
pedagogical_explanation |
PedagogicalExplanation |
Structured HTML explanation for learners (translations, analysis, grammar recap) | All languages |
morpheme_segmentation |
MorphemeSegmentation |
Morpheme-by-morpheme segmentation with grammatical function labels | Agglutinative languages only* |
multiword_expressions |
MultiwordExpressions |
Extracts idioms, collocations, and phrasal expressions | All languages |
leipzig_alignment |
LeipzigAlignment |
Leipzig-style interlinear morpheme-by-morpheme gloss (Leipzig Glossing Rules) | All languages |
*Agglutinative languages are marked with a "Agglutinative" trait implementation in the framework. You can define the implementation for any language, even for low-agglutination languages like french, etc.
Output examples
morphology
Polish — "Studentka czyta interesującą książkę w bibliotece."
pedagogical_explanation
{
"pedagogical_explanation": "<p><b>Translations:</b><br><i>Lit:</i> Student-female reads interesting book in library.<br><i>Nat:</i> The (female) student reads an interesting book in the library.</p><p><b>Analysis:</b></p><ul><li><span style='color:#3498db'><b>studentka</b></span> — nominative (subject)...</li></ul><div style='background-color:#3a3a3a;color:#e0e0e0;padding:10px;border-radius:5px;margin-top:10px;border-left:4px solid #3498db'><b>Grammar Recap:</b><br>Accusative case marks the direct object...</div>"
}
morpheme_segmentation
Turkish — "Öğrenciler kütüphanede kitap okuyorlar."
multiword_expressions
Polish — "Dał nogę przed policją."
leipzig_alignment
Lezgian — "Gila abur-u-n ferma hamišaluǧ güǧüna amuq'-da-č."
Usage
As a library (Rust API)
Add panini-engine and rig-core to your Cargo.toml:
= { = "…" }
= { = "…" }
= "0.33"
Then call extract_features_via_llm with any rig::completion::CompletionModel:
use ;
use ExtractorPrompts;
use Polish;
use openai;
async
As a standalone CLI
1. Install
# from the workspace root
# or build locally
2. Create a config file (copy from panini.example.toml):
# panini.toml
= "google" # openai | anthropic | google
= "gemini-2.0-flash"
= "pol" # pol | tur | ara
= "$GEMINI_API_KEY"
= "panini-cli/prompts/default.yml"
3. Run
# Select specific components
# List supported languages
# Pipe output to jq
|
CLI options
| Flag | Default | Description |
|---|---|---|
--config |
panini.toml |
Path to TOML config |
--text |
(required) | Sentence / card content to analyse |
--target |
(required, repeatable) | Target word(s) to focus extraction on |
--components |
(all) | Comma-separated list of components to run |
--temperature |
0.2 |
Sampling temperature |
--max-tokens |
4096 |
Max tokens for LLM response |
--ui-language |
English |
Learner's UI language for pedagogical explanation |
Python
=
→ See pynini/README.md for the full Python API reference.
What you define, what the framework does
You define:
- A morphology enum — the features you want extracted (POS, case, tense, aspect, gender… whatever your language needs)
- Extraction directives — natural-language instructions that guide the LLM on how to analyze your language
- Optional morpheme segmentation — for agglutinative languages, a morpheme inventory with validation rules
- Optional post-processing — hooks to validate or enrich the LLM's output after parsing
The framework handles:
- Prompt assembly — combines your directives, the generated schema, learner context, and pedagogical focus into a structured prompt
- JSON schema generation — automatically derived from your Rust types, so the LLM is constrained to return exactly what you defined
- LLM orchestration — provider-agnostic; bring your own client (OpenAI, Anthropic, Google, local)
- Response parsing & validation — deserializes the LLM output into your typed structs, rejects malformed responses, supports retry with self-correction
Design principles
- No universal schema. Each language defines its own morphology enum with exactly the features it needs. Polish has 7 cases and verbal aspect. Arabic has triliteral roots and wazn patterns. There is no lowest-common-denominator
Morphologystruct. - Type safety over convention. Morphology variants are strongly typed Rust enums, validated at compile time via
#[derive(MorphologyInfo)]. Every variant must carry alemma. The LLM's JSON output is parsed into these types and rejected if it doesn't conform. - LLM as untrusted source. Responses are validated against a JSON schema, deserialized into typed structs, then post-processed. On parse failure, the raw output and error are returned for retry with self-correction.
- Provider-agnostic via rig. The engine accepts any
rig::completion::CompletionModel— OpenAI, Anthropic, Google Gemini, Mistral, Ollama, or any custom provider. - Opt-in complexity. A simple language (Polish) needs a morphology enum and a few directives. An agglutinative language (Turkish) can opt into morpheme inventories, segmentation, and validation. You only implement what you need.
Workspace structure
panini/ # Facade crate, re-exports everything
panini-core/ # Traits, domain types, morphology enums, components
src/components/ # AnalysisComponent implementations
panini-engine/ # LLM extraction pipeline, prompt assembly, schema composer
panini-langs/ # Per-language implementations (Polish, Arabic, Turkish)
panini-macro/ # #[derive(MorphologyInfo)], #[derive(PaniniResult)] proc macros
Adding a language
Automatically (LLM-assisted)
Fill your panini.toml config, choose a language (with its ISO 639-3 code) and run:
cargo run -p panini-cli -- add-language --language "French" --iso-code fra --config panini.toml
You should ALWAYS check the file output, especially for linguistic definitions, to ensure the LLM properly described the language.
Manually (step by step)
- Create
panini-langs/src/<language>.rs - Define a
Morphologyenum with#[derive(MorphologyInfo)]and#[serde(tag = "pos")]-- every variant must havelemma: Stringas its first field - Implement
LinguisticDefinitionon a unit struct - For agglutinative languages, also implement
Agglutinativewith a morpheme inventory
See panini-langs/src/polish.rs or panini-langs/src/turkish.rs as references.
Adding an analysis component
To add a new component (e.g. leipzig_alignment), touch 3 files:
1. Create the component in panini-core/src/components/<name>.rs:
use Debug;
use crate;
use crateLinguisticDefinition;
;
2. Register the module in panini-core/src/components/mod.rs:
pub use MyComponent;
3. Add to the registry in panini-langs/src/registry.rs (extract_for_language):
let my_comp = MyComponent;
let all_components: = vec!;
That's it. The component is automatically integrated into the schema and prompt via compose_schema() / compose_prompt(). Use it with --components my_component.
Building
License
MIT