Expand description
Evaluation framework for RAG retrieval quality (PMAT-015)
World-class RAG evaluation using LLM-as-judge on actual chunk content and synthetic ground truth generated from the corpus itself.
§Architecture
Split pipeline — trueno-rag handles data, Claude Code handles LLM work:
eval sample— Sample chunks from index (no API needed)eval retrieve— Run queries against index (no API needed)eval metrics— Compute IR metrics from judgments (no API needed)- Claude Code
/eval-generateskill — Generate questions from sampled chunks - Claude Code
/eval-judgeskill — Judge relevance of retrieved chunks
Optional direct API mode (requires ANTHROPIC_API_KEY):
eval generate— Sample + generate questions via Claude APIeval judge— Judge + compute metrics via Claude API
Re-exports§
pub use client::AnthropicClient;pub use domain::classify_domain;pub use generate::GroundTruthGenerator;pub use judge::RelevanceJudge;pub use metrics::compute_metrics_from_judgments;pub use types::EvalConfig;pub use types::GroundTruthEntry;pub use types::JudgeCache;pub use types::JudgeCacheEntry;pub use types::JudgeVerdict;pub use types::JudgmentEntry;pub use types::RetrievalResultEntry;
Modules§
- client
- Minimal Anthropic API client for eval operations
- domain
- Domain classification for course directories
- generate
- Synthetic ground truth generation from corpus chunks
- judge
- LLM-as-judge for content-based relevance scoring
- metrics
- Compute IR metrics from pre-judged results (no API calls needed)
- types
- Core types for the evaluation framework