Skip to main content

Module eval

Module eval 

Source
Expand description

Evaluation framework for RAG retrieval quality (PMAT-015)

World-class RAG evaluation using LLM-as-judge on actual chunk content and synthetic ground truth generated from the corpus itself.

§Architecture

Split pipeline — trueno-rag handles data, Claude Code handles LLM work:

  • eval sample — Sample chunks from index (no API needed)
  • eval retrieve — Run queries against index (no API needed)
  • eval metrics — Compute IR metrics from judgments (no API needed)
  • Claude Code /eval-generate skill — Generate questions from sampled chunks
  • Claude Code /eval-judge skill — Judge relevance of retrieved chunks

Optional direct API mode (requires ANTHROPIC_API_KEY):

  • eval generate — Sample + generate questions via Claude API
  • eval judge — Judge + compute metrics via Claude API

Re-exports§

pub use client::AnthropicClient;
pub use domain::classify_domain;
pub use generate::GroundTruthGenerator;
pub use judge::RelevanceJudge;
pub use metrics::compute_metrics_from_judgments;
pub use types::EvalConfig;
pub use types::GroundTruthEntry;
pub use types::JudgeCache;
pub use types::JudgeCacheEntry;
pub use types::JudgeVerdict;
pub use types::JudgmentEntry;
pub use types::RetrievalResultEntry;

Modules§

client
Minimal Anthropic API client for eval operations
domain
Domain classification for course directories
generate
Synthetic ground truth generation from corpus chunks
judge
LLM-as-judge for content-based relevance scoring
metrics
Compute IR metrics from pre-judged results (no API calls needed)
types
Core types for the evaluation framework