# femind Practical Evaluation
## Purpose
This document defines the real-world validation layer for `femind`.
It exists to keep production readiness tied to practical memory behavior rather
than benchmark scores.
The practical eval set should answer one question first:
Can `femind` extract, store, retrieve, and update real working memory in ways
that are useful to an actual application or developer workflow?
## Validation Order
1. Practical eval scenarios
2. Small approved live validation pass
3. Fixes for real-world failures
4. Repeatable local regression checks
5. Secondary benchmark comparison if still useful
Benchmark datasets remain useful for regression and comparison, but they are
not the primary design target.
## Practical Eval Categories
The curated eval set should cover these categories:
1. Current vs superseded facts
The system should prefer the latest accepted fact and avoid surfacing stale
answers as current.
2. Preferences and decisions
The system should preserve stable preferences and explicit decisions without
losing the reason behind them.
3. Temporal and recency reasoning
The system should answer questions about what changed, when it changed, and
what is current now.
4. Distractor resistance
The system should still retrieve the correct information when unrelated but
semantically similar text is present.
5. Messy source extraction
The system should extract useful facts from rough notes, meetings, and
transcripts instead of only from clean synthetic prompts.
6. Abstention
The system should avoid confident fabrication when the answer is missing.
## Eval Artifact Layout
The curated eval set lives under `eval/practical/`.
- `eval/practical/README.md`
Maintainer notes and review workflow
- `eval/practical/scenarios.json`
Curated practical scenarios and expected behavior
## Review Standard
Every practical scenario should be easy to inspect by hand.
Each scenario should include:
- source records or session text
- the main retrieval questions
- expected current answers
- expected abstentions where relevant
- optional extraction expectations for messy source material
## Release Use
Before larger live runs or benchmark comparisons:
- run a small approved sample from `eval/practical/scenarios.json`
- inspect extraction quality manually
- inspect retrieval answers manually
- log failures by category
`femind` should be treated as production-ready only when the practical eval
set is directionally strong, repeatable, and free of obvious category failures.
## Repeatable Command
The primary live-validation entry point is:
```bash
scripts/run-practical-eval.sh
```
Default behavior:
- `FEMIND_EVAL_MODE=retrieval`
- `FEMIND_VECTOR_MODE=exact`
- `FEMIND_EXTRACT_BACKEND=api`
- `FEMIND_EXTRACT_MODEL=openai/gpt-oss-120b`
- summary output at `target/practical-eval/retrieval-exact.json`
- runtime key resolution through macOS Keychain unless overridden with `FEMIND_EVAL_KEY_CMD`
Equivalent direct command:
```bash
cargo run --example practical_eval --features api-embeddings,api-llm,ann -- \
--scenarios eval/practical/scenarios.json \
--mode retrieval \
--vector-mode exact \
--summary target/practical-eval/retrieval-exact.json
```
The example uses a runtime key command and does not require secrets to be
written into source files or shell history.
Extraction backend options:
- `api`
Uses the OpenAI-compatible API callback. Recommended default:
`openai/gpt-oss-120b`
- `codex-cli`
Uses the local Codex CLI callback. Recommended default:
`gpt-5.4-mini`
Lower-cost fallback:
`gpt-5.1-codex-mini`
## Current Practical Baseline
Current validated baseline:
- extraction-only practical eval with DeepInfra `openai/gpt-oss-120b` passes `4/4`
- extraction-only practical eval with Codex CLI `gpt-5.4-mini` passes `4/4`
- extraction-only practical eval with Codex CLI `gpt-5.1-codex-mini` passes `4/4`
- retrieval-only practical eval with `vector_mode=exact` currently passes `9/9`
- retrieval-only practical eval with `vector_mode=ann` currently passes `9/9`
- summary artifact: `target/practical-eval/retrieval-exact.json`
- broader live-usage sample from actual project docs currently passes `11/11` for all four tested extraction models
- live-usage summary artifact: `target/practical-eval/live-usage-exact.json`
This exact-mode practical run is the standard local regression check before
trying wider live usage samples or ANN comparisons.