Available on crate feature
experimental only.Expand description
Evals. From OpenAI’s evals repo:
Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.
Structs§
- LlmJudge
Builder - LlmJudge
Builder With Fn - LlmJudge
Metric - An LLM as a judge that judges an output by a given schema (and outputs the schema).
The schema type uses the
Judgmenttrait, which simply enforces a single function that checks whether it passes or not. - LlmJudge
Metric With Fn - An LLM as a judge that judges an output by a given schema (and outputs the schema).
Unlike
LlmJudgeMetric, this type uses a function pointer that takes the type and returns aboolinstead. - LlmScore
Metric - An eval that scores an output based on some given criteria.
- LlmScore
Metric Builder - LlmScore
Metric Score - The scoring output returned by
LlmScoreMetric. Must also be used as the Extractor return type when passed intoLlmScoreMetric. - Semantic
Similarity Metric - A semantic similarity metric. Uses cosine similarity. In broad terms, cosine similarity can be used to measure how similar two documents are. This can be useful for things like quickly testing semantic similarity between two documents.
- Semantic
Similarity Metric Builder - A builder struct for
SemanticSimilarityMetric. - Semantic
Similarity Metric Score - The scoring metric used for
SemanticSimilarityMetric.
Enums§
- Eval
Error - Evaluation errors.
- Eval
Outcome - The outcome of an evaluation (ie, sending an input to an LLM which then gets tested against a set of criteria). Invalid results due to things like functions returning errors should be encoded as invalid evaluation outcomes.
Traits§
- Eval
- A trait to encode evaluators - types that can be used to test LLM outputs against criteria. Evaluators come in all shapes and sizes, and additionally may themselves use LLMs (although there are many heuristics you can use that don’t). There are three possible states that an LLM can result in:
- Judgment
- A helper trait for
LlmJudgeMetric. Types that implementJudgmentgenerally have a very standard way of either passing or failing. As such, this can be enforced as a trait.