Module evals

Module evals 

Source
Available on crate feature experimental only.
Expand description

Evals. From OpenAI’s evals repo:

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

Structs§

LlmJudgeBuilder
LlmJudgeBuilderWithFn
LlmJudgeMetric
An LLM as a judge that judges an output by a given schema (and outputs the schema). The schema type uses the Judgment trait, which simply enforces a single function that checks whether it passes or not.
LlmJudgeMetricWithFn
An LLM as a judge that judges an output by a given schema (and outputs the schema). Unlike LlmJudgeMetric, this type uses a function pointer that takes the type and returns a bool instead.
LlmScoreMetric
An eval that scores an output based on some given criteria.
LlmScoreMetricBuilder
LlmScoreMetricScore
The scoring output returned by LlmScoreMetric. Must also be used as the Extractor return type when passed into LlmScoreMetric.
SemanticSimilarityMetric
A semantic similarity metric. Uses cosine similarity. In broad terms, cosine similarity can be used to measure how similar two documents are. This can be useful for things like quickly testing semantic similarity between two documents.
SemanticSimilarityMetricBuilder
A builder struct for SemanticSimilarityMetric.
SemanticSimilarityMetricScore
The scoring metric used for SemanticSimilarityMetric.

Enums§

EvalError
Evaluation errors.
EvalOutcome
The outcome of an evaluation (ie, sending an input to an LLM which then gets tested against a set of criteria). Invalid results due to things like functions returning errors should be encoded as invalid evaluation outcomes.

Traits§

Eval
A trait to encode evaluators - types that can be used to test LLM outputs against criteria. Evaluators come in all shapes and sizes, and additionally may themselves use LLMs (although there are many heuristics you can use that don’t). There are three possible states that an LLM can result in:
Judgment
A helper trait for LlmJudgeMetric. Types that implement Judgment generally have a very standard way of either passing or failing. As such, this can be enforced as a trait.