pydantic_evals.evaluators
Bases: Generic[InputsT, OutputT, MetadataT]
Context for report-level evaluation, containing the full experiment results.
The experiment name.
Type: str
The full evaluation report.
Type: EvaluationReport[InputsT, OutputT, MetadataT]
Experiment-level metadata.
Bases: Evaluator[object, object, object]
Check if the output exactly equals the provided value.
Bases: Generic[InputsT, OutputT, MetadataT]
Context for evaluating a task execution.
An instance of this class is the sole input to all Evaluators. It contains all the information needed to evaluate the task execution, including inputs, outputs, metadata, and telemetry data.
Evaluators use this context to access the task inputs, actual output, expected output, and other information when evaluating the result of the task execution.
Example:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class ExactMatch(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
# Use the context to access task inputs, outputs, and expected outputs
return ctx.output == ctx.expected_output
The name of the case.
The inputs provided to the task for this case.
Type: InputsT
Metadata associated with the case, if provided. May be None if no metadata was specified.
Type: MetadataT | None
The expected output for the case, if provided. May be None if no expected output was specified.
Type: OutputT | None
The actual output produced by the task for this case.
Type: OutputT
The duration of the task run for this case.
Type: float
Attributes associated with the task run for this case.
These can be set by calling pydantic_evals.dataset.set_eval_attribute in any code executed
during the evaluation task.
Metrics associated with the task run for this case.
These can be set by calling pydantic_evals.dataset.increment_eval_metric in any code executed
during the evaluation task.
Get the SpanTree for this task execution.
The span tree is a graph where each node corresponds to an OpenTelemetry span recorded during the task execution, including timing information and any custom spans created during execution.
Type: SpanTree
The result of running an evaluator with an optional explanation.
Contains a scalar value and an optional “reason” explaining the value.
The scalar result of the evaluation (boolean, integer, float, or string).
An optional explanation of the evaluation result.
Bases: BaseEvaluator, Generic[InputsT, OutputT, MetadataT]
Base class for experiment-wide evaluators that analyze full reports.
Unlike case-level Evaluators which assess individual task outputs, ReportEvaluators see all case results together and produce experiment-wide analyses like confusion matrices, precision-recall curves, or scalar statistics.
@abstractmethod
def evaluate(
ctx: ReportEvaluatorContext[InputsT, OutputT, MetadataT],
) -> ReportAnalysis | list[ReportAnalysis] | Awaitable[ReportAnalysis | list[ReportAnalysis]]
Evaluate the full report and return experiment-wide analysis/analyses.
ReportAnalysis | list[ReportAnalysis] | Awaitable[ReportAnalysis | list[ReportAnalysis]]
@async
def evaluate_async(
ctx: ReportEvaluatorContext[InputsT, OutputT, MetadataT],
) -> ReportAnalysis | list[ReportAnalysis]
Evaluate, handling both sync and async implementations.
ReportAnalysis | list[ReportAnalysis]
Bases: Evaluator[object, object, object]
Check if the output exactly equals the expected output.
Bases: Generic[EvaluationScalarT]
The details of an individual evaluation result.
Contains the name, value, reason, and source evaluator for a single evaluation.
name : str
The name of the evaluation.
The scalar result of the evaluation.
An optional explanation of the evaluation result.
The spec of the evaluator that produced this result.
def downcast(value_types: type[T] = ()) -> EvaluationResult[T] | None
Attempt to downcast this result to a more specific type.
EvaluationResult[T] | None — A downcast version of this result if the value is an instance of one of the given types,
EvaluationResult[T] | None — otherwise None.
*value_types : type[T] Default: ()
The types to check the value against.
Bases: Evaluator[object, object, object]
Check if the output contains the expected output.
For strings, checks if expected_output is a substring of output. For lists/tuples, checks if expected_output is in output. For dicts, checks if all key-value pairs in expected_output are in output. For model-like types (BaseModel, dataclasses), converts to a dict and checks key-value pairs.
Note: case_sensitive only applies when both the value and output are strings.
Represents a failure raised during the execution of an evaluator.
Bases: ReportEvaluator
Computes a confusion matrix from case data.
Bases: BaseEvaluator, Generic[InputsT, OutputT, MetadataT]
Base class for all evaluators.
Evaluators can assess the performance of a task in a variety of ways, as a function of the EvaluatorContext.
Subclasses must implement the evaluate method. Note it can be defined with either def or async def.
Example:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class ExactMatch(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
return ctx.output == ctx.expected_output
@classmethod
@deprecated
def name(cls) -> str
name has been renamed, use get_serialization_name instead.
def get_default_evaluation_name() -> str
Return the default name to use in reports for the output of this evaluator.
By default, if the evaluator has an attribute called evaluation_name of type string, that will be used.
Otherwise, the serialization name of the evaluator (which is usually the class name) will be used.
This can be overridden to get a more descriptive name in evaluation reports, e.g. using instance information.
Note that evaluators that return a mapping of results will always use the keys of that mapping as the names of the associated evaluation results.
@abstractmethod
def evaluate(
ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput | Awaitable[EvaluatorOutput]
Evaluate the task output in the given context.
This is the main evaluation method that subclasses must implement. It can be either synchronous or asynchronous, returning either an EvaluatorOutput directly or an Awaitable[EvaluatorOutput].
EvaluatorOutput | Awaitable[EvaluatorOutput] — The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
EvaluatorOutput | Awaitable[EvaluatorOutput] — of evaluation names to either of those. Can be returned either synchronously or as an
EvaluatorOutput | Awaitable[EvaluatorOutput] — awaitable for asynchronous evaluation.
The context containing the inputs, outputs, and metadata for evaluation.
def evaluate_sync(ctx: EvaluatorContext[InputsT, OutputT, MetadataT]) -> EvaluatorOutput
Run the evaluator synchronously, handling both sync and async implementations.
This method ensures synchronous execution by running any async evaluate implementation to completion using run_until_complete.
EvaluatorOutput — The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
EvaluatorOutput — of evaluation names to either of those.
The context containing the inputs, outputs, and metadata for evaluation.
@async
def evaluate_async(
ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput
Run the evaluator asynchronously, handling both sync and async implementations.
This method ensures asynchronous execution by properly awaiting any async evaluate implementation. For synchronous implementations, it returns the result directly.
EvaluatorOutput — The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
EvaluatorOutput — of evaluation names to either of those.
The context containing the inputs, outputs, and metadata for evaluation.
Bases: Evaluator[object, object, object]
Check if the output is an instance of a type with the given name.
Bases: Evaluator[object, object, object]
Check if the execution time is under the specified maximum.
Bases: ReportEvaluator
Computes a precision-recall curve from case data.
Returns both a PrecisionRecall chart and a ScalarResult with the AUC value.
The AUC is computed at full resolution (every unique score threshold) for accuracy,
while the chart points are downsampled to n_thresholds for display.
Bases: TypedDict
Configuration for the score and assertion outputs of the LLMJudge evaluator.
Bases: Evaluator[object, object, object]
Judge whether the output of a language model meets the criteria of a provided rubric.
If you do not specify a model, it uses the default model for judging. This starts as ‘openai:gpt-5.2’, but can be
overridden by calling set_default_judge_model.
Bases: ReportEvaluator
Computes an ROC curve and AUC from case data.
Returns a LinePlot with the ROC curve (plus a dashed random-baseline diagonal)
and a ScalarResult with the AUC value.
Bases: Evaluator[object, object, object]
Check if the span tree contains a span that matches the specified query.
Bases: ReportEvaluator
Computes a Kolmogorov-Smirnov plot and statistic from case data.
Plots the empirical CDFs of the score distribution for positive and negative cases, and computes the KS statistic (maximum vertical distance between the two CDFs).
Returns a LinePlot with the two CDF curves and a ScalarResult with the KS statistic.
The specification of an evaluator to be run.
This class is used to represent evaluators in a serializable format, supporting various short forms for convenience when defining evaluators in YAML or JSON dataset files.
In particular, each of the following forms is supported for specifying an evaluator with name MyEvaluator:
'MyEvaluator'- Just the (string) name of the Evaluator subclass is used if its__init__takes no arguments\{'MyEvaluator': first_arg\}- A single argument is passed as the first positional argument toMyEvaluator.__init__\{'MyEvaluator': \{k1: v1, k2: v2\}\}- Multiple kwargs are passed toMyEvaluator.__init__
Default: NamedSpec
Type for the output of an evaluator, which can be a scalar, an EvaluationReason, or a mapping of names to either.
Default: EvaluationScalar | EvaluationReason | Mapping[str, EvaluationScalar | EvaluationReason]
Bases: BaseModel
The output of a grading operation.
@async
def judge_output(
output: Any,
rubric: str,
model: models.Model | models.KnownModelName | str | None = None,
model_settings: ModelSettings | None = None,
) -> GradingOutput
Judge the output of a model based on a rubric.
If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’,
but this can be changed using the set_default_judge_model function.
GradingOutput
@async
def judge_input_output(
inputs: Any,
output: Any,
rubric: str,
model: models.Model | models.KnownModelName | str | None = None,
model_settings: ModelSettings | None = None,
) -> GradingOutput
Judge the output of a model based on the inputs and a rubric.
If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’,
but this can be changed using the set_default_judge_model function.
GradingOutput
@async
def judge_input_output_expected(
inputs: Any,
output: Any,
expected_output: Any,
rubric: str,
model: models.Model | models.KnownModelName | str | None = None,
model_settings: ModelSettings | None = None,
) -> GradingOutput
Judge the output of a model based on the inputs and a rubric.
If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’,
but this can be changed using the set_default_judge_model function.
GradingOutput
@async
def judge_output_expected(
output: Any,
expected_output: Any,
rubric: str,
model: models.Model | models.KnownModelName | str | None = None,
model_settings: ModelSettings | None = None,
) -> GradingOutput
Judge the output of a model based on the expected output, output, and a rubric.
If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’,
but this can be changed using the set_default_judge_model function.
GradingOutput
def set_default_judge_model(model: models.Model | models.KnownModelName) -> None
Set the default model used for judging.
This model is used if None is passed to the model argument of judge_output and judge_input_output.