pydantic_evals.evaluators

ReportEvaluatorContext

Bases: Generic[InputsT, OutputT, MetadataT]

Context for report-level evaluation, containing the full experiment results.

Attributes

name

The experiment name.

Type: str

report

The full evaluation report.

Type: EvaluationReport[InputsT, OutputT, MetadataT]

experiment_metadata

Experiment-level metadata.

Type: dict[str, Any] | None

Equals

Bases: Evaluator[object, object, object]

Check if the output exactly equals the provided value.

EvaluatorContext

Bases: Generic[InputsT, OutputT, MetadataT]

Context for evaluating a task execution.

An instance of this class is the sole input to all Evaluators. It contains all the information needed to evaluate the task execution, including inputs, outputs, metadata, and telemetry data.

Evaluators use this context to access the task inputs, actual output, expected output, and other information when evaluating the result of the task execution.

Example:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ExactMatch(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        # Use the context to access task inputs, outputs, and expected outputs
        return ctx.output == ctx.expected_output

Attributes

name

The name of the case.

Type: str | None

inputs

The inputs provided to the task for this case.

Type: InputsT

metadata

Metadata associated with the case, if provided. May be None if no metadata was specified.

Type: MetadataT | None

expected_output

The expected output for the case, if provided. May be None if no expected output was specified.

Type: OutputT | None

output

The actual output produced by the task for this case.

Type: OutputT

duration

The duration of the task run for this case.

Type: float

attributes

Attributes associated with the task run for this case.

These can be set by calling pydantic_evals.dataset.set_eval_attribute in any code executed during the evaluation task.

Type: dict[str, Any]

metrics

Metrics associated with the task run for this case.

These can be set by calling pydantic_evals.dataset.increment_eval_metric in any code executed during the evaluation task.

Type: dict[str, int | float]

span_tree

Get the SpanTree for this task execution.

The span tree is a graph where each node corresponds to an OpenTelemetry span recorded during the task execution, including timing information and any custom spans created during execution.

Type: SpanTree

EvaluationReason

The result of running an evaluator with an optional explanation.

Contains a scalar value and an optional “reason” explaining the value.

Constructor Parameters

value : EvaluationScalar

The scalar result of the evaluation (boolean, integer, float, or string).

reason : str | None Default: None

An optional explanation of the evaluation result.

ReportEvaluator

Bases: BaseEvaluator, Generic[InputsT, OutputT, MetadataT]

Base class for experiment-wide evaluators that analyze full reports.

Unlike case-level Evaluators which assess individual task outputs, ReportEvaluators see all case results together and produce experiment-wide analyses like confusion matrices, precision-recall curves, or scalar statistics.

Methods

evaluate

@abstractmethod

def evaluate(
    ctx: ReportEvaluatorContext[InputsT, OutputT, MetadataT],
) -> ReportAnalysis | list[ReportAnalysis] | Awaitable[ReportAnalysis | list[ReportAnalysis]]

Evaluate the full report and return experiment-wide analysis/analyses.

Returns

ReportAnalysis | list[ReportAnalysis] | Awaitable[ReportAnalysis | list[ReportAnalysis]]

evaluate_async

@async

def evaluate_async(
    ctx: ReportEvaluatorContext[InputsT, OutputT, MetadataT],
) -> ReportAnalysis | list[ReportAnalysis]

Evaluate, handling both sync and async implementations.

Returns

ReportAnalysis | list[ReportAnalysis]

EqualsExpected

Bases: Evaluator[object, object, object]

Check if the output exactly equals the expected output.

EvaluationResult

Bases: Generic[EvaluationScalarT]

The details of an individual evaluation result.

Contains the name, value, reason, and source evaluator for a single evaluation.

Constructor Parameters

name : str

The name of the evaluation.

value : EvaluationScalarT

The scalar result of the evaluation.

reason : str | None

An optional explanation of the evaluation result.

source : EvaluatorSpec

The spec of the evaluator that produced this result.

Methods

downcast

def downcast(value_types: type[T] = ()) -> EvaluationResult[T] | None

Attempt to downcast this result to a more specific type.

Returns

EvaluationResult[T] | None — A downcast version of this result if the value is an instance of one of the given types, EvaluationResult[T] | None — otherwise None.

Parameters

*value_types : type[T] Default: ()

The types to check the value against.

Contains

Bases: Evaluator[object, object, object]

Check if the output contains the expected output.

For strings, checks if expected_output is a substring of output. For lists/tuples, checks if expected_output is in output. For dicts, checks if all key-value pairs in expected_output are in output. For model-like types (BaseModel, dataclasses), converts to a dict and checks key-value pairs.

Note: case_sensitive only applies when both the value and output are strings.

EvaluatorFailure

Represents a failure raised during the execution of an evaluator.

ConfusionMatrixEvaluator

Bases: ReportEvaluator

Computes a confusion matrix from case data.

Evaluator

Bases: BaseEvaluator, Generic[InputsT, OutputT, MetadataT]

Base class for all evaluators.

Evaluators can assess the performance of a task in a variety of ways, as a function of the EvaluatorContext.

Subclasses must implement the evaluate method. Note it can be defined with either def or async def.

Example:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ExactMatch(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return ctx.output == ctx.expected_output

Methods

name

@classmethod

@deprecated

def name(cls) -> str

name has been renamed, use get_serialization_name instead.

Returns

str

get_default_evaluation_name

def get_default_evaluation_name() -> str

Return the default name to use in reports for the output of this evaluator.

By default, if the evaluator has an attribute called evaluation_name of type string, that will be used. Otherwise, the serialization name of the evaluator (which is usually the class name) will be used.

This can be overridden to get a more descriptive name in evaluation reports, e.g. using instance information.

Note that evaluators that return a mapping of results will always use the keys of that mapping as the names of the associated evaluation results.

Returns

str

evaluate

@abstractmethod

def evaluate(
    ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput | Awaitable[EvaluatorOutput]

Evaluate the task output in the given context.

This is the main evaluation method that subclasses must implement. It can be either synchronous or asynchronous, returning either an EvaluatorOutput directly or an Awaitable[EvaluatorOutput].

Returns

EvaluatorOutput | Awaitable[EvaluatorOutput] — The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping EvaluatorOutput | Awaitable[EvaluatorOutput] — of evaluation names to either of those. Can be returned either synchronously or as an EvaluatorOutput | Awaitable[EvaluatorOutput] — awaitable for asynchronous evaluation.

Parameters

ctx : EvaluatorContext[InputsT, OutputT, MetadataT]

The context containing the inputs, outputs, and metadata for evaluation.

evaluate_sync

def evaluate_sync(ctx: EvaluatorContext[InputsT, OutputT, MetadataT]) -> EvaluatorOutput

Run the evaluator synchronously, handling both sync and async implementations.

This method ensures synchronous execution by running any async evaluate implementation to completion using run_until_complete.

Returns

EvaluatorOutput — The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping EvaluatorOutput — of evaluation names to either of those.

Parameters

ctx : EvaluatorContext[InputsT, OutputT, MetadataT]

The context containing the inputs, outputs, and metadata for evaluation.

evaluate_async

@async

def evaluate_async(
    ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput

Run the evaluator asynchronously, handling both sync and async implementations.

This method ensures asynchronous execution by properly awaiting any async evaluate implementation. For synchronous implementations, it returns the result directly.

Returns

EvaluatorOutput — The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping EvaluatorOutput — of evaluation names to either of those.

Parameters

ctx : EvaluatorContext[InputsT, OutputT, MetadataT]

The context containing the inputs, outputs, and metadata for evaluation.

IsInstance

Bases: Evaluator[object, object, object]

Check if the output is an instance of a type with the given name.

MaxDuration

Bases: Evaluator[object, object, object]

Check if the execution time is under the specified maximum.

PrecisionRecallEvaluator

Bases: ReportEvaluator

Computes a precision-recall curve from case data.

Returns both a PrecisionRecall chart and a ScalarResult with the AUC value. The AUC is computed at full resolution (every unique score threshold) for accuracy, while the chart points are downsampled to n_thresholds for display.

OutputConfig

Bases: TypedDict

Configuration for the score and assertion outputs of the LLMJudge evaluator.

LLMJudge

Bases: Evaluator[object, object, object]

Judge whether the output of a language model meets the criteria of a provided rubric.

If you do not specify a model, it uses the default model for judging. This starts as ‘openai:gpt-5.2’, but can be overridden by calling set_default_judge_model.

ROCAUCEvaluator

Bases: ReportEvaluator

Computes an ROC curve and AUC from case data.

Returns a LinePlot with the ROC curve (plus a dashed random-baseline diagonal) and a ScalarResult with the AUC value.

HasMatchingSpan

Bases: Evaluator[object, object, object]

Check if the span tree contains a span that matches the specified query.

KolmogorovSmirnovEvaluator

Bases: ReportEvaluator

Computes a Kolmogorov-Smirnov plot and statistic from case data.

Plots the empirical CDFs of the score distribution for positive and negative cases, and computes the KS statistic (maximum vertical distance between the two CDFs).

Returns a LinePlot with the two CDF curves and a ScalarResult with the KS statistic.

EvaluatorSpec

The specification of an evaluator to be run.

This class is used to represent evaluators in a serializable format, supporting various short forms for convenience when defining evaluators in YAML or JSON dataset files.

In particular, each of the following forms is supported for specifying an evaluator with name MyEvaluator:

'MyEvaluator' - Just the (string) name of the Evaluator subclass is used if its __init__ takes no arguments
\{'MyEvaluator': first_arg\} - A single argument is passed as the first positional argument to MyEvaluator.__init__
\{'MyEvaluator': \{k1: v1, k2: v2\}\} - Multiple kwargs are passed to MyEvaluator.__init__

Default: NamedSpec

EvaluatorOutput

Type for the output of an evaluator, which can be a scalar, an EvaluationReason, or a mapping of names to either.

Default: EvaluationScalar | EvaluationReason | Mapping[str, EvaluationScalar | EvaluationReason]

GradingOutput

Bases: BaseModel

The output of a grading operation.

judge_output

@async

def judge_output(
    output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | str | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

Judge the output of a model based on a rubric.

If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’, but this can be changed using the set_default_judge_model function.

Returns

GradingOutput

judge_input_output

@async

def judge_input_output(
    inputs: Any,
    output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | str | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

Judge the output of a model based on the inputs and a rubric.

If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’, but this can be changed using the set_default_judge_model function.

Returns

GradingOutput

judge_input_output_expected

@async

def judge_input_output_expected(
    inputs: Any,
    output: Any,
    expected_output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | str | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

Judge the output of a model based on the inputs and a rubric.

If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’, but this can be changed using the set_default_judge_model function.

Returns

GradingOutput

judge_output_expected

@async

def judge_output_expected(
    output: Any,
    expected_output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | str | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

Judge the output of a model based on the expected output, output, and a rubric.

If the model is not specified, a default model is used. The default model starts as ‘openai:gpt-5.2’, but this can be changed using the set_default_judge_model function.

Returns

GradingOutput

set_default_judge_model

def set_default_judge_model(model: models.Model | models.KnownModelName) -> None

Set the default model used for judging.

This model is used if None is passed to the model argument of judge_output and judge_input_output.

Returns

None