pydantic_evals.reporting

ConfusionMatrix

Bases: BaseModel

A confusion matrix comparing expected vs predicted labels across cases.

Attributes

class_labels

Ordered list of class labels (used for both axes).

Type: list[str]

matrix

matrixexpected_idx = count of cases.

Type: list[list[int]]

PrecisionRecallPoint

Bases: BaseModel

A single point on a precision-recall curve.

PrecisionRecallCurve

Bases: BaseModel

A single precision-recall curve.

Attributes

name

Name of this curve (e.g., experiment name or evaluator name).

Type: str

points

Points on the curve, ordered by threshold.

Type: list[PrecisionRecallPoint]

auc

Area under the precision-recall curve.

Type: float | None Default: None

PrecisionRecall

Bases: BaseModel

Precision-recall curve data across cases.

Attributes

curves

One or more curves.

Type: list[PrecisionRecallCurve]

ScalarResult

Bases: BaseModel

A single scalar statistic (e.g., F1 score, accuracy, BLEU).

Attributes

unit

Optional unit label (e.g., ’%’, ‘ms’).

Type: str | None Default: None

ReportCase

Bases: Generic[InputsT, OutputT, MetadataT]

A single case in an evaluation report.

Attributes

name

The name of the case.

Type: str

inputs

The inputs to the task, from Case.inputs.

Type: InputsT

metadata

Any metadata associated with the case, from Case.metadata.

Type: MetadataT | None

expected_output

The expected output of the task, from Case.expected_output.

Type: OutputT | None

output

The output of the task execution.

Type: OutputT

source_case_name

The original case name before run-indexing. Serves as the aggregation key for multi-run experiments. None when repeat == 1.

Type: str | None Default: None

trace_id

The trace ID of the case span.

Type: str | None Default: None

span_id

The span ID of the case span.

Type: str | None Default: None

TableResult

Bases: BaseModel

A generic table of data (fallback for custom analyses).

Attributes

columns

Column headers.

Type: list[str]

rows

Row data, one list per row.

Type: list[list[str | int | float | bool | None]]

LinePlot

Bases: BaseModel

A generic XY line plot with labeled axes, supporting multiple curves.

Use this for ROC curves, KS plots, calibration curves, or any custom line chart that doesn’t fit the specific PrecisionRecall type.

Attributes

x_label

Label for the x-axis.

Type: str

y_label

Label for the y-axis.

Type: str

x_range

Optional fixed range for x-axis (min, max).

Type: tuple[float, float] | None Default: None

y_range

Optional fixed range for y-axis (min, max).

Type: tuple[float, float] | None Default: None

curves

One or more curves to plot.

Type: list[LinePlotCurve]

ReportCaseFailure

Bases: Generic[InputsT, OutputT, MetadataT]

A single case in an evaluation report that failed due to an error during task execution.

Attributes

name

The name of the case.

Type: str

inputs

The inputs to the task, from Case.inputs.

Type: InputsT

metadata

Any metadata associated with the case, from Case.metadata.

Type: MetadataT | None

expected_output

The expected output of the task, from Case.expected_output.

Type: OutputT | None

error_message

The message of the exception that caused the failure.

Type: str

error_stacktrace

The stacktrace of the exception that caused the failure.

Type: str

source_case_name

The original case name before run-indexing. Serves as the aggregation key for multi-run experiments. None when repeat == 1.

Type: str | None Default: None

trace_id

The trace ID of the case span.

Type: str | None Default: None

span_id

The span ID of the case span.

Type: str | None Default: None

ReportCaseGroup

Bases: Generic[InputsT, OutputT, MetadataT]

Grouped results from running the same case multiple times.

This is a computed view, not stored data. Obtain via EvaluationReport.case_groups().

Attributes

name

The original case name (shared across all runs).

Type: str

inputs

The inputs (same for all runs).

Type: InputsT

metadata

The metadata (same for all runs).

Type: MetadataT | None

expected_output

The expected output (same for all runs).

Type: OutputT | None

runs

Individual run results.

Type: Sequence[ReportCase[InputsT, OutputT, MetadataT]]

failures

Runs that failed with exceptions.

Type: Sequence[ReportCaseFailure[InputsT, OutputT, MetadataT]]

summary

Aggregated statistics across runs.

Type: ReportCaseAggregate

ReportCaseAggregate

Bases: BaseModel

A synthetic case that summarizes a set of cases.

Methods

average

@staticmethod

def average(cases: list[ReportCase]) -> ReportCaseAggregate

Produce a synthetic “summary” case by averaging quantitative attributes.

Returns

ReportCaseAggregate

average_from_aggregates

@staticmethod

def average_from_aggregates(
    aggregates: list[ReportCaseAggregate],
) -> ReportCaseAggregate

Average across multiple aggregates (used for multi-run experiment summaries).

Returns

ReportCaseAggregate

EvaluationReport

Bases: Generic[InputsT, OutputT, MetadataT]

A report of the results of evaluating a model on a set of cases.

Attributes

name

The name of the report.

Type: str

cases

The cases in the report.

Type: list[ReportCase[InputsT, OutputT, MetadataT]]

failures

The failures in the report. These are cases where task execution raised an exception.

Type: list[ReportCaseFailure[InputsT, OutputT, MetadataT]] Default: field(default_factory=(list[ReportCaseFailure[InputsT, OutputT, MetadataT]]))

analyses

Experiment-wide analyses produced by report evaluators.

Type: list[ReportAnalysis] Default: field(default_factory=(list[ReportAnalysis]))

report_evaluator_failures

Failures from report evaluators that raised exceptions.

Type: list[EvaluatorFailure] Default: field(default_factory=(list[EvaluatorFailure]))

experiment_metadata

Metadata associated with the specific experiment represented by this report.

Type: dict[str, Any] | None Default: None

trace_id

The trace ID of the evaluation.

Type: str | None Default: None

span_id

The span ID of the evaluation.

Type: str | None Default: None

Methods

case_groups

def case_groups() -> list[ReportCaseGroup[InputsT, OutputT, MetadataT]] | None

Group cases by source_case_name and compute per-group aggregates.

Returns None if no cases have source_case_name set (i.e., single-run experiment).

Returns

list[ReportCaseGroup[InputsT, OutputT, MetadataT]] | None

render

def render(
    width: int | None = None,
    baseline: EvaluationReport[InputsT, OutputT, MetadataT] | None = None,
    include_input: bool = False,
    include_metadata: bool = False,
    include_expected_output: bool = False,
    include_output: bool = False,
    include_durations: bool = True,
    include_total_duration: bool = False,
    include_removed_cases: bool = False,
    include_averages: bool = True,
    include_errors: bool = True,
    include_error_stacktrace: bool = False,
    include_evaluator_failures: bool = True,
    include_analyses: bool = True,
    input_config: RenderValueConfig | None = None,
    metadata_config: RenderValueConfig | None = None,
    output_config: RenderValueConfig | None = None,
    score_configs: dict[str, RenderNumberConfig] | None = None,
    label_configs: dict[str, RenderValueConfig] | None = None,
    metric_configs: dict[str, RenderNumberConfig] | None = None,
    duration_config: RenderNumberConfig | None = None,
    include_reasons: bool = False,
) -> str

Render this report to a nicely-formatted string, optionally comparing it to a baseline report.

If you want more control over the output, use console_table instead and pass it to rich.Console.print.

Returns

str

print

def print(
    width: int | None = None,
    baseline: EvaluationReport[InputsT, OutputT, MetadataT] | None = None,
    console: Console | None = None,
    include_input: bool = False,
    include_metadata: bool = False,
    include_expected_output: bool = False,
    include_output: bool = False,
    include_durations: bool = True,
    include_total_duration: bool = False,
    include_removed_cases: bool = False,
    include_averages: bool = True,
    include_errors: bool = True,
    include_error_stacktrace: bool = False,
    include_evaluator_failures: bool = True,
    include_analyses: bool = True,
    input_config: RenderValueConfig | None = None,
    metadata_config: RenderValueConfig | None = None,
    output_config: RenderValueConfig | None = None,
    score_configs: dict[str, RenderNumberConfig] | None = None,
    label_configs: dict[str, RenderValueConfig] | None = None,
    metric_configs: dict[str, RenderNumberConfig] | None = None,
    duration_config: RenderNumberConfig | None = None,
    include_reasons: bool = False,
) -> None

Print this report to the console, optionally comparing it to a baseline report.

If you want more control over the output, use console_table instead and pass it to rich.Console.print.

Returns

None

console_table

def console_table(
    baseline: EvaluationReport[InputsT, OutputT, MetadataT] | None = None,
    include_input: bool = False,
    include_metadata: bool = False,
    include_expected_output: bool = False,
    include_output: bool = False,
    include_durations: bool = True,
    include_total_duration: bool = False,
    include_removed_cases: bool = False,
    include_averages: bool = True,
    include_evaluator_failures: bool = True,
    input_config: RenderValueConfig | None = None,
    metadata_config: RenderValueConfig | None = None,
    output_config: RenderValueConfig | None = None,
    score_configs: dict[str, RenderNumberConfig] | None = None,
    label_configs: dict[str, RenderValueConfig] | None = None,
    metric_configs: dict[str, RenderNumberConfig] | None = None,
    duration_config: RenderNumberConfig | None = None,
    include_reasons: bool = False,
    with_title: bool = True,
) -> Table

Return a table containing the data from this report.

If a baseline is provided, returns a diff between this report and the baseline report. Optionally include input and output details.

Returns

Table

failures_table

def failures_table(
    include_input: bool = False,
    include_metadata: bool = False,
    include_expected_output: bool = False,
    include_error_message: bool = True,
    include_error_stacktrace: bool = True,
    input_config: RenderValueConfig | None = None,
    metadata_config: RenderValueConfig | None = None,
) -> Table

Return a table containing the failures in this report.

Returns

Table

str

def __str__() -> str

Return a string representation of the report.

Returns

str

RenderValueConfig

Bases: TypedDict

A configuration for rendering a values in an Evaluation report.

RenderNumberConfig

Bases: TypedDict

A configuration for rendering a particular score or metric in an Evaluation report.

See the implementation of _RenderNumber for more clarity on how these parameters affect the rendering.

Attributes

value_formatter

The logic to use for formatting values.

If not provided, format as ints if all values are ints, otherwise at least one decimal place and at least four significant figures.
You can also use a custom string format spec, e.g. ’{:.3f}’
You can also use a custom function, e.g. lambda x: f’{x:.3f}’

Type: str | Callable[[float | int], str]

diff_formatter

The logic to use for formatting details about the diff.

The strings produced by the value_formatter will always be included in the reports, but the diff_formatter is used to produce additional text about the difference between the old and new values, such as the absolute or relative difference.

If not provided, format as ints if all values are ints, otherwise at least one decimal place and at least four significant figures, and will include the percentage change.
You can also use a custom string format spec, e.g. ’{:+.3f}’
You can also use a custom function, e.g. lambda x: f’{x:+.3f}’. If this function returns None, no extra diff text will be added.
You can also use None to never generate extra diff text.

diff_atol

The absolute tolerance for considering a difference “significant”.

A difference is “significant” if abs(new - old) < self.diff_atol + self.diff_rtol * abs(old).

If a difference is not significant, it will not have the diff styles applied. Note that we still show both the rendered before and after values in the diff any time they differ, even if the difference is not significant. (If the rendered values are exactly the same, we only show the value once.)

If not provided, use 1e-6.

Type: float

diff_rtol

The relative tolerance for considering a difference “significant”.

See the description of diff_atol for more details about what makes a difference “significant”.

If not provided, use 0.001 if all values are ints, otherwise 0.05.

Type: float

diff_increase_style

The style to apply to diffed values that have a significant increase.

See the description of diff_atol for more details about what makes a difference “significant”.

If not provided, use green for scores and red for metrics. You can also use arbitrary rich styles, such as “bold red”.

Type: str

diff_decrease_style

The style to apply to diffed values that have significant decrease.

See the description of diff_atol for more details about what makes a difference “significant”.

If not provided, use red for scores and green for metrics. You can also use arbitrary rich styles, such as “bold red”.

Type: str

ReportCaseRenderer

Methods

build_base_table

def build_base_table(title: str) -> Table

Build and return a Rich Table for the diff output.

Returns

Table

build_failures_table

def build_failures_table(title: str) -> Table

Build and return a Rich Table for the failures output.

Returns

Table

build_row

def build_row(case: ReportCase) -> list[str]

Build a table row for a single case.

Returns

list[str]

build_aggregate_row

def build_aggregate_row(aggregate: ReportCaseAggregate) -> list[str]

Build a table row for an aggregated case.

Returns

list[str]

build_diff_row

def build_diff_row(new_case: ReportCase, baseline: ReportCase) -> list[str]

Build a table row for a given case ID.

Returns

list[str]

build_diff_aggregate_row

def build_diff_aggregate_row(
    new: ReportCaseAggregate,
    baseline: ReportCaseAggregate,
) -> list[str]

Build a table row for a given case ID.

Returns

list[str]

build_failure_row

def build_failure_row(case: ReportCaseFailure) -> list[str]

Build a table row for a single case failure.

Returns

list[str]

EvaluationRenderer

A class for rendering an EvalReport or the diff between two EvalReports.

Methods

build_table

def build_table(report: EvaluationReport, with_title: bool = True) -> Table

Build a table for the report.

Returns

Table — A Rich Table object

Parameters

report : EvaluationReport

The evaluation report to render

with_title : bool Default: True

Whether to include the title in the table (default True)

build_diff_table

def build_diff_table(
    report: EvaluationReport,
    baseline: EvaluationReport,
    with_title: bool = True,
) -> Table

Build a diff table comparing report to baseline.

Returns

Table — A Rich Table object

Parameters

report : EvaluationReport

The evaluation report to compare

baseline : EvaluationReport

The baseline report to compare against

with_title : bool Default: True

Whether to include the title in the table (default True)

ReportAnalysis

Discriminated union of all report-level analysis types.

Default: Annotated[ConfusionMatrix | PrecisionRecall | ScalarResult | TableResult | LinePlot, Discriminator('type')]