pydantic_evals.reporting
Bases: BaseModel
A confusion matrix comparing expected vs predicted labels across cases.
Ordered list of class labels (used for both axes).
matrixexpected_idx = count of cases.
Bases: BaseModel
A single point on a precision-recall curve.
Bases: BaseModel
A single precision-recall curve.
Name of this curve (e.g., experiment name or evaluator name).
Type: str
Points on the curve, ordered by threshold.
Type: list[PrecisionRecallPoint]
Area under the precision-recall curve.
Type: float | None Default: None
Bases: BaseModel
Precision-recall curve data across cases.
One or more curves.
Type: list[PrecisionRecallCurve]
Bases: BaseModel
A single scalar statistic (e.g., F1 score, accuracy, BLEU).
Optional unit label (e.g., ’%’, ‘ms’).
Type: str | None Default: None
Bases: Generic[InputsT, OutputT, MetadataT]
A single case in an evaluation report.
The name of the case.
Type: str
The inputs to the task, from Case.inputs.
Type: InputsT
Any metadata associated with the case, from Case.metadata.
Type: MetadataT | None
The expected output of the task, from Case.expected_output.
Type: OutputT | None
The output of the task execution.
Type: OutputT
The original case name before run-indexing. Serves as the aggregation key for multi-run experiments. None when repeat == 1.
Type: str | None Default: None
The trace ID of the case span.
Type: str | None Default: None
The span ID of the case span.
Type: str | None Default: None
Bases: BaseModel
A generic table of data (fallback for custom analyses).
Column headers.
Row data, one list per row.
Type: list[list[str | int | float | bool | None]]
Bases: BaseModel
A generic XY line plot with labeled axes, supporting multiple curves.
Use this for ROC curves, KS plots, calibration curves, or any custom line chart that doesn’t fit the specific PrecisionRecall type.
Label for the x-axis.
Type: str
Label for the y-axis.
Type: str
Optional fixed range for x-axis (min, max).
Type: tuple[float, float] | None Default: None
Optional fixed range for y-axis (min, max).
Type: tuple[float, float] | None Default: None
One or more curves to plot.
Type: list[LinePlotCurve]
Bases: Generic[InputsT, OutputT, MetadataT]
A single case in an evaluation report that failed due to an error during task execution.
The name of the case.
Type: str
The inputs to the task, from Case.inputs.
Type: InputsT
Any metadata associated with the case, from Case.metadata.
Type: MetadataT | None
The expected output of the task, from Case.expected_output.
Type: OutputT | None
The message of the exception that caused the failure.
Type: str
The stacktrace of the exception that caused the failure.
Type: str
The original case name before run-indexing. Serves as the aggregation key for multi-run experiments. None when repeat == 1.
Type: str | None Default: None
The trace ID of the case span.
Type: str | None Default: None
The span ID of the case span.
Type: str | None Default: None
Bases: Generic[InputsT, OutputT, MetadataT]
Grouped results from running the same case multiple times.
This is a computed view, not stored data. Obtain via
EvaluationReport.case_groups().
The original case name (shared across all runs).
Type: str
The inputs (same for all runs).
Type: InputsT
The metadata (same for all runs).
Type: MetadataT | None
The expected output (same for all runs).
Type: OutputT | None
Individual run results.
Type: Sequence[ReportCase[InputsT, OutputT, MetadataT]]
Runs that failed with exceptions.
Type: Sequence[ReportCaseFailure[InputsT, OutputT, MetadataT]]
Aggregated statistics across runs.
Type: ReportCaseAggregate
Bases: BaseModel
A synthetic case that summarizes a set of cases.
@staticmethod
def average(cases: list[ReportCase]) -> ReportCaseAggregate
Produce a synthetic “summary” case by averaging quantitative attributes.
ReportCaseAggregate
@staticmethod
def average_from_aggregates(
aggregates: list[ReportCaseAggregate],
) -> ReportCaseAggregate
Average across multiple aggregates (used for multi-run experiment summaries).
ReportCaseAggregate
Bases: Generic[InputsT, OutputT, MetadataT]
A report of the results of evaluating a model on a set of cases.
The name of the report.
Type: str
The cases in the report.
Type: list[ReportCase[InputsT, OutputT, MetadataT]]
The failures in the report. These are cases where task execution raised an exception.
Type: list[ReportCaseFailure[InputsT, OutputT, MetadataT]] Default: field(default_factory=(list[ReportCaseFailure[InputsT, OutputT, MetadataT]]))
Experiment-wide analyses produced by report evaluators.
Type: list[ReportAnalysis] Default: field(default_factory=(list[ReportAnalysis]))
Failures from report evaluators that raised exceptions.
Type: list[EvaluatorFailure] Default: field(default_factory=(list[EvaluatorFailure]))
Metadata associated with the specific experiment represented by this report.
Type: dict[str, Any] | None Default: None
The trace ID of the evaluation.
Type: str | None Default: None
The span ID of the evaluation.
Type: str | None Default: None
def case_groups() -> list[ReportCaseGroup[InputsT, OutputT, MetadataT]] | None
Group cases by source_case_name and compute per-group aggregates.
Returns None if no cases have source_case_name set (i.e., single-run experiment).
list[ReportCaseGroup[InputsT, OutputT, MetadataT]] | None
def render(
width: int | None = None,
baseline: EvaluationReport[InputsT, OutputT, MetadataT] | None = None,
include_input: bool = False,
include_metadata: bool = False,
include_expected_output: bool = False,
include_output: bool = False,
include_durations: bool = True,
include_total_duration: bool = False,
include_removed_cases: bool = False,
include_averages: bool = True,
include_errors: bool = True,
include_error_stacktrace: bool = False,
include_evaluator_failures: bool = True,
include_analyses: bool = True,
input_config: RenderValueConfig | None = None,
metadata_config: RenderValueConfig | None = None,
output_config: RenderValueConfig | None = None,
score_configs: dict[str, RenderNumberConfig] | None = None,
label_configs: dict[str, RenderValueConfig] | None = None,
metric_configs: dict[str, RenderNumberConfig] | None = None,
duration_config: RenderNumberConfig | None = None,
include_reasons: bool = False,
) -> str
Render this report to a nicely-formatted string, optionally comparing it to a baseline report.
If you want more control over the output, use console_table instead and pass it to rich.Console.print.
def print(
width: int | None = None,
baseline: EvaluationReport[InputsT, OutputT, MetadataT] | None = None,
console: Console | None = None,
include_input: bool = False,
include_metadata: bool = False,
include_expected_output: bool = False,
include_output: bool = False,
include_durations: bool = True,
include_total_duration: bool = False,
include_removed_cases: bool = False,
include_averages: bool = True,
include_errors: bool = True,
include_error_stacktrace: bool = False,
include_evaluator_failures: bool = True,
include_analyses: bool = True,
input_config: RenderValueConfig | None = None,
metadata_config: RenderValueConfig | None = None,
output_config: RenderValueConfig | None = None,
score_configs: dict[str, RenderNumberConfig] | None = None,
label_configs: dict[str, RenderValueConfig] | None = None,
metric_configs: dict[str, RenderNumberConfig] | None = None,
duration_config: RenderNumberConfig | None = None,
include_reasons: bool = False,
) -> None
Print this report to the console, optionally comparing it to a baseline report.
If you want more control over the output, use console_table instead and pass it to rich.Console.print.
def console_table(
baseline: EvaluationReport[InputsT, OutputT, MetadataT] | None = None,
include_input: bool = False,
include_metadata: bool = False,
include_expected_output: bool = False,
include_output: bool = False,
include_durations: bool = True,
include_total_duration: bool = False,
include_removed_cases: bool = False,
include_averages: bool = True,
include_evaluator_failures: bool = True,
input_config: RenderValueConfig | None = None,
metadata_config: RenderValueConfig | None = None,
output_config: RenderValueConfig | None = None,
score_configs: dict[str, RenderNumberConfig] | None = None,
label_configs: dict[str, RenderValueConfig] | None = None,
metric_configs: dict[str, RenderNumberConfig] | None = None,
duration_config: RenderNumberConfig | None = None,
include_reasons: bool = False,
with_title: bool = True,
) -> Table
Return a table containing the data from this report.
If a baseline is provided, returns a diff between this report and the baseline report. Optionally include input and output details.
Table
def failures_table(
include_input: bool = False,
include_metadata: bool = False,
include_expected_output: bool = False,
include_error_message: bool = True,
include_error_stacktrace: bool = True,
input_config: RenderValueConfig | None = None,
metadata_config: RenderValueConfig | None = None,
) -> Table
Return a table containing the failures in this report.
Table
def __str__() -> str
Return a string representation of the report.
Bases: TypedDict
A configuration for rendering a values in an Evaluation report.
Bases: TypedDict
A configuration for rendering a particular score or metric in an Evaluation report.
See the implementation of _RenderNumber for more clarity on how these parameters affect the rendering.
The logic to use for formatting values.
- If not provided, format as ints if all values are ints, otherwise at least one decimal place and at least four significant figures.
- You can also use a custom string format spec, e.g. ’{:.3f}’
- You can also use a custom function, e.g. lambda x: f’{x:.3f}’
Type: str | Callable[[float | int], str]
The logic to use for formatting details about the diff.
The strings produced by the value_formatter will always be included in the reports, but the diff_formatter is used to produce additional text about the difference between the old and new values, such as the absolute or relative difference.
- If not provided, format as ints if all values are ints, otherwise at least one decimal place and at least four significant figures, and will include the percentage change.
- You can also use a custom string format spec, e.g. ’{:+.3f}’
- You can also use a custom function, e.g. lambda x: f’{x:+.3f}’. If this function returns None, no extra diff text will be added.
- You can also use None to never generate extra diff text.
Type: str | Callable[[float | int, float | int], str | None] | None
The absolute tolerance for considering a difference “significant”.
A difference is “significant” if abs(new - old) < self.diff_atol + self.diff_rtol * abs(old).
If a difference is not significant, it will not have the diff styles applied. Note that we still show both the rendered before and after values in the diff any time they differ, even if the difference is not significant. (If the rendered values are exactly the same, we only show the value once.)
If not provided, use 1e-6.
Type: float
The relative tolerance for considering a difference “significant”.
See the description of diff_atol for more details about what makes a difference “significant”.
If not provided, use 0.001 if all values are ints, otherwise 0.05.
Type: float
The style to apply to diffed values that have a significant increase.
See the description of diff_atol for more details about what makes a difference “significant”.
If not provided, use green for scores and red for metrics. You can also use arbitrary rich styles, such as “bold red”.
Type: str
The style to apply to diffed values that have significant decrease.
See the description of diff_atol for more details about what makes a difference “significant”.
If not provided, use red for scores and green for metrics. You can also use arbitrary rich styles, such as “bold red”.
Type: str
def build_base_table(title: str) -> Table
Build and return a Rich Table for the diff output.
Table
def build_failures_table(title: str) -> Table
Build and return a Rich Table for the failures output.
Table
def build_row(case: ReportCase) -> list[str]
Build a table row for a single case.
def build_aggregate_row(aggregate: ReportCaseAggregate) -> list[str]
Build a table row for an aggregated case.
def build_diff_row(new_case: ReportCase, baseline: ReportCase) -> list[str]
Build a table row for a given case ID.
def build_diff_aggregate_row(
new: ReportCaseAggregate,
baseline: ReportCaseAggregate,
) -> list[str]
Build a table row for a given case ID.
def build_failure_row(case: ReportCaseFailure) -> list[str]
Build a table row for a single case failure.
A class for rendering an EvalReport or the diff between two EvalReports.
def build_table(report: EvaluationReport, with_title: bool = True) -> Table
Build a table for the report.
Table — A Rich Table object
The evaluation report to render
with_title : bool Default: True
Whether to include the title in the table (default True)
def build_diff_table(
report: EvaluationReport,
baseline: EvaluationReport,
with_title: bool = True,
) -> Table
Build a diff table comparing report to baseline.
Table — A Rich Table object
The evaluation report to compare
The baseline report to compare against
with_title : bool Default: True
Whether to include the title in the table (default True)
Discriminated union of all report-level analysis types.
Default: Annotated[ConfusionMatrix | PrecisionRecall | ScalarResult | TableResult | LinePlot, Discriminator('type')]