pydantic_evals.dataset

Dataset management for pydantic evals.

This module provides functionality for creating, loading, saving, and evaluating datasets of test cases. Each case must have inputs, and can optionally have a name, expected output, metadata, and case-specific evaluators.

Datasets can be loaded from and saved to YAML or JSON files, and can be evaluated against a task function to produce an evaluation report.

Case

Bases: Generic[InputsT, OutputT, MetadataT]

A single row of a Dataset.

Each case represents a single test scenario with inputs to test. A case may optionally specify a name, expected outputs to compare against, and arbitrary metadata.

Cases can also have their own specific evaluators which are run in addition to dataset-level evaluators.

Example:

from pydantic_evals import Case

case = Case(
    name='Simple addition',
    inputs={'a': 1, 'b': 2},
    expected_output=3,
    metadata={'description': 'Tests basic addition'},
)

Attributes

name

Name of the case. This is used to identify the case in the report and can be used to filter cases.

Type: str | None Default: name

inputs

Inputs to the task. This is the input to the task that will be evaluated.

Type: InputsT Default: inputs

metadata

Metadata to be used in the evaluation.

This can be used to provide additional information about the case to the evaluators.

Type: MetadataT | None Default: metadata

expected_output

Expected output of the task. This is the expected output of the task that will be evaluated.

Type: OutputT | None Default: expected_output

evaluators

Evaluators to be used just on this case.

Type: list[Evaluator[InputsT, OutputT, MetadataT]] Default: list(evaluators)

Methods

init

def __init__(
    name: str | None = None,
    inputs: InputsT,
    metadata: MetadataT | None = None,
    expected_output: OutputT | None = None,
    evaluators: tuple[Evaluator[InputsT, OutputT, MetadataT], ...] = (),
)

Initialize a new test case.

Parameters

name : str | None Default: None

Optional name for the case. If not provided, a generic name will be assigned when added to a dataset.

inputs : InputsT

The inputs to the task being evaluated.

metadata : MetadataT | None Default: None

Optional metadata for the case, which can be used by evaluators.

expected_output : OutputT | None Default: None

Optional expected output of the task, used for comparison in evaluators.

evaluators : tuple[Evaluator[InputsT, OutputT, MetadataT], …] Default: ()

Tuple of evaluators specific to this case. These are in addition to any dataset-level evaluators.

Dataset

Bases: BaseModel, Generic[InputsT, OutputT, MetadataT]

A dataset of test cases.

Datasets allow you to organize a collection of test cases and evaluate them against a task function. They can be loaded from and saved to YAML or JSON files, and can have dataset-level evaluators that apply to all cases.

Example:

# Create a dataset with two test cases
from dataclasses import dataclass

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ExactMatch(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return ctx.output == ctx.expected_output

dataset = Dataset(
    name='uppercase_tests',
    cases=[
        Case(name='test1', inputs={'text': 'Hello'}, expected_output='HELLO'),
        Case(name='test2', inputs={'text': 'World'}, expected_output='WORLD'),
    ],
    evaluators=[ExactMatch()],
)

# Evaluate the dataset against a task function
async def uppercase(inputs: dict) -> str:
    return inputs['text'].upper()

async def main():
    report = await dataset.evaluate(uppercase)
    report.print()
'''
   Evaluation Summary: uppercase
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID  ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ test1    │ ✔          │     10ms │
├──────────┼────────────┼──────────┤
│ test2    │ ✔          │     10ms │
├──────────┼────────────┼──────────┤
│ Averages │ 100.0% ✔   │     10ms │
└──────────┴────────────┴──────────┘
'''

Attributes

name

Name of the dataset. Required in future versions.

Type: str | None Default: None

cases

List of test cases in the dataset.

Type: list[Case[InputsT, OutputT, MetadataT]]

evaluators

List of evaluators to be used on all cases in the dataset.

Type: list[Evaluator[InputsT, OutputT, MetadataT]] Default: []

report_evaluators

Evaluators that operate on the full report to produce experiment-wide analyses.

Type: list[ReportEvaluator[InputsT, OutputT, MetadataT]] Default: []

Methods

init

def __init__(
    name: str | None = None,
    cases: Sequence[Case[InputsT, OutputT, MetadataT]],
    evaluators: Sequence[Evaluator[InputsT, OutputT, MetadataT]] = (),
    report_evaluators: Sequence[ReportEvaluator[InputsT, OutputT, MetadataT]] = (),
)

Initialize a new dataset with test cases and optional evaluators.

Parameters

name : str | None Default: None

Name for the dataset. Omitting this is deprecated and will raise an error in a future version.

cases : Sequence[Case[InputsT, OutputT, MetadataT]]

Sequence of test cases to include in the dataset.

evaluators : Sequence[Evaluator[InputsT, OutputT, MetadataT]] Default: ()

Optional sequence of evaluators to apply to all cases in the dataset.

report_evaluators : Sequence[ReportEvaluator[InputsT, OutputT, MetadataT]] Default: ()

Optional sequence of report evaluators that run on the full evaluation report.

evaluate

@async

def evaluate(
    task: Callable[[InputsT], Awaitable[OutputT]] | Callable[[InputsT], OutputT],
    name: str | None = None,
    max_concurrency: int | None = None,
    progress: bool = True,
    retry_task: RetryConfig | None = None,
    retry_evaluators: RetryConfig | None = None,
    task_name: str | None = None,
    metadata: dict[str, Any] | None = None,
    repeat: int = 1,
    lifecycle: type[CaseLifecycle[InputsT, OutputT, MetadataT]] | None = None,
) -> EvaluationReport[InputsT, OutputT, MetadataT]

Evaluates the test cases in the dataset using the given task.

This method runs the task on each case in the dataset, applies evaluators, and collects results into a report. Cases are run concurrently, limited by max_concurrency if specified.

Returns

EvaluationReport[InputsT, OutputT, MetadataT] — A report containing the results of the evaluation.

Parameters

task : Callable[[InputsT], Awaitable[OutputT]] | Callable[[InputsT], OutputT]

The task to evaluate. This should be a callable that takes the inputs of the case and returns the output.

name : str | None Default: None

The name of the experiment being run, this is used to identify the experiment in the report. If omitted, the task_name will be used; if that is not specified, the name of the task function is used.

max_concurrency : int | None Default: None

The maximum number of concurrent evaluations of the task to allow. If None, all cases will be evaluated concurrently.

progress : bool Default: True

Whether to show a progress bar for the evaluation. Defaults to True.

retry_task : RetryConfig | None Default: None

Optional retry configuration for the task execution.

retry_evaluators : RetryConfig | None Default: None

Optional retry configuration for evaluator execution.

task_name : str | None Default: None

Optional override to the name of the task being executed, otherwise the name of the task function will be used.

metadata : dict[str, Any] | None Default: None

Optional dict of experiment metadata.

repeat : int Default: 1

Number of times to run each case. When > 1, each case is run multiple times and results are grouped by the original case name for aggregation. Defaults to 1.

lifecycle : type[CaseLifecycle[InputsT, OutputT, MetadataT]] | None Default: None

Optional lifecycle class for per-case setup, context preparation, and teardown hooks. A new instance is created for each case. See CaseLifecycle.

evaluate_sync

def evaluate_sync(
    task: Callable[[InputsT], Awaitable[OutputT]] | Callable[[InputsT], OutputT],
    name: str | None = None,
    max_concurrency: int | None = None,
    progress: bool = True,
    retry_task: RetryConfig | None = None,
    retry_evaluators: RetryConfig | None = None,
    task_name: str | None = None,
    metadata: dict[str, Any] | None = None,
    repeat: int = 1,
    lifecycle: type[CaseLifecycle[InputsT, OutputT, MetadataT]] | None = None,
) -> EvaluationReport[InputsT, OutputT, MetadataT]

Evaluates the test cases in the dataset using the given task.

This is a synchronous wrapper around evaluate provided for convenience.

Returns

EvaluationReport[InputsT, OutputT, MetadataT] — A report containing the results of the evaluation.

Parameters

task : Callable[[InputsT], Awaitable[OutputT]] | Callable[[InputsT], OutputT]

The task to evaluate. This should be a callable that takes the inputs of the case and returns the output.

name : str | None Default: None

The name of the experiment being run, this is used to identify the experiment in the report. If omitted, the task_name will be used; if that is not specified, the name of the task function is used.

max_concurrency : int | None Default: None

The maximum number of concurrent evaluations of the task to allow. If None, all cases will be evaluated concurrently.

progress : bool Default: True

Whether to show a progress bar for the evaluation. Defaults to True.

retry_task : RetryConfig | None Default: None

Optional retry configuration for the task execution.

retry_evaluators : RetryConfig | None Default: None

Optional retry configuration for evaluator execution.

task_name : str | None Default: None

Optional override to the name of the task being executed, otherwise the name of the task function will be used.

metadata : dict[str, Any] | None Default: None

Optional dict of experiment metadata.

repeat : int Default: 1

Number of times to run each case. When > 1, each case is run multiple times and results are grouped by the original case name for aggregation. Defaults to 1.

lifecycle : type[CaseLifecycle[InputsT, OutputT, MetadataT]] | None Default: None

Optional lifecycle class for per-case setup, context preparation, and teardown hooks. A new instance is created for each case. See CaseLifecycle.

add_case

def add_case(
    name: str | None = None,
    inputs: InputsT,
    metadata: MetadataT | None = None,
    expected_output: OutputT | None = None,
    evaluators: tuple[Evaluator[InputsT, OutputT, MetadataT], ...] = (),
) -> None

Adds a case to the dataset.

This is a convenience method for creating a Case and adding it to the dataset.

Returns

None

Parameters

name : str | None Default: None

Optional name for the case. If not provided, a generic name will be assigned.

inputs : InputsT

The inputs to the task being evaluated.

metadata : MetadataT | None Default: None

Optional metadata for the case, which can be used by evaluators.

expected_output : OutputT | None Default: None

The expected output of the task, used for comparison in evaluators.

evaluators : tuple[Evaluator[InputsT, OutputT, MetadataT], …] Default: ()

Tuple of evaluators specific to this case, in addition to dataset-level evaluators.

add_evaluator

def add_evaluator(
    evaluator: Evaluator[InputsT, OutputT, MetadataT],
    specific_case: str | None = None,
) -> None

Adds an evaluator to the dataset or a specific case.

Returns

None

Parameters

evaluator : Evaluator[InputsT, OutputT, MetadataT]

The evaluator to add.

specific_case : str | None Default: None

If provided, the evaluator will only be added to the case with this name. If None, the evaluator will be added to all cases in the dataset.

Raises

ValueError — If specific_case is provided but no case with that name exists in the dataset.

from_file

@classmethod

def from_file(
    cls,
    path: Path | str,
    fmt: Literal['yaml', 'json'] | None = None,
    custom_evaluator_types: Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] = (),
    custom_report_evaluator_types: Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] = (),
) -> Self

Load a dataset from a file.

Returns

Self — A new Dataset instance loaded from the file.

Parameters

path : Path | str

Path to the file to load.

fmt : Literal[‘yaml’, ‘json’] | None Default: None

Format of the file. If None, the format will be inferred from the file extension. Must be either ‘yaml’ or ‘json’.

custom_evaluator_types : Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom evaluator classes to use when deserializing the dataset. These are additional evaluators beyond the default ones.

custom_report_evaluator_types : Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom report evaluator classes to use when deserializing the dataset. These are additional report evaluators beyond the default ones.

Raises

ValidationError — If the file cannot be parsed as a valid dataset.
ValueError — If the format cannot be inferred from the file extension.

from_text

@classmethod

def from_text(
    cls,
    contents: str,
    fmt: Literal['yaml', 'json'] = 'yaml',
    custom_evaluator_types: Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] = (),
    custom_report_evaluator_types: Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] = (),
    default_name: str | None = None,
) -> Self

Load a dataset from a string.

Returns

Self — A new Dataset instance parsed from the string.

Parameters

contents : str

The string content to parse.

fmt : Literal[‘yaml’, ‘json’] Default: 'yaml'

Format of the content. Must be either ‘yaml’ or ‘json’.

custom_evaluator_types : Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom evaluator classes to use when deserializing the dataset. These are additional evaluators beyond the default ones.

custom_report_evaluator_types : Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom report evaluator classes to use when deserializing the dataset. These are additional report evaluators beyond the default ones.

default_name : str | None Default: None

Default name of the dataset, to be used if not specified in the serialized contents.

Raises

ValidationError — If the content cannot be parsed as a valid dataset.

from_dict

@classmethod

def from_dict(
    cls,
    data: dict[str, Any],
    custom_evaluator_types: Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] = (),
    custom_report_evaluator_types: Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] = (),
    default_name: str | None = None,
) -> Self

Load a dataset from a dictionary.

Returns

Self — A new Dataset instance created from the dictionary.

Parameters

data : dict[str, Any]

Dictionary representation of the dataset.

custom_evaluator_types : Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom evaluator classes to use when deserializing the dataset. These are additional evaluators beyond the default ones.

custom_report_evaluator_types : Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom report evaluator classes to use when deserializing the dataset. These are additional report evaluators beyond the default ones.

default_name : str | None Default: None

Default name of the dataset, to be used if not specified in the data.

Raises

ValidationError — If the dictionary cannot be converted to a valid dataset.

to_file

def to_file(
    path: Path | str,
    fmt: Literal['yaml', 'json'] | None = None,
    schema_path: Path | str | None = DEFAULT_SCHEMA_PATH_TEMPLATE,
    custom_evaluator_types: Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] = (),
    custom_report_evaluator_types: Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] = (),
)

Save the dataset to a file.

Parameters

path : Path | str

Path to save the dataset to.

fmt : Literal[‘yaml’, ‘json’] | None Default: None

Format to use. If None, the format will be inferred from the file extension. Must be either ‘yaml’ or ‘json’.

schema_path : Path | str | None Default: DEFAULT_SCHEMA_PATH_TEMPLATE

Path to save the JSON schema to. If None, no schema will be saved. Can be a string template with {stem} which will be replaced with the dataset filename stem.

custom_evaluator_types : Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom evaluator classes to include in the schema.

custom_report_evaluator_types : Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom report evaluator classes to include in the schema.

model_json_schema_with_evaluators

@classmethod

def model_json_schema_with_evaluators(
    cls,
    custom_evaluator_types: Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] = (),
    custom_report_evaluator_types: Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] = (),
) -> dict[str, Any]

Generate a JSON schema for this dataset type, including evaluator details.

This is useful for generating a schema that can be used to validate YAML-format dataset files.

Returns

dict[str, Any] — A dictionary representing the JSON schema.

Parameters

custom_evaluator_types : Sequence[type[Evaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom evaluator classes to include in the schema.

custom_report_evaluator_types : Sequence[type[ReportEvaluator[InputsT, OutputT, MetadataT]]] Default: ()

Custom report evaluator classes to include in the schema.

set_eval_attribute

def set_eval_attribute(name: str, value: Any) -> None

Set an attribute on the current task run.

Returns

None

Parameters

name : str

The name of the attribute.

value : Any

The value of the attribute.

increment_eval_metric

def increment_eval_metric(name: str, amount: int | float) -> None

Increment a metric on the current task run.

Returns

None

Parameters

name : str

The name of the metric.

amount : int | float

The amount to increment by.

InputsT

Generic type for the inputs to the task being evaluated.

Default: TypeVar('InputsT', default=Any)

OutputT

Generic type for the expected output of the task being evaluated.

Default: TypeVar('OutputT', default=Any)

MetadataT

Generic type for the metadata associated with the task being evaluated.

Default: TypeVar('MetadataT', default=Any)

DEFAULT_DATASET_PATH

Default path for saving/loading datasets.

Default: './test_cases.yaml'

DEFAULT_SCHEMA_PATH_TEMPLATE

Default template for schema file paths, where {stem} is replaced with the dataset filename stem.

Default: './\{stem\}_schema.json'