# Multi-Run Evaluation

Run each case multiple times to measure variability and get more reliable aggregate results.

AI systems are inherently stochastic -- the same input can produce different outputs across runs. The `repeat` parameter lets you run each case multiple times and automatically aggregates the results, giving you a clearer picture of your system's typical behavior.

## Basic Usage

Pass `repeat` to [`evaluate()`](/docs/ai/api/pydantic_evals/dataset/#pydantic_evals.dataset.Dataset.evaluate) or [`evaluate_sync()`](/docs/ai/api/pydantic_evals/dataset/#pydantic_evals.dataset.Dataset.evaluate_sync):

```python
from pydantic_evals import Case, Dataset

dataset = Dataset(
    name='multi_run_basic',
    cases=[
        Case(name='greeting', inputs='Say hello'),
        Case(name='farewell', inputs='Say goodbye'),
    ]
)


def task(inputs: str) -> str:
    return inputs.upper()


# Run each case 5 times
report = dataset.evaluate_sync(task, repeat=5)

# 2 cases × 5 repeats = 10 total runs
print(len(report.cases))
#> 10
```

When `repeat > 1`, each run gets an indexed name like `greeting [1/5]`, `greeting [2/5]`, etc., while the original case name is preserved in [`source_case_name`](/docs/ai/api/pydantic_evals/reporting/#pydantic_evals.reporting.ReportCase.source_case_name) for grouping.

## Accessing Grouped Results

Use [`case_groups()`](/docs/ai/api/pydantic_evals/reporting/#pydantic_evals.reporting.EvaluationReport.case_groups) to access runs organized by original case, with per-group aggregated statistics:

```python
from pydantic_evals import Case, Dataset

dataset = Dataset(
    name='grouped_results',
    cases=[
        Case(name='greeting', inputs='Say hello'),
        Case(name='farewell', inputs='Say goodbye'),
    ]
)


def task(inputs: str) -> str:
    return inputs.upper()


report = dataset.evaluate_sync(task, repeat=3)

groups = report.case_groups()
assert groups is not None  # None for single-run (repeat=1)

print(len(groups))
#> 2

group_names = [g.name for g in groups]
print(group_names)
#> ['greeting', 'farewell']

# Each group has 3 runs and aggregated statistics
for group in groups:
    assert len(group.runs) == 3
    assert len(group.failures) == 0
    assert group.summary.task_duration > 0
```

Each [`ReportCaseGroup`](/docs/ai/api/pydantic_evals/reporting/#pydantic_evals.reporting.ReportCaseGroup) contains:

-   `name` -- the original case name
-   `runs` -- the individual [`ReportCase`](/docs/ai/api/pydantic_evals/reporting/#pydantic_evals.reporting.ReportCase) results
-   `failures` -- any runs that raised exceptions
-   `summary` -- a [`ReportCaseAggregate`](/docs/ai/api/pydantic_evals/reporting/#pydantic_evals.reporting.ReportCaseAggregate) with averaged scores, metrics, labels, assertions, and durations

## Aggregation

With `repeat > 1`, the report's `averages()` uses a two-level aggregation strategy:

1.  **Per-group averages**: Each case's runs are averaged into a group summary
2.  **Cross-group averages**: The group summaries are averaged to produce the final result

This ensures each original case contributes equally to the overall averages, regardless of how many runs succeeded or failed.

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected

dataset = Dataset(
    name='aggregation',
    cases=[
        Case(name='easy', inputs='hello', expected_output='HELLO'),
        Case(name='hard', inputs='world', expected_output='WORLD'),
    ],
    evaluators=[EqualsExpected()],
)


def task(inputs: str) -> str:
    return inputs.upper()


report = dataset.evaluate_sync(task, repeat=3)

averages = report.averages()
assert averages is not None
print(f'Overall assertion rate: {averages.assertions}')
#> Overall assertion rate: 1.0
```

## Default Behavior

When `repeat=1` (the default), behavior is identical to a standard evaluation -- no run indexing, no `source_case_name`, and `case_groups()` returns `None`:

```python
from pydantic_evals import Case, Dataset

dataset = Dataset(name='default_behavior', cases=[Case(name='test', inputs='hello')])


def task(inputs: str) -> str:
    return inputs.upper()


report = dataset.evaluate_sync(task)  # repeat=1 by default

assert report.case_groups() is None
assert all(c.source_case_name is None for c in report.cases)
```

## Next Steps

-   **[Concurrency & Performance](/docs/ai/evals/how-to/concurrency)** -- Control parallel execution with `max_concurrency`
-   **[Metrics & Attributes](/docs/ai/evals/how-to/metrics-attributes)** -- Track custom metrics across runs
-   **[Logfire Integration](/docs/ai/evals/how-to/logfire-integration)** -- Visualize multi-run results