# Quick Start

**Pydantic Evals** is a powerful evaluation framework for systematically testing and evaluating AI systems, from simple LLM calls to complex multi-agent applications.

## What is Pydantic Evals?

Pydantic Evals helps you:

-   **Create test datasets** with type-safe structured inputs and expected outputs
-   **Run evaluations** against your AI systems with automatic concurrency
-   **Score results** using deterministic checks, LLM judges, or custom evaluators
-   **Generate reports** with detailed metrics, assertions, and performance data
-   **Track changes** by comparing evaluation runs over time
-   **Integrate with Logfire** for visualization and collaborative analysis

## Installation

Terminal

```bash
pip install pydantic-evals
```

For OpenTelemetry tracing and Logfire integration:

Terminal

```bash
pip install 'pydantic-evals[logfire]'
```

## Quick Start

While evaluations are typically used to test AI systems, the Pydantic Evals framework works with any function call. To demonstrate the core functionality, we'll start with a simple, deterministic example.

Here's a complete example of evaluating a simple text transformation function:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains, EqualsExpected

# Create a dataset with test cases
dataset = Dataset(
    name='uppercase_tests',
    cases=[
        Case(
            name='uppercase_basic',
            inputs='hello world',
            expected_output='HELLO WORLD',
        ),
        Case(
            name='uppercase_with_numbers',
            inputs='hello 123',
            expected_output='HELLO 123',
        ),
    ],
    evaluators=[
        EqualsExpected(),  # Check exact match with expected_output
        Contains(value='HELLO', case_sensitive=True),  # Check contains "HELLO"
    ],
)


# Define the function to evaluate
def uppercase_text(text: str) -> str:
    return text.upper()


# Run the evaluation
report = dataset.evaluate_sync(uppercase_text)

# Print the results
report.print()
"""
        Evaluation Summary: uppercase_text
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID                ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ uppercase_basic        │ ✔✔         │     10ms │
├────────────────────────┼────────────┼──────────┤
│ uppercase_with_numbers │ ✔✔         │     10ms │
├────────────────────────┼────────────┼──────────┤
│ Averages               │ 100.0% ✔   │     10ms │
└────────────────────────┴────────────┴──────────┘
"""
```

Output:

```
                  Evaluation Summary: uppercase_text
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID                 ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ uppercase_basic         │ ✔✔         │     10ms │
├─────────────────────────┼────────────┼──────────┤
│ uppercase_with_numbers  │ ✔✔         │     10ms │
├─────────────────────────┼────────────┼──────────┤
│ Averages                │ 100.0% ✔   │     10ms │
└─────────────────────────┴────────────┴──────────┘
```

## Key Concepts

Understanding a few core concepts will help you get the most out of Pydantic Evals:

-   **[`Dataset`](/docs/ai/api/pydantic_evals/dataset/#pydantic_evals.dataset.Dataset)** - A collection of test cases and (optional) evaluators
-   **[`Case`](/docs/ai/api/pydantic_evals/dataset/#pydantic_evals.dataset.Case)** - A single test scenario with inputs and optional expected outputs and case-specific evaluators
-   **[`Evaluator`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.Evaluator)** - A function that scores or validates task outputs
-   **[`EvaluationReport`](/docs/ai/api/pydantic_evals/reporting/#pydantic_evals.reporting.EvaluationReport)** - Results from running an evaluation

For a deeper dive, see [Core Concepts](/docs/ai/evals/core-concepts).

## Common Use Cases

### Deterministic Validation

Test that your AI system produces correctly-structured outputs:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains, IsInstance

dataset = Dataset(
    name='dict_validation',
    cases=[
        Case(inputs={'data': 'required_key present'}, expected_output={'result': 'success'}),
    ],
    evaluators=[
        IsInstance(type_name='dict'),
        Contains(value='required_key'),
    ],
)
```

### LLM-as-a-Judge Evaluation

Use an LLM to evaluate subjective qualities like accuracy or helpfulness:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    name='llm_judge_test',
    cases=[
        Case(inputs='What is the capital of France?', expected_output='Paris'),
    ],
    evaluators=[
        LLMJudge(
            rubric='Response is accurate and helpful',
            include_input=True,
            model='anthropic:claude-sonnet-4-6',
        )
    ],
)
```

### Performance Testing

Ensure your system meets performance requirements:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import MaxDuration

dataset = Dataset(
    name='performance_test',
    cases=[
        Case(inputs='test input', expected_output='test output'),
    ],
    evaluators=[
        MaxDuration(seconds=2.0),
    ],
)
```

## Next Steps

Explore the documentation to learn more:

-   **[Core Concepts](/docs/ai/evals/core-concepts)** - Understand the data model and evaluation flow
-   **[Native Evaluators](/docs/ai/evals/evaluators/built-in)** - Learn about all available evaluators
-   **[Custom Evaluators](/docs/ai/evals/evaluators/custom)** - Write your own evaluation logic
-   **[Dataset Management](/docs/ai/evals/how-to/dataset-management)** - Save, load, and generate datasets
-   **[Examples](/docs/ai/evals/examples/simple-validation)** - Practical examples for common scenarios