# Built-in Evaluators

Pydantic Evals provides several built-in evaluators for common evaluation tasks.

## Comparison Evaluators

### EqualsExpected

Check if the output exactly equals the expected output from the case.

```python
from pydantic_evals.evaluators import EqualsExpected

EqualsExpected()
```

**Parameters:** None

**Returns:** `bool` - `True` if `ctx.output == ctx.expected_output`

**Example:**

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected

dataset = Dataset(
    name='equals_expected_demo',
    cases=[
        Case(
            name='addition',
            inputs='2 + 2',
            expected_output='4',
        ),
    ],
    evaluators=[EqualsExpected()],
)
```

**Notes:**

-   Skips evaluation if `expected_output` is `None` (returns empty dict `{}`)
-   Uses Python's `==` operator, so works with any comparable types
-   For structured data, considers nested equality

* * *

### Equals

Check if the output equals a specific value.

```python
from pydantic_evals.evaluators import Equals

Equals(value='expected_result')
```

**Parameters:**

-   `value` (Any): The value to compare against
-   `evaluation_name` (str | None): Custom name for this evaluation in reports

**Returns:** `bool` - `True` if `ctx.output == value`

**Example:**

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Equals

# Check output is always "success"
dataset = Dataset(
    name='equals_demo',
    cases=[Case(inputs='test')],
    evaluators=[
        Equals(value='success', evaluation_name='is_success'),
    ],
)
```

**Use Cases:**

-   Checking for sentinel values
-   Validating consistent outputs
-   Testing classification into specific categories

* * *

### Contains

Check if the output contains a specific value or substring.

```python
from pydantic_evals.evaluators import Contains

Contains(
    value='substring',
    case_sensitive=True,
    as_strings=False,
)
```

**Parameters:**

-   `value` (Any): The value to search for
-   `case_sensitive` (bool): Case-sensitive comparison for strings (default: `True`)
-   `as_strings` (bool): Convert both values to strings before checking (default: `False`)
-   `evaluation_name` (str | None): Custom name for this evaluation in reports

**Returns:** [`EvaluationReason`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EvaluationReason) - Pass/fail with explanation

**Behavior:**

For **strings**: checks substring containment

-   `Contains(value='hello', case_sensitive=False)`
    -   Matches: "Hello World", "say hello", "HELLO"
    -   Doesn't match: "hi there"

For **lists/tuples**: checks membership

-   `Contains(value='apple')`
    -   Matches: `['apple', 'banana']`, `('apple',)`
    -   Doesn't match: `['apples', 'orange']`

For **dicts**: checks key-value pairs

-   `Contains(value={'name': 'Alice'})`
    -   Matches: `{'name': 'Alice', 'age': 30}`
    -   Doesn't match: `{'name': 'Bob'}`

**Example:**

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains

dataset = Dataset(
    name='contains_demo',
    cases=[Case(inputs='test')],
    evaluators=[
        # Check for required keywords
        Contains(value='terms and conditions', case_sensitive=False),
        # Check for PII (fail if found)
        # Note: Use a custom evaluator that returns False when PII found
    ],
)
```

**Use Cases:**

-   Required content verification
-   Keyword detection
-   PII/sensitive data detection
-   Multi-value validation

* * *

## Type Validation

### IsInstance

Check if the output is an instance of a type with the given name.

```python
from pydantic_evals.evaluators import IsInstance

IsInstance(type_name='str')
```

**Parameters:**

-   `type_name` (str): The type name to check (uses `__name__` or `__qualname__`)
-   `evaluation_name` (str | None): Custom name for this evaluation in reports

**Returns:** [`EvaluationReason`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EvaluationReason) - Pass/fail with type information

**Example:**

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance

dataset = Dataset(
    name='isinstance_demo',
    cases=[Case(inputs='test')],
    evaluators=[
        # Check output is always a string
        IsInstance(type_name='str'),
        # Check for Pydantic model
        IsInstance(type_name='MyModel'),
        # Check for dict
        IsInstance(type_name='dict'),
    ],
)
```

**Notes:**

-   Matches against both `__name__` and `__qualname__` of the type
-   Works with built-in types (`str`, `int`, `dict`, `list`, etc.)
-   Works with custom classes and Pydantic models
-   Checks the entire MRO (Method Resolution Order) for inheritance

**Use Cases:**

-   Format validation
-   Structured output verification
-   Type consistency checks

* * *

## Performance Evaluation

### MaxDuration

Check if task execution time is under a maximum threshold.

```python
from datetime import timedelta

from pydantic_evals.evaluators import MaxDuration

MaxDuration(seconds=2.0)
# or
MaxDuration(seconds=timedelta(seconds=2))
```

**Parameters:**

-   `seconds` (float | timedelta): Maximum allowed duration

**Returns:** `bool` - `True` if `ctx.duration <= seconds`

**Example:**

```python
from datetime import timedelta

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import MaxDuration

dataset = Dataset(
    name='max_duration_demo',
    cases=[Case(inputs='test')],
    evaluators=[
        # SLA: must respond in under 2 seconds
        MaxDuration(seconds=2.0),
        # Or using timedelta
        MaxDuration(seconds=timedelta(milliseconds=500)),
    ],
)
```

**Use Cases:**

-   SLA compliance
-   Performance regression testing
-   Latency requirements
-   Timeout validation

**See Also:** [Concurrency & Performance](/docs/ai/evals/how-to/concurrency)

* * *

## LLM-as-a-Judge

### LLMJudge

Use an LLM to evaluate subjective qualities based on a rubric.

```python
from pydantic_evals.evaluators import LLMJudge

LLMJudge(
    rubric='Response is accurate and helpful',
    model='openai:gpt-5.2',
    include_input=False,
    include_expected_output=False,
    model_settings=None,
    score=False,
    assertion={'include_reason': True},
)
```

**Parameters:**

-   `rubric` (str): The evaluation criteria (required)
-   `model` (Model | KnownModelName | None): Model to use (default: `'openai:gpt-5.2'`)
-   `include_input` (bool): Include task inputs in the prompt (default: `False`)
-   `include_expected_output` (bool): Include expected output in the prompt (default: `False`)
-   `model_settings` (ModelSettings | None): Custom model settings
-   `score` (OutputConfig | False): Configure score output (default: `False`)
-   `assertion` (OutputConfig | False): Configure assertion output (default: includes reason)

**Returns:** Depends on `score` and `assertion` parameters (see below)

**Output Modes:**

By default, returns a **boolean assertion** with reason:

-   `LLMJudge(rubric='Response is polite')`
    -   Returns: `{'LLMJudge_pass': EvaluationReason(value=True, reason='...')}`

Return a **score** (0.0 to 1.0) instead:

-   `LLMJudge(rubric='Response quality', score={'include_reason': True}, assertion=False)`
    -   Returns: `{'LLMJudge_score': EvaluationReason(value=0.85, reason='...')}`

Return **both** score and assertion:

-   `LLMJudge(rubric='Response quality', score={'include_reason': True}, assertion={'include_reason': True})`
    -   Returns: `{'LLMJudge_score': EvaluationReason(value=0.85, reason='...'), 'LLMJudge_pass': EvaluationReason(value=True, reason='...')}`

**Customize evaluation names:**

-   `LLMJudge(rubric='Response is factually accurate', assertion={'evaluation_name': 'accuracy', 'include_reason': True})`
    -   Returns: `{'accuracy': EvaluationReason(value=True, reason='...')}`

**Example:**

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    name='llm_judge_demo',
    cases=[Case(inputs='test', expected_output='result')],
    evaluators=[
        # Basic accuracy check
        LLMJudge(
            rubric='Response is factually accurate',
            include_input=True,
        ),
        # Quality score with different model
        LLMJudge(
            rubric='Overall response quality',
            model='anthropic:claude-sonnet-4-6',
            score={'evaluation_name': 'quality', 'include_reason': False},
            assertion=False,
        ),
        # Check against expected output
        LLMJudge(
            rubric='Response matches the expected answer semantically',
            include_input=True,
            include_expected_output=True,
        ),
    ],
)
```

**See Also:** [LLM Judge Deep Dive](/docs/ai/evals/evaluators/llm-judge)

* * *

## Span-Based Evaluation

### HasMatchingSpan

Check if OpenTelemetry spans match a query (requires Logfire configuration).

```python
from pydantic_evals.evaluators import HasMatchingSpan

HasMatchingSpan(
    query={'name_contains': 'tool_call'},
    evaluation_name='called_tool',
)
```

**Parameters:**

-   `query` ([`SpanQuery`](/docs/ai/api/pydantic_evals/otel/#pydantic_evals.otel.SpanQuery)): Query to match against spans
-   `evaluation_name` (str | None): Custom name for this evaluation in reports

**Returns:** `bool` - `True` if any span matches the query

**Example:**

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan

dataset = Dataset(
    name='span_check_demo',
    cases=[Case(inputs='test')],
    evaluators=[
        # Check that a specific tool was called
        HasMatchingSpan(
            query={'name_contains': 'search_database'},
            evaluation_name='used_database',
        ),
        # Check for errors
        HasMatchingSpan(
            query={'has_attributes': {'error': True}},
            evaluation_name='had_errors',
        ),
        # Check duration constraints
        HasMatchingSpan(
            query={
                'name_equals': 'llm_call',
                'max_duration': 2.0,  # seconds
            },
            evaluation_name='llm_fast_enough',
        ),
    ],
)
```

**See Also:** [Span-Based Evaluation](/docs/ai/evals/evaluators/span-based)

* * *

## Native Report Evaluators

In addition to the case-level evaluators above, Pydantic Evals provides report evaluators that analyze entire experiment results. These are passed via the `report_evaluators` parameter on `Dataset`.

Report Evaluator

Purpose

Output

[`ConfusionMatrixEvaluator`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.ConfusionMatrixEvaluator)

Classification confusion matrix

`ConfusionMatrix`

[`PrecisionRecallEvaluator`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.PrecisionRecallEvaluator)

PR curve with AUC

`PrecisionRecall`

**See:** [Report Evaluators](/docs/ai/evals/evaluators/report-evaluators) for full documentation, parameters, and examples, including how to write custom report evaluators that produce `ScalarResult` and `TableResult` analyses.

* * *

## Quick Reference Table

### Case-Level Evaluators

Evaluator

Purpose

Return Type

Cost

Speed

[`EqualsExpected`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EqualsExpected)

Exact match with expected

`bool`

Free

Instant

[`Equals`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.Equals)

Equals specific value

`bool`

Free

Instant

[`Contains`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.Contains)

Contains value/substring

`bool` + reason

Free

Instant

[`IsInstance`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.IsInstance)

Type validation

`bool` + reason

Free

Instant

[`MaxDuration`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.MaxDuration)

Performance threshold

`bool`

Free

Instant

[`LLMJudge`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.LLMJudge)

Subjective quality

`bool` and/or `float`

$$

Slow

[`HasMatchingSpan`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.HasMatchingSpan)

Behavioral check

`bool`

Free

Fast

### Report-Level Evaluators

Evaluator

Purpose

Output Type

Cost

Speed

[`ConfusionMatrixEvaluator`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.ConfusionMatrixEvaluator)

Classification matrix

`ConfusionMatrix`

Free

Instant

[`PrecisionRecallEvaluator`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.PrecisionRecallEvaluator)

PR curve with AUC

`PrecisionRecall`

Free

Instant

## Combining Evaluators

Best practice is to combine fast deterministic checks with slower LLM evaluations:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import (
    Contains,
    IsInstance,
    LLMJudge,
    MaxDuration,
)

dataset = Dataset(
    name='combined_evaluators',
    cases=[Case(inputs='test')],
    evaluators=[
        # Fast checks first (fail fast)
        IsInstance(type_name='str'),
        Contains(value='required_field'),
        MaxDuration(seconds=2.0),
        # Expensive LLM checks last
        LLMJudge(rubric='Response is helpful and accurate'),
    ],
)
```

This approach:

1.  Catches format/structure issues immediately
2.  Validates required content quickly
3.  Only runs expensive LLM evaluation if basic checks pass
4.  Provides comprehensive quality assessment

## Next Steps

-   **[LLM Judge](/docs/ai/evals/evaluators/llm-judge)** - Deep dive on LLM-as-a-Judge evaluation
-   **[Custom Evaluators](/docs/ai/evals/evaluators/custom)** - Write your own evaluation logic
-   **[Report Evaluators](/docs/ai/evals/evaluators/report-evaluators)** - Experiment-wide analyses (confusion matrices, PR curves, etc.)
-   **[Span-Based Evaluation](/docs/ai/evals/evaluators/span-based)** - Using OpenTelemetry spans for behavioral checks