# Overview

Evaluators are the core of Pydantic Evals. They analyze task outputs and provide scores, labels, or pass/fail assertions.

## When to Use Different Evaluators

### Deterministic Checks (Fast & Reliable)

Use deterministic evaluators when you can define exact rules:

Evaluator

Use Case

Example

[`EqualsExpected`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EqualsExpected)

Exact output match

Structured data, classification

[`Equals`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.Equals)

Equals specific value

Checking for sentinel values

[`Contains`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.Contains)

Substring/element check

Required keywords, PII detection

[`IsInstance`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.IsInstance)

Type validation

Format validation

[`MaxDuration`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.MaxDuration)

Performance threshold

SLA compliance

[`HasMatchingSpan`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.HasMatchingSpan)

Behavior verification

Tool calls, code paths

**Advantages:**

-   Fast execution (microseconds to milliseconds)
-   Deterministic results
-   No cost
-   Easy to debug

**When to use:**

-   Format validation (JSON structure, type checking)
-   Required content checks (must contain X, must not contain Y)
-   Performance requirements (latency, token counts)
-   Behavioral checks (which tools were called, which code paths executed)

### LLM-as-a-Judge (Flexible & Nuanced)

Use [`LLMJudge`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.LLMJudge) when evaluation requires understanding or judgment:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    name='llm_judge_example',
    cases=[Case(inputs='What is 2+2?', expected_output='4')],
    evaluators=[
        LLMJudge(
            rubric='Response is factually accurate based on the input',
            include_input=True,
        )
    ],
)
```

**Advantages:**

-   Can evaluate subjective qualities (helpfulness, tone, creativity)
-   Understands natural language
-   Can follow complex rubrics
-   Flexible across domains

**Disadvantages:**

-   Slower (seconds per evaluation)
-   Costs money
-   Non-deterministic
-   Can have biases

**When to use:**

-   Factual accuracy
-   Relevance and helpfulness
-   Tone and style
-   Completeness
-   Following instructions
-   RAG quality (groundedness, citation accuracy)

### Custom Evaluators

Custom evaluators can be useful if you want to make use of any evaluation logic we don't provide with the framework. They are frequently useful for domain-specific logic:

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ValidSQL(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        try:
            import sqlparse
            sqlparse.parse(ctx.output)
            return True
        except Exception:
            return False
```

**When to use:**

-   Domain-specific validation (SQL syntax, regex patterns, business rules)
-   External API calls (running generated code, checking databases)
-   Complex calculations (precision/recall, BLEU scores)
-   Integration checks (does API call succeed?)

## Evaluation Types

Detailed Return Types Guide

For full detail about precisely what custom Evaluators may return, see [Custom Evaluator Return Types](/docs/ai/evals/evaluators/custom#return-types).

Evaluators essentially return three types of results:

### 1\. Assertions (bool)

Pass/fail checks that appear as ✔ or ✗ in reports:

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class HasKeyword(Evaluator):
    keyword: str

    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return self.keyword in ctx.output
```

**Use for:** Binary checks, quality gates, compliance requirements

### 2\. Scores (int or float)

Numeric metrics:

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ConfidenceScore(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> float:
        # Analyze and return score
        return 0.87  # 87% confidence
```

**Use for:** Quality metrics, ranking, A/B testing, regression tracking

### 3\. Labels (str)

Categorical classifications:

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class SentimentClassifier(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> str:
        if 'error' in ctx.output.lower():
            return 'error'
        elif 'success' in ctx.output.lower():
            return 'success'
        return 'neutral'
```

**Use for:** Classification, error categorization, quality buckets

### Multiple Results

You can return multiple evaluations from a single evaluator:

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ComprehensiveCheck(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | float | str]:
        return {
            'valid_format': self._check_format(ctx.output),  # bool
            'quality_score': self._score_quality(ctx.output),  # float
            'category': self._classify(ctx.output),  # str
        }

    def _check_format(self, output: str) -> bool:
        return True

    def _score_quality(self, output: str) -> float:
        return 0.85

    def _classify(self, output: str) -> str:
        return 'good'
```

## Combining Evaluators

Mix and match evaluators to create comprehensive evaluation suites:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import (
    Contains,
    IsInstance,
    LLMJudge,
    MaxDuration,
)

dataset = Dataset(
    name='layered_evaluation',
    cases=[Case(inputs='test', expected_output='result')],
    evaluators=[
        # Fast deterministic checks first
        IsInstance(type_name='str'),
        Contains(value='required_field'),
        MaxDuration(seconds=2.0),
        # Slower LLM checks after
        LLMJudge(
            rubric='Response is accurate and helpful',
            include_input=True,
        ),
    ],
)
```

## Case-specific evaluators

Case-specific evaluators are one of the most powerful features for building comprehensive evaluation suites. You can attach evaluators to individual [`Case`](/docs/ai/api/pydantic_evals/dataset/#pydantic_evals.dataset.Case) objects that only run for those specific cases:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance, LLMJudge

dataset = Dataset(
    name='case_specific_evaluators',
    cases=[
        Case(
            name='greeting_response',
            inputs='Say hello',
            evaluators=[
                # This evaluator only runs for this case
                LLMJudge(
                    rubric='Response is warm and friendly, uses casual tone',
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='formal_response',
            inputs='Write a business email',
            evaluators=[
                # Different requirements for this case
                LLMJudge(
                    rubric='Response is professional and formal, uses business language',
                    include_input=True,
                ),
            ],
        ),
    ],
    evaluators=[
        # This runs for ALL cases
        IsInstance(type_name='str'),
    ],
)
```

### Why Case-Specific Evaluators Matter

Case-specific evaluators solve a fundamental problem with one-size-fits-all evaluation: **if you could write a single evaluator rubric that perfectly captured your requirements across all cases, you'd just incorporate that rubric into your agent's instructions**. (Note: this is less relevant in cases where you want to use a cheaper model in production and assess it using a more expensive model, but in many cases it makes sense to use the best model you can in production.)

The power of case-specific evaluation comes from the nuance:

-   **Different cases have different requirements**: A customer support response needs empathy; a technical API response needs precision
-   **Avoid "inmates running the asylum"**: If your LLMJudge rubric is generic enough to work everywhere, your agent should already be following it
-   **Capture nuanced golden behavior**: Each case can specify exactly what "good" looks like for that scenario

### Building Golden Datasets with Case-Specific LLMJudge

A particularly powerful pattern is using case-specific [`LLMJudge`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.LLMJudge) evaluators to quickly build comprehensive, maintainable evaluation suites. Instead of needing exact `expected_output` values, you can describe what you care about:

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    name='golden_dataset',
    cases=[
        Case(
            name='handle_refund_request',
            inputs={'query': 'I want my money back', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Acknowledge the refund request empathetically
                    2. Ask for the reason for the refund
                    3. Mention our 30-day refund policy
                    4. NOT process the refund immediately (needs manager approval)
                    """,
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='handle_shipping_question',
            inputs={'query': 'Where is my order?', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Confirm the order number
                    2. Provide tracking information
                    3. Give estimated delivery date
                    4. Be brief and factual (not overly apologetic)
                    """,
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='handle_angry_customer',
            inputs={'query': 'This is completely unacceptable!', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Prioritize de-escalation with empathy
                    2. Avoid being defensive
                    3. Offer concrete next steps
                    4. Use phrases like "I understand" and "Let me help"
                    """,
                    include_input=True,
                ),
            ],
        ),
    ],
)
```

This approach lets you:

-   **Build comprehensive test suites quickly**: Just describe what you want per case
-   **Maintain easily**: Update rubrics as requirements change, without regenerating outputs
-   **Cover edge cases naturally**: Add new cases with specific requirements as you discover them
-   **Capture domain knowledge**: Each rubric documents what "good" means for that scenario

The LLM evaluator excels at understanding nuanced requirements and assessing compliance, making this a practical way to create thorough evaluation coverage without brittleness.

## Async vs Sync

Evaluators can be sync or async:

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class SyncEvaluator(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return True


async def some_async_operation() -> bool:
    return True


@dataclass
class AsyncEvaluator(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext) -> bool:
        result = await some_async_operation()
        return result
```

Pydantic Evals handles both automatically. Use async when:

-   Making API calls
-   Running database queries
-   Performing I/O operations
-   Calling LLMs (like [`LLMJudge`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.LLMJudge))

## Evaluation Context

All evaluators receive an [`EvaluatorContext`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EvaluatorContext):

-   `ctx.inputs` - Task inputs
-   `ctx.output` - Task output (to evaluate)
-   `ctx.expected_output` - Expected output (if provided)
-   `ctx.metadata` - Case metadata (if provided)
-   `ctx.duration` - Task execution time (seconds)
-   `ctx.span_tree` - OpenTelemetry spans (if logfire configured)
-   `ctx.metrics` - Custom metrics dict
-   `ctx.attributes` - Custom attributes dict

This gives evaluators full context to make informed assessments.

## Error Handling

If an evaluator raises an exception, it's captured as an [`EvaluatorFailure`](/docs/ai/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EvaluatorFailure):

```python
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


def risky_operation(output: str) -> bool:
    # This might raise an exception
    if 'error' in output:
        raise ValueError('Found error in output')
    return True


@dataclass
class RiskyEvaluator(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        # If this raises an exception, it will be captured
        result = risky_operation(ctx.output)
        return result
```

Failures appear in `report.cases[i].evaluator_failures` with:

-   Evaluator name
-   Error message
-   Full stacktrace

Use retry configuration to handle transient failures (see [Retry Strategies](/docs/ai/evals/how-to/retry-strategies)).

## Report Evaluators (Experiment-Wide)

All the evaluators above run once per case. **Report evaluators** are different: they run once per experiment after all cases have been evaluated, and analyze the full set of results together.

Use report evaluators for experiment-wide statistics like:

-   **Confusion matrices** -- visualize classification accuracy across classes
-   **Precision-recall curves** -- assess ranking quality with AUC scores
-   **Scalar metrics** -- overall accuracy, F1, BLEU, or any single number
-   **Summary tables** -- per-class breakdowns, error category summaries

```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import ConfusionMatrixEvaluator

dataset = Dataset(
    name='report_evaluator_example',
    cases=[
        Case(inputs='meow', expected_output='cat'),
        Case(inputs='woof', expected_output='dog'),
    ],
    report_evaluators=[
        ConfusionMatrixEvaluator(
            predicted_from='output',
            expected_from='expected_output',
        ),
    ],
)
```

**See:** [Report Evaluators](/docs/ai/evals/evaluators/report-evaluators) for the full guide, including built-in report evaluators and how to write custom ones.

## Next Steps

-   **[Native Evaluators](/docs/ai/evals/evaluators/built-in)** - Complete reference of all provided evaluators
-   **[LLM Judge](/docs/ai/evals/evaluators/llm-judge)** - Deep dive on LLM-as-a-Judge evaluation
-   **[Third-Party Integrations](/docs/ai/evals/evaluators/framework-integrations)** - Wrap Ragas, DeepEval, and other metrics libraries
-   **[Custom Evaluators](/docs/ai/evals/evaluators/custom)** - Write your own evaluation logic
-   **[Report Evaluators](/docs/ai/evals/evaluators/report-evaluators)** - Experiment-wide analyses
-   **[Span-Based Evaluation](/docs/ai/evals/evaluators/span-based)** - Evaluate using OpenTelemetry spans