Running Evaluations

Once you have a hosted dataset (created via the Web UI or SDK), you can fetch it as a typed pydantic_evals.Dataset and use it to evaluate your AI system.

Getting a typed pydantic-evals Dataset

The get_dataset method fetches all cases and returns a typed pydantic_evals.Dataset that you can use directly for evaluation:

from dataclasses import dataclass

from pydantic_evals import Dataset

from logfire.experimental.api_client import LogfireAPIClient


@dataclass
class QuestionInput:
    question: str
    context: str | None = None


@dataclass
class AnswerOutput:
    answer: str
    confidence: float


@dataclass
class CaseMetadata:
    category: str
    difficulty: str
    reviewed: bool = False


with LogfireAPIClient(api_key='your-api-key') as client:
    dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
        'qa-golden-set',
        input_type=QuestionInput,
        output_type=AnswerOutput,
        metadata_type=CaseMetadata,
    )

print(f'Fetched {len(dataset.cases)} cases')
print(f'First case input type: {type(dataset.cases[0].inputs).__name__}')

If you have custom evaluator types stored with your cases, pass them via custom_evaluator_types so they can be deserialized:

dataset = client.get_dataset(
    'qa-golden-set',
    input_type=QuestionInput,
    output_type=AnswerOutput,
    custom_evaluator_types=[MyCustomEvaluator],
)

Without type arguments, get_dataset returns the raw dict in pydantic-evals-compatible format:

raw_data = client.get_dataset('qa-golden-set')
# raw_data is a dict with 'name', 'cases', etc.

Running the Evaluation

Use the dataset with pydantic-evals to evaluate your AI system:

from pydantic_evals import Dataset

from logfire.experimental.api_client import LogfireAPIClient


async def my_qa_task(inputs: QuestionInput) -> AnswerOutput:
    """The AI system under test."""
    # Your AI logic here --- call an LLM, run an agent, etc.
    ...


async def run_evaluation():
    with LogfireAPIClient(api_key='your-api-key') as client:
        # Get the dataset
        dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
            'qa-golden-set',
            input_type=QuestionInput,
            output_type=AnswerOutput,
            metadata_type=CaseMetadata,
        )

    # Run the evaluation
    report = await dataset.evaluate(my_qa_task)
    report.print()

Viewing Results in the Evals Tab

With Logfire tracing enabled, the evaluation results appear automatically in the Evals tab, where you can compare experiments and analyze performance over time.

The Evaluation Workflow

This creates a continuous improvement loop:

Observe production behavior in Live View.
Curate test cases by adding interesting traces to a dataset.
Evaluate your system against the dataset using pydantic-evals.
Analyze the results in the Logfire Evals tab.
Improve your system and repeat.