Skip to content

Running Evaluations

Evaluations in Logfire are powered by pydantic-evals. You have two equally supported options for where your test cases live:

  • Local datasets --- defined in code (or loaded from a YAML file) as a pydantic_evals.Dataset. No server round-trip required. This is the simplest way to get started and is all you need for many projects.
  • Hosted datasets --- stored on Logfire, editable in the Web UI, and fetchable as a typed Dataset. Useful when you want to curate cases from production traces or collaborate with teammates.

Either way, once you have a Dataset in hand the evaluation step is identical, and results show up on the Evals: Datasets & Experiments page as long as Logfire tracing is configured.

Evaluating a Local Dataset

If your test cases live in code, you can run an evaluation without ever talking to the Logfire datasets API. Just build a Dataset and call evaluate:

from dataclasses import dataclass

from pydantic_evals import Case, Dataset

import logfire

# Configure Logfire so the evaluation shows up on the Evals: Datasets & Experiments page in the Logfire UI.
# Without this, the evaluation still runs but its results will not be sent to Logfire.
logfire.configure()
logfire.instrument_pydantic_ai()  # optional, traces the AI task under test


@dataclass
class QuestionInput:
    question: str
    context: str | None = None


@dataclass
class AnswerOutput:
    answer: str
    confidence: float


dataset = Dataset[QuestionInput, AnswerOutput, None](
    cases=[
        Case(
            name='capital_of_france',
            inputs=QuestionInput(question='What is the capital of France?'),
            expected_output=AnswerOutput(answer='Paris', confidence=1.0),
        ),
        # ... more cases
    ],
)


async def my_qa_task(inputs: QuestionInput) -> AnswerOutput:
    """The AI system under test."""
    ...


async def run_evaluation():
    report = await dataset.evaluate(my_qa_task)
    report.print()

You can also load local datasets from YAML files --- see the pydantic-evals documentation for details. With Logfire tracing enabled, runs against local datasets still appear on the Evals: Datasets & Experiments page (as Local datasets --- see Hosted vs Local Datasets).

Evaluating a Hosted Dataset

If you’d rather manage cases on the server --- for example so teammates can edit them in the UI or so you can seed cases from production traces --- fetch a hosted dataset and use it the same way.

Hosted datasets are typically created in the Web UI or published from code via push_dataset(...).

Getting a typed pydantic-evals Dataset

The get_dataset method fetches all hosted cases and returns a typed pydantic_evals.Dataset that you can use directly for evaluation:

from dataclasses import dataclass

from pydantic_evals import Dataset

from logfire.experimental.api_client import LogfireAPIClient


@dataclass
class QuestionInput:
    question: str
    context: str | None = None


@dataclass
class AnswerOutput:
    answer: str
    confidence: float


@dataclass
class CaseMetadata:
    category: str
    difficulty: str
    reviewed: bool = False


with LogfireAPIClient(api_key='your-api-key') as client:
    dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
        'qa-golden-set',
        input_type=QuestionInput,
        output_type=AnswerOutput,
        metadata_type=CaseMetadata,
    )

print(f'Fetched {len(dataset.cases)} cases')
print(f'First case input type: {type(dataset.cases[0].inputs).__name__}')

If you have custom evaluator types stored with your cases, pass them via custom_evaluator_types so they can be deserialized:

dataset = client.get_dataset(
    'qa-golden-set',
    input_type=QuestionInput,
    output_type=AnswerOutput,
    custom_evaluator_types=[MyCustomEvaluator],
)

Without type arguments, get_dataset returns the raw dict in pydantic-evals-compatible format:

raw_data = client.get_dataset('qa-golden-set')
# raw_data is a dict with 'name', 'cases', etc.

Running the Evaluation

Once fetched, a hosted dataset is just a pydantic_evals.Dataset --- use it exactly like the local example above:

from pydantic_evals import Dataset

import logfire
from logfire.experimental.api_client import LogfireAPIClient

# Send evaluation results to the Evals: Datasets & Experiments page in Logfire.
logfire.configure()
# You can instrument libraries here for richer information in the evaluation traces
# e.g., via `logfire.instrument_pydantic_ai()`


async def my_qa_task(inputs: QuestionInput) -> AnswerOutput:
    """The AI system under test."""
    # Your AI logic here --- call an LLM, run an agent, etc.
    ...


async def run_evaluation():
    with LogfireAPIClient(api_key='your-api-key') as client:
        # Get the dataset
        dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
            'qa-golden-set',
            input_type=QuestionInput,
            output_type=AnswerOutput,
            metadata_type=CaseMetadata,
        )

    # Run the evaluation
    report = await dataset.evaluate(my_qa_task)
    report.print()

Viewing Results on the Evals Page

With Logfire tracing enabled, the evaluation results appear automatically on the Evals: Datasets & Experiments page, where you can compare experiments and analyze performance over time.

The Evaluation Workflow

This creates a continuous improvement loop:

  1. Observe production behavior in Live View.
  2. Curate test cases by adding interesting traces to a dataset.
  3. Evaluate your system against the dataset using pydantic-evals.
  4. Analyze the results on the Logfire Evals: Datasets & Experiments page.
  5. Improve your system and repeat.