Running Evaluations
Once you have a hosted dataset (created via the Web UI or SDK), you can fetch it as a
typed pydantic_evals.Dataset and use it to evaluate your AI system.
The get_dataset method fetches all cases and returns a typed
pydantic_evals.Dataset that you can use directly for evaluation:
from dataclasses import dataclass
from pydantic_evals import Dataset
from logfire.experimental.api_client import LogfireAPIClient
@dataclass
class QuestionInput:
question: str
context: str | None = None
@dataclass
class AnswerOutput:
answer: str
confidence: float
@dataclass
class CaseMetadata:
category: str
difficulty: str
reviewed: bool = False
with LogfireAPIClient(api_key='your-api-key') as client:
dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
'qa-golden-set',
input_type=QuestionInput,
output_type=AnswerOutput,
metadata_type=CaseMetadata,
)
print(f'Fetched {len(dataset.cases)} cases')
print(f'First case input type: {type(dataset.cases[0].inputs).__name__}')
If you have custom evaluator types stored with your cases, pass them via custom_evaluator_types so they can be deserialized:
dataset = client.get_dataset(
'qa-golden-set',
input_type=QuestionInput,
output_type=AnswerOutput,
custom_evaluator_types=[MyCustomEvaluator],
)
Without type arguments, get_dataset returns the raw dict in pydantic-evals-compatible format:
raw_data = client.get_dataset('qa-golden-set')
# raw_data is a dict with 'name', 'cases', etc.
Use the dataset with pydantic-evals to evaluate your AI system:
from pydantic_evals import Dataset
from logfire.experimental.api_client import LogfireAPIClient
async def my_qa_task(inputs: QuestionInput) -> AnswerOutput:
"""The AI system under test."""
# Your AI logic here --- call an LLM, run an agent, etc.
...
async def run_evaluation():
with LogfireAPIClient(api_key='your-api-key') as client:
# Get the dataset
dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
'qa-golden-set',
input_type=QuestionInput,
output_type=AnswerOutput,
metadata_type=CaseMetadata,
)
# Run the evaluation
report = await dataset.evaluate(my_qa_task)
report.print()
With Logfire tracing enabled, the evaluation results appear automatically in the Evals tab, where you can compare experiments and analyze performance over time.
This creates a continuous improvement loop:
- Observe production behavior in Live View.
- Curate test cases by adding interesting traces to a dataset.
- Evaluate your system against the dataset using pydantic-evals.
- Analyze the results in the Logfire Evals tab.
- Improve your system and repeat.