/jobs

Evals & Continuous Learning Engineer

Last updated:

Shipping reliable AI applications means closing the loop: capture what happened in production, turn it into evaluation data, measure quality, and feed improvements back into the system.

We already have a real foundation in production — evals and experiments built into Logfire, our observability platform, plus our open source pydantic-evals library. We're looking for someone who has worked on evaluation or LLM-observability platforms to own this end to end — and to push it toward genuine continuous learning, where systems measurably improve from their own production data.

This might be you if you've worked on a product like Braintrust, Langfuse, LangSmith, Arize/Phoenix, or Humanloop — or built serious internal eval tooling.

  • Own and grow Logfire's evals product: datasets, experiments, evaluators, online evals, and the comparison and analysis UI.
  • Build the trace → dataset → eval → improvement loop: turn production telemetry into evaluation cases, detect regressions, and surface what changed.
  • Advance our continuous-learning / optimizer work: analyze failure patterns from production traces, propose prompt and variable changes, and safely A/B and roll them out via managed variables.
  • Develop the open source pydantic-evals library alongside the product — evaluators, reports, and integrations the community relies on.
  • Work closely with the Pydantic AI team so evals and improvement are first-class for agent developers.
  • Define what "adopted" and "improving" mean as measurable metrics, and instrument them.

We expect a candidate for this position to have:

  • Built evaluation, experimentation, prompt-management, or LLM-observability systems — in a product or as serious internal tooling.
  • Fluency in the realities of LLM evaluation: LLM-as-judge, rubric design, dataset curation, statistical measures, and the failure modes of each.
  • A product-minded approach: you can turn a fuzzy "make the agent better" into datasets, metrics, and a workflow people actually use.
  • Comfort across the backend (FastAPI, Postgres) and a willingness to work in the frontend (React/TypeScript) for the surfaces you own.
  • A genuine interest in AI engineering.
  • At least 5 years of software engineering experience.

Nice to haves but not required:

  • Experience with an evals / observability platform (Braintrust, Langfuse, LangSmith, Arize, Humanloop, Logfire, etc.).
  • Contributions to open source eval or AI tooling.
  • Familiarity with Pydantic, Pydantic AI, and OpenTelemetry's GenAI semantic conventions.
  • Background in ML / experimentation methodology (A/B testing, statistics).
  • Live and work in a timezone between PT (UTC-8) and CET (UTC+1)
  • Able to travel to EU, UK and US up to 4 times a year to join our off-sites
  • Willing to participate in an on-call rotation for the systems you own

Pydantic Validation is the data validation library that powers modern Python development - 500 million downloads per month, used by virtually every tech company you've heard of. Why? Because we obsess over developer experience and write code we'd actually want to use ourselves.

We're applying that same engineering mindset to Pydantic Logfire, our observability platform with first class support for AI engineering, built for today's development reality: AI workloads, multi-language environments, and cloud infrastructure that's designed to be straightforward to set up and maintain.

We build with technologies developers actually want to work with:

  • OpenTelemetry for standardized instrumentation
  • SQL for intuitive querying (no proprietary query language to learn)
  • Rust, Python, and TypeScript for performance and productivity
  • Postgres, DataFusion, and object storage for scalable backends

Unlike other companies that pay lip service to open source, we commit over 20% of our engineering team to maintaining and expanding our open source ecosystem. This includes the core Pydantic Validation library and Pydantic AI - our rapidly growing framework that's becoming the standard for AI application development. We're signatories of the open source pledge and build on open standards because we believe in interoperability, not lock-in. Use our OpenTelemetry-based SDK with any compatible backend - we're confident you'll choose us on merit.

We're backed by Sequoia Capital and run a fully remote team across multiple time zones (with regular in-person offsites - next one is June 2026 in London).

Join our team of exceptional engineers who value substance over hype, practical approaches over perfectionism, and meaningful progress over busyness. We've built a culture that balances technical ambition with sustainable practices—minimal meetings and respect for your expertise and time. We're creating tools that genuinely improve developers' lives, and we're looking for thoughtful contributors who share our commitment to quality and our passion for elegant solutions.

  • 💰 Compensation: Competitive salary and stock options
  • 🌍 Truly Remote: Work from anywhere within our timezone range - no office requirements
  • 🌐 Global & Diverse: Join a multi-cultural team of 8+ nationalities
  • 💪 Impact: Direct influence on tools used by millions of developers worldwide
  • 🎯 Focus on Growth: Regular opportunities for learning and professional development
  • 🤝 Team Gatherings: Connect with the team at our regular international off-sites
  • 🏥 Healthcare: Comprehensive health coverage for you and your dependents
  • 🎮 Flexible Hours: Work when you're most productive
  • 💻 Equipment: Budget for your home office setup
  • ⚖️ Work-Life Balance: flexible working hours and 33 days PTO no matter where you live (including public holidays, which you can choose to take or not)

To apply, email careers@pydantic.dev with the job title in the subject line. We'd also appreciate a few lines explaining why you think you'd be a good fit for the role and what you've done in the past that evidences that.

No recruiters or agencies please. Unsolicited recruiters will be marked as spam.

To make your application stand out, please share something you've built in the evals / AI space — a project, a library, a talk, or a blog post.