The challenge: Fragmented monitoring was slowing down debugging and incident response
Sophos's SecOps AI team builds AI-powered security solutions that protect millions of endpoints globally, including an AI Assistant for their customers. But their monitoring stack was holding them back.
"We'd lose time piecing together what had actually happened," explains Peter Kim, Principal Software Engineer at Sophos. The team needed to trace requests across LLM calls, FastAPI endpoints, and background workers - but their tools showed disconnected fragments instead of complete traces.
On other tooling, dashboard creation was limited, and the team struggled to visualize multiple metrics simultaneously and lacked the flexibility to build the analytical views they needed for monitoring their AI systems. Background jobs would fail silently when Celery workers didn't pick up tasks. "Finding those types of issues with Cloudwatch can be a nightmare", Peter notes.
The team needed unified observability that could keep pace with their AI innovation.
The solution: Logfire & tracing that just works
Sophos chose Pydantic Logfire for its OpenTelemetry foundation and developer-first design. Implementation was remarkably straightforward - the team simply toggled on Logfire's integrations for their existing libraries like FastAPI and httpx.
"We can see the whole conversation thread, the LLM call, and every API hop - all in one go," says Peter. "It saves me a ton of time."
The team went beyond basic monitoring, creating SQL-based monitors to detect "missing spans" and catching those previously invisible background job failures instantly. This ensured the team was notified much faster than had been happening prior to Logfire. No custom query language to learn, just SQL via DataFusion.
The team now builds complex multi-metric dashboards, giving them the analytical flexibility they need - “The filtering has been amazing because you can filter for anything,” Peter notes.
Because Sophos operates with highly sensitive customer data, everything is hosted on-prem using Logfire's enterprise self-hosting option.
Going deeper: From monitoring to experimentation
As confidence grew, Sophos expanded their Logfire usage to include Pydantic Evals for LLM experimentation.
"Evals have been great. We now have the ability to compare experiments.", says Peter. The team particularly values being able to test prompt changes side-by-side and understand performance immediately.
"I think the Logfire UI is cleaner, nicer. Everything's right there. It's what it should be."
— Tony Pelletier, Senior Software Engineer at Sophos
The results: A foundation for AI innovation
Today, Sophos has achieved what they set out to build:
- Complete visibility: AI agent runs are traced end-to-end in a single, connected view across all their services
- Proactive detection: SQL monitors catch issues that previously went unnoticed for hours, with custom alerts for missing spans
- Rapid experimentation: Side-by-side model evaluations directly in the UI for prompt optimization
- Team adoption: Engineers praise the interface and actively expand usage - "I'm a big fan of it," says Tony
- Future-proof architecture: OpenTelemetry foundation means no vendor lock-in - "The great part is it's based on open standards," notes Peter
"This seems more polished to me. And the support we've been getting from the Pydantic team has been awesome - that's a real bonus."
— Peter Kim, Principal Software Engineer at Sophos
Want to achieve similar unified observability for your AI systems? Get started with Pydantic Logfire.