/Pydantic Logfire

Some customers can't use your cloud. Now what?

5 mins

Some customers cannot use your cloud.

A vendor-operated environment, even an isolated dedicated one, can still be a no-go for regulatory or architectural reasons. If data cannot leave the customer's network boundary, your cloud is not part of the solution.

We announced a self-hosted option for Pydantic Logfire a little over a year ago. We chose Helm because it's the packaging format Kubernetes teams already know how to review, diff, and GitOps. Getting it to run in a customer Kubernetes cluster was the first job. The second job, keeping that install working as the product kept changing, turned out to be much harder, and a lot of that cost showed up as customer support.

This post is about what drove that cost, and what we did to bring it down.

Self-hosting usually stems from hard constraints: strict data residency, BYO encryption, internal identity providers, localized audit trails, or "no call home" behavior.

At runtime, Pydantic Logfire should only call infrastructure the customer explicitly configured. In practice, the customer owns the infrastructure for application state, object storage, and identity. Logfire's own service telemetry stays inside the cluster unless explicitly routed outward.

In a managed cloud, you can hide dependencies behind platform services. In a self-hosted chart, if a service needs network egress, TLS, or external APIs, the chart must expose it clearly and predictably. The boundary has to be explicit, not implied.

Logfire ships fast. Startup config changes. Services gain new settings, lose old ones, or change defaults. The problem is that Helm has no way to know: it will render valid YAML for a configuration the application no longer accepts. That gap between "chart is valid" and "deployment actually works" is easy to miss until a customer hits it during an upgrade.

This was one of the first things that started generating support load. We built an audit tool into our release process that diffs the platform configuration against the chart before shipping. It keeps the two in sync as the product evolves.

Sizing generated its own category of problems. Kubernetes makes it trivial to expose every workload knob, but forcing users to tune resource requests before their first install pushes product knowledge onto the wrong team.

The underlying issue was that internal settings, worker counts, concurrency limits, were not connected to the actual resource budget the pod had. We were getting support calls from customers hitting OOM kills or CPU throttling with configurations that looked reasonable on paper.

We added sizing presets (e.g. tiny, small, standard) and moved those internal settings to formulas derived from the resources the workload actually gets. If a pod has a specific memory and CPU budget, worker counts and concurrency limits now stay inside that budget automatically. The presets also cover autoscaling and availability defaults, so customers can get a working deployment without having to understand every knob first.

Some complexity is unavoidable: a production install genuinely requires external PostgreSQL, object storage, TLS, and identity providers. We minimized the required inputs for a working deployment, and made it possible to boot a dev-grade instance locally with PostgreSQL and MinIO before committing to the full production setup.

In the cloud product, customers don't manage the instance; we do. There's no reason to expose controls over things that only make sense when you're the one running the infrastructure. That changes completely in a self-hosted deployment.

Managing the instance becomes their problem, which means it has to become our API. Building self-hosted forced us to design a new privilege tier, instance-level admin access. The boundary between "things operators manage" and "things the platform manages" had always been implicit. Self-hosted made it explicit.


The honest lesson from a year of this: the build is not the hard part. The ongoing support surface is. Every gap in setup, docs, upgrades, or troubleshooting eventually becomes a call. If we were doing it again, we would treat the chart as part of the product from day one. Not as polish, and not as a documentation pass after the fact. Startup config, sizing, and instance administration are part of the product once customers are the ones operating it.