/Pydantic Logfire

The pod that did not survive Tuesday's deploy

5 mins

Your Vercel AI SDK chatbot, shipped to GKE, is throwing 500s on about 8% of requests since Tuesday's deploy. The error rate is not constant. It spikes for a few minutes, recovers, spikes again. Restart loop, probably. But which pod, on which node, after which release?

  • See every layer of your Kubernetes cluster from one page: clusters, nodes, namespaces, workloads, pods, images, all sortable.
  • Find the pod that's restarting, the node that's hot, the workload that's stuck rolling out, the answers you would get from kubectl top and kubectl describe, in your browser.
  • Spot a pod in a crash loop from the cluster view. Restart counts roll up at every level so you do not have to drill all the way down to notice.
  • Click from any pod, namespace or workload straight to the traces it produced in the live view.

Open the Workloads tab. Sort by restart count. The chatbot Deployment shows 47 restarts in the last hour. Click in. Desired versus ready replicas tells you the rollout never finished. A memory chart for the workload is pinning the limit on one of the three pods.

Switch to the Pods tab. Two pods are green and stable. One pod is in a restart loop: OOMKilled, every six minutes. Image digest matches Tuesday's release. The other two pods are running an older digest because the rollout stalled. The new image bundles a tokenizer dependency that doubled memory usage. Mystery solved.

Plenty of tools will tell you a pod is restarting. What was missing for us was the cluster view sitting next to the application traces from the workloads in that cluster, in the same product, with the same trace ID propagating across both. The pod metrics, the kill event, and the trace that triggered it should be one click away from each other. They are now.

All of this data is available to you and your closest friends: Claude, Codex, Gemini, Pydantic AI, and more via MCP and our RESTful query endpoints.

The data comes from two standard OpenTelemetry collector receivers: kubeletstats for pod and container metrics, and k8scluster for clusters, namespaces, workloads, and nodes. No DaemonSet of ours, no proprietary agent.

The drill-down from a pod to the traces that pod produced is what the k8sattributes processor enables. Run it in your application-trace pipeline and it stamps k8s.pod.name, k8s.namespace.name, k8s.deployment.name (and the workload-kind equivalents) onto every span the pod emits. The cluster view and the live view then share the same identifiers, and one click takes you across.

Two operational details worth knowing in advance. The k8scluster receiver doesn't do leader election on itself. Run it as a single-replica Deployment, not a multi-replica one, or you'll double-count cluster metrics. Restart counts roll up at every level (cluster, namespace, workload, node) so you can spot a pod restart loop from the top without drilling all the way down.

For the full end-to-end setup (the OTel collector config, the k8sattributes processor to enrich traces with pod metadata, and unified dashboards), our earlier Full-stack Kubernetes observability with Pydantic Logfire post walks through it.

If your collector already exports the kubeletstats and k8scluster receivers, the inventory populates within a minute. If not, the empty state on each tab has a Set up button that deep-links to the relevant page of the add-data wizard, with copy-pasteable config.

In our scenario, you roll back the deploy in two clicks and the restart loop stops. The memory request gets bumped in the manifest, the rollout finishes cleanly, and the trace from the original 500s is in the live view, with the pod restart event open in context.

Full reference for the setup, including the minimum RBAC and the k8sattributes processor config: the Kubernetes view docs.

Not using Logfire yet? Get started. The free tier includes usage up to 10 million spans, our AI Gateway, and so much more.