Why we need one of these now
A CTO walks out of a board meeting where the question came up plainly: "where are we on AI?", and they realize, on the walk back to their office, that for a cloud transformation or a DevOps maturity discussion they would have a clean structural answer ready: a current state, a target state, the capabilities and processes to build between the two. They could have drawn it on a whiteboard from memory. For AI, the whiteboard stays blank.
The frame is missing because applied generative AI is new. The maturity models we inherited from twenty years of engineering practice (CMMI, DORA, the various cloud grids) do not quite fit. The newer ones tend to be written by consultancies that have not actually built any of this themselves. A few serious attempts in the space exist: Microsoft's LLMOps maturity model is a useful reference, Cohere has published one, G2 and a few others have their own. Most sit at the strategic-deck altitude and never land on the actual capabilities a leader is going to fund. We are proposing one that does.
What this is not
Before describing the model, a small warning about what it is not. It is not a scoreboard, it is not a rush to Level 5, and it is not a way to declare that one part of your org is "better" than another. Oftentimes different products inside the same company sit at different levels for entirely legitimate reasons. The regulatory pressure on the customer-billing agent is not the same as the regulatory pressure on the internal copilot. The model is a mirror you hold up to see what you have built. What you do with the reflection is a separate decision.
The five levels
Level 1: Experimenting. Individual engineers and small teams are trying things, the results are real but isolated, the costs land on somebody's personal card or a shared API key, and the question "what is this for, exactly" is rarely asked. This is not a bad place to be; it is where serious adoption usually starts. It becomes a problem when it persists past the point where the org is making real revenue from the work.
Level 2: Adopted. Multiple teams have AI in production, the results are uneven, the costs are visible enough to worry finance, and the platform team has started to receive questions it cannot fully answer. Most orgs we talk to sit here. The defining quality of Level 2 is the absence of common ground: every team has its own provider, its own logs, its own definition of "good", and the leader trying to answer the board question gets a diffrent answer from every team they ask.
Level 3: Observed. A central observability surface exists, traces are captured for the major systems, cost dashboards have been set up, evals exist for the agents nobody can afford to have go wrong, and the conversation about governance has at least started. This is not yet governance, but it is the precondition for it. Until you can see what is happening, you cannot shape it.
Level 4: Governed. Policies and technical enforcement land together. Evals run as a CI gate, server-side cost limits compose cleanly across organization, project, user, and session, RBAC is fine-grained enough to be useful, audit trails are mandatory rather than aspirational, and the org conducts agent postmortems when something goes wrong. This is the level where most regulated buyers want to be, and where the org's own posture in front of an auditor stops being awkward.
Level 5: Optimized. The loop is closed and continuous. Evals run online against live traffic, unit economics are tracked per agent run, the agents themselves can read their own traces and iterate on what they did, and the practice has become reflexive rather than imposed. Level 5 is rare and not the right target for every org. It requires a maturity in the surrounding engineering culture that not every shop has, and chasing it before the lower levels are solid is a recipe for elegant scaffolding around a wobbly foundation.
The five dimensions
The model travels across five dimensions. Each one maps to a capability a senior leader can directly invest in, and together they cover the actual surface area a serious AI practice needs.
Visibility and observability is what is running, what it did, who used it, what it cost, traceable end to end, from the agent down to the tool call and back. Without this, the others cannot meaningfully exist.
Evaluation and quality is how you know the system does what it says: eval suites, regression coverage, guardrails as a special case, the CI gate that lets legal and risk put their name against the output. This is where non-deterministic systems get their definition of done.
Cost governance is how spend is shaped at the edge: who can spend how much, what visibility you have per agent and per user, what server-side limits compose cleanly across the org without breaking the dev loop.
Access and identity is who can run what, with which credentials, against which data, with what scope. It progresses from shared keys at Level 1 to policy-as-code at Level 5, and it is the dimension most orgs are furthest along on, because SSO and IAM are mature problems they have already solved for other reasons.
Audit and incident response is what is recorded, how long it lives, what counts as evidence when an auditor asks, and how the org conducts a postmortem when an agent misbehaves. This dimension matters most at the moment it matters at all, which is a moment nobody planned for.
The grid
Putting the five levels against the five dimensions gives the artifact a 5x5 grid you can hold up against your own org, and have an honest read without consultants in the room.
| Dimension | L1 Experimenting | L2 Adopted | L3 Observed | L4 Governed | L5 Optimized |
|---|---|---|---|---|---|
| Visibility | None, or scattered logs | Per-team dashboards | Central observability, traces captured | Cross-system traces, instrumentation standards | Agent-readable traces, reflexive |
| Evaluation | None | Manual ad-hoc review | Eval suites for critical agents | Evals as CI gate, regression coverage | Continuous online evals |
| Cost | Invoice-driven discovery | Dashboards, no caps | Centralized dashboards, threshold alerts | Server-side limits at org/project/user | Unit economics per agent run |
| Access | Shared keys, personal cards | Individual keys, partial SSO | SSO plus coarse RBAC | Fine-grained RBAC, scoped keys | Policy-as-code, automated rotation |
| Audit | No records | Scattered logs | Retention policies, exports | Mandatory audit trail, postmortem template | Continuous compliance posture |
Where most orgs actually are
From a fair number of conversations now, most orgs sit somewhere between Level 2 and Level 3, with some dimensions ahead (usually access, because SSO and IAM were already in place for other reasons) and others behind (usually evaluation, because evals are still seen as optional, or as a thing the data-science team does, rather than as a CI gate the whole org depends on).
The most common stuck pattern is the L2-to-L3 gap on visibility. Teams have local dashboards, sometimes good ones, but no central observability surface, and the friction of standing one up keeps getting deferred because everything still mostly works. The cost of staying stuck is not catastrophic, it is just slow: every agent incident is investigated from scratch, every cost spike is explained by a different person, every compliance question gets a different answer depending on who is in the room. The org pays for this in attention rather than money, which is why it persists.
How to use this: the leader's move
The leader's move, looking at this grid, is not to rush to Level 5 across the board. The cost of doing that all at once is high, the value is uneven, and the political capital it spends is rarely available outside a fresh mandate. The better move is to pick the gap that is currently bleeding and close it before opening the next one.
Bleeding looks different at different orgs. Finance is shouting at you, that is cost. A customer-facing agent misbehaved and the post-incident review went poorly, that is evaluation. An auditor is calling, that is audit. A new team got AI access through an unsanctioned route and you found out in arrears, that is access. The model does not tell you which gap to close, because that depends on what is in front of you. What it does is give you sharper language for naming what you are looking at.
Where the Pydantic stack fits, and where it does not
Because we will be asked, we will be plain about where the Pydantic stack maps onto this grid and where it does not.
- Pydantic Logfire covers the visibility dimension end to end, and provides the trace retention and audit-trail surface for the audit dimension.
- Pydantic Evals covers the evaluation dimension, including the CI gate at Level 4 and the online evals at Level 5.
- Pydantic AI Gateway covers the cost-governance dimension and the key-custody parts of access.
- The rest of the access dimension is shared with whatever SSO and policy infrastructure you have already standardized on, because that is a problem your IAM team has already solved.
Where the stack does not fully cover yet: the upstream policy-as-code surface at Level 5 (which usually sits with the IAM team and is owned outside engineering), the full incident-management workflow (which sits with whatever incident tooling the org already runs), and the parts of guardrails that are deeply customer-domain specific, because those have to live in your code rather than in ours. Nobody's stack covers everything, and the question of how the pieces fit together is itself a piece of the maturity model.
A small note to close
The model is a mirror, not a target. The most useful question it asks is not "what level are we", which invites a defensive answer, but whether the org is moving with intention or moving because the calendar moved. Most of the time the honest answer is mixed. That is the point at which the model stops being decorative and starts being useful. The grid above is not a target. It is just a more honest mirror than the one you have been using.
If you are the leader trying to answer "where are we on AI" for your board and the existing models are not quite landing, that is the conversation we have most weeks. Talk to us about Logfire, Evals, and the AI Gateway, and we will help you read your own org against this grid honestly.