How Lema AI cut code by 63% and boosted development velocity by 40%

Q: Can Pydantic AI handle RAG (Retrieval-Augmented Generation) pipelines?

Yes. Lema AI uses Pydantic AI to manage a complex 'middle ground' RAG pipeline that processes 20–50 legal documents per vendor. The framework helps structure the retrieval and validate the specific quotes used for decision-making.

Lema AI evaluated several agent frameworks before choosing Pydantic AI for its structured output validation, intuitive API, and seamless integration with Pydantic Logfire (our AI Observability Platform). The switch was a turning point in building their Agentic Risk Engineer - an autonomous system that investigates third-party security with forensic depth.

“This approach of being very structured and being very appealing to developers is really what increased our velocity. I think that was one of the real turning points in achieving greater velocity in development. It allowed us to develop more agentic pipelines, and even improve some of the existing ones.”

— Amitai Frey, Engineering Lead, Lema AI

The challenge: Why chat-based AI fails at forensic risk analysis

Lema AI is building the world's first Agentic Risk Engineer for third-party risk management (TPRM). Third parties are everywhere: cloud providers, contractors, vendors—every entity that is part of your business but not part of your organization. Dependence on these third parties exposes organizations to inherent risks, including cybersecurity, financial, legal, and service delivery failures.

"This space has been neglected for a long time," explains Omer Yehudai, Co-founder and Chief Product Officer at Lema. "It used to be mostly manual work. Analysts spent hours trying to analyze and assess risk. We utilize AI not just to automate the workflow, but to elevate it—performing deep, forensic validation that simply wasn't possible for human teams to sustain at scale."

Lema's platform goes beyond simple document scanning to perform forensic analysis on artifacts like SOC2 reports, security policies, and contracts. For example, when Lema models a data-sharing engagement, it validates the entire data lifecycle - cross-referencing scanned technical attestations to verify if encryption protocols are sufficient for the specific sensitivity of the data being shared. While traditional questionnaires accept vendor claims at face value, Lema’s agents enforce forensic verification—cross-referencing thousands of data points to expose the objective ground truth buried in the artifacts.

The problem? Most LLM tooling assumes you want chatbot-style responses.

"You're basically chatting with a model that responds in natural language. That's nice for a chatbot, but that's not our use case. We rely on getting structured output and reaching decisions as a result. Having validation built into responses is really crucial for us."

— Amitai Frey, Engineering Lead at Lema.

A new unique RAG challenge: The "Interconnected Dossier"

Most RAG use cases fall into two distinct categories: searching across a vast, unstructured corpus (e.g., all US case law) or querying a single, specific document (typically 10–30 pages). Lema operates in the difficult middle ground between these extremes. Their system analyzes comprehensive vendor dossiers—collections comprising dozens of distinct files. A single assessment might involve Master Services Agreements (MSAs), Data Processing Addendums (DPAs), and security exhibits. This creates a specific set of friction points that standard RAG architectures struggle to handle:

Cross-document dependencies: Legal concepts are often fragmented. A definition might live in the MSA, while the specific liability cap that relies on that definition lives in an addendum three files away. Standard retrieval fetches chunks in isolation, often severing the logic required to understand the full picture.
The hierarchy of authority: In legal analysis, documents have an order of precedence (e.g., an addendum usually overrides the main contract). Generic retrieval systems treat all text chunks as equally weighted, which can lead to answers that are factually present in the text but legally invalid.
The context trap: The total dataset is too large to feed into a context window without encountering the "lost in the middle" phenomenon, yet the documents are too interconnected to be processed in silos.

To solve this, Lema built a pipeline designed for "radical grounding"—treating text not just as context, but as evidence:

Holistic scoped retrieval: Instead of a flat search, they scope retrieval to the specific vendor packet, allowing the system to map relationships across the full spectrum of documents.
Enforced citation: The model is constrained to generate answers only when it can anchor them to specific text segments.
Proof via highlighting: Lema removes the "black box." When the system provides an answer, it highlights the exact excerpt within the document hierarchy that supports the decision, allowing for instant human verification.

"We need to back up our decision," says Amitai. "That's really one of the main cores of our product. And one of the reasons we do it well is because of how we built the pipeline."

The search for the right AI agent framework

Lema is primarily a Go shop, but about six months ago, the team decided to evaluate their AI stack. They wanted to find the framework that best matched how they needed to build: structured, validated, and developer-friendly.

They conducted a rigorous evaluation, implementing the same agent system across multiple frameworks. They were open to anything, not just Python.

The frameworks they tested:

LangChain Go
LangChain Python
LangGraph
CrewAI
Langflow
Pydantic AI

After implementing the exact same system in multiple frameworks, the comparison was clear.

"We implemented the exact same thing in LangChain, LangGraph, and Pydantic AI. The latter was much cleaner. It felt easier. It was nicer in every way."

— Alon Menczer, Engineering Lead, Lema AI

Why Pydantic AI: Structured validation from the ground up

Lema AI chose Pydantic AI because it aligned with how they needed to build AI systems. The classic Pydantic Stack approach, structured responses with built-in validation, code modularity, and performance was exactly what their use case demanded.

"This is a problem in general in the LLM space," explains Amitai. "These pipelines can't really be based on free text or natural language. The structured responses and the validation, having it built in that direction - that's really crucial for us."

Everything clicked: the API, the ease of validation, the ability to customize, the easy reusability of tools.

"We understand the problem and the way solutions need to be built in a similar way. There's someone else who builds it in a way we actually want to use it, and the tools solve real problems."

— Alon Menczer, Engineering Lead, Lema AI

Good documentation sealed the deal. "Good documentation is very important," says Alon. "We also have the LLMs.txt, which is useful if you want an LLM to figure out what's going on. Much easier when using Cursor or any other AI coding tools."

Debugging complex RAG pipelines with Logfire

Lema's RAG pipelines involve multiple steps, document retrievals, and nested questions. Debugging without visibility would be impossible.

"We really loved the integration with Pydantic Logfire, which was so simple," says Alon. "Attaching Pydantic AI to Logfire allowed us to debug effectively."

The visualization particularly helps with their complex pipelines. "The visualization in Logfire is pretty good for us because we can see the turns," explains Amitai. "We have this big RAG pipeline that asks many questions. Even seeing the quotes there was very beneficial."

The team also values Logfire's SQL-based search. “The SQL search is intuitive and makes more sense for more technical users,” says Amitai.

Pydantic Evals: Continuous pipeline improvement

Lema's RAG pipeline is core to their product. They're constantly improving it: better quote retrieval, question decomposition, handling more complex queries. They're even allowing customers to ask their own questions, which is a real challenge. Each change needs validation.

"As a small startup doing our best with limited resources, implementing evals in Pydantic was really easy," says Amitai. "It didn't take much time, which was one of the things that was blocking us from doing it previously."

Improving the pipeline means searching for better quotes, separating questions into sub-questions. And every improvement requires evaluation to see if it actually helps.

"That’s where evals come in. Well-designed evals are key to improving development velocity" says Amitai.

The results: A turning point for development velocity

Since adopting The Pydantic Stack, Lema AI has seen measurable improvements in how they build:

63% less code: When migrating from their previous agentic framework to Pydantic AI, the number of lines in their AI module reduced by 63%
40% faster development: The team estimates their development velocity increased by around 40%
Future plans: The team is also looking at using Pydantic AI Gateway for their infrastructure

"This approach of being very structured and being very appealing to developers is really what increased our velocity. I think that was one of the real turning points in achieving greater velocity in development. It allowed us to develop more agentic pipelines, and even improve some of the existing ones."

— Amitai Frey, Engineering Lead, Lema AI

Building AI systems that need structured, validated outputs? Get started with Pydantic AI.

Already a Pydantic AI user and looking to improve your system's observability? Try Pydantic Logfire.

FAQ

1. Can Pydantic AI handle RAG (Retrieval-Augmented Generation) pipelines? Yes. Lema AI uses Pydantic AI to manage a complex "middle ground" RAG pipeline that processes 20–50 legal documents per vendor. The framework helps structure the retrieval and validate the specific quotes used for decision-making.

2. How does Pydantic AI ensure structured outputs? Pydantic AI leverages Pydantic’s core validation library. By defining data models in Python code, the framework ensures the LLM response adheres strictly to the required schema, rejecting or retrying invalid responses automatically.

3. How did Pydantic AI impact code efficiency? In this specific case study, Lema AI reported a 63% reduction in code lines compared to their previous agentic framework implementation, largely due to reduced boilerplate and cleaner abstractions.

4. What is the "LLMs.txt" mentioned in the case study? LLMs.txt is a documentation standard used by AI coding assistants (like Cursor or GitHub Copilot) to understand a library. Lema AI noted that Pydantic’s excellent documentation and LLMs.txt support made it easier for their AI tools to write correct code.

Products Used:

How Lema AI cut code by 63% and boosted development velocity by 40%

The challenge: Why chat-based AI fails at forensic risk analysis

A new unique RAG challenge: The "Interconnected Dossier"

The search for the right AI agent framework

Why Pydantic AI: Structured validation from the ground up

Debugging complex RAG pipelines with Logfire

Pydantic Evals: Continuous pipeline improvement

The results: A turning point for development velocity

FAQ

Related case studies

Explore Logfire

Products Used:

#How Lema AI cut code by 63% and boosted development velocity by 40%

#The challenge: Why chat-based AI fails at forensic risk analysis

#A new unique RAG challenge: The "Interconnected Dossier"

#The search for the right AI agent framework

#Why Pydantic AI: Structured validation from the ground up

#Debugging complex RAG pipelines with Logfire

#Pydantic Evals: Continuous pipeline improvement

#The results: A turning point for development velocity

#FAQ

Related case studies

Explore Logfire

How Lema AI cut code by 63% and boosted development velocity by 40%

The challenge: Why chat-based AI fails at forensic risk analysis

A new unique RAG challenge: The "Interconnected Dossier"

The search for the right AI agent framework

Why Pydantic AI: Structured validation from the ground up

Debugging complex RAG pipelines with Logfire

Pydantic Evals: Continuous pipeline improvement

The results: A turning point for development velocity

FAQ