Messy insurance emails, property listings, and medical prescriptions are important documents in my daily workflow, but most of the time, it is difficult to get usable data out of them. Traditional document extraction tools often fail in real-world scenarios. They are rigid by design and depend on fixed templates and layout assumptions.
That’s when Google’s open-source LangExtract caught my attention. It takes a different approach. Instead of treating documents as static layouts, it uses large language models as the backbone to understand them contextually and return structured data.
I wanted to test how capable it really was and what I could get out of it. I built an app with FastAPI as the backend and React as the frontend, pairing it with the Gemini 2.5 Flash model to simplify document processing and extract structured data from it.
6 reasons you should ditch Google Docs, and what you should use instead
Ditch the generic word processor for a more secure, feature-rich experience.
LangExtract is an open-source project developed by Google. LangExtract Pro is a separate, independent project built by the author to explore real-world implementation scenarios.
The problem with traditional document extraction
Why PDFs, emails, and semi-structured text break template-based systems
There are many document extraction tools available, but they often fail in dynamic environments, as they are built around structure and not meaning. They rely on fixed templates and often depend on predefined field positions and layout assumptions. They work well until the template and format change.
Real-world scenarios don’t follow a particular template or format. For example, different insurance agencies could have different structures for their data. A doctor’s prescription doesn’t follow the same structure. An email conversation could have messy and unorganized data. In these scenarios, the logic behind the traditional systems breaks.
OCR systems follow a similar pattern; they can convert images to text, but the systems don’t understand the context. It is often difficult to organize the data and return structured information. The output requires brittle rules or manual validation layers.
The limitations of these systems are simple: they treat documents as static layouts. But documents are language, and they contain intent, relationships, and context. Modern large language models are intelligent and can optimize for meaning and context. And that difference changes everything.
What makes LangExtract different
A schema-first approach
Google’s LangExtract is built around the limitations of these traditional document extraction systems. It emphasizes a meaning-driven philosophy, removes the layout assumptions, and understands the context and intent of the text. It then returns structured and meaningful data.
The schema-first approach focuses on structured fields instead of a general format. For example, in the case of a doctor’s prescription, it prioritizes extracting fields like patient name, medicine name, and dosages. The schema guides the model's output, replacing regex and keyword-based matching.
The Python script alone isn’t responsible for this, but under the hood, a large language model (LLM) like Gemini and GPT enables contextual understanding of the parsed data.
LLMs are trained on vast datasets and do not depend on fixed layouts. They can handle narrative and semi-structured texts. For example, an email conversation between an insurance client and an agent can be unstructured, but the model can infer structured details from that context.
But this was all in theory; to see how this worked in practice, I built a small app around it.
Building the app around LangExtract
FastAPI backend, React frontend, and Gemini 2.5 flash
Google provides a simple LangExtract Python script. You can just initialize it, connect it to a language model, and be good to go. That was not what I was aiming for. I wanted a real-world workflow.
import langextract as lx
from pydantic import BaseModel
# Define the structure you want (The Schema-First approach)
class RealEstateListing(BaseModel):
price: str
location: str
amenities: list[str]
# Simple extraction function
def extract_data(text_content):
# LangExtract handles the heavy lifting with Gemini
result = lx.extract(
input_data=text_content,
schema=RealEstateListing,
model="models/gemini-2.5-flash"
)
return result
I wanted a simple UI where I could paste unstructured data and input my prompt for what I needed, and then get a structured output. I chose React for a modern frontend and FastAPI for a responsive backend, paired with Gemini 2.5 Flash for low-latency structured extraction.
The flow was divided into four separate steps. The first step was ingestion: either paste the data directly into the text area or upload a TXT or PDF file. The second step was to parse the text in the case of file uploads, and then schema-based extraction using the LangExtract engine in the third step. The final step was to return structured data according to the schema.
Parsing was an additional step I chose to integrate into the workflow. Parsing is a step that is usually combined with extraction, but I kept it separate. There were two advantages to it: the first was transparency; I could see the parsed data before extracting it, and the second was that I could clean up noisy data before sending it to the model.
To see how it performs, I then tested it against real-world documents.
Real-world tests across multiple domains
Extracting structured data from healthcare, real estate, hiring, and insurance
Building an app is straightforward, but getting desired results from various domains is a different task. Each domain prioritizes different fields, and extracting the optimal data was the aim of the application. I focused on four different domains, as they were a part of my day-to-day workflow. I intentionally chose messy and semi-structured formats to test LangExtract’s ability.
In healthcare documents, the app extracted patient details, medication names, dosages, and follow-up instructions. From the semi-structured property listing, it pulled structured fields like price, location, area, and amenities. In a job description, it identifies important highlights like role titles, required skills, salary range, and experience levels. From conversational insurance emails, it inferred policy numbers, claim amounts, and incident details.
The consistency across various domains was noticeable. The schema remained stable. The model handled fields and structured data smoothly without any template logic. It extracted the data based on meaning rather than position.
I turned my Markdown notes into polished documents easily with this app
A journey from simplicity to sophistication
Beyond OCR and keyword matching
Traditional document extraction tools and Google’s LangExtract are built on different philosophies. One optimizes for position and layout, and the other for meaning and context. By combining LangExtract’s structured schema with large language models, the extraction logic shifts from layout dependency to contextual reasoning. For developers looking to move beyond rigid OCR pipelines and fragile keyword-matching rules, LangExtract offers a compelling alternative.
