VOOZH about

URL: https://dev.to/mawlaia/extract-structured-data-from-any-pdf-with-one-line-of-python-open-source-3e67

⇱ Extract structured data from any PDF with one line of Python (open-source) - DEV Community


Every back-office workflow starts with a stack of PDFs. Invoice processing, loan underwriting, insurance claims, legal review — they all begin with unstructured documents and end with data that needs to go into a database.

Traditional OCR + template engines are brittle and require months of configuration per document type. mawlaia-docparse uses LLMs to make this generic.


The core idea

from docparse import Extractor, InvoiceSchema

extractor = Extractor(schema=InvoiceSchema())
result = extractor.extract("invoice.pdf")

print(result.vendor_name) # "Acme Corp"
print(result.total_amount) # 1250.00
print(result.line_items) # [{"description": "...", "qty": 2, "unit_price": 625.0}]
print(result.confidence) # 0.94

Five vertical schemas

InvoiceSchema — vendor, buyer, line items, totals, payment terms, tax, dates.

ContractSchema — parties, effective date, termination clauses, obligations, jurisdiction.

MedicalRecordSchema — patient demographics, diagnoses (ICD codes), medications, procedures, dates.

FinancialStatementSchema — revenue, expenses, EBITDA, balance sheet items, period.

IDDocumentSchema — name, DOB, document number, expiry, issuing authority.


Custom schemas

from docparse import Extractor, BaseSchema
from pydantic import BaseModel
from typing import List

class PurchaseOrderSchema(BaseSchema):
 po_number: str
 supplier: str
 line_items: List[dict]
 delivery_date: str

result = Extractor(schema=PurchaseOrderSchema()).extract("po_12345.pdf")

TypeScript

import { Extractor, InvoiceSchema } from 'mawlaia-docparse';

const result = await new Extractor({ schema: new InvoiceSchema() }).extract('invoice.pdf');
console.log(result.vendorName, result.totalAmount);

Installation

pip install mawlaia-docparse
npm install mawlaia-docparse

Source, tests (45 Python, 61 TypeScript), MIT: github.com/Mawlaia-Labs/docparse


Hosted version with batch processing, webhook delivery, and fine-tuned vertical models coming Q3 2026. Early access: dev@mawlaia.com.