VOOZH about

URL: https://dev.to/trinh_trankhanhduy_3429/building-a-privacy-first-document-processor-with-ollama-gradio-466f

⇱ Building a privacy-first document processor with Ollama + Gradio - DEV Community


A step-by-step guide to building a local AI document processor that makes zero external network calls — useful for processing NDA-bound contracts, confidential reports, or any document you can't upload to ChatGPT.

Architecture overview

PDF/DOCX file
 ↓
pdfplumber / python-docx (text extraction)
 ↓
System prompt + document text
 ↓
Ollama API (localhost:11434)
 ↓
Gradio UI (localhost:7860)
 ↓
Summary / Q&A / entities

Everything runs on localhost. Zero cloud dependencies at runtime.

Prerequisites

  • Python 3.11+
  • Ollama installed and running
  • 8GB+ RAM (16GB recommended)
# Install Ollama (Windows)
winget install Ollama.Ollama

# Pull a model
ollama pull llama3.1:8b

Core dependencies

pip install gradio pdfplumber python-docx requests

Step 1: Text extraction

import pdfplumber
import docx
from pathlib import Path

def extract_text(file_path: str) -> str:
 path = Path(file_path)
 if path.suffix.lower() == ".pdf":
 with pdfplumber.open(file_path) as pdf:
 return "\n\n".join(
 page.extract_text() or "" for page in pdf.pages
 )
 elif path.suffix.lower() in (".docx", ".doc"):
 doc = docx.Document(file_path)
 return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
 raise ValueError(f"Unsupported file type: {path.suffix}")

Step 2: Ollama integration

import requests

OLLAMA_URL = "http://localhost:11434/api/generate"

def query_ollama(prompt: str, model: str = "llama3.1:8b") -> str:
 response = requests.post(OLLAMA_URL, json={
 "model": model,
 "prompt": prompt,
 "stream": False,
 }, timeout=120)
 response.raise_for_status()
 return response.json()["response"]

Note: http://localhost:11434 — not a cloud API. No authentication needed.

Step 3: Domain-specific system prompts

Generic prompts give generic results. Tuned prompts for document types:

DOMAIN_PROMPTS = {
 "legal": (
 "You are a legal document analyst. Extract and structure the following "
 "from the document:\n"
 "1. PARTIES: All named parties and their roles\n"
 "2. KEY DATES: Effective date, termination, deadlines\n"
 "3. OBLIGATIONS: Each party's obligations\n"
 "4. PAYMENT TERMS: Amounts, schedules, conditions\n"
 "5. UNUSUAL CLAUSES: Non-standard or notable provisions\n"
 "6. GOVERNING LAW: Jurisdiction and dispute resolution\n"
 "Be factual and precise. Do not interpret or give legal advice."
 ),
 "financial": (
 "You are a financial document analyst. Extract:\n"
 "1. AMOUNTS: All monetary values with context\n"
 "2. DATES: Payment dates, fiscal periods, deadlines\n"
 "3. PARTIES: Vendors, clients, counterparties\n"
 "4. TERMS: Payment terms, penalties, conditions\n"
 "5. KEY METRICS: Revenue, costs, margins if present"
 ),
}

def process_document(file_path: str, domain: str, model: str) -> str:
 text = extract_text(file_path)
 system = DOMAIN_PROMPTS.get(domain, "Summarize the key points of this document.")
 prompt = f"{system}\n\nDOCUMENT:\n{text[:12000]}" # ~12k char limit
 return query_ollama(prompt, model)

Step 4: Privacy-safe Gradio UI

import gradio as gr

def build_ui():
 with gr.Blocks(title="Local Document Processor") as app:
 gr.Markdown("## Local Document Processor\n*All processing on your machine — no cloud*")

 with gr.Row():
 file_input = gr.File(label="Upload PDF or DOCX", file_types=[".pdf", ".docx"])
 domain = gr.Dropdown(
 choices=list(DOMAIN_PROMPTS.keys()),
 value="legal",
 label="Domain"
 )

 process_btn = gr.Button("Process Document", variant="primary")
 output = gr.Textbox(label="Result", lines=20)

 process_btn.click(
 fn=lambda f, d: process_document(f.name, d, "llama3.1:8b"),
 inputs=[file_input, domain],
 outputs=output,
 )

 return app

if __name__ == "__main__":
 app = build_ui()
 app.launch(
 server_name="127.0.0.1", # localhost only
 share=False, # no Gradio tunnel
 analytics_enabled=False, # no phone-home
 )

Step 5: Batch processing

For processing entire folders:

import zipfile
import tempfile
from pathlib import Path

def batch_process(folder_path: str, domain: str, model: str) -> str:
 results = {}
 for file in Path(folder_path).glob("*"):
 if file.suffix.lower() in (".pdf", ".docx"):
 try:
 results[file.name] = process_document(str(file), domain, model)
 except Exception as e:
 results[file.name] = f"ERROR: {e}"

 # Package results as ZIP
 with tempfile.NamedTemporaryFile(suffix=".zip", delete=False) as tmp:
 with zipfile.ZipFile(tmp.name, "w") as zf:
 for filename, content in results.items():
 zf.writestr(f"{filename}.txt", content)
 return tmp.name

Performance tips

  • Context window: Truncate documents to ~12,000 characters for reliable results with 8b models
  • Temperature: Set "temperature": 0.1 for factual extraction (less hallucination)
  • Streaming: Use "stream": True for better UX on long documents — update UI in real-time
  • Model selection: qwen2.5:3b for speed, llama3.1:8b for quality, llama3.1:70b for accuracy

Verification

Run Wireshark filtered to not host 127.0.0.1 while processing a document. You should see zero packets — confirming no data leaves your machine.

Full product

The complete version (batch mode, 10 domain types, hardware detection, Windows installer, 12 use-case recipes) is available at https://journeyer376.gumroad.com/l/ussytd for $39.

The architecture above is the core of what it does — the product adds packaging, documentation, and domain prompt iteration aimed at non-developers.


Questions about the architecture or model benchmarks? Happy to answer in the comments.