VOOZH about

URL: https://hub.docker.com/r/dmeric/docs-to-ai

⇱ dmeric/docs-to-ai - Docker Image


dmeric/docs-to-ai

By dmeric

Updated 8 months ago

A Model Context Protocol (MCP) server that enables LLMs to query local documents. PDF, Word, Excel

Image
Machine learning & AI
0

284

dmeric/docs-to-ai repository overview

Docs-to-AI -- PDF/Word Document Query System with MCP

Github repo: https://github.com/dorianmeric/docs-to-ai

A Model Context Protocol (MCP) server that enables LLMs (like Claude Desktop or any other LLMs that supports MCP) to query your documents using semantic search. Organizes documents by topics based on folder structure. Supports: PDF, Word, Excel, Markdown. Supported extensions: .pdf, .docx, .doc, .xlsx, .xls, .xlsam, .xlsb, .md

The model used for document retrieval is all-MiniLM-L6-v2, with 384 dimensions for the embeddings.

1. Installation, with docker

You need Docker Desktop, or Docker Engine, running. Then just write the following into a file called "docker-compose.yaml":

services:
 docs-to-ai:
 image: dmeric/docs-to-ai
 container_name: docs-to-ai
 
 volumes:
 - ./cache/chromadb:/app/chroma_db # ChromaDB database (persists the vector store)
 - ./cache/doc_cache:/app/doc_cache # Document cache (persists extracted text)
 - ./my-docs:/app/my-docs:ro # Documents directory (your PDFs and Word docs). Read-only to prevent accidental modifications
 
 # Stdin/stdout - required for MCP protocol
 stdin_open: true
 tty: true

 # Restart policy
 restart: unless-stopped
 
 # # Resource limits (optional - adjust based on your needs)
 # deploy:
 # resources:
 # limits:
 # cpus: '2'
 # memory: 4G
 # reservations:
 # cpus: '1'
 # memory: 2G
 

then run, in bash or in Powershell:

docker compose up -d

Add to your Claude Desktop config (claude_desktop_config.json):

{
 "mcpServers": {

 "docs-to-ai": {
 "command": "docker",
 "args": [
 "run",
 "-i",
 "--rm",
 "dmeric/docs-to-ai"
 ]
 }

 }
}

Finally, put your documents in the folder /my-docs, and ask the server to scan the documents, and optionally to start the folder watcher. You should now be able to ask your LLM questions about the documents.

Features

  • Extract text from PDF documents
  • Organize documents by topics (using folder structure)
  • Generate embeddings for semantic search
  • Store documents in a vector database (ChromaDB)
  • Expose MCP tools for Claude to search and retrieve documents
  • Filter searches by topic/category
  • Handle multiple documents with the same filename across different topics

Architecture

PDFs (organized by topic folders)
 → Text Extraction
 → Chunking
 → Embeddings
 → ChromaDB (with topic tags)
 ↓
 MCP Server Tools
 ↓
 Claude

Document Organization

This system is designed to work with PDFs organized in a folder structure where:

  • Each folder represents a topic or category
  • PDFs in that folder are automatically tagged with the topic name
  • Documents with the same filename in different folders are handled correctly

Example structure:

pdfs/
├── Machine_Learning/
│ ├── neural_networks.pdf
│ ├── deep_learning.pdf
│ └── introduction.pdf
├── Python_Programming/
│ ├── basics.pdf
│ ├── advanced.pdf
│ └── introduction.pdf # Different from ML's introduction.pdf
└── Data_Science/
 ├── statistics.pdf
 └── visualization.pdf

Project Structure

  • mcp_server.py - Main MCP server implementation
  • document_processor.py - PDF text extraction and chunking
  • vector_store.py - Vector database operations
  • scan_all_my_documentss.py - Batch PDF ingestion script
  • requirements.txt - Python dependencies
  • config.py - Configuration settings

MCP Tools

  • search_documents - Semantic search across all PDFs (with optional topic filter)
  • list_documents - List all available documents (with optional topic filter)
  • list_topics - List all topics/categories
  • get_collection_stats - Get statistics about the collection
Example Queries for Claude

Once configured, you can ask Claude:

General queries:

  • "What documents do you have access to?"
  • "What topics are available?"
  • "Search for information about neural networks"

Topic-specific queries:

  • "Search for Python programming concepts in the Python_Programming topic"
  • "Show me all documents about Machine Learning"
  • "Find information about data visualization in the Data_Science topic"

Complex queries:

  • "Compare what different documents say about deep learning"
  • "Find all mentions of pandas across all documents"
  • "What are the key concepts in the Machine_Learning documents?"

Configuration

Edit config.py to customize:

  • Chunk size and overlap
  • Number of search results
  • Embedding model
  • ChromaDB persistence directory
  • Topic extraction behavior (enable/disable folder-based topics)
Topic Configuration

By default, the system uses folder names as topics. To modify:

# config.py
USE_FOLDER_AS_TOPIC = True # Set to False to disable topic extraction
DEFAULT_TOPIC = "uncategorized" # Default topic for PDFs not in folders

License

MIT

Tag summary

v0.2

Content type

Image

Digest

sha256:c09a92167…

Size

4.4 GB

Last updated

8 months ago

docker pull dmeric/docs-to-ai:v0.2