![]() |
VOOZH | about |
A Model Context Protocol (MCP) server that enables LLMs to query local documents. PDF, Word, Excel
284
Github repo: https://github.com/dorianmeric/docs-to-ai
A Model Context Protocol (MCP) server that enables LLMs (like Claude Desktop or any other LLMs that supports MCP) to query your documents using semantic search. Organizes documents by topics based on folder structure. Supports: PDF, Word, Excel, Markdown. Supported extensions: .pdf, .docx, .doc, .xlsx, .xls, .xlsam, .xlsb, .md
The model used for document retrieval is all-MiniLM-L6-v2, with 384 dimensions for the embeddings.
You need Docker Desktop, or Docker Engine, running. Then just write the following into a file called "docker-compose.yaml":
services:
docs-to-ai:
image: dmeric/docs-to-ai
container_name: docs-to-ai
volumes:
- ./cache/chromadb:/app/chroma_db # ChromaDB database (persists the vector store)
- ./cache/doc_cache:/app/doc_cache # Document cache (persists extracted text)
- ./my-docs:/app/my-docs:ro # Documents directory (your PDFs and Word docs). Read-only to prevent accidental modifications
# Stdin/stdout - required for MCP protocol
stdin_open: true
tty: true
# Restart policy
restart: unless-stopped
# # Resource limits (optional - adjust based on your needs)
# deploy:
# resources:
# limits:
# cpus: '2'
# memory: 4G
# reservations:
# cpus: '1'
# memory: 2G
then run, in bash or in Powershell:
docker compose up -d
Add to your Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"docs-to-ai": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"dmeric/docs-to-ai"
]
}
}
}
Finally, put your documents in the folder /my-docs, and ask the server to scan the documents, and optionally to start the folder watcher. You should now be able to ask your LLM questions about the documents.
PDFs (organized by topic folders)
→ Text Extraction
→ Chunking
→ Embeddings
→ ChromaDB (with topic tags)
↓
MCP Server Tools
↓
Claude
This system is designed to work with PDFs organized in a folder structure where:
Example structure:
pdfs/
├── Machine_Learning/
│ ├── neural_networks.pdf
│ ├── deep_learning.pdf
│ └── introduction.pdf
├── Python_Programming/
│ ├── basics.pdf
│ ├── advanced.pdf
│ └── introduction.pdf # Different from ML's introduction.pdf
└── Data_Science/
├── statistics.pdf
└── visualization.pdf
mcp_server.py - Main MCP server implementationdocument_processor.py - PDF text extraction and chunkingvector_store.py - Vector database operationsscan_all_my_documentss.py - Batch PDF ingestion scriptrequirements.txt - Python dependenciesconfig.py - Configuration settingssearch_documents - Semantic search across all PDFs (with optional topic filter)list_documents - List all available documents (with optional topic filter)list_topics - List all topics/categoriesget_collection_stats - Get statistics about the collectionOnce configured, you can ask Claude:
General queries:
Topic-specific queries:
Complex queries:
Edit config.py to customize:
By default, the system uses folder names as topics. To modify:
# config.py
USE_FOLDER_AS_TOPIC = True # Set to False to disable topic extraction
DEFAULT_TOPIC = "uncategorized" # Default topic for PDFs not in folders
MIT
Content type
Image
Digest
sha256:c09a92167…
Size
4.4 GB
Last updated
8 months ago
docker pull dmeric/docs-to-ai:v0.2