- Executive Summary
This report details the design, architecture, and engineering decisions behind the Voice-Controlled Local AI Agent. The primary objective of this project was to establish a fully autonomous, local-first inference pipeline capable of transcribing human speech, parsing intents via a Large Language Model (LLM), and executing sandbox-verified commands on a host operating system.
Core deliverables include:
A deterministic plugin tool system
A dual-backend hardware STT failover
An SQLite persistent memory layer
A structured Streamlit UI
- System Architecture & Component Design
The framework operates on a 6-layer architecture using decoupled asynchronous Python services (FastAPI backend, Streamlit frontend).
2.1 Transducer & Speech-to-Text (STT) Layer
Local Engine: openai/whisper-small evaluated via Hugging Face transformers pipeline
Failover Logic: Detects the presence of CUDA tensors at runtime. If GPU infrastructure is not present, it gracefully degrades to CPU processing, with a cloud API fallback (Groq inference API) if required keys are provided
Pre-processing Hook: Built to bypass standard library format limitations (e.g., libsndfile limitations) by utilizing pydub to natively transcode codecs like .m4a and .webm into single-channel 16kHz numpy tensors
2.2 Intent & Parsing Layer (The LLM Conductor)
Engine: Ollama running locally. Base model recommended is mistral
Parsing: Natural language commands are strictly parsed into a structured IntentClassification payload mapping to IntentResult parameters
Compound Action Chaining: The engine supports parallel reasoning paths. For instance, the prompt "Make a file and write a hello world script" correctly serializes into a list of two distinct atomic tasks (Create File → Write Code) passed down the executor pipeline
2.3 Plugin Tool System
Instead of a monolithic switch statement, the project natively supports a dynamic Tool Registry.
Any class inheriting from BaseTool is auto-parsed and injected into the intent evaluator.
Currently supported base tools:
ChatResponderTool
FileCreatorTool
CodeWriterTool
SummarizerTool
2.4 Safety Sandboxing
A zero-trust execution model is applied for I/O bounds:
Validates path extensions against a rigid .exe, .sh, .bat blocklist
Blocks symbolic link escapes and structural path traversals (e.g., passing ../ inside a filename)
Implements a Human-In-The-Loop mechanism that flags write operations as pending_confirmation before flushing payloads to disk
- Persistent Memory
An SQLite persistent ledger with WAL-Mode activated handles concurrent reads/writes between the frontend and background worker processes seamlessly.
All actions—even rejected or cancelled ones—are piped into an Action Log trail for compliance and full conversation recall.
- Performance Metrics & Production Scaling Recommendations Current Local Node Performance Limits: Audio Prep: O(N) linear time scaling per chunk; operates inside microseconds Whisper Small (GPU): Sub-second inference Whisper Small (CPU): Averaging ~15s to ~20s inference for 3–5 seconds of speech Ollama Pipeline (mistral 7B): Averages 2–5 sec token streaming on mid-range hardware Recommended Upgrades for Commercial Deployment: RAG Interconnect: Tying the File Summarizer tool to a localized vector database (e.g., ChromaDB) to allow semantic lookup against existing output files Streaming Pipeline: Converting the /api/process-audio loop from generic async endpoints to WebSockets allowing continuous streaming telemetry from UI to backend without buffering out files locally first Multi-Tenant Sandboxing: Swapping the simple OS-level /output path constraint with pure containerized execution like Firecracker microVMs dynamically spun up per-user
- Conclusion
The constructed Voice-Controlled AI Agent is robust, secure, and production-viable software, offering an extensible foundation to tie Natural Language to any physical system API or tool command via structural intelligence bridging.
For further actions, you may consider blocking this person and/or reporting abuse
