Running an LLM locally is a pain you probably don’t want to deal with unless you have a real use case. I tried self-hosting OpenAI’s Whisper model on my laptop, and while the tool itself worked well, getting it up and running took a long time. You usually have to stitch together multiple tools just to get started, downloading model files from different sources, setting up Python environments, installing frameworks like PyTorch or Transformers, configuring GPU drivers, and then fixing whatever breaks next.
There’s no standard way to package or distribute models, and setups vary wildly depending on your OS and hardware. That changes with Docker Model Runner, which, no joke, makes running local LLMs easier than setting up a Minecraft server.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
What Is Docker Model Runner?
An official Docker extension for running AI models
Docker Model Runner (DMR) is an official Docker extension for running AI models, especially large language models, on your local machine. It has been released as an open-source plugin under Docker’s GitHub organization. DMR is supposed to make working with AI models as straightforward as working with containers, and it lives up to that promise. I have been using it for some time now, and the tool makes it super easy to manage, run, and deploy AI models using Docker, streamlining the process of pulling, running, and serving LLMs directly from Docker Hub or any OCI-compliant registry.
DMR provides a unified CLI and GUI for downloading, running, and exposing models via a local API. DMR supports a growing range of models, from small, efficient LLMs to larger, more capable ones. Docker works with model providers and open-source communities to distribute ready-to-run models through Docker Hub, all packaged under the ai/ namespace.
At the smaller end, SmolLM2 is a lightweight ~360M-parameter model that works well for demos, quick experiments, and low-resource systems. Moving up, Gemma 3 sits around 3.9B parameters and offers a good balance between capability and performance, while Ministral 3 is a ~7B-parameter model that also includes vision support, making it useful for multimodal workloads. For more demanding use cases, Phi-4 comes in at roughly 14.7B parameters and is designed for more complex reasoning tasks.
If you need multiple models running at the same time, Docker makes this possible using Compose or separate runner instances. You can define multiple models in a Compose file, expose them on different ports, and run them in parallel as long as your machine has enough CPU, GPU, and memory.
How DMR simplifies local LLMs
Vs traditional setup
Docker Model Runner simplifies local LLMs by treating models like containers. If you already know Docker, you already know how to run an LLM. DMR ships with a built-in inference engine, so you can run models without touching Python, CUDA configs, or native builds. Under the hood, DMR uses proven open-source backends like llama.cpp for CPU and Apple Silicon, and vLLM for GPUs.
In a traditional setup, you would create virtual environments, pin library versions, and deal with dependency conflicts. With DMR, you just enable the plugin in Docker. The runtime comes bundled with Docker Desktop or Engine, and models are treated as first-class Docker resources.
Hardware differences are another common headache with local LLMs. CPU versus GPU, x86 versus ARM, Windows versus macOS often require different builds or libraries. DMR abstracts this away. It automatically pulls a compatible model variant for your system and uses the available hardware efficiently. On Apple Silicon, it selects Arm-optimized models. On NVIDIA GPUs, it uses GPU-optimized paths. You do not need to guess or experiment to get decent performance.
Docker Model Runner is not limited to Docker’s curated catalog. You can pull compatible community models directly from Hugging Face, as long as they are available in supported formats such as GGUF. If you maintain your own fine-tuned models, you can convert them to GGUF or Safetensors and package them using Docker model package for local use or sharing.
Setting up an LLM with Docker Model Runner
It barely takes any time
Getting started with Docker Model Runner is refreshingly simple. In a few steps, you can have a local LLM running on your machine. Start by making sure Docker is up to date. Docker Model Runner was introduced as a beta in Docker Desktop 4.40 on macOS and is supported in Docker Desktop 4.41 and newer on Windows, as well as recent Docker Engine releases on Linux.
On Docker Desktop, open Settings, go to AI, and enable Docker Model Runner. If you are on Windows with a supported NVIDIA GPU, also turn on GPU-backed inference for better performance. Restart Docker if prompted. On Linux systems using Docker Engine, install the model runner CLI plugin using your package manager. For example, on Ubuntu:
sudo apt-get install docker-model-plugin
As mentioned above, Docker hosts a set of curated models on Docker Hub under the ai namespace. You can download a model using the Docker model pull command. For example:
docker model pull ai/smollm2:360M-Q4_K_M
This pulls SmolLM2, a compact 360M-parameter model quantized to 4-bit for efficient local inference. The first pull may take some time since the model files are large, but future runs are cached locally. You can also pull compatible models directly from Hugging Face by specifying the URI, provided the format is supported. For example, the below command downloads a 1B-parameter Llama 3.2 model in GGUF format.
docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
Once the model is pulled, running it is a single command:
docker model run ai/smollm2
If you prefer a graphical interface, Docker Desktop includes a Models tab. After pulling a model, open the tab, find the model under Local, and click the run button. This opens a chat-style UI where you can interact with the model visually.
Once running, Docker Model Runner can also expose an OpenAI-compatible REST API on a local port. This allows your applications to interact with the model just like they would with OpenAI’s API. For example, you can enable the API on port 12434 and send requests to:
http://localhost:12434/engines/llama.cpp/v1/chat/completions
This makes it easy to plug a local LLM into existing tools, scripts, or frameworks without changing how they talk to the model. You just point them to your local endpoint instead of a cloud service.
Put Docker to good use
Docker is hands-down one of the best platforms for self-hosting. Whether you’re running an automation tool like n8n or using apps like Restic to automate backups, Docker makes the whole process simpler. And if you’re not running anything yet, check out these tiny Docker containers that can save hours every week, or these quality-of-life Docker containers you can use every day.
I'm uninstalling Docker Desktop for good, and here's what I'm using instead
It's hands-down my favorite container management platform on Windows
