Stop Choosing Between Local and Cloud LLMs: A Field Guide to Hybrid Patterns
A hands-on walkthrough of a hybrid local-cloud workflow using Gemma 4 and GPT-5.4, with reasoning and structured outputs
LLM applications, two deployment choices are commonly seen: either we go fully cloud, i.e., sending everything to a cloud LLM API, or we go fully local, i.e., run everything with an open model served locally.
Cloud LLMs reason better, but we send our sensitive context outside. Local models keep that context private, but limited by the local compute, they can struggle when tasks get complex.
Naturally, we’d like to ask:
Can we enjoy the reasoning capability of the cloud, while still keeping private context local?
A hybrid local-cloud pattern might achieve that. This is what we’ll explore in this post.
Specifically, we’ll discuss:
- How to reason about hybrid designs. We’ll look into a three-axis map and illustrate 5 most common patterns.
- A concrete case study to walk through one of the patterns. We use both a local small language model (Gemma 4 E4B from Google) and a cloud-based large language model (GPT-5.4 from OpenAI).
By the end, you should have a reusable mental model and a working notebook for splitting an LLM application between a local and a cloud model.
1. When Does A Hybrid Local-Cloud LLM Pattern Make Sense?
When people talk about local-cloud hybrid LLM applications, they immediately think of privacy. That’s certainly an important consideration.
But privacy is not the only reason to go hybrid.
Based on my experience, I find it useful to reason about this hybrid pattern from the lens of a three-axis coordinate system:
- Direction, which answers “who acts first?”. It can be local-first or cloud-first.
- Trigger, which answers “when is the cloud used?”. In some application scenarios, the cloud is always called. At other times, the cloud is only invoked conditionally.
- Purpose, which answers “why split the workflow?”. As we mentioned previously, privacy is a major motivation. But other factors can be cost & latency, as well as trust & reliability.
With these three axes, we can place many real-world hybrid workflows in the same coordinate map. Although this is far from a perfect taxonomy, I believe it makes the design choices intuitively understandable. Let’s take a look now.
Pattern 1: Sanitize-and-Solve
This pattern is local-first (axis 1), always triggers cloud LLM (axis 2), and has privacy-preserving in mind (axis 3).
The local model consumes the unstructured, messy private context, but converts it into an abstract problem, which later gets sent to the cloud LLM. The cloud model only sees the abstract problem and solves it. The results get sent back to the local model, which can further process the results and feed them back to the user.
This is the pattern we will implement in the case study.
Pattern 2: Plan-then-Ground
This pattern sits in the opposite direction. It is cloud-first (axis 1), the cloud model is always triggered (axis 2), and still privacy-preserving (axis 3).
Here, the cloud model is responsible for producing a generic plan based on an abstract goal. This step doesn’t involve seeing the private data. Then, the plan is sent to the local model, who plays the role of the actual executor that executes the plan against the real sensitive data.
Pattern 3: Escalate-on-Hard
For many applications, it’s not necessary to call the cloud model for every single user request.
The local model might already be able to handle the easy majority, such as simple extractions, classifications, summarizations, or producing short answers. The cloud model is called only when the local model cannot sufficiently or reliably address the user’s request because it is too complex.
This pattern is local-first (axis 1), the triggering of the cloud model is conditional (axis 2), and often motivated by cost and latency, instead of privacy (axis 3).
Pattern 4: Draft-then-Refine
Another commonly seen pattern is to separate response speed from response quality.
In this pattern, the local model produces an immediate draft. Optionally, the cloud model can work in the background to provide a more careful answer. If the cloud answer is better, the application can replace or simply enrich the initial response offered by the local model.
This pattern is local-first (axis 1), may or may not trigger the cloud model (axis 2), and is often motivated also by cost and latency.
Pattern 5: Cross-Check
Another important pattern worth mentioning is cross-checking.
Here, the local and cloud models are treated more or less as equal. They may both act like independent reviewers, where one model proposes an answer, and the other model checks it, or both models answer the same question, and their agreement/disagreement becomes the signal for downstream processing.
This pattern flows in both directions (axis 1), the cloud model is always on (axis 2), and it’s mainly driven by trust & reliability (axis 3).
2. Case Study: Should I Run the Dishwasher Now or Later?
In this section, we move from abstract concepts to a concrete case study. Specifically, we’ll look at a smart-home scheduling problem.
2.1 Case Setup
Here, we consider a smart-home assistant system that keeps a private household memory.
Suppose the user asks the assistant:
Should the dishwasher run now or later?
Effectively, the assistant needs to solve a scheduling problem.
To solve that problem, the assistant knows that now is 18:30, and it has access to the following household memory, device facts, and tariff:
Household memory:
- The dishwasher needs to be done before breakfast because kids need clean bowls,
usually around 06:30.
- Don't let the dishwasher still be running once everyone's in bed,
usually around 22:30; it's right by the bedrooms and Maya is a light sleeper.
- The EV is Mark's car, and it has to be charged before he leaves at 07:00.
- The robot vacuum often runs after lunch; today it finished a kitchen pass
around 16:10.
Device facts:
- Dishwasher: runtime about 90 minutes, energy about 1.2 kWh, available now.
- EV charger: runtime about 120 minutes, energy need about 14 kWh, available now.
- Robot vacuum: last cleaning cycle finished at 16:10; battery is at 78% on the dock.
Tariff:
- 17:00-20:00: $0.45/kWh
- 20:00-00:00: $0.22/kWh
- 00:00-06:00: $0.12/kWh
- 06:00-17:00: $0.25/kWh
As we can see, the above context contains quite some private information, such as names, household routines, etc., which we don’t want to send directly to a cloud model.
This naturally calls for a hybrid solution pattern. More concretely, we can design a workflow like this:
- Step 1: local LLM. Read the private context, abstract the scheduling problem so that it contains no sensitive info.
- Step 2: cloud LLM. Reason over the anonymous scheduling problem and produce a schedule.
- Step 3: local LLM. Parse the cloud result back to the household language and present the final answer to the user.
That’s the workflow we’ll build next.
2.2 Setup Ollama and Gemma 4 model
For this case study, we use Gemma 4 E4B model as our local LLM.
Gemma 4 is a model family released by Google this April. It’s designed for reasoning, coding, multimodal understanding, and agentic workflows. It comes in multiple sizes. What matters for us is the edge-friendly variants, i.e., the E4B model.
We’ll serve it locally with Ollama. In case you haven’t used it before, it’s a runtime for downloading, running, and serving local language models from your own machine. Once it is set up, Ollama exposes a local API endpoint.
On Windows machines, you can do that from the official installer:
https://ollama.com/download
After installation, you can launch Ollama from the Windows Start menu.
On a Linux machine, you can install Ollama with:
"curl -fsSL https://ollama.com/install.sh | sh"
Once Ollama is up and running, we can proceed to download our local language model. We can do that via the command line:
ollama pull gemma4:e4b
For reference, my laptop has an Intel i7-13800H CPU, 32 GB RAM, and an NVIDIA RTX 2000 Ada Laptop GPU with about 8 GB VRAM. You can choose
gemma4:e2binstead if E4B feels too slow.
Before moving to the next step, we can do a quick test:
ollama run gemma4:e4b "what's the capital of France?"
If you get “Paris” back, then congratulations, Gemma 4 is now available on your local machine through Ollama.
2.3 Step 1: Local Sanitization
This step runs fully locally.
Here, the local model sees the full household context, and its objective is to prepare a sanitized scheduling problem for the cloud model by stripping away any sensitive information.
We start by drafting the system instruction:
# Note: This prompt was iterated with AI
LOCAL_SANITIZER_INSTRUCTIONS = """
You are a local smart-home assistant running inside the home.
The system has access to private household memory, device facts, and tariff information.
A user has asked a scheduling question about one household load.
Your role is to prepare the scheduling problem for a cloud reasoning model without exposing household-private details.
The cloud model will reason about timing, energy use, deadlines, and electricity prices.
It does not need to know appliance names, people names, room details, family routines, or why a constraint exists.
Create an anonymous scheduling problem for the cloud model.
Use load IDs such as load_A, load_B, and load_C instead of real device or load names.
Include the loads that matter for answering the user's scheduling question.
Think of scheduling_problem as a message copied directly into a cloud API call.
Anything written in scheduling_problem leaves the home.
Keep the private mapping from anonymous load IDs back to household device names in local_mapping.
The local_mapping field stays inside the home and is the only place where private device names may appear.
Do not answer the user's question or solve the schedule yourself.
Your job is to translate the private household context into a cloud-usable anonymous scheduling problem.
Rules:
- The scheduling_problem field is the exact text that will be sent to the cloud model.
- In scheduling_problem, use only anonymous load IDs, never appliance names, people names, room details, or household explanations. Do not pair load IDs with private names; write "load_A:" rather than labels like "load_A (robot vacuum):".
- Preserve the concrete scheduling facts needed for reasoning: current time, relevant loads, duration, energy use, earliest start time, required completion time, local cutoffs, and tariff windows.
- Put the private load-name mapping only in local_mapping.
""".strip()
Here, we ask the local model to do three things:
- Identify which household appliances are actually relevant to the user’s question;
- Remove sensitive information such as names, room details, etc.;
- Preserve sufficient scheduling facts so that the cloud model can still reason properly.
Next, we define a prompt builder function to prepare the context, which combines the user question, the private household context, and a lightweight output contract:
def build_local_sanitizer_prompt(
private_context: dict[str, Any],
user_question: str,
) -> str:
return f"""
User question:
{user_question}
Private household context:
{format_private_context(private_context)}
Task:
Prepare an anonymous scheduling problem so the cloud reasoning model can analyze and plan the schedule.
Use anonymous load IDs in the scheduling problem, and keep the local-only mapping in local_mapping.
Return your answer as JSON with exactly these fields:
{{
"target_load_id": "anonymous ID of the load referenced by the user's question",
"scheduling_problem": "the exact anonymous scheduling problem that will be sent to the cloud",
"local_mapping": {{
"load_A": "robot vacuum",
"load_B": "<household_device_name>"
}}
}}
The local_mapping example only illustrates the mapping direction.
Choose the actual anonymous load IDs based on the relevant loads you include.
Keys must be anonymous load IDs and values must be the original household device names.
Return only JSON, without a markdown fence.
""".strip()
Finally, we can call the model through Ollama:
# pip install ollama
import ollama
import json
prompt = build_local_sanitizer_prompt(
private_context=private_context,
user_question=USER_QUESTION,
)
response = ollama.chat(
model="gemma4:e4b",
messages=[
{"role": "system", "content": LOCAL_SANITIZER_INSTRUCTIONS},
{"role": "user", "content": prompt},
],
think="high",
options={"temperature": 0, "num_ctx": 32768},
)
content = response["message"]["content"]
local_sanitizer_output = json.loads(content)
A couple of things worth mentioning:
- We need to install Ollama Python client.
- We build the prompt from the user’s question and private household context.
- We use the thinking mode of the model by supplying the
thinkvalue. Since the local model needs to reason over a messy local context, giving the local model more thinking budget makes sense here.
You may notice that I ask Gemma to return JSON directly in the prompt. A more formal approach would be to use structured output. Ollama supports that through the
formatargument:
response = ollama.chat(
model=LOCAL_MODEL,
messages=messages,
format=PydanticSchema.model_json_schema(),
)
In theory, this is very useful as the output becomes very easy to parse. However, in my experiments, I noticed that enforcing the output shape of the local LLM can constrain its capability. In our current case study, the model sometimes dropped important scheduling details, such as deadlines or cutoff times. On the other hand, if I only softly guide the output shape by providing examples in the prompt, I see model performs much better.
This seems to be a known issue called constraint tax [1]: for smaller models, hard structured-output constraints can improve schema validity while hurting task correctness.
Therefore, for this case study, I used a prompt-level guidance to ensure that local LLM can at least do the task properly. In a production system, I’d probably add validation and retry logic around this step.
2.4 Step 2: Cloud Reasoning
In this step, we send the prepared anonymous scheduling problem to the cloud LLM so that we can leverage its strong reasoning capability without disclosing any sensitive information.
We start by configuring the system instruction for the cloud LLM role:
# Note: This prompt was iterated with AI
CLOUD_REASONER_INSTRUCTIONS = """
You are a scheduling reasoner.
You receive an anonymous scheduling problem with load IDs, current time, load durations,
energy use, availability, deadlines, latest-finish constraints, and electricity tariff windows.
Your task is to decide whether the target load should start now or later,
then propose a feasible schedule for the relevant loads.
Use only the information in the scheduling problem.
Keep the anonymous load IDs exactly as given.
Do not invent device names, household details, or missing constraints.
When comparing feasible schedules, account for deadlines, latest-finish constraints,
runtime, energy use, and tariff prices.
Return your answer through the structured output schema.
""".strip()
Different from the previous step, we use structured output directly. Here is the output schema:
from typing import Literal
from pydantic import BaseModel, Field
class ScheduledLoad(BaseModel):
load_id: str = Field(
...,
description="Anonymous load ID from the scheduling problem, such as load_A.",
)
start: str = Field(
...,
description="Scheduled start time in 24-hour HH:MM format.",
)
end: str = Field(
...,
description="Scheduled end time in 24-hour HH:MM format.",
)
reason: str = Field(
...,
description="Short reason for choosing this time window.",
)
class CloudScheduleReasoning(BaseModel):
recommendation: Literal["run_now", "run_later"] = Field(
...,
description="Whether the target load should start now or be deferred.",
)
proposed_schedule: list[ScheduledLoad] = Field(
...,
description="Feasible schedule using only anonymous load IDs from the prompt.",
)
reasoning: str = Field(
...,
description="Concise reasoning based only on the anonymous scheduling facts.",
)
The cloud LLM will be tasked to fill in this template. Then, we build the prompt for the cloud model:
def build_cloud_reasoning_prompt(local_sanitizer_output: dict[str, Any]) -> str:
return f"""
Anonymous scheduling problem:
{local_sanitizer_output['scheduling_problem']}
Target load:
{local_sanitizer_output['target_load_id']}
Task:
Decide whether the target load should start now or later,
then propose a feasible schedule for the relevant anonymous loads.
""".strip()
Notice that we only send the anonymous scheduling problem to the cloud LLM. No household memory or load mapping.
Finally, we call Azure OpenAI:
from openai import AzureOpenAI
# Setup cloud LLM client
cloud_client = AzureOpenAI(
api_key=os.environ["OPENAI_API_KEY"],
azure_endpoint=os.environ["OPENAI_API_BASE"],
api_version=os.environ["OPENAI_API_VERSION"],
)
# Prepare prompt
cloud_prompt = build_cloud_reasoning_prompt(local_sanitizer_output)
response = cloud_client.responses.parse(
model="gpt-5.4",
instructions=CLOUD_REASONER_INSTRUCTIONS,
input=cloud_prompt,
reasoning={"effort": "medium"},
text_format=CloudScheduleReasoning,
)
cloud_reasoning = response.output_parsed.model_dump()
At this point, the cloud model has worked out the scheduling, but the result is still expressed in anonymous load IDs. That’s why we need a final step to do the translation.
2.5 Step 3: Local Grounding
This final step runs locally again.
As usual, we start by configuring the system instruction for step 3 LLM. Its objective is to map the cloud result back into household language and produce the final recommendation:
# Note: This prompt was iterated with AI
LOCAL_FINALIZER_INSTRUCTIONS = """
You are a local smart-home assistant running inside the home.
A user asked whether to run a household load now or later.
The cloud model has already reasoned over an anonymous scheduling problem and returned a proposed schedule using load IDs.
You can see the original household context and the local mapping from anonymous load IDs back to household device names.
Your role is to write the response the user will read.
The response should answer the original now-or-later question in normal household language,
using the cloud schedule as the scheduling plan.
Use local_mapping to translate load IDs back to household device names.
Use the private household context only to make the response understandable and locally relevant.
Do not redo the scheduling optimization from scratch.
Do not imply that any appliance has already been started or scheduled automatically.
You are recommending what the user should do.
Rules:
- Answer the now-or-later question first.
- Use household device names, not anonymous load IDs.
- Preserve the important timing recommendation from the cloud schedule.
- Mention other scheduled loads only when they help explain the recommendation.
- If the cloud result conflicts with the private household context, prefer the local context and say so briefly.
- Keep the response concise and focused on the user's decision.
""".strip()
Then we build the corresponding prompt. Here, the local model receives a couple of things:
- Original user question;
- Private household context;
- Local sanitizer output;
- Cloud reasoning result.
Here is the builder function:
def build_local_finalizer_prompt(
private_context: dict[str, Any],
user_question: str,
local_sanitizer_output: dict[str, Any],
cloud_reasoning: dict[str, Any],
) -> str:
return f"""
User question:
{user_question}
Private household context:
{format_private_context(private_context)}
Local sanitizer output:
{json.dumps(local_sanitizer_output, indent=2)}
Cloud reasoning:
{json.dumps(cloud_reasoning, indent=2)}
Task:
Write the final user-facing recommendation.
Answer the user's actual question first.
Use the cloud proposed schedule above and do not create another schedule.
Mention secondary loads only if they help explain the recommendation.
Keep the answer concise and focused on the decision.
""".strip()
Finally, we call Gemma again:
prompt = build_local_finalizer_prompt(
private_context=private_context,
user_question=USER_QUESTION,
local_sanitizer_output=local_sanitizer_output,
cloud_reasoning=cloud_reasoning,
)
response = ollama.chat(
model=LOCAL_MODEL,
messages=[
{"role": "system", "content": LOCAL_FINALIZER_INSTRUCTIONS},
{"role": "user", "content": prompt},
],
think="medium",
options={"temperature": 0, "num_ctx": 32768},
)
final_answer = response["message"]["content"]
That’s the full implementation of our three-step workflow.
2.6 Running the Workflow
Now let’s run the workflow and inspect the results.
In the first step, we have used the local Gemma model to convert the private household context into an anonymous scheduling problem. Here is the output we got:
{
"target_load_id": "load_A",
"scheduling_problem": "Current Time: 18:30\nLoads:\n load_A:\n Energy Use: 1.2 kWh\n Duration Estimate: 90 minutes\n Earliest Start: 18:30\n Hard Deadline (Must finish by): 06:30\n Soft Stop Constraint (Cannot run after): 22:30\n load_B:\n Energy Use: 14 kWh\n Duration Estimate: 120 minutes\n Earliest Start: 18:30\n Hard Deadline (Must finish by): 07:00\nTariff Schedule:\n 17:00-20:00: $0.45/kWh\n 20:00-00:00: $0.22/kWh\n 00:00-06:00: $0.12/kWh\n 06:00-17:00: $0.25/kWh",
"local_mapping": {
"load_A": "dishwasher",
"load_B": "EV charger"
}
}
We see that the local model correctly filtered the context. The robot vacuum was mentioned in the original household memory, but the local model correctly identified that it’s irrelevant to the current scheduling question, so it is not included.
More importantly, we see that scheduling_problem only contains anonymous load IDs, and the cloud LLM only receives the abstract scheduling facts like duration, energy use, deadlines, etc.
In step 2, the cloud LLM returns the anonymous schedule:
{
"recommendation": "run_later",
"proposed_schedule": [
{
"load_id": "load_A",
"start": "20:00",
"end": "21:30"
},
{
"load_id": "load_B",
"start": "00:00",
"end": "02:00"
}
]
}
Finally, Gemma maps the result back to a user-friendly language:
You should run the dishwasher later.
To save money, wait until 8:00 PM tonight.
Starting then will allow it to finish by 9:30 PM, moving its energy use out of the most expensive time window and into a cheaper one.
The overall plan also schedules your EV charger for midnight (12:00 AM - 2:00 AM) to take advantage of the lowest electricity rates.
This is the pattern in action: we let the local model handle private context and final grounding, and the cloud model handle the heavy reasoning, but only over an anonymous problem.
3. Final Thoughts
The smart-home example above shows just one possibility in the larger design space of hybrid local-cloud LLM applications.
I believe the more general lesson here is that we don’t need to treat local and cloud models as two mutually exclusive deployment choices. In many applications, they can play different roles.
That is why I find it useful to ask the following three questions:
- Who should act first, the local model or the cloud model?
- When should the cloud model be called?
- What does the split actually buy us?
The last question is especially important. Privacy is usually the first thing people think of, but there could be other factors, such as cost/latency, reliability, and controllability. Don’t overlook those.
To me, that is the real promise of hybrid local-cloud LLM applications: not a compromise between local and cloud, but a more flexible way to design the application itself.
Reference
[1] Ray (2026), The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models. https://arxiv.org/abs/2605.26128
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS