You've probably seen the benchmarks by now. Bifrost does 11 microseconds. LiteLLM Python does 40-50ms. The messaging is simple: latency matters for gateways. But this misses what teams actually building with Claude Code and Codex have discovered: the real problem isn't gateway latency alone. It's that coding agents need two completely different infrastructure layers, and teams are treating them like one.
The Two Layers of Coding Agent Infrastructure
When you deploy Claude Code or Codex into a team, you're actually solving two separate problems:
1. The Control Plane: Agent Management & Governance
What it does: Manage agent lifecycle, sessions, memory, permissions, scheduling, tool access, and audit trails.
What teams need here:
- Create agents without touching provider consoles
- Run agents on a schedule or via webhook
- Persist session state across restarts
- Control which teams/users can access which tools
- Log every agent action for compliance
- Manage MCP (Model Context Protocol) tool access across a team
- Integrate agents with internal systems (databases, git, APIs)
Why a gateway can't do this: A gateway sees request-response pairs. It doesn't know that Agent-A is trying to access Customer Database and needs to be blocked, or that you want to run a code review agent every night at 2 AM. These are control decisions that live above the request layer.
2. The Data Plane: Fast LLM Routing & Reliability
What it does: Route LLM requests to the right provider, handle fallbacks, track costs, log traffic, enforce budgets.
Why it matters for coding agents: Each code edit, test run, or tool invocation is an LLM call. Claude Code can make 30-50 calls per task. If your gateway adds 1ms per call, that's 30-50ms of compounded overhead. With 40-50ms per Python gateway call, you're looking at 1.2-2.5 seconds of pure gateway latency on a single 30-call task. That's the latency you actually feel.
The Gap Teams Are Hitting
Most gateway discussions treat these as one problem. A fast gateway that routes LLM requests. That's necessary, but not sufficient.
Here's what I'm seeing in production teams running Claude Code:
- Someone needs to create agents without coding → Control plane job
- Agents need to persist across pod restarts → Control plane job
- Tool definitions need to stay in sync → Control plane job
- Agent A needs MCP server X, Agent B can't have it → Control plane job
- We need to route to Claude when latency matters, fallback to Gemini on rate limits → Data plane job
- We need sub-millisecond overhead so 30 calls don't add 1.5 seconds → Data plane job
- We need cost tracking per agent → Data plane job
- We need audit logs of every tool call → Hybrid: agent control + data logging
You can't do #1-4 with a pure gateway. You can't efficiently do #5-8 with a control platform that doesn't understand routing.
Teams are currently solving this by bolting together:
- A managed service for agent orchestration (Bedrock, Anthropic Hosted)
- A separate gateway for LLM routing (Bifrost, Portkey, Kong)
- Custom scripts to sync them
- Extra glue for observability
That works, but it creates operational friction.
What Production Teams Actually Need
The teams I've talked to who are scaling Claude Code and Codex beyond a single developer describe it like this:
"Claude Code is incredible when it's one engineer using it locally. The moment we try to run it on a team, we need:
- A place to define 'these are our agents' (not 30 copies in 30 notebooks)
- A way to say 'this agent runs on a schedule'
- Control over which tools each agent can access
- Visibility into what each agent is doing
- The ability to route to Claude or Gemini based on task type without editing the agent
- Sub-millisecond gateway latency so 30 LLM calls don't turn into 1.5 seconds of overhead"
That's control plane + data plane thinking. Two distinct layers.
Where the Separation Matters
Control Plane (Agent Platform)
- Harness abstraction: swap Claude Code ↔ Codex ↔ OpenCode without rewriting agents
- Session persistence: pause an agent, restart, pick up from where it left off
- Scheduling: cron, webhooks, API triggers
- Tool management: centralized MCP registry, per-agent capability matrix
- Memory: persistent context across runs
- Access control: who can create/run/modify which agents
Example: You define an agent in the platform UI, attach it to Claude Code, give it GitHub + AWS MCPs, schedule it to run every night on your codebase's failing tests, and it automatically creates PRs with fixes. All without anyone touching provider consoles.
Data Plane (Fast Gateway)
- Sub-millisecond routing: multiple calls per task, latency compounds
- Multi-provider routing: route to Claude for complex tasks, Gemini for simple ones
- Fallback chains: if Claude rate-limits, automatically try Gemini
- Cost tracking: per-provider, per-model, per-agent visibility
- Budget enforcement: hard caps that prevent runaway costs
- Observability: structured logs of every LLM call
Example: Your coding agent makes 40 calls. Gateway adds <1ms per call (40 microseconds total). Control plane tracks that this agent used Claude on 25 calls, Gemini on 15, cost $0.23. If Claude hits rate limits, gateway transparently retries on Gemini without the agent knowing.
Recent Signals That This Separation Is Hardening
Claude Code: Now supports Hooks for policy enforcement at key lifecycle events (TaskStarted, ToolCall, TaskCompleted). That's control-plane-level thinking inside the agent harness.
Codex: Added Managed Agents API for creating/running agents from your own infrastructure. That's recognizing the control plane problem.
LiteLLM-Rust: Launched June 2026 specifically for "coding agent workloads" with <1ms target on Claude Code calls, integrated sandbox support (E2B, Daytona), and durable sessions on the roadmap. That's explicitly targeting the data plane + agent runtime integration.
TrueFoundry, Kong, Portkey: All shipping "agent gateway" features that blur the line — they're trying to build control + data in one platform.
The market is recognizing that governance + routing are different concerns, even if some platforms try to unify them.
The Practical Decision Framework
If you're running Claude Code on a single developer or small team:
- Local Claude Code + direct API calls to Claude/OpenAI
- No gateway needed yet
If you're scaling Claude Code or Codex across a team:
- Pick your control plane: self-hosted (LiteLLM Agent Platform), managed (Bedrock), or API-first (Codex API)
- Pick your data plane: fast gateway (LiteLLM-Rust, Bifrost) or managed routing (OpenRouter, Portkey)
- Ensure they can talk: same config format, shared database, compatible APIs
If you're mixing multiple harnesses (Claude Code + Codex + OpenCode):
- Control plane becomes critical: you need abstraction over harness differences
- Data plane must support all providers those harnesses call
If you need compliance/audit/data residency:
- Control plane must self-host
- Data plane must self-host
- Consider platforms that do both (LiteLLM Agent Platform + LiteLLM-Rust + LiteLLM core)
What I'd Measure
When evaluating control + data infrastructure for coding agents:
Control Plane:
- Can I create an agent in a UI? How much YAML/JSON?
- Does it understand session state? Can agents pause and resume?
- Can I attach MCPs to agents and control per-agent access?
- Do audit logs show me every tool call and agent action?
- Can I schedule agents on cron/webhook/API?
Data Plane:
- What's the measured overhead per request? (target: <1ms for coding agents)
- Does it support all the providers my harnesses call?
- Can I route based on task type or cost? (not just round-robin)
- Does it fail gracefully when a provider is rate-limited?
- Is cost/token tracking per-agent visible?
Integration:
- Do the two layers use the same config format?
- Do they share the same database or API?
- Can I swap data planes without redeploying agents?
The Unsexy Part
The sexiest discussion is always about latency. But in teams I talk to running Claude Code at scale, the conversation goes:
"Okay, gateway overhead is solved. Now: how do we keep prod from running experiments? How do we keep Agent-A from accessing the customer database? How do I know what happened yesterday when Agent-B deleted something? Can I run this every night? Can I give the junior engineer the ability to create agents without giving them API access?"
That's all control plane work. And it's unglamorous, but it's what stops you from sleeping at 3 AM.
Wrapping Up
Coding agents (Claude Code, Codex, OpenCode) need two halves:
- Control Plane: Agent creation, sessions, memory, scheduling, tool governance, audit trails. This is where you keep your team safe.
- Data Plane: Sub-millisecond LLM routing, fallbacks, cost tracking, multi-provider support. This is where you keep your latency low.
A fast gateway solves problem #2. A control platform solves problem #1. Both are table stakes for production.
The platforms that will win here are the ones that make the separation clear and let teams pick the right tool for each job — or that do both well without unnecessary coupling.
What's your experience been? Are you running coding agents on a team? What's the first thing that broke when you tried to scale from one developer to five?
Resources
- LiteLLM Agent Platform docs — Control plane for multi-harness agent management
- LiteLLM-Rust launch — Rust-based data plane targeting <1ms overhead on coding agents
- LiteLLM core routing & fallbacks — Multi-provider routing for LLM calls
- Anthropic Hooks for Claude Code — Policy enforcement at agent lifecycle events
- OpenAI Codex Managed Agents API — API-first control of Codex agents
Paul Twist is an AI infrastructure engineer based in Berlin, focused on production agent systems and multi-provider LLM routing.
For further actions, you may consider blocking this person and/or reporting abuse
