Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems
Abstract
Agentic AI systems act through tools and sub-agents, yet the controls meant to bound their financial and environmental cost still sit on dashboards evaluated beside or after execution. Green SARC applies the SARC governance-by-architecture framework — four enforcement sites in the agent loop — to FinOps and GreenOps, contributing the theory of what to enforce and how to predict it. We report four policy-independent results. (i) The unconstrained “State Snowball” is in loop depth; on real multi-step plans (SWE-rebench) it holds on , with median curvature exceeding the linear-accretion prediction — real plans accrete faster than the model (§11.6). (ii) On real residuals the Normal- gate under-covers ( at nominal ); split-conformal calibration holds (; Theorem 2). (iii) A soft Lagrangian penalty tuned to the budget in expectation breaches it on of seeds; the architectural gate breaches . (iv) Under binding budgets the gate’s over-budget incidence is on synthetic and real (BurstGPT) arrivals. End-to-end token/USD/carbon savings (–) are real but policy-dependent in magnitude — set by a scope-cap knob, not by gate rejections. The library is open-source, dependency-free, and ships a regeneration script for every cited number.
Keywords: agentic AI, governance-by-architecture, predictive FinOps, GreenOps, token economics, conformal prediction, runtime constraints, SARC.
1 Introduction
The cost center of artificial intelligence has shifted from training, whose resource envelope is fixed at design time, to the inference trajectory: the runtime-determined sequence of model calls, tool invocations, and conditional retries an agent emits while pursuing a goal. A classical inference call has a bounded, predictable cost. An agentic workflow has neither: the same task, executed twice, can differ by an order of magnitude in token consumption. Both the API bill and the energy draw are therefore stochastic quantities governed by the execution trace, not the specification.
Two instruments are commonly deployed against this volatility. Post-hoc auditing reconciles spend after the billing period closes. Policy-as-code encodes budget rules in a layer evaluated alongside, but not inside, the agent loop. Both inherit the defect SARC identified for correctness obligations: they evaluate constraints after, or beside, the execution they are meant to bound. A budget breach detected at month-end cannot un-spend the tokens; a carbon overage logged to a dashboard cannot un-emit the carbon.
Relationship to SARC.
SARC [1] is a governance-by-architecture framework that treats constraints as first-class specification objects and compiles them into four enforcement sites: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. Green SARC is an application of that architecture — we reuse the four sites unchanged — but carries its own theory, orthogonal to SARC’s correctness results. SARC governs whether the system is right; Green SARC governs what the system costs. The two are independent axes that happen to share enforcement sites.
Contributions.
-
1.
State-Snowball theorem, formal and empirical (§4). Naive context accretion yields cumulative prompt cost (Theorem 1); the synthetic fit recovers the closed-form coefficient exactly, and on real ShareGPT traffic the cumulative-prompt curvature is negative — the snowball is an artifact of naive orchestration, not of chat itself (§4, §10).
-
2.
Predictive Pre-Action Gate with calibration and an anytime-valid safety bound (§5, §7). We generalize the gate to a learned forecast (of which rule-based accounting is the zero-information limit), give split-conformal marginal safety (Theorem 2), and an anytime-valid trajectory over-spend bound via Ville’s inequality (Theorem 3).
-
3.
Binding-budget gate evaluation on the Pareto frontier (§9). Across a budget grid the gate’s empirical over-budget incidence stays at or below while completing more work at zero overspend — dominating the soft-penalty frontier.
-
4.
Real-trace coverage validation on ShareGPT (§10). On real, non-Gaussian residuals the Normal- gate under-covers at tight while split conformal holds nominal coverage; adaptive conformal restores coverage under distribution shift.
-
5.
Ablation with paired-bootstrap CIs (§8). A four-condition ablation decomposes the saving by lever (scope, routing, breaker), each with a CI.
-
6.
Real-arrival ablation on a public production trace (§11). The four-condition ablation re-run on the BurstGPT trace of real Azure OpenAI traffic reproduces the synthetic savings ordering under real burstiness and prompt/response distributions, with paired-bootstrap CIs.
-
7.
Open-source library and reproducible benchmark (§6). A dependency-free implementation with a regression contract enforced in CI via make verify.
We also relate the gate to Bandits-with-Knapsacks as a feasibility oracle (Proposition 3). The decoupling of cost from correctness governance (§3) frames the artifact’s scope.
In one line: SARC gives us where to enforce; we contribute what to enforce, why it decouples from safety, how to predict it, and the evidence that the prediction is calibrated and the architecture is necessary.
2 Background and Related Work
SARC.
A SARC specification declares, per constraint, its source, class, predicate, verification point, response protocol, and operating point, and compiles these into the four enforcement sites named above. It formalizes the minimal invariants for specification–trace correspondence and argues that finite reward penalties do not in general substitute for hard runtime constraints — a claim we make quantitative for the cost domain in §13.
FinOps and GreenOps.
FinOps brings financial accountability to variable cloud spend; GreenOps extends the discipline to carbon and energy. The financial and environmental cost of large-scale AI compute has been quantified for the training regime [3, 2, 4]; the inference regime studied here shifts this cost into a runtime-variable, per-trajectory quantity. Both disciplines are predominantly practiced as observe-and-reconcile loops [5], a cadence adequate only when the consumption unit is predictable. Agentic inference violates that premise.
Efficient inference systems.
A large systems literature reduces the unit cost of inference — phase-split serving [6], high-throughput single-GPU offloading [7], and statistical multiplexing across models [8]. These optimize how a fixed set of calls is served; Green SARC is complementary and orthogonal: it governs which calls an agent is permitted to make, given a budget, before they are issued. The two compose cleanly. If a serving optimization reduces the effective per-token cost from to for some efficiency , then a fixed token budget admits times as much work, since the gate’s feasibility test is equivalent to ; the carbon ceiling scales identically through the proportional energy saving. Cheaper serving thus relaxes the gate’s effective budget by a known factor rather than changing its mechanism.
Cost-aware routing and cascades.
A complementary line reduces cost by choosing which model answers a query: FrugalGPT learns an LLM cascade that escalates to a stronger model only when a cheaper one is judged inadequate [20], and RouteLLM trains a binary router between a strong and a weak model from preference data [21]. These optimize the per-query model choice to maximize quality at lower expected cost, but they offer no hard guarantee: a router tuned to spend less in expectation can still overrun any fixed budget on an adversarial or heavy-tailed query stream, exactly the soft-constraint failure we quantify in §13. Green SARC is orthogonal and composable: it is the enforcement contract a router runs inside, supplying the per-action feasibility test that turns an expected-cost heuristic into a budget-safe one (the CBwK feasibility-oracle framing of Proposition 3). We concede the overlap honestly: Green SARC’s energy-aware routing lever (§8) is mechanistically the same idea as FrugalGPT’s cascade — down-route when a cheaper model suffices — and our contribution there is not the router but the gate that bounds it.
Conformal prediction.
Our safety guarantee rests on split (inductive) conformal prediction, which converts any point predictor into a set/interval with distribution-free, finite-sample marginal coverage under exchangeability [9, 10]. Where residuals are non-exchangeable (distribution shift), adaptive conformal inference restores coverage online [11]; we flag this as the path to robustness on real traces (§15).
Constrained decision-making.
Admitting actions under a depletable budget is formally a Bandits-with-Knapsacks / constrained-MDP problem [26]. Green SARC does not solve the optimal-policy problem; it provides the enforcement primitive (a calibrated per-action feasibility test) that any such policy needs at runtime, and contrasts it with the soft-penalty (Lagrangian) relaxation in §13. We make the relationship precise in Proposition 3: the gate composes with any sublinear-regret CBwK policy as a feasibility oracle without changing its regret order.
Architectural lineage.
The four-site, enforce-in-the-loop design has ancestry well beyond the author’s own SARC framework. The reference monitor of Anderson’s 1972 security study — a mediation mechanism that must be invoked on every access, be tamper-proof, and be small enough to verify [22] — is the direct conceptual ancestor of the Pre-Action Gate: a non-bypassable check interposed before each consequential operation. Admission control in networking and queueing systems (admit a flow only if its reserved rate fits the remaining capacity) is the same two-phase reserve-then-commit primitive our Budget implements. And runtime verification — synthesizing monitors that check an execution against a specification as it runs — is the correctness-domain analogue of the Action-Time Monitor and Post-Action Auditor. Green SARC’s novelty is not the enforce-in-the-loop stance itself but its application to predicted cost and carbon, with a calibrated forecast standing in for the boolean access check.
What Green SARC is not.
Pure observability tools (LangSmith, Helicone, raw OpenTelemetry) give post-hoc cost without enforcement: they tell you what was spent, after it was spent. API-level rate limits (provider tier limits, sidecar throttling on request counts) enforce request counts, not predicted cost or carbon, so a single expensive call passes unchecked. In-agent budget tracking (framework callbacks) is in-process and bookkeeping-only, with no cross-process attribution and no enforcement contract spanning the four sites. Green SARC differs by being a four-site governance contract whose gate predicts cost before the action fires; the closest comparable systems govern only post hoc, or only on request counts.
Regulatory context.
Enterprise deployments increasingly must keep auditable records of automated decisions and account for system accuracy and the energy footprint of AI. The EU AI Act mandates automatic record-keeping/logging (Art. 12), transparency and information provision (Art. 13), and accuracy/robustness with documented metrics (Art. 15) [27]; the Corporate Sustainability Reporting Directive (CSRD) extends mandatory sustainability disclosure to in-scope undertakings [28]. Green SARC’s Post-Action Auditor produces, as a byproduct of execution, the attribution-preserving, predicted-vs-actual trace these regimes require, extended to per-trajectory token yield and a carbon proxy.
3 Decoupling FinOps Governance from Correctness Governance
We state the decoupling explicitly because it defines the scope of the artifact.
Proposition 1(Independence of axes).
Let a governance layer be characterized by the predicate class it enforces. SARC’s correctness layer enforces predicates over action validity (is this action safe and permitted?). The Green SARC layer enforces predicates over resource consumption (does this action fit the cost and carbon budget?). The two predicate classes share enforcement sites but neither implies the other: a perfectly safe agent can be ruinously expensive, and a perfectly cheap agent can be unsafe.
Consequences for the artifact:
-
•
Green SARC is deployable with no safety regime present. Its value derives from the cloud bill, which every operator incurs.
-
•
It tracks cost and carbon only; correctness/accuracy is deliberately out of scope and is not logged as a governed quantity. The quality floor (§5) is the caller’s concern.
-
•
It composes with SARC where both are wanted (the sites are shared) but does not depend on it. The reference implementation has no dependency on SARC.
4 The State-Snowball Cost Theorem
Definition 1(State Snowball).
A multi-agent loop exhibits the State Snowball when each step re-submits the full accreted context, so the per-step prompt grows monotonically with step index.
Assumption 1(Linear accretion).
The prompt at step (zero-indexed) is tokens: a fixed base plus tokens appended per hop. This is the regime in which the unconstrained loop is studied; sub-linear summarization is exactly the mitigation we analyze.
Theorem 1(Quadratic cost of the unconstrained loop).
Proof.
Direct summation of the arithmetic series . The dominant term is with leading coefficient . ∎
Empirical confirmation (synthetic).
Figure 1 plots the cumulative prompt cost of the benchmark’s baseline against loop depth, with , . A second-order least-squares fit recovers , identical to the closed-form ; the residual is numerically zero. This verifies that the simulator faithfully realizes Assumption 1 (the recovered coefficient is a property of the simulator’s construction, not independent evidence that real workloads accrete linearly). Bounding the per-hop increment with an Adapter Node (scope cap tokens) collapses the curve to linear: at depth the scoped cost is lower than the snowball. The Action-Time circuit breaker caps directly, bounding the other factor.
Real chat traffic.
Whether real multi-turn traffic accretes quadratically is a separate, empirical question. On the ShareGPT replay of §10 ( conversations, up to turns) we fit the cumulative billed prompt tokens against turn depth to a quadratic. The leading coefficient is with paired-bootstrap CI — significantly negative. Real conversations are concave in depth, not convex: humans and well-behaved assistants do not blindly re-submit the full transcript every turn. The snowball is therefore a failure mode of naive multi-agent orchestration (full-context re-submission), not an intrinsic property of conversation — which is precisely why the Adapter-Node scoping that prevents it is the highest-leverage lever in the ablation (§8).
5 The Predictive Pre-Action Gate
This is the paper’s central construct. In SARC the Pre-Action Gate evaluates a deterministic predicate. We generalize it to a gate that decides on a learned, calibrated forecast of the resource cost of a proposed action. Table 1 fixes notation.
| Symbol | Meaning |
|---|---|
| proposed action; its context | |
| live remaining token budget | |
| marginal carbon intensity (gCO2e/kWh), region , time ; real grid data used in §11.5[16], stipulated by default | |
| latency/SLA headroom (declared in the state; not enforced in Phase 1) | |
| forecast token cost; forecast carbon | |
| realized token cost; realized carbon | |
| estimator residual standard deviation (per key) | |
| gate risk level; operating-point confidence is | |
| split-conformal quantile of calibration residuals | |
| carbon ceiling; carbon already spent on trajectory |
5.1 Augmented state
The state ingests financial and environmental telemetry: . Of these, and are enforced at the gate; is declared for completeness but is not enforced in the Phase-1 implementation (the field exists in the state object and is unused), a divergence we record honestly in §15.
5.2 The estimator
Let be a proposed action in context . A learned estimator predicts expected token cost and carbon before the action fires. The implementation regresses completion tokens on prompt tokens online per key using the Welford-style sufficient statistics [25], exposing the residual standard deviation . The gate admits iff the forecast fits the remaining budget at confidence and the carbon ceiling:
| (2) |
Operationally the first test is a one-sided upper bound . The implementation forms with the normal quantile (the “Normal- gate”); §7 replaces with a distribution-free conformal margin . Rule-based accounting is the special case a constant threshold independent of : the zero-information gate.
5.3 The closed learning loop
The estimator is trained on the Post-Action Auditor’s own output, closing a loop:
| (3) |
At cold start is weak and the gate behaves conservatively (worst-case forecast); it sharpens as the audit log accumulates. §8 shows the forecast MAE collapsing from a cold-start value of 4,000 tokens to the irreducible noise floor within 20 observations.
5.4 Release sequencing: step then trajectory
The estimator is built in two phases. Phase 1, per-action: predicts the next step only — simple to deploy, and it generates the labeled actuals needed for Phase 2. Phase 2, full-trajectory: a planner-level estimator predicts the cost of an entire plan before the agent starts, enabling rejection of expensive plans, not merely expensive steps. Phase 1 is the data engine for Phase 2; only Phase 1 is implemented here (Phase 2 is an interface stub).
5.5 Budget safety
We give two safety statements: a pointwise one assuming calibration, and a distribution-free one (proved in §7).
Proposition 2(Pointwise budget safety).
If the estimator is calibrated so that pointwise, then an admitted action breaches the budget margin it was admitted against with probability at most ; residual breaches scale with the gate risk level , not with the opportunity for breach.
Theorem 2(Predictive Gate Safety, split-conformal).
Let the conformal margin be calibrated on an exchangeable set of residuals (Assumption 2, §7) and let the gate admit only when . Then for a fresh exchangeable action the per-action budget-breach probability satisfies , with no distributional assumption on the residuals. Over a trajectory of gated admits, the expected number of breaches is at most .
The proof is in §7. Theorem 2 bounds the breach probability of each individual admitted action; for the whole trajectory, Theorem 3 (§7) gives an anytime-valid probabilistic tail bound on the cumulative over-spend across any trajectory prefix, uniformly over stopping times. Together these mirror, in the resource domain, SARC’s claim that residual hard violations scale with enforcement-stack error rather than with the opportunity for violation.
5.6 The Sustainable Token Yield reward
Within the gated action space the agent optimizes a reward that penalizes brute-force inference where deterministic computation suffices:
| (4) |
with task utility , FinOps weight , GreenOps weight . As SARC argues, this finite penalty shapes behavior within the hard budget/carbon constraints; it does not replace them. §13 makes this quantitative: a soft penalty tuned to the budget in expectation still breaches it most of the time.
5.7 The constrained optimization
| (5) | |
| s.t. | (budget; Pre-Action Gate) |
| (loop bound; Action-Time Monitor) | |
| (ESG ceiling; Post-Action Auditor) | |
| (quality floor; caller-owned) |
Budget and carbon are hard constraints enforced at their sites, not merely penalized in .
Proposition 3(Gate as a CBwK feasibility oracle).
Let be any Bandits-with-Knapsacks policy achieving regret with per-action costs bounded in . Composing with the split-conformal Pre-Action Gate — restrict ’s action set at each round to and let select within that set — yields a policy whose regret remains and which additionally satisfies the anytime-valid budget bound of Theorem 3 with probability at least .
Proof sketch.
The gate is a per-round feasibility filter applied to the action set, so is run on a (possibly smaller) feasible set; its per-round regret against the best feasible arm is unchanged, preserving the order. The added budget guarantee holds because only ever plays arms admitted by the gate, to which Theorem 3 applies verbatim; a union bound over the regret event and the conformal coverage event () gives the composite guarantee. The gate supplies feasibility; the bandit supplies optimality. ∎
5.8 Calibration matters
6 Mapping the Enforcement Sites and Implementation
Table 2 maps the four SARC sites to Green SARC predicates and to the modules that implement them. The reference implementation is a standalone, dependency-free Python library; it composes with SARC via shared sites rather than importing it.
| Site | Green SARC predicate / role | Module |
|---|---|---|
| Pre-Action Gate | Predictive cost/carbon forecast; admit iff at and carbon fits. | gate.py, estimator.py |
| Action-Time Monitor | Circuit breaker on loop count / marginal cost; kills runaway retry/re-plan loops. | monitor.py |
| Post-Action Auditor | Logs predicted-vs-actual cost/carbon per action: ESG record and estimator training signal. | auditor.py |
| Escalation Router | Routes budget-/carbon-exhausted tasks to human review or a deterministic fallback. | escalation.py |
| State scoping | Adapter Node bounds the per-hop increment (§4). | scoping.py |
The runtime gate (green_sarc.gate) defaults to the Normal- upper bound of §5; as of v0.3.0 the split-conformal upper bound of §7 is also available at runtime, opt-in via calibrator=... (§6.1).
6.1 Conformal calibration in the runtime gate
The split-conformal bound of §7 and the adaptive variant of §10 are no longer only paper-side analyses: v0.3.0 ships them as runtime strategies in green_sarc.calibrator (SplitConformal, ACIConformal, behind a Calibrator protocol). The PreActionGate constructor gains an optional calibrator=... argument; supplying it replaces the Normal- token bound with the conformal one, and omitting it preserves the prior behaviour exactly (all pre-existing tests pass unchanged, and make verify holds).
We validate the runtime path against the paper-side analysis: re-running the §10 ShareGPT study with the runtime SplitConformal calibrator (--use-runtime-conformal) reproduces the held-out coverage of the offline analysis to within percentage points across . Engineering contract: the calibrator is fit once from a residual log (offline) and, for ACIConformal, updated online from realized-vs-predicted cost at the Post-Action Auditor; the protocol lives in green_sarc.calibrator and the public surface follows semver within . The difference between “we prove” (§7) and “we ship” is now a one-argument opt-in.
Capabilities: today vs. roadmap.
Table 3 states explicitly which Green SARC capabilities ship today (Phase 1) and which are roadmap items. The paper’s empirical results in §8 and §9 use only Phase 1 features; §10 uses real data but Phase 1 code.
| Capability | Phase 1 (today) | Phase 2 (roadmap) |
|---|---|---|
| Pre-Action Gate (step) | Normal-; conformal opt-in via calibrator= | ACI as default + conditional coverage |
| Predictive forecast | OLS per | trajectory estimator |
| Action-Time Monitor | Loop / marginal / total cost breaker | latency-headroom enforcement |
| Post-Action Auditor | JSONL / SQLite | Parquet, multi-tenant attribution |
| Budget | Single-process threading.Lock; distributed Redis backend (experimental) | Postgres durable ledger + fair-share reservations |
| Escalation Router | Deterministic + log-only handlers | plan-level rejection on trajectory forecast |
| Adapters | MCP, PAIS sidecar, OTel SpanProcessor | cross-process OTLP receiver; MCP transport auth |
| Audit schema | plan_id, session_id, parent_action_id | Phase-2 trajectory schema (typed events) |
Reproducibility contract. The committed benchmarks/reference_summary.json (20 seeds, 4 conditions 4 metrics) is the regression contract: a pull request that drifts these numbers by more than per cell, or by more than absolute percentage points on the +full token reduction, fails CI via make verify. This paper is companion to release v0.4.0 of besanson/Greensarc; the tag pins the exact source tree that produced every number cited above.
API stability. The public surfaces in green_sarc.governor, .state, .gate, .auditor, .escalation, and the three adapters (mcp, pais_sidecar, otel) follow semver within ; the examples/ and benchmarks/ paths and the audit-record schema may evolve.
Tests and CI. The library passes 152 unit and integration tests on Python 3.11 and 3.12 (an additional SARC-composition suite is skipped unless the optional sarc extra is installed), including concurrency race tests, runtime conformal-coverage tests, an ACI-restoration test, sidecar SSE streaming tests, a distributed-budget race test, Prometheus-metrics tests, live-feed loader tests, and end-to-end ablation reproduction. CI runs ruff, mypy on src/ and benchmarks/, pytest -q, and make verify on every push; the release workflow gates publication on the test matrix.
Gate overhead. A microbenchmark (benchmarks/gate_overhead.py, warm decisions, single process) puts the Pre-Action Gate’s cost at p50 s / p99 s per decision on the default Normal- path (M decisions/s) — negligible beside any model call. The split-conformal path is p50 s / p99 s, dominated by recomputing the empirical residual quantile over the calibration set on each decision (a cost a production deployment would precompute and amortize); even unamortized it stays under ms. Latency is hardware-dependent; the committed figure is from the reference runner.
6.2 Deploying Green SARC
Three integration patterns, in increasing order of loose coupling:
-
1.
In-process — GreenGovernor.with_defaults(...) wraps the agent’s executor directly. Step-level safety, single replica; best for single-agent CLI tools and notebooks (green_sarc.governor).
-
2.
PAIS sidecar — ASGI middleware on /v1/chat/completions; returns HTTP 429 on reject, with SSE-aware passthrough for streaming. Best for agents fronted by an OpenAI-compatible API (green_sarc.adapters.pais_sidecar).
-
3.
KAOS-managed MCP advisory + OTel observe — register Green SARC as an MCP server for advisory gate/audit tools, and consume actuals cross-process via the OTel SpanProcessor. Loosest coupling, advisory-only safety (green_sarc.adapters.mcp, green_sarc.adapters.otel).
7 Split-Conformal Calibration of the Gate
Assumption 2(Exchangeability).
The estimator is fixed (trained on data disjoint from the calibration set). The one-sided nonconformity scores on the calibration set and the score of a fresh action are exchangeable.
Define as the -th smallest of (and if that index exceeds ). The calibrated gate bound is .
Proof of Theorem 2.
By Assumption 2 the scores are exchangeable, so the rank of among them is uniform on (ties broken at random). Then
Since , the event is exactly . If the gate admits only when , then , giving . The trajectory bound follows by linearity of expectation over gated admits. This is the standard inductive-conformal quantile lemma [9, 10], specialized to a one-sided cost score. ∎
Remark 1(Marginal, not conditional).
Anytime-valid trajectory safety.
Theorem 2 bounds each admitted action’s breach probability marginally. For the cumulative over-spend along a trajectory — monitored continuously, at a data-dependent stopping time — we want a time-uniform bound. Write the per-step over-prediction residual as and let be the cumulative residual over the first admitted actions, with and the natural filtration of admitted actions.
Theorem 3(Anytime-valid cumulative over-spend).
Suppose the centered residuals are conditionally -sub-Gaussian given . Then for any , simultaneously over all — and hence at every stopping time —
| (6) |
Consequently the realized cumulative cost exceeds its forecast plus an envelope with probability at most , uniformly over the trajectory.
Proof.
Fix and define , with . By the conditional sub-Gaussian assumption, , so : is a non-negative supermartingale. Mixing over with a centered Gaussian prior of variance (the method of mixtures) yields another non-negative supermartingale with . Ville’s inequality [13] gives . Evaluating the Gaussian integral and rearranging into a bound on gives a time-uniform boundary of order (up to lower-order terms absorbed by the mixture); this is the standard sub-Gaussian confidence sequence [12]. Because the bound holds simultaneously for all , it holds at any stopping time by optional stopping. A fuller derivation is in Appendix C. ∎
Remark 2(A stronger assumption than Theorem 2).
Remark 3(The sub-Gaussian assumption is not supported by the real residuals).
We state the tension plainly. The forecast residuals on real traffic are right-skewed and mildly heavy-tailed — skew , excess kurtosis , with normality decisively rejected (§10). The conditional sub-Gaussian hypothesis of Theorem 3 is therefore not established by our data; the theorem is an idealized companion bound, and a deployment should not rely on the sub-Gaussian width as if it were validated. The faithful replacement is a variance-adaptive confidence sequence that assumes only bounded or sub-exponential increments. Concretely, the empirical-Bernstein confidence sequence of Howard et al. [12] replaces the boundary of Theorem 3 with one of order
where is the empirical cumulative variance and bounds the per-step residual; the width adapts to the realized residual variance rather than to a stipulated , and the linear term carries the heavy tail. Proof sketch. The same supermartingale construction as Theorem 3 goes through with the sub-Gaussian exponential process replaced by the empirical-Bernstein supermartingale of [12] (their Thm. 4 / the canonical-assumption framework), to which Ville’s inequality [13] applies verbatim; we do not re-derive it. This is the form a production deployment should monitor the cumulative over-spend with — it is anytime-valid under exactly the bounded/sub-exponential conditions our residuals plausibly satisfy, where the sub-Gaussian boundary is not justified. Promoting it from a paper-side bound to the runtime gate is the Phase-2 item flagged in §15.
Empirical coverage (synthetic).
Figure 2 validates both bounds on a held-out test split ( learned forecasts from the benchmark). Empirical coverage tracks nominal within percentage points for the Normal- gate and within for split conformal across . The two agree closely because the synthetic data-generating process has Gaussian residuals, so the Normal- assumption happens to hold. The value of conformal is precisely that it attains the same coverage without that assumption — which is exactly what real, non-Gaussian traffic demands. §10 shows that on real ShareGPT residuals the Normal- gate under-covers at the tight one would actually deploy, while split conformal continues to hold nominal coverage.
8 Synthetic Evaluation and Ablation
Workload.
We use a synthetic Integrated Business Planning (IBP) demand-forecasting pipeline: a fan-out workload of SKUs, each handled by a depth- agent loop, over seeds. No real LLM is called; per-step token usage is simulated from a known relationship (so the estimator has a real signal to learn and runs are deterministic), while the treatment path exercises the real governance stack end to end. A small fraction of SKUs attempt to loop depth, a retry-storm stress scenario for the circuit breaker.
Ablation.
We run four conditions — baseline +scope +scope+route +full — so each lever’s contribution is isolated, with a paired-bootstrap CI on each reduction (Figure 3, Table 4). Relative to the baseline (M tokens, $, gCO2e per run under time-varying intensity), the full stack uses M tokens, $, and gCO2e.
| Condition | Token reduction | USD reduction | Carbon reduction (time-var.) |
|---|---|---|---|
| +scope | |||
| +scope+route | |||
| +full |
Forecast quality and cold start.
In the full condition the learned estimator attains a token-cost MAE of and WAPE of over all admitted actions. Figure 5 isolates the learning dynamics on a single-key stream: per-action absolute error falls from a cold-start 4,000 tokens (worst-case forecast: prompt full max_tokens) to the irreducible noise floor within 20 observations; rolling MAE drops from (first half) to (second half), WAPE to . Figure 5 shows predicted-vs-actual over learned forecasts (, WAPE ).
A negative result, stated plainly.
In this headline workload the gate issues zero rejections: the benchmark’s budget is generous, so the savings come from state scoping, routing, and the circuit breaker — not from the gate refusing actions. The gate’s contribution here is the forecast (which the auditor logs and the breaker and router consume) and the guarantee it would provide under a binding budget. We exercise the gate against a binding budget separately in §12 and §13. We consider it important not to over-claim the gate’s role in the aggregate token number.
9 Gate Behaviour Under Binding Budgets
§8’s headline workload has a non-binding budget, so the gate never rejects. Here we make the budget bind and measure the gate where its guarantee matters.
9.1 Protocol
We sweep the token budget over , with (the mean full-snowball cost), seeds each, running the full Green SARC stack at . The estimator is warm-started on an independent stream so we measure steady-state binding-budget behaviour, not the cold-start transient. We report, per budget: admission rate (admitted attempted steps); over-budget incidence (fraction of admitted steps whose realized cost exceeded the budget remaining at the moment of admission — the per-action breach event of Theorem 2); completed-trajectory rate (fraction of SKUs whose every step ran before the budget was exhausted); MAE on admitted actions; and total tokens.
9.2 Results
Table 5 reports the sweep. Over-budget incidence is at every budget level — comfortably within the target — confirming Theorem 2 empirically: the calibrated gate does not admit actions it cannot afford. Admission and completion degrade smoothly and monotonically as the budget tightens: at (a budget one-quarter of the naive baseline) the gate still admits of attempted steps and completes of trajectories with zero overspend; from the budget is slack and everything completes. Forecast MAE is stable ( tokens) across budgets.
| admission | over-budget | completed | MAE (tok) | tokens | |
|---|---|---|---|---|---|
| M | |||||
| M | |||||
| M | |||||
| M | |||||
| M | |||||
| M |
9.3 The Pareto frontier
Figure 6 plots completed-trajectory fraction against over-budget incidence for the gate (sweeping ) and for the §13 soft penalty (sweeping its weight , with over-budget measured against the binding reference ). The gate’s frontier lies along the bottom axis ( over-budget at every completion level it reaches), dominating the soft penalty, which can only complete most trajectories by breaching the budget on every seed. The penalty frontier is essentially bimodal: on this workload its realized spend jumps from “admit-cheap” (little completed, within budget) to “admit-all” (everything completed, over budget), with no intermediate tracking the budget — a direct consequence of its budget-blindness.
9.4 Reading
The gate’s empirical over-budget incidence tracks across the entire budget grid (it never exceeds it, here attaining ); admission degrades smoothly as tightens rather than collapsing; and the gate’s work-vs-overspend frontier dominates the soft-penalty baseline of §13. This is the binding-budget evidence that §8’s non-binding headline could not provide.
10 Real-Trace Coverage Validation
§7’s coverage check used synthetic, Gaussian residuals. Here we validate the gate’s calibration on real forecast residuals — the experiment the prior draft flagged as future work.
10.1 Dataset and preprocessing
We replay anon8231489123/ShareGPT_Vicuna_unfiltered [14], a public corpus of real ChatGPT/GPT-4 conversations on the Hugging Face Hub, released under a permissive research license. We use ShareGPT because LMSYS-Chat-1M is access-gated and would not reproduce on a clean clone without a credential; ShareGPT is ungated and serves the same purpose. No LLM is called: we use token counts only. Turns are tokenized with tiktoken (cl100k_base); for each assistant turn we form the pair , capped to a realistic deployment window. This yields pairs across conversations, split into calibration () and test () by a conversation-level partition (§10.3).
10.2 Residuals are not Gaussian
We fit online OLS on the calibration split. The residuals (Figure 7) have skew and excess kurtosis ; an Anderson–Darling test gives statistic , far above the critical value , and D’Agostino’s test returns : normality is decisively rejected. The Gaussian- assumption underlying the Phase-1 runtime gate does not hold on real traffic.
10.3 Coverage: Gaussian- vs. split conformal
We split the calibration and test sets by conversation: every turn of a conversation is assigned wholly to one side, so that within-conversation residual correlation never straddles the split (a row-level shuffle would leak it, violating exchangeability; on this corpus that leak inflates the reported conformal coverage at by pp, so the conversation-level number we report below is the honest, slightly looser one). Figure 8 compares empirical coverage to nominal on the conversation-level test split. The Normal- gate is mis-calibrated: it over-covers at loose (e.g. pp at , wasting budget) and, more dangerously, under-covers at the tight one actually deploys — pp at and pp at , i.e. roughly more budget breaches than promised. Split conformal stays within percentage points of nominal across the entire range ( pp at ). On real residuals the distribution-free bound is not a nicety — it is the difference between a gate that keeps its safety promise and one that quietly violates it at the operating point.
10.4 Two worlds, one guarantee
The paper now spans a synthetic-residual world (§4–§9), where Gaussian and conformal bounds coincide because the data-generating process is Gaussian by construction, and a real-residual world (§10), where they diverge and only conformal holds nominal coverage. Conformal calibration is the bound that survives both. This is also why §15 lists promoting conformal into the runtime gate as the leading Phase-2 item.
10.5 Distribution shift
Coverage guarantees assume the deployment distribution matches calibration; real workloads drift. We split the corpus by conversation into a short-context regime (calibration) and a long-context regime (deployment) — classifying each conversation by its own maximum depth so all of its turns land in one regime — then train the conformal quantile on the former and deploy on the latter ( vs. pairs). Figure 9 shows the result. The fixed quantile mis-covers post-shift — it drifts to against a target ( pp off, here over-conservative, needlessly rejecting work). Adaptive conformal inference (ACI [11]), updating the quantile level online at rate , restores empirical coverage to ( pp off target) within the rolling window. Under drift, the static conformal quantile is no longer sufficient; ACI is the runtime mechanism that maintains the guarantee.
11 Real-Arrival Ablation on Production Traffic
§8’s ablation ran on synthetic IBP arrivals — the leading threat to validity. Here we re-run the same four-condition ablation on a real LLM serving trace, converting the headline result from synthetic to empirical.
11.1 Dataset and trajectory construction
We use BurstGPT [15] (BurstGPT_1.csv, CC-BY-4.0), a trace of real Azure OpenAI traffic with schema (Timestamp, Model, Request tokens, Response tokens, Total tokens, Log Type). We take a -request sample (after dropping failed responses, Response tokens); the model mix is GPT-3.5 (“ChatGPT”) and GPT-4 requests, mapped to the benchmark’s efficient and frontier profiles respectively. Request tokens is the prompt and Response tokens the realized completion — the gate’s per-action target. No LLM is called: token counts only, as in §10.
BurstGPT v1.0 carries no session identifier, so we reconstruct trajectories by temporal clustering: consecutive same-model requests within a s window are grouped, capped at depth (the IBP default); API-log rows are single-step. This yields trajectories with median depth (real serving traffic is dominated by independent single requests) and maximum depth . We set the Adapter-Node scope cap to the median prompt ( tokens) and the circuit breaker to the depth cap, and document both as policy choices. (When BurstGPT v1.1 ships SessionID, the clustering heuristic becomes a one-line group-by.)
11.2 Results
Table 6 and Figure 10 report the ablation with paired-bootstrap CIs over trajectories. Scope (Adapter-Node prompt bounding) cuts tokens by and carbon by ; routing of trajectories to the efficient model adds USD savings () and carbon () at no further token cost — exactly the lever decomposition the synthetic ablation predicted (scope drives tokens, routing converts to USD/carbon). The savings ordering is confirmed on real arrivals.
| Condition | Token reduction | USD reduction | Carbon reduction |
|---|---|---|---|
| +scope | |||
| +scope+route | |||
| +full |
An honest negative. +full is identical to +scope+route: the circuit breaker logs zero trips and the gate (under a non-binding budget) issues zero rejections. Real serving traffic has none of the runaway retry loops the synthetic IBP injected, so these two levers are dormant safeguards here — their value appears only under stress (the IBP runaway SKUs, §8) and under binding budgets (§9, and below).
11.3 Binding budget under real arrivals
We repeat the §9 sweep on the real trace ( baseline tokens, ; Figure 11). Over-budget incidence is at every budget — the gate’s safety guarantee holds on real arrivals exactly as on synthetic ones — while admission and completion degrade as the budget tightens (: admitted, of trajectories completed; : full completion). The gate frontier again dominates the soft-penalty frontier, which can reach high completion only by breaching the budget on every seed.
11.4 What the synthetic IBP did and did not capture
The IBP pipeline tightened two claims and the real trace now corroborates them: the lever ordering (scopetokens, routingUSD/carbon) and the gate’s over-budget incidence under binding budgets both reproduce on BurstGPT. Two claims weaken or shift. First, +full’s extra token saving in the IBP ( vs. ) came entirely from the breaker killing injected runaway SKUs; real arrivals have no such storms, so that increment vanishes — consistent with §10’s finding that real cumulative-prompt curvature is negative. Second, on real single-step traffic “scope” is simply context truncation: the Adapter Node caps each prompt at tokens ( the median), so the token reduction is largely the mechanical consequence of that cap, not an intrinsic property of the architecture. The headline percentage should therefore not be read as a free saving Green SARC delivers: it is a tunable policy knob whose realized magnitude scales with the cap, and whose quality cost (truncated context degrading task utility) is deliberately untracked here (§3). A different operator with a less aggressive cap would see a proportionally smaller number. What survives this caveat, and what we therefore present as the load-bearing claims of this section, are the two cap-independent results: the lever ordering reproduces on real arrivals, and the gate’s over-budget incidence is under binding budgets on real data exactly as on synthetic data. The IBP is thus best read as a controlled stress-test of the multi-step regime; BurstGPT confirms the per-request governance levers and the safety property on real distributions, but it is a single-step serving trace and does not exercise the multi-step snowball or breaker dynamics — a real multi-step agent trace remains the natural next validation (§15).
11.5 Carbon savings under real grid mixes
The carbon results so far use a stipulated intensity curve. We re-compute the BurstGPT carbon reductions under measured grid intensity for two zones with contrasting generation mixes.
Setup and data sources
We source hourly carbon-intensity measurements from the ElectricityMaps v3 API [16], which aggregates regulator feeds (ENTSO-E, CAISO OASIS, national TSOs) into a consistent gCO2eq/kWh series on a lifecycle (LCA) basis. We use two zones with materially different generation mixes: Italy (IT, ), gas- and import-dominated, and California (US-CAL-CISO, ), characterised by deep daytime solar troughs and gas-heavy evening peaks. For reference we also report the benchmark’s stipulated proxy (). The free API tier exposes only the most recent hours of history, so we use a single -hour measured window per zone; this captures diurnal contrast but not seasonal or weekly variation, which we note as a §15 limitation. Carbon for each step is , with the workload’s actions spread across that window. The fetched series is cached as committed CSV under paper/data/grid/, so this section reproduces from a clean clone without API access (fetch_grid.py --refresh re-fetches).
Results
| Condition | stipulated () | IT () | US-CAISO () |
|---|---|---|---|
| +scope | |||
| +scope+route | |||
| +full |
Reading
What survives across grids is both the lever ordering and the percentage reduction: scope plus routing cuts carbon by – under all three intensities, because the reduction is a ratio of energy and enters as a common positive multiplier. The result that held under the synthetic proxy holds on real Italian and Californian grids, despite a difference in mean intensity.
What differs is the diurnal structure, and it matters more for one zone than the other. Italy’s intensity is comparatively flat ( intra-day swing, –), so the time at which an agent runs barely changes its carbon. California swings harder (, midday to evening): the same inference is roughly cleaner in the midday solar trough than at the evening peak. The absolute carbon saved therefore scales both with (a CAISO deployment at saves less than half the absolute carbon of an IT deployment for the same percentage) and, in CAISO, with when the traffic lands. Green SARC’s carbon-reduction percentage is robust to grid mix, but its real-world impact depends on where and when the compute runs; time-of-day carbon-aware routing is a Phase-2 opportunity this dataset would enable but the current code does not exploit (§15).
11.6 Multi-step real-trace ablation on SWE-rebench
BurstGPT is single-step; §11 flagged that it cannot exercise the multi-step snowball or breaker. We close that gap on real agent plans.
Dataset
We replay SWE-rebench OpenHands trajectories [17] (CC-BY-4.0; k real agent plans solving GitHub issues with Qwen3-Coder-480B, mapped to the frontier profile), sub-sampled to trajectories streamed from the GB parquet (token counts via tiktoken; no LLM call). These are genuinely multi-step: median depth assistant turns, maximum , with a median per-turn prompt of tokens — the context accretion the State Snowball describes.
Results
| Condition | Token reduction | USD reduction | Carbon reduction |
|---|---|---|---|
| +scope | |||
| +scope+route | |||
| +full |
The State-Snowball holds on real plans, and is steeper than the model.
Fitting each plan’s cumulative prompt against turn index to , every trajectory has (; Figure 14). The median exceeds the linear-accretion prediction (with the median per-turn growth): real agents accrete context faster than the constant-increment model of Assumption 1, because tool outputs and re-reads grow the prompt super-linearly. This is the strongest available confirmation that Theorem 1’s regime is real — and an honest correction that the closed form is a lower bound on real-plan curvature, not an exact match (the synthetic of §4 held only because the simulator was built to Assumption 1).
Breaker activations.
On these real plans the circuit breaker fires on of trajectories (the long-plan tail beyond the median depth), versus zero on BurstGPT. This is the decisive contrast: the breaker is a dormant safeguard on single-step serving traffic but a live, material lever on real multi-step agent plans, where it supplies the entire token-saving increment of +full over +scope+route.
What survives, what shifts.
Routing’s USD/carbon saving (/) and the lever ordering reproduce here as on BurstGPT and the IBP. Two things differ from BurstGPT. The breaker is no longer dormant ( vs ), vindicating its inclusion. And scope yields little here () only because the -median cap (k tokens) rarely binds on these plans; a tighter cap would truncate more but risks dropping the context the agent needs — the same policy/quality tradeoff named in §11, now with real multi-step stakes. Token reduction is modest precisely because we did not tune the cap aggressively; the load-bearing real-plan findings are the confirmed super-linear snowball and the live breaker.
11.7 Cost–utility frontier
The paragraph above names a tradeoff; because SWE-rebench records task outcomes, we can bound it. Each trajectory carries the benchmark’s real resolved flag — whether the agent’s patch passed the held-out tests — and of the plans resolved. We sweep the scope cap (, the median per-step prompt) and, for each cap, report tokens saved against an upper bound on quality harm: a worst-case resolution rate that assumes every resolved trajectory whose actually-used context the cap would have truncated flips to unresolved. This is deliberately pessimistic and, crucially, observational — truncation is simulated on logged trajectories, so the agent cannot react to the smaller context. The replay therefore bounds how much resolved work a cap puts at risk; it cannot establish that the work would in fact fail (a live agent might recover by re-fetching). It is a correlational upper bound, not a causal estimate.
The frontier is steep (Figure 15). The aggressive cap saves of tokens but truncates of plans, putting all resolved work at risk ( worst-case resolution); the cap saves while still truncating . Only the loose cap used in our ablation is close to benign — tokens for a worst-case resolution of (it touches of plans) — and the cap is essentially free ( truncated, worst-case, against the baseline). The honest reading: on real multi-step plans the token savings available from scope capping are bought against a real and possibly large truncation risk, and the cap that is safe saves little. The causal version — where the agent adapts to the cap — needs the live study (§15); this observational frontier is the upper bound that motivates it.
12 Sensitivity Analysis: the Knob
The gate’s single tunable, the risk level , trades admission throughput against realized overspend. We pre-train the estimator, then gate a fresh stream against a binding token budget over seeds, sweeping (Figure 16). Tightening from to drives the overspend rate among admitted actions from to , at a negligible throughput cost (admission throughout, since the binding budget — not — sets the admission ceiling). The practical reading: under a hard budget, a conservative buys an overspend guarantee almost for free.
12.1 Joint sensitivity over , scope cap, and routing fraction
The sweep above varies one knob. To check that the headline operating point is not a cherry-pick, we sweep all three: , scope cap the median prompt ( tokens), and routing fraction — cells over seeds. Token/USD/carbon reductions use the benchmark’s native forecast noise; over-budget incidence is measured under a binding budget at the elevated noise of the stress above (Figures 18, 18).
Three findings, two of them null and stated as such. (i) Token reduction is governed almost entirely by the scope cap: at , at , at , at . (ii) Routing fraction does not move the token reduction at all (it reallocates models, changing USD and carbon, not tokens — the heatmap is flat along the routing axis), and (iii) has no measurable effect on either axis in this regime: the four panels of Figure 18 are identical, and over-budget incidence never exceeds across all cells. The last is the empirical face of Theorem 2: the gate admits on its upper bound, so realized over-budget events are vanishingly rare regardless of the operating point; becomes a live throughput-vs-overspend knob only under the higher forecast uncertainty of real residuals (§10). The paper’s headline operating point (cap , routing , ) achieves the maximum token reduction at over-budget and lies on the Pareto frontier ( of cells are non-dominated): it is the most aggressive cap at zero safety cost, not an interior cherry-pick.
13 Why an Architectural Gate: the Soft-Penalty Baseline
The natural alternative to a hard gate is reward shaping: add a Lagrangian cost penalty to the objective and let the agent self-limit, admitting an action iff its value exceeds times its cost. We compare this soft penalty against the architectural gate on a stream with stochastic costs and a hard budget (Figure 19). Because the penalty is a per-action threshold blind to the remaining budget, no single both respects and matches the gate’s throughput: small admits almost everything and overspends (); large under-spends. Crucially, at the that matches in expectation, realized spend straddles and breaches it on of seeds. The architectural gate, admitting in arrival order while the calibrated forecast fits the live remaining budget, breaches on of seeds while filling of the budget. This is the cost-domain instance of SARC’s thesis that finite penalties cannot substitute for hard runtime constraints.
14 Threat Model and Adversarial Robustness
A runtime gate invites the question: what does an attacker who knows the gate do? We give a partial, honest answer with a toy study against the same gate code path the benchmark exercises.
14.1 Attacker model
The attacker is a prompt author with white-box knowledge of the estimator, the scope cap, and the budget, who observes gate decisions (and, via timing channels, possibly residuals). The attacker cannot modify src/ or mint tokens from nothing; the cost is always realized by the model provider. This is a cost-side adversary, distinct from the safety-side prompt-injection threat studied by [23, 24]: the goal is not to make the agent misbehave but to make it overspend while passing the gate.
14.2 Three attack classes
We construct three attacks ( seeds, instances each, paired-bootstrap CIs). Continuation inflation: a prompt whose realized completion is the benign law (“continue indefinitely” semantics). Scope-cap-aware padding: a prompt sized to exactly tokens, maximizing admitted work per call. Model-substitution gaming: a prompt that declares the efficient model while the cost is realized at frontier rates (a misreported model id).
14.3 Results
| Attack | Admission | Over-budget | Realized/declared | Gate failure mode |
|---|---|---|---|---|
| Continuation inflation | under-estimates | |||
| Scope-cap-aware padding | over-admits | |||
| Model-substitution gaming | under-estimates |
14.4 What survives
Two honest negatives. First, scope-cap-aware padding defeats the gate by staying inside its admission contract: it pads to just under the cap and extracts maximum legitimate work, so the gate admits it () with no over-budget event (, at the noise floor) and realized cost below the bound (). This is a fundamental limitation of bounded-prompt-only governance and is exactly what the Phase-2 trajectory estimator — which reasons about the whole plan, not one padded step — is meant to address. The gate alone is not sufficient here, and we do not claim otherwise.
Second, continuation inflation and model substitution both defeat the forecast (realized cost and the admitted bound), but they are caught post hoc by the Post-Action Auditor, which logs predicted-vs-actual and feeds the discrepancy back to the estimator and the Escalation Router. The architectural response to a forecast-defeating attack is audit-then-revoke at the Auditor, not admission-time rejection at the gate; this is the cost-domain instance of SARC’s predict–act–log–retrain loop. The gate bounds expected cost; it does not bound an adversary who lies about the future, and the four-site architecture — not the gate in isolation — is what makes the residual detectable.
15 Limitations, Threats to Validity, and Future Work
Threats to validity.
Synthetic headline workload. The ablation, binding-budget, and sensitivity results (§8, §9, §12) use a synthetic IBP pipeline with stipulated Gaussian noise; the State-Snowball and gate mechanics are real code, but the cost distribution is constructed. We mitigate this with the calibration study of §10 and the end-to-end real-arrival ablation of §11; a residual gap remains in that BurstGPT is single-operator Azure traffic, so cross-operator traces (Mooncake, Alibaba) are listed as future validation. Marginal coverage. Theorem 2 is marginal and assumes exchangeability (Remark 1); Theorem 3 additionally assumes sub-Gaussian increments. Workload drift violates exchangeability; §10 shows ACI restores coverage, and as of v0.3.0 both split-conformal and ACI ship in the runtime gate (§6.1), though conditional (not merely marginal) coverage remains open. Carbon proxy. is only as faithful as , which varies in availability and granularity across regions. The §11.5 real-grid study uses a -hour window per zone (ElectricityMaps free-tier constraint); a full year would expose seasonal variation in renewable share (e.g. CAISO winter low-solar) but is not expected to alter the grid-invariance result. Energy model. The per-token energy is a stipulated linear coefficient (the shipped default is kWh/token, of order J/token — the same order of magnitude as benchmarked GPU inference energy for large models [19, 18]). Measured inference energy is not linear in tokens: it varies with batch size, hardware, and utilization, and grows super-linearly with context length because attention scales quadratically in sequence length [19]. The bias has a definite sign here: because the State Snowball makes context grow with loop depth, a linear proxy under-counts the marginal energy — and hence carbon — of the deepest, most expensive steps, so the carbon savings we report from bounding loop depth are, if anything, conservative. A measured energy table per model/hardware (rather than one coefficient) is the faithful fix and fits the existing CostModel interface without changing it. Operator readiness. A single-process threading.Lock Budget is authoritative for one replica; an experimental Redis backend (one atomic Lua script per reserve/commit/release, with TTL reclamation of crashed-client reservations) provides a shared transactional counter for multi-replica deployments behind a load balancer, atomic against a single Redis but with no cross-region reconciliation and no fair-share reservations yet (Phase 2). Production deployments needing a durable ledger should await the Postgres backend.
Negative and null results.
The gate produces no token savings in the headline workload (§8); adding routing yields zero marginal token reduction (it trades models, saving USD/carbon only); the declared latency-headroom field is not enforced in Phase 1; and on real chat traffic the cumulative-prompt curvature is negative (§10), so the snowball is specific to naive orchestration, not universal. We report these rather than fold them into a single “governance helps” number.
Future work.
The leading item is to promote the split-conformal upper bound of §7 and the anytime-valid trajectory bound of Theorem 3 from paper-side analyses into the runtime gate, with adaptive conformal inference [11] under workload drift (§10 shows why this matters). Beyond that: a multi-step agent trace (e.g. SWE-bench / OpenHands trajectories) to exercise the breaker and State-Snowball dynamics that the single-step BurstGPT trace does not, and cross-operator traces (Mooncake, Alibaba) for the carbon and arrival-distribution generalization §15 flags; the Phase-2 full-trajectory estimator for plan-level rejection; time-of-day carbon-aware routing, which §11.5’s CAISO diurnal swing () shows is exploitable but the current router ignores; a multi-tenant distributed Budget with fair-share reservations; latency-headroom enforcement; and a production KAOS deployment of the gate as a sidecar. The single outstanding empirical action is the live governed-agent study (two arms — ungoverned vs full stack — over tasks on the Anthropic API): its harness ships at paper/scripts/run_live_study.py with an in-script USD ceiling and a probe-checkpoint spend sign-off, and is unit-tested offline against a mock transport, but the funded live run is deferred (it is the one result this paper does not yet report). These are roadmap directions, not claims.
16 Conclusion
Green SARC applies a correctness-governance architecture to the economics and ecology of inference, and develops its own theory with reproducible evidence. The State-Snowball theorem explains why unconstrained agents fail financially, and its closed form is confirmed exactly in the data. The predictive Pre-Action Gate generalizes the static accounting gate into a calibrated forecaster of which the rule is the zero-information limit, with a distribution-free budget-safety guarantee. And the soft-penalty comparison shows the guarantee is not free for the taking by reward shaping — it requires the architectural placement. The structural claim is that correctness, cost, and sustainability are instances of one problem: the runtime enforcement of declared constraints, where the only thing that changes between them is the predicate.
Appendix A Benchmark configuration
IBP defaults: SKUs, depth , base prompt , per-step increment , scope cap , max_tokens ; completion ; runaway fraction at depth; breaker ; of SKUs routed to the small model under +route; . Two model profiles (frontier/efficient) with distinct USD and energy rates; carbon under both fixed ( gCO2e/kWh) and a daily time-varying intensity curve.
Appendix B Estimator
Per key, an online least-squares fit of completion on prompt tokens via running sums (Welford-style) [25], predicting (completion clamped to ) with residual std supplied to the gate; below min_samples it defers to the zero-information cold-start forecast.
Appendix C Conformal calibration and the anytime-valid bound
Split conformal (Theorem 2). Split the learned forecasts into calibration/test halves; one-sided scores ; the -th order statistic; report test coverage against nominal .
Anytime-valid bound (Theorem 3). With centered residuals conditionally -sub-Gaussian, is a non-negative supermartingale for each (the increment’s conditional MGF is dominated by ). The mixture over a prior is again a non-negative supermartingale with ; the Gaussian integral evaluates in closed form to . Ville’s inequality [13] gives ; solving for yields the time-uniform boundary , which is the standard sub-Gaussian confidence sequence [12]. Optional stopping extends the bound from fixed to any stopping time .
Appendix D Binding-budget experiment
, with the mean full-snowball cost over seeds; ; the estimator warm-started on a -step independent stream. The soft-penalty frontier sweeps at reference budget . Script: paper/scripts/run_binding_budget.py.
Appendix E Real-trace replay
Dataset anon8231489123/ShareGPT_Vicuna_unfiltered (Hugging Face, permissive research license), streamed; up to conversations / assistant-turn pairs, tokenized with tiktoken cl100k_base, capped to an context window; token counts only, no LLM calls. Split conformal vs. Gaussian- coverage on a conversation-level split (whole conversations assigned to calibration or test, never split across turns, to preserve exchangeability); shift experiment trains on short-conversation residuals and deploys on long — regimes also assigned by whole conversation (each conversation classified by its maximum depth) — comparing fixed-quantile vs. ACI (). The extracted table is cached to paper/data/sharegpt_subset.parquet (git-ignored); committed JSONs are the provenance. Script: paper/scripts/run_realtrace_replay.py.
Appendix F Reproduction
make paper-data regenerates every committed JSON asset (ablation, learning curve, binding-budget sweep, real-trace calibration and shift); make paper-figures regenerates the eleven figures and figure_stats.json; paper/scripts/check_stats.py verifies that every statistic in the text resolves to figure_stats.json; make paper compiles this PDF (CI: .github/workflows/paper.yml). The benchmark’s reference run is checked under make verify.
References
- [1] Besanson, G. (2026). SARC: A Governance-by-Architecture Framework for Agentic AI Systems: Compiling Regulatory Obligations into Runtime Constraints. Working paper, Universidad Torcuato Di Tella. arXiv:2605.07728. Code: https://github.com/besanson/sarc-governance.
- [2] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguía, L.-M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). Carbon Emissions and Large Neural Network Training. arXiv:2104.10350.
- [3] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. In Proc. 57th ACL (pp. 3645–3650). doi:10.18653/v1/P19-1355.
- [4] Lacoste, A., Luccioni, A., Schmidt, V., & Dandres, T. (2019). Quantifying the Carbon Emissions of Machine Learning. arXiv:1910.09700.
- [5] FinOps Foundation (2023). FinOps Framework: Principles, Domains, and Capabilities for Cloud Financial Management.https://www.finops.org/framework/.
- [6] Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, Í., Maleki, S., & Bianchini, R. (2024). Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In Proc. ISCA.
- [7] Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proc. ICML.
- [8] Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin, X., Huang, Y., Chen, Z., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In Proc. OSDI.
- [9] Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
- [10] Angelopoulos, A. N., & Bates, S. (2023). Conformal Prediction: A Gentle Introduction.Foundations and Trends in Machine Learning, 16(4), 494–591.
- [11] Gibbs, I., & Candès, E. (2021). Adaptive Conformal Inference Under Distribution Shift. In Advances in Neural Information Processing Systems (NeurIPS).
- [12] Howard, S. R., Ramdas, A., McAuliffe, J., & Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences.Annals of Statistics, 49(2), 1055–1080.
- [13] Ramdas, A., Grünwald, P., Vovk, V., & Shafer, G. (2023). Game-theoretic statistics and safe anytime-valid inference.Statistical Science, 38(4), 576–601.
- [14] ShareGPT community (2023). ShareGPT_Vicuna_unfiltered: a corpus of real ChatGPT/GPT-4 conversations. Hugging Face Hub: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
- [15] Wang, Y., Chen, Y., Li, Z., Kang, X., Fang, Y., Zhou, Y., Zheng, Y., Tang, Z., He, X., Guo, R., Wang, X., Wang, Q., Zhou, A. C., & Chu, X. (2025). BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems. In Proc. 31st ACM SIGKDD V.2. arXiv:2401.17644. https://github.com/HPMLL/BurstGPT.
- [16] Electricity Maps ApS. (2024). Electricity Maps API: hourly carbon intensity by zone.https://api.electricitymap.org/v3. Free tier, attribution required. Data sources documented at https://github.com/electricitymaps/electricitymaps-contrib.
- [17] Nebius (2025). SWE-rebench OpenHands trajectories: 67k agent traces solving real GitHub issues with Qwen3-Coder-480B. CC-BY-4.0. https://huggingface.co/datasets/nebius/SWE-rebench-openhands-trajectories.
- [18] Luccioni, A. S., Jernite, Y., & Strubell, E. (2024). Power Hungry Processing: Watts Driving the Cost of AI Deployment? In Proc. 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). arXiv:2311.16863.
- [19] Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., & Gadepally, V. (2023). From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. In Proc. IEEE High Performance Extreme Computing Conference (HPEC). arXiv:2310.03003.
- [20] Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176.
- [21] Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665.
- [22] Anderson, J. P. (1972). Computer Security Technology Planning Study. Technical Report ESD-TR-73-51, USAF Electronic Systems Division, Hanscom AFB. (Origin of the reference monitor concept.)
- [23] Carlini, N., Nasr, M., Choquette-Choo, C. A., et al. (2023). Are aligned neural networks adversarially aligned? In Advances in Neural Information Processing Systems (NeurIPS).
- [24] Greshake, K., Abdelnabi, S., Mishra, S., et al. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. In Proc. 16th ACM Workshop on Artificial Intelligence and Security (AISec).
- [25] Welford, B. P. (1962). Note on a Method for Calculating Corrected Sums of Squares and Products.Technometrics, 4(3), 419–420.
- [26] Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2018). Bandits with Knapsacks.Journal of the ACM, 65(3), 1–55.
- [27] European Parliament and Council (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. (See Arts. 12, 13, 15.)
- [28] European Parliament and Council (2022). Directive (EU) 2022/2464 amending Regulation (EU) No 537/2014 and Directives 2004/109/EC, 2006/43/EC and 2013/34/EU, as regards corporate sustainability reporting (CSRD). Official Journal of the European Union.
