VOOZH about

URL: https://arxiv.org/html/2606.15954v1

⇱ Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems


License: CC BY 4.0
arXiv:2606.15954v1 [cs.SE] 14 Jun 2026

Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems

Gaston Besanson Universidad Torcuato Di Tella.
(Preprint, June 2026)
Abstract

Agentic AI systems act through tools and sub-agents, yet the controls meant to bound their financial and environmental cost still sit on dashboards evaluated beside or after execution. Green SARC applies the SARC governance-by-architecture framework — four enforcement sites in the agent loop — to FinOps and GreenOps, contributing the theory of what to enforce and how to predict it. We report four policy-independent results. (i) The unconstrained “State Snowball” is in loop depth; on real multi-step plans (SWE-rebench) it holds on , with median curvature exceeding the linear-accretion prediction — real plans accrete faster than the model (§11.6). (ii) On real residuals the Normal- gate under-covers ( at nominal ); split-conformal calibration holds (; Theorem 2). (iii) A soft Lagrangian penalty tuned to the budget in expectation breaches it on of seeds; the architectural gate breaches . (iv) Under binding budgets the gate’s over-budget incidence is on synthetic and real (BurstGPT) arrivals. End-to-end token/USD/carbon savings (–) are real but policy-dependent in magnitude — set by a scope-cap knob, not by gate rejections. The library is open-source, dependency-free, and ships a regeneration script for every cited number.

Keywords: agentic AI, governance-by-architecture, predictive FinOps, GreenOps, token economics, conformal prediction, runtime constraints, SARC.

1  Introduction

The cost center of artificial intelligence has shifted from training, whose resource envelope is fixed at design time, to the inference trajectory: the runtime-determined sequence of model calls, tool invocations, and conditional retries an agent emits while pursuing a goal. A classical inference call has a bounded, predictable cost. An agentic workflow has neither: the same task, executed twice, can differ by an order of magnitude in token consumption. Both the API bill and the energy draw are therefore stochastic quantities governed by the execution trace, not the specification.

Two instruments are commonly deployed against this volatility. Post-hoc auditing reconciles spend after the billing period closes. Policy-as-code encodes budget rules in a layer evaluated alongside, but not inside, the agent loop. Both inherit the defect SARC identified for correctness obligations: they evaluate constraints after, or beside, the execution they are meant to bound. A budget breach detected at month-end cannot un-spend the tokens; a carbon overage logged to a dashboard cannot un-emit the carbon.

Relationship to SARC.

SARC [1] is a governance-by-architecture framework that treats constraints as first-class specification objects and compiles them into four enforcement sites: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. Green SARC is an application of that architecture — we reuse the four sites unchanged — but carries its own theory, orthogonal to SARC’s correctness results. SARC governs whether the system is right; Green SARC governs what the system costs. The two are independent axes that happen to share enforcement sites.

Contributions.
  1. 1.

    State-Snowball theorem, formal and empirical (§4). Naive context accretion yields cumulative prompt cost (Theorem 1); the synthetic fit recovers the closed-form coefficient exactly, and on real ShareGPT traffic the cumulative-prompt curvature is negative — the snowball is an artifact of naive orchestration, not of chat itself (§4, §10).

  2. 2.

    Predictive Pre-Action Gate with calibration and an anytime-valid safety bound (§5, §7). We generalize the gate to a learned forecast (of which rule-based accounting is the zero-information limit), give split-conformal marginal safety (Theorem 2), and an anytime-valid trajectory over-spend bound via Ville’s inequality (Theorem 3).

  3. 3.

    Binding-budget gate evaluation on the Pareto frontier (§9). Across a budget grid the gate’s empirical over-budget incidence stays at or below while completing more work at zero overspend — dominating the soft-penalty frontier.

  4. 4.

    Real-trace coverage validation on ShareGPT (§10). On real, non-Gaussian residuals the Normal- gate under-covers at tight while split conformal holds nominal coverage; adaptive conformal restores coverage under distribution shift.

  5. 5.

    Ablation with paired-bootstrap CIs (§8). A four-condition ablation decomposes the saving by lever (scope, routing, breaker), each with a CI.

  6. 6.

    Real-arrival ablation on a public production trace (§11). The four-condition ablation re-run on the BurstGPT trace of real Azure OpenAI traffic reproduces the synthetic savings ordering under real burstiness and prompt/response distributions, with paired-bootstrap CIs.

  7. 7.

    Open-source library and reproducible benchmark (§6). A dependency-free implementation with a regression contract enforced in CI via make verify.

We also relate the gate to Bandits-with-Knapsacks as a feasibility oracle (Proposition 3). The decoupling of cost from correctness governance (§3) frames the artifact’s scope.

In one line: SARC gives us where to enforce; we contribute what to enforce, why it decouples from safety, how to predict it, and the evidence that the prediction is calibrated and the architecture is necessary.

2  Background and Related Work

SARC.

A SARC specification declares, per constraint, its source, class, predicate, verification point, response protocol, and operating point, and compiles these into the four enforcement sites named above. It formalizes the minimal invariants for specification–trace correspondence and argues that finite reward penalties do not in general substitute for hard runtime constraints — a claim we make quantitative for the cost domain in §13.

FinOps and GreenOps.

FinOps brings financial accountability to variable cloud spend; GreenOps extends the discipline to carbon and energy. The financial and environmental cost of large-scale AI compute has been quantified for the training regime [3, 2, 4]; the inference regime studied here shifts this cost into a runtime-variable, per-trajectory quantity. Both disciplines are predominantly practiced as observe-and-reconcile loops [5], a cadence adequate only when the consumption unit is predictable. Agentic inference violates that premise.

Efficient inference systems.

A large systems literature reduces the unit cost of inference — phase-split serving [6], high-throughput single-GPU offloading [7], and statistical multiplexing across models [8]. These optimize how a fixed set of calls is served; Green SARC is complementary and orthogonal: it governs which calls an agent is permitted to make, given a budget, before they are issued. The two compose cleanly. If a serving optimization reduces the effective per-token cost from to for some efficiency , then a fixed token budget admits times as much work, since the gate’s feasibility test is equivalent to ; the carbon ceiling scales identically through the proportional energy saving. Cheaper serving thus relaxes the gate’s effective budget by a known factor rather than changing its mechanism.

Cost-aware routing and cascades.

A complementary line reduces cost by choosing which model answers a query: FrugalGPT learns an LLM cascade that escalates to a stronger model only when a cheaper one is judged inadequate [20], and RouteLLM trains a binary router between a strong and a weak model from preference data [21]. These optimize the per-query model choice to maximize quality at lower expected cost, but they offer no hard guarantee: a router tuned to spend less in expectation can still overrun any fixed budget on an adversarial or heavy-tailed query stream, exactly the soft-constraint failure we quantify in §13. Green SARC is orthogonal and composable: it is the enforcement contract a router runs inside, supplying the per-action feasibility test that turns an expected-cost heuristic into a budget-safe one (the CBwK feasibility-oracle framing of Proposition 3). We concede the overlap honestly: Green SARC’s energy-aware routing lever (§8) is mechanistically the same idea as FrugalGPT’s cascade — down-route when a cheaper model suffices — and our contribution there is not the router but the gate that bounds it.

Conformal prediction.

Our safety guarantee rests on split (inductive) conformal prediction, which converts any point predictor into a set/interval with distribution-free, finite-sample marginal coverage under exchangeability [9, 10]. Where residuals are non-exchangeable (distribution shift), adaptive conformal inference restores coverage online [11]; we flag this as the path to robustness on real traces (§15).

Constrained decision-making.

Admitting actions under a depletable budget is formally a Bandits-with-Knapsacks / constrained-MDP problem [26]. Green SARC does not solve the optimal-policy problem; it provides the enforcement primitive (a calibrated per-action feasibility test) that any such policy needs at runtime, and contrasts it with the soft-penalty (Lagrangian) relaxation in §13. We make the relationship precise in Proposition 3: the gate composes with any sublinear-regret CBwK policy as a feasibility oracle without changing its regret order.

Architectural lineage.

The four-site, enforce-in-the-loop design has ancestry well beyond the author’s own SARC framework. The reference monitor of Anderson’s 1972 security study — a mediation mechanism that must be invoked on every access, be tamper-proof, and be small enough to verify [22] — is the direct conceptual ancestor of the Pre-Action Gate: a non-bypassable check interposed before each consequential operation. Admission control in networking and queueing systems (admit a flow only if its reserved rate fits the remaining capacity) is the same two-phase reserve-then-commit primitive our Budget implements. And runtime verification — synthesizing monitors that check an execution against a specification as it runs — is the correctness-domain analogue of the Action-Time Monitor and Post-Action Auditor. Green SARC’s novelty is not the enforce-in-the-loop stance itself but its application to predicted cost and carbon, with a calibrated forecast standing in for the boolean access check.

What Green SARC is not.

Pure observability tools (LangSmith, Helicone, raw OpenTelemetry) give post-hoc cost without enforcement: they tell you what was spent, after it was spent. API-level rate limits (provider tier limits, sidecar throttling on request counts) enforce request counts, not predicted cost or carbon, so a single expensive call passes unchecked. In-agent budget tracking (framework callbacks) is in-process and bookkeeping-only, with no cross-process attribution and no enforcement contract spanning the four sites. Green SARC differs by being a four-site governance contract whose gate predicts cost before the action fires; the closest comparable systems govern only post hoc, or only on request counts.

Regulatory context.

Enterprise deployments increasingly must keep auditable records of automated decisions and account for system accuracy and the energy footprint of AI. The EU AI Act mandates automatic record-keeping/logging (Art. 12), transparency and information provision (Art. 13), and accuracy/robustness with documented metrics (Art. 15) [27]; the Corporate Sustainability Reporting Directive (CSRD) extends mandatory sustainability disclosure to in-scope undertakings [28]. Green SARC’s Post-Action Auditor produces, as a byproduct of execution, the attribution-preserving, predicted-vs-actual trace these regimes require, extended to per-trajectory token yield and a carbon proxy.

3  Decoupling FinOps Governance from Correctness Governance

We state the decoupling explicitly because it defines the scope of the artifact.

Proposition 1(Independence of axes).

Let a governance layer be characterized by the predicate class it enforces. SARC’s correctness layer enforces predicates over action validity (is this action safe and permitted?). The Green SARC layer enforces predicates over resource consumption (does this action fit the cost and carbon budget?). The two predicate classes share enforcement sites but neither implies the other: a perfectly safe agent can be ruinously expensive, and a perfectly cheap agent can be unsafe.

Consequences for the artifact:

  • Green SARC is deployable with no safety regime present. Its value derives from the cloud bill, which every operator incurs.

  • It tracks cost and carbon only; correctness/accuracy is deliberately out of scope and is not logged as a governed quantity. The quality floor (§5) is the caller’s concern.

  • It composes with SARC where both are wanted (the sites are shared) but does not depend on it. The reference implementation has no dependency on SARC.

4  The State-Snowball Cost Theorem

Definition 1(State Snowball).

A multi-agent loop exhibits the State Snowball when each step re-submits the full accreted context, so the per-step prompt grows monotonically with step index.

Assumption 1(Linear accretion).

The prompt at step (zero-indexed) is tokens: a fixed base plus tokens appended per hop. This is the regime in which the unconstrained loop is studied; sub-linear summarization is exactly the mitigation we analyze.

Theorem 1(Quadratic cost of the unconstrained loop).

Under Assumption 1, the cumulative prompt-token cost over steps is

(1)

with leading coefficient .

Proof.

Direct summation of the arithmetic series . The dominant term is with leading coefficient . ∎

Empirical confirmation (synthetic).

Figure 1 plots the cumulative prompt cost of the benchmark’s baseline against loop depth, with , . A second-order least-squares fit recovers , identical to the closed-form ; the residual is numerically zero. This verifies that the simulator faithfully realizes Assumption 1 (the recovered coefficient is a property of the simulator’s construction, not independent evidence that real workloads accrete linearly). Bounding the per-hop increment with an Adapter Node (scope cap tokens) collapses the curve to linear: at depth the scoped cost is lower than the snowball. The Action-Time circuit breaker caps directly, bounding the other factor.

Real chat traffic.

Whether real multi-turn traffic accretes quadratically is a separate, empirical question. On the ShareGPT replay of §10 ( conversations, up to turns) we fit the cumulative billed prompt tokens against turn depth to a quadratic. The leading coefficient is with paired-bootstrap CI — significantly negative. Real conversations are concave in depth, not convex: humans and well-behaved assistants do not blindly re-submit the full transcript every turn. The snowball is therefore a failure mode of naive multi-agent orchestration (full-context re-submission), not an intrinsic property of conversation — which is precisely why the Adapter-Node scoping that prevents it is the highest-leverage lever in the ablation (§8).

👁 Refer to caption
Figure 1: State Snowball: the baseline cumulative prompt cost is and its quadratic fit recovers Theorem 1’s leading coefficient exactly; bounded scope (Adapter Node) is linear.

5  The Predictive Pre-Action Gate

This is the paper’s central construct. In SARC the Pre-Action Gate evaluates a deterministic predicate. We generalize it to a gate that decides on a learned, calibrated forecast of the resource cost of a proposed action. Table 1 fixes notation.

Symbol Meaning
proposed action; its context
live remaining token budget
marginal carbon intensity (gCO2e/kWh), region , time ; real grid data used in §11.5[16], stipulated by default
latency/SLA headroom (declared in the state; not enforced in Phase 1)
forecast token cost; forecast carbon
realized token cost; realized carbon
estimator residual standard deviation (per key)
gate risk level; operating-point confidence is
split-conformal quantile of calibration residuals
carbon ceiling; carbon already spent on trajectory
Table 1: Notation.

5.1  Augmented state

The state ingests financial and environmental telemetry: . Of these, and are enforced at the gate; is declared for completeness but is not enforced in the Phase-1 implementation (the field exists in the state object and is unused), a divergence we record honestly in §15.

5.2  The estimator

Let be a proposed action in context . A learned estimator predicts expected token cost and carbon before the action fires. The implementation regresses completion tokens on prompt tokens online per key using the Welford-style sufficient statistics [25], exposing the residual standard deviation . The gate admits iff the forecast fits the remaining budget at confidence and the carbon ceiling:

(2)

Operationally the first test is a one-sided upper bound . The implementation forms with the normal quantile (the “Normal- gate”); §7 replaces with a distribution-free conformal margin . Rule-based accounting is the special case a constant threshold independent of : the zero-information gate.

5.3  The closed learning loop

The estimator is trained on the Post-Action Auditor’s own output, closing a loop:

(3)

At cold start is weak and the gate behaves conservatively (worst-case forecast); it sharpens as the audit log accumulates. §8 shows the forecast MAE collapsing from a cold-start value of 4,000 tokens to the irreducible noise floor within 20 observations.

5.4  Release sequencing: step then trajectory

The estimator is built in two phases. Phase 1, per-action: predicts the next step only — simple to deploy, and it generates the labeled actuals needed for Phase 2. Phase 2, full-trajectory: a planner-level estimator predicts the cost of an entire plan before the agent starts, enabling rejection of expensive plans, not merely expensive steps. Phase 1 is the data engine for Phase 2; only Phase 1 is implemented here (Phase 2 is an interface stub).

5.5  Budget safety

We give two safety statements: a pointwise one assuming calibration, and a distribution-free one (proved in §7).

Proposition 2(Pointwise budget safety).

If the estimator is calibrated so that pointwise, then an admitted action breaches the budget margin it was admitted against with probability at most ; residual breaches scale with the gate risk level , not with the opportunity for breach.

Theorem 2(Predictive Gate Safety, split-conformal).

Let the conformal margin be calibrated on an exchangeable set of residuals (Assumption 2, §7) and let the gate admit only when . Then for a fresh exchangeable action the per-action budget-breach probability satisfies , with no distributional assumption on the residuals. Over a trajectory of gated admits, the expected number of breaches is at most .

The proof is in §7. Theorem 2 bounds the breach probability of each individual admitted action; for the whole trajectory, Theorem 37) gives an anytime-valid probabilistic tail bound on the cumulative over-spend across any trajectory prefix, uniformly over stopping times. Together these mirror, in the resource domain, SARC’s claim that residual hard violations scale with enforcement-stack error rather than with the opportunity for violation.

5.6  The Sustainable Token Yield reward

Within the gated action space the agent optimizes a reward that penalizes brute-force inference where deterministic computation suffices:

(4)

with task utility , FinOps weight , GreenOps weight . As SARC argues, this finite penalty shapes behavior within the hard budget/carbon constraints; it does not replace them. §13 makes this quantitative: a soft penalty tuned to the budget in expectation still breaches it most of the time.

5.7  The constrained optimization

(5)
s.t. (budget; Pre-Action Gate)
(loop bound; Action-Time Monitor)
(ESG ceiling; Post-Action Auditor)
(quality floor; caller-owned)

Budget and carbon are hard constraints enforced at their sites, not merely penalized in .

Proposition 3(Gate as a CBwK feasibility oracle).

Let be any Bandits-with-Knapsacks policy achieving regret with per-action costs bounded in . Composing with the split-conformal Pre-Action Gate — restrict ’s action set at each round to and let select within that set — yields a policy whose regret remains and which additionally satisfies the anytime-valid budget bound of Theorem 3 with probability at least .

Proof sketch.

The gate is a per-round feasibility filter applied to the action set, so is run on a (possibly smaller) feasible set; its per-round regret against the best feasible arm is unchanged, preserving the order. The added budget guarantee holds because only ever plays arms admitted by the gate, to which Theorem 3 applies verbatim; a union bound over the regret event and the conformal coverage event () gives the composite guarantee. The gate supplies feasibility; the bandit supplies optimality. ∎

5.8  Calibration matters

Proposition 2 is only as good as the calibration premise; a systematically optimistic admits overspend. Theorem 2 discharges the premise without a distributional assumption, at the cost of a held-out calibration set. §7 states the assumptions, proves the bound, and validates coverage empirically.

6  Mapping the Enforcement Sites and Implementation

Table 2 maps the four SARC sites to Green SARC predicates and to the modules that implement them. The reference implementation is a standalone, dependency-free Python library; it composes with SARC via shared sites rather than importing it.

Site Green SARC predicate / role Module
Pre-Action Gate Predictive cost/carbon forecast; admit iff at and carbon fits. gate.py, estimator.py
Action-Time Monitor Circuit breaker on loop count / marginal cost; kills runaway retry/re-plan loops. monitor.py
Post-Action Auditor Logs predicted-vs-actual cost/carbon per action: ESG record and estimator training signal. auditor.py
Escalation Router Routes budget-/carbon-exhausted tasks to human review or a deterministic fallback. escalation.py
State scoping Adapter Node bounds the per-hop increment 4). scoping.py
Table 2: The four SARC sites under Green SARC predicates, plus state scoping. Only the Pre-Action Gate changes character: from rule to calibrated forecast.

The runtime gate (green_sarc.gate) defaults to the Normal- upper bound of §5; as of v0.3.0 the split-conformal upper bound of §7 is also available at runtime, opt-in via calibrator=...6.1).

6.1  Conformal calibration in the runtime gate

The split-conformal bound of §7 and the adaptive variant of §10 are no longer only paper-side analyses: v0.3.0 ships them as runtime strategies in green_sarc.calibrator (SplitConformal, ACIConformal, behind a Calibrator protocol). The PreActionGate constructor gains an optional calibrator=... argument; supplying it replaces the Normal- token bound with the conformal one, and omitting it preserves the prior behaviour exactly (all pre-existing tests pass unchanged, and make verify holds).

We validate the runtime path against the paper-side analysis: re-running the §10 ShareGPT study with the runtime SplitConformal calibrator (--use-runtime-conformal) reproduces the held-out coverage of the offline analysis to within percentage points across . Engineering contract: the calibrator is fit once from a residual log (offline) and, for ACIConformal, updated online from realized-vs-predicted cost at the Post-Action Auditor; the protocol lives in green_sarc.calibrator and the public surface follows semver within . The difference between “we prove” (§7) and “we ship” is now a one-argument opt-in.

Capabilities: today vs. roadmap.

Table 3 states explicitly which Green SARC capabilities ship today (Phase 1) and which are roadmap items. The paper’s empirical results in §8 and §9 use only Phase 1 features; §10 uses real data but Phase 1 code.

Capability Phase 1 (today) Phase 2 (roadmap)
Pre-Action Gate (step) Normal-; conformal opt-in via calibrator= ACI as default + conditional coverage
Predictive forecast OLS per trajectory estimator
Action-Time Monitor Loop / marginal / total cost breaker latency-headroom enforcement
Post-Action Auditor JSONL / SQLite Parquet, multi-tenant attribution
Budget Single-process threading.Lock; distributed Redis backend (experimental) Postgres durable ledger + fair-share reservations
Escalation Router Deterministic + log-only handlers plan-level rejection on trajectory forecast
Adapters MCP, PAIS sidecar, OTel SpanProcessor cross-process OTLP receiver; MCP transport auth
Audit schema plan_id, session_id, parent_action_id Phase-2 trajectory schema (typed events)
Table 3: Phase 1 (shipping) vs. Phase 2 (roadmap) capabilities.

Reproducibility contract. The committed benchmarks/reference_summary.json (20 seeds, 4 conditions 4 metrics) is the regression contract: a pull request that drifts these numbers by more than per cell, or by more than absolute percentage points on the +full token reduction, fails CI via make verify. This paper is companion to release v0.4.0 of besanson/Greensarc; the tag pins the exact source tree that produced every number cited above.

API stability. The public surfaces in green_sarc.governor, .state, .gate, .auditor, .escalation, and the three adapters (mcp, pais_sidecar, otel) follow semver within ; the examples/ and benchmarks/ paths and the audit-record schema may evolve.

Tests and CI. The library passes 152 unit and integration tests on Python 3.11 and 3.12 (an additional SARC-composition suite is skipped unless the optional sarc extra is installed), including concurrency race tests, runtime conformal-coverage tests, an ACI-restoration test, sidecar SSE streaming tests, a distributed-budget race test, Prometheus-metrics tests, live-feed loader tests, and end-to-end ablation reproduction. CI runs ruff, mypy on src/ and benchmarks/, pytest -q, and make verify on every push; the release workflow gates publication on the test matrix.

Gate overhead. A microbenchmark (benchmarks/gate_overhead.py, warm decisions, single process) puts the Pre-Action Gate’s cost at p50 s / p99 s per decision on the default Normal- path (M decisions/s) — negligible beside any model call. The split-conformal path is p50 s / p99 s, dominated by recomputing the empirical residual quantile over the calibration set on each decision (a cost a production deployment would precompute and amortize); even unamortized it stays under  ms. Latency is hardware-dependent; the committed figure is from the reference runner.

6.2  Deploying Green SARC

Three integration patterns, in increasing order of loose coupling:

  1. 1.

    In-processGreenGovernor.with_defaults(...) wraps the agent’s executor directly. Step-level safety, single replica; best for single-agent CLI tools and notebooks (green_sarc.governor).

  2. 2.

    PAIS sidecar — ASGI middleware on /v1/chat/completions; returns HTTP 429 on reject, with SSE-aware passthrough for streaming. Best for agents fronted by an OpenAI-compatible API (green_sarc.adapters.pais_sidecar).

  3. 3.

    KAOS-managed MCP advisory + OTel observe — register Green SARC as an MCP server for advisory gate/audit tools, and consume actuals cross-process via the OTel SpanProcessor. Loosest coupling, advisory-only safety (green_sarc.adapters.mcp, green_sarc.adapters.otel).

7  Split-Conformal Calibration of the Gate

We now discharge the calibration premise of Proposition 2 and prove Theorem 2.

Assumption 2(Exchangeability).

The estimator is fixed (trained on data disjoint from the calibration set). The one-sided nonconformity scores on the calibration set and the score of a fresh action are exchangeable.

Define as the -th smallest of (and if that index exceeds ). The calibrated gate bound is .

Proof of Theorem 2.

By Assumption 2 the scores are exchangeable, so the rank of among them is uniform on (ties broken at random). Then

Since , the event is exactly . If the gate admits only when , then , giving . The trajectory bound follows by linearity of expectation over gated admits. This is the standard inductive-conformal quantile lemma [9, 10], specialized to a one-sided cost score. ∎

Remark 1(Marginal, not conditional).

The guarantee is marginal over the residual distribution, not conditional on , and assumes exchangeability. Under workload drift it can be restored online with adaptive conformal inference [11]10).

Anytime-valid trajectory safety.

Theorem 2 bounds each admitted action’s breach probability marginally. For the cumulative over-spend along a trajectory — monitored continuously, at a data-dependent stopping time — we want a time-uniform bound. Write the per-step over-prediction residual as and let be the cumulative residual over the first admitted actions, with and the natural filtration of admitted actions.

Theorem 3(Anytime-valid cumulative over-spend).

Suppose the centered residuals are conditionally -sub-Gaussian given . Then for any , simultaneously over all — and hence at every stopping time —

(6)

Consequently the realized cumulative cost exceeds its forecast plus an envelope with probability at most , uniformly over the trajectory.

Proof.

Fix and define , with . By the conditional sub-Gaussian assumption, , so : is a non-negative supermartingale. Mixing over with a centered Gaussian prior of variance (the method of mixtures) yields another non-negative supermartingale with . Ville’s inequality [13] gives . Evaluating the Gaussian integral and rearranging into a bound on gives a time-uniform boundary of order (up to lower-order terms absorbed by the mixture); this is the standard sub-Gaussian confidence sequence [12]. Because the bound holds simultaneously for all , it holds at any stopping time by optional stopping. A fuller derivation is in Appendix C. ∎

Remark 2(A stronger assumption than Theorem 2).

Time-uniform concentration requires a tail assumption (sub-Gaussian increments) that the marginal conformal bound does not. We treat Theorem 3 as the trajectory-level companion to the per-action Theorem 2, and validate the per-action coverage it builds on empirically in §10.

Remark 3(The sub-Gaussian assumption is not supported by the real residuals).

We state the tension plainly. The forecast residuals on real traffic are right-skewed and mildly heavy-tailed — skew , excess kurtosis , with normality decisively rejected (§10). The conditional sub-Gaussian hypothesis of Theorem 3 is therefore not established by our data; the theorem is an idealized companion bound, and a deployment should not rely on the sub-Gaussian width as if it were validated. The faithful replacement is a variance-adaptive confidence sequence that assumes only bounded or sub-exponential increments. Concretely, the empirical-Bernstein confidence sequence of Howard et al. [12] replaces the boundary of Theorem 3 with one of order

where is the empirical cumulative variance and bounds the per-step residual; the width adapts to the realized residual variance rather than to a stipulated , and the linear term carries the heavy tail. Proof sketch. The same supermartingale construction as Theorem 3 goes through with the sub-Gaussian exponential process replaced by the empirical-Bernstein supermartingale of [12] (their Thm. 4 / the canonical-assumption framework), to which Ville’s inequality [13] applies verbatim; we do not re-derive it. This is the form a production deployment should monitor the cumulative over-spend with — it is anytime-valid under exactly the bounded/sub-exponential conditions our residuals plausibly satisfy, where the sub-Gaussian boundary is not justified. Promoting it from a paper-side bound to the runtime gate is the Phase-2 item flagged in §15.

Empirical coverage (synthetic).

Figure 2 validates both bounds on a held-out test split ( learned forecasts from the benchmark). Empirical coverage tracks nominal within percentage points for the Normal- gate and within for split conformal across . The two agree closely because the synthetic data-generating process has Gaussian residuals, so the Normal- assumption happens to hold. The value of conformal is precisely that it attains the same coverage without that assumption — which is exactly what real, non-Gaussian traffic demands. §10 shows that on real ShareGPT residuals the Normal- gate under-covers at the tight one would actually deploy, while split conformal continues to hold nominal coverage.

👁 Refer to caption
Figure 2: Gate reliability on a held-out split: empirical vs. nominal coverage for the Normal- gate and the distribution-free split-conformal bound. Both lie on the diagonal; conformal needs no Gaussian assumption.

8  Synthetic Evaluation and Ablation

Workload.

We use a synthetic Integrated Business Planning (IBP) demand-forecasting pipeline: a fan-out workload of SKUs, each handled by a depth- agent loop, over seeds. No real LLM is called; per-step token usage is simulated from a known relationship (so the estimator has a real signal to learn and runs are deterministic), while the treatment path exercises the real governance stack end to end. A small fraction of SKUs attempt to loop depth, a retry-storm stress scenario for the circuit breaker.

Ablation.

We run four conditions — baseline +scope +scope+route +full — so each lever’s contribution is isolated, with a paired-bootstrap CI on each reduction (Figure 3, Table 4). Relative to the baseline (M tokens, $,  gCO2e per run under time-varying intensity), the full stack uses M tokens, $, and  gCO2e.

Condition Token reduction USD reduction Carbon reduction (time-var.)
+scope
+scope+route
+full
Table 4: Reduction vs. baseline by lever (20 seeds; paired-bootstrap CIs). Adding routing leaves the token count unchanged — it swaps models, cutting USD and carbon, not tokens — which the ablation makes visible. The circuit breaker (+full) adds the remaining token saving by killing runaway loops (20 breaker trips/run).
👁 Refer to caption
Figure 3: Per-lever reduction with CIs. Scope drives the token saving; routing converts that into USD/carbon savings; the breaker adds the final token increment.
Forecast quality and cold start.

In the full condition the learned estimator attains a token-cost MAE of and WAPE of over all admitted actions. Figure 5 isolates the learning dynamics on a single-key stream: per-action absolute error falls from a cold-start 4,000 tokens (worst-case forecast: prompt full max_tokens) to the irreducible noise floor within 20 observations; rolling MAE drops from (first half) to (second half), WAPE to . Figure 5 shows predicted-vs-actual over learned forecasts (, WAPE ).

👁 Refer to caption
Figure 4: Cold start: forecast error collapses to the noise floor as the loop learns.
👁 Refer to caption
Figure 5: Predicted vs. actual token cost (learned forecasts), .
A negative result, stated plainly.

In this headline workload the gate issues zero rejections: the benchmark’s budget is generous, so the savings come from state scoping, routing, and the circuit breaker — not from the gate refusing actions. The gate’s contribution here is the forecast (which the auditor logs and the breaker and router consume) and the guarantee it would provide under a binding budget. We exercise the gate against a binding budget separately in §12 and §13. We consider it important not to over-claim the gate’s role in the aggregate token number.

9  Gate Behaviour Under Binding Budgets

§8’s headline workload has a non-binding budget, so the gate never rejects. Here we make the budget bind and measure the gate where its guarantee matters.

9.1  Protocol

We sweep the token budget over , with (the mean full-snowball cost), seeds each, running the full Green SARC stack at . The estimator is warm-started on an independent stream so we measure steady-state binding-budget behaviour, not the cold-start transient. We report, per budget: admission rate (admitted attempted steps); over-budget incidence (fraction of admitted steps whose realized cost exceeded the budget remaining at the moment of admission — the per-action breach event of Theorem 2); completed-trajectory rate (fraction of SKUs whose every step ran before the budget was exhausted); MAE on admitted actions; and total tokens.

9.2  Results

Table 5 reports the sweep. Over-budget incidence is at every budget level — comfortably within the target — confirming Theorem 2 empirically: the calibrated gate does not admit actions it cannot afford. Admission and completion degrade smoothly and monotonically as the budget tightens: at (a budget one-quarter of the naive baseline) the gate still admits of attempted steps and completes of trajectories with zero overspend; from the budget is slack and everything completes. Forecast MAE is stable ( tokens) across budgets.

admission over-budget completed MAE (tok) tokens
M
M
M
M
M
M
Table 5: Binding-budget sweep ( seeds, ). Over-budget incidence stays at () at every budget; admission and completion degrade smoothly as tightens.

9.3  The Pareto frontier

Figure 6 plots completed-trajectory fraction against over-budget incidence for the gate (sweeping ) and for the §13 soft penalty (sweeping its weight , with over-budget measured against the binding reference ). The gate’s frontier lies along the bottom axis ( over-budget at every completion level it reaches), dominating the soft penalty, which can only complete most trajectories by breaching the budget on every seed. The penalty frontier is essentially bimodal: on this workload its realized spend jumps from “admit-cheap” (little completed, within budget) to “admit-all” (everything completed, over budget), with no intermediate tracking the budget — a direct consequence of its budget-blindness.

👁 Refer to caption
Figure 6: Binding-budget frontier. The gate (green, sweeping ) completes work at over-budget incidence; the soft penalty (orange, sweeping ) can complete trajectories only by breaching the budget. The gate’s frontier dominates.

9.4  Reading

The gate’s empirical over-budget incidence tracks across the entire budget grid (it never exceeds it, here attaining ); admission degrades smoothly as tightens rather than collapsing; and the gate’s work-vs-overspend frontier dominates the soft-penalty baseline of §13. This is the binding-budget evidence that §8’s non-binding headline could not provide.

10  Real-Trace Coverage Validation

§7’s coverage check used synthetic, Gaussian residuals. Here we validate the gate’s calibration on real forecast residuals — the experiment the prior draft flagged as future work.

10.1  Dataset and preprocessing

We replay anon8231489123/ShareGPT_Vicuna_unfiltered [14], a public corpus of real ChatGPT/GPT-4 conversations on the Hugging Face Hub, released under a permissive research license. We use ShareGPT because LMSYS-Chat-1M is access-gated and would not reproduce on a clean clone without a credential; ShareGPT is ungated and serves the same purpose. No LLM is called: we use token counts only. Turns are tokenized with tiktoken (cl100k_base); for each assistant turn we form the pair , capped to a realistic deployment window. This yields pairs across conversations, split into calibration () and test () by a conversation-level partition (§10.3).

10.2  Residuals are not Gaussian

We fit online OLS on the calibration split. The residuals (Figure 7) have skew and excess kurtosis ; an Anderson–Darling test gives statistic , far above the critical value , and D’Agostino’s test returns : normality is decisively rejected. The Gaussian- assumption underlying the Phase-1 runtime gate does not hold on real traffic.

👁 Refer to caption
Figure 7: Real ShareGPT forecast residuals: skewed, heavy-tailed histogram (left) and Q–Q plot departing from the Normal line (right). Anderson–Darling (the critical value).

10.3  Coverage: Gaussian- vs. split conformal

We split the calibration and test sets by conversation: every turn of a conversation is assigned wholly to one side, so that within-conversation residual correlation never straddles the split (a row-level shuffle would leak it, violating exchangeability; on this corpus that leak inflates the reported conformal coverage at by pp, so the conversation-level number we report below is the honest, slightly looser one). Figure 8 compares empirical coverage to nominal on the conversation-level test split. The Normal- gate is mis-calibrated: it over-covers at loose (e.g. pp at , wasting budget) and, more dangerously, under-covers at the tight one actually deploys — pp at and pp at , i.e. roughly more budget breaches than promised. Split conformal stays within percentage points of nominal across the entire range ( pp at ). On real residuals the distribution-free bound is not a nicety — it is the difference between a gate that keeps its safety promise and one that quietly violates it at the operating point.

👁 Refer to caption
Figure 8: Coverage on real ShareGPT residuals. The Normal- gate under-covers at tight (unsafe); split conformal holds nominal coverage within pp without a distributional assumption.

10.4  Two worlds, one guarantee

The paper now spans a synthetic-residual world (§4–§9), where Gaussian and conformal bounds coincide because the data-generating process is Gaussian by construction, and a real-residual world (§10), where they diverge and only conformal holds nominal coverage. Conformal calibration is the bound that survives both. This is also why §15 lists promoting conformal into the runtime gate as the leading Phase-2 item.

10.5  Distribution shift

Coverage guarantees assume the deployment distribution matches calibration; real workloads drift. We split the corpus by conversation into a short-context regime (calibration) and a long-context regime (deployment) — classifying each conversation by its own maximum depth so all of its turns land in one regime — then train the conformal quantile on the former and deploy on the latter ( vs. pairs). Figure 9 shows the result. The fixed quantile mis-covers post-shift — it drifts to against a target ( pp off, here over-conservative, needlessly rejecting work). Adaptive conformal inference (ACI [11]), updating the quantile level online at rate , restores empirical coverage to ( pp off target) within the rolling window. Under drift, the static conformal quantile is no longer sufficient; ACI is the runtime mechanism that maintains the guarantee.

👁 Refer to caption
Figure 9: Coverage under distribution shift (train on short conversations, deploy on long). The fixed quantile mis-covers ( vs. target); adaptive conformal inference tracks the target ().

11  Real-Arrival Ablation on Production Traffic

§8’s ablation ran on synthetic IBP arrivals — the leading threat to validity. Here we re-run the same four-condition ablation on a real LLM serving trace, converting the headline result from synthetic to empirical.

11.1  Dataset and trajectory construction

We use BurstGPT [15] (BurstGPT_1.csv, CC-BY-4.0), a trace of real Azure OpenAI traffic with schema (Timestamp, Model, Request tokens, Response tokens, Total tokens, Log Type). We take a -request sample (after dropping failed responses, Response tokens); the model mix is GPT-3.5 (“ChatGPT”) and GPT-4 requests, mapped to the benchmark’s efficient and frontier profiles respectively. Request tokens is the prompt and Response tokens the realized completion — the gate’s per-action target. No LLM is called: token counts only, as in §10.

BurstGPT v1.0 carries no session identifier, so we reconstruct trajectories by temporal clustering: consecutive same-model requests within a  s window are grouped, capped at depth (the IBP default); API-log rows are single-step. This yields trajectories with median depth (real serving traffic is dominated by independent single requests) and maximum depth . We set the Adapter-Node scope cap to the median prompt ( tokens) and the circuit breaker to the depth cap, and document both as policy choices. (When BurstGPT v1.1 ships SessionID, the clustering heuristic becomes a one-line group-by.)

11.2  Results

Table 6 and Figure 10 report the ablation with paired-bootstrap CIs over trajectories. Scope (Adapter-Node prompt bounding) cuts tokens by and carbon by ; routing of trajectories to the efficient model adds USD savings () and carbon () at no further token cost — exactly the lever decomposition the synthetic ablation predicted (scope drives tokens, routing converts to USD/carbon). The savings ordering is confirmed on real arrivals.

Condition Token reduction USD reduction Carbon reduction
+scope
+scope+route
+full
Table 6: Real-arrival ablation on BurstGPT ( requests, trajectories; paired-bootstrap CIs vs. baseline). +full equals +scope+route: the circuit breaker records zero trips because the real trace contains no retry storms.
👁 Refer to caption
Figure 10: Real-arrival ablation (BurstGPT). Scope drives token and carbon savings; routing adds USD/carbon savings; +full is indistinguishable from +scope+route.

An honest negative. +full is identical to +scope+route: the circuit breaker logs zero trips and the gate (under a non-binding budget) issues zero rejections. Real serving traffic has none of the runaway retry loops the synthetic IBP injected, so these two levers are dormant safeguards here — their value appears only under stress (the IBP runaway SKUs, §8) and under binding budgets (§9, and below).

11.3  Binding budget under real arrivals

We repeat the §9 sweep on the real trace ( baseline tokens, ; Figure 11). Over-budget incidence is at every budget — the gate’s safety guarantee holds on real arrivals exactly as on synthetic ones — while admission and completion degrade as the budget tightens (: admitted, of trajectories completed; : full completion). The gate frontier again dominates the soft-penalty frontier, which can reach high completion only by breaching the budget on every seed.

👁 Refer to caption
Figure 11: Binding-budget frontier on BurstGPT. The gate completes work at over-budget across the sweep; the soft penalty breaches to reach high completion. Mirrors Figure 6.

11.4  What the synthetic IBP did and did not capture

The IBP pipeline tightened two claims and the real trace now corroborates them: the lever ordering (scopetokens, routingUSD/carbon) and the gate’s over-budget incidence under binding budgets both reproduce on BurstGPT. Two claims weaken or shift. First, +full’s extra token saving in the IBP ( vs. ) came entirely from the breaker killing injected runaway SKUs; real arrivals have no such storms, so that increment vanishes — consistent with §10’s finding that real cumulative-prompt curvature is negative. Second, on real single-step traffic “scope” is simply context truncation: the Adapter Node caps each prompt at tokens ( the median), so the token reduction is largely the mechanical consequence of that cap, not an intrinsic property of the architecture. The headline percentage should therefore not be read as a free saving Green SARC delivers: it is a tunable policy knob whose realized magnitude scales with the cap, and whose quality cost (truncated context degrading task utility) is deliberately untracked here (§3). A different operator with a less aggressive cap would see a proportionally smaller number. What survives this caveat, and what we therefore present as the load-bearing claims of this section, are the two cap-independent results: the lever ordering reproduces on real arrivals, and the gate’s over-budget incidence is under binding budgets on real data exactly as on synthetic data. The IBP is thus best read as a controlled stress-test of the multi-step regime; BurstGPT confirms the per-request governance levers and the safety property on real distributions, but it is a single-step serving trace and does not exercise the multi-step snowball or breaker dynamics — a real multi-step agent trace remains the natural next validation (§15).

11.5  Carbon savings under real grid mixes

The carbon results so far use a stipulated intensity curve. We re-compute the BurstGPT carbon reductions under measured grid intensity for two zones with contrasting generation mixes.

Setup and data sources

We source hourly carbon-intensity measurements from the ElectricityMaps v3 API [16], which aggregates regulator feeds (ENTSO-E, CAISO OASIS, national TSOs) into a consistent gCO2eq/kWh series on a lifecycle (LCA) basis. We use two zones with materially different generation mixes: Italy (IT, ), gas- and import-dominated, and California (US-CAL-CISO, ), characterised by deep daytime solar troughs and gas-heavy evening peaks. For reference we also report the benchmark’s stipulated proxy (). The free API tier exposes only the most recent hours of history, so we use a single -hour measured window per zone; this captures diurnal contrast but not seasonal or weekly variation, which we note as a §15 limitation. Carbon for each step is , with the workload’s actions spread across that window. The fetched series is cached as committed CSV under paper/data/grid/, so this section reproduces from a clean clone without API access (fetch_grid.py --refresh re-fetches).

Results

Condition stipulated () IT () US-CAISO ()
+scope
+scope+route
+full
Table 7: Carbon reduction vs. baseline under three grid intensities ( in gCO2eq/kWh; paired-bootstrap CIs). Figure 12.
👁 Refer to caption
Figure 12: Carbon reduction per lever under the stipulated curve and two real ElectricityMaps grids (Italy, California). The percentage reduction is grid-invariant.

Reading

What survives across grids is both the lever ordering and the percentage reduction: scope plus routing cuts carbon by – under all three intensities, because the reduction is a ratio of energy and enters as a common positive multiplier. The result that held under the synthetic proxy holds on real Italian and Californian grids, despite a difference in mean intensity.

What differs is the diurnal structure, and it matters more for one zone than the other. Italy’s intensity is comparatively flat ( intra-day swing, –), so the time at which an agent runs barely changes its carbon. California swings harder (, midday to evening): the same inference is roughly cleaner in the midday solar trough than at the evening peak. The absolute carbon saved therefore scales both with (a CAISO deployment at saves less than half the absolute carbon of an IT deployment for the same percentage) and, in CAISO, with when the traffic lands. Green SARC’s carbon-reduction percentage is robust to grid mix, but its real-world impact depends on where and when the compute runs; time-of-day carbon-aware routing is a Phase-2 opportunity this dataset would enable but the current code does not exploit (§15).

11.6  Multi-step real-trace ablation on SWE-rebench

BurstGPT is single-step; §11 flagged that it cannot exercise the multi-step snowball or breaker. We close that gap on real agent plans.

Dataset

We replay SWE-rebench OpenHands trajectories [17] (CC-BY-4.0; k real agent plans solving GitHub issues with Qwen3-Coder-480B, mapped to the frontier profile), sub-sampled to trajectories streamed from the  GB parquet (token counts via tiktoken; no LLM call). These are genuinely multi-step: median depth assistant turns, maximum , with a median per-turn prompt of tokens — the context accretion the State Snowball describes.

Results

Condition Token reduction USD reduction Carbon reduction
+scope
+scope+route
+full
Table 8: Multi-step ablation on SWE-rebench ( plans; paired-bootstrap CIs). Unlike BurstGPT, +full adds token saving over +scope+route (): the breaker is a live lever (Figure 14).
The State-Snowball holds on real plans, and is steeper than the model.

Fitting each plan’s cumulative prompt against turn index to , every trajectory has (; Figure 14). The median exceeds the linear-accretion prediction (with the median per-turn growth): real agents accrete context faster than the constant-increment model of Assumption 1, because tool outputs and re-reads grow the prompt super-linearly. This is the strongest available confirmation that Theorem 1’s regime is real — and an honest correction that the closed form is a lower bound on real-plan curvature, not an exact match (the synthetic of §4 held only because the simulator was built to Assumption 1).

👁 Refer to caption
Figure 13: Multi-step ablation; +full exceeds +scope+route on tokens because the breaker fires.
👁 Refer to caption
Figure 14: Per-plan quadratic coefficient : all positive, median above .
Breaker activations.

On these real plans the circuit breaker fires on of trajectories (the long-plan tail beyond the median depth), versus zero on BurstGPT. This is the decisive contrast: the breaker is a dormant safeguard on single-step serving traffic but a live, material lever on real multi-step agent plans, where it supplies the entire token-saving increment of +full over +scope+route.

What survives, what shifts.

Routing’s USD/carbon saving (/) and the lever ordering reproduce here as on BurstGPT and the IBP. Two things differ from BurstGPT. The breaker is no longer dormant ( vs ), vindicating its inclusion. And scope yields little here () only because the -median cap (k tokens) rarely binds on these plans; a tighter cap would truncate more but risks dropping the context the agent needs — the same policy/quality tradeoff named in §11, now with real multi-step stakes. Token reduction is modest precisely because we did not tune the cap aggressively; the load-bearing real-plan findings are the confirmed super-linear snowball and the live breaker.

11.7  Cost–utility frontier

The paragraph above names a tradeoff; because SWE-rebench records task outcomes, we can bound it. Each trajectory carries the benchmark’s real resolved flag — whether the agent’s patch passed the held-out tests — and of the plans resolved. We sweep the scope cap (, the median per-step prompt) and, for each cap, report tokens saved against an upper bound on quality harm: a worst-case resolution rate that assumes every resolved trajectory whose actually-used context the cap would have truncated flips to unresolved. This is deliberately pessimistic and, crucially, observational — truncation is simulated on logged trajectories, so the agent cannot react to the smaller context. The replay therefore bounds how much resolved work a cap puts at risk; it cannot establish that the work would in fact fail (a live agent might recover by re-fetching). It is a correlational upper bound, not a causal estimate.

👁 Refer to caption
Figure 15: Cost–utility frontier on SWE-rebench: token reduction vs. the worst-case resolution rate (every truncated resolved plan assumed to fail). The cap that saves tokens truncates nearly every plan; the cap that is near-harmless saves little.

The frontier is steep (Figure 15). The aggressive cap saves of tokens but truncates of plans, putting all resolved work at risk ( worst-case resolution); the cap saves while still truncating . Only the loose cap used in our ablation is close to benign — tokens for a worst-case resolution of (it touches of plans) — and the cap is essentially free ( truncated, worst-case, against the baseline). The honest reading: on real multi-step plans the token savings available from scope capping are bought against a real and possibly large truncation risk, and the cap that is safe saves little. The causal version — where the agent adapts to the cap — needs the live study (§15); this observational frontier is the upper bound that motivates it.

12  Sensitivity Analysis: the Knob

The gate’s single tunable, the risk level , trades admission throughput against realized overspend. We pre-train the estimator, then gate a fresh stream against a binding token budget over seeds, sweeping (Figure 16). Tightening from to drives the overspend rate among admitted actions from to , at a negligible throughput cost (admission throughout, since the binding budget — not — sets the admission ceiling). The practical reading: under a hard budget, a conservative buys an overspend guarantee almost for free.

👁 Refer to caption
Figure 16: Gate under a binding budget: realized overspend among admitted actions vanishes as tightens, with admission throughput nearly unchanged.

12.1  Joint sensitivity over , scope cap, and routing fraction

The sweep above varies one knob. To check that the headline operating point is not a cherry-pick, we sweep all three: , scope cap the median prompt ( tokens), and routing fraction — cells over seeds. Token/USD/carbon reductions use the benchmark’s native forecast noise; over-budget incidence is measured under a binding budget at the elevated noise of the stress above (Figures 1818).

Three findings, two of them null and stated as such. (i) Token reduction is governed almost entirely by the scope cap: at , at , at , at . (ii) Routing fraction does not move the token reduction at all (it reallocates models, changing USD and carbon, not tokens — the heatmap is flat along the routing axis), and (iii) has no measurable effect on either axis in this regime: the four panels of Figure 18 are identical, and over-budget incidence never exceeds across all cells. The last is the empirical face of Theorem 2: the gate admits on its upper bound, so realized over-budget events are vanishingly rare regardless of the operating point; becomes a live throughput-vs-overspend knob only under the higher forecast uncertainty of real residuals (§10). The paper’s headline operating point (cap , routing , ) achieves the maximum token reduction at over-budget and lies on the Pareto frontier ( of cells are non-dominated): it is the most aggressive cap at zero safety cost, not an interior cherry-pick.

👁 Refer to caption
Figure 17: All cells in (token reduction, over-budget) space; the headline point is on the frontier at over-budget.
👁 Refer to caption
Figure 18: Token reduction over (scope cap route), per . Flat in route and in ; driven by the cap.

13  Why an Architectural Gate: the Soft-Penalty Baseline

The natural alternative to a hard gate is reward shaping: add a Lagrangian cost penalty to the objective and let the agent self-limit, admitting an action iff its value exceeds times its cost. We compare this soft penalty against the architectural gate on a stream with stochastic costs and a hard budget (Figure 19). Because the penalty is a per-action threshold blind to the remaining budget, no single both respects and matches the gate’s throughput: small admits almost everything and overspends (); large under-spends. Crucially, at the that matches in expectation, realized spend straddles and breaches it on of seeds. The architectural gate, admitting in arrival order while the calibrated forecast fits the live remaining budget, breaches on of seeds while filling of the budget. This is the cost-domain instance of SARC’s thesis that finite penalties cannot substitute for hard runtime constraints.

👁 Refer to caption
Figure 19: A soft Lagrangian penalty cannot guarantee the budget: tuned to match in expectation it breaches on of seeds (blue), while the architectural gate (green) never breaches and fills of .

14  Threat Model and Adversarial Robustness

A runtime gate invites the question: what does an attacker who knows the gate do? We give a partial, honest answer with a toy study against the same gate code path the benchmark exercises.

14.1  Attacker model

The attacker is a prompt author with white-box knowledge of the estimator, the scope cap, and the budget, who observes gate decisions (and, via timing channels, possibly residuals). The attacker cannot modify src/ or mint tokens from nothing; the cost is always realized by the model provider. This is a cost-side adversary, distinct from the safety-side prompt-injection threat studied by [23, 24]: the goal is not to make the agent misbehave but to make it overspend while passing the gate.

14.2  Three attack classes

We construct three attacks ( seeds, instances each, paired-bootstrap CIs). Continuation inflation: a prompt whose realized completion is the benign law (“continue indefinitely” semantics). Scope-cap-aware padding: a prompt sized to exactly tokens, maximizing admitted work per call. Model-substitution gaming: a prompt that declares the efficient model while the cost is realized at frontier rates (a misreported model id).

14.3  Results

Attack Admission Over-budget Realized/declared Gate failure mode
Continuation inflation under-estimates
Scope-cap-aware padding over-admits
Model-substitution gaming under-estimates
Table 9: Adversarial study ( seeds; CIs in figure_stats.json). “Over-budget” is the rate at which the realized cost exceeds the gate’s admitted bound (USD for the substitution attack).

14.4  What survives

Two honest negatives. First, scope-cap-aware padding defeats the gate by staying inside its admission contract: it pads to just under the cap and extracts maximum legitimate work, so the gate admits it () with no over-budget event (, at the noise floor) and realized cost below the bound (). This is a fundamental limitation of bounded-prompt-only governance and is exactly what the Phase-2 trajectory estimator — which reasons about the whole plan, not one padded step — is meant to address. The gate alone is not sufficient here, and we do not claim otherwise.

Second, continuation inflation and model substitution both defeat the forecast (realized cost and the admitted bound), but they are caught post hoc by the Post-Action Auditor, which logs predicted-vs-actual and feeds the discrepancy back to the estimator and the Escalation Router. The architectural response to a forecast-defeating attack is audit-then-revoke at the Auditor, not admission-time rejection at the gate; this is the cost-domain instance of SARC’s predict–act–log–retrain loop. The gate bounds expected cost; it does not bound an adversary who lies about the future, and the four-site architecture — not the gate in isolation — is what makes the residual detectable.

15  Limitations, Threats to Validity, and Future Work

Threats to validity.

Synthetic headline workload. The ablation, binding-budget, and sensitivity results (§8, §9, §12) use a synthetic IBP pipeline with stipulated Gaussian noise; the State-Snowball and gate mechanics are real code, but the cost distribution is constructed. We mitigate this with the calibration study of §10 and the end-to-end real-arrival ablation of §11; a residual gap remains in that BurstGPT is single-operator Azure traffic, so cross-operator traces (Mooncake, Alibaba) are listed as future validation. Marginal coverage. Theorem 2 is marginal and assumes exchangeability (Remark 1); Theorem 3 additionally assumes sub-Gaussian increments. Workload drift violates exchangeability; §10 shows ACI restores coverage, and as of v0.3.0 both split-conformal and ACI ship in the runtime gate (§6.1), though conditional (not merely marginal) coverage remains open. Carbon proxy. is only as faithful as , which varies in availability and granularity across regions. The §11.5 real-grid study uses a -hour window per zone (ElectricityMaps free-tier constraint); a full year would expose seasonal variation in renewable share (e.g. CAISO winter low-solar) but is not expected to alter the grid-invariance result. Energy model. The per-token energy is a stipulated linear coefficient (the shipped default is  kWh/token, of order  J/token — the same order of magnitude as benchmarked GPU inference energy for large models [19, 18]). Measured inference energy is not linear in tokens: it varies with batch size, hardware, and utilization, and grows super-linearly with context length because attention scales quadratically in sequence length [19]. The bias has a definite sign here: because the State Snowball makes context grow with loop depth, a linear proxy under-counts the marginal energy — and hence carbon — of the deepest, most expensive steps, so the carbon savings we report from bounding loop depth are, if anything, conservative. A measured energy table per model/hardware (rather than one coefficient) is the faithful fix and fits the existing CostModel interface without changing it. Operator readiness. A single-process threading.Lock Budget is authoritative for one replica; an experimental Redis backend (one atomic Lua script per reserve/commit/release, with TTL reclamation of crashed-client reservations) provides a shared transactional counter for multi-replica deployments behind a load balancer, atomic against a single Redis but with no cross-region reconciliation and no fair-share reservations yet (Phase 2). Production deployments needing a durable ledger should await the Postgres backend.

Negative and null results.

The gate produces no token savings in the headline workload (§8); adding routing yields zero marginal token reduction (it trades models, saving USD/carbon only); the declared latency-headroom field is not enforced in Phase 1; and on real chat traffic the cumulative-prompt curvature is negative10), so the snowball is specific to naive orchestration, not universal. We report these rather than fold them into a single “governance helps” number.

Future work.

The leading item is to promote the split-conformal upper bound of §7 and the anytime-valid trajectory bound of Theorem 3 from paper-side analyses into the runtime gate, with adaptive conformal inference [11] under workload drift10 shows why this matters). Beyond that: a multi-step agent trace (e.g. SWE-bench / OpenHands trajectories) to exercise the breaker and State-Snowball dynamics that the single-step BurstGPT trace does not, and cross-operator traces (Mooncake, Alibaba) for the carbon and arrival-distribution generalization §15 flags; the Phase-2 full-trajectory estimator for plan-level rejection; time-of-day carbon-aware routing, which §11.5’s CAISO diurnal swing () shows is exploitable but the current router ignores; a multi-tenant distributed Budget with fair-share reservations; latency-headroom enforcement; and a production KAOS deployment of the gate as a sidecar. The single outstanding empirical action is the live governed-agent study (two arms — ungoverned vs full stack — over tasks on the Anthropic API): its harness ships at paper/scripts/run_live_study.py with an in-script USD ceiling and a probe-checkpoint spend sign-off, and is unit-tested offline against a mock transport, but the funded live run is deferred (it is the one result this paper does not yet report). These are roadmap directions, not claims.

16  Conclusion

Green SARC applies a correctness-governance architecture to the economics and ecology of inference, and develops its own theory with reproducible evidence. The State-Snowball theorem explains why unconstrained agents fail financially, and its closed form is confirmed exactly in the data. The predictive Pre-Action Gate generalizes the static accounting gate into a calibrated forecaster of which the rule is the zero-information limit, with a distribution-free budget-safety guarantee. And the soft-penalty comparison shows the guarantee is not free for the taking by reward shaping — it requires the architectural placement. The structural claim is that correctness, cost, and sustainability are instances of one problem: the runtime enforcement of declared constraints, where the only thing that changes between them is the predicate.

Appendix A Benchmark configuration

IBP defaults: SKUs, depth , base prompt , per-step increment , scope cap , max_tokens ; completion ; runaway fraction at depth; breaker ; of SKUs routed to the small model under +route; . Two model profiles (frontier/efficient) with distinct USD and energy rates; carbon under both fixed ( gCO2e/kWh) and a daily time-varying intensity curve.

Appendix B Estimator

Per key, an online least-squares fit of completion on prompt tokens via running sums (Welford-style) [25], predicting (completion clamped to ) with residual std supplied to the gate; below min_samples it defers to the zero-information cold-start forecast.

Appendix C Conformal calibration and the anytime-valid bound

Split conformal (Theorem 2). Split the learned forecasts into calibration/test halves; one-sided scores ; the -th order statistic; report test coverage against nominal .

Anytime-valid bound (Theorem 3). With centered residuals conditionally -sub-Gaussian, is a non-negative supermartingale for each (the increment’s conditional MGF is dominated by ). The mixture over a prior is again a non-negative supermartingale with ; the Gaussian integral evaluates in closed form to . Ville’s inequality [13] gives ; solving for yields the time-uniform boundary , which is the standard sub-Gaussian confidence sequence [12]. Optional stopping extends the bound from fixed to any stopping time .

Appendix D Binding-budget experiment

, with the mean full-snowball cost over seeds; ; the estimator warm-started on a -step independent stream. The soft-penalty frontier sweeps at reference budget . Script: paper/scripts/run_binding_budget.py.

Appendix E Real-trace replay

Dataset anon8231489123/ShareGPT_Vicuna_unfiltered (Hugging Face, permissive research license), streamed; up to conversations / assistant-turn pairs, tokenized with tiktoken cl100k_base, capped to an context window; token counts only, no LLM calls. Split conformal vs. Gaussian- coverage on a conversation-level split (whole conversations assigned to calibration or test, never split across turns, to preserve exchangeability); shift experiment trains on short-conversation residuals and deploys on long — regimes also assigned by whole conversation (each conversation classified by its maximum depth) — comparing fixed-quantile vs. ACI (). The extracted table is cached to paper/data/sharegpt_subset.parquet (git-ignored); committed JSONs are the provenance. Script: paper/scripts/run_realtrace_replay.py.

Appendix F Reproduction

make paper-data regenerates every committed JSON asset (ablation, learning curve, binding-budget sweep, real-trace calibration and shift); make paper-figures regenerates the eleven figures and figure_stats.json; paper/scripts/check_stats.py verifies that every statistic in the text resolves to figure_stats.json; make paper compiles this PDF (CI: .github/workflows/paper.yml). The benchmark’s reference run is checked under make verify.

Note on proofs. All proofs are self-contained and elementary. Theorem 2 is a direct specialization of the standard inductive-conformal quantile lemma [9, 10]; Theorem 3 is a standard anytime-valid argument via Ville’s inequality [12, 13].

References

  • [1] Besanson, G. (2026). SARC: A Governance-by-Architecture Framework for Agentic AI Systems: Compiling Regulatory Obligations into Runtime Constraints. Working paper, Universidad Torcuato Di Tella. arXiv:2605.07728. Code: https://github.com/besanson/sarc-governance.
  • [2] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguía, L.-M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). Carbon Emissions and Large Neural Network Training. arXiv:2104.10350.
  • [3] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. In Proc. 57th ACL (pp. 3645–3650). doi:10.18653/v1/P19-1355.
  • [4] Lacoste, A., Luccioni, A., Schmidt, V., & Dandres, T. (2019). Quantifying the Carbon Emissions of Machine Learning. arXiv:1910.09700.
  • [5] FinOps Foundation (2023). FinOps Framework: Principles, Domains, and Capabilities for Cloud Financial Management.https://www.finops.org/framework/.
  • [6] Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, Í., Maleki, S., & Bianchini, R. (2024). Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In Proc. ISCA.
  • [7] Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proc. ICML.
  • [8] Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin, X., Huang, Y., Chen, Z., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In Proc. OSDI.
  • [9] Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
  • [10] Angelopoulos, A. N., & Bates, S. (2023). Conformal Prediction: A Gentle Introduction.Foundations and Trends in Machine Learning, 16(4), 494–591.
  • [11] Gibbs, I., & Candès, E. (2021). Adaptive Conformal Inference Under Distribution Shift. In Advances in Neural Information Processing Systems (NeurIPS).
  • [12] Howard, S. R., Ramdas, A., McAuliffe, J., & Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences.Annals of Statistics, 49(2), 1055–1080.
  • [13] Ramdas, A., Grünwald, P., Vovk, V., & Shafer, G. (2023). Game-theoretic statistics and safe anytime-valid inference.Statistical Science, 38(4), 576–601.
  • [14] ShareGPT community (2023). ShareGPT_Vicuna_unfiltered: a corpus of real ChatGPT/GPT-4 conversations. Hugging Face Hub: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
  • [15] Wang, Y., Chen, Y., Li, Z., Kang, X., Fang, Y., Zhou, Y., Zheng, Y., Tang, Z., He, X., Guo, R., Wang, X., Wang, Q., Zhou, A. C., & Chu, X. (2025). BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems. In Proc. 31st ACM SIGKDD V.2. arXiv:2401.17644. https://github.com/HPMLL/BurstGPT.
  • [16] Electricity Maps ApS. (2024). Electricity Maps API: hourly carbon intensity by zone.https://api.electricitymap.org/v3. Free tier, attribution required. Data sources documented at https://github.com/electricitymaps/electricitymaps-contrib.
  • [17] Nebius (2025). SWE-rebench OpenHands trajectories: 67k agent traces solving real GitHub issues with Qwen3-Coder-480B. CC-BY-4.0. https://huggingface.co/datasets/nebius/SWE-rebench-openhands-trajectories.
  • [18] Luccioni, A. S., Jernite, Y., & Strubell, E. (2024). Power Hungry Processing: Watts Driving the Cost of AI Deployment? In Proc. 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). arXiv:2311.16863.
  • [19] Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., & Gadepally, V. (2023). From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. In Proc. IEEE High Performance Extreme Computing Conference (HPEC). arXiv:2310.03003.
  • [20] Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176.
  • [21] Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665.
  • [22] Anderson, J. P. (1972). Computer Security Technology Planning Study. Technical Report ESD-TR-73-51, USAF Electronic Systems Division, Hanscom AFB. (Origin of the reference monitor concept.)
  • [23] Carlini, N., Nasr, M., Choquette-Choo, C. A., et al. (2023). Are aligned neural networks adversarially aligned? In Advances in Neural Information Processing Systems (NeurIPS).
  • [24] Greshake, K., Abdelnabi, S., Mishra, S., et al. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. In Proc. 16th ACM Workshop on Artificial Intelligence and Security (AISec).
  • [25] Welford, B. P. (1962). Note on a Method for Calculating Corrected Sums of Squares and Products.Technometrics, 4(3), 419–420.
  • [26] Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2018). Bandits with Knapsacks.Journal of the ACM, 65(3), 1–55.
  • [27] European Parliament and Council (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. (See Arts. 12, 13, 15.)
  • [28] European Parliament and Council (2022). Directive (EU) 2022/2464 amending Regulation (EU) No 537/2014 and Directives 2004/109/EC, 2006/43/EC and 2013/34/EU, as regards corporate sustainability reporting (CSRD). Official Journal of the European Union.