Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems

Gaston Besanson Universidad Torcuato Di Tella.

(Preprint, June 2026)

Abstract

Agentic AI systems act through tools and sub-agents, yet the controls meant to bound their financial and environmental cost still sit on dashboards evaluated beside or after execution. Green SARC applies the SARC governance-by-architecture framework — four enforcement sites in the agent loop — to FinOps and GreenOps, contributing the theory of what to enforce and how to predict it. We report four policy-independent results. (i) The unconstrained “State Snowball” is in loop depth; on real multi-step plans (SWE-rebench) it holds on , with median curvature exceeding the linear-accretion prediction — real plans accrete faster than the model (§11.6). (ii) On real residuals the Normal- gate under-covers ( at nominal ); split-conformal calibration holds (; Theorem 2). (iii) A soft Lagrangian penalty tuned to the budget in expectation breaches it on of seeds; the architectural gate breaches . (iv) Under binding budgets the gate’s over-budget incidence is on synthetic and real (BurstGPT) arrivals. End-to-end token/USD/carbon savings (–) are real but policy-dependent in magnitude — set by a scope-cap knob, not by gate rejections. The library is open-source, dependency-free, and ships a regeneration script for every cited number.

Keywords: agentic AI, governance-by-architecture, predictive FinOps, GreenOps, token economics, conformal prediction, runtime constraints, SARC.

1 Introduction

The cost center of artificial intelligence has shifted from training, whose resource envelope is fixed at design time, to the inference trajectory: the runtime-determined sequence of model calls, tool invocations, and conditional retries an agent emits while pursuing a goal. A classical inference call has a bounded, predictable cost. An agentic workflow has neither: the same task, executed twice, can differ by an order of magnitude in token consumption. Both the API bill and the energy draw are therefore stochastic quantities governed by the execution trace, not the specification.

Two instruments are commonly deployed against this volatility. Post-hoc auditing reconciles spend after the billing period closes. Policy-as-code encodes budget rules in a layer evaluated alongside, but not inside, the agent loop. Both inherit the defect SARC identified for correctness obligations: they evaluate constraints after, or beside, the execution they are meant to bound. A budget breach detected at month-end cannot un-spend the tokens; a carbon overage logged to a dashboard cannot un-emit the carbon.

Relationship to SARC.

SARC [1] is a governance-by-architecture framework that treats constraints as first-class specification objects and compiles them into four enforcement sites: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. Green SARC is an application of that architecture — we reuse the four sites unchanged — but carries its own theory, orthogonal to SARC’s correctness results. SARC governs whether the system is right; Green SARC governs what the system costs. The two are independent axes that happen to share enforcement sites.

Contributions.

1.

State-Snowball theorem, formal and empirical (§4). Naive context accretion yields cumulative prompt cost (Theorem 1); the synthetic fit recovers the closed-form coefficient exactly, and on real ShareGPT traffic the cumulative-prompt curvature is negative — the snowball is an artifact of naive orchestration, not of chat itself (§4, §10).
2.

Predictive Pre-Action Gate with calibration and an anytime-valid safety bound (§5, §7). We generalize the gate to a learned forecast (of which rule-based accounting is the zero-information limit), give split-conformal marginal safety (Theorem 2), and an anytime-valid trajectory over-spend bound via Ville’s inequality (Theorem 3).
3.

Binding-budget gate evaluation on the Pareto frontier (§9). Across a budget grid the gate’s empirical over-budget incidence stays at or below while completing more work at zero overspend — dominating the soft-penalty frontier.
4.

Real-trace coverage validation on ShareGPT (§10). On real, non-Gaussian residuals the Normal- gate under-covers at tight while split conformal holds nominal coverage; adaptive conformal restores coverage under distribution shift.
5.

Ablation with paired-bootstrap CIs (§8). A four-condition ablation decomposes the saving by lever (scope, routing, breaker), each with a CI.
6.

Real-arrival ablation on a public production trace (§11). The four-condition ablation re-run on the BurstGPT trace of real Azure OpenAI traffic reproduces the synthetic savings ordering under real burstiness and prompt/response distributions, with paired-bootstrap CIs.
7.

Open-source library and reproducible benchmark (§6). A dependency-free implementation with a regression contract enforced in CI via make verify.

We also relate the gate to Bandits-with-Knapsacks as a feasibility oracle (Proposition 3). The decoupling of cost from correctness governance (§3) frames the artifact’s scope.

In one line: SARC gives us where to enforce; we contribute what to enforce, why it decouples from safety, how to predict it, and the evidence that the prediction is calibrated and the architecture is necessary.

2 Background and Related Work

SARC.

A SARC specification declares, per constraint, its source, class, predicate, verification point, response protocol, and operating point, and compiles these into the four enforcement sites named above. It formalizes the minimal invariants for specification–trace correspondence and argues that finite reward penalties do not in general substitute for hard runtime constraints — a claim we make quantitative for the cost domain in §13.

FinOps and GreenOps.

FinOps brings financial accountability to variable cloud spend; GreenOps extends the discipline to carbon and energy. The financial and environmental cost of large-scale AI compute has been quantified for the training regime [3, 2, 4]; the inference regime studied here shifts this cost into a runtime-variable, per-trajectory quantity. Both disciplines are predominantly practiced as observe-and-reconcile loops [5], a cadence adequate only when the consumption unit is predictable. Agentic inference violates that premise.

Efficient inference systems.

A large systems literature reduces the unit cost of inference — phase-split serving [6], high-throughput single-GPU offloading [7], and statistical multiplexing across models [8]. These optimize how a fixed set of calls is served; Green SARC is complementary and orthogonal: it governs which calls an agent is permitted to make, given a budget, before they are issued. The two compose cleanly. If a serving optimization reduces the effective per-token cost from to for some efficiency , then a fixed token budget admits times as much work, since the gate’s feasibility test is equivalent to ; the carbon ceiling scales identically through the proportional energy saving. Cheaper serving thus relaxes the gate’s effective budget by a known factor rather than changing its mechanism.

Cost-aware routing and cascades.

A complementary line reduces cost by choosing which model answers a query: FrugalGPT learns an LLM cascade that escalates to a stronger model only when a cheaper one is judged inadequate [20], and RouteLLM trains a binary router between a strong and a weak model from preference data [21]. These optimize the per-query model choice to maximize quality at lower expected cost, but they offer no hard guarantee: a router tuned to spend less in expectation can still overrun any fixed budget on an adversarial or heavy-tailed query stream, exactly the soft-constraint failure we quantify in §13. Green SARC is orthogonal and composable: it is the enforcement contract a router runs inside, supplying the per-action feasibility test that turns an expected-cost heuristic into a budget-safe one (the CBwK feasibility-oracle framing of Proposition 3). We concede the overlap honestly: Green SARC’s energy-aware routing lever (§8) is mechanistically the same idea as FrugalGPT’s cascade — down-route when a cheaper model suffices — and our contribution there is not the router but the gate that bounds it.

Conformal prediction.

Our safety guarantee rests on split (inductive) conformal prediction, which converts any point predictor into a set/interval with distribution-free, finite-sample marginal coverage under exchangeability [9, 10]. Where residuals are non-exchangeable (distribution shift), adaptive conformal inference restores coverage online [11]; we flag this as the path to robustness on real traces (§15).

Constrained decision-making.

Admitting actions under a depletable budget is formally a Bandits-with-Knapsacks / constrained-MDP problem [26]. Green SARC does not solve the optimal-policy problem; it provides the enforcement primitive (a calibrated per-action feasibility test) that any such policy needs at runtime, and contrasts it with the soft-penalty (Lagrangian) relaxation in §13. We make the relationship precise in Proposition 3: the gate composes with any sublinear-regret CBwK policy as a feasibility oracle without changing its regret order.

Architectural lineage.

The four-site, enforce-in-the-loop design has ancestry well beyond the author’s own SARC framework. The reference monitor of Anderson’s 1972 security study — a mediation mechanism that must be invoked on every access, be tamper-proof, and be small enough to verify [22] — is the direct conceptual ancestor of the Pre-Action Gate: a non-bypassable check interposed before each consequential operation. Admission control in networking and queueing systems (admit a flow only if its reserved rate fits the remaining capacity) is the same two-phase reserve-then-commit primitive our Budget implements. And runtime verification — synthesizing monitors that check an execution against a specification as it runs — is the correctness-domain analogue of the Action-Time Monitor and Post-Action Auditor. Green SARC’s novelty is not the enforce-in-the-loop stance itself but its application to predicted cost and carbon, with a calibrated forecast standing in for the boolean access check.

What Green SARC is not.

Pure observability tools (LangSmith, Helicone, raw OpenTelemetry) give post-hoc cost without enforcement: they tell you what was spent, after it was spent. API-level rate limits (provider tier limits, sidecar throttling on request counts) enforce request counts, not predicted cost or carbon, so a single expensive call passes unchecked. In-agent budget tracking (framework callbacks) is in-process and bookkeeping-only, with no cross-process attribution and no enforcement contract spanning the four sites. Green SARC differs by being a four-site governance contract whose gate predicts cost before the action fires; the closest comparable systems govern only post hoc, or only on request counts.

Regulatory context.

Enterprise deployments increasingly must keep auditable records of automated decisions and account for system accuracy and the energy footprint of AI. The EU AI Act mandates automatic record-keeping/logging (Art. 12), transparency and information provision (Art. 13), and accuracy/robustness with documented metrics (Art. 15) [27]; the Corporate Sustainability Reporting Directive (CSRD) extends mandatory sustainability disclosure to in-scope undertakings [28]. Green SARC’s Post-Action Auditor produces, as a byproduct of execution, the attribution-preserving, predicted-vs-actual trace these regimes require, extended to per-trajectory token yield and a carbon proxy.

3 Decoupling FinOps Governance from Correctness Governance

We state the decoupling explicitly because it defines the scope of the artifact.

Proposition 1(Independence of axes).

Let a governance layer be characterized by the predicate class it enforces. SARC’s correctness layer enforces predicates over action validity (is this action safe and permitted?). The Green SARC layer enforces predicates over resource consumption (does this action fit the cost and carbon budget?). The two predicate classes share enforcement sites but neither implies the other: a perfectly safe agent can be ruinously expensive, and a perfectly cheap agent can be unsafe.

Consequences for the artifact:

•

Green SARC is deployable with no safety regime present. Its value derives from the cloud bill, which every operator incurs.
•

It tracks cost and carbon only; correctness/accuracy is deliberately out of scope and is not logged as a governed quantity. The quality floor (§5) is the caller’s concern.
•

It composes with SARC where both are wanted (the sites are shared) but does not depend on it. The reference implementation has no dependency on SARC.

4 The State-Snowball Cost Theorem

Definition 1(State Snowball).

A multi-agent loop exhibits the State Snowball when each step re-submits the full accreted context, so the per-step prompt grows monotonically with step index.

Assumption 1(Linear accretion).

The prompt at step (zero-indexed) is tokens: a fixed base plus tokens appended per hop. This is the regime in which the unconstrained loop is studied; sub-linear summarization is exactly the mitigation we analyze.

Theorem 1(Quadratic cost of the unconstrained loop).

Under Assumption 1, the cumulative prompt-token cost over steps is

(1)

with leading coefficient .

Proof.

Direct summation of the arithmetic series . The dominant term is with leading coefficient . ∎

Empirical confirmation (synthetic).

Figure 1 plots the cumulative prompt cost of the benchmark’s baseline against loop depth, with , . A second-order least-squares fit recovers , identical to the closed-form ; the residual is numerically zero. This verifies that the simulator faithfully realizes Assumption 1 (the recovered coefficient is a property of the simulator’s construction, not independent evidence that real workloads accrete linearly). Bounding the per-hop increment with an Adapter Node (scope cap tokens) collapses the curve to linear: at depth the scoped cost is lower than the snowball. The Action-Time circuit breaker caps directly, bounding the other factor.

Real chat traffic.

Whether real multi-turn traffic accretes quadratically is a separate, empirical question. On the ShareGPT replay of §10 ( conversations, up to turns) we fit the cumulative billed prompt tokens against turn depth to a quadratic. The leading coefficient is with paired-bootstrap CI — significantly negative. Real conversations are concave in depth, not convex: humans and well-behaved assistants do not blindly re-submit the full transcript every turn. The snowball is therefore a failure mode of naive multi-agent orchestration (full-context re-submission), not an intrinsic property of conversation — which is precisely why the Adapter-Node scoping that prevents it is the highest-leverage lever in the ablation (§8).

Symbol	Meaning
	proposed action; its context
	live remaining token budget
	marginal carbon intensity (gCO₂e/kWh), region , time ; real grid data used in §11.5[16], stipulated by default
	latency/SLA headroom (declared in the state; not enforced in Phase 1)
	forecast token cost; forecast carbon
	realized token cost; realized carbon
	estimator residual standard deviation (per key)
	gate risk level; operating-point confidence is
	split-conformal quantile of calibration residuals
	carbon ceiling; carbon already spent on trajectory

(5)
	s.t.	(budget; Pre-Action Gate)
	(loop bound; Action-Time Monitor)
	(ESG ceiling; Post-Action Auditor)
	(quality floor; caller-owned)

Site	Green SARC predicate / role	Module
Pre-Action Gate	Predictive cost/carbon forecast; admit iff at and carbon fits.	gate.py, estimator.py
Action-Time Monitor	Circuit breaker on loop count / marginal cost; kills runaway retry/re-plan loops.	monitor.py
Post-Action Auditor	Logs predicted-vs-actual cost/carbon per action: ESG record and estimator training signal.	auditor.py
Escalation Router	Routes budget-/carbon-exhausted tasks to human review or a deterministic fallback.	escalation.py
State scoping	Adapter Node bounds the per-hop increment (§4).	scoping.py

Capability	Phase 1 (today)	Phase 2 (roadmap)
Pre-Action Gate (step)	Normal-; conformal opt-in via calibrator=	ACI as default + conditional coverage
Predictive forecast	OLS per	trajectory estimator
Action-Time Monitor	Loop / marginal / total cost breaker	latency-headroom enforcement
Post-Action Auditor	JSONL / SQLite	Parquet, multi-tenant attribution
Budget	Single-process threading.Lock; distributed Redis backend (experimental)	Postgres durable ledger + fair-share reservations
Escalation Router	Deterministic + log-only handlers	plan-level rejection on trajectory forecast
Adapters	MCP, PAIS sidecar, OTel SpanProcessor	cross-process OTLP receiver; MCP transport auth
Audit schema	plan_id, session_id, parent_action_id	Phase-2 trajectory schema (typed events)

Attack	Admission	Over-budget	Realized/declared	Gate failure mode
Continuation inflation	under-estimates
Scope-cap-aware padding	over-admits
Model-substitution gaming	under-estimates

Condition	Token reduction	USD reduction	Carbon reduction (time-var.)
+scope
+scope+route
+full

	admission	over-budget	completed	MAE (tok)	tokens
	M
	M
	M
	M
	M
	M

Condition	Token reduction	USD reduction	Carbon reduction
+scope
+scope+route
+full

Condition	stipulated ()	IT ()	US-CAISO ()
+scope
+scope+route
+full

URL: https://arxiv.org/html/2606.15954v1