The energy efficiency of agent networks.
A controlled benchmark of how VDF AI reduces the energy footprint of enterprise AI — by decomposing work into DAG-based agent networks and dispatching each step through SEEMR self-evolving model routing. The result: up to a 94.9% reduction in predicted energy, with output quality held non-inferior in aggregate.
Most of the energy an AI system consumes in production is spent at inference — the same request answered again and again[6]. That energy is not a fixed property of a model. It is the outcome of a decision: which model runs, broken into how many steps, under what objective.
This paper reports a benchmark of that decision inside VDF AI. We compare a high-intensity baseline — one large model answering the whole task — against two compounding strategies: routing each request under an energy-aware objective, and decomposing a workload into a directed graph of smaller, independently-routed stages. Across 71 configurations spanning four token budgets and five scenario families, energy-led routing reduced predicted energy by 81–95%, with a stable ~94.8% reduction for the frontier-versus-compact pairing.
Crucially, savings without quality are meaningless. In a separate execution benchmark with a quality score recorded per task, the routed condition reduced predicted energy by 94.9% while remaining non-inferior in aggregate under a margin fixed in advance — with the task-level exceptions disclosed in full. The contribution here is not a single number; it is an auditable account of how routing and decomposition turn energy into something an enterprise can measure, steer, and defend.
AT A GLANCE
Six numbers from the benchmark
predicted energy removed by eco routing vs. a pinned frontier baseline
less predicted energy per workload at the same task, frontier vs. routed
routed quality held within a pre-registered 0.10 margin in aggregate
configurations across five scenario families and four token budgets
reduction band observed across different model pairings
energy still avoided when one DAG stage deliberately keeps the frontier model
FIGURE 1
The same work, a fraction of the energy
Aggregate of the quality-constrained execution benchmark: a pinned high-intensity baseline versus energy-aware routing, with the quality guardrail satisfied.
Fig. 1. Predicted energy in watt-hours for an identical task set. Figures are coefficient-based predictions under benchmark conditions, not measured wall power.
Why inference energy is a decision, not a constant
A model is trained once and served billions of times. The integral of that serving tail now dominates the one-off training spike[6][8], which means the most leveraged place to reduce AI's footprint is the dispatcher that decides, per request, which model runs and how the work is split.
Enterprises increasingly have to attribute that energy — for sustainability reporting, for internal chargeback, and for procurement decisions that no longer accept a single annual number. So the question this paper answers is concrete: if you hold the task fixed and change only the routing and decomposition strategy, how much energy moves? And does quality survive the change?
We answer with a benchmark rather than an assertion. Two forms of evidence are reported: a coefficient-based comparison that isolates the effect of routing policy under fixed token assumptions, and a quality-constrained execution benchmark that pairs each energy figure with a measured quality score. The first tells us how big the lever is; the second tells us whether pulling it costs anything.
The routing objective is a dial you control
The same candidate pool, three presets. Eco leans into energy; Max-Quality deliberately holds the heavy model. That Max-Quality lands at exactly 0% saving is the point — it proves the savings come from the policy, not from a benchmark quietly favouring the small model.
Frontier-class vs. compact local model
Heavy tier vs. light tier
The reduction is not a single magic figure. It scales with the energy gap between the candidates available to the router: a wide gap (frontier vs. compact) yields ~95%, a narrower one (heavy tier vs. light tier) yields ~81%. We report the band honestly because that is what a buyer needs to size their own deployment.
| Token budget | Baseline (Wh) | Routed (Wh) | Energy avoided |
|---|---|---|---|
| 500 in · 500 out | 4.30 | 0.225 | 94.77% |
| 1 000 in · 1 000 out | 8.60 | 0.450 | 94.77% |
| 256 in · 512 out | 3.79 | 0.200 | 94.73% |
| 2 000 in · 500 out | 7.90 | 0.405 | 94.87% |
Don't send one big model. Send a network.
A monolithic call routes the entire workload to a single heavy model. A VDF agent network breaks the same workload into a directed graph of smaller stages — each routed on its own — so the expensive model is used only where it earns its keep.
Fig. 3. Fixed total workload (2 400 input · 1 800 output tokens). The last row keeps one stage on the frontier model on purpose and still avoids 54% of predicted energy — selective use, not all-or-nothing.
Energy fell. Quality was watched the whole time.
A separate execution benchmark scored routed output against the pinned baseline on a curated task set. In aggregate the routed arm stayed non-inferior under a 0.10 margin set in advance — and we publish the one task that slipped rather than hide it.
Two tasks preserved quality exactly while shedding ~95% of energy. One — factual recall — degraded at the task level, and one — exact arithmetic — was equally imperfect on both sides, so it neither helped nor hurt the comparison. The defensible claim is therefore precise: large energy reductions with quality non-inferior on average across the evaluated set, not a blanket promise that every single task is untouched. That distinction is what separates a credible result from a marketing number.
What produces the saving
Four mechanisms compound. None of them is exotic; the result comes from making each one explicit and letting them work together.
Energy as a first-class routing objective
Every candidate model is scored on quality, latency, cost, and energy together. Named presets — Eco, Balanced, Max-Quality — shift the weight on energy explicitly, so sustainability is a setting an operator chooses, not an accident of which model happened to be wired in.
DAG-based agent networks
Instead of sending an entire workload to one large model, a network decomposes it into a directed graph of smaller stages. Each stage is routed independently, so the heavy model is reserved only for the steps that genuinely need it.
Self-evolving model routing (SEEMR)
Routing is a continuously-learning decision rather than a fixed map. The dispatcher re-ranks candidates as evidence accumulates, converging on the lowest-energy model that still clears the quality bar for the task in front of it.
Pre-registered quality guardrail
Energy savings are only meaningful if quality holds. A separate execution benchmark scores routed output against a pinned high-intensity baseline under a non-inferiority margin fixed in advance, so the quality claim is bounded and testable — not asserted.
What this looks like at enterprise scale
The per-task numbers are small by design. Their significance is in the multiplier. Take the aggregate quality-constrained result — 3.61 Wh of predicted energy avoided per task set — and apply it to a workload running that comparison one million times:
The kWh figure scales directly from the benchmark's predicted savings; the carbon figure is an illustrative conversion at a stated grid intensity. Both are extrapolations from coefficient-based predictions, offered to convey magnitude — not as a measured datacenter result.
The strategic point is that this is a software lever. There is no capital expenditure and no migration: the same task runs through a network instead of a monolith, under an objective that an operator sets. For organisations running AI on their own infrastructure, that lever also compounds with the savings they already get from owning the silicon.
Limitations & honest framing
A result is only as strong as the caveats it is willing to state. These bound the claims above.
- Headline energy figures are predictions from per-model energy coefficients under controlled conditions, not direct wall-power measurements of a specific datacenter.
- The achievable saving depends on the energy gap between available candidates; a narrower gap yields a smaller reduction, which is why we report a band (81–95%) rather than one universal number.
- The quality benchmark uses a curated task set. Aggregate non-inferiority held, but one individual task showed measurable degradation — disclosed in Figure 4 rather than smoothed over.
- Staged-network figures assume clean token partitioning between stages and may understate the overhead of repeated context in some real workflows.
Stated conservatively: in a controlled benchmark using explicit per-model energy coefficients, energy-aware routing and DAG decomposition substantially reduced predicted energy across multiple token budgets and workflow shapes, and the routed condition remained non-inferior in aggregate under a pre-registered margin. That is a claim we can defend line by line — which is the only kind worth publishing.
Conclusion
Inference energy is a decision variable, and VDF AI exposes the decision. Choose an energy-aware objective and the router moves work to the most efficient model that still clears the bar. Express the work as a network rather than a monolith and the heavy model is reserved for the steps that need it. Done together, across 71 benchmark configurations, these moves removed 81–95% of predicted energy — and the quality guardrail held.
The headline is not a single percentage. It is that energy became visible, steerable, and accountable without sacrificing the answer — and visibility is the precondition for every improvement that follows.
References
- [1] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM 63(12), 54–63.
- [2] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
- [3] Patterson, D. et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv:2104.10350.
- [4] Henderson, P. et al. (2020). Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. JMLR 21(248).
- [5] Samsi, S. et al. (2023). From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. IEEE HPEC.
- [6] Desislavov, R., Martínez-Plumed, F., & Hernández-Orallo, J. (2023). Trends in AI inference energy consumption. Sustainable Computing.
- [7] Dodge, J. et al. (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT.
- [8] Wu, C.-J. et al. (2022). Sustainable AI: Environmental Implications, Challenges and Opportunities. MLSys.
- [9] MLCommons (2023). MLPerf Power Benchmark — Methodology and Rules.
- [10] Piaggesi, D. et al. (2017). Non-inferiority testing: design and interpretation. Statistical methods reference.
Get the full benchmark white paper
Enter your work email and name and we'll send a download link for the print-optimised PDF — with the complete figure set, the full results tables, and the methodology notes for internal review and citation.