BENCHMARK WHITE PAPER v1.0 June 2026 VDF-WP-2026-003

The energy efficiency of agent networks.

A controlled benchmark of how VDF AI reduces the energy footprint of enterprise AI — by decomposing work into DAG-based agent networks and dispatching each step through SEEMR self-evolving model routing. The result: up to a 94.9% reduction in predicted energy, with output quality held non-inferior in aggregate.

Authors
Read time 16 min
License CC BY 4.0
Read Online
ABSTRACT

Most of the energy an AI system consumes in production is spent at inference — the same request answered again and again[6]. That energy is not a fixed property of a model. It is the outcome of a decision: which model runs, broken into how many steps, under what objective.

This paper reports a benchmark of that decision inside VDF AI. We compare a high-intensity baseline — one large model answering the whole task — against two compounding strategies: routing each request under an energy-aware objective, and decomposing a workload into a directed graph of smaller, independently-routed stages. Across 71 configurations spanning four token budgets and five scenario families, energy-led routing reduced predicted energy by 81–95%, with a stable ~94.8% reduction for the frontier-versus-compact pairing.

Crucially, savings without quality are meaningless. In a separate execution benchmark with a quality score recorded per task, the routed condition reduced predicted energy by 94.9% while remaining non-inferior in aggregate under a margin fixed in advance — with the task-level exceptions disclosed in full. The contribution here is not a single number; it is an auditable account of how routing and decomposition turn energy into something an enterprise can measure, steer, and defend.

Keywords energy-aware inference · DAG agent networks · self-evolving routing · non-inferiority · Green AI · token-level attribution · sustainable enterprise AI
AT A GLANCE

Six numbers from the benchmark

Peak energy avoided
94.9%

predicted energy removed by eco routing vs. a pinned frontier baseline

Efficiency multiple
≈20×

less predicted energy per workload at the same task, frontier vs. routed

Quality outcome
Non-inferior

routed quality held within a pre-registered 0.10 margin in aggregate

Benchmark depth
71

configurations across five scenario families and four token budgets

Savings range
81–95%

reduction band observed across different model pairings

Selective frontier
54%

energy still avoided when one DAG stage deliberately keeps the frontier model

FIGURE 1

The same work, a fraction of the energy

Aggregate of the quality-constrained execution benchmark: a pinned high-intensity baseline versus energy-aware routing, with the quality guardrail satisfied.

Pinned frontier baseline 3.80 Wh
VDF AI — routed 0.19 Wh
94.95% predicted energy avoided
≈20× more efficient, same task
±0.10 quality margin — held

Fig. 1. Predicted energy in watt-hours for an identical task set. Figures are coefficient-based predictions under benchmark conditions, not measured wall power.

SECTION 1

Why inference energy is a decision, not a constant

A model is trained once and served billions of times. The integral of that serving tail now dominates the one-off training spike[6][8], which means the most leveraged place to reduce AI's footprint is the dispatcher that decides, per request, which model runs and how the work is split.

Enterprises increasingly have to attribute that energy — for sustainability reporting, for internal chargeback, and for procurement decisions that no longer accept a single annual number. So the question this paper answers is concrete: if you hold the task fixed and change only the routing and decomposition strategy, how much energy moves? And does quality survive the change?

We answer with a benchmark rather than an assertion. Two forms of evidence are reported: a coefficient-based comparison that isolates the effect of routing policy under fixed token assumptions, and a quality-constrained execution benchmark that pairs each energy figure with a measured quality score. The first tells us how big the lever is; the second tells us whether pulling it costs anything.

SECTION 2 · FIGURE 2

The routing objective is a dial you control

The same candidate pool, three presets. Eco leans into energy; Max-Quality deliberately holds the heavy model. That Max-Quality lands at exactly 0% saving is the point — it proves the savings come from the policy, not from a benchmark quietly favouring the small model.

Frontier-class vs. compact local model

Eco energy-led objective
94.8%
Balanced resolves to the same efficient pick here
94.8%
Max-Quality holds the frontier model by design
0%

Heavy tier vs. light tier

Eco narrower energy gap between candidates
81.4%
Balanced matches eco for this candidate set
81.4%
Max-Quality holds the heavy tier by design
0%

The reduction is not a single magic figure. It scales with the energy gap between the candidates available to the router: a wide gap (frontier vs. compact) yields ~95%, a narrower one (heavy tier vs. light tier) yields ~81%. We report the band honestly because that is what a buyer needs to size their own deployment.

Table 1. Eco routing is stable across token budgets (frontier-class vs. compact local model). Predicted energy in watt-hours.
Token budget Baseline (Wh) Routed (Wh) Energy avoided
500 in · 500 out 4.30 0.225 94.77%
1 000 in · 1 000 out 8.60 0.450 94.77%
256 in · 512 out 3.79 0.200 94.73%
2 000 in · 500 out 7.90 0.405 94.87%
SECTION 3 · FIGURE 3

Don't send one big model. Send a network.

A monolithic call routes the entire workload to a single heavy model. A VDF agent network breaks the same workload into a directed graph of smaller stages — each routed on its own — so the expensive model is used only where it earns its keep.

MONOLITH
Frontier model whole task, one call
16.92 Wh
AGENT NETWORK
Analyze routed · compact
Synthesize routed · compact
0.88 Wh
Frontier-class monolith → routed network
0.88 Wh
was 16.92 Wh
−94.8%
Mid-tier (~32B) monolith → routed network
0.88 Wh
was 4.01 Wh
−78.0%
Mid-tier (~30B) monolith → routed network
0.88 Wh
was 3.73 Wh
−76.4%
Selective frontier (one stage kept) → routed network
7.73 Wh
was 16.92 Wh
−54.3%

Fig. 3. Fixed total workload (2 400 input · 1 800 output tokens). The last row keeps one stage on the frontier model on purpose and still avoids 54% of predicted energy — selective use, not all-or-nothing.

SECTION 4 · FIGURE 4

Energy fell. Quality was watched the whole time.

A separate execution benchmark scored routed output against the pinned baseline on a curated task set. In aggregate the routed arm stayed non-inferior under a 0.10 margin set in advance — and we publish the one task that slipped rather than hide it.

Structured data extraction preserved
−94.7% predicted energy
Quality 1.00 1.00
Semantic summarization preserved
−94.9% predicted energy
Quality 1.00 1.00
Factual recall degraded at task level
−95.3% predicted energy
Quality 1.00 0.67
Exact arithmetic equal (both flagged)
−95.0% predicted energy
Quality 0.00 0.00
94.9% aggregate energy avoided
0.75 → 0.67 aggregate quality (Δ −0.08)
Non-inferior ✓ within the 0.10 margin

Two tasks preserved quality exactly while shedding ~95% of energy. One — factual recall — degraded at the task level, and one — exact arithmetic — was equally imperfect on both sides, so it neither helped nor hurt the comparison. The defensible claim is therefore precise: large energy reductions with quality non-inferior on average across the evaluated set, not a blanket promise that every single task is untouched. That distinction is what separates a credible result from a marketing number.

SECTION 5

What produces the saving

Four mechanisms compound. None of them is exotic; the result comes from making each one explicit and letting them work together.

01

Energy as a first-class routing objective

Every candidate model is scored on quality, latency, cost, and energy together. Named presets — Eco, Balanced, Max-Quality — shift the weight on energy explicitly, so sustainability is a setting an operator chooses, not an accident of which model happened to be wired in.

02

DAG-based agent networks

Instead of sending an entire workload to one large model, a network decomposes it into a directed graph of smaller stages. Each stage is routed independently, so the heavy model is reserved only for the steps that genuinely need it.

03

Self-evolving model routing (SEEMR)

Routing is a continuously-learning decision rather than a fixed map. The dispatcher re-ranks candidates as evidence accumulates, converging on the lowest-energy model that still clears the quality bar for the task in front of it.

04

Pre-registered quality guardrail

Energy savings are only meaningful if quality holds. A separate execution benchmark scores routed output against a pinned high-intensity baseline under a non-inferiority margin fixed in advance, so the quality claim is bounded and testable — not asserted.

SECTION 6

What this looks like at enterprise scale

The per-task numbers are small by design. Their significance is in the multiplier. Take the aggregate quality-constrained result — 3.61 Wh of predicted energy avoided per task set — and apply it to a workload running that comparison one million times:

≈3,610 kWh predicted energy avoided per million comparable runs
≈1.4 tonnes CO₂e illustrative, at a ~400 g/kWh grid intensity
1 policy change no new hardware — just routing and decomposition

The kWh figure scales directly from the benchmark's predicted savings; the carbon figure is an illustrative conversion at a stated grid intensity. Both are extrapolations from coefficient-based predictions, offered to convey magnitude — not as a measured datacenter result.

The strategic point is that this is a software lever. There is no capital expenditure and no migration: the same task runs through a network instead of a monolith, under an objective that an operator sets. For organisations running AI on their own infrastructure, that lever also compounds with the savings they already get from owning the silicon.

SECTION 7

Limitations & honest framing

A result is only as strong as the caveats it is willing to state. These bound the claims above.

  • Headline energy figures are predictions from per-model energy coefficients under controlled conditions, not direct wall-power measurements of a specific datacenter.
  • The achievable saving depends on the energy gap between available candidates; a narrower gap yields a smaller reduction, which is why we report a band (81–95%) rather than one universal number.
  • The quality benchmark uses a curated task set. Aggregate non-inferiority held, but one individual task showed measurable degradation — disclosed in Figure 4 rather than smoothed over.
  • Staged-network figures assume clean token partitioning between stages and may understate the overhead of repeated context in some real workflows.

Stated conservatively: in a controlled benchmark using explicit per-model energy coefficients, energy-aware routing and DAG decomposition substantially reduced predicted energy across multiple token budgets and workflow shapes, and the routed condition remained non-inferior in aggregate under a pre-registered margin. That is a claim we can defend line by line — which is the only kind worth publishing.

SECTION 8

Conclusion

Inference energy is a decision variable, and VDF AI exposes the decision. Choose an energy-aware objective and the router moves work to the most efficient model that still clears the bar. Express the work as a network rather than a monolith and the heavy model is reserved for the steps that need it. Done together, across 71 benchmark configurations, these moves removed 81–95% of predicted energy — and the quality guardrail held.

The headline is not a single percentage. It is that energy became visible, steerable, and accountable without sacrificing the answer — and visibility is the precondition for every improvement that follows.

REFERENCES

References

  1. [1] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM 63(12), 54–63.
  2. [2] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
  3. [3] Patterson, D. et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv:2104.10350.
  4. [4] Henderson, P. et al. (2020). Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. JMLR 21(248).
  5. [5] Samsi, S. et al. (2023). From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. IEEE HPEC.
  6. [6] Desislavov, R., Martínez-Plumed, F., & Hernández-Orallo, J. (2023). Trends in AI inference energy consumption. Sustainable Computing.
  7. [7] Dodge, J. et al. (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT.
  8. [8] Wu, C.-J. et al. (2022). Sustainable AI: Environmental Implications, Challenges and Opportunities. MLSys.
  9. [9] MLCommons (2023). MLPerf Power Benchmark — Methodology and Rules.
  10. [10] Piaggesi, D. et al. (2017). Non-inferiority testing: design and interpretation. Statistical methods reference.

Get the full benchmark white paper

Enter your work email and name and we'll send a download link for the print-optimised PDF — with the complete figure set, the full results tables, and the methodology notes for internal review and citation.