WHITE PAPER v1.0 April 2026 VDF-WP-2026-001

How We Reduce Energy Consumption.

Ten implementation-grounded mechanisms — from multi-objective routing to measured-power sampling — that move the energy footprint of enterprise LLM inference from an implicit externality to a first-class, measurable, and governable engineering objective.

Authors VDF AI Research Team

Read time 18 min

License CC BY 4.0

Read Online

ABSTRACT

Public discussion of AI energy use is dominated by training-era headline numbers, but the majority of the deployed footprint is consumed in repeated inference requests^[10]. Energy at inference is not a fixed property of a model: it is a runtime decision variable that depends on which candidate is chosen, how it is served, and where its power is drawn from.

This white paper describes how VDF AI Networks treats inference energy as a first-class routing objective. Each request is evaluated by a multi-objective scorer with explicit weights over Quality, Latency, Cost, and Energy; local and cloud execution paths are tracked under a single cloud-equivalence model; and every completed request produces a persisted EnergyRecord that carries watt-hours, grams of CO₂e, phase-split attribution, method provenance, and a calibrated confidence band.

We describe ten concrete mechanisms, each tied to a named subsystem of the production runtime. We also report per-model energy intensities in the 0.09 – 6.20 Wh per thousand tokens range, a 10× spread between edge-class and frontier models, and the policy surface (the EnergyBudget) that lets operators encode sustainability targets into the workflow itself. The goal of the paper is not to claim a novel algorithm; it is to give an auditable, implementation-level account of how a distributed AI platform can make its energy behaviour measurable, explainable, and steerable.

Keywords energy-aware inference · multi-objective routing · carbon-aware scheduling · Green AI · token-level attribution · MLPerf Power · EnergyBudget · distributed LLM serving

AT A GLANCE

Six numbers that anchor the paper

Eco weight

50%

of the routing score in eco mode is energy

Local advantage

~40%

lower Wh/1K tokens for comparable local vs cloud models

Phase split

input · output

separate coefficients for prefill and generation

Sampling

250 ms

default power sampling cadence during local inference

Confidence

0.14 – 0.85

every estimate is tagged with a calibrated confidence score

Frontier delta

>10×

energy intensity between a 4B tier model and a 405B frontier model

FIGURE 1

Energy reduction — per-request architecture

Each zone corresponds to a production module cited in Section 4. The bottom strip reports the per-model energy intensities that the scorer consumes.

VDF AI Networks energy reduction architecture — inputs, multi-objective router, local vs cloud decision fork, measurement tiles (measured, estimated, heuristic), persistence and analytics, with headline per-model energy numbers. — **Fig. 1.** End-to-end request lifecycle: inputs → objective scorer → local/cloud decision → telemetry pathway (measured, estimated, heuristic) → `EnergyRecord` and analytics.

SECTION 1

Introduction & motivation

Two things have changed in the last twenty-four months. The first is that published estimates of inference energy have overtaken training in aggregate^[10]^[9]: a model is trained once and served billions of times, and the integral of the serving tail dominates the one-off spike. The second is that enterprise buyers increasingly need to attribute energy and carbon to individual workloads — for regulatory reporting^[14]^[15], for internal chargeback, and for the kind of sustainability commitments that can no longer be satisfied with a single annualised number.

The combination matters. If the majority of the footprint now lives in inference, then the most leveraged place to reduce it is the dispatcher that decides, per request, which model runs and where. And if reporting has to be attributable, then the dispatcher must also emit a measurement record that survives the request and can be aggregated later.

VDF AI Networks is built around that joint observation. Routing decisions and energy accounting are the same subsystem. We do not have a sustainability dashboard bolted to a black-box runtime; we have a runtime whose scoring function contains an energy term and whose output includes a persisted energy record. This paper documents how that works, end to end, so that the mechanism can be reviewed, criticised, and reproduced.

Scope and non-goals

This paper covers serving-time inference. It does not attempt to solve training-energy, does not propose a new hardware design, and does not claim to displace carbon-aware scheduling literature — it describes how the results of that literature are operationalised in a product. Where our mechanisms align with existing academic work (notably MLPerf Power^[6] and carbon-intensity accounting^[11]) we cite it rather than re-derive it.

SECTION 2

Background & related work

Early work on AI energy framed the problem at the training scale. Strubell et al.^[3] first popularised the carbon arithmetic of NLP training; Patterson et al.^[1] produced the canonical analysis of large-model training emissions; Schwartz et al.^[2] proposed "Green AI" as a research norm alongside accuracy. Luccioni et al.^[7] gave a fully accounted training LCA for BLOOM.

Inference accounting has matured more recently. Henderson et al.^[5] proposed systematic reporting norms; Dodge et al.^[11] showed how cloud-region choice changes carbon by ~3× without changing the model; Samsi et al.^[8] published "From Words to Watts", a watts-per-token characterisation that is now the conceptual counterpart of our registry; Desislavov et al.^[10] modelled the aggregate inference trend.

At the infrastructure scale, Gupta et al.^[4] argued that embodied carbon and operational carbon are co-equal, and Wu et al.^[9] summarised the full sustainable-AI problem space. MLCommons' MLPerf Power^[6] gave the community a shared benchmarking methodology and is the spiritual ancestor of our measured sampling path.

What is consistently missing from this literature is a routing-level account of how energy enters the per-request decision. Almost all prior work measures; very little dispatches on the measurement. The contribution of this paper is in that gap — not as a new theoretical result, but as an engineering pattern with a working reference implementation.

SECTION 3

System architecture overview

VDF AI Networks is the orchestration tier of the VDF platform. A network is a declarative specification of nodes, dependencies, and routing policy; at execution time the engine resolves each node against a candidate pool that includes both cloud APIs and locally-hosted models. Two artefacts govern energy behaviour at this tier.

The first is EnergyBudget, declared on the network itself. It carries max_watts, a strategy (balanced, energy, speed), a routing_mode (eco, balanced, max_quality), and optional custom weights. The budget is part of the spec, not a side-channel setting, so it is diffed and reviewed in the same pull request as any other logic change.

The second is EnergyRecord, persisted per node per run in the Vault. Each record carries avg_power_w, energy_wh, input_energy_wh, output_energy_wh, co2_g with phase-split variants, a method (measured, estimated_tokens, unknown), a source attribution (nvml, rocm, prometheus, model_coeff, manual), and a confidence score. The pair turns energy from a black-box runtime property into a spec-declared budget with a per-request audit ledger.

SECTION 4

Methodology — ten mechanisms

Each subsection names one production mechanism, describes how it reduces energy, and identifies the runtime subsystem in which it lives.

Multi-objective routing with energy as first-class

A four-dimensional objective scorer normalises Quality, Latency, Cost, and Energy for every candidate. The Eco preset raises the energy weight to 0.50 while Max-Quality drops it to 0.05, producing a dispatcher that is explicit rather than implicit about sustainability.

Routing layer · Multi-objective scoring

Token-based dual-phase energy estimation

Energy is not a single scalar per model. We store separate Wh/1K coefficients for input (prefill) and output (generation) tokens because their compute profiles differ by factors of 2–3× in modern decoder transformers.

Energy subsystem · Dual-phase estimator

Measured power sampling for local inference

Local and on-prem runs are wrapped with a sampler that polls hardware power telemetry (NVML, ROCm, or a Prometheus endpoint) at a configurable cadence (default 250 ms). We compute E = P̄ · Δt with an explicit confidence of 0.85.

Telemetry · Hardware-level sampling

Regional carbon intensity multiplier

Watts become grams of CO₂e through a deployment-level carbon-intensity parameter (g CO₂ / kWh). This parameterisation is the hook for time-of-use and renewables-aware scheduling: identical energy at a lower grid intensity produces less reportable carbon.

Energy subsystem · Carbon facade

EnergyBudget bound per network

Each network specification can declare an EnergyBudget with max_watts, a strategy (balanced / energy / speed), and a routing_mode override. Operators codify sustainability policy into the workflow itself rather than hope the runtime picks the right model.

Network spec · Declarative budget

Model energy registry with structured profiles

A structured registry records hardware_class, quantization (fp16 · fp8 · int4 · int8), batch_regime, and a confidence score per model. The scorer reads this directly so energy decisions are driven by catalog data, not implicit knowledge.

Configuration · Model energy registry

Heuristic size-tier fallback

Unknown or emerging models are placed on a size-tier curve (4B → 0.09, 8B → 0.14, 35B → 0.40, 80B → 0.80, 500B → 2.25 Wh/1K tokens). The tier attribution is persisted so analytics can distinguish measured data from heuristic estimates.

Energy subsystem · Heuristic fallback

Local-vs-cloud equivalence tracking

Local runs also record a cloud_equivalent_watts estimate (a 3× baseline heuristic where the profile is missing). This lets a cost-aware router compare "run locally at 3× the wall power but zero egress" against "cloud API at lower on-site power but ongoing spend".

Routing layer · Equivalence model

Confidence tracking per estimation method

Every EnergyRecord carries method ∈ {measured, estimated_tokens, heuristic} with a confidence band (≈0.85 measured, 0.6–0.8 registry, 0.14–0.55 heuristic). Analytics dashboards can weight or gate on this so reported totals are never opaque.

Persistence · Vault record schema

Aggregated analytics & persisted audit trail

An Energy Analytics API exposes total_energy_wh, total_co2_g, input/output phase splits, method counts, and coverage_ratio. Persistence in the Vault gives each request a reproducible energy ledger — a prerequisite for any honest sustainability claim.

Persistence · Analytics surface

SECTION 5

Empirical observations

Table 1 reports the energy intensity values the scorer consumes today, sourced from the registry, live measurement, and the size-tier heuristic fallback. The values are expressed in watt-hours per 1 000 tokens (Wh/1K), split between input (prefill) and output (generation) phases. Confidence is reported per entry; a confidence under 0.5 generally indicates a heuristic derivation and is surfaced as such in the analytics API.

**Table 1.** Per-model energy intensities consumed by the objective scorer. Values in Wh per 1 000 tokens.
Model / tier	Class	Input	Output	Confidence	Source
GPT-4o-mini (cloud)	small · cloud	0.22	0.65	0.70	registry
Llama 3.2 (local, 8B-class)	local sweet spot	0.12	0.33	0.80	registry + measured
70B-class, quantised (int4)	balanced	0.38	0.95	0.65	registry
Hermes 405B-class	frontier	2.40	6.20	0.60	registry
Heuristic · 4B tier	edge	0.09	0.24	0.25	size-tier fallback
Heuristic · 8B tier	edge / local	0.14	0.40	0.35	size-tier fallback
Heuristic · 35B tier	mid	0.40	1.05	0.40	size-tier fallback
Heuristic · 80B tier	compute-heavy	0.80	2.10	0.40	size-tier fallback
Heuristic · 500B tier	frontier-fallback	2.25	5.60	0.30	size-tier fallback

Table 2 gives the four built-in routing presets and the resulting weight vectors. The Eco preset lifts Energy from 0.10 (Balanced) to 0.50 while compressing Latency; the Max-Quality preset pushes Quality to 0.70 and deliberately de-prioritises Energy. In practice, routing an identical batch through Eco vs Max-Quality shifts candidate selection toward smaller and local models; the consequence is directly visible in the analytics API as a lower total_energy_wh per equivalent token throughput.

**Table 2.** Built-in routing presets and their Q / L / C / E weight vectors.
Mode	Quality	Latency	Cost	Energy	Typical use
Eco	20%	10%	20%	50%	High-volume, non-critical workloads
Balanced	40%	20%	30%	10%	Default for mixed enterprise traffic
Max-Quality	70%	15%	10%	5%	Hard-accuracy tasks; infrequent
Default	35%	25%	25%	15%	Fallback when no preset is declared

The +15% local-quality bonus — a deliberate trade-off

In Eco mode the scorer adds a small quality bonus (+15 percentage points in the normalised quality term) to candidates that run locally. The effect is to bias the router toward distributed and on-premises inference when the user has declared an energy-leaning policy. This is not an accuracy claim; it is a policy statement that says "under Eco, we are willing to accept a local model that would score slightly lower under Max-Quality". The bonus is tunable and fully logged, so downstream analytics can quantify its contribution to any headline saving.

SECTION 6

Discussion, limitations, and future work

Where the numbers come from

No single source gives us all the coefficients. Measured sampling is authoritative only where we own the hardware; cloud inference must be estimated from published or vendor-supplied data; and emerging models need a fallback. The registry therefore tags each entry with a method and a confidence, and the analytics API surfaces a coverage_ratio so downstream reports can state honestly what fraction of a totalled energy_wh was measured vs estimated. Over-claiming is the original sin of sustainability reporting^[15]; confidence tracking is how we refuse to do it.

Grid variance and time-of-use

Carbon intensity (VDF_CARBON_INTENSITY_G_PER_KWH) is a parameter because grids are not uniform and not stationary^[11]. The present implementation accepts a static value per deployment; a production extension is to receive a time-series from a grid operator and schedule non-urgent batches against it. This is the carbon-aware-scheduling direction of Gupta et al.^[4] and we treat it as a near-term roadmap item, not a finished capability.

Cold-start and short-request sampling

A 250 ms sampling interval is a deliberate compromise. Shorter intervals increase overhead; longer intervals smear over short requests. For sub-second inference the measured path degrades toward the estimated path, which is why the estimator is the primary source of truth and the sampler is a corroboration layer. We are experimenting with adaptive cadence (tighter during warmup) but keep the conservative default for now.

What we do not measure

The paper is deliberately narrow. We do not account for the embodied carbon of the hardware itself — that is a separate, important problem^[4]. We do not model cooling or PUE at the datacentre level; that term is exposed via the carbon-intensity parameter but not disaggregated. And we do not perform a full LCA of the model weights; Luccioni et al.^[7] is the reference for that workstream.

SECTION 7

Conclusion

If inference has become the majority of the footprint, then reducing inference energy is a routing problem before it is a hardware problem. We have described, at source-of-record level, how VDF AI Networks makes energy a first-class term in its multi-objective scorer, how it persists an auditable per-request ledger, and how it exposes that ledger so that operators can state — and can be asked to defend — their aggregate numbers.

The mechanism is deliberately unglamorous. Ten small, composable decisions; one scoring function; one record schema; one analytics surface. The claim is not that this is optimal. The claim is that it is visible, and that visibility is the precondition for every improvement that follows.

REFERENCES

References

[1] Patterson, D. et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv:2104.10350.
[2] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM 63(12), 54–63.
[3] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
[4] Gupta, U. et al. (2022). Chasing Carbon: The Elusive Environmental Footprint of Computing. HPCA.
[5] Henderson, P. et al. (2020). Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. JMLR 21(248).
[6] MLCommons (2023). MLPerf Power Benchmark — Methodology and Rules.
[7] Luccioni, A. S., Viguier, S., & Ligozat, A.-L. (2023). Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. JMLR.
[8] Samsi, S. et al. (2023). From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. IEEE HPEC.
[9] Wu, C.-J. et al. (2022). Sustainable AI: Environmental Implications, Challenges and Opportunities. MLSys.
[10] Desislavov, R., Martínez-Plumed, F., & Hernández-Orallo, J. (2023). Trends in AI inference energy consumption. Sustainable Computing.
[11] Dodge, J. et al. (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT.
[12] Google Research (2022). The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink. IEEE Computer.
[13] Maslej, N. et al. (2024). The AI Index Report. Stanford Institute for Human-Centered AI.
[14] ISO/IEC (2018). ISO 14067:2018 — Greenhouse gases — Carbon footprint of products.
[15] WRI / WBCSD (2011). GHG Protocol Product Life Cycle Accounting and Reporting Standard.
[16] NVIDIA Corporation. NVML / nvidia-smi Reference.
[17] AMD Inc. ROCm SMI Library Reference.
[18] Dean, J. (2020). The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design. ISSCC Plenary.

Download the full white paper

Enter your work email and name and we'll send a download link for the print-optimised PDF version of this document — for offline reading, internal review, and citation.

How We Reduce Energy Consumption.

AT A GLANCE

Six numbers that anchor the paper

FIGURE 1

Energy reduction — per-request architecture

Introduction & motivation

Scope and non-goals

Background & related work

System architecture overview

Methodology — ten mechanisms

Multi-objective routing with energy as first-class

Token-based dual-phase energy estimation

Measured power sampling for local inference

Regional carbon intensity multiplier

EnergyBudget bound per network

Model energy registry with structured profiles

Heuristic size-tier fallback

Local-vs-cloud equivalence tracking

Confidence tracking per estimation method

Aggregated analytics & persisted audit trail

Empirical observations

The +15% local-quality bonus — a deliberate trade-off

Discussion, limitations, and future work

Where the numbers come from

Grid variance and time-of-use

Cold-start and short-request sampling

What we do not measure

Conclusion

References

Download the full white paper

Why We Built the AI That Governs Itself

Five Ways to Avoid AI Agent Design Failures: When More Agents, Bigger Models, and LLM-Everything Backfire

How We Reduce Energy Consumption.

AT A GLANCE

Six numbers that anchor the paper

FIGURE 1

Energy reduction — per-request architecture

Introduction & motivation

Scope and non-goals

Background & related work

System architecture overview

Methodology — ten mechanisms

Multi-objective routing with energy as first-class

Token-based dual-phase energy estimation

Measured power sampling for local inference

Regional carbon intensity multiplier

EnergyBudget bound per network

Model energy registry with structured profiles

Heuristic size-tier fallback

Local-vs-cloud equivalence tracking

Confidence tracking per estimation method

Aggregated analytics & persisted audit trail

Empirical observations

The +15% local-quality bonus — a deliberate trade-off

Discussion, limitations, and future work

Where the numbers come from

Grid variance and time-of-use

Cold-start and short-request sampling

What we do not measure

Conclusion

References

Download the full white paper

Get the full white paper

Check your inbox

Why We Built the AI That Governs Itself

Five Ways to Avoid AI Agent Design Failures: When More Agents, Bigger Models, and LLM-Everything Backfire

Request a Demo

Thank You!