How We Reduce Energy Consumption.
Ten implementation-grounded mechanisms — from multi-objective routing to measured-power sampling — that move the energy footprint of enterprise LLM inference from an implicit externality to a first-class, measurable, and governable engineering objective.
Public discussion of AI energy use is dominated by training-era headline numbers, but the majority of the deployed footprint is consumed in repeated inference requests[10]. Energy at inference is not a fixed property of a model: it is a runtime decision variable that depends on which candidate is chosen, how it is served, and where its power is drawn from.
This white paper describes how VDF AI Networks treats inference energy as a first-class routing
objective. Each request is evaluated by a multi-objective scorer with explicit weights over Quality,
Latency, Cost, and Energy; local and cloud execution paths are tracked under a single
cloud-equivalence model; and every completed request produces a persisted
EnergyRecord that carries watt-hours, grams of CO₂e, phase-split attribution,
method provenance, and a calibrated confidence band.
We describe ten concrete mechanisms, each tied to a named subsystem of the production runtime.
We also report per-model energy intensities in the 0.09 – 6.20 Wh per thousand tokens range, a
10× spread between edge-class and frontier models, and the policy surface (the
EnergyBudget) that lets operators encode sustainability targets into the workflow
itself. The goal of the paper is not to claim a novel algorithm; it is to give an auditable,
implementation-level account of how a distributed AI platform can make its energy behaviour
measurable, explainable, and steerable.
AT A GLANCE
Six numbers that anchor the paper
of the routing score in eco mode is energy
lower Wh/1K tokens for comparable local vs cloud models
separate coefficients for prefill and generation
default power sampling cadence during local inference
every estimate is tagged with a calibrated confidence score
energy intensity between a 4B tier model and a 405B frontier model
FIGURE 1
Energy reduction — per-request architecture
Each zone corresponds to a production module cited in Section 4. The bottom strip reports the per-model energy intensities that the scorer consumes.
EnergyRecord and analytics.
Introduction & motivation
Two things have changed in the last twenty-four months. The first is that published estimates of inference energy have overtaken training in aggregate[10][9]: a model is trained once and served billions of times, and the integral of the serving tail dominates the one-off spike. The second is that enterprise buyers increasingly need to attribute energy and carbon to individual workloads — for regulatory reporting[14][15], for internal chargeback, and for the kind of sustainability commitments that can no longer be satisfied with a single annualised number.
The combination matters. If the majority of the footprint now lives in inference, then the most leveraged place to reduce it is the dispatcher that decides, per request, which model runs and where. And if reporting has to be attributable, then the dispatcher must also emit a measurement record that survives the request and can be aggregated later.
VDF AI Networks is built around that joint observation. Routing decisions and energy accounting are the same subsystem. We do not have a sustainability dashboard bolted to a black-box runtime; we have a runtime whose scoring function contains an energy term and whose output includes a persisted energy record. This paper documents how that works, end to end, so that the mechanism can be reviewed, criticised, and reproduced.
Scope and non-goals
This paper covers serving-time inference. It does not attempt to solve training-energy, does not propose a new hardware design, and does not claim to displace carbon-aware scheduling literature — it describes how the results of that literature are operationalised in a product. Where our mechanisms align with existing academic work (notably MLPerf Power[6] and carbon-intensity accounting[11]) we cite it rather than re-derive it.
Background & related work
Early work on AI energy framed the problem at the training scale. Strubell et al.[3] first popularised the carbon arithmetic of NLP training; Patterson et al.[1] produced the canonical analysis of large-model training emissions; Schwartz et al.[2] proposed "Green AI" as a research norm alongside accuracy. Luccioni et al.[7] gave a fully accounted training LCA for BLOOM.
Inference accounting has matured more recently. Henderson et al.[5] proposed systematic reporting norms; Dodge et al.[11] showed how cloud-region choice changes carbon by ~3× without changing the model; Samsi et al.[8] published "From Words to Watts", a watts-per-token characterisation that is now the conceptual counterpart of our registry; Desislavov et al.[10] modelled the aggregate inference trend.
At the infrastructure scale, Gupta et al.[4] argued that embodied carbon and operational carbon are co-equal, and Wu et al.[9] summarised the full sustainable-AI problem space. MLCommons' MLPerf Power[6] gave the community a shared benchmarking methodology and is the spiritual ancestor of our measured sampling path.
What is consistently missing from this literature is a routing-level account of how energy enters the per-request decision. Almost all prior work measures; very little dispatches on the measurement. The contribution of this paper is in that gap — not as a new theoretical result, but as an engineering pattern with a working reference implementation.
System architecture overview
VDF AI Networks is the orchestration tier of the VDF platform. A network is a declarative specification of nodes, dependencies, and routing policy; at execution time the engine resolves each node against a candidate pool that includes both cloud APIs and locally-hosted models. Two artefacts govern energy behaviour at this tier.
The first is EnergyBudget, declared on the network itself. It carries
max_watts, a strategy (balanced, energy, speed), a routing_mode
(eco, balanced, max_quality), and optional custom weights. The budget is part of the spec, not a
side-channel setting, so it is diffed and reviewed in the same pull request as any other logic
change.
The second is EnergyRecord, persisted per node per run in the Vault. Each record carries
avg_power_w, energy_wh, input_energy_wh,
output_energy_wh, co2_g with phase-split variants, a method
(measured, estimated_tokens, unknown), a source
attribution (nvml, rocm, prometheus, model_coeff,
manual), and a confidence score. The pair turns energy from a black-box
runtime property into a spec-declared budget with a per-request audit ledger.
Methodology — ten mechanisms
Each subsection names one production mechanism, describes how it reduces energy, and identifies the runtime subsystem in which it lives.
Multi-objective routing with energy as first-class
A four-dimensional objective scorer normalises Quality, Latency, Cost, and Energy for every candidate. The Eco preset raises the energy weight to 0.50 while Max-Quality drops it to 0.05, producing a dispatcher that is explicit rather than implicit about sustainability.
Routing layer · Multi-objective scoringToken-based dual-phase energy estimation
Energy is not a single scalar per model. We store separate Wh/1K coefficients for input (prefill) and output (generation) tokens because their compute profiles differ by factors of 2–3× in modern decoder transformers.
Energy subsystem · Dual-phase estimatorMeasured power sampling for local inference
Local and on-prem runs are wrapped with a sampler that polls hardware power telemetry (NVML, ROCm, or a Prometheus endpoint) at a configurable cadence (default 250 ms). We compute E = P̄ · Δt with an explicit confidence of 0.85.
Telemetry · Hardware-level samplingRegional carbon intensity multiplier
Watts become grams of CO₂e through a deployment-level carbon-intensity parameter (g CO₂ / kWh). This parameterisation is the hook for time-of-use and renewables-aware scheduling: identical energy at a lower grid intensity produces less reportable carbon.
Energy subsystem · Carbon facadeEnergyBudget bound per network
Each network specification can declare an EnergyBudget with max_watts, a strategy (balanced / energy / speed), and a routing_mode override. Operators codify sustainability policy into the workflow itself rather than hope the runtime picks the right model.
Network spec · Declarative budgetModel energy registry with structured profiles
A structured registry records hardware_class, quantization (fp16 · fp8 · int4 · int8), batch_regime, and a confidence score per model. The scorer reads this directly so energy decisions are driven by catalog data, not implicit knowledge.
Configuration · Model energy registryHeuristic size-tier fallback
Unknown or emerging models are placed on a size-tier curve (4B → 0.09, 8B → 0.14, 35B → 0.40, 80B → 0.80, 500B → 2.25 Wh/1K tokens). The tier attribution is persisted so analytics can distinguish measured data from heuristic estimates.
Energy subsystem · Heuristic fallbackLocal-vs-cloud equivalence tracking
Local runs also record a cloud_equivalent_watts estimate (a 3× baseline heuristic where the profile is missing). This lets a cost-aware router compare "run locally at 3× the wall power but zero egress" against "cloud API at lower on-site power but ongoing spend".
Routing layer · Equivalence modelConfidence tracking per estimation method
Every EnergyRecord carries method ∈ {measured, estimated_tokens, heuristic} with a confidence band (≈0.85 measured, 0.6–0.8 registry, 0.14–0.55 heuristic). Analytics dashboards can weight or gate on this so reported totals are never opaque.
Persistence · Vault record schemaAggregated analytics & persisted audit trail
An Energy Analytics API exposes total_energy_wh, total_co2_g, input/output phase splits, method counts, and coverage_ratio. Persistence in the Vault gives each request a reproducible energy ledger — a prerequisite for any honest sustainability claim.
Persistence · Analytics surfaceEmpirical observations
Table 1 reports the energy intensity values the scorer consumes today, sourced from the registry, live measurement, and the size-tier heuristic fallback. The values are expressed in watt-hours per 1 000 tokens (Wh/1K), split between input (prefill) and output (generation) phases. Confidence is reported per entry; a confidence under 0.5 generally indicates a heuristic derivation and is surfaced as such in the analytics API.
| Model / tier | Class | Input | Output | Confidence | Source |
|---|---|---|---|---|---|
| GPT-4o-mini (cloud) | small · cloud | 0.22 | 0.65 | 0.70 | registry |
| Llama 3.2 (local, 8B-class) | local sweet spot | 0.12 | 0.33 | 0.80 | registry + measured |
| 70B-class, quantised (int4) | balanced | 0.38 | 0.95 | 0.65 | registry |
| Hermes 405B-class | frontier | 2.40 | 6.20 | 0.60 | registry |
| Heuristic · 4B tier | edge | 0.09 | 0.24 | 0.25 | size-tier fallback |
| Heuristic · 8B tier | edge / local | 0.14 | 0.40 | 0.35 | size-tier fallback |
| Heuristic · 35B tier | mid | 0.40 | 1.05 | 0.40 | size-tier fallback |
| Heuristic · 80B tier | compute-heavy | 0.80 | 2.10 | 0.40 | size-tier fallback |
| Heuristic · 500B tier | frontier-fallback | 2.25 | 5.60 | 0.30 | size-tier fallback |
Table 2 gives the four built-in routing presets and the resulting weight vectors. The Eco preset
lifts Energy from 0.10 (Balanced) to 0.50 while compressing Latency; the Max-Quality preset pushes
Quality to 0.70 and deliberately de-prioritises Energy. In practice, routing an identical batch
through Eco vs Max-Quality shifts candidate selection toward smaller and local models; the
consequence is directly visible in the analytics API as a lower total_energy_wh per
equivalent token throughput.
| Mode | Quality | Latency | Cost | Energy | Typical use |
|---|---|---|---|---|---|
| Eco | 20% | 10% | 20% | 50% | High-volume, non-critical workloads |
| Balanced | 40% | 20% | 30% | 10% | Default for mixed enterprise traffic |
| Max-Quality | 70% | 15% | 10% | 5% | Hard-accuracy tasks; infrequent |
| Default | 35% | 25% | 25% | 15% | Fallback when no preset is declared |
The +15% local-quality bonus — a deliberate trade-off
In Eco mode the scorer adds a small quality bonus (+15 percentage points in the normalised quality term) to candidates that run locally. The effect is to bias the router toward distributed and on-premises inference when the user has declared an energy-leaning policy. This is not an accuracy claim; it is a policy statement that says "under Eco, we are willing to accept a local model that would score slightly lower under Max-Quality". The bonus is tunable and fully logged, so downstream analytics can quantify its contribution to any headline saving.
Discussion, limitations, and future work
Where the numbers come from
No single source gives us all the coefficients. Measured sampling is authoritative only where we own
the hardware; cloud inference must be estimated from published or vendor-supplied data; and
emerging models need a fallback. The registry therefore tags each entry with a method and a
confidence, and the analytics API surfaces a coverage_ratio so downstream reports can
state honestly what fraction of a totalled energy_wh was measured vs estimated.
Over-claiming is the original sin of sustainability reporting[15];
confidence tracking is how we refuse to do it.
Grid variance and time-of-use
Carbon intensity (VDF_CARBON_INTENSITY_G_PER_KWH) is a parameter because grids are not
uniform and not stationary[11]. The present implementation accepts
a static value per deployment; a production extension is to receive a time-series from a grid
operator and schedule non-urgent batches against it. This is the carbon-aware-scheduling direction
of Gupta et al.[4] and we treat it as a near-term roadmap item,
not a finished capability.
Cold-start and short-request sampling
A 250 ms sampling interval is a deliberate compromise. Shorter intervals increase overhead; longer intervals smear over short requests. For sub-second inference the measured path degrades toward the estimated path, which is why the estimator is the primary source of truth and the sampler is a corroboration layer. We are experimenting with adaptive cadence (tighter during warmup) but keep the conservative default for now.
What we do not measure
The paper is deliberately narrow. We do not account for the embodied carbon of the hardware itself — that is a separate, important problem[4]. We do not model cooling or PUE at the datacentre level; that term is exposed via the carbon-intensity parameter but not disaggregated. And we do not perform a full LCA of the model weights; Luccioni et al.[7] is the reference for that workstream.
Conclusion
If inference has become the majority of the footprint, then reducing inference energy is a routing problem before it is a hardware problem. We have described, at source-of-record level, how VDF AI Networks makes energy a first-class term in its multi-objective scorer, how it persists an auditable per-request ledger, and how it exposes that ledger so that operators can state — and can be asked to defend — their aggregate numbers.
The mechanism is deliberately unglamorous. Ten small, composable decisions; one scoring function; one record schema; one analytics surface. The claim is not that this is optimal. The claim is that it is visible, and that visibility is the precondition for every improvement that follows.
References
- [1] Patterson, D. et al. (2021). Carbon Emissions and Large Neural Network Training. arXiv:2104.10350.
- [2] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM 63(12), 54–63.
- [3] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
- [4] Gupta, U. et al. (2022). Chasing Carbon: The Elusive Environmental Footprint of Computing. HPCA.
- [5] Henderson, P. et al. (2020). Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. JMLR 21(248).
- [6] MLCommons (2023). MLPerf Power Benchmark — Methodology and Rules.
- [7] Luccioni, A. S., Viguier, S., & Ligozat, A.-L. (2023). Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. JMLR.
- [8] Samsi, S. et al. (2023). From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. IEEE HPEC.
- [9] Wu, C.-J. et al. (2022). Sustainable AI: Environmental Implications, Challenges and Opportunities. MLSys.
- [10] Desislavov, R., Martínez-Plumed, F., & Hernández-Orallo, J. (2023). Trends in AI inference energy consumption. Sustainable Computing.
- [11] Dodge, J. et al. (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT.
- [12] Google Research (2022). The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink. IEEE Computer.
- [13] Maslej, N. et al. (2024). The AI Index Report. Stanford Institute for Human-Centered AI.
- [14] ISO/IEC (2018). ISO 14067:2018 — Greenhouse gases — Carbon footprint of products.
- [15] WRI / WBCSD (2011). GHG Protocol Product Life Cycle Accounting and Reporting Standard.
- [16] NVIDIA Corporation. NVML / nvidia-smi Reference.
- [17] AMD Inc. ROCm SMI Library Reference.
- [18] Dean, J. (2020). The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design. ISSCC Plenary.
Download the full white paper
Enter your work email and name and we'll send a download link for the print-optimised PDF version of this document — for offline reading, internal review, and citation.