An LLM routing formula is a scoring function that ranks candidate models for a given request by combining four normalized signals — cost, latency, quality, and security/governance — each weighted by the task's priorities. In its simplest form: Score = wcost·Cost + wlatency·Latency + wquality·Quality + wsecurity·Security. The router picks the highest-scoring model, which is usually the cheapest, fastest option that still clears the quality bar and satisfies policy — not the most capable model available.
Key takeaways
- A routing formula turns "which model should handle this?" into a weighted score across cost, latency, quality, and security/governance — computed per request, not set once.
- Each signal is normalized to 0–1 so they can be combined; the weights encode what the task actually cares about (a batch job weights cost, a live agent weights latency).
- Security/governance often acts as a hard gate, not just a soft weight: a model that would send regulated data off-premise can be scored to zero or blocked outright regardless of quality.
- The goal is not to always pick the strongest model — it is to pick the cheapest model that still meets the bar, which is where the cost savings of routing come from.
What is an LLM routing formula?
An LLM routing formula is the decision function inside an LLM router that chooses which model should handle a given request. Instead of hard-wiring one model for everything, the router evaluates a pool of candidate models — a small local model, a larger local model, a fine-tuned specialist, a frontier cloud API — and scores each one against the requirements of the specific task in front of it.
The formula combines several competing signals into a single comparable number. The canonical form is a weighted sum of normalized scores: Score(model, task) = wcost·CostScore + wlatency·LatencyScore + wquality·QualityScore + wsecurity·SecurityScore. Each component is scaled to a 0–1 range so they are directly comparable, and the weights (which also sum to 1) express what this task prioritizes. The router then selects the model with the highest total score.
The reason this matters is economic and operational. Sending every request to the most capable frontier model is simple but wasteful — most enterprise traffic is routine and can be served by a cheaper, faster, private model at a fraction of the cost and latency. A routing formula makes that trade-off explicit and automatic, per request, rather than leaving it to a static configuration or a developer's guess.
Cost score
The cost score rewards models that are cheaper to run for this request. It is derived from the model's price per token (or per GPU-second for local inference) multiplied by the expected input and output token counts for the task. A small local model serving amortized hardware approaches zero marginal cost; a large frontier API with a long context window is the expensive end of the spectrum.
To normalize, the router maps the estimated cost of each candidate onto a 0–1 scale — the cheapest viable model scores near 1.0, the most expensive near 0.0. A common approach is CostScore = 1 − (cost − costmin) / (costmax − costmin) across the candidate pool. Cost weight is turned up for high-volume, batch, or background workloads where spend dominates, and turned down for a handful of high-stakes requests where getting the answer right is worth far more than the token bill.
Latency score
The latency score rewards models that respond fast enough for the task's interaction pattern. It is estimated from the model's time-to-first-token and per-token generation speed on the target hardware, plus any network round-trip for remote endpoints. A local 8B model on a warm GPU might deliver first tokens in 50–100 ms; a frontier cloud API adds internet transit and provider queueing on top of its own compute.
Latency is normalized the same way as cost — faster is closer to 1.0 — but the crucial detail is the latency budget. A live customer-facing agent has a hard sub-second requirement, so any model that blows the budget should score near zero no matter how good it is. A nightly document-processing job has a budget of minutes, so latency barely matters and its weight drops close to zero. The weight on latency is really a statement about how synchronous and interactive the workload is.
Quality score
The quality score estimates how well a model will actually perform on this kind of task — which is harder to pin down than cost or latency because it is not a single number. It is typically built from evaluation results: benchmark scores, task-specific eval suites, historical accuracy on similar requests, and live signals like human thumbs-up rates or downstream validation pass rates.
The key insight is that quality is relative to the task, not absolute. A frontier model may top a general leaderboard, but for classifying a support ticket or extracting fields from an invoice, a small fine-tuned model can match or beat it at a tenth of the cost. So the quality score should be conditioned on task type — a routing system that treats quality as one global ranking will systematically over-pay. Mature routers also apply a quality floor: any model below the minimum acceptable quality for the task is disqualified before scoring, so cost and latency can never buy a wrong answer.
Security/governance score
The security and governance score is what makes routing viable for regulated enterprises, and it is where naive cost/latency/quality routers fail. It encodes whether a model is allowed to handle this request: does the data classification permit sending it to that endpoint, does the model run inside the required perimeter, is it on the approved-for-production list, and does the request touch data with residency or sovereignty constraints?
Unlike the other signals, security often behaves as a hard gate rather than a soft weight. If a request contains regulated data — PHI, PII, financial records, classified material — a cloud endpoint that would move that data off-premise is not merely penalized; it is removed from the candidate pool entirely, regardless of how strong its cost, latency, or quality scores are. This is the difference between a routing formula that optimizes economics and one that is safe to point at sensitive workloads. See AI security and data sovereignty for why this must be enforced structurally.
Example routing decision
Consider a routine internal request — summarize and classify a support ticket that contains customer PII — with weights set for a high-volume, semi-interactive workload: cost 30%, latency 25%, quality 30%, security 15%. The router scores three candidates against these weights. Because the ticket contains PII, the frontier cloud model that would send data off-premise takes a heavy security penalty (and in a stricter policy would be gated out entirely).
The small local model wins — not because it is the most capable model in the pool, but because it clears the quality floor for this task while being the cheapest, fastest, and fully private option. That is the routing formula doing its job: reserving expensive capability for the requests that genuinely need it. Swap in a hard legal-reasoning task where quality is weighted 60% and the quality floor rises, and the same formula would route to the stronger local 70B model instead.
Why routing matters for on-prem AI
On-premise and sovereign AI deployments make routing more valuable, not less. Once you run a pool of local models on hardware you own, the marginal cost of the cheapest model approaches zero — so every request the router keeps away from an expensive frontier API is nearly free capacity you have already paid for. Routing is how you extract the economic advantage of owning your infrastructure instead of treating every task as if it needs the biggest model.
It is also how a hybrid estate stays governed. In a mixed local-plus-cloud setup, the security score is the mechanism that guarantees sensitive data never leaves the perimeter regardless of which model would otherwise score highest — the policy is enforced in the routing decision, not left to application code. Pair that with the cost and latency wins of running LLMs locally, and routing becomes the control plane that makes on-prem AI both affordable and compliant. The broader architecture is covered in fine-tuning vs routing vs smaller models.
How it works
- 01
Estimate the request
The router inspects the incoming request — task type, expected token counts, data classification, and latency budget — before choosing a model.
- 02
Filter the candidate pool
Models that fail a hard constraint are removed first: below the quality floor for the task, over the latency budget, or disallowed by the security/governance policy for this data.
- 03
Score the survivors
Each remaining model gets normalized cost, latency, quality, and security scores, combined into a weighted total using the task's priority weights.
- 04
Route and learn
The highest-scoring model handles the request; the outcome (cost, latency, and quality signals) feeds back to refine future scores.
Worked Example: Scoring Three Models for One Task
A PII-bearing support ticket with weights — cost 30%, latency 25%, quality 30%, security 15%. Scores are normalized 0–1; higher is better.
| Signal (weight) | Local Llama 3.1 8B | Local Llama 3.1 70B | Frontier Cloud API |
|---|---|---|---|
| Cost score (30%) | 1.00 | 0.72 | 0.25 |
| Latency score (25%) | 0.95 | 0.70 | 0.55 |
| Quality score (30%) | 0.62 | 0.88 | 0.97 |
| Security score (15%) | 1.00 | 1.00 | 0.40 |
| Weighted total | 0.87 | 0.81 | 0.56 |
| Router decision | Selected | Runner-up | Penalized / gated by policy |
From concept to a governed, on-premise reality
VDF AI Router implements this exact scoring model as its runtime decision layer. Every request is evaluated against the available model pool on cost, latency, quality, and policy — so routine work runs on cheap private models and only the requests that need frontier capability pay for it, all inside infrastructure you control.
What makes it enterprise-grade is that governance is not an afterthought. Security and data-residency constraints are enforced as hard gates in the routing decision, and the router is self-evolving: it tunes its scores on live performance signals rather than a static rule table. The architecture behind this — how VDF AI treats routing as a governed, observable, self-improving primitive — is documented in the SEEMR architecture overview and the accompanying white paper.
Frequently asked questions
What is the LLM routing formula?
It is a scoring function that ranks candidate models for a request by combining normalized cost, latency, quality, and security/governance scores, each weighted by the task's priorities: Score = wcost·Cost + wlatency·Latency + wquality·Quality + wsecurity·Security. The router selects the highest-scoring model.
How are the four routing scores combined?
Each signal is normalized to a 0–1 scale so they are comparable, then combined as a weighted sum where the weights (summing to 1) express the task's priorities. Hard constraints — like a security policy or a quality floor — are usually applied as gates that remove a model from the pool before the weighted scoring happens.
Does the routing formula always pick the best model?
No — and that is the point. It picks the cheapest, fastest model that still clears the quality floor and satisfies policy. Reserving expensive frontier models for the minority of requests that genuinely need them is where routing generates its cost savings.
How does security factor into model routing?
Security and governance are typically enforced as a hard gate rather than a soft weight. If a request contains regulated data, any model that would move that data outside the allowed perimeter is removed from consideration entirely, no matter how well it scores on cost, latency, or quality.
How do you set the routing weights?
Weights encode what the workload cares about. High-volume batch jobs weight cost heavily and latency lightly; live customer-facing agents weight latency and quality; high-stakes reasoning tasks weight quality and raise the quality floor. The weights are per task type, not global.
Why is LLM routing especially valuable for on-premise AI?
When you own the hardware, the marginal cost of your cheapest local model is near zero, so every request routing keeps off an expensive cloud API is nearly free capacity you already paid for. Routing also enforces data-residency policy in a hybrid estate, guaranteeing sensitive requests never leave the perimeter.
Let the router score cost, latency, quality, and policy on every request.
VDF AI Router applies this scoring model — governed by policy, tuned on live performance signals — across local and cloud models inside your own environment. Explore the router, or read the SEEMR white paper for the full architecture.