WHITE PAPER v1.0 May 2026 VDF-WP-2026-002

The Self-Evolving Model Router.

A composable, six-tier dispatch architecture that turns model selection from a static configuration into a continuously-learning decision — combining policy enforcement, prompt-aware retrieval, rule-based filtering, predictive re-ranking, contextual bandits, and challenger exploration under a single, gracefully-degrading routing surface.

Authors
Read time 20 min
License CC BY 4.0
Read Online
ABSTRACT

Enterprise dispatch of large language models has historically been a configuration decision: operators bind a model to a workload and live with the choice. Real fleets, however, are non-stationary. Provider quotas oscillate, latency drifts on shared cloud endpoints, capabilities evolve as new model families arrive weekly, and the cost-quality-energy frontier shifts under the operator's feet[15][13]. A static binding is therefore a slowly-failing decision, and the problem is not solved by adding an A/B test on top of a static dispatcher — it is solved by treating routing itself as a non-stationary contextual decision.

This white paper documents how VDF AI Networks operationalises that view. Every request flows through a six-tier dispatcher: policy enforcement, prompt-aware retrieval shortlisting, rule-based filtering with a multi-objective scorer, predictive re-ranking on per-arm history, contextual-bandit selection under a disjoint-per-arm LinUCB learner[2], and challenger exploration that dual-routes a small fraction of traffic for live preference learning. Each tier is independently feature-gated and degrades to the next-simpler strategy when its signal is unavailable. The composition, not any single tier, is the contribution.

The router is self-evolving in three coupled senses. Online, every completed request becomes a reward observation that updates the chosen arm via a rank-one Sherman–Morrison update[10]; failures are folded back as a bounded penalty rather than dropped; and an offline trainer batches the run vault to re-derive priors that are atomically swapped into the live policy. We describe the design parameters, the graceful-degradation envelope, and the position of the work relative to the recent cost-quality routing literature. The paper is a design account and is deliberate about not over-claiming measured outcomes.

Keywords contextual bandits · LinUCB · model routing · disjoint per-arm learning · prompt-embedding retrieval · multi-objective scoring · online/offline learning duality · LLM serving · graceful degradation · policy-bound dispatch
AT A GLANCE

Six numbers that anchor the paper

Decision tiers
6

independently feature-gated layers in the dispatch stack

Context dim
64

sparse hashed features encoded per request

Exploration
α = 0.8

UCB confidence bonus on the contextual bandit

Window
~200 obs

per-model rolling latency and throughput window

Challenger
~2%

of traffic dual-routed for live preference learning

Failure reward
0.15

bounded penalty fed back to the bandit on timeout or error

FIGURE 1

The six-tier router — per-request lifecycle

Inputs arrive from the workflow specification on the left and exit as a routing decision and an ordered failover list on the right. Every tier is feature-gated; the dashed return loop depicts the online/offline learning duality that gives the router its name.

Per-request routing lifecycle in the VDF AI self-evolving model router: workflow inputs feed a six-tier decision stack (policy, retrieval, rules, predictive, bandit, challenger); outputs are a selected model and an ordered candidate list; a dashed return loop shows the online reward update and offline batch retraining.
Fig. 1. Per-request routing lifecycle. Each tier is feature-gated and fails open to the next-simpler strategy when its signal is unavailable. The dashed return loop shows the online reward update and the offline retraining cycle that re-derives priors.
SECTION 1

Introduction & motivation

Three things change beneath an enterprise dispatcher in any given quarter. Provider quotas and rate limits drift, sometimes overnight; latency on shared cloud endpoints fluctuates with datacentre load and is correlated across tenants but invisible to any individual one; and the model catalog itself evolves — new families arrive, established ones deprecate, and the price- quality frontier moves[15]. None of these are visible to a dispatcher that selects models by static configuration.

A buyer accepting this state of affairs typically responds in one of three ways: pin the safest model and pay the premium, pin the cheapest model and absorb the variance, or layer an offline A/B test on top of a static dispatcher and update the configuration by hand. None of the three scales. The first wastes capacity; the second wastes outcomes; the third turns the dispatcher into a manual rebalancing job. What is needed is a routing layer that treats the choice of model as a non-stationary contextual decision — one that absorbs the drift instead of papering over it.

The Self-Evolving Model Router is the dispatch tier of VDF AI Networks. It is designed around the observation that every routing decision is a bandit problem with a context vector and a stream of delayed, partial rewards[2][5]: the dispatcher chooses an arm, the runtime returns an outcome (a quality score, a latency, an error), and the policy must update to make better choices the next time the same context recurs. The router solves the bandit problem with a per-arm linear UCB learner inside a broader, gracefully-degrading envelope of five sibling tiers, all of which can shape, accept, or override the bandit's recommendation depending on what the system knows about the request.

Scope and non-goals

This paper covers serving-time dispatch only. It does not propose a new bandit algorithm; the underlying linear UCB scheme is well-established[2][3]. The contribution is the composition — how policy, retrieval, multi-objective scoring, predictive re-ranking, online bandit learning, and challenger exploration are layered into one dispatcher with a clear graceful-degradation envelope and an online/offline learning duality. Where the design borrows from the literature we cite rather than re-derive. Empirical numbers beyond the design parameters are deliberately out of scope; the paper is a documented engineering pattern, not a benchmark.

SECTION 2

Background & related work

The theoretical backbone is the contextual multi-armed bandit. Auer[1] introduced the upper-confidence-bound family for the stochastic bandit; Li, Chu, Langford and Schapire[2] generalised it to the contextual case as LinUCB; Chu, Li, Reyzin and Schapire[3] gave the theoretical analysis of contextual bandits with linear payoffs. Agarwal et al.[4] established efficient algorithms for general contextual bandits, and the surveys of Lattimore and Szepesvári[5] and Slivkins[6] are the canonical references. The dispatcher's learning core is a faithful application of this line of work to a model-selection problem.

The exploration–exploitation literature offers two practical alternatives to UCB: Thompson sampling[8] and ε-greedy variants. Thompson sampling is attractive when posterior sampling is cheap and the reward distribution is well-modelled; UCB remains the more deterministic choice when telemetry is the primary debugging surface — every decision can be reproduced from the recorded arm statistics, which matters when an operator has to explain why one model was chosen over another. We chose UCB for the operational reproducibility, not for any sample-efficiency claim.

Within the LLM-routing literature, three lines of work are immediately relevant. FrugalGPT[13] frames routing as a cost-quality cascade; Hybrid LLM[14] formulates it as a query-router that switches between a strong and a weak model based on a difficulty estimator; RouteLLM[15] learns the router from preference data. All three are valuable, and all three concentrate the routing intelligence in a single learned function on a single objective axis. The contribution of the present paper is orthogonal: rather than propose another single-objective router, we describe a multi-objective, composable dispatcher in which the learned function is one tier among six.

A related but distinct body of work is the mixture-of-experts (MoE) literature. Shazeer et al.[11] and Fedus, Zoph and Shazeer[12] gate inside a model between expert sub-networks. Our dispatcher gates between independent models — different providers, different families, different deployment topologies. The two problems share a vocabulary (gating, routing, experts) but live at different levels of the stack.

The online-update mechanism — a rank-one Sherman–Morrison update to the per-arm regularised inverse Gram matrix — is the classical numerical-linear-algebra technique surveyed by Hager[10]. It permits exact incremental learning without recomputing matrix inverses, which is what makes the in-process online loop practical at serving rates.

What is consistently absent from the prior literature is a routing-layer account that ties policy, capability, cost, latency, energy, and continuous learning into one composable dispatch with a documented graceful-degradation envelope. That gap is what this paper documents.

SECTION 3

System architecture overview

The router is an in-process library rather than a standalone microservice. It is invoked once per node per request inside the orchestration engine, returns a routing decision with an ordered candidate list, and observes the reward asynchronously after the runtime completes the call. Two persistence surfaces support it: an in-memory rolling latency window of approximately two hundred observations per model — thread-safe, cleared on restart, used for live p50, p95, time-to-first-token, throughput, and timeout-rate statistics — and a vault-backed bandit state that stores, for each arm, the regularised inverse Gram matrix, the running reward vector, an observation count, and a cumulative reward sum.

Hot-reload is a first-class capability. The orchestration engine reloads the bandit state from the vault on a configurable cadence (default approximately thirty seconds), so a fresh offline retrain can land in production without restarting workers. The latency window remains process- local and is rebuilt naturally from live traffic.

Failover is enumerative, not re-routed. The router returns up to five ordered candidates per decision, and the engine walks the list until one succeeds. The ordering deliberately prefers provider-diverse alternates first — escaping correlated outages is the most expensive failure mode in production — and same-family fallbacks second, on the principle that a near-equivalent model in the same family incurs less context-switch cost than a complete provider change. The ordering is exposed in telemetry alongside the routing reason code.

The router carries no notion of session affinity, no warm-pool management, and no batching. It is a pure decision function with side effects only on its own bandit state. This is a deliberate architectural choice: every other concern (caching, batching, autoscaling) is owned by tiers above or below, and the dispatcher remains testable as a function of its inputs alone.

SECTION 4

Methodology — six tiers

Each subsection names one tier, describes the signal it consumes, and identifies how it degrades when that signal is unavailable. Graceful degradation is not a feature added late; it is the central design constraint that lets the dispatcher remain stable under simultaneously-failing dependencies.

01

Policy enforcement — the inviolable layer

Pinned models and regulated-domain allow-lists are evaluated before any scoring. A request that targets a regulated workload but cannot be served by an approved candidate halts with an explicit, machine-readable reason code; a soft mismatch (no candidate carries a requested capability) degrades with a logged relaxation rather than a silent failure. Policy is the only tier that can return an unrecoverable error.

Routing layer · Policy short-circuit
02

Prompt-aware retrieval shortlisting

A small index of prompt embeddings records, for each historical request, which models produced the best evaluation score. At decision time the live prompt is embedded once and queried against the index; the result is a shortlist of models that performed well on conceptually similar tasks. An observation minimum (default three) prevents shortlists from forming on insufficient data, and an empty result falls back transparently to the full catalog.

Routing layer · Retrieval shortlist
03

Rule-based filtering and multi-objective scoring

Allow- and deny-lists, the external-API toggle, capability matching, the context-window-versus-prompt-budget check, and latency or time-to-first-token thresholds run as fast, deterministic predicates. Survivors enter a multi-objective scorer with three presets — eco, balanced, and max-quality — that combines normalised quality, cost, latency, and energy on weighted axes. The eco preset adds a small, fully-logged local-model bonus to reflect operator policy.

Routing layer · Multi-objective scoring
04

Predictive re-ranking on per-arm history

Before the bandit step, a lightweight predictive layer composes a re-ranking signal from per-arm history: mean reward, recent fiftieth-percentile latency, and recent failure rate. The composite reduces variance in the next tier and meaningfully improves cold-start behaviour, because the bandit no longer has to learn what the registry already knows.

Routing layer · Predictive re-ranker
05

Contextual bandit selection (LinUCB, disjoint per-arm)

The learning core. Each candidate model is an arm with its own regularised inverse Gram matrix and reward vector. A sixty-four-dimensional sparse hashed context vector encodes domain, node type, requested capability, regulation status, prompt-size bucket, upstream fan-in count, tool usage, and local-runtime availability. The arm with the highest mean-plus-α·uncertainty score is selected; α = 0.8 sets the exploration–exploitation balance. Hybrid priors loaded from offline training initialise the arms when available; otherwise the prior is uniform.

Routing layer · Contextual bandit
06

Challenger exploration — keeping the policy honest

On approximately two percent of requests the dispatcher runs both the chosen primary and a challenger drawn by one of three strategies: next-ranked, random, or least-explored. The pairwise outcome is fed to the offline trainer as preference data. The mechanism is a deliberate, bounded tax on average quality that prevents the bandit from over-exploiting a temporarily-strong arm and lets new models earn their way in.

Routing layer · Live preference learning

The online/offline learning duality

The six tiers compose a decision; the duality composes a policy update. Two clocks run side by side. The fast clock — online — fires whenever a request completes: the evaluation score, a value bounded to [0, 1], becomes a single reward observation, and the chosen arm's regularised inverse Gram matrix and reward vector are updated by a rank-one Sherman–Morrison step[10]. The slow clock — offline — runs in a separate process that batches the run vault, re-derives per-arm priors over a much larger window, and writes a fresh bandit-state snapshot back to the vault, which the live engine then hot-reloads atomically.

The fast clock keeps the live policy current at minute timescales; the slow clock keeps the policy from drifting under short-horizon noise. Neither clock is sufficient alone — without online updates, the policy lags every interesting change in the fleet; without offline retraining, the policy is at the mercy of whatever traffic mix happened to arrive in the last window. The composition is what makes the router self-evolving rather than merely online.

SECTION 5

Design parameters & tier composition

Table 1 collects the design parameters that govern the learner and the surrounding tiers. Values are codebase defaults; each is exposed as a tunable for deployment-specific policy and all are recorded in telemetry on every decision so that a downstream analyst can reconstruct the routing function exactly.

Table 1. Design parameters of the dispatcher. Defaults shown; each is tunable.
Parameter Value Role
Context dimension 64 Sparse hashed features per request
Regularisation λ 1.0 Ridge prior on each arm's Gram matrix
UCB α 0.8 Confidence bonus on uncertainty term
Latency window ~200 obs Per-model rolling p50 / p95 / TTFT / throughput
Failure reward 0.15 Bounded penalty on timeout or error
Challenger fraction ~2% Dual-routed traffic for preference data
Failover depth up to 5 Ordered candidates returned per decision
Reload cadence ~30 s Bandit hot-reload interval from vault
Context-window reserve 256 tokens Headroom on prompt-budget overflow check

Table 2 records the tier-composition matrix. The two columns of operational interest are when the tier triggers degradation and what it degrades into. Read top to bottom, the matrix describes the worst-case path through the dispatcher: a request can in principle traverse the policy tier, find an empty retrieval shortlist, encounter a relaxed capability match in the rule filter, hit an arm with no per-arm history at the predictive stage, fall back from an unloaded bandit, and skip the challenger throttle — and still emerge with a sensible decision drawn from the rule-based order. No single tier failure terminates the dispatcher unless the failure is in the policy tier, which is the only layer whose violations are unrecoverable by design.

Table 2. Tier composition matrix: when each tier degrades and what it degrades into.
Tier Required? Triggers degradation when… Falls back to…
Policy Always Pinned model missing in a regulated domain Hard halt with reason code
Retrieval Optional Empty shortlist or insufficient observations Full catalog
Rule filter Always No models match the requested capability Relaxed capability with logged reason
Predictive Optional Insufficient per-arm history Rule-based order
Bandit Optional Bandit state fails to load or arm is unseen Predictive or rule-based order
Challenger Optional Throttle exceeded or no eligible second arm Single-route (primary only)

The "self" in self-evolving

Three feedback loops drive the bandit. The first is the online loop already described: every completed request with an evaluation score becomes one reward observation and updates the chosen arm in place. The second is the failure-as-signal loop: timeouts and errors return a bounded penalty of 0.15 instead of being silently dropped, so failing arms are demoted on roughly the same time horizon that successful arms are reinforced. The third is the offline loop: batched retraining reads the run vault, re-derives priors over a long window, and atomically swaps the state into the live router via hot-reload.

The composition of the three loops, not any single loop, is what makes the system continuously self-correcting under non-stationary load. Online updates alone leave the policy hostage to short-horizon noise. Offline retraining alone lags every interesting drift in the fleet by a full training window. Failure-as-signal alone produces an over-conservative policy that retreats from any model that ever errored. Together, they produce a policy that adapts at the right timescale for each kind of change.

SECTION 6

Discussion, limitations, and future work

Why disjoint per-arm parameters, not a shared model

A central design choice is that every model is its own arm with its own parameters; there are no shared weights between arms. The alternative — a single shared contextual model that predicts a score for any model given the context — is more sample-efficient on a stable catalog. We chose disjoint per-arm for operational robustness. Adding or removing a model from the catalog becomes a metadata operation rather than a retraining event, and a regression on one arm cannot contaminate any other. In a fleet where the catalog drifts week by week, that isolation is more valuable than the sample-efficiency premium we leave on the table.

Why hashed sparse features rather than raw prompt embeddings as context

The bandit context vector is sixty-four-dimensional and is computed from hashed metadata: the domain, the node type, the requested capability, the regulation status, the external-API policy, the prompt-size bucket, the upstream fan-in count, a tool-usage indicator, and a label hash that disambiguates synthesis nodes performing semantically similar work. Raw prompt embeddings are used only at the retrieval-shortlist tier, not in the bandit. The reason is threefold. First, the bandit context must be cheap to compute on every request with no external service dependency. Second, it must be deterministic — the same request should produce the same context key for telemetry, and embedding-model upgrades should not silently change the policy. Third, offline retraining must reproduce the context exactly; hashed metadata is reproducible where embeddings are not.

The eco-mode local-quality bonus, in context

The multi-objective scorer in the third tier supports a small, fully-logged local-model bonus (≈ +15 percentage points in normalised quality) when the eco preset is active. That bonus is a policy lever, not an accuracy claim. It exists because under an eco policy the operator has declared a willingness to accept a marginally weaker model in exchange for lower energy and reduced data egress[16]. The bonus is tunable, fully logged, and its contribution to any headline saving is recoverable from telemetry — which is the only way it can defensibly remain in a routing function that is otherwise designed not to inject unmeasured biases.

What the dispatcher does not yet do

The catalog is still a metadata operation: new models are introduced by adding entries to the model registry, not auto-discovered. There is no cross-network learning transfer: each workflow's bandit is its own, and a policy trained in one deployment does not propagate to another. Multi-modal routing — image, audio, video — is out of scope; the dispatcher assumes text-token semantics throughout. None of these are difficult extensions, but they are roadmap items, not finished capabilities, and we are explicit about that distinction.

Failure modes we accept

Cold start on a freshly-introduced model is unavoidable without prior data; the predictive tier mitigates it by re-ranking on coarse-grained capability and cost signals, but it cannot eliminate it. Challenger exploration is a deliberate, bounded tax — approximately two percent of traffic — on average quality, taken in exchange for keeping the long-run policy unbiased and able to discover newly-promoted models. Under deeply non-stationary regimes (a model provider rate-limiting an entire family for an extended window) the bandit will eventually re-converge, but the convergence window is bounded below by the observation count required to dominate the prior, which is a tunable rather than an architectural property.

Position relative to RouteLLM and adjacent routers

RouteLLM[15], Hybrid LLM[14], and FrugalGPT[13] are all valuable contributions to the cost-quality routing problem and have served, individually, as productive points of comparison. The dispatcher described here differs in three respects. First, the learned router is one tier among six rather than the whole router, and is wrapped in a graceful-degradation envelope. Second, the optimisation is multi-objective from the outset — quality, latency, cost, and energy are first-class — rather than reduced to a single axis. Third, the duality of online and offline learning is explicit and operational, not assumed. The trade-off is that the system is more machinery for less novelty in any single component. The intent of the paper is to make that trade-off legible.

SECTION 7

Conclusion

If the fleet under a dispatcher is non-stationary — and at production scale it always is — then self-evolution is the price of admission, not a marketable add-on. The Self-Evolving Model Router is what that price looks like when paid in full: six independently feature-gated tiers, an online and offline learning duality, a documented graceful-degradation envelope, and a policy-bound surface that an operator can audit without reading source code.

The contribution of the paper is the composition. No single tier is new in isolation — contextual bandits, UCB, multi-objective scoring, prompt-embedding retrieval, and challenger exploration each have well-developed literatures of their own. The novelty, such as it is, is the gracefully-degrading composition: an engineering pattern that takes routing from a static configuration to a continuously-learning decision without sacrificing the operational invariants that make a production dispatcher trustworthy.

As with the companion energy paper, the claim is not that this is optimal. The claim is that it is visible: every parameter, every reason code, every reward, every failover step is emitted to telemetry and reproducible from the recorded state. Visibility is the precondition for every improvement that follows.

REFERENCES

References

  1. [1] Auer, P. (2002). Using Confidence Bounds for Exploitation–Exploration Trade-offs. Journal of Machine Learning Research 3, 397–422.
  2. [2] Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A Contextual-Bandit Approach to Personalized News Article Recommendation. WWW.
  3. [3] Chu, W., Li, L., Reyzin, L., & Schapire, R. E. (2011). Contextual Bandits with Linear Payoff Functions. AISTATS.
  4. [4] Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., & Schapire, R. E. (2014). Taming the Monster — A Fast and Simple Algorithm for Contextual Bandits. ICML.
  5. [5] Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
  6. [6] Slivkins, A. (2019). Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learning 12(1–2).
  7. [7] Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press.
  8. [8] Russo, D., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning 11(1).
  9. [9] Bouneffouf, D., Bouzeghoub, A., & Gançarski, A. L. (2012). A Contextual-Bandit Algorithm for Mobile Context-Aware Recommender Systems. ICONIP.
  10. [10] Hager, W. W. (1989). Updating the Inverse of a Matrix. SIAM Review 31(2), 221–239.
  11. [11] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks — The Sparsely-Gated Mixture-of-Experts Layer. ICLR.
  12. [12] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformer — Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR.
  13. [13] Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT — How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176.
  14. [14] Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Rühle, V., Lakshmanan, L. V. S., & Awadallah, A. H. (2024). Hybrid LLM — Cost-Efficient and Quality-Aware Query Routing. arXiv:2404.14618.
  15. [15] Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2024). RouteLLM — Learning to Route LLMs with Preference Data. arXiv:2406.18665.
  16. [16] VDF AI Research Team (2026). How We Reduce Energy Consumption. VDF-WP-2026-001. Companion paper; describes the energy axis of the multi-objective scoring tier referenced here.

Download the full white paper

Enter your work email and name and we'll send a download link for the print-optimised PDF version of this document — for offline reading, internal review, and citation.