How do neural networks extend contextual bandits beyond linear reward assumptions?
This explores how moving from classic linear contextual bandits (like LinUCB) to neural networks lets the reward model capture nonlinear structure — and what that trade costs you in exploration, which is the part linearity made easy.
This explores how neural networks replace the linear reward assumption baked into classic contextual bandits, and what breaks when they do. The starting point is the linear bandit itself: in LinUCB news recommendation, the reward of showing an article is modeled as a linear function of context features, and that linearity is precisely what makes the math tractable — you get closed-form confidence intervals, provable regret bounds, and a clean way to balance exploring uncertain articles against exploiting proven ones Can bandit algorithms beat collaborative filtering for news?. The moment real reward surfaces are nonlinear — user taste that interacts across features rather than adding up — that linear model underfits, and you reach for a neural network to learn the representation instead of assuming it.
The catch is that the linear assumption wasn't just modeling convenience; it was also your uncertainty estimate. UCB and Thompson sampling need to know how unsure the model is about a given action, and for a linear model that uncertainty has a closed form. Swap in a deep network and that vanishes — you can predict rewards but no longer cheaply know what you don't know. This is the gap the epistemic-neural-network work targets: it separates *aleatoric* uncertainty (irreducible noise in the data) from *epistemic* uncertainty (what more data would resolve), and spends compute only on the epistemic part that Thompson sampling actually needs. That focus is what makes neural Thompson sampling viable at recommendation scale, lifting click-through 9% while needing 29% fewer interactions Can neural networks explore efficiently at recommendation scale?. The lesson is that extending bandits beyond linearity is really two problems — a richer reward model *and* a replacement for the uncertainty estimate you lost.
There's a quieter counter-move worth knowing: maybe you don't need the expensive exploration machinery at all. Greedy bandits that purely exploit can match UCB-style regret when the context distribution has natural 'covariate diversity' — when incoming users are varied enough that they supply the randomization exploration would have injected deliberately When can greedy bandits skip exploration entirely?. That reframes the neural question: in high-traffic recommendation, the diversity of real users may do for a neural model what careful epistemic uncertainty does, letting you skip the hardest engineering. So the field splits between making exploration smart (epistemic nets) and arguing the data makes it unnecessary.
A different way to escape linearity is to keep the reward linear but make the *basis* learned and nonlinear. Reward factorization represents each user's preference as a linear combination of base reward functions — but those base functions are learned from data, and a handful of adaptive questions pins down the per-user coefficients Can user preferences be learned from just ten questions?. This is the kernel-trick spirit: nonlinearity lives in the features, linearity in the combination, so you keep cheap personalization and uncertainty reduction while escaping the flat linear-in-raw-features assumption. It's a middle path between LinUCB and a fully neural reward.
Finally, the frontier is dropping the parametric reward model entirely. Memory-based online RL treats adaptation as memory operations rather than weight updates, assigning credit and improving policy through stored cases instead of a fitted reward function Can agents learn continuously from experience without updating weights?, and trajectory-based in-context learning shows models can absorb sequential decision-making from context alone when given full trajectories rather than isolated examples Why do trajectories matter more than individual examples for in-context learning?. Read together, the corpus traces an arc: linear bandits → neural bandits with engineered uncertainty → learned-basis linear models → non-parametric memory and in-context approaches. 'Beyond linear' turns out to be less a single technique than a ladder, where each rung buys more expressive reward modeling at the price of harder uncertainty estimation — and where, surprisingly, sometimes the data's own diversity lets you skip a rung entirely.
Sources 6 notes
LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.
ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.
Contextual bandits using pure greedy exploitation can match UCB-style regret guarantees when the context distribution satisfies covariate diversity—a condition satisfied by many real continuous and discrete distributions where incoming users themselves provide sufficient randomization.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.