Why should bandit algorithms condition exploration on time-of-period as well as user state?

This explores why a bandit's exploration policy should treat *when* a choice happens (time-of-day, day-of-week, seasonality) as part of its context — not just *who* the user is.

This explores why a bandit's exploration should bend to the clock as well as to the person. The honest starting point: the corpus doesn't have a paper that argues this claim head-on, but several notes converge on the reasoning from different directions. The foundational move is that contextual bandits already condition exploration on *context* — user state is simply one slice of that context. LinUCB-style news recommendation Can bandit algorithms beat collaborative filtering for news? treats each decision as 'given everything I know right now, how uncertain am I about this arm?' Time-of-period is just another coordinate to feed that 'right now.'

The reason time deserves its own coordinate is non-stationarity. News is the cleanest example: the value of an article decays, and the best arm at 8am is not the best arm at midnight. A bandit conditioned only on user state implicitly assumes a user's reward for an option is time-invariant — which is exactly the assumption dynamic content breaks. Once rewards drift on a daily or weekly cycle, what you learned exploiting last evening can actively mislead you this morning, so the exploration budget has to be spent *per regime*, not once globally.

This sharpens an otherwise free lunch. There's a striking result that greedy bandits can skip exploration entirely when the incoming context distribution is naturally diverse enough to randomize for you When can greedy bandits skip exploration entirely?. Time-of-period looks like it should add diversity — but it adds *structured, cyclical* diversity, not random diversity. Cyclical structure creates correlated blind spots: if your traffic is thin at 3am, no amount of daytime randomization covers that regime, so the covariate-diversity escape hatch quietly closes and explicit exploration becomes necessary again precisely in the under-sampled hours.

That connects to where exploration *should* be aimed. Scalable neural bandits work by separating reducible (epistemic) uncertainty from irreducible noise and spending Thompson-sampling compute only where parameter uncertainty actually lives Can neural networks explore efficiently at recommendation scale?. Epistemic uncertainty is not uniform across the day — it pools in the temporal regimes with sparse data. Conditioning exploration on time-of-period is, in this framing, just honest accounting: it routes exploration toward the hours the model genuinely hasn't learned yet, instead of re-exploring the well-sampled midday it already understands.

The last, more lateral thread is about the *timing of the decision to explore* itself. Work on why LLMs under-explore finds a temporal mismatch inside the model: uncertainty signals arrive in early layers and commit the model to a choice before longer-horizon 'empowerment' signals can weigh in Why do large language models explore less effectively than humans?. The parallel is suggestive — a bandit blind to time-of-period commits as if the current moment were the whole story, foreclosing the long-horizon, across-the-cycle value that only becomes visible when 'when' is part of what it reasons over. The unifying lesson across all five: exploration is only as good as the context you let it see, and time is a context dimension most user-state models silently throw away.

Sources 4 notes

Can bandit algorithms beat collaborative filtering for news?

LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.

When can greedy bandits skip exploration entirely?

Contextual bandits using pure greedy exploitation can match UCB-style regret guarantees when the context distribution satisfies covariate diversity—a condition satisfied by many real continuous and discrete distributions where incoming users themselves provide sufficient randomization.

Can neural networks explore efficiently at recommendation scale?

ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Why should bandit algorithms condition exploration on time-of-period as well as user state?

Sources 4 notes

Next inquiring lines