Off-Policy Evaluation for Large Action Spaces via Policy Convolution
For example, consider a scenario where the logging policy in a movie recommendation platform, for a given segment of users, rarely recommends romantic movies. This can often happen when we think a user will not like certain type of movies. On the other hand, a target policy—whose value we aim to estimate—due to numerous potential reasons, now chooses to recommend romantic movies for the same user segment. This distribution-shift can lead to irrecoverable bias in our estimates [29], making it difficult to accurately evaluate a target policy or learn a better one, which typically involves optimizing over the value estimates [12, 41]. Typical off-policy estimators
Such structure can occur naturally in different forms like action meta-data (text, images, etc.), action hierarchies, categories, etc. Or they can be estimated using domain-specific representation learning techniques