Data Scientist

2021-09-13

We have created a synthetic environment to simulate a simple and controlled case where there is definite personalization based on a single context feature (Desktop/Mobile). We want to determine how the various exploration algorithms behave when a new option is introduced after 10k decisions. We are looking for an exploration algorithm that fulfills 2 criteria.

- The algorithm discovers the new option is optimal for the correct context feature and exploits on it only for that context.
- The algorithm has smooth adjustments in its serving behavior and isnâ€™t subject to high variance outcomes. (consistency)

`Add Option`

The environment rewards are binary outcomes generated from Bernoulli trials using the following reward rates.

Actions | Mobile | Desktop | Avg RR |
---|---|---|---|

Red | 0.05 | 0.03 | 0.04 |

Green | 0.03 | 0.05 | 0.04 |

Blue (added) | 0.06 | 0.01 | 0.035 |

A successful algorithm should start with a little exploration (random serving) and within 10k decisions converge to mostly exploiting **Green for Desktop ** and ** Red for Mobile**. Once the ** Blue Option** is added in, the algorithm should explore a bit more and eventually serve the ** Blue Option** over the **Red Option** for the **Mobile context only**. Desktop should stay close to 100% **Green Option**.

The optimal exploration strategy for this environment is **Online Cover 16, Psi 0.1, NoUnif, LR 1e-4 ** which effectively balances exploration and exploitation within the given 100k decision window. It makes sense to implement Online Cover with these parameters in cases where we want to add new options to an Online Bandit over time.

- Epsilon Greedy (MTR)
- Online Bagging (MTR)
- RegCB Elimination (MTR)
- RegCB Optimistic (MTR)
- Online Cover (DR)

This analysis will not cover how each of these algorithms work. For more details on the algorithm definitions and hyperparameters please see Section 3 in Bietti, Alberto and Agarwal, Alekh and Langford, John, A Contextual Bandit Bake-off (2018).

All simulations will be run for 100k decisions with the Blue Option added as an eligible option after decision 10k. All learning rates are static (power_t = 0) over the course of each simulation, but each simulation has a specific learning rate specified so we can learn the relationship between the hyperparameters for each exploration algorithm and the learning rate if one exists.

The first thing to evaluate is Epsilon Greedy as our baseline to get a good estimation of a conservative, but robust exploration strategy. The completely random component of epsilon greedy will likely discover the new option, but we expect we will be limited in the long run due to the same random exploration of implausible actions given certain contexts. The other thing we have seen in online experiments is high variance in serving due to sparse rewards collected from the random side of epsilon greedy leading to large inverse propensities in the off policy evaluation corrections, especially early on.

The following plots show the cumulative action distribution of the Epsilon Greedy simulations and the relationship between learning rate (top/columns) and epsilon (side/rows).