Bandit Book
Bandit Book
Bandit Book
R
in Information Retrieval
Bandit Algorithms in Information
Retrieval
Suggested Citation: Dorota Głowacka (2019), “Bandit Algorithms in Information Re-
trieval”, Foundations and Trends
R
in Information Retrieval: Vol. 13, No. 4, pp 299–424.
DOI: 10.1561/1500000067.
Dorota Głowacka
University of Helsinki
[email protected]
This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
Contents
1 Introduction 300
6 Recommendation 362
6.1 Personlization and the Cold Start Problem . . . . . . . . . 362
6.2 Social Networks and Recommender Systems . . . . . . . . 370
6.3 Collaborative Filtering and Matrix Factorization . . . . . . 376
6.4 Feature Learning with Bandits . . . . . . . . . . . . . . . 380
6.5 Recommendations with a Limited Lifespan . . . . . . . . . 384
6.6 Simultaneous Multiple Arms Evaluation . . . . . . . . . . 389
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Acknowledgements 404
Appendices 405
B Symbols 408
References 410
Bandit Algorithms in Information
Retrieval
Dorota Głowacka
University of Helsinki; [email protected]
ABSTRACT
Bandit algorithms, named after casino slot machines some-
times known as “one-armed bandits”, fall into a broad cate-
gory of stochastic scheduling problems. In the setting with
multiple arms, each arm generates a reward with a given
probability. The gambler’s aim is to find the arm producing
the highest payoff and then continue playing in order to
accumulate the maximum reward possible. However, having
only a limited number of plays, the gambler is faced with a
dilemma: should he play the arm currently known to produce
the highest reward or should he keep on trying other arms
in the hope of finding a better paying one? This problem
formulation is easily applicable to many real-life scenarios,
hence in recent years there has been an increased interest
in developing bandit algorithms for a range of applications.
In information retrieval and recommender systems, bandit
algorithms, which are simple to implement and do not re-
quire any training data, have been particularly popular in
online personalization, online ranker evaluation and search
engine optimization. This survey provides a brief overview of
bandit algorithms designed to tackle specific issues in infor-
mation retrieval and recommendation and, where applicable,
it describes how they were applied in practice.
Over the last decade there has been an increased interest in application
of bandit algorithms in information retrieval (IR) and recommender
systems. The aim of this survey is to provide an overview of bandit
algorithms inspired by various aspects of IR, such as click models,
online ranker evaluation, personalization or the cold-start problem.
Each section of the survey focuses on a specific IR problem and aims to
explain how it was addressed with various bandit approaches. Within
each section, all the algorithms are presented in chronological order. The
goal is to show how specific concepts related to bandit algorithms, e.g.
graph clustering with bandits, or a specific family of bandit algorithms,
e.g. dueling bandits developed over time. Gathering all this information
in one place allows us to explain the impact of IR on the development
of new bandit algorithms as well as the impact of bandit algorithms
on the development of new methods in IR. The survey covers papers
published up to the end of 2017.
Why Bandits?
Bandit algorithms derive their name from casino slot machines, some-
times referred to as one-armed bandits. In this scenario, a gambler is
300
301
faced with a row of such machines. The gambler has to make a number
of decisions, such as which machines to play or how many times to play
each machine. The problem is that each machine provides a random
reward from a probability distribution specific to that machine. The
gambler aims to maximize the sum of the rewards by playing different
machines. Thus, the gambler needs to make a trade-off between exploit-
ing the machine with the highest expected payoff so far and exploring
other machines to get more information about their expected payoffs.
In the 1950’s Herbert Robbins realized the importance of the problem
and constructed convergent population selection strategies for sequential
design of experiments (Robbins, 1985). A couple of decades later John
Gittens constructed a theorem, called the Gittins index, that gave an
optimal policy for maximizing the expected discounted reward (Gittins,
1979). Later on, some approximate solutions based on epsilon strategies
(Sutton and Barto, 1998) as well as Bayesian methods, such as Thompson
sampling (Thompson, 1933), were developed to solve the bandit problem.
The last two decades has seen an immense interest in the study of bandit
algorithms, starting with the development of Upper Confidence Bound
(UCB) (Agrawal, 1995) strategies. In UCB algorithms, however, every
bandit arm is independent and does not pass any information about
its payoff generating distribution to other bandit arms. This led to the
development of linear and contextual bandits (Auer, 2002; Li et al.,
2010b), where a linear dependency between the expected payoff of an
arm and its context is assumed. In comparison to independent bandit
strategies, linear bandits can lead to elimination of arms with low payoff
earlier during the exploration phase thus allowing the player to focus
on trying arms with a potentially higher payoff.
There are a number of reasons why bandit algorithms have gained a
high level of popularity in many applications. They are quick and easy
to implement, they do not require any training data, and they allow
for continuous testing/learning, which makes them highly applicable to
any online application with a continuous stream of data. Thus, over the
years bandits have been applied in many areas: clinical trials (Villar
et al., 2015; Williamson et al., 2017), adaptive routing (Awerbuch and
Kleinberg, 2008), auctions (Nazerzadeh et al., 2016), financial portfolio
design (Shen et al., 2015), cognitive modelling (Głowacka et al., 2009),
302 Introduction
304
2.1. Reinforcement Learning 305
of the reward – the faster the response, the higher the reward. These
basic concepts of reinforcement learning can be easily translated into
learning in interactive systems. For example, a recommender system (the
learner) makes product suggestions to the end user (the environment)
and obtains rewards from the user in the form of clicks or purchases.
The more items are clicked or purchased, the higher the reward.
A characteristic specific only to reinforcement learning is the trade-
off between exploration and exploitation. To accumulate a high reward,
the learner must select actions that, when tried in the past, produced
high rewards. However, in order to find actions that give high rewards,
the learner has to try actions that it has not selected before. Thus, the
agent has to exploit its current knowledge of actions that it has already
experienced in order to obtain reward, but it also has to explore new
actions in order to possibly improve the reward in the future. On a
stochastic task, each action must be tried many times to gain a reliable
estimate of its expected reward. For example, a recommender system
suggested movies from three categories to a specific user: crime drama,
action drama and romantic movies (the exploration phase). The user
only selected romantic movies to watch. At this stage, the recommender
system could simply decide that this is the only category of movies
that the user likes and only every recommend that types of movies (the
exploitation phase). However, it may also be the case that there are
other movie categories never recommended by the system that the user
likes even more than romantic movies. It might be worth for the system
to suggest movies from other categories not presented before (continue
with exploration) in the hope that the user will find movies from some
of the newly presented categories even more attractive than romantic
movies and thus buys/watches even more movies (increased reward for
the system).
Another important aspect of reinforcement learning is that it explic-
itly considers the whole problem of a goal-directed learner interacting
with an uncertain environment. This is in contrast to other areas of
machine learning that focus on specific subproblems, such as classifica-
tion or clustering, without specifying how the ability to perform such a
task would be useful. Reinforcement learning agents, on the other hand,
have explicit goals, can sense aspects of their environments, and can
306 Reinforcement Learning and Bandit Algorithms
-greedy
The -greedy algorithm is a popular heuristic for handling the explo-
ration/exploitation trade-off. It is widely used due to its simplicity in
terms of implementation and because it can easily generalize to many
sequential decision problems (Kuleshov and Precup, 2014), where an
agent’s choice of action now depends on the actions they will choose in
the future, for example scheduling.
At each round t, -greedy selects the arm with the highest empirical
mean with probability 1 − , and selects a random arm with probability
. In other words, given initial empirical means µ̂1 , . . . , µ̂K :
1 − + /k if i = arg maxj=1,...,K µ̂j (t)
pi (t + 1) =
/k otherwise
as the number of times that each arm has been played ni (t). Initially,
each arm is played once. At round t, the algorithm chooses the arm
ai (t) to play as follows:
s !
2 ln t
ai (t) = arg max µ̂i +
i=1,...,K ni
where ∆i = µ? − µi .
Auer et al. (2002a) also propose UCB1-Tuned, which performs better
in practice than UCB1 but comes without theoretical guarantees. The
main feature of UCB1-Tuned is that it takes into account the variance
of each arm and not only its empirical mean. At round t, the algorithm
picks arm ai as follows:
s !
ln t 1
ai (t) = arg max µ̂i + min , Vi (ni ) ,
i=1,...,K ni 4
where Vi (t) = σ̂i2 (t) + n2 iln(t)t . The estimate of the variance σ̂i2 (t) can be
computed by maintaining the empirical sum of squares of the reward,
in addition to the empirical mean.
Due to the fact that in both algorithms each arm has to be played
multiple times in order to calculate its empirical mean and variance,
simple upper confidence bound strategies do not scale up to problems
that require a large number of arms. However, the UCB1 algorithm is
often used as a baseline in many experiments mentioned throughout
this survey.
vectors and rewards of the arms relate to each other, so that the next
best arm to play can be predicted by considering the feature vectors.
For example, in a news personalization system, each news articles could
be treated as an arm, and the features of both articles and users as
contexts. The system then selects articles for each user to maximize
click-through rate or dwell time. Contextual bandits play an important
role in recommender systems. In Chapter 6, we will look in more detail
how contextual bandits have been applied to improve personalization
or the cold-start problem.
Many contextual bandits assume a linear dependency between the
expected reward of an arm and its context (linear bandits). LinRel
(Auer, 2002) was one of the first methods to extend the UCB algorithm
to contextual cases. LinRel assumes that for each arm ai there is an
associated feature vector xi and the expected reward of arm ai is linear
with respect to its feature vector. The algorithm maintains an implicit or
explicit representation of the estimate ŵ of the unknown weight vector
w which maps context features to relevance scores. In each round t,
LinRel algorithm obtains an estimate ŵt by solving the linear regression
problem yt ≈ Xt · ŵt , where yt = (y1 , . . . , yt−1 ) is the column vector of
relevance scores, or rewards, received so far, and Xt = (x1 , . . . , xt−1 )
is the matrix of row feature vectors of datapoints (arms) tried up to
time t. Based on the estimated weight vector ŵ, LinRel calculates an
estimated relevance score ŷi = xi · ŵ for each arm ai that has not yet
been tried by the player. The arm with the largest upper confidence
bound for the relevance score is presented to the player. The upper
confidence bound for an arm ai is calculated as ŷi + ασ̂i , where σ̂i is
an upper bound on the standard deviation of the relevance estimate
ŷi . The constant α is used to adjust the confidence level of the upper
confidence bound. In a regularized version of the algorithm, in each
round t for each arm ai , LinRel calculates:
Click models learn from user clicks to help understand and incorporate
users’ implicit feedback (Chuklin et al., 2015). Previous studies of user
click behaviour provide a spectrum of hypotheses and models on how
an average user examines and clicks documents returned by a search
engine with respect to the submitted query. This chapter provides
a brief overview of bandit algorithms inspired by click models, most
notably the Cascade Model (Craswell et al., 2008), the Dependent Click
Model (DCM) (Guo et al., 2009) and the Position Based Model (PBM)
(Richardson et al., 2007).
The chapter is divided into three sections. Section 3.1 is devoted to
the Cascade Model, Section 3.2 to the DCM, and Section 3.3 to the
PBM. In each section, we first introduce the basic tenets behind each
click model and then briefly describe the algorithms incorporating ideas
from each model in a chronological order.
313
314 Click Models and Bandit Algorithms
clicked result are not examined. The user views search results from top
to bottom, deciding whether to click each result before moving to the
next. The documents before the first click result are assumed to be
not attractive because the user examines them without clicking on any
of them. The documents after the first clicked result are unobserved
because the user never examines them.
Formally, the model can be described as follows. Let us assume
that there is a list K of items A that come from a ground set E =
1, . . . , L, such as a list of documents. The user traverses a list of K items
A = (a1 , . . . , aK ) ∈ ΠK (E), where ΠK (E) is the set of K permutations
of set E. The model is parametrized by attraction probabilities w ∈
[0, 1]E . After the user examines item aK , the item attracts the user
with probability w(ak ), independently of other items. If the user is
attracted by item ak , she clicks on it and stops the search. Thus, the
probability that item ak is examined by the user is Πk−1 i=1 (1 − w(ai )),
while the probability that the user finds at least one item attractive is
1 − ΠK i=1 (1 − w(ai )). This objective is maximized by K most attractive
items.
item at1 to the last atK , and clicks on the first attractive item. If the
user is not attracted by any item, the user does not click on any item.
The agent at time t receives feedback:
Ct = argmin 1 ≤ k ≤ K : wt (atk ) = 1,
which is the click of the user. Since the user clicks on the first attractive
item in the list, the observed weights of all recommended items at time
t can be determined from Ct . The reward of the agent at time t can be
written in several forms. For example, as maxk wt (atk ) at least one item
in list At is attractive; or as f (At , wt ), where:
f (A, w) = 1 − ΠK
k=1 (1 − w(ak )).
The weights of the items in the ground set E are distributed indepen-
dently and the weight of any item at time t is drawn independently of
the weights of the other items.
Kveton et al. (2015a) propose two algorithms to solve the learning
variant of the cascade model: CascadeUCB1 and CascadeKL-UCB. Cas-
cadeUCB1 is motivated by UCB1 (Auer et al., 2002a) (Section 2.2) and
CascadeKL-UCB is motivated by KL-UCB (Garivier and Cappé, 2011).
KL-UCB is an online, horizon-free index policy for stochastic bandit
problems shown to be more efficient and stable compared to various
UCB policies. The pseudocode for both algorithms is in Algorithm 2.
The algorithms differ only in how they estimate the upper confidence
bound (UCB) Ut (e) on the attraction probability of item e at time t.
After that, they recommend a list of K items with largest UCBs:
At = arg max f (A, Ut ).
After the user provides feedback Ct , the algorithms update their esti-
mates of the attraction probabilities w(e) for all e = atk .
The UCBs are computed as follows. UCB on the attraction proba-
bility of item e at time t is:
Ut (e) = ŵTt−1 (e) (e) + ct−1,Tt−1 (e) ,
where ws (e) is the average of s observed weights of item e, and Tt (e)
is the number of times that item e is observed in t steps, and:
q
ct,s = (1.5 log t)/s
316 Click Models and Bandit Algorithms
The upper bounds of both algorithms are O(log t), linear in the
number of items L, and they improve as the number of recommended
items K increases. The bounds do not depend on the order of recom-
mended items. Experimental results based on simulations show that
CascadeKL-UCB outperforms CascadeUCB1 (Kveton et al., 2015a),
which is in concordance with the general observation that KL-UCB
tends to outperform UCB1 when the expected payoffs of arms are low
(Garivier and Cappé, 2011).
3.1. Cascade Model 317
Second, both algorithms choose the optimal list At . Finally, they receive
feedback, and update Mt and Bt :
Mt = Mt + σ −2 xe xTe
Bt = Bt + xe 1{Ct = k}
the user and the corresponding base arm a. For example, if the the user
at time t is characterized by a feature vector ut and the base arm a has
a feature vector ga , then xt,a = ut gaT is the combined feature vector of
base arm a at time t. The learning agent also knows of the user’s past
history H, which contains feedback information at all time s < t, as
well as contextual information at time t.
Additionally, the reward received by the agent is subject to position
discounts γk ∈ [0, 1], k ≤ K. For example, if the user selects the first
item on the recommended list, the learning agent will receive reward 1
and if the user selects an item further down the list, the learning agent
will receive a discounted reward γk .
A solution to this problem is the C 3 -UCB algorithm. After comput-
ing the UCBs based on which a list of items is selected for presentation
to the user, the algorithm computes an estimate of θ̂t , which can be
viewed as a ridge regression problem:
θ̂t = (XtT Xt + λI)−1 XtT Yt ,
where X is a matrix with each row being the feature vector of items
presented to the user up to time t, i.e. γk xs,ak , and Yt is a column vector
whose elements are the weights of the corresponding feature vectors in
X, i.e. γk ws (ak ), with regularization parameter λ > 0. A new confidence
radius βt (σ) is also calculated:
s
det(Vt )
√
βt (σ) = R ln + λ,
αd σ 2
where Vt = XtT Xt + λI and R is the current regret. Finally, the UCB is
defined as:
T
Ut (a) = min{θ̂t−1 xt,a + βt−1 (σ) k xt,a kV −1 −1 , 1}
t
The cascade model (Section 3.1) assumes that a user abandons exami-
nation of the ranked list of items after the first click. However, in many
2
Note that l in PIE(l) indicates the position in the list of presented K items,
thus, the name of the algorithm should be PIE(k) in order to be consistent with the
notation used in this chapter. However, the original name of the algorithm is PIE(l).
322 Click Models and Bandit Algorithms
(1, 2, 3, 4) and observes user clicks ct = (0, 1, 1, 0). This feedback can
be interpreted in two ways. The first explanation is that item 1 is not
attractive, items 2 and 3 are attractive, and that the user does not
exit at either positions 2 or 3. The second explanation is that item 1
is not attractive, items 2 and 3 are attractive, and that the user does
not exit at position 2, but exits at position 3. In the first case, the
learning agent receives no reward; in the second one, it does. Since
the reward is not directly observed, DCM bandits can be viewed as
an instance of partial monitoring. However, DCM bandits cannot be
solved efficiently by existing algorithms for partial monitoring because
the action set is combinatorial. In DCM bandits, an additional mild
assumption is imposed to allow efficient learning: the order of the
termination probabilities is known to the learning agent.
The proposed solution to DCM bandits is an algorithm called
dcmKL-UCB, which, similarly to CascadeKL-UCB (Section 3.1.1), is
motivated by KL-UCB (Garivier and Cappé, 2011). First, the algorithm
computes the UCBs Ut on the attraction probabilities of all items in E
at time t:
Ut (e) = max{q ∈ [w, 1] : w = ŵTt−1 (e) (e),
Tt−1 (e)DKL (w k q) ≤ log t + 3 log log t},
where DKL (p k q) is the Kullback–Leibler (KL) divergence between
Bernoulli random variables with means p and q; ŵs (e) is the average
of s observed weights of item e; and Tt (e) is the number of times that
item e is observed in t steps. Second, dcmKL-UCB recommends a list
of K items with largest UCBs:
At = arg max f (A, Ut , v),
A∈ΠK (E)
The Cascade Model (Section 3.1) and the DCM (Section 3.2) models
assume that a portion of the recommended list is explicitly examined
by the user, which allows the learning agent to access the rewards
corresponding to the unbiased user’s evaluation of each item. In the
Position-Based Model (PBM) (Richardson et al., 2007), each position
in the list is also endowed with a binary examination variable which is
equal to one only when the user paid attention to the corresponding item.
However, this variable, which is independent of the user’s evaluation of
the item, is not observable. Compared to the Cascade model, the PBM is
challenging due to the fact that the learning agent only observes actual
clicks, however, non-clicks may also provide additional information but
this information tends to be ambiguous. Thus, combining observations
made at different positions becomes a non-trivial statistical task.
l=1 Se,k (t), Se,k (t) = Σs=1 Zk (s)1{At (s) = e}, Ne (t) =
t−1
where Se (t) = ΣK
Σl=1 Ne,k (t), Ne,k (t) = ΣS=1 Zk (s)1{At (s) = e}. The UCB also incor-
K t−1
porates bias-corrected versions of the counts: Ñe (t) = ΣK l=1 Ñe,k (t),
Ñe,k (t) = Σt−1 Z
S=1 k (s)κ k 1 {A t (s) = e}. The algorithm sorts optimistic
indices in decreasing order and pulls the corresponding first K arms.
The PBM-PIE algorithm is an adaptation of the PIE(l) algorithm
(Combes et al., 2015) introduced for the Cascade Model (Section 3.1.4).
At each round, the learning agent potentially explores at position K with
probability 12 using the UCB for each arm e, which also incorporates the
examination parameters (κk ). In other positions, k = 1, . . . , K −1, PBM-
PIE selects the arms with the largest estimates θ̂e (t) = Se (t)/Ñe (t).
Simulation experiments show that PBM-PIE outperforms PBM-UCB
and the baseline KL-UCB, which in turn outperforms PBM-UCB.
The drawback of the approach proposed by Lagrée et al. (2016) is the
fact that the examination parameters are estimated off-line, which limits
the usage of these algorithms. Komiyama et al. (2017) expands PBM
by making an additional assumption about the examination parameters
κ(k), namely it is natural that items that are high up the list receive
more clicks. Thus, κ1 > κ2 > . . . > κK > 0 and this order is known.
Additionally, κ1 = 1. This model involves E + K parameters {θi? }i∈[E]
and {κ?k }k∈[K] , where K is the number of arms and K is the number
of slots in a list of items presented to the user. Each arm e ∈ [E] is
associated with a distinct parameter θi? ∈ (0, 1), and each slot k ∈ [K]
is associated with a parameter κ?k ∈ (0, 1]. At each round t, the system
selects k items (arms) for presentation and receives a corresponding
binary reward (click or non-click) for each item. The reward is drawn
from a Bernoulli distribution. The parameters except for κ1 = 1 are
unknown to the learning agent. The goal for the learning agent is to
maximize the cumulative rewards.
The solution to this problem is provided by Permutation Minimum
Empirical Divergence (PMED) algorithm. The algorithms contains two
loops: LC = LC (t) is the set of K-allocations in the current loop, and
326 Click Models and Bandit Algorithms
3.4 Summary
331
332 Ranking and Optimization
F (A | w) = wT F1 (A), . . . , Fi (A),
4.1. Diversifying Ranking with Bandits 335
wt = Mt−1 Ct ,
2
See also Section 7.3 for application of Exploration scavenging to web-page layout
optimization with bandits.
340 Ranking and Optimization
rank k, ERBA selects the second query as the next best arm excluding
the presented arms according to some criteria of the particular bandit
algorithm used to make a query suggestion at a given rank. To utilize
prior knowledge from query logs, Thompson Sampling (Chapelle and
Li, 2011) (Section 2.2.1) is used to select a query suggestion at each
position. Experimental results with Yahoo! search query logs show that
Thompson sampling ERBA substantially outperforms UCB1 ranked
bandits in terms of CTR for different lengths of queries.
A modified version of Thompson sampling (Chapelle and Li, 2011)
has also been applied to query recommendation (Hsieh et al., 2015). The
proposed algorithm makes a simplifying assumption that the engagement
rate on a specific search suggestion is independent of the other K − 1
suggestions present, which is necessary to accommodate multiple query
suggestions per request. This assumption allows the reduction of the
size of the action space. Based on observations of different search
scenarios in practice, two additional modifications were made. First,
the F parameter in the Beta distribution indicating the number of
1
failures is updated by K−1 instead of 1. In this way, “one” failure (but
not K − 1 failures) is assigned if one success is conferred to some arm
at each round, and the failure is shared by all the remaining arms.
This helps to prevent potential under-exploration that might occur as
a result of the independence assumptions. Second, a rate parameter
γ, to accommodate the ignorance of not observing a user action, was
introduced. Intuitively, γ is a scale factor to control the rate at which
the posterior uncertainty decreases when no user decision is spotted. For
example, at γ = K, the algorithm sees no-response as dissatisfaction.
Li et al. (2016a) address the issue of the user’s complex information
needs. In a majority of retrieval systems, a single active query is used
to retrieve relevant documents. When the user’s information need has
multiple aspects, the query must represent the union of these aspects.
The solution is to keep multiple, simpler queries active and, based on
the user’s feedback, these queries take turns to retrieve documents. The
number of relevant documents, treated as rewards and obtained from
past result pages of each query, indicates which query to explore next.
The approach is based on UCB bandits with a sliding window (Garivier
and Moulines, 2011), which considers only the last i plays. In many
344 Ranking and Optimization
cases, however, the subtopics of the information need of the user are not
defined upfront and gradually change during the search process. New
queries can be generated through an interactive retrieval process, for
example through relevance feedback, and added to the bandit algorithm
used to retrieve the initial search results. An algorithm inspired by
Monte Carlo tree search (Srinivasan et al., 2015) is proposed, where
the UCB score of a new query is estimated by its similarity to existing
queries in the pool. The similarity is calculated using a bag of words
representation of queries and Gaussian Process kernel (see Section 6.1
for more details). For a new query in , a new arm n is introduced and
its UCB score is estimated as:
v
u
R̂n (t) u log min(t, τ ) + N̂n (t)
U CBn (t) = + ct ,
N̂n (t) N̂n(t)
where R̂n (t) is the estimate sum of rewards of the n-th arm in the last τ
rounds, N̂n (t) is the estimated number of times arm n has been played
in the first t rounds and c is a constant controlling the tendency to
explore the new arm.
4.4 Summary
346
5.1. Dueling Bandits and Interleave Filtering 347
yielding k-armed dueling bandit problems with k ∈ {16, 32, 64}. In all
the experiments, RUCB accumulates the least regret, while the regret
of Beat-the-Mean grows linearly with time.
The Relative Minimum Empirical Divergence (RMED) algorithm
(Komiyama et al., 2015) improves upon the theoretical results for RUCB.
For each arm, the algorithm computes the empirical divergence defined
as X
Ii (t) = d(p̂i,j (t), 1/2)Ni,j (t),
j|pi,j <1/2
where p̂i,j (t) is the algorithm’s empirical estimate of the preference
probability pi,j at time t, and Ni,j (t) is the number of comparisons
between arm i and any arm j that beats i in the first t time-steps, while
d(p, q) is the KL-divergence between two Bernoulli distributions with
parameters p and q. The likelihood that arm i is a Condorcet winner
is exp(−Ii (t)) and it is used to pick the arms that will be compared
against each other at time t + 1.
Simulation results in a setting similar to the one used for testing
RUCB (Zoghi et al., 2014b) described above in this section show that
RMED significantly outperforms RUCB and similar dueling bandit
algorithms in terms of regret accumulation.
Relative Confidence Sampling (RCS) (Zoghi et al., 2014a) is an
algorithm related to RUCB. It differs from RUCB in one crucial respect:
the use of sampling when conducting a round robin tournament to select
a champion arm. The intuition behind this change is that improved
performance can be obtained by maintaining posterior distributions
over the expected value of each arm and sampling from those posteriors
to determine which arm to select. Sampling based methods, which rely
on posteriors, do not easily gravitate toward the extremes thus leading
to the selection of more appropriate arms.
The RCS algorithm takes as input a set of ranking functions (rankers)
and an oracle, e.g. an interleaved comparison method that returns a
noisy estimate of which ranker is the winner. RCS has one parameter α,
which controls the exploration rate of the algorithm: the higher the value
of α, the longer it takes to settle on a single ranker. RCS also maintains a
scoresheet where all the comparisons are stored. The algorithm proceeds
in two phases. In phase one, a tournament is simulated based on the
5.2. Condorcet Winner 353
current scoresheet, i.e., samples θij are collected for each pair of rankers
(i, j) with i > j from the posterior Beta distribution. RCS sets θji =
1 − θij and θii = 12 . Ranker i beats ranker j in the simulated tournament
if θij > 21 There are two ways in which a champion ranker could be
selected. In the first scenario, the champion ranker is the ranker that
beats all other rankers in this tournament. If no ranker beats all other
rankers, then the champion ranker is the one that has been the champion
least often. After a number of iterations, enough inferior rankers will be
eliminated for the algorithm to always select the best option. The second
stage of the algorithm is similar to RUCB (Zoghi et al., 2014b) described
above in this section, i.e. the UCB is calculated for all the arms with the
current champion as reference and using the same formula as RUCB.
RCS picks the ranker with the highest UCB. Finally, the champion
ranker and the ranker with the highest UCB are compared against each
other using a real interleaved comparison and the scoresheet is updated
accordingly.
The RCS algorithm was compared against RUCB using two large-
scale learning to rank datasets: the Microsoft Learning to Rank dataset
and the Yahoo! Learning to Rank Challenge dataset. Probabilistic
interleave (Hofmann et al., 2011a) was used to compare a pair of
rankers and a probabilistic user model (Craswell et al., 2008) was used
to model the user’s click behaviour. RCS accumulates on average a
third less regret than RUCB, i.e. RCS finds the best ranker faster and
makes less severe errors compared to RUCB.
Although RUCB and RCS outperform earlier dueling bandit algo-
rithms, such as Interleave Filter (Yue et al., 2012a) and Beat-the-Mean
(Yue and Joachims, 2011) (Section 5.1), they have difficulty scaling
to large K. Interleave Filter and Beat-the-Mean make additional as-
sumptions to facilitate the identification of the best ranker, however,
these assumptions may be too restrictive for specific applications, such
as web search. mergeRUCB (Zoghi et al., 2015b) is an algorithm that
bridges the gap between these two approaches: it makes only weak as-
sumptions about the k-armed dueling bandit problem but requires only
O(K) comparisons and therefore performs well when many rankers need
to be compared. mergeUCB groups rankers into small batches, which
reduces the number of comparisons before rankers can be eliminated.
354 Ranker Evaluation
Relevance judgements are used for creating test collections that are later
employed for evaluation of IR tasks and techniques. However, relevance
360 Ranker Evaluation
5.6 Summary
362
6.1. Personlization and the Cold Start Problem 363
CGPrank was tested using the Yahoo! News dataset and Google
e-books recommendation. In various settings of the algorithm, CGPrank
consistently outperformed the benchmark of UCB1 (Auer et al., 2002a)
(Section 2.2) in terms of CTR.
probability the underlying clusterings over users. This works under the
assumption that the latent clusters are not too small.
COFIBA was compared against LinUCB, DynUCB and CLUB. In
terms of CTR assessed using the Yahoo! dataset, it performed marginally
better than LinUCB and CLUB, however it performed substantially
better than DynUCB. In two other datasets concerned with CTR on
adverts obtained from Telefonica and Avazu, COFIBA outperformed
all the benchmark algorithms by between 5% and 20%.
The Context Aware clustering of Bandits algorithm (CAB) (Gentile
et al., 2017) incorporates collaborative effects into inference by making
changes in the way items are recommended and the weights w are
updated. At time t, before recommending an item to the user, CAB
first computes for each item the set of users that are likely to give the
item a similar payoff. This set is the estimated neighborhood of user
ui with respect to item xk . Earlier approaches (described above in this
section) update the user proxies wi by solving a regularized least squares
problem with feature representations of items presented previously to
user i and the corresponding payoffs. CAB, on the other hand, allows a
user ui to inherit updates obtained through an item being presented to
another user uj if the two users agree on their opinion on item x with a
sufficiently level of confidence. If CAB is not confident enough about the
opinion it has along the direction xk , then only the weight wi associated
with user ui is updated. However, if CAB is confident, then the proxy
updates are performed for all users in its estimated neighborhood. All
users in the neighbourhood undergo the same update. CAB is very
flexible in handling a fluid set of users. Due to its context-sensitive user
aggregation step, CAB allows users to be easily added or dropped.
In terms of CTR, CAB substantially outperformed the baselines of
CLUB, DynUCB and LinUCB when using the Telefonica dataset, the
Avazu dataset and the KDD Cup 2012 dataset.
feature hierarchy allows exploration in the full space when a user is not
perfectly modeled by the coarse space. CoFineUCB is the algorithm
that automatically balances exploration within the coarse-to-fine feature
hierarchy.
For example, there are features that correspond to interest in articles
about baseball and cricket. Prior knowledge suggests that users are
typically interested in one or the other. Then, a feature subspace can
be designed where baseball and cricket topics project along opposite
directions in a single dimension. A bandit algorithm can leverage this
structure by first exploring at a coarse level to determine whether
the user is more interested in articles about baseball or cricket. This
approach can be formalized as a hierarchy that is composed of the
full feature space and a subspace, where matrix X ∈ <d×n denotes
a n-dimensional subspace, which is constructed by using the top n
singular vectors of W containing the user profiles.
The user’s preferences are denoted by weight vector w = X w̄ + w⊥ ,
where w̄ is the projected user profile and w⊥ is the residual, or orthogonal
component, of w with respect to X. The principle can be extended to
multiple hierarchies:
where x̃i = X > xi denotes the projected features of the action taken at
time i, C is the user feedback, and λ is the regularization parameter.
Then, wt is estimated:
t−1
(w̃> xi − Ci )2 + λ k w − X w̄t k2 ,
X
wt = arg min
i=1
382 Recommendation
which regularizes wt to the projection of w̄t back into the full space.
CoFineUCB chooses then the item with the largest potential reward to
present to the user: xt = arg max wt> + ct (x) + c̃t (x), where ct (·) and
c̃t (·) indicate the full space and the subspace, respectively.
CoFineUCB was compared to LinUCB (Li et al., 2010a) in the
context of personalized news recommendation. CoFineUCB dramatically
outperforms LinUCB applied to the full data, while its performance
is comparable to LinUCB applied only to the subspace of the data.
Although the experimental results show the benefit of exploring in the
subspace, they also demonstrate that for atypical users, the subspace is
not sufficient to adequately learn their preferences resulting in linear
regret when only the subspace is used.
(Chapelle and Li, 2011) (Section 2.2.1) exploits dependencies among the
arms. Experimental results show that, compared to a traditional UCB
algorithm with independent arms, the proposed approach achieves a
substantially higher cumulative reward.
A similar approach was proposed by Daee et al. (2016), where
feedback from multiple domains is combined, i.e. retrieved items (docu-
ments) and their features (keywords). A probabilistic model is employed
to account for the relationship between the two domains. The approach
assumes that there is a document-keyword matrix where each entry spec-
ifies the likelihood of document ai being generated by keyword kj . This
matrix is generated from the data model that expresses how keywords
and the documents are related, e.g. normalized tf-idf representations of
documents generated from bag of words. The user provides relevance
feedback to both documents and keywords. The unknown parameter
θ defines the shared link between relevance (reward) distributions for
documents and keywords. After the user has interacted with a set of
documents and a set of keywords, the posterior distribution of θ is
updated. Thompson sampling (Chapelle and Li, 2011) is applied to bal-
ance the exploration and exploitation when selecting which documents
and keywords to present to the user.
Wang et al. (2016) describe how to learn the hidden features for
contextual bandit algorithms. Hidden features are explicitly introduced
in the reward generation assumption, in addition to the observable
contextual features. This approach can be particularly useful in col-
laborative filtering solutions based on latent factor models. Through
coupling the observed contextual features and the hidden features, the
reward can be formalized as follows:
ρat ,u = (xat , vat )T (θux , θuv ) + ηt ,
where xat and vat are the observed and hidden features of item at , and θux
and θuv are the corresponding bandit parameters, while θu = (θux , θuv ) are
the unknown preference parameters of user u. The algorithm assumes
that the dimension d of hidden features is known to the learner ahead
of time.
Due to the fact that only xat is shown to the user, the residual
between the true reward and the user’s estimate no longer has a zero
384 Recommendation
but is never revealed to the algorithm. In the second case, called timed
death, new arms have an expected lifetime of l = 1/p.
The reward function is modelled in two ways. In the state-aware
(deterministic reward) case, the reward is ρit = µit , where µit is the
payoff of arm ai at time t. This provides complete information about
each ad immediately after it is displayed. This approach is implemented
through the DetOpt algorithm, where an arm with the highest payoff
so far (among all the active arms) is played until it expires. At that
point, a random arm is played and its payoff is compared against that
of the remaining active arms to find the best arm.
In the state-oblivious (stochastic reward) case, the reward is a
random variable that is 1 with probability µit and 0 otherwise. This ap-
proach is implemented through a modified version of DetOpt algorithm.
The intuition behind this modification, called Stochastic, is that instead
of pulling an arm once to determine its payoff µi , the algorithm pulls
each arm n times and abandons it unless it looks promising. A variant,
called Stochastic with Early Stopping, abandons the arm earlier if its
maximum possible future reward will not justify its retention
Additionally, an epoch-greedy heuristic based on standard MAB is
proposed in Chakrabarti et al. (2009). The heuristic works in two stages:
(1) select a subset of k/c arms uniformly at random from the total k
arms at the beginning of each epoch; (2) operate a standard bandit
algorithm on these until the epoch ends, and repeat. Step 1 reduces the
load on the bandit algorithm, allowing it to explore less and converge
faster, in return for finding an arm that is probably optimal only among
the k/c subset. The constant c and the epoch length balance the speed
of convergence of the bandit algorithm, the arm lifetimes, and the arm
payoff distribution. The value of c is chosen empirically. This heuristic,
is implemented through an extension of the UCB1 algorithm called
UCB1k/c.
The mortal bandit setting requires different performance measures
than the ones used with static bandits. In the static setting, very little
exploration is needed once an optimal arm is identified and so the quality
measure is the total regret over time. In the mortal bandit setting the
algorithm needs to continuously explore newly available arms. What
is then taken into consideration is the long term, steady-state, mean
386 Recommendation
regret per time step of various solutions. This regret is defined as the
expected payoff of the best currently alive arm minus the payoff actually
obtained by the algorithm.
The UCB1k/c algorithm, Stochastic and Stochastic with Early Stop-
ping were tested using real-life data with the UCB1 algorithm as the
baseline. The test data consisted of approximately 300 real ads be-
longing to a shopping-related category when presented on web pages
classified as belonging to the same category. In terms of regret per time
step, Stochastic with Early Stopping performs best, while the regret of
UCB1 remains more or less flat throughout the timestamps.
Komiyama and Qin (2014) formulate a variant of the bandit problem,
where new arms are dynamically added into the candidate set and the
expected reward of each arm decays as the rounds proceed. A time-
decaying MAB goes as follows. At each round t, a system selects one
arm ai from the candidate set At and receives a random reward ρi,t .
The reward information of the other arms is not available. The reward
of arm ai is drawn from a Bernoulli distribution parameterized by ρi,t
which consists of two parts: the sum of a constant part ρi and several
basic decaying functions fk (t − ti ), where t − ti is the number of rounds
since arm ai appears for the first time. The expected reward of arm ai
at round t can be modeled as ρi,t = ρi + nk=1 wi,k fk (t − ti ), where wi,k
P
is the weight associated with the k-th decaying function for arm ai . The
representation of ρi,t is xTi,t θi , where xi,t is the context vector for arm ai
at time t and θi contains the values of functions f1 (t − ti ), . . . , fn (t − ti ).
The key to solve the time-decaying MAB problem is to effectively
estimate θi , i.e. the weights of individual decaying functions for each
arm. This can be estimated through linear bandits, which perform an
online estimation of linear weights with bandit feedback. The difference
between a general linear bandit and the time-decaying bandit is that
the latter contains multiple linear bandits: each arm can be considered
as an instance of a linear bandit problem whose context consists of
a constant term and a series of temporal functions. At each round,
the time-decaying bandit algorithm, for each arm ai,t constructs a
matrix Mi,t and a vector Ci,t , which are the sum of the covariance
and the reward-weighted sum of features, respectively. ρ̂i,t , the least
6.5. Recommendations with a Limited Lifespan 387
−1
square estimation of the reward at round t, is given as xTi,t Mi,t Ci,t . To
guarantee the sufficient amount of exploration, the following confidence
bound is introduced Ui,t = wt k xi,t kM −1 , where k x kM is the matrix
√ i,t
4
See also Kale et al. (2010) for a theoretical analysis of the slate selection problem,
where the user can select a subset from K possible actions and receive rewards for
the selected actions only. The problem formulation is also inspired by online ad or
news selection.
390 Recommendation
6.7 Summary
394
7.1. Specialized Short Text Recommendation 395
Bandit algorithms have also had limited application in image and music
retrieval. PinView (Auer et al., 2010) is a content-based image retrieval
system that exploits implicit relevance feedback during a search session.
The system is based on LinRel (Auer, 2002) – a contextual bandit
algorithm described in more detail in Section 2.2, which allows the
system to learn a similarity metric between images based on the current
interests of the user.
A similar approach to interactive content-based image retrieval was
proposed in Hore et al. (2015) and Konyushkova and Glowacka (2013).
Hierarchical Gaussian Process bandits (Dorard et al., 2009) (Section
6.1.1) are used to capture the similarity between images. Self-Organizing
Maps (SOM) (Kohonen, 1998) of image features are used as layers in the
bandit hierarchy to improve the efficiency of the algorithm called GP-
SOM. SOM provides model vectors that are treated in the algorithm as
discretization of the input space. The algorithm applies a 2-layer bandit
settings. First, a model vector is selected and then an image is sampled
from among the images associated with a particular model vector. The
selection is repeated K times to obtain K images to present to the
user. Experimental results show that for small values of K, GP-SOM
outperforms LinRel.
Interactive music recommendation can also be formulated as an
exploration-exploitation trade-off. The approach presented in Wang
et al. (2014) and Xing et al. (2014) focuses on audio content and
novelty. A user’s preference Pu can be represented as a linear function
of music audio content of a song x and parameter vector u representing
user preference of different music features: Pu = ux. The novelty of
a particular song decays immediately after listening to it and then
gradually recovers. The novelty recovers following the function: Pc =
1 − z −t/n , where t is the time elapsed since the last listening of the
song, n is a parameter indicating the recovery speed and z is the decay
parameter. The complete user rating model Pl is a combination of user
preferences Pu and the novelty score Pc . The Bayes-UCB algorithm
7.3. Web-page Layout 397
The problem arises when trying to find the optimal policy as that
would require testing many bandit algorithms on live traffic. However,
off-line replay (exploration scavenging Langford et al., 2008; Li et al.,
2011a)1 can be adapted to provide an accurate estimator for the per-
formance of ad layout policies using only historical data about the
effectiveness of layouts.
Offline replay describes a class of sample estimators S:
T X
1X
Ŝ(π) = ρs (xt )1[π(xt ) = s(xt )]wt,a ,
T t=1 a∈A
7.4 Summary
402
403
404
Appendices
A
Algorithms and Methods Abbreviations
406
407
Symbol Meaning
A A list of items
ai ith item in list A
Bt Vector with past observations at time t
Ct User feedback/click at time t
DKL (pkq) KL divergence between two random variables with
means p and q
d Number of dimensions
E A ground set of items
γ Position discount
Id d × d identity matrix
K A list of K presented items
k Position in list K
κ Covariance/kernel function
l Position in list L
κk Probability of observing item in position k
L Number of items
λ Regularization parameter
Continued on next page
408
409
Symbol Meaning
Mt Positive definite matrix with past observations at
time t
Ne (t) Number of counts of item e at time t
Ot Index of item with weight 0
R Regret
rt Regret at time t
ρ Reward
s Number of observed weights
σ Parameter that controls learning rate
Tt (e) Number of times item e is observed in t steps
t Time/number of steps
θ Parameter vector
w Weight
w(a) Weight of item a
wt (a) Weight of item a at time t
U (e) Upper Confidence Bound of item e
Ut Upper Confidence Bound at time t
u User feature vector
v Termination probability
X Matrix with rows of feature vectors of items
x Item feature vector
Y Column vector of observed weights
Zk Observation at position k
References
410
References 411