Control of Sample Complexity and Regret in Bandits Using Fractional Moments

CONTROL OF SAMPLE COMPLEXITY AND
REGRET IN BANDITS USING FRACTIONAL

MOMENTS
A Project Report
submitted by
ANANDA NARAYAN
in partial fulfilment of the requirements

for the award of the degree of
BACHELOR OF TECHNOLOGY AND MASTER OF TECHNOLOGY
DEPARTMENT OF ELECTRICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, MADRAS.
April 2012
THESIS CERTIFICATE
This is to certify that the thesis titled CONTROL OF SAMPLE COMPLEX-

ITY AND REGRET IN BANDITS USING FRACTIONAL MOMENTS,
submitted by ANANDA NARAYAN, to the Indian Institute of Technology, Madras,
for the award of the degree of Bachelor of Technology and Master of Tech-
nology, is a bona fide record of the research work carried out by him under my
supervision. The contents of this thesis, in full or in parts, have not been submitted
to any other Institute or University for the award of any degree or diploma.
Dr.B. Ravindran
Associate Professor
Dept. of Computer Science and Engineering
Dr.R. Aravind
Professor
Dept. of Electrical Engineering
IIT-Madras, 600036
Place: Chennai
Date: 10th of M ay, 2012
ACKNOWLEDGEMENTS
I am very thankful to my project guide Dr. Ravindran for his prolific encouragement
and guidance. If not for his receptivity towards his students, this work would not
have been possible. I would also like to thank Dr. Andrew, Dr. Aravind and Dr.
Venkatesh for the courses they offered that stood me in good stead during the course
of this project.
I would like to extend my profound gratitude to the faculty at the Department

of Electrical Engineering, especially to Dr. Giridhar, Dr. Devendra Jalihal and
Dr. Jagadeesh Kumar, and our then honourable Director Dr. M.S.Ananth, for their
broad and liberal views towards promoting students to explore their curricula beyond
defined horizons. I believe their never-ending support for excellence in students,
unrestricted by all other means, was influential in shaping the dawn of my career. I
thank the faculty at the Department of Electrical Engineering and the Department
of Computer Science and Engineering for the wonderful mix of courses I could avail
in the past five years. I also thank the staff at both Departments for all the facilities
and support during this project.
I would also like to thank my friends and colleagues for making the 5 years of my
stay here most cherishable. Their presence unfolded numerous eventful experiences
that I will always remember.
Finally, I would like to thank my parents for all their love, affection and teach-
ing. My learning from their life experiences has been instrumental in success of this
endeavour.
i
Control of Sample Complexity and Regret in
Bandits using Fractional Moments
Ananda Narayan
Abstract
One key facet of learning through reinforcements is the dilemma between exploration
to find profitable actions and exploitation to act optimal according to the observations
already made. We analyze this explore/exploit situation on Bandit problems in state-
less environments. We propose a family of learning algorithms for bandit problems
based on fractional expectation of rewards acquired. The algorithms can be controlled
to behave optimal with respect to sample complexity or regret, through a single pa-
rameter. The family is theoretically shown to contain algorithms that converge on
an -optimal arm and achieve O(n) sample complexity, a theoretical minimum. The
family is also shown to include algorithms that achieve the optimal logarithmic regret
O(log(t)) proved by Lai & Robbins (1985), again a theoretical minimum. We also
propose a method to tune for a specific algorithm from the family that can achieve
this optimal regret. Experimental results support these theoretical findings and the
algorithms perform substantially better than state-of-the-art techniques in regret re-
duction like UCB-Revisited and other standard approaches.
Contents
1 Introduction 4
1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Multi-arm Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Learning Objectives and Optimality . . . . . . . . . . . . . . . . 7
1.3.1.1 Regret Reduction . . . . . . . . . . . . . . . . . . . . . 8
1.3.1.2 Sample Complexity Reduction . . . . . . . . . . . . . . 8
1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Proposed Algorithm & Motivation 14
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Sample Complexity 18
3.1 Algorithm formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Chernoff Hoeffding Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Modified Chernoff Hoeffding Bounds . . . . . . . . . . . . . . . 19
3.3 Bounds on Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . 21
1
4 Regret Analysis 26
4.1 Definition of Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Bounds on Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Analysis of the Algorithm FMB 35
5.1 Sample Complexity dependence on . . . . . . . . . . . . . . . . . . . 35
5.2 Simple Vs. Cumulative Regrets . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Control of Complexity or Regret with β . . . . . . . . . . . . . . . . . 37
5.4 How low can l get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Experiment & Results 40
6.1 Comparison with Traditional approaches . . . . . . . . . . . . . . . . . 40
6.2 Comparison with State-of-the-art approaches . . . . . . . . . . . . . . . 41
7 Conclusions & Future Work 43
2
List of Figures
1.1.1 Reinforcement Learning framework . . . . . . . . . . . . . . . . . . . . 5
1.1.2 A stochastic maze environment . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Yahoo Frontpage Today Module . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Google Ads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 IMDB Movie Recommendations . . . . . . . . . . . . . . . . . . . . . . 11
1.4.4 Amazon Product Recommendations based on customer behavior . . . . 11
3.3.1 g(n) versus n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Comparison of FMB with MEA on the functional dependence of Sample

Complexity on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.1 Two variants of the proposed algorithm, Greedy and Probabilistic action
selections on Ai , are compared against the SoftMax algorithm . . . . . 40
6.1.2 Asymptotic performance showing low Regret . . . . . . . . . . . . . . . 41
6.2.1 Comparison of FMB with UCB-Rev [Auer & Ortner (2010)] on Bernoulli
and Gaussian rewards, respectively . . . . . . . . . . . . . . . . . . . . 42
6.2.2 Comparison of FMB with Median Elimination Algorithm (MEA) [Even-

Dar et al. (2006)] on Bernoulli and Gaussian rewards, respectively . . . 42
3
Chapter 1
Introduction
Artificial Intelligence has been one of the promising observatories that the current
world is in lookout for. It takes a leap of advancement for algorithms to move from
solving automation, towards demonstrating intelligence. Intelligent agents not only
should learn from their experiences, but also need to find how such experiences can
be achieved. They have to act in such a way that, the associated experiences that
follow prove noteworthy for improvement of their current learning of the system at
hand. They have to repeatedly interact with the system to learn decision-making on
the system. Reinforcement learning is one approach that provides for ample scope
for such agents. We will now introduce some of the salient aspects of reinforcement
learning and discuss some of its applications.
1.1 Reinforcement Learning
Some of the well known models of machine learning constituting supervised, unsu-
pervised and semi-supervised approaches have been extensively used for classification,
regression, clustering, finding frequenting patterns etc. But these approaches do not
provide avenues for learning to cycle — learn decision-making on a system by repeat-
edly interacting with the same. Reinforcement learning is an approach that allows for
such trial and error based learning.
Reinforcement learning is a learning paradigm that lets an agent to map situations

(as perceived in the environment it acts on), to actions (to take on the environment),
so as to maximize a numerical reward signal (from the environment). Figure 1.1.1
describes a simple outline of a reinforcement learner. A decision-making and learning
4
Inputs Outputs
Decision Making & Action
State
Learning Agent
Reward
Figure 1.1.1: Reinforcement Learning framework
agent perceives states of its environment to perform a decision on a set of actions it can
take on the environment. In deciding and taking an action, the agent receives feedback
about how beneficial the action is for the corresponding state of the environment. This
feedback is regarded as a ‘Reward’ signal, usually stochastic in nature. Higher the
reward, better was the action selected for the perceived state of the environment. And
in taking an action, the environment changes it state, which is again usually stochastic.
The learning task is to find a mapping from states to actions. The optimal mapping
may be stochastic or deterministic depending on the stochasticity in reward and state
transitions on taking actions. In a nutshell, reinforcement learning enables an agent
to perform decision-making on a system (environment), perceiving its state and the
rewards it provides.
Example 1. Consider a stochastic maze environment and an agent that tries to solve
it obtaining rewards at few destinations as in figure 1.1.2 1 . When at any of the squares
in the maze (state), the agent has a set of actions to choose from that is always a subset
of the set S = {f orward, backward , right, lef t, idle}. The particular set of actions
available is dependent on the state of the agent in the environment. Taking one of
the actions may not necessarily make the agent move in the same particular direction.
The environment is stochastic in changing the agent’s state (as indicated in the figure)
and the agent may move in any direction, or may remain idle, for taking a particular
action, say moving forward. After an action is taken, the agent receives a reward that
is pertaining to the triple containing the past-state of the agent, action taken by the
agent, and the future-state where the agent ended up due to the action taken. This
reward is that numerical value indicated in the figure for the destination states and
is zero for all other states. Once one of the destination states is reached, the maze
1
Source: Dayan & Abbott (2001)
5
Figure 1.1.2: A stochastic maze environment
is solved (with a finishing reward that the destination state provides) and learning
stops. Note that the reward for selecting a path towards the optimal destination is
only available at the end of the sequence of action selections. Hence, one need to
adhere every reward to not only the last action taken but to the entire sequence of
actions that led to this reward. So the reward to selecting an action can be delayed. By
performing these action selections, the task is to learn a mapping of states to actions
so as to reach those destination states with the highest numerical reward.
To conclude, reinforcement learning is a framework to learn about a system through

interaction. The learning is about optimal actions to take on the system, and is based
on rewards (and punishments) alone that are fed-back from the system. No detailed
supervision about the correctness of an action is available, as in supervised learning.
The framework is hence a trial and error learning paradigm that may encounter delayed
rewards. Also, a sequence of desirable actions is needed to obtain a desirable reward.
The objective is to perform associative learning with the need to associate the learning
of actions to states where it was taken. So it is about learning a sequence of actions,
sometimes referred to as a policy, rather than learning just actions. The environment
is typically a stochastic world, that may have a deterministic or stochastic optimal
policy that needs to be learnt.
6
1.2 Multi-arm Bandits
A multi-arm bandit problem requires finding optimal learning strategies for a simplified
version of the reinforcement learning agent described above. The simplification is in
the state space. In the multi-arm bandit problem, the environment is stateless or can
be said to be at the same state at all times. This reduces the complexity in learning
by a large amount due to the absense of state transitions of the environment and the
absense of reward dependence on state-action association. So the rewards are only a
function of the action taken, and there is a set of actions from which an optimal action
needs to be selected.
The multi-arm bandit takes it name from a slot machine (a gambling machine that
rotates reels when a lever or arm is pulled). A traditional slot machine is a one-arm
bandit and a multi-arm bandit has multiple levers or arms that can be pulled. Thus,
there is a set of levers to choose from at all times, and the task is to find that lever
that would give us the maximum return (or reward).
1.3 Problem formulation
An n-arm bandit problem is to learn to preferentially select a particular action (or

pull a particular arm) from a set of n actions (arms) numbered 1, 2, 3, . . . , n. Each
pull of an arm results in a random reward corresponding to the arm being pulled. Let
this random reward be distributed according to a random variable Ri , with mean µi ,
usually stationary. Arms are pulled and rewards acquired until learning, to find the
arm with the highest rewards corresponding to the mean µ∗ = maxi {µi }, converges.
1.3.1 Learning Objectives and Optimality
There have been two main objectives defined and analyzed for the multi-arm bandit
problem in the literature. One of them releates to reducing the regret (or loss) while
trying to learn the best arm. Another relates to reducing the number of samples
(or pulls of arms) to be drawn to find an arm close to the optimal arm with a high
probability. We define these quantifications of optimality formally below.
7
1.3.1.1 Regret Reduction
The traditional objective of the bandit problem is to maximize total reward (or gain)
given a specified number of pulls to be made, hence minimizing regret (or loss). Recall
that an arm corresponding to a mean equal to µ∗ is considered the best arm as it is
expected to give the maximum reward per pull. Let Zt be the total reward acquired
over t successive pulls. Then regret η is defined as the expected loss in total reward
acquired had the best arm been repetitively chosen right from the first pull, that is,
η(t) = tµ∗ − E[Zt ]. This quantification is also sometimes referred to as cumulative
regret.
1.3.1.2 Sample Complexity Reduction
We will now introduce a Probably Approximately Correct (PAC) framework and the
associated sample complexity, and an optimality criterion based on sample complexity
reduction. We start with a definition for -optimality.
Definition 2. An arm j is considered an -optimal arm if µj > µ∗ − , where µ∗ =

maxi {µi }.
We define the Probably Approximately Correct (PAC) quantification of optimality

below.
Definition 3. Even-Dar et al. (2006) An algorithm is a (, δ)-PAC algorithm for the
multi arm bandit with sample complexity T , if it outputs an -optimal arm, with
probability at least 1 − δ, when it terminates, and the number of time steps the
algorithm performs until it terminates is bounded by T
The number of time steps is also known as the number of samples (pulls of arms).
Hence, the objective is then to reduce T , often referred to as sample complexity.
1.4 Applications
The multi-arm bandit problem has numerous applications in decision theory, evolu-
tionary programming, reinforcement learning, search, recommender systems, and even
in a few interdisciplinary areas like industrial engineering and simulation. We now
describe a very brief collection of examples of such applications.
8
Figure 1.4.1: Yahoo Frontpage Today Module
Figure 1.4.1 2 shows the home page of the popular web service provider Yahoo!,
featuring its Frontpage Today Module. The top stories featured in this module uses
a multi-arm bandit formulation to choose the top four articles that are frequently
viewed by users who visit the page. A set of editor-picked stories (sometimes called a
story-bucket) is made available to be the arms to choose from, and a bandit learner
finds the top four optimal arms to display. This learning happens continuously, in
an online manner, with the story-bucket dynamically changing, as and when more
relevant news articles arrive. The algorithm we will propose in chapter 2, is compatible
for implementation with all the requirements mentioned above. For matching news
articles specific to user interests, demography etc., a set of input features specific to
the user is often used in finding the best news stories to be displayed. This introduces
us to a new formulation often called Contextual Bandits, due to the use of contextual
information in finding the best arm.
Figure 1.4.2 3 depicts a typical set of ads chosen for a specific search query by
Google. This problem is many times analyzed using a multi-arm bandit formulation
with millions of ads relating to the set of arms to choose from. Contextual information
is ofcourse used specific to the search query at hand. This is one of the best examples
of multi-arm bandit appications where sample complexity reduction is very important
2
Source: www.yahoo.com
3
Source: www.google.com
9
Figure 1.4.2: Google Ads
since the number of arms (or ads, in this case) involved is very high.
A set of recommended films at the online movie database IMDB is seen in figure
1.4.3 4 . This problem of selecting recommended items pertaining to some query
is extensively researched under the title ‘Recommender Systems’. Multi-arm bandit
formulations are widely seen in recommender systems as well. Contextual information
from the film names and categories can further improve the recommendations.
But the popular online shopping store Amazon selects recommended articles through
previous user behavior as seen in figure 1.4.4 5 . Does this account for a deterministic
solution, with no need for any explore-exploit situation like in multi-arm bandits? No.
The user frequented buckets of articles bought is usually large and the selection of
top three recommended articles does need to be learnt. Thus, the multi-arm bandit
problem relates to applications that have far-reaching significance and value additions.
1.5 Related Work
The multi-arm bandit problem captures general aspects of learning in an unknown

environment [Berry & Fristedt (1985)]. Decision theoretic issues like how to minimize
learning time, and when to stop learning and start exploiting the knowledge acquired
are well embraced in the multi-arm bandit problem. This problem is of useful con-
cern in different areas of artificial intelligence such as reinforcement learning [Sutton
& Barto (1998)] and evolutionary programming [Holland (1992)]. The problem also
4
Source: www.imdb.com
5
Source: www.amazon.com
10
Figure 1.4.3: IMDB Movie Recommendations
Figure 1.4.4: Amazon Product Recommendations based on customer behavior
11
has applications in many fields including industrial engineering, simulation and evo-
lutionary computation. For instance, Kim & Nelson (2001) talk about applications
with regard to ranking, selection and multiple comparison procedures in statistics and
Schmidt et al. (2006) entail applications in evolutionary computation.
The traditional objective of the bandit problem is to maximize total reward given
a specified number of pulls, l, to be made hence minimizing regret. Lai & Robbins
(1985) showed that the regret should grow at least logarithmically and provided policies
that attain the lower bound for specific probability distributions. Agrawal (1995)
provided policies achieving the logarithmic bounds incorporating sample means that
are computationally more efficient and Auer et al. (2005) described policies that achieve
the bounds uniformly over time rather than only asymptotically. Meanwhile, Even-
Dar et al. (2006) provided another quantification of the objective measuring quickness
in determining the best arm. A Probably Approximately Correct (PAC) framework
is incorporated quantifying the identification of an -optimal arm with probability
1 − δ. The objective is then to minimize the sample complexity l, the number of
samples required for such an arm identification. Even-Dar et al. (2006) describe Median
Elimination Algorithm achieving O(n) sample complexity. This is further extended to
finding m arms that are -optimal with high probability by Kalyanakrishnan & Stone
(2010).
We propose a family of learning algorithms for the Bandit problem and show the
family contains not only algorithms attaining O(n) sample complexity but also al-
gorithms that achieve the theoretically lowest regret of O(log(t)). In addition, the
algorithm presented allows to control learning to reduce sample complexity or regret,
through a single parameter that it takes as input. To the best of our knowledge, this
is the first work to introduce control on sample complexity or regret while learning.
We address the n-arm bandit problem with generic probability distributions without
any restrictions on the means, variances or any higher moments that the distributions
may possess. Experiments show the proposed algorithms perform substantially better
compared to state-of-the-art algorithms like Median Elimination Algorithm [Even-Dar
et al. (2006)] and UCB-Revisited [Auer & Ortner (2010)]. While Even-Dar et al.
(2006) and Auer & Ortner (2010) provide algorithms that are not parameterized for
tunability, we propose a single-parametric algorithm that is based on fractional mo-
ments6 of rewards acquired. To the best of our knowledge ours is the first work7 to
6
The ith moment of a random variable R is defined as E[Ri ]. Fractional moments occur when the
exponent i is fractional (rational or irrational).
7
A part of this work appeared recently in UAI 2011 [Narayan & Ravindran (2011)]
12
use fractional moments in bandit problems, while recently we discovered its usage in
the literature in other contexts. Min & Chrysostomos (1993) describe applications
of such fractional (low order) moments in signal processing. It has been employed in
many areas of signal processing including image processing [Achim et al. (2005)] and
communication systems [Xinyu & Nikias (1996); Liu & Mendel (2001)] mainly with
regard to achieving more stable estimators. We theoretically show that the proposed
algorithms can be controlled to attain O(n) sample complexity in finding an -optimal
arm or to incur the optimal logarithmic regret, O(log(t)). We also discuss the control-
lability of the algorithm. Experiments support these theoretical findings showing the
algorithms incurring substantially low regrets while learning. A brief overview of what
is presented: chapter 2 describes motivations for the proposed algorithms, followed by
theoretical analysis of optimality and sample complexity in chapter 3. Then we prove
the proposed algorithms incur the theoretical lower bound for regret shown by Lai &
Robbins (1985) in chapter 4. We then provide a detailed analysis of the algorithms
presented with regard to their properties, performance and tunability in chapter 5.
Finally, experimental results and observations are presented in chapter 6 followed by
conclusions and scope for future work.
13
Chapter 2
Proposed Algorithm & Motivation
2.1 Notation
Consider a bandit problem with n arms with action ai denoting choice of pulling the
ith arm. An experiment involves a finite number of arm-pulls in succession. An event
of selecting action ai results in a reward that is sampled from a random variable Ri ,
having a bounded support. In a particular such experiment, consider the event of
selecting action ai for the k th time. Let ri,k be a sample of the reward acquired in this
selection.
2.2 Motivation
When deciding about selecting action ai over any other action aj , we are concerned
about the rewards we would receive. Though estimates of E(Ri ) and E(Rj ) are in-
dicative of rewards for the respective actions, estimates of variances E[(Ri − µi )2 ] and
E[(Rj − µj )2 ] would give more information in the beginning stages of learning when
the confidence in estimates of expectations would be low. In other words, we wouldn’t
have explored enough for the estimated expectations to reflect true means.
Though mean and variance together provide full knowledge of the stochasticity for
some distributions, for instance Gaussian, we would want to handle generic distribu-
tions hence requiring to consider additional higher moments. It is the distribution
after all, that gives rise to all the moments completely specifying the random variable.
Thus we look at a generic probability distribution to model our policy to pull arms.
14
Consider selection of action ai over action aj . For action ai to have a higher reward
than action aj , the probability P (Ri > Rj ) holds, and to compute its estimate we
propose the following discrete approximation: After selecting actions ai , aj for mi , mj
times respectively, we can, in general, compute the probability estimate
X X
P̂ (Ri > Rj ) = P̂ (Ri = ri,k ) P̂ (Rj = rj,k0 )
ri,k ∈Ni rj,k0 Lji,k
where sets Ni and Lji,k are given by,
Ni = {ri,k : 1 ≤ k ≤ mi } (2.2.1)
Lji,k = {rj,k0 : rj,k0 < ri,k , 1 ≤ k 0 ≤ mj }
with random estimates P̂ (Ri = ri,k ) of the true probability P (Ri = ri,k ) calculated by
|{k 0 : ri,k0 = ri,k }|

P̂ (Ri = ri,k ) = . (2.2.2)
mi
Note that thus far we are only taking into account the probability, ignoring the
magnitude of rewards. That is, if we have two instances of rewards, rj,l and rm,n
with P (Rj = rj,l ) = P (Rm = rm,n ), then they contribute equally to the probabilities
P (Ri > Rj ) and P (Ri > Rm ), though one of the rewards could be much larger than
the other. For fair action selections we need a function monotonically increasing in
rewards, and let us then formulate the preference function for action ai over action aj
as,
X X
β
Aij = P̂ (Ri = ri,k ) (ri,k − rj,k0 ) P̂ (Rj = rj,k0 ) (2.2.3)
where β determines how far we want to distinguish the preference functions with regard
to magnitude of rewards. This way, for instance, arms constituting a higher variance
are given more preference. For β = 1, we would have Aij proportional to,
E [Ri − Rj |Ri > Rj ] P (Ri > Rj )
when the estimates P̂ approach true probabilties. See that we have started with a
simple choice for the monotonic function, the polynomial function, without restricting
the exponent. Now, to find among a set of arms, one single arm to pull, we need
a quantification to compare different arms. With the introduction of the polynomial
15
dependence on the reward magnitudes, Aij ’s are no more probabilities. To find a
quantification to base the arm pulls, we propose a preference function to choose ai
over all other actions to be
Y
Ai = Aij . (2.2.4)
j6=i
2.3 The Algorithm
For an n-arm bandit problem, our proposed class of algorithms first choose each arm l
times, in what is called as an initiation phase. After this initiation phase, at all further
action selections, where arm i has been chosen for mi times, finding a reward ri,k when
choosing ith arm for the k th time, the class of algorithms computes the sets
Ni = {ri,k : 1 ≤ k ≤ mi }
and using
|{k 0 : ri,k0 = ri,k }|
P̂ (Ri = ri,k ) =
mi
computes the indices
Y
Ai = Aij ,
j6=i
X X
β
Aij = P̂ (Ri = ri,k ) (ri,k − rj,k0 ) P̂ (Rj = rj,k0 )
to further choose arms.
A new class of Algorithms, based on conditional fractional expectations of rewards

acquired, can now use Ai in 2.2.4 to perform action selections. A specific instance that
is greedy after the exploring initiation phase (Algorithm 2.1), henceforth referred to as
Fractional Moments on Bandits (FMB), picks arm i that has the highest Ai . All anal-
ysis presented in this work are with reference to the algorithm FMB. A probabilistic
variant, picking arm i with probability Ai (normalized) is also proposed, but not ana-
lyzed. We call this probabilistic action selection on the quantities Ai , or simply pFMB.
The class of algorithms are based on the set of rewards previously acquired but can
16
Algorithm 2.1 Fractional Moments on Bandits (FMB)
Input: l, β
Initialization: Choose each arm l times
Define: ri,k , the reward acquired for k th selection of arm i, and mi , the number
of selections made for arm i, and the sets Ni = {ri,k : 1 ≤ k ≤ mi }and Lji,k =
{rj,k0 : rj,k0 < ri,k , 1 ≤ k 0 ≤ mj }
Loop:
|{k0 :ri,k0 =ri,k }|
1. p̂ik = P̂ (Ri = ri,k ) = mi
for 1 ≤ i ≤ n, 1 ≤ k ≤ mi
n P o
(ri,k − rj,k0 )β p̂jk0 for 1 ≤ i, j ≤ n, i 6= j
P
2. Aij = ri,k ∈Ni p̂ik r j
j,k0 Li,k
Q
3. Ai = j6=i Aij ∀1 ≤ i ≤ n
4. Pull arm i that has the highest Ai
be incrementally implemented (Results discussed in chapter 6 use an incremental im-

plementation). Also, the algorithms are invariant to the type of reward distributions.
Besides incremental implementations, the algorithms and the associated computations
simplify greatly when the underlying reward is known to follow a discrete probability
distribution.
The family (of algorithms FMB, for various values of l, β)1 is shown to contain
algorithms that achieve a sample complexity of O(n) in chapter 3. The algorithm 2.1
is also shown to incur the asymptotic theoretical lower bound for regret in chapter 4.
Further, the algorithm is analyzed in detail in chapter 5.
1
Note that FMB, pFMB, and any other algorithm that uses Ai for action selection constitute a
class of algorithms. FMB for varying input parameters constitute a family of algorithms.
17
Chapter 3
Sample Complexity
In this chapter, we present theoretical analysis of sample complexity on the proposed

algorithms. We show that given and δ, a sample complexity, as defined in 3, that is
of the order O(n) exists for the proposed algorithm FMB.
3.1 Algorithm formulation
Our problem formulation considers a set of n bandit arms with Ri representing the
random reward on pulling arm i. We assume that the reward is binary, Ri ∈ {0, ri }∀i.
Since the proposed algorithm assumes a discrete approximation for the reward prob-
ability distributions, an analysis on the bernoulli formulation is easily extendible fur-
ther. This is true even when true reward distributions are continuous, as the algorithm
always functions on the discrete approximation on the rewards it has already experi-
enced. Also, denote pi = P {Ri = ri }, µi = E[Ri ]=ri pi , and let us have for simplicity,
µ1 > µ2 > · · · > µn . (3.1.1)
Define, 
1 if r > r
i j
Iij =
0 otherwise
and, 
 rj if ri > rj
ri
δij = .
1 otherwise
18
We then have from 2.2.3, 2.2.2 and 2.2.1,
Aij = Iij p̂i pˆj (ri − rj )β + p̂i (1 − pˆj )(ri − 0)β

= p̂i pˆj (ri − δij ri )β + p̂i (1 − pˆj )riβ
= p̂i riβ 1 + pˆj [(1 − δij )β − 1] .

The action selection function 2.2.4 becomes,

Y
Ai = (p̂i riβ )n−1 1 + pˆj [(1 − δij )β − 1] .

(3.1.2)
j6=i
We now discuss Chernoff bounds and its extension to dependent random variables
before analysis of the proposed algorithm.
3.2 Chernoff Hoeffding Bounds
We start with a brief introduction to the Chernoff Hoeffding bounds on a set of inde-
pendent random variables.
Lemma 4. Hoeffding (1963) Let X1 , X2 , . . . Xn be random variables with common

range [0, 1] and such that E[Xt |X1 , . . . , Xt−1 ] = µ for 1 ≤ t ≤ n. Let Sn = X1 +···+X
n
n
.
Then for all a ≥ 0 we have the following,
P {Sn ≥ µ + a} ≤ exp(−2a2 n)
P {Sn ≤ µ − a} ≤ exp(−2a2 n).
3.2.1 Modified Chernoff Hoeffding Bounds
Chernoff Hoeffding bounds on a set of dependent random variables is discussed here for
analysis of the proposed algorithm. We start by introducing Chromatic and Fractional
Chromatic Number of Graphs.
Definition 5. A proper vertex coloring of a graph G(V, E) is assignment of colors to

vertices such that no two vertices connected by an edge have the same color. Chromatic
number χ(G) is defined to be the minimum number of distinct colors required for
proper coloring of a graph.
19
A k-fold proper vertex coloring of a graph is assignment of k distinct colors to each
vertex such that no two vertices connected by an edge share any of their assigned colors.
k-fold Chromatic number χk (G) is defined to be the minimum number of distinct colors
required for k-fold proper coloring for a graph. The Fractional Chromatic number is
then defined as,
χk (G)
χ0 (G) = lim .
k→∞ k
Clearly, χ0 (G) ≤ χ(G). We now look at Chernoff Hoeffding Bounds when the
random variables involved are dependent.
Lemma 6. Dubhashi & Panconesi (2009) Let X1 , X2 , . . . Xn be random variables,

some of which are independent, have common range [0, 1] and mean µ. Define Sn =
X1 +···+Xn
n
and a graph G(V, E) with V = {1, 2, . . . , n}. Edges connect vertices i and j,
i 6= j if and only if Xi is dependent on Xj . Let χ0 (G) denote the fractional chromatic
number of graph G. Then for all a ≥ 0 we have the following,
P {Sn ≥ µ + a} ≤ exp(−2a2 n/χ0 (G))

P {Sn ≤ µ − a} ≤ exp(−2a2 n/χ0 (G)).
We now discuss bounds on the proposed formulation that incorporates dependent

random variables.
Definition 7. Define an n-tuple set S = {(ν1 , ν2 , . . . , νn ) : 1 ≤ νk ≤ mk , mk ∈ N∀1 ≤

k ≤ n} and a one to one function f : S → {1, 2, . . . , q : q = nk=1 mk } that finds a
Q
unique index for every element in S. For each e ∈ S, define random variable Ui with
i = f (e), so that for e1 , e2 ∈ S, Ui1 and Ui2 are dependent if and only if atleast one
dimension of e1 − e2 is null. Construct a graph G(V, E) on V = {Ui : 1 ≤ i ≤ q} with
edges connecting every pair of dependent vertices.
(k)
Remark 8. Consider n arms where k th arm is sampled for mk times. Let Xu be a
random variable, with mean µ(k) , related to the reward acquired on the uth pull of
(k) (k) (k)
k th arm. Then, for {mk : 1 ≤ k ≤ n}, X1 , X2 , . . . Xmk are random variables (with
(k) (k) (k)
common range within [0, 1], say) such that E[Xu |X1 , . . . , Xu−1 ] = µ(k) ∀k, u : 1 ≤
(i) (j) (k)
u ≤ mk and Xu is independent of Xv for i 6= j. Define a random variable Smk =
(k) (k)
X1 +···+Xmk Q (k)
mk
capturing the estimate of X (k) . Define a random variable T = k Smk
20
and consequently on expanding this product we will find,
Pq
1 Ui
T =
q
with mean µT = E[T ] = k µ(k) , where Ui have properties described in Definition 7

Q
with its associated graph G(V, E).
Corollary 9. For a sum of dependent random variables, Ui , and an associated graph

G(V, E), as defined in Definition 7, and a random variable
Pq
1 Ui
T =
q
we have for all a ≥ 0,
−2a2 q −2a2 q

P {T ≥ µT + a} ≤ exp ≤ exp
χ0 (G) χ(G)
2
−2a2 q

−2a q
P {T ≤ µT − a} ≤ exp ≤ exp
χ0 (G) χ(G)
with mean µT = E[T ] = k µ(k) and χ(G) and χ0 (G) the chromatic and fractional
Q
chromatic numbers of graph G respectively.
3.3 Bounds on Sample Complexity
Theorem 10. For a n-arm bandit problem and a given and δ, algorithm FMB is an
(, δ)-PAC algorithm, incurring a sample complexity of
n1 !
1 2 1
O n ln
µ4T δ
where
µT = min |E[Ai − A∗ ]|
i:µi <µ∗ −
with Ai defined in 2.2.4, 2.2.3, 2.2.2 and 2.2.1, and µ∗ , A∗ being those values that
correspond to the optimal arm.
Following constraint 3.1.1 for simplicity, we will now prove the above theorem. To
start with, we will state a few lemmas that would be helpful in proving theorem 10.
21
Lemma 11. For a graph G(V, E) constructed on vertices Ui pertainining to Definition
7, the chromatic number is bounded as
s
Y
χ(G) ≤ 1 + n q − (mk − 1) .
k
Q
Proof. See that |V | = q, and each vertex is connected to q − k (mk − 1) other vertices.
Then, the total number of edges in the graph becomes, |E| = 12 n (q − k (mk − 1)).
Q
The number of ways a combination of two colors can be picked is given by, χ(G) C2 =
1
2
χ(G) (χ(G) − 1). For optimal coloring, there must exist at least one edge connecting
any such pair of colors picked, and we have
1
χ(G) (χ(G) − 1) ≤ |E|
2
Y
χ(G) (χ(G) − 1) ≤ n q − (mk − 1) .
k
With (χ(G) − 1)2 ≤ χ(G) (χ(G) − 1), this turns into

Y
(χ(G) − 1)2 ≤ n q − (mk − 1)
k
s
Y
χ(G) ≤ 1 + n q − (mk − 1) .
k
Next, we state a lemma on the boundedness of a function which we will come across
later in proving theorem 10.
Lemma 12. For n ∈ N, and a function
1
g(n) = (n ln2 n) n ,
∃n0 ∈ N such that g(n) is upper bounded ∀n > n0 ∈ N.
Proof. This follows from,
ln(n ln2 n) ln n + 2 ln ln n
ln g(n) = =
n n
2
1 0 1 + ln n − (ln n + 2 ln ln n)
=⇒ g (n) = .
g(n) n2
22
Thus, g is a decreasing function for n > n0 ∈ N with the limit at infinity governed by
ln n + 2 ln ln n
lim ln(g(n)) = lim
n→∞ n→∞ n
∂
(ln n + 2 ln ln n)
= lim ∂n ∂
n→∞
∂n
n

1 2
= lim +
n→∞ n n ln n
= 0
=⇒ lim g(n) = 1.
n→∞
Now we prove theorem 10 with the help of the lemmas defined above.
Proof. of theorem 10. Let i 6= 1 be an arm which is not -optimal, µi < µ1 −. To prove
that FMB is an (, δ)-PAC algorithm, we need to bound for all i, the probability of
selecting a non--optimal arm, P (Ai > Aj : µi < µ1 − , ∀j), which is in turn bounded
as
P (Ai > Aj : µi < µ1 − , ∀j) ≤ P (Ai > A1 : µi < µ1 − )
since arm 1 is optimal. In other words, we need to bound the probability of event
Ai > A1
to restrict the policy from choosing non--optimal arm i, with Ai given by 3.1.2. See
Q (k)
that Ai is in the form of k Smk in remark 8. Defining the random variable Ti =
Ai − A1 , we expand the products, to have
Pq
1 Uk
Ti =
q
where Uk have properties described in definition 7 with its associated graph G(V, E).
Q
So we have a sum of q = j mj dependent random variables Uk , with a true mean
µTi = E[Ti ] = E[Ai ] − E[A1 ]. Applying corollary 9 to Ti , to bound the probability of
the event Ti ≥ 0, we have using the modified Chernoff-Hoeffding Bounds,
P (Ti ≥ 0) = P (Ti ≥ µTi − µTi ) ≤ exp(−2µ2Ti q/χ(G)).
From definition 7, we then have,

23
!
−2µ2Ti q
P (Ti ≥ 0) ≤ exp p Q . (3.3.1)
1 + n (q − k (mk − 1))
In a simple case when each arm is sampled l times, mk = l ∀k and

!
−2µ2Ti ln
P (Ti ≥ 0) ≤ exp p
n (ln − (l − 1)n )
1+
−2µ2Ti ln

≤ exp √
2 nln
!
−µ2Ti ln/2
≤ exp √ .
n
To get the sample complexity, sample each arm so that
−µ2T ln/2

δ
exp √ =
n n
n1
n 2 n
=⇒ l = ln (3.3.2)
µ4T δ
where
µT = min |µTi |. (3.3.3)

i:µi <µ1 −
Sample complexity corresponding to all arms is then

n1
n 2 n
nl = n ln .
µ4T δ
We can improve the complexity further through successive elimination method

discussed by Even-Dar et al. (2006). But instead, from lemma 12, we see that ∃C :
1
g(n) = (n ln2 n) n < C∀n > n0 ∈ N. With some numerical analysis, we see that
∀n > 5, g 0 (n) < 0, and that g attains a maximum at n = 5, with g(n) = 1.669. A plot
of g versus n is shown in Figure 1. So g(n) < 1.67 ∀n ∈ N and hence we have
1
O(n(n ln2 n) n ) = O(n).
Hence FMB finds -optimal arms with probability 1 − δ incurring a sample complexity
24
Sample Complexity per Arm
2
g(n)
1
0
0 200 400 600 800 1000
n
Figure 3.3.1: g(n) versus n
of n1 !
1 2 1
O n ln .
µ4T δ
Thus, sample complexity dependence on n, of the proposed algorithms is essentially

O(n) since O(nc1/n ) = O(n) for any constant c. The sample complexity depends on
through µT and this is discussed in detail in 5.1.
25
Chapter 4
Regret Analysis
In this chapter, we show that FMB (Algorithm 2.1) incurs a regret of O(log(t)), a
theoretical lower bound shown by Lai & Robbins (1985). We will start with the
definition of regret and then state and prove the regret bounds of FMB.
4.1 Definition of Regret
Define random variable Fi (t) to be the number of times arm i was chosen in t plays.
The expected regret, η is then given by
X
η(t) = (n − E[F1 (t)])µ1 − µi E[Fi (t)]
i6=1
which is,
X X
η(t) = µ1 E[Fi (t)] − µi E[Fi (t)]
i6=1 i6=1
X
= (µ1 − µi )E[Fi (t)]
i6=1
X
= ∆i E[Fi (t)], (4.1.1)
i6=1
say.
26
4.2 Bounds on Regret
We will now state and prove the boundedness of regret incurred by FMB.
Theorem 13. The total expected regret η(t), for t trials in an n-arm bandit problem,
incurred by algorithm FMB is upper bounded by
X
∆i O (l + log(t))
i6=1
where l is an input parameter concerning the initiation phase of the algorithm.
To prove theorem 13, we will now state a few lemmas that would be helpful. To start
with, we restate some of the definitions that were used in chapter 3 for convenience.
Definition 14. For an n-arm bandit problem where arm i has been chosen for mi
times (including the choice of each arm l times in the initiation phase), finding a
reward ri,k when choosing ith arm for the k th time, FMB computes the sets
Ni = {ri,k : 1 ≤ k ≤ mi }
and using
|{k 0 : ri,k0 = ri,k }|
P̂ (Ri = ri,k ) =
mi
computes the indices
Y
Ai = Aij ,
j6=i
X X
β
Aij = P̂ (Ri = ri,k ) (ri,k − rj,k0 ) P̂ (Rj = rj,k0 )
to further choose arms. While doing this to find an -optimal arm with probability
1 − δ, FMB incurs a sample complexity of
n1 !
1 2 1
O n ln
µ4T δ
27
where
µT = min |µTi |
i:µi <µ∗ −
µTi = E[Ti ]
Ti = Ai − A∗
with µ∗ and A∗ being those values that correspond to the optimal arm.
Corollary 15. For a learning experiment completing τ − 1 plays on an n-arm bandit

problem, with arm i being chosen mi times, the probability that applying algorithm
FMB for the τ th play would result in choice of a non--optimal arm is given by 3.3.1.
Note that mi is a function of τ in itself and 3.3.1 presumes knowledge of mi (τ ) is made
available. See that mi (τ ) for a particular such experiment is a sample from the random
variable Fi (τ ). For the purpose of the regret analysis, denote Ai computed for the τ th
play by Ai (τ ) and the corresponding mi by mi (τ ) and Ti by Ti (τ ). Rewrite 3.3.1 as
!
−2µ2Ti q(τ )
P (Ti (τ ) ≥ 0|mi (τ )) ≤ exp p Q (4.2.1)
1 + n (q(τ ) − k (mk (τ ) − 1))
Q
where q(τ ) = k mk (τ ) and µTi conforms to definition 14.
Lemma 16. Given the event Fi (t − 1) = y, the probability that FMB selects arm i at
tth play is bounded as,
p !
−µ2Ti y(t − y − 1 − l(n − 2))ln−2
P {arm i selected at t|Fi (t − 1)y} ≤ exp √
n
where l, µTi conform to definition 14.
Proof. See that
P {arm i selected at t} = P {Ai (t) > Aj (t)∀j}

≤ P {Ai (t) > A1 (t)}
Given any partial information about Fi (τ ) for some τ < t and some i, this probability
can be further bounded. Given that the event Fi (t − 1) = y is known to have occured,
28
this bound improves as
P {arm i selected at t|Fi (t − 1) = y} = P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y}

≤ P {Ai (t) > A1 (t)|Fi (t − 1) = y}
≤ P {Ti (t) = Ai (t) − A1 (t) > 0|Fi (t − 1) = y}
≤ P {Ti (t) > 0|mi (t) ∈ Sm(t) } (4.2.2)
with set Sm(t) given by

X
Sm(t) = {(m1 (t), m2 (t), . . . , mk (t), . . . , mn (t))|mi (t) = y, mk (t) = t − 1}.
k
So we further bound,
P {Ai (t) > A1 (t)|Fi (t − 1) = y} ≤ max{P (Ti (t) ≥ 0)}. (4.2.3)

Sm(t)
Q
Now, the upper bound of P (Ti (t) ≥ 0) would be maximum when k mk is minimum.
This is observed from (3.3.1) as
!
−2µ2Ti q
P (Ti (t) ≥ 0) ≤ exp p Q
1 + n (q − k (mk (t) − 1))
−2µ2Ti q

≤ exp √
1 + nq
2√
−µTi q
≤ exp √
n
pQ !
2
−µTi k mk (t)
= exp √ . (4.2.4)
n
Q P
The minimum that k mk (t) would take for a given k mk (t) = t − 1 is when mi (t)
are most unevenly distributed. When each arm is chosen l times in the initiation phase
of the algorithm, we would have
Y
mk (t) ≥ y(t − 1 − l(n − 2) − y) ln−2 (4.2.5)
k
which occurs when all the instances of arm choices (other than mi (t) = y and the
initiation phase) correspond to a single particular arm other than i, say j. From
29
(4.2.2), (4.2.3), (4.2.4) and (4.2.5), we would have
P {arm i selected at t|Fi (t − 1)y} ≤ P {Ai (t) > A1 (t)|Fi (t − 1) = y}

  


 y if k = i
  
≤ P Ti (t) ≥ 0|mk (t) = t − 1 − l(n − 2) − y
 
if k = j 
 
 

l otherwise
p !
−µ2Ti y(t − y − 1 − l(n − 2))ln−2
≤ exp √ .
n
Lemma 17. For a, b ∈ N, the sum of square roots of consequtive numbers is lower
bounded as
b
X √ 2
h ≥ (b3/2 − (a − 1)3/2 ).
h=a
3
Proof. The proof uses bounds on integrals as,
b
X √ Z b √
h ≥ υdυ
h=a a−1
2 3/2
= (b − (a − 1)3/2 ).
3
Using the above stated lemmas, we now prove theorem 13.
Proof. of Theorem 13. From (4.1.1), we see that to bound the expected regret is to
bound E[Fi (t)] for all i 6= 1. Then the event {arm i selected at t} is {Ai (t) > Aj (t)∀j}
of course with j 6= i. Consider
E[Fi (t)] = E [E[Fi (t)|Fi (t − 1) = y]]

= E[(y + 1) P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y}
+y(1 − P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y})]
≤ E[(y + 1)P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y} + y]. (4.2.6)
where the first step uses law of total (or iterated) expectations. Choose such that
there exists only one -optimal arm, then the summation in (4.1.1) goes through
30
only non--optimal arms and for all such i, we can use the bounds on P (Ti (t) ≥
0|mi (t)) from corollary 15 and lemma 16. Hence bounding the probability of the event
P {arm i selected at t|Fi (t − 1) = y}, (4.2.6) turns into,
" p ! #
−µ2Ti y(t − y − 1 − l(n − 2))ln−2
E[Fi (t)] ≤ E (y + 1) exp √ +y .
n
µ2T
Denoting √i
n
as a positive constant λi and using E[y] = E[Fi (t − 1)], we have the
relation
p
E[Fi (t)] ≤ E[Fi (t − 1)] + E[(y + 1) exp(−λi y(t − y − 1 − l(n − 2))ln−2 )].
Unrolling the above recurrence gives
t
X p
E[Fi (t)] ≤ E[Fi (nl)] + E[(yw + 1) exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )]
w=nl+1
t
X p
≤ l+ E[(yw + 1) exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )]
w=nl+1
where yw corresponds to number of times the arm under consideration was chosen in
w − 1 plays. Defining
p
Gi (w) = E[(yw + 1) exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )]
we see the regret bounded as,
X t
X
η(t) ≤ ∆i (l + Gi (w)). (4.2.7)
i6=1 w=nl+1
Now,
w−1−l(n−1)
X p
Gi (w) = [(d + 1) exp(−λi d(w − d − 1 − l(n − 2))ln−2 )
d=l
P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l}](4.2.8)
where the probability is that which corresponds to arm i being selected d − l times
between nl + 1th and w − 1th plays, inclusive of the ends, with kSk denoting the
cardinality of set S. Denoting the plays (or epochs) in which arm i could have been
31
selected by the indicator variable I(τ ), we have

1 A (τ ) > A (τ )∀j
i j
I(τ ) = .
0 otherwise
Hence, there are w−nl−1

d−l
possible ways of selecting arm i for a total number of d − l
times between nl + 1 and w − 1th plays. Define the set of possible values that I(τ )
th
P
can take by SI = {I(τ ) : nl + 1 ≤ τ ≤ w − 1, τ I(τ ) = d − l}. Call as event E,
the event {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l}. We now bound the
probability of event E in (4.2.8) as
X
P (E) = P (E|I)P (I)
I∈SI
X
≤ P (E|I)
I∈SI
≤ kSI k max P (E|I).
I∈SI
Defining
Pmax = max P (E|I),

I∈SI
Imax = arg max P (E|I)
I∈SI
we see that P (E) is bounded by

w − nl − 1
P (E) ≤ Pmax (4.2.9)
d−l
with
Y
Pmax = P {Ai (τ ) > Aj (τ )∀j|Imax }
τ :Imax (τ )=1
Y
P {Ai (τ ) < Aj (τ ) f or some j|Imax }
τ :Imax (τ )=0
Y
≤ P {Ai (τ ) > Aj (τ )∀j|Imax } (4.2.10)
τ :Imax (τ )=1
where P {Ai (τ ) > Aj (τ )∀j|Imax } = P (Ti (τ ) ≥ 0) can be bounded using (4.2.4) as
32
follows. Also,
Y Y
P {Ai (τ ) > Aj (τ )∀j|Imax } ≤ P {Ai (τ ) > A1 (τ )|Imax }
τ :Imax (τ )=1 τ :Imax (τ )=1
 
X sY
≤ exp −λi mk (τ ) .
τ :Imax (τ )=1 k
Q
A maximum upper bound would require k mk (τ ) to be at its minimum for all τ .
This would occur as Imax (τ ) = 1∀τ : nl + 1 ≤ τ ≤ nl + (d − l), when arm i is chosen
Q
repeatedly in the first d − l turns just after the initiation phase. Then, k mk (τ ) for
τ : Imax (τ ) = 1 are given by,
Y
mk (τ ) = (l + τ − nl)ln−1 ∀τ : nl + 1 ≤ τ ≤ nl + (d − l).
k
So,
  
nl+(d−l)
Y X p
P {Ai (τ ) > Aj (τ )∀j|Imax } ≤ exp −λi  (l + τ − nl)ln−1 
τ :Imax (τ )=1 τ =nl+1
d
!!
n−1
X √
≤ exp −λi l 2 k (4.2.11)
k=l+1
which can be refined further using lemma 17. Using (4.2.9), (4.2.10), (4.2.11) and
lemma 17, we bound Gi (w) in (4.2.8) as
w−1−l(n−1)
X w − nl − 1 p 2 n−1
Gi (w) ≤ (d+1) exp(−λi ( d(w − d − 1 − l(n − 2))ln−2 + l 2 (d3/2 −l3/2 ))).
d=l
d−l 3
(4.2.12)
Back to the original problem, if we show that E[Fi (t)] grows slower than log(t) asymp-
totically, for all i, then it is sufficient to prove the regret is O(log(t)). For this, see
that
Xt
E[Fi (t)] ≤ l + Gi (w)) = g(t),
w=nl+1
say. Then for regret to grow slower than logarithmically in t, it is sufficient to show that
the derivative of the upper bound g(t) grows slower than 1/t. Seeing that g(t) − g(t −
1) = Gi (t), it is enough to show that Gi (t) is bounded by 1/t asymptotically. Consider
n−1
Γd = (d+1) w−nl−1
p
d(w − d − 1 − l(n − 2))ln−2 + 32 l 2 (d3/2 −l3/2 ))). ∃d∗ :

d−l
exp(−λ i (
33
Γd∗ ≥ Γd ∀d 6= d∗ , and so we bound
Gi (w) ≤ (w − ln)Γd∗ .
Now
Gi (t)
lim 1 = lim tGi (t)
t→∞ t→∞
t
t−nl−1
t(t − ln)(d∗ + 1)

d∗ −l
≤ lim p n−1 .
t→∞ exp(λi ( d∗ (t − d∗ − 1 − l(n − 2))l n−2 + 23 l 2 (d∗3/2 − l3/2 )))
Since each term in the numerator is bounded by a polynomial while the exponential
in the denominator is not,
Gi (t)
lim 1 = 0.
t→∞
t
Since growth of E[Fi (t)] is bounded by O(log(t)) for all i, we see the proposed algorithm
FMB has an optimal regret as characterized by Lai & Robbins (1985), with the total
incurred regret till time t given by
X
∆i O (l + log(t)) .
i6=1
34
Chapter 5
Analysis of the Algorithm FMB
In this chapter, we analyze the proposed algorithm FMB, with regard to its properties,
performance and tunability.
5.1 Sample Complexity dependence on
We now look at the sample complexity incurred for every arm by FMB in 3.3.2 and its
dependance on . Unlike MEA, this dependence of sample complexity is quite different
for FMB, and is illustrated in Figure 5.1.1. A decrease in epsilon would lead to increase
in complexity only when the set {i : µi < µ1 − } in 3.3.3 changes due to the decrease
in . The set of arms that determine the complexity is related to the true means in the
original problem. We also note that this should be the case ideally, in that a decrease
in that does not lead to any decrease in the number of -optimal arms, should not
increase the sample complexity. This is because the algorithm still performs to select
exactly the same -optimal arms, as was the case before the decrease in .
5.2 Simple Vs. Cumulative Regrets
While cumulative regret, referred to regret in general, is defined in the expected sense,
η(t) = tµ∗ − E[Zt ]
35
Sample Complexity for finding a PAC arm with probability 1−delta
FMB
MEA
Sample Complexity
Epsilon
Figure 5.1.1: Comparison of FMB with MEA on the functional dependence of Sample
Complexity on
where Zt is the total reward acquired for t pulls and µ∗ = maxi µi , there exists another
quantification for regret that is related to the sample complexity in some sense. Bubeck
et al. (2009) discuss links between cumulative regrets and this new quantification,
called simple regret, defined as
φ(t) = µ∗ − µψt
P
where µψt = i µi ψi,t with ψi,t , the probability of choosing arm i as ouput by the
policy learning algorithm for the t + 1th pull. Essentially, ψt is the policy learnt by the
algorithm after t pulls. Note that φ(t) after an , δ-PAC guaranteeing exploration is
essentially related to , in the sense that
φ(t) < (1 − δ) + δ(µ∗ − min µi ).

i
Bubeck et al. (2009) find dependencies between φ(t) and η(t) and state that the smaller
φ(t) can get, the larger η(t) would have to be. We now analyze how FMB attains this
relation.
Consider improving the policy learning algorithm with respect to φ(t). This could
be achieved by keeping β constant, and increasing l. This may either reduce µT as seen
from (3.3.2) and hence as seen from (3.3.3), or simply reduce δ as seen from (3.3.2),
both ways improving the simple regret. On the other hand, regret η(t), otherwise
specifically called cumulative regret, directly depends on l as seen from (4.2.7). For
a constant number of pulls, t, the summation of Gi (w) in (4.2.7) runs over lesser
number of terms for an increasing l. As cumulative regret is dominated by regret
incurred during the initiation phase, it is seen evidently that, by increasing l we find
36
worse bounds on regret. Thus, we evidence a Simple vs. Cumulative Regret trade-off
in the proposed algorithm FMB.
5.3 Control of Complexity or Regret with β
From (3.3.2), consider the sample complexity,

n1
n 2 n
nl = n ln (5.3.1)
µ4T δ
where µT is given by
µT = min |µTi |
i:µi <µ∗−
with
µTi = E[Ai − A1 ]
Y β−1 µj
= µn−1
i [ri + (Iij (ri − rj )β − riβ )]
j6=i
r r
i j
Y β−1 µj
−µn−1
1 [r1 + (I1j (r1 − rj )β − riβ )].
j6=1
r1 rj
Consider
Y β−1 µj
E[Ai ] = µn−1
i [ri + (Iij (ri − rj )β − riβ )]
j6=i
r r
i j
Y β β β
= ζi,1 [ζi,2 ζi,3 + ζi,4 ζi,5 + ζi,6 ζi,7 ]
j6=i
where ζi,j are constants given the problem at hand, essentially µi and ri for all i. This
can be further written as
X n β o
θ1 θ2 θ3 θ1 θ2 θ3
E[Ai ] = ζi,1 ζi,2 ζi,4 ζi,6 ζi,3 ζi,5 ζi,7
P
(θ1 ,θ2 ,θ3 ): j θj =n−1
exhibiting monotonicity on β. Hence we see that µTi can be broken down into a finite
summation,
X
µTi = (κi,1,l κβi,2,l − κi,3,l κβi,4,l )
l
37
where κi,j,l are constants given the problem at hand. Thus, by tuning beta, we can
control µTi and hence µT . For a better sample complexity, a lower l, we need a higher
µT . This can be achieved by increasing β since µT is monotonic in β. But there
is a definite restriction on variation of the paramter β. Correctness of the algorithm
requires E[Ai ] > E[Aj ] for every µi > µj . Let us call the set adhering to this restriction,
βS . Since E[Ai ] is monotonic in β, we see βS must be an interval on the real line.
Now consider the bounds on regret, given together by (4.2.7) and (4.2.12). Given
sample complexity, or equivalently given l, regret can still be improved as it depends
µ2
on λi = √Tni that the algorithm has control on, through β. So, β can be increased
to improve the bounds on regret through an increase in λi , keeping l constant. But
with increase in λi or equivalently |µTi |, µT is expected to increase as well. In this
respect, we believe β for the best regret, given an l, would be fractional, and be that
value which when increased by the smallest amount will result in a decrease in sample
complexity l by 1. While it has been observed experimentally that the best regret
occurs for a fractional value of β, we cannot be sure whether ‘The β’ for the best
regret was indeed achieved.
To summarize, we can control the algorithm FMB with regard to sample complexity
or regret. But this tuning will inevitably have trade-offs between simple and cumulative
regrets, φ(t) and η(t) respectively, consistent with the findings of Bubeck et al. (2009).
Nevertheless, there is a definite interval of restriction βS , defined by the problem at
hand, on the tuning of β.
5.4 How low can l get
Is it possible to get l as low as 1, and still get a Complexity or Regret achieving

algorithm so as to reduce the cumulative regret incurred in the initiation phase? While
this was accomplished experimentally, we see from a theoretical perspective of when
this would be possible. We have
n1
n 2 n
l= ln .
µ4T δ
For l = 1, we would have

n
µ4T = n ln2
δ√
−µ2T / n
δ = ne .
38
For δ < 1, we require
√
exp(µ2T / n) > n
√
q
µT > n ln n.
p√
Thus, if some β ∈ βS could achieve µT > n ln n, then we would have an
algorithm achieving O(n) sample complexity for l = 1.
39
Chapter 6
Experiment & Results
A 10-arm bandit test bed with rewards formulated as Gaussian (with varying means
and variances) or Bernoulli was developed for the experiments. Probabilistic and
Greedy action selections (pFMB and FMB respectively) were performed on the quan-
tities Ai .
6.1 Comparison with Traditional approaches
Here we describe comparisons of FMB and pFMB with traditional approaches that do
not necessarily guarantee sample complexity or regret bounds. A 10-arm bandit test
bed with rewards formulated as Gaussian with varying means and a variance of 1 was
developed for the experiments. Figure 6.1.1 shows Average rewards and Cumulative
Optimal Selections with plays averaged over 2000 tasks on the test bed. Greedy Ai
and Exploring Ai are the curves corresponding to performances of FMB and pFMB
Average Reward with plays Optimal Selections with plays

80
1.4
Greedy Ai 70
% Optimal Selections
1.2 Exploring Ai Greedy Ai

Average Reward
60
1 Temp. =0.24 Exploring Ai
50 Temp. =0.24
0.8
40
0.6
30
0.4
20
0.2
10
0 200 400 600 800 1000 0 200 400 600 800 1000
Plays Plays
Figure 6.1.1: Two variants of the proposed algorithm, Greedy and Probabilistic action
selections on Ai , are compared against the SoftMax algorithm
40
Average Reward with plays Optimal Selections with plays
90
1.4
80
% Optimal Selections
1.2 Exploring Ai
70
Average Reward
Epsilon =0.1 Exploring Ai

1 Temp. =0.25 60 Epsilon =0.1
0.8 50 Temp. =0.25
0.6 40
30
0.4
20
0.2
10
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Plays Plays
Figure 6.1.2: Asymptotic performance showing low Regret
respectively. β = 0.85 was empirically chosen without complete parameter optimiza-

tion (though 4 trials of different β were made). The temperature of the SoftMax
algorithm, τ = 0.24 was observed to be best among the 13 different temperatures that
were attempted for parameter-optimization of the SoftMax procedure. This tempera-
ture value was also seen to better -greedy action selection with = 0.1. To see a more
asymptotic performance, the number of plays was further increased to 5000 with Gaus-
sian rewards incorporating varying means as well as variances and the corresponding
plots are shown in Figure 6.1.2. As can be seen, though the proposed algorithms could
not keep up in optimal selections, but yet are reaping higher cumulative rewards even
3000 turns after -greedy finds better optimal selections (at around 1500th turn).
6.2 Comparison with State-of-the-art approaches
We now compare FMB with state-of-the-art algorithms that guarantee either sample
complexity or regret. Comparisons of FMB with UCB-Revisited, henceforth called
UCB-Rev, that was shown to incur low regrets [Auer & Ortner (2010)] are depicted
in Figure 6.2.1. The experiments were conducted on the 10-arm bandit test bed with
specific random seeds (Guassian or Bernoulli) and the results were averaged over 200
different tasks or trials. Note that UCB-Rev was provided with the knowledge of the
horizon, the number of plays T , which aids in its optimal performance. Furthermore, no
parameter optimization was performed for FMB and the experiments were conducted
with β = 0.85. We see that FMB performs substantially better in terms of cumulative
regret and its growth, even with the knowledge of horizon provided to UCB-Rev.
Comparisons of FMB with Median Elimination Algorithm (MEA) [Even-Dar et al.
(2006)] which was also shown to achieve O(n) sample complexity is shown in Figure
6.2.2. Here we compare the performances of the two algorithms for incurring the same
41
Average Reward with plays Average Reward with plays
1.2
1.2
1
1
Average Reward
Average Reward
FMB (l=1)
0.8 0.8 UCBRev (T=1000)
FMB (l=1)
UCBRev (T=1000) 0.6
0.6 0.4
0.2
0.4
0
0.2 −0.2
0 200 400 600 800 1000 0 200 400 600 800 1000
Plays Plays
Figure 6.2.1: Comparison of FMB with UCB-Rev [Auer & Ortner (2010)] on Bernoulli
and Gaussian rewards, respectively
Average Reward with plays Average Reward with plays

1.2
1.5
FMB (l=2)
1 MEA FMB (l=3)
MEA
Average Reward
Average Reward
1
0.8
0.6 0.5
0.4
0
0.2
0 200 400 600 800 1000 0 200 400 600 800 1000
Plays Plays
Figure 6.2.2: Comparison of FMB with Median Elimination Algorithm (MEA) [Even-
Dar et al. (2006)] on Bernoulli and Gaussian rewards, respectively
sample complexity. The parameters for MEA’s performances depicted ( = 0.95, δ =

0.95) performed best with respect to regret among 34 different uniformly changing
instances tested, while no parameter optimization for FMB was performed. To achieve
O(n) guarantees at = 0.95, δ = 0.95, it was observed that l = 2 and l = 3 were
respectively required as arm-picks at the start of FMB for the Bernoulli and Gaussian
experiments. Though this may not be a fair comparison as appropriate values of l
were computed empirically to attain the , δ confidences, we observe relatively very
low values of l are sufficient to ensure O(n) sample complexity.
We conclude that the proposed class of algorithms, in addition to substantially

reducing regrets while learning, seem to perform well with respect to sample complexity
as well.
42
Chapter 7
Conclusions & Future Work
The proposed class of algorithms are the first to use fractional moments in bandit
literature to the best of our knowledge. Specifically, the class is shown to possess algo-
rithms that provide PAC guarantees with O(n) complexity in finding an -optimal arm
in addition to algorithms incurring the theoretical lowest regret of O(log(t)). Experi-
mental results support this, showing the algorithm achieves substantially lower regrets
not only when compared with parameter-optimized -greedy and SoftMax methods
but also with state-of-the art algorithms like MEA [Even-Dar et al. (2006)] (when
compared in achieving the same sample complexity) and UCB-Rev [Auer & Ortner
(2010)]. Minimizing regret has been a crucial factor in various applications. For in-
stance Agarwal et al. (2008) describe relevance to content publishing systems that
select articles to serve hundreds of millions of user visits per day. In this regard, FMB
performs 20 times faster (empirically) compared to UCB-Rev. In addition to perfor-
mance improvements, FMB provides a neat way of controlling sample complexity and
regret with the help of a single parameter. To the best of our knowledge, this is the
first work to introduce control in sample complexity and regret while learning bandits.
We note that as the reward distributions are relaxed from Ri ∈ {0, ri } to contin-
uous probability distributions, the sample complexity in (3.3.2) is further improved.
To see this, consider a gradual blurring of the bernoulli distribution to a continuous
distribution. The probabilities P {Ai > A1 } would increase due to the inclusion of
new reward-possibilities in the event space. So we expect even lower sample complex-
ities (in terms of the constants involved) with continuous reward distributions. But
the trade off is really in the computations involved. The algorithm presented can be
implemented incrementally, but requires that the set of rewards observed till then be
stored. On the other hand the algorithm simplifies computationally to a much faster
43
approach in case of Bernoulli distributions1 , as the mean estimates can be used directly.
As the cardinality of the reward set increases, so will the complexity in computation.
Since most rewards are encoded manually specific to the problem at hand, we expect
low cardinal reward supports where the low sample complexity achieved would greatly
help without major increase in computations.
The theoretical analysis assumes greedy action selections on the quantities Ai and
hence any further exploration other than the initiation phase (for instance, the ex-
ploration as performed by the algorithm pFMB) is unrealizable. Bounds for the ex-
ploratory algorithm pFMB on the sample complexity or regret would help in better
understanding of the explore-exploit situation so as to whether a greedy FMB could
beat an exploratory FMB. While FMB incurs a sample complexity of O(n), determi-
nation of an appropriate l given and δ is another direction to pursue. In addition,
note from (3.3.2) and (3.3.3) that we sample all arms with the worst li , where li is
the necessary number of pulls for arm i to ensure PAC guarantees with O(n). We
could reduce the sample complexity further if we formulate the initiation phase with
li pulls of arm i, which requires further theoretical footing on the determination of
appropriate values for l. The method of tuning parameter β, and the estimation of
βS are aspects to pursue for better use of the algorithm in unknown environments
with no knowledge of reward support. A further variant of the algorithms proposed,
allowing change of β while learning could be an interesting possibility to look at.
Extensions to the multi-slot framework, applications to contextual bandits and exten-
sions to the complete Reinforcement Learning problem would be some useful avenues
to pursue.
1
Essential requirement is rather a low cardinal Reward support
44
Bibliography
Hoeffding W. (1963). Probability inequalities for sums of bounded random variables.

Journal of the American Statistical Association. 58(301):13–30. 4
Even-Dar E., Mannor S. and Mansour Y. (2006). Action Elimination and Stopping
Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.
Journal of Machine Learning Research, 7, 1079–1105. (document), 3, 1.5, 3.3, 6.2,
6.2.2, 7
Sutton R. and Barto A.(1998). Reinforcement Learning: An Introduction. MIT Press,

Cambridge. 1.5
Dubhashi D P. and Panconesi A. (2009). Concentration of Measure for the analysis of

Randomized Algorithms, Cambridge University Press, New York. 6
Kalyanakrishnan S. and Stone P. (2010). Efficient Selection of Multiple Bandit Arms:

Theory and Practice. Proceedings of the 27th International Conference on Machine
Learning, 511-518. 1.5
Berry D A. and Fristedt B. (1985). Bandit problems. Chapman and Hall Ltd. 1.5
Auer P., Cesa-Bianchi N. and Fischer P. (2002). Finite-time Analysis of the Multiarmed
Bandit Problem Machine Learning, Springer Netherlands, 47, 235-256. 1.5
Holland J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT

Press/Bradford Books. 1.5
Lai T. and Robbins H. (1985). Asymptotically efficient adaptive allocation rules. Ad-
vances in Applied Mathematics, 6, 4–22. (document), 1.5, 4, 4.2
Agrawal R. (1995). Sample mean based index policies with O(log n) regret for the
multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078 1.5
45
Min S. and Chrysostomos L N. (1993). Signal Processing with Fractional Lower Order
Moments: Stable Processes & Their Applications. Proceedings of IEEE, Vol.81 No.7,
986-1010. 1.5
Achim A M., Canagarajah C N. and Bull D R (2005). Complex wavelet domain image
fusion based on fractional lower order moments, 8th International Conference on
Information Fusion, Vol.1 No.7, 25-28. 1.5
Xinyu M. and Nikias C L. (1996). Joint estimation of time delay and frequency delay in
impulsive noise using fractional lower order statistics, IEEE Transactions on Signal
Processing, Vol.44 No.11, 2669-2687. 1.5
Liu T. and Mendel J M. (2001). A subspace-based direction finding algorithm using

fractional lower order statistics, IEEE Transactions on Signal Processing, Vol.49
No.8, 1605-1613. 1.5
Agarwal D., Chen B., Elango P., Motgi N., Park S., Ramakrishnan R., Roy S. and J.
Zachariah. (2009). Online models for content optimization. In Advances in Neural
Information Processing Systems 21, 2009. 7
Bubeck S., Munos R., and Stoltz G. (2009). Pure Exploration in Multi-Armed Bandit
Problems. In 20th Intl. Conf. on Algorithmic Learning Theory (ALT), 2009. 5.2, 5.3
Auer P. and Ortner R. (2010). UCB revisited: Improved regret bounds for the stochas-
tic multiarmed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65,
2010. (document), 1.5, 6.2, 6.2.1, 7
Kim S. and Nelson B. A fully sequential procedure for indifference-zone selection in sim-
ulation. ACM Transactions on Modeling and Computer Simulation, 11(3):251–273,
2001. 1.5
Schmidt, Christian, Branke, Jürgen, and Chick, Stephen E. Integrating techniques

from statistical ranking into evolutionary algorithms. In Applications of Evolutionary
Computations, volume 3907 of LNCS, pp. 752–763. Springer, 2006. 1.5
Dayan, P. and Abbott, L.F. (2001) Theoretical Neuroscience: Computational and

Mathematical Modeling of Neural Systems (MIT Press, Cambridge MA), 2001. 1
Narayan, A. and Ravindran, B. (2011). "Fractional Moments on Bandit Problems".

In the Proceedings of the Twenty Seventh Conference on Uncertainty in Artificial
Intelligence (UAI 2011), pp. 531-538. AUAI Press.
7
46

Control of Sample Complexity and Regret in Bandits Using Fractional Moments

Uploaded by

Copyright:

Available Formats

Control of Sample Complexity and Regret in Bandits Using Fractional Moments

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Control of Sample Complexity and Regret in Bandits Using Fractional Moments

Uploaded by

Copyright:

Available Formats

CONTROL OF SAMPLE COMPLEXITY AND

REGRET IN BANDITS USING FRACTIONAL

in partial fulfilment of the requirements

BACHELOR OF TECHNOLOGY AND MASTER OF TECHNOLOGY

DEPARTMENT OF ELECTRICAL ENGINEERING

This is to certify that the thesis titled CONTROL OF SAMPLE COMPLEX-

I would like to extend my profound gratitude to the faculty at the Department

1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Multi-arm Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Learning Objectives and Optimality . . . . . . . . . . . . . . . . 7

1.3.1.1 Regret Reduction . . . . . . . . . . . . . . . . . . . . . 8

1.3.1.2 Sample Complexity Reduction . . . . . . . . . . . . . . 8

1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Proposed Algorithm & Motivation 14

2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Algorithm formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Chernoff Hoeffding Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Modified Chernoff Hoeffding Bounds . . . . . . . . . . . . . . . 19

3.3 Bounds on Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Definition of Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Bounds on Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Analysis of the Algorithm FMB 35

5.1 Sample Complexity dependence on  . . . . . . . . . . . . . . . . . . . 35

5.2 Simple Vs. Cumulative Regrets . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Control of Complexity or Regret with β . . . . . . . . . . . . . . . . . 37

5.4 How low can l get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Experiment & Results 40

6.1 Comparison with Traditional approaches . . . . . . . . . . . . . . . . . 40

6.2 Comparison with State-of-the-art approaches . . . . . . . . . . . . . . . 41

7 Conclusions & Future Work 43

1.1.1 Reinforcement Learning framework . . . . . . . . . . . . . . . . . . . . 5

1.1.2 A stochastic maze environment . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Yahoo Frontpage Today Module . . . . . . . . . . . . . . . . . . . . . . 9

1.4.2 Google Ads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.3 IMDB Movie Recommendations . . . . . . . . . . . . . . . . . . . . . . 11

1.4.4 Amazon Product Recommendations based on customer behavior . . . . 11

3.3.1 g(n) versus n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Comparison of FMB with MEA on the functional dependence of Sample

6.1.2 Asymptotic performance showing low Regret . . . . . . . . . . . . . . . 41

6.2.2 Comparison of FMB with Median Elimination Algorithm (MEA) [Even-

1.1 Reinforcement Learning

Reinforcement learning is a learning paradigm that lets an agent to map situations

To conclude, reinforcement learning is a framework to learn about a system through

1.3 Problem formulation

An n-arm bandit problem is to learn to preferentially select a particular action (or

1.3.1 Learning Objectives and Optimality

1.3.1.2 Sample Complexity Reduction

Definition 2. An arm j is considered an -optimal arm if µj > µ∗ − , where µ∗ =

We define the Probably Approximately Correct (PAC) quantification of optimality

1.5 Related Work

The multi-arm bandit problem captures general aspects of learning in an unknown

Figure 1.4.4: Amazon Product Recommendations based on customer behavior

Proposed Algorithm & Motivation

where sets Ni and Lji,k are given by,

|{k 0 : ri,k0 = ri,k }|

E [Ri − Rj |Ri > Rj ]  P (Ri > Rj )

2.3 The Algorithm

5.1 Sample Complexity dependence on . . . . . . . . . . . . . . . . . . . 35

Definition 2. An arm j is considered an -optimal arm if µj > µ∗ − , where µ∗ =

E [Ri − Rj |Ri > Rj ] P (Ri > Rj )

5.1 Sample Complexity dependence on

φ(t) < (1 − δ) + δ(µ∗ − min µi ).