Control of Sample Complexity and Regret in Bandits Using Fractional Moments
Control of Sample Complexity and Regret in Bandits Using Fractional Moments
Control of Sample Complexity and Regret in Bandits Using Fractional Moments
A Project Report
submitted by
ANANDA NARAYAN
Dr.B. Ravindran
Associate Professor
Dept. of Computer Science and Engineering
Dr.R. Aravind
Professor
Dept. of Electrical Engineering
IIT-Madras, 600036
Place: Chennai
Date: 10th of M ay, 2012
ACKNOWLEDGEMENTS
I am very thankful to my project guide Dr. Ravindran for his prolific encouragement
and guidance. If not for his receptivity towards his students, this work would not
have been possible. I would also like to thank Dr. Andrew, Dr. Aravind and Dr.
Venkatesh for the courses they offered that stood me in good stead during the course
of this project.
I would also like to thank my friends and colleagues for making the 5 years of my
stay here most cherishable. Their presence unfolded numerous eventful experiences
that I will always remember.
Finally, I would like to thank my parents for all their love, affection and teach-
ing. My learning from their life experiences has been instrumental in success of this
endeavour.
i
Control of Sample Complexity and Regret in
Bandits using Fractional Moments
Ananda Narayan
Abstract
One key facet of learning through reinforcements is the dilemma between exploration
to find profitable actions and exploitation to act optimal according to the observations
already made. We analyze this explore/exploit situation on Bandit problems in state-
less environments. We propose a family of learning algorithms for bandit problems
based on fractional expectation of rewards acquired. The algorithms can be controlled
to behave optimal with respect to sample complexity or regret, through a single pa-
rameter. The family is theoretically shown to contain algorithms that converge on
an -optimal arm and achieve O(n) sample complexity, a theoretical minimum. The
family is also shown to include algorithms that achieve the optimal logarithmic regret
O(log(t)) proved by Lai & Robbins (1985), again a theoretical minimum. We also
propose a method to tune for a specific algorithm from the family that can achieve
this optimal regret. Experimental results support these theoretical findings and the
algorithms perform substantially better than state-of-the-art techniques in regret re-
duction like UCB-Revisited and other standard approaches.
Contents
1 Introduction 4
1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Sample Complexity 18
1
4 Regret Analysis 26
2
List of Figures
6.1.1 Two variants of the proposed algorithm, Greedy and Probabilistic action
selections on Ai , are compared against the SoftMax algorithm . . . . . 40
6.2.1 Comparison of FMB with UCB-Rev [Auer & Ortner (2010)] on Bernoulli
and Gaussian rewards, respectively . . . . . . . . . . . . . . . . . . . . 42
3
Chapter 1
Introduction
Artificial Intelligence has been one of the promising observatories that the current
world is in lookout for. It takes a leap of advancement for algorithms to move from
solving automation, towards demonstrating intelligence. Intelligent agents not only
should learn from their experiences, but also need to find how such experiences can
be achieved. They have to act in such a way that, the associated experiences that
follow prove noteworthy for improvement of their current learning of the system at
hand. They have to repeatedly interact with the system to learn decision-making on
the system. Reinforcement learning is one approach that provides for ample scope
for such agents. We will now introduce some of the salient aspects of reinforcement
learning and discuss some of its applications.
Some of the well known models of machine learning constituting supervised, unsu-
pervised and semi-supervised approaches have been extensively used for classification,
regression, clustering, finding frequenting patterns etc. But these approaches do not
provide avenues for learning to cycle — learn decision-making on a system by repeat-
edly interacting with the same. Reinforcement learning is an approach that allows for
such trial and error based learning.
Reward
Figure 1.1.1: Reinforcement Learning framework
agent perceives states of its environment to perform a decision on a set of actions it can
take on the environment. In deciding and taking an action, the agent receives feedback
about how beneficial the action is for the corresponding state of the environment. This
feedback is regarded as a ‘Reward’ signal, usually stochastic in nature. Higher the
reward, better was the action selected for the perceived state of the environment. And
in taking an action, the environment changes it state, which is again usually stochastic.
The learning task is to find a mapping from states to actions. The optimal mapping
may be stochastic or deterministic depending on the stochasticity in reward and state
transitions on taking actions. In a nutshell, reinforcement learning enables an agent
to perform decision-making on a system (environment), perceiving its state and the
rewards it provides.
Example 1. Consider a stochastic maze environment and an agent that tries to solve
it obtaining rewards at few destinations as in figure 1.1.2 1 . When at any of the squares
in the maze (state), the agent has a set of actions to choose from that is always a subset
of the set S = {f orward, backward , right, lef t, idle}. The particular set of actions
available is dependent on the state of the agent in the environment. Taking one of
the actions may not necessarily make the agent move in the same particular direction.
The environment is stochastic in changing the agent’s state (as indicated in the figure)
and the agent may move in any direction, or may remain idle, for taking a particular
action, say moving forward. After an action is taken, the agent receives a reward that
is pertaining to the triple containing the past-state of the agent, action taken by the
agent, and the future-state where the agent ended up due to the action taken. This
reward is that numerical value indicated in the figure for the destination states and
is zero for all other states. Once one of the destination states is reached, the maze
1
Source: Dayan & Abbott (2001)
5
Figure 1.1.2: A stochastic maze environment
is solved (with a finishing reward that the destination state provides) and learning
stops. Note that the reward for selecting a path towards the optimal destination is
only available at the end of the sequence of action selections. Hence, one need to
adhere every reward to not only the last action taken but to the entire sequence of
actions that led to this reward. So the reward to selecting an action can be delayed. By
performing these action selections, the task is to learn a mapping of states to actions
so as to reach those destination states with the highest numerical reward.
6
1.2 Multi-arm Bandits
A multi-arm bandit problem requires finding optimal learning strategies for a simplified
version of the reinforcement learning agent described above. The simplification is in
the state space. In the multi-arm bandit problem, the environment is stateless or can
be said to be at the same state at all times. This reduces the complexity in learning
by a large amount due to the absense of state transitions of the environment and the
absense of reward dependence on state-action association. So the rewards are only a
function of the action taken, and there is a set of actions from which an optimal action
needs to be selected.
The multi-arm bandit takes it name from a slot machine (a gambling machine that
rotates reels when a lever or arm is pulled). A traditional slot machine is a one-arm
bandit and a multi-arm bandit has multiple levers or arms that can be pulled. Thus,
there is a set of levers to choose from at all times, and the task is to find that lever
that would give us the maximum return (or reward).
There have been two main objectives defined and analyzed for the multi-arm bandit
problem in the literature. One of them releates to reducing the regret (or loss) while
trying to learn the best arm. Another relates to reducing the number of samples
(or pulls of arms) to be drawn to find an arm close to the optimal arm with a high
probability. We define these quantifications of optimality formally below.
7
1.3.1.1 Regret Reduction
The traditional objective of the bandit problem is to maximize total reward (or gain)
given a specified number of pulls to be made, hence minimizing regret (or loss). Recall
that an arm corresponding to a mean equal to µ∗ is considered the best arm as it is
expected to give the maximum reward per pull. Let Zt be the total reward acquired
over t successive pulls. Then regret η is defined as the expected loss in total reward
acquired had the best arm been repetitively chosen right from the first pull, that is,
η(t) = tµ∗ − E[Zt ]. This quantification is also sometimes referred to as cumulative
regret.
We will now introduce a Probably Approximately Correct (PAC) framework and the
associated sample complexity, and an optimality criterion based on sample complexity
reduction. We start with a definition for -optimality.
Definition 3. Even-Dar et al. (2006) An algorithm is a (, δ)-PAC algorithm for the
multi arm bandit with sample complexity T , if it outputs an -optimal arm, with
probability at least 1 − δ, when it terminates, and the number of time steps the
algorithm performs until it terminates is bounded by T
The number of time steps is also known as the number of samples (pulls of arms).
Hence, the objective is then to reduce T , often referred to as sample complexity.
1.4 Applications
The multi-arm bandit problem has numerous applications in decision theory, evolu-
tionary programming, reinforcement learning, search, recommender systems, and even
in a few interdisciplinary areas like industrial engineering and simulation. We now
describe a very brief collection of examples of such applications.
8
Figure 1.4.1: Yahoo Frontpage Today Module
Figure 1.4.1 2 shows the home page of the popular web service provider Yahoo!,
featuring its Frontpage Today Module. The top stories featured in this module uses
a multi-arm bandit formulation to choose the top four articles that are frequently
viewed by users who visit the page. A set of editor-picked stories (sometimes called a
story-bucket) is made available to be the arms to choose from, and a bandit learner
finds the top four optimal arms to display. This learning happens continuously, in
an online manner, with the story-bucket dynamically changing, as and when more
relevant news articles arrive. The algorithm we will propose in chapter 2, is compatible
for implementation with all the requirements mentioned above. For matching news
articles specific to user interests, demography etc., a set of input features specific to
the user is often used in finding the best news stories to be displayed. This introduces
us to a new formulation often called Contextual Bandits, due to the use of contextual
information in finding the best arm.
Figure 1.4.2 3 depicts a typical set of ads chosen for a specific search query by
Google. This problem is many times analyzed using a multi-arm bandit formulation
with millions of ads relating to the set of arms to choose from. Contextual information
is ofcourse used specific to the search query at hand. This is one of the best examples
of multi-arm bandit appications where sample complexity reduction is very important
2
Source: www.yahoo.com
3
Source: www.google.com
9
Figure 1.4.2: Google Ads
since the number of arms (or ads, in this case) involved is very high.
A set of recommended films at the online movie database IMDB is seen in figure
1.4.3 4 . This problem of selecting recommended items pertaining to some query
is extensively researched under the title ‘Recommender Systems’. Multi-arm bandit
formulations are widely seen in recommender systems as well. Contextual information
from the film names and categories can further improve the recommendations.
But the popular online shopping store Amazon selects recommended articles through
previous user behavior as seen in figure 1.4.4 5 . Does this account for a deterministic
solution, with no need for any explore-exploit situation like in multi-arm bandits? No.
The user frequented buckets of articles bought is usually large and the selection of
top three recommended articles does need to be learnt. Thus, the multi-arm bandit
problem relates to applications that have far-reaching significance and value additions.
11
has applications in many fields including industrial engineering, simulation and evo-
lutionary computation. For instance, Kim & Nelson (2001) talk about applications
with regard to ranking, selection and multiple comparison procedures in statistics and
Schmidt et al. (2006) entail applications in evolutionary computation.
The traditional objective of the bandit problem is to maximize total reward given
a specified number of pulls, l, to be made hence minimizing regret. Lai & Robbins
(1985) showed that the regret should grow at least logarithmically and provided policies
that attain the lower bound for specific probability distributions. Agrawal (1995)
provided policies achieving the logarithmic bounds incorporating sample means that
are computationally more efficient and Auer et al. (2005) described policies that achieve
the bounds uniformly over time rather than only asymptotically. Meanwhile, Even-
Dar et al. (2006) provided another quantification of the objective measuring quickness
in determining the best arm. A Probably Approximately Correct (PAC) framework
is incorporated quantifying the identification of an -optimal arm with probability
1 − δ. The objective is then to minimize the sample complexity l, the number of
samples required for such an arm identification. Even-Dar et al. (2006) describe Median
Elimination Algorithm achieving O(n) sample complexity. This is further extended to
finding m arms that are -optimal with high probability by Kalyanakrishnan & Stone
(2010).
We propose a family of learning algorithms for the Bandit problem and show the
family contains not only algorithms attaining O(n) sample complexity but also al-
gorithms that achieve the theoretically lowest regret of O(log(t)). In addition, the
algorithm presented allows to control learning to reduce sample complexity or regret,
through a single parameter that it takes as input. To the best of our knowledge, this
is the first work to introduce control on sample complexity or regret while learning.
We address the n-arm bandit problem with generic probability distributions without
any restrictions on the means, variances or any higher moments that the distributions
may possess. Experiments show the proposed algorithms perform substantially better
compared to state-of-the-art algorithms like Median Elimination Algorithm [Even-Dar
et al. (2006)] and UCB-Revisited [Auer & Ortner (2010)]. While Even-Dar et al.
(2006) and Auer & Ortner (2010) provide algorithms that are not parameterized for
tunability, we propose a single-parametric algorithm that is based on fractional mo-
ments6 of rewards acquired. To the best of our knowledge ours is the first work7 to
6
The ith moment of a random variable R is defined as E[Ri ]. Fractional moments occur when the
exponent i is fractional (rational or irrational).
7
A part of this work appeared recently in UAI 2011 [Narayan & Ravindran (2011)]
12
use fractional moments in bandit problems, while recently we discovered its usage in
the literature in other contexts. Min & Chrysostomos (1993) describe applications
of such fractional (low order) moments in signal processing. It has been employed in
many areas of signal processing including image processing [Achim et al. (2005)] and
communication systems [Xinyu & Nikias (1996); Liu & Mendel (2001)] mainly with
regard to achieving more stable estimators. We theoretically show that the proposed
algorithms can be controlled to attain O(n) sample complexity in finding an -optimal
arm or to incur the optimal logarithmic regret, O(log(t)). We also discuss the control-
lability of the algorithm. Experiments support these theoretical findings showing the
algorithms incurring substantially low regrets while learning. A brief overview of what
is presented: chapter 2 describes motivations for the proposed algorithms, followed by
theoretical analysis of optimality and sample complexity in chapter 3. Then we prove
the proposed algorithms incur the theoretical lower bound for regret shown by Lai &
Robbins (1985) in chapter 4. We then provide a detailed analysis of the algorithms
presented with regard to their properties, performance and tunability in chapter 5.
Finally, experimental results and observations are presented in chapter 6 followed by
conclusions and scope for future work.
13
Chapter 2
2.1 Notation
Consider a bandit problem with n arms with action ai denoting choice of pulling the
ith arm. An experiment involves a finite number of arm-pulls in succession. An event
of selecting action ai results in a reward that is sampled from a random variable Ri ,
having a bounded support. In a particular such experiment, consider the event of
selecting action ai for the k th time. Let ri,k be a sample of the reward acquired in this
selection.
2.2 Motivation
When deciding about selecting action ai over any other action aj , we are concerned
about the rewards we would receive. Though estimates of E(Ri ) and E(Rj ) are in-
dicative of rewards for the respective actions, estimates of variances E[(Ri − µi )2 ] and
E[(Rj − µj )2 ] would give more information in the beginning stages of learning when
the confidence in estimates of expectations would be low. In other words, we wouldn’t
have explored enough for the estimated expectations to reflect true means.
Though mean and variance together provide full knowledge of the stochasticity for
some distributions, for instance Gaussian, we would want to handle generic distribu-
tions hence requiring to consider additional higher moments. It is the distribution
after all, that gives rise to all the moments completely specifying the random variable.
Thus we look at a generic probability distribution to model our policy to pull arms.
14
Consider selection of action ai over action aj . For action ai to have a higher reward
than action aj , the probability P (Ri > Rj ) holds, and to compute its estimate we
propose the following discrete approximation: After selecting actions ai , aj for mi , mj
times respectively, we can, in general, compute the probability estimate
X X
P̂ (Ri > Rj ) = P̂ (Ri = ri,k ) P̂ (Rj = rj,k0 )
ri,k ∈Ni rj,k0 Lji,k
Ni = {ri,k : 1 ≤ k ≤ mi } (2.2.1)
Lji,k = {rj,k0 : rj,k0 < ri,k , 1 ≤ k 0 ≤ mj }
with random estimates P̂ (Ri = ri,k ) of the true probability P (Ri = ri,k ) calculated by
Note that thus far we are only taking into account the probability, ignoring the
magnitude of rewards. That is, if we have two instances of rewards, rj,l and rm,n
with P (Rj = rj,l ) = P (Rm = rm,n ), then they contribute equally to the probabilities
P (Ri > Rj ) and P (Ri > Rm ), though one of the rewards could be much larger than
the other. For fair action selections we need a function monotonically increasing in
rewards, and let us then formulate the preference function for action ai over action aj
as,
X X
β
Aij = P̂ (Ri = ri,k ) (ri,k − rj,k0 ) P̂ (Rj = rj,k0 ) (2.2.3)
ri,k ∈Ni rj,k0 Lji,k
where β determines how far we want to distinguish the preference functions with regard
to magnitude of rewards. This way, for instance, arms constituting a higher variance
are given more preference. For β = 1, we would have Aij proportional to,
when the estimates P̂ approach true probabilties. See that we have started with a
simple choice for the monotonic function, the polynomial function, without restricting
the exponent. Now, to find among a set of arms, one single arm to pull, we need
a quantification to compare different arms. With the introduction of the polynomial
15
dependence on the reward magnitudes, Aij ’s are no more probabilities. To find a
quantification to base the arm pulls, we propose a preference function to choose ai
over all other actions to be
Y
Ai = Aij . (2.2.4)
j6=i
For an n-arm bandit problem, our proposed class of algorithms first choose each arm l
times, in what is called as an initiation phase. After this initiation phase, at all further
action selections, where arm i has been chosen for mi times, finding a reward ri,k when
choosing ith arm for the k th time, the class of algorithms computes the sets
Ni = {ri,k : 1 ≤ k ≤ mi }
Lji,k = {rj,k0 : rj,k0 < ri,k , 1 ≤ k 0 ≤ mj }
and using
|{k 0 : ri,k0 = ri,k }|
P̂ (Ri = ri,k ) =
mi
computes the indices
Y
Ai = Aij ,
j6=i
X X
β
Aij = P̂ (Ri = ri,k ) (ri,k − rj,k0 ) P̂ (Rj = rj,k0 )
ri,k ∈Ni rj,k0 Lji,k
16
Algorithm 2.1 Fractional Moments on Bandits (FMB)
Input: l, β
Initialization: Choose each arm l times
Define: ri,k , the reward acquired for k th selection of arm i, and mi , the number
of selections made for arm i, and the sets Ni = {ri,k : 1 ≤ k ≤ mi }and Lji,k =
{rj,k0 : rj,k0 < ri,k , 1 ≤ k 0 ≤ mj }
Loop:
|{k0 :ri,k0 =ri,k }|
1. p̂ik = P̂ (Ri = ri,k ) = mi
for 1 ≤ i ≤ n, 1 ≤ k ≤ mi
n P o
(ri,k − rj,k0 )β p̂jk0 for 1 ≤ i, j ≤ n, i 6= j
P
2. Aij = ri,k ∈Ni p̂ik r j
j,k0 Li,k
Q
3. Ai = j6=i Aij ∀1 ≤ i ≤ n
The family (of algorithms FMB, for various values of l, β)1 is shown to contain
algorithms that achieve a sample complexity of O(n) in chapter 3. The algorithm 2.1
is also shown to incur the asymptotic theoretical lower bound for regret in chapter 4.
Further, the algorithm is analyzed in detail in chapter 5.
1
Note that FMB, pFMB, and any other algorithm that uses Ai for action selection constitute a
class of algorithms. FMB for varying input parameters constitute a family of algorithms.
17
Chapter 3
Sample Complexity
Our problem formulation considers a set of n bandit arms with Ri representing the
random reward on pulling arm i. We assume that the reward is binary, Ri ∈ {0, ri }∀i.
Since the proposed algorithm assumes a discrete approximation for the reward prob-
ability distributions, an analysis on the bernoulli formulation is easily extendible fur-
ther. This is true even when true reward distributions are continuous, as the algorithm
always functions on the discrete approximation on the rewards it has already experi-
enced. Also, denote pi = P {Ri = ri }, µi = E[Ri ]=ri pi , and let us have for simplicity,
Define,
1 if r > r
i j
Iij =
0 otherwise
and,
rj if ri > rj
ri
δij = .
1 otherwise
18
We then have from 2.2.3, 2.2.2 and 2.2.1,
We now discuss Chernoff bounds and its extension to dependent random variables
before analysis of the proposed algorithm.
We start with a brief introduction to the Chernoff Hoeffding bounds on a set of inde-
pendent random variables.
P {Sn ≥ µ + a} ≤ exp(−2a2 n)
P {Sn ≤ µ − a} ≤ exp(−2a2 n).
Chernoff Hoeffding bounds on a set of dependent random variables is discussed here for
analysis of the proposed algorithm. We start by introducing Chromatic and Fractional
Chromatic Number of Graphs.
Clearly, χ0 (G) ≤ χ(G). We now look at Chernoff Hoeffding Bounds when the
random variables involved are dependent.
unique index for every element in S. For each e ∈ S, define random variable Ui with
i = f (e), so that for e1 , e2 ∈ S, Ui1 and Ui2 are dependent if and only if atleast one
dimension of e1 − e2 is null. Construct a graph G(V, E) on V = {Ui : 1 ≤ i ≤ q} with
edges connecting every pair of dependent vertices.
(k)
Remark 8. Consider n arms where k th arm is sampled for mk times. Let Xu be a
random variable, with mean µ(k) , related to the reward acquired on the uth pull of
(k) (k) (k)
k th arm. Then, for {mk : 1 ≤ k ≤ n}, X1 , X2 , . . . Xmk are random variables (with
(k) (k) (k)
common range within [0, 1], say) such that E[Xu |X1 , . . . , Xu−1 ] = µ(k) ∀k, u : 1 ≤
(i) (j) (k)
u ≤ mk and Xu is independent of Xv for i 6= j. Define a random variable Smk =
(k) (k)
X1 +···+Xmk Q (k)
mk
capturing the estimate of X (k) . Define a random variable T = k Smk
20
and consequently on expanding this product we will find,
Pq
1 Ui
T =
q
−2a2 q −2a2 q
P {T ≥ µT + a} ≤ exp ≤ exp
χ0 (G) χ(G)
2
−2a2 q
−2a q
P {T ≤ µT − a} ≤ exp ≤ exp
χ0 (G) χ(G)
with mean µT = E[T ] = k µ(k) and χ(G) and χ0 (G) the chromatic and fractional
Q
Theorem 10. For a n-arm bandit problem and a given and δ, algorithm FMB is an
(, δ)-PAC algorithm, incurring a sample complexity of
n1 !
1 2 1
O n ln
µ4T δ
where
µT = min |E[Ai − A∗ ]|
i:µi <µ∗ −
with Ai defined in 2.2.4, 2.2.3, 2.2.2 and 2.2.1, and µ∗ , A∗ being those values that
correspond to the optimal arm.
Following constraint 3.1.1 for simplicity, we will now prove the above theorem. To
start with, we will state a few lemmas that would be helpful in proving theorem 10.
21
Lemma 11. For a graph G(V, E) constructed on vertices Ui pertainining to Definition
7, the chromatic number is bounded as
s
Y
χ(G) ≤ 1 + n q − (mk − 1) .
k
Q
Proof. See that |V | = q, and each vertex is connected to q − k (mk − 1) other vertices.
Then, the total number of edges in the graph becomes, |E| = 12 n (q − k (mk − 1)).
Q
The number of ways a combination of two colors can be picked is given by, χ(G) C2 =
1
2
χ(G) (χ(G) − 1). For optimal coloring, there must exist at least one edge connecting
any such pair of colors picked, and we have
1
χ(G) (χ(G) − 1) ≤ |E|
2
Y
χ(G) (χ(G) − 1) ≤ n q − (mk − 1) .
k
Next, we state a lemma on the boundedness of a function which we will come across
later in proving theorem 10.
1
g(n) = (n ln2 n) n ,
ln(n ln2 n) ln n + 2 ln ln n
ln g(n) = =
n n
2
1 0 1 + ln n − (ln n + 2 ln ln n)
=⇒ g (n) = .
g(n) n2
22
Thus, g is a decreasing function for n > n0 ∈ N with the limit at infinity governed by
ln n + 2 ln ln n
lim ln(g(n)) = lim
n→∞ n→∞ n
∂
(ln n + 2 ln ln n)
= lim ∂n ∂
n→∞
∂n
n
1 2
= lim +
n→∞ n n ln n
= 0
=⇒ lim g(n) = 1.
n→∞
Now we prove theorem 10 with the help of the lemmas defined above.
Proof. of theorem 10. Let i 6= 1 be an arm which is not -optimal, µi < µ1 −. To prove
that FMB is an (, δ)-PAC algorithm, we need to bound for all i, the probability of
selecting a non--optimal arm, P (Ai > Aj : µi < µ1 − , ∀j), which is in turn bounded
as
P (Ai > Aj : µi < µ1 − , ∀j) ≤ P (Ai > A1 : µi < µ1 − )
since arm 1 is optimal. In other words, we need to bound the probability of event
Ai > A1
to restrict the policy from choosing non--optimal arm i, with Ai given by 3.1.2. See
Q (k)
that Ai is in the form of k Smk in remark 8. Defining the random variable Ti =
Ai − A1 , we expand the products, to have
Pq
1 Uk
Ti =
q
where Uk have properties described in definition 7 with its associated graph G(V, E).
Q
So we have a sum of q = j mj dependent random variables Uk , with a true mean
µTi = E[Ti ] = E[Ai ] − E[A1 ]. Applying corollary 9 to Ti , to bound the probability of
the event Ti ≥ 0, we have using the modified Chernoff-Hoeffding Bounds,
−µ2T ln/2
δ
exp √ =
n n
n1
n 2 n
=⇒ l = ln (3.3.2)
µ4T δ
where
1
O(n(n ln2 n) n ) = O(n).
Hence FMB finds -optimal arms with probability 1 − δ incurring a sample complexity
24
Sample Complexity per Arm
2
g(n)
1
0
0 200 400 600 800 1000
n
of n1 !
1 2 1
O n ln .
µ4T δ
25
Chapter 4
Regret Analysis
In this chapter, we show that FMB (Algorithm 2.1) incurs a regret of O(log(t)), a
theoretical lower bound shown by Lai & Robbins (1985). We will start with the
definition of regret and then state and prove the regret bounds of FMB.
Define random variable Fi (t) to be the number of times arm i was chosen in t plays.
The expected regret, η is then given by
X
η(t) = (n − E[F1 (t)])µ1 − µi E[Fi (t)]
i6=1
which is,
X X
η(t) = µ1 E[Fi (t)] − µi E[Fi (t)]
i6=1 i6=1
X
= (µ1 − µi )E[Fi (t)]
i6=1
X
= ∆i E[Fi (t)], (4.1.1)
i6=1
say.
26
4.2 Bounds on Regret
We will now state and prove the boundedness of regret incurred by FMB.
Theorem 13. The total expected regret η(t), for t trials in an n-arm bandit problem,
incurred by algorithm FMB is upper bounded by
X
∆i O (l + log(t))
i6=1
To prove theorem 13, we will now state a few lemmas that would be helpful. To start
with, we restate some of the definitions that were used in chapter 3 for convenience.
Definition 14. For an n-arm bandit problem where arm i has been chosen for mi
times (including the choice of each arm l times in the initiation phase), finding a
reward ri,k when choosing ith arm for the k th time, FMB computes the sets
Ni = {ri,k : 1 ≤ k ≤ mi }
Lji,k = {rj,k0 : rj,k0 < ri,k , 1 ≤ k 0 ≤ mj }
and using
|{k 0 : ri,k0 = ri,k }|
P̂ (Ri = ri,k ) =
mi
computes the indices
Y
Ai = Aij ,
j6=i
X X
β
Aij = P̂ (Ri = ri,k ) (ri,k − rj,k0 ) P̂ (Rj = rj,k0 )
ri,k ∈Ni rj,k0 Lji,k
to further choose arms. While doing this to find an -optimal arm with probability
1 − δ, FMB incurs a sample complexity of
n1 !
1 2 1
O n ln
µ4T δ
27
where
µT = min |µTi |
i:µi <µ∗ −
µTi = E[Ti ]
Ti = Ai − A∗
with µ∗ and A∗ being those values that correspond to the optimal arm.
Lemma 16. Given the event Fi (t − 1) = y, the probability that FMB selects arm i at
tth play is bounded as,
p !
−µ2Ti y(t − y − 1 − l(n − 2))ln−2
P {arm i selected at t|Fi (t − 1)y} ≤ exp √
n
Given any partial information about Fi (τ ) for some τ < t and some i, this probability
can be further bounded. Given that the event Fi (t − 1) = y is known to have occured,
28
this bound improves as
So we further bound,
Q
Now, the upper bound of P (Ti (t) ≥ 0) would be maximum when k mk is minimum.
This is observed from (3.3.1) as
!
−2µ2Ti q
P (Ti (t) ≥ 0) ≤ exp p Q
1 + n (q − k (mk (t) − 1))
−2µ2Ti q
≤ exp √
1 + nq
2√
−µTi q
≤ exp √
n
pQ !
2
−µTi k mk (t)
= exp √ . (4.2.4)
n
Q P
The minimum that k mk (t) would take for a given k mk (t) = t − 1 is when mi (t)
are most unevenly distributed. When each arm is chosen l times in the initiation phase
of the algorithm, we would have
Y
mk (t) ≥ y(t − 1 − l(n − 2) − y) ln−2 (4.2.5)
k
which occurs when all the instances of arm choices (other than mi (t) = y and the
initiation phase) correspond to a single particular arm other than i, say j. From
29
(4.2.2), (4.2.3), (4.2.4) and (4.2.5), we would have
Lemma 17. For a, b ∈ N, the sum of square roots of consequtive numbers is lower
bounded as
b
X √ 2
h ≥ (b3/2 − (a − 1)3/2 ).
h=a
3
b
X √ Z b √
h ≥ υdυ
h=a a−1
2 3/2
= (b − (a − 1)3/2 ).
3
Proof. of Theorem 13. From (4.1.1), we see that to bound the expected regret is to
bound E[Fi (t)] for all i 6= 1. Then the event {arm i selected at t} is {Ai (t) > Aj (t)∀j}
of course with j 6= i. Consider
where the first step uses law of total (or iterated) expectations. Choose such that
there exists only one -optimal arm, then the summation in (4.1.1) goes through
30
only non--optimal arms and for all such i, we can use the bounds on P (Ti (t) ≥
0|mi (t)) from corollary 15 and lemma 16. Hence bounding the probability of the event
P {arm i selected at t|Fi (t − 1) = y}, (4.2.6) turns into,
" p ! #
−µ2Ti y(t − y − 1 − l(n − 2))ln−2
E[Fi (t)] ≤ E (y + 1) exp √ +y .
n
µ2T
Denoting √i
n
as a positive constant λi and using E[y] = E[Fi (t − 1)], we have the
relation
p
E[Fi (t)] ≤ E[Fi (t − 1)] + E[(y + 1) exp(−λi y(t − y − 1 − l(n − 2))ln−2 )].
t
X p
E[Fi (t)] ≤ E[Fi (nl)] + E[(yw + 1) exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )]
w=nl+1
t
X p
≤ l+ E[(yw + 1) exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )]
w=nl+1
where yw corresponds to number of times the arm under consideration was chosen in
w − 1 plays. Defining
p
Gi (w) = E[(yw + 1) exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )]
X t
X
η(t) ≤ ∆i (l + Gi (w)). (4.2.7)
i6=1 w=nl+1
Now,
w−1−l(n−1)
X p
Gi (w) = [(d + 1) exp(−λi d(w − d − 1 − l(n − 2))ln−2 )
d=l
P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l}](4.2.8)
where the probability is that which corresponds to arm i being selected d − l times
between nl + 1th and w − 1th plays, inclusive of the ends, with kSk denoting the
cardinality of set S. Denoting the plays (or epochs) in which arm i could have been
31
selected by the indicator variable I(τ ), we have
1 A (τ ) > A (τ )∀j
i j
I(τ ) = .
0 otherwise
Defining
with
Y
Pmax = P {Ai (τ ) > Aj (τ )∀j|Imax }
τ :Imax (τ )=1
Y
P {Ai (τ ) < Aj (τ ) f or some j|Imax }
τ :Imax (τ )=0
Y
≤ P {Ai (τ ) > Aj (τ )∀j|Imax } (4.2.10)
τ :Imax (τ )=1
32
follows. Also,
Y Y
P {Ai (τ ) > Aj (τ )∀j|Imax } ≤ P {Ai (τ ) > A1 (τ )|Imax }
τ :Imax (τ )=1 τ :Imax (τ )=1
X sY
≤ exp −λi mk (τ ) .
τ :Imax (τ )=1 k
Q
A maximum upper bound would require k mk (τ ) to be at its minimum for all τ .
This would occur as Imax (τ ) = 1∀τ : nl + 1 ≤ τ ≤ nl + (d − l), when arm i is chosen
Q
repeatedly in the first d − l turns just after the initiation phase. Then, k mk (τ ) for
τ : Imax (τ ) = 1 are given by,
Y
mk (τ ) = (l + τ − nl)ln−1 ∀τ : nl + 1 ≤ τ ≤ nl + (d − l).
k
So,
nl+(d−l)
Y X p
P {Ai (τ ) > Aj (τ )∀j|Imax } ≤ exp −λi (l + τ − nl)ln−1
τ :Imax (τ )=1 τ =nl+1
d
!!
n−1
X √
≤ exp −λi l 2 k (4.2.11)
k=l+1
which can be refined further using lemma 17. Using (4.2.9), (4.2.10), (4.2.11) and
lemma 17, we bound Gi (w) in (4.2.8) as
w−1−l(n−1)
X w − nl − 1 p 2 n−1
Gi (w) ≤ (d+1) exp(−λi ( d(w − d − 1 − l(n − 2))ln−2 + l 2 (d3/2 −l3/2 ))).
d=l
d−l 3
(4.2.12)
Back to the original problem, if we show that E[Fi (t)] grows slower than log(t) asymp-
totically, for all i, then it is sufficient to prove the regret is O(log(t)). For this, see
that
Xt
E[Fi (t)] ≤ l + Gi (w)) = g(t),
w=nl+1
say. Then for regret to grow slower than logarithmically in t, it is sufficient to show that
the derivative of the upper bound g(t) grows slower than 1/t. Seeing that g(t) − g(t −
1) = Gi (t), it is enough to show that Gi (t) is bounded by 1/t asymptotically. Consider
n−1
Γd = (d+1) w−nl−1
p
d(w − d − 1 − l(n − 2))ln−2 + 32 l 2 (d3/2 −l3/2 ))). ∃d∗ :
d−l
exp(−λ i (
33
Γd∗ ≥ Γd ∀d 6= d∗ , and so we bound
Gi (w) ≤ (w − ln)Γd∗ .
Now
Gi (t)
lim 1 = lim tGi (t)
t→∞ t→∞
t
t−nl−1
t(t − ln)(d∗ + 1)
d∗ −l
≤ lim p n−1 .
t→∞ exp(λi ( d∗ (t − d∗ − 1 − l(n − 2))l n−2 + 23 l 2 (d∗3/2 − l3/2 )))
Since each term in the numerator is bounded by a polynomial while the exponential
in the denominator is not,
Gi (t)
lim 1 = 0.
t→∞
t
Since growth of E[Fi (t)] is bounded by O(log(t)) for all i, we see the proposed algorithm
FMB has an optimal regret as characterized by Lai & Robbins (1985), with the total
incurred regret till time t given by
X
∆i O (l + log(t)) .
i6=1
34
Chapter 5
In this chapter, we analyze the proposed algorithm FMB, with regard to its properties,
performance and tunability.
We now look at the sample complexity incurred for every arm by FMB in 3.3.2 and its
dependance on . Unlike MEA, this dependence of sample complexity is quite different
for FMB, and is illustrated in Figure 5.1.1. A decrease in epsilon would lead to increase
in complexity only when the set {i : µi < µ1 − } in 3.3.3 changes due to the decrease
in . The set of arms that determine the complexity is related to the true means in the
original problem. We also note that this should be the case ideally, in that a decrease
in that does not lead to any decrease in the number of -optimal arms, should not
increase the sample complexity. This is because the algorithm still performs to select
exactly the same -optimal arms, as was the case before the decrease in .
While cumulative regret, referred to regret in general, is defined in the expected sense,
35
Sample Complexity for finding a PAC arm with probability 1−delta
FMB
MEA
Sample Complexity
Epsilon
Figure 5.1.1: Comparison of FMB with MEA on the functional dependence of Sample
Complexity on
where Zt is the total reward acquired for t pulls and µ∗ = maxi µi , there exists another
quantification for regret that is related to the sample complexity in some sense. Bubeck
et al. (2009) discuss links between cumulative regrets and this new quantification,
called simple regret, defined as
φ(t) = µ∗ − µψt
P
where µψt = i µi ψi,t with ψi,t , the probability of choosing arm i as ouput by the
policy learning algorithm for the t + 1th pull. Essentially, ψt is the policy learnt by the
algorithm after t pulls. Note that φ(t) after an , δ-PAC guaranteeing exploration is
essentially related to , in the sense that
Bubeck et al. (2009) find dependencies between φ(t) and η(t) and state that the smaller
φ(t) can get, the larger η(t) would have to be. We now analyze how FMB attains this
relation.
Consider improving the policy learning algorithm with respect to φ(t). This could
be achieved by keeping β constant, and increasing l. This may either reduce µT as seen
from (3.3.2) and hence as seen from (3.3.3), or simply reduce δ as seen from (3.3.2),
both ways improving the simple regret. On the other hand, regret η(t), otherwise
specifically called cumulative regret, directly depends on l as seen from (4.2.7). For
a constant number of pulls, t, the summation of Gi (w) in (4.2.7) runs over lesser
number of terms for an increasing l. As cumulative regret is dominated by regret
incurred during the initiation phase, it is seen evidently that, by increasing l we find
36
worse bounds on regret. Thus, we evidence a Simple vs. Cumulative Regret trade-off
in the proposed algorithm FMB.
where µT is given by
µT = min |µTi |
i:µi <µ∗−
with
µTi = E[Ai − A1 ]
Y β−1 µj
= µn−1
i [ri + (Iij (ri − rj )β − riβ )]
j6=i
r r
i j
Y β−1 µj
−µn−1
1 [r1 + (I1j (r1 − rj )β − riβ )].
j6=1
r1 rj
Consider
Y β−1 µj
E[Ai ] = µn−1
i [ri + (Iij (ri − rj )β − riβ )]
j6=i
r r
i j
Y β β β
= ζi,1 [ζi,2 ζi,3 + ζi,4 ζi,5 + ζi,6 ζi,7 ]
j6=i
where ζi,j are constants given the problem at hand, essentially µi and ri for all i. This
can be further written as
X n β o
θ1 θ2 θ3 θ1 θ2 θ3
E[Ai ] = ζi,1 ζi,2 ζi,4 ζi,6 ζi,3 ζi,5 ζi,7
P
(θ1 ,θ2 ,θ3 ): j θj =n−1
exhibiting monotonicity on β. Hence we see that µTi can be broken down into a finite
summation,
X
µTi = (κi,1,l κβi,2,l − κi,3,l κβi,4,l )
l
37
where κi,j,l are constants given the problem at hand. Thus, by tuning beta, we can
control µTi and hence µT . For a better sample complexity, a lower l, we need a higher
µT . This can be achieved by increasing β since µT is monotonic in β. But there
is a definite restriction on variation of the paramter β. Correctness of the algorithm
requires E[Ai ] > E[Aj ] for every µi > µj . Let us call the set adhering to this restriction,
βS . Since E[Ai ] is monotonic in β, we see βS must be an interval on the real line.
Now consider the bounds on regret, given together by (4.2.7) and (4.2.12). Given
sample complexity, or equivalently given l, regret can still be improved as it depends
µ2
on λi = √Tni that the algorithm has control on, through β. So, β can be increased
to improve the bounds on regret through an increase in λi , keeping l constant. But
with increase in λi or equivalently |µTi |, µT is expected to increase as well. In this
respect, we believe β for the best regret, given an l, would be fractional, and be that
value which when increased by the smallest amount will result in a decrease in sample
complexity l by 1. While it has been observed experimentally that the best regret
occurs for a fractional value of β, we cannot be sure whether ‘The β’ for the best
regret was indeed achieved.
To summarize, we can control the algorithm FMB with regard to sample complexity
or regret. But this tuning will inevitably have trade-offs between simple and cumulative
regrets, φ(t) and η(t) respectively, consistent with the findings of Bubeck et al. (2009).
Nevertheless, there is a definite interval of restriction βS , defined by the problem at
hand, on the tuning of β.
p√
Thus, if some β ∈ βS could achieve µT > n ln n, then we would have an
algorithm achieving O(n) sample complexity for l = 1.
39
Chapter 6
A 10-arm bandit test bed with rewards formulated as Gaussian (with varying means
and variances) or Bernoulli was developed for the experiments. Probabilistic and
Greedy action selections (pFMB and FMB respectively) were performed on the quan-
tities Ai .
Here we describe comparisons of FMB and pFMB with traditional approaches that do
not necessarily guarantee sample complexity or regret bounds. A 10-arm bandit test
bed with rewards formulated as Gaussian with varying means and a variance of 1 was
developed for the experiments. Figure 6.1.1 shows Average rewards and Cumulative
Optimal Selections with plays averaged over 2000 tasks on the test bed. Greedy Ai
and Exploring Ai are the curves corresponding to performances of FMB and pFMB
60
1 Temp. =0.24 Exploring Ai
50 Temp. =0.24
0.8
40
0.6
30
0.4
20
0.2
10
0 200 400 600 800 1000 0 200 400 600 800 1000
Plays Plays
Figure 6.1.1: Two variants of the proposed algorithm, Greedy and Probabilistic action
selections on Ai , are compared against the SoftMax algorithm
40
Average Reward with plays Optimal Selections with plays
90
1.4
80
% Optimal Selections
1.2 Exploring Ai
70
Average Reward
0.6 40
30
0.4
20
0.2
10
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Plays Plays
We now compare FMB with state-of-the-art algorithms that guarantee either sample
complexity or regret. Comparisons of FMB with UCB-Revisited, henceforth called
UCB-Rev, that was shown to incur low regrets [Auer & Ortner (2010)] are depicted
in Figure 6.2.1. The experiments were conducted on the 10-arm bandit test bed with
specific random seeds (Guassian or Bernoulli) and the results were averaged over 200
different tasks or trials. Note that UCB-Rev was provided with the knowledge of the
horizon, the number of plays T , which aids in its optimal performance. Furthermore, no
parameter optimization was performed for FMB and the experiments were conducted
with β = 0.85. We see that FMB performs substantially better in terms of cumulative
regret and its growth, even with the knowledge of horizon provided to UCB-Rev.
Comparisons of FMB with Median Elimination Algorithm (MEA) [Even-Dar et al.
(2006)] which was also shown to achieve O(n) sample complexity is shown in Figure
6.2.2. Here we compare the performances of the two algorithms for incurring the same
41
Average Reward with plays Average Reward with plays
1.2
1.2
1
1
Average Reward
Average Reward
FMB (l=1)
0.8 0.8 UCBRev (T=1000)
FMB (l=1)
UCBRev (T=1000) 0.6
0.6 0.4
0.2
0.4
0
0.2 −0.2
0 200 400 600 800 1000 0 200 400 600 800 1000
Plays Plays
Figure 6.2.1: Comparison of FMB with UCB-Rev [Auer & Ortner (2010)] on Bernoulli
and Gaussian rewards, respectively
Average Reward
1
0.8
0.6 0.5
0.4
0
0.2
0 200 400 600 800 1000 0 200 400 600 800 1000
Plays Plays
Figure 6.2.2: Comparison of FMB with Median Elimination Algorithm (MEA) [Even-
Dar et al. (2006)] on Bernoulli and Gaussian rewards, respectively
42
Chapter 7
The proposed class of algorithms are the first to use fractional moments in bandit
literature to the best of our knowledge. Specifically, the class is shown to possess algo-
rithms that provide PAC guarantees with O(n) complexity in finding an -optimal arm
in addition to algorithms incurring the theoretical lowest regret of O(log(t)). Experi-
mental results support this, showing the algorithm achieves substantially lower regrets
not only when compared with parameter-optimized -greedy and SoftMax methods
but also with state-of-the art algorithms like MEA [Even-Dar et al. (2006)] (when
compared in achieving the same sample complexity) and UCB-Rev [Auer & Ortner
(2010)]. Minimizing regret has been a crucial factor in various applications. For in-
stance Agarwal et al. (2008) describe relevance to content publishing systems that
select articles to serve hundreds of millions of user visits per day. In this regard, FMB
performs 20 times faster (empirically) compared to UCB-Rev. In addition to perfor-
mance improvements, FMB provides a neat way of controlling sample complexity and
regret with the help of a single parameter. To the best of our knowledge, this is the
first work to introduce control in sample complexity and regret while learning bandits.
We note that as the reward distributions are relaxed from Ri ∈ {0, ri } to contin-
uous probability distributions, the sample complexity in (3.3.2) is further improved.
To see this, consider a gradual blurring of the bernoulli distribution to a continuous
distribution. The probabilities P {Ai > A1 } would increase due to the inclusion of
new reward-possibilities in the event space. So we expect even lower sample complex-
ities (in terms of the constants involved) with continuous reward distributions. But
the trade off is really in the computations involved. The algorithm presented can be
implemented incrementally, but requires that the set of rewards observed till then be
stored. On the other hand the algorithm simplifies computationally to a much faster
43
approach in case of Bernoulli distributions1 , as the mean estimates can be used directly.
As the cardinality of the reward set increases, so will the complexity in computation.
Since most rewards are encoded manually specific to the problem at hand, we expect
low cardinal reward supports where the low sample complexity achieved would greatly
help without major increase in computations.
The theoretical analysis assumes greedy action selections on the quantities Ai and
hence any further exploration other than the initiation phase (for instance, the ex-
ploration as performed by the algorithm pFMB) is unrealizable. Bounds for the ex-
ploratory algorithm pFMB on the sample complexity or regret would help in better
understanding of the explore-exploit situation so as to whether a greedy FMB could
beat an exploratory FMB. While FMB incurs a sample complexity of O(n), determi-
nation of an appropriate l given and δ is another direction to pursue. In addition,
note from (3.3.2) and (3.3.3) that we sample all arms with the worst li , where li is
the necessary number of pulls for arm i to ensure PAC guarantees with O(n). We
could reduce the sample complexity further if we formulate the initiation phase with
li pulls of arm i, which requires further theoretical footing on the determination of
appropriate values for l. The method of tuning parameter β, and the estimation of
βS are aspects to pursue for better use of the algorithm in unknown environments
with no knowledge of reward support. A further variant of the algorithms proposed,
allowing change of β while learning could be an interesting possibility to look at.
Extensions to the multi-slot framework, applications to contextual bandits and exten-
sions to the complete Reinforcement Learning problem would be some useful avenues
to pursue.
1
Essential requirement is rather a low cardinal Reward support
44
Bibliography
Even-Dar E., Mannor S. and Mansour Y. (2006). Action Elimination and Stopping
Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems.
Journal of Machine Learning Research, 7, 1079–1105. (document), 3, 1.5, 3.3, 6.2,
6.2.2, 7
Berry D A. and Fristedt B. (1985). Bandit problems. Chapman and Hall Ltd. 1.5
Auer P., Cesa-Bianchi N. and Fischer P. (2002). Finite-time Analysis of the Multiarmed
Bandit Problem Machine Learning, Springer Netherlands, 47, 235-256. 1.5
Lai T. and Robbins H. (1985). Asymptotically efficient adaptive allocation rules. Ad-
vances in Applied Mathematics, 6, 4–22. (document), 1.5, 4, 4.2
Agrawal R. (1995). Sample mean based index policies with O(log n) regret for the
multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078 1.5
45
Min S. and Chrysostomos L N. (1993). Signal Processing with Fractional Lower Order
Moments: Stable Processes & Their Applications. Proceedings of IEEE, Vol.81 No.7,
986-1010. 1.5
Achim A M., Canagarajah C N. and Bull D R (2005). Complex wavelet domain image
fusion based on fractional lower order moments, 8th International Conference on
Information Fusion, Vol.1 No.7, 25-28. 1.5
Xinyu M. and Nikias C L. (1996). Joint estimation of time delay and frequency delay in
impulsive noise using fractional lower order statistics, IEEE Transactions on Signal
Processing, Vol.44 No.11, 2669-2687. 1.5
Agarwal D., Chen B., Elango P., Motgi N., Park S., Ramakrishnan R., Roy S. and J.
Zachariah. (2009). Online models for content optimization. In Advances in Neural
Information Processing Systems 21, 2009. 7
Bubeck S., Munos R., and Stoltz G. (2009). Pure Exploration in Multi-Armed Bandit
Problems. In 20th Intl. Conf. on Algorithmic Learning Theory (ALT), 2009. 5.2, 5.3
Auer P. and Ortner R. (2010). UCB revisited: Improved regret bounds for the stochas-
tic multiarmed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65,
2010. (document), 1.5, 6.2, 6.2.1, 7
Kim S. and Nelson B. A fully sequential procedure for indifference-zone selection in sim-
ulation. ACM Transactions on Modeling and Computer Simulation, 11(3):251–273,
2001. 1.5
7
46