Generalized Second Order Value Iteration in
Markov Decision Processes
arXiv:1905.03927v3 [cs.LG] 18 Sep 2021
Chandramouli Kamanchi∗ , Raghuram Bharadwaj Diddigi∗ , Shalabh Bhatnagar
Abstract—Value iteration is a fixed point iteration technique
utilized to obtain the optimal value function and policy in a
discounted reward Markov Decision Process (MDP). Here, a
contraction operator is constructed and applied repeatedly to
arrive at the optimal solution. Value iteration is a first order
method and therefore it may take a large number of iterations
to converge to the optimal solution. Successive relaxation is a
popular technique that can be applied to solve a fixed point
equation. It has been shown in the literature that, under a
special structure of the MDP, successive over-relaxation technique
computes the optimal value function faster than standard value
iteration. In this work, we propose a second order value iteration
procedure that is obtained by applying the Newton-Raphson
method to the successive relaxation value iteration scheme. We
prove the global convergence of our algorithm to the optimal
solution asymptotically and show the second order convergence.
Through experiments, we demonstrate the effectiveness of our
proposed approach.
I. I NTRODUCTION
In a discounted reward Markov Decision Process [2], the objective is to maximize the expected cumulative discounted reward. Reinforcement Learning (RL) deals with the algorithms
for solving an MDP problem when the model information (i.e.,
probability transition matrix and reward function) is unknown.
RL algorithms instead make use of state and reward samples
and estimate the optimal value function and policy. Due to the
success of deep learning [12], RL algorithms in combination
with deep neural networks have been successfully deployed
to solve many real world problems and games [13]. However,
there is ongoing research for improving the sample-efficiency
as well as convergence of RL algorithms [10].
Many RL algorithms can be viewed as stochastic approximation [3] variants of the Bellman equation [1] in MDPs. For
example, the popular Q-learning algorithm [19] can be viewed
as a stochastic fixed point iteration to solve the Q-Bellman
equation. Therefore, we believe that in order to improve the
performance of RL algorithms, a promising first step would
be to propose faster algorithms for solving MDPs when the
model information is known. To this end, we propose a second
order value iteration technique that has global convergence,
∗
Equal Contribution.
The authors are with the Department of Computer Science and Automation, Indian Institute of Science (IISc), Bengaluru 560012, India. (E-mails:
{chandramouli, raghub, shalabh}@iisc.ac.in).
Raghuram Bharadwaj was supported by a fellowship grant from the Centre
for Networked Intelligence (a Cisco CSR initiative) of the Indian Institute
of Science, Bangalore. Shalabh Bhatnagar was supported by the J.C.Bose
Fellowship, a project from DST under the ICPS Program and the RBCCPS,
IISc.
which is a desirable property. In [17], the successive overrelaxation technique is applied to the Bellman equation to
obtain a faster value iteration algorithm. In this work, we
propose a Generalized Second Order Value Iteration (G-SOVI)
method for computing the optimal value function and policy
when the model information is known. This is achieved by
the application of Newton-Raphson method to the Successive
relaxation variant of the Q-Bellman Equation (henceforth
denoted as SQBE). The key differences between G-SOVI and
standard SOVI algorithms are the incorporation of relaxation
parameter w and the construction of single-stage reward as
discussed in Section IV.
Note that we cannot directly apply the Newton-Raphson
method to SQBE because of the presence of max(.) operator
in the equation, which is not differentiable. Therefore, we
approximate the max operator by a smooth function gN [14],
where N is a given parameter. This approximation allows us
to apply the second order method thereby ensuring faster rate
of convergence.
The solution obtained by our second order technique on
the modified SQBE may be different from the solution of the
original MDP problem because of the approximation of the
max operator by gN . However, we show that our proposed
algorithm converges to the actual solution as N →
− ∞.
We show through numerical experiments that given a finite
number of iterations, our proposed algorithm computes a solution that is closer to the actual solution faster when compared
to that obtained from standard value iteration. Moreover, under
a special structure of MDP, the solution is better than standard
second-order value iteration [18].
II. R ELATED W ORK A ND O UR C ONTRIBUTIONS
Value Iteration and Policy Iteration are two classical numerical techniques employed for solving MDP problems. In [16],
it has been shown that Newton-Kantorovich1 method applied
to the exact Bellman equation gives rise to the policy iteration
scheme. In (Section 2.5 of [18]), a second-order value iteration
technique (we refer to it as standard SOVI) is proposed by
applying Newton-Kantorovich method to the smooth (softmax) Bellman equation and remarks about second-order rate
and global convergence are provided. However, convergence
analysis is not discussed for the SOVI technique. Approximate
Newton methods have been proposed in [6] for policy optimization in MDPs. A detailed analysis of the Hessian of the
objective function is provided and algorithms are derived. In
recent times, smooth Bellman equation has been successfully
1 Also
known as Newton-Raphson method
©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. This paper is accepted for publication at IEEE Transactions
on Automatic Control. DOI: 10.1109/TAC.2021.3112851
used in the development of many RL algorithms. For instance,
in [9], a soft Q-learning algorithm has been proposed that
learns the maximum entropy policies. The algorithm makes
use of the smooth Q-Bellman equation with an additional
entropy term. In [4], SBEED (Smoothed Bellman Error Embedding) algorithm has been proposed which computes the
optimal policy by formulating the smooth Bellman equation
as a primal-dual optimization problem. In [5], a matrix-gain
learning algorithm namely Zap Q-learning has been proposed
which is seen to have similar performance as the NewtonRaphson method. Very recently, an accelerated value iteration
technique is proposed in [8] by applying Nestorov’s accelerated gradient technique to value iteration.
We now summarize the main contributions of our paper:
• We propose a generalized second order Q-value iteration
algorithm that is derived from the successive relaxation
technique as well as the Newton-Raphson method. In fact,
we show that standard SOVI is a special case of our
proposed algorithm.
• We prove the global convergence of our algorithm and
provide a second order convergence rate result.
• We derive a bound on the error defined in terms of the
value function obtained by our proposed method and the
actual value function and show that the error vanishes
asymptotically.
• Through experimental evaluation, we further confirm that
our proposed technique provides a better near-optimal
solution compared to that of the value iteration procedure
when run for the same (finite) number of iterations.
III. BACKGROUND AND P RELIMINARIES
A discounted reward Markov Decision Process (MDP)
is characterized via a tuple (S, A, p, r, γ) where S =
{1, 2, · · · , i, . . . , j, · · · , M } denotes the set of states, A =
{a1 , . . . , aK } denotes the set of actions, p is the transition
probability rule i.e., p(j|i, a) denotes the probability of transition from state i to state j when action a is chosen. Also,
r(i, a, j) denotes the single-stage reward obtained in state i
when action a is chosen and the next state is j. Finally,
0 ≤ γ < 1 denotes the discount factor. The objective in an
MDP is to learn an optimal policy π : S →
− A, where π(i)
denotes the action to be taken in state i, that maximizes the
cumulative discounted reward objective given by:
∞
i
hX
γ t r(st , π(st ), st+1 ) | s0 = i .
(1)
E
t=0
In (1), st is the state at time t and E[.] is the expectation
taken over the entire trajectory of states obtained over times
t = 1, . . . , ∞. Let V ∗ (.) be the value function with V ∗ (i)
being the value of state i that represents the total discounted
reward obtained starting from state i and following the optimal
policy π. The value function can be obtained by solving the
Bellman equation [2] given by:
V ∗ (i) = max
a∈A
M
nX
j=1
p(j|i, a) r(i, a, j) + γV ∗ (j)
o
, ∀i ∈ S.
(2)
We assume here for simplicity that all actions are feasible
in every state. Value iteration is a popular numerical scheme
employed to obtain the value function and so the optimal
policy. It works as follows: An initial estimate of the value
function V0 is selected arbitrarily and a sequence of Vn , n ≥ 1
is generated in an iterative fashion as below:
Vn (i) = max
a∈A
M
nX
p(j|i, a) r(i, a, j) + γVn−1 (j)
j=1
n ≥ 1, ∀i ∈ S.
o
,
(3)
Let ζ denote the set of all bounded functions from S to R.
Note that equation (2) can be rewritten as:
V ∗ = T V ∗,
(4)
where the operator T : ζ →
− ζ is defined by:
n
(T V )(i) = max r(i, a) + γ
a∈A
and r(i, a) =
M
X
M
X
j=1
o
p(j|i, a)V (j) ,
p(j|i, a)r(i, a, j) is the expected single-stage
j=1
reward in state i when action a is chosen. It is easy to see that
T is a sup-norm contraction map with contraction factor γ, i.e.,
the discount factor. Therefore, from the contraction mapping
theorem, it is clear that the value iteration scheme given by
equation (3) converges to the optimal value function, i.e.,
V ∗ = lim Vn = T V ∗ .
n→
−∞
(5)
Let Q∗ (i, a) with (i, a) ∈ S × A, be defined as:
Q∗ (i, a) := r(i, a) + γ
M
X
p(j|i, a)V ∗ (j).
(6)
j=1
Here Q∗ (i, a) is the optimal Q-value function associated
with state i and action a. It denotes the total discounted
reward obtained starting from state i upon taking action a
and following the optimal policy in subsequent states. Then
from (2), it is clear that
V ∗ (i) = max Q∗ (i, a).
a∈A
(7)
Therefore, the equation (6) can be re-written as follows:
∗
Q (i, a) = r(i, a) + γ
M
X
p(j|i, a) max Q∗ (j, b).
j=1
b∈A
(8)
This is known as the Q-Bellman equation. We obtain the
optimal policy by letting
π(i) = arg max Q∗ (i, a).
a∈A
(9)
In [17], a modified value iteration algorithm is proposed based
on the idea of successive relaxation. Let us define
w∗ =
1
.
1 − γ min p(i|i, a)
i,a
(10)
Note that w∗ ≥ 1. For 0 < w ≤ w∗ , we define a modified
Bellman operator as follows:
be applied to find a solution here. We select an initial point
x0 and then proceed as follows:
Tw (V ) = w(T V ) + (1 − w)V,
xn = xn−1 − JF−1 (xn−1 )F (xn−1 ), n ≥ 1,
(11)
where w is called the ‘relaxation’ parameter. It is easy to see
that the fixed point of Tw is also the optimal value function of
the MDP (fixed point of T ). Moreover, it is shown in [17] that
the contraction factor of Tw is 1 − γ + γw. Under a special
structure of the MDP, i.e., with p(i|i, a) > 0, ∀i, a, we have
w∗ > 1 (strictly greater than 1). Then, the relaxation parameter
w can be chosen in three possible ways:
1) If 0 < w < 1, then the contractor factor of Tw is more
than the contraction factor of T .
2) If w = 1, then T = Tw and hence the contraction factors
of both the operators are same.
3) If 1 < w ≤ w∗ , the contraction factor of Tw is less than
the contraction factor of T . This implies that the fixed
point iteration utilizing (11) generates the optimal value
function faster than the standard value iteration.
In [11], a successive relaxation Q-Bellman equation (we
call the Generalized Q-Bellman equation) is constructed as
follows:
Qw (i, a) = w(r(i, a)+γ
M
X
j=1
p(j|i, a) max Qw (j, b))
c∈A
(12)
a∈A
(13)
The Generalized Q-values (Qw in (12)) are obtained as
follows. An initial estimate Q0 of Qw is arbitrarily selected
and a sequence of Qn , n ≥ 1 is obtained according to:
Qn (i, a) =w(r(i, a) + γ
M
X
j=1
i=1
We note here that gN (x) is a smooth approximation of max
operator f (x) as shown in the Lemma 3. Then the equation
(14) can be rewritten as follows:
|A|
M
X
X
1
p(j|i, a) log
Qn (i, a) = w r(i, a) + γ
eN Qn−1 (j,b)
N
j=1
+ (1 − w)
X
1
log
eN Qn−1 (i,b) , n ≥ 1,
N
p(j|i, a) max Qn−1 (j, b))
b∈A
(14)
It is shown in [11] that, the Q-values obtained by (14) converge to the generalized Q-values Qw . In this way, we obtain
optimal value function and optimal policy using the successive
relaxation Q-value iteration scheme. In this work, our objective
is to approximate the generalized Q-Bellman equation and
apply the Newton-Raphson second order technique to solve
for the optimal value function. Recall that we cannot apply
the second order method directly to the equation (12) as the
max(.) operator on the RHS is not differentiable. Before
we propose our algorithm, we briefly discuss the Newton’s
second order technique [15] for solving a non-linear system
of equations.
Consider a function F : Rd →
− Rd that is twice differentiable. Suppose we are interested in finding a root of F i.e., a
point x such that F (x) = 0. The Newton-Raphson method can
(16)
starting with an initial Q0 (arbitrarily chosen in general).
Therefore our modified Successive Q-Bellman (SQB) operator U : R|S|×|A| →
− R|S|×|A| is defined as follows. For
∗
0<w≤w ,
|A|
M
X
X
1
eN Q(j,b)
U Q(i, a) = w r(i, a) + γ
p(j|i, a) log
N
j=1
b=1
|A|
+ (1 − w)
+ (1 − w) max Qn−1 (i, c), ∀(i, a).
c∈A
We construct our modified SQBE as follows. We first
approximate the max(.) operator, i.e., the function f (x) =
maxdi=1 xi , where x = (x1 , . . . , xd ), with gN (x) =
d
X
1
log
eN xi , as the max(.) operator is not differentiable.
N
b=1
where 0 < w ≤ w . It has been shown that, although the
Q-values obtained by (12) can be different from the optimal
Q-values, the optimal value functions are still the same. That
is,
a∈A
IV. P ROPOSED A LGORITHM
|A|
∗
max Qw (i, a) = max Q∗ (i, a), ∀i ∈ S.
where JF (x) is the Jacobian of the function F evaluated at
point x. Under suitable hypotheses it can be shown that the
procedure (15) leads to second order convergence.
In the next section, we construct a function F for our
problem and apply the Newton-Raphson method to find the
optimal value function and policy pair.
b=1
b∈A
+ (1 − w) max Qw (i, c),
(15)
X
1
log
eN Q(i,b) .
N
(17)
b=1
The numerical scheme (16) is thus
Qn (i, a) = U Qn−1 (i, a).
Finally, by an application of the Newton-Raphson method to
U , our Generalized Second Order Value Iteration (G-SOVI) is
obtained as described in Algorithm 1. Note that setting w = 1
in Step 4 of the algorithm yields the standard SOVI algorithm.
Remark 1. Note that in our case, the function F in equation
(15) corresponds to F (Q) = Q−U Q and JF (Q) = I −JU (Q)
is a |S × A| × |S × A| dimensional matrix.
−1
Remark 2. Note that directly computing I − JU (Q)
(Q −
U Q) would involve O(|S|3 |A|3 ) complexity. This computation
could be carried out by solving the system (I − JU (Q))Y =
Q − U Q for Y to avoid numerical stability issues. Moreover
the per-iteration time complexity of the Algorithm 1 is also
O(|S|3 |A|3 ).
Algorithm 1 Generalized Second Order Value Iteration (GSOVI)
Input:
(S, A, p, r, γ): MDP model
0 < w ≤ w∗ : Prescribed relaxation parameter
Q0 : Initial Q-vector
N : Prescribed approximation factor
Iter: Total number of iterations
Output: QIter
1: procedure G-SOVI:
2:
while n < Iter do
3:
compute |S × A| × |S × A| matrix JU (Qn ). The
((i, a), (k, c))th entry is given by
4: JU (Qn )((i, a), (k, c)) =
N Qn (k,c)
Xe
(k, c) 6= (i, a)
wγp(k|i, a)
eN Qn (k,b)
b∈A
N Q (k,c)
n
(1 − w + wγp(k|i, a)) Xe N Q (k,b)
e n
(k, c) = (i, a)
b∈A
5:
6:
Qn+1 = Qn − I − JU (Qn )
return QIter
−1
(Qn − U Qn )
the corresponding arg max). Now
d
f (x) − gN (x) = max{x1 , x2 , · · · , xd } −
d
X
1
eN xi
log
N
i=1
i
h X
1
eN (xi −xi∗ ) eN xi∗
log
N
i=1
X
d
1
eN (xi −xi∗ )
=
log
N
i=1
= x i∗ −
log d
→ 0 as N → ∞.
N
≤
Note that the inequality follows from the definition of
xi∗ = max{x1 , x2 , · · · , xd } and the fact that eN (xi −xi∗ ) ≤
1 for 1 ≤ i ≤ d (since xi ≤ xi∗ ∀i). Hence sup f (x) −
gN (x) → 0 as N → ∞ with the rate
x∈Rd
1
N.
Lemma 2. Let U : R|S|×|A| → R|S|×|A| be defined as follows.
|A|
M
X
X
p(j|i, a)
log
eN Q(j,b)
(U Q)(i, a) = w r(i, a) + γ
N
j=1
b=1
|A|
+
X
(1 − w)
log
eN Q(i,b) .
N
b=1
Remark 3. Note that G-SOVI reduces to standard SOVI in
the case w = 1. Moreover the computational complexity for
both the algorithms is the same.
Remark 4. The space required for storing the Jacobian matrix
JU is |S|2 |A|2 . Hence the space complexity of Algorithm1 is
O(|S|2 |A|2 ).
Then U is a max-norm contraction.
Proof: Given (i, a), let q(.|i, a) : {1, 2, · · · , |S|} → [0, 1]
be defined as follows.
( wγp(k|i,a)
(1−w+wγ) , k 6= i,
q(k|i, a) = 1−w+wγp(i|i,a)
, k = i.
(1−w+wγ)
Observe that q is a probability mass function. Let E[.] denote
the expectation with respect to q, and ξ(j, .) denotes the point
that lies on the line joining P (j, .) and Q(j, .) Now for P, Q ∈
R|S|×|A| , we have
(U P )(i, a) − (U Q)(i, a)
V. C ONVERGENCE A NALYSIS
In this section we study the convergence analysis of our
algorithm. Note that the norm considered in the following analysis is the max-norm, i.e., kxk := max1≤i≤d |xi |. Throughout
this section, it is assumed that the relaxation parameter w
satisfies 0 < w ≤ w∗ , where w∗ is as defined in (10).
d
Lemma 1. Suppose f : R
→ R and f (x) :=
max{x1 , x2 , · · · , xd }. Let gN : Rd → R be defined as
d
X
eN xi . Then sup f (x) − gN (x) −→ 0
gN (x) := N1 log
as N −→ ∞.
i=1
x∈Rd
Proof: Let xi∗ = max{x1 , x2 , · · · , xd } (where i∗ denotes
= (1 − w + wγ) E
" log
= (1 − w + wγ) E
"
|A|
X
eN P (j,b)
b=1
N
eN ξ(j,.)
X
eN ξ(j,b)
log
b∈A
≤ (1 − w + wγ)E
"
eN ξ(j,.)
X
eN ξ(j,b)
b=1
−
!T
!T
|A|
X
eN Q(j,b) #
N
P (j, .) − Q(j, .)
P (j, .) − Q(j, .)
b∈A
≤ (1 − w + wγ)E max |P (j, b) − Q(j, b)|
b
≤ (1 − w + wγ) max |P (i, a) − Q(i, a)|
(i,a)
= (1 − w + wγ)kP − Qk
So kU P − U Qk = max (U P )(i, a) − (U Q)(i, a)
(i,a)
≤ (1 − w + wγ)kP − Qk.
#
#
Hence U is a contraction with contraction factor (1−w +wγ).
Here, the second equality follows from an application of mean
value theorem in multivariate calculus.
Lemma 3. Let Qw be as in equation (12) and Q′ be fixed point
of U respectively. Then kQw − Q′ k ≤ (1−w+wγ)
N w(1−γ) log(|A|).
Proof: From equation (12), we have
Qw (s, a) = wr(s, a) + (1 − w + wγ)E max Qw (Z, b) .
b∈A
Now Q′ is the unique fixed point of U (unique by virtue of
Lemma 2), so
X
1
N Q′ (Z,b)
′
log
e
,
Q (s, a) = wr(s, a) + (1 − w + wγ)E
N
b∈A
where Z is a random variable with probability mass function as q and the expectation above is taken with respect
to the law given by probability mass function q. Let c =
arg maxb∈A Q′ (Z, b) i.e. Q′ (Z, c) = maxb∈A Q′ (Z, b). Now
Qw (s, a) − Q′ (s, a)
X
1
N Q′ (Z,b)
log
e
= (1 − w + wγ)E max Qw (Z, b) −
b∈A
N
b∈A
=(1 − w + wγ) E max Qw (Z, b)
b∈A
− max Q′ (Z, b) −
b∈A
≤(1 − w + wγ)E
X
1
eN
log
N
Q′ (Z,b)−Q′ (Z,c)
b∈A
"
max Qw (Z, b)
b∈A
− max Q′ (Z, b) −
b∈A
X
1
log
eN
N
Q′ (Z,b)−Q′ (Z,c)
b∈A
≤(1 − w + wγ)E
"
#
max Qw (Z, b) − max Q′ (Z, b)
b∈A
b∈A
X
1
log
eN
+
N
Q′ (Z,b)−Q′ (Z,c)
b∈A
≤(1 − w + wγ)kQw − Q′ k +
#
(1 − w + wγ)
log |A|.
N
Hence
kQw − Q′ k
≤ (1 − w + wγ)kQw − Q′ k +
(1 − w + wγ)
log |A|
N
(1 − w + wγ)
log |A|.
N w(1 − γ)
This completes the proof. This lemma shows that the approximation error kQw − Q′ k → 0 as N → ∞.
=⇒ kQw − Q′ k ≤
Remark 5. It is easy to see that in the case of w > 1
(1 − w + wγ)
< γ.
w
This shows that the approximation error of G-SOVI is smaller
than standard SOVI in the case of w > 1.
We now invoke the following theorem from [15] to show
the global convergence of our second order value iteration.
Theorem 1 (Global Newton Theorem). Suppose that F :
Rd → Rd is continuous, component-wise concave on Rd , differentiable and that F ′ (x) is non-singular and F ′ (x)−1 ≥ 0,
i.e. each entry of F ′ (x)−1 is non-negative, for all x ∈ Rd .
Assume, further, that F (x) = 0 has a unique solution x∗ and
that F ′ is continuous on Rd . Then for any x0 ∈ Rd the Newton
iterates given by (15) converge to x∗ .
Remark 6. Note that the above theorem is stated for convex
F in [15]. However, the theorem holds true even for concave
F.
Theorem 2. Let Q′ be the fixed point of the operator U . GSOVI converges to Q′ for any choice of initial point Q0 .
Proof: G-SOVI computes the zeros of the equation Q −
U Q = 0. So we appeal to Theorem 1 with the choice of F
as I − U : R|S|×|A| → R|S|×|A| where
U )(Q)(i, a)
=
(I −
X
N Q(Z,b)
1
e
. It
Q(i, a) − wr(i, a) − (1 − w + γw)E N log
b∈A
is enough to verify the hypothesis of Theorem 1 for I − U.
Clearly I − U is continuous, component-wise concave and
differentiable with (I − U )′ (Q) = I − JU (Q) where
eN Q(k,c)
JU (Q)((i, a), (k, c)) = (1 − w + wγ)q(k|i, a) X
.
eN Q(k,b)
b∈A
Note that (I − U ) is a |S × A| × |S × A| dimensional matrix
with 1 ≤ i, k ≤ |S| and 1 ≤ a, c ≤ |A|. Now observe that
th
• each entry in the (i, a) row is non-negative.
th
• the sum of the entries in the (i, a) row is
|S| |A|
X
X
eN Qn (k,c)
(1 − w + wγ)q(k|i, a) X
eN Qn (k,b)
k=1 c=1
b∈A
= (1 − w + wγ).
So JU (Q) = (1 − w + wγ)Φ for a |S × A| × |S × A|
dimensional transition probability matrix Φ. It is easy to see
that (I −JU (Q))−1 exists (see Lemma 4) with the power series
expansion
I − JU (Q)
−1
=
∞
X
(1 − w + wγ)r Φr .
r=0
Moreover, since
−1 each entry in Φ is non-negative, Φ ≥ 0. Hence
I − JU (Q)
≥ 0. Also from lemma 2 it is clear that the
equation Q − U Q = 0 has a unique solution. This completes
the proof.
Lemma 4.
I − JU (Q)
Proof: Note that
−1
≤
1
w(1−γ)
I − JU (Q) = I − (1 − w + wγ)Φ,
for a given transition probability matrix Φ. Now suppose that
λ is an eigen-value of I − (1 − w + wγ)Φ then
1 − (1 − w + wγ) < λ ≤ 1 + (1 − w + wγ).
From 1 − (1 − w + wγ) > 0, we have
0∈
/ σ(I − JU (Q)),
where σ(I − JU (Q)) is the spectrum of I − JU (Q). Hence
−1
for any Q, I − JU (Q)
exists and we have
I − JU (Q)
−1
1
1 − (1 − w + wγ)
1
=
.
w(1 − γ)
≤
This completes the proof.
The following theorem is an adaptation from [15].
Theorem 3. G-SOVI has second order convergence.
Proof: Recall that F (Q) = Q − U Q. Let Q∗ be the
unique solution of F (Q) = 0 and {Qn } be the sequence of
iterates generated by G-SOVI. Define en = kQn − Q∗ k and
G(Q) = Q − F ′ (Q)−1 F (Q). As Q∗ satisfies Q∗ = U Q∗ , it is
a fixed point of G. It is enough to show that en+1 ≤ ke2n for
a constant k. It can be shown that for our particular choice of
F , F ′ is Lipschitz (with Lipschitz constant, say, L).
kF ′ (Q) − F ′ (Q∗ )k ≤ LkQ − Q∗ k
L
kQ − Q∗ k2
2
(by an application of the fundamental theorem of calculus).
=⇒ kF (Q) − F (Q∗ ) − F ′ (Q∗ )(Q − Q∗ )k ≤
Utilizing the above properties we have
en+1 =kQn+1 − Q∗ k
=kG(Qn ) − G(Q∗ )k
=kQn − F ′ (Qn )−1 F (Qn ) − Q∗ k
≤ F ′ (Qn )−1 F (Qn ) − F (Q∗ ) − F ′ (Q∗ )(Qn − Q∗ )
+ F ′ (Qn )−1 F ′ (Qn ) − F ′ (Q∗ ) (Qn − Q∗ )
≤kF ′ (Qn )−1 k F (Qn ) − F (Q∗ ) − F ′ (Q∗ )(Qn − Q∗ )
+ kF ′ (Qn )−1 k F ′ (Qn ) − F ′ (Q∗ ) (Qn − Q∗ )
3
≤ βkQn − Q∗ k2
2
=ke2n ,
where β = LkF ′ (Q)k−1 ≤
k = 32 β.
L
w(1−γ)
(from Lemma 4) and
VI. E XPERIMENTS
In this section, we describe the experimental results of
our proposed G-SOVI algorithm and compare the same with
standard SOVI and value iteration. For this purpose, we
use python MDP toolbox [7] for generating the MDP and
Figure 1: Error vs Number of iterations on setting with 100
states and 10 actions with w = w∗ for Generalized SOVI
(G-SOVI).
Value of N
N=20
N=25
N=30
N=35
Standard Value
Iteration
0.1009 ± 0.0026
Standard
SOVI
G-SOVI
±
±
±
±
0.1093 ± 0.0818
0.0648 ± 0.0217
0.0494 ± 0.017
0.0397 ± 0.0136
0.1205
0.0822
0.0611
0.0484
0.0372
0.0273
0.0211
0.0168
Table I: Comparison of Average Error for different values of
N on 10 states and 5 actions setting at the end of 50 iterations.
For the G-SOVI algorithm, the relaxation parameter is chosen
to be the optimal relaxation parameter w∗ , i.e., w = w∗ .
Value of w
G-SOVI
w=1
(Standard SOVI)
0.04838 ± 0.017
w = 1.00001
0.04838 ± 0.017
w = 1.0001
0.04837 ± 0.017
w = 1.001
0.04830 ± 0.017
w = 1.01
0.0476 ± 0.017
w = 1.05
0.0448 ± 0.016
w = 1.1
0.0417 ± 0.014
w = w∗
0.0397 ± 0.014
Table II: Comparison of Average Error in G-SOVI for different
values of w on 10 states and 5 actions setting at the end of
50 iterations. The value of N is 35.
implementing standard value iteration 2 . The generated MDPs
satisfy p(i|i, a) > 0, ∀i, a in order to ensure that w∗ > 1.
We consider the error as defined below to be the metric for
comparison between algorithms. Error for a given algorithm at
iteration i, denoted E(i), is calculated as follows. We collect
the max-norm difference between the optimal value function
and the value function estimate at iteration i. That is,
E(i) = kV ∗ − Qi (., a∗ )k∞ ,
where V ∗ is the optimal value function of the MDP and
Qi (., .) is the Q-value function estimate of MDP at iteration
i. Also, for any state j, a∗j = arg max Qi (j, a).
a∈A
2 The code for our experiments is available at: https://github.com/
raghudiddigi/G-SOVI
Setting
Standard
Value Iteration
Standard SOVI
G-SOVI
States = 30, Actions= 10
6.471 ± 0.07
0.087 ± 0.01
0.079 ± 0.01
States = 50, Actions = 10
6.587 ± 0.07
0.114 ± 0.01
0.108 ± 0.01
States = 80, Actions = 10
6.754 ± 0.03
0.141 ± 0.01
0.136 ± 0.01
States = 100, Actions = 10
6.772 ± 0.03
0.152 ± 0.01
0.148 ± 0.01
Table III: Comparison of Average Error across four settings at the end of 10 iterations with N = 35. For the G-SOVI algorithm,
the relaxation parameter is chosen to be the optimal relaxation parameter w∗ , i.e., w = w∗ .
Setting
Standard
Value Iteration
Standard SOVI
G-SOVI
States = 30, Actions= 10
0.0008 ± 0.00
0.0154 ± 0.01
0.0267 ± 0.01
States = 50, Actions = 10
0.0009 ± 0.00
0.0242 ± 0.00
0.0488 ± 0.00
States = 80, Actions = 10
0.0011 ± 0.00
0.0532 ± 0.00
0.0988 ± 0.01
States = 100, Actions = 10
0.0026 ± 0.00
0.1202 ± 0.01
0.1343 ± 0.01
Table IV: Per-iteration Execution time of algorithms across four settings in seconds, with the relaxation parameter in G-SOVI
chosen as w = w∗ .
Configuration
Computational Time
(in seconds)
Standard
Value Iteration
10 States, 5 Actions
0.01
20 States, 5 Actions
0.02
30 States, 5 Actions
0.03
Standard SOVI
G-SOVI
25.485 ± 2.21
3.930± 0.92
3.885 ± 0.94
18.291 ± 0.77
5.444 ± 0.51
5.473 ± 0.50
7.327 ± 0.20
7.111 ± 0.32
7.118 ± 0.33
Table V: Average Error vs Computational Time (rounded off to the nearest millisecond). Initial Q-values for algorithms are
assigned random integers between 60 and 70. The discount factor is set to 0.99. G-SOVI is run with w = 1.00001.
First, we generate 100 independent MDPs each with 10
states, 5 actions and we set the discount factor to be 0.9 in
each case. We run all the algorithms for 50 iterations. The
initial Q-values of the algorithms are assigned random integers
between 10 and 20 (which are far away from the optimal
value function). In Table I, we indicate the average error value
(error averaged over 100 MDPs) at the end of 50 iterations
for all the algorithms, wherein for G-SOVI, we set w = w∗
as the relaxation parameter. We observe that standard SOVI
and G-SOVI with N = 25, 30, and 35 have low error at the
end of 50 iterations compared to the standard value iteration.
Moreover, the average error is the least for our proposed GSOVI algorithm. Also, we find that higher the value of N , the
smaller is the error between the G-SOVI value function and
the optimal value function.
In Table II, we report the performance of G-SOVI for
different values of feasible successive relaxation parameters
w across the same 100 MDPs generated previously (in Table
I). The optimal successive relaxation parameter w∗ here lies
between 1.1 and 1.5. Recall that G-SOVI exhibits faster
convergence for any value of w that satisfies 1 < w ≤ w∗
when compared to standard SOVI (first row of Table II). From
Table II, we can conclude that G-SOVI with w ∈ (1, w∗ ]
performs at least as fast as the standard SOVI. Moreover, the
higher the value of w, the better is the performance, when the
algorithm is run for a sufficient number of iterations.
In Table III, we present the results of the three algorithms on
four different settings, averaged over 10 MDPs. The standard
SOVI and G-SOVI are run with N = 35. All the algorithms are
run for 10 iterations. We observe that standard SOVI and G-
SOVI have low error compared to the standard value iteration.
Moreover, the difference here is much more pronounced than
in Table I, where algorithms are run for 50 iterations. Recall
that the SOVI and G-SOVI algorithms with a fixed N give
near-optimal value functions. The advantage of using our
proposed algorithms is that the Q-value iterates converge to
the near-optimal Q-values rapidly. This can also be observed
in Figure 1, where we present the convergence of algorithms
over 50 iterations on 100 states and 10 actions setting. The
SOVI and G-SOVI algorithms converge rapidly to a value and
stay constant. In fact, we observe here that the error is less than
that obtained by the standard value iteration till 45 iterations.
Moreover, G-SOVI computes a solution that gives lower error
as compared to that obtained by SOVI.
In Table IV, we indicate the per-iteration execution time of
our algorithms across the four settings considered above. We
can see that, due to Hessian inversion operation in the secondorder techniques, standard and G-SOVI algorithms take more
time compared to the standard value iteration algorithm.
Recall that the advantage of second-order methods is that
even though the per-iteration computation is higher compared
to the first-order methods, the total number of iterations needed
to achieve a desired error threshold is much lower in general.
Hence, they are capable of achieving lower error in the same
computational time. We demonstrate this in Table V for three
settings. We select the parameters of this experiment (i.e.,
number of states and actions, values of N , w, number of
iterations), such that the second-order methods compute better
solutions compared to the standard value iteration scheme3 .
For example, consider the 10 states and 5 actions setting (first
row of Table V). The standard value iteration is run for 50
iterations. It’s per-iteration time is 0.2 ms which results in an
overall computational time of 0.0002×50 = 0.01 seconds. On
the other hand, SOVI techniques (standard SOVI and the GSOVI) are run for just 3 iterations. However, their per-iteration
time is 0.0033 seconds and hence the overall computational
time is 0.01 seconds. We observe that, in 0.01 seconds, the
SOVI methods achieve lower error compared to the standard
value iteration. Similarly, in the other two settings in Table V,
we see that the second order SOVI algorithms achieve lower
error compared to the standard value iteration when run for
0.02 and 0.03 seconds, respectively.
It is important to note that this advantage need not hold for
MDPs, in general, with large number of states and actions as
the overhead for computing the Hessian inverse in large MDPs
will be higher that would affect the overall computational time.
If one could deploy techniques to improve the computation
time for matrix operations, G-SOVI would be preferred for
computing the optimal value function, over the standard value
iteration.
VII. C ONCLUSION
In this work, we have proposed a generalized secondorder value iteration scheme based on the Newton-Raphson
method for faster convergence to near optimal value function
in discounted reward MDP problems. The first step involved
constructing a differentiable Bellman equation through an
approximation of the max(.) operator. We then applied second
order Newton method to arrive at the proposed algorithm. We
proved the bounds on approximation error and showed second
order convergence to the optimal value function. Finally,
approaches geared towards easing the computational burden
associated with solving problems involving large state and
action spaces such as those based on approximate dynamic
programming can be developed in the context of G-SOVI
schemes in the future.
R EFERENCES
[1] Richard Bellman. Dynamic programming. Science, 153(3731):34–37,
1966.
[2] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming,
volume 5. Athena Scientific, Belmont, MA, 1996.
[3] Vivek S Borkar. Stochastic Approximation: A Dynamical Systems
Viewpoint. Cambridge Univ. Press, 2008.
[4] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu
Chen, and Le Song. Sbeed: Convergent reinforcement learning with
nonlinear function approximation. arXiv preprint arXiv:1712.10285,
2017.
[5] Adithya M Devraj and Sean Meyn. Zap Q-learning. In Advances in
Neural Information Processing Systems, pages 2235–2244, 2017.
[6] Thomas Furmston, Guy Lever, and David Barber. Approximate Newton
methods for policy search in Markov decision processes. The Journal
of Machine Learning Research, 17(1):8055–8105, 2016.
[7] Github.
Python MDPtoolbox.
https://github.com/sawcordwell/
pymdptoolbox.
[8] Vineet Goyal and Julien Grand-Clement. A first-order approach to
accelerated value iteration. arXiv preprint arXiv:1905.09963, 2019.
3 The value of w = 1.00001 respects the constraint w ≤ w ∗ in all the
three settings.
[9] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine.
Reinforcement learning with deep energy-based policies. In Proceedings
of the 34th International Conference on Machine Learning-Volume 70,
pages 1352–1361, 2017.
[10] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft
actor-critic: Off-policy maximum entropy deep reinforcement learning
with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
[11] Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, and Shalabh
Bhatnagar. Successive over-relaxation Q-learning. IEEE Control Systems
Letters, 4(1):55–60, 2019.
[12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
Nature, 521(7553):436, 2015.
[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves,
Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing
Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,
2013.
[14] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152, 2005.
[15] James M Ortega and Werner C Rheinboldt. Iterative solution of
nonlinear equations in several variables, volume 30. SIAM, 1970.
[16] Martin L Puterman and Shelby L Brumelle. On the convergence of
policy iteration in stationary dynamic programming. Mathematics of
Operations Research, 4(1):60–69, 1979.
[17] Dieter Reetz. Solution of a Markovian decision problem by successive
overrelaxation. Zeitschrift für Operations Research, 17(1):29–32, 1973.
[18] John Rust. Structural estimation of Markov decision processes. Handbook of econometrics, 4:3081–3143, 1994.
[19] Christopher J.C.H Watkins and Peter Dayan. Q-learning. Machine
learning, 8(3-4):279–292, 1992.