Academia.eduAcademia.edu

Second Order Value Iteration in Reinforcement Learning

2019, arXiv (Cornell University)

Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

Generalized Second Order Value Iteration in Markov Decision Processes arXiv:1905.03927v3 [cs.LG] 18 Sep 2021 Chandramouli Kamanchi∗ , Raghuram Bharadwaj Diddigi∗ , Shalabh Bhatnagar Abstract—Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach. I. I NTRODUCTION In a discounted reward Markov Decision Process [2], the objective is to maximize the expected cumulative discounted reward. Reinforcement Learning (RL) deals with the algorithms for solving an MDP problem when the model information (i.e., probability transition matrix and reward function) is unknown. RL algorithms instead make use of state and reward samples and estimate the optimal value function and policy. Due to the success of deep learning [12], RL algorithms in combination with deep neural networks have been successfully deployed to solve many real world problems and games [13]. However, there is ongoing research for improving the sample-efficiency as well as convergence of RL algorithms [10]. Many RL algorithms can be viewed as stochastic approximation [3] variants of the Bellman equation [1] in MDPs. For example, the popular Q-learning algorithm [19] can be viewed as a stochastic fixed point iteration to solve the Q-Bellman equation. Therefore, we believe that in order to improve the performance of RL algorithms, a promising first step would be to propose faster algorithms for solving MDPs when the model information is known. To this end, we propose a second order value iteration technique that has global convergence, ∗ Equal Contribution. The authors are with the Department of Computer Science and Automation, Indian Institute of Science (IISc), Bengaluru 560012, India. (E-mails: {chandramouli, raghub, shalabh}@iisc.ac.in). Raghuram Bharadwaj was supported by a fellowship grant from the Centre for Networked Intelligence (a Cisco CSR initiative) of the Indian Institute of Science, Bangalore. Shalabh Bhatnagar was supported by the J.C.Bose Fellowship, a project from DST under the ICPS Program and the RBCCPS, IISc. which is a desirable property. In [17], the successive overrelaxation technique is applied to the Bellman equation to obtain a faster value iteration algorithm. In this work, we propose a Generalized Second Order Value Iteration (G-SOVI) method for computing the optimal value function and policy when the model information is known. This is achieved by the application of Newton-Raphson method to the Successive relaxation variant of the Q-Bellman Equation (henceforth denoted as SQBE). The key differences between G-SOVI and standard SOVI algorithms are the incorporation of relaxation parameter w and the construction of single-stage reward as discussed in Section IV. Note that we cannot directly apply the Newton-Raphson method to SQBE because of the presence of max(.) operator in the equation, which is not differentiable. Therefore, we approximate the max operator by a smooth function gN [14], where N is a given parameter. This approximation allows us to apply the second order method thereby ensuring faster rate of convergence. The solution obtained by our second order technique on the modified SQBE may be different from the solution of the original MDP problem because of the approximation of the max operator by gN . However, we show that our proposed algorithm converges to the actual solution as N → − ∞. We show through numerical experiments that given a finite number of iterations, our proposed algorithm computes a solution that is closer to the actual solution faster when compared to that obtained from standard value iteration. Moreover, under a special structure of MDP, the solution is better than standard second-order value iteration [18]. II. R ELATED W ORK A ND O UR C ONTRIBUTIONS Value Iteration and Policy Iteration are two classical numerical techniques employed for solving MDP problems. In [16], it has been shown that Newton-Kantorovich1 method applied to the exact Bellman equation gives rise to the policy iteration scheme. In (Section 2.5 of [18]), a second-order value iteration technique (we refer to it as standard SOVI) is proposed by applying Newton-Kantorovich method to the smooth (softmax) Bellman equation and remarks about second-order rate and global convergence are provided. However, convergence analysis is not discussed for the SOVI technique. Approximate Newton methods have been proposed in [6] for policy optimization in MDPs. A detailed analysis of the Hessian of the objective function is provided and algorithms are derived. In recent times, smooth Bellman equation has been successfully 1 Also known as Newton-Raphson method ©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. This paper is accepted for publication at IEEE Transactions on Automatic Control. DOI: 10.1109/TAC.2021.3112851 used in the development of many RL algorithms. For instance, in [9], a soft Q-learning algorithm has been proposed that learns the maximum entropy policies. The algorithm makes use of the smooth Q-Bellman equation with an additional entropy term. In [4], SBEED (Smoothed Bellman Error Embedding) algorithm has been proposed which computes the optimal policy by formulating the smooth Bellman equation as a primal-dual optimization problem. In [5], a matrix-gain learning algorithm namely Zap Q-learning has been proposed which is seen to have similar performance as the NewtonRaphson method. Very recently, an accelerated value iteration technique is proposed in [8] by applying Nestorov’s accelerated gradient technique to value iteration. We now summarize the main contributions of our paper: • We propose a generalized second order Q-value iteration algorithm that is derived from the successive relaxation technique as well as the Newton-Raphson method. In fact, we show that standard SOVI is a special case of our proposed algorithm. • We prove the global convergence of our algorithm and provide a second order convergence rate result. • We derive a bound on the error defined in terms of the value function obtained by our proposed method and the actual value function and show that the error vanishes asymptotically. • Through experimental evaluation, we further confirm that our proposed technique provides a better near-optimal solution compared to that of the value iteration procedure when run for the same (finite) number of iterations. III. BACKGROUND AND P RELIMINARIES A discounted reward Markov Decision Process (MDP) is characterized via a tuple (S, A, p, r, γ) where S = {1, 2, · · · , i, . . . , j, · · · , M } denotes the set of states, A = {a1 , . . . , aK } denotes the set of actions, p is the transition probability rule i.e., p(j|i, a) denotes the probability of transition from state i to state j when action a is chosen. Also, r(i, a, j) denotes the single-stage reward obtained in state i when action a is chosen and the next state is j. Finally, 0 ≤ γ < 1 denotes the discount factor. The objective in an MDP is to learn an optimal policy π : S → − A, where π(i) denotes the action to be taken in state i, that maximizes the cumulative discounted reward objective given by: ∞ i hX γ t r(st , π(st ), st+1 ) | s0 = i . (1) E t=0 In (1), st is the state at time t and E[.] is the expectation taken over the entire trajectory of states obtained over times t = 1, . . . , ∞. Let V ∗ (.) be the value function with V ∗ (i) being the value of state i that represents the total discounted reward obtained starting from state i and following the optimal policy π. The value function can be obtained by solving the Bellman equation [2] given by: V ∗ (i) = max a∈A M nX j=1 p(j|i, a) r(i, a, j) + γV ∗ (j) o , ∀i ∈ S. (2) We assume here for simplicity that all actions are feasible in every state. Value iteration is a popular numerical scheme employed to obtain the value function and so the optimal policy. It works as follows: An initial estimate of the value function V0 is selected arbitrarily and a sequence of Vn , n ≥ 1 is generated in an iterative fashion as below: Vn (i) = max a∈A M nX p(j|i, a) r(i, a, j) + γVn−1 (j) j=1 n ≥ 1, ∀i ∈ S. o , (3) Let ζ denote the set of all bounded functions from S to R. Note that equation (2) can be rewritten as: V ∗ = T V ∗, (4) where the operator T : ζ → − ζ is defined by: n (T V )(i) = max r(i, a) + γ a∈A and r(i, a) = M X M X j=1 o p(j|i, a)V (j) , p(j|i, a)r(i, a, j) is the expected single-stage j=1 reward in state i when action a is chosen. It is easy to see that T is a sup-norm contraction map with contraction factor γ, i.e., the discount factor. Therefore, from the contraction mapping theorem, it is clear that the value iteration scheme given by equation (3) converges to the optimal value function, i.e., V ∗ = lim Vn = T V ∗ . n→ −∞ (5) Let Q∗ (i, a) with (i, a) ∈ S × A, be defined as: Q∗ (i, a) := r(i, a) + γ M X p(j|i, a)V ∗ (j). (6) j=1 Here Q∗ (i, a) is the optimal Q-value function associated with state i and action a. It denotes the total discounted reward obtained starting from state i upon taking action a and following the optimal policy in subsequent states. Then from (2), it is clear that V ∗ (i) = max Q∗ (i, a). a∈A (7) Therefore, the equation (6) can be re-written as follows: ∗ Q (i, a) = r(i, a) + γ M X p(j|i, a) max Q∗ (j, b). j=1 b∈A (8) This is known as the Q-Bellman equation. We obtain the optimal policy by letting π(i) = arg max Q∗ (i, a). a∈A (9) In [17], a modified value iteration algorithm is proposed based on the idea of successive relaxation. Let us define w∗ = 1 . 1 − γ min p(i|i, a) i,a (10) Note that w∗ ≥ 1. For 0 < w ≤ w∗ , we define a modified Bellman operator as follows: be applied to find a solution here. We select an initial point x0 and then proceed as follows: Tw (V ) = w(T V ) + (1 − w)V, xn = xn−1 − JF−1 (xn−1 )F (xn−1 ), n ≥ 1, (11) where w is called the ‘relaxation’ parameter. It is easy to see that the fixed point of Tw is also the optimal value function of the MDP (fixed point of T ). Moreover, it is shown in [17] that the contraction factor of Tw is 1 − γ + γw. Under a special structure of the MDP, i.e., with p(i|i, a) > 0, ∀i, a, we have w∗ > 1 (strictly greater than 1). Then, the relaxation parameter w can be chosen in three possible ways: 1) If 0 < w < 1, then the contractor factor of Tw is more than the contraction factor of T . 2) If w = 1, then T = Tw and hence the contraction factors of both the operators are same. 3) If 1 < w ≤ w∗ , the contraction factor of Tw is less than the contraction factor of T . This implies that the fixed point iteration utilizing (11) generates the optimal value function faster than the standard value iteration. In [11], a successive relaxation Q-Bellman equation (we call the Generalized Q-Bellman equation) is constructed as follows: Qw (i, a) = w(r(i, a)+γ M X j=1 p(j|i, a) max Qw (j, b)) c∈A (12) a∈A (13) The Generalized Q-values (Qw in (12)) are obtained as follows. An initial estimate Q0 of Qw is arbitrarily selected and a sequence of Qn , n ≥ 1 is obtained according to: Qn (i, a) =w(r(i, a) + γ M X j=1 i=1 We note here that gN (x) is a smooth approximation of max operator f (x) as shown in the Lemma 3. Then the equation (14) can be rewritten as follows:   |A| M X X 1 p(j|i, a) log Qn (i, a) = w r(i, a) + γ eN Qn−1 (j,b) N j=1 + (1 − w) X 1 log eN Qn−1 (i,b) , n ≥ 1, N p(j|i, a) max Qn−1 (j, b)) b∈A (14) It is shown in [11] that, the Q-values obtained by (14) converge to the generalized Q-values Qw . In this way, we obtain optimal value function and optimal policy using the successive relaxation Q-value iteration scheme. In this work, our objective is to approximate the generalized Q-Bellman equation and apply the Newton-Raphson second order technique to solve for the optimal value function. Recall that we cannot apply the second order method directly to the equation (12) as the max(.) operator on the RHS is not differentiable. Before we propose our algorithm, we briefly discuss the Newton’s second order technique [15] for solving a non-linear system of equations. Consider a function F : Rd → − Rd that is twice differentiable. Suppose we are interested in finding a root of F i.e., a point x such that F (x) = 0. The Newton-Raphson method can (16) starting with an initial Q0 (arbitrarily chosen in general). Therefore our modified Successive Q-Bellman (SQB) operator U : R|S|×|A| → − R|S|×|A| is defined as follows. For ∗ 0<w≤w ,   |A| M X X 1 eN Q(j,b) U Q(i, a) = w r(i, a) + γ p(j|i, a) log N j=1 b=1 |A| + (1 − w) + (1 − w) max Qn−1 (i, c), ∀(i, a). c∈A We construct our modified SQBE as follows. We first approximate the max(.) operator, i.e., the function f (x) = maxdi=1 xi , where x = (x1 , . . . , xd ), with gN (x) = d X 1 log eN xi , as the max(.) operator is not differentiable. N b=1 where 0 < w ≤ w . It has been shown that, although the Q-values obtained by (12) can be different from the optimal Q-values, the optimal value functions are still the same. That is, a∈A IV. P ROPOSED A LGORITHM |A| ∗ max Qw (i, a) = max Q∗ (i, a), ∀i ∈ S. where JF (x) is the Jacobian of the function F evaluated at point x. Under suitable hypotheses it can be shown that the procedure (15) leads to second order convergence. In the next section, we construct a function F for our problem and apply the Newton-Raphson method to find the optimal value function and policy pair. b=1 b∈A + (1 − w) max Qw (i, c), (15) X 1 log eN Q(i,b) . N (17) b=1 The numerical scheme (16) is thus Qn (i, a) = U Qn−1 (i, a). Finally, by an application of the Newton-Raphson method to U , our Generalized Second Order Value Iteration (G-SOVI) is obtained as described in Algorithm 1. Note that setting w = 1 in Step 4 of the algorithm yields the standard SOVI algorithm. Remark 1. Note that in our case, the function F in equation (15) corresponds to F (Q) = Q−U Q and JF (Q) = I −JU (Q) is a |S × A| × |S × A| dimensional matrix. −1 Remark 2. Note that directly computing I − JU (Q) (Q − U Q) would involve O(|S|3 |A|3 ) complexity. This computation could be carried out by solving the system (I − JU (Q))Y = Q − U Q for Y to avoid numerical stability issues. Moreover the per-iteration time complexity of the Algorithm 1 is also O(|S|3 |A|3 ). Algorithm 1 Generalized Second Order Value Iteration (GSOVI) Input: (S, A, p, r, γ): MDP model 0 < w ≤ w∗ : Prescribed relaxation parameter Q0 : Initial Q-vector N : Prescribed approximation factor Iter: Total number of iterations Output: QIter 1: procedure G-SOVI: 2: while n < Iter do 3: compute |S × A| × |S × A| matrix JU (Qn ). The ((i, a), (k, c))th entry is given by 4: JU (Qn )((i, a), (k, c)) =  N Qn (k,c)  Xe (k, c) 6= (i, a)  wγp(k|i, a)  eN Qn (k,b)   b∈A N Q (k,c) n  (1 − w + wγp(k|i, a)) Xe N Q (k,b)    e n   (k, c) = (i, a) b∈A 5: 6: Qn+1 = Qn − I − JU (Qn ) return QIter −1 (Qn − U Qn ) the corresponding arg max). Now d f (x) − gN (x) = max{x1 , x2 , · · · , xd } − d X 1 eN xi log N i=1  i h X 1 eN (xi −xi∗ ) eN xi∗ log N i=1  X d 1 eN (xi −xi∗ ) = log N i=1 = x i∗ − log d → 0 as N → ∞. N ≤ Note that the inequality follows from the definition of xi∗ = max{x1 , x2 , · · · , xd } and the fact that eN (xi −xi∗ ) ≤ 1 for 1 ≤ i ≤ d (since xi ≤ xi∗ ∀i). Hence sup f (x) − gN (x) → 0 as N → ∞ with the rate x∈Rd 1 N. Lemma 2. Let U : R|S|×|A| → R|S|×|A| be defined as follows.   |A| M X X p(j|i, a) log eN Q(j,b) (U Q)(i, a) = w r(i, a) + γ N j=1 b=1 |A| + X (1 − w) log eN Q(i,b) . N b=1 Remark 3. Note that G-SOVI reduces to standard SOVI in the case w = 1. Moreover the computational complexity for both the algorithms is the same. Remark 4. The space required for storing the Jacobian matrix JU is |S|2 |A|2 . Hence the space complexity of Algorithm1 is O(|S|2 |A|2 ). Then U is a max-norm contraction. Proof: Given (i, a), let q(.|i, a) : {1, 2, · · · , |S|} → [0, 1] be defined as follows. ( wγp(k|i,a) (1−w+wγ) , k 6= i, q(k|i, a) = 1−w+wγp(i|i,a) , k = i. (1−w+wγ) Observe that q is a probability mass function. Let E[.] denote the expectation with respect to q, and ξ(j, .) denotes the point that lies on the line joining P (j, .) and Q(j, .) Now for P, Q ∈ R|S|×|A| , we have (U P )(i, a) − (U Q)(i, a) V. C ONVERGENCE A NALYSIS In this section we study the convergence analysis of our algorithm. Note that the norm considered in the following analysis is the max-norm, i.e., kxk := max1≤i≤d |xi |. Throughout this section, it is assumed that the relaxation parameter w satisfies 0 < w ≤ w∗ , where w∗ is as defined in (10). d Lemma 1. Suppose f : R → R and f (x) := max{x1 , x2 , · · · , xd }. Let gN : Rd → R be defined as d X eN xi . Then sup f (x) − gN (x) −→ 0 gN (x) := N1 log as N −→ ∞. i=1 x∈Rd Proof: Let xi∗ = max{x1 , x2 , · · · , xd } (where i∗ denotes = (1 − w + wγ) E " log = (1 − w + wγ) E " |A| X eN P (j,b) b=1 N eN ξ(j,.) X eN ξ(j,b) log b∈A ≤ (1 − w + wγ)E " eN ξ(j,.) X eN ξ(j,b) b=1 − !T !T   |A| X eN Q(j,b) # N P (j, .) − Q(j, .)  P (j, .) − Q(j, .)   b∈A  ≤ (1 − w + wγ)E max |P (j, b) − Q(j, b)| b ≤ (1 − w + wγ) max |P (i, a) − Q(i, a)| (i,a) = (1 − w + wγ)kP − Qk So kU P − U Qk = max (U P )(i, a) − (U Q)(i, a) (i,a) ≤ (1 − w + wγ)kP − Qk. # # Hence U is a contraction with contraction factor (1−w +wγ). Here, the second equality follows from an application of mean value theorem in multivariate calculus. Lemma 3. Let Qw be as in equation (12) and Q′ be fixed point of U respectively. Then kQw − Q′ k ≤ (1−w+wγ) N w(1−γ) log(|A|). Proof: From equation (12), we have   Qw (s, a) = wr(s, a) + (1 − w + wγ)E max Qw (Z, b) . b∈A Now Q′ is the unique fixed point of U (unique by virtue of Lemma 2), so   X 1 N Q′ (Z,b) ′ log e , Q (s, a) = wr(s, a) + (1 − w + wγ)E N b∈A where Z is a random variable with probability mass function as q and the expectation above is taken with respect to the law given by probability mass function q. Let c = arg maxb∈A Q′ (Z, b) i.e. Q′ (Z, c) = maxb∈A Q′ (Z, b). Now Qw (s, a) − Q′ (s, a)   X 1 N Q′ (Z,b) log e = (1 − w + wγ)E max Qw (Z, b) − b∈A N b∈A  =(1 − w + wγ) E max Qw (Z, b) b∈A − max Q′ (Z, b) − b∈A ≤(1 − w + wγ)E X 1 eN log N Q′ (Z,b)−Q′ (Z,c) b∈A " max Qw (Z, b) b∈A − max Q′ (Z, b) − b∈A X 1 log eN N Q′ (Z,b)−Q′ (Z,c) b∈A ≤(1 − w + wγ)E   "  # max Qw (Z, b) − max Q′ (Z, b) b∈A b∈A X 1 log eN + N Q′ (Z,b)−Q′ (Z,c) b∈A ≤(1 − w + wγ)kQw − Q′ k +  # (1 − w + wγ) log |A|. N Hence kQw − Q′ k ≤ (1 − w + wγ)kQw − Q′ k + (1 − w + wγ) log |A| N (1 − w + wγ) log |A|. N w(1 − γ) This completes the proof. This lemma shows that the approximation error kQw − Q′ k → 0 as N → ∞. =⇒ kQw − Q′ k ≤ Remark 5. It is easy to see that in the case of w > 1 (1 − w + wγ) < γ. w This shows that the approximation error of G-SOVI is smaller than standard SOVI in the case of w > 1. We now invoke the following theorem from [15] to show the global convergence of our second order value iteration. Theorem 1 (Global Newton Theorem). Suppose that F : Rd → Rd is continuous, component-wise concave on Rd , differentiable and that F ′ (x) is non-singular and F ′ (x)−1 ≥ 0, i.e. each entry of F ′ (x)−1 is non-negative, for all x ∈ Rd . Assume, further, that F (x) = 0 has a unique solution x∗ and that F ′ is continuous on Rd . Then for any x0 ∈ Rd the Newton iterates given by (15) converge to x∗ . Remark 6. Note that the above theorem is stated for convex F in [15]. However, the theorem holds true even for concave F. Theorem 2. Let Q′ be the fixed point of the operator U . GSOVI converges to Q′ for any choice of initial point Q0 . Proof: G-SOVI computes the zeros of the equation Q − U Q = 0. So we appeal to Theorem 1 with the choice of F as I − U : R|S|×|A| → R|S|×|A| where U )(Q)(i, a)  =  (I − X N Q(Z,b) 1 e . It Q(i, a) − wr(i, a) − (1 − w + γw)E N log b∈A is enough to verify the hypothesis of Theorem 1 for I − U. Clearly I − U is continuous, component-wise concave and differentiable with (I − U )′ (Q) = I − JU (Q) where eN Q(k,c) JU (Q)((i, a), (k, c)) = (1 − w + wγ)q(k|i, a) X . eN Q(k,b) b∈A Note that (I − U ) is a |S × A| × |S × A| dimensional matrix with 1 ≤ i, k ≤ |S| and 1 ≤ a, c ≤ |A|. Now observe that th • each entry in the (i, a) row is non-negative. th • the sum of the entries in the (i, a) row is |S| |A| X X eN Qn (k,c) (1 − w + wγ)q(k|i, a) X eN Qn (k,b) k=1 c=1 b∈A = (1 − w + wγ). So JU (Q) = (1 − w + wγ)Φ for a |S × A| × |S × A| dimensional transition probability matrix Φ. It is easy to see that (I −JU (Q))−1 exists (see Lemma 4) with the power series expansion I − JU (Q) −1 = ∞ X (1 − w + wγ)r Φr . r=0 Moreover, since −1 each entry in Φ is non-negative, Φ ≥ 0. Hence I − JU (Q) ≥ 0. Also from lemma 2 it is clear that the equation Q − U Q = 0 has a unique solution. This completes the proof. Lemma 4. I − JU (Q) Proof: Note that −1 ≤ 1 w(1−γ) I − JU (Q) = I − (1 − w + wγ)Φ, for a given transition probability matrix Φ. Now suppose that λ is an eigen-value of I − (1 − w + wγ)Φ then 1 − (1 − w + wγ) < λ ≤ 1 + (1 − w + wγ). From 1 − (1 − w + wγ) > 0, we have 0∈ / σ(I − JU (Q)), where σ(I − JU (Q)) is the spectrum of I − JU (Q). Hence −1 for any Q, I − JU (Q) exists and we have I − JU (Q) −1 1 1 − (1 − w + wγ) 1 = . w(1 − γ) ≤ This completes the proof. The following theorem is an adaptation from [15]. Theorem 3. G-SOVI has second order convergence. Proof: Recall that F (Q) = Q − U Q. Let Q∗ be the unique solution of F (Q) = 0 and {Qn } be the sequence of iterates generated by G-SOVI. Define en = kQn − Q∗ k and G(Q) = Q − F ′ (Q)−1 F (Q). As Q∗ satisfies Q∗ = U Q∗ , it is a fixed point of G. It is enough to show that en+1 ≤ ke2n for a constant k. It can be shown that for our particular choice of F , F ′ is Lipschitz (with Lipschitz constant, say, L). kF ′ (Q) − F ′ (Q∗ )k ≤ LkQ − Q∗ k L kQ − Q∗ k2 2 (by an application of the fundamental theorem of calculus). =⇒ kF (Q) − F (Q∗ ) − F ′ (Q∗ )(Q − Q∗ )k ≤ Utilizing the above properties we have en+1 =kQn+1 − Q∗ k =kG(Qn ) − G(Q∗ )k =kQn − F ′ (Qn )−1 F (Qn ) − Q∗ k   ≤ F ′ (Qn )−1 F (Qn ) − F (Q∗ ) − F ′ (Q∗ )(Qn − Q∗ )   + F ′ (Qn )−1 F ′ (Qn ) − F ′ (Q∗ ) (Qn − Q∗ )   ≤kF ′ (Qn )−1 k F (Qn ) − F (Q∗ ) − F ′ (Q∗ )(Qn − Q∗ )   + kF ′ (Qn )−1 k F ′ (Qn ) − F ′ (Q∗ ) (Qn − Q∗ ) 3 ≤ βkQn − Q∗ k2 2 =ke2n , where β = LkF ′ (Q)k−1 ≤ k = 32 β. L w(1−γ) (from Lemma 4) and VI. E XPERIMENTS In this section, we describe the experimental results of our proposed G-SOVI algorithm and compare the same with standard SOVI and value iteration. For this purpose, we use python MDP toolbox [7] for generating the MDP and Figure 1: Error vs Number of iterations on setting with 100 states and 10 actions with w = w∗ for Generalized SOVI (G-SOVI). Value of N N=20 N=25 N=30 N=35 Standard Value Iteration 0.1009 ± 0.0026 Standard SOVI G-SOVI ± ± ± ± 0.1093 ± 0.0818 0.0648 ± 0.0217 0.0494 ± 0.017 0.0397 ± 0.0136 0.1205 0.0822 0.0611 0.0484 0.0372 0.0273 0.0211 0.0168 Table I: Comparison of Average Error for different values of N on 10 states and 5 actions setting at the end of 50 iterations. For the G-SOVI algorithm, the relaxation parameter is chosen to be the optimal relaxation parameter w∗ , i.e., w = w∗ . Value of w G-SOVI w=1 (Standard SOVI) 0.04838 ± 0.017 w = 1.00001 0.04838 ± 0.017 w = 1.0001 0.04837 ± 0.017 w = 1.001 0.04830 ± 0.017 w = 1.01 0.0476 ± 0.017 w = 1.05 0.0448 ± 0.016 w = 1.1 0.0417 ± 0.014 w = w∗ 0.0397 ± 0.014 Table II: Comparison of Average Error in G-SOVI for different values of w on 10 states and 5 actions setting at the end of 50 iterations. The value of N is 35. implementing standard value iteration 2 . The generated MDPs satisfy p(i|i, a) > 0, ∀i, a in order to ensure that w∗ > 1. We consider the error as defined below to be the metric for comparison between algorithms. Error for a given algorithm at iteration i, denoted E(i), is calculated as follows. We collect the max-norm difference between the optimal value function and the value function estimate at iteration i. That is, E(i) = kV ∗ − Qi (., a∗ )k∞ , where V ∗ is the optimal value function of the MDP and Qi (., .) is the Q-value function estimate of MDP at iteration i. Also, for any state j, a∗j = arg max Qi (j, a). a∈A 2 The code for our experiments is available at: https://github.com/ raghudiddigi/G-SOVI Setting Standard Value Iteration Standard SOVI G-SOVI States = 30, Actions= 10 6.471 ± 0.07 0.087 ± 0.01 0.079 ± 0.01 States = 50, Actions = 10 6.587 ± 0.07 0.114 ± 0.01 0.108 ± 0.01 States = 80, Actions = 10 6.754 ± 0.03 0.141 ± 0.01 0.136 ± 0.01 States = 100, Actions = 10 6.772 ± 0.03 0.152 ± 0.01 0.148 ± 0.01 Table III: Comparison of Average Error across four settings at the end of 10 iterations with N = 35. For the G-SOVI algorithm, the relaxation parameter is chosen to be the optimal relaxation parameter w∗ , i.e., w = w∗ . Setting Standard Value Iteration Standard SOVI G-SOVI States = 30, Actions= 10 0.0008 ± 0.00 0.0154 ± 0.01 0.0267 ± 0.01 States = 50, Actions = 10 0.0009 ± 0.00 0.0242 ± 0.00 0.0488 ± 0.00 States = 80, Actions = 10 0.0011 ± 0.00 0.0532 ± 0.00 0.0988 ± 0.01 States = 100, Actions = 10 0.0026 ± 0.00 0.1202 ± 0.01 0.1343 ± 0.01 Table IV: Per-iteration Execution time of algorithms across four settings in seconds, with the relaxation parameter in G-SOVI chosen as w = w∗ . Configuration Computational Time (in seconds) Standard Value Iteration 10 States, 5 Actions 0.01 20 States, 5 Actions 0.02 30 States, 5 Actions 0.03 Standard SOVI G-SOVI 25.485 ± 2.21 3.930± 0.92 3.885 ± 0.94 18.291 ± 0.77 5.444 ± 0.51 5.473 ± 0.50 7.327 ± 0.20 7.111 ± 0.32 7.118 ± 0.33 Table V: Average Error vs Computational Time (rounded off to the nearest millisecond). Initial Q-values for algorithms are assigned random integers between 60 and 70. The discount factor is set to 0.99. G-SOVI is run with w = 1.00001. First, we generate 100 independent MDPs each with 10 states, 5 actions and we set the discount factor to be 0.9 in each case. We run all the algorithms for 50 iterations. The initial Q-values of the algorithms are assigned random integers between 10 and 20 (which are far away from the optimal value function). In Table I, we indicate the average error value (error averaged over 100 MDPs) at the end of 50 iterations for all the algorithms, wherein for G-SOVI, we set w = w∗ as the relaxation parameter. We observe that standard SOVI and G-SOVI with N = 25, 30, and 35 have low error at the end of 50 iterations compared to the standard value iteration. Moreover, the average error is the least for our proposed GSOVI algorithm. Also, we find that higher the value of N , the smaller is the error between the G-SOVI value function and the optimal value function. In Table II, we report the performance of G-SOVI for different values of feasible successive relaxation parameters w across the same 100 MDPs generated previously (in Table I). The optimal successive relaxation parameter w∗ here lies between 1.1 and 1.5. Recall that G-SOVI exhibits faster convergence for any value of w that satisfies 1 < w ≤ w∗ when compared to standard SOVI (first row of Table II). From Table II, we can conclude that G-SOVI with w ∈ (1, w∗ ] performs at least as fast as the standard SOVI. Moreover, the higher the value of w, the better is the performance, when the algorithm is run for a sufficient number of iterations. In Table III, we present the results of the three algorithms on four different settings, averaged over 10 MDPs. The standard SOVI and G-SOVI are run with N = 35. All the algorithms are run for 10 iterations. We observe that standard SOVI and G- SOVI have low error compared to the standard value iteration. Moreover, the difference here is much more pronounced than in Table I, where algorithms are run for 50 iterations. Recall that the SOVI and G-SOVI algorithms with a fixed N give near-optimal value functions. The advantage of using our proposed algorithms is that the Q-value iterates converge to the near-optimal Q-values rapidly. This can also be observed in Figure 1, where we present the convergence of algorithms over 50 iterations on 100 states and 10 actions setting. The SOVI and G-SOVI algorithms converge rapidly to a value and stay constant. In fact, we observe here that the error is less than that obtained by the standard value iteration till 45 iterations. Moreover, G-SOVI computes a solution that gives lower error as compared to that obtained by SOVI. In Table IV, we indicate the per-iteration execution time of our algorithms across the four settings considered above. We can see that, due to Hessian inversion operation in the secondorder techniques, standard and G-SOVI algorithms take more time compared to the standard value iteration algorithm. Recall that the advantage of second-order methods is that even though the per-iteration computation is higher compared to the first-order methods, the total number of iterations needed to achieve a desired error threshold is much lower in general. Hence, they are capable of achieving lower error in the same computational time. We demonstrate this in Table V for three settings. We select the parameters of this experiment (i.e., number of states and actions, values of N , w, number of iterations), such that the second-order methods compute better solutions compared to the standard value iteration scheme3 . For example, consider the 10 states and 5 actions setting (first row of Table V). The standard value iteration is run for 50 iterations. It’s per-iteration time is 0.2 ms which results in an overall computational time of 0.0002×50 = 0.01 seconds. On the other hand, SOVI techniques (standard SOVI and the GSOVI) are run for just 3 iterations. However, their per-iteration time is 0.0033 seconds and hence the overall computational time is 0.01 seconds. We observe that, in 0.01 seconds, the SOVI methods achieve lower error compared to the standard value iteration. Similarly, in the other two settings in Table V, we see that the second order SOVI algorithms achieve lower error compared to the standard value iteration when run for 0.02 and 0.03 seconds, respectively. It is important to note that this advantage need not hold for MDPs, in general, with large number of states and actions as the overhead for computing the Hessian inverse in large MDPs will be higher that would affect the overall computational time. If one could deploy techniques to improve the computation time for matrix operations, G-SOVI would be preferred for computing the optimal value function, over the standard value iteration. VII. C ONCLUSION In this work, we have proposed a generalized secondorder value iteration scheme based on the Newton-Raphson method for faster convergence to near optimal value function in discounted reward MDP problems. The first step involved constructing a differentiable Bellman equation through an approximation of the max(.) operator. We then applied second order Newton method to arrive at the proposed algorithm. We proved the bounds on approximation error and showed second order convergence to the optimal value function. Finally, approaches geared towards easing the computational burden associated with solving problems involving large state and action spaces such as those based on approximate dynamic programming can be developed in the context of G-SOVI schemes in the future. R EFERENCES [1] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966. [2] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific, Belmont, MA, 1996. [3] Vivek S Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge Univ. Press, 2008. [4] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285, 2017. [5] Adithya M Devraj and Sean Meyn. Zap Q-learning. In Advances in Neural Information Processing Systems, pages 2235–2244, 2017. [6] Thomas Furmston, Guy Lever, and David Barber. Approximate Newton methods for policy search in Markov decision processes. The Journal of Machine Learning Research, 17(1):8055–8105, 2016. [7] Github. Python MDPtoolbox. https://github.com/sawcordwell/ pymdptoolbox. [8] Vineet Goyal and Julien Grand-Clement. A first-order approach to accelerated value iteration. arXiv preprint arXiv:1905.09963, 2019. 3 The value of w = 1.00001 respects the constraint w ≤ w ∗ in all the three settings. [9] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361, 2017. [10] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. [11] Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, and Shalabh Bhatnagar. Successive over-relaxation Q-learning. IEEE Control Systems Letters, 4(1):55–60, 2019. [12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015. [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [14] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152, 2005. [15] James M Ortega and Werner C Rheinboldt. Iterative solution of nonlinear equations in several variables, volume 30. SIAM, 1970. [16] Martin L Puterman and Shelby L Brumelle. On the convergence of policy iteration in stationary dynamic programming. Mathematics of Operations Research, 4(1):60–69, 1979. [17] Dieter Reetz. Solution of a Markovian decision problem by successive overrelaxation. Zeitschrift für Operations Research, 17(1):29–32, 1973. [18] John Rust. Structural estimation of Markov decision processes. Handbook of econometrics, 4:3081–3143, 1994. [19] Christopher J.C.H Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.