Skip to main content

Shalabh Bhatnagar

Indian Institute of Science, Computer Science and Automation, Faculty Member

Followers

76

Following

3

Co-author

1

Mentions

1

Public Views

Interests

Uploads

Papers by Shalabh Bhatnagar

A Cross Entropy based Stochastic Approximation Algorithm for Reinforcement Learning with Linear Function Approximation

arXiv (Cornell University), Sep 29, 2016

In this paper, we provide a new algorithm for the problem of prediction in Reinforcement Learning... more In this paper, we provide a new algorithm for the problem of prediction in Reinforcement Learning, i.e., estimating the Value Function of a Markov Reward Process (MRP) using the linear function approximation architecture, with memory and computation costs scaling quadratically in the size of the feature set. The algorithm is a multi-timescale variant of the very popular Cross Entropy (CE) method which is a model based search method to find the global optimum of a realvalued function. This is the first time a model based search method is used for the prediction problem. The application of CE to a stochastic setting is a completely unexplored domain. A proof of convergence using the ODE method is provided. The theoretical results are supplemented with experimental comparisons. The algorithm achieves good performance fairly consistently on many RL benchmark problems. This demonstrates the competitiveness of our algorithm against least squares and other state-of-the-art algorithms in terms of computational efficiency, accuracy and stability.

On the function approximation error for risk-sensitive reinforcement learning

arXiv (Cornell University), Dec 22, 2016

We obtain several informative error bounds on function approximation for the policy evaluation al... more We obtain several informative error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. for the risk-sensitive cost criteria represented using exponential utility. The novelty of our approach is that we use the irreducibility of a Markov chain (existing Bapat and Lindqvst inequality as well as the new bound using Perron-Frobenius eigenvectors) to get the new bounds whereas the earlier work used spectral variation bound which holds for any matrix.

Stability of Stochastic Approximations with `Controlled Markov' Noise and Temporal Difference Learning

arXiv (Cornell University), Apr 23, 2015

We are interested in understanding stability (almost sure boundedness) of stochastic approximatio... more We are interested in understanding stability (almost sure boundedness) of stochastic approximation algorithms (SAs) driven by a 'controlled Markov' process. Analyzing this class of algorithms is important, since many reinforcement learning (RL) algorithms can be cast as SAs driven by a 'controlled Markov' process. In this paper, we present easily verifiable sufficient conditions for stability and convergence of SAs driven by a 'controlled Markov' process. Many RL applications involve continuous state spaces. While our analysis readily ensures stability for such continuous state applications, traditional analyses do not. As compared to literature, our analysis presents a twofold generalization (a) the Markov process may evolve in a continuous state space and (b) the process need not be ergodic under any given stationary policy. Temporal difference learning (TD) is an important policy evaluation method in reinforcement learning. The theory developed herein, is used to analyze generalized T D(0), an important variant of TD. Our theory is also used to analyze a TD formulation of supervised learning for forecasting problems.

Actor-Critic Algorithms for Constrained Multi-agent Reinforcement Learning

arXiv (Cornell University), May 8, 2019

In cooperative stochastic games multiple agents work towards learning joint optimal actions in an... more In cooperative stochastic games multiple agents work towards learning joint optimal actions in an unknown environment to achieve a common goal. In many real-world applications, however, constraints are often imposed on the actions that can be jointly taken by the agents. In such scenarios the agents aim to learn joint actions to achieve a common goal (minimizing a specified cost function) while meeting the given constraints (specified via certain penalty functions). In this paper, we consider the relaxation of the constrained optimization problem by constructing the Lagrangian of the cost and penalty functions. We propose a nested actor-critic solution approach to solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture. A meta actor-critic scheme using this faster timescale policy updates is then employed to improve the Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic schemes, we develop three Nested Actor-Critic (N-AC) algorithms. Through experiments on constrained cooperative tasks, we show the effectiveness of the proposed algorithms.

Stochastic recursive inclusions with non-additive iterate-dependent Markov noise

Stochastics An International Journal of Probability and Stochastic Processes, Jul 25, 2017

In this paper we study the asymptotic behavior of stochastic approximation schemes with set-value... more In this paper we study the asymptotic behavior of stochastic approximation schemes with set-valued drift function and non-additive iterate-dependent Markov noise. We show that a linearly interpolated trajectory of such a recursion is an asymptotic pseudotrajectory for the flow of a limiting differential inclusion obtained by averaging the set-valued drift function of the recursion w.r.t. the stationary distributions of the Markov noise. The limit set theorem in [1] is then used to characterize the limit sets of the recursion in terms of the dynamics of the limiting differential inclusion. We then state two variants of the Markov noise assumption under which the analysis of the recursion is similar to the one presented in this paper. Scenarios where our recursion naturally appears are presented as applications. These include controlled stochastic approximation, subgradient descent, approximate drift problem and analysis of discontinuous dynamics all in the presence of non-additive iterate-dependent Markov noise.

Adaptive system optimization using random directions stochastic approximation

arXiv (Cornell University), Feb 19, 2015

We present new algorithms for simulation optimization using random directions stochastic approxim... more We present new algorithms for simulation optimization using random directions stochastic approximation (RDSA). These include first-order (gradient) as well as second-order (Newton) schemes. We incorporate both continuous-valued as well as discrete-valued perturbations into both our algorithms. The former are chosen to be independent and identically distributed (i.i.d.) symmetric uniformly distributed random variables (r.v.), while the latter are i.i.d., asymmetric Bernoulli r.v.s. Our Newton algorithm, with a novel Hessian estimation scheme, requires N-dimensional perturbations and three loss measurements per iteration, whereas the simultaneous perturbation Newton search algorithm of [30] requires 2N-dimensional perturbations and four loss measurements per iteration. We prove the unbiasedness of both gradient and Hessian estimates and asymptotic (strong) convergence for both first-order and second-order schemes. We also provide asymptotic normality results, which in particular establish that the asymmetric Bernoulli variant of Newton RDSA method is better than 2SPSA of [30]. Numerical experiments are used to validate the theoretical results.

Random directions stochastic approximation with deterministic perturbations

arXiv (Cornell University), Aug 8, 2018

We introduce deterministic perturbation schemes for the recently proposed random directions stoch... more We introduce deterministic perturbation schemes for the recently proposed random directions stochastic approximation (RDSA) [19], and propose new first-order and second-order algorithms. In the latter case, these are the first second-order algorithms to incorporate deterministic perturbations. We show that the gradient and/or Hessian estimates in the resulting algorithms with deterministic perturbations are asymptotically unbiased, so that the algorithms are provably convergent. Furthermore, we derive convergence rates to establish the superiority of the first-order and second-order algorithms, for the special case of a convex and quadratic optimization problem, respectively. Numerical experiments are used to validate the theoretical results.

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

arXiv (Cornell University), Oct 14, 2022

During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perf... more During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well studied under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, an agent receives single-stage costs or penalties as well depending on the state transitions. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. The goal then is to find a feasible policy that maximizes reward returns while constraining the cost returns to be below a prescribed threshold during training as well as deployment. We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy Optimization. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark-the Open AI Safety Gym. We demonstrate that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches. Further, our approach shows better reward performance than other constrained model-based approaches in the literature. 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

Hierarchical Average Reward Policy Gradient Algorithms

arXiv (Cornell University), Nov 20, 2019

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to ad... more Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

Dynamics of stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured

arXiv (Cornell University), Jan 10, 2016

This paper compiles several aspects of the dynamics of stochastic approximation algorithms with M... more This paper compiles several aspects of the dynamics of stochastic approximation algorithms with Markov iteratedependent noise when the iterates are not known to be stable beforehand. We achieve the same by extending the lock-in probability (i.e. the probability of convergence of the iterates to a specific attractor of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently large number of iterations (say) n0) framework to such recursions. Specifically, with the more restrictive assumption of Markov iterate-dependent noise supported on a bounded subset of the Euclidean space we give a lower bound for the lock-in probability. We use these results to prove almost sure convergence of the iterates to the specified attractor when the iterates satisfy an asymptotic tightness condition. The novelty of our approach is that if the state space of the Markov process is compact we prove almost sure convergence under much weaker assumptions compared to the work by Andrieu et al. which solves the general state space case under much restrictive assumptions. We also extend our single timescale results to the case where there are two separate recursions over two different timescales. This, in turn, is shown to be useful in analyzing the tracking ability of general adaptive algorithms. Additionally, we show that our results can be used to derive a sample complexity estimate of such recursions, which then can be used for step-size selection.

A note on the function approximation error bound for risk-sensitive reinforcement learning

arXiv (Cornell University), Dec 22, 2016

We obtain several informative error bounds on function approximation for the policy evaluation al... more We obtain several informative error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. for the risk-sensitive cost criteria represented using exponential utility. The novelty of our approach is that we use the irreducibility of a Markov chain (existing Bapat and Lindqvst inequality as well as the new bound using Perron-Frobenius eigenvectors) to get the new bounds whereas the earlier work used spectral variation bound which holds for any matrix.

On a convergent off -policy temporal difference learning algorithm in on-line learning environment

arXiv (Cornell University), May 19, 2016

In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference le... more In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in "online" learning environment. The algorithm considered here is TDC with importance weighting introduced by Maei et al. We support our theoretical results by providing suitable empirical results for standard off-policy counterexamples.

Two Timescale Stochastic Approximation with Controlled Markov noise

arXiv (Cornell University), Mar 31, 2015

We present for the first time an asymptotic convergence analysis of two timescale stochastic appr... more We present for the first time an asymptotic convergence analysis of two timescale stochastic approximation driven by 'controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both timescales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.

Second Order Value Iteration in Reinforcement Learning

arXiv (Cornell University), May 10, 2019

Value iteration is a fixed point iteration technique utilized to obtain the optimal value functio... more Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

A Cross Entropy based Optimization Algorithm with Global Convergence Guarantees

arXiv (Cornell University), Jan 31, 2018

The cross entropy (CE) method is a model based search method to solve optimization problems where... more The cross entropy (CE) method is a model based search method to solve optimization problems where the objective function has minimal structure. The Monte-Carlo version of the CE method employs the naive sample averaging technique which is inefficient, both computationally and space wise. We provide a novel stochastic approximation version of the CE method, where the sample averaging is replaced with incremental geometric averaging. This approach can save considerable computational and storage costs. Our algorithm is incremental in nature and possesses additional attractive features such as accuracy, stability, robustness and convergence to the global optimum for a particular class of objective functions. We evaluate the algorithm on a variety of global optimization benchmark problems and the results obtained corroborate our theoretical findings.

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

arXiv (Cornell University), Jan 8, 2014

We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general... more We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a N-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game-Sub-Problem) conditions. Using these conditions, we develop two actor-critic algorithms: OFF-SGSP (model-based) and ON-SGSP (model-free). Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with 810, 000 states, we establish that ON-SGSP consistently outperforms NashQ [Hu and Wellman, 2003] and FFQ [Littman, 2001] algorithms.

Dynamics of stochastic approximation with Markov iterate-dependent noise with the stability of the iterates not ensured

arXiv (Cornell University), Jan 10, 2016

This paper compiles several aspects of the dynamics of stochastic approximation algorithms with M... more This paper compiles several aspects of the dynamics of stochastic approximation algorithms with Markov iteratedependent noise when the iterates are not known to be stable beforehand. We achieve the same by extending the lock-in probability (i.e. the probability of convergence of the iterates to a specific attractor of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently large number of iterations (say) n0) framework to such recursions. Specifically, with the more restrictive assumption of Markov iterate-dependent noise supported on a bounded subset of the Euclidean space we give a lower bound for the lock-in probability. We use these results to prove almost sure convergence of the iterates to the specified attractor when the iterates satisfy an asymptotic tightness condition. The novelty of our approach is that if the state space of the Markov process is compact we prove almost sure convergence under much weaker assumptions compared to the work by Andrieu et al. which solves the general state space case under much restrictive assumptions. We also extend our single timescale results to the case where there are two separate recursions over two different timescales. This, in turn, is shown to be useful in analyzing the tracking ability of general adaptive algorithms. Additionally, we show that our results can be used to derive a sample complexity estimate of such recursions, which then can be used for step-size selection.

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games

Adaptive Agents and Multi-Agents Systems, May 4, 2015

We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general... more We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from [9] to a general Nplayer game setting. Next, we break down the optimization problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game-Sub-Problem) conditions. Using these conditions, we develop two provably convergent algorithms. The first algorithm-OFF-SGSPis centralized and model-based, i.e., it assumes complete information of the game. The second algorithm-ON-SGSP-is an online model-free algorithm. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game [12] as well as on a synthetic two-player game setup with 810, 000 states, we establish that ON-SGSP consistently outperforms NashQ [16] and FFQ [21] algorithms.

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

arXiv (Cornell University), Oct 10, 2022

We revisit the standard formulation of tabular actor-critic algorithm as a two timescale stochast... more We revisit the standard formulation of tabular actor-critic algorithm as a two timescale stochastic approximation with value function computed on a faster timescale and policy computed on a slower timescale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

Stochastic Recursive Inclusions in two timescales with non-additive iterate dependent Markov noise

arXiv (Cornell University), Nov 17, 2016

In this paper we study the asymptotic behavior of a stochastic approximation scheme on two timesc... more In this paper we study the asymptotic behavior of a stochastic approximation scheme on two timescales with set-valued drift functions and in the presence of non-additive iterate-dependent Markov noise. It is shown that the recursion on each timescale tracks the flow of a differential inclusion obtained by averaging the set-valued drift function in the recursion with respect to a set of measures which take into account both the averaging with respect to the stationary distributions of the Markov noise terms and the interdependence between the two recursions on different timescales. The framework studied in this paper builds on the works of A. Ramaswamy et al. by allowing for the presence of nonadditive iterate-dependent Markov noise. As an application, we consider the problem of computing the optimum in a constrained convex optimization problem where the objective function and the constraints are averaged with respect to the stationary distribution of an underlying Markov chain. Further the proposed scheme neither requires the differentiability of the objective function nor the knowledge of the averaging measure.

A Cross Entropy based Stochastic Approximation Algorithm for Reinforcement Learning with Linear Function Approximation

arXiv (Cornell University), Sep 29, 2016

In this paper, we provide a new algorithm for the problem of prediction in Reinforcement Learning... more In this paper, we provide a new algorithm for the problem of prediction in Reinforcement Learning, i.e., estimating the Value Function of a Markov Reward Process (MRP) using the linear function approximation architecture, with memory and computation costs scaling quadratically in the size of the feature set. The algorithm is a multi-timescale variant of the very popular Cross Entropy (CE) method which is a model based search method to find the global optimum of a realvalued function. This is the first time a model based search method is used for the prediction problem. The application of CE to a stochastic setting is a completely unexplored domain. A proof of convergence using the ODE method is provided. The theoretical results are supplemented with experimental comparisons. The algorithm achieves good performance fairly consistently on many RL benchmark problems. This demonstrates the competitiveness of our algorithm against least squares and other state-of-the-art algorithms in terms of computational efficiency, accuracy and stability.

On the function approximation error for risk-sensitive reinforcement learning

arXiv (Cornell University), Dec 22, 2016

We obtain several informative error bounds on function approximation for the policy evaluation al... more We obtain several informative error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. for the risk-sensitive cost criteria represented using exponential utility. The novelty of our approach is that we use the irreducibility of a Markov chain (existing Bapat and Lindqvst inequality as well as the new bound using Perron-Frobenius eigenvectors) to get the new bounds whereas the earlier work used spectral variation bound which holds for any matrix.

Stability of Stochastic Approximations with `Controlled Markov' Noise and Temporal Difference Learning

arXiv (Cornell University), Apr 23, 2015

We are interested in understanding stability (almost sure boundedness) of stochastic approximatio... more We are interested in understanding stability (almost sure boundedness) of stochastic approximation algorithms (SAs) driven by a 'controlled Markov' process. Analyzing this class of algorithms is important, since many reinforcement learning (RL) algorithms can be cast as SAs driven by a 'controlled Markov' process. In this paper, we present easily verifiable sufficient conditions for stability and convergence of SAs driven by a 'controlled Markov' process. Many RL applications involve continuous state spaces. While our analysis readily ensures stability for such continuous state applications, traditional analyses do not. As compared to literature, our analysis presents a twofold generalization (a) the Markov process may evolve in a continuous state space and (b) the process need not be ergodic under any given stationary policy. Temporal difference learning (TD) is an important policy evaluation method in reinforcement learning. The theory developed herein, is used to analyze generalized T D(0), an important variant of TD. Our theory is also used to analyze a TD formulation of supervised learning for forecasting problems.

Actor-Critic Algorithms for Constrained Multi-agent Reinforcement Learning

arXiv (Cornell University), May 8, 2019

In cooperative stochastic games multiple agents work towards learning joint optimal actions in an... more In cooperative stochastic games multiple agents work towards learning joint optimal actions in an unknown environment to achieve a common goal. In many real-world applications, however, constraints are often imposed on the actions that can be jointly taken by the agents. In such scenarios the agents aim to learn joint actions to achieve a common goal (minimizing a specified cost function) while meeting the given constraints (specified via certain penalty functions). In this paper, we consider the relaxation of the constrained optimization problem by constructing the Lagrangian of the cost and penalty functions. We propose a nested actor-critic solution approach to solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture. A meta actor-critic scheme using this faster timescale policy updates is then employed to improve the Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic schemes, we develop three Nested Actor-Critic (N-AC) algorithms. Through experiments on constrained cooperative tasks, we show the effectiveness of the proposed algorithms.

Stochastic recursive inclusions with non-additive iterate-dependent Markov noise

Stochastics An International Journal of Probability and Stochastic Processes, Jul 25, 2017

In this paper we study the asymptotic behavior of stochastic approximation schemes with set-value... more In this paper we study the asymptotic behavior of stochastic approximation schemes with set-valued drift function and non-additive iterate-dependent Markov noise. We show that a linearly interpolated trajectory of such a recursion is an asymptotic pseudotrajectory for the flow of a limiting differential inclusion obtained by averaging the set-valued drift function of the recursion w.r.t. the stationary distributions of the Markov noise. The limit set theorem in [1] is then used to characterize the limit sets of the recursion in terms of the dynamics of the limiting differential inclusion. We then state two variants of the Markov noise assumption under which the analysis of the recursion is similar to the one presented in this paper. Scenarios where our recursion naturally appears are presented as applications. These include controlled stochastic approximation, subgradient descent, approximate drift problem and analysis of discontinuous dynamics all in the presence of non-additive iterate-dependent Markov noise.

Adaptive system optimization using random directions stochastic approximation

arXiv (Cornell University), Feb 19, 2015

We present new algorithms for simulation optimization using random directions stochastic approxim... more We present new algorithms for simulation optimization using random directions stochastic approximation (RDSA). These include first-order (gradient) as well as second-order (Newton) schemes. We incorporate both continuous-valued as well as discrete-valued perturbations into both our algorithms. The former are chosen to be independent and identically distributed (i.i.d.) symmetric uniformly distributed random variables (r.v.), while the latter are i.i.d., asymmetric Bernoulli r.v.s. Our Newton algorithm, with a novel Hessian estimation scheme, requires N-dimensional perturbations and three loss measurements per iteration, whereas the simultaneous perturbation Newton search algorithm of [30] requires 2N-dimensional perturbations and four loss measurements per iteration. We prove the unbiasedness of both gradient and Hessian estimates and asymptotic (strong) convergence for both first-order and second-order schemes. We also provide asymptotic normality results, which in particular establish that the asymmetric Bernoulli variant of Newton RDSA method is better than 2SPSA of [30]. Numerical experiments are used to validate the theoretical results.

Random directions stochastic approximation with deterministic perturbations

arXiv (Cornell University), Aug 8, 2018

We introduce deterministic perturbation schemes for the recently proposed random directions stoch... more We introduce deterministic perturbation schemes for the recently proposed random directions stochastic approximation (RDSA) [19], and propose new first-order and second-order algorithms. In the latter case, these are the first second-order algorithms to incorporate deterministic perturbations. We show that the gradient and/or Hessian estimates in the resulting algorithms with deterministic perturbations are asymptotically unbiased, so that the algorithms are provably convergent. Furthermore, we derive convergence rates to establish the superiority of the first-order and second-order algorithms, for the special case of a convex and quadratic optimization problem, respectively. Numerical experiments are used to validate the theoretical results.

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

arXiv (Cornell University), Oct 14, 2022

During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perf... more During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well studied under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, an agent receives single-stage costs or penalties as well depending on the state transitions. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. The goal then is to find a feasible policy that maximizes reward returns while constraining the cost returns to be below a prescribed threshold during training as well as deployment. We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy Optimization. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark-the Open AI Safety Gym. We demonstrate that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches. Further, our approach shows better reward performance than other constrained model-based approaches in the literature. 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

Hierarchical Average Reward Policy Gradient Algorithms

arXiv (Cornell University), Nov 20, 2019

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to ad... more Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

Dynamics of stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured

arXiv (Cornell University), Jan 10, 2016

This paper compiles several aspects of the dynamics of stochastic approximation algorithms with M... more This paper compiles several aspects of the dynamics of stochastic approximation algorithms with Markov iteratedependent noise when the iterates are not known to be stable beforehand. We achieve the same by extending the lock-in probability (i.e. the probability of convergence of the iterates to a specific attractor of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently large number of iterations (say) n0) framework to such recursions. Specifically, with the more restrictive assumption of Markov iterate-dependent noise supported on a bounded subset of the Euclidean space we give a lower bound for the lock-in probability. We use these results to prove almost sure convergence of the iterates to the specified attractor when the iterates satisfy an asymptotic tightness condition. The novelty of our approach is that if the state space of the Markov process is compact we prove almost sure convergence under much weaker assumptions compared to the work by Andrieu et al. which solves the general state space case under much restrictive assumptions. We also extend our single timescale results to the case where there are two separate recursions over two different timescales. This, in turn, is shown to be useful in analyzing the tracking ability of general adaptive algorithms. Additionally, we show that our results can be used to derive a sample complexity estimate of such recursions, which then can be used for step-size selection.

A note on the function approximation error bound for risk-sensitive reinforcement learning

arXiv (Cornell University), Dec 22, 2016

We obtain several informative error bounds on function approximation for the policy evaluation al... more We obtain several informative error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. for the risk-sensitive cost criteria represented using exponential utility. The novelty of our approach is that we use the irreducibility of a Markov chain (existing Bapat and Lindqvst inequality as well as the new bound using Perron-Frobenius eigenvectors) to get the new bounds whereas the earlier work used spectral variation bound which holds for any matrix.

On a convergent off -policy temporal difference learning algorithm in on-line learning environment

arXiv (Cornell University), May 19, 2016

In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference le... more In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in "online" learning environment. The algorithm considered here is TDC with importance weighting introduced by Maei et al. We support our theoretical results by providing suitable empirical results for standard off-policy counterexamples.

Two Timescale Stochastic Approximation with Controlled Markov noise

arXiv (Cornell University), Mar 31, 2015

We present for the first time an asymptotic convergence analysis of two timescale stochastic appr... more We present for the first time an asymptotic convergence analysis of two timescale stochastic approximation driven by 'controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both timescales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.

Second Order Value Iteration in Reinforcement Learning

arXiv (Cornell University), May 10, 2019

Value iteration is a fixed point iteration technique utilized to obtain the optimal value functio... more Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

A Cross Entropy based Optimization Algorithm with Global Convergence Guarantees

arXiv (Cornell University), Jan 31, 2018

The cross entropy (CE) method is a model based search method to solve optimization problems where... more The cross entropy (CE) method is a model based search method to solve optimization problems where the objective function has minimal structure. The Monte-Carlo version of the CE method employs the naive sample averaging technique which is inefficient, both computationally and space wise. We provide a novel stochastic approximation version of the CE method, where the sample averaging is replaced with incremental geometric averaging. This approach can save considerable computational and storage costs. Our algorithm is incremental in nature and possesses additional attractive features such as accuracy, stability, robustness and convergence to the global optimum for a particular class of objective functions. We evaluate the algorithm on a variety of global optimization benchmark problems and the results obtained corroborate our theoretical findings.

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

arXiv (Cornell University), Jan 8, 2014

We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general... more We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a N-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game-Sub-Problem) conditions. Using these conditions, we develop two actor-critic algorithms: OFF-SGSP (model-based) and ON-SGSP (model-free). Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with 810, 000 states, we establish that ON-SGSP consistently outperforms NashQ [Hu and Wellman, 2003] and FFQ [Littman, 2001] algorithms.

Dynamics of stochastic approximation with Markov iterate-dependent noise with the stability of the iterates not ensured

arXiv (Cornell University), Jan 10, 2016

This paper compiles several aspects of the dynamics of stochastic approximation algorithms with M... more This paper compiles several aspects of the dynamics of stochastic approximation algorithms with Markov iteratedependent noise when the iterates are not known to be stable beforehand. We achieve the same by extending the lock-in probability (i.e. the probability of convergence of the iterates to a specific attractor of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently large number of iterations (say) n0) framework to such recursions. Specifically, with the more restrictive assumption of Markov iterate-dependent noise supported on a bounded subset of the Euclidean space we give a lower bound for the lock-in probability. We use these results to prove almost sure convergence of the iterates to the specified attractor when the iterates satisfy an asymptotic tightness condition. The novelty of our approach is that if the state space of the Markov process is compact we prove almost sure convergence under much weaker assumptions compared to the work by Andrieu et al. which solves the general state space case under much restrictive assumptions. We also extend our single timescale results to the case where there are two separate recursions over two different timescales. This, in turn, is shown to be useful in analyzing the tracking ability of general adaptive algorithms. Additionally, we show that our results can be used to derive a sample complexity estimate of such recursions, which then can be used for step-size selection.

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games

Adaptive Agents and Multi-Agents Systems, May 4, 2015

We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general... more We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from [9] to a general Nplayer game setting. Next, we break down the optimization problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game-Sub-Problem) conditions. Using these conditions, we develop two provably convergent algorithms. The first algorithm-OFF-SGSPis centralized and model-based, i.e., it assumes complete information of the game. The second algorithm-ON-SGSP-is an online model-free algorithm. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game [12] as well as on a synthetic two-player game setup with 810, 000 states, we establish that ON-SGSP consistently outperforms NashQ [16] and FFQ [21] algorithms.

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

arXiv (Cornell University), Oct 10, 2022

We revisit the standard formulation of tabular actor-critic algorithm as a two timescale stochast... more We revisit the standard formulation of tabular actor-critic algorithm as a two timescale stochastic approximation with value function computed on a faster timescale and policy computed on a slower timescale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

Stochastic Recursive Inclusions in two timescales with non-additive iterate dependent Markov noise

arXiv (Cornell University), Nov 17, 2016

In this paper we study the asymptotic behavior of a stochastic approximation scheme on two timesc... more In this paper we study the asymptotic behavior of a stochastic approximation scheme on two timescales with set-valued drift functions and in the presence of non-additive iterate-dependent Markov noise. It is shown that the recursion on each timescale tracks the flow of a differential inclusion obtained by averaging the set-valued drift function in the recursion with respect to a set of measures which take into account both the averaging with respect to the stationary distributions of the Markov noise terms and the interdependence between the two recursions on different timescales. The framework studied in this paper builds on the works of A. Ramaswamy et al. by allowing for the presence of nonadditive iterate-dependent Markov noise. As an application, we consider the problem of computing the optimum in a constrained convex optimization problem where the objective function and the constraints are averaged with respect to the stationary distribution of an underlying Markov chain. Further the proposed scheme neither requires the differentiability of the objective function nor the knowledge of the averaging measure.