Variable Impedance Control A Reinforcement Learnin

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221344600

Variable Impedance Control A Reinforcement Learning Approach

Conference Paper · July 2010


DOI: 10.15607/RSS.2010.VI.020 · Source: DBLP

CITATIONS READS
75 605

4 authors, including:

Freek Stulp Stefan Schaal


German Aerospace Center (DLR) University of Southern California
127 PUBLICATIONS   2,907 CITATIONS    460 PUBLICATIONS   36,351 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Robotic Manipulation View project

Robotics for computational neuroscience View project

All content following this page was uploaded by Freek Stulp on 09 July 2014.

The user has requested enhancement of the downloaded file.


Variable Impedance Control
A Reinforcement Learning Approach
Jonas Buchli, Evangelos Theodorou, Freek Stulp, Stefan Schaal

Abstract— One of the hallmarks of the performance, versatility, algorithms. However, optimal control requires model-based
and robustness of biological motor control is the ability to adapt derivations, such that it is frequently not applicable to com-
the impedance of the overall biomechanical system to different plex robotic systems and environments, where models are
task requirements and stochastic disturbances. A transfer of this
principle to robotics is desirable, for instance to enable robots unknown.
to work robustly and safely in everyday human environments. It In this paper, we present a novel RL algorithm that does
is, however, not trivial to derive variable impedance controllers scale to complex robotic systems, and that accomplishes gain
for practical high DOF robotic tasks. In this contribution, we ac- scheduling in combination with optimizing other performance
complish such gain scheduling with a reinforcement learning ap- criteria. Evaluations on two simulated robotic systems demon-
proach algorithm, PI2 (Policy Improvement with Path Integrals).
PI2 is a model-free, sampling based learning method derived from strate the effectiveness of our approach. In the following
first principles of optimal control. The PI2 algorithm requires no section, we will first motivate variable impedance control.
tuning of algorithmic parameters besides the exploration noise. Then, we sketch our novel RL algorithms, called PI2 , and
The designer can thus fully focus on cost function design to its applicability to learning gain scheduling. In the fourth
specify the task. From the viewpoint of robotics, a particular section, we will present evaluation results on a 3 DOF and a
useful property of PI2 is that it can scale to problems of many
DOFs, so that RL on real robotic systems becomes feasible. We 6 DOF robotic arm, where the task requires the robot to learn
sketch the PI2 algorithm and its theoretical properties, and how both a reference trajectory and the appropriate time varying
it is applied to gain scheduling. We evaluate our approach by impedance. We conclude with a review of related work and
presenting results on two different simulated robotic systems, a discussions of future directions.
3-DOF Phantom Premium Robot and a 6-DOF Kuka Lightweight
Robot. We investigate tasks where the optimal strategy requires II. VARIABLE IMPEDANCE CONTROL
both tuning of the impedance of the end-effector, and tuning
of a reference trajectory. The results show that we can use The classical approach to robot control is negative feedback
path integral based RL not only for planning but also to derive control with high proportional-derivative (PD) gains. This type
variable gain feedback controllers in realistic scenarios. Thus, of control is straightforward to implement, robust towards
the power of variable impedance control is made available to a modeling uncertainties, and computationally cheap. Unfortu-
wide variety of robotic systems and practical applications. nately, high gain control is not ideal for many tasks involving
I. I NTRODUCTION interaction with the environment, e.g. force control tasks or
Biological motor systems excel in terms of versatility, locomotion. In contrast, impedance control [5] seeks to realize
performance, and robustness in environments that are highly a specific impedance of the robot, either in end-effector or joint
dynamic, often unpredictable, and partially stochastic. In con- space. The issue of specifying the target impedance, however,
trast to classical robotics, mostly characterized by high gain is not completely addressed as of yet. While for simple
negative error feedback control, biological systems derive factory tasks, where the properties of the task and environment
some of their superiority from low gain compliant control are know a priori, suitable impedance characteristics may be
with variable and task dependent impedance. If we adapt derivable, it is usually not easy to understand how impedance
this concept of adaptive impedance for PD negative feedback control is applied to more complex tasks such as a walking
control, this translates into time varying proportional and robot over difficult terrain or the manipulation of objects in
derivative gains, also known as gain scheduling. Finding the daily life (e.g. pillows, hammers, cans, etc.). An additional
appropriate gain schedule for a given task is, however, a hard benefit of variable impedance behavior in a robot comes from
problem. the added active safety due to soft “giving in”, both for the
One possibility to overcome such problems is Reinforce- robot and its environment.
ment Learning (RL). The idea of RL is that, given only a In the following we consider robots with torque controlled
reward function, the learning algorithm finds strategies that joints. The motor commands u are calculated via a PD control
yield high reward through trial and error. As a special and law with feedforward control term uf f :
important feature, RL accomplishes such optimal performance u = −KP (q − qd ) − KD (q̇ − q̇d ) + uf f (1)
without knowledge of the models of the motor system and/or
the environment. However, so far, RL does not scale well to where KP , KD are the positive definite position and velocity
high-dimensional continuous state-action control problems. gain matrices, q, q̇ are the joint positions and velocities, and
Closely related to RL is optimal control theory, where qd , q̇d are the desired joint positions and velocities. The
gain scheduling is a natural part of many optimal control feedforward control term may come, for instance, from an
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
inverse dynamics control component, or a computed torque where R is the finite horizon cost over a trajectory starting at
control component [15]. Thus, the impedance of a joint is time ti in state xti and ending at time tN
parameterized by the choice of the gains KP (“stiffness”) and Z tN
KD (“damping”).
R(τ i ) = φtN + rt dt (5)
For many applications, the joint space impedance is, how- ti
ever, of secondary interest. Most often, regulating impedance
matters the most at certain points that contact with the envi- and where φtN = φ(xtN ) is a terminal reward at time tN .
ronment, e.g., the end-effectors of the robot. We therefore need rt denotes the immediate reward at time t. τ i are trajectory
to assess the impedance at these points of contacts rather than pieces starting at xti and ending at time tN .
the joints. Joint space impedance is computed from the desired As immediate reward we consider
task space impedance KP,x , KD,x by help of the Jacobian J 1
of the forward kinematics of the robot as follows [15]: rt = r(xt , ut , t) = qt + uTt Rut (6)
2
KP,q = JT KP,x J and KD,q = JT KD,x J (2) where qt = q(xt , t) is an arbitrary state-dependent reward
function, and R is the positive semi-definite weight matrix
Here we assume that the geometric stiffness due to the of the quadratic control cost. From stochastic optimal control
change of the Jacobian is negligible in comparison to the [16], it is known that the associated Hamilton Jacobi Bellman
terms in Eq.(2). Regulating the task space impedance thus (HJB) equation is
implies regulating the joint space impedance. Furthermore,
this fundamental mathematical relationship between joint and 1
∂t Vt = qt + (∇x Vt )T ft − (∇x Vt )T Gt R−1 GTt (∇x Vt ) (7)
task space also implies that a constant task stiffness in general 2
means varying gains at the joint level. 1
+ trace (∇xx Vt )Gt Σǫ GTt

In the next section we will sketch a reinforcement learning 2
algorithm that is applied to learning the time dependent gain The corresponding optimal control is a function of the state
matrices. and it is given by the equation:

III. R EINFORCEMENT LEARNING IN HIGH DIMENSIONS – u(xti ) = uti = −R−1 GTti (∇xti Vti ) (8)
2
THE PI ALGORITHM
We are leaving the standard development of this optimal
Reinforcement learning algorithms can be derived from
control problem by transforming the HJB equations with the
different frameworks, e.g., dynamic programming, optimal
substitution Vt = −λ log Ψt and by introducing a simplifica-
control, policy gradients, or probabilistic approaches. Recently,
tion λR−1 = Σǫ . In this way, the transformed HJB equation
an interesting connection between stochastic optimal control
becomes a linear 2nd order partial differential equation. Due
and Monte Carlo evaluations of path integrals was made [9].
to the Feynman-Kac theorem [13, 25], the solution for the
In [18] this approach is generalized, and used in the context
exponentially transformed value function becomes
of model-free reinforcement learning with parameterized poli-
cies, which resulted in the PI2 algorithm. In the following, we  
N −1

1
Z
provide a short outline of the prerequisites and the develop-
X
Ψti = lim p (τ i |xi ) exp − φtN + qtj dtdτ i
ment of the PI2 algorithm as needed in this paper. For more dt→0 λ j=0
details refer to [18]. (9)
The foundation of PI2 comes from (model-based) stochastic Thus, we have transformed our stochastic optimal control
optimal control for continuous time and continuous state- problem into an approximation problem of a path integral.
action systems. We assume that the dynamics of the control As detailed in [18], it is not necessary to compute the value
system is of the form function explicitly, but rather it is possible to derive the optimal
controls directly:
ẋt = f (xt , t) + G(xt ) (ut + ǫt ) = ft + Gt (ut + ǫt ) (3) Z
uti = P (τ i ) u (τ i ) dτ i (10)
with xt ∈ ℜn×1 denoting the state of the system, Gt =
G(xt ) ∈ ℜn×p the control matrix, ft = f (xt ) ∈ ℜn×1 −1
u(τ i ) = R−1 Gti T Gti R−1 Gti T (Gti ǫti − bti )
the passive dynamics, ut ∈ ℜp×1 the control vector and
ǫt ∈ ℜp×1 Gaussian noise with variance Σǫ . Many robotic where P (τ i ) is the probability of a trajectory τ i , and bti is
systems fall into this class of control systems. For the finite a more complex expression, beyond the space constraints of
horizon problem ti : tN , we want to find control inputs uti :tN this paper. The important conclusion is that it is possible to
which minimize the value function evaluate Eq. (10) from Monte Carlo roll-outs of the control
system, i.e., our optimal control problem can be solved as an
V (xti ) = Vti = min Eτ i [R(τ i )] (4)
u ti :tN estimation problem.
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
TABLE I
A. The PI2 Algorithm
P SEUDOCODE OF THE PI2 ALGORITHM FOR A 1D PARAMETERIZED
The PI2 algorithm is just a special case of the optimal P OLICY.
control solution in Eq. (10), applied to control systems with
parameterized control policy:
• Given:
at = gtT (θ + ǫt ) (11) – An immediate cost function rt = qt + θT t Rθt (cf. Eq. (5))
– A terminal cost term φtN (cf. 5)
i.e., the control command is generated from the inner product – A stochastic parameterized policy at = gtT (θ + ǫt ) (cf. Eqs. (11)
of a parameter vector θ with a vector of basis function gt – and (12))
the noise ǫt is interpreted as user controlled exploration noise. – The basis function gti from the system dynamics (cf. 14)
– The variance Σǫ of the mean-zero noise ǫt
A particular case of a control system with parameterized – The initial parameter vector θ
policy is the Dynamic Movement Primitives (DMP) approach • Repeat until convergence of the trajectory cost R:
introduced by [6]: – Create K roll-outs of the system from the same start state x0
using stochastic parameters θ + ǫt at every time step
– For all K roll-outs, compute:
1
v̇t = ft + gtT (θ + ǫt ) (12)  − 1 S(τ i,k )
∗ P τ i,k = PKe λ − 1 S(τ )
τ [e λ i,k ]
1 k=1 PN−1 PN−1
q̇d,t = vt ∗ S(τ i,k ) = φtN ,k + qtj ,k + 1
2
(θ +
τ j=i j=i+1
Mtj ,k ǫtj ,k )T R(θ + Mtj ,k ǫtj ,k )
ft = α(β(g − qd,t ) − vt ) R−1 gt ,k gtT ,k
j
1 ∗ Mtj ,k =
gT R−1 gt ,k
j

ṡt = −αst (13) tj ,k j


τ – For all i time steps, compute:
wj st
[gt ]j = Pp (g − q0 )
PK 
(14)
 
∗ δθti = P τ i,k Mti ,k ǫti ,k
k=1 wk
k=1
PN −1
(N−i) wj,ti [δ θti ]j
= exp −0.5hj (st − cj )2

wj (15) – Compute [δθ]j = i=0
PN −1
wj,ti (N−i)
i=0
The intuition of this approach is to create desired trajectories – Update θ ← θ + δθ
– Create one noiseless roll-out to check the trajectory cost R =
qd,t , q̇d,t , q̈d,t = τ v̇t for a motor task out of the time evolution φt N +
PN−1
rti . In case the noise cannot be turned off, i.e., a
i=0
of a nonlinear attractor system, where the goal g is a point stochastic system, multiple roll-outs need to be averaged.
attractor and q0 the start state. The parameters θ determine
the shape of the attractor landscape, which allows to represent
almost arbitrary smooth trajectories, e.g., a tennis swing, a
reaching movement, or a complex dance movement. While is no need to calculate a gradient that is usually sensitive
leaving the details of the DMP approach to [6], for this paper to noise and large derivatives in the value function.
the important ingredients of DMPs are that i) the attractor • Paths with higher cost have lower probability. A clear in-
system Eq. (12) has the same form as Eq. (3), and that ii) the tuition that has also rigorous mathematical representation
p-dimensional parameter vector can be interpreted as motor through the exponentiation of the value function. This
commands as used in the path integral approach to optimal transformation is necessary for the linearization of HJB
control. Learning the optimal values for θ will thus create a into the Chapman-Kolmogorov PDE.
optimal reference trajectory for a given motor task. The PI2 • With PI2 the optimal control problem is solved with
learning algorithm applied to this scenario is summarized in the forward propagation of dynamics. Thus no backward
Table I. As illustrated in [18, 19], PI2 outperforms previous propagation of approximations of the value function is
RL algorithms for parameterized policy learning by at least required. This is a very important characteristic of PI2
one order of magnitude in learning speed and also lower that allows for sampling (i.e. roll-out) based estimation
final cost performance. As an additional benefit, PI2 has no of the path-integral.
open algorithmic parameters, except for the magnitude of the • For high dimensional problems, it is not possible to
exploration noise ǫt (the parameter λ is set automatically, sample the whole state space and that is the reason for
cf. [18]). We would like to emphasize one more time that applying path integral control in an iterative fashion to
PI2 does not require knowledge of the model of the control update the parameters of the DMPs.
system or the environment. • The derivation of an RL algorithm from first principles
Key Innovations in PI2 : In summary we list the key inno- largely eliminates the need for open parameters in the
vations in PI2 that we believe lead to its superior performance. final algorithm.
These innovations make applications like the the learning of
gain schedules for high dimensional tasks possible. IV. VARIABLE I MPEDANCE C ONTROL WITH PI2
2
• The basis of the derivation of the PI algorithm is the The PI2 algorithm as introduced above seems to be solely
transformation of the optimal control problem from a suited for optimizing a trajectory plan, and not directly the
constrained minimization to a maximum likelihood for- controller. Here we will demonstrate that this is not the
mulation. This transformation is very critical since there case, and how PI2 can be used to optimize a gain schedule
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
simultaneously to optimizing the reference trajectory. For this
purpose, it is important to realize how Eq. (3) relates to a
complete robotics system. We assume a d DOF robot that
obeys rigid body dynamics. qv denotes the joint velocities,
and qp the joint angle positions. Every DOF has its own
reference trajectory from a DMP, which means that Eqs. (12)
are duplicated for every DOF, while Eqs. (13), (14), and (15)
are shared across all DOFs – see [6] for explanations on
how to create multi-dimensional DMPs. Thus, Eq. (3) applied
to this context, i.e. using rigid body dynamics equations,
with M, C, G the Inertia matrix, Coriolis/centripedal and and
gravity forces respectively, becomes: Fig. 1. 3-DOF Phantom simulation in SL.

q̇v = M(qp )−1 (−C(qp , qv ) − G(qp ) + u)


time constant α1K is so small, that for all practical purposes we
q̇ = qv
p
(16) i,T
1 can assume that KP,i = gt,K (θ iK + ǫiK,t ) holds at all times.
ṡt = −αst Essentially equations 16,17,18 and 20 are incorporated in
τ
one stochastic dynamical system of the form of Eq. (3). In
where each element ui of the control vector u:
  conclusion, we achieved a novel formulation of learning both
ui = −KP,i qip − qd,i p
− ξi KP,i qiv − qd,i
v the reference trajectory and the gain schedule for a multi-
p 
dimensional robotic system with model-free reinforcement
+ uf f,i (17) learning, using the PI2 algorithm and its theoretical properties
v
The terms qd,i p
, qd,i are the reference joint angle position and as foundation of our derivations.
velocity of the ith DOF and they are given by the set of V. R ESULTS
equations:
We will now present results of applying the outlined al-
1 v p gorithms to two simulated robot arms with 3 and 6 DOFs,
q̇ = α(β(gi − qd,i v
) − qd,i + gti,T (θiref + ǫit )
τ d,i respectively. For both robots, the immediate reward at time
1 p v step t is given as:
q̇ = qd,i (18)
τ d,i
Note that in the control law in (17), we used Eq. (1) applied X
i
rt = Wgain KP,t + Wacc ||ẍ|| + Wsubgoal C(t) (21)
to every DOF individually using a time varying gain, and
i i
we inserted the common practice that the damping gain KD P i
is written as the square root of the proportional gain KPi Here, i KP,t is the sum over the proportional gains over
with a user determined multiplier ξ i . A critically important all joints. The reasoning behind penalizing the gains is that low
result of [18] is that for the application of PI2 only those gains lead to several desirable properties of the system such
differential equations in Eq. (16) matter that have learnable as compliant behavior (safety and/or robustness [2]), lowered
parameter θi . Moreover, the optimization of these parameters energy consumption, and less wear and tear. The term ||ẍ||
is accomplished by optimizing the parameter vector of each is the magnitude of the accelerations of the end-effector. This
differential equation independently (as shown in Table I), quantity is penalized to avoid high-jerk end-effector motion.
despite that the DOFs are coupled through the cost function. This penalty is low in comparison to the gain penalty.
For this reason, PI2 operates in a model free mode, as only The robot’s primary task is to pass through an intermedi-
one of the DMP differential equation per DOF is required, and ate goal, either in joint space or end-effector space – such
all other equations, including the rigid body dynamics model, scenarios occur in tasks like playing tennis or table tennis.
drop out. The component of the cost function C(t) that represents this
For variable stiffness control, we exploit these insights and primary task will be described individually for each robot in
add one more differential equation per DOF in Eq. (16): the next sections. Gains and accelerations are penalized at each
  time step, but C(t) only leads to a cost at specific time steps
i,T
K̇P,i = αK gt,K (θ iK + ǫiK,t ) − KP,i (19) along the trajectory.
wj For both robots, the cost weights are Wsubgoal = 2000,
[gt,K ]j = Pp (20) Wgain = 1/N , Wacc = 1/N . Dividing the weights by the
k=1 wk
number of time steps N is convenient, as it makes the weights
This equation models the time course of the position gains, independent of the duration of a movement.
coupled to Eq. (15) of the DMP. Thus, KP,i is represented
by a basis function representation linear with respect to the A. Robot 1: 3-DOF Phantom
learning parameter θiK , and these parameter are learned with The Phantom Premium 1.5 Robot is a 3 DOF, two link arm.
the PI2 algorithm following Table I. We will assume that the It has two rotational degrees of freedom at the base and one in
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
Fig. 2. Initial (red, dashed) and final (blue, solid) joint trajectories and gain scheduling for each of the three joints of the phantom robot. Yellow circles
indicate intermediate subgoals.

the arm. We use a physically realistic simulation of this robot


generated in SL [14], as depicted in Fig. 1.
The task for this robot is intentionally simple and aimed at
demonstrating the ability to tune task relevant gains in joint
space with straightforward and easy to interpret data.
The duration of the movement is 2.0s, which corresponds
to 1000 time steps at 500Hz servo rate. The intermediate goals
for this robot are set as follows:

C(t) = δ(t − 0.4) · | qSR (t) + 0.2 | + (22)


δ(t − 0.8) · | qSF E (t) − 0.4 | + Fig. 3. Learning curve for the phantom robot.
δ(t − 1.2) · | qEB (t) − 1.5 |
This penalizes joint SR for not having an angle qSR = −0.2 From these graphs, we draw the following conclusions:
at time t = 0.4s. Joints SFE and EB are also required to go
through (different) intermediate angles at times 0.8s and 1.2s • PI2 has adapted the initial minimum jerk trajectories such
respectively. that they fulfill the task and pass through the desired joint
The initial parameters θi for the reference trajectory are angles at the specified times with only small error. These
determined by training the DMPs with a minimum jerk intermediate goals are represented by the circles on the
trajectory [26] in joint space from qt=0.0 = [0.0 0.3 2.0]T graphs. The remaining error is a result of the trade-off
to qt=2.0 = [−0.6 0.8 1.4]T . The function approximator for between the different factors of the cost function (i.e.
the proportional gains of the 3 joints is initialized to return penalty for distance to goal vs. penalty for high gains).
a constant gain of 6.0N m/rad. The initial trajectories are • Because the magnitude of gains is penalized in general,
depicted as red, dashed plots in Fig. 2, where the angles and they are low when the task allows it. After t = 1.6s,
gains of the three joints are plotted against time. Since the task all gains drop to the minimum value1 , because accurate
of PI2 is to optimize both trajectories and gains with respect tracking is no longer required to fulfill the goal. Once
to the cost function, this leads to a 6-D RL problem. The robot the task is completed, the robot becomes maximally
executes 100 parameter updates, with 4 noisy exploration trials compliant, as one would wish it to be.
per update. After each update, we perform one noise-less test • When the robot is required to pass through the inter-
trial for evaluation purposes. mediate targets, it needs better tracking, and therefore
Fig. 3 depicts the learning curve for the phantom robot higher gains. Therefore, the peaks of the gains correspond
(left), which is the overall cost of the noise-less test trial after
1 We bound the gains between pre-specified maximum and minimum values.
each parameter update. The joint space trajectory and gain
Too high gains would generate oscillations and can lead to instabilities of the
scheduling after 100 updates are depicted as blue, solid lines robot, and too low gains lead to poor tracking such that the robot frequently
in Fig. 2. runs into the joint limits.

PREPRINT May 28, 2010


To appear in Proceedings RSS 2010
(c) 2010
Fig. 4. Initial (red, dotted), intermediate (green, dashed), and final (blue, solid) end-effector trajectories of the Kuka robot.

roughly to the times where the joint is required to pass


through an intermediate point.
• Due to nonlinear effects, e.g., Coriolis and centripedal
forces, the gain schedule shows more complex temporal
behavior as one would initially assume from specifying
three different joint space targets at three different times.
In summary, we achieved the objective of variable
impedance control: the robot is compliant when possible, but
has a higher impedance when the task demands it.

B. Robot 2: 6-DOF Kuka robot


Next we show a similar task on a 6 DOF arm, a Kuka Light- Fig. 5. Learning curve for the Kuka robot.

Weight Arm. This example illustrates that our approach scales


well to higher-dimensional systems, and also that appropriate
minimum gain directly afterwards. As with the phantom
gains schedules are learned when intermediate targets are
robot we observe high impedance when the task requires
chosen in end-effector space instead of joint space.
accuracy, and more compliance when the task is relatively
The duration of the movement is 1.0s, which corresponds
unconstrained.
to 500 time steps. This time, the intermediate goal is for the
• The second joint (GA2) has the most work to perform, as
end-effector x to pass through [ 0.7 0.3 0.1]T at time t = 0.5s:
it must support the weight of all the more distal links. Its
C(t) = δ(t − 0.5)| x − [ 0.7 0.3 0.1]T | (23) gains are by far the highest, especially at the intermediate
goal, as any error in this DOF will lead to a large end-
The six joint trajectories are again initialized as minimum effector error.
jerk trajectories. As before, the resulting initial trajectory is • The learning has two phases. In the first phase (plotted
plotted as red, dashed line in Fig. 4. The initial gains are as dashed, green), the robot is learning to make the
set to a constant [60, 60, 60, 60, 25, 6]T . Given these initial end-effector pass through the intermediate goal. At this
conditions, finding the parameter vectors for DMPs and gains point, the basic shape of the gain scheduling has been
that minimizes the cost function leads to a 12-D RL problem. determined. In the second phase, PI2 fine tunes the gains,
We again perform 100 parameter updates, with 4 exploration and lowers them as much as the task permits.
trials per update.
The learning curve for this problem is depicted in Fig. 5. VI. R ELATED WORK
The trajectory of the end-effector after 30 and 100 updates In optimal control and model based RL, Differential Dy-
is depicted in Fig. 4. The intermediate goal at t = 0.5 is namic Programming (DDP) [7] has been one of the most estab-
visualized by circles. Finally, Fig. 6 shows the gain schedules lished and used frameworks for finite horizon optimal control
after 30 and 100 updates for the 6 joints of the Kuka robot. problems. In DDP, both state space dynamics and cost function
From these graphs, we draw the following conclusions: are approximated up to the second order. The assumption of
2
• PI has adapted joint trajectories such that the end- stabilizability and detectability for the local approximation of
effector passes through the intermediate subgoal at the the dynamics are necessary for the convergence of DDP. The
right time. It learns to do so after only 30 updates resulting state space trajectory is locally optimal while the cor-
(Figure 5). responding control policy consists of open loop feedforward
• After 100 updates the peaks of most gains occur just command and closed loop gains relative to a nominal and
before the end-effector passes through the intermediate optimal final trajectory. This characteristic allows the use of
goal (Figure 6), and in many cases decrease to the DDP for both planning and control gain scheduling problems.
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
and time invariant systems, such guarantee is feasible through
γ-iteration [23]. However, for nonlinear systems providing this
guarantee is not trivial.
The work on Receding Horizon DDP [17] provided an
alternative and rather efficient way of computing local op-
timal feedback controls. Nevertheless, all the computations
of optimal trajectories and control take place off-line and
the model predictive component is only due to the fact that
the final target state of the optimal control problem varies.
Recent work on LQR-trees uses a simpler variation of DDP,
the iterative Linear Quadratic Regulator (iLQR) [11], which is
based on linear approximations of the state space dynamics,
in combination with tools from Nonlinear Robust Control
theory for region of attraction analysis. Given the local optimal
feedback control policies, the sums of squares optimization
scheme is used to quantify the size of of the basin of attraction,
and provides so-called control funnels. These funnels improve
sampling since they quantize the state space into attractor
regions placed along the trajectories towards the target state.
This is a model based approach and inherits all the problems of
model based approaches to optimal control. In addition, even
though sampling is improved, it is still an issue how LQR
trees scale in high dimensional dynamical systems.
The path integral formalism for optimal control was intro-
duced in [8, 9]. In this work, the role of noise in symme-
try breaking phenomena was investigated in the context of
stochastic optimal control. In [22] the path integral formalism
is extended for stochastic optimal control of multi-agent sys-
tems, which is not unlike our multi degree-of-freedom control
systems.
Recent work on stochastic optimal control by [21, 20, 3]
shows that for a class of discrete stochastic optimal control
problems, the Bellman equation can be written as the KL di-
vergence between the probability distribution of the controlled
and uncontrolled dynamics. Furthermore, it is shown that the
class of discrete KL divergence control problem is equivalent
Fig. 6. Initial (red, dotted), intermediate (green, dashed), and final (blue, to the continuous stochastic optimal control formalism with
solid) joint gain schedules for each of the six joints of the Kuka robot. quadratic cost control function and under the presence of
Gaussian noise. In all this aforementioned work, both in the
path integral formalism as well as in KL divergence control,
In [4, 24] DDP was extended to incorporate constrains in the class of stochastic dynamical systems under consideration
state and controls. In [10] the authors suggest computational is rather restrictive since the control transition matrix is
improvements to constrained DDP and apply the proposed state independent. Moreover, the connection to direct policy
algorithm to a low dimensional planning problem. learning in RL and model-free learning was not made in any
An example of a DDP application to robotics is in [12]. In of the previous projects. In [3], the stochastic optimal control
this work, a min-max or Differential Game Theory approach to problem is investigated for discrete state-action spaces, and
optimal control is proposed. There is a strong link between ro- therefore it is treated as Markov Decision Process (MDP).
bust control frequency design analysis such as H∞ control and To apply our PI2 algorithm, we do not discretize the state
the framework of Differential Game Theory [1]. Essentially space and we do not treat the problem as an MDP. Instead
the min-max DDP results in robust feedback control policies we work in continuous state-action spaces which are suitable
with respect to model uncertainty and unknown dynamics. for performing RL in high dimensional robotic systems. To
Although, in theory, min-max DDP should resolve the issue the best of our knowledge, our results present RL in one
of model uncertainty, it can lead to overly conservative control of the most high dimensional continuous state action spaces.
policies. The conservatism results from the need to guarantee In our derivations, the probabilistic interpretation of control
that the game theoretic approach will be always stabilizable, comes directly from the Feynman-Kac Lemma. Thus we do
i.e. making sure that the stabilizing controller wins. For linear not have to impose any artificial pseudo-probability treatment
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
of the cost as in [3]. In addition, for continuous state-action R EFERENCES
spaces, we do not have to learn the value function as it is [1] T. Basar and P. Bernhard. H-Infinity Optimal Control and Related
suggested in [3] via Z-learning. Instead we directly obtain the Minimax Design Problems: A Dynamic Game Approach. Birkhäuser,
controls based on our generalization of optimal controls. In the Boston, 1995.
[2] J. Buchli, M. Kalakrishnan, M. Mistry, P. Pastor, and S. Schaal. Compli-
previous work, the problem of how to sample trajectories is not ant quadruped locomotion over rough terrain. In IEEE/RSJ International
addressed. Sampling is performed with the hope to cover all Conference on Intelligent Robots and Systems, pages 814–820, 2009.
the relevant state space. We follow a rather different approach [3] Todorov E. Efficient computation of optimal actions. Proc Natl Acad
Sci USA, 106(28):11478–83, 2009.
by incremental updating that allows to attack robotic learning [4] R. Fletcher. Practical Methods of Optimization, volume 2. Wiley, 1981.
problems of the complexity and dimensionality of complete [5] N. Hogan. Impedance control - an approach to manipulation. I - theory.
humanoid robots. II - implementation. III - applications. ASME, Transactions, Journal of
Dynamic Systems, Measurement, and Control, 107:1–24, 1985.
[6] A.J. Ijspeert. Vertebrate locomotion. In M.A. Arbib, editor, The
VII. D ISCUSSION handbook of brain theory and neural networks, pages 649–654. MIT
Press, 2003.
We presented a model-free reinforcement learning approach [7] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming.
Optimal Control. Elsevier Publishing Company, New York, 1970.
that can learn variable impedance control for robotic systems. [8] H. J. Kappen. Linear theory for control of nonlinear stochastic systems.
Our approach is derived from stochastic optimal control with Phys Rev Lett, 95(20):200201, 2005.
path integrals, a relatively new development that transforms [9] H.J. Kappen. Path integrals and symmetry breaking for optimal control
theory. Journal of Statistical Mechanics: Theory and Experiment,
optimal control problems into estimation problems. In particu- 2005(11):P11011, 2005.
lar, our PI2 algorithm goes beyond the original ideas of optimal [10] G. Lantoine and R.P. Russell. A hybrid differential dynamic pro-
control with path integrals by realizing the applicability to gramming algorithm for robust low-thrust optimization. In AAS/AIAA
Astrodynamics Specialist Conference and Exhibit, 2008.
optimal control with parameterized policies and model-free [11] W. Li and E. Todorov. Iterative linear quadratic regulator design for
scenarios. nonlinear biological movement systems. In Proceedings of the 1st
The mathematical structure of the PI2 algorithm makes it International Conference on Informatics in Control, Automation and
Robotics, pages 222–229, 2004.
suitable to optimize simultaneously both reference trajectories [12] J. Morimoto and C. Atkeson. Minimax differential dynamic program-
and gain schedules. This is similar to classical differential dy- ming: An application to robust biped walking. In In Advances in
namic (DDP) programming methods, but completely removes Neural Information Processing Systems 15, pages 1563–1570. MIT
Press, Cambridge, MA, 2002.
the requirements of DDP that the model of the controlled [13] B. K. Oksendal. Stochastic differential equations: an introduction with
system must be known, that the cost function has to be twice applications. Springer, Berlin, New York, 6th edition, 2003.
differentiable in both state and command cost, and that the [14] S. Schaal. The SL simulation and real-time control software package.
Technical report, University of Southern California, 2007.
dynamics of the control system have to be twice differentiable. [15] L. Sciavicco and B. Siciliano. Modelling and Control of Robot
The latter constraints make it hard to apply DDP to movement Manipulators. Springer, London, New York, 2000.
tasks with discrete events, e.g., as typical in force control and [16] R.F. Stengel. Optimal Control and Estimation. Dover Publications, New
York, 1994.
locomotion. [17] Y. Tassa, T. Erez, and W. Smart. Receding horizon differential dynamic
We evaluated our approach on two simulated robot systems, programming. In Advances in Neural Information Processing Systems
20, pages 1465–1472. MIT Press, Cambridge, MA, 2008.
which posed up to a 12 dimensional learning problems in con- [18] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning in high
tinuous state-action spaces. The goal was to learn compliant dimensional state spaces: A path integral approach. Journal of Machine
control while fulfilling kinematic task constraints, like passing Learning Research, 2010. Accepted for publication.
[19] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor
through an intermediate target. The evaluations demonstrated skills in high dimensions: A path integral approach. In Proceedings of
that the algorithm behaves as expected: it increases gains when the IEEE International Conference on Robotics and Automation, 2010.
needed, but tries to maintain low gain control otherwise. The [20] E. Todorov. Linearly-solvable markov decision problems. In Advances
in Neural Information Processing Systems 19, pages 1369–1376. MIT
optimal reference trajectory always fulfilled the task goal. Press, Cambridge, MA, 2007.
Learning speed was rather fast, i.e., within at most a few [21] E. Todorov. General duality between optimal control and estimation.
hundred trials, the task objective was accomplished. From In Proceedings of the 47th IEEE Conference on Decision and Control,
2008.
a machine learning point of view, this performance of a [22] B. van den Broek, W. Wiegerinck, and B. Kappen. Graphical model
reinforcement learning algorithm is very fast. inference in optimal control of stochastic multi-agent systems. Journal
The PI2 algorithms inherits the properties of all trajectory- of Artificial Intelligence Research, 32:95–122, 2008.
[23] T. Vincent and W. Grantham. Non Linear And Optimal Control Systems.
based learning algorithms in that it only finds locally optimal John Wiley & Sons Inc., 1997.
solutions. For high dimensional robotic system, this is unfortu- [24] S. Yakowitz. The stagewise Kuhn-Tucker condition and differential
nately all one can hope for, as exploring the entire state-action dynamic programming. IEEE Transactions on Automatic Control,
31(1):25–30, 1986.
space in search for a globally optimal solution is impossible. [25] J. Yong. Relations among ODEs, PDEs, FSDEs, BSDEs, and FBSDEs.
Future work will apply the suggested methods on actual In Proceedings of the 36th IEEE Conference on Decision and Control,
robots for mobile manipulation and locomotion controllers. 1997, volume 3, pages 2779–2784, Dec 1997.
[26] M. Zefran, V. Kumar, and C.B. Croke. On the generation of smooth
We believe that our methods are a major step towards realizing three-dimensional rigid body motions. IEEE Transactions on Robotics
compliant autonomous robots that operate robustly in dynamic and Automation, 14(4):576–589, 1998.
and stochastic environment without harming other beings or
themselves.
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010

View publication stats

You might also like