Variable Impedance Control A Reinforcement Learnin
Variable Impedance Control A Reinforcement Learnin
Variable Impedance Control A Reinforcement Learnin
net/publication/221344600
CITATIONS READS
75 605
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Freek Stulp on 09 July 2014.
Abstract— One of the hallmarks of the performance, versatility, algorithms. However, optimal control requires model-based
and robustness of biological motor control is the ability to adapt derivations, such that it is frequently not applicable to com-
the impedance of the overall biomechanical system to different plex robotic systems and environments, where models are
task requirements and stochastic disturbances. A transfer of this
principle to robotics is desirable, for instance to enable robots unknown.
to work robustly and safely in everyday human environments. It In this paper, we present a novel RL algorithm that does
is, however, not trivial to derive variable impedance controllers scale to complex robotic systems, and that accomplishes gain
for practical high DOF robotic tasks. In this contribution, we ac- scheduling in combination with optimizing other performance
complish such gain scheduling with a reinforcement learning ap- criteria. Evaluations on two simulated robotic systems demon-
proach algorithm, PI2 (Policy Improvement with Path Integrals).
PI2 is a model-free, sampling based learning method derived from strate the effectiveness of our approach. In the following
first principles of optimal control. The PI2 algorithm requires no section, we will first motivate variable impedance control.
tuning of algorithmic parameters besides the exploration noise. Then, we sketch our novel RL algorithms, called PI2 , and
The designer can thus fully focus on cost function design to its applicability to learning gain scheduling. In the fourth
specify the task. From the viewpoint of robotics, a particular section, we will present evaluation results on a 3 DOF and a
useful property of PI2 is that it can scale to problems of many
DOFs, so that RL on real robotic systems becomes feasible. We 6 DOF robotic arm, where the task requires the robot to learn
sketch the PI2 algorithm and its theoretical properties, and how both a reference trajectory and the appropriate time varying
it is applied to gain scheduling. We evaluate our approach by impedance. We conclude with a review of related work and
presenting results on two different simulated robotic systems, a discussions of future directions.
3-DOF Phantom Premium Robot and a 6-DOF Kuka Lightweight
Robot. We investigate tasks where the optimal strategy requires II. VARIABLE IMPEDANCE CONTROL
both tuning of the impedance of the end-effector, and tuning
of a reference trajectory. The results show that we can use The classical approach to robot control is negative feedback
path integral based RL not only for planning but also to derive control with high proportional-derivative (PD) gains. This type
variable gain feedback controllers in realistic scenarios. Thus, of control is straightforward to implement, robust towards
the power of variable impedance control is made available to a modeling uncertainties, and computationally cheap. Unfortu-
wide variety of robotic systems and practical applications. nately, high gain control is not ideal for many tasks involving
I. I NTRODUCTION interaction with the environment, e.g. force control tasks or
Biological motor systems excel in terms of versatility, locomotion. In contrast, impedance control [5] seeks to realize
performance, and robustness in environments that are highly a specific impedance of the robot, either in end-effector or joint
dynamic, often unpredictable, and partially stochastic. In con- space. The issue of specifying the target impedance, however,
trast to classical robotics, mostly characterized by high gain is not completely addressed as of yet. While for simple
negative error feedback control, biological systems derive factory tasks, where the properties of the task and environment
some of their superiority from low gain compliant control are know a priori, suitable impedance characteristics may be
with variable and task dependent impedance. If we adapt derivable, it is usually not easy to understand how impedance
this concept of adaptive impedance for PD negative feedback control is applied to more complex tasks such as a walking
control, this translates into time varying proportional and robot over difficult terrain or the manipulation of objects in
derivative gains, also known as gain scheduling. Finding the daily life (e.g. pillows, hammers, cans, etc.). An additional
appropriate gain schedule for a given task is, however, a hard benefit of variable impedance behavior in a robot comes from
problem. the added active safety due to soft “giving in”, both for the
One possibility to overcome such problems is Reinforce- robot and its environment.
ment Learning (RL). The idea of RL is that, given only a In the following we consider robots with torque controlled
reward function, the learning algorithm finds strategies that joints. The motor commands u are calculated via a PD control
yield high reward through trial and error. As a special and law with feedforward control term uf f :
important feature, RL accomplishes such optimal performance u = −KP (q − qd ) − KD (q̇ − q̇d ) + uf f (1)
without knowledge of the models of the motor system and/or
the environment. However, so far, RL does not scale well to where KP , KD are the positive definite position and velocity
high-dimensional continuous state-action control problems. gain matrices, q, q̇ are the joint positions and velocities, and
Closely related to RL is optimal control theory, where qd , q̇d are the desired joint positions and velocities. The
gain scheduling is a natural part of many optimal control feedforward control term may come, for instance, from an
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
inverse dynamics control component, or a computed torque where R is the finite horizon cost over a trajectory starting at
control component [15]. Thus, the impedance of a joint is time ti in state xti and ending at time tN
parameterized by the choice of the gains KP (“stiffness”) and Z tN
KD (“damping”).
R(τ i ) = φtN + rt dt (5)
For many applications, the joint space impedance is, how- ti
ever, of secondary interest. Most often, regulating impedance
matters the most at certain points that contact with the envi- and where φtN = φ(xtN ) is a terminal reward at time tN .
ronment, e.g., the end-effectors of the robot. We therefore need rt denotes the immediate reward at time t. τ i are trajectory
to assess the impedance at these points of contacts rather than pieces starting at xti and ending at time tN .
the joints. Joint space impedance is computed from the desired As immediate reward we consider
task space impedance KP,x , KD,x by help of the Jacobian J 1
of the forward kinematics of the robot as follows [15]: rt = r(xt , ut , t) = qt + uTt Rut (6)
2
KP,q = JT KP,x J and KD,q = JT KD,x J (2) where qt = q(xt , t) is an arbitrary state-dependent reward
function, and R is the positive semi-definite weight matrix
Here we assume that the geometric stiffness due to the of the quadratic control cost. From stochastic optimal control
change of the Jacobian is negligible in comparison to the [16], it is known that the associated Hamilton Jacobi Bellman
terms in Eq.(2). Regulating the task space impedance thus (HJB) equation is
implies regulating the joint space impedance. Furthermore,
this fundamental mathematical relationship between joint and 1
∂t Vt = qt + (∇x Vt )T ft − (∇x Vt )T Gt R−1 GTt (∇x Vt ) (7)
task space also implies that a constant task stiffness in general 2
means varying gains at the joint level. 1
+ trace (∇xx Vt )Gt Σǫ GTt
In the next section we will sketch a reinforcement learning 2
algorithm that is applied to learning the time dependent gain The corresponding optimal control is a function of the state
matrices. and it is given by the equation:
III. R EINFORCEMENT LEARNING IN HIGH DIMENSIONS – u(xti ) = uti = −R−1 GTti (∇xti Vti ) (8)
2
THE PI ALGORITHM
We are leaving the standard development of this optimal
Reinforcement learning algorithms can be derived from
control problem by transforming the HJB equations with the
different frameworks, e.g., dynamic programming, optimal
substitution Vt = −λ log Ψt and by introducing a simplifica-
control, policy gradients, or probabilistic approaches. Recently,
tion λR−1 = Σǫ . In this way, the transformed HJB equation
an interesting connection between stochastic optimal control
becomes a linear 2nd order partial differential equation. Due
and Monte Carlo evaluations of path integrals was made [9].
to the Feynman-Kac theorem [13, 25], the solution for the
In [18] this approach is generalized, and used in the context
exponentially transformed value function becomes
of model-free reinforcement learning with parameterized poli-
cies, which resulted in the PI2 algorithm. In the following, we
N −1
1
Z
provide a short outline of the prerequisites and the develop-
X
Ψti = lim p (τ i |xi ) exp − φtN + qtj dtdτ i
ment of the PI2 algorithm as needed in this paper. For more dt→0 λ j=0
details refer to [18]. (9)
The foundation of PI2 comes from (model-based) stochastic Thus, we have transformed our stochastic optimal control
optimal control for continuous time and continuous state- problem into an approximation problem of a path integral.
action systems. We assume that the dynamics of the control As detailed in [18], it is not necessary to compute the value
system is of the form function explicitly, but rather it is possible to derive the optimal
controls directly:
ẋt = f (xt , t) + G(xt ) (ut + ǫt ) = ft + Gt (ut + ǫt ) (3) Z
uti = P (τ i ) u (τ i ) dτ i (10)
with xt ∈ ℜn×1 denoting the state of the system, Gt =
G(xt ) ∈ ℜn×p the control matrix, ft = f (xt ) ∈ ℜn×1 −1
u(τ i ) = R−1 Gti T Gti R−1 Gti T (Gti ǫti − bti )
the passive dynamics, ut ∈ ℜp×1 the control vector and
ǫt ∈ ℜp×1 Gaussian noise with variance Σǫ . Many robotic where P (τ i ) is the probability of a trajectory τ i , and bti is
systems fall into this class of control systems. For the finite a more complex expression, beyond the space constraints of
horizon problem ti : tN , we want to find control inputs uti :tN this paper. The important conclusion is that it is possible to
which minimize the value function evaluate Eq. (10) from Monte Carlo roll-outs of the control
system, i.e., our optimal control problem can be solved as an
V (xti ) = Vti = min Eτ i [R(τ i )] (4)
u ti :tN estimation problem.
PREPRINT May 28, 2010
To appear in Proceedings RSS 2010
(c) 2010
TABLE I
A. The PI2 Algorithm
P SEUDOCODE OF THE PI2 ALGORITHM FOR A 1D PARAMETERIZED
The PI2 algorithm is just a special case of the optimal P OLICY.
control solution in Eq. (10), applied to control systems with
parameterized control policy:
• Given:
at = gtT (θ + ǫt ) (11) – An immediate cost function rt = qt + θT t Rθt (cf. Eq. (5))
– A terminal cost term φtN (cf. 5)
i.e., the control command is generated from the inner product – A stochastic parameterized policy at = gtT (θ + ǫt ) (cf. Eqs. (11)
of a parameter vector θ with a vector of basis function gt – and (12))
the noise ǫt is interpreted as user controlled exploration noise. – The basis function gti from the system dynamics (cf. 14)
– The variance Σǫ of the mean-zero noise ǫt
A particular case of a control system with parameterized – The initial parameter vector θ
policy is the Dynamic Movement Primitives (DMP) approach • Repeat until convergence of the trajectory cost R:
introduced by [6]: – Create K roll-outs of the system from the same start state x0
using stochastic parameters θ + ǫt at every time step
– For all K roll-outs, compute:
1
v̇t = ft + gtT (θ + ǫt ) (12) − 1 S(τ i,k )
∗ P τ i,k = PKe λ − 1 S(τ )
τ [e λ i,k ]
1 k=1 PN−1 PN−1
q̇d,t = vt ∗ S(τ i,k ) = φtN ,k + qtj ,k + 1
2
(θ +
τ j=i j=i+1
Mtj ,k ǫtj ,k )T R(θ + Mtj ,k ǫtj ,k )
ft = α(β(g − qd,t ) − vt ) R−1 gt ,k gtT ,k
j
1 ∗ Mtj ,k =
gT R−1 gt ,k
j