Estimating Probability Distribution With Q-Learning For Biped Gait Generation and Optimization
Estimating Probability Distribution With Q-Learning For Biped Gait Generation and Optimization
Estimating Probability Distribution With Q-Learning For Biped Gait Generation and Optimization
Estimating Probability Distribution with Q-learning for Biped Gait Generation and Optimization
1
Abstract A new biped gait generation and optimization method is proposed in the frame of Estimation of Distribution Algorithms (EDAs) with Q-learning method. By formulating the biped gait synthesis as a constrained multi-objective optimization problem, a dynamically stable and low energy cost biped gait is generated by EDAs with Q-learning (EDA Q), which estimate probability distributions derived from the objective function to be optimized to generate searching points in the highly-coupled and high dimensional working space of biped robots. To get the preferable permutation of the interrelated parameters, Qlearning is combined to build and modify the probability models in EDA autonomously. By making use of the global optimization capability of EDA, the proposed EDA Q can also solve the local minima problem in traditional Q-learning. On the other hand, with the learning agent, EDA Q can evaluate the probability distribution model regularly without pre-designed structure and updating rule. The simulation results show that faster and more accurate searching can be achieved to generate preferable biped gait. The gait has been successfully used to drive a soccerplaying humanoid robot called Robo-Erectus which is one of the foremost leading soccer-playing humanoid robots in the RoboCup Humanoid League. Index TermsBiped robot, gait optimization, estimation distribution algorithm, probability distribution, Q-learning.
I. I NTRODUCTION Recent research on humanoid robot walking typically describes biped locomotion as a periodic movement pattern. One of the basic requirements for the gait is dynamically stable and low energy cost walking. Though some prototypes have performed successful walking [1] [2], their gaits have not yet reached the level of complete solution with enough liability and adaptability. Current research to generate expected trajectory can be roughly divided into two kinds as biomechanical data based and control principle based. Two commonly used methods in the latter kind are exploiting the intrinsic dynamics of biped structure and parametric optimization technique. The parametric optimization technique provides the possibility to cope with multi-DOF structure that varies in walking. Correspondingly, many methods are available to approximate the parameters to be optimized in the high dimensional coupling space. Besides polynomials [3] or spline functions [4], heuristic computation methods like neural networks [5], fuzzy logics [6] and genetic algorithms [7] are also among the usually used strategies. To avoid involving too many parameters which is a problem appeared in the forenamed methods especially for high dimen-
sional optimization cases, an expansion of evolutionary computation method namely Estimation of Distribution Algorithm (EDA) [8] has been introduced into this new area [9][10][11]. It can successfully generate expected biped gaits that satisfy the predened objective function with the assumption that joint angles are independent variables. So a nine-link biped robot is formulated as a nonlinear system with impulses to study the dynamic stability of biped locomotion in this paper. Biped walking, formed by double support and swing phases, is represented by key poses extracted from one cycle. The joint angles at these key poses are the parameters to be optimized by EDA. It can be found that the two important factors that will affect the learning quality of EDAs are probability model and corresponding updating rule. To build and update the probability model autonomously from experiences instead of exclusively determined in advance, Reinforcement Learning (RL) [12] is incorporated with EDA in this paper. RL is used to learn the unknown desired distribution by providing an agent with suitable evaluation of its performance. The RL agent allows the probability models in EDAs to learn from their experiences and interrelationship instead of directly from the selected samples. The specialty of the proposed method namely EDA Q is the combination of Q-learning and EDA, which can provide EDA the ability to deal with interrelated multi-variables without pre-designed probability distribution function and updating rule. This can be particularly used to cope with the optimization problem appeared in biped gait generation, which to our best of knowledge is the rst application in this new area. The rest of the paper is organized as follows. In the following section, biped pattern and the mechanical models are briey introduced. Constraints used in gait generation are presented in the section III. In section IV, gait optimization is abstracted as a multi-objective optimization problem with geometric and state constraints. The EDA Q used to solve the problem is then demonstrated in section V. This is followed by the algorithm simulation results demonstration in section VI. Section VII is some concluding remarks. II. B IPED G AIT S YNTHESIS A. Mathematical Model Considering an Nl -link biped robot consisting of a torso, two hips and two identical legs with ankles and knees, its
362
kinematic structure is shown in Fig. 1. The biped mechanism system is assumed to be planar rigid link structure. All links lump mass in the middle center and the branched kinematic chains interconnected by spherical or cylindrical joints transform from open to closed chain during the walking cycle.The mathematical model of the biped locomotion proposed in [13] is modied in this paper. The spatial model dened in internal coordinate state space as shown in Fig. 2 will be used to synthesize the dynamic control of the biped mechanism and to verify the simulation research result. Denition of angular coordinates and disposition of the masses are also indicated in Fig. 2. The supporting ankle, noted as l1 is adopted as the basic link of the mechanism.
Left support Swing foot passes through the highest point Swing foot touches the ground Right support
m5
q5
a5 L6
L4 a4 L3 a3
m3
m4
m6
a6
q4
q6
m7
L7 a7 q8
q7
L2 Lb
m2
q3 b8 La
m8
q9
nT
nT+Ta
nT+Ta+Tb
(n+1)T
a9 L9
Ta
Tb
Tc
L1
Fig. 2.
Key pose 1
Key pose 2
Key pose 3
Key pose 1
Landing Impact
Swing Phase
F is the opposite force when the swing foot contacts on the ground. So it equals to zero during the swing phase and changes to Fi at the landing impact moment. Where Fi is the integral of contact impulses at the impact instant. After landing impact, F can be represented as J T (q)Fe for gait enters into double support phase. Where J(q) = C/q denes the Jacobian matrix, Fe is the vector of external forces and moments acting at the swing ending and C is the geometric constraints for closed chain requirements. B. Biped Gait Pattern Biped walking is assumed to limit in sagittal plane and on a level surface. The main phases of one gait cycle are supposed to be swing phase and double support phase as shown in Fig. 1. The swing phase of the nth gait cycle starts from the swing foot toeing off the ground and ends when the heel touches the ground. The swing foot reaches the highest point at nT + Ta and after Tb lands on the ground. Where T = Ta + Tb + Tc is the time period for one gait cycle and Tc represents the time for double support phase. n = 0, 1, ... records the step number. The double support phase begins at the moment that the front heel touching the ground and ends when the rear toe sways off at (n + 1)T (as shown in Fig. 1). End conguration of this phase initiates the swing phase of the next gait cycle. Landing impact happens at the beginning of double support phase. Transition between the two phases is assumed to take place in an innitesimal length of time. Ratio between swing and double support periods is a function of walking velocity [14]. To be mentioned in simulation as an example, swing phase is set to be twenty percent of one walking cycle according to [14]. Totally Nkp = 3 key poses Zi (i = 1, 2, ..., Nkp ) are chosen in one cycle. Poses between these key poses are approximated by third order spline functions. A technique of this type has been implemented using splines of class C 2 [4].
Fig. 1.
Biped gait pattern in one gait cycle with three key poses.
The dynamical model of the biped robot can be represented by two parts: the differential equations describing the dynamics of the swing phase and double support phase including an impulse model of the contact event at the beginning of double support phase. Let q be the coordinate sets describing the link locomotion in joint space with respect to a world reference frame. q is angular velocity. The supporting ankle, noted as l1 , is adopted as the basic link of the mechanism. Denition of angular coordinates and disposition of the masses are indicated in Fig. 2. The mechanism dynamic model of biped robots can be expressed by similar dynamic Newton-Euler equation like manipulators as shown in Equation (1). q M (q) + C(q, q)q + G(q) = + F, (1)
where M (q) = {rij cos (qj qi )}Nl Nl is the square matrix describing full inertia, C(q) = {sij sin (qj qi )}Nl Nl is the vector for Coriolis and centrifugal generalized inertia forces and G(q) = {diag(hi sin qi )}Nl 1 represents the generalized forces acting at Nq actuated mechanism joint, parameters rij , sij , and hi , are constants derived by Lagranges equation, i, j = 1, 2, ..., Nq . is the vector of driving moments at each joint.
363
Such strategies guarantee the second order derivatives at every point, especially at connection points. Consequently, actuating torques is continuous at knots. III. CONSTRAINTS Constraints are divided into two categories, namely geometric and state constraints in this section according to the denition space. A. Geometric Constraints Geometric Constraints(GCs) ensure the feasibility of biped gaits from the point of physical structure. Since one gait can be regarded as a pose sequence that is generated by interpolating key poses, GCs will be constructed only at key poses. Two geometric constraints are taken into account. GC1(q): position limitations Ap p(q) Bp . GC2(q): structure limitations Aq q Bq .
T T T
the convex polygon formulated by the two feet when gaits change from single support to double support phase. It can be simplied as Equation (7) shows. Azmp pzmp Bzmp , (7)
where pzmp = [pzmp,i ]T , pzmp,i = (xzmp,i , yzmp,i , 0) is position of the ith ZMP. Azmp = [Azmp,i ]T and Bzmp = [Bzmp,i ]T are lower and upper boundaries of the stable region respectively. For the kinematic chain as shown in Fig. 2, the ZMP condition can be calculated by Equations (8) and (9). xzmp = yzmp =
Nq z i=1 (mi (i
+ g)xi mi xi zi (Ii i )y )
Nq z i=1 (mi (i
+ g)) + g))
, (8)
Nq z i=1 (mi (i
+ g)yi mi yi zi + (Ii i )x )
Nq z i=1 (mi (i
(9)
(2)
(3)
Where p = [pt,i ] , Ap = [Ap t,i ] , Bp = [Bp t,i ] . pt,i = [xt,i , yt,i , zt,i ] denotes center position of link i at time t, i = 1, 2, ..., Nl . Apt,i and Bpt,i are lower and upper boundaries of pt,i . q = [qt,i ]T stands for the ith joint angle at time t, i = 1, 2, ..., Nq . Aq = [Aqt,i ]T , Bq = [Bqt,i ]T , Aqt,i and Bqt,i are lower and upper boundaries of qt,i . Nl and Nq are number of links and actuated torques in biped robot. B. State Constraints Three more kinds of state constraints including force, velocity and stability constraints are also taken into consideration in this paper. Details of each constraint are described as follows. F C1(q): during double support phase, the force on feet must satisfy
Nl Nl
where mi is the mass of link i and g is the gravity acceleration. Position of link i is described by [x, y, z], and correspondingly, acceleration of link i in x, y and z direction are represented by xi , yi and zi . (Ii )x and (Ii )y are the inertial components and (i )x and (i )y are the absolute angular velocity component around x and y axis at the center of gravity of link i. More constraints can be added to achieve more practical requirements in biped gaits generation and optimization. It will not affect the problem formulation and framework denition. IV. PERFORMANCE CRITERION The proposed constrained dynamical model with geometric and state constraints provides the searching space boundary. Based on the proposed framework, an optimization procedure will be developed to select a continuous feasible solution by minimizing an appropriate cost function. Since analytical property of the performance criterion affects the optimal searching in terms of either dynamics or stabilities. ZMP displacement f () and energy cost g() are chosen to form the cost function. Performance criterion is dened as: M inimize () = f f () + g g()
Ng
(mi pi ) = fR + fL +
i=1 i=1
(mi g).
(4)
(10)
Where fR and fL are ground reaction force at right and left foot respectively. pi is the acceleration of link i. F C2(q): internal force fd in the closed loop structure must be minimized. It can be expressed as fd = min(F I(fR , fL )). (5)
=
i=1
Ap p Bp Aq q Bq
Nl Nl
Where F I is the function to calculate the internal force. V C(q): velocity constraints are considered to guarantee motion smoothness with respect to mechanical limitations of the biped system. It can be written as Aq q Bq . (6)
(mi pi ) = fR + fL +
i=1 i=1
(mi g)
F C2 VC ZC
Where Aq = [Aqt,i ]T and Bq = [Bqt,i ]T are lower and upper boundaries of qt,i . ZC(q): According to dynamical stability criterion dened by ZMP [15], position of ZMPs should be within the stable region, which changes from the area of standing feet to
Where f () = t=nT ||pzmp (t) pd (t)||dt is the intezmp gral of ZMP displacement during the nth gait cycle. pzmp (t) and pd (t) are real and desired ZMP trajectories respectively, zmp (n+1)T |||| is the second order norm. g() = t=nT (t)dt estimates the dynamical load during one gait cycle [nT, (n + 1)T ].
(n+1)T
364
By sampling Ns poses in the nth gait cycle, we can get d the discrete fi (q) and gi (q). Where fi = ||pi zmp pzmp || calculates the Euclidean distance between the actual ZMP pi zmp and the desired ZMP trajectory at the ith sample index. Nq gi = j=1 ij summarizes the torque of all actuated joints at the ith sample index. N is the normalization operator to make the two targets comparable. V. EDA Q FOR BIPED GAIT OPTIMIZATION For the given biped gait generation problem, many heuristic approaches like neural networks, fuzzy systems and genetic algorithms are available choices. However there are problems in determining the large number of related parameters and methods are hard to expand on other kind of prototypes. The population based heuristic method EDA which depends on less parameter with comparison to other traditional intelligent method, has been tested to be successful in solving such problems [9][10][11]. However, as we found in using EDA for biped gait optimization, the parameters to be optimized are interrelated as joint angles are greatly affected by its value at prior key pose. To employ such relationship for probability construction in EDA, we developed a Q-learning based updating rule for EDA and thus proposed a new EDA namely EDA Q for biped gait generation and optimization. A. EDA Like other evolutionary algorithms, EDAs maintain and successively improve a population of solutions to nd the promising solutions. Differently, manipulations like crossover or mutation in GA are replaced by posterior probability distribution model in EDA. For most problems, there is no information about the function type of the probability distribution model. Several different probability models have been proposed in EDAs for continuous optimization problems. Gaussian model is mostly chosen for its low computation load. However, it cannot be used to deal with multi-modal objective functions. A Gaussian mixture model can do it in theory but with high cost especially for large scale problems. Correspondingly, the updating rule should be designed for different probability models and problems. To build and update the discrete probability model autonomously and intelligently, RL is a good choice particularly for interrelated multi-parameters. B. Q-learning Model free delayed RL methods are derived by making suitable approximations to the computations in value iteration and policy iteration, so as to eliminate the need for a system model. Q-learning is one of the important methods resulting from such approximations. It solves problem by an agent that learns behavior through trial and error interactions with a dynamic environment. It maintains an estimate Q(s, a) of the value of taking action a in state s. Its rule can be represented as: Q(s, a) = Q(s, a) + (r + max Q(s , a ) Q(s, a)). (11)
Where s is the following state of s after action a and a is the corresponding action set for state s. r is the scalar feedback provided by the critic. Let Q be the global optima, it is proved that if each action is executed in each state an innite number of times on an innite run and is decayed appropriately, the Q value will converge with probability one to Q [16]. C. EDA Q Hereby, to speed up the searching in high dimensional coupling space, we apply EDA Q on each joint angle as follows. 1. Initialization Set k = 1. Initiate the multivariate distribution models P roi,i,j (k) and P rol,l+1,j (k), i = 1, ..., Nm , l = 1, ..., Nm 1, j = 1, .., Nq . 2. Sampling Generate Ns samples from P roi,l,j (k) to form the current population Os (k), m = 1, 2, , Ns . 3. Selection Select the Nb best samples Ob from Os according to objective function values. Nb = s Ns (0 < s < 1). 4. Updating Update P ro1,1,j (k + 1), P roi,i+1,j (k + 1) by Equations (13) and (12) respectively, i = 1, ..., Nm 1. Then modify P rol,l,j (k + 1) by Equation (14), l = 2, 3, ..., Nm . 5. Stopping If stop condition is not met, go back to step 2 and let k = k + 1. Probability model of joint angles at the rst key poses are simply updated with corresponding rewards as shown in Equation (13) while transfer probability between these joint angles are updated by Q-learning method as shown in Equation (12) and probability distribution functions for later key poses are updated by Equation (14). P roi,i+1,j (k + 1) = P roi,i+1,j (k) + (r(qi,j )(k) + max P roi+1,i+1,j (k) P roi,i+1,j (k)), P ro1,1,j (k + 1) P roi+1,i+1,j (k + 1) = (1 )P ro1,1,j (k) +r(q1,j )(k) = P roi,i+1,j (k)P roi,i,j (k) (12) (13) (14)
where P roi,l,j (k) represents the transfer probability of the jth variable from the ith moment to the lth moment at the kth learning epoch. i, l = 1, 2, ..., Nm , they denote the index of Nm moments. j = 1, 2, ..., Nq , it is the index of variables. When i = l, P roi,i,j (k) stands for the probability model of the jth variable at the ith moment at the kth learning epoch. Reward r, representing the reward value at current state, will be calculated by selected best samples as r(qi,j ) = f rlocal (qi,j ) + g rglobal (qi ). Where rlocal (qi,j ) = N ((i,j )1 ), rglobal (qi ) = N (fi1 ). (16) (17) (15)
It can be found that, unlike the traditional EDAs that assume the variables as independent individuals, probability model between joint angles are connected by Q-learning method in the proposed algorithm. EDA Q takes the parameters to be
365
Fig. 3.
learned at each key pose as a combined unit. Interrelationship between them which would also affect the quality of objective function indistinctly is applied to modify the probability model as well. Consequently the formed probability model can reect the actual distribution more precisely. Moreover, EDA Q can update probability model automatically for any kind of probability function without prior knowledge and it can save much human efforts. VI. SIMULATION RESULTS To show the effectiveness of the proposed EDA PD for biped gait optimization, it is applied to the simulation model of the humanoid robot namely Robo-Erectus (RE) as shown in Fig. 3 [17]. Height and weight of RE are 600 mm and 4.6 kg respectively. Totally 23 DOFs are designed in RE as 6 per leg, 4 per arm and 1 for head, 2 for waist. They are driven by servo motor with maximum torque as 30 kg cm. One notebook with window XP operation system is set on RE for online control. The simplied model as shown in Fig. 2 takes basic structure parameters as number of links Nl = 9, number of key poses Nkp = 3, number of samples in one gait cycle Ns = 10. The width of hip lh = 15cm, other links take the value as foot length l1 = 10cm, la = 5cm, lb = 5cm, l2 = 4cm, l3 = l4 = 8cm, l5 = 20cm, gait length Ds = 30cm, the max height of swing foot Dh = 3cm. Masses of links m1 = m2 = 0.1kg, m3 = m4 = 0.2kg, m5 = 1kg. Desired ZMP is set as the middle line in the stable region. Totally 20 poses are sampled in one gait cycle and the rst, the sixteenth and the eighteenth of them are key poses. Parameters used for EDA Q are Ns = 16, Nb = 4, f = g = 0.5, = 0.1, = 0.1. Discrete probability model of each joint angle is represented by Nd = 20 uniform samples. EDA Q stops learning after 100 iterations. ZMP trajectories and joint torque of the learned gait are shown in Fig. 4 and Fig. 5 respectively. The trajectory distribute not so closely to the middle line in the stable region and mostly concentrate on a straight line in the stable region. It is because such trajectories need too much energy. Moreover, the biped gait optimized by the proposed EDA Q leads to less torque (as shown in Fig. 5) with comparison to that before learning (as shown in Fig. 6).
Fig. 4.
Therefore it can be concluded that the learned biped motion with large stability margin based on the ZMP concept is not only dynamically stable but also low energy consumption. Fig. 7 displays the variation of objective function value for EDA Q. It can be seen that the objective function value reduces sharply to half of its initial result during the rst 10 iterations and converges to 1.2 after the expected 100 learning epochs. Moreover, peak value of the oscillation also converges as learning goes on. The two merits, learning capability and insensitivity to many parameters, that inherit from Q-learning and EDA respectively provide EDA Q the remarkable ability to form any kinds of probability model without using predict knowledge in short learning epochs. It would save lots of human efforts and can be used to solve online biped gait optimization and learning problem. VII. CONCLUSIONS This paper formulates biped gait as a multi-objective optimization problem under multi-constraint. A new EDA based evolutionary computation method with Q-learning updating rule, namely EDA Q, is proposed in this paper to get the precise solution in high dimensional coupling space. In EDA Q, the parameters to be optimized are not considered as independent one, but as interrelated unit. Relationship between them is utilized to approximate the probability distribution of parameters by Q-learning method. Hereby, probability model can be generated and modied autonomously without predesigned structure. It is used in this paper to address the problem of establishing a periodic orbit in biped models with comparison to classic EDA. Effect and fast learning ability of EDA Q are veried in the simulation results. The probability model presented in this paper is discrete for simplicity. For continuous problem, different model and related knowledge about probability distribution function can also be easily incorporated into this method, which will be studied in the future.
366
ACKNOWLEDGMENT
Single Support
Double Support
Single Support
Double Support
We would like to thank staff and students at the Advanced Robotics and Intelligent Control Center (ARICC) of Singapore Polytechnic for their support in the development of humanoid robots named Robo-Erectus. The research described in this paper is jointly supported by the Singapore Tote Fund and the Singapore Polytechnic R&D Fund. It is also partially supported by the National Key Project for Basic Research of China (Grant No: 2002CB312205). R EFERENCES
[1] F. Tanaka, B. Fortenberry, K. Aisaka, and J. R. Movellan, Plans for developing real-time dance interaction between QRIO and toddlers in a classroom environment, in Proc. of the 4nd Int. Conf. on Development and Learning, 2005, pp. 142 147. [2] K. Kaneko, F. Kanehiro, S. Kajita, H. Hirukawa, T. Kawasaki, M. Hirata, K. Akachi, and T. Isozumi, Humanoid robot HRP-2, in Proc. of IEEE Int. Conf. on Robotics and Automation, vol. 2, 2004, pp. 10831090. [3] T. Saidouni and G. Bessonnet, Gait trajectory optimization using approximation functions, in Proc. of the 5th Int. Conf. on Climbing and Walking Robots, 2002, pp. 709716. [4] , Generating globally optimized sagittal gait cycles of a biped robot, Robotica, vol. 21, pp. 199210, 2003. [5] J. Hu, J. Pratt, and G. Pratt, Stable adaptive control of a bipedal walking robot with CMAC neural networks, in Proc. of IEEE Int. Conf. on Robotics and Automation, 1999, pp. 19501956. [6] C. Zhou and Q. Meng, Dynamic balance of a biped robot using fuzzy reinforcement learning agents, Fuzzy Sets and Systems, vol. 134, no. 1, pp. 169187, 2003. [7] G. Capi, S. Kaneko, K. Mitobe, L. Barolli, and Y. Nasu, Optimal trajectory generation for a prismatic joint biped robot using genetic algorithms, Robotics and Autonomous Systems, vol. 38, no. 2, pp. 119 128, 2002. [8] P. Larranaga and J. A. Lozano, Estimation of Distribution Algorithms, A New Tool for Evolutionary Computation, P. Larranaga and J. A. Lozano, Eds. Kluwer Academic Publishers, 2001. [9] L. Hu, C. Zhou, and Z. Sun, Biped gait optimization using estimation of distribution algorithm, in Proc. of IEEE-RAS Int. Conf. on Humanoid Robots, 2005, pp. 283289. [10] , Biped gait optimization using spline function based probability model, in Proc. of IEEE Int. Conf. on Robotics and Automation, Orlando, USA, May 2006, pp. 830835. [11] , Hybrid estimation of distribution algorithm with application to biped gait generation, Information Sciences, 2006 (accepted). [12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, ser. Adaptive Computation and Machine Learning. MIT Press, 1998. [13] J. Grizzle, G. Abba, and F. Plestan, Asymptotically stable walking for biped robots: Analysis via systems with impulse effects, IEEE Trans. On Aut. Cont., vol. 46, no. 1, pp. 5164, 2001. [14] Q. Huang, K. Yokoi, S. Kajita, K. Kaneko, H. Arai, N. Koyachi, and K. Tanie, Planning walking patterns for a biped robot, IEEE Trans. on Rob. and Aut., vol. 17, no. 3, pp. 280290, 2001. [15] M. Vukobratovic, B. Borovac, D. Surla, and D. Stokic, Biped Locomotion: Dynamics Stability, Control, and Application. New York: SpringerVerlag, 1990. [16] C. J. C. H. Watkins, Learning from delayed rewards, Ph.D. dissertation, Psychology Department, University of Cambridge, 1989. [17] C. Zhou and P. K. Yue, Robo-Erectus: A low cost autonomous humanoid soccer robot, Advanced Robotics, vol. 18, no. 7, pp. 717720, 2004.
Landing Impact
Landing Impact
Fig. 5.
Single Support
Single Support
Double Support
Landing Impact
Landing Impact
Fig. 6.
1 EDA__Q
0.8
Objective Function
0.6
0.4
0.2
20
40 Epoch
60
80
100
Fig. 7.
367