Skip to content
Open Access Published by De Gruyter Open Access September 11, 2013

Robot Skill Learning: From Reinforcement Learning to Evolution Strategies

  • Freek Stulp EMAIL logo and Olivier Sigaud EMAIL logo


Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. Owing to current trends involving searching in parameter space (rather than action space) and using reward-weighted averaging (rather than gradient estimation), reinforcement learning algorithms for policy improvement, e.g. PoWER and PI2, are now able to learn sophisticated high-dimensional robot skills. A side-effect of these trends has been that, over the last 15 years, reinforcement learning (RL) algorithms have become more and more similar to evolution strategies such as (μW , λ)-ES and CMA-ES. Evolution strategies treat policy improvement as a black-box optimization problem, and thus do not leverage the problem structure, whereas RL algorithms do. In this paper, we demonstrate how two straightforward simplifications to the state-of-the-art RL algorithm PI2 suffice to convert it into the black-box optimization algorithm (μW, λ)-ES. Furthermore, we show that (μW , λ)-ES empirically outperforms PI2 on the tasks in [36]. It is striking that PI2 and (μW , λ)-ES share a common core, and that the simpler algorithm converges faster and leads to similar or lower final costs. We argue that this difference is due to a third trend in robot skill learning: the predominant use of dynamic movement primitives (DMPs). We show how DMPs dramatically simplify the learning problem, and discuss the implications of this for past and future work on policy improvement for robot skill learning


[1] L. Arnold, A. Auger, N. Hansen, and Y. Ollivier. Informationgeometric optimization algorithms: A unifying picture via invariance principles. Technical report, INRIA Saclay, 2011.Search in Google Scholar

[2] A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event systems, 13(1-2):41-77, 2003.10.1023/A:1022140919877Search in Google Scholar

[3] Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies - a comprehensive introduction. Natural Computing, 1(1):3-52, 2002.10.1023/A:1015059928466Search in Google Scholar

[4] L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska. Crossentropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, andCybernetics-Part B: Cybernetics, 41(1):196-209, 2011.10.1109/TSMCB.2010.2050586Search in Google Scholar PubMed

[5] F. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through cooperatively coevolved synapses. Journalof Machine Learning Research, 9:937-965, 2008.Search in Google Scholar

[6] N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159-195, 2001.10.1162/106365601750190398Search in Google Scholar PubMed

[7] Nikolaus Hansen. The CMA evolution strategy: A tutorial, June 2011. in Google Scholar

[8] Verena Heidrich-Meisner and Christian Igel. Evolution strategies for direct policy search. In Proceedings of the 10th interna-tional conference on Parallel Problem Solving from Nature:PPSN X, pages 428-437, Berlin, Heidelberg, 2008. Springer- Verlag. ISBN 978-3-540-87699-1.10.1007/978-3-540-87700-4_43Search in Google Scholar

[9] Verena Heidrich-Meisner and Christian Igel. Similarities and differences between policy gradient methods and evolution strategies. In ESANN 2008, 16th European Symposium on Artifi-cial Neural Networks, Bruges, Belgium, April 23-25, 2008,Proceedings, pages 149-154, 2008.Search in Google Scholar

[10] A. Ijspeert, J. Nakanishi, P Pastor, H. Hoffmann, and S. Schaal. Dynamical Movement Primitives: Learning attractor models for motor behaviors. Neural Computation, 25(2):328-373, 2013.Search in Google Scholar

[11] A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In Pro-ceedings of the IEEE International Conference on Roboticsand Automation (ICRA), 2002.Search in Google Scholar

[12] Shivaram Kalyanakrishnan and Peter Stone. Characterizing reinforcement learning methods through parameterized learning problems. Machine Learning, 84(1-2):205-247, 2011.10.1007/s10994-011-5251-xSearch in Google Scholar

[13] H.J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory andExperiment, 2005(11):P11011, 2005.10.1088/1742-5468/2005/11/P11011Search in Google Scholar

[14] S. Mohammad Khansari-Zadeh and Aude Billard. Learning stable non-linear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011.10.1109/ROBOT.2010.5510001Search in Google Scholar

[15] J. Kober, E. Oztop, and J. Peters. Reinforcement learning to adjust robot movements to new situations. In Proceedings of Robotics:Science and Systems, Zaragoza, Spain, June 2010.10.15607/RSS.2010.VI.005Search in Google Scholar

[16] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84:171-203, 2011.10.1007/s10994-010-5223-6Search in Google Scholar

[17] D. Marin and O. Sigaud. Towards fast and adaptive optimal control policies for robots: A direct policy search approach. In Proceed-ings Robotica, pages 21-26, Guimaraes, Portugal, 2012.Search in Google Scholar

[18] Mustafa Parlaktuna, Doruk Tunaoglu, Erol Sahin, and Emre Ugur. Closed-loop primitives: A method to generate and recognize reaching actions from demonstration. In International Confer-ence on Robotics and Automation, pages 2015-2020, 2012.10.1109/ICRA.2012.6225039Search in Google Scholar

[19] J. Peters and S. Schaal. Applying the episodic natural actor-critic architecture to motor primitive learning. In Proceedings of the15th European Symposium on Artificial Neural Networks(ESANN 2007), pages 1-6, 2007.Search in Google Scholar

[20] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocom-puting, 71(7-9):1180-1190, 2008.10.1016/j.neucom.2007.11.026Search in Google Scholar

[21] Jan Peters and Stefan Schaal. Reinforcement learning of mo- tor skills with policy gradients. Neural networks : the officialjournal of the International Neural Network Society, 21(4): 682-97, May 2008. ISSN 0893-6080.10.1016/j.neunet.2008.02.003Search in Google Scholar PubMed

[22] W. B. Powell. Approximate Dynamic Programming: Solvingthe curses of dimensionality, volume 703. Wiley-Blackwell, 2007.10.1002/9780470182963Search in Google Scholar

[23] Martin Riedmiller, Jan Peters, and Stefan Schaal. Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark. In 2007 IEEE International Symposium on Approxi-mate Dynamic Programming and Reinforcement Learning, pages 254-261. IEEE, April 2007. ISBN 1-4244-0706-0. URL10.1109/ADPRL.2007.368196Search in Google Scholar

[24] T. Rückstiess, M. Felder, and J. Schmidhuber. State-dependent exploration for policy gradient methods. In 19th European Con-ference on Machine Learning (ECML), 2010.Search in Google Scholar

[25] Thomas Rückstiess, Frank Sehnke, Tom Schaul, Daan Wierstra, Yi Sun, and Jürgen Schmidhuber. Exploring parameter space in reinforcement learning. Paladyn. Journal of BehavioralRobotics, 1:14-24, 2010. ISSN 2080-9778.10.2478/s13230-010-0002-4Search in Google Scholar

[26] J.C. Santamaría, R.S. Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive behavior, 6(2):163-217, 1997.10.1177/105971239700600201Search in Google Scholar

[27] H.-P. Schwefel. Evolutionsstrategie und numerische Opti-mierung. PhD thesis, TU Berlin, 1975.Search in Google Scholar

[28] Frank Sehnke, Christian Osendorfer, Thomas Rückstie, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameterexploring policy gradients. Neural Networks, 23(4):551-559, 2010.10.1016/j.neunet.2009.12.004Search in Google Scholar PubMed

[29] O. Sigaud and J. Peters. From motor learning to interaction learning in robots. In From Motor Learning to Interaction Learningin Robots, volume 264, pages 1-12. Springer-Verlag, 2010.10.1007/978-3-642-05181-4_1Search in Google Scholar

[30] Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Ma-chine Learning (ICML-12), ICML ’12, pages 1679-1686, New York, NY, USA, July 2012. Omnipress. ISBN 978-1-4503-1285-1.Search in Google Scholar

[31] Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings ofthe 29th International Conference on Machine Learning(ICML), 2012.Search in Google Scholar

[32] Freek Stulp, Evangelos Theodorou, Mrinal Kalakrishnan, Peter Pastor, Ludovic Righetti, and Stefan Schaal. Learning motion primitive goals for robust manipulation. In International Con-ference on Intelligent Robots and Systems (IROS), 2011.10.1109/IROS.2011.6094877Search in Google Scholar

[33] Freek Stulp, Evangelos Theodorou, and Stefan Schaal. Reinforcement learning with sequences of motion primitives for robust manipulation. IEEE Transactions on Robotics, 28(6):1360-1370, 2012. King-Sun Fu Best Paper Award of the IEEE Trans-actions on Robotics for the year 2012.10.1109/TRO.2012.2210294Search in Google Scholar

[34] R. Sutton and A. Barto. Reinforcement Learning: an Introduc-tion. MIT Press, 1998.10.1109/TNN.1998.712192Search in Google Scholar

[35] Minija Tamosiumaite, Bojan Nemec, Ales Ude, and Florentin Wörgötter. Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives. Robots andAutonomous Systems, 59(11):910-922, 2011.10.1016/j.robot.2011.07.004Search in Google Scholar

[36] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11:3137-3181, 2010.Search in Google Scholar

[37] Julian Togelius, Tom Schaul, Daan Wierstra, Christian Igel, Faustino Gomez, and Jürgen Schmidhuber. Ontogenetic and phylogenetic reinforcement learning. Zeitschrift Künstliche In-telligenz - Special Issue on Reinforcement Learning, pages 30-33, 2009.Search in Google Scholar

[38] Ales Ude, Andrej Gams, Tamim Asfour, and Jun Morimoto. Taskspecific generalization of discrete and periodic dynamic movement primitives. IEEE Transactions on Robotics, 26(5): 800-815, 2010.10.1109/TRO.2010.2065430Search in Google Scholar

[39] S. Vijayakumar and S. Schaal. Locally weighted projection regression: An o(n) algorithm for incremental real time learning in high dimensional spaces. In Proceedings of the 17th InternationalConference on Machine Learning (ICML), pages 288-293, 2000.Search in Google Scholar

[40] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In Proceedings of IEEECongress on Evolutionary Computation (CEC), 2008.10.1109/CEC.2008.4631255Search in Google Scholar

[41] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229-256, 1992. 10.1007/BF00992696Search in Google Scholar

Published Online: 2013-09-11
Published in Print: 2013-09-1

This content is open access.

Downloaded on 6.6.2023 from
Scroll to top button