Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Paladyn, Journal of Behavioral Robotics

Editor-in-Chief: Schöner, Gregor

CiteScore 2017: 0.33

SCImago Journal Rank (SJR) 2017: 0.104

ICV 2017: 99.90

Open Access
See all formats and pricing
More options …

Robot Skill Learning: From Reinforcement Learning to Evolution Strategies

Freek Stulp
  • Corresponding author
  • Robotics and Computer Vision ENSTA-ParisTech Paris, France
  • FLOWERS Team INRIA Bordeaux Sud-Ouest Talence, France
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Olivier Sigaud
  • Corresponding author
  • Institut des Systèmes Intelligents et de Robotique, Université Pierre Marie Curie CNRS UMR 7222, Paris
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2013-09-11 | DOI: https://doi.org/10.2478/pjbr-2013-0003


Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. Owing to current trends involving searching in parameter space (rather than action space) and using reward-weighted averaging (rather than gradient estimation), reinforcement learning algorithms for policy improvement, e.g. PoWER and PI2, are now able to learn sophisticated high-dimensional robot skills. A side-effect of these trends has been that, over the last 15 years, reinforcement learning (RL) algorithms have become more and more similar to evolution strategies such as (μW , λ)-ES and CMA-ES. Evolution strategies treat policy improvement as a black-box optimization problem, and thus do not leverage the problem structure, whereas RL algorithms do. In this paper, we demonstrate how two straightforward simplifications to the state-of-the-art RL algorithm PI2 suffice to convert it into the black-box optimization algorithm (μW, λ)-ES. Furthermore, we show that (μW , λ)-ES empirically outperforms PI2 on the tasks in [36]. It is striking that PI2 and (μW , λ)-ES share a common core, and that the simpler algorithm converges faster and leads to similar or lower final costs. We argue that this difference is due to a third trend in robot skill learning: the predominant use of dynamic movement primitives (DMPs). We show how DMPs dramatically simplify the learning problem, and discuss the implications of this for past and future work on policy improvement for robot skill learning

Keywords: reinforcement learning; black-box optimization; evolution strategies; dynamic movement primitives

  • [1] L. Arnold, A. Auger, N. Hansen, and Y. Ollivier. Informationgeometric optimization algorithms: A unifying picture via invariance principles. Technical report, INRIA Saclay, 2011.Google Scholar

  • [2] A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event systems, 13(1-2):41-77, 2003.Google Scholar

  • [3] Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies - a comprehensive introduction. Natural Computing, 1(1):3-52, 2002.Google Scholar

  • [4] L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska. Crossentropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, andCybernetics-Part B: Cybernetics, 41(1):196-209, 2011.Google Scholar

  • [5] F. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through cooperatively coevolved synapses. Journalof Machine Learning Research, 9:937-965, 2008.Google Scholar

  • [6] N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159-195, 2001.CrossrefGoogle Scholar

  • [7] Nikolaus Hansen. The CMA evolution strategy: A tutorial, June 2011. http://www.lri.fr/hansen/cmatutorial.pdf.Google Scholar

  • [8] Verena Heidrich-Meisner and Christian Igel. Evolution strategies for direct policy search. In Proceedings of the 10th interna-tional conference on Parallel Problem Solving from Nature:PPSN X, pages 428-437, Berlin, Heidelberg, 2008. Springer- Verlag. ISBN 978-3-540-87699-1.Google Scholar

  • [9] Verena Heidrich-Meisner and Christian Igel. Similarities and differences between policy gradient methods and evolution strategies. In ESANN 2008, 16th European Symposium on Artifi-cial Neural Networks, Bruges, Belgium, April 23-25, 2008,Proceedings, pages 149-154, 2008.Google Scholar

  • [10] A. Ijspeert, J. Nakanishi, P Pastor, H. Hoffmann, and S. Schaal. Dynamical Movement Primitives: Learning attractor models for motor behaviors. Neural Computation, 25(2):328-373, 2013.Web of ScienceGoogle Scholar

  • [11] A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In Pro-ceedings of the IEEE International Conference on Roboticsand Automation (ICRA), 2002.Google Scholar

  • [12] Shivaram Kalyanakrishnan and Peter Stone. Characterizing reinforcement learning methods through parameterized learning problems. Machine Learning, 84(1-2):205-247, 2011.Web of ScienceGoogle Scholar

  • [13] H.J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory andExperiment, 2005(11):P11011, 2005.Google Scholar

  • [14] S. Mohammad Khansari-Zadeh and Aude Billard. Learning stable non-linear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011.Google Scholar

  • [15] J. Kober, E. Oztop, and J. Peters. Reinforcement learning to adjust robot movements to new situations. In Proceedings of Robotics:Science and Systems, Zaragoza, Spain, June 2010.Google Scholar

  • [16] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84:171-203, 2011.Google Scholar

  • [17] D. Marin and O. Sigaud. Towards fast and adaptive optimal control policies for robots: A direct policy search approach. In Proceed-ings Robotica, pages 21-26, Guimaraes, Portugal, 2012.Google Scholar

  • [18] Mustafa Parlaktuna, Doruk Tunaoglu, Erol Sahin, and Emre Ugur. Closed-loop primitives: A method to generate and recognize reaching actions from demonstration. In International Confer-ence on Robotics and Automation, pages 2015-2020, 2012.Google Scholar

  • [19] J. Peters and S. Schaal. Applying the episodic natural actor-critic architecture to motor primitive learning. In Proceedings of the15th European Symposium on Artificial Neural Networks(ESANN 2007), pages 1-6, 2007.Google Scholar

  • [20] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocom-puting, 71(7-9):1180-1190, 2008.Google Scholar

  • [21] Jan Peters and Stefan Schaal. Reinforcement learning of mo- tor skills with policy gradients. Neural networks : the officialjournal of the International Neural Network Society, 21(4): 682-97, May 2008. ISSN 0893-6080.Google Scholar

  • [22] W. B. Powell. Approximate Dynamic Programming: Solvingthe curses of dimensionality, volume 703. Wiley-Blackwell, 2007.Google Scholar

  • [23] Martin Riedmiller, Jan Peters, and Stefan Schaal. Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark. In 2007 IEEE International Symposium on Approxi-mate Dynamic Programming and Reinforcement Learning, pages 254-261. IEEE, April 2007. ISBN 1-4244-0706-0. URLGoogle Scholar

  • [24] T. Rückstiess, M. Felder, and J. Schmidhuber. State-dependent exploration for policy gradient methods. In 19th European Con-ference on Machine Learning (ECML), 2010.Google Scholar

  • [25] Thomas Rückstiess, Frank Sehnke, Tom Schaul, Daan Wierstra, Yi Sun, and Jürgen Schmidhuber. Exploring parameter space in reinforcement learning. Paladyn. Journal of BehavioralRobotics, 1:14-24, 2010. ISSN 2080-9778.Google Scholar

  • [26] J.C. Santamaría, R.S. Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive behavior, 6(2):163-217, 1997.Google Scholar

  • [27] H.-P. Schwefel. Evolutionsstrategie und numerische Opti-mierung. PhD thesis, TU Berlin, 1975.Google Scholar

  • [28] Frank Sehnke, Christian Osendorfer, Thomas Rückstie, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameterexploring policy gradients. Neural Networks, 23(4):551-559, 2010.Google Scholar

  • [29] O. Sigaud and J. Peters. From motor learning to interaction learning in robots. In From Motor Learning to Interaction Learningin Robots, volume 264, pages 1-12. Springer-Verlag, 2010.Google Scholar

  • [30] Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Ma-chine Learning (ICML-12), ICML ’12, pages 1679-1686, New York, NY, USA, July 2012. Omnipress. ISBN 978-1-4503-1285-1.Google Scholar

  • [31] Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings ofthe 29th International Conference on Machine Learning(ICML), 2012.Google Scholar

  • [32] Freek Stulp, Evangelos Theodorou, Mrinal Kalakrishnan, Peter Pastor, Ludovic Righetti, and Stefan Schaal. Learning motion primitive goals for robust manipulation. In International Con-ference on Intelligent Robots and Systems (IROS), 2011.Google Scholar

  • [33] Freek Stulp, Evangelos Theodorou, and Stefan Schaal. Reinforcement learning with sequences of motion primitives for robust manipulation. IEEE Transactions on Robotics, 28(6):1360-1370, 2012. King-Sun Fu Best Paper Award of the IEEE Trans-actions on Robotics for the year 2012.Google Scholar

  • [34] R. Sutton and A. Barto. Reinforcement Learning: an Introduc-tion. MIT Press, 1998.Google Scholar

  • [35] Minija Tamosiumaite, Bojan Nemec, Ales Ude, and Florentin Wörgötter. Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives. Robots andAutonomous Systems, 59(11):910-922, 2011.Google Scholar

  • [36] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11:3137-3181, 2010.Google Scholar

  • [37] Julian Togelius, Tom Schaul, Daan Wierstra, Christian Igel, Faustino Gomez, and Jürgen Schmidhuber. Ontogenetic and phylogenetic reinforcement learning. Zeitschrift Künstliche In-telligenz - Special Issue on Reinforcement Learning, pages 30-33, 2009.Google Scholar

  • [38] Ales Ude, Andrej Gams, Tamim Asfour, and Jun Morimoto. Taskspecific generalization of discrete and periodic dynamic movement primitives. IEEE Transactions on Robotics, 26(5): 800-815, 2010.Google Scholar

  • [39] S. Vijayakumar and S. Schaal. Locally weighted projection regression: An o(n) algorithm for incremental real time learning in high dimensional spaces. In Proceedings of the 17th InternationalConference on Machine Learning (ICML), pages 288-293, 2000.Google Scholar

  • [40] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In Proceedings of IEEECongress on Evolutionary Computation (CEC), 2008.Google Scholar

  • [41] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229-256, 1992. Google Scholar

About the article

Published Online: 2013-09-11

Published in Print: 2013-09-01

Citation Information: Paladyn, Journal of Behavioral Robotics, Volume 4, Issue 1, Pages 49–61, ISSN (Print) 2081-4836, DOI: https://doi.org/10.2478/pjbr-2013-0003.

Export Citation

This content is open access.

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Carlos Celemin, Javier Ruiz-del-Solar, and Jens Kober
Autonomous Robots, 2018
Matthieu Zimmer and Stephane Doncieux
IEEE Transactions on Cognitive and Developmental Systems, 2018, Volume 10, Number 1, Page 102
Peter Englert and Marc Toussaint
The International Journal of Robotics Research, 2017, Page 027836491774379
Martina Zambelli and Yiannis Demiris
IEEE Transactions on Cognitive and Developmental Systems, 2017, Volume 9, Number 2, Page 113
Kenta Kato, Ryo Ariizumi, and Fumitoshi Matsuno
Journal of the Robotics Society of Japan, 2017, Volume 35, Number 2, Page 143
Georgios Pierris and Torbjorn S. Dahl
IEEE Transactions on Cognitive and Developmental Systems, 2017, Volume 9, Number 1, Page 30
René Felix Reinhart
Autonomous Robots, 2017, Volume 41, Number 7, Page 1521
Valentina Cristina Meola, Daniele Caligiore, Valerio Sperati, Loredana Zollo, Anna Lisa Ciancio, Fabrizio Taffoni, Eugenio Guglielmelli, and Gianluca Baldassarre
IEEE Transactions on Cognitive and Developmental Systems, 2016, Volume 8, Number 3, Page 152
Jim Mainprice, Rafi Hayne, and Dmitry Berenson
IEEE Transactions on Robotics, 2016, Volume 32, Number 4, Page 897
Wouter Caarls and Erik Schuitema
IEEE Transactions on Neural Networks and Learning Systems, 2016, Volume 27, Number 7, Page 1457
Olivier Sigaud and Alain Droniou
IEEE Transactions on Cognitive and Developmental Systems, 2016, Volume 8, Number 2, Page 99
Stephane Doncieux, Nicolas Bredeche, Jean-Baptiste Mouret, and Agoston E. (Gusz) Eiben
Frontiers in Robotics and AI, 2015, Volume 2

Comments (0)

Please log in or register to comment.
Log in