Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. Owing to current trends involving searching in parameter space (rather than action space) and using reward-weighted averaging (rather than gradient estimation), reinforcement learning algorithms for policy improvement, e.g. PoWER and PI2, are now able to learn sophisticated high-dimensional robot skills. A side-effect of these trends has been that, over the last 15 years, reinforcement learning (RL) algorithms have become more and more similar to evolution strategies such as (μW , λ)-ES and CMA-ES. Evolution strategies treat policy improvement as a black-box optimization problem, and thus do not leverage the problem structure, whereas RL algorithms do. In this paper, we demonstrate how two straightforward simplifications to the state-of-the-art RL algorithm PI2 suffice to convert it into the black-box optimization algorithm (μW, λ)-ES. Furthermore, we show that (μW , λ)-ES empirically outperforms PI2 on the tasks in . It is striking that PI2 and (μW , λ)-ES share a common core, and that the simpler algorithm converges faster and leads to similar or lower final costs. We argue that this difference is due to a third trend in robot skill learning: the predominant use of dynamic movement primitives (DMPs). We show how DMPs dramatically simplify the learning problem, and discuss the implications of this for past and future work on policy improvement for robot skill learning
If the inline PDF is not rendering correctly, you can download the PDF file here.
 L. Arnold, A. Auger, N. Hansen, and Y. Ollivier. Informationgeometric optimization algorithms: A unifying picture via invariance principles. Technical report, INRIA Saclay, 2011.
 A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event systems, 13(1-2):41-77, 2003.
 Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies - a comprehensive introduction. Natural Computing, 1(1):3-52, 2002.
 L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska. Crossentropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, andCybernetics-Part B: Cybernetics, 41(1):196-209, 2011.
 F. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through cooperatively coevolved synapses. Journalof Machine Learning Research, 9:937-965, 2008.
 N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159-195, 2001.
 Nikolaus Hansen. The CMA evolution strategy: A tutorial, June 2011. http://www.lri.fr/hansen/cmatutorial.pdf.
 Verena Heidrich-Meisner and Christian Igel. Evolution strategies for direct policy search. In Proceedings of the 10th interna-tional conference on Parallel Problem Solving from Nature:PPSN X, pages 428-437, Berlin, Heidelberg, 2008. Springer- Verlag. ISBN 978-3-540-87699-1.
 Verena Heidrich-Meisner and Christian Igel. Similarities and differences between policy gradient methods and evolution strategies. In ESANN 2008, 16th European Symposium on Artifi-cial Neural Networks, Bruges, Belgium, April 23-25, 2008,Proceedings, pages 149-154, 2008.
 A. Ijspeert, J. Nakanishi, P Pastor, H. Hoffmann, and S. Schaal. Dynamical Movement Primitives: Learning attractor models for motor behaviors. Neural Computation, 25(2):328-373, 2013.
 A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In Pro-ceedings of the IEEE International Conference on Roboticsand Automation (ICRA), 2002.
 Shivaram Kalyanakrishnan and Peter Stone. Characterizing reinforcement learning methods through parameterized learning problems. Machine Learning, 84(1-2):205-247, 2011.
 H.J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory andExperiment, 2005(11):P11011, 2005.
 S. Mohammad Khansari-Zadeh and Aude Billard. Learning stable non-linear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011.
 J. Kober, E. Oztop, and J. Peters. Reinforcement learning to adjust robot movements to new situations. In Proceedings of Robotics:Science and Systems, Zaragoza, Spain, June 2010.
 J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84:171-203, 2011.
 D. Marin and O. Sigaud. Towards fast and adaptive optimal control policies for robots: A direct policy search approach. In Proceed-ings Robotica, pages 21-26, Guimaraes, Portugal, 2012.
 Mustafa Parlaktuna, Doruk Tunaoglu, Erol Sahin, and Emre Ugur. Closed-loop primitives: A method to generate and recognize reaching actions from demonstration. In International Confer-ence on Robotics and Automation, pages 2015-2020, 2012.
 J. Peters and S. Schaal. Applying the episodic natural actor-critic architecture to motor primitive learning. In Proceedings of the15th European Symposium on Artificial Neural Networks(ESANN 2007), pages 1-6, 2007.
 Jan Peters and Stefan Schaal. Natural actor-critic. Neurocom-puting, 71(7-9):1180-1190, 2008.
 Jan Peters and Stefan Schaal. Reinforcement learning of mo- tor skills with policy gradients. Neural networks : the officialjournal of the International Neural Network Society, 21(4): 682-97, May 2008. ISSN 0893-6080.
 W. B. Powell. Approximate Dynamic Programming: Solvingthe curses of dimensionality, volume 703. Wiley-Blackwell, 2007.
 Martin Riedmiller, Jan Peters, and Stefan Schaal. Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark. In 2007 IEEE International Symposium on Approxi-mate Dynamic Programming and Reinforcement Learning, pages 254-261. IEEE, April 2007. ISBN 1-4244-0706-0. URL
 T. Rückstiess, M. Felder, and J. Schmidhuber. State-dependent exploration for policy gradient methods. In 19th European Con-ference on Machine Learning (ECML), 2010.
 Thomas Rückstiess, Frank Sehnke, Tom Schaul, Daan Wierstra, Yi Sun, and Jürgen Schmidhuber. Exploring parameter space in reinforcement learning. Paladyn. Journal of BehavioralRobotics, 1:14-24, 2010. ISSN 2080-9778.
 J.C. Santamaría, R.S. Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive behavior, 6(2):163-217, 1997.
 H.-P. Schwefel. Evolutionsstrategie und numerische Opti-mierung. PhD thesis, TU Berlin, 1975.
 Frank Sehnke, Christian Osendorfer, Thomas Rückstie, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameterexploring policy gradients. Neural Networks, 23(4):551-559, 2010.
 O. Sigaud and J. Peters. From motor learning to interaction learning in robots. In From Motor Learning to Interaction Learningin Robots, volume 264, pages 1-12. Springer-Verlag, 2010.
 Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Ma-chine Learning (ICML-12), ICML ’12, pages 1679-1686, New York, NY, USA, July 2012. Omnipress. ISBN 978-1-4503-1285-1.
 Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings ofthe 29th International Conference on Machine Learning(ICML), 2012.
 Freek Stulp, Evangelos Theodorou, Mrinal Kalakrishnan, Peter Pastor, Ludovic Righetti, and Stefan Schaal. Learning motion primitive goals for robust manipulation. In International Con-ference on Intelligent Robots and Systems (IROS), 2011.
 Freek Stulp, Evangelos Theodorou, and Stefan Schaal. Reinforcement learning with sequences of motion primitives for robust manipulation. IEEE Transactions on Robotics, 28(6):1360-1370, 2012. King-Sun Fu Best Paper Award of the IEEE Trans-actions on Robotics for the year 2012.
 R. Sutton and A. Barto. Reinforcement Learning: an Introduc-tion. MIT Press, 1998.
 Minija Tamosiumaite, Bojan Nemec, Ales Ude, and Florentin Wörgötter. Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives. Robots andAutonomous Systems, 59(11):910-922, 2011.
 Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11:3137-3181, 2010.
 Julian Togelius, Tom Schaul, Daan Wierstra, Christian Igel, Faustino Gomez, and Jürgen Schmidhuber. Ontogenetic and phylogenetic reinforcement learning. Zeitschrift Künstliche In-telligenz - Special Issue on Reinforcement Learning, pages 30-33, 2009.
 Ales Ude, Andrej Gams, Tamim Asfour, and Jun Morimoto. Taskspecific generalization of discrete and periodic dynamic movement primitives. IEEE Transactions on Robotics, 26(5): 800-815, 2010.
 S. Vijayakumar and S. Schaal. Locally weighted projection regression: An o(n) algorithm for incremental real time learning in high dimensional spaces. In Proceedings of the 17th InternationalConference on Machine Learning (ICML), pages 288-293, 2000.
 Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In Proceedings of IEEECongress on Evolutionary Computation (CEC), 2008.
 R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229-256, 1992.
Paladyn. Journal of Behavioral Robotics is a fully peer-reviewed, open access journal that publishes original, high-quality research works and review articles on topics broadly related to neuronally and psychologically inspired robots and other behaving autonomous systems. The journal is indexed in SCOPUS.