Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Paladyn, Journal of Behavioral Robotics

Editor-in-Chief: Schöner, Gregor

1 Issue per year

CiteScore 2017: 0.33

SCImago Journal Rank (SJR) 2017: 0.104

Open Access
See all formats and pricing
More options …

Exploring Parameter Space in Reinforcement Learning

Thomas Rückstieß / Frank Sehnke
  • Technische Universität München, Institut für Informatik VI, Boltzmannstr. 3, 85748 Garching, Germany
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Tom Schaul
  • Dalle Molle Institute for Artificial Intelligence (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Daan Wierstra
  • Dalle Molle Institute for Artificial Intelligence (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Yi Sun
  • Dalle Molle Institute for Artificial Intelligence (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Jürgen Schmidhuber
  • Dalle Molle Institute for Artificial Intelligence (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2010-03-31 | DOI: https://doi.org/10.2478/s13230-010-0002-4


This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-exploring algorithms: Natural Evolution Strategies and Policy Gradients with Parameter-Based Exploration. Both outperform state-of-the-art algorithms in several complex high-dimensional tasks commonly found in robot control. Furthermore, we describe how a novel exploration method, State-Dependent Exploration, can modify existing algorithms to mimic exploration in parameter space.

Keywords: reinforcement learning; optimization; exploration; policy gradients


  • [1] D. Aberdeen. Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University, 2003.Google Scholar

  • [2] S. Amari and S. C. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’98), volume 2, pages 1213–1216, 1998.Google Scholar

  • [3] J. Baxter and P. L. Bartlett. Reinforcement learning in POMDPs via direct gradient ascent. In Proc. 17th International Conf. on Machine Learning, pages 41–48. Morgan Kaufmann, San Francisco, CA, 2000.Google Scholar

  • [4] M. Buss and S. Hirche. Institute of Automatic Control Engineering, TU München, Germany, 2008. http://www.lsr.ei.tum.de/.

  • [5] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9 (2):159–195, 2001.CrossrefGoogle Scholar

  • [6] N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with co-variance matrix adaptation (CMA-ES). Evolutionary Computation, 11(1):1–18, 2003.CrossrefGoogle Scholar

  • [7] M. I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. Proc. of the Eighth Annual Conference of the Cognitive Science Society, 8:531–546, 1986.Google Scholar

  • [8] J. Kennedy, R. C. Eberhart, et al. Particle swarm optimization. In Proceedings of IEEE international conference on neural networks, volume 4, pages 1942–1948. Piscataway, NJ: IEEE, 1995.Google Scholar

  • [9] S. Kern, S. D. Müller, N. Hansen, D. Büche, J. Ocenasek, and P. Koumoutsakos. Learning probability distributions in continuous evolutionary algorithms–a comparative review. Natural Computing, 3(1):77–112, 2004.Google Scholar

  • [10] P. Larranaga and J. A. Lozano. Estimation of distribution algorithms: A new tool for evolutionary computation. Kluwer Academic Pub, 2002.Google Scholar

  • [11] H. Müller, M. Lauer, R. Hafner, S. Lange, A. Merke, and M. Riedmiller. Making a robot learn to play soccer. Proceedings of the 30th Annual German Conference on Artificial Intelligence (KI-2007), 2007.Google Scholar

  • [12] R. Munos and M. Littman. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771–791, 2006.Google Scholar

  • [13] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71 (7-9):1180–1190, 2008.CrossrefWeb of ScienceGoogle Scholar

  • [14] J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006.Google Scholar

  • [15] I. Rechenberg. Evolution strategy. Computational Intelligence: Imitating Life, pages 147–159, 1994.Google Scholar

  • [16] M. Riedmiller. Neural fitted Q iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. Lecture notes in computer science, 3720:317, 2005.Google Scholar

  • [17] M. Riedmiller, J. Peters, and S. Schaal. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.Google Scholar

  • [18] T. Rückstieß, M. Felder, and J. Schmidhuber. State-Dependent Exploration for policy gradient methods. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages 234–249, 2008.Google Scholar

  • [19] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning movement primitives. In International symposium on robotics research. Citeseer, 2004.Google Scholar

  • [20] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain. Journal of Machine Learning Research, 11:743–746, 2010.Google Scholar

  • [21] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks, Special Issue, December 2009.Google Scholar

  • [22] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Policy gradients with parameter-based exploration for control. In Proceedings of the International Conference on Artificial Neural Networks ICANN, 2008.Google Scholar

  • [23] R. Storn and K. Price. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11(4):341–359, 1997.Google Scholar

  • [24] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Stochastic Search using the Natural Gradient. In International Conference on Machine Learning (ICML), 2009.Google Scholar

  • [25] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Efficient Natural Evolution Strategies. In Genetic and Evolutionary Computation Conference (GECCO), 2009.Google Scholar

  • [26] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS-1999, pages 1057–1063, 2000.Google Scholar

  • [27] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.Google Scholar

  • [28] S. B. Thrun. The role of exploration in learning control. Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pages 527–559, 1992.Google Scholar

  • [29] H. Ulbrich. Institute of Applied Mechanics, TU München, Germany, 2008. http://www.amm.mw.tum.de/.

  • [30] H. van Hasselt and M. Wiering. Reinforcement learning in continuous action spaces. In Proc. 2007 IEEE Symp. Approx. Dynamic Programming and Reinforcement Learning, volume 272, page 279. Citeseer, 2007.Google Scholar

  • [31] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992.Google Scholar

  • [32] M. Wiering and J. Schmidhuber. Efficient Model-Based Exploration. From Animals to Animats 5: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior, 1998.Google Scholar

  • [33] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In IEEE World Congress on Computational Intelligence (WCCI 2008), 2008.Google Scholar

  • [34] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.Google Scholar

About the article

Received: 2010-02-20

Accepted: 2010-03-16

Published Online: 2010-03-31

Published in Print: 2010-03-01

Citation Information: Paladyn, Journal of Behavioral Robotics, Volume 1, Issue 1, Pages 14–24, ISSN (Online) 2081-4836, DOI: https://doi.org/10.2478/s13230-010-0002-4.

Export Citation

© Thomas Rückstieß et al.. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Herke van Hoof, Daniel Tanneberg, and Jan Peters
Machine Learning, 2017
Chunlin Chen, Daoyi Dong, Han-Xiong Li, Jian Chu, and Tzyh-Jong Tarn
IEEE Transactions on Neural Networks and Learning Systems, 2014, Volume 25, Number 5, Page 920
M. Giuliani, J. D. Herman, A. Castelletti, and P. Reed
Water Resources Research, 2014, Volume 50, Number 4, Page 3355
Freek Stulp and Pierre-Yves Oudeyer
Paladyn, Journal of Behavioral Robotics, 2012, Volume 3, Number 3
Sylvain Calinon, Petar Kormushev, and Darwin G. Caldwell
Robotics and Autonomous Systems, 2013, Volume 61, Number 4, Page 369
Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama
Neural Networks, 2012, Volume 26, Page 118

Comments (0)

Please log in or register to comment.
Log in