Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter Open Access March 31, 2010

Exploring Parameter Space in Reinforcement Learning

  • Thomas Rückstieß EMAIL logo , Frank Sehnke , Tom Schaul , Daan Wierstra , Yi Sun and Jürgen Schmidhuber

Abstract

This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-exploring algorithms: Natural Evolution Strategies and Policy Gradients with Parameter-Based Exploration. Both outperform state-of-the-art algorithms in several complex high-dimensional tasks commonly found in robot control. Furthermore, we describe how a novel exploration method, State-Dependent Exploration, can modify existing algorithms to mimic exploration in parameter space.

References

[1] D. Aberdeen. Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University, 2003.Search in Google Scholar

[2] S. Amari and S. C. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’98), volume 2, pages 1213–1216, 1998.Search in Google Scholar

[3] J. Baxter and P. L. Bartlett. Reinforcement learning in POMDPs via direct gradient ascent. In Proc. 17th International Conf. on Machine Learning, pages 41–48. Morgan Kaufmann, San Francisco, CA, 2000.Search in Google Scholar

[4] M. Buss and S. Hirche. Institute of Automatic Control Engineering, TU München, Germany, 2008. http://www.lsr.ei.tum.de/.Search in Google Scholar

[5] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9 (2):159–195, 2001.10.1162/106365601750190398Search in Google Scholar PubMed

[6] N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with co-variance matrix adaptation (CMA-ES). Evolutionary Computation, 11(1):1–18, 2003.10.1162/106365603321828970Search in Google Scholar PubMed

[7] M. I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. Proc. of the Eighth Annual Conference of the Cognitive Science Society, 8:531–546, 1986.Search in Google Scholar

[8] J. Kennedy, R. C. Eberhart, et al. Particle swarm optimization. In Proceedings of IEEE international conference on neural networks, volume 4, pages 1942–1948. Piscataway, NJ: IEEE, 1995.Search in Google Scholar

[9] S. Kern, S. D. Müller, N. Hansen, D. Büche, J. Ocenasek, and P. Koumoutsakos. Learning probability distributions in continuous evolutionary algorithms–a comparative review. Natural Computing, 3(1):77–112, 2004.10.1023/B:NACO.0000023416.59689.4eSearch in Google Scholar

[10] P. Larranaga and J. A. Lozano. Estimation of distribution algorithms: A new tool for evolutionary computation. Kluwer Academic Pub, 2002.Search in Google Scholar

[11] H. Müller, M. Lauer, R. Hafner, S. Lange, A. Merke, and M. Riedmiller. Making a robot learn to play soccer. Proceedings of the 30th Annual German Conference on Artificial Intelligence (KI-2007), 2007.Search in Google Scholar

[12] R. Munos and M. Littman. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771–791, 2006.Search in Google Scholar

[13] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71 (7-9):1180–1190, 2008.10.1016/j.neucom.2007.11.026Search in Google Scholar

[14] J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006.10.1109/IROS.2006.282564Search in Google Scholar

[15] I. Rechenberg. Evolution strategy. Computational Intelligence: Imitating Life, pages 147–159, 1994.Search in Google Scholar

[16] M. Riedmiller. Neural fitted Q iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. Lecture notes in computer science, 3720:317, 2005.10.1007/11564096_32Search in Google Scholar

[17] M. Riedmiller, J. Peters, and S. Schaal. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.10.1109/ADPRL.2007.368196Search in Google Scholar

[18] T. Rückstieß, M. Felder, and J. Schmidhuber. State-Dependent Exploration for policy gradient methods. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages 234–249, 2008.10.1007/978-3-540-87481-2_16Search in Google Scholar

[19] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning movement primitives. In International symposium on robotics research. Citeseer, 2004.10.1007/11008941_60Search in Google Scholar

[20] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain. Journal of Machine Learning Research, 11:743–746, 2010.Search in Google Scholar

[21] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks, Special Issue, December 2009.10.1109/ICMLA.2010.24Search in Google Scholar

[22] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Policy gradients with parameter-based exploration for control. In Proceedings of the International Conference on Artificial Neural Networks ICANN, 2008.Search in Google Scholar

[23] R. Storn and K. Price. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11(4):341–359, 1997.10.1023/A:1008202821328Search in Google Scholar

[24] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Stochastic Search using the Natural Gradient. In International Conference on Machine Learning (ICML), 2009.Search in Google Scholar

[25] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Efficient Natural Evolution Strategies. In Genetic and Evolutionary Computation Conference (GECCO), 2009.10.1145/1569901.1569976Search in Google Scholar

[26] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS-1999, pages 1057–1063, 2000.Search in Google Scholar

[27] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.10.1109/TNN.1998.712192Search in Google Scholar

[28] S. B. Thrun. The role of exploration in learning control. Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pages 527–559, 1992.Search in Google Scholar

[29] H. Ulbrich. Institute of Applied Mechanics, TU München, Germany, 2008. http://www.amm.mw.tum.de/.Search in Google Scholar

[30] H. van Hasselt and M. Wiering. Reinforcement learning in continuous action spaces. In Proc. 2007 IEEE Symp. Approx. Dynamic Programming and Reinforcement Learning, volume 272, page 279. Citeseer, 2007.Search in Google Scholar

[31] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992.10.1023/A:1022676722315Search in Google Scholar

[32] M. Wiering and J. Schmidhuber. Efficient Model-Based Exploration. From Animals to Animats 5: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior, 1998.Search in Google Scholar

[33] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In IEEE World Congress on Computational Intelligence (WCCI 2008), 2008.10.1109/CEC.2008.4631255Search in Google Scholar

[34] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.10.1007/BF00992696Search in Google Scholar

Received: 2010-2-20
Accepted: 2010-3-16
Published Online: 2010-3-31
Published in Print: 2010-3-1

© Thomas Rückstieß et al.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Downloaded on 19.3.2024 from https://www.degruyter.com/document/doi/10.2478/s13230-010-0002-4/html
Scroll to top button