Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Journal of Artificial General Intelligence

The Journal of the Artificial General Intelligence Society

3 Issues per year

Open Access
See all formats and pricing
More options …

Feature Reinforcement Learning: Part I. Unstructured MDPs

Marcus Hutter
Published Online: 2011-11-23 | DOI: https://doi.org/10.2478/v10229-011-0002-8

Feature Reinforcement Learning: Part I. Unstructured MDPs

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II (Hutter, 2009c). The role of POMDPs is also considered there.

Keywords: Reinforcement learning; Markov decision process; partial observability; feature learning; explore-exploit; information & complexity; rational agents

  • Aarts, E. H. L., and Lenstra, J. K., eds. 1997. Local Search in Combinatorial Optimization. Discrete Mathematics and Optimization. Chichester, England: Wiley-Interscience.Google Scholar

  • Banzhaff, W.; Nordin, P.; Keller, E.; and Francone, F. 1998. Genetic Programming. San Francisco, CA, U.S.A.: Morgan-Kaufmann.Google Scholar

  • Barron, A. R. 1985. Logically Smooth Density Estimation. Ph.D. Dissertation, Stanford University.Google Scholar

  • Berry, D. A., and Fristedt, B. 1985. Bandit Problems: Sequential Allocation of Experiments. London: Chapman and Hall.Google Scholar

  • Brafman, R. I., and Tennenholtz, M. 2002. R-max - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3:213-231.Google Scholar

  • Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. Wiley-Intersience, 2nd edition.Google Scholar

  • Dearden, R.; Friedman, N.; and Andre, D. 1999. Model based Bayesian Exploration. In Proc. 15th Conference on Uncertainty in Artificial Intelligence (UAI-99), 150-159.Google Scholar

  • Duff, M. 2002. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph.D. Dissertation, Department of Computer Science, University of Massachusetts Amherst.Google Scholar

  • Dzeroski, S.; de Raedt, L.; and Driessens, K. 2001. Relational Reinforcement Learning. Machine Learning 43:7-52.CrossrefGoogle Scholar

  • Fishman, G. 2003. Monte Carlo. Springer.Google Scholar

  • Givan, R.; Dean, T.; and Greig, M. 2003. Equivalence Notions and Model Minimization in Markov Decision Processes. Artificial Intelligence 147(1-2):163-223.Google Scholar

  • Goertzel, B., and Pennachin, C., eds. 2007. Artificial General Intelligence. Springer.Google Scholar

  • Gordon, G. 1999. Approximate Solutions to Markov Decision Processes. Ph.D. Dissertation, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google Scholar

  • Grünwald, P. D. 2007. The Minimum Description Length Principle. Cambridge: The MIT Press.Google Scholar

  • Guyon, I., and Elisseeff, A., eds. 2003. Variable and Feature Selection. JMLR Special Issue: MIT Press.Google Scholar

  • Hastie, T.; Tibshirani, R.; and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.Google Scholar

  • Hutter, M. 2005. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Berlin: Springer. 300 pages, http://www.hutter1.net/ai/uaibook.htm. http://www.hutter1.net/ai/uaibook.htm

  • Hutter, M. 2007. Universal Algorithmic Intelligence: A Mathematical TopDown Approach. In Artificial General Intelligence. Berlin: Springer. 227-290.Google Scholar

  • Hutter, M. 2009a. Feature Dynamic Bayesian Networks. In Proc. 2nd Conf. on Artificial General Intelligence (AGI'09), volume 8, 67-73. Atlantis Press.Google Scholar

  • Hutter, M. 2009b. Feature Markov Decision Processes. In Proc. 2nd Conf. on Artificial General Intelligence (AGI'09), volume 8, 61-66. Atlantis Press.Google Scholar

  • Hutter, M. 2009c. Feature Reinforcement Learning: Part II: Structured MDPs. In progress. Will extend Hutter (2009a).Google Scholar

  • Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence 101:99-134.Google Scholar

  • Kearns, M. J., and Singh, S. 1998. Near-optimal reinforcement learning in polynomial time. In Proc. 15th International Conf. on Machine Learning, 260-268. Morgan Kaufmann, San Francisco, CA.Google Scholar

  • Koza, J. R. 1992. Genetic Programming. The MIT Press.Google Scholar

  • Kumar, P. R., and Varaiya, P. P. 1986. Stochastic Systems: Estimation, Identification, and Adaptive Control. Englewood Cliffs, NJ: Prentice Hall.Google Scholar

  • Legg, S., and Hutter, M. 2007. Universal Intelligence: A Definition of Machine Intelligence. Minds & Machines 17(4):391-444.Google Scholar

  • Legg, S. 2008. Machine Super Intelligence. Ph.D. Dissertation, IDSIA, Lugano.Google Scholar

  • Li, M., and Vitányi, P. M. B. 2008. An Introduction to Kolmogorov Complexity and its Applications. Berlin: Springer, 3rd edition.Google Scholar

  • Liang, P., and Jordan, M. 2008. An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. In Proc. 25th International Conf. on Machine Learning (ICML'08), volume 307, 584-591. ACM.Google Scholar

  • Liu, J. S. 2002. Monte Carlo Strategies in Scientific Computing. Springer.Google Scholar

  • Lusena, C.; Goldsmith, J.; and Mundhenk, M. 2001. Nonapproximability Results for Partially Observable Markov Decision Processes. Journal of Artificial Intelligence Research 14:83-103.Google Scholar

  • MacKay, D. J. C. 2003. Information theory, inference and learning algorithms. Cambridge, MA: Cambridge University Press.Google Scholar

  • Madani, O.; Hanks, S.; and Condon, A. 2003. On the Undecidability of Probabilistic Planning and Related Stochastic Optimization Problems. Artificial Intelligence 147:5-34.Google Scholar

  • McCallum, A. K. 1996. Reinforcement Learning with Selective Perception and Hidden State. Ph.D. Dissertation, Department of Computer Science, University of Rochester.Google Scholar

  • Ng, A. Y.; Coates, A.; Diel, M.; Ganapathi, V.; Schulte, J.; Tse, B.; Berger, E.; and Liang, E. 2004. Autonomous Inverted Helicopter Flight via Reinforcement Learning. In ISER, volume 21 of Springer Tracts in Advanced Robotics, 363-372. Springer.Google Scholar

  • Pankov, S. 2008. A Computational Approximation to the AIXI Model. In Proc. 1st Conference on Artificial General Intelligence, volume 171, 256-267.Google Scholar

  • Pearlmutter, B. A. 1989. Learning State Space Trajectories in Recurrent Neural Networks. Neural Computation 1(2):263-269.Google Scholar

  • Poland, J., and Hutter, M. 2006. Universal Learning of Repeated Matrix Games. In Proc. 15th Annual Machine Learning Conf. of Belgium and The Netherlands (Benelearn'06), 7-14.Google Scholar

  • Poupart, P.; Vlassis, N. A.; Hoey, J.; and Regan, K. 2006. An Analytic Solution to Discrete Bayesian Reinforcement Learning. In Proc. 23rd International Conf. on Machine Learning (ICML'06), volume 148, 697-704. Pittsburgh, PA: ACM.Google Scholar

  • Puterman, M. L. 1994. Markov Decision Processes — Discrete Stochastic Dynamic Programming. New York, NY: Wiley.Google Scholar

  • Raedt, L. D.; Hammer, B.; Hitzler, P.; and Maass, W., eds. 2008. Recurrent Neural Networks - Models, Capacities, and Applications, volume 08041 of Dagstuhl Seminar Proceedings. IBFI, Schloss Dagstuhl, Germany.Google Scholar

  • Ring, M. 1994. Continual Learning in Reinforcement Environments. Ph.D. Dissertation, University of Texas, Austin.Google Scholar

  • Ross, S., and Pineau, J. 2008. Model-Based Bayesian Reinforcement Learning in Large Structured Domains. In Proc. 24th Conference in Uncertainty in Artificial Intelligence (UAI'08), 476-483. Helsinki: AUAI Press.Google Scholar

  • Ross, S.; Pineau, J.; Paquet, S.; and Chaib-draa, B. 2008. Online Planning Algorithms for POMDPs. Journal of Artificial Intelligence Research 2008(32):663-704.Google Scholar

  • Russell, S. J., and Norvig, P. 2003. Artificial Intelligence. A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall, 2nd edition.Google Scholar

  • Sanner, S., and Boutilier, C. 2009. Practical Solution Techniques for First-Order MDPs. Artificial Intelligence 173(5-6):748-788.Google Scholar

  • Schmidhuber, J. 2004. Optimal Ordered Problem Solver. Machine Learning 54(3):211-254.Google Scholar

  • Schwarz, G. 1978. Estimating the Dimension of a Model. Annals of Statistics 6(2):461-464.CrossrefGoogle Scholar

  • Singh, S.; Littman, M.; Jong, N.; Pardoe, D.; and Stone, P. 2003. Learning Predictive State Representations. In Proc. 20th International Conference on Machine Learning (ICML'03), 712-719.Google Scholar

  • Strehl, A. L.; Diuk, C.; and Littman, M. L. 2007. Efficient Structure Learning in Factored-State MDPs. In Proc. 27th AAAI Conference on Artificial Intelligence, 645-650. Vancouver, BC: AAAI Press.Google Scholar

  • Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.Google Scholar

  • Szita, I., and Lörincz, A. 2008. The Many Faces of Optimism: a Unifying Approach. In Proc. 12th International Conference (ICML 2008), volume 307.Google Scholar

  • Wallace, C. S. 2005. Statistical and Inductive Inference by Minimum Message Length. Berlin: Springer.Google Scholar

  • Willems, F. M. J.; Shtarkov, Y. M.; and Tjalkens, T. J. 1997. Reections on the Prize Paper: The Context-Tree Weighting Method: Basic Properties. IEEE Information Theory Society Newsletter 20-27.Google Scholar

  • Wolpert, D. H., and Macready, W. G. 1997. No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation 1(1):67-82.Google Scholar

About the article

Published Online: 2011-11-23

Published in Print: 2009-12-01

Citation Information: Journal of Artificial General Intelligence, Volume 1, Issue 1, Pages 3–24, ISSN (Online) 1946-0163, DOI: https://doi.org/10.2478/v10229-011-0002-8.

Export Citation

This content is open access.

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Alexey A. Melnikov, Adi Makmal, Vedran Dunjko, and Hans J. Briegel
Scientific Reports, 2017, Volume 7, Number 1
Nazanin Mohammadi Sepahvand, Elisabeth Stöttinger, James Danckert, Britt Anderson, and Joy J. Geng
PLoS ONE, 2014, Volume 9, Number 4, Page e94308
Zuo-wei Wang
Computational Intelligence and Neuroscience, 2016, Volume 2016, Page 1
Ronald Ortner
Minds and Machines, 2016, Volume 26, Number 3, Page 243
Rico Jonschkowski and Oliver Brock
Autonomous Robots, 2015, Volume 39, Number 3, Page 407
Bill Hibbard
Journal of Artificial General Intelligence, 2012, Volume 3, Number 1, Page 1
Hanns Sommer and Lothar Schreiber
Journal of Artificial General Intelligence, 2012, Volume 3, Number 1

Comments (0)

Please log in or register to comment.
Log in