Reinforcement learning with Gaussian process regression using variational free energy

: The essential part of existing reinforcement learning algorithms that use Gaussian process regres - sion involves a complicated online Gaussian process regression algorithm. Our study proposes online and mini - batch Gaussian process regression algorithms that are easier to implement and faster to estimate for reinforcement learning. In our algorithm, the Gaussian process regression updates the value function through only the computation of two equations, which we then use to construct reinforcement learning algorithms. Our numerical experiments show that the proposed algorithm works as well as those from previous studies.


Introduction
Reinforcement learning is learning how to behave to maximize reward. One of the most striking examples of the application of reinforcement learning is AlphaGo [1,2], which defeated a professional Go player. Various function approximations have been used in reinforcement learning algorithms, one of which is based on Gaussian process regression [3]. Gaussian process regression is a Bayesian nonparametric method often used as a standard nonlinear regression model and has some desired properties such as a low possibility of overfitting and the ability to express estimation uncertainty.
Reinforcement learning algorithms are often based on estimating a value function [4]. In classical reinforcement learning, the model of the value function is represented in table form. In other words, the model is represented as a matrix of pairs of states and actions. However, as the set of states and actions becomes larger, estimating the value function becomes harder. We solve this problem by representing the value function by functional approximation [4,5]. One of the functional approximations used in reinforcement learning is Gaussian process regression, and its features are expected to be helpful. The principal advantage is the high degree of freedom of the model obtained by the kernel function. Through Bayesian learning, estimation uncertainty is expressed naturally.
However, due to its high computational cost, Gaussian process regression has not been suitable for reinforcement learning. In addition, the existing online algorithms in Gaussian process regression have been very complicated to implement. Thus, we need to develop a simple algorithm for online Gaussian process regression that is easier to implement for reinforcement learning.
Prior research on model-free reinforcement learning using Gaussian process regression includes stateaction-reward-state-action (SARSA)-based methods and Q-learning-based methods. There are several methods based on SARSA, such as GP-SARSA [6,7] and iGP-SARSA [8]. However, these methods have problems, such as high computational costs. The GPQ [9] is an algorithm based on Q-learning and uses the sparse online Gaussian processes method [10]. These algorithms are based on complex methods to reduce the computational complexity of Gaussian process regression.
In this article, to overcome the above shortcomings, we propose a mini-batch-learnable variational free energy (VFE) method and a reinforcement learning algorithm based on the VFE method and Q-learning. The VFE method approximates the posterior distribution by variational inference. The offline VFE method [11] is widely used to reduce the computational complexity of Gaussian process regression.
The VFE method is one of the methods using inducing points and is expressed by a simple formula. The inducing points must be provided to estimate using this method. Choosing the inducing points may be difficult for some environments, but the method of using them is easier to estimate. For example, it is more difficult to select inducing points in higher-dimensional environments. The computational complexity of our method is the same as the offline VFE method. The computational complexity is O N ( ), where N is the number of data. Mini-batch learning is more efficient than online learning in reinforcement learning.
Our main contributions are as follows: -We extend the VFE method to allow online and mini-batch learning.
-We propose a reinforcement learning algorithm using the mini-batch-learnable VFE method.
In our experiments, we confirm that the proposed method can be learned just as well as the existing methods. We compare the proposed reinforcement learning algorithm with the algorithms in previous studies by considering a two-dimensional grid global environment as an example. Numerical results show that the proposed algorithm works as well as the current GPQ but is simpler and faster.
In Section 2, we explain the basics of reinforcement learning and Gaussian process regression in some detail. We then propose our reinforcement learning algorithm using the online Gaussian process regression with a mini-batch-learnable VFE method in Section 3. Section 4 shows the numerical results of experiments in a two-dimensional grid world. Finally, concluding remarks are given in Section 5.

Reinforcement learning
In reinforcement learning, the environment is often modeled using the Markov decision process (MDP) [12].  . In reinforcement learning, the goal is to find the action rule that maximizes the reward, i.e., to find π E R max π π ⋅ represents the expected value under the given policy π. To find π * , we introduce a value function Q s a E R s s a a , , Since π * can be obtained from the optimal value function Q Q s a max , π π ( ) = * , finding π * reduces to finding Q * . Qlearning [13] is often used as a method to find Q * . If the value function is in . In the case of function approximation, the value of Q s a , t t ( ) is updated to be closer to the value on the right-hand side.

Gaussian process regression
Gaussian process regression [3] is regarded as a Bayesian nonparametric regression, where we estimate the function = be the data of N pairs of input and output variables. Assume that the output variable is accompanied by normally distributed noise, and the model is The goal is to construct the predictive distribution of the output y * for a new input , where k , ( ) ⋅ ⋅ denotes a kernel function. In this article, we use the radial basis function The predictive distribution is obtained from the formula for the posterior distribution of the normal distribution. The predictive distribution of y * is Only matrix multiplication is required to calculate the predictive distribution, and the computational cost is O N 3 ( ). One way to reduce the computational cost is to use inducing points. The partially independent training condition [14], the fully independent training condition [15], and the VFE [11] are well-known methods that use inducing points. In methods using inducing points, the computational cost is O NM 2 ( ), where M is the number of inducing points, and, in general, N M > . We introduce the VFE method and explain this method in detail.
be the set of inducing points. The model for Gaussian process regression using inducing points is . For details, see [11]. Then, the pre- In the VFE method, we can estimate the predictive distribution by computing these equations. If we used all data each time for mini-batch learning, the computational cost would be high. Therefore, in the next section, we will extend the method so that it can be estimated without changing the computational cost.

Related work
Here, we introduce previous research on reinforcement learning using Gaussian process regression. Reinforcement learning algorithms can be divided into two categories: model-based and model-free. In model-based reinforcement learning algorithms, Gaussian process regression can be used not only for the value function but also for the environment model, such as [16][17][18]. Since the model of the environment is also estimated, the computational cost is higher, but it is not necessary to learn the model of the environment.
Next, we discuss model-free reinforcement learning algorithms. Reinforcement learning algorithms can be categorized into two types: on-policy learning and off-policy learning. SARSA [4] is a typical algorithm for on-policy learning. GP-SARSA [6,7] is an algorithm based on SARSA that learns a value function using a Gaussian process. iGP-SARSA [8] is a method that improves the exploration of GP-SARSA. On the other hand, Q-learning is a typical off-policy learning algorithm. GPQ [9] is a learning method based on Q-learning in which the value function is represented by a Gaussian process.
Next, we describe online learning for Gaussian process regression. GPQ uses the sparse online Gaussian processes method [10] to construct the algorithm. While the Gaussian regression algorithm used by these methods is complex, this article proposes an algorithm that can be updated with only two formulas. The difference between these Gaussian regression methods and the proposed method is the use of inducing points. Offline learning methods for Gaussian process regression with inducing points include VFE [11], fully independent training conditional (FITC) [15], and partially independent training conditional (PITC) [14]. However, reinforcement learning algorithms require online or mini-batch training. Online learning methods have been proposed for FITC and PITC without changing the computational cost [19].
In this article, we use the VFE, which is widely used and can be computed with a simple formula. We also propose mini-batch learning, which has not been proposed in previous studies. The relationship between the previous study and the proposed method is summarized in Table 1. We construct a Q-learning algorithm using our proposed mini-batch VFE. The difference from previous studies is that our proposed mini-batch VFE is used, but the original structure is the same as that of Q-learning.

Reinforcement learning with Gaussian process regression using VFE
We extend the VFE formulas (5)-(9) to be updatable online. We have used the methods of previous studies [19] as a guide. We rewrite the covariance function for an online update, where Σ N for N pairs of data is given as follows: Let x + and y + be the new input data. We transform the covariance function Σ N 1 + with N 1 + pairs of data as follows: where K x z x z k k , , , , Using this formula, Σ can be updated online. Then, μ can also be transformed as follows: From this equation, we can update μ N from Σ N , Σ N 1 + , and μ N . This update method allows us to learn with computational cost O NM 2 ( ), which is similar to offline learning. Furthermore, this method requires less memory because it does not store the input data but only keeps Σ N and μ N .
Next, we will extend this online VFE learning to allow mini-batch learning. In other words, we consider the case in which there is more than one piece of data coming in to be updated. Let D be the set of all data up to the present, and let D + be the set of data to be updated. The covariance function for a set of data D D ∪ + is denoted by Σ N+ . Then, Σ N+ is given as follows: Since K M+ is a matrix, the inverse of the sum of the matrices is represented in a different form. Using this update formula, Σ N+ can be updated. We transform μ N+ in the same way as in online learning, where it is given as follows: By using (19) and (20), we can directly obtain Σ N+ and μ N+ for the set of data D D ∪ + from Σ N and μ N for the set of data D. Furthermore, the number of elements in D + does not have to be a fixed value during the learning process. Both online learning and mini-batch learning have the same computational complexity and return the same estimated results as offline VFE formulas (5)- (9).
We propose an algorithm for Q-learning using this mini-batch-learnable VFE method. The algorithm for learning GP based on the supervised data obtained by Q-learning has been proposed in GPQ [9]. While the GPQ is based on the sparse online Gaussian process regression algorithm [10], we use the formula proposed earlier. The proposed algorithm is presented as Algorithm 1. Lines 2-6 of Algorithm 1 are the same as in Qlearning, and equations (19) and (20) of the proposed method are used in the updating part of the value function in lines 7-10.

1:
For the first data, use the offline VFE formulas (5) through (9). 2: for each time step t do 3: Choose a t from s t , using ε-greedy exploration 4: Take action a t , observe r s , Add y x , t t ( )to D + 8: if D N batch | | = + then 9: Compute μ N+ and Σ N+ according to (20) and (19) 10: Reset D + 11: end if 12: end for To learn with the mini-batch-learnable VFE method, a set of inducing points must be given. The inducing points should be evenly distributed across the product set of states and actions. Kernel functions can be selected according to the environment. In our algorithm, the supervised data are generated in the same way as in normal Q-learning. The generated supervised data are then used to learn the value function in the Gaussian process regression. The batch size should be changed depending on the environment. Empirically, it is more efficient to consider a large batch size in a complex environment. The larger the batch size, the faster the learning proceeds for the number of data. However, note that too large a batch size may slow the convergence of the proposed algorithm. The computational cost of Algorithm 1 is O NM 2 ( ). Since the proposed algorithm only computes two formulas in the update, we expect the proposed algorithm to be faster than the GPQ method.

Experiments
In this section, we present several experiments that show that the proposed algorithm can perform as well as existing algorithms. We use two-dimensional grids [9] for our experiments. The state of this environment is represented by a 5 5 × grid. Agents in this environment start at 1, 1 ( ) and can move in four directions: up, down, left, and right. The transition is noisy, i.e., with a probability of 0.1, that the agent remains in its current state. The agent can obtain reward 1 in state 5, 5 ( ). This environment has been used in previous experiments [9]. We use the GPQ method and tabular Q-learning for comparison.
In this experiment, γ is set to 0.99. For all methods, we used the ε-greedy algorithm, the exploration rate of which is t 1 0.3 / . The learning rate used in Q-learning is set to t 0.5 0.1 / . We use the same RBF kernel as − − / and variance σ 1 2 = for both methods. We set the GPQ parameter, β tol , to 0.75 and the kernel budget to 25. The batch size N batch is set to 16. The same estimation can be done with N 1 = , i.e., with online learning, but the calculation time increases. The number of inducing points in the proposed method is 36, and the inducing points are arranged in a grid pattern in this experiment. This is the best combination of parameter settings for each algorithm in our experiments.
We perform ten independent experiments for each method. The average number of steps required to reach the goal is shown in Figure 1. We can see that each algorithm can learn the optimal behavior. Figure 2 shows that the resulting value estimates are similar for both the proposed method and GPQ. Numerically, there is no significant difference in the speed of convergence for each algorithm, and these algorithms perform similarly. The proposed algorithm computes only two equations without branches, and this experiment shows that for the same data size, our method terminates more than twice as fast as the GPQ method. The computational cost of the online Gaussian process regression algorithm used by GPQ is O N ( ). Since the computational cost of both methods is O N ( ) for the number of data, the difference in execution time depends on the size of the hyperparameters and the number of formulas to be calculated.
A limitation of the proposed method comes from the selection of inducing points, which must be given before learning. In a complex problem, the performance of our algorithm depends on the set of inducing points. Increasing the number of inducing points gives a wider estimated range of states and actions, but the time required for learning is longer. This problem can be solved by changing the scale of the data if the range of states and actions is not large.

Conclusion
In reinforcement learning, an algorithm has been proposed in which the value function is represented by Gaussian process regression. Gaussian process regression is expected to have advantages because of its  high expressivity by kernel functions and Bayesian learning. However, the algorithms proposed in previous studies use complex online Gaussian process regression methods. We propose an online or mini-batch Gaussian process regression with VFE and inducing points for easier learning with Gaussian process regression. This method of Gaussian regression requires only the computation of two equations. We then construct a Q-learning algorithm using these two equations. Our experiments show that the algorithm can learn as well as those from previous studies.
The advantage of our algorithm is that it is easy to implement while expressing the value function in Gaussian process regression. In addition, our algorithm uses mini-batch learning, and experiments show that it can be estimated more efficiently than online learning.
Our proposed algorithm has the limitation that inducing points must be given before learning. In the case of an environment where inducing points are difficult to give, for example, when the behavior or state is high-dimensional, it becomes difficult to use our proposed algorithm.
Improving the proposed method by taking advantage of Gaussian process regression will be conducted in a future study. Even though we have been able to learn with Gaussian process regression, we have not been able to take full advantage of Gaussian process regression. Exploration for reinforcement learning is important in gathering useful data for updating the value function, and we believe that the ability to express uncertainty in the value function can be used for exploration. In addition, the choice of inducing points also affects the estimation. How inducing points are selected in reinforcement learning algorithms will also be investigated in a future study. Finally, experiments were conducted using the classical reinforcement learning environment that has been used in previous studies. Experiments in other environments should be the subject of future research.