Transfer learning for photonic delay-based reservoir computing to compensate parameter drift

: Photonic reservoir computing has been demonstrated to be able to solve various complex problems. Although training a reservoir computing system is much simpler compared to other neural network approaches, it still requires considerable amounts of resources which becomes an issue when retraining is required. Transfer learning is a technique that allows us to re-use information between tasks, thereby reducing the cost of retraining. We propose transfer learning as a viable technique to compensate for the unavoidable parameter drift in experi-mentalsetups.Solvingthisparameterdriftusuallyrequires retraining the system, which is very time and energy consuming. Based on numerical studies on a delay-based reservoir computing system with semiconductor lasers, we investigate the use of transfer learning to mitigate these parameter fluctuations. Additionally, we demonstrate that transfer learning applied to two slightly different tasks allows us to reduce the amount of input samples required for training of the second task, thus reducing the amount of retraining.


Introduction
With the tremendously fast growth of the amount of information in our digital age, we are becoming ever increasingly reliable on machine learning to analyze this large quantity of information [1,2]. Currently, this is typically performed on digital hardware. However, due to the breakdown of Moore's law [3], there is considerable interest to resort to analog systems to perform the computations required for machine learning. One example of such a machine learning strategy, suited for analog machines, is reservoir computing (RC). Reservoir computing systems were originally based on recurrent neural networks and consist of a large amount of nodes with random but fixed interconnections. An RC system is generally divided in three separate layers: an input layer, a reservoir layer and an output layer. The input layer is used to inject the data into the reservoir. In the reservoir layer, the data will be processed in a complex, non-linear dynamical system and sent through to the output layer. In this output layer, linear weights are used to calculate the reservoir's output. These weights are optimized during the training phase. Note that the internal weights of the reservoir itself are not being optimized and remain constant during training. This results in RC systems being very simple to train and makes them very time and energy efficient. RC systems have so far been successfully applied to several tasks, including non-linear channel equalization [4], time-series predictions [5][6][7] and speech recognition [8][9][10]. In this work, we implement a reservoir computing system by using opto-electronic components. These opto-electronic systems offer several advantages, including fast information processing rates combined with low energy consumption [11,12]. However, a problem with the physical implementations of photonic reservoir computing systems is their parameter drift during operation. For example, varying room temperatures lead to temperature-induced internal length differences. This will lead to a difference in the feedback phase and ultimately results in a loss of performance, which we want to avoid as much as possible. Several publications have investigated the effects of these influences, including possible techniques to counteract this. This can for example be done by actively adapting the length changes of a coherent linear Fabry-Perot resonator using a control loop [13], reducing the noise sensitivity of the RC performance by improving the pre-processing of data in the input layer [14] or by expanding the output layer to improve performance [15]. However, further improvements are possible, which is why there is considerable interest in developing other approaches. In this paper, we numerically explore the use of a novel learning paradigm introduced in [16], referred to as transfer learning, applied to photonic reservoir computing. This technique builds upon the conventional training method and tries to enhance it by reusing information gained from previous training procedures and applying this information when training on different, but still similar problems. It offers the advantage of being able to find weights with a minimal amount of required retraining for the new problem. In this numerical study, we apply this transfer learning technique to photonic delay-based RC systems with semiconductor lasers. This paper is organised as follows. Section 2 gives a short introduction on delay-based reservoir computing and discusses the numerical model that we use to implement this RC system. In Section 3, we introduce and discuss two training methods: conventional training and transfer learning. In Section 4, we compare the RC performance when using transfer learning and conventional training. In Section 5, we investigate whether transfer learning can be used to mitigate influences of small changes in the feedback phase of the RC system on the performance of the RC itself. In Section 6, we investigate the use of transfer learning when multiple parameters are varied, namely the injection rate, feedback rate and excess pump current. Section 7 gives a conclusion of the previous results of the paper.

Numerical implementation of delay-based reservoir computing using semiconductor lasers
In this paper, we use a reservoir computing system based on semiconductor lasers (SL) with delayed feedback [17]. Various types of photonic or electronic reservoir computing systems have already been realized using this delay-based technique [17][18][19][20]. In Figure 1, we show the structure of our RC system. It is composed of three layers: the input layer, the reservoir layer and the output layer. The input layer consists of a semiconductor laser coupled with an unbalanced Mach-Zehnder modulator which optically injects input samples into the reservoir layer. Before the data is injected into the reservoir, the input samples are first encoded via a mask m(t). The reservoir layer consists of a single-mode semiconductor laser (SL) with delayed optical feedback with a delay length . The time trace of the light intensity emitted by the SL is measured by a photodetector after the reservoir layer. This time trace is then sampled and used in the output layer, where the weights are calculated. This procedure is explained in further detail in Sections 3.1 and 3.2 [11,21]. The numerical simulations for our delay-based RC system are based on the following rate-equations [22] dE(t) dt ( 1 ) Figure 1: Illustration of a delay-based RC system using a semiconductor laser (SL). SL 1 drives the system and SL 2 is used to simulate the reservoir. A mask m(t) is used to encode the input data sample u k , which is optically injected into the reservoir using a Mach-Zehnder modulator (MZM). Also shown here is the node separation and delay time . The virtual nodes are represented by the light blue circles and the predicted target data byŷ k .
with E(t) the complex valued slowly-varying amplitude of the electric field of the laser and N(t) the excess amount of available carriers, both of which are dimensionless parameters. and g represent the differential gain and threshold gain of the laser. is the linewidth enhancement factor. and are the feedback rate and the injection rate parameters. ΔI∕e is the excess pump current rate normalized with the elementary charge, with ΔI = I − I thr and where I is the injected pump current and I thr the threshold pump current. FB is the feedback phase. Complex Gaussian white noise is added to the system byF (t) to model spontaneous emission noise, with ⟨F (t) In this term, the spontaneous emission noise strength is controlled by and the carrier lifetime is represented by c . The data injection is performed optically by E inj (t), where its slowly varying envelope is given by . This injection occurs on the same frequency as the free running laser so that the injection frequency detuning is zero and remains constant in this work [23]. is the amplitude of the injected electric field, and B(t) represents the injected data signal, with A and Φ the modulation amplitude and bias, originating from the Mach-Zehnder modulator. S(t) is defined as with * the convolution operator, u k the k-th input sample from a total of n input samples, the Dirac delta function and m(t) the mask. Because we use delay-based reservoir computing with delayed optical feedback, we first have to time-multiplex the input data. This is performed by using the mask m(t), so that every input sample u k is injected into the reservoir for a given time length. The mask consists of a piecewise constant function, with sublevels randomly selected from 5 values: [0, 0.25, 0.5, 0.75, 1]. These sublevels are kept piecewise constant with a duration equal to the node separation, , so that the total duration of this mask is equal to the number of virtual nodes, N, multiplied with the node separation . Every input data sample is injected for duration equal to the mask length, N , during which this input data sample is multiplied with the mask, resulting in a masked input signal. The node separation and delay length are also held constant in all simulations. We use here values = 20 ps and = 4 ns that have proven to work well for our laser based RC system  [24]. These parameter values lead to the number of virtual nodes being equal to 200. In this paper, we have opted to choose the period of the mask equal to the delay length , so that = N . This is done purely out of simplicity, even though an improved performance can be found when we introduce a small mismatch in the mask length [25]. A summary of all used parameters can be found in Table 1, which are taken from [23].

Training procedures
In this section, we explain the two different training procedures we use in this paper: conventional training and transfer learning.

Conventional training procedure
We obtain the output weights w corresponding to the N nodes of the reservoir in the training phase, where the training is performed off-line. To this aim, we use the normalized state matrix, A, of the RC system and the expected data, y. Because the input samples are time-multiplexed in the input layer, we have to de-multiplex the output, represented by the light intensity |E(t)| 2 measured by a photodetector at the output of SL 2. This de-multiplexing is performed by sampling the intensity time trace at every time interval for each input data sample. The N sampled intensities are stored in the columns of A, which is done for all the input samples, stored in the rows of A. The resulting matrix is referred to as the state matrix A with dimensions (n × (N + 1)), where n and N represent the number of input samples and the number of nodes of the RC system. In this matrix an additional bias node has been added in order to account for a possible offset in the data. The state matrix can be used to find the weights w for the N nodes of the reservoir and the additional bias node, to match with the expected data samples y. This is performed using a least squares minimization and results in predicted values for the data samplesŷ, which we want to make as close to y as possible. In matrix notation, this translates tô These weights dictate the scaling of the individual nodes of the RC network for the state matrix A. In practice, the weights w can be calculated by minimizing the squared error between the predicted value for the data samplesŷ, resulting from the matrix multiplication in Eq. (5), and the expected data samples y. The practical implementation of this can performed by calculating the real Moore-Penrose pseudoinverse (denoted by the symbol †) for Eq. (5): This previous equation can be simplified even further by making use of the fact that the state matrix contains the light intensity and is thus real-valued, so that Once the weights have been calculated in the training phase, we test how the RC performs on unseen data, which is referred to as the test phase. In order to quantify this performance, we use the normalized mean squared error (NMSE) between the expected output y and predicted outputŷ.

Transfer learning procedure
The concept of transfer learning builds upon the conventional training of weights, explained in Section 3.1. The main advantage transfer learning offers is that when we have already performed training on a particular task, we can reuse this information for different, but still similar tasks. Therefore, a key difference is that instead of one general training dataset, we now have two different training datasets, referred to as the training source and training target dataset. The training source dataset   contains the information of the first task, of which we assume here to have a lot of information, i.e. many data samples. The training target dataset   contains information of the second task, which is similar to the first task, and typically contains fewer data samples than the training source dataset. This could be due to an inherently limited amount of data on the task or because we want to minimize the required amount of training due to energy or time constraints.
We follow [26] in order to implement the transfer learning, which can be summarized as follows. One starts by finding the weights corresponding to the training source dataset, as explained in Section 3.1 and which results in the training source weights w S . The weights for the training target dataset are expected to be similar to those of the training source dataset, because both tasks are similar, and therefore will only require a small correction. The training target weights are therefore defined as such that the expected target data y T is estimated by the predicted target dataŷ T , using the state matrix for the target dataset A T ,ŷ Instead of defining the squared error function between the predicted target dataŷ T and expected target data y T , the L2 regularized version of the squared error function is defined as where the parameter ∈ [0, +∞] is defined as the transfer rate. 1 This transfer rate dictates the amount of information transferred from the training source to the training target domain and needs to be scanned for optimal performance. Minimizing Eq. (11) to w results in an expression 2 for the correction weights w: with I the identity matrix of size N + 1.
The use of transfer learning thus allows one to combine the information gained from two different datasets, via the transfer rate , so that an optimized performance can be achieved by varying this single parameter.

Results: transfer learning applied to different tasks
To illustrate the benefits of transfer learning to photonic reservoir computing, we apply this technique to two tasks. For the first task, we simulate data from two Lorenz systems with different parameter values, as is done in [26]. For both Lorenz systems, we sample one of their coordinates to obtain the input samples for the RC system. The data of one of the other spatial coordinates of both Lorenz systems functions as the data samples we want to compute. This allows us to investigate whether transfer learning can reuse information from one Lorenz system to another Lorenz system with slightly different parameter values.
In the second task, we are still working with two Lorenz systems with different parameter values but now we want to quantify the performance of transfer learning when we limit the amount of input data samples for the training target dataset.

Predicting coordinates of Lorenz system
The first task is related to the Lorenz system of ordinary differential equations: dz dt where we fix the parameters = 10 and = 8∕3. The task we want to solve with our RC system consists of inferring one of the three coordinates, z(t), from one of the other coordinates, x(t). We simulate two Lorenz systems, for identical initial conditions (x(0) = y(0) = z(0) = 1) but with different parameters and different simulation times.
Both Lorenz systems include Gaussian noise for each spatial coordinate, respectively, x (t), y (t) and z (t), with a mean of zero and standard deviation of 0.25. The resulting x-, y-and z-coordinates are normalized to a maximum value of 1.
We integrate the two systems using an Euler scheme samples of all three datasets in order to remove the effects of possible transients occurring when switching between datasets with the RC system. Figure 2 shows the time traces for the x-and z-coordinates of the Lorenz systems with different parameters. These input samples are then injected in an RC system, resulting in three state matrices (for training source dataset dataset, for training target dataset and test dataset). By using different parameters for the input data, we are effectively changing the task, while still using the same RC system. Figure 3 shows the NMSE on the test dataset for = 42 as a function of the transfer rate when we train on various datasets conventionally (without any transfer learning) and with transfer learning. We distinguish three different cases for conventional training, using only the training source dataset   (shown in purple), using only the training target data   (shown in red) or using a combined dataset of both the training source dataset and  training target dataset (shown in green). In these three cases, we use the weights found during training and apply these during testing, as explained in Section 3.1. All three cases result in horizontal lines in Figure 3, because no transfer learning has been applied and the results therefore do not depend on the transfer rate . The NMSE values for the three horizontal lines in Figure 3 can be understood conceptually. When we apply conventional training on only the training source dataset, we have a large amount of training samples available, but test on a different Lorenz system (with = 42 instead of = 28). The resulting NMSE is the highest compared to the other two horizontal lines. When we use conventional training on only the training target dataset, we have the smallest amount of training samples available, but they correspond to the same Lorenz system as the test dataset (both = 42), which results in the best NMSE out of the three horizontal lines. If we, however, train on a combined dataset of both training source and training target datasets, we have the largest dataset available of the three cases, and would mainly expect the best performance. This is not the case, because it contains data of a mix of two Lorenz systems with different parameters. This results in an NMSE which is situated between the previous two cases.
We can improve the total performance of the RC system by controlling the amount of information transferred from the training source dataset   to the training dataset   , by using transfer learning. We show the NMSE in the case where we apply transfer learning, and scan the transfer rate (shown in blue dots) in Figure 3. We observe that when we apply transfer learning, we have an NMSE for → 0 corresponding to the horizontal line when we only train on training target data. For → +∞, the NMSE will correspond to the horizontal line where we train conventionally on only the training source dataset. These two extreme regimes of the transfer rate can be explained as follows. For very low transfer rates ( → 0), Eq. (11) will reduce to a least square minimization on the training target dataset. This results in no information transfer from the training source dataset to the training target dataset, thus reducing the training to the conventional RC training on the training target dataset. If the transfer rate becomes very large ( → +∞), the regularizing term of Eq. (11) will become the dominant factor, resulting in w becoming very small. This implies that we add no correction to the weights found from the training source dataset, thus transferring no information from the training target dataset to the weights.
For values between these two extreme cases, we find that around ≈ 10 −2 there exist a minimum in the NMSE, after which the NMSE increases drastically to a maximum value. This global NMSE minimum, around ≈ 10 −2 , indicates that there exists an optimal transfer rate for which the NMSE is the lowest compared to all other cases, and which thus results in the best performance. At this value, the information from both the training source and training target dataset is combined optimally to achieve the lowest NMSE. Ultimately, Figure 3 shows that transfer learning results in at least the same NMSE as conventional training, and in general a lower NMSE, for these tasks where we are able to combine information and fine-tune the transfer rate . The main advantage of transfer learning lies in the fact that when we already have previously trained on a similar dataset (the training source dataset   ), we are able to use this information and are able to use fewer data samples for a new dataset (the training target dataset   ). This means that less training would need to be performed and results in a computational speed increase. This will be investigated in Section 4.2.

Influence of training target dataset size
In order to investigate the influence of the size of this training target dataset   , we again predict the z-coordinate of a normalized Lorenz system with noise by the x-coordinate, but where we will now change the amount of training target data samples. We find that the transfer learning method, at optimal , has a lower NMSE for all training target dataset sizes compared to the conventional training method. This corresponds to the results in Section 4.1. As expected, we observe that for both training techniques the NMSE decreases with increasing amount of training target dataset samples. If the training target dataset becomes small, the NMSE of the conventional training increases drastically, whereas the NMSE for transfer learning remains fairly small. This can be explained by the fact that for the transfer learning case, we already have plenty of information gained from the training source dataset. This is not the case for the conventional training method, where the training target dataset is the only training dataset available. The decreases in NMSE with increasing size of the training target dataset continue to a training target dataset size of around 2 × 10 3 data samples, from where the NMSE saturates to a quasi constant value.
The bottom panel of Figure 4 shows the values of the most optimal transfer rate found for each of the 7 iterations per size of the training target dataset. It demonstrates that for increasing size of the training target dataset, the most optimal transfer rate gradually decreases. This is in agreement with the observations found in Figure 3, where we show that small values correspond to giving more importance to the training target dataset. This is also the case here, if we have a large training target dataset available. Figure 4 demonstrates that instead of retraining the RC system with a large training target dataset, we can instead reuse already trained weights from another training source dataset combined with transfer learning.
For example, we only have to use 1.5 × 10 3 training target samples combined with already trained weights from a training source dataset, to achieve the same NMSE as conventionally training on a training target dataset of around 2.5 × 10 3 samples.

Results: mitigating influence of feedback phase changes on performance of Santa Fe task
Any experimental setup is subject to variations of its internal system parameters, which can be induced by different effects during operation. For example, in our delay-based RC system a change in temperature can lead to a change in the optical length of the delay line. This eventually results in a fluctuating feedback phase, which leads to a worsening of the performance of the reservoir computing system. Typically, the temperature is controlled by thermoelectric cooling (e.g. using Peltier elements).
However, very small changes in the feedback phase can still occur, which is why we investigate the use of transfer learning to mitigate performance worsening due to these effects. Instead of varying the task, which we have done in Sections 4.1 and 4.2, we can instead look into varying the internal RC parameters itself. We thus study if it is possible to compensate for the shift in the feedback phase by applying transfer learning. We try to use transfer learning to quickly calculate the weights, i.e. with a small amount of input samples in the training target dataset used for retraining.
In order to quantify the performance of the RC setups, with varying feedback phase, we choose a one-step ahead prediction task on a time-series used frequently in literature. The used input dataset is the Santa Fe dataset, which consists of 9093 data samples. This dataset is recorded using a far-IR laser in a chaotic regime [27].
In order to investigate the influence of the feedback phase on the performance, we inject -as input data -the first 3010 normalized data samples of the discrete Santa Fe time-series into the RC system. The task is to predict sample k + 1 when data up until sample k is injected. This results in the state matrix A which we use for calculating weights during the training phase. As test dataset, we use the following 1010 data samples of the Santa Fe time-series after the training dataset, with a 10 data sample break, which we also inject into the RC system. After simulation, we discard for both the training and test datasets the first 10 samples to remove any effects of transients occurring from switching datasets. We perform a scan of the feedback phase of the RC system, with FB ∈ [0, 2 ], in order to find the most optimal feedback phase for the RC system. This optimal feedback phase is defined as the feedback phase for which we achieve the best performance, i.e. the lowest NMSE. Figure 5 shows the NMSE as a function of the feedback phase FB for the one-step ahead prediction of the Santa Fe time-series. In the top right a zoom is shown of the plot near its minimum NMSE. Simulations of RC systems with FB = 0 typically result in NMSE values around 0.01 and 0.02 for the Santa Fe one-step ahead predictions, which agrees with the values found in Figure 5 [14,24].
In Figure 5, we find that the performance of our RC system on this task is very sensitive to the feedback phase in the best NMSE, while feedback phases outside this range perform rather poorly, with higher NMSE. We find that a value for the feedback phase around FB ≈ 5.68 corresponds to the lowest NMSE. Therefore, we use this feedback phase as the most optimal feedback phase for our RC systems.
Having found the most optimal feedback phase, we again inject the first 3010 data samples of the normalized Santa Fe time-series into our RC system, which has a constant feedback phase FB,S = 5.68. The resulting state matrix A S corresponds to the training source response   . In order to apply transfer learning, we need to define a training target response   . This training target response is defined as the state matrix A T resulting from injecting the first 510 data samples of the normalized Santa Fe time-series into our RC system, but with a different feedback phase FB (where FB,T ∈ [5.70, 5.71, 5.72, 5.73]).
The values for these feedback phases are small deviations from the most optimal feedback phase, with the maximum deviation only being 0.5%. In a similar fashion, we define the test response as the next 1010 data samples, with a 10 sample break, after the training source data samples of the same time-series, for the same feedback phase FB as for the training target response. After simulation, we again discard the first 10 samples of the responses to remove any effects of transients occurring from switching responses. Figure 6 shows the NMSE corresponding to the different training schemes. It shows the performance when we conventionally train on only the training target response (in red), on only the training source response (in purple), on a combined response of both training source and training target data (in green) and when training on the optimal RC system (with the same optimal feedback phase FB,S = 5.68 for the training response as the test response).
This last NMSE is used as the reference NMSE value, because it corresponds to the best NMSE value we can expect, since the FB of the RC system is optimal and identical for both the training and test response. All of the previously mentioned NMSE values do not vary with the transfer rate , since no transfer learning has been used. The training scheme where we apply transfer learning between the training source and training target response (shown with blue dots) varies with the transfer rate .
These results are shown in Figure 6 for the four investigated feedback phases FB,T of the training target response. The feedback phases closest to the optimal FB,S have a better NMSE when performing conventional training on only the training source response compared to conventional training on only the training target response. This is the case for Figure 6(a) and (b), where the feedback phase of the training source system is similar to that of the training target system. However, when the feedback phase of the training target response FB,T is too different from that of the training source response FB,S , training on the training source response results in a large NMSE, surpassing that of the situation when the RC is trained on the training target response. This result can be seen in Figure 6(c) and (d). This again shows that the value of the feedback phase for the training target response is very sensitive for the performance of the RC, which was already indicated by Figure 5.
The performance of conventional training on a combined response consisting of both the training source and training target response is also shown in Figure 6 for all four cases. This combined response can therefore be seen as a response for which the feedback phase has changed during the experiment. It contains the most amount of data samples, in total 3500 samples. Training conventionally on this combined response always results in better performance when compared to purely training conventionally on either the training source or training target response. However, we are able to improve this performance by introducing transfer learning, which controls the information transfer between both training source and training target responses.
We observe in Figure 6 that, similar to Figure 3, for small we have the situation corresponding to conventionally training only on the training target response. When is further increased, the NMSE will decrease until an optimal value is found. At this optimal transfer rate, around ≈ 10 2 to ≈ 1, all four figures of Figure 6 show a minimum value for the NMSE. This point corresponds to the most optimal transfer rate of information between training source and training target response and thus results in the best performance using both responses. For large , the situation is similar to training on only the training source response. This is to be expected, and is also described in Section 4.1.
From Figure 6, we observe that if we start from the optimal feedback phase FB,S , and there is a small drift in the feedback phase, that retraining using transfer learning is the best option. This is due to a better performance than conventional training on only the training target response and also because it is more time and energy efficient than conventional training on the combined response of the training target and training source response. We also observe that if FB,T is strongly different from FB,S , we do not get good performance from transfer learning. This is in agreement with the results of Figure 5, where we have shown that achieving a good performance for these phases is simply not possible. We have investigated the effect of phase variations in the feedback term and found that the performance quickly deteriorates when this parameter is changed, even with small variations. We found that transfer learning is slightly able to limit this worsening in performance, but only within a limited percentage from the optimal value for the feedback phase. We thus conclude that transfer learning can only be used when confronted with small changes in the feedback phase of delay-based photonic RC systems. In the next section, we investigate the performance when the RC system is retrained using transfer learning when three other parameters are varied. These parameters have less influence on the performance of the RC systems, compared to the feedback phase, and are investigated for larger parameter variations.

Results: mitigating influence of multiple parameter variations on performance of Santa Fe task
In this section, we investigate the performance of RC systems when they are retrained using transfer learning, and when three parameters are varied: the injection rate , the feedback rate and the excess pump current rate ΔI∕e.
The injection of Santa Fe data is identical to the procedure described in Section 5, with the only difference that instead of varying the feedback phase, we vary the injection rate, the feedback rate and the excess pump current rate. The optimal values for these three parameters are defined in Table 1 (denoted here as opt , opt , ΔI opt ∕e) and are used for creating the state matrix A S . We again define the training target response   as the state matrix A T resulting from injecting normalized Santa Fe samples into our RC system, but with different values for , and ΔI∕e. The values for these three parameters are defined as x opt , y opt , zΔI opt ∕e where (x, y, z) are three random numbers drawn from a Gaussian distribution with a mean of 1 and standard deviation given by . This dictates the amount of deviation from the optimal values of opt , opt , ΔI opt ∕e, and is varied over 20 different values ranging between 0.01 and 0.20. For every , we repeat the simulation 10 times, and thus with 10 different (x opt , y opt , zΔI opt ∕e) combinations. This implies that for each , we repeat the experiment multiple times, and are thus able to achieve a better statistical result. As defined in Section 5, the test response is also created using the same parameters as for the training target response.
For each of the 10 iterations, we perform a scan for the value of the transfer rate which results in the lowest NMSE. This is done in order to achieve the most optimal performance for every iteration. Figure 7 shows the NMSE for the one-step ahead prediction of the Santa Fe dataset. In this figure, we show the median and interquartile range, calculated over the 10 different parameter combinations, each with identical mask. This is done for the case when we use transfer learning with the training source and target response, at the most optimal transfer rate (in blue), and when we conventionally train on the training target response (in red). As a reference value, we also show the performance without any deviations (i.e. = 0) in the RC's parameters where we conventionally train on the training source response, at the optimal values for the three parameters (in dark green). Additionally, we show the results when we inject the entire source input samples into the RC system with the same parameter combination as the test response, and conventional train on this dataset. These results are shown for various (in light green) and represent the best possible results. We use the median and interquartile range, instead of the mean and standard deviation, to show the spread of the NMSE since they are less influenced by large outliers of the NMSE. Figure 7 shows that the median NMSE is consistently lower when using transfer learning compared to conventional training on the training target response, as shown by the blue line always being below the red line. The median when using transfer learning remains around NMSE ≈0.03 for ⪅ 5%, whereas for conventional training on the target response we typically obtain NMSE ≈0.05. This should be compared with the best possible result of the NMSE being around ≈0.02 for conventional training. This is referred to as the best result for conventional training, as it uses 3000 data samples in an RC system which has the same parameter combination as the test set for its training source response, as opposed to only 500 samples for the training target response. This means that these best possible results correspond to fully retraining the system, which is very time-intensive and we thus want to avoid, while with transfer learning we only partially retrain the system.
In Figure 7, we observe that the median NMSE for transfer learning and conventional training increase when increases. The increase of the median NMSE with is, however, not monotone in both cases due to the fact that only 10 random iterations are used for the calculation of the median. Therefore, it is possible that certain parameter combinations were chosen, even at increased , where a good performance, and thus low NMSE, is found (e.g. around = 8%). However, we observe that for transfer learning, the increase or decrease in median NMSE with is also present for the conventional learning case. This can be explained by the fact that the medians are calculated with the same parameter combinations of , and ΔI∕e. This implies that a well-performing parameter combination will lead to a good NMSE, for both the transfer learning case and conventional learning case.
Additionally, the interquartile range is also smaller, with slightly lower NMSE, when using transfer learning, indicating that transfer learning is able to improve the performance of RC systems. However, for both cases, the fluctuations for the interquartile range increase with increasing parameter deviation. This can be explained by the fact that for increasing , the probability of having parameter combinations which result in a poor performance increases, due to the larger parameter deviations. Due to these fluctuations, we limit the applicability of transfer learning to around parameter deviations of ≈ 10%, as the NMSE becomes too large for higher .
Finally, we conclude that transfer learning can be used for parameter deviations of , and ΔI∕e up to ≈ 10%, where a lower NMSE is found when compared to conventional training on the training target response, and where the fluctuations in NMSE remain small.

Conclusions
In this work, we have numerically investigated the application of a novel training scheme, transfer learning, for delay-based reservoir computing with semiconductor lasers. With transfer learning, one is able to control the information transfer between two training sets, a training source and training target dataset, by controlling the transfer rate parameter, . This allows one to combine previous information and reuse previously found weights, resulting in less training data being required in the training target dataset. We have found that by using transfer learning, we are able to increase the performance on predicting coordinates of a Lorenz system with different parameters, even with relatively small training data available on that target Lorenz system. We have also investigated how small this training target dataset can be made and still result in improved performance compared to conventionally training on a training target dataset, when predicting the behaviour of a Lorenz system. We have found that using transfer learning with only 1.5 × 10 3 training target samples, combined with 10 4 training source samples, have the same performance as conventionally training on 2.5 × 10 3 samples as training target dataset. Since we do not have to retrain the weights corresponding to the training source dataset, this implies that ultimately we have to perform less retraining when the weights corresponding to the training source dataset are already available. Finally, we have also investigated the possibility of using transfer learning to compensate for the worsening of reservoir computing performance by parameter variations. In order to study this, we have first looked into the effect of changes in the feedback phase of the reservoir computing systems.
We are able to update the weights -originally obtained at the optimum feedback phase FB -when the feedback phase drifts by using a limited amount of training target samples combined with transfer learning. By training on a reservoir computing system at the most optimal feedback phase, we were able to mitigate, to a certain degree, this performance worsening for slightly varying feedback changes. If we, however, use transfer learning when confronted with parameter deviations of the injection rate, feedback rate and excess pump current rate, we were able to achieve better results, up to large parameter deviations. Therefore, we conjecture that transfer learning can be used to enhance the performance of other photonic RC systems, which are also suffering from internal parameter drift.