## 1 Introduction

In recent years, neural networks (NNs) take centre-stage in advancing computation [1]. Optimized by training, such *learning* machines provide key advantages for solving abstract computational problems and already outperform humans in numerous tasks previously deemed impossible for classically (algorithmically) programmed computers [1], [2], [3].

However, NNs are still mostly emulated by traditional Turing/von Neumann computers. The absence of computing hardware supporting fully parallel NNs reduces energy efficiency and overall speed, and new paradigms addressing these problems are desirable. The lack of a parallel network substrate is a fundamental roadblock and is an active area of research since decades, with current analogue hardware either implementing the full network [4], [5], [6] or the neurons [7], [8], [9], [10], [11]. An implementation of nonlinear neurons, fully-parallel information transduction and learning on a substrate level promises a revolution, and photonic NNs [12], [13] remain a highly promising avenue [4], [5].

Noise is an inseparable companion of analogue hardware [14], yet the fundamental aspects of optimizing a noisy NN [15], [16], [17] have so far hardly been explored – neither in experiments [18], [19] nor in theory [14]. Here, we investigate the interactions between noise, learning rules and the topology of an error landscape for the first time. We experimentally implement a NN with 961 electro-optical neurons via a spatial-light modulator (SLM) [17], use diffraction [4], [5], [10], [20] to physically realize the network’s internal connections and a digital micro-mirror device (DMD) for programmable Boolean readout weights [17]. The particular NN task consists in one-step-ahead prediction of the chaotic Mackey–Glass times series. Learning exclusively modifies the readout connections [21] via an evolutionary Boolean algorithm based on the error gradient only, and optimization is using either fully random (Markovian) or structured (greedy) exploration.

The statistics of the experimentally obtained learning trajectories prove that noise and exploration strategy strongly interact. Noise induces a kind of random forcing upon the descent algorithm, which strongly modifies the system’s path during its convergence towards a local minimum. We find that noise decorrelates the final weight configurations: starting from identical weight configurations and exploring the error landscape’s dimensions in identical sequences always leads to clearly differentiated local minima. Quite astonishingly, all minima are spaced at an almost constant distance from each other, which for the generally non-trivial error landscape topologies is unusual at the least. Noise therefore appears to arrange minimizers in periodic positions, much like competitive Brownian walkers with non-local interactions [22]. These fundamental effects highlight the importance of considering hardware architecture, noise and learning algorithm as intimately linked.

## 2 Neural network hardware

A recurrent NN inspired by reservoir computing, illustrated in Figure 1a, was our experimentally realized NN test bench. Figure 1b schematically depicts the experiment. An optical plane wave *E*^{0} illuminates the SLM’s pixels, and the reflected field is filtered by a polarizing beam splitter (PBS). The SLM combined with the PBS creates a cos(·) non-linearity and the SLM’s pixels physically encode the NNs state. A quarter wave plate located between the PBS and the mirror directs the signal towards a camera, and a double pass through the diffractive optical element (DOE) establishes the recurrent connections *W*^{DOE} [10], [17], [20]. Camera state *n*

*u*(

*n*+1) and sent to the SLM, creating the network’s state according to

*N = 961*is the recurrent layer’s number of nodes,

*β = 0.8*the feedback gain,

*γ = 0.4*the input injection gain and

*α*a normalization parameter. The optical electric field and the nonlinearity’s bias offset for node

*i*are

*θ*

_{i}, respectively. Input information

*u*(

*n*+1) is injected into the system according to random connections

*W*^{inj}.

The polarization reflected by the PBS is imaged onto the DMD, whose mirrors are programmed to fixed angles of ±12° from normal incidence. A photodiode only detects optical signals reflected of mirrors with −12°, thus implementing a Boolean readout weight matrix

*k*is the learning epoch and

*i*arriving at the detector. As in reservoir computing, we restrict learning to the optimization of the readout weights. Finally, the absence of negative weights is partially mitigated by distributing the offset phases

*θ*

_{0}+

*δθ*

_{i}and

*θ*

_{0}+Δ

*θ+δθ*

_{i}, where

*δθ*

_{i}is a random Gaussian distribution [17]. Internal

*W*^{DOE}and readout

*W*^{DMD}(

*k*) connections are, however, realized in passive and fully parallel photonic hardware.

As the network is constructed of physical neurons it harbours noise, which can either be additive or multiplicative, as well as correlated or uncorrelated [14]. The main sources of noise in our experiment are the SLM and the camera, in relation to which the illumination laser and output detector can be considered as noiseless, and so are the internal coupling and readout matrices implemented by the DOE and DMD, respectively. All relevant noise sources are therefore reservoir-internal, and our following discussion is by no means limited to systems where the readout layer is implement physically. More details about the theoretical treatment and propagation of noise in NNs as well as the individual noise sources and their respective amplitudes and statistics can be found in [14].

## 3 Boolean evolutionary learning

Most current learning techniques require complete knowledge of the internal network’s state [21], all connection weights and potentially all gradients [1]. In a hardware network this demands probing (and most probably externally storing) the value of each node and connection, which necessitates auxiliary circuitry of a complexity potentially exceeding the actual neural network. This jeopardizes precisely the benefits one targets when mapping a neural network onto hardware. We therefore employ learning that only tracks the computation error’s evolution, and hence imposes no constraint on the type of neurons, and more broadly, on hidden layers as a whole. Such an implementation’s complexity does not depend on, and hence does not limit the NN’s size, which is crucial considering the importance of scalability for computing.

Here, we optimize the DMD’s configuration simply by measuring the impact of output mirrors’ modifications onto computing error *W*^{DMD}(*k*) during *k* = 1, 2, … *K* learning epochs such that output *y*^{out}(*n* + 1) best approximates target

### 3.1 Mutation

We create a vector with *N* random elements, independently and identically distributed between 0 and 1 (rand (*N*)). *W*^{bias} offers the possibility to modifying the otherwise stochastic *l*(*k*), determines the Boolean readout weight

A fully stochastic Markovian descent is obtained with *W*^{bias} *=* 1 and excluding Eq. (6). However, here we also investigate exploration which makes mutating a particular connection in near succession unlikely. There, *W*^{bias} is randomly initialized at *k* = 1, and at each epoch Eq. (6) increases the bias of all connections by 1/*N*, while the currently modified connection’s bias is set to zero. On average, the probability of again probing a particular weight reaches unity only after *N* learning epochs have passed, and we therefore refer to this biased exploration as *greedy* learning.

According to these instructions, our algorithm only probes, hence potentially mutates one mirror at a time. We have considered updating more than one mirror at each *k*, yet simplified numerical simulations indicate that convergence was significantly faster for updating only one weight at a time.

### 3.2 Error and reward signals

Mean square error *T* data points according to Eq. (7), and comparison to the previous error assigns a reward *r*(*k*) = 1 only if a modification

### 3.3 Descent action

Based on reward *r*(*k*), the DMD’s current configuration either accepts or rejects the previous modification, Eq. (11). For a noise-less system, reward *r*(*k*) is therefore simply based on the gradient found at position *l*(*k*). We will refer to this hypothetical gradient of a noise-less system as the *systematic* gradient.

## 4 Results

While such Boolean learning has been applied to a wide range of computational problems, recurrent neural networks have a particular relevance for dynamical signal processing, and we therefore explore one-step-ahead prediction of the chaotic Mackey–Glass sequence with a Lyapunov exponent of ∼3·10^{−3}. This particular input is a commonly employed benchmark test and our results are therefore directly comparable to other works such as Mackey-Glass prediction based on a semiconductor laser delay reservoir with weights optimized and applied in an offline procedure [23], as well as the seminal work on RC [21] – where however the time step was twice as large.

The chaotic sequence acting as input information *u*(*n*+1) has zero mean and is normalized to its standard deviation, making error *u*(*n*+1), of which we however removed the first 30 time steps due to their transient nature. The result is a training signal with *T* = 200 − 30 = 170 data points for which the target is *W*^{DMD}(*k*), and reward *r*(*k*) drives the configuration from *W*^{DMD}(*1*) to a local minima at *W*^{DMD}(*k*^{min}). There our system will remain trapped due to an exploration step size of 1. We will refer to one complete learning process for *k*:1→*k*^{min} as a *minimizer*.

Understanding why generalization is possible for a training set size (*T* = 170) not orders of magnitude larger than the number of to be optimized weights (*N* = 961) is an interesting question. Recent results on deep neural networks, triggered by the insightful analysis from Ref. [24], show that overparametrisation may not preclude generalization. See Ref. [25] for an account to this phenomenon using random matrix theory, starting from simple linear models and generalizing to kernel estimation. In our setting we, however, might additionally postulate that we work below the overparametrization barrier due to the Boolean entries of *W*^{DMD}(*k*), which brings substantial rigidity into play. The price one pays is making the problem harder from a computational optimization viewpoint [26].

Typically, the main metric for evaluating learning are speed of convergence *k*^{min} and final inference error *W*^{DMD}(1) and we therefore focus on the algorithm’s exploration of the error-landscape. Results are shown in Figure 2, with individual learning curves as grey lines and their average as red crosses. Panel (a) shows data for the greedy, panel (b) for Markovian exploration.

### 4.1 Average and local features of convergence and minima

On average, the error landscape topology excellently follows an exponential decay for both exploration strategies, see fit (blue line) to the average error (red crosses) in Figure 2. Comparing individual trajectories, however, reveals strong inter-trial differences significantly exceeding the noise level. This diversity corresponds to the error landscape’s topological richness probed by the different random descents, and trajectories range from rather smooth descents to paths including steep drops. No correlation between the starting *k*^{min} = 973.6 ± 63.7 learning epochs, while Markovian exploration arrives at a slightly lower error *k*^{min} = 1856.5 ± 175.1. Crucially, the system’s testing error (green line in Figure 2) excellently matches its training error, hence ruling out over fitting. Noteworthy, convergence for both cases scales linear with network size *N* [27], yet greediness approximately halves *k*^{min} compared to Markovian decent.

Nevertheless, despite the small deviations of *W*^{DMD,a} (*k*^{a}) and *W*^{DMD,b} (*k*^{b}) is determined by Hamming distance *H* = 419 and with a half width at 1/*e* of 14. Data shows a very specific and unusual error landscape topology: local minima appear not to be irregularly distributed, nor located in a particular region. Instead, the negligible correlations between the minimas’ locations, and the systematic and narrow distribution of inter-minima distances shown in Figure 3 reveals their almost uniform distribution across the error landscape. Again, we find that Markovian exploration results in an identical behaviour.

## 5 Noise sensitivity

To further investigate this phenomena, we reduce the number of uncertainties during learning. We measure three minimizer paths starting at the same *r*(*k*) and hence independently evaluate mutating the same weight. Keeping the potentially systematic error of a slow experimental parameter drift in mind, the three systems are evaluated at each learning epoch *k* before advancing to *k* + 1. A single minimizer takes ∼20 h, and sequential evaluation would amplify susceptibility to slow experimental parameter drifts which take place on the scale of hours in our experiment.

Results are shown in Figure 4a. The blue, green and red lines correspond to the different errors *H*(*k*) of the two slaves to their master (red and green crosses) and between the two slaves (grey crosses); all three grow linearly at essentially the same rate. Without noise, each minimizer’s reward *r*(*k*) would be identical and they would consequently all follow the same trajectories *W*^{DMD}(*k*) and arrive at the same minima *W*^{DMD}(*k*^{min}).

To understand this behaviour, we therefore have to consider the impact of noise and learning upon the system’s error *y*^{out}(*k*) is

*y*

^{out}(

*k*) is the mean modification of output

*y*

^{out}(

*k*,

*n*+1) within a certain window, which during training contains

*T*sample points. Some general considerations regarding our system are in order. The amplitudes of all network nodes

*y*

^{out}(

*k*) according to a normalized Gaussian distribution with a width of

*y*

^{out}(

*k*) is excellently approximated by Gaussian white noise with a width of

*W*^{DMD}remain approximately evenly distributed between zeros and ones for all

*k*. Learning does only modify readout connections and therefore neither modifies

*y*

^{out}induced by learning and noise remain constant for all

*k*, hence

The fact that according to Eq. (12) noise (*σ*^{n}) and learning (*σ*^{l}) both modify the system’s error according to the same relationship is of general importance. Convergence during learning is characterized by *σ*^{n} and *σ*^{l} remains constant as both scale with the same constant *r*(*k*) towards *σ*^{n} and *σ*^{l}, meaning that neither *σ*^{n} and *σ*^{l}. Noise and weight modifications are therefore independent players, whose action upon learning is somehow competitive.

The objective of modifying a readout weight is to probe the error landscape’s systematic gradient. However, this action is contaminated by noise which can potentially exceed the systematic gradient in the opposite direction. The consequence is a change in the sign of *r* (*k*) is inverted. How likely such a modification takes place depends on the relative amplitudes of *σ*^{n} and *σ*^{l}, and C is the constant probability of such a modification occurring. The analytical derivation of *C* is possible, yet beyond the scope of this manuscript.

Probability *C* is the driving force behind the growing separation between two identical minimizers, and two situations are relevant. The first situation occurs when *r*(*k*) for one minimizer is inverted by noise while the other preserves its systematic value, which has a probability of *C*(1−*C*)+(1−*C*)*C* = 2*C*(1−*C*). The other situation is if both minimizers have an identical reward *r*(*k*), which can either be the consequence of both retaining their systematic result, or for both being inverted by noise, with a combined probability of *C*(1−*C*)^{2}+*C*^{2} = 1−2*C*(1−*C*). The first situation leads to *H*(*k* + 1) ≠ *H*(*k*), the second to *H*(*k* + 1) = *H*(*k*), and the Hamming distance’s rate equation is

*l*(

*k*) to be identical or opposite, respectively. Using

The Hamming distance’s evolution is therefore governed by noise quantified through constant

For fully random mutation, the probability of a weight to be selected is identical at every *k*, and hence the Hamming distance at the previous epoch *k* determines the probability of two weights being opposite in their configuration: *a*. The probability of both minimizers to be configured opposite for all *a* is therefore their Hamming distance at the end of the previous interval:

Figure 4b shows the evolution of Hamming distance *W*^{DMD}(1) we always have *H*(1) = 0. Greedy mutations in the experiment (analytics) are the red line (black dashed line), while random mutations in the experiment (analytics) are the blue line (black solid line). For both scenarios, greedy and random descent, the experimental data is the average obtained from 20 minimizers. We then changed the starting conditions and realized two parallel minimizers which started with a separation *H*(1) > 0, see Figure 4c. In general, the evolution according to Eq. (15) perfectly reproduces results of the highly different experimental learning scenarios. In particularly for the averaged data, where we always arrive at

Different minimizers therefore always arrive at final readout configurations which share no common feature. This suggests a closer look into the role and relevance of individual weights: how many induce a systematic contribution to convergence at all, and if their gradients depend on the sequence of previous mutations. We optimized readout weights via two minimizers starting at different random positions *W*^{DMD,a}(1) and *W*^{DMD,b}(1), which arrived at two distinct local minima *m* weights where *M*_{b} differs from *M*_{a}. The list is randomly arranged in sequence *r* (*k*) is taking place. Starting from *M*_{a} (*M*_{b}), this results in a random path **l**(*k*) are the ones in an opposite configuration for *M*_{a} and *M*_{b}, *P*_{a} and *P*_{b} connect both minima along inverted trajectories, see Figure 5a. We probe error *P*_{a} and *P*_{b} and determine error gradients *m* = 430 different dimensions between *M*_{a} and *M*_{b}. Only

Weights insensitive (sensitive) to preceding optimizations correspond to linearly independent (linearly dependent) NN dimensions. Linearly independent NN dimensions must always induce the same gradient, regardless of the preceding optimization path, and they therefore have to be located on the red diagonal line in Figure 5b. The Figure’s green area indicates the linearly independent criteria when considering the impact of noise *σ*^{n}, and we find *m*−1 dimensions, which for the 2^{429} possibilities is prohibitive to prove experimentally. The NN dimensions whose weight configuration depends on the previously optimized weights lie outside the grey and green areas. This is a sufficient criteria for linear dependent NN dimension, and we find

## 6 Discussion

Our experimental findings and analytical descriptions are the first of their kind and stimulate a fundamental discussion. Equation (12) is of interesting consequence for noisy hardware NNs comprising linear readout weights. It links the susceptibility of

We would like to also propose alternative noise-mitigation approaches which are a derivative of our findings. Simply suppressing noise on a hardware level is potentially expensive, and topological requirements can limit mitigation based on connectivity statistics [14]. One might therefore curb the impact of noise by modified learning strategies. Noise will first of all limit the absolute performance, but also cause this performance to fluctuate, which is an effect one could address for example by amending an optimization’s cost function by the gradients encountered in the proximity of a neighborhood. According to Eq. (12) the local gradients

Equation (15) shows that for *C* > 0 the Hamming distance between readout weights of two systems will always tend towards complete decorrelation as

We have shown that the large majority of our NN’s dimensions are most likely linear dependent. What this means in pratical terms is that each modification of a weight has to be interpreted in the context of all previous modifications. Each configuration *W*^{DMD}(*k*) therefore encodes the history of modifications to the reward due to noise during the previous learning epoch.

One direct consequence for applications is that one cannot simply transfer or swap weight configurations between optimized analogue neural networks, even for potentially available identical twin networks. The reason is that optimized configurations are not only the consequence of error landscape, system properties and noise, but also of the precise history of noise during an exploration path. Even almost perfectly reproducible hardware networks will therefore always have to be individually trained for optimal performance; simply uploading a configuration will potentially not work. A ‘school’ in which each neural network learns individually might therefore be required. Finally, our findings open a new field where such twin-minimizers could be considered for probing and interrogating unknown hardware neural networks. The average divergences shown in Figure 4b agree exceptionally well with our model, and based on such data one can therefore make accurate inferences about the noise properties of a hardware NN and about its error landscape exploration strategy.

## 7 Conclusions

In our work we have investigated the intricate interactions between different learning concepts and the noise inherently present in analogue neural networks. We experimentally showed that trajectories of individual minimizers (i.e. learning trajectories) strongly diverge, and were able to analytically link this divergence to a constant ration between output error and noise susceptibility. Our analytical description only assumes a linear multiplication between a NN’s state and its readout weights, and hence should be generally applicable to this wide class of analogue hardware NNs.

The authors acknowledge the support of the Region Bourgogne Franche-Comté. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No. 860360 (POST DIGITAL) and No. 713694 (MULTIPLY). This work was also supported by the EUR EIPHI program (Contract No. ANR-17-EURE- 0002), the BiPhoProc project (Contract No. ANR-14-OHRI- 0002-02), and by the Volkwagen Foundation (NeuroQNet).

^{}

**Author contribution:** All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

^{}

**Research funding:** This research was funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No. 860830 (POST DIGITAL) and No. 713694 (MULTIPLY). This work was also supported by the EUR EIPHI program (Contract No. ANR-17-EURE- 0002), the BiPhoProc project (Contract No. ANR-14-OHRI- 0002-02), and by the Volkwagen Foundation (NeuroQNet).

^{}

**Conflict of interest statement:** The authors declare no conflicts of interest regarding this article.

## References

- [2]↑
B. Amos, B. Ludwiczuk, and M. Satyanarayanan, OpenFace: A General- Purpose Face Recognition Library with Mobile Applications, Pittsburgh, Carnegie Mellon University, Tech. Rep., 2016.

- [3]↑
A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, IEEE, 2013, pp. 6645–6649.

- [4]↑
X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science, vol. 26, pp. 1–20, 2018.

- [5]↑
Y. Shen, N. C. Harris, S. Skirlo, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics, vol. 11, pp. 441–446, 2017, https://doi.org/10.1038/nphoton.2017.93.

- [6]↑
A. N. Tait, S. Member, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast and weight : an integrated network for scalable photonic spike processing,” J. Lightwave Technol., vol. 32, no. 21, pp. 3427–3439, 2014.

- [7]↑
L. Appeltant, M. C. Soriano, G. V. D. Sande, et al., “Information processing using a single dynamical node as complex system,” Nat. Commun., vol. 2, p. 468, 2011, https://doi.org/10.1038/ncomms1476.

- [8]↑
F. Duport, B. Schneider, A. Smerieri, M. Haelterman, and S. Massar, “All-optical reservoir computing,” Opt. Express, vol. 20, pp. 22783–22795, 2012, https://doi.org/10.1364/OE.20.022783.

- [9]↑
L. Larger, M. C. Soriano, D. Brunner, et al., “Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing,” Opt. Express, vol. 20, pp. 3241–3249, 2012, https://doi.org/10.1364/OE.20.003241.

- [10]↑
D. Brunner, and I. Fischer, “Reconfigurable semiconductor laser networks based on diffractive coupling,” Opt. Lett., vol. 40, pp. 3854–3857, 2015, https://doi.org/10.1364/OL.40.003854.

- [11]↑
J. Torrejon, M. Riou, F. A. Araujo, et al., “Neuromorphic computing with nanoscale spintronic oscillators,” Nature, vol. 547, pp. 428–431, 2017, https://doi.org/10.1038/nature23011.

- [12]↑
N. H. Farhat, D. Psaltis, A. Prata, and E. Paek, “Optical implementation of the Hopfield model,” Appl. Opt., vol. 24, no. 10, pp. 1469–1475, 1985, https://doi.org/10.1364/AO.24.001469.

- [13]↑
D. Psaltis, and N. Farhat, “Optical information processing based on an associative-memory model of neural nets with thresholding and feedback,” Opt. Lett., vol. 10, pp. 98–100, 1985, https://doi.org/10.1364/ol.10.000098.

- [14]↑
N. Semenova, X. Porte, L. Andreoli, M. Jacquot, L. Larger, and D. Brunner, “Fundamental aspects of noise in analog-hardware neural networks,” Chaos, vol. 29, no. 10, p. 103128, 2019, https://doi.org/10.1063/1.5120824.

- [15]↑
M. Hermans, P. Antonik, M. Haelterman, and S. Massar, “Embodiment of learning in electro-optical signal processors,” Phys. Rev. Lett., vol. 117, 2016, Art no. 128301. https://doi.org/10.1103/PhysRevLett.117.128301.

- [16]↑
P. Antonik, M. Haelterman, and S. Massar, “Brain-inspired photonic signal processor for generating periodic patterns and emulating chaotic systems,” Phys. Rev. Appl., vol. 7, 5 2017, Art no. 054014. https://doi.org/10.1103/PhysRevApplied.7.054014.

- [17]↑
J. Bueno, S. Maktoobi, L. Froehly, et al., “Reinforcement learning in a large scale photonic recurrent neural network,” Optica, vol. 5, pp. 756–760, 2018, https://doi.org/10.1364/OPTICA.5.000756.

- [18]↑
R. Alata, J. Pauwels, M. Haelterman, and S. Massar, “Phase noise robustness of a coherent spatially parallel optical reservoir,” IEEE J. Select.Top. Quant. Electron., vol. 26, no. 1, pp. 1–10, 2020.

- [19]↑
M. C. Soriano, S. Ortín, D. Brunner, et al., “Optoelectronic reservoir computing: tackling noise-induced performance degradation,” Opt. Express, vol. 21, pp. 12–20, 2013, https://doi.org/10.1364/OE.21.000012.

- [20]↑
S. Maktoobi, L. Froehly, L. Andreoli, et al., “Diffractive coupling for photonic networks: how big can we go?,” IEEE J. Select. Top. Quant. Electron., vol. 26, no. 1, pp. 1–8, 2020.

- [21]↑
H. Jaeger, and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, pp. 78–80, 2004, https://doi.org/10.1126/science.1091277.

- [22]↑
E. Heinsalu, E. Hernández-García, and C. López, “Competitive brownian and lévy walkers,” Phys. Rev. E Stat. Nonlinear Soft Matter Phys., vol. 85, no. 4, pp. 1–10, 2012, https://doi.org/10.1103/PhysRevE.85.041105.

- [23]↑
J. Bueno, D. Brunner, M. Soriano, and I. Fischer, “Conditions for reservoir computing performance using semiconductor lasers with delayed optical feedback,” Opt. Express, vol. 25, no. 3, pp. 2401–2412, 2017, https://doi.org/10.1364/OE.25.002401.

- [24]↑
M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine learning and the bias-variance trade-off,” Proc. Natl. Acad. Sci., vol. 116, no. 32, pp. 15849–15854, 2018, https://doi.org/10.1073/pnas.1903070116.

- [25]↑
T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” arXiv preprint arXiv:1903.08560, 2019.

- [26]↑
F. Hadaeghi and H. Jaeger, “Computing optimal discrete readout weights in reservoir computing is NP-hard,” Neurocomputing, vol. 338, pp. 233–236, 2019, https://doi.org/10.1016/j.neucom.2019.02.009.

- [27]↑
X. Porte, L. Andreoli, M. Jacquot, L. Larger, and D. Brunner, “Reservoir-size dependent learning in analogue neural networks,” in Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions, I. V. Tetko, V. Kůrková, P. Karpov, and F. Theis, Eds., Cham, Springer International Publishing, 2019, pp. 184–192.

- [28]↑
S. Liu, B. Kailkhura, P. -Y. Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” in Advances in Neural Information Processing Systems, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018, pp. 3727–3737.

- [29]↑
M. Freiberger, A. Katumba, P. Bienstman, and J. Dambre, “Training passive photonic reservoirs with integrated optical readout,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 7, pp. 1943–1953, 2019, https://doi.org/10.1109/TNNLS.2018.2874571.

- [30]↑
A. Suarez-Perez, G. Gabriel, B. Rebollo, et al., “Quantification of signal-to-noise ratio in cerebral cortex recordings using flexible MEAs with co-localized platinum black, carbon nanotubes, and gold electrodes,” Front. Neurosci., vol. 12, pp. 1–12, 2018, https://doi.org/10.3389/fnins.2018.00862.