Misalignment Resilient Diffractive Optical Networks

As an optical machine learning framework, Diffractive Deep Neural Networks (D2NN) take advantage of data-driven training methods used in deep learning to devise light-matter interaction in 3D for performing a desired statistical inference task. Multi-layer optical object recognition platforms designed with this diffractive framework have been shown to generalize to unseen image data achieving e.g.,>98% blind inference accuracy for hand-written digit classification. The multi-layer structure of diffractive networks offers significant advantages in terms of their diffraction efficiency, inference capability and optical signal contrast. However, the use of multiple diffractive layers also brings practical challenges for the fabrication and alignment of these diffractive systems for accurate optical inference. Here, we introduce and experimentally demonstrate a new training scheme that significantly increases the robustness of diffractive networks against 3D misalignments and fabrication tolerances in the physical implementation of a trained diffractive network. By modeling the undesired layer-to-layer misalignments in 3D as continuous random variables in the optical forward model, diffractive networks are trained to maintain their inference accuracy over a large range of misalignments; we term this diffractive network design as vaccinated D2NN (v-D2NN). We further extend this vaccination strategy to the training of diffractive networks that use differential detectors at the output plane as well as to jointly-trained hybrid (optical-electronic) networks to reveal that all of these diffractive designs improve their resilience to misalignments by taking into account possible 3D fabrication variations and displacements during their training phase.

The output aperture array and the 3D-printed MNIST digits were coated with aluminum except the openings and object features.Each aperture at the output plane is a square covering an area of 4.8 mm × 4.8 mm, matching the assumed size during the training.The size of the printed MNIST digits was 4 cm × 4cm sampled at a rate of 0.4 mm in both x and y directions, matching the training forward model.A 3D-printed holder was used to align the 3D printed input object, 5 diffractive layers and the output aperture.Around the location of the 3 rd layer, the holder had additional spatial features that allowed us to move this diffractive layer to 13 different locations including the ideal one (see Fig. 5 of the main text).

Error-free D 2 NN
In a diffractive optical network, each unit diffractive feature of a layer represents a complex-valued transmittance learned during the training process that optimizes the thickness, h, of the features based on the complex-valued refractive index of the 3D-fabrication material, τ = n + jκ.The characterization of the printing material in a THz-TDS setup revealed the values of n and κ as 1.7227 and 0.031, respectively, for a monochromatic THz light at 400 GHz.Our formulation represents the complex-valued transmittance function of a diffractive feature on layer, l, at coordinates (x q , y q , z l ) as; t(x q ,y q ,z l ) = exp( -2π κ h(x q ,y q ,z l ) λ )) exp( j2π(n-n air ) h(x q ,y q ,z l ) λ ) (S1), where h(x q , y q , z l ) , n air and z l denote the thickness of a given feature, refractive index of air and the axial location of the layer, l, respectively.From the Rayleigh-Sommerfeld theory of diffraction, we can interpret every diffractive unit on layer, l, at (x q , y q , z l ), as the source of a secondary wave, w l q (x, y, z), where r = ((x-x q ) 2 + (y-y q ) 2 + (z-z l ) 2 ) 0.5 .Therefore, the complex field coming out of the q th feature of (l+1) th layer, u l+1 q (x,y,z) can be written as; u l+1 q (x,y,z) = t(x q ,y q ,z l+1 ) w l+1 q (x,y,z) ( ∑ kϵl u l k (x q ,y q ,z l+1 ) ) (S3).
We sampled our diffractive fields and surfaces at a sampling interval of 0.4 mm that is equal to 0.53λ.The smallest diffractive feature size was also equal to 0.4 mm.The learnable thickness of each feature, h, was defined over an auxiliary variable, h a ; where h m and h b denote the maximum modulation thickness and base thickness, respectively.Taking h b as 0.5 mm and h m as 1 mm, we limited the printed thickness values between 0.5 mm and 1.5 mm.The minimum thickness h b was used to mainly ensure the mechanical stability of the 3D printed layers against cracks and bending.The operator q(.) in Eq. (S4) represents the quantization operator.We quantized the thickness values to 16 discrete levels (0.0625 mm per step).For the initialization of the diffractive layers at the beginning of the training, the thickness of each feature was taken as a uniformly distributed random variable between 0.9 mm and 1.1 mm, including the base thickness.

Vaccinated D 2 NN
The training of the vaccinated diffractive optical networks follows the same optical forward model outlined in the previous section, except that it additionally introduces statistical variations following the models of the error sources in a diffractive network.The components of the 3D displacement vector of the l th diffractive layer, D l = (D l x , D l y , D l z ), were defined as uniformly distributed random variables defined by Eq. ( 1) of the main text.The vaccination strategy uses different sets of displacement vectors at every iteration (batch) to introduce undesired misalignments of the diffractive layers during the training.With D (l,i) = (D (l,i) x , D (l,i) y , D (l,i) z ) denoting the random displacement that the l th layer experiences at i th iteration, Eq. ( S3) was adjusted according to the longitudinal shift of the successive layers, D z (l,i) and D z (l+1,i) , i.e., the light propagation distances between the diffractive layers were varied at every iteration.To implement the continuous lateral displacement of diffractive layers, we used: where t (l,i) (x,y) denotes the 2-dimensional complex modulation function of layer l, at i th iteration, and T (l,i) (u,v) represents its spatial Fourier transform defined over the 2D spatial frequency space (u,v).

Training of all-optical and hybrid classification systems 2.2.1 Loss function and class scores
In our forward training model, without loss of generality, we modeled our detectors as ratiometric sensors that capture the ratio of the optical power incident over their active area, P d , and the optical power incident over the object at the input plane, P obj .Based on this, the optical signal vector collected by output detectors, I d , was formulated as: For all three diffractive object classification systems depicted in Fig. 1 of main text, the cost function was defined as the widely-known softmax-cross-entropy (SCE); where g c , s c and C denote the binary entry in the label vector, the computed class score for the data class, c, and the number of data classes in a given dataset (e.g., C=10), respectively.
For the standard diffractive optical network architecture shown in Fig. 1A, the number of class detectors, N d , is equal to the number of data classes, C. In this scheme, the class score vector, s, was computed by: where T and ε are constants, i.e., non-trainable hyperparameters used during the training phase.The multiplicative factor T was empirically set to be equal to 10 to generate artificial signal contrast at the input of softmax function for more efficient convergence of training.The constant ε, on the other hand, was used to regularize the power efficiency of the standard diffractive object recognition systems.In particular, the standard diffractive neural network models presented in Figs.2A, 2D and 3 of the main text, as well as in the Supplementary Figs.S4A, S4D and S5, were trained by taking ε = 10 -4 which results in low power efficiency, η, and low signal contrast, ψ.The 3D-printed diffractive optical networks, on the other hand, were trained by setting ε = 10 -3 to circumvent the effects of the limited signal-tonoise ratio in our experimental system.Trained with a higher ε value, these diffractive networks offer slightly compromised blind testing accuracies while providing significantly improved power efficiency, η, and signal contrast, ψ, which are defined as: where I gt and I sc denote the optical signals measured by the class detector representing the ground truth label of the input object and its strongest competitor, i.e. the second maximum for a correctly classified input object, respectively.A comparison between the inference performances of low-and high-contrast variants of vaccinated and nonvaccinated standard diffractive optical networks under various levels of misalignments is presented in Fig. S1.As depicted in Fig. S1A, the high contrast, high efficiency standard diffractive networks are more robust against the undesired system variations/misalignments compared to their low-efficiency counterparts when both networks were trained under error-free conditions.Figure S1B, on the other hand, compares the standard diffractive network architectures that were tested within the same misalignment range used in their training.In this case, the low-contrast, power inefficient diffractive networks show their higher inference capacity advantage and adapt to the misalignments more effectively than the diffractive classification systems trained to favor higher power efficiency.
In a differential diffractive optical network system, the number of detectors is doubled, i.e.N d =2C, where each pair represents the negative, I d-, and positive signal vector, I d+ , contributing to the normalized differential signal, I (d,n) (see Fig. 1B of the main text) defined as: In parallel, the class scores of a differential diffractive object classification system, s, are calculated by replacing the optical signal vector, I d , in Eq. (S8) with the normalized differential signals, I (d,n) , depicted in Eq. (S10).
It is important to note that the Eqs.(S6), (S7) and (S8) concern only the training stage of diffractive optical networks and the associated all-optical object classification systems.Once the training is completed, these equations are not used in the numerical and experimental blind testing, meaning that the class decision is made solely based on max(P d ) and max(P (d,n) ) in standard and differential diffractive network systems, respectively.
In the hybrid neural network models, we jointly-trained 5-layer diffractive optical networks with an electronic network that has a single-layer fully-connected network with only 110 (100 multiplicative weights + 10 bias) trainable parameters.During the joint-evolution of these two neural networks, we normalized the optical signal collected by the detectors, I d , as depicted in Eq. (S8) with T = 1.These normalized detector signals were then fed into the subsequent fully-connected layer in the electronic domain to compute the class scores, s, which was used in Eq. (S7) for computing the classification loss before the error-backpropagation through both the electronic and diffractive optical networks.

Other training related details
All network models used in this work were trained using Python (v3.6.5) and TensorFlow (v1.15.0,Google Inc.).We selected Adam optimizer during the training of all the models, and its parameters were taken as the default values in TensorFlow and kept identical in each model.The learning rates of the diffractive optical networks and the electronic neural network were set to be 0.001 and 0.0002, respectively.The data of handwritten digits and fashion-products were both divided into three parts: training, validation and testing, containing 55K, 5K and 10K images, respectively.All object recognition systems were trained for 50 epochs with a batch size of 50 and the best model was selected based on the highest classification performance on the validation dataset.In the training of MNIST digits, the image information was encoded in the amplitude channel at the object plane, while the Fashion-MNIST objects was assumed to be phaseonly targets with their gray levels mapped to phase values between 0 and π.

Fig. S2 :
Fig. S2: Experimental image classification results as a function of misalignments.AThe experimentally measured class scores for handwritten digit '0' selected from Set 2. B Same as A, except the input object is now a handwritten digit '7' selected from Set 2. The red dot within the coordinate system shown on the left-hand side represents the physical misalignment for each case (see Fig.5Bof main text).Red (green) rectangles mean incorrect (correct) inference results.

Fig. S3 :
Fig.S3: Experimental image classification results as a function of misalignments.A The experimentally measured class scores for handwritten digit '2' selected from Set 1. B Same as A, except the input object is now a handwritten digit '3' selected from Set 2. The red dot within the coordinate system shown on the left-hand side represents the physical misalignment for each case (see Fig.5Bof main text).Red (green) rectangles mean incorrect (correct) inference results.

Fig. S4 :
Fig. S4: The blind inference accuracies achieved by standard, differential and hybrid diffractive network systems for the classification of phase-encoded Fashion-MNIST images.Same as the Figure 2 of main text, except, the image dataset is Fashion-MNIST.Unlike amplitude encoded MNIST images at the input plane, the fashion products were assumed to represent phase-only targets at the object/input plane with their phase values restricted between 0 and π.

Fig. S5 :
Fig. S5: Direct comparison of blind inference accuracies achieved by standard, differential and hybrid diffractive network systems for the classification of phase-encoded fashion products.Same as the Figure 3 of main text, except, the image dataset is Fashion-MNIST.Unlike amplitude encoded MNIST images at the input plane, the fashion products were assumed to represent phase-only targets at the object/input plane with their phase values restricted between 0 and π.