Integrated multi-operand optical neurons for scalable and hardware-efficient deep learning

The optical neural network (ONN) is a promising hardware platform for next-generation neuromorphic computing due to its high parallelism, low latency, and low energy consumption. However, previous integrated photonic tensor cores (PTCs) consume numerous single-operand optical modulators for signal and weight encoding, leading to large area costs and high propagation loss to implement large tensor operations. This work proposes a scalable and efficient optical dot-product engine based on customized multi-operand photonic devices, namely multi-operand optical neurons (MOON). We experimentally demonstrate the utility of a MOON using a multi-operand-Mach-Zehnder-interferometer (MOMZI) in image recognition tasks. Specifically, our MOMZI-based ONN achieves a measured accuracy of 85.89% in the street view house number (SVHN) recognition dataset with 4-bit voltage control precision. Furthermore, our performance analysis reveals that a 128x128 MOMZI-based PTCs outperform their counterparts based on single-operand MZIs by one to two order-of-magnitudes in propagation loss, optical delay, and total device footprint, with comparable matrix expressivity.


Introduction
The optical neural network (ONN) is an emerging analog artificial intelligence (AI) accelerator that leverages properties of photons, including low latency, wide bandwidth, and high parallelism [1][2][3], to address the growing demand for computing power required to implement deep neural network (DNN) models.Once weight parameters are set, photonic integrated circuits (PICs) can perform tensor operations with near-zero energy consumption at the speed of light [4,5], making them an ideal platform for accelerating multiply-accumulate (MAC) operations [6].However, the potential massive parallelism and ultra-high computing speed of ONNs are not fully unleashed with small-size photonic tensor cores (PTCs).To maximize the performance benefit of photonic computing in DNN acceleration, scalable and efficient photonic tensor core designs are in high demand.
The scalability of previous photonic tensor core designs is bottlenecked by the large spatial footprint and insertion loss [7].For instance, an MZI-based coherent PTC [8] requires ( 2 +  2 ) single-operand MZI modulators to construct an  ×  matrix, consuming huge area cost to implement large tensor operations (e.g., 128 × 128).Moreover, the large number (~2) of cascaded optical devices in the critical path of the circuit leads to unacceptable insertion loss.Even with low-loss MZIs such as thermo-optic MZIs (0.5-1 dB) [9], cascading 128 such devices will result in 64 to 128 dB propagation loss.In addition, single-operand-device-based PTCs suffer from nontrivial dynamic energy consumption to reconfigure weight parameters.Given the limited chip area and link budget, we have to serialize the matrix multiplication by repeatedly reusing small-size photonic tensor cores, which incurs much longer latency to implement one matrix-vector multiplication, potentially negating the speed advantage of ONNs over electronic analog AI accelerators [10].
Both circuit-and device-level optimizations have been explored to enhance the scalability of ONNs.Circuit-level approaches, such as the butterfly-style circuit mesh [11], have been explored to reduce hardware usage [12,13].Moreover, compact device-level photonic tensor cores, such as star couplers and metasurfaces [5,6], have been proposed to significantly reduce the device footprint and improve the hardware efficiency of tensor operations.However, one major challenge with compact photonic circuit mesh or passive device-level tensor cores is their limited matrix representability, which usually results in accuracy degradation when implementing complicated AI tasks.To address this challenge, we suggest using active devicelevel photonic tensor cores, which offer the potential to achieve both high representability and high hardware efficiency.Recently there has been a trend to use multi-operand devices for vector operations, which shows great potential to achieve efficiency and scalable breakthroughs.We partition the phase shifter into multiple small segments, each being independently controlled.By leveraging the underlying device transfer function, we can then realize vector operations with nearly the same device footprint and tuning range as the singleoperand one.In this work, for the first time, we officially name this photonic structure a multioperand optical neuron (MOON).Prior work has proposed a microring-based MOON and showed its advantages over standard single-operand micro-ring in neuromorphic computing through simulation [14].In this work, we introduce a new broadband device in this MOONfamily, a multi-operand MZI (MOMZI), and experimentally demonstrate its superior efficiency and scalability for next-generation photonic neuromorphic computing.
In this work, we customize a MOMZI, whose modulation arm is controlled by multiple independent signals, and leverage its transmission to realize vector-vector dot-product.A operand (-op) MOMZI can be used as a length- vector dot-product engine, directly saving the device usage by a factor of  compared to single-operand MZI arrays [8].Note that the device footprint and tuning range keep constant and will not scale with .By combing the result from multiple -op MOMZIs, we can efficiently scale up to operations with a large vector length with near-constant insertion loss.Using devices from foundry process design kits (PDKs) [15], 128×128 photonic tensor cores based on our MOMZIs show a 6.2× smaller total device footprint, 49× lower optical delay, and >256 dB lower propagation loss than previous single-operand MZI arrays [8].We experimentally demonstrated the representability and trainability of an ONN constructed by 4-op MOMZIs on the street view house number (SVHN) recognition task [16], achieving a measured accuracy of 85.89% with 4-bit voltage control precision.Our proposed MOMZI-based photonic tensor core enables the implementation of high-performance and energy-efficient neuromorphic computing with a small device footprint, low propagation delay, and low energy consumption.

Multi-operand optical neurons
A typical photonic tensor core to implement MAC operation is in Fig. 1(a), which contains photonic components to generate input signals, the weight matrix, and the outputs. highspeed modulators are needed in an -input, -output layer.Depending on the weight mapping approach, one needs (−1)+(−1) 2 + max (, ) [8] or  ×  active photonic components [17] to implement a  ×  weight matrix.Furthermore, ~2 active devices are cascaded in one optical path, resulting in nonnegligible propagation loss and requiring more laser power to drive the photonic neural chip.
In this study, we propose a novel approach to reduce the optical component usage by implementing the multiply-accumulate (MAC) operation using an array of multi-operand-modulator-based optical neurons (MOONs), as shown in Fig. 1(b).Depending on the area and reliability concerns, one MOON can be a multi-operand active photonic device of any waveguide structure, such as MZI modulators and microring modulators.As illustrated in Fig. 1(c), each row of the layer is divided into  * =   -operand modulators, and the output of each -operand modulator is accumulated using on-chip combiners or multiplexers to compute the final output of each row.Consequently, the total number of MOONs required for an -input, -output layer is   , significantly reducing the number of active optical components.
Unlike conventional PTCs designed for GEMM, the nonlinearity transfer function between the electrical signal and the transmission of the MOON needs to be considered when training DNN models.The input vector   is encoded as the amplitude of the optical signals and will also be partitioned into  * segments   = (   ,    , … ,    ).Thus, the output signals of one layer can be expressed as follows: Our MOON-based PTC significantly improves computational efficiency compared to previous GEMM-based PTCs [8].For example, a -operand MOMZI has a similar device footprint and dynamic tuning range to a single-operand MZI, but it can implement  MACs.This outperforms a single-operand MZI in area-and energy-efficiency since it can only perform approximately one MAC operation per device in single-operand MZI-based PTCs.The advantage of MOONs lies in their ability to perform multiple MAC operations using a single device, making them more computationally efficient than previous ONNs.
Moreover, as shown in Fig. 1(c), only one MOON is cascaded in one optical path of our circuit architecture, resulting in much smaller propagation loss compared to MZI-based or microring-based ONNs, where 2 + 1 MZIs or  microrings are cascaded.As a result, we can deploy compact but lossy optical modulators, e.g., plasmonic-on-silicon modulators [19], as MOONs in our PTC, trading higher insertion loss for a much smaller chip footprint and lower modulation power.Detailed performance evaluations will be provided in our discussions.

Multi-operand-MZI-based optical neural network
In this work, we demonstrate the use of -operand MZI modulators as the fundamental building blocks for constructing our MOMZI-PTC.Figure.1(d) shows the structure of a MOMZI.Unlike the traditional MZI modulators with one or two phase modulators, a -op MOMZI has  active phase shifters on each modulation arm, and each phase shifter is controlled by an independent signal.This structure is similar to lumped-segment MZIs used in optical communications [20], but the driving signals on each operand are independent and analog.For MZI modulators with dual modulation arms, the total number of operands can increase to 2 to enable both positive and negative phase shifts.Suppose each shifter contributes to a phase shift   , the output intensity of a MOMZI can be expressed as: where (•) = cos 2 ( The modulation mechanism of the MOMZI plays a critical role in determining their transfer function with an input voltage signal.As shown in Fig. 2, the transfer function of MZI modulators using the same foundry [15] can exhibit sinusoidal, quadratic (linear field intensity response), or other nonlinear transfer functions with the operating voltage .The specific shape of the transfer function depends on the modulation mechanism (  ± ()) and the modulator's waveguide structure ((•)).By optimizing these parameters, one can customize the transfer function of the MOMZI to realize certain nonlinear activation functions of DNNs.We will discuss this hereinafter.Supposing the dot product information   •   is directly encoded as the operating voltage   on each operand of the MOMZI, we can rewrite Eq. ( 2) as Eq. ( 3): In Eq. ( 3), positive or negative phase shifts are achieved by applying the operating voltages to each phase shifter's upper or lower arm.The phase bias   , of the th MOMZI on row  of MOMZI-PTC can be adjusted to improve the expressivity of our neural architecture.The constant  =   can be eliminated after photodetection.Using Eq. ( 3).we can model the MOMZI on commercial deep learning platforms, e.g., PyTorch, making it practical to train and deploy the DNN.

Experimental results
In this study, we designed and fabricated a 4-op MOMZI that is capable of implementing a 4×1 vector operation on the silicon photonics platform.The chip layout was drawn and verified using Synopsys OptoDesigner (version 2021) and then fabricated by AIM Photonics.The schematic of the MOMZI is illustrated in Fig. 3(b), while Fig. 3(a) shows close-up images of its components, including phase shifters, 50-50 directional couplers, and photodetectors. We We tune one additional phase shifter on the upper arm to let   ≈  2 to obtain a relatively linear and balanced output range.
The schematic of the testing setup is illustrated in Fig. 4. Continuous-wave (CW) light is coupled to the chip through an edge coupler.The MOMZI's phase shifters are programmed using a high-precision multi-channel digital-to-analog converter (DAC).The on-chip photodetector, along with an off-chip transimpedance amplifier (TIA), converts the output optical signal to electrical voltage outputs.These converted electrical outputs will subsequently be read using oscilloscopes.A microcontroller is used to program the electrical signals that represent   •   to the DAC and read the output signals in this work.We use computers to process the measurement data, train the DNN parameters, and implement the DNN model.Notably, current fabrication and co-packaging technologies enable the integration of electrical control circuits and the laser on a single substrate [21] or a single chip [22], resulting in much higher compactness, shorter interconnect paths, and higher efficiency.In this work, we construct a CNN with our MOMZI and benchmark its performance on a street view house number (SVHN) dataset.It is more complicated than the MNIST dataset [16] since each image contains color information and various natural backgrounds.To perform convolutional operations with our PTCs, we employ the widely-used tensor unrolling method (im2col) [23].Large-size tensor operations are partitioned into 4×1 blocks and mapped onto our MOMZI.We first calibrate the behavior of each phase shifter for training and model it using Eq. 4, as shown in Figure 5(a).Based on the chip measurement data, our proposed hardware-aware training framework can efficiently train the ONN weights while being fully aware of all the physical non-idealities during optimization, e.g., process variations, thermal crosstalk, and signal quantization [14].The dynamic noises are also measured (shown in Fig. 5(b)) and added to the training framework to improve the robustness of ONNs.Then we map the trained weights to our MOMZI.Finally, we evaluate the task performance of our photonic neural chip on different ML tasks, where partial accumulation, nonlinearity, and other postprocessing operations are offloaded to the digital computer.Figure 5(c) illustrates the network structure for training our MOMZI-ONN.
Our experiments show that under 4-bit voltage control resolution (16 phase shift levels for each operand), the inference accuracy of the CNN reaches ~85.89% in our experimental demonstration.The confusion matrix depicting the prediction results is shown in Fig. 5(d).Figure 5(e)(f) shows the tested probability distribution of different street-view numbers.As a reference, we can achieve 90.6% accuracy using an ideal CNN model with the same network structure on 64-bit computers.One can improve the task performance of MOMZIs using operands with more linear phase responses and higher control precision, which will be shown hereinafter.

Discussion
Expressivity evaluation.Our MOMZI-ONN exhibits comparable trainability and expressivity with ONNs designed for GEMMs with  times fewer optical component usage ( is the number of operands).By explicitly modeling the transfer function of the MOMZI during ONN training, we can efficiently learn the mapping from the software model to the MZI devices.Here, we simulate the task performance of our MOMZI-ONN with different numbers of operands on the SVHN dataset using the same NN model and control precision.An ideal CNN model with the same model architecture is also trained as a reference.In the evaluation, the phase response of each operand is  = , which can be realized on linear phase shifters such as lithium niobate EO phase shifters [24].In simulations, we add a phase bias   =  2 to enable a balanced output range.The evaluation results are shown in Fig. 6, showing that our MOMZI-ONNs can achieve >91% accuracy on the SVHN dataset, which has a <0.6% accuracy difference compared to the ideal CNN model.It should be noted that the task performance of MOMZI-ONN is insensitive to the number of operands once we properly normalize the operands.Moreover, the number of active photonic devices to implement an -input, -output linear layer are   -operand MOMZIs.Therefore, ONNs based on MOMZIs with a large number of operands will significantly reduce the hardware cost without accuracy loss.Propagation delay and loss.By minimizing the number of cascaded optical components in the critical path of PTCs, MOMZI-PTC outperforms single-operand MZI-PTC in both propagation loss and optical delay by one to two orders of magnitude.In this work, we evaluate the propagation delay and loss of MOMZI-PTC using the foundry's process-design-kit (PDK) libraries.The parameters of optical devices are given in Table.S1.As shown in Fig. 1(c), the MOMZIs in one optical path are placed parallelly in our PTC, so the insertion loss and propagation loss contributed by lossy MZIs will not accumulate when the size of the DNN model increases.As a result, the optical delay and the propagation loss of a MOMZI-PTC with n-inputs and m-outputs can be calculated as follows: Eq. ( 5)  − =   +   Eq. ( 6) In Eq. ( 5),   = 4.3 is the group index of silicon waveguides.  is the length of the MOMZI, which depends on the operands and the waveguides used to connect these operands.Since the tuning ranges of a MOMZI and a single-operand MZI are the same, the total length of the operands of MOMZI should also be the same as the length of a high-speed electro-optic (EO) modulator.Here we assume the distance between each operand to be  = 10  based on the device layout of a recently-published two-operand 10- -radius microring modulator [25].  is the length of the on-chip combiners/multiplexers, for microringfilter-based multiplexers,   =     .  is the insertion loss of one multioperand modulator,   is the total insertion loss of the combiner, which is     with add-drop microrings as multiplexers.Increasing the operand number k can potentially reduce both the IL and propagation delay.
On the other hand, the propagation loss of single-operand MZI-PTC can be estimated as ( +  + 1) () +  (ℎ) , while the total device length can be expressed as ( +  + 1) () +  (ℎ) .Here, (ℎ) denotes the high-speed EO modulators for input signal encoding, and () is the TO switch for weight encoding.Because the tuning range and the modulation mechanism of the MOMZI should be the same as that of the input EO modulator, we let   =  (ℎ) .The model parameters are available in Table .S1.
The results presented in Fig. 7(a) and 7(b) demonstrate that using the same component library [15], our MOMZI-PTC can achieve an optical delay that is approximately 49 times lower than that of a single-operand MZI-PTC.Furthermore, the propagation loss of our MOMZI-PTC is ~257 dB lower than that of the single-operand MZI-PTC, which results in lower laser power requirements to drive the ONN and a lower response time.Footprint.Our MOMZI-PTC significantly improves the area efficiency and reduces the number of active devices compared to a single-operand MZI-PTC [8].Unlike single-operand active devices such as single-operand MZI, our -op MOMZI is capable of implementing  dot products and  − 1 additions, which results in a much higher hardware efficiency in terms of #MAC/device.The total device footprint of -op MOMZI-PTCs can be estimated using Eq. ( 7): − = ×    +   Eq. ( 7) where we assume a distance of  = 10  between neighboring operands.Suppose the device footprint of a high-speed MZI modulator is  (ℎ) =  (ℎ) •  (ℎ) .The device footprint of one  − op MOMZI can then be estimated as   = � (ℎ) + ( − 1)� •  (ℎ) .Figure .7(c) shows the estimated device footprint of MOMZI-based PTC and single-operand MZI-PTC based on our assumptions.The estimated device footprint of MOMZI-PTC and MZI-PTC is shown in Fig. 7(c).When the matrix size is 128×128, our 128-op MOMZI-PTC consumes ~127× fewer MZI modulators, leading to ~6.2× footprint reduction compared to single-operand MZI-PTC [8] with the same matrix size and optical component selection.
From Eq. ( 7) and Fig. 7(c), MOMZI-based PTC will be more area efficient with a larger number of operands  on each MOMZI.Previous has shown that a 10-μm-radius silicon-based microring modulator can be divided into 32 independent active segments using a 45-nm technology node [27], where each operand only consumes 2μm in length.This means that an MZI-modulator with a 1.6 mm-length modulation arm can support up to 800 operands using current layout technology, which should be comparable with other analog electronic tensor cores in scalability, e.g., 256 × 256 memristor-based crossbar arrays [28].
Another big advantage of the proposed MOON-based PTC is its superior compatibility with compact, high-speed optical modulators, even with high insertion loss, e.g., plasmonic-onsilicon modulators with only 15 μm modulation length and 11.2 dB  [5].The fundamental reason is the small number of cascaded devices in the critical path.Figure 7 shows the normalized device footprint compared with silicon-based MZI-PTCs, which shows the plasmonic-on-silicon-MOMZI-PTC can reduce the footprint by 177× compared to singleoperand silicon-MZI-based PTC.Single-operand MZI-PTCs are not compatible with these compact high-loss modulators because there are 2 + 1 MZIs in the critical path.Using compact high-loss modulators for weight configuration will lead to significant propagation loss and require high laser power to drive the neural chip.
Finally, the hardware cost of MOMZI-PTC can be further optimized with operand pruning strategies.To implement an FC layer in DNN models, especially with sparse matrices [29], we only need to encode non-zero weights on the MOMZIs.The operands of MOMZIs with zero weight values can be either removed from the device to save footprint or power-gated to reduce energy consumption.Sparsity-aware training [30] can be applied to prune redundant MOMZI operands while maintaining task accuracy.For simplicity, we assume the entire waveguide length for connecting the operands is the same as the total length of the operands of MOMZIs, so   = 2 (ℎℎ−) .
Energy efficiency.MOMZI-based PTC is a more energy-efficient alternative to singleoperand MZI-PTCs for implementing large-tensor-size operations due to its lower propagation loss, which allows it to consume over 256 dB less laser power.The total power consumption of MOMZI-PTC for computing comprises the power required to drive the lasers, modulators, and photodetectors and for biasing the MOMZI, as well as the power needed to drive the digitalto-analog converters (DACs) and analog-to-digital converters (ADCs).The silicon-based carrier-depletion MZI's modulation energy consumption in previous work can achieve ~146 fJ/bit [31].Furthermore, the power to bias the MOMZI is ~2.5 mW per phase shifter if we use thermal phase shifters from foundry PDKs [32].
Using the parameters of existing technology provided in Table.S3, the optical part of MOMZI-PTC, accounts for <9% of total power consumption when the tensor size is 128 × 128.The power breakdown analysis in Fig. 8 indicates that our 128-op MOMZI-PTC can achieve ~56 TOPS/W at a 10 GHz clock rate, 100% higher than existing analog electronic tensor cores [33] with 100x faster operating speed.Currently, the energy efficiency of MOMZI-PTC is dominated by data converters such as ADCs.This work employs an 8-bit, 10 GSPS ADC that consumes 39 mW per channel [34].
The energy efficiency can be further improved to ~ 604 TOPS/W using emerging high-speed and energy-efficient data converters and EO modulators.Recent advances in energy-efficient active optical components, such as the plasmonic-on-silicon modulator that consumes approximately 0.1 fJ per bit modulation energy at 50 GHz operating frequency, have made it possible to reduce the power consumption of MOMZI further [19].The power to bias the MOMZI can be decreased to zero with phase change materials or nano-opto-electro-mechanical devices [35,36].Using energy-efficient modulators, the energy consumption of the optical computing part only accounts for <3% of the total power consumption, showing that large-size MOMZI-PTC will not bring scalability issues due to excessive laser power.Moreover, we can use energy-efficient analog content-addressable memory (ACM) to replace the ADCs [37], reducing the power consumption of ADCs by ~33x.The final power breakdown of MOMZI-PTC for computing shows our MOMZI-PTC can achieve a competitive energy efficiency of ~604 TOPS/W, 20x higher than existing memristor-based analog electronic tensor cores [33].More details of our power analysis are provided in supporting information note 2.  [37,38]) to boost the energy efficiency to ~604 TOPS/W.
In addition, our -op MOMZI-PTC can reduce the weight reconfiguration energy by  times compared to single-operand-device-based PTCs, which will bring considerable energy efficiency improvement, especially when the photonic tensor cores need to be frequently reconfigured to map a large number of matrix blocks in DNNs.The number of active optical devices in our MOMZI-PTC is only (   ), which is  times fewer than that of PTCs with single-operand devices (() [17] or (max( 2 ,  2 )) [39]) This feature of MOMZI-PTC is essential to implement modern DNNs, where weight loading takes nontrivial hardware costs [22].
Nonlinearity engineering.The nonlinearity of MOONs can be customized in various dimensions to achieve a desired activation function, potentially saving power for doing activation functions electronically.The built-in nonlinearity of MOON is contributed by the weight/signal encoding way and the nonlinear transfer function of the optical modulator with the input voltage.To customize such built-in nonlinearity, one can add electrical or optical components before or after photodetection to alter the optical outputs to implement the activation function.Previous work has widely investigated this approach [41][42][43].Typically, one can add saturable absorbers before photodetection with a linear optical modulator [44] to construct a ReLU-like MOON, reducing the hardware cost to realize activation functions electronically.
Depending on the transfer function of the MOON, the weight encoding approach can be designed to enable high-speed dynamic tensor operations beyond ones with stationary weights.Dynamic tensor operations mean both the inputs and the weights can be updated at high speed, which is crucial in emerging applications, such as the self-attention operation in Transformer [45] and on-chip training tasks for intelligent edge learning.A specific example of an optical modulator with a linear field response region with voltage (|Δ  | ∝ Δ) is provided here.Suppose the electrical modulation signal of the modulator is bidirectional; then, one can use two MOONs and one differential photodetector to implement high-speed vectorto-vector operations.As shown in Fig. 10, the weight and input voltage signal   and   are encoded with the same phases on operand  of the upper modulator, and high-speed signals   and   with opposite phases encoded on operand  of the downer modulator.After differential photodetection, one can obtain the output current signal as: where  is the modulation efficiency of each operand.

Conclusion
We have presented a scalable, energy-efficient optical neural network with customized multioperand optical neurons (MOONs).We have experimentally demonstrated a 4-operand siliconphotonic MOMZI on practical image recognition tasks.Compared to prior single-operand-MZI-based photonic tensor cores (PTCs), our MOMZI-based PTC design can achieve one to two orders-of-magnitude reduction in active optical component usage, footprint, latency and propagation loss.The speed, footprint, and energy efficiency of our MOON-based PTC can benefit from more advanced technologies, e.g., faster and more efficient data converters, optical devices, and nonlinearity engineering.Our customized MOON design provides a scalable solution for the next-generation photonic AI accelerators with extreme compute density and energy efficiency.    .S3.In Eq. (S1),   is the laser power.  represents the energy consumption of modulators,   and   include the power consumption for photodetection, amplification, and analog-to-digital conversion.  is the operating speed of modulators, as determined by the total delay of the MOMZI-ONN.In the MOMZI-ONN,  ℎ is the static power to tune the MOMZI to a bias point.Using thermal phase shifters for bias tuning,  ℎ = 2.5 .By utilizing energyefficient active optical components based on nano-opto-electro-mechanical systems or phase change materials [46,47], we can eliminate the power consumption for phase maintenance.To carrier-depletion-based silicon MZI modulators,   can achieve ~146 fJ/bit.Using energy-efficient plasmonic-on-silicon modulators ,   = 0.1 fJ/bit.One DAC's energy consumption can be estimated by [48]: =       /  Eq. (S2) where   is the DAC figure of merit,   is the DAC resolution,   is the sampling frequency,   is the bit rate.In our estimation,   = 35 fJ/step in a 7-nm microprocessor [42],   = 8 bit,   /  = 1.
The propagation loss, the photodetectors' minimum detectable power, and the outputs' precision dictate the laser power.The total laser power can then be calculated by the following equation [6]: where ℎ is the photon energy at 1.55 μm,  = ,  = 0.2 is the wall-plug efficiency of the laser [49], is the number of wavelengths used in the MOMZI-PTC.The precision of output signals is   bits.  is the capacitance of the photodetector while   is the operating voltage.Note that   = 0 in some zero-biased energy-efficient photodetectors [50,51],   is the baud rate of the intput signals.In this work, we choose   = 10 Baud/s and use a 10 GSPS ADC for reading the output.

• 2 )
,   + denotes the -th phase shifters on the upper arm of the modulator, and   − denotes that on the lower arm.Consequently, positive weight signals are encoded on upper modulation arms, while negative weight signals are encoded on the lower ones.  is the phase bias when no input signals are operated on the modulation arms, which is used to tune the transfer function of the MOMZI.

Fig. 1
Fig. 1 General architecture of the MOON-based photonic tensor core.(a) A conventional photonic tensor core based on single-operand modulators, which has an array of input modulators and () photonic devices to construct the weight matrix.(b) Schematic of the MOON-based PTC to implement an -input, -output layer.(c) Schematic of MOON array to implement a  × 1 vector product operation, which requires  * =   -operand MOONs.The output is obtained by accumulating the output signal of each MOON using multiplexers.(d) Schematic of a -operand MOMZI-based MOON, which consists of  operands on each arm.There are various approaches to encoding weight signals   and input signals   on each modulation region.For instance, one can use (e) programmable resistances to encode   and current signals to encode   , or (f) tunable amplifiers/attenuators to encode   and voltage signal to encode   , or (g) adjust modulation length to encode fixed weight signals and voltage signals to encode   .

Fig. 2
Fig. 2 Transfer function of different MZI modulators under different modulation mechanisms.All the data are experimental data from our measurement or the process design kit (PDK) model[15] on Lumerical Interconnect.  is the maximum allowed operating voltage.

Fig. 3
Fig. 3 Schematic of the 4-operand MOMZI.The micrographs of necessary optical components are highlighted in (a) and the full schematic of the MOON is shown in (b).

Fig. 4
Fig. 4 Experimental setup of MOMZI-ONN.(a) Schematic of our MOMZI-ONN test flow.The entire tensor operation is first partitioned into multiple 4×1 blocks, and each block is implemented optically on a 4-op MOMZI.(b) The weight parameters and the input signals are programmed by a multi-channel digital-to-analog converter (DAC).Shown in (c), the output optical signals are converted to photocurrents using on-chip photodetectors.We use an offchip TIA to convert the output photocurrent to electrical signals, which are then read by the oscilloscope.Both the oscilloscope and the DAC are controlled by a microcontroller.The tensor operation results are provided to the computer for data processing in order to train and deploy the DNN.

Fig. 5 2 , 2 ] 2 , 2 ]
Fig. 5 Experimental result of street view house number (SVHN) recognition with the MOMZI-ONN.(a) Our measured output data and curve fitting for training the MOMZI.The tuning range of the total phase shift of four operands is [−  2 ,  2 ].(b) Dynamic noise analysis of output signal of MOMZI, the measured standard deviation of the dynamic noise is ~0.5%.(c) Structure of the CNN, the convolution is realized by MOMZIs with im2col approach.The first convolutional layer has one input channel and 32 output channels with a stride of 2. The subsequent two convolutional layers have 32 input/output channels with a stride of 1.After adaptive average pooling, we use a linear classifier with 10 outputs for final recognition.(d) Our measured output data and curve fitting for training the MOMZI-ONN.The tuning range of the total phase shift of four operands is [−  2 ,  2 ].(c) The confusion matrix of the trained MOMZI-ONN on the SVHN dataset shows a measured accuracy of 85.89%.(e) and (f) show the predicted probability distribution of our MOMZI-ONN on two selected test digits in the SVHN dataset.

Fig. 6
Fig. 6 Task performance and hardware cost of MOMZI-ONN on SVHN dataset.Inference accuracies of MOMZI-ONNs with different operand numbers are shown.Using the same neural network structure, the accuracy of an ideal CNN model is 91.8%.The normalized total number of active photonic devices with different operand numbers of MOMZI-ONNs compared to ideal CNN models are shown.We suppose the matrix size is  × , the number of microring-and MZI-ONNs are normalized to 1.
0 is the output intensity of the modulator at the biased point.Compared to MOONs that use memristors to encode stationary weights, the dual-linear-modulator-based MOON shown in Fig.10can enable high-speed weight reprogramming/updates to implement high-speed dynamic tensor operations.One can investigate more efficient signal encoding approaches of MOONs to support more types of tensor operations in state-of-the-art DNNs.

Fig. 10 A 1 .
Fig. 10 A MOON with two linear modulators for matrix-matrix multiplications.The optical output power of each modulator is proportional to the electrical input power or  2 .Here we apply differential input signals ±   on upper/lower modulators, and put weight signals    on both upper/lower modulators.The output power after differential photodetection is then proportional to ∑   •    1

7. Supporting information 7 . 1
Parameter tables for the delay, propagation loss, and footprint estimation use two phase shifters on each modulation arm to enable both positive and negative weights during training.In experiments, we encode   ∝ |  ± | •   , where  1 + and  2 + are positive and  1 − and  2 − are negative,  1 and  2 are encoded on the upper arm, while  1 − and  2 − are encoded on the downer arm.The transfer function of our modulator can then be written as  = (Σ

Table .
a. Based on the layout picture of the MZI in the reference (~200 × 100  2 ).The highspeed phase shifter part is only 15  in length, so the size of the modulator can be further optimized with more compact directional couplers.