Deep learning for optical tweezers

: Optical tweezers exploit light–matter interactions to trap particles ranging from single atoms to micrometer-sized eukaryotic cells. For this reason, optical tweezers are a ubiquitous tool in physics, biology, and nanotechnology. Recently, the use of deep learning has started to enhance optical tweezers by improving their design, calibration, and real-time control as well as the tracking and analysis of the trapped objects, often outperforming classical methods thanks to the higher computational speed and versatility of deep learning. In this perspective, we show how cutting-edge deep learning approaches can remarkably improve optical tweezers, and explore the exciting, new future possibilities enabled by this dynamic synergy. Furthermore, we offer guidelines on integrating deep learning with optical trapping and optical manipulation in a reliable and trustworthy way.


I. INTRODUCTION
Optical trapping and optical manipulation exploit light-matter interactions to trap and manipulate various types of micro-and nanoparticles.These techniques date back to Arthur Ashkin, who demonstrated in the 1970s that it is possible to levitate microparticles in a fluid using a focused laser beam [1][2][3][4].Later, A. Ashkin and coworkers demonstrated that it is also possible to trap particles in 3D using a strongly focused laser beam [4] a technique now known as optical tweezers [5,6].
Deep learning is a collection of computer algorithms that can improve and adapt their solutions by learning the rules connecting input and output directly from data [25], solving problems ranging from particle tracking and characterization [26] to protein folding [27] and face recognition [28].The first steps towards the deep learning revolution were taken in the 1940s with the mathematical modeling of biological neurons by neuroscientist Warren McCulloch and logician Walter Pitts [29].The recent growth of deep learning has been driven largely by the recent increase in the computational power of processors and the size of datasets, but also by the spread of userfriendly all-purposes deep learning frameworks, such as PyTorch [30,31] and Keras/Tensorflow [32,33], which enable quick and easy deployment of deep learning solutions for a wide range of tasks.
Several aspects of optical tweezers that are difficult to study theoretically, either due to the computational cost or because of the high modelling complexity, can now be addressed using deep learning.Deep learning can improve the calculation of optical forces by increasing its speed [34] and even accuracy [35], helping to realistically simulate more complex systems.From an experimental standpoint, deep learning can enhance the calibration of optical tweezers [36] and improve the tracking of trapped particles [37].Furthermore, recent progress in deep learning is also benefiting the real-time control of optical tweezers [38] and the design optimization [39].
This review presents an overview of optical tweezers and deep learning, highlighting their recent collaborative developments.We speculate on possible future innovations resulting from this synergy.To conclude, we suggest strategies for those aiming to utilize deep learning in combination with optical trapping and optical manipulation in a reliable and safe way.

II. OPTICAL TWEEZERS
Optical tweezers are an ubiquitous tool in science and they are contributing to the progress of fields like biology, physics, and nanotechnology [5,6].As can be seen in Fig. 1, the field of optical trapping and optical manipulation is rapidly expanding.Based on light-matter interactions, optical forces can trap particles in the proximity of a focused laser beam.Furthermore, the trapping forces are typically so small that, by employing a trapped particle as a probe, it is possible to measure forces well below those reachable with an atomic force microscope (AFM) and micro-fabricated cantilevers [40].Despite of the recent progress in optical trapping, there are still many open challenges [6], including the calculation of optical forces, the efficient calibration of an optical trap, the position detection of a trapped particle, and the development of new optical trapping systems.
The calculation of optical forces has typically relied on approximations that depend on the trapping regime defined by size of the particle [5,41].The trapping regimes are the geometrical-optics regime, the Rayleigh regime, and the intermediate regime.The geometricaloptics regime is valid when the size of the particle is much larger than the wavelength λ 0 of the trapping light.In this case, the wave nature of the light can be neglected and optical forces can be calculated using ray optics [42,43].Instead, the Rayleigh regime occurs when the linear dimensions of the trapped object are much smaller than λ 0 .Thus, the trapped object behaves like a dipole and the optical forces are mostly proportional to the gradient of the light intensity [44].Finally, the intermediate regime lays in between, where the linear dimensions of the trapped object are comparable with λ 0 .In this case, the optical forces need to be calculated from the electromagnetic fields obtained as an exact solution of the scattering problem, which can be a very complex and computationally intensive process [45][46][47].Common to all the regimes is that the trapping forces for small displacements from the trapping position can be approximated as an harmonic force where k is the stiffness of the trap, r is the displacement from the equilibrium position, and F (r) is the optical force.
Calibrating an optical tweezers consists of determining the relation between the position of a particle and the force it experiences.For small displacements from the equilibrium position, it is sufficient to determine the trap stiffness.The traditional approaches to calibration rely on explicit mathematical recipes such as the potential method [48], the autocorrelation method [49], the power spectrum analysis [50], the mean square dis-placement method, the equipartiton method, or the maximum-likelihood-estimator analysis (FORMA) [51].While these approaches perform well when the field is static, conservative, and a high amount of data are available, they present some limitations when the force field does not satisfy these assumptions.
In optical trapping experiments, the location of the particle is often the most critical parameter.Even though the previously mentioned calibration techniques differ in their approaches, they all rely on this knowledge.There are two main possibilities for tracking the position of the particle.For a single particle in an optical trap, one can use the trapping laser as a probe to determine its position, for instance using a quadrant photodiode (QPD) or a position sensitive detector (PSD).However, when there are multiple particles or multiple traps, interpreting the QPD signal becomes more complex and cameras are typically necessary.These cameras provide a larger view of the experimental system under investigation, containing much more information than the QPD/PSD signals but with the drawback of a lower acquisition rate.
Nowadays, in order to expand the applicability of optical trapping, new techniques to control optical tweezers are being developed.External real-time feedback allows to correct the trapping force by adjusting either the intensity of the light or the position of the trap [52,53].Introducing external feedback increases the effective trap stiffness but comes with the drawbacks of a limited bandwidth and of a higher sensitivity to errors in the detection of the position of the particle.To overcome these problems, automatic feedback control mechanisms have been postulated for plasmonic tweezers [24] and realized for intracavity optical trapping [54].

III. DEEP LEARNING
Deep learning is a branch of computer science that, by using artificial neural networks, allows computers to learn from data and improve their performance without explicit programming.It is a subset of machine learning, as shown Fig. 2. Typically, deep learning approaches extract hierarchical features from data to realize complex tasks such as image recognition, natural language processing, and speech synthesis with remarkable accuracy and efficiency.They achieve this by automatically learning hierarchical features from raw data, reducing the need for manual feature engineering, whereas traditional machine learning models, like linear regression, principal component analysis, or decision trees, often require explicit feature extraction.This has led to a nearexponential growth in the use of machine learning and, in particular, deep learning [25], as shown in Fig. 1.
Deep Learning is typically based on deep (i.e., multilayer) artificial neural networks with many trainable parameters that transform input data into output data [25].These parameters are automatically adjusted during the training process, in which the system learns the rules that connect the input data to the desired outputs by operating on known input/output pairs, called training data, using algorithms such as stochastic steepest descent and error backpropagation [55].Thus, specific problems can be addressed reliably without explicitly knowing the rules connecting input and output, especially when the data to be analyzed closely resemble the training data.
The fundamental building block of neural networks is the artificial neuron [29].The artificial neuron processes its inputs by performing a weighted sum and returning a transformation (typically a nonlinear activation function) of the resulting sum.During the training process, the trainable parameters, often referred to as weights, are tuned to optimize the output of the neuron.Artificial neurons can be connected in layers, with each neuron receiving input from neurons of the previous layer and passing its output to the next layer, forming the most standard artificial neural network.
Deep learning can be implemented through different network structures, i.e., different architectures, and choosing the right one depends on the task at hand.The efficiency and effectiveness of the solution are strongly influenced by the architecture because different problems have distinct data characteristics and complexities.Generally speaking, more complex data require more complex models in terms of the number of parameters needed for fitting and analysis.Moreover, deep learning can be used to generate synthetic data of high quality.For our purposes, it is convenient to group the different architectures into three groups based on their purposes: data analysis (Dense Neural Networks, Convolutional Neural Networks, U-nets, Recurrent Neural Networks, Transformers Networks, Graph Neural Networks), data generation (Generative Adversarial Networks, Variational Autoencoders, Diffusion Models), and decision making (Deep Reinforcement Learning).

A. Data analysis
Dense neural networks (DNNs) are artificial networks in which all the nodes in each layer are connected to all the nodes in the adjacent layers.They have a structure characterized by a first layer referred to as the input layer and a last layer referred to as the output layer (dark circles in Fig. 2), and one or more layers in between referred to as hidden layer (gray circles in Fig. 2).They are used to deal with tabular data, sequential data, and data with small dimensions.When dealing with high-dimensional data, such as images, the number of connections between the layers increases drastically leading to problems such as overfitting, meaning that the neural network performs exceptionally well on the training data but fails to generalize to new, unseen data.
To deal with high-dimensional data, convolutional neural networks (CNNs) employ 2D layers of neurons partially connected one to the other [56][57][58].The key layers are the convolutional layers, which use filters to scan the input and perform convolutional operations, as shown in Fig. 2. A filter uses the same weights for different subsets of the input image, thus reducing the number of required trainable parameters and the risk of overfitting.More importantly, each filter corresponds to a feature map that detects a feature in the input data.In this way, the convolutional layer can detect different features of the input for each of its filters.Typically, the image size decreases as it passes through the layers, reducing the computational load and providing access to the information present at different length scales.Often, a dense neural network is added to the final layer of the convolutional neural network to generate an output representing comprehensive information associated with the input, for example, the coordinates of the position of a particle [59].By reducing the dimensionality of the input, CNNs identify more abstract and high-level features from the data, such as the general shape of a particle or cell, at the expense of low-level features.Therefore, CNNs excel in image detection, recognition, and segmentation [60,61].
U-nets [62] are characterized by their "U-shaped" design consisting of a contracting path (encoder) connected to an expanding path (decoder) connected also by skip connections, as shown in Fig. 2.These skip connections bridge earlier and later layers in the network, ensuring that both low-level and high-level features are effectively combined by enabling the direct transfer of feature maps.The contracting path reduces the dimension of the input thanks to several convolutional layers, capturing and summarizing local information to learn high-level features.Instead, the expanding path consists of transposed convolutions (or deconvolutions) to up-sample the feature map restoring the dimension of the input.Through the skip connections, the expanding path receives highresolution feature maps preserving the low-level features in the final output.Between the contracting and expanding path, i.e., at the bottom of the U shape, there is a bottleneck layer having the most abstract and high-level representation of the input data.Even if U-nets solve the loss of low-level features, they still need, like any CNN, a large number of diverse training data to reach good performances and acceptable reliability.For example, U-Nets have achieved significant success in the analysis of brain tumors images from MRI scans [63], denoising astronomical images [64], and characterizing the microstructure of samples imaged with scanning electron microscopy [65].
Unlike the previous architectures, recurrent neural networks (RNNs) retain and utilize information from previous time steps [66].For this reason, RNNs incorporate memory gates that adjust their internal state based on prior data [55].A fundamental characteristic of RNNs is their capability to establish recurrent connections, generating a feedback loop within the network.This enables the information to circulate within the network, making it responsive to the order and timing of input data.However, conventional recurrent neural networks encounter constraints resulting in difficulties in capturing prolonged dependencies effectively, including the vanishing gradient problem [67].To address this issue, advanced models such as long short-term memory (LSTM) [68] and gated recurrent unit (GRU) [69] networks have been developed.These structures contain more advanced memory gates that can select and retain information over extended sequences, making them especially effective in tasks such as speech recognition, where long-term contextual information is crucial.Overall, RNNs excel in applications where the sequence of data elements is important, such as natural language processing [70], protein analysis [71,72], optical coherence tomography data segmentation [73], and adaptive optics control [74].
Attention-based transformers networks (ATNs) employ self-attention mechanisms to analyze sequential data, enabling them to identify how even distant elements in the sequence interact and influence each other [75], as shown in Fig. 2. The first step is to add some position information to the sequential input data through positional encoding (typically creating a vector applying the cosine function for every odd index of the input data and a vector applying the sine function for every even index).Then, an encoder layer maps all the input sequences into a continuous representation.It is composed of 2 sub-modules: the multi-headed attention and the dense neural network.The multi-headed attention layer allows the model to focus on specific elements of the input data, assigning them different levels of importance during the learning process thanks to a scoring matrix (determining the amount of attention one element of the input should have on the others).The word "multi-headed" refers to the fact that this layer analyzes simultaneously the input with a different attention sub-modules called "heads".The dense neural network, which follows multiheaded attention, enhances the representations of the input elements to learn higher-level information.After the encoder, its output is sent to a decoder that has two multi-headed attention layers followed by a dense neural network.The first multi-headed attention layer receives the output of the encoder after positional encoding and sends its output to the second multi-headed layer that combines it directly with the output of the encoder (without positional encoding) allowing the decoder to understand which encoder input is relevant to put a focus on.In the end, the dense neural network classifies the input and chooses the highest probability prediction for the output.Transformers have proved themselves very useful in language modeling [76], text generation [77], and image captioning [78].
A graph comprises a set of nodes (or vertices) linked by edges (or links).The nodes, in which information is stored within a vector known as a feature vector, correspond to the input data, while the edges represent the corresponding dependencies.The process begins by taking the input graph and passing it through a sequence of neural networks.This transformation transforms the structure of the input graph into a graph embedding (i.e., into vectors), preserving essential details about nodes, edges, and overall context.Next, the feature vectors associated with the nodes are passed to a neural network layer.These features are combined and aggregated within this layer, and the resulting information is then passed on to the next layer in the network.In this way, the GNN updates node representations iteratively to capture information from neighboring nodes, often by following a series of message-passing steps.During these steps, each node aggregates information from its neighbors, applies a learnable function, and updates its representation accordingly.The first obvious application of GNNs is the classification of nodes and the completion of graphs with missing links.More interesting applications in which GNNs excel are, for example, web recommendation systems [82], traffic prediction [83], protein-protein interactions [84].

B. Data generation
Generative adversarial networks (GANs) create highquality synthetic data by using a specific method called adversarial training [85].This method uses two neural networks: the generator, which produces the synthetic data, and the discriminator, which verifies whether the data are real or fake, as shown in Fig. 2. The adversarial training improves the synthetic data generation by training the generator and discriminator in alternating steps.First, the generator produces synthetic data from the input data and the discriminator tries to classify them.Following this, by using both real and synthetic data, the discriminator is trained to better classify data.Finally, the generator is updated to produce more realistic data by using the results of the training of the discriminator.This adversarial process continues iteratively until the generator produces synthetic data able to deceive the discriminator.
Step-by-step, the generator can produce samples that are almost indistinguishable from real data, making GANs a powerful tool in data augmentation and data synthesis applications.A recent evolution of GANs, called Time-series GANs (TGANs), allows the generation of time-series data by taking into account the temporal correlations of the time-series data [86].However, training GANs can be challenging because they might suffer from mode collapse (producing limited diversity in generated samples).GANs are used not only for data generation, but also for image-to-image translation [87], for enhancing the resolution of images [88], and for anomaly detection [89].
Variational autoencoders (VAEs) are generative models that combine deep neural networks with probabilistic modeling to learn representations of data and generate new samples by mapping input data into a continuous latent space [90].VAEs use deep neural networks to produce a meaningful latent space representation of the input data, where a latent space is a lower-dimensional space in which the input data are mapped into a distribution (typically, a multivariate Gaussian).To do this, VAEs use an encoder and a decoder, as shown in Fig. 2. The encoder is a neural network (typically, a dense or convolutional neural network) that extrapolates from the input data the mean (µ) and the variance (σ) of the distribution in the latent space.Once these two parameters are known, the encoder uses them to sample a point (z) from the latent space by using the reparameterization trick following a standard Gaussian distribution, i.e., z = µ + ϵ • σ with σ Gaussian random noise term.Then, the decoder uses a neural network to reconstruct the original input data from the latent space representation obtained with the encoder.In this way, it takes points from the latent space and generates a new data sample that is similar to the input data one.VAEs have proved useful to reconstruct complex many-body physics [91], for regressions [92], and for music generation [93].
Diffusion models (DMs) are a deep learning architecture created to simulate the evolving changes in data over time or space, emulating the fundamental principles of diffusion processes and allowing a heterogeneous data production [94].These models add noise or perturbations to the input data during different steps, converting them into an uncertain state, as shown in Fig. 2. Subsequently, the model is trained to reverse this process using a neural network to predict and control the noise reduction, gradually restoring the data point to its original or desired state.This approach to noise reduction produces data samples that reflect the underlying trends and variability of the data distribution while ensuring coherence, realism, and high heterogeneity thanks to the randomness of the process.This means that DMs have an exclusive ability to capture patterns and variations inherent in data distribution.The adaptability of diffusion models cover a wide range of applications, including image generation [95][96][97][98] and natural language processing [99].

C. Decision making
Deep reinforcement learning (DRL) is a deep learning approach that combines deep neural networks with reinforcement learning techniques to learn sequential decision-making in complex environments through trial and error [100,101].It is based on reinforcement learning, in which an agent learns to make sequential decisions in an environment to maximize a cumulative reward signal.In DRL, the agent employs a neural network that is trained using feedback from the environment, as shown in Fig. 2.This feedback consists of rewards or penal-ties for the agent based on its actions.Through iterative interactions with the environment, collecting experiences, and updating its neural network, the DRL agent gradually learns an optimal policy or value function, enabling it to make effective decisions in complex and high-dimensional environments.In this way, DRL can do very complex tasks like playing Go [102], driving autonomous vehicles [103], and designing optical multilayer thin films [104].

IV. DEEP LEARNING FOR OPTICAL TWEEZERS
The advantages of machine learning, such as simplicity, versatility and speed, enhance optical tweezers by improving particle detection and tracking, trajectory analysis and calibration, optical force calculation, and by enabling tasks such as real-time control of optical traps and new designs.When automated without deep learning, these tasks typically require manual tuning of parameters, low noise measurements, or extremely long calculations.This is undesirable because it is time consuming for the researchers and also risks introducing human biases.In the following subsections, we discuss different cases where deep learning has already been successfully combined with optical trapping and optical manipulation, and we propose new possible applications.

A. Particle tracking
In optical tweezers experiments, particle tracking is a key task.Deep neural networks have significantly enhanced this task, notably improving the speed and accuracy of detection.Leading tracking algorithms now frequently incorporate Convolutional Neural Networks (CNNs) [26,37,105].These CNNs exhibits greater resistance against noise compared to classical algorithms.This prevents tracking errors due to the presence of noise in the particle video and increases the accuracy of the extracted particle trajectory, as shown in Fig. 3a.Nevertheless, acquiring enough training data from experiments is challenging because the true values of the position of the trapped particle are not known and may need to be collected manually or with standard methods.To solve this issue, it is possible to train the algorithms on simulated data [26,37].
An alternative approach that has shown promise is to exploit the symmetries inherent to the tracking problem.This approach is employed by the recently developed the deep-learning approach called LodeSTAR (Localization and detection from Symmetries, Translations, And Rotations) [106].This approach is particularly beneficial as it enables training on small datasets, even with as little as a single image, without the necessity of ground truth.In this way, a single training image is sufficient to train LodeSTAR.FIG.3: Deep learning for particle tracking.a. Trajectory of an optically trapped particle obtained from a noisy video by DeepTrack (orange) compared to that obtained with the classical radial symmetry algorithm (blue line).Reproduced from [59].b.A U-net can be used to track trapped particles that approach one to the other also when one particle overlaps with the other (defocused particle in the bottom picture on the left).c.A TGAN can fill missing frames in a video file (e.g., due to uneven sampling rate) and track the particles allowing the applications of calibration methods that require a constant sampling rate (e.g., those based on power spectral density, autocorrelation functions, and mean squared displacement).d.An ATN can find the trajectory of optically trapped particles in a video file and use it to determine the physical properties of the particles, such as their refractive index n p and radius r, as well as information about the immersion media, such as its viscosity η and its temperature T .
In addition to the position from images of the particle, deep learning can extract more information, such as the particle's size and orientation.For example, deep learning has been recently used to track the orientation of sperms in an optical trap enabling the extraction of the sperm rotation rates [107].Furthermore, going beyond analyzing images acquired with digital video microscopy, deep learning can potentially be applied also with data acquired with methods based on quadrant-photodiodes (QPDs) or position-sensitive detectors (PSDs).In these cases, deep learning can allow, for example, the extrapolation of the trajectory signal from noisy signals or with frequency higher than the detection bandwidth.
Importantly, deep learning often manage to excel even when standard methods fail.For example, U-nets can be used to track multiple trapped particles that approach one to the other, as schematically illustrated in Fig. 3b, a situation in which standard methods fail and require complex ad-hoc fixing [108].This is specially relevant for multiple trapped particles and in case of defocusing (due for example to overlapping of two or many particles).
TGANs could improve the tracking of particles from videos with missing frames or non-constant sampling frequencies thanks to their ability to generate data that respect the temporal correlation of the inputs and thus to generate the missing data from the properties of the phenomenon being studied, as schematically shown in Fig. 3c.It is possible, for instance, to create a constant sampling rate video from one that is non-constant, enabling the utilization of calibration techniques based on power spectral density, autocorrelation functions, and mean squared displacement.Instead, ATNs can be used to locate trapped particles in a set of many particles and evaluate their properties (such as dimensions and refractive index) or the fluid properties (such as temperature and viscosity) by identifying how distant points of the trajectory of the particle interact and influence one another, as schematically shown in Fig. 3d.

B. Trajectory analysis and calibration
Deep learning has proven to be an efficient method for analyzing confined particle motion, especially when experimental conditions change, and has proven effective for calibrating optical tweezers in scenarios where traditional methods are inadequate, such as non-conservative force fields and limited data collection situations.Recently, the trajectory analysis with deep learning allowed the estimation of rheological properties by reducing the amount of data needed [109], as schematically shown in Fig. 4a.This kind of analysis, which would ordinarily require measuring for several minutes, can now be obtained in a matter of seconds.This result was possible by training the neural network on simulated data, further showcasing the potential of synthetic data to be used to train models.In this case, simulating the training data are both essential to get sufficient amounts of data and relatively simple since the equations of motion of a trapped particle are well understood.
Deep learning has also been used to analyze particle trajectories within an optical trap measured using from the forward scattering captured by a quadrant photodiode to discern different kinds of particles [110].Potentially, deep learning architectures, such as diffusion models, can be utilized to estimate the properties of various diffusion processes experienced by a trapped particle, even when there are missing points in the trajectory.Indeed, the diffusion model can be employed to reconstruct the particle trajectory by effectively filling in the gaps and can estimate the required properties, as schematically illustrated in Fig. 4b.
Deep learning can also be used for calibration purposes.This was demonstrated in Ref. [36], where RNNs were used to estimate force fields with limited data available (trajectory length < 10 s) for harmonic potentials [36], as shown in Fig. 4c, as well as for more complex and time-varying force fields.Recent findings underscore the capabilities of neural networks to go beyond determining the stiffness of optical traps, and to estimate properties of trapped particles such as their refractive index or radii [111].The use of deep learning, specifically transformers network, can determine whether a trapped particle is in thermal equilibrium or not, as shown in Fig. 4d, task that is challenging by using standard methods..

C. Optical force calculations
Calculating optical forces can be computationally expensive, especially when optical forces require repeated calculations, such as when simulating the Brownian dynamics of an optically trapped particle [112], or for non-Gaussian beams, such as Laguerre-Gauss beams.Deep learning offers a solution to this problem.For example, neural networks have successfully predicted the forces acting on a spherical trapped particle both in the intermediate regime, even for complex beams [34], and in the geometrical-optics approximation [35].Importantly, the improvement in speed does not come at the expense of accuracy.Quite the opposite, neural networks have also been shown to be able to overcome some artifacts caused by the restricted number of rays used in the geometrical-optics approximation [35].Simple dense neural networks have been shown to perform well for this task, probably thanks to the low dimensionality of both inputs (e.g., the three coordinates od the particle position as well as some of the particle physical properties) and outputs (e.g., the three components of the force).The enhanced computational speed enables simulations of scenarios previously unattainable utilizing conventional computational methods.For instance, modeling a trapped particle that changes size [34] (Fig. 5a), improving the performance and accuracy of geometricaloptics calcualtions [35] (Fig. 5b), exploring the parameter space of an ellipsoid in a double beam configuration [35], simulating the dynamics of a trapped red blood cell [113], or evaluating forces produced by beams with amplitude profiles of arbitrary complexity [114].
As a perspective, DMs and GANs could be used to evaluate the optical forces of complex light fields (also random fields, as speckles field [51,[115][116][117]) from intensity images of the field acquired with a camera, as schematically shown in Fig. 5c.This is not possible with standard methods, whereas DMs and GANs can learn how an intensity image relates to a force field during the generation process.
Moreover, CNNs, possibly trained with an adversarial approach, could be used to evaluate the optical forces produced by near-field optical trapping from the 2D design of the substrate, as schematically shown in Fig. 5d.Currently this design requires the use of numerical methods that requires a lot of computational power and time for having acceptable results.

D. Controlling tweezers
Real-time control of optical tweezers using deep learning can improve their operational efficiency and reliability.In 2021 [118], a neural network was trained to guide optically trapped particles to precise target positions while avoiding collisions with other particles and obstacles.The first step in this process is to detect particles in images captured by a camera using a thresholding method.The particle positions are then used to determine the most efficient movements for the captured particle, resulting in its alignment with the desired target.This is done by training a deep reinforcement learning algorithm in a simulated environment.In this way, the NN can determine the most suitable direction for guiding FIG.4: Deep Learning for trajectory analysis and calibration.a.A convolutional neural network is trained on simulated data in order to extrapolate from the particle trajectory the medium viscosity η.Reproduced from [109].b.A diffusion model can be used to extract information about the diffusion processes of a trapped particle when there are missing points in the trajectory.c.The DeepCalib method used a recurrent neural network trained on simulated data to extract the trap stiffness for a microparticle held in a harmonic potential.Reproduced from [36].d.An attention-based transformer network can determine whether a trapped particle is in thermal equilibrium or in a non equilibrium condition.
the trapped particle to its target position, all while avoiding potential collisions with other particles, as shown in Fig. 6a.
To achieve precise optical tweezers control, digital twins can be coupled with deep learning.Digital twins are virtual models of physical objects, systems, or processes, generated by collecting and integrating data from their corresponding physical counterparts [119][120][121].By including optical tweezers within a digital twin framework, researchers can virtually execute and manage microscopic objects, such as individual molecules or nanoparticles, with great precision.This enables improved experimentation at the nanoscale and supplies an abundance of real-time data on the behavior and interactions of the objects.This data can then be analyzed by deep-learning algorithms to optimize experimental conditions and swiftly detect complex patterns and trends that may be difficult for human researchers to discern.For example, digital twins and VAE can be used to automatize trapping experiments of only particles with specific properties as schematically shown in Fig. 6b.This experiment is not feasible using standard methods because of the need to extrapolate the properties of the particle in real time.
Moreover, Bayesian deep learning can be incorporated into the control structure of optical tweezers to consider possible uncertainties such as sensor noise and variations in particle characteristics.Bayesian deep learning is a deep learning approach using Bayesian modeling, which is a statistical model where the probability is influenced by the belief in the likelihood of a specific outcome [122].This, in turn, enables the precise and adaptable manipulation of particles, for example, for drug delivery, for studying biological processes, or for assembling microstructures, as schematically depicted in Figs.6c and  6d.The Bayesian framework empowers the system to continuously update its beliefs concerning the state of the particles, thereby enhancing the robustness and efficiency of optical tweezers experiments.Reproduced from [34].b.A dense neural network calculates the optical forces in the geometrical-optics approximation increasing not only the calculation speed but also the accuracy when compared to the conventional geometrical-optics approach.The neural network (orange line) has been trained with data generated with geometrical optics using 100 rays (purple line) and approximates much better the exact solution (black line).Reproduced from [35].c.A GNN could evaluate the force field (red arrows in the right panel) directly from images of the optical field (on the left).d.A CNN could be used to evaluate and optimize the trapping force directly from the 2D design of a near-field optical trap.

E. Designing optical tweezers
Optical tweezers are complex systems whose design can be challenging, especially when using adaptive optics or plasmonic structures.Deep learning has the potential to improve and simplify this design process.However, until now only probabilistic techniques, such as simulated annealing, have been used to design custom nanostructures that help improve the performance of plasmonic trapping [39,123].By evaluating the optical force produced by different shapes of the nanoaperture, it is possible to optimize its shape, enhance their electromagnetic field, and, therefore, maximize the trapping force, as shown in Fig. 7a.
Deep Learning for designing nanophotonic devices is now widely used [124] and its extension for designing optical tweezers is straightforward.More advanced tech-niques, such as deep reinforcement learning combined with digital twins, may improve the design of plasmonic devices.For example, DRL might try different shapes of the nanodevice on the digital twin to find the best shape for the best performance.Another way to design optical tweezers is to use a spatial light modulator (SLM) [125] and deep learning algorithms to alter the beam shape.Then, the beam shape can be controlled by a diffusion model that generates the appropriate SLM mask, allowing, for example, the trapping of multiple particles with different beam shapes and/or to compensate the spherical aberrations of the optical system, as schematically shown in Fig. 7b.
In addition, digital twins might be used with VAEs to design the optical elements (e.g., trapping lens properties, laser wavelength) to have specific properties of the optical trap such as a specific stiffness of the trap or a trap able to efficiently trap particles that typically are difficult to trap (e.g., gold nanoparticles, quantum dots, low refractive index particles).

V. GUIDELINES
Considering that many potential applications of deep learning in the optical tweezers domain remain to be developed, we provide here some guidelines.We also address some specific challenges, such as the availability of only limited datasets and the diversity of optical tweezers setups, which complicate the application of the same techniques broadly to different experiments.
The process of applying deep learning to solve an optical tweezers problem can be broadly split into the following steps: 1. Problem description.2. Data collection/simulation.3. Architecture selection.4. Training.5. Testing.Often, it is necessary to iterate the process multiple times before achieving an acceptable performance.

A. Problem description
The first step in implementing any deep learning model is to provide a detailed description of the problem, outlining what is known and what the deep learning model needs to predict.The knowledge of the input and output data, especially which types of data these will contain, is fundamental to choose the proper deep-learning architecture.For instance, the algorithm could use images from a camera as inputs and return the commands to send to the laser the beam properties as output.A key aspect is to define the specific requirements for the sought-after solution.These could be that the output is needed quickly, such as for real-time feedback control, or that the output needs to be accurate, as for image analysis.
When using deep learning to control the experiment, the choice of an architecture able to communicate with the experimental setup and manage the input and output signals is fundamental.A simple solution is to run the deep learning model on a desktop computer connected to the experimental setup.However, more specialized solutions might also be required, for example employing microcontrollers or field programmable gate arrays (FP-GAs) with pre-trained neural networks.
Instead, if deep learning is used in data analysis, providing the inputs to the network and retrieving its output is rarely a technical problem.However, it is still recommended to run the algorithm on specialized hardware (GPUs or TPUs), relatively easy and accessible through local computers, servers, or on the cloud.
To enhance the effectiveness and simplify the training of the deep-learning algorithms, the problem needs to be written in as simple terms as possible.For example, the magnitude of the force applied on a sphere in standard optical tweezers depends only on two inputs (radial distance and height from the focus) and not on the three values of the cartesian coordinates (x, y, z) because of symmetry arguments.By exploiting this symmetry in the modelling of the problem, the deep-learning model can perform more accurately and computationally faster, while reducing the requirements of training data and the efforts in training.
Also at the initial stage, it is critical to consider whether deep learning is the best fit for the problem of interest.There are situations in which standard methods perform as well as deep learning with the additional advantage of interpretability and explainability of the results.Instead, a deep-learning model is intrinsically less transparent as it learns through a relatively mysterious training process.In general, deep learning is preferable when there is plenty of data for training or when the relation between the inputs and outputs is too complicated to be described analytically or with simple computational models.

B. Data collection/simulation
Any deep learning approach will require training data to fit the parameters of the model and these data will need to be as representative of the problem as possible.Depending on the problem at hand and the chosen architecture, the amount of data required for training the neural network will be different.Typically, the quantity of data should be substantial and diverse, representing the entire variable space of the problem.This can easily be the biggest obstacle when applying deep learning.For example, to track the position of a trapped particle, multiple images in different experimental conditions are required to achieve sufficient generality.Nevertheless, some cutting-edge techniques require only a single sample to complete the training, such as the LodeStar tracking algorithm [106].
In several situations, the training data can be produced through simulations allowing access to potentially infinite amounts of data.Multiple software packages help with this, such as DeepTrack [26,37], for simulating images of particles, for calculating optical forces [34,35], and for analysing trajectories [36].However, the simulated data must be representative of the problem and, to ensure this, a small experimental dataset can be used as a validation set.Sometimes, combinations of simulated and experimental data can improve the learning process.Typically, one would then train the algorithm on the simulated data first and then fine-tune it on the experimental data.
It is important to highlight that the data should be split into three different subsets: a training set used to train the parameters of the architecture; a validation set used to tune its hyperparameters, i.e. the parameters related to the architecture properties (such as number of neurons, number of layers, dimensions of the layers); and a test set used to evaluate the final performance of the trained model on unseen data (these data should not be used during the optimization of the architecture or the training of the model).
Most algorithms employ supervised learning which requires labelled data.This means that the data must be labeled with the ground truth, i.e., each input of the training dataset needs to be associate to a known desired output that the deep-learning model should provide.Knowing the ground truth is challenging and requires the utilization of standard methods or alternative experimental setups.There are also unsupervised techniques (e.g., VAEs) that do not need labeled data.In this case, the preparation of the training dataset is much easier a problem, but the validation of the model becomes more challenging and often requires explicit analysis by the user.

C. Architecture selection
The choice of the architecture to use and its hyperparameters is a crucial point because it greatly influences the performance of the model.To assist with the selection of the appropriate architecture, we have compiled in Table I the most commonly utilized architectures for typical tasks relevant to optical trapping and optical manipulation.The first things to consider are the task to be achieved and the type of data to be analyzed.
In the case of tracking particles with digital video microscopy, the most commonly used architectures are variants of CNNs.If the goal is to track a single optically trapped particle, a standard CNN is often sufficient [37,105,110].However, if many particles need to be tracked simultaneously, then using a U-net is often better than a standard CNN [26].
In the case of trajectory analysis and calibration, an architecture that can handle the time series data is required [36,111].RNNs have been used previously and will often suffice [36].Also, TGANs and ATNs can perform well with various time series and are, therefore, a good option when there are missing data points or complex dependencies in the data.However, if one has a large number of particles that interact, then a GNN is a good choice-as demonstrated by the MAGIK algorithm [126].
To calculate optical forces, DNNs have been shown to work well [34,35,113,114] and should therefore be the starting point.If the number of input parameters is small (up to a few tens, e.g., the particle position, rotation, and a limited number of parameters describing its shape), then a DNN will almost certainly perform well.Instead, when the number of parameters increases, such as in the case of biological cells which are also deformable, CNNs may be a better choice due to their capacity to capture spatial dependencies and their lower number of fitting parameters.
When deciding on an algorithm to use for controlling optical tweezers, the choice naturally falls on DRL [118], digital twins, and Bayesian modeling.However, the spe-cific architecture to use is less obvious and depends on the input data.
Designing optical tweezers with deep learning is an area in which there has not been much research yet, but we believe that generative models, such as GANs and DMs, might be appropriate to deal with the need to generate different designs to find the most efficient one.
There are also cases when one wishes to combine different data types, for example when acquired by different sensors in the same experimental setup.In this case, one option is to use separate models for the different data types, but this restricts the algorithm by not giving the full picture preventing it from investigating correlations between the two different data streams.A superior option is to use hybrid models which combine several architectures.For instance, to handle a time series from a photodiode in combination with images from a camera, one can combine an RNN and a CNN as backbones to make the prediction using a DRL network as a head.

D. Training
Training consists of adjusting the parameters of a deep learning model to enhance its performance on the specific problem to solve.It is convenient to use a standard library to implement the models.The two most commonly used are Pytorch [31] (which has been on the rise for several years) and Keras/Tensorflow [32] (which is being slowly abandoned).Often, it is also possible to find already implemented architectures that can be used as a starting point for training your models.For example, the DeepTrack library [26,127] offers an extensive toolkit for image analysis which have been shown to work well on microscopy data.The training process is often computationally demanding, which explains why we recommend running it on specialized hardware (e.g., using a GPU).
Before starting training, it is necessary to select loss, a performance metric that quantifies how far the model is from the ground truth, providing a quantity to be optimized.Therefore, the loss plays a fundamental role during the training process as its value quantifies the ability of the model to predict the real value of the desired parameter accurately.For example, this can be the square distance between a predicted position and the actual position of a trapped particle, or the proportion of correctly classified samples.
Next, the initialization of the parameters is done, often automatically by the deep-learning framework being employed.Then, the training loop starts.In each iteration, known as an epoch, the training data are split into small batches on which the model is evaluated, and the loss is calculated.The loss is used with an optimization algorithm such as stochastic gradient descent to slightly change the weights of the model to minimize the loss value.Parallel to this, the value of the loss function is calculated on the validation set to see how well the model generalizes its prediction.Generally, the performance of the model will increase epoch by epoch, but only up to a certain point when measured on the validation set.Afterwards, the validation performance tends to drop due to overfitting.It can be hard to tell for sure if a model is overfitting; generally, the more parameters the model has and the smaller the dataset, the larger the risk of overfitting.To avoid overfitting, it is possible to stop the training when the performance on the validation set has plateaued and before it starts worsening.Often, tuning of the hyperparameters, such as the number of layers in a CNN, optimizes the results and, also, reduces the risk of overfitting.

E. Testing
The final step is to test the model to ensure that it performs as desired when applied to new, never-seenbefore data.By using as input to the model a validation dataset for which are known the desired outputs, the model output is compared with the expected one.If the performance is satisfactory, then the training process is finished.If the model has been trained on simulated data, then it is at this stage that the model is tested against real-world data or in an experimental setting.However, often the performance is not as good as desired.If the performance on simulated data is significantly better than that on real-world data, this may indicate a discrepancy between the simulations and the experiment.Similar problems may occur if the training data are experimental but gathered under different conditions (e.g., a different setup or with a different type of sample).If this happens, it is mandatory to train the model again by using a larger or more representative training dataset.
When employing the model in a real-time experimental setting, there is often a need for the model to make its predictions quickly.To achieve the required computational speed (especially when using the model in embedded systems, such as microcontrollers or FPGAs), connections or entire neurons may be removed from the neural network to reduce the size and increase the speed.This operation is called pruning.The aim is to strike an optimal trade-off between speed and accuracy for a real-time application and this requires further testing.

VI. CONCLUSIONS
In this perspective, we investigate the application of deep learning for the optical tweezers field.As examples, we discuss the improvements in particle tracking at low signal-to-noise levels [59] and in quantifying the rotation of trapped particles [107].Furthermore, we highlight the use of deep learning to address cases that traditional methods cannot deal with, such as accurately tracking multiple particles when they are close together, filling in missed frames in videos, or selectively tracking particles with unique characteristics.
Then, we discuss the enhancement of trajectory analysis and optical tweezers calibration, which permit one to estimate rheological properties with only a few seconds of data instead of minutes [109] and, also, to discern different typologies of particles [110].Moreover, we propose to use deep learning in some cases when standard methods fail: DMs may help to reconstruct trajectories with missing data points and estimate the desired properties; ATNs may help to determine whether a trapped particle is in thermal equilibrium or not.
Furthermore, deep learning has already improved the calculation of optical forces by increasing the computational speed and accuracy [35] and by allowing the study of nontrivial cases (such as with Laguerre-Gaussian beams) [34].In this scenario, optical forces could be calculated also in cases where standard methods are not viable.Indeed, DMs and GANs can calculate the force field starting from intensity images of the optical field, while CNNs can do the same from the design of a nearfield optical trap.
When real-time control of optical tweezers is necessary, standard methods are often too computationally slow.Recently, NNs have allowed moving a trapped particle to a target position while avoiding collisions with real and virtual obstacles [118].We believe that real-time control and automatization of optical tweezers can be further improved using deep learning.Digital twins with VAEs may be suitable when the automatic trapping of specific particles is desired.DRL with Bayesian modeling may automate experiments, such as DNA pulling, optimizing the search for favorable experimental conditions.U-nets with Bayesian modeling may help automate the process of designing microstructures with optimal optical manipulation properties.
Designing optical tweezers can be challenging, especially when more complex designs are required.Deep learning can provide an effective solution for these requirements.Although the design of optical tweezers has thus far only utilized probabilistic methods like simulated annealing [123], deep learning has the potential to enhance this process.For example, DMs can design optical tweezers with a spatial light modulator for trapping multiple particles while enhancing the trapping force.
Finally, we provide guidelines for using deep learning in optical trapping and optical manipulation, highlighting step-by-step the process to follow to create an effective deep learning model, from the problem description to the model validation, while avoiding common pitfalls.

FIG. 1 :
FIG. 1: The rise of optical trapping and deep learning in scientific publications.Number of articles published per year that use "Optical trapping" (blue line),"Machine learning" (gray line), or "Deep learning" (orange line) in their title, abstract, or keywords.Milestones in the development of these fields are highlighted with illustrations.Data obtained from Web of Science™on November 2023.

FIG. 2 :
FIG. 2: Machine learning and deep learning.Deep learning (orange rectangle) is a subset of machine learning (black rectangle).Machine learning approaches include linear regression, principal component analysis, and decision trees.Deep learning approaches include dense neural networks, convolutional neural networks, U-nets, attention-based transformer networks, graph neural networks, generative adversarial networks, variational autoencoders, diffusion model, and deep reinforcement learning.

FIG. 5 :
FIG. 5: Deep learning for optical force calculation.a. Experimental (black symbols) and neural-network-simulated (orange line) rotation rates ω as a function of the parameter α of the superposition of two Laguerre-Gaussian beams, α LG 0,+5 + (1 − α) LG 0,−5 .The error bars represent standard errors.Reproduced from[34].b.A dense neural network calculates the optical forces in the geometrical-optics approximation increasing not only the calculation speed but also the accuracy when compared to the conventional geometrical-optics approach.The neural network (orange line) has been trained with data generated with geometrical optics using 100 rays (purple line) and approximates much better the exact solution (black line).Reproduced from[35].c.A GNN could evaluate the force field (red arrows in the right panel) directly from images of the optical field (on the left).d.A CNN could be used to evaluate and optimize the trapping force directly from the 2D design of a near-field optical trap.

FIG. 6 :FIG. 7 :
FIG. 6: Real-time control of optical tweezers with deep learning.a. Sketch of a trapped particle moved in real time by a neural network to avoid both physical (defocused particles) and virtual (white hollow circles) obstacles.The red solid line represents the trajectory, the white arrows the direction of the motion, and the green cross the destination point of the particle.Reproduced from [118].b.Digital twins and VAEs can be used to automatize trapping experiments of only particles with specific properties.c.Deep reinforcement learning and Bayesian modeling can be used to automatize the DNA pulling experiment done with two optical traps.d.U-net and Bayesian modeling can improve the process of filling micro-holes in a microfluic chamber with particles in order to create microstructures.

TABLE I :
Summary of deep learning algorithms suitable for different problems related to optical tweezers.In the last column, we have listed references that deal with the technique on a general level or apply it in the context of optical trapping or a related field.