Deep Learning application for stellar parameters determination: I- Constraining the hyperparameters

Machine Learning is an efficient method for analyzing and interpreting the increasing amount of astronomical data that is available. In this study, we show, a pedagogical approach that should benefit anyone willing to experiment with Deep Learning techniques in the context of stellar parameters determination. Utilizing the Convolutional Neural Network architecture, we give a step by step overview of how to select the optimal parameters for deriving the most accurate values for the stellar parameters of stars: T$_{\rm{eff}}$, $\log g$, [X/H], and $v_e \sin i$. Synthetic spectra with random noise were used to constrain this method and to mimic the observations. We found that each stellar parameter requires a different combination of network hyperparameters and the maximum accuracy reached depends on this combination, as well as, the Signal to Noise ratio of the observations, and the architecture of the network. We also show that this technique can be applied to other spectral types in different wavelength ranges after the technique has been optimized.


Introduction
Machine learning (ML) applications have been used extensively in astronomy over the last decade (Baron 2019).This is mainly due to the large amount of data that are recovered from space and ground-based observatories.There is therefore a need to analyse these data in an automated way.Statistical approaches, dimensionality reduction, wavelet decomposition, ML, and deep learning (DL) are all examples of the attempts that were performed in order to derive more accurate stellar parameters such as the effective temperature (T eff ), surface gravity ( g log ), projected equatorial rotational velocity (v i sin e ), and metallicity ([M/H]) using stellar spectra in different wavelength ranges (Guiglion et al. 2020, Passegger et al. 2020, Portillo et al. 2020, Wang et al. 2020, Zhang et al. 2020, Bai et al. 2019, Kassounian et al. 2019, Fabbro et al. 2018, Gill et al. 2018, Li et al. 2017, Gebran et al. 2016, Paletou et al. 2015a, b).DL is an ML method based on deep artificial neural networks (ANN) that does not usually require a specific statistical algorithm to predict a solution but it is rather learned by experience and thus require a very large dataset (Zhu et al. 2016) for training in order to perform properly.
An overview of the automated techniques used in stellar parameter determination can be found in the study of Kassounian et al. (2019).We will mention some of the recent studies that involve ML/DL.The increase of the computational power and the large availability of predefined optimized ML packages (in e.g.Python, C++, and R) have allowed astronomers to shift from classical techniques to ML when using large data.One of the first trials to derive the stellar parameters using neural networks was carried out by Bailer-Jones (1997).This work demonstrated that networks can give accurate spectral-type classifications across the spectral-type range B2-M7.Dafonte et al. (2016) presented an ANN architecture that learns the function which can relate the stellar parameters to the input spectra.They obtained residuals in the derivation of the metallicity below 0.1 dex for stars with Gaia magnitude < G 12 rvs mag, which accounts for a number in the order of four million stars to be observed by the radial velocity spectrograph of the Gaia satellite.¹Ramírez Vélez et al. (2018) used an ML algorithm to measure the mean longitudinal magnetic field in stars from polarized spectra of high resolution.They found a considerable improvement of the results, allowing us to estimate the errors associated with the measurements of stellar magnetic fields at different noise levels.Parks et al. (2018) developed and applied a convolutional neural network (CNN) architecture using multitask learning to search for and characterize strong HI Lyα absorption in quasar spectra.Fabbro et al. (2018) applied a deep neural network architecture to analyse both SDSS-III APOGEE DR13 and synthetic stellar spectra.This work demonstrated that the stellar parameters are determined with similar precision and accuracy to the APOGEE pipeline.Sharma et al. (2020) introduced an automated approach for the classification of stellar spectra in the optical region using CNN.They also showed that DL methods with a larger number of layers allow the use of finer details in the spectrum, which results in improved accuracy and better generalization with respect to traditional ML techniques.Wang et al. (2020) introduced a DL method, SPCANet, which derivedT eff and g log and 13 chemical abundances for LAMOST Medium-Resolution Survey data.These authors found abundance precision up to 0.19 dex for spectra with a signal-to-noise ratio (SNR) down to ~10.The results of SPCANet are consistent with those from other surveys, such as APOGEE, GALAH, and RAVE, and are also validated with the previous literature values including clusters and field stars.Guiglion et al. (2020) derived the atmospheric parameters and abundances of different species for 420165 RAVE spectra.They showed that CNN-based methods provide a powerful way to combine spectroscopic, photometric, and astrometric data without the need to apply any priors in the form of stellar evolutionary models.
More recently, Landa and Reuveni (2021) introduced a multi-layer CNN to forecast solar flare events probability occurrence of M and X classes.Chen et al. (2021)  In this manuscript, we present both a new method to derive stellar atmospheric parameters, and we also demonstrate the effect of each of the CNN parameters (such as the choice of the optimizers, loss function, and activation function) on the accuracy of the results.We will provide the procedure that can be followed in order to find the most appropriate configuration independently of the architecture of the CNN.This is intended as the first in a series of papers that will help the astronomical community to understand the effect on the accuracy of the prediction from most of the parameters and the architecture of the network.CNN parameters are numerous and to find the optimal ones is a very hard task.To do so, we trained the CNNs with different configurations of the parameters using purely synthetic spectra for the three steps of training, cross-validation (hereafter called validation), and testing.Using synthetic spectra, we have access to the true parameters during our tests.Noisy spectra are tested in order to mimic observations.
We have limited our work to a specific type of objects, A stars, because as mentioned previously the purpose is not to show how well we can derive the labelled stellar parameters but what is the effect of specific parameters on stellar spectra analysis.By applying our models to A stars, we use previous results (Gebran et al. 2016, Kassounian et al. 2019) as a reference for the expected accuracy of the derived stellar parameters.In the same way, the wavelength range and the resolving power are chosen to be representative of values used by most available instruments.Once the calibration of the hyperparameters was  1 The limited magnitude of the radial velocity spectrometer (RVS) is around 15.5 mag (Cropper et al. 2014).performed, we have tested our optimal network configurations on a set of FGK stars in Section 6, using the wavelength range of Paletou et al. (2015a).
The training, validation, and test data are explained in Section 2. Section 3 discusses the data preparation previous to training.The neural network construction and the parameter selection are explained in Section 4. Results are summarized in Section 5.The application of the optimal networks to FGK stars is performed in Section 6. Discussion and conclusion are gathered in Section 7.

Training spectra
Our learning or training databases (TDB) are constructed from synthetic spectra for stars having effective temperature between 7,000 and 10,000 K, and the wavelength range of 4,450-5,000 Å.This range was selected because it is in the visible domain and contains metallic and Balmer lines sensitive to all stellar parameters (T eff , g log , [M/H], v i sin e ), especially for the spectra types selected in this work.This region is also insensitive to microturbulent velocity which was adopted to be = / ξ 2 km s t based on the work of Gebran et al. (2016Gebran et al. ( , 2014)).Surface gravity, g log , is selected to be in the range of 2.0-5.0 dex.Pro- jected rotational velocity, v i sin e , is calculated between 0 and 300 − km s 1 .The metallicity, [M/H], is in the range of −1.5 and +1.5 dex.Table 1 displays the range of all stellar parameters.These spectra are used for both the training and the validation phases.Approximately 55,000 noise free synthetic spectra were calculated using a random selection of the stellar parameters in the range of Table 1.These spectra are used instead of the observations (test data without noise).Gaussian SNRs, ranging between 5 and 300, were added to these test spectra in order to check the accuracy of the technique on noisy data (test data with noise).
Details for the calculations of the synthetic spectra can be found in the study of Gebran et al. (2016) or Kassounian et al. (2019).In summary, 1D plane-parallel model atmospheres were calculated using ATLAS9 (Kurucz 1992).These models are in local thermodynamic equilibrium (LTE) and in hydrostatic and radiative equilibrium.We have used the new opacity distribution function in the calculations (Castelli and Kurucz 2003) as well as a mixing length parameter of 0.5 for ≤ ≤ T 7,000 K 8,500 K eff , and 1.25 for ≤ T 7,000 K eff (Smalley 2004).We have used Hubeny and Lanz (1992) SYNSPEC48 synthetic spectra code to calculate all normalized spectra.The adopted line lists are detailed in the study of Gebran et al. (2016).This list is mainly compiled using the data from Kurucz gfhyperall.dat²,VALD³, and the NIST⁴ databases.
Finally, the resolving power is simulated to = R 60,000.This value falls in the range between low and high resolution spectrographs.The technique that will be shown in the next sections can be used for any resolution.The construction and the size of the TDB will be discussed in Section 5.The use of synthetic spectra in ML to constrain the stellar parameters has shown to suffer from the so-called synthetic gap (Fabbro et al. 2018, Passegger et al. 2020).This gap refers to the differences in feature distributions between synthetic and observed data.We have decided to limit our work to synthetic data for two reasons: first we would like to remove the hassle of the data preparation steps (data reduction, flux calibration, flux normalization, radial velocity correction, and so on), and second because our intention is to find the strategy and technique that should be adopted in ML for deriving stellar parameters.
We are working on a future paper that deals with the architecture of the network as well as the choice of the kernel sizes and the number of neurons.Combining the best strategy to constrain the hyperparameters (this manuscript) as well as the most optimal architecture (future studies) will allow us to use a combination of synthetic and observational data in our training database.Having well-known stellar parameters, these observational data will allow us to remove/minimize the synthetic gap and better constrain the stellar parameters.

Data preparation
The TDB contains N spectra spectra that span the wave- length range of 4,450-5,000 Å.Having a wavelength step of 0.05 Å, this results in = N 10,800 λ flux points per spectrum.The TDB can then be represented by a .A colour map of a subsample of M is displayed in Figure 1.Although the syn- thetic spectra are normalized, some wavelength points could have fluxes larger than unity.This is due to the noise that is incorporated during the so-called data augmentation procedure, which will be explained in Section 4.1.1.
Training a CNN using the M matrix is time con- suming, especially if one should use a larger wavelength range or a higher resolution.For that reason, we have applied a dimensionality reduction technique, i.e. principal component analysis (PCA), in order to reduce the size of the training TDB as well as the size of the validation, test, and noisy synthetic data.Although this step is optional, we recommend its use whenever the data can be represented by a small number of coefficients.The PCA can reduce the size of each spectrum from N λ to n k .The choice of n k depends on the many parameters, the size of the database, the wavelength range, and the shape of the spectra lines.As a first step, we need to find the principal components, and to do so, we proceed as follows.
The matrix M is averaged along the N spectra -axis and the result is stored in a vector M ¯.Then, we calculate the eigenvectors ( ) λ e k of the variance-covariance matrix C defined as where the superscript "T" stands for the transpose operator.C has a dimension of × N N λ λ .Sorting the eigenvectors of the variance-covariance matrix in decreasing magnitude will result in the "principal components."Each spectrum of M is then projected on these principal com- ponents in order to find its corresponding coefficient p jk defined as (2) The choice of the number of coefficient is regulated by the reconstructed error as detailed in the study of Paletou et al. (2015a): We have opted to a value for n k that reduces the mean reconstructed error to a value of <0.5%.As an example, using a database of 25,000 spectra with stellar parameters ranging randomly between the values in Table 1 requires less than seven coefficients to reach an accuracy <1%, and a value of Applying the same procedure to all our TDB and taking the maximum value to be used for all, we have adopted a constant value for = n 50 k .This value takes into account all the databases that will be dealt with in this work, especially that some will be data augmented as will be explained in Section 4.1.1.This means that instead of training a matrix having a dimension of × N N λ spectra , we are using one with dimension of In that case, our new data consist of a matrix containing the coefficients that are calculated by projecting the spectra on the n k eigenvectors.
This projection procedure over the principal components is then applied to the validation, test, and noisy spectra datasets., the spectra can be reconstructed with more than 99.5% accuracy.
This section begins with a brief description of supervised⁵ learning.Given a data set ( ) X Y , , the goal is to find a function f such that ( ) f X is as "close" as possible to Y .For example, Y could be the effective temperature or sur- face gravity and X the corresponding spectra.This "clo- seness" is typically measured by defining a loss function , that measures the difference between the predicted and actual values.Therefore, the goal of the learning process is to find f that minimizes L for a given dataset ( ) X Y , .Ultimately, the success of any learning method is assessed by how well it generalizes.In other words, once the optimal f is found for the training set ( ) X Y , , and given another data set One of the most successful methods in tackling this kind of problem is ANN, a subset of ML.As the name suggests, an ANN is a set of connected building blocks called neurons which are meant to mimic the operations of biological neurons (Anthony andBartlett 1999, Wang 2003).Different kinds of ANNs can be built by varying the number of connections between and operations of individual neurons.The operations performed by these neurons depend on a number of parameters called weights and some nonlinear function called the activation.At a high level, an ANN is just the function f that was described earlier.Since the network architecture is chosen at the start, finding the optimal f boils down to finding the optimal weight para- meters that minimize the cost function L.
Regardless of the type of ANN used, the process of finding the optimal weights is more or less the same, and works as follows.After the network architecture is chosen, the weights are initialized, then a variant of gradient descent is applied to the training data.Gradient descent changes the parameters iteratively, at a certain rate proportional to their gradient, until the loss value is sufficiently small (Ruder 2016).The proportionality constant is called the learning rate.While this process is wellknown, there is to date no clear prescription for the choice of different components.The main difficulty arises from the fact that the loss function contains multiple minima with different generalization properties.In other words, not all minima of the loss function are equal in terms of generalization.Which minimum is reached at the end of training phase depends on the initial values chosen for the weights, the optimization algorithm used, including the learning rate and the training dataset (Zhang et al. 2016).In the absence of clear theoretical prescriptions for the components, one has to rely on experience and best practices (Bengio 2012).
One popular type of ANN is the feedforward network, where neurons are organized in layers, with the outputs of each layer fully connected to the inputs of the next.By increasing the number of layers (whence the "deep" in "DL"), many types of data can be modelled to a high degree of accuracy.Fully connected ANNs, however, have some shortcomings, such as the large number of parameters, slow convergence, overfitting, and most importantly, failure to detect local patterns.Almost all the aforementioned shortcomings are solved by using convolution layers.

CNN
A CNN is a multi-layer network where at least one of the layers is a convolution layer LeCun (1989).As the name suggests, the output of a convolution layer is the result of a convolution operation, rather than matrix multiplication, as in feedforward layers, on its input.Typically, this convolution operation is performed via a set of filters.CNNs have been very successful in image recognition tasks (Yim et al. 2015).Most commonly, CNNs are used in conjunction with pooling layers.In this work, since the input to the CNN has been already processed with PCA to reduce the dimension of the training database, we decided to omit pooling layers in our work.Even though CNNs have been mostly used for processing image data, which can be viewed as 2D grid data, they can also be used for 1D data as well.
The architecture of a CNN differs among various studies.There is no perfect model, it all depends on the type and size of the input data, and on the type of the predicted parameters.In this work, we will not be constraining the architecture of the model but rather we will be providing the best strategy to constrain the parameters of the model for a specific and defined architecture.Figure 3 shows a flow chart of a typical CNN.Table 2 represents the different layers, the output shape for each layer, and the number of parameters used in our model.In the same table, "Conv" stands for convolutional layer, "Flat" for flattening layer which transforms the matrix of data to one dimensional, and "Full" stands for fully connected layer.The total number of parameters to be trained every iteration is 764,357.The choice of such an architecture is based on aF trial and error procedure that  we performed in order to find the best model that can handle all types of training databases used in this work.The strategy of selecting the number of hidden layers and the size of the convolution layers will be described in a future paper.We decided to do all our tests using the ML platform TensorFlow⁶ with the Keras⁷ interface.The reason is that these two options are open-source and written in Python.
Although the calculation time is an important parameter constraining the choice of a network, we have decided not to take it into consideration while selecting the optimal network.The reason for that is that the calculation time depends mainly on the network's architecture which is not discussed in this article.Two parameters are also constraining the calculation time, the number of epochs, and the batch size (related to the size of the TDB).Calculation time increases with increasing epoch number and decreases with increasing batch size.The main goal of this work is to find the optimal configuration for the parameters independently of the calculation time and the Network architecture.As a rule of thumb, using a Database of 70,000 spectra and 50 eigenvectors, it takes around 17 h to run the CNN over 2,000 epochs using 64 batches and a Dropout of 30%.These calculations are done on a Intel Core i7-8750H CPU × @ 2.20 GHz 6 CPU.

Data augmentation
Data augmentation is a regularization technique that increases the diversity of the training data by applying different transformations to the existing one.It is usually used for image classification (Shorten and Khoshgoftaar 2019) and speech recognition (Jaitly and Hinton 2013).We tested this approach in our procedure in order to take into  account some modifications that could occur in the shape of the observed spectra due to a bad normalization or inappropriate data reduction.We also took into account the fact that observed spectra are affected by noise and that the learning process should include the effect of this parameter.
For each spectrum in the TDB, five replicas were performed.Each of these five replicas has different amount of flux values but they all have the same stellar labels T eff , g log , [M/H], and v i sin e .The modifications are done as follows: -A Gaussian noise is added to the spectrum with an SNR ranging randomly between 5 and 300.-The flux is multiplied in a uniform way with a scaling factor between 0.95 and 1.05.-The flux is multiplied with a new scaling factor and noise was added.-The flux is multiplied by a second-degree polynomial with values ranging between 0.95 and 1.05 and having its maximum randomly selected between 4,450 and 5,000 Å. -The flux is multiplied by a second-degree polynomial and Gaussian noise added to it.
The purpose of this choice is to increase the dimension of the TDB from and to introduce some modifications in the training spectra that could appear in the observations that we need to analyse.Such modifications are the noise and the commonly observed departures from a perfect continuum normalization.
Distortions in observed spectra could appear due to bad selection in the continuum points.We have tested the two options, with and without data augmentation, and the results are shown in Section 5. Figure 4

Initializers: Kernel and bias
The initialization defines the way to set the initial weights.
There are various ways to initialize, and we will be testing the following: -Zeros: weights are initialized with 0. In that case, the activation in all neurons is the same and the derivative of the loss function is similar for every weight in every neuron.This results in a linear behaviour for the model.-Ones: a similar behaviour as the Zeros but using the value of 1 instead of 0. -RandomNormal: initialization with a normal distribution.-RandomUniform: initialization with a uniform distribution.
-TruncatedNormal: initialization with a truncated normal distribution.-VarianceScaling: initialization that adapts its scale to the shape of weights.-Orthogonal: initialization that generates a random orthogonal matrix.-Identity: initialization that generates the identity matrix.
For all of these initializers, the biases are initialized with a value of zero.It will be shown later that most of these initializers give the same accuracy except for the zeros and ones.

Optimizer
Once the (parameterized) network architecture is chosen, the next step is to find the optimal values for the parameters.If we denote by θ the collective set of parameters, then, by definition, the optimal values, * θ , are the ones that minimize a certain loss function ( ) L θ ; a measure of difference between the predicted and the actual values.This optimization problem is, typically, solved in an iterative manner, by computing the gradient of the loss function with respect to the parameters.
Let θ t denote the set of parameters at iteration t.The iterative optimization process produces a sequence of values, … * θ θ , , 1 that converges to the optimal values * θ .At a given step t we define the history of that process as the set { ( ) The values + θ t 1 are obtained from θ t according to some update rule where γ t is a set of hyperparameters such as the learning rate.
Different optimization techniques use a different update rule.For example, in the so-called "vanilla" gradient descent, the update rule depends on the most recent gradient only: Other methods include the whole history with different functional dependence on the gradient and different rates for each step (see Choi et al. 2020 for a survey).Different optimization techniques are available in keras and we will be testing the following: -Adam: an adaptive moment estimation that is widely used for problems with noise and sparse gradients.Practically, this optimizer requires little tuning for different problems.-RMSprop: a root mean square propagation that iteratively updates the learning rates for each trainable parameter by using the running average of the squares of previous gradients.-Adadelta: it is an adaptive delta, where delta refers to the difference between the current weight and the newly updated weight.It also works as a stochastic gradient descent method.-Adamax: an adaptive stochastic gradient descent method and a variant of Adam are based on the infinity norm.It is also less sensitive to the learning rates than other optimizers.-Nadam: Nesterov-accelerated Adam optimizer that is used for gradients with noise or with high curvatures.
It uses an accelerated learning process by summing up the exponential decay of the moving averages for the previous and current gradient.It is also an adaptive learning rate algorithm and requires less tuning of the hyperparameters.

Learning rate
As mentioned in the beginning of the section the training rate can affect the minimum reached by the loss function and therefore has a large effect on the generalization property of the solution.In this article, we followed the recommendation of Bengio (2012) and chose the learning rate value to be half of the largest rate that causes divergence.

Dropout
Dropout is a regularization technique for neural networks and DL models that prevent the network from overfitting (Srivastava et al. 2014).When dropout is applied, randomly selected neurons removed each iteration of the training and do not contribute to the forward propagation and no weight updates are applied to these neurons during backward propagation.Statistically, this has the effect of doing ensemble average over different sub-networks obtained from the original base network.We tried to find the optimal number for the dropped out fraction of neurons.Dropout layers are put after each convolutional one.Tests were performed with dropout fraction ranging between 0 and 1.

Pooling
Pooling layer is a way to down sample the features (i.e.reducing the dimension of the data) in the database by taking patches together during the training.The most common pooling methods are the average and the max pooling Zhou and Chellappa (1988).The average one summarizes the mean intensity of the features in a patch and the max one considers only the most intense (i.e.highest value) value in a patch.The size of the patches and the number of filters used are decided by the user.
The standard way to do that is to add a pooling layer after the convolutional layer and this can be repeated one or more times in a given CNN.However, pooling makes the input invariant to small translations.In image detection, we need to know if the features exist and not their exact position.That is why this technique has shown to be valuable when analysing images (Goodfellow et al. 2016).This is not the case in spectra because the position of the lines needs to be well-known (Section 5).But also, as discussed previously, pooling layers are not needed in our case because the dimension of the TDB was already reduced drastically by applying PCA.

Activation functions
The activation function is a non linear transformation that is applied on the output of a layer and this output is then sent to the next layer of neurons as input.Activation functions play a crucial role in deriving the output of a model, determining its accuracy and computational efficiency.In some cases, activation functions might prevent the network from converging.
The activation function for the inner layers of deep networks must be nonlinear, otherwise no matter how deep the network is, it would be equivalent to single layer (i.e.regression/logistic regression).Having said that we have tested five activation functions that are as follows: sigmoid: 0.
x It is important to note that in this section we discuss the choice of the activation function for inner layers only.The choice of the activation for the last layer is usually more or less fixed by the type of the problem and how one is modelling it.For example, if one is performing binary classification, then a sigmoid-like activation is usually used (or softmax for multiclass classification) and interpreted as a probability.However, for regression-like problems a linear activation is usually used for the last layer.In our case, which is a purely regression problem, the last layer will have a linear activation function.
The sigmoid and tanh restrict the magnitude of the output of the layer to be ≤1.Both, however, suffer from the vanishing gradient problem (Glorot et al. 2011).For relatively large magnitudes both functions saturate and their gradient becomes very small.Since deep networks rely on backpropagation for training the gradient, the first few layers, being a product of the succeeding layers, become increasingly small.The rectifier class of activation, relu, elu, and so on seem to minimize the vanishing gradient problem.Also, they lead to sparse representation, which seems to give better results (He et al. 2015, Maas 2013).

Loss functions
The loss function controls the prediction error of an NN as explained in Section 4. It is an important criterion in controlling the updates of the weights in an NN, mainly during the backward propagation.The selection of the type of the loss function is decided depending on the types of output labels.If the output is a categorical variable, one can use the categorical crossentropy or the sparse categorical crossentropy.If we are dealing with a binary classification, binary crossentropy will be the normal choice for a loss function.Finally, in case of a regression problem like the one used in stellar spectra parameters determination, variants of mean squared error loss functions are used.In our work, we have tested the following functions: -Mean squared error: -Mean squared logarithmic error: y being the actual label, y ˆthe predicted ones, and N the number of spectra in the training dataset.Loss function selection can differ from one study to the other (Rosasco et al. 2004).For that reason, we have tested the above three functions in deriving the stellar parameters.

Epochs
The number of epochs is the number of times the whole dataset is used for the forward and the backward propagation.The number of Epochs controls the number of times the weights of the neurons are updated.While increasing the number of Epochs, we can move from underfitting to overfitting passing through the optimal solution for our network.

Batches
Instead of passing the whole training dataset into the NN, we can divide it in N Batches batches and iterate on all batches per epoch.In that case, the number of iteration will be the number of batches needed to complete one Batches are used in order to avoid the saturation of the computer memory and the decrease of iterations speed.However, the selection of the optimal batch number is not straightforward.Adopted values are usually 32, 64, or 128 (Keskar et al. 2016).
One of the most important measures of the success for a deep neural network is how well it generalizes on some test data, not included in the training phase.In current deep neural networks, the loss function has multiple minima.Many experimental studies have shown that, during the training phase, the path to reaching a minimum is as important as the final value (Neyshabur et al. 2017, Zou et al. 2019, Zhang et al. 2016).A good rule of thumb is that a "small," less than 1% the size of the data, batch size generalizes better than "large" batches, about 10% of the training data (Keskar et al. 2016).

Results and analysis
The effect of each CNN parameter on the accuracy of the stellar parameters has been tested.To do so, we have used the same CNN with the same parameters for all our tests while changing only the concerned one at each time.For example, to find the best epoch numbers, we fix the activation function, the optimizer, the number of batches, the dropout percentage, the loss function, and the kernel initializer while iterating on the number of epochs.The same parameters are used again for finding the optimal dropout percentage and so on.The fixed values used in these calculations are the he_normal for the kernel initializer, the mean squared error for the loss function, the "ADAM" optimizer, the relu activation function, 50% of dropout, 64 batches.These tests are performed with epochs of 100, 500, 1,000, 2,000, 3,000, 4,000, and 5,000.In all tests, the distribution of Training and Validation is 80% and 20%, respectively.
The results will be a combination of test errors spanning over different number of epochs for each stellar parameter and CNN configuration.The variation with the number of epochs ensures that the trends are real and not due to local minima as a result of the low number of iterations.The tests are a collection of 110,000 synthetic spectra, half of them without noise and half with random noise as introduced in Section 2.
To better visualize the results and to have a better conclusion about the optimal configurations, we display in Figures 5-8 the relative error of the observations.These errors are calculated by dividing the values by the maximum observation standard deviation in all configurations (i.e.including all epoch simulations).This will allow us to target the minimum values and pinpoint the best parameters.
In what follows, we show the results that were performed using a training dataset of 40,000 randomly generated synthetic spectra in the ranges of Table 1.In Section 5.5, we discuss the effect of using a small or a large training database and the effect of using or not data augmentation.

Effective temperature
According to Figure 5, the use of a relu or elu activation functions leads to a similar conclusion within a difference of few percents.And this could be applied independently of the number of epochs.As for the Optimizer, Adam and Adamax optimizers seem to be consistently accurate across all epoch numbers.The optimal number of batches is found to be between 32 and 64.The number of epochs is tightly related to the batch number, however, in case of 64 batches, the optimal number of epochs is found to be RMSprop for the optimizer, a number of batches between 32 and 128, an epoch number of 3,000, a Dropout fraction between 0.3 and 0.4, a mean squared logarithmic error loss function, and all kinds of initializers except for zeros and ones.

Metallicity
The metallicity parameter, [M/H], also behaves differently than T eff and g log .As seen in Figure 7, [M/H] requires a different combination of parameters in our CNN in order to reach optimal results.tanh or relu activa- tion functions give the least error in most epoch number situations.Adam and RMSprop optimizer lead to similar results within few percents of differences.A combination of 16 batches and 1,000 epochs is appropriate to derive [M/H] with low errors.A dropout between 10 and 30%, a mean absolute error for a loss function, and a RandomUniform kernel initializer are to be used in order to reach the highest possible accuracy for [M/H].Our technique was applied to A stars and extrapolated to FGK stars (Section 6).However, specific considerations should be taken into account when deriving the metallicities of cool stars due to forests of molecular lines that are present in the spectra (Passegger et al. 2021).
In case of [M/H], the optimal configuration is found to be using the following parameters: Activation function: tanh.Optimizer: Adam.Batches: 16. the same TDB1 parameter ranges.We have also checked the importance of using Data Augmentation as a regularization technique for deriving accurate parameters (see Section 4.1.1 for details).
For each stellar parameter, we used the optimal CNN with the configuration that was derived in Sections 5.1-5.4.Each configuration was tested with TDB1, TDB2, and TDB3 with and without Data Augmentation.Figure 9 displays the average relative standard deviation for each stellar parameter with respect to the maximum values, for the training, validation, test, and observation sets.In order to quantify these proxies for the uncertainties of the techniques, Table 3 collects the standard deviations for the four stellar parameters as a function of the training database.
According to Table 3, each parameter behaves differently with respect to the change of the databases.This is mainly due to the number of unique values of the parameter in the database.For that reason, [M/H] is well represented by TDB1 without data augmentation, whereas T g log eff and v i sin e require a larger database to be well represented.g log can be well represented with TDB3 with data augmentation, whereas T eff can be predicted with TDB2 with data augmen- tation.Finally, v i sin e can be predicted using TDB3 with data augmentation.

Accuracy for the optimal configuration
After selecting the optimal configuration for each stellar parameter, the predicted parameters are displayed in Figure 10 as a function of the input ones for the training, validation, and the two sets of test datasets.All data points are located around the = y x line.The dispersion of the observation around that line is due to spectra with very low signal to noise.The accuracy that we found using our CNN architecture seems to be appropriate for A stars as they are comparable to most of the previous studies using classical tools (Aydi et al. 2014) or more complicated statistical tools (Gebran et al. 2016, Kassounian et al. 2019).The same is true for all parameters.
In order to verify the effect of the noise on the predicted parameters, Figure 11 displays the variation of the accuracy of the predicted values with respect to the input SNR.The figure also displays the observations depending on the values of v i sin e .The reason for that is that increasing v i sin e induces blending in the spectra and thus less information to be used in the prediction.This is reflected in the case of low v i sin e for which the predicted values are found to be more accurate than the case of large v i sin e .

Extrapolating to other spectral-types
In order to verify how universal the results are, we checked that the optimization of the code is not dependent on  wavelength and/or spectral-type, we also tested the procedure on FGK stars.To do that, we have calculated a TDB specific for FGK stars using the parameters displayed in Table 4.The wavelength range was selected to coincide with the one of Paletou et al. (2015a).This range is sensitive to all the concerned stellar parameters.A database of 50,000 random synthetic spectra with known stellar labels is used in the training.About 20,000 test data, with and without noise, were calculated in the same range of Table 4 to be used for verification.The optimal NNs that were introduced in Section 5 were used again, as a proof of concept, for the FGK TDB.The results are displayed in , and [M/H], respectively (Table 5).These results are very promising, but we should be aware of the complications that would arise when using real observations, especially in the case of the cool M stars.These stars have been analysed in the context of exoplanet search (Shan et al. 2021, Passegger et al. 2020) and show complications in their spectra mainly related to the continuum normalization.Adapting the data preparation and the CNN will be inevitable in order to take into account these effects.These results also show that when deriving the stellar parameters for specific spectral-types, the wavelength region should be selected according to these spectral lines/bands most sensitive to the variations of the parameters one seeks.

Discussion and conclusion
The purpose of this work is not only to find the best tool for the accurate prediction of parameters but also to show the steps that should be taken in order to reach the optimal selection of the CNN parameters.Often scientists use DL as a black box without explaining the choice of the parameters and/or architecture.In this manuscript, we have explained the reason for selecting specific hyperparameters while emphasizing the pedagogical approach.To have a more effective tool, one should change the architecture of the model.The architecture of the model depends on the type and range of the input.In this work, we have fixed the architecture and iterated on the hyperparameters only.Sections 5.1-5.4show that for each stellar parameter, the setup of the network should be changed.This means that for a specific network and a specific stellar parameter, a study should be made to find the optimal configuration of hyperparameters.This is due to the contribution of the specific stellar parameter on the shape of the input spectrum.Using the PCA decomposition, we have reduced the size of the input parameters to only 50 points per spectrum while keeping more than 99.5% of the information.This is recommended in case of large databases and wide wavelength range and could avoid the use of extra pooling layers in the network.This projection technique is not only applicable for AFGK stars but can also be used for cooler stars.Although the CNN architecture was not optimized, we were able, using a strategy of finding the best hyperparameters, to reach a level of accuracy that is comparable to other adopted techniques.In fact, we found for A stars, an average accuracy of 0.08 dex for g log , 0.07 dex for [M/H], − 3.90 km s 1 for v i sin e , and 127 K for T eff .In the case of stars with v i sin e less than − 100 km s 1 , we found the accuracy to be 90 K, 0.06 dex, 0.06 dex, and − 2.0 km s 1 , for T eff , g log , [M/H] and v i sin e , respectively.These accuracy values are signal to noise dependant and reduce as long as the signal to noise increases.Extrapolating the technique to FGK stars also shows that the same network could be applied to different spectral-types and different wavelength ranges.
The technique that we followed in this article could be transferable to any classification problem that involves neural network.In the future, we plan to develop a strategy to find the best CNN architecture depending on the input data and the type of the predicted parameters.Once the architecture and the configuration of the parameters are settled, we will be testing the procedure on observational spectra as we did in the studies of Paletou et al. (2015a), Paletou et al. (2015b), Gebran et al. (2016), and Kassounian et al. (2019).Using only observational data or a combination of synthetic spectra and real observations with well-known parameters will allow us to constrain the derived stellar labels while minimizing the critical synthetic gap (Fabbro et al. 2018).One more criterion that should be taken into account is when applying this technique to real observations, thorough data preparation work should be done to take into account the characteristics of each spectral-type (e.g.continuum normalization in M and giant stars, and low number of lines in hot stars).
introduced an AGN recognition method based on deep neural network.Almeida et al. (2021) used ML methods to generate model special entry distributions (SEDs) and fit sparse observations of low-luminosity active galactic nuclei.Rhea et al. (2020), Rhea and Rousseau-Nepton (2021) used CNNs and different ANNs to estimate emission-line parameters and line ratios present in different filters of SITELLE spectrometer.Curran et al. (2021) used DL combined with k-Nearest Neighbour and Decision Tree Regression algorithms to compare the accuracy of the predicted photometric redshifts of newly detected sources.Ofman et al. (2022) applied the ThetaRay Artificial Intelligence algorithms to 10,803 light curves of threshold crossing events and uncovered 39 new exoplanetary candidate targets.Bickley et al. (2021) reached a classification accuracy of 88% while investigating the use of a CNN for automated merger classification.Gafeira et al. (2021) used an assisted inversion techniques based on CNN for solar Stokes profile inversions.In the context of classification of galactic morphologies, Gan et al. (2021) used ML generative adversarial networks to convert ground-based Subaru Telescope blurred images into quasi Hubble Space Telescope images.Garraffo et al. (2021) presented StelNet, a deep neural network trained on stellar evolutionary tracks that quickly and accurately predict mass and age from absolute luminosity and effective temperature for stars of solar metallicity.

= n 17 k
to reach a 0.5% error as shown in Figure2.This technique has shown its efficiency when applied to synthetic and/or real observational data with > T 4,000 K eff (see Gebran et al. 2016, Paletou et al. 2015a, b, for more details).

Figure 1 :
Figure 1: Colour map representing the fluxes for a sample of the training database using data augmentation.Wavelengths are in Å.

Figure 2 :
Figure 2: Mean reconstructed error as a function of the number of principal components used for the projection.The dashed lines represent the 1 and 0.5% error, respectively.For > n 17 k

Figure 3 :
Figure 3: CNN architecture used in this work.A PCA dimension reduction transforms the spectra into a matrix of input coefficient.This input passes through several convolutional layers and fully connected layers in order to train the data and predict the stellar parameters.
[M/H] = 0.0 dex as well as the extra five modifications that were performed on this spectrum.We have decided to use a continuous SNR between 5 and 300 but different modifications could be tested.As an example, González-Marcos et al. (2017) adapted the SNR of the spectra used in the training dataset to the SNR of the spectra for which the atmospheric parameters are needed (evaluation set).They concluded that in case of T eff , only two regression models are needed (SNR = 50 and 10) to cover the entire SNR range.

Figure 4 :
Figure 4: The effect of the data augmentation on the shape of the spectra.Upper left: spectrum represents the original synthetic spectra.Upper middle: Gaussian noise added to the synthetic spectra.Upper right: synthetic spectrum with the intensities multiplied by a constant scale factor.Bottom left: Gaussian noise added to the synthetic spectra and multiplied by a constant scale factor.Bottom middle: synthetic spectrum with the intensities multiplied by a second-degree polynomial.Bottom right: Gaussian noise added to the synthetic spectra and multiplied by a second-degree polynomial.All these spectra have the same stellar parameters ( = T 8,800 K eff , = g log 4.3 dex, = v i sin 45 km s e

Figure 6 :
Figure 6: Same as Figure 5 but for g log .

Figure 8 :
Figure 8: Same as Figure 5 but for v i sin e

Figure 9 :
Figure 9: Relative errors for each stellar parameter using TDB1, TDB2, and TDB3 with and without data augmentation as a training dataset.

Figure 10 :
Figure 10: Predicted stellar parameters using the optimal CNN configurations for T eff , g log , v i sin e , and [M/H] as a function of the input ones for the training, validation, and test databases as well as for the noise added observations.

Figure 11 :
Figure 11: Average error bars for the observation predicted stellar parameters as a function of the SNR and for different ranges of stellar rotation.

Table 1 :
Ranges of the parameters used for the calculation of the synthetic spectra TDBs

Table 3 :
Derived standard deviation for each parameter using TDB1, TDB2, and TDB3 with and without data augmentation The values for the Training, Validation, and the two sets of Test are depicted in this table.

Table 5 :
Derived standard deviation for each parameter using the TDB for FGK stars Furthermore, (Houdebine et al. 2016, Paletou et al. 2015b), Sarro et al. (2018) have applied a projection pursuit regression model based on the independent component analysis compression coefficients to derive T eff , g log , and [M/H] of M-type stars.