Machine translation of English content: A comparative study of di ﬀ erent methods

: Based on neural machine translation, this article introduced the ConvS2S system and transformer system, designed a semantic sharing combined transformer system to improve translation quality, and compared the three systems on the NIST dataset. The results showed that the operation speed of the semantic sharing combined transformer system was the highest, reaching 3934.27 words per second; the BLEU value of the ConvS2S system was the smallest, followed by the transformer system and the semantic sharing combined transformer system. Taking NIST08 as an example, the BLEU values of the designed system were 4.74 and 1.49 higher than the other two systems. The analysis of examples showed that the semantic sharing combined transformer had higher translation quality. The experimental results show that the proposed system is reliable in English content translation and can be further promoted and applied in practice.


Introduction
Natural language processing (NLP) mainly studies [1] how to realize effective communication between humans and computers through natural language, which has been widely used in machine translation (MT) [2], public opinion monitoring [3], text classification [4], etc. The process of MT can be interpreted as decoding the source corpus and re-coding it into the target language. It is necessary to have a deep understanding of the grammar and semantics of the language to ensure high-quality MT. Neural machine translation (NMT) is one kind of MT [5]. Choi et al. [6] contextualized the word embedding vector using a nonlinear bag-ofwords representation of the source sentence and used typed symbols to represent special tokens, such as numbers, proper nouns, and acronyms. Experiments on En-Fr and En-De showed that the method could significantly improve the quality of translation. Wu et al. [7] pointed out the importance of grammar knowledge for translation performance, designed a grammar-aware encoder, and incorporated it into NMT. Through experiments, they found that the method could improve the quality of translation. Lee et al. [8] introduced an NMT model, which mapped a source character sequence to a target character sequence without any segmentation. They used a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation to allow the model to be trained at a speed comparable to subword-level models while capturing local regularities. The experiment found that the model showed better performance. Gu et al. [9] proposed a new MT method for languages with limited parallel data. It used the shift learning method to share multiple source languages as a target language and share the source encoder with other languages. The experiment found that the method could realize 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences. Translation between English and Chinese has always been a difficult problem in MT. English belongs to the Germanic language family, which has a morphologic form and relatively fixed word order. Chinese belongs to the Sino-Tibetan language family, which expresses grammatical relations through word order and function words. To be specific, English focuses on cohesion [10] and has close sentence structure; Chinese that pays attention to coherence [11] has a relatively loose sentence structure, and it is often necessary to combine the context to understand the meaning of a sentence. In addition, there are great differences in thinking and culture between Chinese and English, which also leads to the poor quality of translation results. Therefore, this study mainly analyzed the MT method of English content. Taking NMT as the subject, this study compared the translation performance of three different NMT methods. This work makes some contributions to the realization of better translation between English and Chinese.
2 Different neural MT methods

ConvS2S system
Convolutional neural network (CNN) has the characteristics of weight sharing, downsampling, etc. [12]. It has significant advantages in image processing [13] and has been widely used in the NLP field [14], such as semantic analysis [15] and language model [16]. It can also be used in MT. In the ConvS2S system, each layer in the decoder contains an attention module, and sublayers are connected by residuals. The calculation formula is as follows: where h l is the output of the lth sublayer, sublayer refers to the function performance of the layer, which is realized by CNN. It is assumed that the weight of every convolution kernel is w l and the bias is b w . The outputs of k words are merged: It is mapped as an output element: then, the output of the lth sublayer can be written as: where ( ) v Y refers to the function performance of convolution operation and v refers to the gated linear unit (GLU). For the input matrix its operation process is as follows: where A and B are inputs of GLU, ⊗ is the multiplication of the corresponding elements of the matrix, and σ is a nonlinear activation function.

Transformer system
The transformer system [17] abandons the recurrent neural network and uses a self-attention mechanism [18], generating three vectors, q, k, and v, from the output word vector. The calculation method of selfattention mechanism is as follows: where Q K V , , and are matrices of three vectors and d k is the dimension of k. The source language sequence is set as and the target language sequence is set as In the transformer system, the self-attention mechanism is realized by the multi-head attention module, and the calculation formulas are as follows: The output generated from the multi-head attention module enters the forward feedback neural network (FFN) [19] to generate the output of the encoder: where W 1 and W 2 are the weights at the first and second mappings and b 1 and b 2 are the bias at the first and second mappings.
In the system, time sequence information is obtained by position coding: where d model refers to the size of the model dimension.
The decoder part of the system is the same as the encoder, and a softmax layer is added at the end. The final output of the system is the probability distribution of candidate target words. The cross-entropy function is used for training, and the optimizer is Adam.

Transformer system combined with semantic sharing
NMT can be regarded as a model of transformation between two semantic spaces. If it can be combined with the semantic representation space of cross-language sharing, the semantic relevance of model translation results can be improved. Therefore, this article optimizes the transformer system with semantic sharing, including parameter sharing and representation sharing. First, in the process of training translation tasks, the same parameters are shared. It is assumed that the two languages to be translated are X and Y . The loss function is optimized as follows: where → θ X Y refers to the parameter of the X →Y direction model, → θ Y X refers to the parameter of the → Y X direction model, θ enc refers to the parameter of the encoder, θ dec refers to the parameter of the decoder, θ s refers to the parameter represented by the source word, and θ t refers to the parameter represented by the target word.
Based on parameter sharing, the representation generated by the model is also shared. That is to say, taking the encoder as an example, when X and Y representations are shared, the encoder can not only learn the encoding of two languages by using the same parameters but also learn the mapping from the word representation space to the hidden layer representation space of sentences. The output of the transformer system at the j moment: where W is the transformation matrix and z j is the hidden layer state obtained by the decoder. In the representation sharing, the above formula is modified to The probability of refactoring Y is calculated as: In order to realize representation sharing, i.e., to realize reconfiguration without supervision, the loss function is optimized again: where α stands for the weight of the reconstructed constraint loss.

Experimental analysis 3.1 Experimental setup
The ConvS2S system was implemented in the fairseq open-source tool [20]. The number of convolution layers was 16, the dimension of layers was 256, and the number of convolution kernels was 3. The optimizer nag carried by fairseq was used. The learning rate was 0.25. The parameters were the default values. The transformer system was implemented on the open-source tensor2tensor [21]. The number of network layers was 6. The dimension of different layers was 512. The multi-layer attention module used weight heads. The dimension of q, k, and v was 64. The dimension of the hidden layer was 2,048. The beam size was 8. NIST06 data sets under LDC were used for the experiment [22]. NIST06 was used as the development set, and NIST02, NIST03, NIST04, NIST05, and NIST08 were used as the test set. The translation performance of the three systems was compared.

Evaluation criteria
The results are evaluated by the BLEU value [23]. The calculation formula is where P n refers to the matching degree of the nth order, usually the fourth order, candidates refers to the candidate translation, and c refers to every sentence in the candidate translation. Finally, the calculation method of BLEU is where BP is the length penalty factor, r and c are the length of the reference translation and candidate translation, and w n is the weight coefficient.

Comparison of results
The computing speed of the ConvS2S system, the transformer system, and the transformer system combined with semantic sharing is shown in Figure 1. Figure 1 shows that the computing speed of the transformer system combined with semantic sharing was the fastest, reaching 3934.27 words per second, the computing speed of the ConvS2S system was 2869.45 words per second, and that of the transformer system was 3048.16 words per second. The computing speed of the transformer system combined with semantic sharing was 37.11 and 29.07% higher than that of the ConvS2S system and the transformer system.
The BLEU values of the ConvS2S system, the transformer system, and the transformer system combined with semantic sharing are shown in Figure 2. Figure 2 shows that the BLEU value of the ConvS2S system was the smallest, followed by the transformer system and the transformer system combined with semantic sharing.
Specifically, in NIST02, the BLEU value of the transformer system combined with semantic sharing was 4.46 larger than the ConvS2S system and 1.52 larger than the transformer system; in NIST03, the BLEU value of the transformer system combined with semantic sharing was 4.03 larger than the ConvS2S system and 2.12 larger than the transformer system; in NIST04, the BLEU value of the transformer system combined with semantic sharing was 5.5 larger than the ConvS2S system and 1.22 larger than the transformer system; in NIST05, the BLEU value of the transformer system combined with semantic sharing was 3.87 larger than the ConvS2S system and 1.05 larger than the transformer system; in NIST08, the BLEU value of the transformer system combined with semantic sharing was 4.74 larger than the ConvS2S system and 1.49 larger than the transformer system. The above results verified that the transformer system combined with semantic sharing had higher quality in English content translation.
The translation results of two sentences were analyzed, as shown in Table 1.  Table 1 shows that there were some differences between the translation of the ConvS2S system and the transformer system, and the reference translation in translating English content; from the semantic perspective, the differences were large. The translation results of the transformer system combined with semantic sharing were very similar to the reference translation, with stronger readability and higher translation quality.

Discussion
NMT maps sentences of the source language directly to those of the target language through an end-to-end method [24], which is significantly better than statistical MT in the case of sufficient parallel corpus. It does not require separate modules for word alignment and tuning order but outputs translation results directly through a neural network. It not only has a wide range of applications in practice, but also has a very important research value in the field of translation [25].
This study mainly compared two NMT methods, the ConvS2S system and the transformer system, and improved the transformer system by semantic sharing. Experiments were carried out with the NIST dataset as an example. First, in terms of computing speed, the ConvS2S system had the slowest computing speed, 2869.45 words per second, showing high computational complexity, while the computing speed of the transformer system was 3048.16 words per second, showing an increase of 6.23% compared to the ConvS2S system. The computing speed of the improved system reached 3934.27 words per second, which was significantly higher than the first two systems, i.e., the improved system had an advantage in computational  efficiency. The comparison of BLEU values showed that all three systems showed similar results on different data sets, i.e., the ConvS2S system < the transformer system < the improved system. The average BLEU values of the three systems were 37.18, 40.22, and 41.7, respectively, and the BLEU value of the transformer system was 3.04 larger than the ConvS2S system, while the BLEU value of the improved system was 4.52 larger than the ConvS2S system and 1.48 larger than the transformer system. The above results revealed that the improved system had better performance both in terms of computational efficiency and translation performance. Finally, the comparison of translation results showed that the translation results of both ConvS2S system and transformer system had different degrees of semantic differences in the translation of English example sentences, which did not fully express the meaning of the source sentences and had shortcomings in the completeness and readability, but the improved system designed achieved better translation of English example sentences and more accurate results, showing higher translation quality. Some results have been achieved in the comparison of MT methods for English content in this study; however, there are still some shortcomings. In future research, more NMT methods can be improved and compared, and experiments can be conducted on more datasets to further improve the efficiency and quality of English translation.

Conclusion
This study introduced two NMT methods for English content translation, the ConvS2S system and the transformer system, and designed a transformer system combined with semantic sharing to improve translation quality. The experiment on the NIST data set showed that the transformer system combined with semantic sharing had a better performance in computing speed and BLEU value, showing reliability in improving the efficiency and quality of English content translation. The designed system can be further promoted and applied in practice.

Conflict of interest:
The author state no conflict of interest.