An intelligent algorithm for fast machine translation of long English sentences

: Translation of long sentences in English is a complex problem in machine translation. This work brie ﬂ y introduced the basic framework of intelligent machine translation algorithm and improved the long short - term memory ( LSTM )- based intelligent machine translation algorithm by introducing the long sen - tence segmentation module and reordering module. Simulation experiments were conducted using the public corpus and the local corpus containing self - collected linguistic data. The improved algorithm was compared with machine translation algorithms based on a recurrent neural network and LSTM. The results suggested that the LSTM - based machine translation algorithm added with the long sentence segmentation module and reordering module e ﬀ ectively segmented long sentences and translated long English sentences more accurately, and the translation was more grammatically correct.


Introduction
With the development of globalization, there is more and more international communication. In the process of communication, language is crucial. Using a common language that can be mutually understood can avoid misunderstandings during communication and enhance the efficiency of labor division. English is one of the most widely used common languages. For non-native English speakers, the cost of learning a new language is also high, and achieving free communication is difficult [1]. In formal situations and when a large amount of information needs to be exchanged [2], single human translation can no longer meet the growing demand. Simultaneous interpretation, for example, requires a high level of attention from interpreters, so they usually cannot work for a long time. Thus, a translation tool is needed to replace human translation [3]. In the actual communication process, long sentences in English are common [4]. The grammar of English and other languages is different. If the machine translation algorithm translates long English sentences according to the one-to-one correspondence method, the problem of ungrammatical translation will occur, and in severe cases, it may even lead to translation errors. The emergence of intelligent algorithms provides a new approach to English machine translation. Relevant studies about improving the efficiency and quality of English machine translation are reviewed below. Lin et al. [5] proposed a neural machine translation based on a novel beam search evaluation function and found that the method effectively improved the quality of English-Chinese translation. Luong and Manning et al. [6] proposed an end-to-end neural machine translation-based minimal-risk training method and verified the effectiveness of the method through experiments. Choi et al. [7] contextualized the word embedding vectors with a nonlinear bag-of-words representation of the source sentence. They performed experiments and found that the proposed contextualization and symbolization methods greatly improved the translation quality of neural machine translation systems. This work described the basic framework of the intelligent machine translation algorithm and optimized the long short-term memory (LSTM)-based intelligent machine translation algorithm by introducing the long sentence segmentation module and the reordering module. Simulation experiments were conducted using the public corpus and the local corpus collected by the authors.

Intelligent machine translation algorithms for long English sentences
The intelligent machine translation algorithm utilizes a neural network algorithm. The basic framework of this machine translation algorithm is shown in Figure 1, and the main structure includes an encoder and a decoder. The intelligent machine translation algorithm first converts the original text into a coded string that is neither the original text nor the translation by the neural network algorithm in the encoder and then converts the coded string into the translation by the neural network algorithm in the decoder [8].
Compared with the traditional translation method, which translates word-by-word with the help of dictionaries, the intelligent neural network-based machine translation algorithm acquires hidden laws with grammatical rules through corpus training samples in the training process [9]; thus, it is better in translation. However, the translation quality decreases when the intelligent machine translation algorithm translates long English sentences. The reason for this is that when the encoder in the intelligent machine translation algorithm converts the original text into vector code [10], the size of the vector code is fixed regardless of the length of the original text, i.e., the longer the original text, the more information is compressed when it is converted into vector code, and therefore, more semantic information is lost [11].
In order to improve the translation quality of long English sentences by intelligent machine translation algorithms, the long sentence segmentation module and reordering module were added. The function of the long sentence segmentation module is to split a long English sentence into multiple short sentences according to some rules; then, the translation algorithm combines the multiple short sentences after translating them alone, reducing the loss of semantic information due to the fixed-size vector code [12]. The function of the reordering module is to reorder the short sentences obtained from the segmentation so that the translation of the short sentences will be more consistent with the language order after the combination [13]. The flow of the improved intelligent machine translation algorithm is shown in Figure 2.  (1) A source text is input, i.e., the long English sentence.
(2) The long sentence segmentation module segments the long English sentence. It predicts the probability of every word in a long sentence as a segmentation word with the maximum entropy classifier [14] and uses the word with the highest probability as the segmentation word to realize the long sentence segmentation. The probability calculation formula is: where ( | ( )) p y x w is the probability that word w is used as the segmentation word in the long sentence, ( ) x w is the contextual information of w containing the word w, y is the label of the segmentation word, ′ y is the set of "segmentation" and "non-segmentation" labels, ( ( )) g y x w , i is the i-th characteristic function between ( ) x w and y, which is 1 if there is a connection between them and 0 otherwise, and ω i is the weight parameter of ( The reordering module reorders the segmented short sentences and predicts the probability of the order of the original adjacent short sentences with the maximum entropy classifier. The corresponding calculation formula is as follows: where C m s and C s n are the phrase before and after the segmented word s, respectively, o is the "order" label, indicating that C m s comes before C s n after reordering, ′ o is the set of "order" and "reversed order" labels, ( | ) (4) The segmented and reordered short English sentences are input into the encoder. The encoder uses a LSTM model LSTM to encode the source text, and the corresponding calculation formula is as follows : , , where f t is the forgetting gate output, b f , U f , and W f are the bias term, input term weight, and forgetting gate weight in the forgetting gate [15], respectively, s t is the cyclic gate output, b, U , and W are the bias term, input term weight, and cyclic gate weight in the cyclic gate, respectively, g t is the external input gate unit, b g , U g , and W g are the bias term, input term weight, and input gate weight in the input gate, respectively [16], q t is the output gate unit, and b q , U q , and W q are the output gate bias term, input term weight, and output gate weight, respectively. (5) After the vector code is obtained by the encoder, the decoder, which adopts LSTM, decodes it [17]. The probability distribution of the translated characters is obtained after forward calculation of the input vector code by the LSTM. Finally, the cluster search algorithm [18] is used to find out the translated characters with the best probability from the probability distribution. The final translation is obtained after arranging the characters in order.

Simulation experiments 3.1 Experimental data
The English-Chinese parallel corpus used for simulation experiments was UM-Corpus [19], which provides two million English-Chinese aligned bilingual corpora from eight different text domains, including education, law, Weibo, news, science, speaking, subtitles, and essay. Ten thousand sentences were randomly selected as the training set, and 5,000 sentences were randomly selected as the test set. In addition to the datasets, this study collected 3,000 English corpora from newspapers and movie reviews to test the improved intelligent machine translation algorithm.

Experimental setup
In the improved intelligent machine translation algorithm, the relevant parameters of the LSTM in the encoder are as follows. Four hidden layers were set, the number of nodes in every layer was set as 1,024, and the activation function in the hidden layer was the sigmoid function. The parameters of the LSTM in the decoder are as follows. There were 2 hidden layers, 1,024 nodes per layer. The sigmoid function was the activation function of the hidden layer. The cluster search algorithm was used to transform the calculated probability distribution of characters into the translation, and the window size of the "cluster" was 10. When the algorithm was trained with the training set, the stochastic gradient descent method was used to adjust the parameters; the learning rate was set as 0.1, and the maximum number of learning was 1,000.
In addition to testing the improved intelligent machine translation algorithm, the machine translation algorithm without adding the long sentence segmentation module and the reordering module was also tested. The encoder and decoder in this machine translation algorithm also adopted the LSTM algorithm, and the parameter settings were consistent with the improved machine translation algorithm.
The machine translation algorithm that adopted the recurrent neural network (RNN) algorithm in the encoder and decoder was also tested. The relevant parameters of the RNN algorithm are as follows. Four hidden layers were set for the encoder, and 2 hidden layers were set for the decoder; 1,024 nodes were set in every hidden layer, and the sigmoid function was used as the activation function. The training was performed using the training set in the same way as the improved machine translation algorithm.

Evaluation criteria
First, the long sentence segmentation in the improved intelligent machine algorithm was tested. The long sentence segmentation can be regarded as the sequence annotation of words in the sentence, so the segmentation effect was measured using the precision, recall rate, and F value.
The effect of segmenting long sentences was evaluated by the confusion matrix of binary classification. Table 1 shows the corresponding confusion matrix of binary classification. The calculation formulas of the evaluation indexes are as follows:  where P is the precision, R is the recall rate, and F is the combined value of precision rate and recall rate. The performance of the machine translation algorithm for English word translation was initially evaluated by the word error rate [20], and the calculation formula is as follows: where X is the number of substituted words, Y is the number of deleted words, Z is the number of inserted words, and P is the number of all words in the test set. When using a machine translation algorithm to translate English, long sentences or even long texts are translated in most cases, and when translating long sentences or English texts, besides the accuracy of word translation, it is also necessary to pay attention to the differences in grammar between the source text and the translated text, so the bilingual evaluation understudy (BLEU) index was also used to evaluate the translated text as a whole, and its formula is as follows: where N is the maximum order of the n-gram grammar, ω n is the weight of the n-gram grammar, p n is the percentage of short sentences in the n-gram grammar, B is the penalty factor, c is the number of words in the machine-translated translation, and r is the number of identical words in the machine-translated translation and the reference translation. The order of the n-gram grammar is determined by the one to be evaluated, and the weight of the corresponding grammar is set according to experience. Table 2 shows the partial results of segmenting long English sentences with the improved intelligent machine translation algorithm and the overall segmentation performance. "(,)" in the segmentation result of a long English sentence in Table 2 indicated the segmentation boundaries given by the segmentation module. It was found that the long sentence segmentation module segmented the long sentence well, and the short sentences segmented by the module contained conjunctions, which meet the grammatical segmentation criteria. The precision, recall rate, and F value of segmentation for the corpora in the test set were 97.6%, 96.9%, and 97.2%, respectively, which indicated that the improved intelligent machine translation algorithm segmented long English sentences quite well. Due to space limitation, only some results of the translation of the English source text by the three machine translation algorithms are shown here, as shown in Table 3. After comparing the translations obtained by the three machine translation algorithms with the reference translation, it was found that the translation obtained by the improved machine translation algorithm based on LSTM was the closest to the reference translation, and the translation obtained by the LSTM-based algorithm had some differences. Although the translation obtained by the RNN-based machine translation algorithm conveyed the meaning consistent with the reference translation, its grammar did not conform to the regular rules, which led to obvious discomfort when reading the translation. Figure 3 shows the word error rates in the translation results of the three machine translation algorithms for the corpus data test set and the self-collected local data test set. The word error rate of the RNNbased machine translation algorithm was 2.6% when translating the corpus data test set and 2.8% when  translating the local data test set. The word error rate of the LSTM-based machine translation algorithm was 1.5 and 1.7%, while the word error rate of the improved LSTM-based machine translation algorithm was 0.9 and 1.1%, respectively. It was found from the comparison that the translation of the improved LSTM-based machine translation algorithm had the lowest word error rate. Figure 4 shows the BLEU of the translations obtained by the three machine translation algorithms for the corpus and local data test sets. The BLEU of the RNN-based machine translation algorithm was 22.3 for the corpus data test set and 22.4 for the local data test set. The BLEU of the LSTM-based machine translation algorithm was 26.7 for the corpus data test set and 26.1 for the local data test set, while the BLEU of the improved LSTM-based machine translation algorithm was 31.3 for the corpus data test set and 30.9 for the local data test set. It was seen from the comparison that the translation of the improved LSTM-based machine translation algorithm had the highest BLEU, which suggested that this algorithm had the best translation performance.

Discussion
As international communication becomes increasingly frequent, English as a common language is difficult to learn for non-native English speakers. Especially when a large number of texts need to be translated on formal occasions, manual translation alone is less efficient. With the improvement in computer technology, machine translation algorithms have been gradually applied to the rapid translation of a large number of English texts. However, in the actual use process, as the grammar differs between different languages, if the machine translation algorithm still translates long English sentences according to the one-to-one correspondence, the translation may have grammar mistakes, leading to difficult reading. In this work, the intelligent translation algorithm cut a long English sentence into a collection of phrases, reordered the phrases in the collection following the translation grammar, and translated the reordered English phrases in the collection by machine using the encoder and decoder of the LSTM. Then, the intelligent translation algorithm was simulated and compared with the machine translation algorithm using RNN as the coder (decoder) and the LSTM algorithm without adding the long sentence segmentation module. The final results are discussed in Section 3.
In the test of English long sentence segmentation, the proposed intelligent translation algorithm had a good performance. The subsequent comparison experiments also verified that the proposed intelligent translation algorithm had the best translation performance for long English sentences, the LSTM algorithm without the long sentence segmentation module was the second best, and the RNN algorithm without the long sentence segmentation module was the worst. The reasons are as follows. Although the RNN algorithm in the RNN-based machine translation algorithm is suitable for processing data with sequential characteristics, it is easy to fall into conditions such as gradient explosion or gradient disappearance when processing long data, so its performance in translating long English sentences was poor. The LSTM algorithm in the LSTM-based machine translation algorithm originates from RNN. The introduced forgetting mechanism reduced the impact caused by gradient explosion or disappearance due to long sentences, so the performance of the LSTM-based machine translation algorithm was better than that of the RNN-based machine translation algorithm. The improved machine translation algorithm based on LSTM was added with the long sentence segmentation module and the reordering module. The segmentation of long sentences reduced the loss of semantic information. The reordering module improved the grammatical order of the translation, so the performance of the improved machine translation algorithm was better than that of the LSTM-based machine translation algorithm.

Conclusion
This study briefly introduced the basic framework of the intelligent machine translation algorithm, introduced the long sentence segmentation module and reordering module to improve the LSTM-based intelligent machine translation algorithm, and conducted simulation experiments using two corpora. The results are concluded below. (1) The improved intelligent machine translation algorithm had a good segmentation effect on long English sentences. (2) The translation obtained by the improved LSTM-based machine translation algorithm was closest to the reference translation, the LSTM-based machine translation algorithm was the second, and the translation obtained by the RNN-based machine translation algorithm did not conform to the conventional grammar. (3) The translation obtained by the improved LSTM-based machine translation algorithm had the lowest word error rate, while the RNN-based machine translation algorithm had the highest error rate. (4) The translations obtained by the improved LSTM-based machine translation algorithm for both test sets had the highest BLEU, the LSTM-based machine translation algorithm had the second highest, and the RNN-based machine translation algorithm had the lowest.

Conflict of interest:
The author declares no conflict of interests.