A lattice - transformer - graph deep learning model for Chinese named entity recognition

: Named entity recognition ( NER ) is the localization and classi ﬁ cation of entities with speci ﬁ c meanings in text data, usually used for applications such as relation extraction, question answering, etc. Chinese is a language with Chinese characters as the basic unit, but a Chinese named entity is normally a word containing several characters, so both the relationships between words and those between characters play an important role in Chinese NER. At present, a large number of studies have demonstrated that reasonable word information can e ﬀ ectively improve deep learning models for Chinese NER. Besides, graph convolution can help deep learning models perform better for sequence labeling. Therefore, in this article, we combine word information and graph convolution and propose our Lattice - Transformer - Graph ( LTG ) deep learning model for Chinese NER. The proposed model pays more attention to additional word infor - mation through position - attention, and therefore can learn relationships between characters by using lattice - transformer. Moreover, the adapted graph convolutional layer enables the model to learn both richer character relationships and word relationships and hence helps to recognize Chinese named entities better. Our experiments show that compared with 12 other state - of - the - art models, LTG achieves the best results on the public datasets of Microsoft Research Asia, Resume, and WeiboNER, with the F 1 score of 95.89%, 96.81%, and 72.32%, respectively.


Introduction
Named entity recognition (NER) is to extract and classify entities with specific meanings, such as person (PER), location (LOC), and organization (ORG). As an upstream process of a number of natural language processing (NLP) tasks, e.g., relationship extraction, event extraction, and automatic question answering [1], it plays an irreplaceable role. Therefore, in-depth research on NER has high practical value [2].
In earlier studies, NER can be solved using sequence annotation models such as hidden Markov model (HMM), maximum-entropy Markov model (MEMM) [3], and conditional random field (CRF) [4]. With the development of deep learning, convolutional neural network (CNN) and long short-term memory (LSTM) are widely used for NER. For example, the LSTM-CRF architecture has been employed to improve the learning ability of neural networks for NER [5,6]. In addition, TENER [7] uses transformer [8] as the encoder with specific modifications for NER. Nowadays, these methods constitute the fundamental structures for NER.
Besides, BERT [9], which is trained on a large dataset containing different languages, can be used as an enhanced feature extractor for NER. Moreover, since each annotation position can be considered as a node on a graph, graph convolutional network (GCN) is utilized to solve the sequence annotation problem by constructing edges between nodes [10]. Other applications of GCN include text semantic role annotations [11,12], text sequence classification [13], etc. Furthermore, GCN can be concatenated after LSTM to improve NER [14]. However, these methods are for word-based languages, such as English. Since Chinese is a character-based language, for Chinese NER, directly adopting these methods cannot achieve good performances.
As a character-based language, in Chinese, there are no clear boundaries between named entities and other characters. For example, the Chinese sentence which means "Beijing Chang'an Street," has two location-type named entities (Beijing City) and (Chang'an Street), but there are no obvious boundaries between these two words. If a model can segment the sentence into it can recognize these two location-type named entities and more easily. However, which means "Mayor," is a person-type Chinese named entity. If there lack the relationships between the characters and and and it is possible that this sentence is incorrectly split into Therefore, both the relationships between words and those between characters are crucial for Chinese NER.
Zhu and Wang try to obtain word information from input sentences directly through a convolutional attention network, using the convolution kernel to help the model segment word boundaries [15] (CAN-NER), but the lengths of Chinese words are variable, so this method has great limitations. Since a Chinese sentence is a character sequence, it is helpful to utilize a lexicon as external knowledge to bring more relationships and linguistic features to the model. Zhang and Yang not only train a lexicon for Chinese NER but also propose a new model to make good use of this lexicon [16] (Lattice-LSTM). However, Lattice-LSTM only adds word information from the lexicon after the last character of a word, so it fails to learn the inner relationships between characters. With regard to other studies, pre-trained language models based on the transformer architecture have made progress in NLP. For example, LEBERT [17] and ZEN [18] adapt BERT to make it more suitable for Chinese NLP, and BERT-wmm [19] obtains a better way to train Chinese pretrained language models. There are studies which use pre-trained language models as the language embedding extractor [20,21] (DGLSTM-CRF, GAT). However, these Chinese pre-trained language models always pay attention on words rather than characters, so they cannot learn the inner relationships between characters, either.
To learn the inner relationships between characters, the FLAT-Lattice structure [22] is proposed, which not only allows the neural network to extract more linguistic features from the lexicon but also uses relative position coding to enable the transformer model [8] to learn the relationships between different input positions, that is, the inner relationships. Besides, in ref. [23] (LGN), researchers try to use the graph neural network with the lexicon, and in ref. [24] (PLT), researchers try to use the transformer. Moreover, the lattice structure is reconstructed in ref. [22] (FLAT) through self-attention in the transformer model to extract more linguistic and relationship features from the lexicon. Although these studies can learn the inner relationships between characters, they ignore the relationships between words.
In order to learn both the relationships between words and those between characters, in this article, we propose a lattice-transformer-graph (LTG) deep learning model, which uses lattice-transformer to learn inner relationships between characters and modifies GCN to learn not only relationships between characters but also relationships between words from additional word information. Furthermore, as BERT has been adapted to process Chinese NER tasks better by integrating lattice word information [17,18], LTG can also be combined with BERT. We then conduct experiments on four public datasets to demonstrate the effectiveness of LTG. Consequently, our LTG model with using BERT achieves the best results by comparing with 12 other state-of-the-art neural network architectures, including BiLSTM [5], TENER [7], CAN-NER [15], Lattice-LSTM [16], LGN [23], PLT [24], FLAT [22], LEBERT [17], ERINE [25], ZEN [18], DGLSTM-CRF [20], and GAT [21].
2 Proposed methods

Structure of the LTG model
Our proposed LTG model is illustrated in Figure 1. The position-attention part works on characters and additional words, and these characters and additional words are extracted from the input text to make the LTG model achieve more information. The outputs of Position-Attention are sent to Transformer-Encoder [8] to extract the features of the input text. After that, we use Bi-GCN [11], whose graph is built based on the positions of the additional words, to make the model know more about the structure of the input words. Finally, we put the output features of Bi-GCN into CRF to obtain labels. is Position-Attention, which helps the model to consider relevant relationships, shown by the red arrows, or irrelevant relationships, shown by the black arrows. The Position-Attention can help the model extract more linguistic features from the input sequence and external words, and also the relationship features between them. The output embeddings of Position-Attention are sent to Transformer-Encoder to obtain features. Transformer-Encoder can handle long-distance dependencies better than RNN. After that, Bi-GCN calculates these features using the graph (the part in the red dashed box) built based on the additional words, and the graph can help the model learn deeper linguistic features and relationship features. Finally, the model sends the outputs of Bi-GCN to CRF to obtain labels.
In the LTG model, we use transformer-encoder [8] and Bi-GCN [11] to extract features. The Bi-GCN can make the neural network to focus more on the relationships between additional words and characters. The graph construction method used in our study is based on the positions of the additional words, which can learn the context relationships between words and characters better. The reason of using additional word information is that, in the process of Chinese NER, Chinese characters are nodes, but most of the named entities are words which consist of several Chinese characters rather than a single Chinese character. By making the model pay more attention to the positions of additional words, the performance of Chinese NER can be improved.

Lattice
For the reason that lattice significantly improves Chinese NER, our study uses Flat-Lattice [22] as additional information. The input sequence of the model is composed of Chinese characters. In addition, there are words contained in the lexicon in all subsequences of the sequence. These characters and words are regarded as the lattice input, and transforming them into embeddings is completed by the trained lexicon. As shown in Figure 2, the input sentence is (Beijing Chang'an Street)", which contains five words in the lexicon: (Beijing)," (Beijing city)," (Mayor)," (Chang'an)," and (Chang'an Street)," so there are five additional words in the input. In order to conduct parallel computing during training this model, the relationships between words and characters will be considered before they are computed by the next layer. This study uses W b e d , to represent these words, where b and e are Figure 2: The input text is (Beijing Chang'an Street)," whose subsequences (Beijing)," (Beijing City)," (Mayor)," (Chang'an)," and (Chang'an Street)" are in the lexicon. The subsequences are the additional words of the input. We use Position-Attention to consider every word and every character. The number pair ( ) D D , head tail above represents the distance not only between two characters but also between a character and a word. All of them will be sent to the Position-Attention layer.
the beginning position and the end position of the word, and d is the dimension of the word embeddings. For example, (Beijing)" can be shown as W d 1,2 , and (Chang'an Street)" can be shown as W d 4,6 .

Position-attention
Since the model considers the relationships between each word and each character, irrelevant words and characters will also be calculated together, such as (north)" and (Chang'an)," which will undoubtedly influence the subsequent procedure. Therefore, Position-Attention is needed to distinguish the relevant and irrelevant relationships. In order to obtain the Position-Attention of the context of words and characters, we calculate the position embedding representing the position relationships between them. First, we calculate the relative positions of words and characters through (1) and (2): where HEAD word indicates the position of the first character in a word, and TAIL word indicates the position of the last one, and POS char indicates the position of a single character. Combining (1) and (2), we know that when < D 0  (1) and (2) can then be simplified to (3): The relative position embeddings of characters can be calculated by (3). Our study follows [8] to obtain the position embeddings of distances, shown by (4) and (5): where D is the relative position calculated by (1), (2), and (3), and k is the current dimension, and d model is the dimension of the model. Each dimension of (4) and (5) corresponds to the sinusoidal curve [8]. The results of the position embeddings are the embeddings of the relative position distances. The relative position embedding P between word/character i and j is calculated by: Through the position embedding of the relationship between each two words/characters, the output feature containing context and word information can be obtained. The feature will be sent to the transformer layer, and the model obtains the input containing context information between words and characters. The calculation of transformer is defined as follows:

Building graphs with word information
We use GCN to help the model learn the relationships between adjacent nodes, so we construct the graphs with the positions of words and the inner relationships of characters in words. A graph is represented as where V is the node set, that is, the character set, and E is the edge set.

Algorithm 1 Construct graph G 1
Input: The start positions and lengths of all additional words from the input text Output: G 1 , the graph that helps the model learn the relationships between characters 1: for each ∈ word Words do 2: We construct two graphs, G 1 representing the inner relationships in words and G 2 representing the relationships between words. In G 1 , if two nodes satisfy ≤ ≤ ≤ start i j end 1 1 , we construct an edge . The construction of G 1 is detailed in Algorithm 1. Benefiting from G 1 , Bi-GCN helps the network learn the context of characters in words and extract more linguistic information from words.

Algorithm 2 Construct graph G 2
Input: The additional words from the input text Output: G 2 , the graph that helps the model learn the relationships between words 1: ⇐ Word NULL now 2: for = Pos 0 to Length sentence do 3: Generate the set of the words as Word pos whose first character is at Pos 4: if Word pos is not empty then

5:
if the number of words in Word pos is greater than 1 then

6:
Push the words in Word pos into Stack in descending order of the lengths of these words 7: while end for G 2 is the graph of the relationships between words, which is constructed by Algorithm 2. The edge of G 2 is represented as N u m words . Bi-GCN uses G 2 to make the model extract more context information between words. For example, (Beijing)" and (Beijing City)" have similar linguistic features, and both of them have strong correlations with (Chang'an Street)." However, some words are useless, such as (Major)." Context information can also help the model reduce these useless words' weights, that is, the redundant information can be removed. G 1 and G 2 of this example (Beijing Chang'an Street)" are shown in Figure 3. in G 2 shown with the red dotted line represents an edge from a word to its next word. In order to make the model learn the context information at the same time, we use Bi-GCN, which only needs to build the forward edges in the graph. The upper figure can be simplified to the below figure, and nodes n 1 , n 2 , n 3 , n 4 , n 5 , and n 6 represent the characters , , , , , respectively.

Bidirectional graph convolutional layer
Algorithms 1 and 2 only show the construction of edges from front nodes to rear nodes. Actually, we use a Bi-GCN [11] layer which changes the directed graph into an undirected graph and make the model to learn context information. Given a directed graph ( ) = G V E , , and the input with length L: = … X x x x , , , L 1 2 , the feature f i for each node i can be obtained by the following equations: ∈ ⋅ W f d d x f includes learnable parameters, and ∈ b f d f is the offset. d x is the dimension of the input, and d f is the dimension of the hidden layers in GCN. ReLU is a nonlinear activation function. Concatenating the output features from the two graphs to represent the output features of the GCN layer is shown as follows: of every node, CRF [4] can compute labels through these features.
Ontonotes 5.0: The dataset includes texts in English, Arabic, and Chinese, covering six areas: radio conversations, radio news, magazine articles, news-wires, telephone conversations, and web blogs [26]. In this study, only Chinese text sequences and NER labels in this dataset are selected.
MSRA: It is a simplified Chinese dataset provided by MSRA for word segmentation and NER [27].
Resume: The dataset is constructed from resume data of Sina Finance. The data established by Sina Finance include the resumes of executives of listed companies in the Chinese stock market. The team randomly selected 1,027 resume abstracts and manually annotated 8 types of named entities using the YEDDA system [30]. The labeling consistency between them is 97.1%, and the complete dataset is called Resume [16]. WeiboNER: The construction of the dataset follows the DEFT ERE guidelines and includes four main named entity types: person names, organization names, place names, and geopolitics names. The corpus includes 1,890 messages collected from Weibo species between November 2013 and December 2014. The WeiboNER dataset is constructed by manually labeling these messages [29]. Although P and R are equally important in judging the performance, they are contradictory indicators. By predicting all the uncertain labels as non-named entities, P will increase. However, in this case, R decreases. On the contrary, by predicting all the uncertain labels as named entities, R will increase, but in this case, P decreases. Therefore, in our experiments, we use their combination ( ) = * * / + F P R P R 1 2 as the evaluation metric, which can balance P and R.

State-of-the-art models for comparison
In our experiments, we assess the performance of our model LTG against those of 12 other state-of-the-art models.
-Lattice-LSTM [16]: it uses the lattice structure to add word information from the lexicon.
-LGN [23]: it uses the graph neural network with the lexicon.
-PLT [24]: it uses the transformer with the lexicon.
-FLAT [22]: it reconstructs the lattice structure to utilize the lexicon better.
-ZEN [18]: it uses n-gram representations to enhance the Chinese text encoder.
-GAT [21]: it uses BERT as the feature extractor with a structure enhancing entity boundary detection.
Among these models, BiLSTM, TENER, and CAN-NER are without a lexicon, and are used as the baseline models, so that we can demonstrate the advantage of using lattice in Chinese NER. Also, LTG is compared with a number of recent well-performed models without using BERT, including Lattice-LSTM, LGN, PLT, and FLA. In addition, our model using BERT as the feature extractor, that is, BERT+LTG, is compared with the models also using BERT or with a large number of parameters, such as LEBERT, ERINE, ZEN, DGLSTM-CRF, and GAT.

Comparison with state-of-the-art models
In order to show the effectiveness of LTG and BERT+LTG, we compare them with all the 12 state-of-the-art models mentioned in 3.3. We conduct experiments on the public datasets of Ontonotes 5.0, MSRA, Resume, and WeiboNER, mentioned in 3.1. Table 2 details the experimental results of these models. It demonstrates that our model BERT+LTG achieves the best results on multiple datasets, especially on WeiboNER. Since WeiboNER is collected from Weibo, a social networking platform, most of the sentences are daily Chinese, and the distribution of named entities is significantly sparse compared with other datasets. Therefore, GCN plays an obvious role in facilitating the network understand word information and the relationship between word information and context. At the same time, Position-Attention also helps the network pay more attention to key information. Hence, LTG performs better on WeiboNER than on other datasets.
Resume is a dataset from resume data of Sina Finance. For Resume, our LTG model can extract more features from word information to perform better than other models.
However, LTG do not obtain the best F1 score on Ontonotes 5.0. The reason is that LTG relies on the help of word information, but there are too many long distances in this dataset which consequently produce large amount of word information, making it harder for LTG to handle it. Nevertheless, LTG achieves comparable results on this dataset.
MSRA is also a dataset with long distances, but there are only three types of labels in this dataset, so LTG can recognize these labels without extracting a large number of linguistic and relation features from word information. Therefore, it performs best on this dataset. Furthermore, we aim to build an architecture that can work well both with a pre-trained language model, such as BERT, and without a pre-trained language model. It can be seen from Table 3 that when the application scenario is closer to those of MSRA and Resume, the improvements of using BERT, that is, a large-parameter language model, are not obvious. Since BERT needs a lot of computations, in these scenarios, LTG without BERT is more suitable. On the other hand, when the application scenario is closer to those of Ontonotes 5.0 and WeiboNER, the improvements of using BERT are obvious, and in these scenarios, BERT + LTG is a better choice. The bold values highlight the best results. The bold values highlight the best results.

Ablation experiments
In order to demonstrate the functionality of different components in our LTG model, we perform ablation experiments on the dataset of WeiboNER. Table 4 shows the results of ablation experiments.

Number of Bi-GCN layers
In many cases, adding layers will make a deep learning model work better. In this way, we try to add Bi-GCN layers in our model. However, as shown in Table 4, as the number of Bi-GCN layers increases, the performance does not increase. We conclude that the number of Bi-GCN layers makes no obvious difference in our model.

Bidirectional and unidirectional GCN
For the reason that we only build a forward graph for GCN and context information takes an important part in a sequence labeling task, bidirectional-GCN is necessary for the model to learn context information. From Table 4, we see that bidirectional GCN can bring significant improvements, compared with unidirectional GCN and not using GCN. GCN uses the graph to extract more relationship and linguistic features, and the bidirectional structure helps to reduce the loss of context features.

Ways of embedding
Position-Attention helps the model learn more about the context information between words and characters. For comparison, we use another way to process the embedding called "lattice only," which means adding lattice as Lattice-LSTM [16]. The experiment results in Table 4 show that Position-Attention proposed in this article performs much better than "lattice only." At the same time, these experimental results also prove that obtaining more relationship features between the word information and the input sequence can facilitate the model to perform better.

Residual experiments
In order to further prove the effectiveness of GCN and the possibility of building a deeper network structure, we conduct residual experiments of GCN on WeiboNER, by adding the output before entering the GCN layer with the output of the GCN layer and send them to the CRF layer for decoding. The bold values highlight the best results.
In the experiments, the encoding E pre before passing through the GCN layer and the encoding E GCN after passing through the GCN layer are added under different weights. As illustrated in equation (13), the encoding entering the next layer is as follows: where α and σ are the weights of E pre and E GCN , respectively. Experimental results obtained by adjusting different weights are shown in Figure 4. It is shown that when = α 0 and = σ 1.0, the network performs best for F1, Precision and Recall. That is, the network without residual calculation performs best. At the same time, all the experimental results with the GCN layer are better than those without the GCN layer, demonstrating that the coding without GCN contains redundant information, which interferes with the coding results through the GCN layer after addition. Therefore, it further shows that the GCN layer can help the network extract key information.
As for the residual structure, although the bidirectional GCN retains the contextual features, the structure of the output features is not the same as the sequential structure of the original features, so adding the residuals of these two parts will bring confusing information. Therefore, the LTG model cannot build a deeper GCN through the residual structure, and hence cannot extract richer features for Chinese NER.

Conclusions and future work
In this article, we proposed the LTG model and applied it to Chinese NER. LTG uses Position-Attention to improve the transformer model's ability to learn the relationships between characters and modifies GCN to improve the network's understanding of word information. The experimental results show that LTG achieves the best results so far for Chinese NER on the public datasets of MSRA, Resume, and WeiboNER. Moreover, LTG can be used both with and without BERT.
However, LTG does not work well on Ontonotes 5.0, due to too much redundant information, so in the future, we will try to reduce the redundant information in our model to improve its generalization ability. Besides, since the aim of NER is to serve downstream tasks and the results of the downstream tasks can give feedback to NER [31], we will combine NER more with its downstream tasks in the future work.
Funding information: This work was supported by the Fundamental Research Funds for the Central Universities (grant number 2021ZY87).

Conflict of interest:
The authors declare that there is no conflict of interests.