Topic-Feature Lattices Construction and Visualization for Dynamic Topic Number

The topic recognition for dynamic topic number can realize the dynamic update of super parameters, and obtain the probability distribution of dynamic topics in time dimension, which helps to clear the understanding and tracking of convection text data. However, the current topic recognition model tends to be based on a ﬁxed number of topics K and lacks multi-granularity analysis of subject knowledge. Therefore, it is impossible to deeply perceive the dynamic change of the topic in the time series. By introducing a novel approach on the basis of Inﬁnite Latent Dirichlet allocation model, a topic feature lattice under the dynamic topic number is constructed. In the model, documents, topics and vocabularies are jointly modeled to generate two probability distribution matrices: Documents-topics and topic-feature words. Afterwards, the association intensity is computed between the topic and its feature vocabulary to establish the topic formal context matrix. Finally, the topic feature is induced according to the formal concept analysis (FCA) theory. The topic feature lattice under dynamic topic number (TFL DTN) model is validated on the real dataset by comparing with the mainstream methods. Experiments show that this model is more in line with actual needs, and achieves better results in semi-automatic modeling of topic visualization analysis.


Introduction
With the widespread application of Web 2.0, self-media platforms, such as online forums and online communities, have gradually become the main form of information exchange. While users enjoy convenient technology, they also face difficulties in making decisions caused by explosive review data. The topic modeling of the review dataset can realize the "short description" of the document, thus providing the possibility for mining the hidden semantic structure of largescale datasets. However, in the process of topic recognition and evolution, the dynamic change of the number of topics makes it difficult to quantitatively analyze the relationship between the content relevance of a document and the number of topics [1] . In addition, current topic recognition models are mostly based on a fixed number of topics, which cannot represent the semantic relevance between topics. At the same time, the recognition results depend only on the probability between topics, which makes it difficult to characterize the inherent hierarchical relationship of comment events. Therefore, it is extremely urgent to dig deeper topic relationship on the review topic.
After years of research, topic detection and tracking [2] has gradually formed a relatively complete set of algorithms and systems, the goal of which is to classify massive texts according to topics and track their evolution. According to the different text representation models in the corpus, the current topic evolution methods can be divided into two categories. The first type is cluster evolution analysis based on vector space. This type of method treats high-dimensional corpus text as an unordered set of low-dimensional words. It measures the similarity distance between texts and compares the change of the subject at different times. Lu, et al. [3] proposed a K-means clustering method (EEAM) based on the multi-vector model. This method constructs topic events by calculating the similarity between sub-topics. The topics at different moments are matched according to the similarity between the event vectors to generate a topic evolution set. Lin, et al. [4] proposed a news review topic evolution model (WVCA) based on word vectors and clustering algorithms. This model first introduced the word vector model into text stream processing to construct word vectors in time series, and then used the K-means cluster to achieve the extraction of topic keywords. Cigarrn, et al. [5] proposed an unsupervised topic detection algorithm (TDFCA) based on formal concept analysis (FCA). By combining similar content in formal concepts into concept lattices, formal concepts are used as the basic carrier to construct Twitter-based terms. Guesmi, et al. [6] proposed an event topic selection model (FCACIC) based on FCA. This method uses hierarchical clustering to focus on detecting common interest communities (CIC) in social networks, avoiding the introduction of new topics during the topic detection process. It cannot achieve non-human participation, as this type of method only utilizes the similarity distance between texts to determine the correlation between topics.
In order to cope with the topic detection of massive documents in complex environments, some scholars have proposed probabilistic topic analysis. This type of method considers the topic to be smooth in the time dimension, and uses the topic posterior probability of the t − 1 time slice as the t time slice. A priori probability, combined with the calculation of the similarity between topics, reduces the calculation bias caused by part-of-speech differences. For example, the probabilistic index model [7] (PLSI) and the implicit Dirichlet allocation model (LDA) [8] map out the process of topic identification and evolution by establishing joint probability between texts, topics, and words. AlSumait, et al. [9] added the online processing function of the text on the basis of the LDA model, and proposed an online Dirichlet probability model to achieve online tracking of topics. Although the focuses of the above studies are different, a common drawback is that the identification of topics relies heavily on the number of topics in text clustering or classification, and the number of topics needs to be specified in advance or iteratively obtained according to a given threshold, which cannot meet the topic evolution process. To enact this need, Herinrich [10] proposed the infinite latent Dirichlet allocation (ILDA) model, which implements topic classification based on the time-dependent relationship of text. However, this method still has the problem of "short sightedness". Considering the iterative problem of the optimal number of topics, there are more meaningless topics, without considering the weight of different topic feature words under the changing number of topics.
The approaches mentioned above have two drawbacks. First, these approaches rely on the number of topics of text clustering. Specifically, the recognition of topics are represented in a certain way without considering the semantic changes of the topic feature words under the dynamic number, which fails to avoid false inheritance of the topic. Second, the correlation strength of feature words under different topics in the ILDA model is weak, which is difficult to mine the inherent hierarchical relationship of events.
The motivation of the paper is to establish a partial order constraint relationship between topics and feature words. To achieve this goal, the model for building topic feature lattice under dynamic topic number (TFL DTN) is proposed, which realizes the dynamic change perception of topics in time series. Specifically, the TFL DTN model first obtains the topic-feature word probability matrix and the document-topic probability matrix by modeling documents, topics and feature words; then, the topic association matrix is established, and the features under different topics in the document are calculated according to the joint probability among them. Finally, multi-granularity topic networks are identified based on the characteristics of strongly correlated topics.

The Theory of ILDA
LDA is an unsupervised probabilistic model based on probabilistic latent semantic analysis (PLSA), which can implement implicit topic mining of documents [11] . The LDA model is a three-layer Bayesian network, where documents can be viewed as discrete topic words, and different topics converge to a limited mixture of topic feature words with probability, which is shown as Figure 1. However, the hyper parameters α and β of this model need to be set in advance, and after many simultaneous, the number of topics K, which is manually set, is related to the granularity of text division. In extreme cases, an excessively large number of topics will merge too many divided text topics. The ILDA model [12] generates an empty topic for integration, unable to obtain valid topic description information, which cannot meet the actual needs of topic division. Besides, the time-dependent relationship of the text realizes the topic classification under the dynamic number of topics. The model structure is shown in Figure 2. There are two main differences between the two models in Figures 1 and 2. First, on the basis of LDA, ILDA changes the value of the topic number K to a dynamic variable that can be arbitrarily selected in the interval [1, +]. Second, the document-topic distribution matrix θ in LDA is determined by the Dirichlet distribution of the hyper parameter α polynomial, and θ in ILDA is determined by the joint Dirichlet allocation process (DAP), regardless of the polynomial distribution of the hyper parameter α [13] . DAP is a prior distribution based on random probability, which can be obtained by polynomial π. The polynomial π is a polynomial mixture that obeys the Griffiths-Engen-McCloskey (GEM) random measure distribution. The detailed calculation process can be found in [14]. The calculation of the base distribution O is shown in Equation (1) [15] . The advantage of DAP is that its input is not a fixed number, but a discrete variable that changes dynamically. ILDA is a three-layer Bayesian network. By abstracting a document into a polynomial distribution containing K topics, and abstracting a topic into a polynomial distribution containing multiple feature words, it implements joint modeling of documents, feature words, and topics. At the same time, the number of topics depends on the random prior distribution of the hybrid model. It no longer requires that the topic priors of the document must obey the Dirichlet distribution, thereby reducing the sensitivity of the topic model to the number of topics and improving the ability to model large corpora. (1)

The Theory of Formal Concept Analysis
FCA is a formal method that takes the formal context as its domain, which focuses on describing the hierarchical relationship between concepts [16] . This theory takes the partial order relationship between formal concepts as the core, and realizes the semi-automatic identification of multi-level ordered concept nodes by establishing the mapping relationship between description objects and attributes [17] . From the perspective of semantic relationship mining, the concept lattice construction process described by FCA theory can be regarded as the process of hierarchical relationship mining between topic nodes. Meanwhile, the association relationship between the topic concepts is obtained to enhance the semantic relationship between the feature words and the topic.
The mathematical foundation of FCA theory is lattice theory and order theory. The modeling process can be described as follows: First, based on the binary membership between objects and attributes, a ternary context is established including (objects, attributes, relationships). Afterwards, the formal context obtains formal concepts that satisfy the partial order relationship. Finally, a formal concept lattice is established based on whether there is an order relationship between the concepts. In the above process, concept nodes at different levels can reflect different generalization and instantiation relationships between objects, which provide new ideas for obtaining the semantic correlation between topics and feature words.

Construction of TFL DTN
Although the ILDA model can realize online topic identification under dynamic topic numbers, it only determines the topic correlation degree through the probability dependence relationship between topics. Besides, it does not take into account the change in the weight of feature words that may be caused by changes in the topic number. At the same time, the model cannot effectively obtain the hidden hierarchical relationships between topics, and lacks the semantic modeling ability of multi-granularity knowledge. Therefore, this paper makes use of the good dynamic topic modeling ability of the ILDA model, by introducing feature word weight parameters into the topic model, and combining the formal concept analysis method to establish a topic recognition model TFL DTN. The model first utilizes the ILDA model to simulate the dynamic topic generation process. Secondly, the strength of the connection between the topic and its feature words is determined to establish the topic formal context based on the joint probability. Finally, the concept features are used as a guide to construct the topic feature lattice to identify a multi-granular topic network including a document library, a topic array, and a feature word set, so as to realize the conceptual visual modeling of multi-layer network topics.

Model Construction based on TFL DTN
The topic modeling of TFL DTN can be divided into two sub-models: Self-adaptive topic analysis model (STAM) and Topic feature lattice construction model (TFLCM). First, the STAM model assumes that there is probability dependence between documents, topics, and feature words. Each document converges to a topic with a probability, and each topic extracts feature words with a certain probability, thereby forming a three-layer production probability distribution. Among them, the document is a topic polynomial distribution that obeys the Dirichlet distribution process, and the topic is also a feature polynomial that obeys the Dirichlet distribution, which is shared by the document set containing different mixed topic proportions and feature word weights. For the convenience of explanation, the meanings of the variables and parameters in the model are shown in Table 1. The topic analysis process of the STAM model is depicted as follows. First, we use Gibbs sampling to obtain the dynamic optimal number of topics and establish a document-topic probability matrix and topic-feature word probability matrix to extract topics and feature words respectively. Then, candidate feature words with top N high word frequencies from the document-feature word matrix can be selected, on the basis of extracting feature words with higher weights. Finally, the above steps are iterated to obtain the topic with feature words. The STAM model reconstructs the probabilistic dependency relationship between the topic and the feature words based on the ILDA model. In essence, the model does not change the generation of documents, topics and feature words, where the relationship still maps topics and feature words to the same semantic space through the probability selection model. Therefore, STAM can still be regarded as a three-layer Bayes network. The functional dependence of the variables and distribution matrix in the model is shown in Figure 3.  The TFLCM model assumes that the probability value of the pair of document and topic has a positive correlation with the correlation strength of the pair of topic and feature word. The greater the probability that a document selects a topic is, and the greater the probability that a topic selects a feature word is. By setting the threshold, the strongly related topic features are filtered out and mapped into a context matrix of formal context, and the topic feature lattice is finally generated. The generation process of the TFLCM model is expressed as follows. First, the association probability with the highest probability value from the document-feature word probability matrix is extracted, as well as the topic association matrix, by calculating the feature word association strength under different topics in the document. Afterwards, the feature words with strong correlation are generated, and the correlation matrix of topic formal context is generated. Finally, the generated topic feature lattice is reduced through formal concept analysis. The transformation relationship among matrix variables in the model is shown in Figure 4. Based on the above analysis, the relevant definitions are given as follows.
if topic probability vectors q z k about topic z is generated, the sampling probability of topic z named as p(z d |Q d ) can be obtained on the basis of the document-topic probability matrix if the feature word probability vectors p w n about the feature word w is generated, the sampling probability p(w d |P z ) of feature word w in the topic z i can be obtained on the basis of the topic-feature word probability matrix P z = {p w 1 , p w 2 , · · · , p w n }. Definition 4 (Feature Matrix of Feature Words) Let the dependence of the feature word's probability w i on the topic z i under the number of topics K be s z k , then the weight matrix of the feature word Definition 5 (Topic Association Matrix) Let the association set R i = {r z 1 , r z 2 , · · · , r z k } between topic z i and feature words w i satisfy the following constraints, then it is called the topic association matrix under the topic z i . In particular, if r z i ≥ r z s , where s = arg max i=1···k−1 (r z i − r z i+1 ), R i is called a strong association matrix (denoted as SR i ) of z i , recorded as the feature set C SRi of all topic associations that satisfy the constraints SR i in the topic set. Constraint Definition 6 (Topic Formal Context) Let the topic formal context be F = (T, W, I), where T = {t 1 , t 2 , · · · , t K } represents the topic set and I ⊆ T × W represents the feature word set. I ⊆ T × W represents the mapping relationship between the topic and the feature set on condition that (t i , w j ) ∈ I, t i ∈ C SRi . Definition 7 (Topic Feature Lattice) Let the topic formal context be F = (T, W, I), when A ⊆ T and B ⊆ W , for any two-tuple (A, B) satisfying A  *  = B and B  *  = A, (A, B) can be called a set of formal concepts on condition that when A 1 ⊆ A 2 or B 1 ⊆ B 2 , there is a partial ordering relationship ≺ that makes (A 1 , B 1 ) ≺ (A 2 , B 2 ) be true, where * operation is defined as Equations (1) and (2). The partial order relationship set of all formal concepts in topic formal context F constitutes the topic feature lattice denoted as L(T, W, I).
To sum up, the topic in the TFL DTN model is a latent variable that depends on a mixture of document-topic polynomials, and the feature words depend on significant variables of the multimodal mixture between (topic, feature word) and (feature word, feature word weight). The core idea of the model is described as follows. Firstly, potential semantic associations among variables can be established, through the probability dependence on documents, topics, feature words and feature word weights. Meanwhile, the Dirichlet stochastic process is viewed as the prior distribution of the Bayes network according to Gibbs sampling. What's more, the sampling algorithm obtains the number of dynamic topics, and establishes a document-topic probability matrix, a topic-feature word probability matrix, and a feature word weight matrix. Finally, the TFL DTN model calculates the topic association matrix to filter out strongly related topic features and maps them into a formal context association matrix. To enact this need, a binary partial order relationship between topics and feature words is established to generate a topic feature lattice. The overall structure of the TFL DTN model is shown in Figure 5.

Sorting Associations
Hierarchical relationship display Topic visualization Figure 5 Overall structure of TFL DTN model

Model Reasoning and Parameter Iteration
Since the derivation and parameter estimation of variables and distribution matrices in the TFL DTN model are mainly handled by the STAM model, the TFLCM model mainly performs secondary filtering and correlation analysis of topic feature words. Therefore, this section mainly discusses the hidden variable z and the matrix P , Q parameter estimation. At the same time, the matrix relationship of the TFLCM model is transformed into the algorithm description in Subsection 3.3.

Model Reasoning
The STAM model first introduces hyper parameters α and β for topic probability distributions to represent mixed documents. Meanwhile, the parameter γ is utilized to represent the probability distribution of feature words for mixed topics. Afterwards, the topic of a word is extracted according to the topic probability distribution, and the characteristic words of the topic are generated on the basis of the characteristic word probability distribution. In the above process, since the α in the STAM model undergo multiple iterations, their initial values have little effect on the calculation of the model. The prior γ can be calculated by the GEM polynomial distribution. Therefore, for the joint probability distribution of the solution model, the posterior conditional probability of the variable w must be obtained first, and then it can be used as the prior conditional probability of the probability matrix P to calculate the topic polynomial distribution. Finally, the Gibbs sampling algorithm is used to approximate the estimation. A steady-state distribution matrix of the probability matrices P , Q is obtained. For the convenience of explanation, the meanings of the variables during parameter iteration are shown in Table 2.
The joint probability of all observable and hidden variables in the model with the hyper parameters is shown in Equation (4). Table 2 Parameter description in STAM model

|M |
The number of total documents in the corpus p(w n |r n )p(r n |S, P )p(P |β)p(z n |Q)p(Q|α, γ). (4) By solving the integrals for P and Q in the above formula, the probability dependence between variables can be further solved, as shown in Equation (5). p(w, z, r, d) = p(w, r|z, d)p(z|d).
The above formula can be further expressed as shown in Equation (6).
where p(P |β) represents the probability that a super parameter β generates its feature words according to the probability under the topic. p(S|P ) represents the probability that the feature word weight matrix depends on the feature word distribution. p(Q|α, γ) represents the prior distribution of a Bayes network that depends on the Dirichlet random process. Equation (6) can be further expressed as follows (Equation (7)).
The posterior probability of the available document library D = {w i } |M| d=1 is shown in Equation (8).
From the above formula, the Gibbs sampling formula can be further obtained as shown in Equation (9).

Parameter Estimation
The STAM model first assigns random topics to feature candidates, and then iteratively calculates the probability distribution of feature words w until the probability is stable (Equation (9)). After that, topic j is extracted from the Q matrix (Equation (10)), and feature words w can be extracted from the P matrix with the probability of the formula (Equation (11)).

Algorithm Description
According to the description mentioned above, the parameter iterative process of the STAM model, as well as the matrix relationship conversion process of the TFLCM model can be described in Algorithm 1.

Get {L(T, W, I), T, P, Q, R}
The proposed algorithm of TFL DTN can be divided into two sub-models: STAM and TFLCM. STAM starts initializing the number of topic number and generating the matrix Q, P on the basis of Dirichlet distribution as shown from steps 1 to 6. The proposed algorithm then gets the document-feature word matrix by calculating vector of feature word frequency as shown from steps 7 to 10. The model iteratively samples the feature words, and calculates the feature word weight matrix under different topic numbers to obtain the topic-feature word probability matrix and document-topic probability matrix as shown from steps 11 to 21. Consequently, topics, feature words with weights are extracted on the basis of extracting feature words with higher weights as shown from steps 22 to 29. To build the topic feature lattice, TFLCM starts calculating topic association matrix by extracting the association probability with the highest probability value from the document-feature word probability matrix, as well as generating the correlation matrix of topic formal context as shown from steps 30 to 35.

Preprocessing
We randomly select 1,583,275 online review data of the 20 automobile brand forums from the two websites of Auto Home and Netease Auto, from August 1, 2019 to September 20, 2019. First, the initial document is segmented, and the standard document corpus is obtained by removing data such as stop words, special symbols, and useless tags. Then, the text is converted into a set of review phrases, and a document-word matrix is established. Afterwards, the TF-IDF vector can be calculated to obtain attribute feature words of comment data.

Comparison of the Optimal Number of Topics
In order to verify the rationality of the number of dynamic topics in the STAM model, α = 0.1, β = 0.1, γ = 0.2, while the initial topic weight r 0 = 0.5, and the iteration threshold is 0.2. The number of topics in the Baseline method (LDA model) is set to K = 40, whose value is set manually in advance. The STAM model only specifies the number of algorithm iterations (200 and 280 respectively), and determines the number of topics in the document by week.
It can be seen in Figure 6 that although the content of the events described in the corpus is relatively fixed, the number of topics in different periods is dynamically changed, which reflects the correlation between the evolution of topics and the number of topics. In addition, the real data is summarized by the method of manual annotation, and the number of topics varies in the interval [15,60], which is consistent with the experimental results of the STAM model. At the same time, there is no positive correlation between the document size and the number of topics, but related to the degree of clustering of the actual topics. For example, during the 200 iterations of the STAM model, the document set of the second week contains 847 texts, while the document set for Week 3 is composed of 561 texts. In contrast, the topic number of the former is only 32 but the one of the latter is 49.
In addition, in order to test the capabilities of topic prediction and text representation in STAM model, the perplexity of the above models in the corpus document is calculated. The smaller the value of the perplexity is, the stronger topic prediction capability for the document and it has. The calculation of perplexity is shown as Equation (12). And the experimental results are shown in Figure 7.
where, p(w d ) =   Figure 7 shows that the perplexity curve of the STAM model is lower than that of the Baseline method as a whole, and the perplexity of the dataset gradually decreases as the number of topics increases. Second, when the number of topics is K = 70, the changeable degree is small, which indicates that the topic distribution under this topic number tends to be stable and the model achieves optimal performance, while the STAM model achieves the best performance when the number of topics is K = 62, indicating that the model has relatively few topics. In that case, the ability to capture the correlation between topics under a dynamic number of topics is stronger, which reduces the model's dependence on the number of topics K and improves the data representation ability for small sample data sets.

The Construction of Topic Feature Lattice
When the model's iterative probability threshold is set to 0.01, the document-feature word matrix can be acquired in the corpus. At the same time, when a relatively stable state is reached in the STAM model, both the document-topic probability matrix and topic-feature word probability matrix are output. The top 10 feature words with higher probability are extracted, and their feature word weights are calculated separately. Due to the large number of topics, Table 3 lists only six relatively concentrated topics. Based on the identification results of topic feature words in Table 3, the content of the topic sets is analyzed manually, which is summarized as the following comment topics: Topic 1 (Topic 22) is security evaluation; Topic 2 (Topic 8) is economy evaluation; Topic 3 (Topic 34) is dynamic performance evaluation; 4 (Topic 68) is comfort evaluation; 5 (Topic 71) is service evaluation; Topic 6 (Topic 79) is manipulative evaluation. The top 10 associated topics of the reviews are listed in Table 4, in which the main characters of security-related topic are No. 13 and No. 22. The feature words contain the topic of the braking, sideslip, blind area, early warning etc. These words are highly associated with vehicle safety of the vehicles, which is strongly aligned with the classification results of manual annotation. In addition, according to algorithm 1, the document set is mined for strongly correlated topic features, and the relational matrix of formal context is established to construct the topic feature lattice. The corresponding part of Hasse structure of topic feature lattice is shown in Figure 8, which shows that the closer the concept is to the top-level root node, the more generalization features of topic words are, such as vehicle length, wheelbase, weight. The term specialization is usually more prominent, such as the acceleration, torque, vehicle power, and vehicle hill climbing associated with node Topic 34. The conclusions show that the topic feature lattice based on the TFLCM model can intuitively find the hierarchical relationships of different topic feature words, with a good modeling ability in obtaining generalization of topic words and semantic relationships.  Figure 8 Hasse diagram of the topic feature lattice (partial)

Discussion
In order to verify the rationality of the TFL DTN model, the accuracy rate, recall rate, F1 value, and mean absolute error (MAE) are selected as the evaluation indicators. Meanwhile, a comparison experiment is performed with the TFIDF algorithm [18] , the TDFCA algorithm [15] , and the ILDA algorithm [12] on the same data set. Tables 5 and 6 are the comparison results of the evaluation indexes of the above algorithms. The results show that the prediction performance of the TFL DTN model is significantly better than the other methods on the six review topics. The accuracy, recall and F1 value of the measured data can be maintained around 0.65, and the MAE value can be below 0.85. The reason is that the TFL DTN model combines the probabilistic relationship and partial order relationship between topic feature words and topics, which not only effectively reduce the dimensionality, but also improve the topic awareness of the document in the changing topic word K.

Conclusions
The proposed method TFL DTN designs a topic recognition visualization to optimize the topic semantic correlation feature generated by the ILDA model. The model iteratively generates the topic-feature word probability distribution matrices and document-topic probability distribution matrices, based on the conditional probabilistic dependency relationship among topics, documents, and feature words. Through the calculation of feature word weights and strong correlation matrix, a visual concept lattice for topic feature is constructed, which realizes the generalization and specialization of semantic relationships between topic features.
Experiments show that the TFL DTN model has a good ability of topic recognition under dynamic subject numbers. To enact this need, the following innovative points are made in this paper: 1) A method is proposed for calculating the correlation strength of feature words under different topics using joint probability of topic-feature words.
2) A method is proposed to construct topic feature lattice in formal context association matrix at multi-granularity.
In order to improve the calculation accuracy of the topic prediction model, the future research will focus on the semantic analysis of topic sentiment, to deeply dig the online users' sentiment tendencies, and establish text sentiment for the hidden features of topics.