The relevance of reports in adjacent time periods should be determined when evolution of topic content is analyzed. If a new report is related to the priori report, it can be used to track the development of the transcendental report. If it is not relevant, the report can be judged as a new event. Relevance judgment is also called correlation detection, which is the original problem of topic tracking. In text mining methods, the similarity between two topics is generally calculated according to characteristic words in the text. According to difference of text representation models, the topic similarity calculation method can be summed up as Jaccard coefficient and cosine similarity based on word pocket model, KL (Kullback-Leibler) distance method [7] and words co-occurrence method based on language model.

The similarity algorithm used in conjunction with LDA is usually KL algorithm, that is, the similarity between two topics is seen as the distance of vector space formed by words of two topics, as follows:

$$\begin{array}{}{\displaystyle KL({z}_{1},{z}_{2})\phantom{\rule{negativethinmathspace}{0ex}}\phantom{\rule{negativethinmathspace}{0ex}}\phantom{\rule{negativethinmathspace}{0ex}}\phantom{\rule{negativethinmathspace}{0ex}}\phantom{\rule{negativethinmathspace}{0ex}}}& =D(P(w\mid {z}_{1})\parallel P(w\mid {z}_{2}))\\ {\displaystyle}& {\displaystyle =\sum _{w\in W}P(w\mid {z}_{1})log\frac{P(w\mid {z}_{1})}{P(w\mid {z}_{2})}}\end{array}$$(1)

Where *D*(*P*(*w* ∣ *z*_{1}) ‖ *P*(*w* ∣ *z*_{2})) is the distance of two vectors. The algorithm measures the difference of probability distribution of two sets of characteristic words in same semantic space. In order to improve the accuracy of this kind of algorithms, Chu proposed an improved similarity calculation method which combines classical cosine similarity algorithm and Jensen-Shannon divergence (JSD) [8], the latter is an improved KL algorithm [9], hereinafter referred to JSD_COS algorithm:

$$\begin{array}{}{\displaystyle JSD({z}_{1},{z}_{2})=\frac{1}{2}(D({z}_{1}||m)+D({z}_{2}||m)),}\\ {\displaystyle \phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}\phantom{\rule{thinmathspace}{0ex}}m=\frac{1}{2}({z}_{1}+{z}_{2})}\end{array}$$(2)

$$\begin{array}{}{\displaystyle JSD\mathrm{\_}COS({z}_{1},{z}_{2})=\lambda COS({z}_{1},{z}_{2})}\\ {\displaystyle \phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}+(1-\lambda )JSD({z}_{1},{z}_{2})+\lambda}\end{array}$$(3)

Where *COS*(*z*_{1}, *z*_{2}) is cosine similarity, which represents cosine of the angle of two vectors composed by characteristic words from two topics, and *λ* is a adjustment coefficient.

Because LDA and its improved models have some shortcomings, such as high frequency words tends to be selected for topic distribution, the above methods will lead to a problem that semantics of generated topics is not clear, for lacking semantic analysis.

In the other hand, some scholars use co-occurrence relationship between characteristic words to express their semantic relations [10], so as to calculate relevance between words and topics. The basic idea is if a few words often appear together in the same text, then they express the semantic information of the text to a certain extent. For example, in a news report, if “Haiti” and “earthquake” appear together repeatedly, it can be inferred that this is a report on the earthquake in Haiti. The existing co-occurrence analysis model is mainly used to extend VSM, and with no research on extending LDA.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.