Skip to content
Publicly Available Published by De Gruyter Oldenbourg November 1, 2022

Cross-Corpora Comparisons of Topics and Topic Trends

  • Victor Bystrov , Viktoriia Naboka , Anna Staszewska-Bystrova and Peter Winker ORCID logo EMAIL logo

Abstract

Textual data gained relevance as a novel source of information for applied economic research. When considering longer periods or international comparisons, often different text corpora have to be used and combined for the analysis. A methods pipeline is presented for identifying topics in different corpora, matching these topics across corpora and comparing the resulting time series of topic importance. The relative importance of topics over time in a text corpus is used as an additional indicator in econometric models and for forecasting as well as for identifying changing foci of economic studies. The methods pipeline is illustrated using scientific publications from Poland and Germany in English and German for the period 1984–2020. As methodological contributions, a novel tool for data based model selection, sBIC, is impelemented, and approaches for mapping of topics of different corpora (including different languages) are presented.

JEL Classification: C49

1 Introduction

Textual data gained relevance as a novel source of information for applied economic research. Examples include text based indicators of economic uncertainty (Baker et al. 2016) and economic or political sentiment (Jentsch et al. 2020; Shapiro et al. 2022), the analysis of central bank communication (Hansen and McMahon 2016; Lüdering and Tillmann 2020), using textual information for now- and forecasting (Ellingsen et al. 2022; Foltas 2022; Kalamara et al. 2020; Larsen and Thorsrud 2019; Thorsrud 2020) or describing the diffusion of innovations (Lenz and Winker 2020) and innovation cluster (Krüger et al. 2020), and the link between real economic developments and scientific publications in economics (Lüdering and Winker 2016; Wehrheim 2019). Textual data have also been used during the Covid-19 pandemic for policy and impact analysis, e.g., by Debnath and Bardhan (2020), Mamaysky (2021), and Dörr et al. (2022).

In most of these applications, the main interest is on temporal patterns in the textual information and how these relate to other developments over time. However, also a comparison of the content of different text corpora might be of interest, e.g., when comparing sentiments in different countries or topics of economic research present in different scientific journals. When longer periods are considered, it might also become necessary to merge the information content of various textual data sources available for different sub-periods. In such settings, it is imperative to identify those topics which are common or at least similar across the corpora and their evolution over time. Obviously, there is no guarantee for finding matching topics ex post. Therefore, it is also relevant to identify those topics which are specific for certain corpora only. Although such analyses are urgently needed there is no consensus yet on which methods are most appropriate for specific applications.

In this paper, we extend the existing literature on topic modelling by proposing methods for comparing (matching) topics identified for various corpora. This approach also includes proposing a data-based criterion for assessing the quality of a match, which eventually serves for identifying real matches. Additionally, we suggest to select the number of topics in each corpus based on singular Bayesian information criterion (sBIC), which appears more robust compared to other commonly used tools used for this model selection step.

The working of the methods is presented using economic text corpora in two languages. However, the proposed tools are generic and can be applied also beyond economic research. For example, the results described in this paper might be of interest to political and communication scientists who often need to consider multilingual textual sources (see e.g. Lucas et al. 2015; Maier et al. 2022).

The presented methods pipeline is not meant to be the only viable approach for such analyses, but rather a basic setting building mostly on established procedures. Although the approach is kept simple, it still involves a number of steps which require choosing several parameters. We will stick as far as possible to standard parameter values and discuss some central aspects in more detail, in particular those related to choosing the number of topics in a corpus and the threshold-values for defining a meaningful match of topics across corpora.

As illustration of the application of the proposed methods pipeline, we consider two corpora of scientific publications in economics over the period 1984–2020, one published in Germany and one in Poland. These datasets and the basic methods pipeline building on latent Dirichlet allocation (LDA) (Blei et al. 2003) for topic modelling are introduced in Section 2. Methodological extensions are presented in Section 3, in particular using a singular Bayesian information criterion for selecting the number of topics in Subsection 3.1 and the approaches for matching topics across corpora in Subsections 3.2 and 3.3. The results of the application to scientific publications are provided in Section 4. Section 5 summarizes the findings and provides an outlook to further analysis.

2 Data and Methods

2.1 Topic Modelling and Corpora Comparisons

In this section, the text corpora are described. Figure 1 summarizes our research methodology.

Figure 1: 
Outline of the methods pipeline for corpora comparison.
Figure 1:

Outline of the methods pipeline for corpora comparison.

Textual data for the current application consist of scientific articles published in Germany, in the Journal of Economics and Statistics (JES), and in Poland, in the proceedings of the Macromodels International Conference (MM) and the Central European Journal of Economic Modelling and Econometrics (CEJEME). A detailed description of the data sources is provided in Section 2.2. After the documents from the different sources were collected and prepared, the following common text preprocessing steps were applied:[1]

  1. Removing all punctuation marks, special characters, numbers.

  2. Following Lüdering and Winker (2016), who also applied topic modelling to data from JES, we also decide to remove words that contain fewer than 3 and more than 20 characters in order to capture further stop words and reduce the vocabulary size.

  3. Removing English and German stop words (see Appendix B).

  4. Removing especially rare/common words: all words with frequency over all articles in the text corpus under 2.5% and above 75% were removed. We prefer relative thresholds to absolute ones as they will depend on the size of the corpus and the vocabulary. Therefore, using relative threshold appears more appropriate for establishing a standardized pipeline.

  5. Lemmatizing of texts, i.e., grouping inflected forms together as a single base form.

  6. Removing certain parts-of-speech, the so called PoS tags, such as determiners, adpositions, conjunctions, pronouns to the extent that they are not contained in the usually rather short lists of stop words.

The resulting Bag-of-Words (BoW) representations of the documents were further used to train LDA models for the Polish and German data sets. LDA is one of the best-known and most widely used topic modelling approaches (Blei et al. 2003). It is based on the assumption that each document in a corpus is a distribution over some latent topics and each topic is a distribution over a fixed corpus vocabulary. Therefore, the algorithm behind the LDA approach aims to identify this hidden structure and to uncover the underlying latent topics in a corpus.

For the German corpus, two different LDA models were trained for the two subsets in English and German languages. Running LDA on the combined corpus would result in two sets of topics which are language-specific, although some of them might cover the same semantic content. Therefore, a separate modelling and matching post hoc appears preferably. As a robustness check, we also did an analysis on a joint corpus, for which all German texts have been translated to English using DeepL API Pro (see Appendix C.1, pp. 32–33).[2] This robustness check using machine translation has revealed that the most of the topics from the joint German dataset can be also found under the topics uncovered in the English subset of the data. Furthermore, we also show that for the identification of relevant matches between the two countries using the joint LDA model does not impact the results substantially. For the Polish corpus, only one LDA model was estimated. The models were trained using Python’s sklearn module. The implementation in sklearn follows Hoffman et al. (2010) and Hoffman et al. (2013) and provides a method for estimating LDA models based on the online variational Bayes algorithm. Except for the number of topics and the number of iterations in the training process, all the parameters were kept at default values. Since there are several Python modules that implement the LDA algorithm, we also perform the analysis using gensim, another popular module for LDA topic modelling. In doing so, we aim to account for possible differences resulting from different LDA implementations. We show that the qualitative findings of the current work do not change. The results of this robustness check are described in Appendix C.2, p. 35.

The choice of an optimal number of topics in LDA models still remains a challenge in applied research. Although there are several criteria for selecting the optimal number of themes, the ultimate choice is often based on human judgement concerning interpretability of selected topics. In the current work, we aim to avoid the subjectivity of topic selection and use sBIC to determine the optimal number of topics for each of the text corpora. We further discuss interpretability of topics selected by sBIC. To our knowledge this is the first application of sBIC to LDA modelling and so we provide more methodological and practical details in Section 3.1.

In the topic modelling stage, we obtained three different sets of topics corresponding to two countries and two languages. To distinguish these sets we will use the following notation: PL ENG , DE ENG and DE GER where PL, DE indicate the country of publication of a corpus i.e. Poland or Germany, while the superscripts ENG, GER inform about the language of publication.

We first focus on the matching between DE ENG and PL ENG topics. The matching of two topic sets of different LDA models can be done based on topic-word frequency vectors. Thereby, the distributions of topics over the vocabulary words are compared. However, since this standard approach to topic matching assumes the vocabularies are in the same language, it cannot be applied to LDA models trained on corpora in different languages. For this reason, we also propose a different embedding based approach to topic matching to compare topic sets in different languages. Both topic-word frequency vectors based matching and embedding based matching approaches are described in Sections 3.2 and 3.3, respectively.

In the final step, we qualitatively analyse the resulting topic matches and define thresholds for matches expected to be meaningful in a colloquial sense. We also construct topic time series based on the topic weights for each of the topic sets and descriptively analyse the time series trends of the matched topics.

2.2 Textual Data for Germany and Poland

The illustration of the methods pipeline is based on corpora of scientific articles published in Germany and Poland. Given the interest in comparing trends of research topics over time, it is important that both corpora cover a long period. For our application, the overlap of both corpora is from 1984 to 2020. While the time span of the sample is rather long, the number of documents per year is substantially smaller than in other applications covering recent years. Furthermore, scientific articles in economics are more focused than general interest documents. Therefore, the number of distinct topics to be expected in these corpora is rather small. Nevertheless, the example might well illustrate the general procedure of cross-corpora topic and topic trends comparison outlined in Subsection 2.1.

2.2.1 German Text Collection

The German textual data consist of articles published in the Journal of Economics and Statistics. The Journal has been published since 1863 containing articles that cover topics from economics with a focus on empirical economics and applied statistics. During the sample period 1984–2020 publications have been either in German or English. The distribution of the articles’ languages is presented in Figure 2.[3]

Figure 2: 
Language of the articles in the German text corpus.
Figure 2:

Language of the articles in the German text corpus.

The volumes of the journal published annually between 1984 and 2020 comprise usually 4–6 issues.[4] The details of data collection and further steps of preparation are described in Appendix A. The corpus used for the application comprises 903 articles in German and 704 articles in English.

2.2.2 Polish Text Collection

Two sources of textual data for Poland were considered. Firstly, proceedings of the Macromodels International Conference (MM) and joint meetings were used providing textual data for the years 1984–2011. Secondly, papers published in the Central European Journal of Economic Modelling and Econometrics (CEJEME) in the period 2009–2020 were analysed.

The Macromodels International Conference has been organised in Poland every year since 1974. The printed materials analyzed in this article included also papers presented at the meetings held jointly with MM, such as Econometric Modelling and Forecasting Socialist Economies (Models & Forecasts, MF), the Multivariate Statistical Analysis (MSA) conference and the Association for Modelling and Forecasting Economies in Transition (AMFET) meetings. As indicated at the Macromodels’ webpage (www.macromodels.uni.lodz.pl), the aim of the conference is to “bring together scientists who work in the field of econometric modelling […]. Within the scope of interest are issues such as the problems of estimation, simulation, developing econometric models and their use for policy analyses. Recently, a special attention has been given to modelling economies of new EU member countries”. Conference materials printed as books are available since 1984 and continue up to the year 2011. Altogether 41 conference volumes comprising a total of 514 articles were used in the analysis. The language of the conference is English. After several preprocessing steps that are described in more detail in Appendix A, a structured database with bibliographic information for the articles was created in Python. The data include information on the year of publication, names of the authors, title of the paper, abstract and the main text.

The article collection from the Central European Journal of Economic Modelling and Econometrics includes 145 scientific articles which appeared in 46 issues of the journal, starting with the first issue from 2009 (January 2009) and ending with the fourth issue from the year 2020 (April 2020). As indicated by the aims and scope of CEJEME, the papers are focused on the theory and applications of mathematical and statistical models in economic sciences. All articles are in English. Detailed information on the preparation of the data from CEJEME publications can be found in Appendix A. The Polish data set used for the application consists of the main texts of the articles (without abstracts) from MM and CEJEME.

3 Methodological Advances

This section first describes the proposed information criterion for determining the optimal number of topics in a corpus. Then, we present a general topic matching approach for LDA models trained on different corpora in the same language. As the German corpus consists of texts in English and German, we also propose a further topic comparison approach based on multilingual word representations that can be applied to LDA models trained on corpora in different languages.

3.1 Topic Number Selection Based on Singular Bayesian Information Criterion

Several criteria, which are often used for selecting the number of topics in LDA modelling, are based on specific semantic properties of selected topics, such as similarity and coherence (see Cao et al. 2009; Mimno et al. 2011). The number of topics selected by these criteria frequently differs considerably and the final choice is based on human judgement concerning interpretability of selected topics. In this paper, the number of topics is chosen using an information criterion that does not directly quantify any semantic property of topics, but balances the goodness-of-fit and model complexity. The model selection procedure based on the information criterion does not rely on topic interpretability, but chooses the optimal number of topics that can be used for inference. Nevertheless, topics selected by the information criterion are expected to be interpretable for a text corpus generated by an LDA model.

The implementation of information criteria for topic number selection in the LDA analysis is complicated because it is based on a singular statistical model: the Fisher information matrix is not positive definite. The usual BIC cannot be implemented for evaluation of singular models as the penalty for model complexity used in the BIC is too large for singular models: too few topics would be selected in the LDA modelling if the regular BIC was used.

Drton and Plummer (2017) proposed a model selection criterion, called singular BIC (sBIC), that uses the Bayesian model averaging and a smaller penalty than the penalty used in the regular BIC. Hayashi (2021) derived the asymptotic learning coefficient for LDA that can be used for the evaluation of penalty in sBIC. In this paper the model averaging method proposed by Drton and Plummer (2017) and the asymptotic learning coefficient derived in Hayashi (2021) are combined in order to implement sBIC for the selection of number of topics in the LDA modelling. As it is a novel application of sBIC, essential theoretical and practical details of this procedure are briefly described below.

In order to present essential details of sBIC, let us consider a document corpus D that includes N documents and uses a vocabulary of M words. An LDA model is described by N × H matrix θ of document-topic frequencies and H × M matrix β of topic-word frequencies with the dimensions of these matrices depending on the number of topics H. A set of candidate LDA models is thus determined by the numbers of topics in candidate models: H { H min , , H max } .

The marginal likelihood of corpus D given a model with H topics can be written as

L ( D | H ) = θ , β P ( D | θ , β , H ) d P ( θ , β | H ) ,

where P ( D | θ , β , H ) is the likelihood of D given matrices θ and β . The Fisher matrix for LDA models is singular, and the quadratic approximation of marginal likelihood, which is used in the derivation of the regular BIC, is not possible. But the singular BIC can be derived using the decomposition (Watanabe 2009)

log L ( D | H ) = log P ( D | θ ˆ , β ˆ , H ) λ ( H ) log ( n ) + ( m ( H ) 1 ) log log ( n ) + O p ( 1 ) ,

where θ ˆ and β ˆ are consistent estimators of corresponding matrices, λ ( H ) is a learning coefficient, measuring stochastic complexity of a model with H topics, m(H) is the multiplicity of the learning coefficient, and n is the number of words in the document corpus. In practice, λ ( H ) and m(H) are not known as they depend on the true model dimension. This problem can be solved by the model averaging which is described in Drton and Plummer (2017).

An approximation of the marginal likelihood for a model with H topics, based on averaging of submodels with number of topics h H , is obtained as

(3.1) L ( D | H ) = h H L H h L ( D | h ) P ( h ) h H L ( D | h ) P ( h ) ,

where P(h) is the prior of a model with h topics,

L H h = P ( D | θ ˆ , β ˆ , H ) ( log n ) m H h 1 n λ H h

and the constants λ H h , m Hh can be computed for any h H using formulas from Hayashi (2021). Following Drton and Plummer (2017), we replace unknown marginal likelihoods L ( D | h ) on the right-hand side of (3.1) by their approximations, L ( D | h ) ,

L ( D | H ) = h H L H h L ( D | h ) P ( h ) h H L ( D | h ) P ( h ) ,

and define the singular Bayesian information criterion for a model with H topics as

sBIC ( H ) = log L ( D | H ) ,

where L ( D | H ) is the unique solution of the equation system

(3.2) h H [ L ( D | H ) L H h ] L ( D | h ) = 0

that can be found recursively with L ( D | H min ) = L H min H min for the minimal model. The optimal number of topics maximizes sBIC over the set {H min,…,H max}.

For an empirical implementation of sBIC in the LDA modelling approach, it is essential to use high-precision computations in case of large-scale data sets. Since the likelihood function for an LDA model is a product of a large number of word frequencies, it usually takes extremely small positive values. Correspondingly, the log-likelihood function takes negative values of extremely large modulus. To avoid exponent underflow and overflow in floating point computations, the outer limits allowable for exponents of floating-point numbers have to be sufficiently large. The computational precision (the size of fractional part of floating-point numbers) has to be sufficiently large for solving the system of quadratic equation (3.2) that represents a bottleneck of the sBIC algorithm.

The selection of an appropriate precision needed to avoid rounding errors might depend on a particular dataset. However, as compared to the estimation time of LDA models the additional time needed for high-precision computations in the sBIC algorithm is not substantial.

3.2 Topic Matching

The outcome of a LDA model is a matrix containing the probabilities of occurence of each word in each topic. Therefore, a standard and intuitive way to compare two LDA models, or the hidden structures behind the data, is to compare the distributions of topics over the vocabulary words. Each topic can be represented as a vector with the length equal to the vocabulary size.

For the comparison, the topic vectors from different LDA models should have the same length. However, it is quite improbable that the vocabularies from different corpora are exactly the same. There are two possibilities to create topic-word frequency vectors of the same length. First, one of the vocabularies can be considered as the base vocabulary. If some of the words are missing in the other vocabulary, the probabilities are replaced with zeros. Alternatively, one can use only the intersection of the vocabularies of the considered corpora, i.e., only the words that occur in both corpora. In the current work, we use the second solution as only minor differences have been observed when comparing both approaches. In general, this choice bears the risk that matched topics can still differ substantially with regard to the non overlapping parts of the vocabularies. Thus, in particular for less homogeneous corpora than considered in our application, one might also consider matching based on the union of the vocabularies.

In the next step, the similarities of the topic vectors are calculated. To this end, we consider two similarity measures:

  1. Jensen–Shannon divergence (JSD), which is closely related to the Kullback–divergence (KLD), measures the similarity between two probability distributions or, in the current case, two word-topic distributions. The Jensen–Shannon divergence between two probability distributions P and Q is calculated as follows:

J S D ( P Q ) = 1 2 E n t r o p y ( P M ) + 1 2 E n t r o p y ( Q M ) ,

where M = ( P + Q ) / 2 . The square root of the Jensen–Shannon divergence is a distance metric.

  1. Cosine similarity is an alternative measure of similarity of two vectors and is often used when working with textual data. Cosine similarity is the cosine of the angle between the two vectors. For example, a cosine similarity of 1 implies that two vectors have the same orientation in the corresponding vector space.

The final step is the actual matching of the topics. Again, there are two alternative approaches to matching that can be applied to obtain topic pairs. The first one is the so-called one-to-one matching using the Hungarian algorithm (Kuhn 1955). The Hungarian algorithm is an optimization algorithm that, given a cost matrix containing the assignment costs between the topics of two LDA models, aims to find an optimal assignment of rows to columns with minimal costs. It is also possible to apply this algorithm if the number of topics of two LDA models is not the same. In this case, some of the topics of the larger LDA model remain unmatched. One-to-one matching can be applied, for example, when the two corpora are expected to cover the same set of topics. When implementing one-to-one matching, it is recommendable to use distance metrics such as the Jensen–Shannon divergence or the cosine distance, which is defined as 1−cosine similarity, as the cost measure, as the Hungarian algorithm is formulated as a minimization problem.

The second option is best matching using the nearest neighbours approach, i.e., for each topic in a corpus, its nearest neighbour in the other corpus is chosen as its match. Hereby, the topics can be assigned multiple times. Best matching is a better choice when the thematic focus of the corpora to be compared is quite different and it is unclear whether each topic in one corpus can find a meaningful match in the other one.

However, given that each topic is assigned its nearest neighbour independently of the corresponding minimum distance, there is also no guarantee that all of the identified best matches actually correspond to a match according to the common understanding. For this reason, a cut-off value has to be set in order to select only topic pairs sharing a high enough similarity. At this point, it is important to mention that the best matching is a non-symmetric process. For example, if for the German Topic b the Polish Topic a is the nearest neighbour in the Polish topic set (direction Germany → Poland), it does not necessarily imply that for the Polish Topic a the German Topic b is the nearest neighbour in the German topic set (direction Poland → Germany). To account for this non-symmetry, it is advisable to check the topic assignments in both directions.

In the current application, we use the cosine similarity measure to evaluate the topic similarity and perform best matching. We set the cut-off value based on the distribution of the cosine similarity values between all possible topic pairs. Subsequently, we also perform the matching using Jensen–Shannon distance as a robustness check.

3.3 Embedding-Based Matching

The standard matching described in the previous subsection is restricted to the comparison of models in the same language. To enable multi-language analyses, we propose a further approach that uses the so-called word embeddings. These word vector representations have been attracting a lot of attention in recent years and are widely used in different applications also beyond the natural language processing field. One of the most important characteristics of such word embeddings is the interpretability of the distances between them. It means that semantically similar words tend to be close to each other in the shared vector space. For more details on how word embeddings are trained see Mikolov et al. (2013a, 2013b). Recently, such word embeddings have been also used in the context of topic modelling. For example, Dieng et al. (2020) introduce the embedded topic model (ETM) where each word and each topic in a corpus are represented in the same embedding space. The authors claim that the proposed approach addresses the drawbacks of a classical LDA model, namely dealing with large vocabularies. Empirically, it is shown that the method leads to better results compared to other approaches including classical LDA as measured by the coherence criterion introduced by Mimno et al. (2011). However, it is not discussed how ETM performs in a multilingual context when a dataset consists of texts in different languages, as in the current case. Since in the proposed approach word and topic embeddings are trained based on the underlying texts and word co-occurrences, applying it to a multilingual corpus would probably result in an embedding space that contains multiple sub-spaces related to the languages contained in the corpus. Therefore, we decided to not further consider ETM for our analysis.

Bianchi et al. (2020) address exactly this problem and develop a language-agnostic approach to topic modelling – Multilingual Contextualized Topic Modelling (MCTM). The authors develop a topic modelling approach that is based on document representations from SBERT, a novel Transformer based technique to language modelling. The main advantage of the proposed approach is that a model can be trained based on one corpus and topic distributions for documents in unseen languages can be inferred just based on the multilingual vector representations. In the current case, we could apply MCTM and train the model, for example, for the Polish dataset and infer topic distributions for the documents from the German dataset. In doing so, we would, however, restrict ourselves only to the topics in the Polish dataset. Some latent topics that are specific only for the German dataset would be missing. Therefore, we decided to stick to our embedding based matching approach that is described in more detailed in the following.

In the last few years, a lot of pre-trained word vectors have been released. For example, the fastText[5] library provides pre-trained word embeddings for over 150 different languages (Grave et al. 2018; Joulin et al. 2018). Many attempts have also been made to train multilingual word embeddings, i.e., training a shared vector space for multiple languages. For example, Conneau et al. (2017) introduce both supervised and unsupervised approaches to learning cross-lingual word embeddings. The authors provide multilingual embeddings for 30 languages based on fastText monolingual word vectors.[6] These multilingual embeddings can be used to obtain language independent topic representations. Thereby, each topic could be also represented as a vector in the shared multilingual vector space using the word embeddings of its most frequent words (see options 1–3 below). We consider the following options for obtaining multilingual topic vector representations:

  1. Represent a topic as the sum vector of n word vectors in the embedding space corresponding to its n most frequent words.

  2. Represent a topic vector as the weighted sum of n word vectors corresponding to its n most frequent words. The weights are given by the estimated LDA models and represent the probabilities of each word occurring in a certain topic. Thereby, rescale the original probabilities given by the LDA output depending on the number of words considered.

  3. Represent a topic vector as the weighted sum of all the vocabulary word vectors, i.e., the word embeddings of all the vocabulary words vectors multiplied by the probabilities of occurring given by the LDA output.

  4. “Translate” the words of one model into the language of the other model using word embeddings. For example, for each word in the English corpus vocabulary, search for the first nearest neighbour in German language and use the corresponding word as the translation of the English word. Afterwards, apply the standard matching approach described previously.

Further steps, i.e., calculating the similarity and applying one of the matching types, are performed analogously to the standard matching approach. In the current work, we use the first option and represent the topics as the sum vectors of 100 word vectors corresponding to their 100 most frequent words (not weighted). While preliminary analysis indicated no qualitative differences in the results for our analysis, an in-depth comparison of the performance of the different alternatives is left for future research.

3.4 Topic Trends Comparison

The methods described in this section aimed at identifying similar topics based on their textual content. A further aspect of interest is the development of these topics over time, namely the relative importance of the identified matched topics in their corpora at certain points in time. For this comparison we construct topic weight time series and are interested in whether the identified topic matches exhibit similar dynamics over the considered period of time.

As described above, for each document in a corpus, the estimated LDA models provide probabilities of each topic occurring is this document, e.g., each document is represented as a vector with the length equal to the number of topics selected and sums up to one. To construct topic time series, the probabilities of each topic to occur in documents of the corpus are aggregated over all documents published in a given year and averaged on an annual basis.

To construct time series for topic matches identified within the German corpus for the two different languages considered, the average of the individual topic time series was calculated. If one of the values was missing in one of the time series, this value was replaced with the value from the second time series. In doing so, we were able to provide longer time series for the DE ENG and DE GER matches, as the share of German articles in the German corpus was substantially higher until the early 2000s.

In order to describe similarity between the time series for the matched topics we perform two steps. Firstly, given that the trajectories are quite ragged due to the limited number of texts per year, to ease visual inspection we smooth the series using a two-sided filter. In the second step, we evaluate the correlation coefficient and compute the Euclidean distance for the pairs of filtered series.

4 Results

4.1 Number of Topics

As described in Section 2, the first step of the analysis consisted in identifying the optimal number of topics for each of the text corpora. The number of topics was selected by maximizing the singular BIC with a minimal number of topics set to H min = 10 and a maximal number of topics set to H max = 100. These boundaries were set based on the assessment of the variety of topics in the scientific publications considered. The models with different number of topics in the predefined range were assumed to have the same priors, i.e.,  P ( H min ) = P ( H min + 1 ) = = P ( H max ) . The values of the learning coefficients λ H h and their multiplicities m Hh were obtained using the formulas provided by Hayashi (2021). High precision computations were implemented using the Python module decimal.

Using the model selection procedure based on the sBIC, we identified 37 topics for the Polish data set, 20 topics for the DE GER data set and 60 topics for the DE ENG data set. The sBIC values for Poland and Germany depending on the number of topics are shown in Figures 3 and 4, respectively.[7] The red dashed lines indicate the selected number of topics for each corpus. For the DE ENG data set, maximizing the sBIC would lead to 74 topics. However, it can be seen in Figure 4b that the shape of the curve of sBIC values in the interval from 55 to 75 topics is almost flat. For this reason and due to a rather small data set consisting of 704 articles, we decided to consider 60 topics for this corpus.

Figure 3: 
Distribution of the sBIC values for the Polish corpus.
Figure 3:

Distribution of the sBIC values for the Polish corpus.

Figure 4: 
Distribution of the sBIC values for the German corpus.
Figure 4:

Distribution of the sBIC values for the German corpus.

For comparison, we also applied some of the techniques commonly used in the literature to chose the optimal number of topics. We used Python’s module tm toolkit and calculated the following evaluation metrics available in this module: Arun et al. (2010), Cao et al. (2009), perplexity, and coherence (Mimno et al. 2011). For our application, however, none of these metrics provides a clear indication of an optimal number of topics (see Figure 5 for the German corpora). In fact, the first two criteria seem to suggest always the largest number of topics, while the coherence criterion appears to favour a very small number of topics. Only perplexity suggests an inner solution for the smaller German corpus. The results for the Polish corpus are also inconclusive. Therefore, we stick to the novel sBIC measure with a strong theoretical background.

Figure 5: 
Evaluations metrics from tm toolkit.
Figure 5:

Evaluations metrics from tm toolkit.

All topics identified in the LDA model selected by sBIC both for German and Polish corpora are interpretable, i.e. by a visual inspection of word clouds composed of the 50 most common words for each topic (see Section 4.2 and the online supplementary material E) we are able to link the topics to relevant economic issues. Thus, although sBIC does not directly measure any semantic quality of topics, the outcome of the model selection procedure using sBIC is a set of interpretable topics. If another criterion was used, then the selected number of topics would be very large or very small as compared to the number of topics selected by sBIC (see discussion above). It would imply obtaining either a small model, in which some interpretable topics were omitted, or a large model, in which some topics might be meaningless.

4.2 Topics

Figure 6 shows some topics from the Polish corpus. The uncovered topics deal with different aspects of econometric models (Topics 3 and 36), forecasting (Topic 16), and modelling of macroeconomic indicators (Topics 9, 10, 17). Figure 7 presents some DE GER topics discussing unemployment (Topic 0), consumption&income (Topic 14), government spending (Topic 15) as well as some DE ENG topics discussing business indicators (Topic 2), wage (Topic 6), and stock market (Topic 14). The font size of the words in the presented word clouds corresponds to the relative importance of the words in a topic. The full set of all topics obtained for all corpora can be found in the online supplementary material E.

Figure 6: 

PL

ENG
 topics.
Figure 6:

PL ENG topics.

Figure 7: 
German topics.
Figure 7:

German topics.

4.3 Matching of Topics

In the matching stage, we first performed topic matching between DE ENG and PL ENG topics based on the topic-word vectors and the intersection of the two vocabularies (2523 words). We identified best matches based on the cosine similarity values.

Given that the topic matching procedure provides a match for all topics in the corpus considered, we have to differentiate between “sensible” matches, i.e., pairs of topics with high similarity, and best matches, which pair quite different topics. To this end, we propose to determine a cut-off value based on the distribution of the cosine similarity values between all possible topic pairs, which should provide an approximation of the values we might expect for random matches. Figure 8a presents this distribution of the cosine similarity values which exhibits an “elbow” around a value of 0.2.

Figure 8: 
Distribution of cosine similarities between all possible matches.
Figure 8:

Distribution of cosine similarities between all possible matches.

We decided to use the 95% percentile (0.265) of the empirical distribution as the cut-off value. An alternative approach for determining this cut-off value could be based on Monte Carlo simulations for corpora with common and different topics. The computational resources required for such Monte Carlo simulations would be very high, and the setup would have to take into account how similar the topics within each corpus are, i.e., results could be used only for a very specific setting. Therefore, we leave such an analysis to future work. Apart from defining a cut-off value, we also checked systematically whether there are some multiple assignments, i.e., topics matched with the same topic in the targeted corpus. In this case, we only kept the pairs with the highest cosine similarity value. At the same time, we took the non-symmetry of the cosine similarity measure into account and checked whether the topics in the selected matches are the nearest neighbours of each other also when reversing the direction of matching.

Using this approach, a total number of 24 topic pairs were identified. Figures 9 and 10 show two of them. The matched topics seem to be quite similar as can be concluded from the word clouds comprising the 50 most frequent words. While the first one deals with international economic links, the second one is about business cycle analysis. Further matches deal with topics such as loan debt, hypothesis testing, forecasting methods, labour market and (un)employment, capital growth, oil shocks, inflation, income, trade etc. (see online supplementary material F). Results of a robustness check using Jensen–Shannon distance as the similarity measure instead of cosine similarity are provided in Appendix C.3.

Figure 9: 
Topic match “International economic relationships”.
Figure 9:

Topic match “International economic relationships”.

Figure 10: 
Topic match “Business cycle”.
Figure 10:

Topic match “Business cycle”.

4.4 Embedding Based Matching of Topics

For the multilingual corpus, we applied the proposed embedding based approach to match the topics between the DE ENG and DE GER data subsets. To this end, each topic was represented as the sum vector of 100 word vectors corresponding to its 100 most frequent words. Cosine similarity values were calculated between the topic pairs and for each topic in one language its nearest neighbour in the other language was chosen as its match. Analogously to the topic-word based matching, we used the 95% percentile of the cosine values between all possible topic pairs, 0.93, as the cut-off value to identify “sensible” matches (see Figure 8b). This approach resulted in 16 topic pairs within the German data set (see online supplementary material G). Finally, we made use of the English part of these matches to obtain overall matches between both German topic sets and the PL ENG topics.

Figures 11 and 12 show examples of these multilingual matches of the two corpora. In the first example it becomes obvious that both German topics and the corresponding Polish topic deal with the labour market and unemployment. However, not all of the obtained DE ENG and DE GER topic pairs appear meaningful to the same extent. The second example exhibits that the DE GER topic deals with private consumption and income and might be related to the analysis of the lifetime cycle of private households. The matched DE ENG topic is about investment and capital growth as is the one in PL ENG . This unsatisfactory outcome might be due to the specific multilingual embedding that was used for the matching. Therefore, further research is required for selecting or generating appropriate embeddings in order to improve the proposed approach to multilingual topic matching.

Figure 11: 
Topic match “Unemployment”.
Figure 11:

Topic match “Unemployment”.

Figure 12: 
Topic match “Capital growth”.
Figure 12:

Topic match “Capital growth”.

4.5 Time Series of Topic Weights

To enable comparison of patterns in the series of weights, the time series for the PL ENG and DE ENG text corpora were filtered with the centered equally-weighted moving average computed using 5 observations, MA(5). In the next step, the Euclidean distance and correlation coefficients were evaluated for the smoothed series. The values of these measures as well as cosine similarity scores are reported in Table 5 in the Appendix D. Below we discuss relations between the weight series of two topic pairs.

Figures 13 and 14 present word clouds and weight series (both raw and filtered) for two selected pairs of topics from PL ENG and DE ENG corpora. Analogous figures for all topic pairs are provided in online supplementary material F. Figure 13 shows the interest in time in the topics on international economic links. This match was characterised by a high cosine similarity score of 0.86146. The Euclidean distance between filtered series amounted to 0.10798 and the coefficient of correlation had a value of 0.63348. Filtered series for topics identified in both text corpora show a mild upward trend: an increasing interest in international links might be associated with an increasing openness and integration in the European Union economies. High positive correlation can be additionally attributed to common patterns in the dynamics which are synchronised in time.

Figure 13: 
Topic match “International economic relationships”.
Figure 13:

Topic match “International economic relationships”.

Figure 14: 
Topic match “Business cycle”.
Figure 14:

Topic match “Business cycle”.

Figure 14 shows words clouds and plots of weight series for pair of topics concerning the business cycle. Although cosine similarity for this pair of topics is also high (0.83445) the Euclidean distance is larger (0.16063) and a negative correlation coefficient (−0.20020) indicates weaker co-movement. The negative correlation can be explained by the misalignment of interest in these topics in time due to different economic circumstances of Germany and Poland. The creation of the euro area brought about an increased interest in business cycle studies in Germany. This can be explained by the need to better understand economic fluctuations in the common currency area. The importance of a similar topic for Poland grew later – after joining the common market.

5 Conclusions and Outlook

The present work considered scientific publications from Germany and Poland. The primary aim was to uncover main topics in the corpora using LDA modelling and to compare them with each other on the basis of the proposed matching approaches. The results of the current paper are a valuable contribution to the growing body of literature on text-as-data applications for several reasons.

First, we address one of the great challenges in the topic modelling, namely the choice of the optimal number of topics. We suggest to select the number of topics based on sBIC, a Bayesian information criterion adapted to the singularity of LDA models. Our analysis shows that the proposed information criterion leads to coherent topics in the considered text corpora. Second, we propose a topic matching approach that allows to compare the topic-word distributions of two different LDA models and to identify suitable topic pairs across text corpora. This matching approach made it possible to find meaningful topic pairs describing similar concepts in Polish and German corpora. Third, we suggest a data based procedure for identifying potentially meaningful matches of topics across corpora. Using a data based cut-off value for the minimum value of cosine similarity for matched topics, we were able to separate sensible matches from the remaining ones which do not correspond to similar topics in the colloquial sense. Fourth, we address the problem of topic matching between two corpora trained in different languages by proposing a language agnostic topic matching approach using multilingual word embeddings.

The work could be extended along the following lines. Further research is required to examine the randomness component of the sBIC criterion, e.g., by conducting a simulation study for sBIC. Additional work is also needed to improve the proposed embedding based matching approach as well as to examine further possibilities for topic matching in a multilingual context. This is recommended since not all topics matched across different languages in the current study were fully convincing. Furthermore, some of the identified topic matches seem to be closely related to themes concerning real macroeconomic activities, e.g. inflation, unemployment, income etc. Further research will examine more closely the links between the corresponding topic time series and real macroeconomic variables with a focus on potential differences across countries.


Corresponding author: Peter Winker, Justus Liebig University Giessen, Licher Strasse 64, 35394 Giessen, Germany, E-mail:

Award Identifier / Grant number: WI 2024/8-1

Funding source: Narodowe Centrum Nauki

Award Identifier / Grant number: Beethoven Classic 3: UMO-2018/31/G/HS4/00869

  1. Research funding: Financial support from the German Research Foundation (DFG) (WI 2024/8-1) and the National Science Centre (NCN) (Beethoven Classic 3: UMO-2018/31/G/HS4/00869) for the project TEXTMOD is gratefully acknowledged.

Appendices

Appendix A: Data Preparation

A.1 German Data

In the first step, data were downloaded from the De Gruyter website. Table 1 summarizes number of articles published each year. Up to 2000, the volumes were available as scanned pdf files. The Optical Character Recognition (OCR) was used to transform existing pdf files into text files. The text files were then copied into Microsoft Word and saved again with other coding (Unicode UTF-8). After that, the new text files were again copied into Word and the following preparation steps were taken:

  1. Mark each issue number with heading 1.

  2. Mark each article title with heading 2.

  3. Remove the following non-textual elements:

    1. Table of contents,

    2. Author names and article numbers,

    3. Formulas and special characters,

    4. Bibliographies,

    5. Tables and appendices.

After these preparation steps, the data could be imported into Python and be further preprocessed and analysed.

Table 1:

Number of articles published in Journal of Economics and Statistics.

Year Volume Number of articles
1984 199 42
1985 200 46
1986 201&202 49
1987 203 49
1988 204&205 93
1989 206 52
1990 207 46
1991 208 53
1992 209&2010 88
1993 211&2012 81
1994 213 53
1995 214 53
1996 2015 57
1997 216 45
1998 217 53
1999 218&219 84
2000 220 52
2001 221 39
2002 222 41
2003 223 46
2004 224 40
2005 225 45
2006 226 35
2007 227 42
2008 228 34
2009 229 41
2010 230 44
2011 231 46
2012 232 43
2013 233 35
2014 234 41
2015 235 39
2016 236 33
2017 237 26
2018 238 27
2019 239 36
2020 240 35

A.2 Polish Data

The texts from the two data sources for Poland had different forms. The conference proceedings were available as hard copies of the volumes, while CEJEME articles were digital and had the format of LATEX   or pdf files.

The available conference volumes (including more than 9000 pages) were scanned and saved as pdf files. The description of the MM data is provided in Table 2. Altogether, the data included 514 full length papers (with or without an abstract) and 231 abstracts (without the main text). In the next step, OCR was performed and the texts were saved as docx files. Over the years, the volumes were published by various publishing houses, using alternative typesetting styles and techniques. Thus, also the resulting source files differed considerably and extensive manual labour was needed to clean the texts. This preparatory step involved removing front and back matter, running heads and feet, tables, footnotes, figures, equations and other mathematical expressions as well as references. The beginning of each article was also manually marked. In addition, within each paper, information on the title, authors, affiliations, abstract (if present) and main body of the text were uniformly organized so that they could be easily identified by the code.

Table 2:

Proceedings from macromodels international conference and joint meetings.

Year of conference No. of volumes Meetings Contents
1984 1 MM and MF 17 full papers
1985 2 MM 26 full papers
1986 and 1987 1 MM 17 full papers
1988 and 1989 1 MM 18 full papers
1990 1 MM 11 full papers
1991 1 MM 12 full papers
1992 1 MM 16 full papers
1993 2 MM 24 full papers
1994 1 MM 11 full papers
1995 2 MM and MSA 29 full papers
1996 3 MM and MSA 46 full papers
1997 2 MM and AMFET 20 full papers
1998 1 MM and AMFET 8 full papers
1999 2 MM and AMFET 40 full papers
2000 2 MM and AMFET 14 full papers
2001 2 MM and AMFET 33 full papers
2002 2 MM and AMFET 25 full papers
2003 2 MM and AMFET 20 full papers
2004 2 MM and AMFET 23 full papers
2005 2 MM and AMFET 27 full papers
2006 2 MM and AMFET 25 full papers
2007 2 MM and AMFET 25 full papers
2008 1 MM and AMFET 10 full papers
2009 1 MM and AMFET 6 full papers
2010 1 MM and AMFET 6 full papers
2011 1 MM and AMFET 5 full papers

The input files from CEJEME used for modelling had LATEX   format.[8] Detailed information on the numbers of papers published in each volume and issue of CEJEME is presented in Table 3. All papers had an abstract.

Table 3:

Articles published in CEJEME.

Year Volume Issue Number of papers Year Volume Issue Number of papers
2009 1 1 5 2015 7 1 3
1 2 4 7 2 3
1 3 4 7 3 3
1 4 4 7 4 3
2010 2 1 4 2016 8 1 3
2 2 3 8 2 3
2 3 3 8 3 3
2 4 3 8 4 3
2011 3 1 3 2017 9 1 3
3 2 3 9 2 3
3 3 3 9 3 3
3 4 3 9 4 3
2012 4 1 3 2018 10 1 3
4 2 3 10 2 3
4 3 3 10 3 3
4 4 3 10 4 3
2013 5 1 3 2019 11 1 3
5 2 3 11 2 3
5 3 3 11 3 3
5 4 3 11 4 3
2014 6 1 3 2020 12 1 3
6 2 3 12 2 4
6 3 3 12 3 4
6 4 3 12 4 4

Initially, a structured database on the documents was created using Matlab. This step consisted in extracting from LATEX    files information on the publication year, names of authors, title of each paper, key words, JEL codes and abstracts. Abstracts were cleaned of all mathematical expressions and LATEX    formatting commands. Gathering this information was facilitated by a relatively stable LATEX   template used in the publication process.

In the next step, to form the text corpus(es) appropriate for further probabilistic analysis, the text files had to be suitably prepared. Initial editing was done in two steps. In the first step, the original files with .tex extension were modified to obtain the main body of the text by removing the following elements:

  1. Initial article information including: the author name(s), affiliation(s), e-mail address(es), dates of submitting and accepting the article,

  2. The abstract, keywords and JEL classification codes,

  3. Text appearing in running head (the author name(s) and short title of the paper) and running foot (the author name(s) and information on the volume and issue) of the journal,

  4. Figures and tables,

  5. Formulas, mathematical symbols and Greek letters,

  6. References,

  7. Selected LATEX   commands which prevented compilation after the above alternations of the texts were done, e.g. those introducing change of line.

In the second step, PDF files were obtained on the basis of the filtered LATEX   files. The pdfs were transformed to a text format.

Appendix B: Stopwords

English stopwords removed from article texts using the R package tm:

I, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, would, should, could, ought, I’m, you’re, he’s, she’s, it’s, we’re, they’re, I’ve, you’ve, we’ve, they’ve, I’d, you’d, he’d, she’d, we’d, they’d, I’ll, you’ll, he’ll, she’ll, we’ll, they’ll, isn’t, aren’t, wasn’t, weren’t, hasn’t, haven’t, hadn’t, doesn’t, don’t, didn’t, won’t, wouldn’t, shan’t, shouldn’t, can’t, cannot, couldn’t, mustn’t, let’s, that’s, who’s, what’s, here’s, there’s, when’s, where’s, why’s, how’s, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very.

Additional stopwords removed from the texts of articles:

appendix, acknowledgements, introduction.

German stopwords [9] removed from article texts:

a, ab, aber, ach, acht, achte, achten, achter, achtes, ag, alle, allein, allem, allen, aller, allerdings, alles, allgemeinen, als, also, am, an, ander, andere, anderem, anderen, anderer, anderes, anderm, andern, anderr, anders, au, auch, auf, aus, ausser, ausserdem, außer, außerdem, b, bald, bei, beide, beiden, beim, beispiel, bekannt, bereits, besonders, besser, besten, bin, bis, bisher, bist, c, d, d.h, da, dabei, dadurch, dafür, dagegen, daher, dahin, dahinter, damals, damit, danach, daneben, dank, dann, daran, darauf, daraus, darf, darfst, darin, darum, darunter, darüber, das, dasein, daselbst, dass, dasselbe, davon, davor, dazu, dazwischen, daß, dein, deine, deinem, deinen, deiner, deines, dem, dementsprechend, demgegenüber, demgemäss, demgemäß, demselben, demzufolge, den, denen, denn, denselben, der, deren, derer, derjenige, derjenigen, dermassen, dermaßen, derselbe, derselben, des, deshalb, desselben, dessen, deswegen, dich, die, diejenige, diejenigen, dies, diese, dieselbe, dieselben, diesem, diesen, dieser, dieses, dir, doch, dort, drei, drin, dritte, dritten, dritter, drittes, du, durch, durchaus, durfte, durften, dürfen, dürft, e, eben, ebenso, ehrlich, ei, ei, eigen, eigene, eigenen, eigener, eigenes, ein, einander, eine, einem, einen, einer, eines, einig, einige, einigem, einigen, einiger, einiges, einmal, eins, elf, en, ende, endlich, entweder, er, ernst, erst, erste, ersten, erster, erstes, es, etwa, etwas, euch, euer, eure, eurem, euren, eurer, eures, f, folgende, früher, fünf, fünfte, fünften, fünfter, fünftes, für, g, gab, ganz, ganze, ganzen, ganzer, ganzes, gar, gedurft, gegen, gegenüber, gehabt, gehen, geht, gekannt, gekonnt, gemacht, gemocht, gemusst, genug, gerade, gern, gesagt, geschweige, gewesen, gewollt, geworden, gibt, ging, gleich, gott, gross, grosse, grossen, grosser, grosses, groß, große, großen, großer, großes, gut, gute, guter, gutes, h, hab, habe, haben, habt, hast, hat, hatte, hatten, hattest, hattet, heisst, her, heute, hier, hin, hinter, hoch, hätte, hätten, i, ich, ihm, ihn, ihnen, ihr, ihre, ihrem, ihren, ihrer, ihres, im, immer, in, indem, infolgedessen, ins, irgend, ist, j, ja, jahr, jahre, jahren, je, jede, jedem, jeden, jeder, jedermann, jedermanns, jedes, jedoch, jemand, jemandem, jemanden, jene, jenem, jenen, jener, jenes, jetzt, k, kam, kann, kannst, kaum, kein, keine, keinem, keinen, keiner, keines, kleine, kleinen, kleiner, kleines, kommen, kommt, konnte, konnten, kurz, können, könnt, könnte, l, lang, lange, leicht, leide, lieber, los, m, machen, macht, machte, mag, magst, mahn, mal, man, manche, manchem, manchen, mancher, manches, mann, mehr, mein, meine, meinem, meinen, meiner, meines, mensch, menschen, mich, mir, mit, mittel, mochte, mochten, morgen, muss, musst, musste, mussten, muß, mußt, möchte, mögen, möglich, mögt, müssen, müsst, müßt, n, na, nach, nachdem, nahm, natürlich, neben, nein, neue, neuen, neun, neunte, neunten, neunter, neuntes, nicht, nichts, nie, niemand, niemandem, niemanden, noch, nun, nur, o, ob, oben, oder, offen, oft, ohne, ordnung, p, q, r, recht, rechte, rechten, rechter, rechtes, richtig, rund, s, sa, sache, sagt, sagte, sah, satt, schlecht, schluss, schon, sechs, sechste, sechsten, sechster, sechstes, sehr, sei, seid, seien, sein, seine, seinem, seinen, seiner, seines, seit, seitdem, selbst, sich, sie, sieben, siebente, siebenten, siebenter, siebentes, sind, so, solang, solche, solchem, solchen, solcher, solches, soll, sollen, sollst, sollt, sollte, sollten, sondern, sonst, soweit, sowie, später, startseite, statt, steht, suche, t, tag, tage, tagen, tat, teil, tel, tritt, trotzdem, tun, u, uhr, um, und, uns, unse, unsem, unsen, unser, unsere, unserer, unses, unter, v, vergangenen, viel, viele, vielem, vielen, vielleicht, vier, vierte, vierten, vierter, viertes, vom, von, vor, w, wahr, wann, war, waren, warst, wart, warum, was, weg, wegen, weil, weit, weiter, weitere, weiteren, weiteres, welche, welchem, welchen, welcher, welches, wem, wen, wenig, wenige, weniger, weniges, wenigstens, wenn, wer, werde, werden, werdet, weshalb, wessen, wie, wieder, wieso, will, willst, wir, wird, wirklich, wirst, wissen, wo, woher, wohin, wohl, wollen, wollt, wollte, wollten, worden, wurde, wurden, während, währenddem, währenddessen, wäre, würde, würden, x, y, z, z.b, zehn, zehnte, zehnten, zehnter, zehntes, zeit, zu, zuerst, zugleich, zum, zunächst, zur, zurück, zusammen, zwanzig, zwar, zwei, zweite, zweiten, zweiter, zweites, zwischen, zwölf, über, überhaupt, übrigens.

Appendix C: Robustness Check

C.1 Translation

As a robustness check, we translated the German texts from JES into English using DeepL API. The vocabulary of the joint dataset is almost identical (about 85%) to the vocabulary of the English subset of the data. For the joint dataset D E \spmathit J O I N T , we identified the optimal number of topics using the proposed sBIC measure (see Figure 15).

Figure 15: 
Distribution of the sBIC values for the joint German corpus.
Figure 15:

Distribution of the sBIC values for the joint German corpus.

In the next step, we performed standard topic matching in both directions D E J O I N T D E E N G as well as D E E N G D E J O I N T to account for the non-symmetry of the matching procedure. 34 out of the 46 identified topic pairs are the best matches of each other.

Using the proposed standard matching approach we identified 24 sensible matches between the corpora for both countries reported in Table 5 in Appendix D. Next, we could analyse whether matching P L E N G D E J O I N T and then D E J O I N T D E E N G would result in the same topic pairs as compared to directly matching P L E N G D E E N G . An example of this analysis is shown in Figure 16. The PL ENG Topic 10 was initially assigned to DE ENG Topic 52, both dealing with inflation and monetary policy. Using the DE JOINT LDA model, we find a similar topic that is the best match to both PL ENG Topic 10 and DE ENG Topic 52. For 17 out of the 24 relevant topic pairs, a similar result is obtained.

Figure 16: 
Topic matching using machine translation.
Figure 16:

Topic matching using machine translation.

Therefore, in general, machine translation might be considered a good alternative when dealing with multilingual corpora. However, the additional costs, the quality of translation for certain corpora, and the “black box” character of machine translations must be taken into account.

C.2 Sklearn Versus Gensim

To account for possible differences in topic distributions resulting from different LDA implementations, we considered a further Python module gensim. For all the considered datasets, namely PL ENG , DE ENG , DE GER , we estimated LDA models using gensim with the number of topics according to sBIC. For each dataset, we calculated the following evaluation metrics: perplexity, average topic similarity (Cao et al. 2009) and average topic coherence (Mimno et al. 2011). The results are summarized in Table 4. According to the perplexity and average topic similarity measures, sklearn seems to perform better. As for average coherence of the resulting topics, the scores seem to be quite similar.

Table 4:

Sklearn versus gensim: model evaluation.

PL ENG DE ENG DE GER
Gensim Sklearn Gensim Sklearn Gensim Sklearn
Perplexity 873.4 842.8 998.02 949.4 1697.62 1641.45
Cao et al. (2009) 0.18 0.14 0.11 0.08 0.28 0.2
Mimno et al. (2011) −0.76 −0.79 −1.04 −0.93 −0.92 −0.95

As we are most interested in topics, for each dataset we compared the topic-word distributions using the proposed standard matching approach. In doing so, we aimed to find out whether topics uncovered using the two different LDA implementations overlap to a large extent or not. We found that most of the topics that are later identified as meaningful matches can be found by means of both implementations (see examples below).

C.3 Similarity Measure

We also performed topic matching using a different similarity measure, Jensen–Shannon distance, to see whether the main results and the resulting topic pairs change considerably. Analogously to the process presented in the main part of this paper, we first calculated the JS distances between all possible topic matches to derive a suitable cut-off value. Figure 17 presents the distribution of the JS distance values. The lower the distance between two topic vectors, the more similar they are to each other. We took 0.05% (0.64) percentile as a cut-off value. We then removed multiple assignments keeping just the topic pair with the lowest distance. It resulted in 23 topic pairs. Four out of 24 assignments were different as compared to the results when using cosine similarity. One possible reason for this is that the DE ENG topic set is larger and contains some quite similar topics, i.e. one PL ENG topic could be a suitable pair to more than one of the DE ENG topics. An example is shown in Figure 18. Both DE ENG topics seem to be related to the PL ENG topic. Overall, it could be observed that the use of a different similarity measure does not impact the results significantly.

Figure 17: 
Distribution of the Jensen–Shannon distances.
Figure 17:

Distribution of the Jensen–Shannon distances.

Figure 18: 
Differences in the assignment.
Figure 18:

Differences in the assignment.

Appendix D: Time Series Analysis

Table 5:

Comparison of filtered weight time series.

Topic number in Cosine similaritya Euclidean Correlation
PL ENG DE ENG score distance coefficient
0 4 0.84617 0.15223 −0.11350
1 15 0.45003 0.22556 −0.47935
2 47 0.58684 0.24628 0.58412
3 13 0.50440 0.07083 0.42358
4 12 0.75001 0.24791 −0.02885
6 36 0.48663 0.07005 −0.39230
10 52 0.69914 0.08590 −0.40108
11 20 0.72227 0.07436 0.60665
13 21 0.80610 0.20421 0.05272
14 33 0.86146 0.10798 0.63348
15 54 0.56860 0.14301 0.85228
16 1 0.88338 0.09773 0.03375
17 57 0.49535 0.13781 0.51459
21 43 0.63251 0.12119 −0.58288
22 46 0.54288 0.12549 −0.81556
23 2 0.83445 0.16063 −0.20020
25 39 0.50876 0.09650 −0.61320
26 37 0.65153 0.11825 −0.24760
27 26 0.85116 0.09879 0.00184
29 30 0.66695 0.30690 −0.44140
31 38 0.52991 0.06821 0.07816
32 53 0.70459 0.14300 −0.55365
34 31 0.52545 0.08092 0.41222
35 28 0.34393 0.10423 −0.59937
  1. aThese values refer to the topics content and were calculated between word distributions of the topics, whereas Euclidean distance and correlation were calculated using the resulting topic time series.

References

Arun, R., Suresh, V., Veni Madhavan, C.E., and Narasimha Murthy, M.N. (2010). On finding the natural number of topics with latent dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., and Pudi, V. (Eds.), Advances in knowledge discovery and data mining. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 391–402.10.1007/978-3-642-13657-3_43Search in Google Scholar

Baker, S.R., Bloom, N., and Davis, S.J. (2016). Measuring economic policy uncertainty. Q. J. Econ. 131: 1593–1636, https://doi.org/10.1093/qje/qjw024.Search in Google Scholar

Bianchi, F., Terragni, S., Hovy, D., Nozza, D., and Fersini, E. (2020). Cross-lingual contextualized topic models with zero-shot learning, arXiv preprint arXiv:2004.07737.10.18653/v1/2021.eacl-main.143Search in Google Scholar

Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022.Search in Google Scholar

Cao, J., Xia, T., Li, J., Zhang, Y., and Tang, S. (2009). A density-based method for adaptive lda model selection. Neurocomputing 72: 1775–1781, https://doi.org/10.1016/j.neucom.2008.06.011.Search in Google Scholar

Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou, H. (2017). Word translation without parallel data. CoRR, abs/1710.04087. Available at: http://arxiv.org/abs/1710.04087.Search in Google Scholar

Debnath, R. and Bardhan, R. (2020). India nudges to contain COVID-19 pandemic: a reactive public policy analysis using machine-learning based topic modelling. PLoS One 15: 1–25, https://doi.org/10.1371/journal.pone.0238972.Search in Google Scholar

Dieng, A.B., Ruiz, F.J., and Blei, D.M. (2020). Topic modeling in embedding spaces. Trans. Assoc. Comput. Ling. 8: 439–453, https://doi.org/10.1162/tacl_a_00325.Search in Google Scholar

Dörr, J.O., Kinne, J., Lenz, D., Licht, G., and Winker, P. (2022). An integrated data framework for policy guidance during the coronavirus pandemic: towards real-time decision support for economic policymakers. PLoS One 17: e0263898, https://doi.org/10.1371/journal.pone.0263898.Search in Google Scholar

Drton, M. and Plummer, M. (2017). A Bayesian information criterion for singular models. J. Roy. Stat. Soc. B 79: 323–380, https://doi.org/10.1111/rssb.12187.Search in Google Scholar

Ellingsen, J., Larsen, V.H., and Thorsrud, L.A. (2022). News media versus fred-md for macroeconomic forecasting. J. Appl. Econom. 37: 63–81, https://doi.org/10.1002/jae.2859.Search in Google Scholar

Foltas, A. (2022). Testing investment forecast efficiency with forecasting narratives. J. Econ. Stat. 242: 191–222, https://doi.org/10.1515/jbnst-2020-0027.Search in Google Scholar

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of the eleventh international conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. Available at: https://www.aclweb.org/anthology/L18-1550.Search in Google Scholar

Hansen, S. and McMahon, M. (2016). Shocking language: understanding the macroeconomic effects of central bank communication. J. Int. Econ. 99: S114–S133, https://doi.org/10.1016/j.jinteco.2015.12.008.Search in Google Scholar

Hayashi, N. (2021). The exact asymptotic form of Bayesian generalization error in latent Dirichlet allocation. Neural Netw. 137: 127–137, https://doi.org/10.1016/j.neunet.2021.01.024.Search in Google Scholar

Hoffman, M., Bach, F.R., and Blei, D.M. (2010). Online learning for latent dirichlet allocation. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A. (Eds.), Advances in neural information processing systems, 23. Curran Associates, Inc., La Jolla, CA, Red Hook, NY, pp. 856–864.Search in Google Scholar

Hoffman, M.D., Blei, D.M., Wang, C., and Paisley, J.W. (2013). Stochastic variational inference. J. Mach. Learn. Res. 14: 1303–1347.Search in Google Scholar

Jentsch, C., Lee, E.R., and Mammen, E. (2020). Time-dependent Poisson reduced rank models for political text data analysis. Comput. Stat. Data Anal. 142: 106813, https://doi.org/10.1016/j.csda.2019.106813.Search in Google Scholar

Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., and Grave, E. (2018). Loss in translation: learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, pp. 2979–2984. Available at: https://www.aclweb.org/anthology/D18-1330.10.18653/v1/D18-1330Search in Google Scholar

Kalamara, E., Turrell, A., Redl, C., Kapetanios, G., and Kapadia, S. (2020). Making text count: economic forecasting using newspaper text, Bank of England working papers 865, Bank of England. Available at: https://ideas.repec.org/p/boe/boeewp/0865.html.10.2139/ssrn.3610770Search in Google Scholar

Krüger, M., Kinne, J., Lenz, D., and Resch, B. (2020). The digital layer: how innovative firms relate on the webv, Technical Report No. 20-003, ZEW – Centre for European Economic Research. Available at: https://ssrn.com/abstract=3530807.Search in Google Scholar

Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2: 83–97, https://doi.org/10.1002/nav.3800020109.Search in Google Scholar

Larsen, V.H. and Thorsrud, L.A. (2019). The value of news for economic developments. J. Econom. 210: 203–218, https://doi.org/10.1016/j.jeconom.2018.11.013.Search in Google Scholar

Lenz, D. and Winker, P. (2020). Measuring the diffusion of innovations with paragraph vector topic models. PLoS One 15: e0226685, https://doi.org/10.1371/journal.pone.0226685.Search in Google Scholar

Lucas, C., Nielsen, R.A., Roberts, M.E., Stewart, B.M., Storer, A., and Tingley, D. (2015). Computer-assisted text analysis for comparative politics. Polit. Anal. 23: 254–277, https://doi.org/10.1093/pan/mpu019.Search in Google Scholar

Lüdering, J. and Tillmann, P. (2020). Monetary policy on Twitter and asset prices: evidence from computational text analysis. N. Am. J. Econ. Finance 51: 100875, https://doi.org/10.1016/j.najef.2018.11.004.Search in Google Scholar

Lüdering, J. and Winker, P. (2016). Forward or backward looking? The economic discourse and the observed reality. Journal of Economics and Statistics 236: 483–515, https://doi.org/10.1515/jbnst-2015-1026.Search in Google Scholar

Maier, D., Baden, C., Stoltenberg, D., Vries-Kedem, M.D., and Waldherr, A. (2022). Machine translation vs. multilingual dictionaries assessing two strategies for the topic modeling of multilingual text collections. Commun. Methods Meas. 16: 19–38, https://doi.org/10.1080/19312458.2021.1955845.Search in Google Scholar

Mamaysky, H. (2021). News and markets in the time of COVID-19. SSRN. Available at: https://ssrn.com/abstract=3565597.Search in Google Scholar

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. In: Bengio, Y., and LeCun, Y. (Eds.), 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings. Available at: http://arxiv.org/abs/1301.3781.Search in Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2: 3111–3119.Search in Google Scholar

Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, Association for Computational Linguistics, Edinburgh, Scotland, UK, 262–272. Available at: https://aclanthology.org/D11-1024.Search in Google Scholar

Shapiro, A.H., Sudhof, M., and Wilson, D.J. (2022). Measuring news sentiment. J. Econom. 228: 221–243, https://doi.org/10.1016/j.jeconom.2020.07.053.Search in Google Scholar

Thorsrud, L.A. (2020). Words are the new numbers: a newsy coincident index of the business cycle. J. Bus. Econ. Stat. 38: 393–409, https://doi.org/10.1080/07350015.2018.1506344.Search in Google Scholar

Watanabe, S. (2009). Algebraic geometry and statistical learning theory, Cambridge monographs on applied and computational mathematics. Cambridge University Press, Cambridge.Search in Google Scholar

Wehrheim, L. (2019). Economic history goes digital: topic modeling the journal of economic history. Cliometrica 13: 83–125, https://doi.org/10.1007/s11698-018-0171-7.Search in Google Scholar


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/jbnst-2022-0024).


Received: 2022-04-26
Accepted: 2022-09-26
Published Online: 2022-11-01
Published in Print: 2022-08-26

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 4.12.2023 from https://www.degruyter.com/document/doi/10.1515/jbnst-2022-0024/html
Scroll to top button