Abstract
Textual data gained relevance as a novel source of information for applied economic research. When considering longer periods or international comparisons, often different text corpora have to be used and combined for the analysis. A methods pipeline is presented for identifying topics in different corpora, matching these topics across corpora and comparing the resulting time series of topic importance. The relative importance of topics over time in a text corpus is used as an additional indicator in econometric models and for forecasting as well as for identifying changing foci of economic studies. The methods pipeline is illustrated using scientific publications from Poland and Germany in English and German for the period 1984–2020. As methodological contributions, a novel tool for data based model selection, sBIC, is impelemented, and approaches for mapping of topics of different corpora (including different languages) are presented.
1 Introduction
Textual data gained relevance as a novel source of information for applied economic research. Examples include text based indicators of economic uncertainty (Baker et al. 2016) and economic or political sentiment (Jentsch et al. 2020; Shapiro et al. 2022), the analysis of central bank communication (Hansen and McMahon 2016; Lüdering and Tillmann 2020), using textual information for now and forecasting (Ellingsen et al. 2022; Foltas 2022; Kalamara et al. 2020; Larsen and Thorsrud 2019; Thorsrud 2020) or describing the diffusion of innovations (Lenz and Winker 2020) and innovation cluster (Krüger et al. 2020), and the link between real economic developments and scientific publications in economics (Lüdering and Winker 2016; Wehrheim 2019). Textual data have also been used during the Covid19 pandemic for policy and impact analysis, e.g., by Debnath and Bardhan (2020), Mamaysky (2021), and Dörr et al. (2022).
In most of these applications, the main interest is on temporal patterns in the textual information and how these relate to other developments over time. However, also a comparison of the content of different text corpora might be of interest, e.g., when comparing sentiments in different countries or topics of economic research present in different scientific journals. When longer periods are considered, it might also become necessary to merge the information content of various textual data sources available for different subperiods. In such settings, it is imperative to identify those topics which are common or at least similar across the corpora and their evolution over time. Obviously, there is no guarantee for finding matching topics ex post. Therefore, it is also relevant to identify those topics which are specific for certain corpora only. Although such analyses are urgently needed there is no consensus yet on which methods are most appropriate for specific applications.
In this paper, we extend the existing literature on topic modelling by proposing methods for comparing (matching) topics identified for various corpora. This approach also includes proposing a databased criterion for assessing the quality of a match, which eventually serves for identifying real matches. Additionally, we suggest to select the number of topics in each corpus based on singular Bayesian information criterion (sBIC), which appears more robust compared to other commonly used tools used for this model selection step.
The working of the methods is presented using economic text corpora in two languages. However, the proposed tools are generic and can be applied also beyond economic research. For example, the results described in this paper might be of interest to political and communication scientists who often need to consider multilingual textual sources (see e.g. Lucas et al. 2015; Maier et al. 2022).
The presented methods pipeline is not meant to be the only viable approach for such analyses, but rather a basic setting building mostly on established procedures. Although the approach is kept simple, it still involves a number of steps which require choosing several parameters. We will stick as far as possible to standard parameter values and discuss some central aspects in more detail, in particular those related to choosing the number of topics in a corpus and the thresholdvalues for defining a meaningful match of topics across corpora.
As illustration of the application of the proposed methods pipeline, we consider two corpora of scientific publications in economics over the period 1984–2020, one published in Germany and one in Poland. These datasets and the basic methods pipeline building on latent Dirichlet allocation (LDA) (Blei et al. 2003) for topic modelling are introduced in Section 2. Methodological extensions are presented in Section 3, in particular using a singular Bayesian information criterion for selecting the number of topics in Subsection 3.1 and the approaches for matching topics across corpora in Subsections 3.2 and 3.3. The results of the application to scientific publications are provided in Section 4. Section 5 summarizes the findings and provides an outlook to further analysis.
2 Data and Methods
2.1 Topic Modelling and Corpora Comparisons
In this section, the text corpora are described. Figure 1 summarizes our research methodology.
Textual data for the current application consist of scientific articles published in Germany, in the Journal of Economics and Statistics (JES), and in Poland, in the proceedings of the Macromodels International Conference (MM) and the Central European Journal of Economic Modelling and Econometrics (CEJEME). A detailed description of the data sources is provided in Section 2.2. After the documents from the different sources were collected and prepared, the following common text preprocessing steps were applied:^{[1]}
Removing all punctuation marks, special characters, numbers.
Following Lüdering and Winker (2016), who also applied topic modelling to data from JES, we also decide to remove words that contain fewer than 3 and more than 20 characters in order to capture further stop words and reduce the vocabulary size.
Removing English and German stop words (see Appendix B).
Removing especially rare/common words: all words with frequency over all articles in the text corpus under 2.5% and above 75% were removed. We prefer relative thresholds to absolute ones as they will depend on the size of the corpus and the vocabulary. Therefore, using relative threshold appears more appropriate for establishing a standardized pipeline.
Lemmatizing of texts, i.e., grouping inflected forms together as a single base form.
Removing certain partsofspeech, the so called PoS tags, such as determiners, adpositions, conjunctions, pronouns to the extent that they are not contained in the usually rather short lists of stop words.
The resulting BagofWords (BoW) representations of the documents were further used to train LDA models for the Polish and German data sets. LDA is one of the bestknown and most widely used topic modelling approaches (Blei et al. 2003). It is based on the assumption that each document in a corpus is a distribution over some latent topics and each topic is a distribution over a fixed corpus vocabulary. Therefore, the algorithm behind the LDA approach aims to identify this hidden structure and to uncover the underlying latent topics in a corpus.
For the German corpus, two different LDA models were trained for the two subsets in English and German languages. Running LDA on the combined corpus would result in two sets of topics which are languagespecific, although some of them might cover the same semantic content. Therefore, a separate modelling and matching post hoc appears preferably. As a robustness check, we also did an analysis on a joint corpus, for which all German texts have been translated to English using DeepL API Pro (see Appendix C.1, pp. 32–33).^{[2]} This robustness check using machine translation has revealed that the most of the topics from the joint German dataset can be also found under the topics uncovered in the English subset of the data. Furthermore, we also show that for the identification of relevant matches between the two countries using the joint LDA model does not impact the results substantially. For the Polish corpus, only one LDA model was estimated. The models were trained using Python’s sklearn module. The implementation in sklearn follows Hoffman et al. (2010) and Hoffman et al. (2013) and provides a method for estimating LDA models based on the online variational Bayes algorithm. Except for the number of topics and the number of iterations in the training process, all the parameters were kept at default values. Since there are several Python modules that implement the LDA algorithm, we also perform the analysis using gensim, another popular module for LDA topic modelling. In doing so, we aim to account for possible differences resulting from different LDA implementations. We show that the qualitative findings of the current work do not change. The results of this robustness check are described in Appendix C.2, p. 35.
The choice of an optimal number of topics in LDA models still remains a challenge in applied research. Although there are several criteria for selecting the optimal number of themes, the ultimate choice is often based on human judgement concerning interpretability of selected topics. In the current work, we aim to avoid the subjectivity of topic selection and use sBIC to determine the optimal number of topics for each of the text corpora. We further discuss interpretability of topics selected by sBIC. To our knowledge this is the first application of sBIC to LDA modelling and so we provide more methodological and practical details in Section 3.1.
In the topic modelling stage, we obtained three different sets of topics corresponding to two countries and two languages. To distinguish these sets we will use the following notation: PL ^{ ENG }, DE ^{ ENG } and DE ^{ GER } where PL, DE indicate the country of publication of a corpus i.e. Poland or Germany, while the superscripts ENG, GER inform about the language of publication.
We first focus on the matching between DE ^{ ENG } and PL ^{ ENG } topics. The matching of two topic sets of different LDA models can be done based on topicword frequency vectors. Thereby, the distributions of topics over the vocabulary words are compared. However, since this standard approach to topic matching assumes the vocabularies are in the same language, it cannot be applied to LDA models trained on corpora in different languages. For this reason, we also propose a different embedding based approach to topic matching to compare topic sets in different languages. Both topicword frequency vectors based matching and embedding based matching approaches are described in Sections 3.2 and 3.3, respectively.
In the final step, we qualitatively analyse the resulting topic matches and define thresholds for matches expected to be meaningful in a colloquial sense. We also construct topic time series based on the topic weights for each of the topic sets and descriptively analyse the time series trends of the matched topics.
2.2 Textual Data for Germany and Poland
The illustration of the methods pipeline is based on corpora of scientific articles published in Germany and Poland. Given the interest in comparing trends of research topics over time, it is important that both corpora cover a long period. For our application, the overlap of both corpora is from 1984 to 2020. While the time span of the sample is rather long, the number of documents per year is substantially smaller than in other applications covering recent years. Furthermore, scientific articles in economics are more focused than general interest documents. Therefore, the number of distinct topics to be expected in these corpora is rather small. Nevertheless, the example might well illustrate the general procedure of crosscorpora topic and topic trends comparison outlined in Subsection 2.1.
2.2.1 German Text Collection
The German textual data consist of articles published in the Journal of Economics and Statistics. The Journal has been published since 1863 containing articles that cover topics from economics with a focus on empirical economics and applied statistics. During the sample period 1984–2020 publications have been either in German or English. The distribution of the articles’ languages is presented in Figure 2.^{[3]}
The volumes of the journal published annually between 1984 and 2020 comprise usually 4–6 issues.^{[4]} The details of data collection and further steps of preparation are described in Appendix A. The corpus used for the application comprises 903 articles in German and 704 articles in English.
2.2.2 Polish Text Collection
Two sources of textual data for Poland were considered. Firstly, proceedings of the Macromodels International Conference (MM) and joint meetings were used providing textual data for the years 1984–2011. Secondly, papers published in the Central European Journal of Economic Modelling and Econometrics (CEJEME) in the period 2009–2020 were analysed.
The Macromodels International Conference has been organised in Poland every year since 1974. The printed materials analyzed in this article included also papers presented at the meetings held jointly with MM, such as Econometric Modelling and Forecasting Socialist Economies (Models & Forecasts, MF), the Multivariate Statistical Analysis (MSA) conference and the Association for Modelling and Forecasting Economies in Transition (AMFET) meetings. As indicated at the Macromodels’ webpage (www.macromodels.uni.lodz.pl), the aim of the conference is to “bring together scientists who work in the field of econometric modelling […]. Within the scope of interest are issues such as the problems of estimation, simulation, developing econometric models and their use for policy analyses. Recently, a special attention has been given to modelling economies of new EU member countries”. Conference materials printed as books are available since 1984 and continue up to the year 2011. Altogether 41 conference volumes comprising a total of 514 articles were used in the analysis. The language of the conference is English. After several preprocessing steps that are described in more detail in Appendix A, a structured database with bibliographic information for the articles was created in Python. The data include information on the year of publication, names of the authors, title of the paper, abstract and the main text.
The article collection from the Central European Journal of Economic Modelling and Econometrics includes 145 scientific articles which appeared in 46 issues of the journal, starting with the first issue from 2009 (January 2009) and ending with the fourth issue from the year 2020 (April 2020). As indicated by the aims and scope of CEJEME, the papers are focused on the theory and applications of mathematical and statistical models in economic sciences. All articles are in English. Detailed information on the preparation of the data from CEJEME publications can be found in Appendix A. The Polish data set used for the application consists of the main texts of the articles (without abstracts) from MM and CEJEME.
3 Methodological Advances
This section first describes the proposed information criterion for determining the optimal number of topics in a corpus. Then, we present a general topic matching approach for LDA models trained on different corpora in the same language. As the German corpus consists of texts in English and German, we also propose a further topic comparison approach based on multilingual word representations that can be applied to LDA models trained on corpora in different languages.
3.1 Topic Number Selection Based on Singular Bayesian Information Criterion
Several criteria, which are often used for selecting the number of topics in LDA modelling, are based on specific semantic properties of selected topics, such as similarity and coherence (see Cao et al. 2009; Mimno et al. 2011). The number of topics selected by these criteria frequently differs considerably and the final choice is based on human judgement concerning interpretability of selected topics. In this paper, the number of topics is chosen using an information criterion that does not directly quantify any semantic property of topics, but balances the goodnessoffit and model complexity. The model selection procedure based on the information criterion does not rely on topic interpretability, but chooses the optimal number of topics that can be used for inference. Nevertheless, topics selected by the information criterion are expected to be interpretable for a text corpus generated by an LDA model.
The implementation of information criteria for topic number selection in the LDA analysis is complicated because it is based on a singular statistical model: the Fisher information matrix is not positive definite. The usual BIC cannot be implemented for evaluation of singular models as the penalty for model complexity used in the BIC is too large for singular models: too few topics would be selected in the LDA modelling if the regular BIC was used.
Drton and Plummer (2017) proposed a model selection criterion, called singular BIC (sBIC), that uses the Bayesian model averaging and a smaller penalty than the penalty used in the regular BIC. Hayashi (2021) derived the asymptotic learning coefficient for LDA that can be used for the evaluation of penalty in sBIC. In this paper the model averaging method proposed by Drton and Plummer (2017) and the asymptotic learning coefficient derived in Hayashi (2021) are combined in order to implement sBIC for the selection of number of topics in the LDA modelling. As it is a novel application of sBIC, essential theoretical and practical details of this procedure are briefly described below.
In order to present essential details of sBIC, let us consider a document corpus
The marginal likelihood of corpus
where
where
An approximation of the marginal likelihood for a model with H topics, based on averaging of submodels with number of topics
where P(h) is the prior of a model with h topics,
and the constants
and define the singular Bayesian information criterion for a model with H topics as
where
that can be found recursively with
For an empirical implementation of sBIC in the LDA modelling approach, it is essential to use highprecision computations in case of largescale data sets. Since the likelihood function for an LDA model is a product of a large number of word frequencies, it usually takes extremely small positive values. Correspondingly, the loglikelihood function takes negative values of extremely large modulus. To avoid exponent underflow and overflow in floating point computations, the outer limits allowable for exponents of floatingpoint numbers have to be sufficiently large. The computational precision (the size of fractional part of floatingpoint numbers) has to be sufficiently large for solving the system of quadratic equation (3.2) that represents a bottleneck of the sBIC algorithm.
The selection of an appropriate precision needed to avoid rounding errors might depend on a particular dataset. However, as compared to the estimation time of LDA models the additional time needed for highprecision computations in the sBIC algorithm is not substantial.
3.2 Topic Matching
The outcome of a LDA model is a matrix containing the probabilities of occurence of each word in each topic. Therefore, a standard and intuitive way to compare two LDA models, or the hidden structures behind the data, is to compare the distributions of topics over the vocabulary words. Each topic can be represented as a vector with the length equal to the vocabulary size.
For the comparison, the topic vectors from different LDA models should have the same length. However, it is quite improbable that the vocabularies from different corpora are exactly the same. There are two possibilities to create topicword frequency vectors of the same length. First, one of the vocabularies can be considered as the base vocabulary. If some of the words are missing in the other vocabulary, the probabilities are replaced with zeros. Alternatively, one can use only the intersection of the vocabularies of the considered corpora, i.e., only the words that occur in both corpora. In the current work, we use the second solution as only minor differences have been observed when comparing both approaches. In general, this choice bears the risk that matched topics can still differ substantially with regard to the non overlapping parts of the vocabularies. Thus, in particular for less homogeneous corpora than considered in our application, one might also consider matching based on the union of the vocabularies.
In the next step, the similarities of the topic vectors are calculated. To this end, we consider two similarity measures:
Jensen–Shannon divergence (JSD), which is closely related to the Kullback–divergence (KLD), measures the similarity between two probability distributions or, in the current case, two wordtopic distributions. The Jensen–Shannon divergence between two probability distributions P and Q is calculated as follows:
where
Cosine similarity is an alternative measure of similarity of two vectors and is often used when working with textual data. Cosine similarity is the cosine of the angle between the two vectors. For example, a cosine similarity of 1 implies that two vectors have the same orientation in the corresponding vector space.
The final step is the actual matching of the topics. Again, there are two alternative approaches to matching that can be applied to obtain topic pairs. The first one is the socalled onetoone matching using the Hungarian algorithm (Kuhn 1955). The Hungarian algorithm is an optimization algorithm that, given a cost matrix containing the assignment costs between the topics of two LDA models, aims to find an optimal assignment of rows to columns with minimal costs. It is also possible to apply this algorithm if the number of topics of two LDA models is not the same. In this case, some of the topics of the larger LDA model remain unmatched. Onetoone matching can be applied, for example, when the two corpora are expected to cover the same set of topics. When implementing onetoone matching, it is recommendable to use distance metrics such as the Jensen–Shannon divergence or the cosine distance, which is defined as 1−cosine similarity, as the cost measure, as the Hungarian algorithm is formulated as a minimization problem.
The second option is best matching using the nearest neighbours approach, i.e., for each topic in a corpus, its nearest neighbour in the other corpus is chosen as its match. Hereby, the topics can be assigned multiple times. Best matching is a better choice when the thematic focus of the corpora to be compared is quite different and it is unclear whether each topic in one corpus can find a meaningful match in the other one.
However, given that each topic is assigned its nearest neighbour independently of the corresponding minimum distance, there is also no guarantee that all of the identified best matches actually correspond to a match according to the common understanding. For this reason, a cutoff value has to be set in order to select only topic pairs sharing a high enough similarity. At this point, it is important to mention that the best matching is a nonsymmetric process. For example, if for the German Topic b the Polish Topic a is the nearest neighbour in the Polish topic set (direction Germany → Poland), it does not necessarily imply that for the Polish Topic a the German Topic b is the nearest neighbour in the German topic set (direction Poland → Germany). To account for this nonsymmetry, it is advisable to check the topic assignments in both directions.
In the current application, we use the cosine similarity measure to evaluate the topic similarity and perform best matching. We set the cutoff value based on the distribution of the cosine similarity values between all possible topic pairs. Subsequently, we also perform the matching using Jensen–Shannon distance as a robustness check.
3.3 EmbeddingBased Matching
The standard matching described in the previous subsection is restricted to the comparison of models in the same language. To enable multilanguage analyses, we propose a further approach that uses the socalled word embeddings. These word vector representations have been attracting a lot of attention in recent years and are widely used in different applications also beyond the natural language processing field. One of the most important characteristics of such word embeddings is the interpretability of the distances between them. It means that semantically similar words tend to be close to each other in the shared vector space. For more details on how word embeddings are trained see Mikolov et al. (2013a, 2013b). Recently, such word embeddings have been also used in the context of topic modelling. For example, Dieng et al. (2020) introduce the embedded topic model (ETM) where each word and each topic in a corpus are represented in the same embedding space. The authors claim that the proposed approach addresses the drawbacks of a classical LDA model, namely dealing with large vocabularies. Empirically, it is shown that the method leads to better results compared to other approaches including classical LDA as measured by the coherence criterion introduced by Mimno et al. (2011). However, it is not discussed how ETM performs in a multilingual context when a dataset consists of texts in different languages, as in the current case. Since in the proposed approach word and topic embeddings are trained based on the underlying texts and word cooccurrences, applying it to a multilingual corpus would probably result in an embedding space that contains multiple subspaces related to the languages contained in the corpus. Therefore, we decided to not further consider ETM for our analysis.
Bianchi et al. (2020) address exactly this problem and develop a languageagnostic approach to topic modelling – Multilingual Contextualized Topic Modelling (MCTM). The authors develop a topic modelling approach that is based on document representations from SBERT, a novel Transformer based technique to language modelling. The main advantage of the proposed approach is that a model can be trained based on one corpus and topic distributions for documents in unseen languages can be inferred just based on the multilingual vector representations. In the current case, we could apply MCTM and train the model, for example, for the Polish dataset and infer topic distributions for the documents from the German dataset. In doing so, we would, however, restrict ourselves only to the topics in the Polish dataset. Some latent topics that are specific only for the German dataset would be missing. Therefore, we decided to stick to our embedding based matching approach that is described in more detailed in the following.
In the last few years, a lot of pretrained word vectors have been released. For example, the fastText^{[5]} library provides pretrained word embeddings for over 150 different languages (Grave et al. 2018; Joulin et al. 2018). Many attempts have also been made to train multilingual word embeddings, i.e., training a shared vector space for multiple languages. For example, Conneau et al. (2017) introduce both supervised and unsupervised approaches to learning crosslingual word embeddings. The authors provide multilingual embeddings for 30 languages based on fastText monolingual word vectors.^{[6]} These multilingual embeddings can be used to obtain language independent topic representations. Thereby, each topic could be also represented as a vector in the shared multilingual vector space using the word embeddings of its most frequent words (see options 1–3 below). We consider the following options for obtaining multilingual topic vector representations:
Represent a topic as the sum vector of n word vectors in the embedding space corresponding to its n most frequent words.
Represent a topic vector as the weighted sum of n word vectors corresponding to its n most frequent words. The weights are given by the estimated LDA models and represent the probabilities of each word occurring in a certain topic. Thereby, rescale the original probabilities given by the LDA output depending on the number of words considered.
Represent a topic vector as the weighted sum of all the vocabulary word vectors, i.e., the word embeddings of all the vocabulary words vectors multiplied by the probabilities of occurring given by the LDA output.
“Translate” the words of one model into the language of the other model using word embeddings. For example, for each word in the English corpus vocabulary, search for the first nearest neighbour in German language and use the corresponding word as the translation of the English word. Afterwards, apply the standard matching approach described previously.
Further steps, i.e., calculating the similarity and applying one of the matching types, are performed analogously to the standard matching approach. In the current work, we use the first option and represent the topics as the sum vectors of 100 word vectors corresponding to their 100 most frequent words (not weighted). While preliminary analysis indicated no qualitative differences in the results for our analysis, an indepth comparison of the performance of the different alternatives is left for future research.
3.4 Topic Trends Comparison
The methods described in this section aimed at identifying similar topics based on their textual content. A further aspect of interest is the development of these topics over time, namely the relative importance of the identified matched topics in their corpora at certain points in time. For this comparison we construct topic weight time series and are interested in whether the identified topic matches exhibit similar dynamics over the considered period of time.
As described above, for each document in a corpus, the estimated LDA models provide probabilities of each topic occurring is this document, e.g., each document is represented as a vector with the length equal to the number of topics selected and sums up to one. To construct topic time series, the probabilities of each topic to occur in documents of the corpus are aggregated over all documents published in a given year and averaged on an annual basis.
To construct time series for topic matches identified within the German corpus for the two different languages considered, the average of the individual topic time series was calculated. If one of the values was missing in one of the time series, this value was replaced with the value from the second time series. In doing so, we were able to provide longer time series for the DE ^{ ENG } and DE ^{ GER } matches, as the share of German articles in the German corpus was substantially higher until the early 2000s.
In order to describe similarity between the time series for the matched topics we perform two steps. Firstly, given that the trajectories are quite ragged due to the limited number of texts per year, to ease visual inspection we smooth the series using a twosided filter. In the second step, we evaluate the correlation coefficient and compute the Euclidean distance for the pairs of filtered series.
4 Results
4.1 Number of Topics
As described in Section 2, the first step of the analysis consisted in identifying the optimal number of topics for each of the text corpora. The number of topics was selected by maximizing the singular BIC with a minimal number of topics set to H
_{min} = 10 and a maximal number of topics set to H
_{max} = 100. These boundaries were set based on the assessment of the variety of topics in the scientific publications considered. The models with different number of topics in the predefined range were assumed to have the same priors, i.e.,
Using the model selection procedure based on the sBIC, we identified 37 topics for the Polish data set, 20 topics for the DE ^{ GER } data set and 60 topics for the DE ^{ ENG } data set. The sBIC values for Poland and Germany depending on the number of topics are shown in Figures 3 and 4, respectively.^{[7]} The red dashed lines indicate the selected number of topics for each corpus. For the DE ^{ ENG } data set, maximizing the sBIC would lead to 74 topics. However, it can be seen in Figure 4b that the shape of the curve of sBIC values in the interval from 55 to 75 topics is almost flat. For this reason and due to a rather small data set consisting of 704 articles, we decided to consider 60 topics for this corpus.
For comparison, we also applied some of the techniques commonly used in the literature to chose the optimal number of topics. We used Python’s module tm toolkit and calculated the following evaluation metrics available in this module: Arun et al. (2010), Cao et al. (2009), perplexity, and coherence (Mimno et al. 2011). For our application, however, none of these metrics provides a clear indication of an optimal number of topics (see Figure 5 for the German corpora). In fact, the first two criteria seem to suggest always the largest number of topics, while the coherence criterion appears to favour a very small number of topics. Only perplexity suggests an inner solution for the smaller German corpus. The results for the Polish corpus are also inconclusive. Therefore, we stick to the novel sBIC measure with a strong theoretical background.
All topics identified in the LDA model selected by sBIC both for German and Polish corpora are interpretable, i.e. by a visual inspection of word clouds composed of the 50 most common words for each topic (see Section 4.2 and the online supplementary material E) we are able to link the topics to relevant economic issues. Thus, although sBIC does not directly measure any semantic quality of topics, the outcome of the model selection procedure using sBIC is a set of interpretable topics. If another criterion was used, then the selected number of topics would be very large or very small as compared to the number of topics selected by sBIC (see discussion above). It would imply obtaining either a small model, in which some interpretable topics were omitted, or a large model, in which some topics might be meaningless.
4.2 Topics
Figure 6 shows some topics from the Polish corpus. The uncovered topics deal with different aspects of econometric models (Topics 3 and 36), forecasting (Topic 16), and modelling of macroeconomic indicators (Topics 9, 10, 17). Figure 7 presents some DE ^{ GER } topics discussing unemployment (Topic 0), consumption&income (Topic 14), government spending (Topic 15) as well as some DE ^{ ENG } topics discussing business indicators (Topic 2), wage (Topic 6), and stock market (Topic 14). The font size of the words in the presented word clouds corresponds to the relative importance of the words in a topic. The full set of all topics obtained for all corpora can be found in the online supplementary material E.
4.3 Matching of Topics
In the matching stage, we first performed topic matching between DE ^{ ENG } and PL ^{ ENG } topics based on the topicword vectors and the intersection of the two vocabularies (2523 words). We identified best matches based on the cosine similarity values.
Given that the topic matching procedure provides a match for all topics in the corpus considered, we have to differentiate between “sensible” matches, i.e., pairs of topics with high similarity, and best matches, which pair quite different topics. To this end, we propose to determine a cutoff value based on the distribution of the cosine similarity values between all possible topic pairs, which should provide an approximation of the values we might expect for random matches. Figure 8a presents this distribution of the cosine similarity values which exhibits an “elbow” around a value of 0.2.
We decided to use the 95% percentile (0.265) of the empirical distribution as the cutoff value. An alternative approach for determining this cutoff value could be based on Monte Carlo simulations for corpora with common and different topics. The computational resources required for such Monte Carlo simulations would be very high, and the setup would have to take into account how similar the topics within each corpus are, i.e., results could be used only for a very specific setting. Therefore, we leave such an analysis to future work. Apart from defining a cutoff value, we also checked systematically whether there are some multiple assignments, i.e., topics matched with the same topic in the targeted corpus. In this case, we only kept the pairs with the highest cosine similarity value. At the same time, we took the nonsymmetry of the cosine similarity measure into account and checked whether the topics in the selected matches are the nearest neighbours of each other also when reversing the direction of matching.
Using this approach, a total number of 24 topic pairs were identified. Figures 9 and 10 show two of them. The matched topics seem to be quite similar as can be concluded from the word clouds comprising the 50 most frequent words. While the first one deals with international economic links, the second one is about business cycle analysis. Further matches deal with topics such as loan debt, hypothesis testing, forecasting methods, labour market and (un)employment, capital growth, oil shocks, inflation, income, trade etc. (see online supplementary material F). Results of a robustness check using Jensen–Shannon distance as the similarity measure instead of cosine similarity are provided in Appendix C.3.
4.4 Embedding Based Matching of Topics
For the multilingual corpus, we applied the proposed embedding based approach to match the topics between the DE ^{ ENG } and DE ^{ GER } data subsets. To this end, each topic was represented as the sum vector of 100 word vectors corresponding to its 100 most frequent words. Cosine similarity values were calculated between the topic pairs and for each topic in one language its nearest neighbour in the other language was chosen as its match. Analogously to the topicword based matching, we used the 95% percentile of the cosine values between all possible topic pairs, 0.93, as the cutoff value to identify “sensible” matches (see Figure 8b). This approach resulted in 16 topic pairs within the German data set (see online supplementary material G). Finally, we made use of the English part of these matches to obtain overall matches between both German topic sets and the PL ^{ ENG } topics.
Figures 11 and 12 show examples of these multilingual matches of the two corpora. In the first example it becomes obvious that both German topics and the corresponding Polish topic deal with the labour market and unemployment. However, not all of the obtained DE ^{ ENG } and DE ^{ GER } topic pairs appear meaningful to the same extent. The second example exhibits that the DE ^{ GER } topic deals with private consumption and income and might be related to the analysis of the lifetime cycle of private households. The matched DE ^{ ENG } topic is about investment and capital growth as is the one in PL ^{ ENG }. This unsatisfactory outcome might be due to the specific multilingual embedding that was used for the matching. Therefore, further research is required for selecting or generating appropriate embeddings in order to improve the proposed approach to multilingual topic matching.
4.5 Time Series of Topic Weights
To enable comparison of patterns in the series of weights, the time series for the PL ^{ ENG } and DE ^{ ENG } text corpora were filtered with the centered equallyweighted moving average computed using 5 observations, MA(5). In the next step, the Euclidean distance and correlation coefficients were evaluated for the smoothed series. The values of these measures as well as cosine similarity scores are reported in Table 5 in the Appendix D. Below we discuss relations between the weight series of two topic pairs.
Figures 13 and 14 present word clouds and weight series (both raw and filtered) for two selected pairs of topics from PL ^{ ENG } and DE ^{ ENG } corpora. Analogous figures for all topic pairs are provided in online supplementary material F. Figure 13 shows the interest in time in the topics on international economic links. This match was characterised by a high cosine similarity score of 0.86146. The Euclidean distance between filtered series amounted to 0.10798 and the coefficient of correlation had a value of 0.63348. Filtered series for topics identified in both text corpora show a mild upward trend: an increasing interest in international links might be associated with an increasing openness and integration in the European Union economies. High positive correlation can be additionally attributed to common patterns in the dynamics which are synchronised in time.
Figure 14 shows words clouds and plots of weight series for pair of topics concerning the business cycle. Although cosine similarity for this pair of topics is also high (0.83445) the Euclidean distance is larger (0.16063) and a negative correlation coefficient (−0.20020) indicates weaker comovement. The negative correlation can be explained by the misalignment of interest in these topics in time due to different economic circumstances of Germany and Poland. The creation of the euro area brought about an increased interest in business cycle studies in Germany. This can be explained by the need to better understand economic fluctuations in the common currency area. The importance of a similar topic for Poland grew later – after joining the common market.
5 Conclusions and Outlook
The present work considered scientific publications from Germany and Poland. The primary aim was to uncover main topics in the corpora using LDA modelling and to compare them with each other on the basis of the proposed matching approaches. The results of the current paper are a valuable contribution to the growing body of literature on textasdata applications for several reasons.
First, we address one of the great challenges in the topic modelling, namely the choice of the optimal number of topics. We suggest to select the number of topics based on sBIC, a Bayesian information criterion adapted to the singularity of LDA models. Our analysis shows that the proposed information criterion leads to coherent topics in the considered text corpora. Second, we propose a topic matching approach that allows to compare the topicword distributions of two different LDA models and to identify suitable topic pairs across text corpora. This matching approach made it possible to find meaningful topic pairs describing similar concepts in Polish and German corpora. Third, we suggest a data based procedure for identifying potentially meaningful matches of topics across corpora. Using a data based cutoff value for the minimum value of cosine similarity for matched topics, we were able to separate sensible matches from the remaining ones which do not correspond to similar topics in the colloquial sense. Fourth, we address the problem of topic matching between two corpora trained in different languages by proposing a language agnostic topic matching approach using multilingual word embeddings.
The work could be extended along the following lines. Further research is required to examine the randomness component of the sBIC criterion, e.g., by conducting a simulation study for sBIC. Additional work is also needed to improve the proposed embedding based matching approach as well as to examine further possibilities for topic matching in a multilingual context. This is recommended since not all topics matched across different languages in the current study were fully convincing. Furthermore, some of the identified topic matches seem to be closely related to themes concerning real macroeconomic activities, e.g. inflation, unemployment, income etc. Further research will examine more closely the links between the corresponding topic time series and real macroeconomic variables with a focus on potential differences across countries.
Funding source: Deutsche Forschungsgemeinschaft
Award Identifier / Grant number: WI 2024/81
Funding source: Narodowe Centrum Nauki
Award Identifier / Grant number: Beethoven Classic 3: UMO2018/31/G/HS4/00869

Research funding: Financial support from the German Research Foundation (DFG) (WI 2024/81) and the National Science Centre (NCN) (Beethoven Classic 3: UMO2018/31/G/HS4/00869) for the project TEXTMOD is gratefully acknowledged.
Appendix A: Data Preparation
A.1 German Data
In the first step, data were downloaded from the De Gruyter website. Table 1 summarizes number of articles published each year. Up to 2000, the volumes were available as scanned pdf files. The Optical Character Recognition (OCR) was used to transform existing pdf files into text files. The text files were then copied into Microsoft Word and saved again with other coding (Unicode UTF8). After that, the new text files were again copied into Word and the following preparation steps were taken:
Mark each issue number with heading 1.
Mark each article title with heading 2.
Remove the following nontextual elements:
Table of contents,
Author names and article numbers,
Formulas and special characters,
Bibliographies,
Tables and appendices.
After these preparation steps, the data could be imported into Python and be further preprocessed and analysed.
Year  Volume  Number of articles 

1984  199  42 
1985  200  46 
1986  201&202  49 
1987  203  49 
1988  204&205  93 
1989  206  52 
1990  207  46 
1991  208  53 
1992  209&2010  88 
1993  211&2012  81 
1994  213  53 
1995  214  53 
1996  2015  57 
1997  216  45 
1998  217  53 
1999  218&219  84 
2000  220  52 
2001  221  39 
2002  222  41 
2003  223  46 
2004  224  40 
2005  225  45 
2006  226  35 
2007  227  42 
2008  228  34 
2009  229  41 
2010  230  44 
2011  231  46 
2012  232  43 
2013  233  35 
2014  234  41 
2015  235  39 
2016  236  33 
2017  237  26 
2018  238  27 
2019  239  36 
2020  240  35 
A.2 Polish Data
The texts from the two data sources for Poland had different forms. The conference proceedings were available as hard copies of the volumes, while CEJEME articles were digital and had the format of LATEX or pdf files.
The available conference volumes (including more than 9000 pages) were scanned and saved as pdf files. The description of the MM data is provided in Table 2. Altogether, the data included 514 full length papers (with or without an abstract) and 231 abstracts (without the main text). In the next step, OCR was performed and the texts were saved as docx files. Over the years, the volumes were published by various publishing houses, using alternative typesetting styles and techniques. Thus, also the resulting source files differed considerably and extensive manual labour was needed to clean the texts. This preparatory step involved removing front and back matter, running heads and feet, tables, footnotes, figures, equations and other mathematical expressions as well as references. The beginning of each article was also manually marked. In addition, within each paper, information on the title, authors, affiliations, abstract (if present) and main body of the text were uniformly organized so that they could be easily identified by the code.
Year of conference  No. of volumes  Meetings  Contents 

1984  1  MM and MF  17 full papers 
1985  2  MM  26 full papers 
1986 and 1987  1  MM  17 full papers 
1988 and 1989  1  MM  18 full papers 
1990  1  MM  11 full papers 
1991  1  MM  12 full papers 
1992  1  MM  16 full papers 
1993  2  MM  24 full papers 
1994  1  MM  11 full papers 
1995  2  MM and MSA  29 full papers 
1996  3  MM and MSA  46 full papers 
1997  2  MM and AMFET  20 full papers 
1998  1  MM and AMFET  8 full papers 
1999  2  MM and AMFET  40 full papers 
2000  2  MM and AMFET  14 full papers 
2001  2  MM and AMFET  33 full papers 
2002  2  MM and AMFET  25 full papers 
2003  2  MM and AMFET  20 full papers 
2004  2  MM and AMFET  23 full papers 
2005  2  MM and AMFET  27 full papers 
2006  2  MM and AMFET  25 full papers 
2007  2  MM and AMFET  25 full papers 
2008  1  MM and AMFET  10 full papers 
2009  1  MM and AMFET  6 full papers 
2010  1  MM and AMFET  6 full papers 
2011  1  MM and AMFET  5 full papers 
The input files from CEJEME used for modelling had LATEX format.^{[8]} Detailed information on the numbers of papers published in each volume and issue of CEJEME is presented in Table 3. All papers had an abstract.
Year  Volume  Issue  Number of papers  Year  Volume  Issue  Number of papers 

2009  1  1  5  2015  7  1  3 
1  2  4  7  2  3  
1  3  4  7  3  3  
1  4  4  7  4  3  
2010  2  1  4  2016  8  1  3 
2  2  3  8  2  3  
2  3  3  8  3  3  
2  4  3  8  4  3  
2011  3  1  3  2017  9  1  3 
3  2  3  9  2  3  
3  3  3  9  3  3  
3  4  3  9  4  3  
2012  4  1  3  2018  10  1  3 
4  2  3  10  2  3  
4  3  3  10  3  3  
4  4  3  10  4  3  
2013  5  1  3  2019  11  1  3 
5  2  3  11  2  3  
5  3  3  11  3  3  
5  4  3  11  4  3  
2014  6  1  3  2020  12  1  3 
6  2  3  12  2  4  
6  3  3  12  3  4  
6  4  3  12  4  4 
Initially, a structured database on the documents was created using Matlab. This step consisted in extracting from LATEX files information on the publication year, names of authors, title of each paper, key words, JEL codes and abstracts. Abstracts were cleaned of all mathematical expressions and LATEX formatting commands. Gathering this information was facilitated by a relatively stable LATEX template used in the publication process.
In the next step, to form the text corpus(es) appropriate for further probabilistic analysis, the text files had to be suitably prepared. Initial editing was done in two steps. In the first step, the original files with .tex extension were modified to obtain the main body of the text by removing the following elements:
Initial article information including: the author name(s), affiliation(s), email address(es), dates of submitting and accepting the article,
The abstract, keywords and JEL classification codes,
Text appearing in running head (the author name(s) and short title of the paper) and running foot (the author name(s) and information on the volume and issue) of the journal,
Figures and tables,
Formulas, mathematical symbols and Greek letters,
References,
Selected LATEX commands which prevented compilation after the above alternations of the texts were done, e.g. those introducing change of line.
In the second step, PDF files were obtained on the basis of the filtered LATEX files. The pdfs were transformed to a text format.
Appendix B: Stopwords
English stopwords removed from article texts using the R package tm:
I, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, would, should, could, ought, I’m, you’re, he’s, she’s, it’s, we’re, they’re, I’ve, you’ve, we’ve, they’ve, I’d, you’d, he’d, she’d, we’d, they’d, I’ll, you’ll, he’ll, she’ll, we’ll, they’ll, isn’t, aren’t, wasn’t, weren’t, hasn’t, haven’t, hadn’t, doesn’t, don’t, didn’t, won’t, wouldn’t, shan’t, shouldn’t, can’t, cannot, couldn’t, mustn’t, let’s, that’s, who’s, what’s, here’s, there’s, when’s, where’s, why’s, how’s, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very.
Additional stopwords removed from the texts of articles:
appendix, acknowledgements, introduction.
German stopwords ^{[9]} removed from article texts:
a, ab, aber, ach, acht, achte, achten, achter, achtes, ag, alle, allein, allem, allen, aller, allerdings, alles, allgemeinen, als, also, am, an, ander, andere, anderem, anderen, anderer, anderes, anderm, andern, anderr, anders, au, auch, auf, aus, ausser, ausserdem, außer, außerdem, b, bald, bei, beide, beiden, beim, beispiel, bekannt, bereits, besonders, besser, besten, bin, bis, bisher, bist, c, d, d.h, da, dabei, dadurch, dafür, dagegen, daher, dahin, dahinter, damals, damit, danach, daneben, dank, dann, daran, darauf, daraus, darf, darfst, darin, darum, darunter, darüber, das, dasein, daselbst, dass, dasselbe, davon, davor, dazu, dazwischen, daß, dein, deine, deinem, deinen, deiner, deines, dem, dementsprechend, demgegenüber, demgemäss, demgemäß, demselben, demzufolge, den, denen, denn, denselben, der, deren, derer, derjenige, derjenigen, dermassen, dermaßen, derselbe, derselben, des, deshalb, desselben, dessen, deswegen, dich, die, diejenige, diejenigen, dies, diese, dieselbe, dieselben, diesem, diesen, dieser, dieses, dir, doch, dort, drei, drin, dritte, dritten, dritter, drittes, du, durch, durchaus, durfte, durften, dürfen, dürft, e, eben, ebenso, ehrlich, ei, ei, eigen, eigene, eigenen, eigener, eigenes, ein, einander, eine, einem, einen, einer, eines, einig, einige, einigem, einigen, einiger, einiges, einmal, eins, elf, en, ende, endlich, entweder, er, ernst, erst, erste, ersten, erster, erstes, es, etwa, etwas, euch, euer, eure, eurem, euren, eurer, eures, f, folgende, früher, fünf, fünfte, fünften, fünfter, fünftes, für, g, gab, ganz, ganze, ganzen, ganzer, ganzes, gar, gedurft, gegen, gegenüber, gehabt, gehen, geht, gekannt, gekonnt, gemacht, gemocht, gemusst, genug, gerade, gern, gesagt, geschweige, gewesen, gewollt, geworden, gibt, ging, gleich, gott, gross, grosse, grossen, grosser, grosses, groß, große, großen, großer, großes, gut, gute, guter, gutes, h, hab, habe, haben, habt, hast, hat, hatte, hatten, hattest, hattet, heisst, her, heute, hier, hin, hinter, hoch, hätte, hätten, i, ich, ihm, ihn, ihnen, ihr, ihre, ihrem, ihren, ihrer, ihres, im, immer, in, indem, infolgedessen, ins, irgend, ist, j, ja, jahr, jahre, jahren, je, jede, jedem, jeden, jeder, jedermann, jedermanns, jedes, jedoch, jemand, jemandem, jemanden, jene, jenem, jenen, jener, jenes, jetzt, k, kam, kann, kannst, kaum, kein, keine, keinem, keinen, keiner, keines, kleine, kleinen, kleiner, kleines, kommen, kommt, konnte, konnten, kurz, können, könnt, könnte, l, lang, lange, leicht, leide, lieber, los, m, machen, macht, machte, mag, magst, mahn, mal, man, manche, manchem, manchen, mancher, manches, mann, mehr, mein, meine, meinem, meinen, meiner, meines, mensch, menschen, mich, mir, mit, mittel, mochte, mochten, morgen, muss, musst, musste, mussten, muß, mußt, möchte, mögen, möglich, mögt, müssen, müsst, müßt, n, na, nach, nachdem, nahm, natürlich, neben, nein, neue, neuen, neun, neunte, neunten, neunter, neuntes, nicht, nichts, nie, niemand, niemandem, niemanden, noch, nun, nur, o, ob, oben, oder, offen, oft, ohne, ordnung, p, q, r, recht, rechte, rechten, rechter, rechtes, richtig, rund, s, sa, sache, sagt, sagte, sah, satt, schlecht, schluss, schon, sechs, sechste, sechsten, sechster, sechstes, sehr, sei, seid, seien, sein, seine, seinem, seinen, seiner, seines, seit, seitdem, selbst, sich, sie, sieben, siebente, siebenten, siebenter, siebentes, sind, so, solang, solche, solchem, solchen, solcher, solches, soll, sollen, sollst, sollt, sollte, sollten, sondern, sonst, soweit, sowie, später, startseite, statt, steht, suche, t, tag, tage, tagen, tat, teil, tel, tritt, trotzdem, tun, u, uhr, um, und, uns, unse, unsem, unsen, unser, unsere, unserer, unses, unter, v, vergangenen, viel, viele, vielem, vielen, vielleicht, vier, vierte, vierten, vierter, viertes, vom, von, vor, w, wahr, wann, war, waren, warst, wart, warum, was, weg, wegen, weil, weit, weiter, weitere, weiteren, weiteres, welche, welchem, welchen, welcher, welches, wem, wen, wenig, wenige, weniger, weniges, wenigstens, wenn, wer, werde, werden, werdet, weshalb, wessen, wie, wieder, wieso, will, willst, wir, wird, wirklich, wirst, wissen, wo, woher, wohin, wohl, wollen, wollt, wollte, wollten, worden, wurde, wurden, während, währenddem, währenddessen, wäre, würde, würden, x, y, z, z.b, zehn, zehnte, zehnten, zehnter, zehntes, zeit, zu, zuerst, zugleich, zum, zunächst, zur, zurück, zusammen, zwanzig, zwar, zwei, zweite, zweiten, zweiter, zweites, zwischen, zwölf, über, überhaupt, übrigens.
Appendix C: Robustness Check
C.1 Translation
As a robustness check, we translated the German texts from JES into English using DeepL API. The vocabulary of the joint dataset is almost identical (about 85%) to the vocabulary of the English subset of the data. For the joint dataset
In the next step, we performed standard topic matching in both directions
Using the proposed standard matching approach we identified 24 sensible matches between the corpora for both countries reported in Table 5 in Appendix D. Next, we could analyse whether matching
Therefore, in general, machine translation might be considered a good alternative when dealing with multilingual corpora. However, the additional costs, the quality of translation for certain corpora, and the “black box” character of machine translations must be taken into account.
C.2 Sklearn Versus Gensim
To account for possible differences in topic distributions resulting from different LDA implementations, we considered a further Python module gensim. For all the considered datasets, namely PL ^{ ENG }, DE ^{ ENG }, DE ^{ GER }, we estimated LDA models using gensim with the number of topics according to sBIC. For each dataset, we calculated the following evaluation metrics: perplexity, average topic similarity (Cao et al. 2009) and average topic coherence (Mimno et al. 2011). The results are summarized in Table 4. According to the perplexity and average topic similarity measures, sklearn seems to perform better. As for average coherence of the resulting topics, the scores seem to be quite similar.
PL ^{ ENG }  DE ^{ ENG }  DE ^{ GER }  

Gensim  Sklearn  Gensim  Sklearn  Gensim  Sklearn  
Perplexity  873.4  842.8  998.02  949.4  1697.62  1641.45 
Cao et al. (2009)  0.18  0.14  0.11  0.08  0.28  0.2 
Mimno et al. (2011)  −0.76  −0.79  −1.04  −0.93  −0.92  −0.95 
As we are most interested in topics, for each dataset we compared the topicword distributions using the proposed standard matching approach. In doing so, we aimed to find out whether topics uncovered using the two different LDA implementations overlap to a large extent or not. We found that most of the topics that are later identified as meaningful matches can be found by means of both implementations (see examples below).
C.3 Similarity Measure
We also performed topic matching using a different similarity measure, Jensen–Shannon distance, to see whether the main results and the resulting topic pairs change considerably. Analogously to the process presented in the main part of this paper, we first calculated the JS distances between all possible topic matches to derive a suitable cutoff value. Figure 17 presents the distribution of the JS distance values. The lower the distance between two topic vectors, the more similar they are to each other. We took 0.05% (0.64) percentile as a cutoff value. We then removed multiple assignments keeping just the topic pair with the lowest distance. It resulted in 23 topic pairs. Four out of 24 assignments were different as compared to the results when using cosine similarity. One possible reason for this is that the DE ^{ ENG } topic set is larger and contains some quite similar topics, i.e. one PL ^{ ENG } topic could be a suitable pair to more than one of the DE ^{ ENG } topics. An example is shown in Figure 18. Both DE ^{ ENG } topics seem to be related to the PL ^{ ENG } topic. Overall, it could be observed that the use of a different similarity measure does not impact the results significantly.
Appendix D: Time Series Analysis
Topic number in  Cosine similarity^{a}  Euclidean  Correlation  

PL ^{ ENG }  DE ^{ ENG }  score  distance  coefficient 
0  4  0.84617  0.15223  −0.11350 
1  15  0.45003  0.22556  −0.47935 
2  47  0.58684  0.24628  0.58412 
3  13  0.50440  0.07083  0.42358 
4  12  0.75001  0.24791  −0.02885 
6  36  0.48663  0.07005  −0.39230 
10  52  0.69914  0.08590  −0.40108 
11  20  0.72227  0.07436  0.60665 
13  21  0.80610  0.20421  0.05272 
14  33  0.86146  0.10798  0.63348 
15  54  0.56860  0.14301  0.85228 
16  1  0.88338  0.09773  0.03375 
17  57  0.49535  0.13781  0.51459 
21  43  0.63251  0.12119  −0.58288 
22  46  0.54288  0.12549  −0.81556 
23  2  0.83445  0.16063  −0.20020 
25  39  0.50876  0.09650  −0.61320 
26  37  0.65153  0.11825  −0.24760 
27  26  0.85116  0.09879  0.00184 
29  30  0.66695  0.30690  −0.44140 
31  38  0.52991  0.06821  0.07816 
32  53  0.70459  0.14300  −0.55365 
34  31  0.52545  0.08092  0.41222 
35  28  0.34393  0.10423  −0.59937 

^{a}These values refer to the topics content and were calculated between word distributions of the topics, whereas Euclidean distance and correlation were calculated using the resulting topic time series.
References
Arun, R., Suresh, V., Veni Madhavan, C.E., and Narasimha Murthy, M.N. (2010). On finding the natural number of topics with latent dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., and Pudi, V. (Eds.), Advances in knowledge discovery and data mining. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 391–402.10.1007/9783642136573_43Search in Google Scholar
Baker, S.R., Bloom, N., and Davis, S.J. (2016). Measuring economic policy uncertainty. Q. J. Econ. 131: 1593–1636, https://doi.org/10.1093/qje/qjw024.Search in Google Scholar
Bianchi, F., Terragni, S., Hovy, D., Nozza, D., and Fersini, E. (2020). Crosslingual contextualized topic models with zeroshot learning, arXiv preprint arXiv:2004.07737.10.18653/v1/2021.eaclmain.143Search in Google Scholar
Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022.Search in Google Scholar
Cao, J., Xia, T., Li, J., Zhang, Y., and Tang, S. (2009). A densitybased method for adaptive lda model selection. Neurocomputing 72: 1775–1781, https://doi.org/10.1016/j.neucom.2008.06.011.Search in Google Scholar
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou, H. (2017). Word translation without parallel data. CoRR, abs/1710.04087. Available at: http://arxiv.org/abs/1710.04087.Search in Google Scholar
Debnath, R. and Bardhan, R. (2020). India nudges to contain COVID19 pandemic: a reactive public policy analysis using machinelearning based topic modelling. PLoS One 15: 1–25, https://doi.org/10.1371/journal.pone.0238972.Search in Google Scholar
Dieng, A.B., Ruiz, F.J., and Blei, D.M. (2020). Topic modeling in embedding spaces. Trans. Assoc. Comput. Ling. 8: 439–453, https://doi.org/10.1162/tacl_a_00325.Search in Google Scholar
Dörr, J.O., Kinne, J., Lenz, D., Licht, G., and Winker, P. (2022). An integrated data framework for policy guidance during the coronavirus pandemic: towards realtime decision support for economic policymakers. PLoS One 17: e0263898, https://doi.org/10.1371/journal.pone.0263898.Search in Google Scholar
Drton, M. and Plummer, M. (2017). A Bayesian information criterion for singular models. J. Roy. Stat. Soc. B 79: 323–380, https://doi.org/10.1111/rssb.12187.Search in Google Scholar
Ellingsen, J., Larsen, V.H., and Thorsrud, L.A. (2022). News media versus fredmd for macroeconomic forecasting. J. Appl. Econom. 37: 63–81, https://doi.org/10.1002/jae.2859.Search in Google Scholar
Foltas, A. (2022). Testing investment forecast efficiency with forecasting narratives. J. Econ. Stat. 242: 191–222, https://doi.org/10.1515/jbnst20200027.Search in Google Scholar
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of the eleventh international conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. Available at: https://www.aclweb.org/anthology/L181550.Search in Google Scholar
Hansen, S. and McMahon, M. (2016). Shocking language: understanding the macroeconomic effects of central bank communication. J. Int. Econ. 99: S114–S133, https://doi.org/10.1016/j.jinteco.2015.12.008.Search in Google Scholar
Hayashi, N. (2021). The exact asymptotic form of Bayesian generalization error in latent Dirichlet allocation. Neural Netw. 137: 127–137, https://doi.org/10.1016/j.neunet.2021.01.024.Search in Google Scholar
Hoffman, M., Bach, F.R., and Blei, D.M. (2010). Online learning for latent dirichlet allocation. In: Lafferty, J.D., Williams, C.K.I., ShaweTaylor, J., Zemel, R.S., and Culotta, A. (Eds.), Advances in neural information processing systems, 23. Curran Associates, Inc., La Jolla, CA, Red Hook, NY, pp. 856–864.Search in Google Scholar
Hoffman, M.D., Blei, D.M., Wang, C., and Paisley, J.W. (2013). Stochastic variational inference. J. Mach. Learn. Res. 14: 1303–1347.Search in Google Scholar
Jentsch, C., Lee, E.R., and Mammen, E. (2020). Timedependent Poisson reduced rank models for political text data analysis. Comput. Stat. Data Anal. 142: 106813, https://doi.org/10.1016/j.csda.2019.106813.Search in Google Scholar
Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., and Grave, E. (2018). Loss in translation: learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, pp. 2979–2984. Available at: https://www.aclweb.org/anthology/D181330.10.18653/v1/D181330Search in Google Scholar
Kalamara, E., Turrell, A., Redl, C., Kapetanios, G., and Kapadia, S. (2020). Making text count: economic forecasting using newspaper text, Bank of England working papers 865, Bank of England. Available at: https://ideas.repec.org/p/boe/boeewp/0865.html.10.2139/ssrn.3610770Search in Google Scholar
Krüger, M., Kinne, J., Lenz, D., and Resch, B. (2020). The digital layer: how innovative firms relate on the webv, Technical Report No. 20003, ZEW – Centre for European Economic Research. Available at: https://ssrn.com/abstract=3530807.Search in Google Scholar
Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2: 83–97, https://doi.org/10.1002/nav.3800020109.Search in Google Scholar
Larsen, V.H. and Thorsrud, L.A. (2019). The value of news for economic developments. J. Econom. 210: 203–218, https://doi.org/10.1016/j.jeconom.2018.11.013.Search in Google Scholar
Lenz, D. and Winker, P. (2020). Measuring the diffusion of innovations with paragraph vector topic models. PLoS One 15: e0226685, https://doi.org/10.1371/journal.pone.0226685.Search in Google Scholar
Lucas, C., Nielsen, R.A., Roberts, M.E., Stewart, B.M., Storer, A., and Tingley, D. (2015). Computerassisted text analysis for comparative politics. Polit. Anal. 23: 254–277, https://doi.org/10.1093/pan/mpu019.Search in Google Scholar
Lüdering, J. and Tillmann, P. (2020). Monetary policy on Twitter and asset prices: evidence from computational text analysis. N. Am. J. Econ. Finance 51: 100875, https://doi.org/10.1016/j.najef.2018.11.004.Search in Google Scholar
Lüdering, J. and Winker, P. (2016). Forward or backward looking? The economic discourse and the observed reality. Journal of Economics and Statistics 236: 483–515, https://doi.org/10.1515/jbnst20151026.Search in Google Scholar
Maier, D., Baden, C., Stoltenberg, D., VriesKedem, M.D., and Waldherr, A. (2022). Machine translation vs. multilingual dictionaries assessing two strategies for the topic modeling of multilingual text collections. Commun. Methods Meas. 16: 19–38, https://doi.org/10.1080/19312458.2021.1955845.Search in Google Scholar
Mamaysky, H. (2021). News and markets in the time of COVID19. SSRN. Available at: https://ssrn.com/abstract=3565597.Search in Google Scholar
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. In: Bengio, Y., and LeCun, Y. (Eds.), 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings. Available at: http://arxiv.org/abs/1301.3781.Search in Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2: 3111–3119.Search in Google Scholar
Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, Association for Computational Linguistics, Edinburgh, Scotland, UK, 262–272. Available at: https://aclanthology.org/D111024.Search in Google Scholar
Shapiro, A.H., Sudhof, M., and Wilson, D.J. (2022). Measuring news sentiment. J. Econom. 228: 221–243, https://doi.org/10.1016/j.jeconom.2020.07.053.Search in Google Scholar
Thorsrud, L.A. (2020). Words are the new numbers: a newsy coincident index of the business cycle. J. Bus. Econ. Stat. 38: 393–409, https://doi.org/10.1080/07350015.2018.1506344.Search in Google Scholar
Watanabe, S. (2009). Algebraic geometry and statistical learning theory, Cambridge monographs on applied and computational mathematics. Cambridge University Press, Cambridge.Search in Google Scholar
Wehrheim, L. (2019). Economic history goes digital: topic modeling the journal of economic history. Cliometrica 13: 83–125, https://doi.org/10.1007/s1169801801717.Search in Google Scholar
Supplementary Material
The online version of this article offers supplementary material (https://doi.org/10.1515/jbnst20220024).
© 2022 Walter de Gruyter GmbH, Berlin/Boston