Bill Clinton’s campaign in the early 1990s was mainly focused on America’s declining economy, emphasizing a basic but effective slogan, “It’s the Economy, Stupid!.” The inference was clear: President Bush was still gloating over the end of the Cold War and victory in the Persian Gulf while ignoring the more recent fact that the recession was hurting the “forgotten middle class.”
The 1992 race occurred in the aftermath of the 1990–1991 economic recession, which had been followed by a “jobless recovery.” Clinton’s campaign manager James Carville coined the phrase “the economy, stupid” as one of three guiding principles for the campaign. Despite an economic environment that is very similar to that of the 1992 elections, as well as more imminent economic turmoil among foreign trading partners and “warnings” in the financial markets, the presidential candidates today do not seem too focused on the recession and macroeconomic policy.
Is it merely in hindsight that we regard economic issues as a defining message of the ’92 Clinton campaign, or was it genuinely characteristic of its discourse? Viewed within the full context of his administration, which oversaw an extended period of marked growth, the message appears to have presaged a strong recovery. On the other hand, had the recovery continued to stagnate, perhaps we would not be so willing to assign significance to this point of campaign strategy. We believe that an objective analysis of primary sources from the ’92 campaign will lead us to a definitive answer to this hypothesis.
Similarly, we might expect that our contemporary cultural biases will distort the vast quantity of communication disseminated over the course of the present contest. When candidates compete for attention, tailoring their messages depending on their audience, what we notice and remember is less a function of the actual spoken content than it is the secondary reactions from reporters, pundits, and social media echo chambers.
But is there a way to analyze election-year rhetoric that does not integrate some level of bias on the part of those conducting the analysis? The key innovation of this paper is to use text data from transcripts and policy releases to explore recurring topics addressed directly by presidential candidates and their campaigns. To quantify text, we use probabilistic topic models, a class of machine learning algorithms that identify common topics in a set of documents and decompose documents into the fraction of words they spend covering a given topic. We generate counts and topic coverage based on debates and policy releases, and uses them to explore what topics or themes appear to dominate this year’s election cycle and what are the priorities of each presidential candidate.
Results show that candidates have rarely approached the economy in speech, and when they did so it was either with an emphasis on Wall Street and its perceived evils, or promises of a balanced budget. More complex topics like taxes and support of the economy in the long-term were absent from the debates or policy releases. This in itself is an important finding since it suggests that none of the potential future presidents are concerned about the ongoing market turmoil, our troubled debt, or the long-term interest rates which are near record lows.
2 The Current State of the Economy
Despite the fact that the unemployment rate fell from a high of 10 percent in 2009 to 4.9 percent today and job growth has been consistently strong, we are still living in the era of 2009 in the sense that deflation is our key challenge. Between 2007 and 2009, the economy contracted by 5 percent and the stock market lost around half its value. Unemployment rates remained persistently high despite the fact that employers were posting substantially more vacancies. About one-tenth of one percent of American households own almost as much wealth as the bottom 90 percent of Americans, and research shows income for the top 10 percent of Americans has grown three times faster than for the bottom 90 percent since 1980 (Ghayad and Cragg, 2015). With all these worrisome economic indicators, economists and policymakers nowadays should be concerned about the state of our economy.
Today, bond markets show long-term US rates near record lows. A significant amount of loans to energy companies are posing a great risk to the economy. These companies were hit hard by plunging oil prices. Moreover, the labor force participation rate remains well below pre-recession levels, at 30-year lows. All of this indicates that the economy is still struggling out of its deepest recession since the great depression.
3 Identifying Presidential Candidates’ Preferences
Every time a candidate puts out a press release, policy document, or interview, the document can be interpreted to communicate intentions. Perhaps it is about the candidate’s proposed changes in tax policy, the minimum wage, or military strategy; perhaps it combines multiple themes of the candidate’s campaign, invoking the issue of veterans during a speech about foreign policy, or referencing working families in the context of taxes, the minimum wage, and healthcare. Press releases, debates and opinion articles constitute a particularly useful medium to measure candidates’ priorities. They provide distinct and politically important content which provides a comprehensive measure of their topics and priorities. This article analyzes the topics that presidential candidates emphasize and the priorities they express.
To provide a comprehensive measure of candidates’ priorities, we introduce a statistical model which will help uncover hidden themes in unlabeled text data without linking themes to particular word lists prior to estimation. To do this, the model makes use of the well-established idea that topics in texts are expressed with a distinctive set of words (Blei, Ng and Jordan, 2003; Hansen, McMahon and Prat, 2015). The model estimates a set of topics: clusters of words, discussed in debates, press releases and other opinion articles1. The model then assigns every press release, debate or article to one or more topics, and provides each topic with a document-specific weight. This paper looks only at each document’s highest-weighted topic.
As a measure of candidates’ preferences, the model measures the proportion of documents candidates allocate to each of the topics. This provides a measure of how candidates communicate their preferences. Consider for example Bernie Sanders in 2016. We measure how Sanders divided his attention over his 20 topics in the debates, press releases and opinion articles. Collecting this allocation across all 20 topics, we have our measure of Sanders’s preference in 2016. The model obtains the fraction of documents candidates allocate to the topics across all candidates for a given year.
4 Measuring Communication
Topic modeling is an effective statistical tool for mining large collections of textual documents to identify themes or topics. By looking at the words that occur together most often across different documents, it identifies the underlying topics that candidates are most likely to invoke when constructing their texts. By examining the topics most associated with each candidate, we can draw conclusions about his or her priorities and values.
The documents in our model consist of transcripts from the nine Republican debates and six Democratic debates held in 2015 and 2016, policy position statements from presidential candidates’ websites, op-ed pieces written in newspapers and magazines, transcripts of candidate interviews, and speeches – both direct transcripts and remarks as prepared for delivery.
We analyze the frequency count of each word that appears in this document database. By analyzing which words tend to occur together, we identify themes most used by different candidates in a neutral way. To do so, we rely on a probabilistic topic model that is commonly referred to as Latent Dirichlet Allocation (LDA).
LDA is a statistical approach to document modeling that discovers latent semantic topics in large collections of text documents. LDA posits that words carry strong semantic information, and documents discussing similar topics will use a similar group of words. Each discovered topic is characterized by its own particular distribution over words. Each document is then characterized as a random mixture of topics indicating the proportion of words the document spends on each topic. The number of topics in our case was set to be 20 and the topic that was assigned the highest probability for a document was chosen as its topic. LDA is a widely used topic model and has been cited thousands of times since 2003 (Blei, Ng and Jordan, 2003).
The LDA model proposes a stochastic procedure by which words in documents are generated. Given a corpus of unlabeled text documents, the model discovers hidden topics as distributions over the words in the vocabulary. Here, words are modeled as observed random variables, while topics are observed as latent random variables. Once the generative procedure is established, we define its joint distribution and then use statistical inference to compute the probability distribution over the latent variables, conditioned on the observed variables. The use of probability distributions is important because it allows the same word to appear in different topics with potentially different weights. In other words, a topic can be best described as a weighted list of words that all express the same underlying theme.2
5 The Big Issues of the 2016 Election Cycle
The topics for 2016 are displayed in Table 1 below. These topics are determined without any outside guidance, aside from specifying in advance the number of topics to generate. The benefit of this approach is that partisan biases play no role, but the cost is that topics can sometimes be difficult to interpret.
Topics Based on 2016 Election Cycle.
|Topic||Interpretation||Stems||Percentage of documents|
|11||Debate speech||Go, re, peopl, know, think||15.5%|
|15||The presidency||American, presidency, year, one, govern||9.4%|
|3||Women and families||Family, hillary, work, women, right||8.6%|
|17||Government reforms||Federa;, education, state, program, student||7.2%|
|16||US constitution||Court, president, constitution, state, donald||6.8%|
|20||Class||People, country, sanders, american, street||6.1%|
|5||Foreign policy||Iran, israel, deal, nuclear, obama||5.8%|
|12||Social factors||Need, would, peopl, work, us||5.8%|
|19||National security||World, america, nation, military, china||5.4%|
|8||Taxes||Tax, income, rate, business, plan||4.7%|
|4||Energy and environment||Energy, clean, climate, invest, state||4.0%|
|2||Terrorism||Isis, state, terrorist, syria, islam||4.0%|
|7||Healthcare||Health, care, plan, drug, insurance||3.2%|
|6||Syrian civil war||Tha, photo, syria, civil, hide||3.2%|
|14||Economic growth||New, economic, job, business, innovation||2.9%|
|1||Financial system||Financial, bank, would, street, wall||2.5%|
|18||Immigration||Immigration, law, illegal, border, enforce||2.2%|
|9||Veterans||Veteran, care, service, ensure||2.2%|
|13||Trump||Trump, guy, deal, great, right||0.7%|
|10||Stump speech||Ofth, applaus, presid, obama, ofour||0.0%|
Each topic is accompanied by a cloud of the words that occur most often in the topic. The larger the font size, the stronger its association with the given topic. Words are stemmed so that different parts of speech are treated as one word (for example, “chang” can be “change” or “changes” or “changing”).
It’s important not to conflate these word clouds with those generally seen elsewhere, which merely reflect the frequency of words in a given corpus. The words represented here are those most often found in proximity to one another, and hence correspond to what we would colloquially refer to as a particular “issue” or even mode of rhetoric. We can think of a topic model as a rough approximation of some of the unconscious heuristics a human reader will rely on to answer the sorts of questions generally seen in high school English classes: “what is this paragraph about?” “What is its tone?” and so forth.
LDA has the ability to estimate topics that appear natural despite having no pre-assigned labels. Table 1 below reports the top five tokens in each topic using the 2016 election cycle corpus. The numbered topics facilitate the interpretation of the estimated priorities: how presidential candidates divide their attention over the topics. The table shows the top words that dominate each topic as well as the percent of documents where that topic is most associated with the document. In Table 2, we show the dominant topics for documents by each presidential candidate. For example, 43% of all the documents related to Sanders are revolved around Topic 20; only 2% of the documents related to Hillary are focused on Topic 20, while none of the Republican candidates focused on Topic 20.
Topics by Candidate Based on 2016 Election Cycle.
|Women and families||30%||16%||4%||0%||3%||1%||0%|
|Energy and environment||8%||8%||0%||0%||3%||4%||0%|
|Syrian civil war||2%||0%||0%||9%||3%||6%||0%|
6 Priorities of the 2016 Presidential Candidates
This section discusses and labels the content that LDA estimates. The topics are represented as word clouds: each word cloud represents the probability distribution of words within a given topic, and the size of the word indicates its probability of occurring within that topic. A critical step is to use our judgment to attribute meaning to the probability distribution of words that create a topic. The critical finding in this paper is that there is little to no emphasis on economic themes and no topics that discuss macroeconomic policy or what to do about recessions in particular.
Table 2 reveals that Topics 3 and 12 are dominant for Hillary Clinton. Topic 3 (Figure 1) gathers together many terms that are related to women and families. The dominant token is “family” which captures words related to family. There are other important words like “work,” “women,” and “right.” The only other candidate focused on this topic was Sanders. A uniquely dominant topic for Clinton was Topic 12 (Figure 2), which collects together terms relating to society more generally such as “people,” “work,” and “need.”
For Sanders, Table 2 shows that Topic 20 is uniquely dominant for him and like Clinton, Sanders also emphasized Topic 3 which is focused on women rights, children, and work generally. Topic 20 (Figure 3) is Sanders emphasis and represents a focus on inequality, Wall Street, and wealth, reflecting Sanders’s attacks on Wall Street and his criticism of big banks. In general, the words associate with Sanders’s focus on class.
Table 2 shows that Topic 11 (Figure 4), which per Table 1 is the most dominant topic overall, is by far and away the most dominant topic for any of the candidates and is uniquely associated with Trump. It is the dominant topic for Trump, Kasich and Carson.
Topic 13 (Figure 5) does not seem to revolve around any domestic economic themes either but includes dominant words such as “great,” “deal,” “billion” as well as “Mexico” and “china” which captures Trump’s foreign policy proposals. Republican candidate Donald Trump was also focused on Topic 15, which mainly addresses the Presidency, as well as Topic 14 (Figure 6). Topic 14 contains optimistic rhetoric about jobs and economic growth through manufacturing and innovation, but does not address macroeconomic policy.
Topic 15 (Figure 7), one of the dominant topics among Republican candidates, was also a prevailing theme for Ted Cruz. Cruz was also uniquely focused on Topic 16 (Figure 8) which was generally focused on the Constitution and included tokens such as “court,” “president,” “constitution,” and “state.”
One of the dominant topics for Kasich was Topic 17 (Figure 9). Topic 17 is generally related to Government Reforms and includes an emphasis on education and other government projects. For instance, some of the dominant words associated with the educational system are “student,” “common” (as used in the phrase “Common Core”), “college,” “tuition,” “teacher,” and words reflecting the result of that system: “career.” “State” and “federal,” “program,” “budget” are other dominant words under this topic which reflects Kasich’s focus on government reforms. Carson was the only other candidate focused on this theme.
In contrast, Table 2 shows that Rubio’s emphasis was on Topic 5 and 15. The statistical analysis represented by Topic 15 (Figure 7), also used by Donald Trump, shows a focus on President Obama which reflects Rubio’s focused opposition to President Obama’s policies. Topic 15 (Figure 7) is also a theme reflected in the Trump and Cruz rhetoric. In contrast, Topic 5 (Figure 10) is focused on foreign policy and is uniquely associated with Rubio. Some of its dominant terms included “Iran,” “Israel,” and “nuclear” which are topics Rubio emphasized while others did not.
Carson’s dominant topic was Topic 19 (Figure 11). Similar to other word clouds, this topic seems to be far from any economic theme and instead focuses on national security and foreign policy. Some of the terms that are dominant within this topic are “military,” “nation,” “defense,” and “threat.”
Over the last few elections, the economy was a central issue. Government spending has ballooned to lofty levels, and the national debt has increased dramatically. Because so much of the national discussion has been dominated by the economy, deficits, tax rates, and debt ceilings, it’s worrisome and strange that we have not heard much about it in this year’s election cycle.
With the current state of the economy it is no doubt that the real economic issues in this year’s election cycle bear an uncanny resemblance to those that at the time of the election battle between Bill Clinton and George Bush. However, 24 years ago the election issues focused on economic issues whereas our analysis shows little to no focus on the economy. However, according to a CNN opinion research poll, voters think the economy is the most important by a 35-to-25-percent margin, yet any discussion of macroeconomic policy seems to be absent today.
While Sanders has focused on class issues, this has not led to a discussion of raising wages and reversing inequality in this year’s election debates. Neither Clinton nor the Republican candidates make an effort to spell out a conversation plan for raising wages and closing the gap between working families and the top 1 percent. In 1992, healthcare ranked only after the economy and the deficit as the top concern among voters but none of the topics appear to include healthcare.
Answers that address core challenges working families face and approaches that offer hope that those challenges are missing. The phrase “climate change” is not even something that presidential candidates seem to worry much about. Issues of urban segregation and unequal education are also absent from this year’s election cycle.
The fundamental problem in the world economy is pretty much the same problem it has faced since the collapse of the housing bubble threw the world economy into a recession in 2008. There is no recognition in the Presidential debates that global manufacturing jobs have peaked meaning that globally the service sector is the future. The underlying problem related to weak aggregate arising from wealth lost from the burst housing bubble. The economy is growing far too slowly to us back to full employment and hence the labor markets is still not yet tight enough to allow workers to achieve substantial real wage gains. There is no Presidential campaign focus on these topics despite concerns Americans express in polling.
Before doing any analysis, we converted words into tokens using the Natural Language Toolkit in Python. This involves multiple steps, designed to filter out subtleties of language that do not contribute to meaning but could introduce noise into automated processes like Latent Dirichlet Allocation.
We used the following steps:
- Convert text to lowercase.
- Remove special characters, including punctuation and apostrophes.
- Remove words that are one character long, and contain any numbers.
- Remove stopwords, commonly occurring words with little informational content (like “the,” “of,” “an”) using the Natural Language Toolkit stopword list.
- Use the Porter Stemmer to remove the stems at the end of words, reducing each part of speech to its root token. For example, “issue,” “issues” and “issuing” become “issu.”
Once each document has been converted to a list of tokens, we construct a document-term matrix using the scikit-learn machine learning package in Python. This contains the counts of the number of times that each token appears in each document.
By using a document-term matrix, we inherently assume that each document is represented by a bag of words, where order of words does not matter. This is a useful simplification for constructing models of text, even though (strictly speaking) it is never true. For models like LDA, co-occurrence of words matters more than the order in which the words appear.
We then used the LDA package in Python to train the LDA model. We chose to use 20 topics in order to allow some differentiation among topics, without over-fitting our limited set of texts. Given the document-term matrix and the number of topics, the LDA package estimates the LDA parameters discussed below using a procedure called Gibbs sampling.
The Latent Dirichlet Allocation model assumes that documents are generated by the following process:
- Given the global topic parameter η: for each topic k, draw iid βk~Dirichlet(η). These are the word distributions for each topic; they are each represented as a vector, whose length is the number of unique words in the corpus.
- Given the global proportions parameter α: for each document d, draw iid θd~Dirichlet(α). These are the topic proportions for each document; each θd is represented as a vector, whose length is the number of topics.
- For each document d, draw the document’s nth word Wd,n as follows:
- Draw the word’s topic assignment Zd,n from the document’s topic distribution, Zd,n~Multinomial(θd).
- Draw the observed word Wd,n from the assigned topic’s word distribution,
Therefore, in order to train the LDA model, we need to estimate the latent variables, particularly βk and θd. The word clouds above are visual representations of βk for each k, while the assigned “dominant topic” for each document d is the mode of θd. The LDA package performs this estimation using a Bayesian method called Gibbs sampling (Griffiths and Steyvers, 2004).
The authors are employed by The Brattle Group. The authors would like to thank Leif Shen and Alexander Hoyle for providing research assistance.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.
Ghayad, Rand and Michael Cragg. 2015. “Growing Apart: The Evolution of Income vs. Wealth Inequality.” The Economists’ Voice. 12 (1): 1–12.
Griffiths, Thomas L., and Mark Steyvers. 2004. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101: 5228–5235.
Hansen, Stephen, Michael McMahon, and Andrea Prat. “Transparency and Deliberation within the FOMC: a Computational Linguistics Approach.” (working paper). November 4, 2015.