Equity returns and sentiment

: This paper analyzes approximately 100 Gigabytes of raw text data from Twitter with keywords “ AAPL, ” “ S&P 500, ” “ FTSE100 ” and “ NASDAQ ” to explore the relationship between sentiment and the returns and prices on the Apple stock and the S&P 500, FTSE 100 and NASDAQ indices. The ﬁ ndings point to signi ﬁ cant relationship and dependence between sentiment measures and the S&P 500 and FTSE 100 indices ’ returns and prices. The econometric analysis of dependence between the aforementioned variables in the paper is presented in some detail for illustration of the methodology employed.


Introduction
Quantifying subjective opinion and using it as a predictor of stock market returns and prices have become an important topic of research and empirical analysis in academia and the industry. According to the efficient-market hypothesis (EMH) developed by Fama (see Fama et al. [11], Fama [12], and the references therein), information and thus the news are driving sources of equity prices as an "asset price reflects all available information." This paper focuses on the analysis of the question whether public sentiment provides valuable information that affects stock prices and on quantifying the significance of the effects of public opinion and sentiment for equity prices. The motivation for asking these questions is inspired by the research and the developments in behavioral economics and behavioral finance. In particular, research in these fields suggests that asset prices could be affected by human psychological or behavioral factors (Bollen et al. [4]). For instance, according to Gruhl et al. [17], Liu et al. [24], and Mishne and de Rijke [27], book, movie, and other products' sales can be predicted by sentiment in social media such as blogs, Twitter posts, and so on. Hence, it is reasonable to assume that public feeling could affect the returns and prices of financial assets and indices. This is further consistent with the research in psychology, which indicates that emotion plays a significant role in human decision-making (Damasio [9]). Studies in behavioral finance and related fields also indicate that emotion and sentiment have a meaningful contribution to financial decisions and investors' performance [25,28]. Moreover, research indicates that downward pressure on market prices is related to high media pessimism indicated by publications in the Wall Street Journal [33].
To study the effect of public mood on the stock price, we need to find a reliable, representative, and accessible proxy that can be used to construct time series on sentiment measures. Large-scale surveys for obtaining public mood are impractical, not only they are a waste of resources, but also they have great difficulty in producing a time-sensitive data. On the other hand, Twitter, a popular social media website that was launched in 2006, has millions of posts per day. The average number of monthly active Twitter users is over 330 million. Users of Twitter come from a variety of backgrounds, including CEOs, analysts, as well as the users' major component, the general public. Therefore, it is reasonable to choose the sentiment of Twitter posts as a proxy for the public mood [4]. Many papers in the literature have focused on the analysis of the relation between Twitter sentiment and stock market returns (see, among others, Behrendt and Schmidt [3], Corea [8], Groß-Kluß mann et al. [16], Mittal and Goel [26], Ranco et al. [30], Washha et al. [34]. In the first stage, this paper investigates the relationship between prices and public mood for three major indexes, S&P 500, NASQAD, FTSE 100, and one corporation stock, Apple Inc, using a large database of collected posts on Twitter with the indices and the stock' tickers as keywords. In particular, the database includes approximately 8,029,963 posts for each keyword (3,000 randomly picked Tweets in each day) from January 1, 2008 to April 1, 2016.¹ We use Granger Causality tests to investigate whether a change in public sentiment can cause a change in stock prices.
In the second stage, we explore models that may be able to further explain the relationship between Twitter sentiment and the returns and prices on the S&P 500 and FTSE 100 indices. We employ and estimate Nonlinear Generalized AutoRegressive Conditional Heteroskedasticity (NGARCH) time series to explain and quantify the relationship and to model the effects of Twitter sentiment on volatility clustering in financial markets.
At the third stage, we focus on the analysis of the effects of Twitter sentiment on market volatility using the fitted GARCH models and Granger causality tests.

Data acquisition
This section discusses where and how the stock price and Twitter data are acquired and describes the methods for quantifying the sentiment in the text from Twitter posts and generating the data on sentiment measures that can be used in the analysis, tests, and model development.
The data on stock prices for the assets considered are downloaded from Yahoo Finance. For the sentiment data, an R program has been developed by using Twitter Application Interface (API) to download an assigned and fixed number of Twitter posts (or tweets) with a particular keyword for each day throughout the time interval in consideration. In particular, we used the CRAN version (stable) twitteR libratry [14], which developed based on standard Twitter API (Twitter application access API at http:// dev.twitter.com/). We used the "searchTwitter" function in this package to obtain data for a given date, keywords, maximum return tweets (the maximum number of returning tweets are limited by the API capability at the time), and other search strings. This program analyzed a total 100 GB of data from Twitter. Following Tetlock [33], to quantify sentiment, this paper uses various types of sentiment measures as suggested in the General Inquirer Categories in Harvard Psychological Dictionary [13].
The general analysis procedure employed in the paper is depicted in the flow graph ( Figure 1).

Twitter mining algorithm
This section describes the algorithm that was used to convert Twitter text data into time series of different sentiment measures.


First, for each day in the period from January 1, 2008 to January 4, 2016, a random sample of 3,000 Tweets was extracted. All Tweets on the same day were collected to form a large text file that was used as a proxy for public comments on Twitter. For each of the downloaded daily text file, all the punctuation and other symbols (e.g. "https://") were removed to form a crude corpus. In the crude corpus, we applied a further filtration for removing any meaningless (for the purpose of sentiment quantification) words, such as "is," "this," etc., to form the final daily corpus.
Second, such daily generated corpus was checked using the Harvard Psychological Dictionary's General Inquirer Categories [13] with four broad classes Positive, Negative, Active, and Passive, and also their subclasses Affiliation, Hostile, Strong, and Weak, by counting the frequency of words in the corpus that fall into a particular category. Hence, for each day, eight values for the word frequency in each of the group were obtained from the collected Tweets. The process has been repeated for every day in time interval dealt with. Following the procedure, time series of the raw sentiment data from Twitter were generated.
The example in Appendix A1 demonstrates how the algorithm works for randomly acquired five Tweets on a particular date. The Tweets considered in Appendix A1 are not sentimentally neutral and contain polarized oriented words such as "drop," "Active," "unable," etc. that indicates their sentiment orientation to some extend. This observation also provides a logical justification for using the categorization method in the paper.
Third, to get a time series for testing the Granger causality between the Twitter sentiment and the stock price, several different sentiment measures are used in the paper. The sentiment measures considered are inspired by the analysis by Zhang and Skiena [35] and are summarized in the following formulas, with #Positive and #Negative, etc. standing for the number of words in the positive, negative, and other corresponding sentiment categories. The polarity sentiment measure is defined as follows: Negative . (2.1) Obviously, the Polarity measure is not guaranteed to always be positive similar to asset prices as the number of positive words in the Twitter posts considered is not necessarily larger than the word count of negative words. To ensure positivity of the sentiment measures considered, without loss of generality, Zhang and Skiena' Polarity measures are modified as the "Relative positive" measures, which are defined as follows: In a similar way, we define Relative Affiliation, Relative Active, and Relative Strong measures as follows.
The categories Affiliation and Strong are subclasses of the Positive and Active categories. Their relative sentiment measures are defined as follows: Passive . (2.5)

Granger causality analysis
As the sentiment data have been acquired, Granger causality tests were performed to investigate whether there is a causality between the Twitter sentiment and the stock price, which also gives a partial answer to the question on the relationship between these variables. The Ganger causality tests are applied to time series confirmed to be stationary I (0) processes.

(Non-)stationarity analysis
First, we conduct augmented Dickey-Fuller (ADF) tests to investigate the presence of unit root in the processes considered. Testing for a unit root in a time series X t is based on the ordinary least squares (OLS) regression: where β 0 is a constant, α is a time trend coefficient, and ε t is the innovation process with zero mean. Under = δ H : 0 0 , the process is nonstationary, and < δ H : 0 a corresponds to stationarity in X t (Dickey and Fuller [10], Ch. 17 in Hamilton [19] and Section 15.7 in Stock and Watson [31]).
The results of the ADF tests indicate that all the processes considered, including all the time series of sentiment measures dealt with and the logarithms of stock prices, are unit root I (1) processes. The analysis and tests of Granger causality in the paper are therefore based on stationary I(0) first differences of the aforementioned processes. In other words, in the following section, we focus on testing of Granger causality between the changes of the log price and the changes in the Twitter sentiment measures considered.

Autoregressive distributed lag models in Granger causality tests
A process X t is said to Granger cause a process Y t if the lags of X t have useful predictive content for forecasting Y , t above and beyond the other regressors, e.g., the lags of Y t itself, in the model (see, among others, Section 15.4 in [31]). The Granger causality test is usually carried out using autoregressive distributed lag (ADL) models The Granger causality test is carried out using an F -test on all the coefficients on the lags of X t . The null hypothesis in this test is , which equivalently means that X t is not a useful predictor of Y , t given the lags of the latter process. The alternative hypothesis in the test is which corresponds to the property that the lags of X t do have some useful predictive content for forecasting Y , t beyond the lags of Y t itself.
Let P t denote the logarithm of the price of a stock/index considered and let Sem denote a sentiment measure. The test whether the Twitter sentiment Granger causes the stock price is based on the following model: Similarly, the test that the stock price Granger causes the Twitter sentiment is based on the model

Determination of the number of lags using the BIC criterion
As shown in Eqs. We use the the Bayesian information criterion (BIC) for deterrmination of the number of lags of processes X t and Y t ( P Δ t and ΔSen) in the ADL models dealt with. More precisely, as usual, first, the number J of lags in autoregressive (AR) models for the process Y t is determined based on the BIC, and then the criterion is applied to determine the number of lags K of the potential predictor X t in model (Eq. (3.1)) with the estimated number J . The results of the lag length selection on the basis of the BIC for Granger causality tests are provided in Appendix A2.

The results of Granger causality tests
The results of tests of Granger causality between the considered asset returns and the corresponding Twitter sentiment are provided in Tables 1 and 2. Similar to Section 3.2, the null hypotheses in the tables are that the changes in sentiment do not Granger cause the change in the log prices, that is, the returns, and vice versa.
The results in Table 2 indicate that, somewhat surprisingly, that the changes in (log) prices apparently do not Granger cause the Twitter sentiment. These conclusions are in contrast to the conventional belief that the changes in asset prices affect public sentiment.
Further, the results in Tables 1 and 2 point to the conclusion that the change of Twitter sentiment related to S&P 500 and FTSE 100 indices Granger causes their price changes but not vice versa. In particular, according to Table 1, among all the sentiment measures and the assets considered, the effect of the Relative Positive measure on S&P 500 returns appears to be the most significant, with significance of the test statistics at the 1% level. In contrast, the returns on the NASDAQ index and the Apple stock appear not to be Granger caused by the respective Twitter sentiment of price.

S&P 500: Granger causality tests using the big data
To further evaluate and confirm the results on Granger causality in the previous section, we conduct the tests of Granger causality between the S&P 500 returns and the respective sentiment measures using a very large-scale database.
Different from the first stage analysis (3,000 randomly selected with target keywords, by limiting the maximum return tweet in twitteR: searchTwitter function), in this section, we do not give a limit to the maximum return tweets in twitteR: searchTwitter function, just go for the max number that Twitter application access API can provide in one request. Tweet with the indices and the stock tickers as keywords are applied in the search. Also, The twitteR Library are based on Twitter Application access API. The Twitter Application access API not only limit the maximum number of tweets in each request but also limit a certain amount requests in a time period with a given IP. To acquire large data for analysis, We have registered multiple accounts (Hence multiple tokens) and switch IP each time when a request were sent. The analysis is based on all the acquirable data from the Twitter API for the index in the time period considered. The analysis is not conducted for the FTSE 100 index as there are much fewer posts related to it as compared to the S&P 500. As indicated earlier, for the S&P 500, as many data as possible were extracted and analyzed for each day in the time period dealt with, with approximately 14,005 Tweets per day (this number is a mathematical average estimation based on all obtained data with the target keywords) and 42,478,072 Tweets in total.
The general observation was that the number of obtained tweets with corresponding keyword in each day is increasing with time. This is consisted with the fact that tweet is getting more and more well known and more people are posting their thoughts on tweet over time. For example, some keys have only 5,000 tweets per day in 2008, but obtained tweets number per day gets more in the follow year. Eventually the obtained daily number of tweets by a specific keyword is limited by the Twitter API we used.
The results of Granger causality tests for the S&P 500 returns and the sentiment measures using the large-scale database are provided in the second rows of Tables 3 and 4. For comparison, we also provide, in the first rows of the tables, the Granger causality test statistics for the S&P 500 from the previous section. Notes: * * * indicates the 1% significance and * * indicates the 5% significance. The results in Tables 3 and 4 using the large-scale data confirm the results in the previous section that the Twitter sentiment related to the S&P 500 index appears to Granger cause its returns and (log) price changes but not vice versa.
In conclusion, the returns and the prices of the S&P 500 and FTSE 100 indices appear to be Granger caused by the public sentiment. On the other hand, according to the results in this and the previous section, the changes in prices of the assets considered appear not to Granger cause the respective sentiment.

Causality modeling: ADL and GARCH models
As discussed in the previous section, the Twitter sentiment appears to Granger cause the returns on the S&P 500 and FTSE 100 indices. In this section, we focus on the analysis of models for the relationship between the returns on the indices and the respective sentiment. In particular, we evaluate the ADL models for the relationship and further fit GARCH time series to model the effects of sentiment volatility on market volatility.

Volatility clustering
We first focus on the estimation of ADL models for the relationship between the returns on the S&P 500 and FTSE 100 indices and the lags of the sentiment measures. Similar to the analysis in the previous section, following the results in Section 3.1, the models are estimated for the stationary changes in the log pricesthe returnsand the stationary changes in the measures of sentiment dealt with.
The estimated ADL models include all the sentiment measures that were shown in the previous section to Granger cause the returns on the indices. The estimated models thus have the following form:

0.0086103
Notes: * * * indicates the 1% significance and * * indicates the 5% significance. ADL models (4.1) estimated by the OLS result in a poor fit for the time series of the returns on both the S&P 500 and FTSE 100 indices. Further, the plots of the residuals from the ADL regressions point to pronounced volatility clustering in the errors in the estimated linear models.
The results on the poor fit of linear models for the returns and the presence of volatility clustering in the ADL regression errors and the returns are in accordance with the well known stylized facts of the absence of linear dependence and the presence of nonlinear dependence in financial returns (see Cont [7]).
Following the results, in the next section, we thus focus on the models capturing the volatility clustering in the ADL model errors and the returns on the indices considered.

Modeling Granger causality and volatility clustering: ADL models with NGARCH errors
To model Granger causality between the sentiment and the returns on the S&P 500 and FTSE 100 indices accounting for volatility clustering in the returns, as usual, we employ GARCH-type time series. As is well known, GARCH-type processes can be used to capture and model the most of the stylized facts of financial returns, including the absence of linear autocorrelations, the presence of volatility clustering and autocorrelations in squared returns, heavy-tailedness and conditional heavy-tailedness, and the leverage effect (see, among others, Alberg et al. [2], Christoffersen [6], and Cont [7]). t denotes the filtration that contains all the information up to time t, and ( ) N 0, 1 and ( ) t ν denote the standard normal and (heavy-tailed) Student-t distribution with ν degrees of freedom, respectively.
The following ADL models with NGARCH errors exhibiting the aforementioned stylized facts are estimated using the maximum likelihood with i.i.d. innovations ε t that have a standard normal or Student-t distributions, and the volatility dynamics is given by the NGARCH model in the following form: The NGARCH specification for the errors in ADL models for index returns considered accounts for the properties of absence of linear autocorrelations, volatility clustering, heavy-tailedness, and leverage effect in returns time series.
The models impose stationary, that is, the condition on the GARCH parameters. The results of the ML estimation of the aforementioned models are provided in the following sections.

S&P 500
As shown in the results in Table 5, unlike linear ADL models estimated by the OLS, the ADL models with NGARCH errors described in the previous section provide an exceptional fit for the S&P 500 returns. The results in Table 5 further confirm that the changes in sentiment measures are useful predictors of the changes of the index prices and returns. In particular, in the case of the ADL models with NGARCH errors and heavy-tailed Student-t innovations, the lags of the changes in all the sentiment measures appear to be highly significant, with the corresponding p-values less than 0.001. Further, even in the case of ADL-NGARCH model errors with standard normal innovations, one of the sentiment measures, Relative Positive, exhibits statistical significance in predictive models for the S&P 500 returns.

FTSE 100
Similar to the S&P 500 case, the estimation results in Table 6 for predictive ADL regression models for FTSE 100 returns with NGARCH errors demonstrate statistical significance of the changes in the sentiment measures.
The results in Table 6 indicate statistical significance of the regressors, including all the sentiment measures considered, in the predictive ADL models for FTSE 100 returns both in the case of normal and nonnormal heavy-tailed Student-t innovations in NGARCH models for the regression errors. Similar to the previous section, the results confirm predictive power of the sentiment measures for prediction of the returns and further confirm the presence of volatility clustering and other stylized facts in the ADL regression errors and the returns dealt with, the properties not captured by ADL models estimated by the OLS.

Causality between asset price volatility and sentiment volatility
The results in Sections 3 and 4 indicate the presence of volatility clustering in the errors from the predictive ADL models, with the dynamics that can be modeled using NGARCH time series. In this section, we focus on the analysis of causality between the volatilities of the returns and sentiment processes. Similar to Patton [29], the analysis is based on Granger causality tests using the residuals from fitting GARCH-type models to both of the processes considered.

Models for causality between volatilities
Consider two time series { } X t and { } Y t and, as mentioned earlier, denoted by t the filtration containing the information up to time t. The analysis of causality between the volatilities of the processes is based on Granger causality tests for innovations-residuals z t and ′ z t from the GARCH processes fitted to { } X t and { } Y : More precisely, the estimates of the GARCH parameters are obtained using maximum likelihood estimation (MLE), and then tests of Granger causality are conducted for the GARCH model residuals/standardized processes = / ε X σt t t and = / η Y δt t t . Granger casality testing described in Section 3 is used to investigate whether there is a causality between { } ε t and { } η t . In particular, if the tests indicate that { } ε t Granger causes { } η t , then this implies that the information contains in the past volatility of { } X t is useful for forecasting the volatility of { } Y t . In the following analysis, the approach is applied to the time series { } X t and { } Y t being the processes of asset returns and the measures of Twitter sentiment considered.

Data preparation
The analysis of Granger causality between volatilities is based on standardized returns and sentiment measures.
More precisely, given the observations on a time series (e.g., that of the returns or the sentiment measures) { } X t , we consider its standardized version: where, as usual, X and s X denote the sample mean and standard deviation of the time series observations. The analysis is based on the standardized time series ({ }) r STD t and ({ }) STD Sem t for the returns and sentiment processes { } r t and { } Sem t , respectively. A few (approximately 5 out of 3,500) large outliers are observed in the standardized sentiment measure time series ({ }) STD Sem t . The presence of the outliers may be due to a relatively large number of reposts of polarized sentiment-oriented Tweets. This is similar to observations and the discussion in Tetlock [32] on some news having textual similarity with others. In the case of the Twitter, outliers caused by reposts may not be representing the real sentiment, as people can actually write their own comments along with their reposts. Some of the people's comments along with their reposts could have a sentiment that is totally opposite to the reposted Tweets. Further, the presence of such large outliers may severely affect the fitting of GARCH models, in part due to the analyzed sentiment measures being squared in the analysis.
To deal with the outliers, we make the assumption that the maximum change in the standardized sentiment is the same as the maximum change in the standardized return and estimate the following model with GARCH errors (see also Carnero et al. [5]): The estimation results in Appendices 4 and 5 indicates the coefficient k being statistically significant and equal to 1 for all the sentiment measures considered. Hence, we further estimate the GARCH model for the following process: and the dynamics of the volatility σ t 2 follows (Eq. (5.7)).

Granger causality tests for sentiment and return volatilities
The analysis and tests for Granger causality is similar to the discussion in Section 3. The Granger causality tests are provided for the volatility of the standardized returns and the adjusted sentiment measures for different categories as discussed in Section 5.2. The results of the tests are as follows.
The results in Table 7 indicate that the volatility of the FTSE 100 index return appears to be Granger caused by the volatility of all the sentiment measures related to the index, while this is not the case for the S&P 500 return volatility. In addition, according to the results in Table 8, the volatility of the Affiliation sentiment measure related to the FTSE 100 index appears to be Granger caused by the index return volatility. The results in the tables further point to the absence of Granger causality between the return and sentiment volatility for the S&P 500. 1.503 × 10 −07 *** 6.994 × 10 −08 *** 8.891 × 10 −08 *** 1.263 × 10 −07 *** Notes: * * * indicates the 1% significance.

Conclusion
This paper has focused on the problems of quantifying subjective sentiment and the analysis of its use as a predictor for asset returns and prices. The study is based on a vast first amount of data (approximately 100 GB) acquired from Twitter using text mining and quantification of the sentiment according to the General Inquirer Categories of the Harvard psychological dictionary. The relation between the sentiment and the returns on the stock indices is analyzed using Granger causality tests and GARCH-type modeling.
The results of the analysis indicate that the Twitter sentiment apparently has predictive power for the returns on the S&P 500 and FTSE 100 financial indices. The results of the study further indicate that the volatility of the sentiment measures related to the FTSE 100 index appears to Granger cause the index return volatility, while this is not the case for the S&P 500 index.
An important problem that is left for further research is that of structural breaks in models for the relation and dependence between asset prices and returns and the sentiment, including the structural breaks due to the beginning of the on-going COVID-19 pandemic.
The paper uses Harvard Psychological Dictionary for sentiment analysis; further research may focus on applications of more recent sentiment analysis methods using artificial neural networks and other machine learning, such as Bidirectional Encoder Representations from Transformers (BERT) technique for natural language processing.
Due to the fact that the sentiment appears to Granger cause the returns on the financial indices considered, further analysis may also focus on predictive models incorporating the sentiment and other predictors, including the factors used in predictive regressions for financial returns and also on using the sentiment as a signal for trading. The analysis may be based on the widely used econometric methods as well as machine learning approaches.
As is common in the literature dealing with the analysis of dependence between financial and economic time series exhibiting volatility clustering such as financial returns and foreign exchange rates (see, among others, Patton [29]), the analysis of Granger causality in the paper is based on estimated volatilities. Further research may focus on the development and the use of methods that account for the uncertainty introduced in the first stage of the analysis by volatility estimation. In particular, the use of robust t-statistic approaches to inference under heterogeneity, dependence and heavy-tailedness developed by Ibragimov and Müller [21,22] (see also Section 3.3 in the study by Ibragimov et al. [23]), and their extensions appears to be perspective in the context of econometric inference using general two-stage procedures as the approaches do not require consistent estimation of limiting variances of estimators dealt with/their standard errors (see, in particular, Ibragimov and Müller [21] for the discussion of applicability and the properties of t-statistic approaches in inference in two-stage instrumental variable regressions and general GMM models and Abduraimova [1], for applications of the approaches in IV regressions for the analysis of effects of contagion on the tail risk in complex financial networks). frequency (word count) of words that belong to a particular class in the Harvard Psychological Dictionary (Table A1).

A2 Lag length selection for the Granger causality tests
This appendix provides the results of the lag length selection for Granger causality tests using the BIC (Table A2).
A3 Volatility clustering in the residuals of the ADL linear models estimated by the OLS Figures A1 and A2 provide the plots of the time series of the residuals from the ADL linear predictive models for the FTSE 100 and S&P 500 returns estimated by the OLS. According to the plots, pronounced volatility clustering is present in the time series of regression errors in ADL models discussed in Section 4.1.

A4 GARCH model for FTSE 100 returns
The estimates of the parameters of the GARCH model fitted to the FTSE 100 returns are provided in the following table.   Figure A1: Residuals from the ADL linear predictive model for FTSE 100 returns.