Abstract
Predicting short-term stock returns remains a longstanding challenge in financial research due to high noise levels and the rapid incorporation of publicly available information into asset prices. While prior studies suggest that sentiment extracted from social media may improve market forecasting, many report implausibly strong results under evaluation frameworks that risk data leakage. This study examines whether sentiment derived from Twitter provides incremental predictive value for next-day stock returns when combined with historical price information under a leakage-free, time-series-appropriate evaluation protocol. Daily adjusted closing prices were collected for 25 publicly traded U.S. companies over a one-year period from September 2021 to September 2022, covering 252 trading days. Over 80,000 tweets referencing these firms were analyzed. Tweet sentiment was quantified using the VADER sentiment analyzer and aggregated at the daily, per-stock level using mean sentiment, tweet volume, and a missingness indicator. Linear regression, ridge regression, and gradient boosting regression models were evaluated using a strict chronological train–test split. Model performance was assessed using mean absolute error, root mean squared error, and R², and compared against a no-skill baseline predicting zero return. Results indicate that none of the evaluated models materially outperform the baseline, and that aggregated Twitter sentiment provides minimal incremental predictive value for next-day stock returns after controlling for lagged price information. These findings suggest that daily aggregated Twitter sentiment does not materially improve next-day stock return prediction under leakage-free evaluation, highlighting the challenges of short-horizon financial forecasting.
Keywords: stock return prediction, investor sentiment, Twitter data, time-series forecasting, machine learning
Introduction
Investor sentiment has long been studied as a potential influence on financial market behavior1, particularly during periods of heightened uncertainty or market stress2, and broader research has examined the extent to which observable news and other public information explain short-term stock price movements3. With the growth of digital media, large-scale textual data from sources such as news articles and social media platforms have become increasingly accessible, motivating interest in sentiment-based approaches to financial prediction4,5,6,7,8,9. However, predicting short-horizon stock returns remains difficult, as financial markets rapidly incorporate publicly available information into prices, a core implication of the efficient market hypothesis10.
Despite extensive prior work, a gap remains in the literature regarding rigorous evaluation of sentiment-based prediction models under time-series–appropriate protocols. Many studies report unusually high predictive performance while relying on random data splits or same-day price targets11,12, raising concerns about data leakage and overstated conclusions13,14,15. This study addresses this gap by evaluating whether Twitter sentiment provides incremental predictive value for next-day stock returns when assessed under a leakage-free, chronological framework.
Twitter is selected as the sentiment source due to its widespread use in prior financial sentiment studies16,17 and the reproducibility of large-scale data collection. The scope of the study is limited to daily aggregation and next-day forecasting, with an emphasis on methodological rigor rather than model complexity18,19.
Accordingly, this study tests whether daily, Twitter-derived sentiment adds incremental predictive value for next-day stock returns after controlling for lagged price information, using a leakage-free chronological evaluation. The analysis is limited to U.S. equities over a one-year period with daily aggregation and a lexicon-based sentiment model, which may understate intraday effects and finance-specific language20,8,21,22,23.
Methods
This study employs an observational, retrospective time-series research design. The sample consists of 25 publicly traded U.S. companies selected based on data availability across both price and Twitter datasets. Daily adjusted closing price data were collected over 252 trading days from September 30, 2021 to September 29, 2022. Twitter data were collected over the same calendar period using keyword-based queries corresponding to company names and ticker symbols.
To ensure chronological consistency and prevent look-ahead bias, tweet timestamps were mapped to the next available trading day. Sentiment was computed using the VADER sentiment analyzer24, a rule-based model designed for short, informal text. For each stock and trading day, the mean sentiment score and tweet count were calculated, along with a binary indicator denoting days with no associated tweets.
The dependent variable is the next-day log return25,26,27 of the adjusted closing price. Independent variables include lagged stock returns, rolling volatility measures, daily mean sentiment, tweet volume, and the sentiment missingness indicator. The analytical procedure consisted of data collection, preprocessing, feature construction, model training, and out-of-sample evaluation.
Multicollinearity diagnostics
To assess potential multicollinearity among predictors, Variance Inflation Factors (VIF) were computed for the full feature set used in model estimation. Table 2 reports VIF values for all predictors.
| Feature | VIF |
| ret_1_lag | 1.001 |
| vol_5 | 1.070 |
| ma_5 | 1.146 |
| sent_mean_lag1 | 1.208 |
| tweet_count_lag1 | 1.225 |
| Volume | 1.307 |
| sent_missing_flag_lag1 | 1.309 |
All VIF values are close to 1 and well below commonly cited thresholds (e.g., 5 or 10), indicating minimal multicollinearity and suggesting that coefficient instability due to correlated predictors is unlikely. Nevertheless, ridge regression is included as a robustness check, as regularization mitigates coefficient sensitivity in the presence of correlated inputs.
Three models were evaluated: linear regression, ridge regression, and gradient boosting regression. All models were trained and tested using a strict chronological split, with earlier observations used for training and later observations reserved for testing. Performance was evaluated using mean absolute error, root mean squared error, and R². A no-skill baseline predicting zero return was used for comparison. All data used are publicly available and contain no personal identifying information. No human subjects were involved, and the study complies with NHSJS ethical guidelines for publicly available data.
Results
Table 1 summarizes out-of-sample predictive performance across all evaluated models on the held-out test set.
| Model | MAE | RMSE | R² |
| Baseline (0 return) | 0.01973 | 0.02806 | — |
| Linear Regression | 0.01973 | 0.02809 | -0.0117 |
| Ridge Regression (tuned) | 0.01973 | 0.02809 | -0.0117 |
| Gradient Boosting Regressor | 0.02040 | 0.02878 | -0.0620 |
| GBR (no sentiment) | 0.02034 | 0.02878 | -0.0622 |
Note: R² is undefined for the constant baseline predictor.
As shown in Table 1, predictive performance is similar across all evaluated models and closely matches that of the no-skill baseline28,29. R² values are negative for all models, indicating that none explain variance in next-day stock returns beyond the baseline. Models incorporating Twitter sentiment do not exhibit lower error metrics than price-only variants, suggesting minimal incremental contribution from aggregated sentiment features.
Discussion
The results indicate that aggregated Twitter sentiment provides little to no incremental predictive value for next-day stock returns when evaluated under a leakage-free, chronological framework. This finding is consistent with the efficient market hypothesis10, which posits that publicly available information is rapidly incorporated into asset prices30. The research objective of assessing the incremental value of sentiment under conservative evaluation assumptions was met, with results suggesting that daily aggregated sentiment is insufficient for short-horizon forecasting.Several limitations should be acknowledged. Sentiment was aggregated at the daily level, potentially obscuring intraday dynamics. The analysis focuses on a one-day forecast horizon and does not evaluate economic profitability or trading strategies. Additionally, the use of a lexicon-based sentiment model may limit sensitivity to nuanced financial language20,21,22,23. Future research may explore higher-frequency data, alternative sentiment representations21,31, or longer-term prediction horizons to further investigate conditions under which sentiment may play a larger role. Taken together, these findings emphasize that methodological rigor is essential when evaluating sentiment-based financial models and that conservative benchmarks are critical for avoiding misleading conclusions32,15.
References
- M. Baker, J. Wurgler. Investor sentiment in the stock market. Journal of Economic Perspectives. Vol. 21, pg. 129–152, 2007, https://doi.org/10.1257/jep.21.2.129. [↩]
- P. C. Tetlock. Giving content to investor sentiment: The role of media in the stock market. Journal of Finance. Vol. 62, pg. 1139–1168, 2007, https://doi.org/10.1111/j.1540-6261.2007.01232.x. [↩]
- D. M. Cutler, J. M. Poterba, L. H. Summers. What moves stock prices? Journal of Portfolio Management. Vol. 15, pg. 4–12, 1989, https://doi.org/10.3905/jpm.1989.409212. [↩]
- J. Bollen, H. Mao, X. Zeng. Twitter mood predicts the stock market. Journal of Computational Science. Vol. 2, pg. 1–8, 2011, https://doi.org/10.1016/j.jocs.2010.12.007. [↩]
- W. Antweiler, M. Z. Frank. Is all that talk just noise? The information content of internet stock message boards. Journal of Finance. Vol. 59, pg. 1259–1294, 2004, https://doi.org/10.1111/j.1540-6261.2004.00662.x. [↩]
- P. C. Tetlock, M. Saar-Tsechansky, S. Macskassy. More than words: Quantifying language to measure firms’ fundamentals. Journal of Finance. Vol. 63, pg. 1437–1467, 2008, https://doi.org/10.1111/j.1540-6261.2008.01362.x. [↩]
- A. G. Smailović, J. Grčar, N. Lavrač, M. Žnidaršič. Stream-based active learning for sentiment analysis in the financial domain. Information Sciences. Vol. 285, pg. 181–203, 2014, https://doi.org/10.1016/j.ins.2014.04.034. [↩]
- H. Nassirtoussi, S. Aghabozorgi, T. Ying Wah, A. Ngo. Text mining for market prediction: A systematic review. Expert Systems with Applications. Vol. 41, pg. 7653–7670, 2014, https://doi.org/10.1016/j.eswa.2014.06.009. [↩] [↩]
- J. Engelberg, C. A. Parsons. The causal impact of media in financial markets. Journal of Finance. Vol. 66, pg. 67–97, 2011, https://doi.org/10.1111/j.1540-6261.2010.01626.x. [↩]
- E. F. Fama. Efficient capital markets: A review of theory and empirical work. Journal of Finance. Vol. 25, pg. 383–417, 1970, https://doi.org/10.2307/2325486. [↩] [↩]
- S. Gu, B. Kelly, D. Xiu. Empirical asset pricing via machine learning. Review of Financial Studies. Vol. 33, pg. 2223–2273, 2020, https://doi.org/10.1093/rfs/hhaa009. [↩]
- G. Ranco, M. Aleksovski, G. Caldarelli, M. Grčar, I. Mozetič. The effects of Twitter sentiment on stock price returns. PLOS ONE. Vol. 10, e0138441, 2015, https://doi.org/10.1371/journal.pone.0138441. [↩]
- M. López de Prado. The 7 reasons most machine learning funds fail. Journal of Portfolio Management. Vol. 44, pg. 120–133, 2018, https://doi.org/10.3905/jpm.2018.44.4.120. [↩]
- A. Goyal, I. Welch. A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies. Vol. 21, pg. 1455–1508, 2008, https://doi.org/10.1093/rfs/hhm014. [↩]
- C. R. Harvey, Y. Liu, H. Zhu. …and the cross-section of expected returns. Review of Financial Studies. Vol. 29, pg. 5–68, 2016, https://doi.org/10.1093/rfs/hhv059. [↩] [↩]
- T. O. Sprenger, A. Tumasjan, P. G. Sandner, I. M. Welpe. Tweets and trades: The information content of stock microblogs. European Financial Management. Vol. 20, pg. 926–957, 2014, https://doi.org/10.1111/j.1468-036X.2013.12007.x. [↩]
- S. Gu, B. Kelly, D. Xiu. Empirical asset pricing via machine learning. Review of Financial Studies. Vol. 33, pg. 2223–2273, 2020, https://doi.org/10.1093/rfs/hhaa009. [↩]
- M. López de Prado. Advances in financial machine learning. Wiley, Hoboken, NJ, 2018. [↩]
- J. Y. Campbell, S. B. Thompson. Predicting excess stock returns out of sample: Can anything beat the historical average? Review of Financial Studies. Vol. 21, pg. 1509–1531, 2008, https://doi.org/10.1093/rfs/hhm055. [↩]
- T. Loughran, B. McDonald. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance. Vol. 66, pg. 35–65, 2011, https://doi.org/10.1111/j.1540-6261.2010.01625.x. [↩] [↩]
- D. Araci. FinBERT: Financial sentiment analysis using pre-trained language models. Proceedings of the ACL Workshop on Financial Technology and Natural Language Processing. pg. 1–7, 2019. [↩] [↩] [↩]
- C. Kearney, S. Liu. Textual sentiment in finance: A survey of methods and models. International Review of Financial Analysis. Vol. 33, pg. 171–185, 2014, https://doi.org/10.1016/j.irfa.2014.02.006. [↩] [↩]
- T. Loughran, B. McDonald. Textual analysis in accounting and finance: A survey. Journal of Accounting Research. Vol. 54, pg. 1187–1230, 2016, https://doi.org/10.1111/1475-679X.12123. [↩] [↩]
- C. J. Hutto, E. Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media. pg. 216–225, 2014, https://doi.org/10.1609/icwsm.v8i1.14550. [↩]
- B. G. Malkiel. The efficient market hypothesis and its critics. Journal of Economic Perspectives. Vol. 17, pg. 59–82, 2003, https://doi.org/10.1257/089533003321164958. [↩]
- J. Y. Campbell, R. J. Shiller. The dividend-price ratio and expectations of future dividends and discount factors. Review of Financial Studies. Vol. 1, pg. 195–228, 1988, https://doi.org/10.1093/rfs/1.3.195. [↩]
- N. Jegadeesh, D. Wu. Word power: A new approach for content analysis. Journal of Financial Economics. Vol. 110, pg. 712–729, 2013, https://doi.org/10.1016/j.jfineco.2013.08.007. [↩]
- J. Y. Campbell, S. B. Thompson. Predicting excess stock returns out of sample: Can anything beat the historical average? Review of Financial Studies. Vol. 21, pg. 1509–1531, 2008, https://doi.org/10.1093/rfs/hhm055. [↩]
- A. Goyal, I. Welch. A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies. Vol. 21, pg. 1455–1508, 2008, https://doi.org/10.1093/rfs/hhm014. [↩]
- B. G. Malkiel. The efficient market hypothesis and its critics. Journal of Economic Perspectives. Vol. 17, pg. 59–82, 2003, https://doi.org/10.1257/089533003321164958. [↩]
- C. Kearney, S. Liu. Textual sentiment in finance: A survey of methods and models. International Review of Financial Analysis. Vol. 33, pg. 171–185, 2014, https://doi.org/10.1016/j.irfa.2014.02.006. [↩]
- M. López de Prado. The 7 reasons most machine learning funds fail. Journal of Portfolio Management. Vol. 44, pg. 120–133, 2018, https://doi.org/10.3905/jpm.2018.44.4.120. [↩]




