News Sentiment Embeddings for Stock Price Forecasting

Paper · arXiv 2507.01970 · Published June 19, 2025

A key focus is to use news headlines from the Wall Street Journal (WSJ) to predict the movement of stock prices on a daily timescale with OpenAI-based text embedding models used to create vector encodings of each headline with principal component analysis (PCA) to exact the key features. The challenge of this work is to capture the time-dependent and time-independent, nuanced impacts of news on stock prices while handling potential lag effects and market noise. Financial and economic data were collected to improve model performance; such sources include the U.S. Dollar Index (DXY) and Treasury Interest Yields. Over 390 machine-learning inference models were trained. The preliminary results show that headline data embeddings greatly benefit stock price prediction by at least 40% compared to training and optimizing a machine learning system without headline data embeddings.

B. Headlines in the Machine Learning Pipeline

One of the major goals of this project was to examine the impact of headlines on stock prices. As a result, many decisions were needed to quantify best and measure the impact of the headlines on the stock data. There are many different ways that short headlines could be integrated into the model, and commonly used techniques in literature include one-hot encoding, semantic analysis, and embeddings. There are advantages and disadvantages to using every technique, but embeddings were selected as just examining the text in headlines. With the rise of Generative Pre-Trained Transformers (GPT)-based Large Language Models (LLM), new techniques can extract meaning that can encode complex ideas in numbers. Headlines often contain or embody complex ideas that are abstracted in just a few words; examining the semantic feedback from these headlines may not result in a measurable impact. The main motive for using an embedding model was that clustering techniques and Principal Component Analysis (PCA) could be used to find structural relationships between headlines and their impacts [19]. This project brings one step closer to producing a mapping between stock prices and news headlines.

This project used PCA to reduce the headline vector embeddings from the 1,536 dimensional original down to as low as two dimensions [19]. PCA identifies the principal components along which the data varies the most. The principal components in question are particular vector dimensions. This is achieved by computing the covariance matrix of the data and then determining the directions and magnitudes of variance [19]. The directions and magnitudes of variance are computed as the eigenvectors and eigenvalues [19]. The highdimensional data is then projected onto a subspace of the largest eigenvectors, and only the most essential data is kept. The benefit of PCA is that it is computationally efficient, maintains the global structure of the data, and the variance captured is easily interpretable [19].

One of the key implications of this research lies in its ability to transform a time-dependent sequence of data into a time-independent representation. This enables the model to make time-agnostic predictions while capturing temporal patterns, thereby improving generalization and repeatability through higher-order pattern recognition [6], [9], [15]. Future development includes deploying this model onto custom hardware modules to accelerate inference and computation speeds dramatically.