StockGPT A GenAI Model For Stock Prediction and Trading
StockGPT A GenAI Model For Stock Prediction and Trading
Dat Mai
September 2024
arXiv:2404.05101v3 [q-fin.CP] 23 Oct 2024
Abstract
This paper introduces StockGPT, an autoregressive number model trained and tested on 70
million daily U.S. stock returns over nearly 100 years. Treating each return series as a sequence
of tokens, StockGPT automatically learns the hidden patterns predictive of future returns via
its attention mechanism. On a held-out test sample from 2001 to 2023, daily and monthly
rebalanced long-short portfolios formed from StockGPT predictions yield strong performance.
The StockGPT-based portfolios span momentum and long-/short-term reversals, eliminating
the need for manually crafted price-based strategies, and yield highly signicant alphas against
leading stock market factors, suggesting a novel AI pricing eect. This highlights the immense
promise of generative AI in surpassing human in making complex nancial investment decisions.
Key words: generative articial intelligence, transformer, decoder, stock market, investment,
trading, return prediction
∗
Dat Mai, PhD, CFA ([email protected]) is a quantitative researcher at MKT MediaStats, LLC. I have no
conicts of interest to disclose. The views expressed herein are solely my own and do not reect those of my employer.
This paper was written as part of my postdoctoral research at the University of Missouri-Columbia. I would like to
thank Andrej Karpathy for publicly sharing his lecture and code on the GPT architecture. I acknowledge helpful
comments from the participants at the Citi’s Data Science Seminar and the 2024 Chicago Quantitative Alliance
(CQA) Fall Conference.
1 Introduction
texts, images, videos, programming codes, or arts from instructions via sounds or texts—has taken
the society by storm and exerted wide-range inuences on many aspects of the world economy (Bal-
dassarre et al. 2023; Dell’Acqua et al. 2023; Mannuru et al. 2023; Noy and Zhang 2023; Otis et al.
2023; Sætra 2023). Although it had been around for years, GenAI came to public prominence since
the introduction of ChatGPT in November 2022, a chatbox able to generate answers, reasoning,
Since its introduction, ChatGPT and similar large language models have quickly made their
ways into the investment industry. One common use of ChatGPT for investment is to give trading
recommendations directly from news about a company (such as news articles or corporate com-
munications) (Lopez-Lira and Tang 2023). A less direct approach is to rely on similar pretrained
language models such as BERT (Devlin et al. 2018) and OPT (Zhang et al. 2022) to generate a
sentiment score for each company which is then used to make trading decisions. For example,
Jiang, Kelly, and Xiu (2022) and Kirtac and Germano (2024) nd that stock portfolios formed on
This paper contributes to this fast-evolving eld by applying the GenAI logic to numeric stock
data. That is, I rst train a new Generative Pretrained Transformer (GPT) model (Brown et al.
2020) from scratch on numeric stock data (hereafter StockGPT) and then show that StockGPT has
the potential to produce strong investment performance.1 Unlike previous nance domain-specic
language models that are pretrained on nancial texts such as FinBERT (Yang, UY, and Huang
2020) and BloombergGPT (Wu et al. 2023), to the best of my knowledge, StockGPT is the rst of
For the trading purpose, using a model trained directly on stock data has three important
advantages over models trained on texts: (i) the model learns price patterns directly from price
data rather than from news about prices, and (ii) the model predictions are available for each stock
at each time point rather than dependent on the availability of news data about stocks, and (iii)
the model predicts the whole distribution of future returns rather than a point estimate.
1
ChatGPT is GPT netuned for the conversational purpose.
1
Language models such as GPT operates by predicting the next most likely token given the
previous ones, p(xt+1 |xt , . . . x1 ). This nature bears a strong resemblance to numeric time series
data such as stock returns where data points come in order and the next value is conditional on
what comes before it. Hence the natural question is whether the architecture of language models
can be applied to numeric time series data. To do so, one fundamental dierence between texts and
numbers needs to be addressed: texts are a collection of (vast but) discrete tokens while numeric
time series are generally continuous. Therefore, to train a generative model for stock returns, I
rst discretize stock return data into intervals (or tokens) and then apply the language model
architecture.
To build the StockGPT model, I adapt a light-weight version of the GPT architecture, which
consists of four attention blocks having about one million parameters. Input into the model is a
sequence of 256 consecutive daily returns on each stock (i.e., the block size in language models)
which approximates the number of trading days in a year.2 The training objective is to predict
the next return value given its previous returns using the transformer architecture, which receives
indexes (or positions) of the tokens in a sequence, retrieves their vector representations, and models
their dependencies via a mechanism called attention (Vaswani et al. 2017). The training sample
consists of around 50 million daily U.S. stock returns from 1926 to 2000, which covers almost all
stocks that have ever been listed on the U.S. stock market during the 20th century. The model is
tested on a hold-out sample of around 20 million daily U.S. stock returns from 2001 to 2023.
Notably, the model is trained only once using the training sample and applied o-the-shelf to
the out-of-sample period. This study design serves two purposes: (i) it is the cleanest setup to
test the eectiveness of the model and (ii) it reduces the computational costs. Despite this simple
setup, the model still delivers strong performance up to 23 years after the period it is trained on. In
practice, the model should be continually retrained with the arrival of new nancial data to uphold
its relevance and performance. This is especially needed in a dynamic environment like the stock
market featured by a low signal to noise ratio and constantly distributional shifts (Kelly, Xiu, et al.
2023).
During the testing phase, for each stock on each trading day t, StockGPT uses 256 daily returns
from t − 255 to t to make a return forecast for t + 1. The evaluation of the forecasts consists of
2
It is a convention in machine learning to specify model parameters in powers of 2.
2
two steps. First, I examine the accuracy of the forecasts by running cross sectional regressions
of realized stock returns on day t + 1 onto return predictions for t + 1. The results indicate that
The second evaluation step entails building real time trading portfolios based on StockGPT
forecasts. At the market close of each trading day t, I build zero cost portfolios by going long/short
the top/bottom decile of stocks having the highest/lowest return forecasts for day t+1 and rebalance
the portfolio at the t + 1 market close. To avoid trading only micro stocks since these stocks are
illiquid and incur high transaction and market impact costs, before forming the portfolio, I remove
stocks below the 10th percentile market value at the market close.
Under the equal-weighting scheme where each stock receives an equal weight in the portfolio,
this daily rebalanced long-short portfolio earns an average annualized returns of 119% from 2001
to 2023, achieving a Sharpe ratio of 6.5. This performance is higher than the best daily-rebalanced
portfolio based on language model predictions in Jiang, Kelly, and Xiu (2022), which has an annual
return of 50% and Sharpe ratio of 4.8 from 2004 to 2019. It is noteworthy that while the prediction
model in Jiang, Kelly, and Xiu (2022) is retrained every year,3 StockGPT is trained only once
Under the value weighting scheme where stock weights in the portfolio are proportional to
their market values, the StockGPT-based portfolio achieves an average annualized return of 27%
and a Sharpe ratio of 1. Since value weighting gives more weight to stocks having higher market
values, this result is consistent with the consensus view in asset pricing that small stocks are more
predictable due to more mispricing and higher arbitrage costs (Baker and Wurgler 2006).
Since StockGPT makes its return forecasts using only historical price data, I examine how it
relates to common price-based strategies such as momentum and long-/short-term reversals. I nd
that the StockGPT-based portfolios span these strategies via the spanning test. This suggests that
AI is more eective than human in designing trading strategies based on historical price movements.
The StockGPT-based portfolios also encompass several factors of the Fama and French (2015) ve
It is noteworthy that unlike several elds such as medical or law where GenAI is expected to
3
Specically, they retrieve contextual word embeddings from pretrained OPT and BERT and use these embed-
dings to retrain the return prediction model every year.
3
generate (100%) correct responses, StockGPT does not need to accurately predict future returns
on individual stocks for it to become useful. Instead, in the context of cross-sectional asset pricing,
it is only required to identify the groups of stocks that are more likely to go up/down to facilitate
While the daily results show the proof of concept that the GPT model can be applied to numeric
stock data to yield strong investment results, it is practically challenging to trade hundreds of stocks
on a daily basis, especially the small cap ones. Therefore, I also experiment with a more realistic
StockGPT model that makes monthly return predictions. Specically, I train a new model to
predict the returns over the next 20 days instead of the next-day return as before. I then use this
model to make monthly return forecasts and form monthly rebalanced portfolios.
On average, this strategy earns an annual return of 13% with a Sharpe ratio of 1 from 2001 to
2023, outperforming 11 common stock factors by a large margin (these factors include momentum,
long-/short-term reversals, ve factors from Fama and French (2015), and three factors from Hou et
al. (2021)). The performance persists if I focus on only the 50% largest companies based on market
cap or retain only stocks listed on NYSE. The StockGPT portfolio also earns a highly signicantly
annual alpha of 16% (t-statistic of 4.7) against all of these factors combined, suggesting a new
AI-based pricing eect not captured by standard asset pricing factors. In other words, StockGPT
can be combined with standard pricing factors to improve the overall risk-return prole.
Although StockGPT shows promising performance, the model can be enhanced in several ways
by practitioners to achieve better results. First, the model should be retrained frequently (such
as monthly) to maintain its relevance and performance. Second, StockGPT as introduced in this
paper is a light-weight adoption of the GPT architecture; it is an open question whether extending
the model along several of its parameter dimensions will yield better performance. Third, training
StockGPT with high frequency stock data can be a fruitful avenue since there is evidence of alpha
to be extracted from the order book (Kolm, Turiel, and Westray 2023). These three enhancements
can be readily implemented given enough computing power. In addition, one area that needs more
exploration and research eorts is how to modify the model itself or the training of it to work better
4
2 Model Architecture
2.1 Overview
StockGPT uses a vanilla decoder-only transformer architecture which is the second step of the
canonical transformer model developed by Vaswani et al. (2017). The decoder-only transformer
is also the architecture of ChatGPT. Figure 1 depicts the sketch of the architecture. Specically,
the decoder receives an input sequence of tokens x = (x1 , x2 , . . . , xt−1 , xt ), transforms its via mul-
tiple layers of attention, and outputs the probability of each next token, p(x2 |x1 ), p(x3 |x2 , x1 ),. . . ,
p(xt+1 |xt , . . . x1 ).
During the training phase, the model learns and updates its parameters via mimimizing the
cross-entropy loss between a token prediction and its actual value l(x̂t+1 , xt+1 ) averaging across all
tokens across all sequences in a training batch. During the deployment phase, the decoder gener-
ates an output sequence of tokens (xt+1 , xt+2 , . . . , xt+m ) one at a time, given the input sequence
probability distribution p(xt+1 |xt . . . x1 ), and generates the next token from this distribution. The
decoder model is autoregressive in the sense that it consumes its own generated output at each
time as additional input to generate the next one, i.e., p(xt+2 |x̂t+1 , xt , . . . , x1 ) where x̂t+1 is previ-
ously generated.
2.2 Details
Since computer itself does not understand human texts, the transformer rst quanties text tokens
via token and positional embedding. Token embedding simply retrieves a unique vector representa-
tion for each token in a dictionary containing all available tokens. Positional embedding vectorizes
each token position in an input sequence. Without positional embedding, the transformer cannot
understand the context and order of tokens. The transformer then sums up token and positional
embedding vectors for each token. These embeddings are learnable parameters.
5
For example, in the sentence The [rm] made a [rm] decision about its capital structure. the
two [rm] words have the same token embeddings but dierent positional embeddings due to their
positions.
At the heart of the transformer model is the attention mechanism. Accordingly, for each token,
the transformer generates three vectors from its embedding vector: key k, query q, and value v.
The attention for each token t is the weighted sum of its vt with all vi ’s of tokens preceding it,
weighted by the product of its qt with ki ’s of those tokens and a normalizing constant.
attentiont = vi × w i with wi = qt × ki × norm. const. (2)
i=1...t
Intuitively, a token emits a query and the previous tokens that can match its query (i.e., having a
high qt × ki value) get its attention. k, q, and v are also learnable parameters.4 This mechanism
constitutes a self-attention head and helps the transformer develop a contextual understanding
of the tokens.5 In the above example, the attention mechanism helps the model understand that
[its] refers to the rst [rm]. Since each token is only inuenced by the tokens before it, this setup
is autoregressive.
The transformer concatenates multiple attention heads into a multi-head node which is sequen-
tially followed by multiple linear layers to form an attention block.6 Multiple attention blocks are
then stacked on top of each other. The last attention block is follows by a layer normalization
and a linear layer whose output is converted into a vector of probabilities via a softmax activation.
Specically, at time step t, the transformer outputs p(xt+1 |xt . . . x1 ) which is the multinomial dis-
tribution over all available tokens in the dictionary, conditional on all tokens up to t. Given this
distribution, the model can sample the most likely token at t + 1 from its dictionary.
4
Technically speaking, the model learns the weight matrices that produce these vectors.
5
This tutorial by Andrej Karpathy gives an intuitive step-by-step introduction to the GPT architecture:
https://www.youtube.com/watch?v=kCc8FmEb1nY&t=4901s.
6
Specically, for StockGPT, in the rst step of the attention block, the input goes through a layer normalization,
then multi-head concatenation, followed by a linear layer with dropout. The output of this layer is added up to the
input via a skip connection. In the second step, the output of the rst step goes through a second layer normalization,
followed by a linear expanding layer to increase the input dimension by 4 times, a ReLU activation, and a contracting
layer to revert to the input dimension with dropout. Again the output of this second step is added up to its input
via a second skip connection.
6
2.3 StockGPT Specics
Since the transformer can only work with text tokens, to use it on continuous stock return data,
the rst step is to discretize returns into intervals. Table 1 illustrates the discretization rule.
Accordingly, I rst convert returns into integer basis points by multiplying them by 10,000 and
keeping the integer portion. Next, I cut the basis points into intervals of 50, closed on right. The
rst interval closes on -10,000. Since stock prices cannot be negative, returns cannot be lower than
-10,000 basis points (i.e., -100%); therefore, the rst bin contains only -10,000. The last closed
interval is (9,950, 10,000]. Third, for each bin, I use the mid value of each interval to represent
its value with the exception that the rst bin (-Inf, -10,000] is represented by -10,000 and the last
bin (10 000, Inf) by 10,000. In other words, I treat all daily returns greater than 100% as 100%.
Values above this threshold are extremely rare since the 1th to 99th percentile of daily returns in
the training set is from -9.6% to 11.1%. Finally, the bins are numbered from 0 to 401. Therefore,
my return dictionary has a total of 402 tokens where each token is a return bin midpoint. As
an example of the discretization rule, the following return sequence (-2.4%, 0%, 0%, 5%, 4.8%) is
converted into the index sequence (196, 200, 200, 210, 210) which is input into StockGPT.
Besides the vocabulary size of 402, StockGPT has a block size (i.e., length of each input sequence)
of 256, token and positional embedding sizes of 128, 4 attention blocks each consisting of 4 self-
attention heads, and a dropout probability of 0.2 in various layers. Taken together, StockGPT has
0.93 million parameters. StockGPT is trained in 10,000 training steps with each step consisting of
64 sequences (i.e, batch size) drawn randomly from the training data.7 The probability of sampling
each stock during training is proportional to the number of daily return observations it has.
As discussed above, to make a return forecast, given a 256-return sequence input (xt−255 , . . . , xt ),
StockGPT will output p(xt+1 |xt , . . . , xt−255 ), a multinomial distribution over 402 return bins.8 The
model produces output in terms of bin indexes which are converted to numeric returns using the bin
midpoints in Table 1. The expected return for day t + 1 will then be the weighted average of return
7
During the training phase, the cross-entropy loss stabilizes at around 2.5 after
5,000 training steps. With 402
labels (the number of return bins), the maximum cross-entropy would be E = − i log(1/402) × (1/402) = 6. The
model is fully trained locally on a MacBook M2 with 64GB RAM and 30 GPU cores.
8
Technically, input of sequence of any length from 1 to 256 (i.e., the block size during training) can be used to
make forecasts. Analogously, ChatGPT prompts can be of any length up to a limit (around 2048 tokens). However,
since StockGPT is trained with block size of 256, I also use input sequence of 256 days in making forecasts to utilize
all price patterns the model has discovered during training.
7
bin midpoints weighted by the corresponding bin probabilities presented by p(xt+1 |xt , . . . , xt−255 ).
Alternatively, the expected returns on day t + 1 can be computed by sampling many forecasts from
p(xt+1 |xt , . . . , xt−255 ) and averaging them. The two approaches will produce the same results if the
number of drawn samples is large but the latter approach is more computationally intensive. To
make return forecasts over the next m days, we can recursively sample several paths of forecasts
3 Data
Stock return data comes from Center for Research in Security Prices (CRSP) that collects all
historical U.S. stock returns from 1926 to 2023. As standard in asset pricing research, I include
only common stocks with a share code of 10 or 11 traded on three main exchanges NYSE, AMEX
and NASDAQ. This sample consists of around 70 million stock observations from 1926 to 2023.
This sample is then split into two parts: the sample from 1926 to 2000 for training and the sample
from 2001 to 2023 for testing. Within the training sample, data from 1926 to 1990 is used for
parameter optimization and data from 1991 to 2000 for hyperparameter tuning and evaluation.
During training evaluation, I document that the model using stocks from NYSE alone has lower
evaluation loss than the one using all three exchanges, 2.55 versus 2.72. This may be because
NYSE is the world largest stock exchange that lists high quality large cap stocks while AMEX and
NASDAQ list smaller stocks that add noises to the training process. Therefore, the main results
focus on the model trained on NYSE data alone while the results using the model trained on all
During the testing phase, stock returns from all three exchanges are used. As noted in the
introduction, the model is trained only once using the training sample and kept unchanged during
9
The latter model still produces annual returns of 83% with Sharpe ratio of 5.
8
4 Results: Daily Prediction
To evaluate the quality of return forecast for day t + 1, I rst compare it against the actual return
on that day. Specically, for each trading day t, I run the following cross-sectional regression
where xit+1 is the actual realized return of stock i on day t + 1 and x̂it+1 is its StockGPT return
forecast. The slope bt and regression adjusted Rt2 are then averaged across all trading days in the
test sample. These measure how well StockGPT forecasts track the actual returns. This regression
specication is referred to as the Fama-MacBeth regression in the asset pricing literature (Fama
Table 2 reports the results. Accordingly, the average slope coecient is 0.5, indicating that
a cross-sectional dierence of 100 basis points (i.e, 1%) in StockGPT return predictions signals a
dierence of 50 basis points in realized returns. Moreover, the average cross-sectional R2 is 1.2%
equivalent to a 11% cross-sectional correlation between return predictions and actual returns. For
comparison, the average correlation between return forecasts based on language models and actual
returns is around 2% in Jiang, Kelly, and Xiu (2022). I also examine the relation between return
forecasts for day t + 1 and realized returns on day t + 2 (i.e., skipping one day). For this test, the
slope coecient is 0.09 and R2 is 0.4% which translates into a 6% correlation. The slopes in both
tests are highly signicant with t-statistics over 10. Overall, GPT forecasts track future returns
The main empirical analysis is to examine the trading implications of StockGPT forecasts. To
do so, on each trading day t, I buy the top stock decile having the highest return forecasts for
t + 1 (High portfolio) and sell the bottom decile having the lowest return forecasts for t + 1 (Low
portfolio). To avoid only trading micro-cap stocks, I remove stocks having market value below
the 10th percentile each day. In Table 3, under equal weighting (EW), this long-short portfolio
9
yields an average annual return of 119% with a Sharpe ratio (mean/standard deviation) of 6.5.10
If I remove stocks having prices below $1, $3, and $5, the mean returns (and Sharpe ratios) are
110% (6.3), 86% (5.2), and 74% (4.7), respectively. The left graph in Panel A of Figure 2 plots the
log cumulative returns of these 4 long-short portfolios. These portfolios show a consistent upward
trend throughout the 2001-2023 sample with the biggest jump in 2009 after the nancial crisis.
The right graph in Panel A plots the cumulative returns of each long/short leg of the portfolios. It
is clear that StockGPT can symmetrically predict both rising and falling stocks.
While the annual return of 119% under equal weighting in the baseline model is before trans-
action costs, under the hypothetical worst-case scenario that the portfolio replaces all of its con-
stituents every day (i.e., a turnover of 400% in a long-short portfolio) and each trade costs 5 basis
points, the StockGPT-based strategy still realizes an annual return of 69% net of transaction costs.
Under value weighting (VW), the long-short portfolio without price lter (but still after removing
the bottom decile based on market cap) earns an annual returns of 27% and Sharpe ratio of 1. The
Sharpe ratio of the portfolios with price lters are at 1, 0.9, and 0.8 for the $1, $3, and $5 price
thresholds, respectively. Since value weighting gives more weight to large cap stocks, this result
indicates that StockGPT is more eective at forecasting returns of small cap stocks. This is expected
since small cap stocks are more likely to be mispriced (Baker and Wurgler 2006).
Table 3 also reports the portfolio when return forecasts for t + 1 are used to form portfolio for
t + 2. Under equal weighting, this skipping-one-day portfolio earns 26% annually with a Sharpe
ratio of 1.7. Panel B of Figure 2 shows that when one day is skipped, StockGPT forecasts track
the returns in the long leg better than the short one.
Since StockGPT uses only historical market price data to make return forecasts, it is important
to examine how the StockGPT-based portfolio relates to prominent trading strategies based on
historical returns. The three most notable patterns are short-term reversal (using returns from
Jegadeesh and Titman (1993), and long-term reversal (using returns from month t − 13 to t − 60)
10
Without the market cap restriction, the StockGPT-based portfolio would earns 230% annually with a Sharpe
ratio of 10. On the other hand, if stocks having market cap below the 30th percentile are removed, the resulting
portfolio earns 50% annually with a Sharpe ratio of 2.9.
10
by De Bondt and Thaler (1985). It is also interesting to examine how StockGPT performs relative
to leading stock factors such as the ve factors of Fama and French (2015) and the investment-based
As standard in asset pricing research, to examine whether a strategy earns abnormal returns
relative to a set of other traded factors, we can perform the following contemporaneous regression:
yt = α + β × xt + et (4)
where yt is the return of the target strategy and xt is the set of benchmark factors. If α is signicant,
Panel A of Table 4 reports the results of the spanning tests in which yt is the daily returns
on the StockGPT-based portfolios and xt is the set of benchmark factors. Accordingly, both the
equal-weighted and value-weighted StockGPT portfolios earn sizable and highly signicant alphas
relative to all 11 benchmark factors (t-statistics are greater than 10 for EW and greater than 3 for
VW).
In Panel B, I test whether the StockGPT portfolios span the benchmark factors. The equal-
weighted StockGPT portfolio spans momentum, long-term reversal, value, and size while the value-
weighted StockGPT portfolio spans 9 out of 11 factors except protability from Fama-French and
earning growth from q5 models. That the value-weighted StockGPT portfolio better spans the
other factors than does the equal-weighted StockGPT portfolios is expected since those factors are
also value-weighted.
Overall, the spanning tests show that when we let the stock data speaks for itself via the
momentum, and long-term reversal are no longer needed. Notably, although StockGPT only learns
from historical returns over the past 12 months, it completely encompasses the long-term reversal
pattern based on returns beyond the past 12 months. Furthermore, StockGPT-based portfolios
11
The momentum, reversal, and ve factors of Fama and French (2015) are available at
https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html while the q5 factors of Hou et al.
(2021) are at https://global-q.org/factors.html.
11
5 Results: Monthly Prediction
While the daily results show proof of concept that StockGPT can deliver strong investment per-
formance, it is costly and challenging to trade hundreds of stocks on a daily basis. In this section,
I examine the arising question of whether StockGPT can be used to make longer term forecasts to
There are two ways to produce long-term return forecasts over the next m days from StockGPT.
The rst approach it to produce several paths of forecasts xj = (xt+1 , xt+2 , . . . , xt+m ) and average
across the paths to compute the expected returns over the next m days. However, this approach is
very computationally expensive since there are on average 3,000 to 4,000 stocks traded in the cross
section. For each stock on each rebalance day, we need to make many m-day forecast paths and
average them.
The second approach is to train a new StockGPT model where the training target is to predict
return over the next m days (i.e., p(x̄t+1→t+m |xt , . . . , xt−255 ) where x̄t+1→t+m is the mean return
over the next m days). I pursue this approach in this section. Specically, I train a StockGPT model
using historical returns to predict mean returns over the next 20 days (i.e., 20 days approximates
the number of trading days in a month) with all other specications kept unchanged from the daily
model. Like before, the model is trained only once using data up to 2000. During the testing phase,
at the end of each month for each stock, an input sequence of 256 previous daily returns for that
stock is used by the new StockGPT model to predict the next 20-day returns (i.e., return over the
next month).
To evaluate the quality of long-term forecasts by StockGPT, I regress realized monthly returns
onto 20-day return forecasts via the Fama-McBeth test discussed above. As reported in Table 5,
the average slope coecient is 3 (signicant at 5%), indicating that a cross-sectional dierence of
100 basis points (i.e, 1%) in StockGPT return forecasts signals a dierence of 300 basis points
in realized returns. Moreover, the average cross-sectional R2 is 0.55% equivalent to a 7.4% cross-
sectional correlation between return predictions and actual returns. When one month is skipped
between 20-day return forecasts and realized returns, the correlation shrinks to zero.
12
5.2 Portfolio Sorting
I then form monthly rebalanced long-short decile portfolios using the 20-day return forecasts and re-
port the performance statistics over 2001-2023 in Panel A of Table 6. The equal-weighted portfolios
after removing the bottom stock decile based on market value earn about 13% annually, signicant
at 1%, with Sharpe ratios around 1. To ensure the tradability of the strategy, I further remove
stocks in the bottom 30th percentile and the performance remains almost unchanged. When the
bottom half of all stocks are removed, the annual mean return falls to about 10% with the Sharpe
ratio falling to about 0.7. When only NYSE stocks are used, the average annual returns are 15%
For comparison, in Panel B of Table 6, I report the summary statistics for 11 common stock
factors. Only 5 factors yield signicant returns, including short-term reversal, market, protability
from Fama and French (2015), and return on equity and earnings growth from Hou et al. (2021).
Among these factors, short-term reversal yields the strongest result with a mean return of 8.8% and
a Sharpe ratio of 0.7. It is clear that StockGPT portfolios across dierent specications outperform
these factors.
Panel A of Figure 3 plots the log cumulative returns on the StockGPT portfolios. The long-
short portfolios see a stable upward trend from 2001 to 2023. Between the two legs of the strategy,
StockGPT does better at predicting the future winners. Panel B plots the log cumulative returns
of the baseline StockGPT portfolio and 11 stock factors. Among the factors, short-term reversals
outperformed StockGPT before the nancial crisis but has lagged StockGPT by a large extent since
then.
Table 7 reports the spanning tests. In Panel A, the monthly-rebalanced equal-weighted StockGPT
portfolio earns a signicant annual alpha of about 15% (t-statistic of 4.7) against all of the factors.
This suggests that StockGPT represents a new AI pricing eect not captured by standard factor
models. In Panel B, I check whether the stock factors earn alphas against StockGPT. Accordingly,
long-term reversal, value, and investment (from both Fama and French (2015) and Hou et al.
(2021)) are subsumed by StockGPT. Besides, the market alpha against StockGPT is only marginally
13
signicant at the 10% level.
Overall, while the investment performance of the monthly StockGPT model is far less impressive
than that of the daily model, it still outperforms all of the standard stock factors and yields
highly signicant alphas. The monthly results conrm that StockGPT can be used in practice to
6 Conclusion
This paper introduces StockGPT, a decoder-only transformer model trained directly on U.S. stock
returns. Instead of relying on manually crafted trading patterns using historical stock prices, Stock-
GPT automatically learns the hidden patterns most predictive of future returns via its attention
mechanism. Even though trained on daily returns only up to 2000, StockGPT delivers strong
trading strategies such as momentum and long-/short-term reversals and span several leading stock
StockGPT can be enhanced in several ways. First, StockGPT should be retrained frequently
with the arrival of new stock data to maintain its performance. Second, StockGPT as introduced in
this paper is a light-weight model with only around one million parameters. The natural extension is
to examine bigger models having more granular return intervals, longer block size, bigger embedding
size, and more layers of attention blocks. Third, examining the long-term forecasts from the daily
StockGPT model (as discussed in Section 5) can be a fruitful direction. Finally, training StockGPT
with higher frequency data such as tick size data can yield promising results.
14
References
Baker, M., & Wurgler, J. (2006). Investor sentiment and the cross-section of stock returns. The
journal of Finance, 61 (4), 1645–1680.
Baldassarre, M. T., Caivano, D., Fernandez Nieto, B., Gigante, D., & Ragone, A. (2023). The social
impact of generative ai: An analysis on ChatGPT. Proceedings of the 2023 ACM Conference
on Information Technology for Social Good, 363–373.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances
in neural information processing systems, 33, 1877–1901.
De Bondt, W. F., & Thaler, R. (1985). Does the stock market overreact? Journal of Finance, 40 (3),
793–805.
Dell’Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S.,
Krayer, L., Candelon, F., & Lakhani, K. R. (2023). Navigating the jagged technological
frontier: Field experimental evidence of the eects of AI on knowledge worker productivity
and quality. Harvard Business School Technology & Operations Mgt. Unit Working Paper,
(24-013).
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Fama, E. F., & French, K. R. (2015). A ve-factor asset pricing model. Journal of nancial eco-
nomics, 116 (1), 1–22.
Fama, E. F., & MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests. Journal of
political economy, 81 (3), 607–636.
Hou, K., Mo, H., Xue, C., & Zhang, L. (2021). An augmented q-factor model with expected growth.
Review of Finance, 25 (1), 1–41.
Jegadeesh, N. (1990). Evidence of predictable behavior of security returns. Journal of Finance,
45 (3), 881–898.
Jegadeesh, N., & Titman, S. (1993). Returns to Buying Winners and Selling Losers: Implications
for Stock Market Eciency. Journal of Finance, 48 (1), 65–91.
Jiang, J., Kelly, B. T., & Xiu, D. (2022). Expected returns and large language models. Available
at SSRN.
Kelly, B., Xiu, D., et al. (2023). Financial machine learning. Foundations and Trends® in Finance,
13 (3-4), 205–363.
Kirtac, K., & Germano, G. (2024). Sentiment Trading with Large Language Models. Finance Re-
search Letters.
Kolm, P. N., Turiel, J., & Westray, N. (2023). Deep order ow imbalance: Extracting alpha at
multiple horizons from the limit order book. Mathematical Finance, 33 (4), 1044–1081.
Lopez-Lira, A., & Tang, Y. (2023). Can ChatGPT forecast stock price movements? Return pre-
dictability and large language models. arXiv preprint arXiv:2304.07619.
Mannuru, N. R., Shahriar, S., Teel, Z. A., Wang, T., Lund, B. D., Tijani, S., Pohboon, C. O., Agbaji,
D., Alhassan, J., Galley, J., et al. (2023). Articial intelligence in developing countries: The
15
impact of generative articial intelligence (AI) technologies for development. Information
Development.
Newey, W. K., & West, K. D. (1987). Hypothesis testing with ecient method of moments estima-
tion. International Economic Review, 777–787.
Noy, S., & Zhang, W. (2023). Experimental evidence on the productivity eects of generative
articial intelligence. Science, 381 (6654), 187–192.
Otis, N., Clarke, R. P., Delecourt, S., Holtz, D., & Koning, R. (2023). The uneven impact of
generative ai on entrepreneurial performance. Available at SSRN 4671369.
Sætra, H. S. (2023). Generative AI: Here to stay, but for good? Technology in Society, 75, 102372.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L
., & Polo-
sukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg,
D., & Mann, G. (2023). BloombergGPT: A large language model for nance. arXiv preprint
arXiv:2303.17564.
Yang, Y., UY, M. C. S., & Huang, A. (2020). FinBERT: A Pretrained Language Model for Financial
Communications.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin,
X. V., et al. (2022). OPT: Open pre-trained transformer language models. arXiv preprint
arXiv:2205.01068.
16
x2 x3 x4 ........ xt-1 xt xt+1
Softmax
Linear
Attention Block 4
........
Attention Block 1
17
Panel A: Next Day
High-Low Portfolio High and Low Portfolios
18
Panel A: StockGPT Portfolios
High-Low Portfolio High and Low Portfolios
where xit+1 is the actual realized return of stock i on day t + 1 and x̂it+1 is its StockGPT return forecast. Returns
are in basis points and R2 in percentage points. t is the t-statistic of time-series mean of bt computed using Newey
and West (1987) standard error with 20 lags. Horizon 1 (2) means comparing return forecasts for t + 1 with actual
returns on t + 1 (t + 2). The sample is daily from January 2001 to December 2023.
Horizon b t R2
1 0.50 25.18 1.19
2 0.09 10.20 0.41
20
Table 4: Daily Spanning Tests
This table reports results of the following spanning test
yt = α + β × x t + e t
In Panel A, yt is one of StockGPT-based portfolios and xt are short-term reversal (ST Rev), momentum (Mom),
long-term reversal (LT Rev), market (MKT), value (HML), size (SMB), protability (RMW), and investment (CMA)
from Fama and French (2015), as well as investment (R IA), return on equity (R ROE), and earnings growth (R EG)
from Hou et al. (2021). In Panel B, yt is one of the factors and xt is one of StockGPT-based portfolios. α is in
annualized percentage points and R2 is in percentage points. tβ is computed with Newey-West standard error using
20 lags. The sample is daily from January 2001 to December 2023.
21
Table 5: Monthly Fama-MacBeth Regression
This table reports the time series averages of slopes and adjusted R2 ’s of the following cross-sectional regression
where xit+1 is the actual realized return of stock i in month t + 1 and x̂it+1 is its StockGPT return forecast. Returns
are in basis points and R2 in percentage points. t is the t-statistic of time-series mean of bt computed using Newey
and West (1987) standard error with 4 lags. Horizon 1 (2) means comparing return forecasts for month t + 1 with
actual returns in month t + 1 (t + 2). The sample is monthly from February 2001 to December 2023.
Horizon b t R2
1 3.01 2.49 0.55
2 -0.08 -0.08 0.43
22
Table 6: Monthly Portfolio Statistics
Panel A reports the return statistics of the monthly equal-weighted long-short StockGPT-based portfolios. Mcap
Filter refers to the monthly market cap percentile under which stocks are removed and Price Filter refers to the
price level under which stocks are removed. Panel B reports the return statistics of stock factors including short-
term reversal (ST Rev), momentum (Mom), long-term reversal (LT Rev), market (MKT), value (HML), size (SMB),
protability (RMW), and investment (CMA) from Fama and French (2015), as well as investment (R IA), return
on equity (R ROE), and earnings growth (R EG) from Hou et al. (2021). Mean and SD (standard deviation) are
in annualized percentage points; Mean/SD (Sharpe ratio) is annualized; Min, Max, and MDD (max drawdown) are
in percentage points; and t-Mean is t-statistic of the mean portfolio return using Newey-West standard error with 4
lags. The sample is monthly from February 2001 to December 2023.
23
Table 7: Monthly Spanning Tests
This table reports results of the following spanning test
yt = α + β × x t + e t
In Panel A, yt is the equal-weighted monthly-rebalanced StockGPT-based portfolio and xt are short-term reversal
(ST Rev), momentum (Mom), long-term reversal (LT Rev), market (MKT), value (HML), size (SMB), protability
(RMW), and investment (CMA) from Fama and French (2015), as well as investment (R IA), return on equity
(R ROE), and earnings growth (R EG) from Hou et al. (2021). In Panel B, yt is one of the factors and xt is the
StockGPT-based portfolio. α is in annualized percentage points and R2 is in percentage points. tβ is computed with
Newey-West standard error using 4 lags. The sample is monthly from February 2001 to December 2023.
β tβ β tβ β tβ β tβ
α 15.58 4.66 14.94 4.16 15.43 4.79 15.27 4.87
ST Rev -0.06 -0.66 -0.08 -0.77
Mom -0.21 -1.82 -0.25 -1.99
LT Rev -0.49 -3.25 -0.31 -2.71
MKT 0.04 0.70 0.08 1.10 0.06 0.84
HML 0.09 0.51 0.01 0.10
SMB -0.18 -1.54 -0.34 -2.75 -0.40 -2.93
RMW -0.32 -1.72 -0.34 -1.76
CMA -0.46 -0.93 -0.05 -0.23
R IA 0.78 1.73 0.03 0.16
R ROE 0.02 0.11 -0.31 -1.25
R EG 0.02 0.10 -0.01 -0.05
R2 12.82 9.65 4.94 4.72
Panel B: StockGPT Spans Stock Factors
ST Rev Mom LT Rev MKT HML SMB RMW CMA R IA R ROE R EG
α 9.00 5.83 2.33 6.15 2.14 3.35 6.06 2.84 2.26 5.44 5.69
tα 4.03 1.98 0.97 1.65 0.74 1.91 3.10 1.50 1.11 2.77 2.95
β -0.02 -0.26 -0.14 0.10 -0.07 -0.08 -0.08 -0.04 -0.02 -0.09 -0.03
tβ -0.21 -1.60 -2.03 1.04 -0.78 -1.40 -1.39 -0.77 -0.33 -0.99 -0.46
R2 -0.33 5.05 3.46 0.53 0.43 1.34 2.08 0.36 -0.26 1.28 -0.17
24
A Daily Model Trained will All Stocks
25