0% found this document useful (0 votes)

9 views

2 2

Uploaded by

Manish Barath

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

2 2

Uploaded by

Manish Barath

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Received 17 July 2024, accepted 9 August 2024, date of publication 19 August 2024, date of current version 27 September 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3445413

Large Language Models and Sentiment Analysis

in Financial Markets: A Review, Datasets,
and Case Study
CHENGHAO LIU 1 , ARUNKUMAR ARULAPPAN 2 , (Member, IEEE),
RANESH NAHA 3 , (Member, IEEE), ANIKET MAHANTI 1,4 , (Senior Member, IEEE),
JOARDER KAMRUZZAMAN 5 , (Senior Member, IEEE),
AND IN-HO RA 6 , (Member, IEEE)
1 School of Computer Science, The University of Auckland, Auckland 1010, New Zealand
2 School of Computer Science Engineering and Information Systems, VIT University, Vellore 632014, India
3 School of Information Systems, Queensland University of Technology, Brisbane, QLD 4000, Australia
4 Department of Computer Science, University of New Brunswick, Saint John, NB E2K 5E2, Canada
5 Centre for Smart Analytics, Federation University Australia, Melbourne, VIC 3806, Australia
6 School of Software, Kunsan National University, Gunsan 54150, South Korea

Corresponding author: Arunkumar Arulappan ([email protected])

This work was supported by the School of Computer Science Engineering and Information Systems, Vellore Institute of Technology.

ABSTRACT This paper comprehensively examines Large Language Models (LLMs) in sentiment analysis,
specifically focusing on financial markets and exploring the correlation between news sentiment and Bitcoin
prices. We systematically categorize various LLMs used in financial sentiment analysis, highlighting their
unique applications and features. We also investigate the methodologies for effective data collection and
categorization, underscoring the need for diverse and comprehensive datasets. Our research features a case
study investigating the correlation between news sentiment and Bitcoin prices, utilizing advanced sentiment
analysis and financial analysis methods to demonstrate the practical application of LLMs. The findings reveal
a modest but discernible correlation between news sentiment and Bitcoin price fluctuations, with historical
news patterns showing a more substantial impact on Bitcoin’s longer-term price than immediate news events.
This highlights LLMs’ potential in market trend prediction and informed investment decision-making.

INDEX TERMS Large language model, Bitcoin price, sentiment analysis, machine learning, market
dynamics.

I. INTRODUCTION decision-making processes [5]. These factors often play a

Sentiment analysis (SA) in financial markets has emerged significant role in explaining market anomalies [6].
as a critical study area, particularly given its widespread Furthermore, the sentiment expressed in news, especially
application in specific sectors like the stock market [1], those covering political, social, economic, or emotional
[2], [3], [4]. This analytic approach primarily aims to events disseminated through social media, profoundly influ-
discern individuals’ attitudes, evaluations, and opinions ences investor behavior [7], [8]. As a result, information
regarding various entities and products. In this context, sourced from online newsgroups, social networks, and stock
behavioral economics becomes pertinent as it delves into discussion forums has become increasingly valuable for
the psychological aspects of investor behaviors, considering informed business decision-making. Recently, a significant
the influence of social, cultural, and emotional factors on amount of research has been carried out by fundamentally
analyzing unstructured text data through machine learning
The associate editor coordinating the review of this manuscript and involving supervised and unsupervised learning methods.
approving it for publication was Seifedine Kadry . LLMs emerged due to large-scale data and increased

2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 134041
C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

FIGURE 1. Bitcoin’s approach to transaction flow and validation.

computational power available [9]. Armed with a wide 1) RQ1: How does the classification, data collection, and
range and variety of training data, these models have shown application of LLMs in sentiment analysis influence
remarkable proficiency in mimicking human language skills, their effectiveness in financial markets?
resulting in significant transformations across various fields, 2) RQ2: What is the correlation between news sentiment,
including the financial domain [10]. Applying LLMs to as analyzed by LLMs, and the price of cryptocurrencies
sentiment analysis represents an innovative shift, where like Bitcoin?
the traditional sentiment analysis challenges are reinter- This paper makes a significant contribution to the field
preted and addressed through more advanced computational of financial sentiment analysis by integrating the advanced
approaches [11]. Their effectiveness is particularly notable in capabilities of LLMs with the dynamic realm of Bitcoin
tasks that require deep contextual understanding and nuanced and cryptocurrency markets. The study stands out for its
language interpretation, such as predicting market trends, comprehensive examination of various LLMs, including
analyzing investor sentiments, and interpreting financial BERT, FinBERT, and ChatGPT, within the specific context
news [10], [12], [13] of financial sentiment analyses [12], [14], [15]. This area
Despite the growing interest in using LLMs for sentiment is particularly challenging due to the nuanced language and
analysis, especially in financial markets, there remains a investor sentiments intrinsic to market dynamics [16]. The
significant gap in understanding the extent and nature of their systematic categorization and analysis of these LLMs in
impact on financial instruments, particularly cryptocurren- the paper illuminate their individual strengths and collective
cies like Bitcoin. Existing literature predominantly focuses potential in enhancing financial market analytics. The focus
on the technical capabilities of LLMs without adequately on the unique features and applications of these LLMs
exploring their practical implications in financial sentiment in the financial domain reveals new insights into their
analysis. Our study seeks to bridge this gap by not only transformative role in market trend prediction and investment
categorizing various LLMs and their applications in financial decision-making.
markets but also by empirically investigating the correlation By identifying a modest but discernible correlation
between news sentiment, as processed by these models, and between news sentiment and Bitcoin prices, the paper
Bitcoin price movements. This approach aims to provide a contributes valuable empirical evidence to the understanding
more nuanced understanding of the role of media sentiment of cryptocurrency market dynamics. This insight is crucial
in cryptocurrency markets. To achieve this, our study will for a range of stakeholders, including investors, financial
answer the following research questions (RQ): analysts, and policymakers, who navigate the complexities

134042 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

FIGURE 2. Structure of this paper.

of these emerging markets. Additionally, the discussion of This study is structured as follows: Section II presents
challenges and future directions for LLMs in sentiment the existing literature closely related to sentiment analysis
analysis highlights both the current capabilities and potential in financial markets. Section III outlines our research
growth areas for these models in financial applications. methodology. Sections IV to VI are collectively focused on
Fig. 1. illustrates the process of Bitcoin transactions within addressing RQ1. Section IV presents a detailed classification
a peer-to-peer network, showcasing how transactions are of LLMs in sentiment analysis. Section V discusses the data
signed and added to the blockchain through mining, while collection method and categorization. Section VI explores
also highlighting various factors that can influence Bitcoin’s the applications of LLMs in sentiment analysis. Section VII
market price. provides a case study aimed at answering our RQ2, offering

VOLUME 12, 2024 134043

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

practical insights into applying these models in a real-world and regulatory challenges associated with using generative AI
scenario. Sections VIII and IX discuss the challenges that in financial markets.
should be overcome when employing LLMs to solve sen- Dong et al. [23] investigate the application of LLMs for
timent analysis tasks and highlight promising opportunities extracting relevant information from financial documents.
and directions for future research. The conclusions of our The authors employ GPT-3 to analyze annual reports,
study are presented in section X. The overall organization of earnings call transcripts, and other financial texts to identify
the paper is presented in Fig. 2. key sentiment indicators and predict stock price movements.
The study shows that LLMs can effectively process and
II. LITERATURE REVIEW interpret large volumes of text data, providing valuable
Among the various LLMs, BERT (Bidirectional Encoder insights for investors and analysts. Farimani et al. [24]
Representations from Transformers) [14] has set a new investigate the efficiency and accuracy of using LLMs like
precedent in natural language processing by understanding GPT-3 for sentiment analysis in the financial market. The
the context of a word in a sentence more holistically. BERT’s authors compare the performance of LLMs with traditional
architecture has been utilized in the financial sector to create models, demonstrating significant improvements in capturing
FinBERT [12], a model specifically fine-tuned to grasp the nuanced sentiments and predicting market trends based
subtleties of financial jargon and sentiment. FinBERT [12] on financial news and social media data. Another early
excels in interpreting complex financial reports, earnings review [25] emphasizes the potential of both BERT and GPT-
calls, and market analysis, providing more accurate sentiment 2 in advancing financial sentiment analysis through improved
predictions than general-purpose models [17]. Additionally, feature mapping techniques, leveraging their respective
Ploutos [18], another financial LLM, demonstrates superior strengths in understanding context and generating relevant
performance in predicting stock movements. This model text.
uniquely integrates textual and numerical data using a While previous studies have significantly advanced finan-
mixture of experts architecture, enhancing its ability to cial sentiment analysis using models like FinBERT and
deliver precise explanations for its predictions. A further integrated approaches combining sentiment indices with pre-
groundbreaking LLM is ChatGPT [15], which has been dictive models, our approach introduces a novel perspective
instrumental in enhancing interactive financial analysis. by leveraging more specific aspects like classification, data
ChatGPT’s ability to engage in human-like conversations collection and application with a case study. A central
and provide detailed, contextually relevant responses has element of this research is the empirical investigation into the
been utilized in customer service automation, financial correlation between news sentiment, as analyzed by LLMs,
advisory, and real-time market analysis [19]. This model’s and Bitcoin price movements. This case study is particularly
sophisticated understanding of queries and ability to generate relevant given the growing influence of cryptocurrencies
coherent and context-aware responses make it an invaluable like Bitcoin in global financial markets. Bitcoin serves as a
tool in the dynamic world of finance. benchmark for the digital currency landscape, characterized
Few recent studies focus on various applications and by its volatility, decentralized nature, and sensitivity to public
advancements of LLMs in financial sentiment analysis. sentiment and news [26]. The study addresses the pressing
In a recent study, Sharma et al. [20] explored the use need to understand Bitcoin’s market behavior due to its
of generative models like ChatGPT for sentiment analysis. escalating impact on retail and institutional investors and
These models enhance sentiment analysis by augmenting its potential in reshaping financial technology and monetary
datasets with synthetic labeled data and simulating human transactions [27].
sentiment expression, particularly for tasks like sarcasm
detection. Key challenges include maintaining the quality and III. RESEARCH METHOD
consistency of generated data and addressing inherent biases. This literature review adheres to the methodology proposed
By overcoming these issues, the potential for sentiment anal- by Kitchenham et al. [28], [29]. Following the guidelines
ysis in real-world applications can be significantly enhanced. provided by Kitchenham et al. [28], our methods included
The architectures and applications of large language models, two main steps: planning and conducting the review.
including their use in sentiment analysis presented by Raiaan Established academic databases were utilized to gather the
et al. [21]. They categorize different LLMs, such as GPT-3, relevant literature, including Web of Science, IEEE Xplore,
and explore their applications in various domains, including Springer, arXiv, and UoA(University of Auckland) Library.
finance. The paper also addresses the challenges and open The following sections describe the methodology used to
issues in deploying LLMs for sentiment analysis, such source and evaluate the chosen literature. Specifically, Fig. 3
as data scarcity and model interpretability. The increasing presents the structure of the literature review.
role of generative AI models, such as GPT-3, in business Our manual search encompasses four critical databases
and finance is discussed in [22]. The work highlights the known for their comprehensive collection of scientific papers.
potential of these models to generate realistic financial data, The methodology involved a multi-step process, beginning
perform sentiment analysis, and support decision-making with creating a keyword dictionary instrumental in the initial
processes. The paper also explores the ethical implications search across these databases. Our search string should

134044 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

Language Models (LLMs), in our search criteria. This

broader approach aims to ensure we don not overlook any
relevant research, thereby expanding our search scope during
automated searches.
Upon retrieving these papers, the next step involved
a detailed examination of the titles and abstracts, which
allowed to determine the relevance of each paper to
the research objectives based on inclusion and exclu-
sion criteria. We designed these criteria following several
several state-of-the-art papers [30], [31], as shown in
Table 1, so that the selected documents can directly address
our topic. We drop duplicated studies across multiple
databases to refine our dataset, streamlining our literature
collection.

TABLE 1. Inclusion criteria and exclusion criteria.

FIGURE 3. LLMs classification and literature review methodology in

financial sentiment analysis.

combine two sets of keywords: one related to sentiment

analysis and the other to LLMs. If the paper contains both
types of keywords, it is more likely that it is the paper we Following the curation of unique studies, we proceeded
need. The complete set of search keywords is as follows: to a more in-depth review, scanning the full text of each
selected paper. A thorough quality assessment can help
1) Keywords related to sentiment analysis: Sentiment mitigate biases that may arise from low-quality studies
detection, Opinion mining, Emotional analytics, Affec- and guide readers on where to approach conclusions with
tive computing, Polarity classification, Subjectivity caution [32]. We developed a set of ten Quality Assessment
analysis, Sentiment scoring, Mood analysis, Opinion Criteria (QAC), detailed in Table 2. These criteria evaluate
polarity, Sentiment quantification, Emotion recogni- the papers’ relevance, clarity, validity, and importance.
tion, Tone analysis, Sentiment lexicons, Sentiment The final stage in our search process was to conduct
metrics, Textual affect detection, Semantic orien- a quality assessment of these primary studies, evaluating
tation, Sentiment strength, Sentiment benchmarks, them against predefined criteria to ensure that only the
Sentimental analysis tools, Review analysis, Con- most rigorous and relevant research was included in our
sumer sentiment, Investor sentiment, Market sentiment, analysis.
Brand sentiment, Social sentiment, Sentiment cor- The systematic literature review on LLMs for sentiment
relation, Aspect-based sentiment analysis, Sentiment analysis acknowledges the risk of missing key studies due
summarization, Sentimental classification, Sentimental to potential gaps in keyword summarization. To mitigate
interpretation. this, a dual approach combining manual review and auto-
2) Keywords related to LLMs: LLM, Language Model, mated searches was utilized, with keywords derived from
Large Language Model, Pre-trained, PLM, Pre- authoritative sources and forward and backward snowballing
training, NLP, Natural Language Processing, DL, Deep techniques employed to ensure thoroughness. Additionally,
Learning, ML, Machine Learning, ChatGPT, Neural to counter study selection bias, defined inclusion and
Network, Transfer Learning, Sequence Model, T5, exclusion criteria were established, and a QAC framework
GPT, Codex, BERT, Transformer, Attention Model, AI, was implemented, with ambiguous cases receiving manual
Artificial Intelligence. scrutiny. This blend of strategies aimed to balance efficiency
We included keywords like Machine Learning and Deep with meticulousness, reducing biases and enhancing the
Learning, alongside other terms not directly related to Large review’s validity.

VOLUME 12, 2024 134045

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

TABLE 2. Checklist of quality assessment criteria (QAC) for studies on and the general context of the sentence. Prominent examples
LLMs in sentiment analysis.
of encoder-only LLMs include BERT [14] and various
adaptations of it [12], [39], [40]. BERT, in particular, is built
on the encoder architecture of the Transformer [41]. Its
unique feature is the bidirectional attention mechanism,
which allows it to analyze the context to the left and
right of each word concurrently during its training phase.
In the financial domain, other prominent models like
FinBERT [12], CryptoBERT [42], and SBERT [26]have been
widely employed.
These models distinguish themselves from the original
BERT [14] by enhancing the architecture to include novel
pre-training tasks or adjusting to different data modalities,
thereby improving their effectiveness for finance-related
tasks. For instance, FinBERT [12] is an adaptation of
BERT [14] that is pre-trained explicitly on financial corpora
and fine-tuned to perform sentiment analysis within the
financial domain, achieving an accuracy of 0.86 and an F1-
Score of 0.84. Similarly, CryptoBERT [42], which is also
IV. CLASSIFICATION OF LARGE LANGUAGE MODELS IN
grounded in the BERT [14] model, undergoes fine-tuning
SENTIMENT ANALYSIS
on a cryptocurrency-specific corpus, yielding heightened
This section explores the classification of LLMs in the
accuracy in the sentiment classification of texts related to
context of sentiment analysis, emphasizing how their size and
cryptocurrencies. It achieved accuracy scores of 55.60 and
architecture impact their effectiveness. We categorize LLMs
an F1-Score of 55.79 among five models for the StockTwits1
based on their structural design, distinguishing between
data, which contains 1.875 million posts. These models
encoder-only, encoder-decoder, and decoder-only models,
have demonstrated their proficiency in various applications,
each with distinct capabilities in processing natural language.
such as predicting market movements, analyzing investor
For an in-depth exploration of the relevant literature, we have
sentiments, and automating financial report summaries,
included a comprehensive summary in Table 4.
showcasing their transformative impact on the financial
analytics landscape.
A. LARGE LANGUAGE MODELS
Pre-trained language models (PLMs) have proven highly
2) ENCODER-DECODER LLMs
effective in various natural language processing (NLP) tasks,
as evidenced in several studies [33], [34]. Researchers have Encoder-decoder LLMs integrate both the encoder and
noted that increasing the size of these models significantly decoder components [41]. The encoder component converts
boosts their capabilities, particularly when the size of the input text into a hidden representation, adeptly grasping
the fundamental structure and meaning. This confidential
parameters exceeds a certain point [35], [36]. The designation
representation is a transitional language, facilitating the
‘‘Large Language Model’’ (LLM) is used to differentiate
connection between input and output formats. On the other
language models based on their size, primarily referring to
hand, the decoder leverages this hidden representation to
PLMs with a larger scale [37]. However, it is important
produce the desired output text, transforming the abstract
to mention that there is no widely agreed-upon standard
representation into specific, contextually appropriate phrases.
in the literature for the minimum parameter size for an
Within this context, the memory module of models like
LLM, as its efficiency is linked to the dataset’s size and
FINMEM [43] stands out. It mirrors human cognitive
the total computing power used. In our study, we follow the
processes, providing clear interpretability and flexibility for
classification and taxonomy of LLMs introduced by Pan et al.
real-time adjustments. This feature enhances the model’s
[38], dividing mainstream LLMs into three categories based
utility in financial trading by allowing it to hold on to
on their architecture: encoder-only, encoder-decoder, and
essential information for extended periods, which is crucial
decoder-only. This classification and the corresponding
for complex decision-making. FINMEM outperformed in
models are depicted in Fig. 3.
trading five different stocks, achieving the highest Sharpe
ratio of 2.6789 and lowest max drawdown of 10.7996%.
1) ENCODER-ONLY LLMs
Another example is TradingGPT [44], an innovative LLM
Encoder-only LLMs are a specific type of neural network multi-agent framework endowed with layered memories. The
framework that employs solely the encoder part of the model. ability of TradingGPT to navigate through financial data and
The primary role of the encoder is to process and transform
the input text into a hidden representation. This representation
is critical in understanding the connections among words 1 https://stocktwits.com/

134046 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

its application in trading exemplifies how encoder-decoder A. SOURCING DATASETS FOR TRAINING LARGE
LLMs can be potent tools in enhancing trading strategies [44]. LANGUAGE MODELS
Data is a vital and essential component in training Large
3) DECODER-ONLY LLMs Language Models (LLMs), significantly influencing their
Decoder-only LLMs exclusively use the decoder module generalization capabilities, efficiency, and overall perfor-
to produce the intended output text. They follow a unique mance [55]. An ample amount of high-quality and varied
training approach focusing on sequential prediction [45]. data enables models to thoroughly learn features and patterns,
Contrary to the encoder-decoder framework, where the fine-tune their parameters, and maintain dependability during
encoder handles the input text, the decoder-only structure validation and testing.
starts from a base state and sequentially predicts tokens, Our initial focus is on examining the methodologies
thereby progressively constructing the output text. This for dataset acquisition. Through this analysis of data
method heavily depends on the model’s proficiency in collection techniques, we have categorized the sources
grasping and predicting language structure, syntax, and of data into four groups: open-source datasets, datasets
context. Key examples of this architecture include the GPT that are actively collected, datasets that are specifically
series models such as GPT-1, GPT-2, GPT-3, GPT-4, and constructed, and datasets derived from industrial sources.
their significant variant, ChatGPT.2 [45], [46], [47], [48]. The Open-source datasets [56], [57] are publicly available data
GPT series has shown promising performance in financial compilations typically distributed via open-source platforms
sentiment analysis, not only for Twitter news but also in terms or repositories. An example of this is the FiQA [56] dataset,
of accuracy, recall, and F1-score across different forex pair a substantial new dataset featuring Question-Answering pairs
news [49], [50]. These models demonstrate their capability focused on financial reports crafted by experts in finance. The
to excel in financial contexts, highlighting their potential for credibility of these datasets is bolstered by their open-source
improving sentiment analysis and market prediction tasks. status, enabling community-based updates and ensuring their
These models can execute downstream tasks with min- reliability for scholarly research.
imal input, often requiring just a handful of examples or The Financial PhraseBank, first introduced by Malo et al.
straightforward instructions. This attribute eliminates the [58], consists of 4,845 English sentences randomly selected
need for additional prediction heads or extensive fine-tuning from financial news articles in the LexisNexis database.
processes, rendering them particularly valuable in sentiment These sentences were annotated by 16 experts in finance and
analysis research. For instance, recent developments in the business who evaluated how the information could influence
industry have witnessed Google unveiling Bard. At the the stock prices of the companies discussed. Furthermore,
same time, Meta has introduced its models, LLaMA [51] the dataset includes information about the level of agreement
and LLaMA2 [52], alongside Microsoft’s foray with Bing among the annotators regarding the sentiments expressed in
Chat.3 One application of LLaMA in the realm of financial the sentences.
sentiment analysis is demonstrated by FinMA, a version of TRC2-financial is a specialized subset of the TRC244
LLaMA specifically fine-tuned for this task, which recorded collection from Reuters, which encompasses 1.8 million
the highest F1-score of 0.87 on the FiQA dataset [53]. news articles released between 2008 and 2010. This subset
Furthermore, LLaMA2 has proven effective, reaching an specifically contains 46,143 documents, totaling nearly
accuracy of 84.03% through supervised learning and aligning 29 million words and close to 400,000 sentences [12].
financial texts [54]. These developments highlight the capa- SemEval 2017 Task 5 focuses on fine-grained sentiment
bilities of LLaMA models in sentiment analysis, particularly analysis (FSA) of news headlines and microblogs [59].
their proficiency in the precise interpretation and assessment The training set for this task includes 1,142 financial news
of financial sentiments. headlines and 1,694 microblog posts, each annotated with
target entities and their corresponding sentiment scores.
V. DATA ACQUISITION AND CLASSIFICATION FOR LLMs The test set comprises 491 financial news headlines and
IN SENTIMENT ANALYSIS 794 posts [11].
This section examines the methodologies employed in Collected datasets [26], [60] are compiled by researchers
collecting and utilizing datasets for sentiment analysis in from diverse sources, such as significant websites, forums,
LLMs. This section underscores the pivotal role of data blogs, and social media. Researchers often extract data from
in training LLMs, emphasizing the need for diversity sources like Twitter and Reddit for datasets specifically
and comprehensiveness in dataset collection to enhance tailored to their research inquiries.
model performance in varied contexts [55]. We explore the Constructed datasets [12] are researcher-generated
systematic process of dataset categorization, preprocessing, datasets derived from modifying or enhancing collected
and formatting, which is essential for aligning data with the datasets to align with specific research goals. Manual or
model’s training objectives and processing needs. semi-automatic modifications can include creating domain-
specific tests, annotated datasets, or synthetic data.
2 https://chat.openai.com/
3 https://www.microsoft.com/en-us/edge/features/bing-chat 4 https://trec.nist.gov/data/reuters/reuters.html

VOLUME 12, 2024 134047

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

TABLE 3. Data types of datasets involved in prior studies.

Industrial datasets [10] sourced from commercial or various types of financial datasets used in studies of LLMs
industrial entities contain proprietary data and are essential for sentiment analysis. By examining how data types relate
for research addressing real-world business contexts. to model architectures and their performance, we aim to
Acknowledging that certain studies utilize diverse datasets highlight the importance of data types in the effectiveness of
encompassing various categories is essential. For instance, LLMs for sentiment analysis.
Wu et al. [10] trained BloombergGPT using multiple datasets, We classified the data types of all datasets into five
e.g., complex table datasets and question-answering pairs. categories: Twitter posts, Reddit posts, News articles, Annual
reports, and Fi-QA. Table 3 describes the specific data
B. VARIETY OF DATASETS IN EXISTING LLMs FOR included in the data types corresponding to the datasets we
SENTIMENT ANALYSIS STUDIES summarized from the 15 studies.
The data types are crucial in determining the architecture
and choice of LLMs, as they directly affect the extraction of VI. APPLICATIONS OF LLMs IN FINANCIAL SENTIMENT
implicit features and the decisions made by the model [67]. ANALYSIS
The selection of specific data types can significantly This section delves into LLMs’ diverse and transformative
influence the LLMs’ overall effectiveness and ability to applications in financial sentiment analysis. In recent years,
generalize. In our research, we explore and categorize the integrating advanced LLMs into the financial sector has

134048 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

marked a significant evolution in how financial data, market The study’s results prompt further investigation into how
trends, and investor sentiments are analyzed and interpreted. Generative AI might enhance financial data analysis and
This section explores how LLMs predict market trends, social media sentiment interpretation, potentially unlocking
optimize trading strategies, and forecast stock prices. more sophisticated market insights. This research opens up
new pathways for sentiment analysis in financial markets,
leveraging AI technologies.
A. PREDICTIVE ANALYTICS IN CRYPTOCURRENCY
MARKETS USING LLMs
This section explores the application of LLMs for predicting B. SENTIMENT-DRIVEN LLM STRATEGIES FOR FINANCIAL
cryptocurrency market trends, with a particular focus on inte- TRADING
grating sentiment analysis into these predictions. The poten- Developing a robust trading strategy is crucial in the volatile
tial of LLMs to distill sentiment from vast datasets offers realm of financial markets, where integrating sentiment
a novel dimension to the forecasting models, as evidenced analysis and LLMs can provide a competitive edge. Kim et al.
by several recent studies. Zou and Herremans [61] intro- [60] leveraged an LLM adapted to the crypto domain
duced a pioneering multimodal model, PreBit, specifically to parse crypto news sentiments called CBITS. Their
designed to anticipate significant Bitcoin price movements. research demonstrates that trading strategies augmented
Bashchenko [26] provided insights that counter the notion with sentiment scores significantly outperform conventional
of Bitcoin’s value being purely speculative, demonstrating models, underscoring the efficacy of sentiment-based trading
that non-endogenous news carries fundamental information approaches. Backtesting various Bitcoin trading strategies,
affecting Bitcoin prices. their study reveals that models employing TabNet combined
Raheman et al. [62] highlighted the practical advantages with RoBERTa, specifically the TabNet RoBERTa top 10,
of interpretable AI and NLP methods over non-explainable yield the highest profit, recording an impressive gain of
alternatives, suggesting that transparency in AI could lead 304.65%. In contrast, other models assessed during the same
to more valuable applications in the financial sector. Ider test period generated negative returns.
and Lessmann [68] demonstrated the advantages of refining Yu et al. [43] introduced FINMEM, an innovative LLM-
FinBERT with weakly labeled data, illustrating how even based framework crafted for financial decision-making.
imprecisely labeled datasets can significantly improve text- This framework is structured around three central modules:
based feature prediction and forecasting accuracy for cryp- Profiling, which tailors the agent to specific investor profiles;
tocurrency returns. Their study utilized a dataset comprising Memory, which processes financial information in a layered
433 test samples, with a noteworthy agreement rate of 92.6% manner akin to human cognitive structures, facilitating deeper
among all 16 expert labels. This approach facilitated the assimilation of financial data; and Decision-making, which
development of predictive models for Bitcoin and Ethereum translates the processed information into actionable invest-
that substantially outperformed baseline models, achieving ment strategies. The adaptability of FINMEM, particularly
gains of 0.572 and 0.501, respectively.This evidence under- its memory module, provides a level of interpretability that
scores the efficacy of leveraging weak labels in enhancing mirrors human trading logic, coupled with the capability for
the performance of financial prediction models, particularly real-time adjustment to optimize trading decisions.
in the volatile domain of cryptocurrency markets. Li et al. [44] took the concept further by developing an
Ortu et al. [63] investigated cryptocurrency price pre- LLM multi-agent framework with layered memories called
diction by analyzing social sentiment data from GitHub TradingGPT. The LLMs at the heart of this framework act
and Reddit, employing a pre-trained BERT-based model to as decision-making cores for trading agents, utilizing the
synthesize emotional and sentiment indicators from social layered memory system to synthesize historical data and
media commentary into hourly and daily series datasets. current market conditions. This innovative approach enables
Their findings indicated that incorporating these social the agents to engage in strategic dialogues with peers, refine
sentiment metrics markedly enhances the predictive accuracy their investment choices, and uphold a diverse yet robust
for the daily pricing of Bitcoin and Ethereum. The research decision-making process informed by their unique trading
highlights a significant inverse relationship between negative personas.
sentiment and price volatility within the Bitcoin market, sug- Curtó et al. [69] provided empirical evidence showcasing
gesting that users might interpret volatility as a speculative the adaptability of LLM-informed strategies to the dynamic
opportunity. In contrast, the Ethereum market sentiment is bandit problem, a standard paradigm in trading strategy
predominantly influenced by emotional arousal, which shows formulation. Their experiments underscore the ability of
a substantial positive correlation with negative sentiment, LLMs to navigate the complexities of the financial markets,
indicating that community reactions are more emotionally yielding a strategy that competes favorably with traditional
driven rather than directly related to price movements. methods even in unpredictable scenarios.
Building on these findings, Nguyen et al. [27] explored the Gupta [65] aimed to streamline the analysis of Annual
distinctive impact of ChatGPT-based sentiment indicators on Reports across various firms by harnessing the analytical
Bitcoin returns, revealing its adeptness at sentiment detection. prowess of LLMs. A machine learning model was trained

VOLUME 12, 2024 134049

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

using these insights as predictive features by distilling

insights from the LLMs into a quantitatively styled dataset
and supplementing it with historical stock prices. The
walk-forward testing indicated that such a model could sig-
nificantly outperform benchmarks like the S&P 500 returns,
underscoring the potential of GPT3.5 to revolutionize trading
strategies. The research revealed that the model, when used to FIGURE 4. Dataset creation process.
select the top k stocks, consistently generated higher returns
than the S&P 500. Notably, the returns were inversely related
to the value of k, with lower k values correlating with higher A. DATA COLLECTION AND ANALYSIS METHOD
returns. This outcome indicates that the stocks predicted 1) CRYPTOCURRENCIES DATA
as top performers by the GPT model indeed yielded better We collected comprehensive daily cryptocurrencies data
financial results. from the investing website, www.investing.com,5 to inves-
tigate this relationship. The dataset spans two years, from
C. ENHANCING STOCK MARKET FORECASTING WITH November 1, 2021, to November 1, 2023, and encompasses
LLMS various metrics, including price, closing price, highest and
Our analysis underscores the broad utility of LLMs in stock lowest price of the day, opening price, and volume of
price prediction through sentiment analysis, showcasing transactions. Consistent with methodologies employed in
their versatility across various financial applications. Araci similar studies [70], the price was chosen as the primary
introduced FinBERT, a model tailored for the financial target variable. This decision is based on the Bashchenko [26]
sector, demonstrating superior capabilities in economic text price’s everyday use as a critical indicator of market
mining and suggesting further application of FinBERT across sentiment in financial research, providing a reliable measure
different financial NLP tasks. FinBERT’s utility could be of the market’s end-of-day valuation for Bitcoin.
significantly extended by integrating more extensive stock Given the current existence of approximately 1,000
market datasets, presenting opportunities for more intricate cryptocurrency coins, some of which suffer from incomplete
market analysis and model refinement. information or delayed publication, our selection criteria
Mishev et al. [11] provided evidence that contextual focused on coins with at least 1,000 recorded observations.
embeddings substantially improve efficiency for sentiment This threshold ensures the accumulation of sufficient data for
analysis over traditional lexicons and static word encoders, our analyses. Importantly, the use of transfer entropy as a
a benefit that holds even in the absence of large datasets. This methodological approach in our study offers the advantage
advancement points to the potential of LLMs to revolutionize of not necessitating a balanced dataset, thus allowing for
sentiment analysis with a more profound understanding of a broader inclusion of data points. Our dataset represents
contextual nuances in financial texts. over 80% of the cryptocurrency market’s total market
Deng et al. [66] revealed that LLMs can achieve remark- capitalization, ensuring a comprehensive analysis scope.
able outcomes in market sentiment analysis. The study
showed that with minimal examples, it is possible to calibrate 2) NEWS DATA
a ‘student’ model that matches or surpasses the performance To gather cryptocurrency-related news data, we employed
of more extensive, state-of-the-art models, optimizing both an open-source Python library, scrape,6 renowned for its
effectiveness and computational efficiency. efficiency in web scraping. This tool was instrumental in
Fazlija and Harder [64] identified that sentiment scores compiling a substantial dataset of tweets about various
derived from news content play a critical role in predicting cryptocurrencies. To ensure a targeted and relevant data
the direction of stock prices. The correlation between collection, we used a set of carefully selected search
news sentiment and market performance underscores the keywords for each cryptocurrency. For instance, in the case
value of high-quality, content-based sentiment indicators in of Bitcoin, the search parameters included a combination of
forecasting models. its name and symbol, such as ‘BTC OR BTC OR BITCOIN
OR Bitcoin’. Aligning with the timeframe of our price data,
VII. CASE STUDY REGARDING THE CORRELATION
the collection period for the news data was also set from
BETWEEN NEWS SENTIMENT AND
November 1, 2021, to November 1, 2023.
BITCOIN PRICE
Given the sheer volume of cryptocurrency-related news
This case study aims to explore the relationship between
and the constraints of our computational resources, it was
the sentiment expressed in cryptocurrency news articles and
necessary to limit the quantity of news collected daily for
the price fluctuations of Bitcoin. Leveraging the power of
each cryptocurrency. Without such a limit, the sentiment
sentiment analysis through advanced language models, this
analysis process would have been impractically prolonged,
study seeks to provide a deeper understanding of how public
sentiment, as reflected in media [16], can impact financial 5 https://www.investing.com/crypto/bitcoin/btc-usd
markets, particularly the volatile cryptocurrency sector. 6 https://github.com/JustAnotherArchivist/snscrape

134050 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

TABLE 4. Summary of literature review.

VOLUME 12, 2024 134051

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

potentially taking months. To achieve this, we implemented cornerstone model for analyzing the co-movement of finan-
a timed request mechanism, where news was requested cial returns. This technique boasts two significant advantages
every 20 seconds using a single computer. This approach over other variants in the GARCH model family, such as the
was crucial to avoid triggering anti-crawler mechanisms on Baba-Engle-Kraft-Kroner (BEKK) and Constant Conditional
the websites we scraped, while also ensuring a consistent Correlation (CCC) models. Firstly, it exhibits a superior
data collection rate. By limiting requests in this manner, capacity to capture time-varying conditional covariance. This
we capped the collection at a maximum of 5000 news articles is achieved with less computational complexity than the
per day for each cryptocurrency, totaling 18506 articles BEKK model, making it more efficient and accessible for
for the entire study period. For each piece of news, complex analyses. Secondly, unlike the CCC model, which
we meticulously recorded several key attributes: the date and assumes constant correlations over time, the DCC model
time of the post (DateTime), the headline, the main text of allows for variation, adding flexibility and realism to the
the news article, the author’s information, the URL, and a few analysis.
other relevant features. The bivariate DCC model’s simplicity is particularly
beneficial in many return series contexts. A key strength
3) SENTIMENT CLASSIFIERS
is its ability to directly account for heteroscedasticity by
calculating Dynamic Conditional Correlations (DCCs) from
FinBERT, introduced by Araci in 2019 [12], stands as the
standardized residuals. As Chiang et al. [73] noted, this
first finance domain-specific BERT model, pretrained on
approach ensures that the DCCs are free from biases
the expansive TRC2-financial corpus. This corpus, a spe-
associated with volatility clustering, addressing concerns
cialized subset of Reuters’ TRC2, comprises approximately
highlighted by Forbes and Rigobon [74]. Additionally, the
1.8 million news articles published between 2008 and 2010.
DCC model’s proficiency in generating accurate, time-
Given the scope of this paper, a detailed exploration of
varying estimates of volatilities and correlations is invaluable.
the BERT architecture is beyond our purview, but readers
It capably reflects the latest market news and responds to
are encouraged to consult Araci’s original work for a
regime shifts triggered by shocks and crises. This dynamic
comprehensive understanding.
analysis of correlations over time facilitates more informed
The FinBERT model underwent further fine-tuning using
asset allocation and hedging decisions.
the Financial Phrase Bank, a resource developed by Malo
Implementing the DCC model involves a two-step process
et al. in 2013 [71], specifically for sentiment classification
to ascertain conditional correlations. A univariate GARCH
tasks within the financial domain. FinBERT’s performance
model is initially estimated for each return series, yielding
in financial sentiment analysis tasks showed a notable
the conditional variance. Subsequently, dynamic conditional
15% improvement over generic BERT models [12]. This
correlations are derived from these standardized residuals.
enhancement in accuracy and the successful application of
Following the methodology described by Bauwens and
FinBERT in studies parallel to ours, such as those by Zou
Laurent, the model is delineated like this:
and Herremans [61] and Farimani et al. [72], underscored its
suitability for our research objectives. 0.5
X
For our study, we opted to employ the FinBERT model Rt = µt + Zt (1)
in its pre-fine-tuned state, initially configured by Araci [12]. t
This decision was driven by the nature of our data j
where the return vector Rt = (rSt , rt )′ and sector indices, rSt
set, which primarily consists of unlabeled news articles. j j
and the selected alternative investments, rt ∗µt = (µSt , µt )′ is
Further fine-tuning of FinBERT on other labeled datasets iid
was deemed unnecessary, considering it has already been the conditional mean process, and Zt −→ N (0, 1) is an (2×1)
optimized for sentiment classification using the Financial independent identically distributed P
random variables vector.
Phrase Bank. By applying this ready-to-use, finely tuned The conditional covariance matrix t = Dt Ct Dt , with the
FinBERT model to our news data, we aimed to leverage conditional correlation matrix
h i 1 1
its advanced capabilities for accurate sentiment analysis in Ct = ρtS/j = diag(Qt )− 2 Qt diag(Qt )− 2 (2)
the financial sector without additional training. FinBERT
outputs sentiment scores for each news article on a scale and
from 1 to 10, where 10 indicates a high confidence level in q q
j
the news positively impacting Bitcoin prices. Finally, the data Dt = diag hSt , ht (3)
are stored in a database (Fig. 4). q q
S j
where ht and ht denote the univariate GARCH variances,
4) ENGLE’S BIVARIATE DCC-GARCH TECHNIQUE IN The (2 × 2) symmetric positive matrix Qt is given by
FINANCIAL RETURN ANALYSIS
Qt = (1 − α − β)N̄ + αηt−1 η′t−1 + βQt−1 (4)
Engle’s bivariate Dynamic Conditional Correlation-Extended
Generalized Autoregressive Conditional Heteroskedastic- where C̄ is the unconditional correlation matrix of standard-
ity (DCC-GARCH) technique, introduced in 2002, is a ized innovations ηt , The added value of the positive scales α

134052 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

and β is restricted to α + β < 1. We obtain the DCCs by information absorbed by the same future instance solely from
S/j the past values of I . In essence, transfer entropy seeks to
S/j qt
ρt = j 1
(5) measure the net information flow.
(qSt , qt ) 2 (k) (k)
!
X (k) (l) (it+1 |it , jt )
TJ →I (k, l) = p(it+1 , it jt ) · log (k)
(9)
5) TRANSFER ENTROPY i,j (it+1 |it )
Transfer entropy offers distinct advantages over tradi- In this context, TJ →I is used to assess the flow of
tional methods, enhancing its capability to evaluate infor- information from J to I . Dimpfl and Peter [79] introduced
mation flows, as highlighted by Barnett et al. [75]. novel methods, including the Markov block bootstrap and
Unlike conventional econometric models that rely heavily the repeated bootstrap, to this field of study. They base their
on domain-specific assumptions and constraints, transfer investigation on the null hypothesis which posits the absence
entropy facilitates a non-parametric analysis of time-series of any information transfer.
data, minimizing the need for extensive presumptions about
stochastic processes. Fundamentally, transfer entropy is RTJ →I (k, l)
(k) (k)
!
grounded in econophysics, focusing on quantifying the
i φq (it ) · p (it+1 |it )
q
P
1
directional information flow of a variable over time, rooted = log P (k) (k) (k) (k)
(10)
i,j φq (it , jt ) · p (it+1 |it , jt )
1−q q
in information theory. This concept was originally introduced
by Shannon in 1948. Here, J and I represent two distinct processes, while q is a
X positive weighting parameter q > 0 applied to the individual
HI = − p(i) · log(p(i)) (6) probability function p(.) for computations. Specifically, in
i
refers to the nth element of the time series I , and jn denotes
In this context, i denotes a discrete random variable the n6th element of the time seriesq for the variable J . It should
characterized by its probability distribution, p(i), reflecting be recognized that φq (j) = Pp p(j)q (j) and φq constitute the
j
the various outcomes it may manifest. H is identified as the pq (i)
escort distribution as defined by φq (i) = P q. The primary
most effective function for facilitating this transformation, i pi
and HI is known as Shannon entropy. Shannon’s [76] purpose of introducing the Markov process into this analysis
seminal work in 1948 established the groundwork for this is to estimate the likelihood of transitioning from one state to
methodology, focusing on the uncertainty and dynamism another during information transfer, as well as to facilitate the
in a variable’s processes. Subsequently, Kullback and prediction of potential transition matrix scenarios. According
Leibler [77], in 1951, expanded upon this by integrating to Equation 10, and following the methodology suggested
an additional element, referred to as process J . Notably, by Bekiros et al. [80], setting l = k = 1 allows for the
the concept of Transfer entropy gains complexity with the denoising of the dataset and enables Transfer entropy to
inclusion of more variables and values, indicating a broader detect asymmetrical interactions between pairs (X and Y ) and
and more intricate understanding of entropy. (Y and X ), thus offering valuable insights into the dynamics
X (k) (k) of information flow between two time series. In essence,
hI (k) = − p(it+1 , it ) · log(p(it+1 |it )) (7)
transfer entropy relies on the logarithmic scale of the number
i
of possible outcomes, determined by a given probability
To elaborate further, the marginal probability distributions distribution, to analyze information flows.
p(i), p(j) and the joint probability distribution p(i, j) are
expected to form a stationary time series. This implies that B. RESULTS
(k)
it = (it , . . . , it−k+1 ) represents a sequence of values over Table 5 presents the descriptive statistics for a case study
time. Similarly, hj (l) is defined for process J in a comparable analyzing the volatility of Bitcoin prices and news sentiment
manner. Kullback and Leibler [77], in their 1951 work, from November 1, 2021, to November 1, 2023. This
introduced a broader application of the Markov process to this period witnessed significant fluctuations in Bitcoin prices,
context. as evidenced by a minimum price of $15,766 on November
(k)
p(it+1 |it ) = p(it+1 |it , jt )
(k) (k)
(8) 21st, 2022, and a peak of $67,526 on November 8, 2021.
Brown [81] suggests that kurtosis values typically range
Transfer entropy revolves around the likelihood of one from −10 to +10, and skewness values between −3
variable obtaining information from its past and from another and +3 are acceptable. In this context, the Bitcoin price
variable (jt ). This core idea behind ‘Transfer entropy’ is exhibits a skewness more significant than one and a kurtosis
to quantify the information exchange between two distinct, exceeding 3, indicating a distribution with a higher peak
random variables. Schreiber [78] elucidated this approach, and thicker tails than a normal distribution, thereby implying
where I and J represent two separate processes. The formula a higher likelihood of extreme values. Regarding news
for transfer entropy from J to I is defined as the difference sentiment, the skewness is 0.9311, denoting a moderate
between the information absorbed by a future instance of positive skew with a longer right tail. The kurtosis of
process I(t+1) from the past values of both I and J , and the 4.6131, above 3, categorizes the distribution as leptokurtic,

VOLUME 12, 2024 134053

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

TABLE 5. Descriptive statistics.

suggesting that the news sentiment scores have a sharper peak TABLE 6. Unit root test results.
and heavier tails compared to a normal distribution.
Table 6 details the outcomes of unit root tests conducted
on the variables utilized in this study, explicitly presenting
the Augmented Dickey-Fuller (ADF) test results for each
factor. Despite the fact that stationarity is not a prerequisite
for utilizing the transfer entropy approach, which can
handle probability density functions from a single realization
as highlighted by Wollstadt et al. [82], we nevertheless
proceeded to perform a stationarity test.
For the Bitcoin price, a first-order difference was applied.
The ADF test result for Bitcoin price, with a statistic of
−2.4729 and a higher p-value, indicates that the time series
is non-stationary. This means that the null hypothesis of
a unit root for Bitcoin price cannot be rejected at the 5% complemented by the GARCH model’s volatility insights.
significance level, necessitating further analysis due to its These statistical results reveal that the price of Bitcoin
non-stationary nature. and news sentiment generally exhibit a similar directional
In contrast, the ADF test for News Sentiment yields a movement; however, this relationship is notably weak.
statistic of −6.0595 with a p-value of 0.01. This p-value, The ρ (Rho) value of 0.1145 indicates a low long-term
being below the commonly accepted significance level of correlation between Bitcoin prices and news sentiment,
0.05, strongly refutes the null hypothesis of a unit root. suggesting that, on average, they do not move together
Consequently, we can confidently reject the null hypothesis, closely. The α (Alpha) value is 0.00107, which is very small,
affirming that the news sentiment time series is stationary. implying that recent news events exert minimal influence on
Other cryptocurrencies show mixed results: BNB has an the immediate volatility of Bitcoin’s price. This means that
ADF test statistic of −2.9179, indicating non-stationarity as new information or shocks from news have a negligible short-
the null hypothesis cannot be rejected. ETH has an ADF term impact on Bitcoin’s volatility.
test statistic of −2.4665, which is non-stationary. DOGE Conversely, the β (Beta) value is 0.9874, which is quite
shows a statistic of −3.7383, which rejects the null hypothesis high, indicating that past volatility trends have a substantial
at the 5% significance level, indicating stationarity. TRON and enduring impact on the volatility of Bitcoin’s price. This
has an ADF test statistic of −2.7859, indicating non- high Beta value suggests that historical news patterns are a
stationarity. XRP has an ADF test statistic of −2.8493, also significant factor in the longer-term volatility of Bitcoin.
indicating non-stationarity. SOL and ADA show statistics of
−2.8546 and −4.3383 respectively, indicating stationarity.. TABLE 7. Estimation results for the DCC-GARCH model.
These results suggest that while News Sentiment is
stationary, most of the cryptocurrency prices exhibit non-
stationary behavior, requiring second-order difference for
further analysis.

1) BITCOIN VOLATILITY: RESULTS FROM DCC-GARCH 2) NEWS-INDUCED SPILLOVER EFFECTS IN

MODEL CRYPTOCURRENCY MARKETS
The outcome of the DCC-GARCH model is presented in Transfer entropy values are calculated and detailed in
Table 7, illustrating the dynamic adjustments in conditional Table 8 and Table 9. It’s important to clarify that these
correlation within a multivariate DCC model’s framework, values should not be confused with directional or signal

134054 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

in the study by Zhang et al. [83]. This significant market share

underscores Bitcoin’s extensive connectivity and influence
over other cryptocurrencies, including BNB, ETH, DOGE,
TRON, XRP, SOL and ADA.

C. CASE STUDY LIMITATIONS AND CONSIDERATIONS

The findings of this case study reveal a discernible but
modest correlation between news sentiment and Bitcoin
price fluctuations. Utilizing the robust FinBERT [12] model
for sentiment analysis and the DCC-GARCH technique for
financial analysis, we gleaned significant insights into the
dynamic interplay between public sentiment, as reflected in
media, and Bitcoin’s price volatility. Specifically, the statis-
tical results from the DCC-GARCH model suggested that
historical news patterns wield a more substantial impact on
Bitcoin’s longer-term volatility than immediate news events.
These findings provide insights into the interrelationships
FIGURE 5. Correlation among cryptocurrencies and news sentiment. between news and Bitcoin price, underscoring the importance
of monitoring news for cryptocurrency.
This investigation is subject to certain constraints. Notably,
relationships typical of correlations or coefficients. Instead, the scope of the data utilized could be more extensive
they should be understood as transfer entropy measures regarding the timeframe and the range of currencies
flowing from the ‘Sender’ to the ‘Receiver’, indicative examined, which may influence the perceived relationship
of the information transfer between the two entities. Our between the variables. Including more data and additional
findings highlight significant spillover effects within the cryptocurrencies in future analyses could alter the outcomes
cryptocurrency markets, as gauged by the Transfer entropy of this study. Furthermore, the diversity of methodologies
method. employed in similar studies poses challenges in directly
Notably, cryptocurrencies with smaller market capital- comparing their results. Exploring factors influencing cryp-
ization tend to react more sensitively compared to their tocurrency development is an evolving area of academic
larger counterparts. For instance, XRP (ranked 7th) and interest that warrants further exploration. Future studies aim
ADA (ranked 10th) emerge as the most notable recipients. to overcome these limitations by incorporating broader and
As shown in Table 9, they are the most sensitive to changes, more varied datasets and adopting more uniform research
receiving signals from 7 other sources, indicating their high methods, contributing to a more cohesive understanding
reactivity to market information including news sentiment. among scholars in the field.
BNB, on the other hand, sends signals to 7 other cryptocur-
rencies, showcasing its role as a significant influencer within VIII. CHALLENGES
the market. This section delves into the multifaceted challenges and
Conversely, cryptocurrencies with the largest market limitations of using LLMs in sentiment analysis. The techni-
capitalization exhibit lower levels of information exchange. cal difficulties are paramount, highlighted by the significant
For example, BTC sends information to 7 other cryptocur- computational and storage demands of evolving models like
rencies, illustrating its central role in the market. Despite its GPT-1 [46] to those with trillions of parameters, raising
significant influence, BTC receives only 4 signals, reflecting concerns about accessibility in resource-limited contexts.
its relative stability and lower sensitivity to external shocks LLMs also need help with generalizability, often needing help
compared to smaller cryptocurrencies. maintaining consistent performance across diverse domains
News events that substantially affect Bitcoin’s valuation and tasks. This points to a need for models that are more
often initiate a domino effect, impacting the valuations of adaptable and versatile. Additionally, the interpretability and
other cryptocurrencies. For instance, it has been observed ethical usage of LLMs are crucial, especially in critical
that news impacts the prices of BNB, ETH, and XRP. sectors like finance, where the opaque nature of these models
As shown in Table 9, News Sentiment sends 4 shocks to can hinder trust and reliability.
other cryptocurrencies but receives only 1 shock from another
cryptocurrency, highlighting its role as a significant source of A. CHALLENGES IN LLM APPLICABILITY
market information. The evolution of LLMs has been characterized by a
Nonetheless, Bitcoin exerts a more profound influence substantial increase in their size, with a progression from
on these cryptocurrencies. Given that Bitcoin accounts for GPT-1’s 117 million parameters [46] to GPT-2’s 1.5 billion
approximately 50% of the total market capitalization of all [45] and a dramatic leap to GPT-3’s 175 billion parame-
cryptocurrencies, our observations align with those reported ters [47]. More recent models have continued this trend,

VOLUME 12, 2024 134055

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

TABLE 8. Transfer entropy matrix.

TABLE 9. Summary of sending and receiving signals. domains, datasets, or functions that differ from their initial
training environment. Although LLMs are often trained
on extensive datasets, encompassing a broad range of
knowledge, their efficacy can be less reliable when applied
to unique or niche tasks outside their primary training scope.
This limitation becomes evident in diverse applications, from
coding projects to document analysis, where the context and
semantics can vary significantly across different projects,
languages, or domains.
To enhance the generalizability of LLMs, it is crucial to
engage in meticulous fine-tuning, apply rigorous validation
across diverse datasets, and establish continuous feedback
reaching into the trillions of parameters [84]. Such vast
mechanisms. These steps are vital to prevent models from
sizes present formidable challenges regarding storage, mem-
becoming overly specialized in their training data, which
ory, and computational requirements. These challenges are
can severely restrict their applicability in various real-world
particularly acute in scenarios with limited resources or real-
scenarios. However, despite these precautions, recent studies
time demands, especially when developers cannot access
indicate that LLMs often need help to extend their high-
high-powered GPUs or TPUs. For instance, FinBERT is a
performance levels to inputs markedly different from their
pre-trained model with 110 million parameters, resulting
training data [91]. This limitation highlights a significant gap
in a considerable size of 438 MB [12]. The Hugging
in the current capabilities of LLMs. The challenge, therefore,
Face team [85] notes that training a 176 billion parameter
lies in developing LLMs that possess extensive knowledge
model like BLOOM [86] on a 1.5 TB dataset consumes
and understanding gleaned from large datasets and exhibit
1,082,880 GPU hours. Similarly, training the GPT-NeoX-
the flexibility and adaptability required to function effectively
20B model [87] on the Pile dataset [88], which includes
across a wide range of contexts. Addressing this challenge
over 825 GiB of raw text data, requires eight NVIDIA A100-
involves refining the training process and innovating in model
SXM4-40GB GPUs. This extensive training can last up to
architecture and learning algorithms.
1,830 hours or approximately 76 days.
Beyond the monetary costs, these models also incur
significant energy expenses. Predictions indicate a massive C. CHALLENGES IN LLM INTERPRETABILITY,
increase in energy usage by platforms employing LLMs [89], TRUSTWORTHINESS, AND ETHICAL USAGE
raising environmental concerns. However, a growing body Interpretability and trustworthiness are pivotal in integrating
of research is aimed at mitigating these challenges. For LLMs for sentiment analysis tasks. The primary challenge
example, Wang et al. [90] have demonstrated a distillation lies in demystifying the decision-making processes of these
method that successfully compresses the MiniLM model to models. Due to their ‘black-box’ nature, elucidating the
a mere 66 million parameters, significantly reducing its size mechanisms through which they discern sentiment from text
while maintaining efficiency. Increasing LLM sizes poses a is often challenging. Recent studies [92] have underscored
complex challenge, necessitating ongoing efforts for more this issue, revealing that while LLMs are proficient in
efficient deployment strategies. sentiment analysis, their opaque internal workings remain a
significant barrier. This obscurity in understanding how these
B. CHALLENGES IN LLM GENERALIZABILITY models arrive at their conclusions can generate apprehension
Generalizability in LLMs pertains to their capability to and reluctance among users, particularly investors who rely
perform tasks accurately and consistently across various on clear and logical reasoning for decision-making [93].

134056 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

Investors may only trust the outputs of LLMs with a customization in specialized tasks. The critical difference
transparent understanding of the underlying processes. between these approaches lies in their level of control
To foster trust in LLMs, it is essential to develop and and personalization. Despite their proprietary nature, pre-
implement techniques and tools that shed light on the trained models like GPT-4 enable quick, task-specific
internal mechanics of these models. Such efforts would adaptations with minimal data requirements. This approach
enable developers and users to trace and understand the reduces computational demands and expedites deployment.
rationale behind the outputs generated by LLMs. Improving In contrast, open-source frameworks like LLaMA provide a
interpretability and trustworthiness is a technical necessity foundation for extensive tailoring. While these models arrive
and a step towards broader acceptance and use of LLMs in pre-trained, they can be further adapted, with organizations
sentiment analysis, leading to more efficient and effective often modifying and retraining them on large-scale datasets
practices in this field [94]. specific to their needs [98]. Although this process demands
Another aspect contributing to the challenge is the substantial computational resources and investment, it allows
closed nature of many LLMs. Often, it needs to be more for creating models intricately tailored to specific domains.
transparent about what data these models have been trained
on, raising questions about the source training data’s quality, B. EXPANDING LLM’S NLP CAPABILITIES IN MORE
representativeness, and ownership. This lack of transparency SENTIMENT ANALYSIS PHASES
extends to concerns over the ownership of derivative data
Throughout our analysis, it became apparent that most data
produced by the models [95]. Furthermore, the potential
inputs for LLMs in sentiment analysis were text-based. This
vulnerability of LLMs to various adversarial attacks, where
finding aligns with traditional NLP approaches, yet there
inputs are maliciously designed to manipulate or confuse
needs to be a noticeable gap in utilizing more diverse and
the models, adds another layer of complexity. These risks
complex datasets, particularly graph-based ones. Embracing
emphasize the need for robust security measures and ethical
a more comprehensive array of natural language inputs, such
considerations in developing and deploying LLMs.
as spoken language, diagrams, and multimodal data, could
IX. FUTURE OPPORTUNITIES significantly expand the capabilities of LLMs in capturing
This section highlights the future opportunities for LLMs and interpreting varied forms of user sentiment [99].
in sentiment analysis. As these models evolve and gain Integrating spoken language into LLMs could enhance user
prominence in academic research, we explore the emerging interactions, enabling the models to process more natural and
trends and potential advancements that could shape their role contextually rich conversations. This addition would allow
in sentiment analysis. This section reflects on optimizing LLMs to understand better nuances in tone, intonation, and
LLMs for greater efficiency and effectiveness, expands their colloquial expressions, which often need to be improved
natural language processing capabilities to encompass a more in text-based communication. Similarly, including diagrams
comprehensive array of input forms, and discusses enhancing could provide valuable visual representations of complex
their performance in existing sentiment analysis tasks. ideas or emotions, offering a unique dimension to sentiment
analysis [100]. Diagrams can be a powerful tool to convey
A. OPTIMIZATION OF LLM FOR SENTIMENT ANALYSIS information that may be difficult to express through words
The ascent of ChatGPT in academic research highlights its alone.
growing prominence and acceptance in scholarly circles. Moreover, multimodal inputs that amalgamate text, audio,
Researchers have increasingly favored ChatGPT over other and visual elements could lead to a more holistic under-
LLMs and their applications since its release, primarily due standing of context. Such a comprehensive approach would
to its computational efficiency, versatility in handling diverse likely result in more accurate and context-sensitive senti-
tasks, and potential for cost-effectiveness [96]. Beyond its ment analysis outcomes. For instance, combining textual
application in sentiment analysis, ChatGPT has spearheaded data with vocal intonations and facial expressions could
an era of enhanced collaboration in the financial sector. better understand the user’s emotional state and intentions
This trend marks a significant shift towards incorporating [101], [102].
sophisticated natural language understanding into sentiment
analysis [97]. By examining these evolving dynamics, we can
C. ENHANCING LLM’S PERFORMANCE IN EXISTING
anticipate the future trajectory of LLMs like ChatGPT in
SENTIMENT ANALYSIS TASKS
refining and revolutionizing sentiment analysis processes.
These developments indicate the transformative potential of In academic research, establishing a universal and adaptable
LLMs in sentiment analysis. evaluation framework for LLMs in sentiment analysis is
Regarding the utilization of LLMs, the choice between becoming increasingly imperative. Such a framework is
using commercially available pre-trained models like GPT-4 essential for conducting systematic and consistent assess-
and opting for open-source alternatives such as LLaMA [51], ments of LLMs, focusing on their performance, efficacy,
LlaMA 2 [52], and Alpaca7 presents distinct avenues for and potential limitations. This standardization would serve
as a critical benchmark, enabling researchers to verify the
7 https://github.com/tatsu-lab/stanford_alpaca practical readiness of these models for various applications.

VOLUME 12, 2024 134057

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

A standardized evaluation framework would offer a compre- standard evaluation framework are highlighted as promising
hensive set of criteria and metrics against which to measure avenues for research and development.
LLMs, ensuring that their capabilities are accurately and
objectively assessed [103]. REFERENCES
In academia, where rigorous analysis and validation are [1] M. Baker and J. Wurgler, ‘‘Investor sentiment in the stock market,’’
paramount, the absence of such a framework can lead to J. Econ. Perspect., vol. 21, no. 2, pp. 129–152, 2007.
fragmented and inconsistent evaluations of LLMs, potentially [2] P. C. Tetlock, ‘‘Giving content to investor sentiment: The role of
media in the stock market,’’ J. Finance, vol. 62, pp. 1139–1168,
impeding their development and adoption. By establishing Jun. 2007. [Online]. Available: https://onlinelibrary.wiley.com/doi/full/
a universally accepted framework, researchers can compare 10.1111/j.1540-6261.2007.01232.x
different LLMs on a level playing field, fostering a clearer [3] L. A. Smales, ‘‘The importance of fear: Investor sentiment and
stock market returns,’’ Appl. Econ., vol. 49, no. 34, pp. 3395–3421,
understanding of each model’s strengths and areas for Jul. 2017. [Online]. Available: https://www.tandfonline.com/doi/abs/
improvement. This framework should ideally encompass a 10.1080/00036846.2016.1259754
range of considerations, including accuracy in sentiment [4] T. Rao and S. Srivastava. (2012). Analyzing Stock Market Movements
Using Twitter Sentiment Analysis. [Online]. Available: http://dx.doi.org/
detection, adaptability to different linguistic contexts, com- 10.1109/ASONAM.2012.30 and https://repository.lincoln.ac.uk/articles/
putational efficiency, and ethical concerns such as bias and conference_contribution/Analyzing_stock_market_movements_using_T
fairness [104], [105]. witter_sentiment_analysis/25165223/2?file=44450105
Furthermore, a universal evaluation framework would [5] E. Cambria and B. White, ‘‘Jumping NLP curves: A review of natural
language processing research,’’ IEEE Comput. Intell. Mag., vol. 9, no. 2,
facilitate responsible LLM adoption in academic research pp. 48–57, May 2014.
[106]. It would provide scholars with the tools to decide [6] V. Ramiah, X. Xu, and I. A. Moosa, ‘‘Neoclassical finance, behavioral
which models best suit their research needs and objectives. finance and noise traders: A review and assessment of the literature,’’ Int.
Rev. Financial Anal., vol. 41, pp. 89–100, Oct. 2015.
[7] F. Wu, Y. Huang, and Y. Song, ‘‘Structured microblog sentiment clas-
sification via social context regularization,’’ Neurocomputing, vol. 175,
X. CONCLUSION pp. 599–609, Jan. 2016.
In this comprehensive literature review, we adeptly examine [8] T. Al-Moslmi, S. Gaber, M. Albared, and N. Omar. (2016). Feature
Selection Methods Effects on Machine Learning Approaches in Malay
the intersection of LLMs and sentiment analysis within Sentiment Analysis. [Online]. Available: https://www.researchgate.
financial markets, providing a detailed exploration of net/publication/308968243
LLMs’ evolution, application, and future opportunities in [9] R. C. Moore and W. Lewis, ‘‘Intelligent selection of language model
training data,’’ in Proc. ACL Conf. Short Papers, 2010, pp. 220–224.
this domain. The review navigates through the intrica- [10] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann,
cies of sentiment analysis, underlining its significance in P. Kambadur, D. Rosenberg, and G. Mann, ‘‘BloombergGPT: A large
understanding market dynamics and investor behavior. Our language model for finance,’’ 2023, arXiv:2303.17564.
[11] K. Mishev, A. Gjorgjevikj, I. Vodenska, L. T. Chitkushev, and
meticulous analysis of LLMs, mainly their development from
D. Trajanov, ‘‘Evaluation of sentiment analysis in finance: From lexicons
BERT [14] to more sophisticated models like FinBERT [12] to transformers,’’ IEEE Access, vol. 8, pp. 131662–131682, 2020.
and ChatGPT, reveals these models’ substantial impact on [12] D. Araci, ‘‘FinBERT: Financial sentiment analysis with pre-trained
financial sentiment analysis. language models,’’ 2019, arXiv:1908.10063.
[13] P. Seroyizhko, Z. Zhexenova, M. Z. Shafiq, F. Merizzi, A. Galassi, and
The review methodically dissects the role of LLMs in F. Ruggeri, ‘‘A sentiment and emotion annotated dataset for Bitcoin price
various financial contexts, from cryptocurrency market pre- forecasting based on Reddit posts,’’ in Proc. 4th Workshop Financial
diction to stock price forecasting, showcasing their capability Technol. Natural Lang. Process. (FinNLP), 2022, pp. 203–210. [Online].
Available: https://aclanthology.org/2022.finnlp-1.27
to extract and interpret complex economic sentiments. The [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
case study on Bitcoin price and news sentiment further of deep bidirectional transformers for language understanding,’’ in Proc.
exemplifies the practical application of LLMs, reinforcing Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
Technol., vol. 1, Oct. 2018, pp. 4171–4186.
that sentiment analysis, powered by advanced language
[15] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydlo, J. Baran,
models, is pivotal in deciphering market trends. J. Bielaniewicz, M. Gruza, A. Janz, K. Kanclerz, A. Kocon, B. Koptyra,
However, the review is open to addressing the challenges W. Mieleszczenko-Kowszewicz, P. Milkowski, M. Oleksy, M. Piasecki,
and limitations inherent in the current state of LLMs. Issues L. Radlinski, K. Wojtasik, S. Wozniak, and P. Kazienko, ‘‘ChatGPT:
Jack of all trades, master of none,’’ Inf. Fusion, vol. 99, Nov. 2023,
such as the immense computational requirements, difficulties Art. no. 101861.
in generalizability and interpretability, and ethical concerns [16] M. Chakraborty and S. Subramaniam, ‘‘Does sentiment impact
are thoughtfully discussed, providing a balanced perspective. cryptocurrency?’’ J. Behav. Finance, vol. 24, no. 2, pp. 202–218,
Apr. 2023. [Online]. Available: https://www.tandfonline.com/doi/abs/
We call for more efficient deployment strategies, improved 10.1080/15427560.2021.1950723
generalizability, and enhanced interpretability is particularly [17] A. H. Huang, H. Wang, and Y. Yang, ‘‘FinBERT: A large language model
compelling, indicating the need for continued innovation in for extracting information from financial text,’’ Contemp. Accounting
Res., vol. 40, no. 2, pp. 806–841, May 2023. [Online]. Available:
this field. https://onlinelibrary.wiley.com/doi/full/10.1111/1911-3846.12832
Looking to the future, integrating more diverse data types [18] H. Tong, J. Li, N. Wu, M. Gong, D. Zhang, and Q. Zhang, ‘‘Ploutos:
and establishing a universal evaluation framework are essen- Towards interpretable stock movement prediction with financial large
language model,’’ 2024, arXiv:2403.00782.
tial steps toward enhancing the efficacy of LLMs in sentiment
[19] A. S. George and A. H. George, ‘‘A review of ChatGPT AIs impact on
analysis. The potential expansion of LLM capabilities to several business sectors,’’ Partners Universal Int. Innov. J., vol. 1, no. 1,
include multimodal data inputs and the implementation of a pp. 9–23, 2023.

134058 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

[20] N. A. Sharma, A. B. M. S. Ali, and M. A. Kabir, ‘‘A review of sentiment [43] Y. Yu, H. Li, Z. Chen, Y. Jiang, Y. Li, D. Zhang, R. Liu, J. W. Suchow, and
analysis: Tasks, applications, and deep learning techniques,’’ Int. J. Data K. Khashanah, ‘‘FinMem: A performance-enhanced LLM trading agent
Sci. Anal., pp. 1–38, Jul. 2024, doi: 10.1007/s41060-024-00594-x. with layered memory and character design,’’ 2023, arXiv:2311.13743.
[21] M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, [44] Y. Li, Y. Yu, H. Li, Z. Chen, and K. Khashanah, ‘‘TradingGPT: Multi-
M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, ‘‘A review on large agent system with layered memory and distinct characters for enhanced
language models: Architectures, applications, taxonomies, open issues financial trading performance,’’ 2023, arXiv:2309.03736.
and challenges,’’ IEEE Access, vol. 12, pp. 26839–26874, 2024. [45] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, ‘‘Lan-
[22] B. Chen, Z. Wu, and R. Zhao, ‘‘From fiction to fact: The growing role of guage models are unsupervised multitask learners,’’ OpenAI Blog, vol. 1,
generative AI in business and finance,’’ J. Chin. Econ. Bus. Stud., vol. 21, no. 8, p. 9, 2019. [Online]. Available: https://github.com/codelucas/
no. 4, pp. 471–496, Oct. 2023. newspaper
[23] M. M. Dong, T. C. Stratopoulos, and V. X. Wang, A Scoping Review of [46] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving
ChatGPT Research in Accounting and Finance, T. C. Wang and V. Xiaoqi, Language Understanding by Generative Pre-training. Accessed: Feb. 3,
Eds., Dec. 2023. [Online]. Available: https://ssrn.com/abstract=4680203 2024. [Online]. Available: https://gluebenchmark.com/leaderboard
and http://dx.doi.org/10.2139/ssrn.4680203 [47] T. B. Brown et al., ‘‘Language models are few-shot learners,’’
[24] S. A. Farimani, M. V. Jahan, and A. M. Fard, ‘‘From text representation in Proc. NIPS, 2020, pp. 1877–1901. [Online]. Available: https://
to financial market prediction: A literature review,’’ Information, vol. 13, commoncrawl.org/the-data/
no. 10, p. 466, Sep. 2022. [48] J. Achiam et al., ‘‘GPT-4 technical report,’’ 2023, arXiv:2303.08774.
[25] A. Koshiyama, N. Firoozye, and P. Treleaven, ‘‘Algorithms in future [49] G. Fatouros, J. Soldatos, K. Kouroumali, G. Makridis, and D. Kyriazis,
capital markets: A survey on AI, ML and associated algorithms in capital ‘‘Transforming sentiment analysis in the financial domain with
markets,’’ in Proc. 1st ACM Int. Conf. AI Finance, 2020, pp. 1–8. ChatGPT,’’ Mach. Learn. Appl., vol. 14, Dec. 2023, Art. no. 100508.
[26] O. Bashchenko, ‘‘Bitcoin price factors: Natural language processing [50] B. Zhang, H. Yang, and X.-Y. Liu, ‘‘Instruct-FinGPT: Financial sentiment
approach,’’ SSRN Electron. J., vol. 13, pp. 22–48, Mar. 2022. [Online]. analysis by instruction tuning of general-purpose large language models,’’
Available: https://papers.ssrn.com/abstract=4079091 2023, arXiv:2306.12659.
[27] B. N. Thanh, A. T. Nguyen, T. T. Chu, and S. Ha. (2023). ChatGPT, [51] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
Twitter Sentiment and Bitcoin Return. [Online]. Available: https:// T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez,
papers.ssrn.com/abstract=4628097 A. Joulin, E. Grave, and G. Lample, ‘‘LLaMA: Open and efficient
[28] B. Kitchenham. (2007). Guidelines for Performing Systematic foundation language models,’’ 2023, arXiv:2302.13971.
Literature Reviews in Software Engineering. [Online]. Available: [52] H. Touvron et al., ‘‘Llama 2: Open foundation and fine-tuned chat
https://www.researchgate.net/publication/302924724 models,’’ 2023, arXiv:2307.09288.
[29] B. Kitchenham, L. Madeyski, and D. Budgen, ‘‘SEGRESS: Software [53] Q. Xie, W. Han, X. Zhang, Y. Lai, M. Peng, A. Lopez-Lira, and J. Huang,
engineering guidelines for REporting secondary studies,’’ IEEE Trans. ‘‘PIXIU: A large language model, instruction data and evaluation
Softw. Eng., vol. 49, no. 3, pp. 1273–1298, Mar. 2023. benchmark for finance,’’ 2023, arXiv:2306.05443.
[30] H. Zhao, Z. Liu, Z. Wu, Y. Li, T. Yang, P. Shu, S. Xu, H. Dai, L. Zhao, [54] B. Peng, E. Chersoni, Y.-Y. Hsu, L. Qiu, and C.-R. Huang, ‘‘Supervised
G. Mai, N. Liu, and T. Liu, ‘‘Revolutionizing finance with LLMs: cross-momentum contrast: Aligning representations with prototypical
An overview of applications and insights,’’ 2024, arXiv:2401.11641. examples to enhance financial sentiment analysis,’’ Knowl.-Based Syst.,
[31] K. Du, F. Xing, R. Mao, and E. Cambria, ‘‘Financial sentiment analysis: vol. 295, Jul. 2024, Art. no. 111683.
Techniques and applications,’’ ACM Comput. Surv., vol. 56, no. 9, [55] C. He, C. Li, T. Han, and L. Shen, ‘‘Assessing and enhancing LLMs:
pp. 1–42, Oct. 2024. A physics and history dataset and one-more-check pipeline method,’’ in
[32] M. N. Ashtiani and B. Raahemi, ‘‘News-based intelligent prediction Proc. Int. Conf. Neural Inf. Process., 2024, pp. 504–517.
of financial markets using text mining and machine learning: A [56] Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa,
systematic literature review,’’ Expert Syst. Appl., vol. 217, May 2023, M. Beane, T.-H. Huang, B. Routledge, and W. Y. Wang, ‘‘FinQA:
Art. no. 119509. A dataset of numerical reasoning over financial data,’’ in Proc. Conf.
[33] M. Shanahan, ‘‘Talking about large language models,’’ 2022, Empirical Methods Natural Lang. Process., 2021, pp. 3697–3711.
arXiv:2212.03551. [57] Z. Liu, D. Huang, K. Huang, Z. Li, and J. Zhao, ‘‘FinBERT: A pre-trained
[34] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, financial language representation model for financial text mining,’’ in
Q. V. Le, and D. Zhou, ‘‘Chain-of-thought prompting elicits reasoning in Proc. 29th Int. Joint Conf. Artif. Intell., Jul. 2020, pp. 4513–4519.
large language models,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 35, [Online]. Available: http://commoncrawl.org/
2022, pp. 24824–24837. [58] P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, ‘‘Good debt
[35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, or bad debt: Detecting semantic orientations in economic texts,’’ J. Assoc.
A. Poulton, V. Kerkez, and R. Stojnic, ‘‘Galactica: A large language Inf. Sci. Technol., vol. 65, no. 4, pp. 782–796, Apr. 2014.
model for science,’’ 2022, arXiv:2211.09085. [59] K. Cortis, A. Freitas, T. Daudert, M. Huerlimann, M. Zarrouk,
[36] J. Hoffmann et al., ‘‘Training compute-optimal large language models,’’ S. Handschuh, and B. Davis, ‘‘SemEval-2017 task 5: Fine-grained
2022, arXiv:2203.15556. sentiment analysis on financial microblogs and news,’’ in Proc. 11th Int.
[37] J. Xu Zhao, Y. Xie, K. Kawaguchi, J. He, and M. Q. Xie, ‘‘Automatic Workshop Semantic Eval. (SemEval), 2017, pp. 519–535.
model selection with large language models for reasoning,’’ 2023, [60] G. Kim, M. Kim, B. Kim, and H. Lim, ‘‘CBITS: Crypto BERT incorpo-
arXiv:2305.14333. rated trading system,’’ IEEE Access, vol. 11, pp. 6912–6921, 2023.
[38] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, ‘‘Unifying [61] Y. Zou and D. Herremans, ‘‘PreBit—A multimodal model with Twitter
large language models and knowledge graphs: A roadmap,’’ 2023, FinBERT embeddings for extreme price movement prediction of
arXiv:2306.08302. Bitcoin,’’ Expert Syst. Appl., vol. 233, Dec. 2023, Art. no. 120838.
[39] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, [62] A. Raheman, A. Kolonin, I. Fridkins, I. Ansari, and M. Vishwas, ‘‘Social
‘‘ALBERT: A lite BERT for self-supervised learning of language media sentiment analysis for cryptocurrency market prediction,’’ 2022,
representations,’’ in Proc. 8th Int. Conf. Learn. Represent. (ICLR), 2020, arXiv:2204.10185.
pp. 1–17. [63] M. Ortu, N. Uras, C. Conversano, S. Bartolucci, and G. Destefanis,
[40] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, ‘‘On technical trading and social media indicators for cryptocurrency
L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized price classification through deep learning,’’ Expert Syst. Appl., vol. 198,
BERT pretraining approach,’’ 2019, arXiv:1907.11692. Jul. 2022, Art. no. 116804.
[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [64] B. Fazlija and P. Harder, ‘‘Using financial news sentiment for stock price
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. direction prediction,’’ Mathematics, vol. 10, no. 13, p. 2156, Jun. 2022.
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11. [Online]. Available: https://www.mdpi.com/2227-7390/10/13/2156/htm
[42] M. Kulakowski and F. Frasincar, ‘‘Sentiment classification of [65] U. Gupta, ‘‘GPT-InvestAR: Enhancing stock investment strategies
cryptocurrency-related social media posts,’’ IEEE Intell. Syst., vol. 38, through annual report analysis with large language models,’’ 2023,
no. 4, pp. 5–9, Jul. 2023. arXiv:2309.03079.

VOLUME 12, 2024 134059

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

[66] X. Deng, V. Bashlovkina, F. Han, S. Baumgartner, and M. Bendersky, [89] M. C. Rillig, M. Ågerstrand, M. Bi, K. A. Gould, and U. Sauerland,
‘‘What do LLMs know about financial markets? A case study on Reddit ‘‘Risks and benefits of large language models for the environment,’’
market sentiment analysis,’’ 2022, arXiv:2212.11311. Environ. Sci. Technol., vol. 57, no. 9, pp. 3464–3466, Mar. 2023. [Online].
[67] H. Q. Abonizio, E. C. Paraiso, and S. Barbon, ‘‘Toward text data Available: https://pubs.acs.org/doi/full/10.1021/acs.est.3c01106
augmentation for sentiment analysis,’’ IEEE Trans. Artif. Intell., vol. 3, [90] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, ‘‘MiniLM:
no. 5, pp. 657–668, Oct. 2022. Deep self-attention distillation for task-agnostic compression of pre-
[68] D. Ider and S. Lessmann, ‘‘Forecasting cryptocurrency returns from sen- trained transformers,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 2020,
timent signals: An analysis of BERT classifiers and weak supervision,’’ 2020, pp. 5776–5788.
2022, arXiv:2204.05781. [91] A. Albalak, A. Shrivastava, C. Sankar, A. Sagar, and M. Ross, ‘‘Data-
[69] J. de Curtò, I. de Zarzà, G. Roig, J. C. Cano, P. Manzoni, and efficiency with a single GPU: An exploration of transfer methods for
C. T. Calafate, ‘‘LLM-informed multi-armed bandit strategies for non- small language models,’’ 2022, arXiv:2210.03871.
stationary environments,’’ Electronics, vol. 12, no. 13, p. 2814, Jun. 2023. [92] X. Deng, V. Bashlovkina, F. Han, S. Baumgartner, and M. Bendersky,
[Online]. Available: https://www.mdpi.com/2079-9292/12/13/2814/htm ‘‘What do LLMs Know about financial markets? A case study on Reddit
[70] M. Fernandes, S. Khanna, L. Monteiro, A. Thomas, and G. Tripathi, market sentiment analysis,’’ in Proc. ACM Web Conf., 2022, pp. 107–110.
‘‘Bitcoin price prediction,’’ in Proc. Int. Conf. Adv. Comput., Commun., [Online]. Available: https://dl.acm.org/doi/10.1145/3543873.3587324
Control (ICAC3), Dec. 2021, pp. 1–4. [93] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao,
[71] P. Malo, A. Sinha, P. Takala, P. Korhonen, and J. Wallenius, ‘‘Good debt ‘‘ReAct: Synergizing reasoning and acting in language models,’’ 2022,
or bad debt: Detecting semantic orientations in economic texts,’’ 2013, arXiv:2210.03629.
arXiv:1307.5336. [94] X. Li, H. Xiong, X. Li, X. Wu, X. Zhang, J. Liu, J. Bian, and
[72] S. A. Farimani, M. V. Jahan, A. M. Fard, and S. R. K. Tabbakh, D. Dou, ‘‘Interpretable deep learning: Interpretation, interpretability,
‘‘Investigating the informativeness of technical indicators and news trustworthiness, and beyond,’’ Knowl. Inf. Syst., vol. 64, no. 12,
sentiment in financial market price prediction,’’ Knowl.-Based Syst., pp. 3197–3234, Dec. 2022. [Online]. Available: https://link.springer.
vol. 247, Jul. 2022, Art. no. 108742. com/article/10.1007/s10115-022-01756-8
[73] T. C. Chiang, B. N. Jeon, and H. Li, ‘‘Dynamic correlation analysis [95] S. Sinha, H. Chen, A. Sekhon, Y. Ji, and Y. Qi, ‘‘Perturbing
of financial contagion: Evidence from Asian markets,’’ J. Int. Money inputs for fragile interpretations in deep natural language pro-
Finance, vol. 26, no. 7, pp. 1206–1228, Nov. 2007. cessing,’’ in Proc. 4th BlackboxNLP Workshop Analyzing Inter-
[74] K. J. Forbes and R. Rigobon, ‘‘No contagion, only interdependence: preting Neural Netw. NLP, 2021, pp. 420–434. [Online]. Available:
Measuring stock market comovements,’’ J. Finance, vol. 57, https://aclanthology.org/2021.blackboxnlp-1.33
no. 5, pp. 2223–2261, Oct. 2002. [Online]. Available: https:// [96] M. T. R. Laskar, M. S. Bari, M. Rahman, M. A. H. Bhuiyan, S. Joty,
onlinelibrary.wiley.com/doi/full/10.1111/0022-1082.00494 and J. X. Huang, ‘‘A systematic study and comprehensive evaluation
[75] L. Barnett, A. B. Barrett, and A. K. Seth, ‘‘Granger causality and transfer of ChatGPT on benchmark datasets,’’ in Proc. Annu. Meeting Assoc.
entropy are equivalent for Gaussian variables,’’ Phys. Rev. Lett., vol. 103, Comput. Linguistics, 2023, pp. 431–469.
no. 23, Dec. 2009, Art. no. 238701. [97] M. U. Haque, I. Dharmadasa, Z. T. Sworna, R. N. Rajapakse, and
[76] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst. H. Ahmad, ‘‘‘I think this is the most disruptive technology’: Exploring
Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948. sentiments of ChatGPT early adopters using Twitter data,’’ 2022,
[77] S. Kullback and R. A. Leibler, ‘‘On information and sufficiency,’’ Ann. arXiv:2212.05856.
Math. Statist., vol. 22, no. 1, pp. 79–86, 1951. [98] I. Gur, O. Nachum, Y. Miao, M. Safdari, A. Huang, A. Chowdhery,
[78] T. Schreiber, ‘‘Measuring information transfer,’’ Phys. Rev. Lett., vol. 85, S. Narang, N. Fiedel, and A. Faust, ‘‘Understanding HTML with large
no. 2, pp. 461–464, Jul. 2000. language models,’’ 2022, arXiv:2210.03945.
[79] T. Dimpfl and F. J. Peter, ‘‘Using transfer entropy to measure information [99] X. Wang, J. He, Z. Jin, M. Yang, Y. Wang, and H. Qu, ‘‘M2Lens:
flows between financial markets,’’ Stud. Nonlinear Dyn. Econometrics, Visualizing and explaining multimodal models for sentiment analysis,’’
vol. 17, no. 1, pp. 85–102, 2013. IEEE Trans. Vis. Comput. Graphics, vol. 28, no. 1, pp. 802–812,
[80] S. Bekiros, D. K. Nguyen, L. S. Junior, and G. S. Uddin, ‘‘Information Jan. 2022.
diffusion, cluster formation and entropy-based network dynamics in [100] H. Song, J. Li, Z. Xia, Z. Yang, and X. Du, ‘‘Multimodal sentiment
equity and commodity markets,’’ Eur. J. Oper. Res., vol. 256, no. 3, analysis based on pre-LN transformer interaction,’’ in Proc. IEEE 6th
pp. 945–961, Feb. 2017. Inf. Technol. Mechatronics Eng. Conf. (ITOEC), vol. 6, Mar. 2022,
[81] T. A. Brown, Confirmatory Factor Analysis for Applied Research. NY, pp. 1609–1613.
USA: Guilford publications, 2015. [101] K. Dashtipour, M. Gogate, E. Cambria, and A. Hussain, ‘‘A novel
[82] P. Wollstadt, M. Martínez-Zarzuela, R. Vicente, F. J. Díaz-Pernas, and context-aware multimodal framework for Persian sentiment analysis,’’
M. Wibral, ‘‘Efficient transfer entropy analysis of non-stationary neural Neurocomputing, vol. 457, pp. 377–388, Oct. 2021.
time series,’’ PLoS ONE, vol. 9, no. 7, Jul. 2014, Art. no. e102833. [102] U. Sehar, S. Kanwal, K. Dashtipur, U. Mir, U. Abbasi, and F. Khan, ‘‘Urdu
[83] H. Zhang, H. Hong, Y. Guo, and C. Yang, ‘‘Information spillover effects sentiment analysis via multimodal data mining based on deep learning
from media coverage to the crude oil, gold, and Bitcoin markets during the algorithms,’’ IEEE Access, vol. 9, pp. 153072–153082, 2021.
COVID-19 pandemic: Evidence from the time and frequency domains,’’ [103] A. Oussous, F.-Z. Benjelloun, A. A. Lahcen, and S. Belfkih,
Int. Rev. Econ. Finance, vol. 78, pp. 267–285, Mar. 2022. ‘‘ASA: A framework for Arabic sentiment analysis,’’ J. Inf. Sci.,
[84] S. Moss, ‘‘Google brain unveils trillion-parameter AI language model, the vol. 46, no. 4, pp. 544–559, Aug. 2020. [Online]. Available:
largest yet,’’ Tech. Rep., 2021. https://journals.sagepub.com/doi/10.1177/0165551519849516
[85] S. Bekman, ‘‘The technology behind Bloom training,’’ Tech. Rep., 2022. [104] Z. Ke, J. Sheng, Z. Li, W. Silamu, and Q. Guo, ‘‘Knowledge-guided
[86] T. L. Scao et al., ‘‘BLOOM: A 176B-parameter open-access multilingual sentiment analysis via learning from natural language explanations,’’
language model,’’ 2022, arXiv:2211.05100. IEEE Access, vol. 9, pp. 3570–3578, 2021.
[87] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, [105] Q. Zhang, J. Zhou, Q. Chen, Q. Bai, J. Xiao, and L. He, ‘‘A
H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, knowledge-enhanced adversarial model for cross-lingual structured
S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, sentiment analysis,’’ in Proc. Int. Joint Conf. Neural Netw., Jul. 2022,
‘‘GPT-NeoX-20B: An open-source autoregressive language model,’’ pp. 1–8.
2022, arXiv:2204.06745. [106] G. F. N. Mvondo, B. Niu, and S. Eivazinezhad, ‘‘Generative con-
[88] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, versational AI and academic integrity: A mixed method investigation
H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy, ‘‘The Pile: to understand the ethical use of LLM chatbots in higher educa-
An 800 GB dataset of diverse text for language modeling,’’ 2021, tion,’’ SSRN Electron. J., 2023. [Online]. Available: https://ssrn.com/
arXiv:2101.00027. abstract=4548263 and http://dx.doi.org/10.2139/ssrn.4548263

134060 VOLUME 12, 2024

C. Liu et al.: Large Language Models and Sentiment Analysis in Financial Markets

CHENGHAO LIU received the B.Sc. degree in JOARDER KAMRUZZAMAN (Senior Member,
software engineering from Jiangxi University of IEEE) received the B.Sc. and M.Sc. degrees
Finance and Economics, in 2020. He is currently in electrical and electronic engineering from
pursuing the master’s degree with The Univer- Bangladesh University of Engineering and Tech-
sity of Auckland. His research interests include nology, Dhaka, and the Ph.D. degree in infor-
machine learning and large language model to mation systems engineering from the Muroran
solve financial problems. Institute of Technology, Hokkaido, Japan.
He is a Professor of information technology
and the Director of the Centre for Smart Analyt-
ics, Federation University Australia. Previously,
he was with Monash University, Australia, as an Associate Professor; and
ARUNKUMAR ARULAPPAN (Member, IEEE) Bangladesh University of Engineering and Technology, as a Professor.
received the B.Tech. degree in information tech- He has been listed in Stanford’s Top 2% Scientists list, since 2020.
nology from Anna University, Chennai, India, He has published more than 300 peer-reviewed articles, which include over
the M.Tech. degree in computer science and 110 journals and 180 conference papers. His publications are cited over
engineering from Vellore Institute of Technology 7300 times and have an H-index of 36, a g-index of 79, and an i-10 index
(VIT), Vellore, India, and the Ph.D. degree from of 115. He has received over A$5.0m in competitive research funding,
the Faculty of Information and Communication including a highly prestigious Australian Research Council Grant and Large
Engineering, Anna University, in 2023. He is an Collaborative Research Centre Grants. His research interests include the
Assistant Professor with the School of Computer Internet of Things, machine learning, and cybersecurity. He was a recipient of
Science Engineering and Information Systems the Best Paper Award in four international conferences, such as ICICS’15,
(SCORE), VIT University. He is proficient with simulator tools MATLAB, Singapore; APCC’14, Thailand; IEEE WCNC’10, Sydney, Australia; and
ns-3, Mininet, OpenNet VM, and P4 programming. He is exposed to open IEEE-ICNNSP’03, Nanjing, China. He has served many conferences in
source tools, such as OpenStack, Cloudify, OPNFV, and Cloud-Native leadership capacities, including the program co-chair, the publicity chair,
Computing Foundation (CNCF). His research interests include the cloud- the track chair, and the session chair. Since 2012, he has been an Editor of
native deployment, SDN, NFV, 5G/6G networks, AI/ML based networking, the Journal of Network and Computer Applications (Elsevier). He served
the Internet of Vehicles, and UAV communications. as the Lead Guest Editor for Journal Future Generation Computer Systems
(Elsevier).

RANESH NAHA (Member, IEEE) received the

M.Sc. degree in parallel and distributed com-
puting from Universiti Putra Malaysia, and the
Ph.D. degree in information technology from
the University of Tasmania, Australia. He is
a Senior Lecturer of information systems with
Queensland University of Technology (QUT).
He has authored more than 50 peer-reviewed
scientific research articles. His research interests
include distributed computing (fog/edge/cloud),
the Internet of Things (IoT), AI and ML, software-defined networking
(SDN), cybersecurity, and blockchain.
IN-HO RA (Member, IEEE) received the Ph.D.
degree in computer engineering from Chung-Ang
ANIKET MAHANTI (Senior Member, IEEE) University, Seoul, South Korea, in 1995. From
received the B.Sc. degree (Hons.) in computer February 2007 to August 2008, he was a Visiting
science from the University of New Brunswick, Scholar with the University of South Florida,
Canada, and the M.Sc. and Ph.D. degrees in Tampa, FL, USA. He has been with the School
computer science from the University of Calgary, of Computer, Information and Communication
Canada. He is a Senior Lecturer (an Associate Pro- Engineering, Kunsan National University, where
fessor) of computer science with The University he is currently a Professor. His research interests
of Auckland, New Zealand. His research interests include wireless ad hoc and sensor networks,
include network science, distributed systems, and blockchain, the IoT, PS-LTE, and microgrids.
internet measurements.