Financial News With Supervised Learning
Financial News With Supervised Learning
Linköpings universitet
SE–581 83 Linköping
+46 13 28 10 00 , www.liu.se
Upphovsrätt
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-
ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-
pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-
ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan
användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-
heten och tillgängligheten finns lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som
god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet
ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-
nens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida
http://www.ep.liu.se/.
Copyright
The publishers will keep this document online on the Internet - or its possible replacement - for a
period of 25 years starting from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for anyone to read, to down-
load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial
research and educational purpose. Subsequent transfers of copyright cannot revoke this permission.
All other uses of the document are conditional upon the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work
is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures
for publication and for assurance of document integrity, please refer to its www home page:
http://www.ep.liu.se/.
Financial data in banks are unstructured and complicated. It is challenging to analyze these
texts manually due to the small amount of labeled training data in financial text. Moreover,
the financial text consists of language in the economic domain where a general-purpose
model is not efficient. In this thesis, data had collected from MFN (Modular Finance)
financial news, this data is scraped and persisted in the database and price indices are
collected from Bloomberg terminal. Comprehensive study and tests are conducted to find
the state-of-art results for classifying the sentiments using traditional classifiers like Naive
Bayes and transfer learning models like BERT and FinBERT. FinBERT outperform the
Naive Bayes and BERT classifier.
The time-series indices for sentiments are built, and their correlations with price indices
calculated using Pearson correlation. Augmented Dickey-Fuller (ADF) is used to check if
both the time series data are stationary. Finally, the statistical hypothesis Granger causality
test determines if the sentiment time series helps predict price. This result shows that there
is a significant correlation and causal relation between sentiments and price.
All praises and thanks to the Almighty Creator for his showers of blessings to complete this
thesis successfully.
My sincere gratitude and thanks to my supervisor, Maryna Prus , Ph.D. at Division of Statis-
tics and Machine Learning (STIMA), Department of Computer and Information Science
(IDA), for her exceptional supervision encouragement during the thesis work. Furthermore,
my external supervisor, Per von Rosen, Data strategist, and Architect Swedbank AB, for
his motivation and help, had tremendously contributed to the successful completion of the
project at Swedbank AB for providing the necessary infrastructure needed. Finally, I am also
grateful to all supervisors of STIMA for their constructive feedback and valuable comments
during this Journey.
Lastly, for their unconditional support and encouragement, my family and friends motivated
me to work on the thesis.
iv
Contents
Abstract iii
Acknowledgments iv
Contents v
Glossaries 1
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Transfer Learning 6
2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Types of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Theory 9
3.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 FinBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Pearson’s Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.9 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.10 Augmented Dickey-Fuller Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11 Granger Casualty Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.12 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.13 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Method 27
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Software Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Sentiment Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
v
5 Results 34
5.1 Sentiment Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Classified data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Statistical Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Discussion 46
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Conclusion 48
Bibliography 50
vi
List of Figures
3.1 Artificial neural network architectures with feed-forward network and back-
propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 High level Architecture of Transformer . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Token Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Test Loss v/s Training size for different Models . . . . . . . . . . . . . . . . . . . . . 26
5.1 Confusion matrix for Naive Bayes model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances. 35
5.2 Confusion matrix for BERT model for test data. The diagonal from top left corner
to bottom right corner shows the percentage of correctly classified instances. . . . . 36
5.3 Confusion matrix for FinBERT model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances. 36
5.4 Plot Epochs v/s Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 Plot Training data size v/s Test loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Shows Sentiments change in a Week 11th Oct - 15th Oct . . . . . . . . . . . . . . . . 39
5.7 Count of Sentiments (a) in one week 4th Oct - 8th Oct (b) in 7 months 15th March -
15th Oct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.8 Plot for cummulative sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.9 Plot for price time series and sentiment implication . . . . . . . . . . . . . . . . . . 40
5.10 (a) Price over time (b) AutoCorrelation for non stationary price time series . . . . . 42
5.11 (a) Price Difference over time (b) AutoCorrelation for stationary price time serie . . 42
5.12 (a) Correlation price v/s sentiment (b) Correlation price v/s sentiment with best
fit line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
List of Tables
5.1 Hyper-parameter considered for BERT and FinBERT to report training accuracy,
test accuracy and test confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Evaluation Metrics of Classifiers for Training data . . . . . . . . . . . . . . . . . . . 35
5.3 Evaluation Metrics of Classifiers for Test data . . . . . . . . . . . . . . . . . . . . . . 35
5.4 BERT Hyper-parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 FinBERT Hyper-parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.6 Sentiment indices from 15th March to 19 th March. . . . . . . . . . . . . . . . . . . . 40
5.7 ADF test statistics for sentiment time series . . . . . . . . . . . . . . . . . . . . . . . 41
5.8 ADF test statistics for price time series . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.9 Second ADF test statistics for price time series . . . . . . . . . . . . . . . . . . . . . 41
5.10 Results for granger causality test1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.11 Results for granger causality test2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
viii
Glossaries
1
Glossaries
2
1 Introduction
1.1 Motivation
The vast amount of unstructured and semi-structured data is produced every day, from
social media posts, news, and other sources, making it critical to analyze unstructured data
efficiently. Natural Language Processing (NLP) converts the unstructured text in a document
or database into structured data, which is suitable for performance analysis and predictions.
NLP has fascinated importance in data science and machine learning in recent years. It
has also transformed the banking sector in helping analyze large financial text, sentiment
analysis, polarity analysis of economic text to quantify the sentiments [1]. Sentiment analysis
helps classify these texts to ‘positive’ or ‘negative’ and gives a quick insight to make better
decisions. Sentiment analysis is a sub field of NLP.
The study of NLP is performed using using machine learning methods [1, 2] and for further
enriching the NLP application, neural networks and deep learning models with transfer
learning is used. The Transfer Learning is a technique applied to transfer the knowledge
learned from previous tasks to related new tasks. One of the most crucial components of
transfer learning models or methods is to further pre-train the model on a domain-specific
language to learn semantic relationships in the text, primarily in a niche domain like financial
sectors [2]. Transfer learning is explained in detail in the section 2 .
The principal research interest of the thesis is to select a model that should classify the
news sentiment accurately because the particular language and unique vocabulary used in
the financial corpora makes it hard to use the general machine learning models. The financial
news classifier is selected based on good performance of the classifier. It will help the Macro
analyst and Quant economist for individual assessment to get quick insights about specific
news, and the news sentiment index helps predict measurements of different sentiments.
The impressive result are stored plus Metadata describing what information the classification
had performed. Also, sentiment results need to be aggregated into indices and stored as a
time series because it is compared to the price implication over time
The next step is to find the correlation between news sentiments and stock prices, in order
to quantify the relationships between them. The news sentiments labels ’positive’,’negative’
3
1.2. Aim
and ’neutral’ are converted to numerical values where ’positive’ is represented as +1 , ’neu-
tral’ as o and negative as ’-1’. A detailed explanation on creation of sentiment indices are
explained in section 4.5.1. It is fascinating to see whether sentiments lead to price move-
ments? In other words, checking the impact of news articles on stock prices, how does news
polarity affect the market prices and may lead to price movements. The goal will also focus
on fine-tuning the classifier by adjusting hyper parameters to achieve the best performance
[3]. Finally, a constructed production pipeline on data will capture the financial news stored
in the database, perform necessary Extract Transfer Load (ETL), and add Metadata and
classifier to process the sentiments.
The figure 1.1 shows different Machine Learning approach for text sentiment classification
task that focus on extracting features from text [4], probabilistic classifier and deep learning
models [5].
1.2 Aim
The proposed thesis aims to explore sentiment classification approach for text especially fi-
nancial text. The first approach is to start with a more simpler machine learning model and
then exploring the advanced deep learning models in order to get more insight with the hope
that these model may perform better and also we are able to select the best classifier for the
task. The objective is to setup a pipeline that can scrape the data, filter the data and predict the
sentiment. The cumulative sentiment indices are constructed for sentiments and comparing
it with price indices for better predictions.
4
1.3. Research questions
1. What are the suitable models for sentiment prediction of financial text?
4. Is there any statistical significance between sentiment and price indexes? Will it help in
better predictions?
5
2 Transfer Learning
Let’s say, in the case of transfer learning, a new task can rely on the previously learned tasks;
by this, we can achieve more accurate and faster results. We also need less training data.
The figure 2.1 illustrates how the different datasets are offered for the same task can use the
same model. The model is pre-trained on dataset-A, which can be used by dataset-B to per-
form a related task. Knowledge is leveraged from previously trained models to train newer
models and even relies on this model when we have fewer data. The main advantage of
pre-training is that we do not need more data for a downstream task to get remarkable results.
2. Use the pre-trained model to transfer knowledge as starting point onto solving the prob-
lem for data set B.
6
2.2. Formal Definition
7
2.3. Types of Transfer Learning
2.3.3.1 Zero-shot
Training procedure are not applied to optimize/learn new parameters.The NLP task question
answering is best example of Zero-shot learning [7].
8
3 Theory
For binary classification problems, the sentiments are classified as ’negative’ or ’positive,’ and
for multi-class classification problems, the sentiments are ’positive,’ ’negative,’ or ’neutral.’
For the thesis, multi-class classification is chosen to classify the financial text.
P( A) ¨ P( B|A)
P( A|B) = (3.1)
P( B)
where, P(A) = Prior probability
P(A|B) = Posterior probability
P(B|A) = Likelihood
9
3.3. Neural Networks
the conditional probability. The equations below shows the parametric model used in text
classification [8].
śn f
P(c) P(wi |c)i i
i =1
P(c|d) = (3.2)
P(d)
• P(wi |c) is the conditional probability that a word wi occur in a document/text given a
class c.
• P(c) is the prior probability that a document with class label c may occur in the docu-
ment.
Nic N
P̂(wi |c) = = ř|V |ic (3.3)
Nc N
j =1 j
In the equation 3.3, Nic is the number of occurrences of the word wi in the training docu-
ment/text with class label c. The Nc is the total number of word frequencies in document
with class label c, |V| is the number of unique words in document.
The estimate of P̂(w) is given in equation 3.4 where Ni is the number of occurrence of
word wi in the text.
N
P̂(wi ) = ř|V | i (3.4)
j=1 Nj
The log likelihood is given in the equation 3.5 where the first term is the conditional log
likelihood that measures how well the classifier model estimates the probability of the class
given the words. The next term is the marginal log likelihood that measures joint distribution
of words in text.
|V |
ÿ |V |
ÿ
t
LL( T ) = log P̂(c|w ) + log P̂(wt ) (3.5)
t =1 t =1
10
3.3. Neural Networks
Figure 3.1: Artificial neural network architectures with feed-forward network and back-
propagation
If the activation function is higher than the threshold value, a unit provides high-value out-
put [11]. A neural network without activation function is linear regression model. The output
of the activation function moves to the next layer, and the same process is repeated. The for-
ward movement of information is called as forward-propagation. The output of each neuron
is calculated using the equation 3.6.
n
ÿ
yi = f ( wij ˚ xi + b) (3.6)
i =0
In the equation 3.6 f is activation function, b is the bias, xi is the ith neuron output in the
preceding layer, wij is the corresponding weight from the ith neuron to the jth neuron in the
current layer, n is the number of neurons.
In back-propagation, the error is calculated using how far is the generated output from
the actual value. Then, based on the error value, the weights and bias of the neurons are
updated [11]. In the equation 3.7 the total error E is calculated as the difference between ex-
pected value y and value obtained at the output layer after applying the activation function
f. wi is the weight of the neuron xi .
n
1 ÿ
E= (y ´ f ( wi xi ))2 (3.7)
2
i =1
3.3.1 Optimization
Optimization algorithms in neural networks are used to minimize the cost function. It de-
pends on the model’s internal parameters like weights and bias used in computing the output
11
3.4. Deep Learning
value. The internal parameters are learned and updated to achieve an optimal solution.
θ = θ ´ η ¨ ∇ J (θ ) (3.8)
Batch-Gradient Descent calculates the derivative from all the training data and update, where
as Stochastic Gradient Decent calculates the derivative form each training data instance and
update it immediately. The main challenge here is to use proper learning rate [11]. In the
equation 3.9 x (i ) and y(i ) are trainig example.
3.4.1 Transformer
A transformer is a self-attention and a deep learning model majorly used with natural lan-
guage processing; language translation is one of the applications of transformers where at-
tention boosts the speed. It processes the input data parallelly to utilize Graphics Processing
unit (GPU) efficiently, and training speed is also increased, unlike the conventional neural
network, which processes sequentially. Attention can be described as what humans focus
specific words in text. At the same time, the rest of the information has less resolution or
importance. The self attention layer also overcomes the gradient issues where their is expo-
nential growth in the model parameters. Transformers are built on encoders and decoders,
both share similar properties; each encoder and decoder consists of a self-attention layer and
feed-forward neural network. The Bidirectional Encoder Representations from Transformers
(BERT) model has adopted the encoder layer of the transformer . Figure 3.3 represents the
high level architecture of Transformers [13, 14].
3.4.1.1 Encoder
Encoder and decoder can be represented as language translation task. An encoder receives a
list of input vectors ( x1 ...xn )that processes these vectors into a ‘self-attention’ layer, then into
a feed-forward neural network, and then passes encodings to the next encoder layer as input.
Each encoder has two sub-layers.
12
3.4. Deep Learning
3.4.1.2 Decoder
Decoder focus on the appropriate words in the input during the decoding process using self
attention mechanism.
Both the encoder layer and decoder layers have a feed-forward neural network for additional
processing and normalization layer.
3.4.1.3 Self-attention
The self-attention mechanism takes the input sequence and decides which parts of the se-
quence are important. To define it in simple terms, the input text interacts with each other
and finds out who they should pay “attention to.” Let us take an example “Sales are high.” A
trained self-attention layer will associate the “sales” to “high” with a higher weight than the
word “are” from a linguistics perspective; these words share a subject-verb-object relation-
ship.
Sales are high
Sales 0.8 0.1 0.2
Attention probability = are 0.1 0.65 0.1
high 0.2 0.05 0.65
Transformer uses three different types of representations , the Queries Q, Keys K and Values
V. These are calculated by multiplying the input vector (X) with weight matrices that is learnt
in the training process. These are calculated by multiplying your input vector(X) with weight
matrices with WQ ,WK ,WV .
QK T
Attention( Q, K, V ) = so f tmax ( a )V (3.10)
dk
Attention is a scaled dot-product scoring function to represent
a the relation between two
words know as the attention weight. In the equation 3.10 dk the square root of the dimen-
sion of key vector. Softmax function is applied to get the probability distribution of attention
weights.
13
3.5. BERT
3.5 BERT
Deep learning models like LSTM are used for the text’s natural language processing that pro-
cess unidirectionally, i.e., left-to-right or right-to-left during the training phase. For example,
BERT is a deep learning model based on the transformers encoder representation.
BERT has offered state-of-the-art results in the Machine Learning community in various
NLP tasks like next sentence prediction, text sentiment classification, text summarization,
and question answering. It is a transfer learning language model developed by Google. The
bidirectional training of text is applied, making it unique. BERT is trained on massive data
like Wikipedia, and book corpus has more than 3,500M words. So, a pre-trained model is
a starting point [13]. Then, each output element is connected to each input element in the
BERT, and weights are measured dynamically based on their connection. The model uses
transformers instead of RNN. The following subsections will explain the pre-training of
BERT and the process [14].
3.5.1 Pre-Training
In Artificial Intelligence, pre-training is how human beings learn new knowledge by using
the learned knowledge. This knowledge helps the new models perform the task from the
knowledge transferred using transfer learning instead of training from scratch.
14
3.5. BERT
3.5.1.3 Epochs
The number of epochs defines the number of times the model will work through the entire
training dataset. For each epoch, the learning algorithm gets an opportunity to update the
internal parameters of the model. Epochs can have one or more batches. The classifier per-
formance is measured with different epochs [15].
m t = β 1 m t ´1 + ( 1 ´ β 1 ) g t (3.11)
The next step is referred as bias-correction where first and second moment estimations
are calculated shown in equation 3.13,3.14 where m̂t and v̂t .
mt
m̂t = (3.13)
1 ´ βt1
15
3.6. BERT Architecture
vt
v̂t = (3.14)
1 ´ βt2
Then the next step is to update the parameters moving averages to scale learning rate indi-
vidually for each parameter. The way it’s done in Adam is very simple, to perform weight
update we do the following in equation 3.15.
η
w t = w t ´1 ´ ? m̂t (3.15)
( vt + ϵ)
In the equation 3.15 wt is weight at time t, wt´1 is weight at time t ´ 1, η is the step size and ϵ
is error term.
Example : Some of the words are replaced by token [MASK] and the model tries to pre-
dict the masked words.
Input sentence : This year sales are high.
Masked sequence: This year sales are [MASK].
16
3.6. BERT Architecture
pos
PE( pos, 2i ) = sin( ) (3.16)
10002i/d
pos
PE( pos, 2i + 1) = cos( ) (3.17)
10002i/d
In the equation 3.16„3.17 ”pos” define the position of the "word" in the sequence. Pos0 refers
to the position embedding of the first word in the series. ‘d1 is size of tokens in embedding
say 5, i refers to dimensions (i.e., 0,1,2,3,4). The values of i and pos vary, whereas d is fixed.
So by changing the values of 1 i1 in the equation 3.16,3.17 above, we get different position
embedding values.
A tokenized input sequence of length ’n’ is the sum to element-wise to produce a single
representation of shape (1, n, 768) and have the vector representation of embedding as fol-
lows:
• Segment Embedding helps distinguish between pairs of input sequences with a vector
representation of (1, n, 768).
17
3.7. FinBERT
e zi
so f tmax (ŷ) = řn (3.18)
j =1 ez j
In the equation 3.18, zi is the elements of the input vector to the softmax function, can be
values of positive, negative like (´2.4, 8, 0.4, . . . . . . . . . . . . ..) The standard exponential function
applies to each element of the input vector. The term on the bottom of the formula is the
normalization term. It ensures that all the output values of the function will sum to 1.
3.7 FinBERT
Training the BERT model from scratch may result in over fitting because of its architecture, a
vast neural network with 110 million parameters. Therefore, the BERT model is further pre-
18
3.8. Pearson’s Correlation
trained on financial news data [2]; for further pre-training, the BERT model, a vast financial
corpus, is used TRC2 to understand the financial context for better prediction. The previous
research has resulted in state-of-the-art performance on financial sentiment classification [18],
which used the financial phrase bank [19] dataset to evaluate performance. TRC2-financial is
dataset [19] that consists of news articles that Reuters published consists of 1.8M news arti-
cles published between 2008 -2010. This pre-trained model fine-tunes our financial sentiment
classification task for classifying positive, negative, and neutral sentiments using our dataset.
The financial phrase bank [19] dataset is labelled by 16 people with background in finance
and business. 60% data is used for trainig, 20% set aside for test and 20% of the remaining as
validation set.
For the implementation of FinBERT, the authors [18] considered the dropout probability
of 0.1; dropout probability is a regularization technique that drops a hidden unit along with
its connections at training time with a specified probability. The network should not be
relied on a specific connection that can lead to over-fitting. A warm-up period of 10,000 steps
with a learning rate 2e-5 is considered. During the training phase, the learning rate will be
increased linearly from approximately 0 to 2e-5 within the 10,000 steps to give time to the
Learning Rate to adapt to the data. The model is trained with a maximum sequence length
of 64 tokens, where the average sentence size is not greater than 64. The model is trained on
6 epochs. Some network layers are frozen, and gradually, all the layers are unfroze. At the
start of training, only the classification layer is unfrozen. After each training epoch, the next
layers are unfrozen; the initial layers learn a generic representation of the data.
cov( X, Y )
corr X,Y = (3.19)
σX , σY
X,Y are random variable, cov is the covariance, σX is the standard deviation and σY is the
standard deviation of Y.
1. A correlations of ´1: The two variables are perfectly negatively linearly related.
2. A correlation of 0: means that two variables don’t have any linear relation.
19
3.10. Augmented Dickey-Fuller Test
auto correlation over time to make sure its statistical properties that are in the past will be the
same in the future. For a stationary time series, yt has:
• The mt = mean, vt =variance of the series that does not depend on time t.
• A correlation of 0: means that two variables don’t have any linear relation.
• The auto covariance function, f (s, t) at any point in time their difference is |s-t|.
yt = Tt + St + Nt
where yt is a non stationary time series Tt is the trend component, St is a seasonal component,
and Nt is noise.
To test if a time series is stationary or not Augmented Dickey-Fuller (ADF) test is used.
is unit root test to check stationarity in a time series. The test’s null hypothesis is that a
unit root can present in the time series, time series with some independent structure is not
stationary. The time series is stationary, reject the null hypothesis and accept the alternative
hypothesis.
— Null Hypothesis ( H0 ): The time series have a unit root, meaning it is non-stationary. It
has some time-dependent structure.
— Alternate Hypothesis ( Ha ): Reject the mull hypothesis, it suggests the time series does
not have a unit root, meaning it is stationary. It has a time-independent structure.
20
3.10. Augmented Dickey-Fuller Test
Three basic differential-form auto-regressive regression models are used to detect the pres-
ence of a unit root.
Auto Regressive (AR)(1) Model is derived as,
yt = αyt´1 + ε t (3.21)
where α is constant. yt =Time series at a time 1 t1 . ε t is residual term
• Test1 : Test for unit root with no drift(constant) and no trend. Subtracting yt´1 from
both sides in equation 3.21. Where γ = α -1 in equation 3.22
• Test2 : Test for unit root with drift(constant). In equation 3.23 α is a drift term
– H_0 : yt is non-stationary, (γ = 0 , α = 0)
– H_a : yt is stationary, (γ < 1 , α ‰ 0)
• Test3 : Test for unit root with drift(constant) and trend. In equation 3.24 α is a drift term
and β is trend term.
∆yt = α + β + γyt´1 + ε t (3.24)
– H_0 : yt is non-stationary, (γ = 0 , α = 0, β = 0)
– H_a : yt is stationary, (γ < 0 , α ‰ 0, β ‰ 0)
γ̂
DFT = (3.25)
SE(γ̂)
In equation 3.25, DFT is the Dickey-Fuller (DF) test statistics γ̂ is estimated value of γ and SE
is the standard error of γ. The calculated Dickey Fuller t-statistic is compared to the critical
value of DF t distribution.Cumulative Distribution table A of t from the Supplementary Man-
ual[22], The critical value at a certain significance level (1%, 5%, 10%) is a cut-off point. If the
DF test statistics is greater than than the critical value then null hypothesis is rejected. If the
p-value is less than 5% means we can reject the null hypothesis that there is a unit root.
d (1) ( t ) = x ( t ) ´ x ( t ´ 1 ) (3.26)
— x (t ´ 1) observation at time t ´ 1.
21
3.11. Granger Casualty Test
— Ha : X Granger Cause Y.
Step 2: The best AR model for a time series yt is Calculated and AR for xt is calculated choos-
ing the lags.The best AR model is used to know how many lags are useful i.e., observations
at previous time steps that are useful to predict the value of the next time step. Choose the
different lags and run the Granger test to test many times to check the results are same when
different lag values are selected. To make sure that results must not be sensitive to lags. In
the equation 3.37, 3.28, α is the coefficient of the time series. ε is the error term normally
distributed with mean zero and same variance ε « N (0, σ2 ).
m
ÿ
y t = α0 + α i y t ´i + ε i (3.27)
i =1
m
ÿ
x t = α0 + α i x t ´i + ε i (3.28)
i =1
Step 3: Calculate the restricted regression model known as RSS Restricted and full regression
model known as RSS full. The Residual sum of squares (RSS) is calculated by adding the
two time series AR model of xt , yt . In the time series analysis, it is essential to test if the
residuals in the regression are correlated over different time periods. When the error term of
one time period is correlated with the error term of the subsequent time periods, it can lead
to an incorrect conclusion. Smaller RSS indicate a better fit. The null hypothesis that states X
does not Granger Cause Y, Xt is the restricted model and yt is full model and run the Granger
tests. In the equation 3.29,3.30, α and β are coefficients.
m
ÿ
xt ( RSSRestricted) = α0 + α i x t ´i + ε i (3.29)
i =1
m
ÿ m
ÿ
xt ( RSS f ull ) = α0 + α i x t ´i + β j y t´ j + ε i (3.30)
i =1 j =1
The restricted model for X which excludes Y and an unrestricted model for X, which includes
Y.
• n: number of observations.
22
3.12. Performance Measures
• Where RSS is the residual sum of squares of the model, F will have an F distribution,
with (p, nk) degrees of freedom. The null hypothesis should be rejected if the calculated
F value from the data is greater than the critical value of the F-distribution. The desired
rejection probability p-value is 0.05.
The p-value is the probability that is, the calculated F-value in a test is larger than the f
statistics (or) f critical value p( F ą= f |H0 ) under the assumption that the null hypothesis is
true. If the p-value is less than 0.05 then null hypothesis is rejected.
Step 5: If the p-value is less than 0.05 and f-statistics is also low then reject the null hy-
pothesis.
In the figure 3.6 here ’True Positive(TP)’ is the actual value was positive and the model pre-
dicted a positive value. The ’False Positive(FP)’ is the actual value was negative but the model
predicted a positive value. The ’False Negative(FN)’ is the actual value was positive but the
model predicted a negative value. The ’True Negative(TN)’ The actual value was negative
and the model predicted a negative value.
23
3.12. Performance Measures
rately predicts the positive class. Likewise, a ’true negative’ is a result that the model accu-
rately predicts the negative class. A ’false positive’ result from the model mistakenly predicts
the negative class as a positive class. A ’false negative ’result from the model mistakenly pre-
dicts the positive class as a negative class.For a binary classification problem the accuracy is
calculated as in equation 3.32 [27].
( TP + TN )
Accuracy = (3.32)
TP + TN + FP + FN
In the multi-class classification, where the classifier has three classes ‘positive’, ‘negative,’ and
‘neutral,’.In the equation 3.33,3.34 k is the number of classes. For each class ’TP’,’TN’,’FP’,’FN’
in the confusion matrix.
k
ÿ tpi + tni
Accuracy = (3.33)
tpi + tni + f pi + f ni
i =1
3.12.3 Miss-Classification
Misclassification or error rate is defined as the sum of incorrect predictions by total predic-
tions.
k
ÿ f pi + f ni
Miss ´ classi f ication = (3.34)
tpi + tni + f pi + f ni
i =1
3.12.4 Precision
Precision is another metric to calculate the effectiveness of classifier calculated as the number
of positive class predictions, i.e., true positive that belong to the positive class-divided all
TP’s Agreement of true class labels with those of the classifier’s calculated by summing all
True Positive (TP)’s and False Positive (FP)’s [27].
k
ÿ tpi
Precision = (3.35)
tpi + f pi
i =1
Effectiveness of a classifier, that will measure positive classes that are correctly identified as
positive out of all the positive classes.
3.12.5 Recall
Like Precision, Recall is also used to measure effectiveness; it is calculated as true positive
divided by the true positives and true negatives [27].
k
ÿ tpi
Recall = (3.36)
tpi + f ni
i =1
Thus, for all the positive classes that were actually positive, how many were correctly identi-
fied as positive.
3.12.6 F1-Score
F1-score is a harmonic mean of Precision and Recall that gives a better measure of the incor-
rectly classified cases than the Accuracy. F1-score is a better metric for evaluating our model
when imbalanced class distribution exists. So, to make precision and recall comparable F1-
score is used, which helps to measure the metrics Recall and Precision simultaneously [27].
2 ˚ precision ˚ Recall
F1 ´ score = (3.37)
precision + Recall
24
3.12. Performance Measures
3.12.7 Cross-Entropy
For fine-tuning the FinBERT model, test loss and validation loss are calculated, known as
cross-entropy loss. It is the ‘standard classification losses’. Cross-entropy loss is also known
as log loss, is calculated as multiplying the class prediction vector to the consequent weights
of the classification layer.
For binary classification problem where the number of classes are equal o 2, cross entropy
is calculated as in equation 3.38 where y is class label, log is natural log and p is predicted
probability.
cross ´ entropy = ´(ylog( p) + (1 ´ y)log(1 ´ p)) (3.38)
In case of multi class classification, separate loss for each class per observation is calculated
and sum the result.In the equation 3.39 y is a binary indicator if class label ’c’ is the correct
classification for observation ’o’. The ’p’ is the predicted probability of observation ’o’ is of
class ’c’.
M
ÿ
cross ´ entropy = ´ yo,c log( po,c ) (3.39)
c =1
25
3.13. Related Work
The Financial sentiment analysis is a challenging task due to the lack of labeled data and
the language used. In this study, various machine learning methods are compared based on
their classification performance. The data used in their analysis is Thomson Reuters News
Archive data [33]. The classification prediction performance is measured; in this research,
Neural Networks outperforms other machine learning techniques like NB and Support Vec-
tor Machine (SVM). A hierarchical sentiment classifier for the text classification for handling
domain-specific lexicons. The polarity or sentiment classifier model builds a hierarchical
classifier model using the concept of association rule mining where the minimum confidence
and minimum support on finding the frequent item sets. This polarity prediction will make
sentiment predictions. The generated rules from frequent item sets are ordered based on
decreasing support, confidence, and antecedent length. Lagging indicators and leading
indicators are found [34].
Deep learning models like convolution and recurrent neural networks are used to perform
sentiment analysis. Classifier performance (accuracy, Precision, Recall, and F-measure) other
than the data mining approaches like text categorization, Information retrieval, and logistic
regression for the bag of words system. Convolution networks outperform the logistic regres-
sion in the extraction of sentiments [3]. The pre-trained model, which requires less labeled
data and can be domain-specific approaches like Universal Language Model Finetuning for
Text Classification (ULMFit), is considered for transfer learning models on NLP applications
which is suitable for specific tasks; this model suits only domain-specific and small datasets
for financial learning [35]. The pre-trained language models like ULMFit, Short Term Mem-
ory (STM), LSTM with Embedding from Language Models (Elmo), BERT are compared based
on their performance and test loss. Even with the small datasets, for example, 500 samples
FinBERT pre-trained on the BERT model outperforms other models, requiring less time [18].
Figure 3.7 [18] taken from the related work shows the test loss based on different training set
sizes and how test loss decreases when training data set size increases.
Figure 3.7: Test Loss v/s Training size for different Models
26
4 Method
This chapter covers data description and pre-processing. The later section describes How
thesis work carried out.
4.1 Data
27
4.2. Software Environment
Companyname Weight
Swedbank 0.03700339
Nordia 0.0315533
Ericsson 0.05718687
Atlas Copco 0.1148021
Sinch AB 0.02323883
4.2.1 Python
Python helps in experiments like web scraping, data pre-processing, prediction, and visual-
ization. Python is an open-source programming language that provides good frameworks for
Artificial Intelligence, Machine Learning, statistical analysis, and visualization. It supports
different libraries with powerful features with highly customized implementations; some
packages are used for better results [36].
4.2.2 Urllib
Urllib library is used for handling and fetching Uniform Resource Locators (URL)s. By using
the urlopen function, it can fetch various URLs using various protocols [37].
4.2.3 BeautifulSoup
The purpose of beautiful soup package is web scraping by extracting data out of Hyper Text
Markup Language (HTML) and Extensible Markup Language (XML) pages. From the parsed
URL’s a parsed tree is constructed to search for required tags. It provides flexibility for navi-
28
4.2. Software Environment
gating, searching through HTML elements. To obtain the different elements from each page
like news content, news link, news title iteratively for data extraction [37].
4.2.4 Pandas
For data pre-processing and data handling, panda’s package is used. For the next steps, it
is very important to create a data structure for the scrapped data that is provided by pandas
for fast and flexible structuring of data. Its multipurpose functionality for handling data is
an advantage, all the data that is scraped for the thesis work is converted to a data frame for
further analysis and prediction [36].
4.2.5 NumPy
A NumPy stands for “Numerical Python” is used for implementing numerical computations
for vectors and matrices. It provides 50 times faster computation than list data. For data
analysis and numerical calculation in the thesis, this library is used [38].
4.2.6 nltk
Natural Language Toolkit (NLTK) is a standard library that eases the use and implementation
of natural language processing and information retrieval tasks like tokenization, stemming,
parsing, and semantic text relationships [39].
4.2.7 Sklearn
Scikit-learn provides tools and functionality for machine learning and statistical modeling for
classification, clustering, and other predictions. For example, split data into train, validation,
and test subsets, create features for text inputs, create tokens, and count vectors like frequency
count for tf-idf. For classification, task data is split into train and test [39].
4.2.8 Transformer
Transformers provide thousands of functionalities for NLP, handling text classification, sum-
marization, and text translation for more than 100 languages. For BERT tokenization and
token sequence, it is used [39].
4.2.9 matplotlib
Matplotlib is a library for animated and interactive visualization that helps in creating quan-
tified plots, creating layouts [36].
4.2.10 seaborn
Like matplotlib, the seaborn library is also used for data visualization and exploratory data
analysis, built on Matplotlib to create customized plots [36].
4.2.11 ggplot2
ggplot2 from tidyverse is an open-source data visualization tool for plotting different plots of
R statistical programming language [40].
29
4.3. Web Scraping
4.2.12 statsmodel
Statsmodel provides classes and functions for the statistical estimations and model for per-
forming statistical tests and exploration, like the R programming syntax. The results can be
in different types and estimators. For performing correlation, granger casualty, and adfuler
tests, it is used [41].
4.2.13 TensorFlow
TensorFlow is an end-to-end open-source library for creating deep learning models to handle
extensive data and implementing complex models like BERT to simplify and speed up the
process [39].
4.2.14 Keras
Like tensorflow, Keras is an open-source software high-level Application Programming In-
terface (API) that provides a Python interface for artificial neural networks., it acts as an
interface for the TensorFlow library. It is more user-friendly and a little faster compared to
Tensor flow. For the implementation of the Financial Bidirectional Encoder Representations
from Transformers (FinBERT) model in the thesis, this package is used [39].
4.2.15 Logging
Logging is keeping track of all data input, processes, and data output of a code. When run-
ning complex processes, it is important to keep track of the crashes to keep track of defects.
4.2.16 GPU
GPU’s can perform multiple, simultaneous computations. This enables the distribution of
training processes and can significantly speed machine learning operations. With GPUs, you
can accumulate many cores that use fewer resources without sacrificing efficiency or power.
Furthermore, GPU accelerates the training of the model. Hence, GPU is a better choice to
train the Deep Learning Model efficiently and effectively.
1. HTTP request is sent to a webpage that returns the HTML content to access the web-
page.
2. Parsing the HTML page’s content, parsing creates a tree-like nested structure of the
HTML data.
3. Navigating and searching the parsed tree created in step2, traversing the tree for the
created elements.
30
4.4. Data Preparation
hierarchical layout than can be traversed via a library called BeautifulSoup [37]. In the figure
4.2, each element that is scraped is highlighted. More details discuss in the section software
environment description.
The dataset has been segregated into training, validation, and test sets. The classifier
uses a training set for learning to fit the model’s parameters, and the validation set is used
to fine-tune the hyperparameters. So, it provides unbiased model efficiency of classifier; for
example, in a neural network selecting the number of hidden layers and assessing the perfor-
mance of model test set is used. For the Naïve Bayes classifier, 75% data is used for training
and 25% for tests. For BERT and FinBERT, 60% is used for training, 20% for validation, and
the remaining 20% for test.
31
4.5. Sentiment Analysis Pipeline
word occurs [8]. The tf-idf increases as the word appear more times in the text for measuring
its frequency. CountVectroizer function from Sklearn package [39] transforms the input text
into a numerical vector based on the frequency count of each word.
1. Special tokens like [SEP], [CLS] are added at each input text. [CLS] for classification task
is added to the beginning of the sequence. [SEP] is used at the end of every sentence.
2. Truncating and padding all input sequences to a maximum length parameter of the
classifier. Max_length is the maximum length of our input sequence, which is the length
of 512 tokens.
3. Each token is mapped to the vocabulary id’s in the BERT tokenizer vocabulary, it is
explained in subsection Token Embedding.
4. For the BERT encoder, tensors in the TensorFlow convert the text to the numeric inputs
to interpret each string in string tensor that is required by the BERT encoder, explained
in section BERT Architecture.
32
4.5. Sentiment Analysis Pipeline
1. The data is scrapped like news content with the metadata like sector, industry, and
country from MFN.
2. For each day at least one news or more than one news published for the top 30 compa-
nies, the sentiment for each news is classified as [positive, negative, neutral].
4. The price index of OMX Stockholm 30 companies is filtered from the news database to
calculate the sentiment index.
6. The OMXS 30 price index has the corresponding weight of the company in the price
implication that sum up to 100. The percentage of weights multiply by sentiments. So,
that the sum of sentiment on a particular day will range from -1 to +1.
n
ÿ
SI = Sc ˚ Wp (4.1)
i =1
7. For each news sentiment, the sentiment index is calculated using step 3 to ensure that
the sentiment value always ranges between -1 to +1.
8. The data is grouped by date and the cumulative sum for each day measured to construct
the sentiment indices time series.
9. The cumulative sum of the sentiments is a measure to find the sentiment implication
over time.
33
5 Results
This section will cover the results of the models, the correlation between sentiment indices
and price indices, and the causality relation between them. The results are calculated using
the evaluation metrics described in Section Performance Measures. The results are discussed
in Section Discussion. The evaluation metrics are calculated using the test data for all models,
consisting of 20% of the dataset. The predicted labels of the test data are compared with the
actual labels. The statistical hypothesis test results are presented.
34
5.1. Sentiment Prediction
The confusion matrices are presented in Figures 5.1,5.2,5.3. The confusion matrices are in
the normalized form. The advantage of the "normalized" matrix is avoiding the differences
in numbers that may arise due to class imbalances Every row in the confusion matrix is the
total number of actual values for each class label ’positive’, ’negative’, and ’neutral’. The row
elements are divided by the sum of the entire row. Thus, the sum of each row in a normalized
confusion matrix is 1.00. The percentage of prediction of each class made by the model for
that specific true label. The column sum may deviate from 1.00 depending on whether the
class is correctly predicted and more instances are assigned to a specific class.
Figure 5.1: Confusion matrix for Naive Bayes model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances.
35
5.1. Sentiment Prediction
Figure 5.2: Confusion matrix for BERT model for test data. The diagonal from top left corner
to bottom right corner shows the percentage of correctly classified instances.
Figure 5.3: Confusion matrix for FinBERT model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances.
36
5.1. Sentiment Prediction
epoch. The attention dropout probability is 0.1, which is the dropout ratio for the attention
probabilities, which causes some of the weights to be zeroed out during training. The original
model weights are used for initialization, i.e., the initial weights from the pre-trained BERT
and FinBERT. This will help the model to retrain the embedding that has been learned. In
gradual unfreezing, only the unfrozen top layers replace the weights with their pre-trained
weight values. When learning on new data, the model might get deviated, so lower learning
rates are used a the beginning of training.
The hyper parameters considered are maximum sequence length, epochs, learning rate,
batch size. Some of the combinations tried with different hyper parameters are presented
in table 5.4. Few experiments are conducted to try out the different combinations of hyper-
parameters. However, due to the limited computational resources, it was not possible to try
out many combinations.
From the table 5.4 the optimal hyper parameters chosen for the BERT is learning rate : 2e-5
, number of epochs :5 , batch-size : 8 , max_sequence_length=64 with better accuracy as
91%. The selection of hyper parameters for FinBERT model are presented in table 5.5. Few
experiments are conducted to try out the different combinations of hyper-parameters.
From the table 5.4 The optimal hyper parameters chosen for FinBERT model is learning rate
: 2e-5 , train_batch_size :32, max_sequence_length = 56, number_of_epochs = 6. The figure
5.4 shows the validation loss and test loss of FinBERT model, for each increasing epoch the
model updates the weights and internal parameters are updated and improve the learning
and performance increases and loss decreases. As it is computationally expense only 6 epochs
are used if further epochs are increased the performance will increase. The figure 5.5 shows
how the test losses decreases from 0.9 to 0.4 when the size of training data is increased.
37
5.2. Classified data
38
5.3. Time Series
Figure 5.6: Shows Sentiments change in a Week 11th Oct - 15th Oct
(a) Count of Sentiments in one Week (b) Count of Sentiments in seven Months
Figure 5.7: Count of Sentiments (a) in one week 4th Oct - 8th Oct (b) in 7 months 15th March -
15th Oct .
39
5.3. Time Series
The plots 5.8 below shows the sentiment time series and price time series for the dataset
taken. The cumulative sentiment graph shows the positive trend as most news is predicted
as positive and neutral. Price time series plot 5.9 shows discrete-time series based on a set
of well-defined numerical price values at successive point intervals of time and sentiment
implications over time.
Figure 5.9: Plot for price time series and sentiment implication
40
5.3. Time Series
The first step is to convert the time series data to stationary using the ADF test. The statistical
hypothesis test is used to check if the sentiment and price time-series data are stationary.
If the p-value < 0.05 significance level, reject the null hypothesis and accept the alternative
hypothesis that the series is stationary.
The p-value obtained int the table 5.7 is less than the significance level of 0.05, and ADF
statistics are less than any critical values. Therefore, the null hypothesis is rejected and the
time series is considered stationary. Hence, the sentiment time series is stationary.
If the p-value < 0.05 significance level, reject the null hypothesis and accept the alternative
hypothesis that the series is not stationary.
The results in the table 5.8 shows the p-value obtained is greater than the significance
level of 0.05 and ADF statistics is greater than any critical values. Therefore, the null hypoth-
esis is accepted, and the time series is considered as non-stationary. Hence, the price time
series is non-stationary.
To convert the price time series to stationary by taking the discrete difference to make
the time series stationary by taking lag-1 difference and re-run the tests to check the series is
now stationary.
Table 5.9: Second ADF test statistics for price time series
41
5.4. Statistical Test
The p-value obtained in table 5.9 shows that it is less than the significance level of 0.05, and
ADF statistics is less than all critical values. Therefore, reject the null hypothesis and the time
series is considered stationary. Hence, the price time series is stationary.
The values are plotted along with the confidence band that shows the ACF plot to show
how previous values have past values. The plot 5.10 shows that the ACF function of a non-
stationary series is slowly trailing towards zero. It also shows some measure of trend in the
series.
(a) Price Difference over time (b) AutoCorrelation for stationary time series
Figure 5.11: (a) Price Difference over time (b) AutoCorrelation for stationary price time serie
After taking the lag-1 difference to make the price time series stationary will not endure the
effects of trends or seasonality , no trend is observed in the ACF plot in the figure 5.11.
5.4.1 Correlation
First, Pearson’s correlation is calculated to check if there is any relation between sentiments
and prices movements, which measures the test statistics to find the statistical association
42
5.4. Statistical Test
relationship between them. The correlation value 0.6613 has a strong relationship association.
Before calculating the correlation, we should remove the time series dependencies by making
the data stationary. The scatter plot in figure 5.12 shows the strong statistical significance.
(a) Correlation price v/s sentiment (b) Correlation price v/s sentiment with best fit
Figure 5.12: (a) Correlation price v/s sentiment (b) Correlation price v/s sentiment with best
fit line
5.4.2.1 Test1
— Null Hypothesis Ho : sentiment do not granger cause price.
In the table 5.10 and 5.11 the different number of lags are considered to find the best AR
model for the time series. The historical values of X and Y are used to predict the value of Y,
than only using the historical values of Y. When we have a large number of lags, F-test can
lose its significance. An alternative is to use the chi-square test, constructed with a likelihood
ratio. The sum of squares-Regression (SSR) based F test and chi-square results are presented.
df stands for degrees of freedom and df_demon is denominator degrees of freedom.
43
5.4. Statistical Test
Granger Causality
number of lags(no zero) 1
Granger Causality
number of lags(no zero) 3
In the table 5.10 P-value is very low; F-test and Chi-square results are high but we cannot
accept null hypothesis with very low p-values. So null hypothesis will be rejected hence
sentiment granger cause price.
44
5.4. Statistical Test
5.4.2.2 Test2
— Null Hypothesis Ho : price does not granger cause sentiment.
Granger Causality
number of lags(no zero) 1
Granger Causality
number of lags(no zero) 3
In the table 5.11 The p-value is considerably high, F-test values and chi-square are not
high but do not harm. The null hypothesis is accepted; thus, price does not granger cause
sentiment. In Granger casualty, if we take large lags(difference), the F-test loses power;
alternatively chi-square test can be considered.
45
6 Discussion
This chapter includes a discussion of the accomplished results and a future for further im-
provement.
6.1 Results
The results achieved in Section 5 guide us to the understanding of sentiment analysis of
financial unstructured data and challenges involved in the process because the language
used in the financial text-domain makes it difficult to use a general-purpose model that is not
efficient.
One of the objectives of the thesis was to examine the suitable model for sentiment pre-
diction. The data was scraped from https://mfn.se/all and the data was labelled that
is presented in Section Data Preparation. The models were trained and evaluated on the
labelled data. The performance of the models are evaluated based on metrics like accuracy,
F1-scores, precision, recall and confusion matrix. The test accuracy of the FinBERT model is
91% that outperformed Naive Bayes with accuracy 67% and BERT with accuracy with 88%.
For the Naive Bayes classifier, The model to some extent, distinguished between classes in
the figure 5.1, shows that 80% of the positive sentiments are correctly classified by the model
where as the other classes ’neutral’ and ’negative’ large number of sentiments are wrongly
classified by the model.
One of the reasons why NB did not perform well is that it assumes all the features are
independent and the probabilities are incorrect if this assumption is not correct. NB are sim-
ple to implement and computationally fast. On the other hand BERT and FinBERT models
are complex and achieved good accuracy as shown in the confusion matrix in figures 5.2,5.3
shows that the models, to a large extent majority of the time correctly classified the classes.A
good explanation why the BERT and FinBERT models performed better might be they are
trained on large data that is hierarchical learning process to extract the text semantics and
relations. Since BERT is bidirectional it learns depending on the context. One drawback with
BERT is it takes a significant long time for training; running it on CPU is very low. So the GPU
setting is used. Comparing the results to the related study done by the Shi, Feng et al [33] and
Sahar et al [3] for financial sentiment analysis that has measured the classification metrics of
46
6.2. Future Work
different models where deep learning models like RNN outperformed NB. Zhuang et al [2]
used BERT and FinBERTf or the financial tasks have similar performance metrics compared
with the thesis outcome. The performance metrics after fine tuning the hyper parameters for
BERT,FinBERT models are presented in table 5.4 and 5.5, the hyper parameters maximum
sequence length 64,56 , the number of epochs 5,6 , learning rate 2e-5 and batch size of 8 and
32 are the best.
The correlation coefficient of sentiment and stock price is 66% that can be interpreted as good
correlation. Correlation between these variable is symmetric so Granger causality statistical
test is performed to check the causaulty between sentiment and price. Results are shown in
table 5.10, the null hypothesis sentiment do not granger cause price is rejected because of
low p-values and alternative hypothesis sentiment granger cause price is accepted. The test2
shown in 5.11 the null hypothesis price does not granger causalty sentiment is accepted with
high p-values. Hence from the results it is concluded that sentiment granger cause price and
price do not granger cause sentiment.
We would like to extend this study by adding more company’s data as only 30 companies
are considered now and check the prediction accuracy. We can also apply textual analysis
to other languages like Swedish and German, which is much more structured than English,
where a word has multiple grammatical purposes. It is also interesting to apply textual
analysis to the other financial market like bonds, commodities, and derivatives. The textual
information should be well explored to avoid fault predictions for stock price data.
FinBERT model that is trained on the financial domain can be further pre-trained that will
allow building a one-size-fits-all model for financial sentiment analysis of financial text and
many other tasks. But modelling the implicit semantic information is not necessarily clear
and a very challenging task. FinBERT can be used for other natural language processing
tasks such as question answering and named entity recognition and next sentence prediction
in the financial domain.
47
7 Conclusion
• What are the suitable models for sentiment prediction of financial text?
This thesis attempted to study different supervised machine learning algorithms for sen-
timent prediction with a comparative study. The machine learning models such as Naive
Bayes, BERT, and FinBERT are compared. The accuracy of the FinBERT model on test data is
91% that outperformed Naive Bayes with accuracy 67% and BERT with accuracy with 88%.
The Naive Bayes model often predicted the class positive even though the actual class was
neutral and predicted the class neutral even though the actual class was negative. The size
of the training data for NB model is 80%. The training accuracy for NB model is 79%, the
model is likely over fitted. One of the reason is train to test ratio is not equal and imbalance
of labelled classes, cross validation can certainly be used to avoid over-fitting. The training
accuracy of BERT model is 92%, slightly higher than the test accuracy. The training accu-
racy for FinBERT model is slightly lower than the test accuracy. Overall the FinBERT model
performed better.
choosing appropriate hyper parameters like learning rate, batch size, and epochs is helpful in
improving performance. The best hyper parameters for BERT is maximum sequence length
64 , the number of epochs 5, learning rate 2e-5 and batch size of 8 and are the best. For the
FinBERT model the best hyper parameters are maximum sequence length 56 , the number of
epochs 6, learning rate 2e-5 and batch size of 32. The results are presented in table 5.4 and 5.5.
Domain adapted BERT model performs significantly better. The FinBERT model which is
a BERT that is further trained on financial data expalined in subsection FinBERT. Yes the
performance is increased, the accuracy of glsfinbert is 91% that and BERT with 88% accuracy.
When small training dataset was used to check test accuracy the performance the models
was low. They are data-hungry and perform exceptionally well when the training dataset
size increases.
48
• Is there any statistical significance between sentiment and price indexes? Will it help in
better predictions?
From the results, the correlation coefficient of sentiment and stock price is 66% that can be
interpreted as good correlation.Augmented Dickey-Fuller (ADF) is used to check if both the
time series data are stationary. Granger causality test used to determine if the sentiment time
series helps predicting price. Hence from the results,it is concluded that sentiment granger
cause price and price do not granger cause sentiment. If the news is positive, then we can
conclude that news impact is good, so more chances of stock price go high, and if negative,
then it may impact the stock price to go down.
49
Bibliography
[1] Casey Whitelaw, Navendu Garg, and Shlomo Argamon. “Using appraisal groups for
sentiment analysis”. In: Proceedings of the 14th ACM international conference on Informa-
tion and knowledge management. 2005, pp. 625–631.
[2] Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. “FinBERT: A Pre-
trained Financial Language Representation Model for Financial Text Mining.” In: IJCAI.
2020, pp. 4513–4519.
[3] Sahar Sohangir, Dingding Wang, Anna Pomeranets, and Taghi M Khoshgoftaar. “Big
Data: Deep Learning for financial sentiment analysis”. In: Journal of Big Data 5.1 (2018),
pp. 1–25.
[4] Basant Agarwal and Namita Mittal. “Machine learning approach for sentiment analy-
sis”. In: Prominent feature extraction for sentiment analysis. Springer, 2016, pp. 21–45.
[5] Oscar Araque, Ignacio Corcuera-Platas, J Fernando Sánchez-Rada, and Carlos A Igle-
sias. “Enhancing deep learning sentiment analysis with ensemble techniques in social
applications”. In: Expert Systems with Applications 77 (2017), pp. 236–246.
[6] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. “A survey of transfer learn-
ing”. In: Journal of Big data 3.1 (2016), pp. 1–40.
[7] Injy Sarhan and Marco Spruit. “Can we survive without labelled data in nlp, transfer
learning for open information extraction”. In: Applied Sciences 10.17 (2020), p. 5758.
[8] Jiang Su, Jelber Sayyad Shirab, and Stan Matwin. “Large scale text classification us-
ing semisupervised multinomial naive bayes”. In: International Conference on Machine
Learning (2011).
[9] Oludare Isaac Abiodun, Aman Jantan, Abiodun Esther Omolara, Kemi Victoria Dada,
Nachaat AbdElatif Mohamed, and Humaira Arshad. “State-of-the-art in artificial neu-
ral network applications: A survey”. In: Heliyon 4.11 (2018), e00938.
[10] Maher GM Abdolrasol, SM Hussain, Taha Selim Ustun, Mahidur R Sarker, Mahammad
A Hannan, Ramizi Mohamed, Jamal Abd Ali, Saad Mekhilef, and Abdalrhman Milad.
“Artificial Neural Networks Based Optimization Techniques: A Review”. In: Electronics
10.21 (2021), p. 2689.
[11] Agnes Lydia and F Sagayaraj Francis. “A Survey of Optimization Techniques for Deep
Learning Networks”. In: International Journal for Research in Engineering Application
Management (IJREAM) 05 (May 2019).
50
Bibliography
[12] Ajay Shrestha and Ausif Mahmood. “Review of deep learning algorithms and architec-
tures”. In: IEEE Access 7 (2019), pp. 53040–53065.
[13] Shanshan Yu, Jindian Su, and Da Luo. “Improving bert-based text classification with
auxiliary sentence and domain knowledge”. In: IEEE Access 7 (2019), pp. 176600–
176612.
[14] Mickel Hoang, Oskar Alija Bihorac, and Jacobo Rouces. “Aspect-based sentiment anal-
ysis using bert”. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics
(2019), pp. 187–196.
[15] Li Deng. “A tutorial survey of architectures, algorithms, and applications for deep
learning”. In: APSIPA Transactions on Signal and Information Processing 3 (2014).
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances
in neural information processing systems. 2017, pp. 5998–6008.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp
Hochreiter. “Gans trained by a two time-scale update rule converge to a local nash
equilibrium”. In: Advances in neural information processing systems 30 (2017).
[18] Allen Huang, Hui Wang, and Yi Yang. “FinBERT—A Deep Learning Approach to Ex-
tracting Textual Information”. In: Available at SSRN 3910214 (2020).
[19] Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. “Good
debt or bad debt: Detecting semantic orientations in economic texts”. In: Journal of the
Association for Information Science and Technology 65.4 (2014), pp. 782–796.
[20] R Werner, D Valev, and D Danov. “The Pearson’s correlation-a measure for the linear
relationships between time series”. In: Conference: Fundamental Space Research (2009).
[21] Rizwan Mushtaq. “Augmented dickey fuller test”. In: World Journal of Finance and In-
vestment Research (2011).
[22] W. Enders. Applied Econometric Times Series. Wiley Series in Probability and Statistics.
Wiley, 2014. ISBN: 9781118918616.
[23] Marco T Bastos, Dan Mercea, and Arthur Charpentier. “Tents, tweets, and events: The
interplay between ongoing protests and social media”. In: Journal of Communication 65.2
(2015), pp. 320–350.
[24] Paul-Francois Muzindutsi, Sanelisiwe Jamile, Nqubeko Zibani, and Adefemi A
Obalade. “The effects of political, economic and financial components of country risk
on housing prices in South Africa”. In: International Journal of Housing Markets and Anal-
ysis (2020).
[25] Clive WJ Granger. “Investigating causal relations by econometric models and cross-
spectral methods”. In: Econometrica: journal of the Econometric Society (1969), pp. 424–
438.
[26] Edward E Leamer. “Vector autoregressions for causal inference?” In: Carnegie-rochester
conference series on Public Policy. Vol. 22. North-Holland. 1985, pp. 255–304.
[27] Razan M AlZoman and Mohammed JF Alenazi. “A comparative study of traffic classi-
fication techniques for smart city networks”. In: Sensors 21.14 (2021), p. 4677.
[28] Bing Liu. “Sentiment analysis and opinion mining”. In: Synthesis lectures on human lan-
guage technologies 5.1 (2012), pp. 1–167.
[29] Justin Christopher Martineau and Tim Finin. “Delta tfidf: An improved feature space
for sentiment analysis”. In: Third international AAAI conference on weblogs and social me-
dia. 2009.
51
Bibliography
[30] Abinash Tripathy, Ankit Agrawal, and Santanu Kumar Rath. “Classification of senti-
ment reviews using n-gram machine learning approach”. In: Expert Systems with Appli-
cations 57 (2016), pp. 117–126.
[31] Mohamed Abdellatif and Ahmed Elgammal. “Text Classification Using Language
Modeling: Reproducing ULMFiT”. In: 12th International Conference on Language Re-
sources and Evaluation, LREC 2020 (2020), pp. 5579–5587.
[32] Lei Zhang, Shuai Wang, and Bing Liu. “Deep learning for sentiment analysis: A sur-
vey”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.4 (2018),
e1253.
[33] Li Guo, Feng Shi, and Jun Tu. “Textual analysis and machine leaning: Crack unstruc-
tured data in finance and accounting”. In: The Journal of Finance and Data Science 2.3
(2016), pp. 153–170.
[34] Srikumar Krishnamoorthy. “Sentiment analysis of financial news articles using perfor-
mance indicators”. In: Knowledge and Information Systems 56.2 (2018), pp. 373–394.
[35] MP Geetha and D Karthika Renuka. “Improving the performance of aspect based sen-
timent analysis using fine-tuned Bert Base Uncased model”. In: International Journal of
Intelligent Networks 2 (2021), pp. 64–69.
[36] Mark Lutz. Programming python. " O’Reilly Media, Inc.", 2001.
[37] Vineeth G Nair. Getting started with beautiful soup. Packt Publishing Ltd, 2014.
[38] S Chris Colbert et al. “The NumPy array: a structure for efficient numerical computa-
tion”. In: Computing in Science & Engineering. Citeseer. 2011.
[39] Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Con-
cepts, tools, and techniques to build intelligent systems. O’Reilly Media, 2019.
[40] Virgilio Gómez-Rubio. “ggplot2-elegant graphics for data analysis”. In: Journal of Sta-
tistical Software 77 (2017), pp. 1–3.
[41] Skipper Seabold and Josef Perktold. “Statsmodels: Econometric and statistical model-
ing with python”. In: 57 (2010), p. 61.
[42] Ling-Chu Lee, Pin-Hua Lin, Yun-Wen Chuang, and Yi-Yang Lee. “Research output and
economic productivity: A Granger causality test”. In: Scientometrics 89.2 (2011), pp. 465–
478.
52