0% found this document useful (0 votes)
23 views

Financial News With Supervised Learning

This document is the master's thesis of Syeda Farha Shazmeen from Linköping University. The thesis investigates sentiment analysis of financial news using supervised learning models. It conducts tests to find the best model for classifying sentiment, comparing Naive Bayes, BERT, and FinBERT. FinBERT outperforms the other models. Time series of sentiment indices are built and correlated with price indices. Tests are run to examine the relationship between sentiment and price movements. The results show sentiment has a significant correlation with and can help predict price changes.

Uploaded by

Theerapan C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Financial News With Supervised Learning

This document is the master's thesis of Syeda Farha Shazmeen from Linköping University. The thesis investigates sentiment analysis of financial news using supervised learning models. It conducts tests to find the best model for classifying sentiment, comparing Naive Bayes, BERT, and FinBERT. FinBERT outperforms the other models. Time series of sentiment indices are built and correlated with price indices. Tests are run to examine the relationship between sentiment and price movements. The results show sentiment has a significant correlation with and can help predict price changes.

Uploaded by

Theerapan C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning


2022 | LIU-IDA/STAT-A–22/003—SE

Sentiment Analysis of Financial


News with Supervised Learning

Syeda Farha Shazmeen

Supervisor : Maryna Prus


Examiner : Oleg Sysoev

External supervisor : Per von Rosen

Linköpings universitet
SE–581 83 Linköping
+46 13 28 10 00 , www.liu.se
Upphovsrätt
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-
ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-
pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-
ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan
användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-
heten och tillgängligheten finns lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som
god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet
ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-
nens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida
http://www.ep.liu.se/.

Copyright
The publishers will keep this document online on the Internet - or its possible replacement - for a
period of 25 years starting from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for anyone to read, to down-
load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial
research and educational purpose. Subsequent transfers of copyright cannot revoke this permission.
All other uses of the document are conditional upon the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work
is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures
for publication and for assurance of document integrity, please refer to its www home page:
http://www.ep.liu.se/.

© Syeda Farha Shazmeen


Abstract

Financial data in banks are unstructured and complicated. It is challenging to analyze these
texts manually due to the small amount of labeled training data in financial text. Moreover,
the financial text consists of language in the economic domain where a general-purpose
model is not efficient. In this thesis, data had collected from MFN (Modular Finance)
financial news, this data is scraped and persisted in the database and price indices are
collected from Bloomberg terminal. Comprehensive study and tests are conducted to find
the state-of-art results for classifying the sentiments using traditional classifiers like Naive
Bayes and transfer learning models like BERT and FinBERT. FinBERT outperform the
Naive Bayes and BERT classifier.

The time-series indices for sentiments are built, and their correlations with price indices
calculated using Pearson correlation. Augmented Dickey-Fuller (ADF) is used to check if
both the time series data are stationary. Finally, the statistical hypothesis Granger causality
test determines if the sentiment time series helps predict price. This result shows that there
is a significant correlation and causal relation between sentiments and price.

Keywords: Financial news, Transfer learning, Sentiment classification, BERT, FinBERT,


Time series indices, Casual inference.
Acknowledgments

All praises and thanks to the Almighty Creator for his showers of blessings to complete this
thesis successfully.

My sincere gratitude and thanks to my supervisor, Maryna Prus , Ph.D. at Division of Statis-
tics and Machine Learning (STIMA), Department of Computer and Information Science
(IDA), for her exceptional supervision encouragement during the thesis work. Furthermore,
my external supervisor, Per von Rosen, Data strategist, and Architect Swedbank AB, for
his motivation and help, had tremendously contributed to the successful completion of the
project at Swedbank AB for providing the necessary infrastructure needed. Finally, I am also
grateful to all supervisors of STIMA for their constructive feedback and valuable comments
during this Journey.

Lastly, for their unconditional support and encouragement, my family and friends motivated
me to work on the thesis.

iv
Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

Glossaries 1

1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Transfer Learning 6
2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Types of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Theory 9
3.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 FinBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Pearson’s Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.9 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.10 Augmented Dickey-Fuller Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11 Granger Casualty Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.12 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.13 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Method 27
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Software Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Sentiment Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

v
5 Results 34
5.1 Sentiment Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Classified data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Statistical Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Discussion 46
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Conclusion 48

Bibliography 50

vi
List of Figures

1.1 Machine Learning Models for sentiment Classification . . . . . . . . . . . . . . . . 4

2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


2.2 Types of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Artificial neural network architectures with feed-forward network and back-
propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 High level Architecture of Transformer . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Token Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Test Loss v/s Training size for different Models . . . . . . . . . . . . . . . . . . . . . 26

4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


4.2 Scraped Web Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Shows the Sentiment Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1 Confusion matrix for Naive Bayes model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances. 35
5.2 Confusion matrix for BERT model for test data. The diagonal from top left corner
to bottom right corner shows the percentage of correctly classified instances. . . . . 36
5.3 Confusion matrix for FinBERT model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances. 36
5.4 Plot Epochs v/s Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 Plot Training data size v/s Test loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Shows Sentiments change in a Week 11th Oct - 15th Oct . . . . . . . . . . . . . . . . 39
5.7 Count of Sentiments (a) in one week 4th Oct - 8th Oct (b) in 7 months 15th March -
15th Oct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.8 Plot for cummulative sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.9 Plot for price time series and sentiment implication . . . . . . . . . . . . . . . . . . 40
5.10 (a) Price over time (b) AutoCorrelation for non stationary price time series . . . . . 42
5.11 (a) Price Difference over time (b) AutoCorrelation for stationary price time serie . . 42
5.12 (a) Correlation price v/s sentiment (b) Correlation price v/s sentiment with best
fit line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii
List of Tables

4.1 Weighted Price Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Hyper-parameter considered for BERT and FinBERT to report training accuracy,
test accuracy and test confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Evaluation Metrics of Classifiers for Training data . . . . . . . . . . . . . . . . . . . 35
5.3 Evaluation Metrics of Classifiers for Test data . . . . . . . . . . . . . . . . . . . . . . 35
5.4 BERT Hyper-parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 FinBERT Hyper-parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.6 Sentiment indices from 15th March to 19 th March. . . . . . . . . . . . . . . . . . . . 40
5.7 ADF test statistics for sentiment time series . . . . . . . . . . . . . . . . . . . . . . . 41
5.8 ADF test statistics for price time series . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.9 Second ADF test statistics for price time series . . . . . . . . . . . . . . . . . . . . . 41
5.10 Results for granger causality test1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.11 Results for granger causality test2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

viii
Glossaries

Adam Adaptive Moment estimation.


ADF Augmented Dickey-Fuller.
ANN Artificial Neural Networks.
API Application Programming Interface.
AR Auto Regressive.
BERT Bidirectional Encoder Representations from
Transformers.
DF Dickey-Fuller.
Elmo Embedding from Language Models.
ETL Extract Transfer Load.
FFNN feed-forward Neural Network.
FinBERT Financial Bidirectional Encoder Representa-
tions from Transformers.
FP False Positive.
GPU Graphics Processing unit.
HTML Hyper Text Markup Language.
HTTP Hypertext Transfer Protocol.
LSTM Long Short Term Memory.
MFN Modular Finance.
MLE Maximum Likelihood estimator.
MLM Masked Language Modelling.
NB Naive Bayes.
NLP Natural Language Processing.
NLTK Natural Language Toolkit.
OMX Option Market Index.
RMSP Root Mean Square Proportion.
RNN Recurrent Neural Networks.
RSS Residual sum of squares.
SGD Stochastic Gradient Descent.
SSR sum of squares-Regression.
STM Short Term Memory.
SVM Support Vector Machine.

1
Glossaries

TF-IDF Term frequency-inverse document frequency.


TP True Positive.
ULMFit Universal Language Model Finetuning for
Text Classification.
URL Uniform Resource Locators.
XML Extensible Markup Language.

2
1 Introduction

1.1 Motivation
The vast amount of unstructured and semi-structured data is produced every day, from
social media posts, news, and other sources, making it critical to analyze unstructured data
efficiently. Natural Language Processing (NLP) converts the unstructured text in a document
or database into structured data, which is suitable for performance analysis and predictions.
NLP has fascinated importance in data science and machine learning in recent years. It
has also transformed the banking sector in helping analyze large financial text, sentiment
analysis, polarity analysis of economic text to quantify the sentiments [1]. Sentiment analysis
helps classify these texts to ‘positive’ or ‘negative’ and gives a quick insight to make better
decisions. Sentiment analysis is a sub field of NLP.

The study of NLP is performed using using machine learning methods [1, 2] and for further
enriching the NLP application, neural networks and deep learning models with transfer
learning is used. The Transfer Learning is a technique applied to transfer the knowledge
learned from previous tasks to related new tasks. One of the most crucial components of
transfer learning models or methods is to further pre-train the model on a domain-specific
language to learn semantic relationships in the text, primarily in a niche domain like financial
sectors [2]. Transfer learning is explained in detail in the section 2 .

The principal research interest of the thesis is to select a model that should classify the
news sentiment accurately because the particular language and unique vocabulary used in
the financial corpora makes it hard to use the general machine learning models. The financial
news classifier is selected based on good performance of the classifier. It will help the Macro
analyst and Quant economist for individual assessment to get quick insights about specific
news, and the news sentiment index helps predict measurements of different sentiments.
The impressive result are stored plus Metadata describing what information the classification
had performed. Also, sentiment results need to be aggregated into indices and stored as a
time series because it is compared to the price implication over time

The next step is to find the correlation between news sentiments and stock prices, in order
to quantify the relationships between them. The news sentiments labels ’positive’,’negative’

3
1.2. Aim

and ’neutral’ are converted to numerical values where ’positive’ is represented as +1 , ’neu-
tral’ as o and negative as ’-1’. A detailed explanation on creation of sentiment indices are
explained in section 4.5.1. It is fascinating to see whether sentiments lead to price move-
ments? In other words, checking the impact of news articles on stock prices, how does news
polarity affect the market prices and may lead to price movements. The goal will also focus
on fine-tuning the classifier by adjusting hyper parameters to achieve the best performance
[3]. Finally, a constructed production pipeline on data will capture the financial news stored
in the database, perform necessary Extract Transfer Load (ETL), and add Metadata and
classifier to process the sentiments.

Figure 1.1: Machine Learning Models for sentiment Classification

The figure 1.1 shows different Machine Learning approach for text sentiment classification
task that focus on extracting features from text [4], probabilistic classifier and deep learning
models [5].

1.2 Aim
The proposed thesis aims to explore sentiment classification approach for text especially fi-
nancial text. The first approach is to start with a more simpler machine learning model and
then exploring the advanced deep learning models in order to get more insight with the hope
that these model may perform better and also we are able to select the best classifier for the
task. The objective is to setup a pipeline that can scrape the data, filter the data and predict the
sentiment. The cumulative sentiment indices are constructed for sentiments and comparing
it with price indices for better predictions.

4
1.3. Research questions

1.3 Research questions


This thesis work addresses several research questions that encountered during different
phases. The next sections cover the data analysis and findings.

1. What are the suitable models for sentiment prediction of financial text?

2. When fine tuning a classifier, what are best hyper parameters?

3. Does classifier domain adaption on financial data improve classification performance?

4. Is there any statistical significance between sentiment and price indexes? Will it help in
better predictions?

5
2 Transfer Learning

2.1 Transfer Learning


Training a deep neural network from scratch can be a tedious task, taking a trained network
and adapting it to our domain-specific task is easier. However, it is very flexible to use
the pre-trained models for pre-processing and save lot of time. In deep learning Transfer
Learning is a technique applied to transfer the knowledge learned from previous tasks to
related new tasks, and it is becoming the most popular approach in NLP. Transfer learning
is also known as knowledge gained from training one model that can be applied to another
model [6]. A model developed or trained on a specific task (or) specific problem can be used
as an initial point for a model on a second related task. Advanced deep learning models are
best example for transfer learning that has been trained on Wikipedia and Book Corpus.

When we consider the traditional machine learning approach, it is related to learning a


specific task. Knowledge is not transferred or retained; learning is only performed on past
learned knowledge of a specific model. Where as in transfer learning, when we have two
similar tasks, the model trained for the first task can be used to re-train to perform any similar
related tasks. The example below shows both the models are meant for the same task but
trained from scratch.

Let’s say, in the case of transfer learning, a new task can rely on the previously learned tasks;
by this, we can achieve more accurate and faster results. We also need less training data.
The figure 2.1 illustrates how the different datasets are offered for the same task can use the
same model. The model is pre-trained on dataset-A, which can be used by dataset-B to per-
form a related task. Knowledge is leveraged from previously trained models to train newer
models and even relies on this model when we have fewer data. The main advantage of
pre-training is that we do not need more data for a downstream task to get remarkable results.

Transfer learning generally has two steps:

1. Using data set A to train your model

2. Use the pre-trained model to transfer knowledge as starting point onto solving the prob-
lem for data set B.

6
2.2. Formal Definition

Figure 2.1: Transfer Learning

2.2 Formal Definition


A more formal definition of transfer learning considers a source domain Ds , a task related
to source domain Ts , target domain Dt , and a task pertaining to domain Tt [6]. The primary
purpose of transfer learning is to facilitate learning the target conditional probability distri-
bution in target domain DT with the information gained from source domain DS where Ds
not equal to Dt .

2.3 Types of Transfer Learning


The Transfer Learning is of three different types Transductive, Inductive and Unsupervised
transfer learning. The figure 2.2 shows the types of transfer learning.

Figure 2.2: Types of Transfer Learning

2.3.1 Transductive Transfer Learning


Transductive Transfer Learning is a supervised learning where model is trained on a labelled
source data, by using which it can further pre-trained on other domains for further learning

7
2.3. Types of Transfer Learning

to improve the classifier performance by adapting to new domain. Transductive transfer


learning can be divided into two categories domain adaption and cross language learning as
shown in Figure 2.2 [7].

2.3.1.1 Domain Adaption


Domain adaption refers to the process of adapting to a new domain by further enhancing
knowledge of the model. For example, we have the same task where we mostly don’t have
labelled data in the target domain or a few labeled samples in the target task. For instance,
in the sentiment classification task, the reviews could be written on specific products in the
source domain that can be used as a starting point for performing sentiment predictions in
other target domains [7].

2.3.1.2 Cross Language Learning


The trained model can also be further trained to learn cross-language learning. When we
want to use a high resource language like English, many data resources exist to learn a
corresponding task in a low resource language where resources are insufficient to build an
NLP application [7].

2.3.2 Inductive Transfer Learning


Inductive transfer learning is a supervised learning where model is trained on labelled data
in target domain.Inductive transfer learning can be divided into two categories multi-task
transfer learning and sequential transfer learning as shown in Figure 2.2 [7].

2.3.2.1 Multi-task Transfer Learning


Multi-Tasks learning is the process of learning multiple tasks simultaneously parallelly. For
example, given a pre-trained model A, we want to transfer the learning to multiple tasks
T1 , T2 .....Tn .

2.3.2.2 Sequential Transfer Learning


In sequential task transfer learning, source and target are learned sequentially. For example,
given a pre-trained model A, we want to transfer the learning to other tasks T1 , T2 .....Tn . At
each time step t specific task is learned. The sequential transfer is slow compared to multi-
task learning [7].

2.3.3 Unsupervised Transfer Learning


Unsupervised Transfer Learning is a unsupervised learning to solve text classification when
we don’t have training data to train model, it can seamlessly adapt to unseen classes [7].

2.3.3.1 Zero-shot
Training procedure are not applied to optimize/learn new parameters.The NLP task question
answering is best example of Zero-shot learning [7].

8
3 Theory

3.1 Sentiment Analysis


Sentiment analysis is broad category of the text classification task. It is an NLP task that is
used in a lot of applications like a movie review, product reviews, and product recommen-
dations. A fundamental task in sentiment analysis is to classify the given input sequence or
feature or entity as ’positive,’ ’negative,’ or ’neutral.’

For binary classification problems, the sentiments are classified as ’negative’ or ’positive,’ and
for multi-class classification problems, the sentiments are ’positive,’ ’negative,’ or ’neutral.’
For the thesis, multi-class classification is chosen to classify the financial text.

3.2 Naive Bayes


The Naive Bayes model is based on the Bayes theorem, which calculates the probability of
an event happening based on the prior probability of the assumption of strong independence
between the features. The conditional probability is defined as:

P( A) ¨ P( B|A)
P( A|B) = (3.1)
P( B)
where, P(A) = Prior probability
P(A|B) = Posterior probability
P(B|A) = Likelihood

3.2.1 Multinomial Naive Bayes


In the Natural Language Processing method, Multinomial Naive Bayes (NB) is used. It is a
probabilistic learning method based on the Bayes theorem. It uses the term frequency, i.e.,
the number of times a given word appears in the text. These term frequencies are normalized
by dividing them by the document text length. After normalization of this term frequency,
the maximum likelihood estimates is calculated based on the training data for estimating

9
3.3. Neural Networks

the conditional probability. The equations below shows the parametric model used in text
classification [8].
śn f
P(c) P(wi |c)i i
i =1
P(c|d) = (3.2)
P(d)

• In equation 3.2, f i is the number of occurrences of a word wi in a document d, i.e., the


frequency of the word.

• P(wi |c) is the conditional probability that a word wi occur in a document/text given a
class c.

• n is the number of unique words in the document.

• P(c) is the prior probability that a document with class label c may occur in the docu-
ment.

3.2.2 Parameter Estimation


The parameters in the equation 3.2 can be estimated using Maximum Likelihood estimator
(MLE). The conditional probability P(wi |c) is calculated using relative frequency of the word
wi in a document belong to class c.

Nic N
P̂(wi |c) = = ř|V |ic (3.3)
Nc N
j =1 j

In the equation 3.3, Nic is the number of occurrences of the word wi in the training docu-
ment/text with class label c. The Nc is the total number of word frequencies in document
with class label c, |V| is the number of unique words in document.

The estimate of P̂(w) is given in equation 3.4 where Ni is the number of occurrence of
word wi in the text.

N
P̂(wi ) = ř|V | i (3.4)
j=1 Nj

The log likelihood is given in the equation 3.5 where the first term is the conditional log
likelihood that measures how well the classifier model estimates the probability of the class
given the words. The next term is the marginal log likelihood that measures joint distribution
of words in text.

|V |
ÿ |V |
ÿ
t
LL( T ) = log P̂(c|w ) + log P̂(wt ) (3.5)
t =1 t =1

3.3 Neural Networks


Artificial Neural Networks (ANN) is a network that is inspired by the human nervous sys-
tem. In recent years ANN’s are a type of machine learning that has become very popular for
classification, pattern recognition, and prediction in many fields. The feed-forward Neural
Network (FFNN) moves only in one direction; there are no cycles or loops in the network,
i.e., forward from input nodes through the hidden nodes if there any and to output nodes [9].
In the figure 3.1 [10], every neuron in input layer is connected to every neuron in the hidden
layer, and every neuron from the hidden layer is connected to the output layer, this network
is known as a fully connected neural network. Each input value is multiplied by the corre-
sponding weight. A neuron should be activated, or not is decided by the activation function.

10
3.3. Neural Networks

Figure 3.1: Artificial neural network architectures with feed-forward network and back-
propagation

If the activation function is higher than the threshold value, a unit provides high-value out-
put [11]. A neural network without activation function is linear regression model. The output
of the activation function moves to the next layer, and the same process is repeated. The for-
ward movement of information is called as forward-propagation. The output of each neuron
is calculated using the equation 3.6.
n
ÿ
yi = f ( wij ˚ xi + b) (3.6)
i =0

In the equation 3.6 f is activation function, b is the bias, xi is the ith neuron output in the
preceding layer, wij is the corresponding weight from the ith neuron to the jth neuron in the
current layer, n is the number of neurons.

In back-propagation, the error is calculated using how far is the generated output from
the actual value. Then, based on the error value, the weights and bias of the neurons are
updated [11]. In the equation 3.7 the total error E is calculated as the difference between ex-
pected value y and value obtained at the output layer after applying the activation function
f. wi is the weight of the neuron xi .
n
1 ÿ
E= (y ´ f ( wi xi ))2 (3.7)
2
i =1

3.3.1 Optimization
Optimization algorithms in neural networks are used to minimize the cost function. It de-
pends on the model’s internal parameters like weights and bias used in computing the output

11
3.4. Deep Learning

value. The internal parameters are learned and updated to achieve an optimal solution.

Batch-Gradient Descent is an optimization technique used to train and optimize neural


network by updating the weights of network. In case of very large data set the computation
is very slow and cannot fit in the memory, redundant updates of weights is calculated for
large data sets [11]. In the equation 3.8 η is the learning rate, ∇ J (θ ) is the gradient of the loss
function J (θ ) with respect to θ.

θ = θ ´ η ¨ ∇ J (θ ) (3.8)
Batch-Gradient Descent calculates the derivative from all the training data and update, where
as Stochastic Gradient Decent calculates the derivative form each training data instance and
update it immediately. The main challenge here is to use proper learning rate [11]. In the
equation 3.9 x (i ) and y(i ) are trainig example.

θ = θ ´ η ¨ ∇ J (θ; x (i ); y(i )) (3.9)

3.4 Deep Learning


Deep learning is a subfield of Artificial Intelligence. Deep learning architectures such as deep
neural networks and RNN have been applied to speech recognition, natural language pro-
cessing, and bioinformatics. However, the complicated task of model training can be de-
scribed in a hierarchical structure, and when all these tasks are built on top of each other, the
structure looks deep with many layers. For this reason, this approach is termed deep learn-
ing. Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM) are examples
of deep learning model [12].

3.4.1 Transformer
A transformer is a self-attention and a deep learning model majorly used with natural lan-
guage processing; language translation is one of the applications of transformers where at-
tention boosts the speed. It processes the input data parallelly to utilize Graphics Processing
unit (GPU) efficiently, and training speed is also increased, unlike the conventional neural
network, which processes sequentially. Attention can be described as what humans focus
specific words in text. At the same time, the rest of the information has less resolution or
importance. The self attention layer also overcomes the gradient issues where their is expo-
nential growth in the model parameters. Transformers are built on encoders and decoders,
both share similar properties; each encoder and decoder consists of a self-attention layer and
feed-forward neural network. The Bidirectional Encoder Representations from Transformers
(BERT) model has adopted the encoder layer of the transformer . Figure 3.3 represents the
high level architecture of Transformers [13, 14].

3.4.1.1 Encoder
Encoder and decoder can be represented as language translation task. An encoder receives a
list of input vectors ( x1 ...xn )that processes these vectors into a ‘self-attention’ layer, then into
a feed-forward neural network, and then passes encodings to the next encoder layer as input.
Each encoder has two sub-layers.

1. A self-attention mechanism on the input vectors.

2. A simple, fully connected feed-forward network.

12
3.4. Deep Learning

3.4.1.2 Decoder
Decoder focus on the appropriate words in the input during the decoding process using self
attention mechanism.

1. A self-attention mechanism on the input vectors.

2. encoder-decoder layer to connect outputs received from encoder layer.

3. A simple, fully connected feed-forward network.

Both the encoder layer and decoder layers have a feed-forward neural network for additional
processing and normalization layer.

3.4.1.3 Self-attention
The self-attention mechanism takes the input sequence and decides which parts of the se-
quence are important. To define it in simple terms, the input text interacts with each other
and finds out who they should pay “attention to.” Let us take an example “Sales are high.” A
trained self-attention layer will associate the “sales” to “high” with a higher weight than the
word “are” from a linguistics perspective; these words share a subject-verb-object relation-
ship.
Sales are high
Sales 0.8 0.1 0.2
Attention probability = are 0.1 0.65 0.1
high 0.2 0.05 0.65

Transformer uses three different types of representations , the Queries Q, Keys K and Values
V. These are calculated by multiplying the input vector (X) with weight matrices that is learnt
in the training process. These are calculated by multiplying your input vector(X) with weight
matrices with WQ ,WK ,WV .

Figure 3.2: Self Attention

QK T
Attention( Q, K, V ) = so f tmax ( a )V (3.10)
dk
Attention is a scaled dot-product scoring function to represent
a the relation between two
words know as the attention weight. In the equation 3.10 dk the square root of the dimen-
sion of key vector. Softmax function is applied to get the probability distribution of attention
weights.

13
3.5. BERT

Figure 3.3: High level Architecture of Transformer

3.5 BERT
Deep learning models like LSTM are used for the text’s natural language processing that pro-
cess unidirectionally, i.e., left-to-right or right-to-left during the training phase. For example,
BERT is a deep learning model based on the transformers encoder representation.

BERT has offered state-of-the-art results in the Machine Learning community in various
NLP tasks like next sentence prediction, text sentiment classification, text summarization,
and question answering. It is a transfer learning language model developed by Google. The
bidirectional training of text is applied, making it unique. BERT is trained on massive data
like Wikipedia, and book corpus has more than 3,500M words. So, a pre-trained model is
a starting point [13]. Then, each output element is connected to each input element in the
BERT, and weights are measured dynamically based on their connection. The model uses
transformers instead of RNN. The following subsections will explain the pre-training of
BERT and the process [14].

3.5.1 Pre-Training
In Artificial Intelligence, pre-training is how human beings learn new knowledge by using
the learned knowledge. This knowledge helps the new models perform the task from the
knowledge transferred using transfer learning instead of training from scratch.

3.5.1.1 Fine Tuning


It is the process in the deep neural networks where the hyper parameters are adjusted to
achieve the best possible result, such as learning rate, number of epochs, maximum sequence
length, batch size to achieve the best possible results. Using the results of weights of an earlier
deep neural network model for solving another similar problem.

14
3.5. BERT

3.5.1.2 Maximum Sequence Length


It is the sequence length of the input data. The max_seq_length usually considered are 56 and
64.

3.5.1.3 Epochs
The number of epochs defines the number of times the model will work through the entire
training dataset. For each epoch, the learning algorithm gets an opportunity to update the
internal parameters of the model. Epochs can have one or more batches. The classifier per-
formance is measured with different epochs [15].

3.5.1.4 Learning Rate


The learning rate is a hyper-parameter in deep learning because each time the model weights
are updated, the error rate changes. Choosing a correct learning rate is very challenging
because a minimal value results in a long training time, and setting to large values results in
learning that is not optimal and too fast training that is unstable and in some cases results in
over-fitting. The authors of BERT has suggested using learning rate of [3e-4, 1e-4, 5e-5, 3e-5]
[16]. The learning rate of 2e-5 is optimal according to the results. By taking large learning,
rates haven’t helped in classifier predictions.

3.5.1.5 Batch size


The number of training sample in the batch is the batch size that is utilized in one iteration.
By increasing the batch size, ample memory space is needed [15].

3.5.1.6 Adaptive Moment Estimation


BERT uses Adaptive Moment estimation (Adam) technique. Adam is an algorithm for op-
timization for gradient descent. Gradient descent helps learn training data over time and
minimize the cost function of algorithms with more data or parameters [17]. Adams com-
bines Stochastic Gradient Descent (SGD) with momentum and Root Mean Square Proportion
(RMSP) It takes advantage of momentum by using a moving average of the gradient that
can be termed stochastic gradient descent with momentum. Momentum can be considered
a ball is running down a slope. It behaves like a heavy ball with friction, which prefers flat
minima in the error surface. The moments of gradient adapt the learning rate for each weight
of the neural network and squared gradients to scale the learning like RMSP. The decaying
averages of past and past squared gradients t first moment (the mean) and vt is the second
moment (the uncentered variance) of the gradients respectively calculated as, shown in equa-
tion 3.11,3.12.

m t = β 1 m t ´1 + ( 1 ´ β 1 ) g t (3.11)

vt = β 2 vt´1 + (1 ´ β 2 ) gt2 (3.12)


In the equation 3.11,3.12 β is Moving average parameter, mt is aggregate of gradients at step
t (initially, mt = 0), vt is sum of square of past gradients (initially, vt = 0), mt´1 is aggregate of
gradients at step t-1 [previous]. vt´1 is sum of square of past gradients at step t-1 [previous].

The next step is referred as bias-correction where first and second moment estimations
are calculated shown in equation 3.13,3.14 where m̂t and v̂t .
mt
m̂t = (3.13)
1 ´ βt1

15
3.6. BERT Architecture

vt
v̂t = (3.14)
1 ´ βt2
Then the next step is to update the parameters moving averages to scale learning rate indi-
vidually for each parameter. The way it’s done in Adam is very simple, to perform weight
update we do the following in equation 3.15.
η
w t = w t ´1 ´ ? m̂t (3.15)
( vt + ϵ)
In the equation 3.15 wt is weight at time t, wt´1 is weight at time t ´ 1, η is the step size and ϵ
is error term.

3.5.1.7 Masked Language Modelling


BERT training in Masked Language Modelling (MLM). MLM is a task where the model
masks (hide) one or more words in a sentence and tries to predict the masked word(s). The
input containing the masked token model generates the most probable replacement that
will help the model learn word sequence properties. Masking is essential in the transformer
architecture because without it model can see the actual word and learn to predict it incon-
sequentially without understanding contextual representation. The model remembers the
contextual representation of each input token. The text is randomly masked in training to
predict the word, 80% of the time masked, 10% use random token, and 10% use the exact
original text.

Example : Some of the words are replaced by token [MASK] and the model tries to pre-
dict the masked words.
Input sentence : This year sales are high.
Masked sequence: This year sales are [MASK].

3.6 BERT Architecture


In BERT first step is pre-processing input, the embedding layers where the input is tokenized
and segment embeddings and position embeddings are learnt together to improve the learn-
ing. For example, a [CLS] token will be at the start of the input sentence, and a [SEP] sepa-
rator token is inserted between each input sentence.BERT has three embedding layers. The
detailed architecture is presented in the figure 3.5.

3.6.1 Token Embedding


In BERT, the word piece tokenization used for token embedding is the model’s lower layer,
where each input token’s vector representation is transformed. Thus, each word represents
using a 768-dimensional vector.An example of token embedding is illustrated in figure 3.4.

Example: Token Embedding


Input : This is financial text
Output: {This : 0, is : 2, financial : 4, text : 5 }

16
3.6. BERT Architecture

Figure 3.4: Token Embedding

3.6.2 Segment Embedding


The input sentences are processed parallelly in the model. Segments use to distinguish be-
tween the pair of input sequences. For example, suppose the input sequence has two sen-
tences, then two segment vector representations is used.

3.6.3 Position Embedding


Knowing the position of text plays an essential role in understanding the token’s place in a
sentence. The (sin, cos) function pair helps the model learn patterns of token’s position, the
cyclic nature of sin, and the cos function that returns the word’s position in a sentence to
make sure words with different positions should not have similar positions embedding vale.
All the words at the even position get sin value, and words at the odd position get cos values
as we can see in the equation 3.16,3.17.

pos
PE( pos, 2i ) = sin( ) (3.16)
10002i/d
pos
PE( pos, 2i + 1) = cos( ) (3.17)
10002i/d
In the equation 3.16„3.17 ”pos” define the position of the "word" in the sequence. Pos0 refers
to the position embedding of the first word in the series. ‘d1 is size of tokens in embedding
say 5, i refers to dimensions (i.e., 0,1,2,3,4). The values of i and pos vary, whereas d is fixed.
So by changing the values of 1 i1 in the equation 3.16,3.17 above, we get different position
embedding values.

A tokenized input sequence of length ’n’ is the sum to element-wise to produce a single
representation of shape (1, n, 768) and have the vector representation of embedding as fol-
lows:

• Token Embedding has a vector representation of shape (1, n, 768).

• Segment Embedding helps distinguish between pairs of input sequences with a vector
representation of (1, n, 768).

• Position Embedding has a vector representation of shape (1, n, 768).

17
3.7. FinBERT

Figure 3.5: BERT Architecture

3.6.4 Normalization and Softmax Activation Function


The softmax function is to normalize the outputs by converting them to probabilities that
sum to one. It is an activation function for a multi-class classification problem. The results
from the encoder hidden layer states are softmaxed by amplifying the high score with higher
probabilities and lower scores with lower probability. The large value before and after nor-
malization will be the largest.

e zi
so f tmax (ŷ) = řn (3.18)
j =1 ez j
In the equation 3.18, zi is the elements of the input vector to the softmax function, can be
values of positive, negative like (´2.4, 8, 0.4, . . . . . . . . . . . . ..) The standard exponential function
applies to each element of the input vector. The term on the bottom of the formula is the
normalization term. It ensures that all the output values of the function will sum to 1.

3.7 FinBERT
Training the BERT model from scratch may result in over fitting because of its architecture, a
vast neural network with 110 million parameters. Therefore, the BERT model is further pre-

18
3.8. Pearson’s Correlation

trained on financial news data [2]; for further pre-training, the BERT model, a vast financial
corpus, is used TRC2 to understand the financial context for better prediction. The previous
research has resulted in state-of-the-art performance on financial sentiment classification [18],
which used the financial phrase bank [19] dataset to evaluate performance. TRC2-financial is
dataset [19] that consists of news articles that Reuters published consists of 1.8M news arti-
cles published between 2008 -2010. This pre-trained model fine-tunes our financial sentiment
classification task for classifying positive, negative, and neutral sentiments using our dataset.
The financial phrase bank [19] dataset is labelled by 16 people with background in finance
and business. 60% data is used for trainig, 20% set aside for test and 20% of the remaining as
validation set.

For the implementation of FinBERT, the authors [18] considered the dropout probability
of 0.1; dropout probability is a regularization technique that drops a hidden unit along with
its connections at training time with a specified probability. The network should not be
relied on a specific connection that can lead to over-fitting. A warm-up period of 10,000 steps
with a learning rate 2e-5 is considered. During the training phase, the learning rate will be
increased linearly from approximately 0 to 2e-5 within the 10,000 steps to give time to the
Learning Rate to adapt to the data. The model is trained with a maximum sequence length
of 64 tokens, where the average sentence size is not greater than 64. The model is trained on
6 epochs. Some network layers are frozen, and gradually, all the layers are unfroze. At the
start of training, only the classification layer is unfrozen. After each training epoch, the next
layers are unfrozen; the initial layers learn a generic representation of the data.

3.8 Pearson’s Correlation


It is a test statistics that finds the relationship between the variables. It is a simple method that
measures the association of variables using covariance. The covariance value ranges from ´1
to +1 and never less than -1 and never greater than 1. The equation of correlation is given in
equaiton 3.19 [20].

cov( X, Y )
corr X,Y = (3.19)
σX , σY
X,Y are random variable, cov is the covariance, σX is the standard deviation and σY is the
standard deviation of Y.

1. A correlations of ´1: The two variables are perfectly negatively linearly related.

2. A correlation of 0: means that two variables don’t have any linear relation.

3. A correlation of 1: A correlation coefficient of 1 means that two variables are perfectly


positively linearly related.

3.9 Time series


A series of data points in a list with timely order is known as time series, a sequence that takes
equally spaced points in a period. When the time series points are plotted on a graph, they
will be chronologically ordered on one of the axes. Time series is used in statistics, pattern
recognition, finance mathematics, and many other application areas.

3.9.1 Stationary Time-series


Most of the statistical forecasting methods for time series data are based on the assumption
of ’stationary’. A time series is said to be stationary if it has a constant mean, variance, and

19
3.10. Augmented Dickey-Fuller Test

auto correlation over time to make sure its statistical properties that are in the past will be the
same in the future. For a stationary time series, yt has:

• The mt = mean, vt =variance of the series that does not depend on time t.

• A correlation of 0: means that two variables don’t have any linear relation.

• The auto covariance function, f (s, t) at any point in time their difference is |s-t|.

3.9.2 Non-Stationary Time Series


A time series is non-stationary that has statistical properties like mean, variance,auto-
correlation that is not constant over time. An example of a non-stationary time series is a
series with a trend that grows over time. The mean and variance of such a series will grow as
you increase the size of the sample. A non-stationary time series is written as:

yt = Tt + St + Nt

where yt is a non stationary time series Tt is the trend component, St is a seasonal component,
and Nt is noise.

To test if a time series is stationary or not Augmented Dickey-Fuller (ADF) test is used.

3.10 Augmented Dickey-Fuller Test


ADF test checks a statistical significance in a series, which means hypothesis testing is in-
volved where test statistics are involved in calculating the p-values. It belongs to the ‘Unit
Root Test’ category, which is the method for testing stationary in a time series [21]. For ex-
ample, suppose the series is consistently increasing over time. In that case, the sample mean
and variance will grow with the size of the sample.

3.10.1 Unit root


If a time series is non-stationary, it has a unit root, making it non-stationary.Unit root is a
stochastic trend in a time series. In the equation 3.20 below, a time series ‘Y’ at a time ‘t’ is
said to have a unit root value of α = 1, α is the coefficient of Yt´1 . The unit root test is used
to determines how strongly a trend defines a time series.

yt = αyt´1 + βXe + ε (3.20)


In the equation 3.20, yt =Time series at a time 1 t1 , if α=1 is a unit root of the first degree, ε is
the error term, β and Xe are explanatory variables of other time series. ε is error term.

is unit root test to check stationarity in a time series. The test’s null hypothesis is that a
unit root can present in the time series, time series with some independent structure is not
stationary. The time series is stationary, reject the null hypothesis and accept the alternative
hypothesis.

— Null Hypothesis ( H0 ): The time series have a unit root, meaning it is non-stationary. It
has some time-dependent structure.

— Alternate Hypothesis ( Ha ): Reject the mull hypothesis, it suggests the time series does
not have a unit root, meaning it is stationary. It has a time-independent structure.

20
3.10. Augmented Dickey-Fuller Test

Three basic differential-form auto-regressive regression models are used to detect the pres-
ence of a unit root.
Auto Regressive (AR)(1) Model is derived as,

yt = αyt´1 + ε t (3.21)
where α is constant. yt =Time series at a time 1 t1 . ε t is residual term

• Test1 : Test for unit root with no drift(constant) and no trend. Subtracting yt´1 from
both sides in equation 3.21. Where γ = α -1 in equation 3.22

∆yt = γyt´1 + ε t (3.22)

– H_0 : y_t is non-stationary,γ = 0


– H_a : y_t is stationary, γ < 1

• Test2 : Test for unit root with drift(constant). In equation 3.23 α is a drift term

∆yt = α + γyt´1 + ε t (3.23)

– H_0 : yt is non-stationary, (γ = 0 , α = 0)
– H_a : yt is stationary, (γ < 1 , α ‰ 0)

• Test3 : Test for unit root with drift(constant) and trend. In equation 3.24 α is a drift term
and β is trend term.
∆yt = α + β + γyt´1 + ε t (3.24)

– H_0 : yt is non-stationary, (γ = 0 , α = 0, β = 0)
– H_a : yt is stationary, (γ < 0 , α ‰ 0, β ‰ 0)

The Dickey Fuller test statistics is calculated as:

γ̂
DFT = (3.25)
SE(γ̂)

In equation 3.25, DFT is the Dickey-Fuller (DF) test statistics γ̂ is estimated value of γ and SE
is the standard error of γ. The calculated Dickey Fuller t-statistic is compared to the critical
value of DF t distribution.Cumulative Distribution table A of t from the Supplementary Man-
ual[22], The critical value at a certain significance level (1%, 5%, 10%) is a cut-off point. If the
DF test statistics is greater than than the critical value then null hypothesis is rejected. If the
p-value is less than 5% means we can reject the null hypothesis that there is a unit root.

3.10.2 Discrete Difference


When the time series is first tested for stationary and results in non-stationary, then it is first
converted to stationary data by removing trends. The discrete difference makes the time
series stationary by taking the lag-1 difference, i.e., the first discrete difference in equation
3.26.

d (1) ( t ) = x ( t ) ´ x ( t ´ 1 ) (3.26)

— x (t) observation at time t.

— x (t ´ 1) observation at time t ´ 1.

21
3.11. Granger Casualty Test

3.11 Granger Casualty Test


In most cases, regression only tells if there is any relation between two variables but not like
one variable is causing the other. This test checks the causal relation between two-time series
data, which shows that series x cause’s series y [23, 24]. It is a statistical hypothesis test
for determining one-time series helps predict other time series. If the probability values are
less than the significance level, would reject the hypothesis.Past values of a time series are
significant in predicting the future values of another time series. The steps are defined after
learning from the reference [25, 26].
Step 1: State the null hypothesis and alternate hypothesis.

— H0 : X does not Granger Cause Y.

— Ha : X Granger Cause Y.

— H0 rejected, then X causes Y.

Step 2: The best AR model for a time series yt is Calculated and AR for xt is calculated choos-
ing the lags.The best AR model is used to know how many lags are useful i.e., observations
at previous time steps that are useful to predict the value of the next time step. Choose the
different lags and run the Granger test to test many times to check the results are same when
different lag values are selected. To make sure that results must not be sensitive to lags. In
the equation 3.37, 3.28, α is the coefficient of the time series. ε is the error term normally
distributed with mean zero and same variance ε « N (0, σ2 ).
m
ÿ
y t = α0 + α i y t ´i + ε i (3.27)
i =1
m
ÿ
x t = α0 + α i x t ´i + ε i (3.28)
i =1

Step 3: Calculate the restricted regression model known as RSS Restricted and full regression
model known as RSS full. The Residual sum of squares (RSS) is calculated by adding the
two time series AR model of xt , yt . In the time series analysis, it is essential to test if the
residuals in the regression are correlated over different time periods. When the error term of
one time period is correlated with the error term of the subsequent time periods, it can lead
to an incorrect conclusion. Smaller RSS indicate a better fit. The null hypothesis that states X
does not Granger Cause Y, Xt is the restricted model and yt is full model and run the Granger
tests. In the equation 3.29,3.30, α and β are coefficients.
m
ÿ
xt ( RSSRestricted) = α0 + α i x t ´i + ε i (3.29)
i =1
m
ÿ m
ÿ
xt ( RSS f ull ) = α0 + α i x t ´i + β j y t´ j + ε i (3.30)
i =1 j =1

The restricted model for X which excludes Y and an unrestricted model for X, which includes
Y.

Step 4: Calculate the F-statistic using the equation 3.31 :


 
p ( RSSrestricted ´ RSS f ull )
F= . (3.31)
(n ´ k) RSS f ull

• n: number of observations.

22
3.12. Performance Measures

• k: number of parameters from full model.

• p: The difference in numbers of parameters from restricted model

• Where RSS is the residual sum of squares of the model, F will have an F distribution,
with (p, nk) degrees of freedom. The null hypothesis should be rejected if the calculated
F value from the data is greater than the critical value of the F-distribution. The desired
rejection probability p-value is 0.05.

The p-value is the probability that is, the calculated F-value in a test is larger than the f
statistics (or) f critical value p( F ą= f |H0 ) under the assumption that the null hypothesis is
true. If the p-value is less than 0.05 then null hypothesis is rejected.

Step 5: If the p-value is less than 0.05 and f-statistics is also low then reject the null hy-
pothesis.

3.12 Performance Measures


The classifier performance is measured by using different metrics that how well the model
has performed. As data used for the thesis is text data and multi-class classification, the
best metrics are chosen. In multi-class classification, one of the popular metrics is accuracy.
Higher the accuracy, the model has performed well. After the classifier has made a pre-
diction, a confusion matrix is generated, which summarizes the results of a classification
problem.

3.12.1 Confusion Matrix


For evaluating the performance of a classifier, a confusion matrix is used. It is an N x N
matrix. Here N is the number of target classes. It compares the actual values with the model
predicted values.

Figure 3.6: Confusion Matrix

In the figure 3.6 here ’True Positive(TP)’ is the actual value was positive and the model pre-
dicted a positive value. The ’False Positive(FP)’ is the actual value was negative but the model
predicted a positive value. The ’False Negative(FN)’ is the actual value was positive but the
model predicted a negative value. The ’True Negative(TN)’ The actual value was negative
and the model predicted a negative value.

3.12.2 Classification Accuracy


Classification accuracy is defined as the total number of correct predictions divided by the
classifier’s total number of predictions. A ’true positive’ is a result when the model accu-

23
3.12. Performance Measures

rately predicts the positive class. Likewise, a ’true negative’ is a result that the model accu-
rately predicts the negative class. A ’false positive’ result from the model mistakenly predicts
the negative class as a positive class. A ’false negative ’result from the model mistakenly pre-
dicts the positive class as a negative class.For a binary classification problem the accuracy is
calculated as in equation 3.32 [27].

( TP + TN )
Accuracy = (3.32)
TP + TN + FP + FN
In the multi-class classification, where the classifier has three classes ‘positive’, ‘negative,’ and
‘neutral,’.In the equation 3.33,3.34 k is the number of classes. For each class ’TP’,’TN’,’FP’,’FN’
in the confusion matrix.

k
ÿ tpi + tni
Accuracy = (3.33)
tpi + tni + f pi + f ni
i =1

3.12.3 Miss-Classification
Misclassification or error rate is defined as the sum of incorrect predictions by total predic-
tions.
k
ÿ f pi + f ni
Miss ´ classi f ication = (3.34)
tpi + tni + f pi + f ni
i =1

3.12.4 Precision
Precision is another metric to calculate the effectiveness of classifier calculated as the number
of positive class predictions, i.e., true positive that belong to the positive class-divided all
TP’s Agreement of true class labels with those of the classifier’s calculated by summing all
True Positive (TP)’s and False Positive (FP)’s [27].

k
ÿ tpi
Precision = (3.35)
tpi + f pi
i =1
Effectiveness of a classifier, that will measure positive classes that are correctly identified as
positive out of all the positive classes.

3.12.5 Recall
Like Precision, Recall is also used to measure effectiveness; it is calculated as true positive
divided by the true positives and true negatives [27].
k
ÿ tpi
Recall = (3.36)
tpi + f ni
i =1

Thus, for all the positive classes that were actually positive, how many were correctly identi-
fied as positive.

3.12.6 F1-Score
F1-score is a harmonic mean of Precision and Recall that gives a better measure of the incor-
rectly classified cases than the Accuracy. F1-score is a better metric for evaluating our model
when imbalanced class distribution exists. So, to make precision and recall comparable F1-
score is used, which helps to measure the metrics Recall and Precision simultaneously [27].

2 ˚ precision ˚ Recall
F1 ´ score = (3.37)
precision + Recall

24
3.12. Performance Measures

3.12.7 Cross-Entropy
For fine-tuning the FinBERT model, test loss and validation loss are calculated, known as
cross-entropy loss. It is the ‘standard classification losses’. Cross-entropy loss is also known
as log loss, is calculated as multiplying the class prediction vector to the consequent weights
of the classification layer.

For binary classification problem where the number of classes are equal o 2, cross entropy
is calculated as in equation 3.38 where y is class label, log is natural log and p is predicted
probability.
cross ´ entropy = ´(ylog( p) + (1 ´ y)log(1 ´ p)) (3.38)
In case of multi class classification, separate loss for each class per observation is calculated
and sum the result.In the equation 3.39 y is a binary indicator if class label ’c’ is the correct
classification for observation ’o’. The ’p’ is the predicted probability of observation ’o’ is of
class ’c’.
M
ÿ
cross ´ entropy = ´ yo,c log( po,c ) (3.39)
c =1

25
3.13. Related Work

3.13 Related Work


In this section a brief summary of the previous work in the context of machine learning
methods and deep learning algorithms for sentiment prediction is illustrated. In recent years
natural language processing has attracted many data scientists [28].The study is divided
into two categories, first is the extraction of features from text word counting using machine
learning methods [1, 4, 29, 30] and secondly, neural networks and deep learning models,
where sequence embedding and positional embedding of text is represented [5, 31, 32].

The Financial sentiment analysis is a challenging task due to the lack of labeled data and
the language used. In this study, various machine learning methods are compared based on
their classification performance. The data used in their analysis is Thomson Reuters News
Archive data [33]. The classification prediction performance is measured; in this research,
Neural Networks outperforms other machine learning techniques like NB and Support Vec-
tor Machine (SVM). A hierarchical sentiment classifier for the text classification for handling
domain-specific lexicons. The polarity or sentiment classifier model builds a hierarchical
classifier model using the concept of association rule mining where the minimum confidence
and minimum support on finding the frequent item sets. This polarity prediction will make
sentiment predictions. The generated rules from frequent item sets are ordered based on
decreasing support, confidence, and antecedent length. Lagging indicators and leading
indicators are found [34].

Deep learning models like convolution and recurrent neural networks are used to perform
sentiment analysis. Classifier performance (accuracy, Precision, Recall, and F-measure) other
than the data mining approaches like text categorization, Information retrieval, and logistic
regression for the bag of words system. Convolution networks outperform the logistic regres-
sion in the extraction of sentiments [3]. The pre-trained model, which requires less labeled
data and can be domain-specific approaches like Universal Language Model Finetuning for
Text Classification (ULMFit), is considered for transfer learning models on NLP applications
which is suitable for specific tasks; this model suits only domain-specific and small datasets
for financial learning [35]. The pre-trained language models like ULMFit, Short Term Mem-
ory (STM), LSTM with Embedding from Language Models (Elmo), BERT are compared based
on their performance and test loss. Even with the small datasets, for example, 500 samples
FinBERT pre-trained on the BERT model outperforms other models, requiring less time [18].
Figure 3.7 [18] taken from the related work shows the test loss based on different training set
sizes and how test loss decreases when training data set size increases.

Figure 3.7: Test Loss v/s Training size for different Models

26
4 Method

This chapter covers data description and pre-processing. The later section describes How
thesis work carried out.

4.1 Data

4.1.1 Financial Data


For seven months, the financial news feeds are scraped from 15th march to 15th October to
ensure duplicate news is not extracted. The field ‘Newslink, which is unique for each news,
is saved as the primary key in the database. So the duplicate news are not stored in the
database. For the thesis, Option Market Index (OMX) Stockholm 30 is an index of the thirty
most traded shares on the Stockholm Stock Exchange. The reason for taking this index is
to take their price indexes for finding the correlation between the extracted sentiment time
series with price time series. The news is filtered from the database by using the company
name, OMX 30 companies are filtered using the field ’company name’. The data set size used
for experiments is (3562,7) with 3562 rows and 7 columns. The data fields are date, company
name who published the news, news link, news content, title, country of disseminating news,
and sector. The ‘sector’ is to know the industry type: IT, health, finance, mechanical, etc.
In the figure 4.1 shows the different columns and row count of each company. The number
of news(row count) of company ’Swedbank’ for 7 months is 80 and company ’Nordia’ is 71.

4.1.2 Price Data


The Price data is from OMX Stockholm. The OMX Stockholm 30 is an index measure of
the 30 most-traded stocks on the Nasdaq Stockholm stock exchange. The indexes are on
weighted capitalization, the table 4.1 shows the company weightage, sum of the weight of
all 30 companies is 1. Price data comprises Company Name, date, weight index, and price
value. The sum of the weighted index of each company is always 1. The price data for 7
months are source from Bloomberg terminal from 15th march to 15th October. The size of the
price dataset (excluding the weekends) is (159, 4) with 159 rows and 4 columns.

27
4.2. Software Environment

Figure 4.1: Data Set

Companyname Weight
Swedbank 0.03700339
Nordia 0.0315533
Ericsson 0.05718687
Atlas Copco 0.1148021
Sinch AB 0.02323883

Table 4.1: Weighted Price Indices

4.2 Software Environment

4.2.1 Python
Python helps in experiments like web scraping, data pre-processing, prediction, and visual-
ization. Python is an open-source programming language that provides good frameworks for
Artificial Intelligence, Machine Learning, statistical analysis, and visualization. It supports
different libraries with powerful features with highly customized implementations; some
packages are used for better results [36].

4.2.2 Urllib
Urllib library is used for handling and fetching Uniform Resource Locators (URL)s. By using
the urlopen function, it can fetch various URLs using various protocols [37].

4.2.3 BeautifulSoup
The purpose of beautiful soup package is web scraping by extracting data out of Hyper Text
Markup Language (HTML) and Extensible Markup Language (XML) pages. From the parsed
URL’s a parsed tree is constructed to search for required tags. It provides flexibility for navi-

28
4.2. Software Environment

gating, searching through HTML elements. To obtain the different elements from each page
like news content, news link, news title iteratively for data extraction [37].

4.2.4 Pandas
For data pre-processing and data handling, panda’s package is used. For the next steps, it
is very important to create a data structure for the scrapped data that is provided by pandas
for fast and flexible structuring of data. Its multipurpose functionality for handling data is
an advantage, all the data that is scraped for the thesis work is converted to a data frame for
further analysis and prediction [36].

4.2.5 NumPy
A NumPy stands for “Numerical Python” is used for implementing numerical computations
for vectors and matrices. It provides 50 times faster computation than list data. For data
analysis and numerical calculation in the thesis, this library is used [38].

4.2.6 nltk
Natural Language Toolkit (NLTK) is a standard library that eases the use and implementation
of natural language processing and information retrieval tasks like tokenization, stemming,
parsing, and semantic text relationships [39].

4.2.7 Sklearn
Scikit-learn provides tools and functionality for machine learning and statistical modeling for
classification, clustering, and other predictions. For example, split data into train, validation,
and test subsets, create features for text inputs, create tokens, and count vectors like frequency
count for tf-idf. For classification, task data is split into train and test [39].

4.2.8 Transformer
Transformers provide thousands of functionalities for NLP, handling text classification, sum-
marization, and text translation for more than 100 languages. For BERT tokenization and
token sequence, it is used [39].

4.2.9 matplotlib
Matplotlib is a library for animated and interactive visualization that helps in creating quan-
tified plots, creating layouts [36].

4.2.10 seaborn
Like matplotlib, the seaborn library is also used for data visualization and exploratory data
analysis, built on Matplotlib to create customized plots [36].

4.2.11 ggplot2
ggplot2 from tidyverse is an open-source data visualization tool for plotting different plots of
R statistical programming language [40].

29
4.3. Web Scraping

4.2.12 statsmodel
Statsmodel provides classes and functions for the statistical estimations and model for per-
forming statistical tests and exploration, like the R programming syntax. The results can be
in different types and estimators. For performing correlation, granger casualty, and adfuler
tests, it is used [41].

4.2.13 TensorFlow
TensorFlow is an end-to-end open-source library for creating deep learning models to handle
extensive data and implementing complex models like BERT to simplify and speed up the
process [39].

4.2.14 Keras
Like tensorflow, Keras is an open-source software high-level Application Programming In-
terface (API) that provides a Python interface for artificial neural networks., it acts as an
interface for the TensorFlow library. It is more user-friendly and a little faster compared to
Tensor flow. For the implementation of the Financial Bidirectional Encoder Representations
from Transformers (FinBERT) model in the thesis, this package is used [39].

4.2.15 Logging
Logging is keeping track of all data input, processes, and data output of a code. When run-
ning complex processes, it is important to keep track of the crashes to keep track of defects.

4.2.16 GPU
GPU’s can perform multiple, simultaneous computations. This enables the distribution of
training processes and can significantly speed machine learning operations. With GPUs, you
can accumulate many cores that use fewer resources without sacrificing efficiency or power.
Furthermore, GPU accelerates the training of the model. Hence, GPU is a better choice to
train the Deep Learning Model efficiently and effectively.

4.3 Web Scraping


10. Web scraping captures(extracts) data from a website, the web scraping using Hypertext
Transfer Protocol (HTTP) from a web page. To fetch web page content using web crawling by
finding all the web page elements using web elements like class, div tags, and anchor tags.
For example, the steps used for scraping https://mfn.se/all is:

1. HTTP request is sent to a webpage that returns the HTML content to access the web-
page.

2. Parsing the HTML page’s content, parsing creates a tree-like nested structure of the
HTML data.

3. Navigating and searching the parsed tree created in step2, traversing the tree for the
created elements.

4.3.1 Webscraping from Modular Finance new webite


The web scraping for each news feed was a challenging task. However, the page inspection
has helped to understand the HTML tag elements in the web page. Parse tree from a page
source is constructed in a hierarchical and structured format. HTML parse tree is simple

30
4.4. Data Preparation

hierarchical layout than can be traversed via a library called BeautifulSoup [37]. In the figure
4.2, each element that is scraped is highlighted. More details discuss in the section software
environment description.

Figure 4.2: Scraped Web Elements

4.4 Data Preparation


This section discusses the labelling of the data and different data preparations required before
performing the classification.

4.4.1 Training, Validation, and Test sets


The labelling of the financial data set for sentiments is the most difficult part as analysing
the financial text need expertise in finance and business area. The 3562 new feeds used for
experiment is labelled as ’positive’,’negative’ and ’neutral’ by three data quantitative analyst
with background in finance and business. Positive instance are equal to 1860, neutral 1026,
negative 676. The positive class is dominated 50% instances are positive, around 27% are
neutral and around 23% are negative.

The dataset has been segregated into training, validation, and test sets. The classifier
uses a training set for learning to fit the model’s parameters, and the validation set is used
to fine-tune the hyperparameters. So, it provides unbiased model efficiency of classifier; for
example, in a neural network selecting the number of hidden layers and assessing the perfor-
mance of model test set is used. For the Naïve Bayes classifier, 75% data is used for training
and 25% for tests. For BERT and FinBERT, 60% is used for training, 20% for validation, and
the remaining 20% for test.

4.4.2 Data Preparation for Naïve Bayes classifier


Before performing the classification using the Naïve Bayes classifier, the first step is pre-
processing converting the input text to lower case; this will help inconsistent output and
remove the stop words from the text in English. In addition, this will help the classifier focus
on more meaningful words for better accuracy. Term frequency-inverse document frequency
(TF-IDF) is a statistic that reflects how important a word in the text is the number of times the

31
4.5. Sentiment Analysis Pipeline

word occurs [8]. The tf-idf increases as the word appear more times in the text for measuring
its frequency. CountVectroizer function from Sklearn package [39] transforms the input text
into a numerical vector based on the frequency count of each word.

4.4.3 BERT Tokenization


Data preparation is the first step for the BERT. Tokenize the text meaning the text split into
tokens.

1. Special tokens like [SEP], [CLS] are added at each input text. [CLS] for classification task
is added to the beginning of the sequence. [SEP] is used at the end of every sentence.

2. Truncating and padding all input sequences to a maximum length parameter of the
classifier. Max_length is the maximum length of our input sequence, which is the length
of 512 tokens.

3. Each token is mapped to the vocabulary id’s in the BERT tokenizer vocabulary, it is
explained in subsection Token Embedding.

4. For the BERT encoder, tensors in the TensorFlow convert the text to the numeric inputs
to interpret each string in string tensor that is required by the BERT encoder, explained
in section BERT Architecture.

Example: [CLS] First sentence [SEP] second sentence [SEP]

4.5 Sentiment Analysis Pipeline


The process of sentiment analysis is explained in this section. The pipeline structure shows
in figure 4.3. The news feeds are scrapped from Modular Finance (MFN) to get the financial
news feeds, and these news feeds are stored in the database with ’news link’ as primary
key , which is unique for each news item. The sector type and industry type are added as
metadata. This additional information helps to categorize it to correct sector/industry. The
news is filtered based on the keyword i.e., company name. The news is classified as positive,
negative, and neutral using a classifier. The date, count of sentiments can be extracted, next
the sentiment indices are constructed.

Figure 4.3: Shows the Sentiment Analysis Pipeline

32
4.5. Sentiment Analysis Pipeline

4.5.1 Construction of Sentiment Time series


The size of the considered sentiment dataset is (3562, 7), and the price data set is (159, 4). The
find the correlation (causality) between sentiment and price indices, the sentiment time series
is calculated using the following steps:

1. The data is scrapped like news content with the metadata like sector, industry, and
country from MFN.

2. For each day at least one news or more than one news published for the top 30 compa-
nies, the sentiment for each news is classified as [positive, negative, neutral].

3. Positive value is represented as 1; negative represents as -1 and represents as 0.

4. The price index of OMX Stockholm 30 companies is filtered from the news database to
calculate the sentiment index.

5. News published by the respective company sentiment value is multiplied with a


weighted average; for example, the news sentiment for Swedbank is multiplied by 0.370
equal to 0.370 for positive, -0.370 for negative and 0 for neutral sentiment. So, that the
same weightage is given to sentiment indices as of price indices, explained in section
Price Data.

6. The OMXS 30 price index has the corresponding weight of the company in the price
implication that sum up to 100. The percentage of weights multiply by sentiments. So,
that the sum of sentiment on a particular day will range from -1 to +1.
n
ÿ
SI = Sc ˚ Wp (4.1)
i =1

• In the equation 4.1, SI is the sentiment index for each day.


• Sc is the sentiment(1:positive,-1:negative and 0:neutral) of each company that is
classified using the selected classifier.
• Wp is the price weightage of the respective company.

7. For each news sentiment, the sentiment index is calculated using step 3 to ensure that
the sentiment value always ranges between -1 to +1.

8. The data is grouped by date and the cumulative sum for each day measured to construct
the sentiment indices time series.

9. The cumulative sum of the sentiments is a measure to find the sentiment implication
over time.

33
5 Results

This section will cover the results of the models, the correlation between sentiment indices
and price indices, and the causality relation between them. The results are calculated using
the evaluation metrics described in Section Performance Measures. The results are discussed
in Section Discussion. The evaluation metrics are calculated using the test data for all models,
consisting of 20% of the dataset. The predicted labels of the test data are compared with the
actual labels. The statistical hypothesis test results are presented.

5.1 Sentiment Prediction


This section represents the results of the performance of each classifier. Firstly the Multi-
nomial Naïve Bayes, BERT, FinBERT. Input to the classifier is the text, and output is the
predicted class ‘positive’ or ‘neutral’ or ‘negative.’ The different metrics used to measure
the classifier performance are accuracy, Precision, Recall, F1-Score. A detailed description of
these metrics is defined in the section Performance Measures. The different evaluation met-
rics for training data is presented in the table 5.2 and for test data is presented in the table 5.3.
The hyper parameters considered for the BERT and FinBERT values reported in table 5.2,5.3
are given in 5.1

Classifier Maximum Epochs Learning Rate Batch Size


Sequence
Length
BERT 128 4 3e-5 16
FinBERT 64 3 3e-5 8
Table 5.1: Hyper-parameter considered for BERT and FinBERT to report training accuracy,
test accuracy and test confusion matrix

34
5.1. Sentiment Prediction

Classifier Training Precision Recall F1-Score


Accuracy
NaiveBayes 0.794 0.801 0.791 0.795
BERT 0.921 0.894 0.938 0.912
FinBERT 0.890 0.871 0.919 0.920
Table 5.2: Evaluation Metrics of Classifiers for Training data

Classifier Test Accuracy Precision Recall F1-Score


NaiveBayes 0.678 0.687 0.634 0.681
BERT 0.882 0.899 0.865 0.881
FinBERT 0.913 0.924 0.898 0.910
Table 5.3: Evaluation Metrics of Classifiers for Test data

The confusion matrices are presented in Figures 5.1,5.2,5.3. The confusion matrices are in
the normalized form. The advantage of the "normalized" matrix is avoiding the differences
in numbers that may arise due to class imbalances Every row in the confusion matrix is the
total number of actual values for each class label ’positive’, ’negative’, and ’neutral’. The row
elements are divided by the sum of the entire row. Thus, the sum of each row in a normalized
confusion matrix is 1.00. The percentage of prediction of each class made by the model for
that specific true label. The column sum may deviate from 1.00 depending on whether the
class is correctly predicted and more instances are assigned to a specific class.

Figure 5.1: Confusion matrix for Naive Bayes model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances.

35
5.1. Sentiment Prediction

Figure 5.2: Confusion matrix for BERT model for test data. The diagonal from top left corner
to bottom right corner shows the percentage of correctly classified instances.

Figure 5.3: Confusion matrix for FinBERT model for test data. The diagonal from top left
corner to bottom right corner shows the percentage of correctly classified instances.

5.1.1 Fine Tuning


The hyper parameters for the BERT model is explained in section Fine Tuning. For training
the BERT and FinBERT gradual unfreezing is set to True, in gradual unfreezing rather than
fine-tuning all layers at once, firstly unfreeze the last layer and fine-tune for one epoch,
while the remaining layers are frozen. Then, the next layer is unfrozen in the next epoch
and fine-tune all the top unfrozen layers. More and more layers will be unfrozen in each

36
5.1. Sentiment Prediction

epoch. The attention dropout probability is 0.1, which is the dropout ratio for the attention
probabilities, which causes some of the weights to be zeroed out during training. The original
model weights are used for initialization, i.e., the initial weights from the pre-trained BERT
and FinBERT. This will help the model to retrain the embedding that has been learned. In
gradual unfreezing, only the unfrozen top layers replace the weights with their pre-trained
weight values. When learning on new data, the model might get deviated, so lower learning
rates are used a the beginning of training.

The hyper parameters considered are maximum sequence length, epochs, learning rate,
batch size. Some of the combinations tried with different hyper parameters are presented
in table 5.4. Few experiments are conducted to try out the different combinations of hyper-
parameters. However, due to the limited computational resources, it was not possible to try
out many combinations.

Classifier Maxi- Epochs Learning Batch Test F1-Score


mum Rate Size Accuracy
Sequence
Length
BERT 128 2 3e-5 8 0.79 0.80
BERT 128 4 3e-5 16 0.882 0.881
BERT 64 4 3e-5 16 0.89 0.92
BERT 128 4 2e-5 16 0.84 0.87
BERT 64 5 2e-5 8 0.91 0.93
Table 5.4: BERT Hyper-parameter selection

From the table 5.4 the optimal hyper parameters chosen for the BERT is learning rate : 2e-5
, number of epochs :5 , batch-size : 8 , max_sequence_length=64 with better accuracy as
91%. The selection of hyper parameters for FinBERT model are presented in table 5.5. Few
experiments are conducted to try out the different combinations of hyper-parameters.

Classifier Maxi- Epochs Learning Batch Test F1-Score


mum Rate Size Accuracy
Sequence
Length
FinBERT 100 2 3e-5 6 0.87 0.89
FinBERT 64 3 3e-5 8 0.898 0.910
FinBERT 64 4 4e-5 8 0.90 0.93
FinBERT 128 5 2e-5 16 0.92 0.93
FinBERT 56 6 2e-5 32 0.93 0.95
Table 5.5: FinBERT Hyper-parameter selection

From the table 5.4 The optimal hyper parameters chosen for FinBERT model is learning rate
: 2e-5 , train_batch_size :32, max_sequence_length = 56, number_of_epochs = 6. The figure
5.4 shows the validation loss and test loss of FinBERT model, for each increasing epoch the
model updates the weights and internal parameters are updated and improve the learning
and performance increases and loss decreases. As it is computationally expense only 6 epochs
are used if further epochs are increased the performance will increase. The figure 5.5 shows
how the test losses decreases from 0.9 to 0.4 when the size of training data is increased.

37
5.2. Classified data

Figure 5.4: Plot Epochs v/s Loss

Figure 5.5: Plot Training data size v/s Test loss

5.2 Classified data


After the classification task the text is classified into three class multi-class classification. ‘pos-
itive’ represented as ’1’, ’negative’ represented as ’-1’, and ‘neutral’ represented as ’0’. The
classified sentiments are OMX Stockholm 30 companies for 7 months, i.e., 15th March to 15th
0ctober, are represented in bar chart figure 5.7a with a total count of each sentiment. We can
see that the highest count value is for positive equal to 1860, neutral 1026, negative 676. The
positive class is dominated 50% instances are positive, around 27% are neutral and around
23% are negative. On the other hand, in figure 5.7b shows the bar chart represents the sen-
timent count for one week 04 th October to 8 th October is plotted. The figure 5.6 shows the
sentiment change in a week 11th October to 15 th .

38
5.3. Time Series

Figure 5.6: Shows Sentiments change in a Week 11th Oct - 15th Oct

(a) Count of Sentiments in one Week (b) Count of Sentiments in seven Months
Figure 5.7: Count of Sentiments (a) in one week 4th Oct - 8th Oct (b) in 7 months 15th March -
15th Oct .

5.3 Time Series


The sentiment time series is constructed by multiplying a company’s weighted average with
its sentiment values and taking the cumulative sum of those values. The table below shows
the calculated sentiment time series. After calculating the sentiment time series, the shape of
sentiment time series data is (155, 3) the same shape as the price dataset.

39
5.3. Time Series

Date Sentiemnt cums um


2021-03-15 -0.090509 -0.090509
2021-03-16 0.050629 -0.039880
2021-03-17 0.161965 0.122085
2021-03-18 0.015510 0.137594
2021-03-19 0.043711 0.181305

Table 5.6: Sentiment indices from 15th March to 19 th March.

The plots 5.8 below shows the sentiment time series and price time series for the dataset
taken. The cumulative sentiment graph shows the positive trend as most news is predicted
as positive and neutral. Price time series plot 5.9 shows discrete-time series based on a set
of well-defined numerical price values at successive point intervals of time and sentiment
implications over time.

Figure 5.8: Plot for cummulative sentiment

Figure 5.9: Plot for price time series and sentiment implication

40
5.3. Time Series

The first step is to convert the time series data to stationary using the ADF test. The statistical
hypothesis test is used to check if the sentiment and price time-series data are stationary.

5.3.1 ADF Test for sentiment series


— Null Hypothesis Ho :: Sentiment time series is not stationary .

— Alternative Hypothesis Ha :: Sentiment time series is stationary

If the p-value < 0.05 significance level, reject the null hypothesis and accept the alternative
hypothesis that the series is stationary.

Test statistics : -9.84535529804266


p-value : 4.6353275090038285e-17
critical_value : ’1%’: -3.474712913481481, ’5%’ : -2.881008708148148, ’10%’ : -2.57715084444444

Table 5.7: ADF test statistics for sentiment time series

The p-value obtained int the table 5.7 is less than the significance level of 0.05, and ADF
statistics are less than any critical values. Therefore, the null hypothesis is rejected and the
time series is considered stationary. Hence, the sentiment time series is stationary.

5.3.2 ADF Test for price series


— Null Hypothesis Ho :: The price time series is not stationary .

— Alternative Hypothesis Ha :: The price time series is stationary

Test statistics : -1.8915472905221271


p-value : 0.33604790206698165
critical_value : ’1%’: -3.4753253063120644, ’5%’ : -2.891274703275266, ’10%’ : -2.5772928360116873

Table 5.8: ADF test statistics for price time series

If the p-value < 0.05 significance level, reject the null hypothesis and accept the alternative
hypothesis that the series is not stationary.

The results in the table 5.8 shows the p-value obtained is greater than the significance
level of 0.05 and ADF statistics is greater than any critical values. Therefore, the null hypoth-
esis is accepted, and the time series is considered as non-stationary. Hence, the price time
series is non-stationary.

To convert the price time series to stationary by taking the discrete difference to make
the time series stationary by taking lag-1 difference and re-run the tests to check the series is
now stationary.

Test statistics : -5.440189685982799


p-value : 2.787409279405758e-17
critical_value : ’1%’: -3.4753063120644, ’5%’ : -2.881274703275226, ’10%’ : -2.5772928360116873

Table 5.9: Second ADF test statistics for price time series

41
5.4. Statistical Test

The p-value obtained in table 5.9 shows that it is less than the significance level of 0.05, and
ADF statistics is less than all critical values. Therefore, reject the null hypothesis and the time
series is considered stationary. Hence, the price time series is stationary.

5.3.3 Auto Correlation Function - ACF


Auto correlation characterizes the measure of similarity between a time series and its lagged
values over succeeding time intervals and visual representation to demonstrate correlation
in data changing over time.

(b) AutoCorrelation for non-stationary time se-


(a) Price over time ries
Figure 5.10: (a) Price over time (b) AutoCorrelation for non stationary price time series

The values are plotted along with the confidence band that shows the ACF plot to show
how previous values have past values. The plot 5.10 shows that the ACF function of a non-
stationary series is slowly trailing towards zero. It also shows some measure of trend in the
series.

(a) Price Difference over time (b) AutoCorrelation for stationary time series
Figure 5.11: (a) Price Difference over time (b) AutoCorrelation for stationary price time serie

After taking the lag-1 difference to make the price time series stationary will not endure the
effects of trends or seasonality , no trend is observed in the ACF plot in the figure 5.11.

5.4 Statistical Test

5.4.1 Correlation
First, Pearson’s correlation is calculated to check if there is any relation between sentiments
and prices movements, which measures the test statistics to find the statistical association

42
5.4. Statistical Test

relationship between them. The correlation value 0.6613 has a strong relationship association.
Before calculating the correlation, we should remove the time series dependencies by making
the data stationary. The scatter plot in figure 5.12 shows the strong statistical significance.

(a) Correlation price v/s sentiment (b) Correlation price v/s sentiment with best fit
Figure 5.12: (a) Correlation price v/s sentiment (b) Correlation price v/s sentiment with best
fit line

5.4.2 Granger Causality Test


It is a statistical hypothesis test to find if one time series is a cause and provide helpful infor-
mation in forecasting another time series [42]. The detailed explanation on Granger casualty
test is presented in section Granger Casualty Test

5.4.2.1 Test1
— Null Hypothesis Ho : sentiment do not granger cause price.

— Alternative Hypothesis Ha :sentiment granger cause price.

In the table 5.10 and 5.11 the different number of lags are considered to find the best AR
model for the time series. The historical values of X and Y are used to predict the value of Y,
than only using the historical values of Y. When we have a large number of lags, F-test can
lose its significance. An alternative is to use the chi-square test, constructed with a likelihood
ratio. The sum of squares-Regression (SSR) based F test and chi-square results are presented.
df stands for degrees of freedom and df_demon is denominator degrees of freedom.

43
5.4. Statistical Test

Granger Causality
number of lags(no zero) 1

ssr based F test: F=10.5902 , p=0.021 , df_denom=137, df_num=1


ssr based chi2 test: chi2=11.6031 , p=0.009 , df=1
likelihood ratio test: chi2=9.6018 , p=0.006 , df=1
parameter F test: F=8.5902 , p=0.023 , df_denom=137, df_num=1
Granger Causality
number of lags(no zero) 2

ssr based F test: F=3.934 , p=0.021 , df_denom=134, df_num=2


ssr based chi2 test: chi2=8.657 , p=0.012 , df=2
likelihood ratio test: chi2=8.0354 , p=0.0181 , df=2
parameter F test: F=3.9146 , p=0.0265 , df_denom=134, df_num=2

Granger Causality
number of lags(no zero) 3

ssr based F test: F=2.4189 , p=0.0441 , df_denom=131, df_num=3


ssr based chi2 test: chi2=10.3240 , p=0.0167 , df=3
likelihood ratio test: chi2=4.3177 , p=0.0255 , df=3
parameter F test: F=2.4189 , p=0.0441 , df_denom=131, df_num=3
Granger Causality
number of lags(no zero) 4

ssr based F test: F=5.3566 , p=0.00061 , df_denom=128, df_num=4


ssr based chi2 test: chi2=10.5268 , p=0.0002 , df=4
likelihood ratio test: chi2=13.5184 , p=0.0017 , df=4
parameter F test: F=4.3566 , p=0.0059 , df_denom=128, df_num=4

Table 5.10: Results for granger causality test1.

In the table 5.10 P-value is very low; F-test and Chi-square results are high but we cannot
accept null hypothesis with very low p-values. So null hypothesis will be rejected hence
sentiment granger cause price.

44
5.4. Statistical Test

5.4.2.2 Test2
— Null Hypothesis Ho : price does not granger cause sentiment.

— Alternative Hypothesis Ha :price granger cause sentiment.

Granger Causality
number of lags(no zero) 1

ssr based F test: F=0.5902 , p=0.4437 , df_denom=137, df_num=1


ssr based chi2 test: chi2=0.6031 , p=0.4374 , df=1
likelihood ratio test: chi2=0.6028 , p=0.4379 , df=1
parameter F test: F=0.5902 , p=0.4437 , df_denom=137, df_num=1
Granger Causality
number of lags(no zero) 2

ssr based F test: F=0.4306 , p=0.6510 , df_denom=134, df_num=2


ssr based chi2 test: chi2=0.8934 , p=0.6397 , df=2
likelihood ratio test: chi2=0.8905 , p=0.6407 , df=2
parameter F test: F=0.4306 , p=0.6510 , df_denom=134, df_num=2
Granger Causality
number of lags(no zero) 3

ssr based F test: F=0.4189 , p=0.7397 , df_denom=131, df_num=3


ssr based chi2 test: chi2=1.3240 , p=0.7235 , df=3
likelihood ratio test: chi2=1.3177 , p=0.7249 , df=3
parameter F test: F=0.4189 , p=0.7397 , df_denom=131, df_num=3

Granger Causality
number of lags(no zero) 3

ssr based F test: F=0.3566 , p=0.8390 , df_denom=128, df_num=4


ssr based chi2 test: chi2=1.5268 , p=0.8219 , df=4
likelihood ratio test: chi2=1.5184 , p=0.8234 , df=4
parameter F test: F=0.3566 , p=0.8390 , df_denom=128, df_num=4

Table 5.11: Results for granger causality test2.

In the table 5.11 The p-value is considerably high, F-test values and chi-square are not
high but do not harm. The null hypothesis is accepted; thus, price does not granger cause
sentiment. In Granger casualty, if we take large lags(difference), the F-test loses power;
alternatively chi-square test can be considered.

45
6 Discussion

This chapter includes a discussion of the accomplished results and a future for further im-
provement.

6.1 Results
The results achieved in Section 5 guide us to the understanding of sentiment analysis of
financial unstructured data and challenges involved in the process because the language
used in the financial text-domain makes it difficult to use a general-purpose model that is not
efficient.

One of the objectives of the thesis was to examine the suitable model for sentiment pre-
diction. The data was scraped from https://mfn.se/all and the data was labelled that
is presented in Section Data Preparation. The models were trained and evaluated on the
labelled data. The performance of the models are evaluated based on metrics like accuracy,
F1-scores, precision, recall and confusion matrix. The test accuracy of the FinBERT model is
91% that outperformed Naive Bayes with accuracy 67% and BERT with accuracy with 88%.
For the Naive Bayes classifier, The model to some extent, distinguished between classes in
the figure 5.1, shows that 80% of the positive sentiments are correctly classified by the model
where as the other classes ’neutral’ and ’negative’ large number of sentiments are wrongly
classified by the model.

One of the reasons why NB did not perform well is that it assumes all the features are
independent and the probabilities are incorrect if this assumption is not correct. NB are sim-
ple to implement and computationally fast. On the other hand BERT and FinBERT models
are complex and achieved good accuracy as shown in the confusion matrix in figures 5.2,5.3
shows that the models, to a large extent majority of the time correctly classified the classes.A
good explanation why the BERT and FinBERT models performed better might be they are
trained on large data that is hierarchical learning process to extract the text semantics and
relations. Since BERT is bidirectional it learns depending on the context. One drawback with
BERT is it takes a significant long time for training; running it on CPU is very low. So the GPU
setting is used. Comparing the results to the related study done by the Shi, Feng et al [33] and
Sahar et al [3] for financial sentiment analysis that has measured the classification metrics of

46
6.2. Future Work

different models where deep learning models like RNN outperformed NB. Zhuang et al [2]
used BERT and FinBERTf or the financial tasks have similar performance metrics compared
with the thesis outcome. The performance metrics after fine tuning the hyper parameters for
BERT,FinBERT models are presented in table 5.4 and 5.5, the hyper parameters maximum
sequence length 64,56 , the number of epochs 5,6 , learning rate 2e-5 and batch size of 8 and
32 are the best.

The correlation coefficient of sentiment and stock price is 66% that can be interpreted as good
correlation. Correlation between these variable is symmetric so Granger causality statistical
test is performed to check the causaulty between sentiment and price. Results are shown in
table 5.10, the null hypothesis sentiment do not granger cause price is rejected because of
low p-values and alternative hypothesis sentiment granger cause price is accepted. The test2
shown in 5.11 the null hypothesis price does not granger causalty sentiment is accepted with
high p-values. Hence from the results it is concluded that sentiment granger cause price and
price do not granger cause sentiment.

6.2 Future Work


The knowledge gathered from this thesis has helped to understand sentiment analysis. In
the future, it is expected that textual analysis will play an important task in the handling of
unstructured data in a financial domain where more qualitative information is from different
sources like news and various social media. However, textual analysis and machine learning
techniques have become more and more popular over time recently due to increasing need
to handle tons of texts from firm-specific news.

We would like to extend this study by adding more company’s data as only 30 companies
are considered now and check the prediction accuracy. We can also apply textual analysis
to other languages like Swedish and German, which is much more structured than English,
where a word has multiple grammatical purposes. It is also interesting to apply textual
analysis to the other financial market like bonds, commodities, and derivatives. The textual
information should be well explored to avoid fault predictions for stock price data.

FinBERT model that is trained on the financial domain can be further pre-trained that will
allow building a one-size-fits-all model for financial sentiment analysis of financial text and
many other tasks. But modelling the implicit semantic information is not necessarily clear
and a very challenging task. FinBERT can be used for other natural language processing
tasks such as question answering and named entity recognition and next sentence prediction
in the financial domain.

47
7 Conclusion

In this section the research questions will be answered.

• What are the suitable models for sentiment prediction of financial text?

This thesis attempted to study different supervised machine learning algorithms for sen-
timent prediction with a comparative study. The machine learning models such as Naive
Bayes, BERT, and FinBERT are compared. The accuracy of the FinBERT model on test data is
91% that outperformed Naive Bayes with accuracy 67% and BERT with accuracy with 88%.
The Naive Bayes model often predicted the class positive even though the actual class was
neutral and predicted the class neutral even though the actual class was negative. The size
of the training data for NB model is 80%. The training accuracy for NB model is 79%, the
model is likely over fitted. One of the reason is train to test ratio is not equal and imbalance
of labelled classes, cross validation can certainly be used to avoid over-fitting. The training
accuracy of BERT model is 92%, slightly higher than the test accuracy. The training accu-
racy for FinBERT model is slightly lower than the test accuracy. Overall the FinBERT model
performed better.

• When fine tuning a classifier, what are best hyper parameters?

choosing appropriate hyper parameters like learning rate, batch size, and epochs is helpful in
improving performance. The best hyper parameters for BERT is maximum sequence length
64 , the number of epochs 5, learning rate 2e-5 and batch size of 8 and are the best. For the
FinBERT model the best hyper parameters are maximum sequence length 56 , the number of
epochs 6, learning rate 2e-5 and batch size of 32. The results are presented in table 5.4 and 5.5.

• Does classifier domain adaption on financial data improve classification performance?

Domain adapted BERT model performs significantly better. The FinBERT model which is
a BERT that is further trained on financial data expalined in subsection FinBERT. Yes the
performance is increased, the accuracy of glsfinbert is 91% that and BERT with 88% accuracy.
When small training dataset was used to check test accuracy the performance the models
was low. They are data-hungry and perform exceptionally well when the training dataset
size increases.

48
• Is there any statistical significance between sentiment and price indexes? Will it help in
better predictions?

From the results, the correlation coefficient of sentiment and stock price is 66% that can be
interpreted as good correlation.Augmented Dickey-Fuller (ADF) is used to check if both the
time series data are stationary. Granger causality test used to determine if the sentiment time
series helps predicting price. Hence from the results,it is concluded that sentiment granger
cause price and price do not granger cause sentiment. If the news is positive, then we can
conclude that news impact is good, so more chances of stock price go high, and if negative,
then it may impact the stock price to go down.

49
Bibliography

[1] Casey Whitelaw, Navendu Garg, and Shlomo Argamon. “Using appraisal groups for
sentiment analysis”. In: Proceedings of the 14th ACM international conference on Informa-
tion and knowledge management. 2005, pp. 625–631.
[2] Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. “FinBERT: A Pre-
trained Financial Language Representation Model for Financial Text Mining.” In: IJCAI.
2020, pp. 4513–4519.
[3] Sahar Sohangir, Dingding Wang, Anna Pomeranets, and Taghi M Khoshgoftaar. “Big
Data: Deep Learning for financial sentiment analysis”. In: Journal of Big Data 5.1 (2018),
pp. 1–25.
[4] Basant Agarwal and Namita Mittal. “Machine learning approach for sentiment analy-
sis”. In: Prominent feature extraction for sentiment analysis. Springer, 2016, pp. 21–45.
[5] Oscar Araque, Ignacio Corcuera-Platas, J Fernando Sánchez-Rada, and Carlos A Igle-
sias. “Enhancing deep learning sentiment analysis with ensemble techniques in social
applications”. In: Expert Systems with Applications 77 (2017), pp. 236–246.
[6] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. “A survey of transfer learn-
ing”. In: Journal of Big data 3.1 (2016), pp. 1–40.
[7] Injy Sarhan and Marco Spruit. “Can we survive without labelled data in nlp, transfer
learning for open information extraction”. In: Applied Sciences 10.17 (2020), p. 5758.
[8] Jiang Su, Jelber Sayyad Shirab, and Stan Matwin. “Large scale text classification us-
ing semisupervised multinomial naive bayes”. In: International Conference on Machine
Learning (2011).
[9] Oludare Isaac Abiodun, Aman Jantan, Abiodun Esther Omolara, Kemi Victoria Dada,
Nachaat AbdElatif Mohamed, and Humaira Arshad. “State-of-the-art in artificial neu-
ral network applications: A survey”. In: Heliyon 4.11 (2018), e00938.
[10] Maher GM Abdolrasol, SM Hussain, Taha Selim Ustun, Mahidur R Sarker, Mahammad
A Hannan, Ramizi Mohamed, Jamal Abd Ali, Saad Mekhilef, and Abdalrhman Milad.
“Artificial Neural Networks Based Optimization Techniques: A Review”. In: Electronics
10.21 (2021), p. 2689.
[11] Agnes Lydia and F Sagayaraj Francis. “A Survey of Optimization Techniques for Deep
Learning Networks”. In: International Journal for Research in Engineering Application
Management (IJREAM) 05 (May 2019).

50
Bibliography

[12] Ajay Shrestha and Ausif Mahmood. “Review of deep learning algorithms and architec-
tures”. In: IEEE Access 7 (2019), pp. 53040–53065.
[13] Shanshan Yu, Jindian Su, and Da Luo. “Improving bert-based text classification with
auxiliary sentence and domain knowledge”. In: IEEE Access 7 (2019), pp. 176600–
176612.
[14] Mickel Hoang, Oskar Alija Bihorac, and Jacobo Rouces. “Aspect-based sentiment anal-
ysis using bert”. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics
(2019), pp. 187–196.
[15] Li Deng. “A tutorial survey of architectures, algorithms, and applications for deep
learning”. In: APSIPA Transactions on Signal and Information Processing 3 (2014).
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances
in neural information processing systems. 2017, pp. 5998–6008.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp
Hochreiter. “Gans trained by a two time-scale update rule converge to a local nash
equilibrium”. In: Advances in neural information processing systems 30 (2017).
[18] Allen Huang, Hui Wang, and Yi Yang. “FinBERT—A Deep Learning Approach to Ex-
tracting Textual Information”. In: Available at SSRN 3910214 (2020).
[19] Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. “Good
debt or bad debt: Detecting semantic orientations in economic texts”. In: Journal of the
Association for Information Science and Technology 65.4 (2014), pp. 782–796.
[20] R Werner, D Valev, and D Danov. “The Pearson’s correlation-a measure for the linear
relationships between time series”. In: Conference: Fundamental Space Research (2009).
[21] Rizwan Mushtaq. “Augmented dickey fuller test”. In: World Journal of Finance and In-
vestment Research (2011).
[22] W. Enders. Applied Econometric Times Series. Wiley Series in Probability and Statistics.
Wiley, 2014. ISBN: 9781118918616.
[23] Marco T Bastos, Dan Mercea, and Arthur Charpentier. “Tents, tweets, and events: The
interplay between ongoing protests and social media”. In: Journal of Communication 65.2
(2015), pp. 320–350.
[24] Paul-Francois Muzindutsi, Sanelisiwe Jamile, Nqubeko Zibani, and Adefemi A
Obalade. “The effects of political, economic and financial components of country risk
on housing prices in South Africa”. In: International Journal of Housing Markets and Anal-
ysis (2020).
[25] Clive WJ Granger. “Investigating causal relations by econometric models and cross-
spectral methods”. In: Econometrica: journal of the Econometric Society (1969), pp. 424–
438.
[26] Edward E Leamer. “Vector autoregressions for causal inference?” In: Carnegie-rochester
conference series on Public Policy. Vol. 22. North-Holland. 1985, pp. 255–304.
[27] Razan M AlZoman and Mohammed JF Alenazi. “A comparative study of traffic classi-
fication techniques for smart city networks”. In: Sensors 21.14 (2021), p. 4677.
[28] Bing Liu. “Sentiment analysis and opinion mining”. In: Synthesis lectures on human lan-
guage technologies 5.1 (2012), pp. 1–167.
[29] Justin Christopher Martineau and Tim Finin. “Delta tfidf: An improved feature space
for sentiment analysis”. In: Third international AAAI conference on weblogs and social me-
dia. 2009.

51
Bibliography

[30] Abinash Tripathy, Ankit Agrawal, and Santanu Kumar Rath. “Classification of senti-
ment reviews using n-gram machine learning approach”. In: Expert Systems with Appli-
cations 57 (2016), pp. 117–126.
[31] Mohamed Abdellatif and Ahmed Elgammal. “Text Classification Using Language
Modeling: Reproducing ULMFiT”. In: 12th International Conference on Language Re-
sources and Evaluation, LREC 2020 (2020), pp. 5579–5587.
[32] Lei Zhang, Shuai Wang, and Bing Liu. “Deep learning for sentiment analysis: A sur-
vey”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.4 (2018),
e1253.
[33] Li Guo, Feng Shi, and Jun Tu. “Textual analysis and machine leaning: Crack unstruc-
tured data in finance and accounting”. In: The Journal of Finance and Data Science 2.3
(2016), pp. 153–170.
[34] Srikumar Krishnamoorthy. “Sentiment analysis of financial news articles using perfor-
mance indicators”. In: Knowledge and Information Systems 56.2 (2018), pp. 373–394.
[35] MP Geetha and D Karthika Renuka. “Improving the performance of aspect based sen-
timent analysis using fine-tuned Bert Base Uncased model”. In: International Journal of
Intelligent Networks 2 (2021), pp. 64–69.
[36] Mark Lutz. Programming python. " O’Reilly Media, Inc.", 2001.
[37] Vineeth G Nair. Getting started with beautiful soup. Packt Publishing Ltd, 2014.
[38] S Chris Colbert et al. “The NumPy array: a structure for efficient numerical computa-
tion”. In: Computing in Science & Engineering. Citeseer. 2011.
[39] Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Con-
cepts, tools, and techniques to build intelligent systems. O’Reilly Media, 2019.
[40] Virgilio Gómez-Rubio. “ggplot2-elegant graphics for data analysis”. In: Journal of Sta-
tistical Software 77 (2017), pp. 1–3.
[41] Skipper Seabold and Josef Perktold. “Statsmodels: Econometric and statistical model-
ing with python”. In: 57 (2010), p. 61.
[42] Ling-Chu Lee, Pin-Hua Lin, Yun-Wen Chuang, and Yi-Yang Lee. “Research output and
economic productivity: A Granger causality test”. In: Scientometrics 89.2 (2011), pp. 465–
478.

52

You might also like