Enular: Stock Prediction Using Social Media Data
Enular: Stock Prediction Using Social Media Data
Abstract
Stock prediction is a highly competitive market and is a topic that has been heavily researched.
However, many of the current products on the market only utilize empirical data. This project
aims to investigate whether the movement of stock prices can be predicted solely from the use of
social media sentiment. The system design was split into four main components; data ingestion,
sentiment and feature break down, prediction model training, and building a user application.
Twitter posts about companies and their products would be analyzed and stored on a database,
before being deconstructed and used for training various models. Applying the principal of
Granger Causality, a previous day’s data would be associated with price movements the next day.
The product was able to achieve a 57.31% accuracy in predicting the movement of stock prices
at closing time the next day, with an expected annual return in investment of 50.12%. The data
ingestion process was done well, achieving extremely consistent information about any topic at
any period of time and giving uniform results reliably, indicating a high precision in converting
public sentiment into data. That being said, the performance of the classifier model was the main
limiting factor and requires further investigation. Overall, this product may be viable in the real
world, but as a supplement meant to be used in combination with other methods of prediction.
i
Acknowledgements
I would like to express my sincere gratitude to the University of Glasgow for offering me the
opportunity to complete my studies here.
In addition, I offer special thanks to Dr. Iadh Ounis, who has been my project supervisor and
supported me throughout my journey.
I am also hugely grateful to the members of Peak Capital Limited, who took me in and provided
me with invaluable mentorship and experience over the summer.
Finally, I would like to thank Mr. Zachary Hagan for his unending emotional support over the
course of this project.
i
I hereby grant my permission for this project to be stored, distributed and shown to other
University of Glasgow students and staff for educational purposes. Please note that you are
under no obligation to sign this declaration, but doing so would help future students.
Contents
1 Introduction 1
1.1 Motivations 1
1.2 Aims 3
1.3 Questions 3
1.4 Hypothesis 3
1.5 Dissertation Layout 3
2 Background 4
2.1 Essential background theory 4
2.1.1 Efficient Market Hypothesis 4
2.1.2 Random Walk Theory 4
2.1.3 Sentiment analysis 4
2.2 Related research 5
2.2.1 Twitter mood predicts the stock market 5
2.2.2 Stock prediction using twitter sentiment analysis 6
2.2.3 The effects of twitter sentiment on stock price returns 6
2.2.4 What we have learned 7
2.3 Similar products 7
2.3.1 TINO IQ 7
2.3.2 Market Sensei 8
2.3.3 I Know First 8
2.3.4 What we have learned 9
2.4 Summary 9
3 Requirements capturing 10
3.1 User Scenarios 10
3.2 Functional Requirements 10
3.2.1 Must have 10
3.2.2 Should have 11
3.2.3 Could have 12
3.2.4 Won’t have 13
3.3 Non-functional requirement 13
4 Design 14
4.1 System design and architecture 14
4.1.1 Initial architecture proposal 14
4.1.2 Early prototyping architecture 15
4.1.3 Developmental architecture 16
4.1.4 Final architecture 16
4.1.5 Summary 17
4.2 Technology prototyping, selection and justification 18
4.2.1 Data retrieval 18
4.2.2 Twitter 18
4.2.3 Secondary methods of retrieval 18
4.2.4 Stock price retrieval 18
iii
5 Implementation 21
5.1 Data ingestion 21
5.1.1 Tweepy implementation 21
5.1.2 Diagram of how related keywords operated 22
5.1.3 Market data retrieval 22
5.1.4 Code snippet showing how historical data was collected 23
5.1.5 Preliminary feature deconstruction 23
5.1.6 Data storage 23
5.2 Sentiment analysis 23
5.2.1 Natural language processing 24
5.2.2 Collected data 24
5.3 Prediction 24
5.3.1 Data retrieval and structuring 24
5.3.2 Granger causality 25
5.3.3 Feature scaling 25
5.3.4 Principal component analysis 25
5.3.5 Classifier training 25
5.3.6 Prediction data retrieval 26
5.4 User application 26
5.4.1 Model saving 26
5.4.2 User interface 26
5.4.3 Standalone application 27
5.4.4 Web application 28
5.5 Summary 28
6 Evaluation 29
6.1 Performance analysis and refinement 29
6.1.1 Classifier metrics 29
6.1.2 Classifier comparison 29
6.1.3 Classifier configurations 30
6.1.4 PCA component analysis 30
6.1.5 Data reduction and filtering 31
6.1.6 Feature adding 32
6.1.7 Summary 32
6.2 Final evaluations 32
6.2.1 Unit testing 32
6.2.2 Ablation study 33
6.2.3 Training with fewer days 34
6.2.4 Random data split testing 35
6.2.5 Prediction consistency 35
6.2.6 Comparison with empirical standards 36
6.2.7 Summary 36
7 Conclusion 37
7.1 Requirements review 37
iv
Appendices 41
A Appendices 41
A.1 Understanding the source code submission 41
Bibliography 42
1
1 Introduction
In this dissertation, we will explore the journey taken to produce an application that would predict
the movement of stock prices using social media sentiment. We will discuss the challenges, how
they were solved, and the barriers which must be overcome for future development.
In this chapter, I will go through the motivations and goals for the project, my initial hypothesis,
as well as how this dissertation has been laid out.
1.1 Motivations
Stock prediction has been a huge trend for the past decade. Today, almost three quarters of trades
are done by algorithms and computers [1]. Quant funds are some of the most profitable funds in
banks and are present in almost all large institutions. There are many types of automated trading
in the industry, with different frequencies, and a range of methodologies [5].
The use of computers to assist stock trading is not new. Electronic trading platforms have been
used since the 1970s to place orders for financial products over a network through financial
intermediaries and service providers [2]. These were used as a replacement for traditional floor
trading, where brokers had to handle transactions between themselves manually, as electronic
trading could be carried out by users from any location. These platforms provided essential
information, such as live market prices, volumes, and company statistics. As they developed, the
platforms started including tools which helped brokers predict future prices, such as charting
packages, news feeds, and technical analysis. Eventually, they would also allow traders to set up
automatic trading in order for them to trade at a higher frequency than humanly possible, based
on the parameters set out by the traders.
These parameters were based on existing trading models. Strategies for automated trading have
been developed and used since 1949. The most commonly used is trend following, where trend
in moving averages and price level movements are simply followed. Other examples include
volumes weight average price and mean reversion, but we will not be going into details here, as
this is not a finance paper.
It was inevitable that, sooner or later, institutions would combine automated trading, high
frequency trading, and technical analysis as a strategy in their management of funds. As long
as what they used to predict the movement of prices had a slightly better than 50% (random)
expected return, they would profit over large volumes of trades. This would become the basis of
automated trading systems (ATS).
An automated trading system is a program that automatically generates orders and submits them
to an exchange in the market, following sets of predefined rules representing trading strategies
within which orders are generated. With rules based on technical analysis, such as theoretical
buy and sell prices based on current market price, and with trades and tasks being carried out at
orders of magnitude greater than human equivalencies, this would soon become the norm for
day trading brokers.
2
The usage of an ATS is also advantageous for a number of other reasons. On top of increased
trade frequency and calculations, emotion is eliminated completely from the process, which is
important while the market is volatile [3]. The speed at which orders are places in response to a
market action is minimized as trades are conducted almost instantaneously after rules are met.
It is also relatively quick and simple to test and evaluate a strategy before deployment into the
live market. The consistency of an ATS may entice investors who look for low risk investments,
and furthermore, the ability to diversify a portfolio is made easy as ATS allows for simultaneous
trading on multiple accounts, decreasing risk even more. However, systems often require careful
monitoring as failures could ensure heavy costs for the fund. A system could perform very well
during back testing but poorly when deployed into the live market. In 2012, Knight Capital
Group lost four times its net income in just 30 minutes due to a bug in one of their trading
algorithms [4].
One of the most dangerous scenarios is market disruption and manipulation. In 2010, during an
event known as the ’2010 Flash Crash’, the Dow Jones Industrial Average (DJIA) plummeted 1,000
points then recovered within minutes. New regulations had to be issued to control automated
trading market access.
As automation of processes and trading were perfected and high frequency trades were not an
issue any more, focus shifted towards the algorithms and rules that drove the trades. Over the past
decade, algorithmic trading has been gaining traction both with retail and professional investors.
It is now widely used by investment banks, pension funds, mutual funds, and hedge funds. In
2014, Virtu Financial, a high frequency algorithmic trading firm, reported that during a five
year period the firm was profitable 1,277 out of 1,278 trading days.
Figure 1.1: A graph showing the increase in use of algorithmic trading over the past two decades
Most of the strategies in algorithmic trading involve looking at numerical statistics. Quant funds
tend to hire Physics and Mathematics researchers to work on these funds, and attempting to
compete with these institutions as a student who is barely crawling through computing science
would be foolish. There are already methods in place for parsing and analyzing lengthy financial
reports for companies and generate a verdict on whether there is an improvement since the last
report. However, it was only recently that algorithms would look at news stories, and this is an
idea that seems promising. There is evidence that large institutions are already using strategies
related to sentiment, as mentions of specific words in news can be directly correlated to stock
price movements. Social media sentiment, on the other hand, is still in a very hypothetical stage,
as any discussion on this topic lie mostly in academic papers.
3
1.2 Aims
In this project, I hope to devise a method of predicting stock prices using social media data.
This will be produced in the form of either a web application or a standalone application. The
application will have a way of ingesting social media data, deconstructing it into its bare features,
then use machine learning techniques to attempt to predict stock prices. The social media data
will be stored on a database, and models will be pre-built. The user can simply input a company
stock code, and the social media data related to that company will start being ingested. The user
may decide how much social media data to use, and afterwards, the application will process that
data through a pre-trained classifier model to get a prediction for their stock. The model will be
evaluated and tested.
1.3 Questions
In addition to building the application, I would like to decipher what factors in particular affect
stock prices the most, as well as the most optimal conditions for training a classifier model. Will a
product like this work in the real world or is it completely outclassed by traditional methods of
algorithmic trading?
1.4 Hypothesis
I believe that it is possible to achieve a better than random level of accuracy with predictions
using social media sentiment. As stated previously, and assuming that daily fluctuations in prices
remain consistent over long periods of time, using this product should, in theory, net a profit
over the long run.
2 Background
In this chapter, we will be discussing existing market theories, what other researchers have done,
what companies with similar products are doing, and how we can learn from all of this.
Early studies on stock prediction were based efficient market hypothesis (EMH) [6]. EMH was
an investment theory that suggested share prices reflects all available information, such as news,
investor reports, financial statements, and so on [8]. From this theory it was assumed that it
would not be possible to outperform the market on a consistent basis because neither fundamental
nor technical analysis can produce excess returns, relative to the market. Stocks would never
be undervalued or overvalued, always reflecting its true price. Fluctuations and volatility were
caused by new information being released to the public.
EMH was built upon in the Random Walk Theory (RWT). Due to the unpredictability of news,
RWT essentially suggests that stocks take a random and unpredictable path. Changes in stock
prices have the same distribution and are independent of each other. Therefore, future stock
prices cannot be predicted using past movements or trends, and market indexes could not be
outperformed without assuming addition risk [7]. These two theories remain controversial.
Firstly, there are studies to indicate that stock prices may not follow a random walk, and can be
predicted to a certain extent using historical data [9]. We will not look further into this. Instead,
we attack the other pillar on which this theory stands; whether or not news is truly unpredictable.
Numerous researches have shown that social media sentiment could potentially be very early
indicators of changes in the economic or commercial fields. Analysis of online activity has been
5
used to predict book sales, [10] movie sales, [11] product sales, [12] and even disease infection
rates and consumer spending [13]. This begs the question; can we not predict the stock market
this same way? Public mood and sentiment may play an equally important role as news in terms
of influence of the stock market. To further understand this concept, we must examine some of
the previous academic studies conducted on this subject.
A Granger causality analysis and a Self Organizing Fuzzy Neural Network were then used
to investigate the hypothesis that public mood states, as measured by the OpinionFinder and
6
GPOMS mood time series, are predictive of changes in DJIA closing values. A DJIA value would
be linked to the sentiment performance of the previous n days. They then combined this model
with existing DJIA prediction models, which they did not alter.
Their results indicate that the accuracy of DJIA predictions can be significantly improved by the
inclusion of specific public mood dimensions, namely ’sure’ and ’happiness’, but not others. They
found an accuracy of 87.6% in predicting the Daily up and down changes in the closing values
of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%. This was
a 14.3% improvement from the baseline DJIA prediction model that they had built upon.
They proposed a new cross validation method for financial data and obtained 75.56% accuracy
using Self Organizing Fuzzy Neural Networks (SOFNN) on the Twitter feeds and DJIA val-
ues from the period June 2009 to December 2009. They also implemented a naive portfolio
management strategy based on their predicted values.
by Gabriele Ranco, Darko Aleksovski, Guido Caldarelli, Miha GrÄŊar, Igor MozetiÄŊ, 2015
[22]
One of the more recent papers, Ranco et al investigated the relations between a well-known
micro-blogging platform Twitter and financial markets. In particular, they consider, in a period
of 15 months, the Twitter volume and sentiment about the 30 stock companies that form the
Dow Jones Industrial Average (DJIA) index. They find a relatively low Pearson correlation and
Granger causality between the corresponding time series over the entire time period. However,
7
they find a significant dependence between the Twitter sentiment and abnormal returns during
the peaks of Twitter volume. This is valid not only for the expected Twitter volume peaks
(e.g., quarterly announcements), but also for peaks corresponding to less obvious events. They
formalize the procedure by adapting the well-known âĂIJevent studyâĂİ from economics and
finance to the analysis of Twitter data. The procedure allows to automatically identify events as
Twitter volume peaks, to compute the prevailing sentiment (positive or negative) expressed in
tweets at these peaks, and finally to apply the âĂIJevent studyâĂİ methodology to relate them
to stock returns. They show that sentiment polarity of Twitter peaks implies the direction of
cumulative abnormal returns. The amount of cumulative abnormal returns is relatively low
(about 1-2%), but the dependence is statistically significant for several days after the events.
There are several points we can take away from these papers. From the first two papers we
are told that public sentiment and mood is likely correlated with stock price movements, and
therefore there must be an emphasis on sentiment analysis features for our product. The third
paper contradicts the first two in the way that they found low correlation between previous day’s
data, and the next day’s stock price movements. However, they did find significance in large
volume peaks in Twitter, which we may be able to consider as another feature.
2.3.1 TINO IQ
TINO IQ is a tool for precise predictions in stocks showing signs of artificial manipulation. They
boast 2- years of research, analyzing various factors impacting the market. Their algorithms are
designed to detect human sentiment and machine trading patterns, converting them into trading
opportunities. However, their take on sentiment seems to be based entirely on pattern analysis,
instead of using text and media data.
Every day, thousands of stocks are scanned by TINO for patterns. These patterns are checked
against models trained with 20 to 30 years of data. The patterns are scored and ones that show
high probability and effectiveness are recommended to users. Only blue chip stocks are analyzed.
Market Sensei predicts a stock’s most likely low, high, opening and closing prices daily. Users
can view predictions up to the next 7 days. They offer information such as the best buy in price
for a stock, expectations for returns, and a stock’s range and volatility. They also offer a stock
training game with an educational aspect, as well as several other supplementary features.
Accuracy rates for stock predictions are updated on a daily basis. Historical predictions can be
viewed and compared with actuality. They do not offer any information on how their algorithms
work. They offer a very affordable subscription model but is only available as a mobile application
or an API.
I know first is a fintech company that provides self-learning, AI based algorithmic forecasting
for capital markets to uncover the best investment opportunities. They provide daily investment
forecasts to users.
Their algorithm was developed by a team of researchers lead by Dr. Lipa Roitman, who has over
20 years of research and experience in artificial intelligence and machine learning and a long
record in computer modeling of processes.
The algorithm generates daily market predictions for over 10,000 financial assets, including
stocks, commodities, ETF’s, interest rates, currencies, and world indices for the short, medium
and long term time horizons.
The system outputs the predicted trend as a number, positive or negative, along with a wave
chart that predicts how the waves will overlap the trend. This helps the trader to decide which
direction to trade, at what point to enter the trade, and when to exit. The algorithm produces a
forecast with a signal and a predictability indicator.
Since the model is 100% empirical, the results are based only on factual data, thereby avoiding
any biases or emotions that may accompany human derived assumptions.
9
Their pricing ranges from $170usd a month to $439usd a month, depending on the amount of
stock picks the user wants per day.
There are many products available that offer stock price prediction, but there a few that make
use of social media sentiment. The fact that the use of empirical data may offer more accurate
results than sentiment may be apparent, but should nonetheless be investigated.
2.4 Summary
In both the research papers and competitor products, empirical data was used in some way. It
must therefore be investigated to answer the question of whether the analysis of social media
data alone could be viable in the market today.
10
3 Requirements capturing
In this chapter, we will go over the requirements for the project. The requirements were first
created before work on the product began, and have been added to as more research came to light
and while meeting with the client (supervisor). It was been updated constantly due to impractical
limitations that presented themselves along the way. These revised requirements will attempt
to demonstrate a product that is unique in its own way, having aspects that have not been done
before, while answer questions about the viability of social media data as stock price indicators.
M1 User interface - the product must have a user interface so that the user can easily do what
they want to do without navigating through code. GUI libraries such as tkinter can be
used.
M2 Stock choice input - the user must able to be input the stock that they wish to have predicted,
and this should include any company that is publicly traded.
M3 Twitter data ingestion - the system must have a method of gathering large amounts of
social media data in an efficient way. The data must include tweet content, tweet date,
tweet type (retweets), information about the author (such as number of friends, followers),
how well the tweet was received, such as number of likes. Tools for this include the official
Twitter API, Twitter4j, and tweepy.
M4 Topic filtering - must be able to retrieve data specifically for a particular topic, while
filtering out unrelated tweets.
M5 Sentiment analysis - using the contents of a tweet, the application must be able to generate
a numerical sentiment score for each tweet. This can be done with a number of libraries
such as OpinionFinder.
M6 Stock code dictionary - given a company, the stock code should be linked to the company
name so that even if the user does not have to full name of the company, which might
be long and include company types (ltd), other components of the product will still be
functional.
M7 Financial data retrieval - with the stock code, the system should be able to retrieve that
stock price information, specifically the change in price during that day. On top of that,
we should be able to get the price change for any day in the past, within reason.
M8 Feature break down - with the available twitter data mentioned in M3, the system must
disassemble it into a number of individual features for classification.
M9 Database storage and retrieval - data from the social media ingestion must be stored and
retrieved efficiently in order to be used by the classifier to build a model. The database will
most likely be a noSQL database.
M10 Feature scaling - as we deconstruct the features, more likely than not they will not be on
the same scale. We must have a way to scale them so that it is fair when we train models.
M11 Classification - using collected data, we must train a classifier model such as Decision Tree
or Support Vector Machine that will be used to make stock predictions.
M12 Immediate prediction data retrieval - as the user inputs a company or stock of their choice,
the system must immediately and automatically retrieve social media data for that exact
very company.
M13 Stock prediction - as the main functionality of the project, the application must be able to
offer the user a prediction for their selected stock.
M14 Evaluation - the product will be evaluated and tested so that improvements can be made.
S3 Standalone application - although a user interface will exist, we do not want the users to be
compiling and executing code every time they wish to run the product. On top of that,
there are many packages that will be required, meaning that the user must install them
on their own device. A standalone application will eliminate this problem as it can be run
without any dependencies.
S4 Web application - to make access even easier, a web application should be built so that
users will not even have to download the product.
S5 Empirical data comparison - to accurately assess the viability of stock prediction using
sentiment, we should compare it to a basic technical analysis baseline.
S6 Data clean up - tweets are messy by nature, causing problems such as inaccurate features
due to clutter. We should have a method of cleaning up tweets so that they only contain
information that we desire.
S7 Model saving - we do not want to train the classifier every time we wish to make a
prediction. This would also mean dependence on a database, which we rather not allow
everyone access to. Trained models should be able to be saved and easily retrievable.
S8 Scaler saving - scaling features is an essential part of the project, but we want to have the
same scaling even as the user processes the data that they retrieved. Again, we do not wish
to reuse the training data, so saving the scaler model will be optimal.
S9 Profit testing - we should have a method of obtaining the expected returns of stock in
addition to the accuracy. This way we can assure that it will not net our users a loss, even
if the evaluation metrics of the classifier is high.
We discuss functionality that is desirable and would enhance user experience, but may be difficult
to implement or time consuming.
C1 Long term prediction - although the product is meant for short term prediction, we
could possibly adjust the classifier to output an aggregated prediction over a longer period,
although the accuracy of such prediction may be questionable.
C2 Exact price prediction - on top of up down prediction, we could create a regression model
that predicts the exact closing price of a stock. We have the data, but again, the reliability
of such product is in question.
C3 Text classification - we may implement our own text classification method of sentiment
analysis, so that we can favor certain finance related words over others. The problem with
this task is the seer workload it involves, as it is a project in itself.
C4 Mobile application - downloadable mobile app could be created for even greater outreach
and accessibility. Again, this will take a lot of work.
C5 Financial model comparison - on top of using empirical data for comparison, we could
implement some basic financial modeling scripts and compare those with our classifier.
These would be highly accurate and well thought out equations and are used in a professional
setting, and will probably blow my project out the water.
C6 Tweet ranking - using the University of Glasgow’s own Terrier application, we could rank
the tweets and weigh certain samples more than others.
C7 Mood analysis - to even further develop our sentiment analysis, we could look into Google’s
famous GPOMS.
C8 Twitter volume analysis - as data is collected upon request of the user, the volume of
ingested tweets could be used as an additional feature.
13
For each of these features, we will discuss the reasons that they will not be included.
W1 Empirical data predictor - although the product will be tested against a regression model
that I wrote, this model will not be used in combination with the sentiment model to
improve accuracy, as the point of this project is to solely focus on sentiment.
W2 Financial models - Discounted Cash Flow and Three Statement Model are commonly used
models for calculating the value of a company, however, we will not be using these in our
project either.
W3 Macro sentiment analysis - linking macro-economic news such as political policies, country
GDPs, and trade wars, with changes in price for individual stocks. This was something
that I really wanted to include in my project, but I soon found out it was completely out of
my scope.
W4 Statistical data viewing - to keep the product flexible, automated retrieval of financial
statistics for the user will not be included. There are many stock exchanges, such as the
NASDAQ, the LSE, the NYSE, and countless more all around the world. Retrieval of
financial statistics would mean limiting the user to a certain group of exchanges. However,
we want to user to be able to get a prediction for any stock on any exchange in any part of
the world.
W5 Author and topic details - the product will retrieve social media data for whatever topic on
command, this is settled. However, it will not provide the user with details about specifically
what is being said, nor who is saying them, simply outputting empirical information.
Privacy issues must be considered as we do not wish for this application to be unethical or
controversial in any way.
4 Design
The design of the project has been split into two sections, one for the system design and architec-
ture and one for component prototyping and selection.
The project was heavily researched over the summer leading up to fourth year. I was ambitious
to create the most accurate stock prediction tool. Needless to say, I was way out of my depth.
However, the research and ideas would come in handy as the project developed. The initial
architecture is shown below.
The initial idea was to get the prediction from a number of classifier components and combine
them in a central predictor. This predictor would determine the weights of each prediction,
and generate a final prediction which was to have the best accuracy. Components included
a tool which would automatically generate related words for a topic, a sentiment analyzer,
a macroeconomic sentiment analysis, prediction using popular financial models, and pattern
15
recognition using neural network. Some of the ideas we kept, but most of these either required
too much work, were pointless, or completely out of scope. We will discuss some of the things
we cut out and why.
Related word generation would have been a mess to implement for several reasons. Firstly, finding
related words for companies and stock names are completely different to finding a related word in
a dictionary. Synonyms are established, but finding words related to ’Tencent’ is an entire matter
altogether, requiring us to almost implement a search engine. The second reason is that even
if we had a tool for gathering related words on the spot, it would be extremely unreliable. For
example, http://relatedwords.org lists the top related word for ’Microsoft’ as ’Nokia’. This would
have been a nightmare for stock sentiment ingestion, and for these reasons, this component was
replaced with a more sensible method.
At the start of the project, I imagined financial models being one of the foundation components.
They are used by professional analysts, researchers, and fund managers. On top of that, they
would relatively simple to implement, compared to the rest of the project. It is for this very
reason that they were not included. Financial modeling is common and highly accessible. Most
trading software has it included by default. I had initially hoped that combining information
from models with predictions from my own product would enhance its accuracy, but results
from these models do not change day to day. We use statistics from a company’s financial reports,
which remains consistent and thereby only causing needless interference with the product. The
same went for regression models, which mainly extrapolated the past prices of individual stocks,
and pattern recognition using neural networks. However, theses will come into play later, and
will be explained in more detail as we describe their implementation during the evaluation stages.
Macroeconomic sentiment analysis was likely going to be the most powerful component. Politics
and national news affect stock prices as much, if not more than company level news. Look at the
2008 financial crisis for example. However, despite how powerful this component would have
been, it was just not meant to be. There are again several problems with it. Firstly, we would
require a completely new method of ingesting data, crawling through various news sources, as
well as identifying the topic, the tone of the article, what stock it relates to, which countries and
industries are affected, and so on. We would essentially need to build a robot that could read
and understand news at a human level. On top of that, we would need advanced knowledge of
economics to even know what to do with the data, a topic which expects have studied years to do.
We would not be simply able to apply machine learning techniques to this due to the uniqueness
for each case. Regretfully, but to no surprise, this was not going to be part of the project.
The first task at hand was to test out the three core components of the project, social media
ingestion, sentiment analysis, and classification, shown in 4.2. The other components used at this
stage would be temporary and quickly replaced down the line.
16
I implemented a twitter search function that would gather a certain amount of tweets on a certain
topic. A script that processed the content of these tweets was written and the output of these
results would be temporarily stored on text files, with different directories for different companies
and days. The chances in stock prices were manually input by me, and another script would then
retrieve the collection of data and run it through a classifier. Some arrays of arbitrary features
would be hard coded to test whether the model was generating prediction.
Many different techniques were tried, especially for the ingestion component (details in the next
chapter). Once these three core components were working, we were ready to replace the rest of
the parts.
The next stage was to implement a proper data storage tool, as well as continue improving the
core components. I began by improving the ingestion component. Due to the limitations of a
search method of getting tweets, we would begin using a stream which filtered out undesirable
content which live streamed data about 10 companies at a time. The data would be unpacked,
converted from JSON into a readable format, and then broken down into a barrage of features.
Then a noSQL database was set up to store the features for each company on each day, with an
automated method of retrieving the up and down movement of stocks on a particular day, given
the stock market was open. Social media data would be collected in the hours leading up to a
market open and paired with the change in stock price for that company the previous day, a
method designed around the principal of Granger Causality.
With the training and validation data now in the database, we created another script to retrieve
and structure the data. We scaled them due to the vast disparity in scale between features and ran
them through that classifier to create a trained model. A range of classifiers were used, and the
training data was split so that we had a portion to test the evaluation metrics of different models.
The same process was carried out to obtain the data of the company we wanted to predict. It
would be processed the same way as the training data, and then processed through one of the
models to obtain a prediction.
At this stage, it was a matter of improving usability, fine tuning the predictor for accuracy, and
setting up the evaluation techniques.
17
For the usability aspect, I wanted the whole product to make a prediction with only one user
action. To do this, I started working on the module that would turn out to be the application
itself. From the training stages of the project, all we needed were the classifier models, and
a way to save the scaling and feature reduction settings. Once we had those, we could set up
another module which retrieved and streamed only the social media data that we would need to
make a prediction, loaded the trained models, and process the data directly, without the need of
external storage, to generate a prediction. A method of connecting companies to related words
was created to obtain a wider range of data for any topic. Finally, a graphical user interface
would then be implemented so that the product would become fully usable by anyone.
The accuracy predictions at this stage were being constantly tested and improved, at all angles.
Classifiers would be compared, classifier configurations would be tweaked, and features would be
adjusted, reduced, added and removed for the purpose of obtaining a slightly higher accuracy.
Testing and comparison techniques will be described in the ’Fine tuning’ chapter.
Finally, methods of evaluation were carried that judged classifier performance. The predictions
from our product would be compared to predictions from other methods. The viability and real
life practicality of the product would also be considered.
4.1.5 Summary
In this section we have explored the gist of the inner workings of this project. The system
architecture essentially revolves around four main components.
1. Data ingestion
2. Sentiment and feature analysis
3. Prediction
4. User application
18
The outcome is a product that will retrieve social media about a stock on demand, break the data
down into features, and make a prediction. Although reliant on Twitter’s API for data retrieval,
has no other dependencies and should be easy to maintain and improve.
The first major component of the project is a method of ingesting social media data and retrieving
stock price information. Twitter had been determined to be primary source of the social media
data since the start of the project. Several solutions were explored for this task.
4.2.2 Twitter
The official Twitter API was the only legitimate method of retrieving Twitter data so it must
have been involved no matter the circumstances. This meant that registering as a developer
and authentication was required, which was simple to achieve and quickly set up. The only
disadvantage of this was that I would be required to keep my authentication keys on the final
products and would need a way to protect it. There are a number of libraries which facilitate
the use of the Twitter API. The two that were considered were Twitter4J and Tweepy, which
the project would then be based around. Initially Twitter4J was the apparent choice, however a
short time later it was replaced with Tweepy due to Python being better suited for the machine
learning stages later on in the project. Both had similar functionality, so transitioning was not
difficult.
Other methods of retrieving social media data were also considered. StockTwits and Reddit APIs
were planned to have been implemented at a later stage, but this never happened due to time
constraints.
Again, several methods were considered for this task. Many of the APIs which retrieved live
stock quotes were subscription based with free trials and would soon require payment. This was
not an issue since stock price retrieval would not be necessary of the standalone product, just
for the training process. That being said, I would still need access to it for at least 6 months,
which narrowed down options. Eventually, I stumbled upon a humble Python module named
iexfinance. Despite doubts about its dependability, I would begin using as a temporary solution
as it met all the functional requirements I needed. I figured I would continue with it until I was
19
forced to switch to something more reliable. It has, luckily, lasted the entire training process
without need of replacement.
At the start of my project, I was against the use of a database due badly designed initial system
architecture. I was convinced that data I had retrieved should be stored as a csv dataset, so it
could be more easily used to train a classifier. This would also mean that it could be included in
with the final packaged product, as opposed to having to set up hosting if I were to proceed with
a database. However, as discussed in the previous chapter, plans for the design of the product
quickly changed and hosting was no longer part of the specification. With the recommendation
of my supervisor, I set up a noSQL database using MongoDB, which conveniently had a Python
API, PyMongo.
Sentiment analysis was the second major component in my project, and it proved to challenging
process with rather anticlimactic solution. This component was interestingly one of the first
and last concepts explored during my project, with two very different goals each time. At the
beginning, the task was to analyze the sentiment of the contents of a particular tweet, then
assign it with one or several numerical scores. Near the end of the project, after a large part of
evaluation had been done, feature adding was explored and we came back to sentiment analysis.
This time around, I focused a lot more on other natural language processing methods to extract
more features from our achieved tweets. We will go through tools chronologically, thereby
starting with finding a method of detecting the polarity of a tweet.
With much ambition, I had hoped to implement my own text classifier that would be tailored to
keywords which related to stocks, finance, and economics. I began with using a Twitter dataset
from Kaggle [14] to train a classifier which would take a string of text and output a polarity score
indicating whether the contents of a tweet held a positive or negative sentiment.
Only sentiment and tweet content were kept from the Kaggle dataset. The usernames, hashtags,
links, and stop words were then filtered out from the tweet. Individual words from each tweet
split to create a word list, which was then mapped onto a frequency distribution dictionary with
words as keys. Keys were then extracted to get a list of unique words.
To get the features, a method was then implemented that took a document as an argument, and
checked each word in the document against the list of unique words, returning a dictionary with
the list of unique words and a boolean indicating whether each word was in the document. This
was done to each element in the dataset to create a lazymap training set containing the features
for each element along with the sentiment. This would then be used to train an NLTK NaÃŕve
Bayes classifier to create a model which would predict whether a string of text had positive or
negative sentiment. After testing, an accuracy score of 0.8854 was found.
However, I soon found out there were several, problems with using this method practically. First
of all, the training set used was of questionable quality, with tweet obtained over two days in
2015, with a very strong emphasis on the presidential election. It also only gave a prediction of
positive, negative, or neutral, as opposed to a continuous polarity score I was after. On top of that,
filtering or emphasizing on words specifically related to stocks would be difficult to implement
and time constraints had to be considered. Although there were solutions to these issues, the use
20
of this method of sentiment analysis was disregarded as there were to be several existing tools
that would outdo the one I tried to produce myself.
There were a range of libraries which offered sentiment analysis for Python. At first, Azure
Text Analytics API and Google Cloud Platform’s Natural Language API were considered due to
their prominence. However, as sentiment analysis was a required part of the packaged product, I
wanted to minimize the use of tools which required authentication keys. Therefore, TextBlob
and VaderNLTK were chosen instead. TextBlob was a module for natural language processing
which offered scores for sentiment polarity and subjectivity. VADER sentiment analysis was a
lexicon and rule based tool build specifically for the task at hand. It took negations, punctuation,
capitalization, slang, emoticons and acronyms into consideration, generating scores for positivity,
negativity and neutral, along with a compound score which aggregated the three measurements.
Finally, ScaCy would be used after the evaluation stages to further add features for classifier
training.
4.2.9 Classification
The third major component of our project was building a predictor. The choice for this task was
fairly evident from the beginning; a range of tools from Sklearn would meet our requirements
for both training and evaluation.
The final task for the project was to package the code into a usable format with few dependencies.
Tkinter was used to create a user interface for the standalone application, and PyInstaller to
package the code into a executable for multiple operating systems. For the web application,
the PLAY and Django frameworks were explored, but in the end Flask was used due to the
lightweight nature of the task. Development of a mobile app using Android Studio was initiated,
but forgone due to time constraints.
4.2.11 Summary
A range of tools were explored, shortlisted, and selection during the planning stages of each
component. Now that we have got everything we need, we are ready to discuss how the product
was to be built.
21
5 Implementation
The following two chapters will go through the implementation of the product in its entirety.
In this project, I had gone through two coding stages and two evaluation stages. Approximately
80% in weight of tangible content and code was done in the first stage, which fully created a
product which met requirements, and evaluation was carried out. At that stage, I discovered that
there were a number of additional features that I could implement that could potentially further
elevate the quality of the project. Therefore, in this section, we will discuss the implementation
of the project, followed with initial analysis and improvements, followed with final evaluations.
As with the previous chapter, we will split this section into the four main components, going
through their respective sub-components or related modules in each section.
In the previous chapter we discussed the usage of Tweepy as the main method of data retrieval.
Due to it using the official Twitter API, authentication was necessary but simple to achieve.
Tweepy offered both search and streaming capabilities. As the scale of the project grew, search
became less and less efficient and data was since solely retrieved via streaming.
In order to stream, a listener class was implemented which inherited properties and methods for
Tweepy’s StreamListener class. A stream was then created using the Twitter API authentication
keys and the listener class. Streaming itself was then initiated by calling the stream’s filter method.
The result was a JSON object every time a tweet was retrieved. This was converted into a
dictionary format using Python’s JSON module, which simplified the unpacking process.
Almost none of the tweets would have consistent keys once converted to dictionaries. Some
may have no location, or no posting date, or author name, and so on. Many had vital data
missing which could not be replaced with substitute variables. Try and except clauses were used
to suppress exceptions during streaming and a check was put in place to ensure that all necessary
22
data was present for each element before posting to the database. This would ensure that features
would be consistent for each object.
The dates for the tweets had to be carefully monitored. Although it may seem that using the
datetime module or using the tweet posting date would be the obvious choice to keep track of
object dates, a very severe issue would later present itself. Tweets collected during the period
before the stock market opened would have to be dated the day prior, and tweets collected on
Monday mornings would have to be dated the previous Friday. This mean that date input was
manual for every day the stream was ran. The reason why will be explained further down in this
chapter because it also involves the classification process.
The first order of business was to set up the querying mechanics for streaming. For each of
the 10 companies I focused on, there would be a company name, stock code, and two related
words for that company. Each of these four words for all 10 companies would be used as filters
for the twitter data stream. As elements obtained from the data stream did not indicate which
filter it came from, it was difficult to assign elements to companies. As an alternative to running
40 streams synchronously with multiprocesses, a function was used to identify which keyword
was detected, which company that keyword belonged to, and the data would subsequently be
stored under that company in the database. The stock code and related words would be stored as
dictionaries values with the overseeing company name as the key.
Using the iexfinance module, the change in stock price would be obtained. However, two
methods were required for this. The first method simply offered the most recent price change
of a stock, which was a built in function by default. The second would return the change of
the stock price on any given day. As mentioned above, tweets had to be dated the working
day before they were collected, and this would be the same for stock prices. The historical data
retrieval function did not offer a value for the percentage change of the stock on that day, only
the open and closing prices, which would be sufficient after running a brief formula.
Approximately 90% of collected tweets were filtered out in the first stage. Tweet contents were
ensured not to be spam, only containing links, or were just gibberish but ensuring that each
tweet’s content contained at least one of the key words of that company, which was surprisingly
not done by default.
Using a mixture of available data and analysis of tweet content, each retrieved object was
deconstructed into an initial 15 features show in 5.4 before even being stored in the database. A
number of these features had already been included in the JSON object by default, while others
required analyzing mainly the tweet content.
Additionally, the features were split into four more or less even groups, as shown in 5.5. Each of
these groups represented an aspect of the tweet. Six more features were added at a later stage
during evaluation, please refer to 6.1.6.
MongoDB was used as the sole method of data storage with PyMongo module used. Each viable
object is added to the database with a single post. The 15 features above were all empirical and
would be posted to the database along with the company name, stock code, stock price change,
date, and tweet content. This meant that further addition of features could be done at any stage.
Backups of the database were made frequently.
SpaCy would be used to break a tweet down. It was able to determine people, organizations, dates,
nouns, verbs, locations and so on. I would use these as addition features after initial evaluation
had been carried out to examine how language processing would improve performance.
This gave us the aggregate of approximately 40 good quality (non spam/gibberish) tweets per
sample, with 400 samples over 40 consecutive market days.
5.3 Prediction
Now that we had the data we needed, we needed to build a model which would give us a
prediction as to whether the stock would go up or down the next day.
Extensive trialing and prototyping were conducted on the database, resulting in certain days
which couldn’t be used. Posting structure gradually changed as more features and values were
added. For this reason, posts before a certain date were not to be used. List of the 10 companies
and viable dates were created manually. Simple loops were then implemented that would parse
every single viable post.
Figure 5.7: Code snippet show the how posts were iterated
Features in each post were appended to initial arrays for each feature. These initial arrays were
then aggregated to get an array of features which attributed to a single company on a single day.
This array was then further appended to the main training data set, which would be a 2D array
with an array of size 14 for each element. A similar process was carried out to get the validation
data set with price changes.
25
The data we had correlated samples to the prices the previous day, since that was the only
information known at that time. However, we want to see if this data would correlate to data
during the market hours later that day. Granger causality is a statistical concept of causality that
is based on prediction [16]. If a certain signal X causes signal Y, then past values of X should
contain information that helps predict Y. This is the principal we will use.
To implement this principal, we simply shifted the training arrays once day forward and the
validation data one day back by deleted the last and first days’ worth of values of the respective
arrays.
The disparity in scale between individual features was several orders of magnitude. For this
reason, it was a necessity to use a scaler to standardize features to unit variance, using the formula:
z = (x-u)/s Where z is the scaled feature, x is the sample, u is the mean of training samples, and s
is the standard deviation of training samples.
Due to the large number of features, wildly different scale, and its noisy nature, principal
component analysis was used to reduce the complexity of the training data. Sklearn’s PCA
module was used for this.
From Sklearn’s library, 8 classifiers, including a dummy, were implemented, trained, tuned, and
evaluated. This will be heavily explored in the next chapter as we discuss evaluation techniques.
26
Data for companies I wanted to make a prediction for were streamed the same way as the training
data. There were two ways I approached this. Firstly, the data could be streamed directly,
formatted into arrays and predicted using the classifier models we trained previously. Secondly,
if we wanted to save this prediction to the database, we posted the data the same way as we did
the training data, but with an extra Boolean value as part of the object to differentiate between
training and prediction data.
The database was not going to be hosted online for several reasons, including application depen-
dence, cost, and privacy issues. On top of that, we did not want the application to train a model
every time a prediction for a stock was to be made. The solution was simple to use Python’s
Pickle module to save the classifier, scaler and PCA models so that only those three external files
would be required to be packaged with the application.
The user interface would be built using Tkinter. It would allow the user to input the company
name and stock code, which would be required, along with two optional related words to that
company. The number of tweets to be retrieved would be up to the user. 5.11 Shows how the
window appears on the Windows OS.
The streaming process would be difficult to show on the user interface. String variables used
by Tkinter would have to be updated every time new data was retrieved and the root would
have had to refresh. To combat this issue, I would close the window and allow the terminal to
run in the background while tweets were collected, and another pop up would appear once the
prediction was ready.
We would let the user know how many tweets were collected for each key word they input.
Company name and stock code would be considered the same category. We would also show
them an aggregated sentiment score. Finally, they would be given a prediction as to whether
their stock would go up or down at the closing time the next market day.
From the requirements, we had to make the product as easy to use as possible. This meant that
we would not be making the user install hundreds of Python modules in order to run the code.
To package the entirety of the project into an executable format for an operating system, without
dependencies, PyInstaller was used. This was a much more challenging process than expected, as
the project involved several heavy modules which all had to work together. This meant having to
go through modules I already had installed, then upgrade or downgrade them so that everything
would be compatible. As a fix for several recurring issues, changing variables in several methods
in the Python modules themselves was required. On top of that, many resources had to be
manually or specifically included.
Figure 5.13: Specific instructions for packaging this project, based on its dependencies
28
Finally, a web application was built to allow for even easier access. The web app could not depend
on Tkinter for the user interface, so a HTML had to be used instead, with GET methods.
(a) User interface of the web app (b) Results page of the web app
Using the Flask framework, the application was successfully adapted into a hosted web application.
However, the same problem was faced as when using Tkinter; it was extremely difficult to show
streaming. On top of this, due to slightly different limitations, tweet limits could not be used,
and instead, I implemented a time limit. The final screen would still show you the same details as
with the standalone application.
5.5 Summary
At this stage, the product had met most of the essential requirements. The goal now was to attain
reasonable results using the product we had built, and that would involve improving and adding
features as well as configuration of the prediction process.
29
6 Evaluation
This chapter will be split into two sections. In the first section we will discuss initial analysis of
the product’s performance, how and what was implemented to further improve its accuracy and
refine the product. In the second section, we will carry out our final evaluations to answer the
questions set out at the start of the project.
The first order of business was to have a method of measuring the performance of our classifiers.
Using the Sklearn metrics library, I was able to measure the F1, accuracy, precision and recall for
each model. The accuracy would measure the ratio of correctly prediction observations to total
observations, precision measured the ratio of ratio of correctly prediction positive observations to
total prediction positive observations, recall measured the ratio of predicted positive observations
to all observations in a class, and the F1 score weighted the average of precision and recall [18].
(a) General accuracy comparison between classifiers (b) Detailed comparison between shortlisted classifiers
Figure 6.1: Comparing performance of various classifiers trained using the same data
Immediately, using the data that I had, I trained eight classifiers: decision tree (DTC), support
vector machine (SVM), Gaussian NaÃŕve Bayes (GNB), k-nearest neighbors (KNB), multi-layer
perception (MLP), random forest (RFC), extra trees (ETC), and finally a random dummy classifier
(DUM). They were evaluation with different splits of the testing and training data to see which
30
of the classifiers would consistently perform well, adhering to the principal of cross validation.
An example of the results is shown in 6.1a.
Over a number of evaluations, DTC, SVM and GNB would consistently perform well. These
three classifiers, along with the random, were shortlisted and compared further, with the results
shown in 6.1b.
The decision tree classifier was outperforming the other classifiers the majority of the tests. This,
on top of the flexible nature of decision tree classifiers and comprehensive analysis, I would be
using this classifier for the rest of the evaluation. However, the choice of classifier would be
reexamined later on.
Figure 6.2: A graph showing the performance over a range of classifier configurations
The configurations of classifier would be adjusted to try and find optimal settings for performance.
A loop was used to cycle through parameters such as max depth, minimum samples required
for splitting, weighted fraction, and so on. For many of these options, the default gave the best
performance. For those that showed an improvement we be used to replace the default. An
example of the results from analysis is show in 6.2.
Figure 6.3: A graph showing the performance over a range of PCA configurations
31
As we had 15 features at this point, I wanted to find out whether reducing them to components
would improve the performance, with the results being shown in 6.3
Reducing the data down to two components seemed to improve the performance drastically.
This was most likely due to the independent nature of the features, with many possibly being
arbitrary.
The initial spam filtering stage had already removed more than 90% of our data. However, there
were still samples which would be of more use than others, such as tweets with more words. For
this reason, a minimum word and character limit were implemented to see if it would make a
difference. First, I examined the distribution of tweets among my data set, as shown in 6.4
(a) Distribution of minimum word count (b) Distribution of minimum character count
With a close to linear distribution of samples, I created a loop which would give us the performance
of the classifier as each of these limitations were put in place, plotting with Python’s matplotlib
module.
(a) Affect of filtering minimum words (b) Affect of filtering minimum characters
Surprisingly, filtering the words and length of the tweets did not improve the performance of the
classifier. This was possibly due to the face that word and character counts were already included
as features.
32
At this time, I had a few ideas for extracting addition features out of the data that we already had.
Using the content of the tweet, addition methods would be implemented to additional values for
sentiment, subjectivity, word types.
Figure 6.6: A graph showing performance with new feature groups added
Using several natural language processing libraries including TextBlob, VaderSentiment, SpaCy,
three more groups of features were generated, with 6 individual features. Using VaderSentiment,
values for positivity, negativity and a compound score were used as features under the group
’Sentiment’. TextBlob’s subjectivity analysis was used and finally, the number of nouns and verbs
in each tweet was counted using SpaCy. The results are shown in 6.6.
Interestingly, adding these additional features only reduced the performance of the classifier.
Perhaps it was due to the fact that variations of these features already existed as part of the control
model, and adding these were simply more noise.
6.1.7 Summary
I was successful to an extent in improving the performance of the model. Many methods which I
thought would improve the performance of prediction did not, and explanations were given to
the best of my ability.
The software practice of unit testing was used throughout the project. Every lower level compo-
nent were testing separately and individually before being merged into a larger component, as
33
seen in section 4.1.4. This was convenient due to every component having a very specific output
expectation given a situation. The larger components would when be tested on their own to
ensure that results were consistent, such as getting the same metrics from the classifier. Finally,
the product was tested as a whole, which print checks at every stage. This would be repeated any
time a large change was implemented.
The features that we used were split into four groups. The first group contained features describing
the author of the tweet. The second group was about the reception of the post itself. The third
group would include information about the company, such as sentiment, the number of times it
was mentioned. Finally, the fourth group would contain twitter specific information, such as
number of hashtags. I then compared the accuracy of with the accuracy of the control. The
results are shown in 6.8.
As we can see, group 1 in particular caused the accuracy of the classifier to deteriorate drastically.
Group 2 on the other hand, made no difference at all. This made sense as I later found out group
34
2 contained no valuable information whatsoever. As tweets were collected as they were posted,
they would not have time to accumulate likes and retweets. With this information, I further
expanded the study to find out which features in particular made the biggest difference, shown
in 6.9
Figure 6.9: Graph showing the performance with various features removed
It seemed that the biggest indicators were the number of followers of the author, the length of the
tweet, the mentions of the company, the number of hashtags, and the sentiment of the content.
The fixed number of days was set aside. Using the remaining training data, the model was
training the same way to see if results would be affected. It was expected that the more training
data I took away, the worse the performance would get.
Figure 6.10: A graph showing performance while training with less day
35
Although the results from taking away 10, 15 and 20 days of training data worsened the per-
formance, surprisingly taking away only 5 days improve it slightly. This could have been for
multiple reasons, but that most likely explaination is that the 5 days taken away were giving
unusual results, possibly due to macroeconomic changes.
Finally, with our changes made, we conducted one final test of our classifier. Using cross
validation principals, we split out training and testing data randomly 10 times and obtained a
value for the accuracy of each model trained. The results are shown in 6.11.
The model was consistently achieving over 50% accuracy, with the average score being 0.5731.
Using the final product, we attempt to find out how consistent the predictions are for the same
stocks, during the same period. Our goal is that, if the application is run 10 times, they would
all give the same prediction as well as offer similar values of sentiment and tweet distribution.
We would test this with Apple, with the stock code AAPL and related words ’iPhone’ and ’Mac’,
retrieving 500 tweets each time.
Figure 6.12: A table shown the results from running the application 10 separate times
These results were extremely surprising. It indicated that retrieved tweeting data was not as
random as I thought. It was also very suspicious. The consistency of the sentiment did not
seem possible, and for that reason I suspect that it could be manipulation from marketing firms.
However, it is difficult to investigate and confirm this.
36
A regression model was built using historical data of stocks for the purpose of comparing the
results with the sentiment model. It used much of the same libraries to retrieve stock data, and
used Sklearn’s linear regression algorithm. Although the model was able to give numerical results,
it was not adjusted to compare day to day to the sentiment model due to time constraints.
6.2.7 Summary
Firstly, features related to sentiment, such as content polarity, number of mentions of search
query, as well as tweet length and author friends had the biggest negative impact when removed
from the training. We can conclude that these four features may be the great indicators of stock
price movements the following day.
Unsurprisingly, training with more data generally improved the performance of the performance
of the classifier, with the exception of unusually periods such nationals level news affecting the
economy.
Finally, the data retrieval process was extremely precise. Collected a certain number of tweets
using the same parameters consecutively would give very consistent results, especially for senti-
ment values. To confirm that this was not an error in the code, the same search was performed
on another company with different keywords, getting very different results. Furthermore, the
experiment was performed on Apple again, using the same keywords, but on a different day at a
different time. Again, the results were completely different. This means that the randomness and
unpredictability of social media data was in fact not an issue, but a matter of what is done with
that data.
37
7 Conclusion
At last, we will discuss whether this project was successful, why or why not, and what more
could have been done.
The ’must have’ requirements were all met, as every one of them was essential for the product to
even exist. All the ’should have’ requirements were also met, with the exception of S5 and S9. S5
which was explained in section 6.2.6 to have been forgone due to time constraints, while S9 will
be satisfied later this chapter.
The ’could have’ requirements were slightly too ambitious. C1 and C2 were not feasible due to
barely passable accuracy of the existing model. C3 was attempted as described in section 4.2.7
but was met with failure. C4, C5, C6, C7, and C8 were not able to be completed simply due to
time restraints.
The product was informally trialed by colleagues and students from both technological and
non-technological backgrounds to gauge an understanding of the product’s general usability. Of
the non-functional requirements, NF4 and NF7 were not met, while NF2 and NF8 are only just
acceptable. N4, an accuracy concern, will be elaborated in this chapter. As for NF7, users felt
that just a prediction and a value for sentiment was not enough. They wanted to know why the
prediction was made, which could not simply be explained by saying ’trust the machine’, which I
myself do not believe. Although a web app was implemented for NF2, it was not hosted, meaning
I would have to send the standalone application in order for someone to use it. The standalone
was over 1GB, which inconvenienced some users. Finally, although the code is expandable,
whether or not NF8 was met is debatable due to uncleanness of the commenting and structuring
of the source files.
Out of the four major components, I believe that the data ingestion and sentiment/feature
breakdown were executed rather well, due to the preciseness of averaged features and consistency
of predictions. The user applications satisfied its requirements. Generating an accurate prediction
using retrieved data, however, was questionable at best, as evident from the mere 57.31% accuracy
and large deviation of performance metrics when configurations were adjusted.
38
The overall accuracy obtained was 57.31%. This may seem awfully low compared to other
machine learning problems, however, once must take into consideration the independent relation
between tweets and stock prices.
The average price increase was 0.54% and the average price decrease was 0.36%. Assuming we
only buy stocks that the predictor tells us will increase each day, we use for formula:
This gives us an expected return of 0.16% per day. If many trades are made, this value of profit
becomes consistent, although not accounting for trading fees. Over a 365 day year, the expected
return of one’s portfolio is 50.12%.
7.2.3 Verdict
Although a return of 50.12% of a year may seem good, we must take into consideration of risk
and reliability. The data used to train the models were collected over a 3 month period with very
questionable political circumstances. As the world’s situation changes, so might this model.
The 57.31% accuracy was based on training data. When applied to real life scenarios, the results
may vary wildly.
On top of that, many profession prediction models achieve accuracies over 70%. My application
is most definitely unable to compete with them.
The main disadvantage that this product has is that it only examines sentiment, which is highly
independent to stocks movements, while other models mostly use empirical data from stocks
themselves which would be much easier to correlate and make accurate predictions.
Overall, this product would not be able to be viable on its own. With this being said, it is not
useless either, as there are still improvements that can be made to bring viability to the product.
7.3 Limitations
In this section, we will briefly discuss why this project was particularly difficult.
39
The choice of classifier was questionable. Although decision tree gave the best performance out
of all the default classifiers, the task could have possibly been more suitable for Random Forrest
or NaÃŕve Bayes. A graph of the training data with its components reduced to 2, are shown in
7.1, with the components being the x and y value and the color indicating the class the sample
belonged to. Decision tree was not the best option for this, but then, what was?
Using Sklearn’s classifier comparison code [19], we get the results shown in 7.2.
We find that no classifier really does well given the data set. This is a problem which must be
tackled at the lowest level.
40
In several of the previous papers we explored, mood analysis was heavily involved, and was
believed to show a very strong indication of stock price movements. Given more time, mood
analysis could have been implemented to get features from each of a number of moods.
Another thing I would have liked to do was implement a method of ranking tweets according to
how significant they were, but was not implemented due to time limitations. Glasgow’s own
Terrier platform could be used for this.
With current standards of stock prediction heavily focused on using empirical data, it is essential
that this product be used in combination with those products to achieve maximum accuracy.
7.5 Summary
This was a difficult task for start to finish, particularly building a good model given the data that
was retrieved. However, the project achieved better than random predictions on the movement
of stock prices at closing time the next day. The data retrieval process was done well, achieving
extremely consistent information about any stock at any time, and giving the same prediction
reliably, indicating a high precision in converting public sentiment into data. That being said,
the accuracy of the model is the main limiting factor and must be improved, with a lot of work
going into designing and perfecting the classifier. Overall, this product may be feasible in the
real world, but as a supplement meant to be combined with other methods of prediction.
41
A Appendices
7 Bibliography