0% found this document useful (0 votes)
9 views

Forecasting Stock Market Movement Direction Using Sentiment Analysis and Support Vector Machine

Uploaded by

msun27846
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Forecasting Stock Market Movement Direction Using Sentiment Analysis and Support Vector Machine

Uploaded by

msun27846
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE SYSTEMS JOURNAL 1

Forecasting Stock Market Movement Direction Using


Sentiment Analysis and Support Vector Machine
Rui Ren, Member, IEEE, Desheng Dash Wu, Senior Member, IEEE, and Tianxiang Liu

Abstract—Investor sentiment plays an important role on the In order to get an efficient and persuasive sentiment index,
stock market. User-generated textual content on the Internet pro- we take the day-of-week effect into consideration, which means
vides a precious source to reflect investor psychology and predicts that the average return on Mondays is much lower than that
stock prices as a complement to stock market data. This paper in-
tegrates sentiment analysis into a machine learning method based on the other days of the week [5]. It is one of the most well-
on support vector machine. Furthermore, we take the day-of-week known financial anomalies dating back to 1930 when Fred C.
effect into consideration and construct more reliable and realistic Kelly revealed the phenomenon on the U.S. markets where
sentiment indexes. Empirical results illustrate that the accuracy the returns had the tendency to decline on Mondays [6], [7].
of forecasting the movement direction of the SSE 50 Index can be Then, the effect is proved to exist in global stock markets [8].
as high as 89.93% with a rise of 18.6% after introducing senti-
ment variables. And, meanwhile, our model helps investors make The reasons probably include that a much larger amount of
wiser decisions. These findings also imply that sentiment probably information is produced on weekends than weekdays. Most
contains precious information about the asset fundamental values of the corporations tend to release news on Saturday, Sun-
and can be regarded as one of the leading indicators of the stock day, or even on Friday nights just after the stock market is
market. closed, as people have enough time to digest the bad news to
Index Terms—Day-of-week effect, decision making, sentiment prevent the dramatic fluctuations or remember the good news
analysis, stock markets, text mining. to boost companies’ images. However, the day-of-week effect
is seldom mentioned when it comes to calculating sentiment
indexes.
I. INTRODUCTION
Another difficulty in predicting stock movement direction is
ORECASTING stock market trends has been treated as attributed to its nonlinear, dynamic, and evolutionary properties.
F one of the most challenging but important tasks. Stock
market is a nonlinear and dynamic system, and investor senti-
Support vector machine (SVM) has been widely utilized since it
can solve the nonlinear problem by converting it to a quadratic
ment constitutes a key factor of the financial market [1]. With the programming. Moreover, the solution of SVM is unique and
proliferation of news, blogs, forums, and social networking web- globally optimal [9]. It can also reduce the overfitting prob-
sites, textual content on the Internet provides a precious source lem by selecting the maximal margin hyperplane in the feature
to reflect investor sentiment and predicts stock prices as a com- space [10]. To further address the problem, we implement five-
plement to traditional stock market time series data. Hence an fold cross validation. However, it leads to look-ahead bias, so
automated approach is required to distill knowledge from a large we integrate SVM with a realistic rolling window approach to
number of textual documents [2], [3]. Sentiment analysis is used eliminate the bias. Empirical results illustrate that combining
to automatically extract views, attitudes, and emotions from the sentiment features with stock market data outperforms using
opinionated contents [4]. So, we employ sentiment analysis to only stock market data in forecasting movement direction. So,
construct sentiment indexes, and then aggregate them with stock we can deduce that investor sentiment plays an important role
market data to forecast movement direction. on the stock market. Furthermore, sentiment probably contains
precious information about the asset fundamental values and
can be regarded as one of the leading indicators of the stock
Manuscript received June 9, 2017; revised September 20, 2017 and December market.
27, 2017; accepted January 4, 2018. This work was supported in part by the
Ministry of Science and Technology of China under Grant 2016YFC0503606, in Moreover, we have developed a practical trading strategy. The
part by the National Natural Science Foundation of China under Grant 71471055 prediction results also imply trade order, 1 means buy order,
and Grant 91546102, in part by the Chinese Academy of Sciences (CAS) Fron- whereas –1 means sell order. Thus, we can simulate how it
tier Scientific Research Key Project under Grant QYZDB-SSW-SYS021, and in
part by the CAS Strategic Research and Decision Support System Development behaves if people make investment decisions solely based on
under Grant GHJ-ZLZX-2017-36. (Corresponding author: Desheng Dash Wu.) the results in a real market environment. We assume that short
R. Ren and T. Liu are with the School of Economics and Management, selling mechanism is allowed and there are no market frictions.
University of Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected]; [email protected]). The results present that by integrating sentiment indexes to the
D. Wu is with the School of Economics and Management, University of basic model, the investors can make more profit and at the same
Chinese Academy of Sciences, Beijing 100190, China, and also with the time bear fewer risks. In addition, a stop-loss order strategy is
Stockholm Business School, Stockholm University, Stockholm SE-106 91,
Sweden (e-mail: [email protected]). applied to limit the potential losses, and it accomplishes a much
Digital Object Identifier 10.1109/JSYST.2018.2794462 better performance.

1937-9234 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE SYSTEMS JOURNAL

The remainder of this paper is organized as follows. of three important stock indexes, S&P 500 Index, DJIA In-
Section II highlights related literature. Section III puts forward dex, and New York Stock Exchange Index. Kara et al. [21] use
a new sentiment analysis method and describes SVMs in detail. SVM to forecast daily Istanbul Stock Exchange National 100
In Section IV, we describe data and present empirical results. Index, and the average prediction performance is 71.52%. Be-
Section V provides concluding remarks and future work. sides, SVM is combined with other methods to achieve a better
performance. A model by integrating SVM with several other
II. RELATED WORK classification methods, such as random walk model and Elman
backpropagation neural networks, performs best in predicting
A. Sentiment Analysis in Finance Industry NIKKEI 225 Index [9]. Pai and Lin [22] put forward a hybrid
Sentiment is an opinion or feeling you have about something of an autoregressive integrated moving average model and an
according to the Longman Dictionary. Sentiment analysis is the SVM model in forecasting stock prices, and the experimental
method to transfer unstructured textual contents to structured results prove promising. Wu et al. [23] propose a method that
data, and distill views, attitudes, and emotions by language pro- integrates sentiment analysis into SVM and generalized autore-
cessing, data mining, and computational linguistics [11], [12]. gressive conditional heteroskedasticity to explore the relation-
Investor sentiment constitutes a key factor of the financial mar- ship between stock price volatility and stock forum sentiment,
ket [1]. Baker and Wurgler [13] employ the equity share in and the method can effectively predict financial risk measured
new issues, the dividend premium and some other variables as in volatility terms. This paper not only combines SVM with a
sentiment proxies, and point out that investor sentiment affects rolling window approach to make our method more meaningful
the cross section of stock returns. Edmans et al. [14] use in- and practical in a financial domain, but also integrates senti-
ternational soccer results as a mood variable and document a ment analysis into a machine learning method based on SVM
significant market decline after each loss. Afterwards, with the to forecast stock market movement direction with consideration
development of the sentiment analysis, researchers start to deal of human emotions.
with written text that is a more direct way to express ideas
and emotions. Tetlock [15] generates a pessimistic media factor III. METHODOLOGY
in terms of the Wall Street Journal’s “Abreast of the Market” This paper aims to forecast stock market movement direc-
column and finds that high pessimism has a negative effect on tion by not only using financial market data, but also combining
market prices followed by a subsequent reversion. Bollen et al. them with sentiment features that incorporate investor psychol-
[16] present evidence that tweets posted on Twitter are a predic- ogy. The features are extracted from unstructured news data
tive factor of the Dow Jones Industrial Average (DJIA) values automatically and then are expressed as sentiment indexes. In
and find an accuracy of 86.7% in predicting the daily up and order to make the indexes more realistic and reliable, we take the
down changes in the closing values of the DJIA. Gillam et al. day-of-week effect into consideration. Next, we employ SVM
[17] concentrate on the volume of news to quantify the informa- to forecast stock market trends, and make an adjustment to real
tion incorporated in textual data and discover that it enhances market situations by use of a rolling window approach, and then
earnings forecasting. Moreover, sentiment analysis is superior compare the accuracy with the baseline method. Moreover, the
to the bag-of-words model at individual stock, sector, and index prediction results are used to instruct investment decisions, and
levels in predicting stock prices [12]. Oliveira et al. [18] pro- the performance of three different trading strategies are evalu-
pose an automated method to build a stock market sentiment ated and compared. The overview of the stock market prediction
lexicon to facilitate the research in the area. Nevertheless, the architecture is illustrated in Fig. 1.
day-of-week effect is rarely mentioned in the study of investor
sentiment. A. Investor Sentiment
This section is made up of three steps. We first build a web
B. SVM in Predicting Stock Market crawler to download news documents automatically from the
SVM was proposed by Vapnik [19] and is a supervised learn- Internet, and then construct daily sentiment indexes based on
ing method that can partially address the overfitting problem the corpus. At last, adjustments are made in consideration of the
directly and formally [20]. With the help of kernel functions, day-of-week effect.
such as radical basis function (RBF) kernel and polynomial Step 1 Web crawler: In this step, we aim to build a web crawler
kernel, it is able to solve the nonlinear problem by projecting to automatically download the targeted textual documents from
it onto the high-dimensional feature space. Furthermore, it is the Internet and store them to a database for further processing.
a dynamic approach. Stock market is a nonlinear and dynamic The framework is clarified in Fig. 2. The web crawler begins
system, so SVM has been widely applied to forecast stock prices, with the seeds in the form of a list of URLs. The scheduler man-
especially stock indexes. Huang et al. [9] employ SVM to pre- ages the queue of URLs, deciding the priority and eliminating
dict the weekly movement direction of NIKKEI 225 Index and duplicate parts. Next, the downloader is responsible for acquir-
show that SVM outperforms the other classification methods, ing the web pages from the Internet and providing them to the
such as random walk model, quadratic discriminant analysis, spider, which is used to parse the pages and extract the targeted
and Elman backpropagation neural networks. An evolving least contents. What we need to obtain is comprised of two sections:
squares SVM is proposed by Yu et al. [10] to explore the trends one is the textual news with the date from the websites, and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: FORECASTING STOCK MARKET MOVEMENT DIRECTION USING SENTIMENT ANALYSIS AND SUPPORT VECTOR MACHINE 3

Fig. 1. Overview of stock market prediction architecture.

ambiguity problem. As a result, a document is divided into


sentences first. Next, we segment the sentences into separate
words, then project the words onto the sentiment space, count
the number of positive and negative words, assign a specific
sentiment value, and decide the polarity of each sentence based
on HowNet and Chinese Sentiment Analysis Ontology Base.
HowNet is an online common-sense knowledge base unveiling
interconceptual relationships and interattribute relationships of
concepts as connoted in lexicons of the Chinese and their En-
glish equivalents [24]. Chinese Sentiment Analysis Ontology
Base is constructed by Dalian University of Technology and
depicts words and phrases from various aspects containing part
of speech, polarity, and sentiment intensity. After that, we cate-
gorize each document. As there may be a large number of posts
Fig. 2. Framework of the web crawler. or articles in a day, a daily sentiment index St is calculated as

⎧ 
then we store the precious data into the database; the other is the ⎨2Mt
bull
(Mt bull + Mt b ear ) − 1, Mt bull > Mt b ear
URLs contained in the pages, and then the URLs are transported St = 0,  Mt bull = Mt b ear

to the scheduler. The procedures are repeated until we get hold 1 − 2Mt b ear (Mt bull + Mt b ear ), Mt bull < Mt b ear
of all the targeted textual documents. Each of the documents is (1)
displayed as time, headline, and contents in the database. where Mt bull denotes the number of positive comments,
Step 2 Daily sentiment: A sentence-based sentiment analysis whereas Mt b ear denotes the number of negative comments in
approach is used to process the textual data during a specific pe- day t. The value of St ranges from –1 to 1, where 0 means
riod. We regard a sentence as a unit to interpret the meaning of people hold a neutral position. And, if the value is larger than
the whole document instead of a single word because a sentence 0, it means most people take a positive view; if the value is less
can express a relatively complete meaning and help address the than 0, it means most people take a negative view.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE SYSTEMS JOURNAL

Step 3 Modified sentiment: The day-of-week effect is one An SVM model can be represented as
of the most well-known financial anomalies [5], which means
1  l
that the average return on a Monday is much lower than that min w2 + C ξi (7)
on the other days of the week. The reason includes that large ω ,b,ξ 2 i=1
amount of news is reported on the weekend or on Friday just
after the market is closed. With such considerable and valuable s.t. yi ((w · f (xi )) + b) ≥ 1 − ξi , i = 1 (8)
information to deal with, investors are very likely to change their ξi ≥ 0, i = 1, . . . l (9)
mind and take actions on Mondays. Furthermore, corporations
also tend to release important news on the weekend to ensure where ξi is a tolerable training error, and C is a positive constant
the stability of the stock and boost the public image. If it is bad parameter to evaluate the tradeoff between training errors and
news, investors will have enough time to digest and accept it, margin maximization. In order to solve the problem, we can
whereas if it is good news, companies can continuously spread transform it to its dual problem, and the solution set of the dual
out news to make it known by more and more people and expand problem is the same as the QP problem as shown in the following
their coverage. equation:
In this procedure, we aim to gauge the effect from Satur-
1  
l l l
day to Monday by using an exponential time function as news min yi yj αi αj (f (xi ) · f (xj ))− αj (10)
has greater impact when it is more recent. Barberis et al. [25]
α 2 i=1 j =1 j =1
introduced a measure of “sentiment” by using an exponential
1  
l l l
function on past price changes on the stock market, accord- = yi yj αi αj K(xi , xj )− αj (11)
ingly, we define sentiment on Monday as a weighted average of 2 i=1 j =1 j =1
past sentiment where the weights decrease exponentially. The
expression is clarified as 
l
s.t. yi αi = 0 (12)
Sm = e−λt 1 S1 + e−λt 2 S2 + e−λt 3 S3 (2) i=1

where S1 , S2 , and S3 , respectively, stands for Saturday sen- 0 ≤ αi ≤ C, i = 1, . . . l (13)


timent, Sunday sentiment, and Monday sentiment; Sm is the
where α = (α1 , . . . , αl )T is a Lagrange multiplier, K(xi , xj ) is
modified Monday sentiment; λ(λ > 0) is prescribed; t1 = 2,
defined as a kernel function with K(xi , xj ) = (f (xi ) · f (xj )),
t2 = 1, and t3 = 0.
and (·) denotes the inner product in the Hilbert space.
Similarly, the stock market is also closed on national holidays
There are many kinds of kernel functions, such as RBF kernel
or on some special days, so we generalize (2) to more common
Krbf (xi , xj ) = exp (−γxi − xj 2 ) and polynomial kernel
occasions. Assume, there are n holiday days on the stock market,
Kp oly (xi , xj ) = ((xi · xj ) + 1)d , where γ and d are kernel
then the sentiment on n + 1th day is represented as
parameters. If we choose a prescribed parameter C and a
λ
Sn +1 = e−n λ S1 + e−(n −1)λ S2 + · · · + e− Sn + Sn +1 . (3) proper kernel function K(xi , xj ), we can compute the solution
α∗ = (α1 ∗ , . . . , αl ∗ )T of QP problem (11)–(13), and then b∗ is
B. Support Vector Machine calculated as
SVM is a supervised machine learning model for classifica- 
l
b∗ = y j − yi αi ∗ K(xi , xj ). (14)
tion, which was proposed by Vapnik in the 1990s [19]. Assume
i=1
that there is an input space X, an output space Y , and a training
dataset T Finally, we construct the decision function (15) that can be
used to classify
T = {(xi , yi ), i = 1, . . . , l} ∈ (X × Y )l (4)  l 

where xi ∈ Rn , yi ∈ Y = {−1, 1}, and then introduce a trans- D(x) = sgn yi αi ∗ K(xi , xj ) + b∗ . (15)
formation x = f (x) such that Rn → H, where H is the Hilbert i=1
space, so the training set is then denoted as
IV. EXPERIMENT
Tf = {(xi , yi ), i = 1, . . . , l} ∈ (H × Y )l (5)
A. Data Description
where xi = f (xi ) ∈ H and yi ∈ Y = {−1, 1}. Thus, we can
We intend to explore the trend of a very important index in
find a linear separating hyperplane (w∗ · x) + b∗ = 0 in the
China, the SSE 50 Index, not only by using stock market data but
Hilbert space, and then we can obtain a separating hyperplane
also exploiting news documents related to it and its constituents.
w∗ · f (x) + b∗ = 0 and a decision function D(x) = sgn((w∗ ·
The SSE 50 Index is a primarily blue-chip stock index on the
x) + b∗ ) = sgn(w∗ · f (x) + b∗ ) in the original space Rn .
Shanghai stock market, and it is made up of the 50 largest
SVM is an optimization problem that aims to maximize the
stocks of good liquidity and representativeness. Conventional
margin. The margin between two hyperplanes in the Hilbert
time series data include opening price, closing price, high for the
space is 2/w. The two hyperplanes are classified as
day, low for the day, trading volume in number of shares, trading
(w · x) + b = 1 and (w · x) + b = −1. (6) volume in RMB, change in RMB, and change in percentage. We
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: FORECASTING STOCK MARKET MOVEMENT DIRECTION USING SENTIMENT ANALYSIS AND SUPPORT VECTOR MACHINE 5

download such data of the SSE 50 Index and its 50 constituents TABLE I
STATISTICS OF SENTIMENT INDEXES
from the Wind Economic Database, which is the market leader
in China’s financial information service industry.
Accordingly, we applied the web crawler we built to download Number Mean Median Std. Dev. Skewness Kurtosis
all the posts and documents of the 51 shares from the Sina stock 1 23236 0.35637 0.40967 0.26127 2.01860 8.80423
forum and Eastmoney stock forum over the period between 2 81248 0.53358 0.58165 0.20246 3.20545 20.72166
3 33196 0.21777 0.25837 0.21603 1.60730 9.98526
June 17th, 2014 and June 7th, 2016, including 485 trading days. 4 34591 0.32120 0.37197 0.21405 1.82987 8.50063
The two forums are widely regarded as active and mainstream 5 27811 0.47152 0.51065 0.17151 3.64575 25.45141
communities in China. The number of reviews of each stock is 6 18754 0.29471 0.34524 0.21371 1.81329 9.32399
7 33537 0.36117 0.39134 0.17573 2.23436 15.70559
37 855 on average, peaking at 23 236 and reaching the lowest 8 34968 0.52412 0.57139 0.17972 3.05854 19.38373
point at 7797. The details are illustrated in the second column 9 9853 0.28766 0.31879 0.19666 1.21639 9.65186
of Table I. The total number of the reviews on the Sina stock 10 40644 0.42828 0.48220 0.22630 2.27251 10.46903
11 31386 0.25575 0.30260 0.20527 2.23891 11.81875
forum and Eastmoney stock forum is 1 930 592 after filtering 12 36994 0.45809 0.50773 0.20597 2.51271 14.05555
and denoising during the given period. 13 17032 0.18246 0.22299 0.25101 1.05804 5.89966
14 10766 0.19424 0.23456 0.20181 1.91364 10.09988
15 28200 0.39511 0.42887 0.16154 2.17121 16.48476
B. Sentiment Calculation 16 26608 0.13916 0.19250 0.21458 1.61710 7.60458
17 17052 0.17892 0.24357 0.24875 1.50026 5.94911
Under Steps 2–3 in Section III, we can compute 51 senti- 18 46914 0.18769 0.24859 0.23176 1.80984 8.71453
19 27024 0.64060 0.67804 0.16707 4.05740 31.82885
ment indexes for 51 stocks. In Step 2, we first segment each 20 28158 0.19699 0.25102 0.20822 2.06887 10.11175
document into several sentences by identifying punctuations, 21 33764 0.24181 0.28347 0.20219 1.65773 8.80704
such as “,” “.” and “!” Then sentences are divided into sepa- 22 38000 0.31557 0.36938 0.20474 2.16160 11.06345
23 37925 0.40178 0.46159 0.21796 2.42526 11.64860
rate words, and if there appears a negative word, it is treated 24 57848 0.63125 0.69425 0.25169 2.64671 12.65547
as a whole with the word next to it. For example, if people 25 43594 0.04643 0.10727 0.26298 1.06220 5.51963
say “ (I’m not satisfied with the stock),” after 26 19409 0.37464 0.41017 0.20768 1.63800 10.20172
27 14538 0.40000 0.44882 0.19462 1.72010 10.35468
word segmentation, the sentence becomes four words “(I’m)” 28 17470 0.39013 0.42699 0.17367 3.03270 20.76046
“(not)” “(satisfied with)” “(the)” “ (stock).” If we 29 22766 0.41687 0.38462 0.14956 -0.78036 22.28815
directly project the words to the sentiment space, the program 30 26006 0.52197 0.57188 0.19350 2.46205 13.48589
31 41348 0.58017 0.63323 0.20431 2.50395 13.37901
will tell us the sentence is optimistic because of the positive 32 30269 0.26305 0.30691 0.17272 1.92448 12.25382
word “(satisfied with).” So, we need to treat “(not)” “ 33 37313 0.46347 0.52139 0.23625 1.91042 8.52865
(satisfied with)” as a whole “ (not satisfied with)” so 34 24289 0.54739 0.62792 0.20351 3.30866 19.27568
35 16982 0.43495 0.48230 0.20400 2.15579 10.87906
that we can find the true meaning. Then, we need to catego- 36 79823 0.15132 0.20275 0.25479 0.94820 6.24546
rize each document, assume there are pi positive sentences and 37 21645 0.00751 0.01479 0.21188 -0.26603 7.62239
ni negative sentences in document i; if pi > ni , the document 38 71082 0.35197 0.40319 0.20811 2.16227 11.59339
39 54386 0.50157 0.56257 0.21386 2.50068 13.03801
is positive; if pi = ni , the document is neutral; if pi < ni , the 40 42746 0.30985 0.35117 0.19425 1.78007 10.41846
document is negative. And, then we find on the day t, the num- 41 18339 0.59117 0.64865 0.20903 2.44041 13.36085
ber of positive comments is Mt bull and the number of negative 42 25493 0.53206 0.58415 0.20860 2.55991 14.14896
43 57196 0.53761 0.58005 0.17451 2.78345 18.64874
comments is Mt b ear , so a daily sentiment index is calculated by 44 55932 0.44264 0.47679 0.14944 2.96612 22.85836
using formula (1), with the value ranging from –1 to 1, where 45 12858 0.11312 0.15774 0.20706 2.08796 11.04161
0 means people hold a neutral position. And, if the value is 46 152533 0.17153 0.22287 0.23559 1.41102 6.72343
47 104189 0.58461 0.65802 0.23296 2.65091 13.82870
between 0 and 1, it means people hold a positive view; if the 48 7797 0.55417 0.60022 0.16712 2.91624 20.59029
value is between –1 and 0, it means people take a negative view. 49 16293 0.31672 0.36630 0.24080 1.39273 7.65575
Then, by considering the day-of-week effect, the modified sen- 50 17807 0.48988 0.53705 0.19318 2.10251 11.62268
51 122980 0.42754 0.46735 0.19745 2.69997 16.05438
timent indexes are calculated according to (2) and (3). The third
to seventh columns of Table I show the major statistics of the This table illustrates the major statistics of the 51 sentiment indexes. The second column
modified sentiment indexes, including the mean, median, stan- shows the number of comments on each stock. The third to seventh column, respec-
dard deviation, skewness, and kurtosis, of each stock sentiment tively, display the mean, median, standard deviation, skewness, and kurtosis of each stock
sentiment index.
index. It demonstrates that the majority of investors have a pos-
itive view on the 51 stocks since the mean and median of most
stocks are positive rather than negative. From Fig. 8 or 9, we can the features of market data and sentiment indexes needs to be
find that the price of the SSE 50 Index increases at first and then balanced. Although we have already computed 51 sentiment in-
drops suddenly, but overall people are optimistic. It implies that dexes, the number of market data is around 10, so it is unfair for
people are not ready for the decrease. In fact, many individuals the market attributes to some extent. Next, too many variables
and companies lost a great deal during the period that is also are inclined to cause the problem of overfitting. Furthermore,
called the stock market disaster in China. we find that using 8 sentiment features achieves a better result
Nevertheless, we do not utilize them directly to explore the in forecasting the SSE 50 Index than using all 51 indexes. In-
stock market trend, but select 8 sentiment indicators based on spired by the market attributes, sentiment features consist of the
51 sentiment indexes. The reasons include that the number of highest of modified sentiment indexes, the lowest of modified
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE SYSTEMS JOURNAL

TABLE II However, the two kinds of methods cannot be applied in


DESCRIPTION OF EIGHT FEATURES
forecasting stock market movement direction for the reason
that they lead to look-ahead bias, which is created by the use of
Features Description information or data that would not have been known or available
s1 The average of modified sentiment indexes during the period being analyzed. For example, we have some
s2 The highest of modified sentiment indexes data from January to May, and implement the fivefold cross
s3 The lowest of modified sentiment indexes
s4 The median of modified sentiment indexes
validation approach. When we use the data from February to
s5 The value between the highest and the lowest May as training set and the data in January as testing set, it is
s6 The change of the average impossible because in January, we will never know what will
s7 The percentage of the average change
s8 The standard deviation of modified sentiment indexes
happen from February to May. On the other hand, that does not
mean the method is useless, and it is an important procedure
to select the proper kernel functions and parameters as well
sentiment indexes, the median of modified sentiment indexes, as address the overfitting problem. In other words, the purpose
the average of modified sentiment indexes, the difference be- of the procedure is not to forecast but select the proper kernel
tween the highest and the lowest, the change of the average (a functions and parameters.
certain day’s average minus the last day’s average), the percent- We utilize a realistic rolling window approach to overcome
age of the average change (a certain day’s average minus the the challenge, and accordingly, we need to single out a best win-
last day’s average and divided by the last day’s average), and the dow for both experiments. The principle of choosing the rolling
standard deviation of 51 modified sentiment indexes. Table II windows is that we use n previous days to forecast the next
describes eight selected features. Fig. 3 sheds light on the trend day’s movement direction, repeat the procedure, and change the
of the 8 variables after standardization in 484 trading days. value of n until an SVM model achieves the highest accuracy
with the parameters and the kernel function we have already
C. Prediction selected. Figs. 6 and 7 display the rolling window choosing pro-
First, we need to label the data according to the following cesses. For Experiment 1, the optimal rolling window is 68, and
equation: the highest accuracy is 71.33%; for Experiment 2, the optimal
rolling window is 76, and the highest accuracy is 89.93%. It
1, Closet−1 < Closet is clear from each figure that the accuracy is relatively stable
Label = (16)
−1, Closet−1 > Closet at around the optimal rolling window. And it remains lower
where Closet denotes the close price of the SSE 50 Index, and than the highest accuracy after 80 days in Fig. 6 or 90 days
Closet−1 stands for the close price on the previous day. Besides, in Fig. 7, which are not shown in the figures due to the large
1 also means buy order as it indicates the increase, whereas −1 size.
means sell order as it implies the decline. Panel B of Table III sheds light on the prediction accuracy
Next, we implement two experiments to predict the index of SVM with rolling windows. We can see from the table that
movement direction. Experiment 1 is to use market data, which adding sentiment features to the baseline model helps boost
include opening price, closing price, high for the day, low for the prediction performance significantly. The reasons why the
the day, trading volume in number of shares, trading volume empirical result of forecasting the market index movement di-
in RMB, change in RMB, and change in percentage. And then, rection can be as high as 89.93% probably include that investor
we combine them with sentiment features for Experiment 2. We sentiment plays a very important role on the stock market. Fur-
employ classification accuracy Acc to assess the performance, thermore, sentiment contains valuable knowledge about the as-
as shown in the following equation: set values and can be considered as one of the leading indicators
of the stock market.
T++ + T−−
Acc = (17) In addition, LR is used to reexamine the conclusions. LR is a
T++ + T−− + F−+ + F+− very important method in prediction, and it is good at modeling
where T++ denotes that the true value is +1 and the prediction the probability of a response based on a set of predictor vari-
value is also +1; T−− denotes that the true value is −1 and the ables. In order to compare with SVM, fivefold cross validation
prediction value is also −1; F+− denotes that the true value is is also applied to forecast the movement direction of the SSE
+1, whereas the prediction value is –1; F−+ denotes that the 50 Index. For Experiment 1, the accuracy is 70.96%, GD with
true value is −1, whereas the prediction value is +1. the maximum number of iterations set to 600 is implemented to
A fivefold cross-validation approach is adapted to train an converge the result; for Experiment 2, the accuracy is 86.59%,
SVM model. Eventually, we find the proper parameters and the stochastic GD with the maximum number of iterations set to
kernel functions to achieve the best performance. Figs. 4 and 5 1000 is implemented to converge the result. Panel C of Table III
document the processes of parameter selection. Panel A of Ta- confirms that investor sentiment is vital to the stock prices;
ble III sheds light on the prediction results. For Experiment and illustrates that the accuracy of LR with fivefold cross val-
1, the accuracy can be 79.96%, and we use RBF kernel func- idation is acceptable, but it is not only less than SVM with
tion, C = 256, γ = 0.9942; for Experiment 2, the accuracy can fivefold cross validation but also less than SVM with a rolling
be as high as 97.73%, and we employ RBF kernel function, window approach, suggesting that our method is realistic and
C = 181.0193, γ = 0.005524. efficient.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: FORECASTING STOCK MARKET MOVEMENT DIRECTION USING SENTIMENT ANALYSIS AND SUPPORT VECTOR MACHINE 7

Fig. 3. Sentiment features in trading days.

Fig. 5. Parameter selection process for Experiment 2.


Fig. 4. Parameter selection process for Experiment 1.

D. Investment Performance (EMD) is an estimate of the maximum losses average, based on


This section tries to discover if the prediction results are a geometric Brownian motion assumption. MDD and EMD are
of benefit to the investment. Some measures are employed to regarded as indicators of downside risk. Sharpe ratio (SR) is a
evaluate and compare the performance of the methods. AI is way to gauge the performance of an investment by calculating
computed based on the stock points. For example, if we buy a the adjusted-risk return [27], which is defined as
stock at the price of 100 and sell it at 150, then we earn 50 stock
points and AI is 50 stock points; after that, we short the equity ra − rf
SR = (19)
at 150 and liquidate the position at 120, then we make 30 stock σa
points and AI becomes 80 stock points. Maximum drawdown
(MDD) is the maximum decline of a series from a peak to a where ra is the mean of the asset returns, σa is the standard
trough over a specified time period [26]. MDD at time T is deviation of the asset returns, and rf denotes the risk-free rate
expressed as and set to be 0 in this paper.
We have previously mentioned that the label 1 means buy
order, whereas –1 means sell order. Accordingly, we follow the
MDD = sup sup X(s) − X(t) (18) prediction results to buy or sell, then find if it is beneficial to
t∈[0,T ] s∈[0,t]
support investment decisions and reduce financial risk. We pos-
where X(t) is a random process on [0, T ]. MDD time illus- tulate that short selling mechanism is allowed and there are no
trates when MDD occurs. The expected maximum drawdown market frictions. The prediction results of Experiment 1 and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE SYSTEMS JOURNAL

TABLE III
PREDICTION RESULTS

Panel A: Prediction accuracy of support vector machine (SVM) with fivefold cross validation

Accuracy C γ
Experiment 1 0.7996 256 0.9942
Experiment 2 0.9773 181.0193 0.0055
Panel B: Prediction accuracy of SVM with rolling windows
Accuracy C γ Rolling window
Experiment 1 0.7133 256 0.9942 68
Experiment 2 0.8993 181.0193 0.0055 76
Panel C: Prediction accuracy of logistic regression (LR) with fivefold cross validation
Accuracy Max Optimization
Experiment 1 0.7096 600 Gradient descent (GD)
Experiment 2 0.8659 1000 Stochastic GD

Fig. 6. Rolling window choosing process for Experiment 1.

Fig. 7. Rolling window choosing process for Experiment 2.


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: FORECASTING STOCK MARKET MOVEMENT DIRECTION USING SENTIMENT ANALYSIS AND SUPPORT VECTOR MACHINE 9

Fig. 8. Accumulated income (AI) for Experiment 1.

Fig. 9. AI for Experiment 2.

Experiment 2 are, respectively, utilized to compute AI and TABLE IV


INVESTMENT PERFORMANCE
MDD. Figs. 8 and 9 demonstrate AI compared with the trend of
the closing price of the SSE 50 Index, and highlight the MDD
district of AI simultaneously. It can be seen from the line graphs AI/stock SR MDD MDD time/ EMD
points trading days
that the AI of Experiment 2 (916.6264 stock points) is more than
two times than that of Experiment 1 (404.8598 stock points). Experiment 1 404.8598 0.3263 0.4073 50–85 (35 days) 0.2546
Experiment 2 916.6264 0.8263 0.3770 157–185 (28 days) 0.1882
Moreover, although both the methods fail to detect the dramatic Experiment 2 1300.4639 1.2248 0.3034 306–384 0.1572
decline at first, Experiment 2 predicts the trend afterward, and is (strategy) (78 days)
able to uncover the following rise. The sharp decrease is known
as the Chinese stock market crash in 2015. Besides, the MDD of
Experiment 2 is 0.3770, whereas the MDD of Experiment 1 is This implies that sentiment features help to reduce risks for in-
0.4073. Similarly, the EMD of the next 30 days for Experiment vestors and institutions. In addition, SR significantly went up by
2 (0.1882) is also lower than that of Experiment 1 (0.2546). adding sentiment variables, which indicates that the results of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE SYSTEMS JOURNAL

Fig. 10. Stop-loss order strategy for Experiment 2.

Experiment 2 can help investors make higher profits with the implementing fivefold cross validation and a realistic rolling
same risk. window approach. Empirical results illustrate that combining
Finally, we try to discover whether we can achieve a better sentiment features with stock market data can achieve a much
performance based on the prediction results of Experiment 2. better performance than just using stock market data in fore-
Hence a stop-loss order strategy is applied to limit the potential casting movement direction. The accuracy can be as high as
losses. We set the stop order to be 95 stock points, which means 89.93% with a rise of 18.6% after introducing sentiment vari-
that we would stop to trade if 95 stock points had already been ables. Furthermore, if combined with a stop-loss order strategy,
lost in a trading day. The strategy accomplishes a much better our approach can help investors reduce risks and make wiser
performance, and all of the measures that are displayed in the decisions. In addition, we find that sentiment probably contains
third row of Table IV improve significantly. From Figs. 9 and 10, precious information about the asset fundamental values and
we can point out that a stop-loss order can put an end to a losing can be regarded as one of the leading indicators of the stock
period, but it cannot turn it into a win. However, the reduced loss market. For future work, we consider expanding the time inter-
also means an increase in a final AI. As a result, we can deduce val, which also means crawling more textual documents from
that our approach can be of great benefit to investors if combined the Internet. And, it is imperative to boost the efficiency of the
with a proper strategy, for example a stop-loss order strategy. In method to process voluminous data in real time.
other words, the method is useful to decision-making processes
that are pervasive phenomena of nature [28].
REFERENCES
V. CONCLUSION AND FUTURE WORK [1] R. J. Shiller, Irrational Exuberance. Princeton, NJ, USA: Princeton Univ.
Press, 2000.
In this paper, we aim to exploit investor sentiment to forecast [2] I. Perikos and I. Hatzilygeroudis, “Recognizing emotions in text using
stock market movement direction by emphasizing the role of ensemble of classifiers,” Eng. Appl. Artif. Intell., vol. 51, pp. 191–201,
2016.
investors. Investor psychology drives the stock market [1] and [3] B. Wu, X. Zhou, Q. Jin, F. Lin, and H. Leung, “Analyzing social roles
it matters for our research. Accordingly, user-generated content based on a hierarchical model and data mining for collective decision-
on the Internet provides a precious source to reflect investor making support,” IEEE Syst. J., vol. 11, no. 1, pp. 356–365, Mar. 2017.
[4] B. Liu and L. Zhang, “A survey of opinion mining and sentiment analysis,”
psychology. Sentiment analysis is used to convert unstructured Mining Text Data. New York, NY, USA: Springer, 2012.
textual documents into daily sentiment indexes. Furthermore, [5] R. J. Shiller, “From efficient markets theory to behavioral finance,” J.
the financial anomaly day-of-week effect that means the aver- Econ. Perspectives, vol. 17, no. 1, pp. 83–104, 2003.
[6] F. C. Kelly, Why You Win or Lose: The Psychology of Speculation. North
age return on Mondays is much lower than that on the other Chelmsford, Massachusetts, USA: Courier Corp., 2003.
days of the week probably influences the precision of the sen- [7] E. D. Maberly, “Eureka! Eureka! Discovery of the monday effect belongs
timent indexes, so we adjust the indexes by introducing an ex- to the ancient scribes,” Financial Anal. J., vol. 51, pp. 10–11, 1995.
[8] J. Zhang, Y. Lai, and J. Lin, “The day-of-the-week effects of stock markets
ponential function on past sentiment changes on weekends and in different countries,” Finance Res. Lett., vol. 20, pp. 47–62, 2017.
then generalize to holidays. Correspondingly, Sina Finance and [9] W. Huang, Y. Nakamori, and S.-Y. Wang, “Forecasting stock market move-
Eastmoney, two typical financial websites, were selected as ex- ment direction with support vector machine,” Comput. Oper. Res., vol. 32,
no. 10, pp. 2513–2522, 2005.
perimental platforms to obtain a corpus of financial review data. [10] L. Yu, H. Chen, S. Wang, and K. K. Lai, “Evolving least squares sup-
Then, the machine learning model SVM is employed to pre- port vector machines for stock market trend mining,” IEEE Trans. Evol.
dict a very important index in China, the SSE 50 Index, by Comput., vol. 13, no. 1, pp. 87–102, 2009.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: FORECASTING STOCK MARKET MOVEMENT DIRECTION USING SENTIMENT ANALYSIS AND SUPPORT VECTOR MACHINE 11

[11] C. C. Aggarwal and C. Zhai, Mining Text Data. New York, NY, USA: Rui Ren (M’17) is currently working toward the
Springer, 2012. Ph.D. degree at the School of Economics and Man-
[12] X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, “News impact on agement, University of Chinese Academy of Sci-
stock price return via sentiment analysis,” Knowl.-Based Syst., vol. 69, ences, Beijing, China.
pp. 14–23, 2014. Her research interests include sentiment analysis,
[13] M. Baker and J. Wurgler, “Investor sentiment and the cross-section of text mining, and behavioral finance.
stock returns,” J. Finance, vol. 61, no. 4, pp. 1645–1680, 2006.
[14] A. Edmans, D. Garcia, and Ø. Norli, “Sports sentiment and stock returns,”
J. Finance, vol. 62, no. 4, pp. 1967–1998, 2007.
[15] P. C. Tetlock, “Giving content to investor sentiment: The role of media in
the stock market,” J. Finance, vol. 62, no. 3, pp. 1139–1168, 2007.
[16] J. Bollen, H. Mao, and X. Zeng, “Twitter mood predicts the stock market,”
J. Comput. Sci., vol. 2, no. 1, pp. 1–8, 2011.
[17] R. A. Gillam, J. B. Guerard, and R. Cahan, “News volume information: Desheng Dash Wu (M’09–SM’14) is with the School
Beyond earnings forecasting in a global stock selection model,” Int. J. of Economics and Management, University of Chi-
Forecast., vol. 31, no. 2, pp. 575–581, 2015. nese Academy of Sciences, Beijing, China, and also
[18] N. Oliveira, P. Cortez, and N. Areal, “Stock market sentiment lexicon with the Stockholm Business School, Stockholm Uni-
acquisition using microblogging data and statistical measures,” Decis. versity, Stockholm, Sweden. He has authored or coau-
Support Syst., vol. 85, pp. 62–73, 2016. thored more than 100 papers in refereed journals such
[19] V. Vapnik, The Nature of Statistical Learning Theory. New York, NY, as Production and Operations Management, Deci-
USA: Springer, 2013. sion Support Systems, Decision Sciences, Risk Anal-
[20] M. Cecchini, H. Aytug, G. J. Koehler, and P. Pathak, “Detecting man- ysis, IEEE TRANSACTIONS ON SYSTEMS MAN AND
agement fraud in public companies,” Manage. Sci., vol. 56, no. 7, CYBERNETICS, etc. He is the Editor of the Springer
pp. 1146–1160, 2010. book series entitled “Computational Risk Manage-
[21] Y. Kara, M. A. Boyacioglu, and Ö. K. Baykan, “Predicting direction of ment.” His research interests include enterprise risk management in operations,
stock price index movement using artificial neural networks and support performance evaluation in financial industry, and decision sciences.
vector machines: The sample of the Istanbul stock exchange,” Expert Syst. Dr. Wu has been an Associate Editor/Guest Editor for the IEEE TRANSAC-
Appl., vol. 38, no. 5, pp. 5311–5319, 2011. TIONS ON SYSTEMS MAN AND CYBERNETICS, Annals of Operations Research,
[22] P.-F. Pai and C.-S. Lin, “A hybrid ARIMA and support vector machines Computers and Operations Research, International Journal of Production Eco-
model in stock price forecasting,” Omega, vol. 33, no. 6, pp. 497–505, nomics, Omega, etc. He is elected member of the European Academy of Sciences
2005. and Arts.
[23] D. D. Wu, L. Zheng, and D. L. Olson, “A decision support approach for
online stock forum sentiment analysis,” IEEE Trans. Syst., Man, Cybern.,
Syst., vol. 44, no. 8, pp. 1077–1087, Aug. 2014.
[24] Z. Dong, Q. Dong, and C. Hao, “HowNet and its computation of mean-
ing,” in Proc. 23rd Int. Conf. Comput. Linguistics, Demonstrations, 2010,
pp. 53–56.
[25] N. Barberis, R. Greenwood, L. Jin, and A. Shleifer, “X-CAPM: An ex-
trapolative capital asset pricing model,” J. Financial Econ., vol. 115, Tianxiang Liu is currently working toward the Ph.D.
no. 1, pp. 1–24, 2015. degree at the School of Economics and Management,
[26] M. Magdon-Ismail, A. F. Atiya, A. Pratap, and Y. S. Abu-Mostafa, University of Chinese Academy of Sciences, Beijing,
“On the maximum drawdown of a Brownian motion,” J. Appl. Probab., China.
vol. 41, no. 1, pp. 147–161, 2004. His research interests include investment and asset
[27] W. F. Sharpe, “The sharpe ratio,” J. Portfolio Manage., vol. 21, no. 1, pricing.
pp. 49–58, 1994.
[28] V. Shukla, G. Auriol, and K. W. Hipel, “Multicriteria decision-making
methodology for systems engineering,” IEEE Syst. J., vol. 10, no. 1,
pp. 4–14, Mar. 2016.

You might also like