Decision Tree Method in Financial Analysis of Listed Logistics Companies
Decision Tree Method in Financial Analysis of Listed Logistics Companies
Abstract—The paper introduces decision tree knowledge from large amounts of data.
algorithm and C5.0 algorithm in the data mining Currently, data mining has a preliminary
at first. Then it introduces financial analysis application in customer relationship
methods, the problems which need to pay attention management, product design, finance and
to in application and the selection process of securities, telecommunications, military,
attributes. At last, we study the financial ratios of biomedical and other fields. The application of
listed logistics companies through the application data mining in the financial area focuses on the
of SPSS Clenmentine12.0 software。The accuracy study of financial early-warning model. Of
of this model is as high as 95.83%. course, the financial early-warning model is a
key point in the financial area, but other areas of
Keywords-decision tree; listed logistics
finance cannot be ignored. In this paper, we use
companies; profits; financial ratios
the decision tree in data mining to analyze which
I. INTRODUCTION financial ratios has strong correlation with the
profit growth of listed logistics companies. I hope
With the rapid development of computer
this paper can play a role in attracting valuable
technology, various industries have accumulated
opinions and lead more scholars to apply the data
large amounts of data, and the amount of data is
mining in the various financial fields.
increasing day by day. People are aware that
these data have a vast reservoir of knowledge. II. DECISION TREE METHOD
However, if we only rely on the understanding of
A. Decision Tree Principles
people own to tap the knowledge is impossible.
The community called for the need of a powerful
The foundation stone of the decision tree
data mining tools, so data mining came into
learning is the concept of learning systems
being. The concept of "Data mining" concept was
framework approach (Concept Learning System
first used by Usama Fayaad 1995 in Montreal,
framework, CLS) which is proposed by Hunt et
Canada, on the first session of the Knowledge
al in1960. Decision tree is a tree structure similar
Discovery and Data Mining International
to the flow chart, which each internal node
Conference. The technical definition of data
(non-leaf node) represents an attribute on the test,
mining: Data mining also known as knowledge
that is, a divided property. The basic steps of
discovery in databases which means the
decision tree classification model are as follows:
non-trivial process of obtaining valid, novel,
First, we divide the sample data into the training
potentially useful and ultimately understandable
samples and test samples according to the
patterns from the large amount of data. To put it
proportion. Secondly, we generate a decision tree
simply, data mining is to extract live "mining"
model according to the training samples. There
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 03,2010 at 10:23:33 UTC from IEEE Xplore. Restrictions apply.
are two key points in model generation. One is The amount of information: Suppose the training
the selection of split attributes. Attribute selection data set D has two sets. They are YD set and ND
criteria are information gain which created by set, and each contains a collection of the
Quinlan, information gain ratio, the minimum corresponding sample size for y and d, then the
GINI indicators. We usually hope that the tree formula for the amount of information is as
can growth as much as possible. However, follows:
although this can increase its accuracy on the
training samples, but it will reduce the accuracy
of the test samples, that is, we often say that the
phenomenon of over-fitting. So another key point Information expectations: If we use the attribute
is to handle over-fitting problems through A as the root of a decision tree, A has v
pruning. Pruning treatment is divided into two
values , it will be divided A into v
kinds of pre-pruning and post-pruning. Third, use
the decision tree to classify the test samples to
sub-set . Suppose A contains y
obtain useful conclusions.
belong to the category YD set, contains n belong
B. .C5.0 algorithm to ND set, then the information expectations
which a subset is needed is as follows:
Commonly used decision tree algorithms are
ID3 algorithm, C4.5 algorithm, C5.0 algorithm,
CARPT algorithms, CHAID algorithm, PUBLIC
algorithm, SLIQ algorithm and SPRLNT Information gain: The formula of information
algorithm. ID series of algorithms are the most gain which use A as root is as follows:
influential in the international decision-tree
algorithm, and the C5.0 algorithm is based on
the ID series of algorithms. Selected attributes Split information: As the search strategy of ID
metrics of C5.0 algorithm - Gain ratio is series algorithm led to a shorter tree is easy to
calculated as follows:Information Entropy: develop than the longer tree. This could lead to
Information entropy is used to measure the inductive bias. For example, a training data set
uncertainty of the information sources X overall. has n samples, and the attribute A for each
Suppose a collection of sample data for X, and X
contains x sample data. Assume that class label sample has a value . So this
attribute with different values of n, define n attribute in the training data set has the largest
information gain. Then this attribute can predict
different classes . Suppose the
the target attribute of training data. So this
attribute will be selected as the decision attribute
number of samples of are , the probability
of the tree root node. Thereby, generating a
decision tree which is very wide and the depth is
of any sample is , . Then the one. We can imagine that when the decision tree
model was applied to test the sample data, the
information entropy as follows: effect will be poor. Therefore, in order to avoid
the bias and to make up for lack of ID series of
algorithms we use split information. Split
information is used to measure the breadth and
1102
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 03,2010 at 10:23:33 UTC from IEEE Xplore. Restrictions apply.
uniformity of data. The formula of split So we are unable to take measures to achieve
information as follows: profit and avoid loss. Ratio analysis means
determine the level of economic activity by
calculating the various ratios. Ratio is a relative
number, this approach can change
non-comparable indicator into comparable
So finally we get the gain ratio as shown by the indicator. Ratio analysis method has the
following formula: advantage of simple calculation, the calculation
results also relatively easy to judge. But the ratio
analysis method is not perfect, so we should pay
attention to these following issues when we
apply it: the child and the mother items which
In fact, C5.0 is not only changing the metrics used to calculate the ratio must have a certain
which means add the split information to deal amount of logical relations (such as cause and
with inductive bias issue, but also can discrete effect relationship), so as to ensure the financial
the continuous attributes. More importantly, ratios can tell certain problems. In other words,
C5.0 algorithm is the classification algorithm the ratio should have the financial significance;
which can apply to large data sets and it is the child and mother items which used to
improved in the execution efficiency and calculate the ratio must be maintain consistent in
memory use. In reality we usually deal with the the time and scope of calculation. Factor
large data so C5.0 algorithm has more practical analysis method bases on the relationship
significance. Based on these advantages, I select between analysis indicator and driven factors
the C5.0 algorithm to build decision tree model. and can determine the direction and level of
impact by quantity. Factor analysis method can
III. FINANCIAL ANALYSIS
not only make comprehensive analysis of
A. financial analysis method Introduction various factors’ impact on certain economic
indicators, but also can make separate analysis
Financial analysis methods include trend
of certain factor’ impact on certain economic
analysis, ratio analysis and factor analysis. Trend
indicators. So this method applied quite widely
analysis means compare the same indicator
in the financial analysis. But in actual
among two or several consecutive financial
application should pay attention to the following
report to determine changes in the direction of
issues: determine the factors which constitute the
increase or decrease ,the amount and magnitude
economic indicators must have cause and effect
to explain the trends of change of enterprise's
relationship objectively and can reflect the
financial condition or operating results. Trend
inherent causes of the differences of the
analysis method has the advantage of simple and
indicator, otherwise we lose the value of its
intuitive. Its shortcomings are: This method
existence; Alternative factors must follow the
require of comparative analysis of indicators of
interdependence of various factors, arrange in a
different periods, but sometimes the diameter of
certain order, otherwise you will arrive at
calculation inconsistent; this method cannot
different results; Maintain the chain of
exclude the impact of sporadic project which
calculation program in order to make the sum of
lead to the data for analysis does not reflect the
various factors is equal to the difference of the
normal operating conditions; this method doesn’t
change of analysis indicator. So we can fully
make significant analysis on indicator which has
explain the reasons for the change of indicator;
significant change and doesn’t study its causes.
1103
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 03,2010 at 10:23:33 UTC from IEEE Xplore. Restrictions apply.
Solvency Short-term solvency enterprise's financial condition and operating
1104
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 03,2010 at 10:23:33 UTC from IEEE Xplore. Restrictions apply.
We found financial ratios of 2007 and net profit
of 2007 and 2008 from GuoTaiAn database.
After delete the company which data is not
complete, we get 24 logistics companies.
C. Modeling thought:
1105
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 03,2010 at 10:23:33 UTC from IEEE Xplore. Restrictions apply.
lose. We can extract out criteria of growth or of the interest coverage ratio as 37.236 and two
reduction of 2008. The criteria of profit grow of split points of assets-liability ratio as 0.440 and
2008 is interest coverage ratio less than or equal 0.553. And the correct rate of this model is
37.236. There are two criteria for profit down of 95.83%.Through monitor these two financial
2008. One is the interest coverage ratio is greater indicators we can predict the profit of next year
than 37.236 and the other is asset-liability ratio will rise or decrease. If the profit decreases, this
is greater than 0.440 and less than or equal to should arouse the attention of management.
0.553. Experiment 2: Test the accuracy of the Then the management should find the reason and
model take measure before the enterprise suffers in
financial distress. In addition, when we use this
We add the “whether growth” model after the model we should pay attention to the treatment
“type” node. Then we add “analysis” mode. of the default value and prevention of
After the implementation we get the following over-fitting phenomenon.
results:
ACKNOWLEDGMENT
TABLE 2: COMPARING $ C-IS GROWTH AND
This paper obtains funding from Beijing
WHETHER GROWTH
Municipal Education Commission project
"Logistics cost research theory and
Correct 23 95.83%
methodology" (SM200810037006) and Research
Mistake 1 4.17%
and Innovation in Business Administration base
Total 24
of Beijing Wuzi University.
As shown in the table, we have selected 24
companies; the model can make right judgments REFERENCES
of 23 companies and only make one judgment
wrong. So the correct rate of the decision tree [1] Chun-Chieh Wu. Financial Distress Prediction: Data
Mining Methods and application [J]. Journal of
model is 95.83%. This shows that this model has Tsinghua University (Philosophy and Social Science
Edition) No.S1 2006 Vol.21, 45-53
a high accuracy.
[2] Qian Xiaodong. Classification in Data Mining
Methods. LIBRARY AND INFORMATION
SERVICE Vol .51, No .3, March ,2007,68-71, 108
Ⅳ. Conclusion
[3] Meijuan Gao, Jingwen Tian, and Shiru Zhou. The of
Building Logistics Cost Forecast Based on Radial
Basic Probabilistic Neural Network. Proceedings of
In this paper, we analyze 15 financial ratios of the IEEE International Conference on Automation and
listed logistics companies. We get the conclusion Logistics Shenyang, China August 2009, 68-71,108
[4] Teams Yan. Chinese logistics companies the market
that interest coverage ratio and asset-liability value of debt financing and research. Enterprises
Economic Research. No.13 2006,63-68
ratio play an important role in whether next year
profit will rise. The model finds a division point
1106
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 03,2010 at 10:23:33 UTC from IEEE Xplore. Restrictions apply.