Clustering Approaches For Financial Data Analysis PDF
Clustering Approaches For Financial Data Analysis PDF
Abstract—Nowadays, financial data analysis is becoming profitable in analysing financial datasets [1]. However,
increasingly important in the business market. As companies mining financial data presents special challenges;
collect more and more data from daily operations, they expect complexity, external factors, confidentiality, heterogeneity,
to extract useful knowledge from existing collected data to help and size. The data miners' challenge is to find the trends
make reasonable decisions for new customer requests, e.g. user
credit category, confidence of expected return, etc. Banking
quickly while they are valid, as well as to recognize the time
and financial institutes have applied different data mining when the trends are no longer effective. Moreover, designing
techniques to enhance their business performance. Among an appropriate process for discovering valuable knowledge
these techniques, clustering has been considered as a significant in financial data is a very complex task.
method to capture the natural structure of data. However, Different DM techniques have been proposed in the
there are not many studies on clustering approaches for literature for data analysing in various financial applications.
financial data analysis. In this paper, we evaluate different
clustering algorithms for analysing different financial datasets
For instance, decision-tree [2] and first-order learning [3] are
varied from time series to transactions. We also discuss the used in stock selection. Neural networks [4] and support
advantages and disadvantages of each method to enhance the vector machine [5] techniques were used to predict
understanding of inner structure of financial datasets as well as bankruptcy, nearest-neighbours classification [6] for the
the capability of each clustering method in this context. fraud detection. Users also have used these techniques for
analysing financial time series [7], imputed financial data [8],
Keywords-clustering; partitioning clustering; density-based
outlier detection [9], etc. However, there are not many
clustering; financial datasets
clustering techniques applied in this domain compared to
other techniques such as classification and regression [2].
I. INTRODUCTION
In this paper, we survey different clustering algorithms for
T ODAY, we have a deluge of financial datasets. Faster
and cheaper storage technology allows us to store
ever-greater amounts of data. Due to the large sizes of the
analysing different financial datasets for a variety of
applications; credit cards fraud detection, investment
transactions, stock market, etc. We discuss the advantages
data sources it is not possible for a human analyst to come and disadvantages of each method in relation to better
up with interesting information (or patterns) that will help in understanding of inner structure of financial datasets as well
the decision making process. Global competitions, dynamic as the capability of each clustering method in this context. In
markets, and rapid development in the information and other words, the purpose of this research is to provide an
communication technologies are some of the major overview of how basic clustering methods were applied on
challenges in today’s financial industry. For instance, financial data analysis.
financial institutions are in constant needs for more data The rest of this paper is organised as follows. In Section II,
analysis, which is becoming more very large and complex. we present briefly different financial data mining techniques
As the amount of data available is constantly increasing, our that can be found in the literature. Section III describes
ability to process it becomes more and more difficult. briefly different clustering techniques used in this domain.
Efficient discovery of useful knowledge from these datasets We evaluate and discuss the advantages and disadvantages
is therefore becoming a challenge and a massive economic of these clustering methods in Section IV. We conclude and
need. discuss some future directions in Section V.
On the other hand, data mining (DM) is the process of
extracting useful, often previously unknown information, II. DATA MINING IN FINANCE
so-called knowledge, from large datasets (databases or data).
This mined knowledge can be used for various applications A. Association Rules
such as market analysis, fraud detection, customer retention, Association Rule is a DM technique known as association
etc. Recently, DM has proven to be very effective and analysis, which is useful for discovering interesting
relationships hidden in large datasets. These relationships
can be represented in the form of association rules or sets of
N-A. Le-Khac: School of Computer Science & Informatics, University
College Dublin, Ireland (Corresponding author: [email protected]).
frequent itemsets [2]. This technique can be applied to
F. Cai: School of Computer Science & Informatics, University College analyse data in different domains such as finance, earth
Dublin, Ireland ([email protected]). science, bioinformatics, medical diagnosis, web mining, and
M-T. Kechadi: School of Computer Science & Informatics, University
College Dublin, Ireland ([email protected]).
scientific computation.
In finance, association analysis is used for instance in
customer profiling that builds profiles of different groups [12], dynamic programming [13], reinforcement learning
from the company’s existing customer database. The [14], etc. Besides, linear regression [2] and wavelet
information obtained from this process can help regression [15] are popular methods in the domain of
understanding business performance, making new marketing financial forecasting, option pricing and stock prediction.
initiatives, analysing risks, and revising company customer
policies. Moreover, loan payment prediction, customer credit III. CLUSTERING METHODS
policy analysis, marketing and customer care can also
A. Partitioning Methods
perform association analysis to identify important factors
and eliminate irrelevant ones. K-means clustering [16] method aims to partition n
observed examples into k clusters. Each example belongs to
B. Classification one cluster. All examples are treated with the equal
Classification is another DM approach, which assigns importance and thus a mean is taken as the centroid of the
objects to one of the predefined categories. It uses training observations in the cluster. With the predetermined k, the
examples, such as pairs of input and output targets, to find algorithm proceeds by alternating between two steps:
an appropriate target function also known informally as a assignment step and update step. Assignment step assigns
classification model. The classification model is useful for each example to its closest cluster (centroid). Update step
both descriptive and predictive modelling [2]. In finance, uses the result of assignment step to calculate the new means
classification approaches are also used in customer profiling (centroids) of newly formed clusters. The convergence speed
by building predictive models where predicted values are of the k-means algorithm is fast in practice but the optimal k
categorical. Financial market risk, credit scoring/rating, value is not known in advance.
portfolio management, and trading also apply this approach In [17], the author uses k-means algorithm to categorise
to group similar data together. mutual funds. The created clusters are assigned according to
Classification can be considered as one of the important self-declared investment objectives and are compared to
analytical methods in computational finance. Rule-based explain the difference between expectation and financial
methods [2][3] can be used for the stock selection. Besides, characteristics. Besides, in order to determine the number of
bankruptcy prediction can use its geometric methods [4][5] clusters (k), the author applied the Hartigan’s theory by
where classification functions are represented with a set of evaluating the following formula:
decision boundaries constructed by optimising certain error
criteria. Other methods such as Naïve Bayes classifiers [10], ⎛ ∑k ESS ⎞ (1)
⎜ i =1 − 1⎟ × (n − k − 1) > 10
maximum entropy classifiers [11] were applied in bond ⎜ k +1ESS ⎟
rating and prototype-based classification methods such as ⎝ ∑i =1 ⎠
nearest-neighbours classification was moreover used for the
fraud detection. where k is the result with k clusters and ESS represents
the sum of squares and n is the dataset’s size. The number of
C. Clustering clusters is the minimum k such that (1) is false.
Like classification, cluster analysis groups similar data
objects into clusters [2], however, the classes or clusters
B. Density-based
were not defined in advance. Normally, clustering analysis is
a useful starting point for other purposes such as data Another clustering approach is density based [2] which
summarisation. A cluster of data objects can be considered does not partition the sample space by mean centroid, but
as a form of data compression. Different domains can apply instead density based information is used, by which tangled,
clustering techniques to analysis data such as biology, irregular contoured but well distributed dataset can be
information retrieval, medicine, etc. In the business and clustered correctly.
finance, clustering can be used, for instance, to segment OPTICS [18] is a density based clustering technique to get
customers into a number of groups for additional analysis insight into the density distribution of a dataset. It makes up
and marketing activities. As clustering is normally used in for the weakness of the k-means algorithm for lack of
data summarisation or compression, there are not many knowledge of how to choose the value k. OPTICS provides a
financial applications that use this technique compared to perspective to look into the size of density-based clusters.
classification and association analysis. We will survey some Unlike centroid-based clustering, OPTICS does not
approaches in Section III. produce a clustering of a dataset explicitly from the first step.
It instead creates an augmented ordering of examples based
D. Other methods on the density distribution. This cluster ordering can be used
Other mining techniques that can be applied for financial by a broad range of density-based clustering, such as
datasets are grouped in three categories: optimization, DBSCAN. And besides, OPTICS can provide density
regression and simulation. For instance, portfolio selection, information about the dataset graphically by cluster
risk management and asset liability management can use reachability-plot [18], which makes it possible for the user to
different optimisation techniques such as genetic algorithms understand the density-based structure of dataset.
clusters by calculating the distance between examples and
existing cluster centres. If this distance is higher than a
threshold value, a new cluster is created and initialized by
the example. This clustering algorithm can be summarised in
three main steps:
C. Partitioning Methods
As in [17] group mutual funds with different investment
objectives, they claimed that cluster analysis is able to
explain non-linear structural relationships among unknown
structural dataset. They found that over 40% of the mutual
funds do not belong to their stated categories, and despite the
very large number of categories stated; three groups are very
important. Clustering helps simplifying the financial data
classification problem based on their characteristics rather
than on labels, such as nominal labels (customer gender,
living area, income or the success of the last transaction,
etc.). Besides, nominal labels may be missing or not
provided. Thus our effort is to understand the detailed Figure II. DBI and DI of K-means clustering German dataset
structure of financial data classification without the given
class labels.
We give the DBI and DI of K-Means clustering of both
normalised and un-normalised two datasets (German credit
dataset and Churn dataset) to figure out what are the optimal
k values for given datasets. To avoid information overfitting
and loss of generality, we test k from 2 to 20. We normalise
the attributes values between [0:1] in order to avoid
large-scale attributes dominating the dataset features.
x − xmin
x' =
xmax − xmin
where the xmax and xmin are the max and min value of
rescaled attribute.
From Figure II, k=12 is optimal by DBI and k=8 is the
optimal value by DI for the original German credit dataset,
k=8 is the optimal value for the normalised German credit
dataset by both DBI and DI. From the result, we know that
attribute scale affects the clustering evaluation since the DI
of clustering original dataset is around 0. Normalisation
unifies the results of both average tightness and worst case.
From Figure III, k=12 is optimal by DBI and k=17 by DI
for original churn dataset. k=2 is the optimal value by both
DBI and DI for normalised dataset. Again, we notice that
normalisation unifies the optimal clustering scheme while
Figure III. DBI and DI of K-means clustering churn
original attribute scale giving two clustering solutions.
dataset
Figure V shows that normalised German credit dataset is
well density distributed. When MinPts=10, by setting
reachability-distance equal to 0.33, the dataset is partitioned
into 23 density-based clusters and 1 noise cluster. There are
841 valid examples and 159 noise examples. When MinPts =
20, with the same reachability distance, dataset is partitioned
into 15 density closed clusters and 1 noise cluster. There are
681 valid examples and 319 noise examples.
Despite the visualization of density distribution, from
Table III, the clustering suffers from large proportion of
noise and larger DBI values and lower DI values compared
to K-means clustering. We can conclude that German credit
dataset is more suitable for centroid-based clustering rather Figure IV. Reachability-plot of original German credit
than density-based clustering. dataset
the data recorded should be generally trusted. Financial
datasets are not usually density distributed, and therefore,
density-based clustering is not appropriate.
REFERENCES
[1] A. Weigend, “Data Mining in Finance: Report from the
Post-NNCM-96 Workshop on Teaching Computer Intensive Methods
for Financial Modeling and Data Analysis”, Fourth International
Conference on Neural Networks in the Capital Markets NNCM-96,
1997, pp. 399-411.
[2] P-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
Addison Wesley, 2006, pp.150-172
[3] J. R. Quinlan, “Learning First-Order Definitions of Functions”,
Journal of Artificial Intelligence Research., vol. 5, 1996, pp. 139–161
[4] N. Cristianini, J-S. Taylor, An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods. Cambridge
University Press, 2000.
[5] J. Han and M. Kamber, Data Mining: Concept and Techniques.
Morgan Kaufmann publishers, 2nd Eds., Nov. 2005.
[6] T. M. Cover, P. E. Hart, “Nearest Neighbor Pattern Classification”,
Journal of Knowledge Based Systems, vol. 8 no.6, 1995, pp. 373–389
[7] T. Wittman. (2002, December). Time-Series Clustering and
Association Analysis of Financial Data. Available: http://
www.math.ucla.edu/~wittman/thesis/project.pdf.
[8] H. Bensmail, R. P. DeGennaro. (2004, September). Analyzing
Imputed Financial Data: A New Approach to Cluster Analysis.
Available: http:// www.frbatlanta.org/filelegacydocs/wp0420.pdf.
[9] S. Omanovic, Z. Avdagic, S. Konjicija, “On-line evolving clustering
for financial statements' anomalies detection”, International