0% found this document useful (0 votes)
216 views

Clustering Approaches For Financial Data Analysis PDF

This document surveys different clustering approaches that have been used for analyzing various types of financial data. It begins by introducing the importance of analyzing increasingly large financial datasets and the challenges involved. It then provides an overview of various data mining techniques that have been applied to financial data, including classification, regression, and clustering. The document focuses on evaluating different clustering algorithms and their advantages and disadvantages for analyzing financial transaction data, stock market data, and credit card fraud detection data. It aims to provide insights into how clustering methods can better understand the inner structure of financial datasets.

Uploaded by

Newton Linchen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
216 views

Clustering Approaches For Financial Data Analysis PDF

This document surveys different clustering approaches that have been used for analyzing various types of financial data. It begins by introducing the importance of analyzing increasingly large financial datasets and the challenges involved. It then provides an overview of various data mining techniques that have been applied to financial data, including classification, regression, and clustering. The document focuses on evaluating different clustering algorithms and their advantages and disadvantages for analyzing financial transaction data, stock market data, and credit card fraud detection data. It aims to provide insights into how clustering methods can better understand the inner structure of financial datasets.

Uploaded by

Newton Linchen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Clustering Approaches for Financial Data Analysis: a Survey

Fan Cai, Nhien-An Le-Khac, M-Tahar Kechadi,


School of Computer Science & Informatics, University College Dublin, Ireland

Abstract—Nowadays, financial data analysis is becoming profitable in analysing financial datasets [1]. However,
increasingly important in the business market. As companies mining financial data presents special challenges;
collect more and more data from daily operations, they expect complexity, external factors, confidentiality, heterogeneity,
to extract useful knowledge from existing collected data to help and size. The data miners' challenge is to find the trends
make reasonable decisions for new customer requests, e.g. user
credit category, confidence of expected return, etc. Banking
quickly while they are valid, as well as to recognize the time
and financial institutes have applied different data mining when the trends are no longer effective. Moreover, designing
techniques to enhance their business performance. Among an appropriate process for discovering valuable knowledge
these techniques, clustering has been considered as a significant in financial data is a very complex task.
method to capture the natural structure of data. However, Different DM techniques have been proposed in the
there are not many studies on clustering approaches for literature for data analysing in various financial applications.
financial data analysis. In this paper, we evaluate different
clustering algorithms for analysing different financial datasets
For instance, decision-tree [2] and first-order learning [3] are
varied from time series to transactions. We also discuss the used in stock selection. Neural networks [4] and support
advantages and disadvantages of each method to enhance the vector machine [5] techniques were used to predict
understanding of inner structure of financial datasets as well as bankruptcy, nearest-neighbours classification [6] for the
the capability of each clustering method in this context. fraud detection. Users also have used these techniques for
analysing financial time series [7], imputed financial data [8],
Keywords-clustering; partitioning clustering; density-based
outlier detection [9], etc. However, there are not many
clustering; financial datasets
clustering techniques applied in this domain compared to
other techniques such as classification and regression [2].
I. INTRODUCTION
In this paper, we survey different clustering algorithms for
T ODAY, we have a deluge of financial datasets. Faster
and cheaper storage technology allows us to store
ever-greater amounts of data. Due to the large sizes of the
analysing different financial datasets for a variety of
applications; credit cards fraud detection, investment
transactions, stock market, etc. We discuss the advantages
data sources it is not possible for a human analyst to come and disadvantages of each method in relation to better
up with interesting information (or patterns) that will help in understanding of inner structure of financial datasets as well
the decision making process. Global competitions, dynamic as the capability of each clustering method in this context. In
markets, and rapid development in the information and other words, the purpose of this research is to provide an
communication technologies are some of the major overview of how basic clustering methods were applied on
challenges in today’s financial industry. For instance, financial data analysis.
financial institutions are in constant needs for more data The rest of this paper is organised as follows. In Section II,
analysis, which is becoming more very large and complex. we present briefly different financial data mining techniques
As the amount of data available is constantly increasing, our that can be found in the literature. Section III describes
ability to process it becomes more and more difficult. briefly different clustering techniques used in this domain.
Efficient discovery of useful knowledge from these datasets We evaluate and discuss the advantages and disadvantages
is therefore becoming a challenge and a massive economic of these clustering methods in Section IV. We conclude and
need. discuss some future directions in Section V.
On the other hand, data mining (DM) is the process of
extracting useful, often previously unknown information, II. DATA MINING IN FINANCE
so-called knowledge, from large datasets (databases or data).
This mined knowledge can be used for various applications A. Association Rules
such as market analysis, fraud detection, customer retention, Association Rule is a DM technique known as association
etc. Recently, DM has proven to be very effective and analysis, which is useful for discovering interesting
relationships hidden in large datasets. These relationships
can be represented in the form of association rules or sets of
N-A. Le-Khac: School of Computer Science & Informatics, University
College Dublin, Ireland (Corresponding author: [email protected]).
frequent itemsets [2]. This technique can be applied to
F. Cai: School of Computer Science & Informatics, University College analyse data in different domains such as finance, earth
Dublin, Ireland ([email protected]). science, bioinformatics, medical diagnosis, web mining, and
M-T. Kechadi: School of Computer Science & Informatics, University
College Dublin, Ireland ([email protected]).
scientific computation.
In finance, association analysis is used for instance in
customer profiling that builds profiles of different groups [12], dynamic programming [13], reinforcement learning
from the company’s existing customer database. The [14], etc. Besides, linear regression [2] and wavelet
information obtained from this process can help regression [15] are popular methods in the domain of
understanding business performance, making new marketing financial forecasting, option pricing and stock prediction.
initiatives, analysing risks, and revising company customer
policies. Moreover, loan payment prediction, customer credit III. CLUSTERING METHODS
policy analysis, marketing and customer care can also
A. Partitioning Methods
perform association analysis to identify important factors
and eliminate irrelevant ones. K-means clustering [16] method aims to partition n
observed examples into k clusters. Each example belongs to
B. Classification one cluster. All examples are treated with the equal
Classification is another DM approach, which assigns importance and thus a mean is taken as the centroid of the
objects to one of the predefined categories. It uses training observations in the cluster. With the predetermined k, the
examples, such as pairs of input and output targets, to find algorithm proceeds by alternating between two steps:
an appropriate target function also known informally as a assignment step and update step. Assignment step assigns
classification model. The classification model is useful for each example to its closest cluster (centroid). Update step
both descriptive and predictive modelling [2]. In finance, uses the result of assignment step to calculate the new means
classification approaches are also used in customer profiling (centroids) of newly formed clusters. The convergence speed
by building predictive models where predicted values are of the k-means algorithm is fast in practice but the optimal k
categorical. Financial market risk, credit scoring/rating, value is not known in advance.
portfolio management, and trading also apply this approach In [17], the author uses k-means algorithm to categorise
to group similar data together. mutual funds. The created clusters are assigned according to
Classification can be considered as one of the important self-declared investment objectives and are compared to
analytical methods in computational finance. Rule-based explain the difference between expectation and financial
methods [2][3] can be used for the stock selection. Besides, characteristics. Besides, in order to determine the number of
bankruptcy prediction can use its geometric methods [4][5] clusters (k), the author applied the Hartigan’s theory by
where classification functions are represented with a set of evaluating the following formula:
decision boundaries constructed by optimising certain error
criteria. Other methods such as Naïve Bayes classifiers [10], ⎛ ∑k ESS ⎞ (1)
⎜ i =1 − 1⎟ × (n − k − 1) > 10
maximum entropy classifiers [11] were applied in bond ⎜ k +1ESS ⎟
rating and prototype-based classification methods such as ⎝ ∑i =1 ⎠
nearest-neighbours classification was moreover used for the
fraud detection. where k is the result with k clusters and ESS represents
the sum of squares and n is the dataset’s size. The number of
C. Clustering clusters is the minimum k such that (1) is false.
Like classification, cluster analysis groups similar data
objects into clusters [2], however, the classes or clusters
B. Density-based
were not defined in advance. Normally, clustering analysis is
a useful starting point for other purposes such as data Another clustering approach is density based [2] which
summarisation. A cluster of data objects can be considered does not partition the sample space by mean centroid, but
as a form of data compression. Different domains can apply instead density based information is used, by which tangled,
clustering techniques to analysis data such as biology, irregular contoured but well distributed dataset can be
information retrieval, medicine, etc. In the business and clustered correctly.
finance, clustering can be used, for instance, to segment OPTICS [18] is a density based clustering technique to get
customers into a number of groups for additional analysis insight into the density distribution of a dataset. It makes up
and marketing activities. As clustering is normally used in for the weakness of the k-means algorithm for lack of
data summarisation or compression, there are not many knowledge of how to choose the value k. OPTICS provides a
financial applications that use this technique compared to perspective to look into the size of density-based clusters.
classification and association analysis. We will survey some Unlike centroid-based clustering, OPTICS does not
approaches in Section III. produce a clustering of a dataset explicitly from the first step.
It instead creates an augmented ordering of examples based
D. Other methods on the density distribution. This cluster ordering can be used
Other mining techniques that can be applied for financial by a broad range of density-based clustering, such as
datasets are grouped in three categories: optimization, DBSCAN. And besides, OPTICS can provide density
regression and simulation. For instance, portfolio selection, information about the dataset graphically by cluster
risk management and asset liability management can use reachability-plot [18], which makes it possible for the user to
different optimisation techniques such as genetic algorithms understand the density-based structure of dataset.
clusters by calculating the distance between examples and
existing cluster centres. If this distance is higher than a
threshold value, a new cluster is created and initialized by
the example. This clustering algorithm can be summarised in
three main steps:

(1) Calculate the distance DiJ between data object xi to all


existing cluster centres CcJ, find the minimum distance Dik
and compare it to the radius Rk of cluster Ck.
(2) If Dik < Rk then xi belongs to cluster Ck, else find the
nearest cluster Ca and evaluate Sa= Dia + Ra against a
Figure I. 2-D dataset sample and corresponding threshold δ.
reachability plot (3) If Sa > δ then create a new cluster for xi else xi belongs
to cluster Ca and update Ra = Sa/2.
Figure I gives the reachability-plot of the 2 dimensional
dataset and the number of the valleys indicate that there are In this algorithm, the number of clusters is not predefined.
3 density-based clusters. However, the distance calculation and the threshold value
However, OPTICS needs some priori, such as needs expert to provide prior knowledge and so does label of
neighbourhood radius (ε) and a minimum number of objects newly formed cluster.
(MinPts) within ε, by which directly density-reachable, [21] applied a hierarchical agglomerative clustering [2]
density-connected, cluster and noise are defined as in [18]. approach to analyse stock market data. The authors proposed
DBSCAN [19] is based on density-connected range from an efficient metric for measuring the similarity between
arbitrary core objects, which contains MinPts objects in clusters; a key issue for hierarchical agglomerative
ε-neighbourhood. In OPTICS, cluster membership is not clustering methods. This similarity between two clusters C =
recorded from the start, but instead the order in which {C1, C2,…Ck} and C’ = {C’1, C’2…C’k} is defined as follows:
objects get clustered are stored. This information consists of
two values: core-distance and reachability-distance. For Sim(C, C ' ) = (∑ maxSim(Ci , C ' j )) / k
more details on DBSCAN and OPTICS ordered dataset are i
j

provided in [18]. where


Core-distance of an object p is defined as: Ci ∩ C ' j
Sim(Ci , C ' j ) = 2
Ci + C ' j
⎧Undefined, if neighbourε ( p) < MinPts

core − dis tan ceε ,MinPts ( p) = ⎨
⎪ MinPts − dis tan ce( p), otherwise

The authors also mentioned that some pre-processing
techniques such as mapping, dimensionality reduction and
Reachability-distance of an object q w.r.t object o is
normalisation should also be applied to improve the
defined as:
performance. Moreover, they used Precision-Recall method
reachability − dis tan ceε ,MinPts (q, o) [21] to increase the cluster quality.
[7] also applied [21]’s approach for analysing financial
$"Undefined, if neighborε (o) < MinPts
=# data i.e. stock market. Besides, the authors defined a new
$%max(core − dis tan ce(o), dis tan ce(o, q)), otherwise
distance metric based on the time period to cope with time
series data. Concretely, the distance between stock i and
Since reachability plot is insensitive to the input stock j is given by:
parameters, [18] suggests that the values should be “large”
enough to yield a good result with no undefined examples d (i, j ) = Pi − Pj
2

and reachability-plot looks not jagged. Experiments show where


that MinPts uses values between 10 and 20 always get good si (t + 1) − si (t )
Pi (t ) = × 100
results with large enough ε . Briefly, reachability-plot is a si (t )
very intuitive means to get the understanding of the si(t) is the stock value i at time t. The authors stated that
density-based structure of financial data. Its general shape is hierarchical agglomerative clustering fed by normalised
independent of the parameters used. percentage change after filtering outliers gives the best result.
C. Data stream clustering However the identification of outliers needs a priori
[9] applied an on-line evolving approach for detecting of threshold. Moreover, the authors combine neural networks
financial statements’ anomalies. The on-line evolving and association analysis with the clustering technique to
method [20] is a dynamic technique for clustering data analyse stock market datasets.
stream. This method dynamically increases the number of
IV. EVALUATION AND ANALYSIS statements’ anomaly detection system but it highly depends
on the operator monitoring the process.
A. Datasets
In this paper, we use well-known internal criteria to
Different financial datasets have been discussed in this evaluate the clustering behaviour. Davies-Bouldin Index
Si + S j (DBI) [23] is used as a first internal criterion for clustering,
section. Some of the Ri, j = were selected by the
M i, j which is defined as follows:
authors’ approaches. For instance, [17] used data obtained 1 N
DBI = ∑ Di
from Morningstar including 904 different funds classified in N i=1
seven different investment objectives: World Wide Bonds, where N is the number of clusters and Di is the tightness
Growth, SMEs, Municipal NY, Municipal CA, Municipal
criteria of a cluster Ci , which takes the worst case scenario
State and Municipal National. Each fund has 28 financial
variables and all are normalised before analysis. Meanwhile, and is defined as:
[9] used synthesis datasets with 1000 documents containing Di = max Ri, j
j;i≠ j
financial statements. In [21] the authors used Standard and
Poor 500 index historical stock dataset. There are 500 stocks where i and j are cluster indexes, Ri, j is summary
with daily price and each stock is a sequence of some length evaluation of two clusters of ratio between sum of tightness
l where l ≤ 252. In [7] they analysed stock price datasets of two clusters and looseness between two centres.
from 91 different stocks, which can be found at link
http://finance.yahoo.com. The data covers three years; from Sk is the average internal Euclidean distance of the cluster
November 1, 1999 to November 1, 2001.
We analyse moreover two financial datasets with k-means indexed by k, and M i, j is the Euclidean distance between
and density-based clustering approaches: German credit card two clusters.
and Churn. Both of these datasets are provided by UCI 1 i
T
2
machine learning repository [22]. German credit dataset Si = 2 ∑
Ti j=1
X j − Ai
contains clients described by 7 numerical and 13 nominal
attributes to good or bad credit risks. The data contains 1000
sample cases. The Churn dataset is artificial but are claimed M i, j = Ai − A j 2
to be similar to real-world measurement. It concerns
telecommunications churn and contains 5 nominal attributes,
15 numerical attributes and 3333 examples. We analyse the where Ai is the centroid of the cluster Ci , Ti is the size
dataset without the help of nominal attributes for several of Ci , Xi is an n dimensional feature vector assigned to Ci .
reasons, e.g. numerical attributes are taken internally within
The smaller DBI value is, the more efficient clustering is.
the commercial activities or business market while nominal
Dunn index (DI) is used as a second internal criteria for
attributes are stated by external concepts defined by market
clustering, which is defined by:
experts, whose significance is not promised. Moreover,
nominal attributes are usually hierarchically dependent and
$ $& δ (A , A ) (&(&
can be missing while data mining models should have the & i j
DI = min % min % ))
capability to bypass these optional constraints to understand 1≤i≤N 1≤ j≤N, j≠i
&' max Δ
'& 1≤k≤N k *&&*
the structure of sample cases.
B. Criteria
where Δ k is various types of size notation of a cluster, it
The criteria used to evaluate clustering methods depend
could be farthest two points in side a cluster, mean distance
on each approach. For instance, [17] applied a relevant value
between all pairs or distance of all the points from the mean.
of k by using the formula (1) and then discuss on results
obtained from the running of k-means algorithm to classify
mutual funds. Δ k = max x − y
x,y∈Ck
[7] uses normalised change Pi(t) of stock i to overcome
and δ (Ai , A j ) is the closest distance between clusters
the discrete essence of time and difficulties to treat
deviations or first difference of prices due to the wide range δ (Ai , A j ) = min xi − x j
xi ∈Ci ,x j ∈C j ,i≠ j
of possible stock prices. External clustering statics such as
entropy and purity are used to define the closeness within an Unlike DBI, the larger DI is the better is the clustering. It
industry, and internal statistics such as separation and the evaluates the inter-cluster and intra-cluster distances.
silhouette coefficient to tell what degree the industries’ are However, like DBI, the best clustering loses most general
separate from each other. structural information about the dataset.
[9] does not give a clustering criterion but claims that The main difference between DBI and DI is that DBI
their work is the first step to building robust financial indicates the average tightness while DI is a worst-case
indicator.

C. Partitioning Methods
As in [17] group mutual funds with different investment
objectives, they claimed that cluster analysis is able to
explain non-linear structural relationships among unknown
structural dataset. They found that over 40% of the mutual
funds do not belong to their stated categories, and despite the
very large number of categories stated; three groups are very
important. Clustering helps simplifying the financial data
classification problem based on their characteristics rather
than on labels, such as nominal labels (customer gender,
living area, income or the success of the last transaction,
etc.). Besides, nominal labels may be missing or not
provided. Thus our effort is to understand the detailed Figure II. DBI and DI of K-means clustering German dataset
structure of financial data classification without the given
class labels.
We give the DBI and DI of K-Means clustering of both
normalised and un-normalised two datasets (German credit
dataset and Churn dataset) to figure out what are the optimal
k values for given datasets. To avoid information overfitting
and loss of generality, we test k from 2 to 20. We normalise
the attributes values between [0:1] in order to avoid
large-scale attributes dominating the dataset features.
x − xmin
x' =
xmax − xmin
where the xmax and xmin are the max and min value of
rescaled attribute.
From Figure II, k=12 is optimal by DBI and k=8 is the
optimal value by DI for the original German credit dataset,
k=8 is the optimal value for the normalised German credit
dataset by both DBI and DI. From the result, we know that
attribute scale affects the clustering evaluation since the DI
of clustering original dataset is around 0. Normalisation
unifies the results of both average tightness and worst case.
From Figure III, k=12 is optimal by DBI and k=17 by DI
for original churn dataset. k=2 is the optimal value by both
DBI and DI for normalised dataset. Again, we notice that
normalisation unifies the optimal clustering scheme while
Figure III. DBI and DI of K-means clustering churn
original attribute scale giving two clustering solutions.
dataset
Figure V shows that normalised German credit dataset is
well density distributed. When MinPts=10, by setting
reachability-distance equal to 0.33, the dataset is partitioned
into 23 density-based clusters and 1 noise cluster. There are
841 valid examples and 159 noise examples. When MinPts =
20, with the same reachability distance, dataset is partitioned
into 15 density closed clusters and 1 noise cluster. There are
681 valid examples and 319 noise examples.
Despite the visualization of density distribution, from
Table III, the clustering suffers from large proportion of
noise and larger DBI values and lower DI values compared
to K-means clustering. We can conclude that German credit
dataset is more suitable for centroid-based clustering rather Figure IV. Reachability-plot of original German credit
than density-based clustering. dataset
the data recorded should be generally trusted. Financial
datasets are not usually density distributed, and therefore,
density-based clustering is not appropriate.

Table IV. DBSCAN clustering and DBI for Churn dataset


Reachability MinPts Noise DBI DI
distance
0.32 10 No 1.596 0.182
Yes 3.568 0.106
0.33 20 No 1.572 0.195
Figure V. Reachability-plot of normalized German dataset
Yes 4.435 0.080
Table III. DBSCAN clustering for normalized German
credit dataset D. Data stream clustering
Reachability MinPts Noise DBI DI In [9] the authors use an on-line evolving clustering to
distance update the parameters: cluster number and cluster radius.
0.33 10 No 2.529 0.236 Two levels of anomalies detection have different financial
Yes 2.843 0.033 statement features. The first level is based on internal
0.33 20 No 2.465 0.250 information related to the account, e.g. equipment, employee,
Yes 2.793 0.020 etc. For every combination of the two parameters, at least
one cluster is created. But the authors do not give a good
Figure VI shows that original churn dataset cannot reason for it. The second level is based on document type.
partitioned into clusters based on density; the entire dataset However the distances among different types are different,
behaves as a whole. which is a prior knowledge from expert as well. The
threshold values for creating new clusters are determined by
the experts for the first level and pre-defined distance for the
second level. The monitoring process involves experts
heavily to approve or disapprove the documents as well. The
authors categorise their method as the first step to anomalies
detection. They are committed to reduce the reliance on
experts and combine off-line and on-line approaches in the
future work.
In [7] the authors use hierarchical agglomerative
clustering for the time-based normalised stock market data.
Figure VI. Reachability-plot of original Churn dataset Percentage change is chosen to be a good comparative
measure and time-based normalisation is used to remove the
overall trend of stock market and improve the accuracy
caused by outliers. The approach removes all items as
outliers if the average normalised distance across all the
items exceeds a specified threshold, which requires domain
expert knowledge. Moreover, the degree of correlation of
time-series is decided in advance. The authors found
complete link and Ward’s Method performs reasonably well
by better purity and filtering out fewer outlier stocks. By
treating the outlier, the overall purity decrease only about
6%, the author claims time-series clustering can determine
Figure VII. Reachability-plot of normalized Churn dataset the industry classification given the historical price record of
a stock.
Figure VII shows that there are mainly two valleys when However, we notice that data stream clustering needs too
MinPts = 10 or 20, which indicates there are two incentive much prior or domain knowledge and a lot of tuning for
clusters in the churn dataset. different features of even a single domain. Clustering
From Table IV, DBSCAN without noise examples gets approaches of different fields are different in essence. Thus
good DBI while getting poor DBI with noise. However, clustering is a good method to understand the financial
DBSCAN clustering suffers from large proportion of noise
time-series classification but not logically clear and efficient.
again, which has over 980 noise examples (around 30% of Distance measure becomes even more complex due to time
noise). For financial dataset, noise should be very small and
related nature because clustering does not have the Symposium on Information, Communication and Automation
Technologies, ICAT 2009. XXII, 2009, pp. 1-4.
capability to scale time related influence intelligently [10] P. Langley, W. Iba, K. Thompson, “An analysis of Bayesian
between examples. Experts have to determine that instead, classifiers”, 10th National Conference on Artificial Intelligence, 1992,
e.g. length of periodicity, etc. Recurrent neural networks [24] pp. 223-228.
[11] R. A. Bourne, S. Parsons, “Maximum Entropy and Variable Strength
and Gaussian Process [25] are more promising approaches
Defaults”, 16th International Joint Conference on Artificial
and are more likely to handle time-series or periodical Intelligence, IJCAI 99, Stockholm, Sweden, July 31 - August 6, 1992,
financial data classification. pp.50-55
[12] N-A. Le-Khac, M. T. Kechadi “Application of Data Mining for
Anti-money Laundering Detection: A Case Study”. 10th IEEE
V. CONCLUSION AND FUTURE WORK International Conference on Data Mining Workshops, Sydney,
We show that density-based clustering does not suit Australia, 14 December 2010. pp.577-584
[13] S. R. Eddy, “What is dynamic programming?”, Nature Biotechnology,
financial dataset. Normalised centroid-based clustering with vol. 22, 2004, pp.909–910.
higher DI or lower DBI gives the best number of clusters to [14] R. S. Sutton, "Learning to predict by the method of temporal
help understanding financial data classification. Original differences". Machine Learning, vol. 3, 1988. pp.9–44
[15] H. Wenying, “Wavelet Regression With an Emphassis on Singularity
attribute scales do not reflect the behaviour similarity since Detection,” M.S. thesis, Dept. Mathematics and Statistics, Sam
Euclidean distance is dominated by large scaled attributes, Houston State Univ., Texas, USA, 2003
best average tightness does not indicate the best case by [16] J.A. Hartigan, “Clustering Algorithms”, Wiley 1975
[17] A. Marathe A, HA. Shawky ,“ Categorizing mutual funds using
departing the worst case. However, we still find some clusters”, Advances in Quantitative Analysis of Finance and
constrains, e.g., K-means clustering tends to find spherical Accounting, vol. 7, 1999, pp.199–211.
clusters, centroid-based clustering does not handle the noise, [18] M. Ankerst, M. M. Breunig, H-P. Kriegel, J. Sander "OPTICS:
Ordering Points To Identify the Clustering Structure". ACM SIGMOD
etc. international conference on Management of data”, 1999. pp. 49–60.
This work can be seen as the first step to look into the [19] M. Ester, H-P. Kriegel, J. Sander, X. Xu, "A density-based algorithm
structure of financial dataset by using clustering. We would for discovering clusters in large spatial databases with noise" 2nd
International Conference on Knowledge Discovery and Data Mining
further apply other techniques on financial datasets. This
(KDD-96). 1996. pp. 226–231
includes: (1) discover other centroid-based clustering [20] N. Kasabov, “Evolving connectionist systems”, Springer-Verlag
approaches for financial datasets. (2) Find if nominal London Berlin Heidelberg, 2003, pp. 40–42.
attributes are significant and introduce other criteria to [21] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. “Mining the
Stock Market: Which Measure is Best?” Proc. of the KDD 2000, p.
evaluate the clusters. (3) Introduce weighted Euclidean 487-496.
distance instead of standard Euclidean distance to [22] Professor Dr. Hans Hofmann, Statlog (German Credit Data) Data Set,
re-evaluate centroid-based clusters, as to overcome the C. L. Blake and C. J. Merz, Churn Data Set, UCI Repository of
Machine Learning Databases
limitations of K-means. (4) Introduce and compare different [23] Davies, D. L.; Bouldin, D. W. (1979). "A Cluster Separation Measure".
kinds of nonlinear classifiers to strengthen the recall and IEEE Transactions on Pattern Analysis and Machine Intelligence (2):
accuracy and improve prediction, interpretability of the 224.
[24] Martin T. Hagan, H. B. D., and Mark Beale. Neural network design.
results. These techniques include decision tree, nonlinear Boston, MA, USA, PWS Publishing Co. 1996
SVMs, different structures of neural networks and Gaussian [25] MacKay, D. J. C. "Gaussian Processes - A Replacement for
processes with different kernel functions, etc. Supervised Neural Networks?". 1997

REFERENCES
[1] A. Weigend, “Data Mining in Finance: Report from the
Post-NNCM-96 Workshop on Teaching Computer Intensive Methods
for Financial Modeling and Data Analysis”, Fourth International
Conference on Neural Networks in the Capital Markets NNCM-96,
1997, pp. 399-411.
[2] P-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
Addison Wesley, 2006, pp.150-172
[3] J. R. Quinlan, “Learning First-Order Definitions of Functions”,
Journal of Artificial Intelligence Research., vol. 5, 1996, pp. 139–161
[4] N. Cristianini, J-S. Taylor, An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods. Cambridge
University Press, 2000.
[5] J. Han and M. Kamber, Data Mining: Concept and Techniques.
Morgan Kaufmann publishers, 2nd Eds., Nov. 2005.
[6] T. M. Cover, P. E. Hart, “Nearest Neighbor Pattern Classification”,
Journal of Knowledge Based Systems, vol. 8 no.6, 1995, pp. 373–389
[7] T. Wittman. (2002, December). Time-Series Clustering and
Association Analysis of Financial Data. Available: http://
www.math.ucla.edu/~wittman/thesis/project.pdf.
[8] H. Bensmail, R. P. DeGennaro. (2004, September). Analyzing
Imputed Financial Data: A New Approach to Cluster Analysis.
Available: http:// www.frbatlanta.org/filelegacydocs/wp0420.pdf.
[9] S. Omanovic, Z. Avdagic, S. Konjicija, “On-line evolving clustering
for financial statements' anomalies detection”, International

You might also like