Article28
Article28
Indian Journal of Science and Technology, Vol 9(7), DOI: 10.17485/ijst/2016/v9i7/87846, February 2016 ISSN (Online) : 0974-5645
Abstract
Background/Objectives: Insurance data analysis can be considered as a way of losses reduction by using data mining. It
uses the machine learning, pattern recognition and data base theory for discovering the unknown knowledge. Methods/
Statistical Analysis: In this paper, information of 2011, third party insurance of Iran insurance company auto has analyzed
in Kohgiluyeh and Boyer Ahmad by using the data mining method. Findings: The results show that using clustering
algorithms with acceptable clusters will be able to provide a model to identify affecting factors and to determine the effect
of them in the profit and loss of auto third party insurance. Applications/Improvements: The algorithm of K-Means has
formed the best clustering with 9 clusters that have relatively good quality. It means that has been able to maximize the
distance between the cluster and minimize the within cluster distance.
Keywords: Clustering Algorithm, Data Mining, Insurance, Profit and Loss, Third Party
1. Introduction 5
Presented a methodology using clustering data
mining methods decision tree for management of
Insurance data analysis can be considered as a way of insurance customers. The decision tree results with
reducing insurance companies’ losses and data mining 99.66% accuracy showed that the main cause of customer
may lead to useful results. Data mining is unknown churn is lack of satisfaction with the performance of the
knowledge and laws discovery process and useful of insurance company, high insurance premiums and so on.
mass data and data bases1. Data mining is a useful tool 6
Had a research on the identification of fraud in auto
for exploring knowledge from large data2. Because the insurance by using data mining. The results show that
data mining tools predict process and future behavior by the simple Bayes algorithm with accuracy of 90.28%
monitoring data base for hidden patterns, cause to make then decision tree with an accuracy of 88.9% and finally
decisions based on knowledge and easily respond to the logistic regression with accuracy of 86.1% have been able
questions that earlier was very time consuming3. Using to recognize the false or fraudulent of damage claim.
the data mining (with supervisor or without supervisor) 7
Have done a research based on the classification of
can achieve to discover of the hidden rules in the data4. the policy holders’ risk of auto insurance by using data
mining algorithms. The aim of researchers is classification From loss data set, just determined fields of the amount
of insurance policy holders due to the risk of receiving or of damages and details are extracted. Unfortunately, there
not receiving compensation during the insurance period weren’t the more useful information such as the age of the
in insurance company. First, they collected customer fault driver, education, etc. and because key information
profile data that record 13768 during the years 2009- of issuing data use at the time of record damage for an
2010 and after the necessary pre-processing, run opposite insurance policy, given that the most important fields of
algorithms on them and compare their results8–11. Used issued data are available from the previous stage so, with
techniques included 6 cases, including decision tree, the integration of damage fields and issued to a complete
neural networks, Bayesian networks, support vector information will have access about a particular insurance
machines, logistic regression and discriminant analysis. policy (Tables 1 and 2).
The best accuracy between these algorithms related to
decision tree that with accuracy of 76.4% could detect Selected fields
high-risk or low-risk of a customer. The amount of damage
The date of accident creation
2 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar
Table 3. Statics of the third party insurance policy 2.2.2 Remote Data Discovery
issued Kohgiluyeh and Boyer Ahmad in 2011
For detection of outliers data, box plot graphs and
Field name Number Missing removal method
Minitab15 software was used. In this graph, percentile
System 70 Diagnosis according to concept use that data between 25% and 75% are shown
another features
respectively with Q1and Q3 the most important part of
Type of vehicle 33 Diagnosis according to
data. X50% also shows the median and is determined
another features
with a line in the middle of the graph. Inter Quartile
More used 11 Diagnosis according to
another features Range (IQR) is another concept that is also IQR = Q3-Q1.
Number of cylinders 2 Diagnosis according to Values greater than Q3 + [(Q3 - Q1) X 1.5] and
another features less than Q1 - [(Q3 - Q1) X 1.5] are outliers data. To
Governmental 28 Diagnosis of the plaque do this, implement box plot graph on the individual
Month 130 Diagnosis of the issued date characteristics of the data and carried away data were
Insurance 49 Diagnosis of the insurer corrected according to the results.
name
2.2.3 The Type and Name of Data Fields
2.2 The Steps of the Research At this stage, given the knowledge that is derived from
data fields, proceed to determine the type of data for the
2.2.1 Investigate to Missing Data software. Selected fields are according to Table 5.
In the initial phase attempts to discover the missing values Table 5. Determination of type and field name
with sorting all the features regularly in the Microsoft Rule Result Support Confidence
Excel software and through other characteristics of No surplus commit- Damage 32% 41%
each record have guessed the missing amount. Also, lost ment, add code of
amounts will be identified during data transfer to data premium rate
mining area, including fields with missing values and
trouble shooting methods are in Table 4.
2.3 E
valuation Criteria Rule-Based
Table 4. Fields with missing values and methods of Algorithm (The Discovery of Association
trouble shooting Rules)
Type of field Field name Association rules produce many patterns that may not
Integer Surplus commitment, Physical commitment, be attractive all models for us. So, the criteria should be
Financial commitment, Plaque, Capacity, defined for assessing the quality of the rules. If you have a
Number of cylinders, Year of construction, rule that says A, then B, of the number of records where
Term of insurance, The number of injured A, B are both present, the total number of records, a
stricken, The number of deceased stricken measure is obtained that named Support. The numerical
Polynominal More used, System, Type of vehicle, First value is between 0 and 1. Usually in search of better rules,
injured insurer consider a threshold for support to be limited the number
Real Late penalty, Add code of premium rate, pre- of obtained rules.
mium, Taxation, Seat premium, Legal party Threshold value may be cause to not see the rules
premium, The amount of damage that their support is lower than the threshold but also be
Binominal Insurance policy of the previous year, Employ- valuable. So this criterion alone is not enough to determine
ee, Issued by branch the value of a law. Confidence is a criterion that will have
the value between 0 and 1. If the criteria for a rule show
The number of records that have had lost values in certainty of 0.98 means that in 98% of cases if left side of
several important features and removed and have been rule is true, the right side of rule will be true too.
about 350 cases. SUP(A B)
(A ® B) = Confidence
SUP(A)
Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 3
Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto
2.4 E
valuation Criteria of Clustering of insurance that specify the loss of a cluster.
Algorithms
Evaluation of clustering algorithms is divided into 3. T
he Rule-based Algorithm
two categories. A set of indicators that are internal or
unsupervised that determine clustering operations quality
(The Discovery of Association
with respect to the contained information in the data set. Rules)
Other categories that call foreign or observers, according
to information out of the analyzed dataset, evaluated Implementation of these algorithms according to the
the performance of clustering algorithm. In this study, methods of working with the used software like all other
a criterion of unsupervised is used. The criteria Average algorithms, first data source enter to software and then
Silhouette Coefficient that is abbreviated ASC. have to run the algorithm. The algorithm was run with
As we know the duty of a clustering algorithm is to the default software parameters and different parameters
minimize the distance of inside the cluster with density were applied to the algorithms that the best response was
(coh) and to maximize the distance between cluster with to run with the default software settings.
separation (Sep). Because there are many unsupervised
criteria, the criteria define two factors in a certain way. 3.1 FP Growth Algorithm
ASC criteria define these two factors as follows: The rules of this algorithm are as shown in Table 6.
1 Table 6. Extracted rules by the algorithm Fp growth
Coh =
mi
å x Îci
y Îci
dist ( x , y )
Rule Result support Confidence
min ìï 1 Used = trolleys, Systems = Damage 6% 38%
ïí
j £ nC j¹1 ïï mi å y Îci
Sep = x Îci dist ( x , y )
Nissan discount no dam- Damage 47% 40%
î
ageless than 1.5 million
So, ASC or Silhoutte Measure defines as follows :
rials, the type of vehicle =
1 nc Sep(i) - Coh(i) rides, year of manufacture
ASC =
nc
å i =1
max(Sep(i) - Coh(i)) more than 2007
The maximum amount for this measure is the number 3.2. Weka Apriori Algorithm
1 and the minimum is -1. In the above formula dist (x, y) The rules of this algorithm are as shown in Table 7.
represents the distance of the (record) x, y of each other
that to calculate it, the Euclidean distance is used. Also Table 7. Extracted rules by Weka Apriori algorithm
nc, mi, ci respectively represent the i-th cluster center, Iteration k %train % Test Partitioning
the number of cluster member i and the total number of Partition Partition
formed clusters for the study points. Euclidean distance 8 9 10 90 YES
is as follows:
n 3.3 Clustering Algorithm
å
2
de ( x , y ) = k =1
(xk - yk ) The objective of this part is use of clustering algorithm
K-Means, Kohonen and two-step data and check on
In the above equation, n represents the number of whether these algorithms on the data will have good
features (of the problem), y_k and x_k respectively the output or not? After running the algorithm, the output
kth are features of two records x and y. will be evaluated with criteria ASC.
Implementation of the algorithm according to the
2.5 Method method with used software applied various parameters
In this study, two algorithms rule base Apriori, Fp Growth for algorithms that the best response was to run with the
and three clustering algorithm are used K-Means, two- parameters that are described below.
step (Two Step) and Kohonen. The results are compared
to each other and the best rules and the most suitable 3.4 K-Means Algorithm
characteristics is announced and extracted from each The best obtained performance for this algorithm has
algorithm after consultation with specialists and experts been by setting the following parameters (Table 8):
4 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar
The fields in order of importance according to the Figure 4. The quality of clusters in K-Means algorithm.
detection of algorithm are (Figure 2):
• Surplus commitment. As it is specified, the best determined quality according
• Physical commitment. to criteria of Silhouette Measure has been equal to 4.0,
• Number of decrease stricken. which is also acceptable.
• Financial commitment.
Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 5
Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto
6 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar
• More used.
• Vehicle type.
• Month.
• Capacity.
• Seat premium.
• System.
Figure 8. The number of input and output neurons in • Taxation.
Kohonen. • Legal party premium.
• Premium.
3.6 Two-Step Algorithm • Agency code of major exporter.
The best obtained performance for the algorithm has The size of the clusters is shown in Figure 10.
been with parameters settings of Table 10.
Table 10. A two-step algorithm parameters settings
Algorithm name Silhouette Measure The number of
clusters
K-Means 4.0 9
Kohonen 3.0 8
Two steps 2.0 3
Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 7
Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto
8 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar
Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 9