0% found this document useful (0 votes)
4 views9 pages

Article28

Uploaded by

Arnel Husic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Article28

Uploaded by

Arnel Husic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ISSN (Print) : 0974-6846

Indian Journal of Science and Technology, Vol 9(7), DOI: 10.17485/ijst/2016/v9i7/87846, February 2016 ISSN (Online) : 0974-5645

Using the Clustering Algorithms and Rule-based


of Data Mining to Identify Affecting Factors
in the Profit and Loss of Third Party
Insurance, Insurance Company Auto
Faramarz Karamizadeh1 and Seyed Ahad Zolfagharifar2*
1
Department of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran;
[email protected]
2
Kohgiluyeh and Boyer Ahmad Science and Research Branch, Islamic Azad University, Iran;
[email protected]

Abstract
Background/Objectives: Insurance data analysis can be considered as a way of losses reduction by using data mining. It
uses the machine learning, pattern recognition and data base theory for discovering the unknown knowledge. Methods/
Statistical Analysis: In this paper, information of 2011, third party insurance of Iran insurance company auto has analyzed
in Kohgiluyeh and Boyer Ahmad by using the data mining method. Findings: The results show that using clustering
algorithms with acceptable clusters will be able to provide a model to identify affecting factors and to determine the effect
of them in the profit and loss of auto third party insurance. Applications/Improvements: The algorithm of K-Means has
formed the best clustering with 9 clusters that have relatively good quality. It means that has been able to maximize the
distance between the cluster and minimize the within cluster distance.

Keywords: Clustering Algorithm, Data Mining, Insurance, Profit and Loss, Third Party

1. Introduction 5
Presented a methodology using clustering data
mining methods decision tree for management of
Insurance data analysis can be considered as a way of insurance customers. The decision tree results with
reducing insurance companies’ losses and data mining 99.66% accuracy showed that the main cause of customer
may lead to useful results. Data mining is unknown churn is lack of satisfaction with the performance of the
knowledge and laws discovery process and useful of insurance company, high insurance premiums and so on.
mass data and data bases1. Data mining is a useful tool 6
Had a research on the identification of fraud in auto
for exploring knowledge from large data2. Because the insurance by using data mining. The results show that
data mining tools predict process and future behavior by the simple Bayes algorithm with accuracy of 90.28%
monitoring data base for hidden patterns, cause to make then decision tree with an accuracy of 88.9% and finally
decisions based on knowledge and easily respond to the logistic regression with accuracy of 86.1% have been able
questions that earlier was very time consuming3. Using to recognize the false or fraudulent of damage claim.
the data mining (with supervisor or without supervisor) 7
Have done a research based on the classification of
can achieve to discover of the hidden rules in the data4. the policy holders’ risk of auto insurance by using data

* Author for correspondence


Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto

mining algorithms. The aim of researchers is classification From loss data set, just determined fields of the amount
of insurance policy holders due to the risk of receiving or of damages and details are extracted. Unfortunately, there
not receiving compensation during the insurance period weren’t the more useful information such as the age of the
in insurance company. First, they collected customer fault driver, education, etc. and because key information
profile data that record 13768 during the years 2009- of issuing data use at the time of record damage for an
2010 and after the necessary pre-processing, run opposite insurance policy, given that the most important fields of
algorithms on them and compare their results8–11. Used issued data are available from the previous stage so, with
techniques included 6 cases, including decision tree, the integration of damage fields and issued to a complete
neural networks, Bayesian networks, support vector information will have access about a particular insurance
machines, logistic regression and discriminant analysis. policy (Tables 1 and 2).
The best accuracy between these algorithms related to
decision tree that with accuracy of 76.4% could detect Selected fields
high-risk or low-risk of a customer. The amount of damage
The date of accident creation

2. Parts of the Research First injured insurer


The number of injured stricken
In this section, we will study the data and area of study. The number of deceased stricken
Then the different stages of research and used methods
investigate and the results of each part will be explained. Operations of data mining was used by rapid miner
software and to optimize the responses and quality of the
results also have been used Minitab and Clementine 12
2.1 Data and Area of Study software.
In this study, first collected third-party damage and issued
insurance policy in 2011, (about 20 thousand records, that Table 2. Selected fields of insurance policy issued
1500 record had damage) that includes 179 fields on the data
issued data. Then 137 fields that were not effective were Total Total Fields
omitted and at the end, effective fields decrease to 42 fields. records effective Effective Non_effective
The insurance experts also were considered to reduce records
20000 1500 137 42
the scale of the problem for removing the various fields.

Table 1. Selected fields of insurance policy loss data


Line Field name Line Field name line Field name
1 Month 15 Surpluscommitment 29 start date
2 year 16 Physical commitment 30 Date of issue
3 Agency code of major exporter 17 Financial commitment 31 Organization Name
4 Group discounts 18 Insurance policy of the 32 Policy Issues
previous year
5 Discount of no damage 19 Insurance 33 Employee
6 Type of Document1 20 Plaque 34 Issued by branch
7 Late penalty 21 More used 35 Government
8 Add code of premiumrate 22 Capacity 36 Representative of Issue place
9 premium 23 Number of cylinders 37 Damage?
10 29 Article complications 24 Year of construction 38 The amount of damage
11 Taxation 25 System 39 Date of accident creation
12 Seat premium 26 Type of vehicle 40 First injured insurer
13 Surplus premium 27 Term of insurance 41 The number of injured stricken
14 Legal party premium 28 Expiration date 42 The number of deceased stricken

2 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar

Table 3. Statics of the third party insurance policy 2.2.2 Remote Data Discovery
issued Kohgiluyeh and Boyer Ahmad in 2011
For detection of outliers data, box plot graphs and
Field name Number Missing removal method
Minitab15 software was used. In this graph, percentile
System 70 Diagnosis according to concept use that data between 25% and 75% are shown
another features
respectively with Q1and Q3 the most important part of
Type of vehicle 33 Diagnosis according to
data. X50% also shows the median and is determined
another features
with a line in the middle of the graph. Inter Quartile
More used 11 Diagnosis according to
another features Range (IQR) is another concept that is also IQR = Q3-Q1.
Number of cylinders 2 Diagnosis according to Values greater than Q3 + [(Q3 - Q1) X 1.5] and
another features less than Q1 - [(Q3 - Q1) X 1.5] are outliers data. To
Governmental 28 Diagnosis of the plaque do this, implement box plot graph on the individual
Month 130 Diagnosis of the issued date characteristics of the data and carried away data were
Insurance 49 Diagnosis of the insurer corrected according to the results.
name
2.2.3 The Type and Name of Data Fields
2.2 The Steps of the Research At this stage, given the knowledge that is derived from
data fields, proceed to determine the type of data for the
2.2.1 Investigate to Missing Data software. Selected fields are according to Table 5.
In the initial phase attempts to discover the missing values Table 5. Determination of type and field name
with sorting all the features regularly in the Microsoft Rule Result Support Confidence
Excel software and through other characteristics of No surplus commit- Damage 32% 41%
each record have guessed the missing amount. Also, lost ment, add code of
amounts will be identified during data transfer to data premium rate
mining area, including fields with missing values and
trouble shooting methods are in Table 4.
2.3 E
 valuation Criteria Rule-Based
Table 4. Fields with missing values and methods of Algorithm (The Discovery of Association
trouble shooting Rules)
Type of field Field name Association rules produce many patterns that may not
Integer Surplus commitment, Physical commitment, be attractive all models for us. So, the criteria should be
Financial commitment, Plaque, Capacity, defined for assessing the quality of the rules. If you have a
Number of cylinders, Year of construction, rule that says A, then B, of the number of records where
Term of insurance, The number of injured A, B are both present, the total number of records, a
stricken, The number of deceased stricken measure is obtained that named Support. The numerical
Polynominal More used, System, Type of vehicle, First value is between 0 and 1. Usually in search of better rules,
injured insurer consider a threshold for support to be limited the number
Real Late penalty, Add code of premium rate, pre- of obtained rules.
mium, Taxation, Seat premium, Legal party Threshold value may be cause to not see the rules
premium, The amount of damage that their support is lower than the threshold but also be
Binominal Insurance policy of the previous year, Employ- valuable. So this criterion alone is not enough to determine
ee, Issued by branch the value of a law. Confidence is a criterion that will have
the value between 0 and 1. If the criteria for a rule show
The number of records that have had lost values in certainty of 0.98 means that in 98% of cases if left side of
several important features and removed and have been rule is true, the right side of rule will be true too.
about 350 cases. SUP(A  B)
(A ® B) = Confidence
SUP(A)

Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 3
Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto

2.4 E
 valuation Criteria of Clustering of insurance that specify the loss of a cluster.
Algorithms
Evaluation of clustering algorithms is divided into 3. T
 he Rule-based Algorithm
two categories. A set of indicators that are internal or
unsupervised that determine clustering operations quality
(The Discovery of Association
with respect to the contained information in the data set. Rules)
Other categories that call foreign or observers, according
to information out of the analyzed dataset, evaluated Implementation of these algorithms according to the
the performance of clustering algorithm. In this study, methods of working with the used software like all other
a criterion of unsupervised is used. The criteria Average algorithms, first data source enter to software and then
Silhouette Coefficient that is abbreviated ASC. have to run the algorithm. The algorithm was run with
As we know the duty of a clustering algorithm is to the default software parameters and different parameters
minimize the distance of inside the cluster with density were applied to the algorithms that the best response was
(coh) and to maximize the distance between cluster with to run with the default software settings.
separation (Sep). Because there are many unsupervised
criteria, the criteria define two factors in a certain way. 3.1 FP Growth Algorithm
ASC criteria define these two factors as follows: The rules of this algorithm are as shown in Table 6.
1 Table 6. Extracted rules by the algorithm Fp growth
Coh =
mi
å x Îci
y Îci
dist ( x , y )
Rule Result support Confidence
min ìï 1 Used = trolleys, Systems = Damage 6% 38%
ïí
j £ nC j¹1 ïï mi å y Îci
Sep = x Îci dist ( x , y )
Nissan discount no dam- Damage 47% 40%
î
ageless than 1.5 million
So, ASC or Silhoutte Measure defines as follows :
rials, the type of vehicle =
1 nc Sep(i) - Coh(i) rides, year of manufacture
ASC =
nc
å i =1
max(Sep(i) - Coh(i)) more than 2007

The maximum amount for this measure is the number 3.2. Weka Apriori Algorithm
1 and the minimum is -1. In the above formula dist (x, y) The rules of this algorithm are as shown in Table 7.
represents the distance of the (record) x, y of each other
that to calculate it, the Euclidean distance is used. Also Table 7. Extracted rules by Weka Apriori algorithm
nc, mi, ci respectively represent the i-th cluster center, Iteration k %train % Test Partitioning
the number of cluster member i and the total number of Partition Partition
formed clusters for the study points. Euclidean distance 8 9 10 90 YES
is as follows:
n 3.3 Clustering Algorithm
å
2
de ( x , y ) = k =1
(xk - yk ) The objective of this part is use of clustering algorithm
K-Means, Kohonen and two-step data and check on
In the above equation, n represents the number of whether these algorithms on the data will have good
features (of the problem), y_k and x_k respectively the output or not? After running the algorithm, the output
kth are features of two records x and y. will be evaluated with criteria ASC.
Implementation of the algorithm according to the
2.5 Method method with used software applied various parameters
In this study, two algorithms rule base Apriori, Fp Growth for algorithms that the best response was to run with the
and three clustering algorithm are used K-Means, two- parameters that are described below.
step (Two Step) and Kohonen. The results are compared
to each other and the best rules and the most suitable 3.4 K-Means Algorithm
characteristics is announced and extracted from each The best obtained performance for this algorithm has
algorithm after consultation with specialists and experts been by setting the following parameters (Table 8):

4 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar

Table 8. K-Means algorithm parameters settings • Month.


%train Partition % Test Partition Partitioning • Add Code of premium rate.
• Seat premium.
40 60 YES
• Taxation.
As shown in Figure 1, after the 8-order of algorithm • Legal party premium.
implementation achieved to zero error percent. • Surplus premium.
• Premium.
• First injured insurer.

The size of these clusters is shown in Figure 3.

Figure 1. Achieve of the error percent to zero after 8


orders.
Implementation for 9 clusters in K-Means algorithm.
12 more effective fields according to detection of this
algorithm for clustering as has been determined in Figure
2. Figure 3. The size of the clusters and the smallest
proportion of cluster to the largest cluster in K-Means
algorithm.

Clustering quality has been determined also relatively


good as shown in Figure 4.

Figure 2. Predictor importance for K-Means.

The fields in order of importance according to the Figure 4. The quality of clusters in K-Means algorithm.
detection of algorithm are (Figure 2):
• Surplus commitment. As it is specified, the best determined quality according
• Physical commitment. to criteria of Silhouette Measure has been equal to 4.0,
• Number of decrease stricken. which is also acceptable.
• Financial commitment.

Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 5
Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto

3.5 Kohonen Algorithm • TYPE of document 1.


The best obtained performance for the algorithm with • 29 Article complications.
parameter settings has been according to Table 9.
The size of the clusters is shown in Figure 6.
Table 9. Kohonen algorithm parameters settings
%train % Test Partition Partitioning
Partition
40 60 YES

The best number of clusters according to


detection of algorithm has been 8 clusters.
12 more effective fields according to detection of algorithm
for clustering has been determined as Figure 5.

Figure 6. The size of clusters and the smallest clusters


to.

The largest cluster in Kohonen algorithm clustering


quality also according to Figure 7 has been determined
relatively good.

Figure 5. Predictor importance for Kohonen.


The fields in order of importance according to the
detection of algorithm are:
• Physical commitment. Figure 7. The quality of the clusters in K-Means
• Surplus commitment. algorithm.
• Number of decease stricken.
• Month. As it is clear, the best determined quality according to
• Seat premium. criteria of Silhouette Measure has been equal to 3.0 that
• Surplus premium. is also acceptable. This algorithm is neural network type
• Taxation. and therefore the input layer is detected 76 neurons and
• Legal party premium. output layer is detected 12 neurons. (Figure 8).
• Financial commitment.
• Premium.

6 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar

• More used.
• Vehicle type.
• Month.
• Capacity.
• Seat premium.
• System.
Figure 8. The number of input and output neurons in • Taxation.
Kohonen. • Legal party premium.
• Premium.
3.6 Two-Step Algorithm • Agency code of major exporter.
The best obtained performance for the algorithm has The size of the clusters is shown in Figure 10.
been with parameters settings of Table 10.
Table 10. A two-step algorithm parameters settings
Algorithm name Silhouette Measure The number of
clusters
K-Means 4.0 9
Kohonen 3.0 8
Two steps 2.0 3

The best number of clusters according to


the algorithm detection has been 3 clusters.
12 more effective fields according to algorithm detection
for clustering have been shown in Figure 9.

Figure 10. The size of clusters and the smallest clusters


to the largest cluster in a two-step algorithm.
Clustering quality has been determined poor as shown
in Figure 11.

Figure 11. Quality of clusters in the two-step


algorithm.
Figure 9. Predictor importance for two step algorithm.
As it is specified, the best determined quality according
The fields in order of importance according to the to criteria of Silhouette Measure has been equal to 2.0,
detection of algorithm are: which is less than normal.
• Physical commitment.
• Surplus commitment.

Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 7
Using the Clustering Algorithms and Rule-based of Data Mining to Identify Affecting Factors in the Profit and Loss of Third Party
Insurance, Insurance Company Auto

4. Conclusions of 4% of total cluster have the most similarity with the


damage cases, because the add code had the rate more
4.1. Analysis of Clusters Result than 30, have low premium and paid a high late penalty.
In this section we study compare of clustering algorithms. In Kohonen algorithm, cluster of X = 0, Y = 0 has
the most similarity to the damage cases because the add
Table 11. Comparison of clustering algorithms code had more rate than 30, have the low premium and
Late penalty Physical Built Year paid a high late penalty. This cluster is allocated 20% of
commitment the clusters. Two step algorithm in cluster 1 with 25% of
Add code of Surplus com- Type ofve- System the total volume had the most similarity to damage cases
premiumrate mitment hicle because the add code had the rate more than 30, the auto
premium The number Term of Financial
was Peikan and paid a high late penalty.
of injured insurance commitment
So any new records after comparing to these algorithms
stricken
Taxation Number of Employee Insurance
can lead to damage in future if granted to clusters by the
cylinders policy of the said possibilities that might lead in the future to cause
previous year damage.
Seat premium First injured Issued by Plaque Clustering algorithms have identified 12 effective
insurer branch features in clustering. These features are marked as
Legal party Capacity The amount More used underlined in the following table.
premium of damage According to the results we can conclude that
clustering algorithms able to create a new model and
The results show that the algorithm of K-Means has approach to the allocation of new samples to a specific
formed the best clustering with 9 clusters that have cluster to determine the possibility of lose a policy.
relatively good quality. It means that has been able to
maximize the distance between the cluster and minimize 4.1 The Result of Correlation Rules (Rule
the within cluster distance. Then Kohnen could generate Base)
8 clusters with low quality that is acceptable also. And at The best obtained results of the algorithms that obtained
the end, the two-step approach has produced 3 clusters rules of A and then B for 3.154 records are according to
with poor quality. What has concluded that existence of 8 the following Table 13.
or 9 clusters has been the best number of clusters for this Due to the low of support obtained rules in accordance
type of data. with the stipulated scientific criteria in the above table,
In K-Means algorithm, first, eighth cluster with the rules are not reliable scientific and conclusions.
coverage of 10% and then second cluster with coverage

Table 12. Obtained fields of clustering algorithms


Line Rule Result Support The number of The number of Confidence
records, including records including
betting and results betting
1 No surplus commitments, Add Damage 32% 312 755 41%
Code of premium rate = 0
2 Used = trolleys, System = Damage 6% 51 135 38%
Nissan
3 Discount no damage Less than Damage 47% 437 1090 40%
1.5 million rials, vehicle type =
Peikan, Year of built more than
2007

8 Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology
Faramarz Karamizadeh and Seyed Ahad Zolfagharifar

5. Recommendations process. International Journal of Soft Computing and Engi-


neering (IJSCE). 2012 Jul; 2(3):191–4.
4. Saniee M. Applied data mining. First Printing. Tehran, Iran:
In this study the defects and shortcomings of the current
Niyaz Danesh Publishing; 2012.
procedure entry of insured and the injured information 5. Allahyari R, Vahidy K. Applying data mining to insurance
determined to a certain extent. The approach to the losing customer churn management. 3rd International Confer-
policy holders and no harm that is done now had some ence on Information Computing and Applications (ICICA
defects that by fix it, gave them more profit the insurance 2012); Chengde, China. 2012 Sep. p.14–6.
6. Firuzi M, Shakoori M, Kazemi L, Zahedi S. The identifica-
companies. Accordingly, it is proposed:
tion of fraud in auto insurance using data mining method.
• Insert the insured individual characteristics such Journal of Insurance (Insurance Industry of the Former).
as age, occupation, education, date of certification The Twenty-Sixth Year. 2011 Sep; 3(103):103–28.
issued, type of certification or an individual health 7. Heydari N, Samrand K, Farahi A. The classification of the
position in insurance policy issued for future use insured risk of auto insurance using data mining algo-
of data mining that definitely will lead to find more rithms. Journal of Insurance (Former Insurance Industry).
The Twenty-Sixth Year. 2011 Jan; 104:107–29.
definitive knowledge in this field. 8. Montazeri-Gh M, Mahmoodi-k M. Development a new
• Insert more detailed information about the accident, power management strategy for power split hybrid electric
the scene and damaged personal information and vehicles. Transportation Research Part D: Transport and
responsible for future use of data mining. Environment. 2015 Jun; 37:79–96.
9. Delafrooz N, Farzanfar E. Determining the customer life-
time value based on the benefit clustering in the insurance
6. References industry. Indian Journal of Science and Technology. 2016
Jan; 8(1):1342–9.
1. Long L, Liang C, Hui Y. Efficient evolutionary data mining 10. Montazeri-Gh M, Mahmoodi-k M. An optimal ener-
algorithms applied to the insurance fraud prediction. In- gy management development for various configuration
ternational Journal of Machine Learning and Computing. of plug-in and hybrid electric vehicle. Journal of Central
2012 Jan; 2(3):308–14. South University. 2015 May; 22 (5):1737–47.
2. Patil SP, Patil UM, Borse S. The novel approach for improv- 11. Jeon Y, Lee J, Kwon, D. Process innovation case study of
ing Apriori algorithm for mining association rule. World insurance industry: Based on Case of H Company, Indian
Journal of Science and Technology. 2012; 2(3):75–8. Journal of Science and Technology, 2015 Jan; 8(S1):20–7.
3. Ramamohan Y, Vasantharao K, Chakravarti CK, Ratnam
ASK. A study of data mining tools in knowledge discovery

Vol 9 (7) | February 2016 | www.indjst.org Indian Journal of Science and Technology 9

You might also like