Implement Prediction Based Parallel Clustering Algorithm using Hybrid Optimization Technique ch4
Implement Prediction Based Parallel Clustering Algorithm using Hybrid Optimization Technique ch4
4.4 Results 33
Data is the most essential part of any type of implementation. In our research, we have taken a
secondary data. Data is taken from Amazon website, featured by Stanford University. By taking
permission from Stanford University we have accessed a different amazon product dataset and we
have used the same in our project. Permission and access is given by J. McAuley (2013). The
dataset contains various parameters like Product Id. Title, Price, user id, reviewer profile name,
and helpfulness score, review score, summary and review text.
In prior stages data is taken from any website that may contain unfiltered data or it is not processed
by clef.keras.sequential model. To achieve this objective we have applied various methods to
preprocess the data. Initially, we have filtered the data on the basis of lines and then by calculating
review score, we have categorized it into two parts. First part is appended with positive data where
review is >2.5 and it is considered for further procedure. Remaining data is discarded from the
system.
It is most important in sentimental analysis if you want to find out the positive score of any text
you need to train the data on the basis of previous values. To achieve the following objective in
our project we have used Keras model. Keras model works on the concept of deep learning or
neural network. It is divided into three layers viz, input layer, hidden layer and output layer.
Number of nodes in the hidden layer can be increased as per user requirement. Keras is an API
that used to build a deep neural network. The main advantages of using Keras are it is user-friendly,
modular and composable, easy to extend. We can import the Keras by using tf.Keras where tf is a
TensorFlow. If we want to design a simple model using Keras we need to import the sequential
model. It is the most common type for implementing the stack of layers. This Keras layers are
always available with some types of constructors like Activation(),
kernel_initializer and bias_initializer () and kernel_regulizer() or Bias_regulizer().
Layers.dense(32, kernel_regularizer=tf.keras.regularizer.l1(0.01))
In the above equation, a linear layer is created with L1 regularization factor 0.01 and it is applied
to the kernel matrix. Once the model is created it is passed for compilation using three important
parameters viz. optimizer, loss and metrics. The optimizer is used to specify the training procedure
and loss uses mean square error method to minimize the losses. Metrics are used to monitor the
training.
In Keras, tokenizer function is used to tokenize the data. The same function we have used to
preprocess the data in our research. This function is used to vectorize a text corpus and then
converting each text into integer or vectors. It is based on binary function or based on word count
or tf-idf. The syntax of tokenizer is as given below,
it contains various arguments like num_words (Maximum number of words ), filters, Boolean (to
convert text into lowercase or not), split (separator for separating the words), char_level and
oov_token. Once the data is preprocessed with Keras, the output is passed to NMF-LDA. Latent
Dirichlet Allocation technique is used to remove the unwanted data from the text. Once, the data
is preprocessed by removing all the stopwords, irrelevant data and unspecified characters. We got
the final dataset with preprocessed and clean data.
Now at initial level, we have implemented the Naïve Bayes model to find out the accuracy,
precision, recall f1-score and support. Then on the same dataset, we have implemented and applied
random forest and decision tree algorithm. This task is performed for comparative analysis.
Figure 4. 1 Detailed Flow of research
As shown in figure 4.1 researches are carried out in further processes. Now in further steps, actual
implementation of our research is started by implementing the Grey Wolf Optimization algorithm.
In the first level, we have implemented GWO. Then by writing two different functions viz.
crossover and mutation we have implemented genetic algorithm. In our project, we have performed
crossover and mutation between beta and delta wolves. And then mutation is performed over the
same wolves. If the output value is better as compared to normal search then it is updated with a
newer one. In this way, we have extracted most optimized keywords from the dataset. Now we are
in the final stage of our research where we need to predict the most relevant products for customers.
In the last stage, we have applied Parallel fuzzy C-Means algorithm. It is a parallel clustering
algorithm used to find out the most relevant product. Finally, by generating contingency matrix
we have calculated all the output parameters like Accuracy, Precision, recall, f1-score and support.
PSO is a computational method used to optimize candidate solutions to gives a measure of quality.
In this technique by applying some mathematical formulae candidates are trying to find out the
local best solution in search space. Each movement is influenced by the best-known position. But
it is also guided towards the best-known position updated by other particles. Finally, it is expected
to move the swarm towards the best solution. It is working on the concept of parameter selection
and convergence. By using fuzzy logic parameter selection can take place in PSO. Basic variants
of PSO are population (Swarm) of candidate solution (particles).
To evaluate PSO with AACO we have considered Traveling Salesman Problem in both the
scenario. In this salesman has to visit all cities in minimum time. So, we need to design an approach
where salesmen can visit all the cities in minimum time. The Detailed procedure for the execution
is given as follows.
Procedure:
AACO_MetaHeuristic
while(not_termination)
artificial_ant()
organic_ant()
random_ant()
abnormal_ant()
generate Solutions()
daemon Actions()
pheromone Update()
end while
end procedure
For edge selection, Ordinary ants are moves from state ‘x’ to state ‘y’. Ordinary ant ‘k’ computes
a feasible solution of sets Ak(x) for each set for each iteration. For ordinary ant, k probability is
Pxyk from state ‘x’ to state ‘y’. It is dependent on attractiveness η xy and trailing solutions τ of the
move. Trails are updated after receiving low pheromone concentration paths solution to ordinary
ants and all ants have updated their solutions. The probability of k th ant from state ‘x’ to state ‘y’
is
β
α )(η )
(𝜏xy xy
𝑘
𝑃𝑥𝑦 = β
α )(η )
…………………………………..(4.1)
𝛴𝑧𝜖 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑦 (𝜏xz xz
Where τxy is the amount of pheromone deposited from state x to y α and β. It is a parameter use to
control influence and to represent the attractiveness τxz and trail level ηxz for all the possible
solutions.
To update path, trails are updated once all the ants have completed the updations are
τxy (1-ρ) τxy + ∑k ∆τkxy ………………...(4.2)
Where, ρ is the coefficient of pheromone evaporation and Δτkxy is the amount of pheromone
deposited by kth ant.
The initial level to achieve our objective we have implemented same concept for solving the
Travelling Salesman Problem. Steps involved for solving TSP are
1) Visit each city exactly once
2) Different city has less chance of selection in terms of selection and visibility.
3) If the pheromone concentrations between any two cities are more, the probability of
selection of path is more.
4) Once the journey is completed ants deposits the pheromone on all the paths. If the path is
shortest and concentration of deposited pheromones is more than the path is selected as the
best path.
5) Path with fewer pheromones is evaporated after selection of best path.
AACO can be used in I-Commerce but accuracy, relevancy for prediction of products is less and
escape from local optima is difficult than other metaheuristics. So we have implemented another
technique called as particle swarm optimization.
PSO is a computational method that optimizes a solution to the problem by performing an iterative
operation to increase quality. It works for population called as a swarm having a candidate solution
(particles). These particles are move along the search-space. The movement of particles is guided
by their own best position in the search space. When an improved position is discovered it is
updated to all the swarms and process is repeated to reach the best solution. In PSO two main
principles are used viz. Communication and learning. To find the updated solution in search space
it performs communication with other particles. While performing communication, it emphasizes
on learning to find a stochastic solution in search space.
For all particles ‘i’ in search space we initialize the position of particles with uniformly distributes
random vector
at first step we initialize the best-known position in search space as its initial position as pi ← xi
After this, we have checked for all the possible solution in search space and if the value of best
known position in search space is less than initial position then we have to update swarm's best-
known position: g ← pi
Once we have done with previous proves we need to Initialize the
Perform the above steps for all the particles in search space and if f(xi) < f(pi) then Update the
particle's best-known position: pi ← xi
else if f(pi) < f(g) then Update the swarm's best-known position: g ← pi
Hence in further step, we have applied both the algorithms on TSP to evaluate the best algorithm
on the following graph in figure 4.2
TSP is a traveling salesman problem where a salesman wants to visit all the cities in minimum
time. In our graph we have considered 5 cities named as 0, 1,2,3,4. Initially salesman will start
from 0th position and travel all the cities. We have also defined a distance between all the cities. It
is very tangible to visit the cities without prior knowledge. It may time consuming and wastage of
effort. To avoid this problem initially we have designed a TSP graph to find out the most relevant
solution. Figure 4.2 shows the details of TSP graph with cities and distance between then. Then in
further steps we have applied PSO and AACO on the graph to find relevant solutions.
Figure 4. 2 TSP Graph
As discussed above first AACO is applied to input data with 100 iterations after applying AACO
successfully on the graph it gives the best path with minimum cost. But the time of execution is
more. Detail output is shown in Figure 4.3.
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
shorted_path: ([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
82.5 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop
each)
Showing particles...
Optimization
No of Iterations Mean Time Standard Deviation
Technique
The above table describes a comparative analysis between AACO and PSO. For comparison, we
have considered means time and standard deviation. For 100 iterations AACO takes 206 ms for
the execution of task, whereas PSO will complete the task with same iterations in 18.3 ms.
250
206
200
150
100 100
100
50
23.7
18.3
4.86
0
No Of Iterations Mean time Standard Deviation
AACO PSO
As shown in Figure 4.6 it has been observed PSO will give better results as compared to AACO.
PSO is more efficient to find the relevant results and perform the operation in minimum time with
highly efficient throughput as compared to AACO.
Once the generation is created the algorithm will work on the following parameters:
1) Selection Operator: The concept behind selection is to give preference for each individual
to pass their genes for the successive process. Where condition is to have a good fitness
score.
2) Crossover Operators: It provides a mating between two individuals. Where best genes are
selected from both the mates. Then crossover sites is exchange the genes and produce a
new individual offspring as sown in figure 4.7.
3) Mutation: The basic idea of is to insert random genes to maintain the diversity in
population and avoid the premature problem in offspring.
A B C D E F
A M C D J L
H I J K L M
To demonstrate the concept of Genetic algorithm we have consider an example with equation
Z = W1X1+W2X2+W3X3+W4X4+W5X5+W6X6
Our target is to find the optimized weight to maximize the equation. To achieve this target we have
applied genetic algorithm by performing crossover and mutation. After performing ‘n’ iterations
our target reached to the optimize value or best possible solution. Initially we have started with
population P. Then we have calculated fitness value of population ‘p’. After calculation fitness
value, we have performed crossover and generated a new population. Then we have performed
mutation between different chromosomes and fitness score is determined. Similar process is
repeated until we have find out the best value. The detail steps are explained in algorithm as given
below.
Algorithm:
After implementing above algorithm the output which we get at iteration 1000 is shown in
figure 4.8. It has been observed as the number of iteration is increased the fitness score is
increase gradually at each iteration level.
PSO is a computational method that optimizes a solution to the problem by performing an iterative
operation to increase quality. It works for population called as a swarm having a candidate solution
(particles). We have applied a genetic algorithm technique in PSO to solve the convergence
problem. We have performed a crossover and Mutation in different particles to determine best
known position as shown in figure 4.9. The detailed steps used while executions are explained
below.
Algorithm:
After implementing above algorithm it give a best known position but time requires for the
execution of this process is more.
Data pre-processing
By performing Lemmatization,
Removal of stop words and URLs
Various Amazon datasets are available for research purpose. We have considered few databases
for our research. To demonstrate we have given sample data from amazon instant video file. As
shown in sample 1, it contains product ID, title, price, userid, profilename, helpfulness, score,
score, time, summary, reviews, productid for each product and details of the review given by each
customer.
4.3.2.1 Lemmatization
Lemmatization is a technique used to transform word into dictionary form. It is used to fetch the
proper lemma. Each Morphological word is identified using this technique, whereas lemmatization
is closely related with stemming technique. This technique identifies the base from of ‘adding’ to
‘add’. If will consider a stemming technique, ‘caring is converted into ‘car’ so lemmatization is a
better as compare to stemming. It is very effective technique to identify the proper word. After
preprocessing Amazons customer review data, Positive and negative labels are created using
customers rating. If the rating is in between 1 to 2.5, we have discarded the particular review and
considered it as a negative review. If the rating is between 2.5 to 5 then is considered a positive
review. Finally collected data is converted into sequence number. Detailed procedure for the
lemmatization is given below.
Procedure:
The output looks like as shown in sample 2, after running the above procedure all the outputs are
stored into the array. As per the constraints given in procedure it gives output with product ID,
title, price, userid, profilename, helpfulness, score, score, time, summary, reviews, productid for
each product and details of the review given by each customer.
Sample 2: Data is extracted after performing Lemmatization
After Processing data at basic level text data is tokenized. For tokenizing Keras model is use to
process the review_text. Keras is used to build and train the deep learning model. We have
considered maximum 20000 features and length is 100 to perform a tokenize operation. At initial
stage we have prepared the text data for deep learning by initializing the following procedure.
Procedure
LDA is observed from three layered representation, 𝜋 and 𝜇 parameters are analyzed during the
generation of corpus. For all documents, document-level topic variables are investigated and word
level variables are examined for every word in document. Joint distribution over random function
variable is represented as a generative process. Probability density function of K-dimensional
Dirichlet random variable is determined by using equation 4.7. Simultaneously, the joint
distribution of topic mixture and the probability of a corpus are evaluated by using 4.8 and 4.9.
𝛤(∑𝑘
𝑖=1 𝜋𝑖 ) 𝜋 −1 𝜋 −1
𝑝(ℵ|𝜋) = ∏𝑘
ℵ1 1 … … ℵ𝑘 𝑘 ………… (4.7)
𝑖=1 𝛤(𝜋𝑖 )
In our code we have initialized Latent Dirichlet Allocation and Non-Negative Matrix Factorization
with following parameters
Then by applying stopword algorithm we have extracted the features from the input file. Then by
extracting the text feature file we have fetch the sparse matrix. The output is looks like as shown
in figure 4.14.
Once the sorted values are extracted, it is necessary to optimize the keyword to fetch relevant
output. For this purpose we have combined Grey wolf optimization algorithm with modified grey
wolf optimization algorithm, by performing crossover in beta and delta and then uniform mutation
is adopted.
Where, Current iterations are represented by using t, Position vector is denoted as, 𝑋𝑝𝑟𝑒𝑦 , 𝐴 and 𝐶
are denoted as coefficient vectors. 𝑋𝑤𝑜𝑙𝑓 is stated as position vector of a grey wolf. The vectors
In above equation, over the course of iteration 𝑎 is linearly decreased from two to zero. 𝑟1 and 𝑟2
are denoted as random vectors in the interval of [0,1]. On continuous basis, hunt is guided by alpha
value, intermittently beta and delta values might also participate.
In order to mimic the hunting behaviors of wolves, first three best solutions of alpha, beta and
delta values are considered. Rest search agents like omega are obliged to update their position
based on the equations 4.14 - 4.20.
⃗ 𝑎𝑙𝑝ℎ𝑎
𝑋1 = 𝑋𝑎𝑙𝑝ℎ𝑎 − 𝐴1 . 𝐷 ………………(4.17)
⃗ 𝑏𝑒𝑡𝑎
𝑋2 = 𝑋𝑏𝑒𝑡𝑎 − 𝐴2 . 𝐷 ………………(4.18)
⃗ 𝑑𝑒𝑙𝑡𝑎
𝑋3 = 𝑋𝑑𝑒𝑙𝑡𝑎 − 𝐴3 . 𝐷 ………………(4.19)
Update the positions Discrete the positions Re-rank the feature index
No Stopping criteria
Yes
Return the selected features of alpha or
mutated values as optimal feature subset
15
Topic #199:
0.28099209761443633
0.04816120212628072
0.06776271578954335
0.2336600805724257
0.3461654111245941
0.15508862677419016
0.04945173247087492
0.08901404696666354
0.18972953823639033
0.09903705805535212
0.10603221225446763
0.10802570740006906
0.06770453593432929
0.022904244084102215
0.23832149874454583
0.061186685647167116
0.11573850604791097
0.10135808948088434
0.13832359622052065
0.08443257813829846
[0.24412666 0.25848506 0.2594024 0.27212978 0.27501555 0.286257
41
0.28992477 0.29195642 0.30241421 0.30925507 0.31932865 0.323855
12
0.3278822 0.34387091 0.35922882]
15
14 13 10 000 100 141 14t 10th 00uhr 1001 11 01 04 12 00
Figure 4.17 Feature Extraction after applied MGWO
15
4.3.5 Similar Data Clustering using Possibilistic Fuzzy C-Means
Once the keywords are extracted, Similar Data is clustered by Applying PFCM approach.
Clustering is a technique in which group of similar elements is created. It uses to identify the
hidden groups accurately. PFCM is an unsupervised approach. So Clustering does not require
historical knowledge of Inputs and output. In our research, PFCM approach is used for clustering
the membership grade. PFCM clustering considers each object of cluster with a variable degree of
“Membership Function”. In this scenario, PFCM depends on the reduction of objective function.
Mathematically it is represented by using equation.
Generally, the PFCM clustering considers every object as a member of each cluster with a variable
degree of “membership function”. In this scenario, PFCM depends on the reduction of objective
function, which is mathematically represented in the equation 4.21, 4.22 and 4.23.
In our research work function is accomplished by using corpus i.e. number of clusters and degree
of membership. Where cluster centers and degree of membership is represented by using equations
4.24, 4.25 and 4.26.
𝑑𝑥 ,𝑣 2
𝜇𝑖𝑗 = [∑𝑐𝑘=1(𝑑𝑥 𝑗,𝑣 𝑖 )𝑚−1 ]−1 , 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑗 ≤ 𝑛 …………(4.24)
𝑗 𝑘
𝑑𝑥𝑗,𝑣𝑖 2
𝑡𝑖𝑗 = [∑𝑛𝑘=1( )𝑛−1 ]−1 , 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑗 ≤ 𝑛 ………….(4.25)
𝑑𝑥𝑗,𝑣𝑘
∑𝑛 𝑚 𝑛
𝑘=1(𝑢𝑖𝑘 +𝑡𝑖𝑘 )𝑥𝑘
𝑣𝑖 = ∑𝑛 𝑚 𝑛 ,1 ≤𝑖≤𝑐 ………….(4.26)
𝑘=1(𝑢𝑖𝑘 +𝑡𝑖𝑘 )
Where 𝑐 is denoted as number of cluster centers, 𝑛 is signified as number of data points that are
described by the co-ordinates(𝑥𝑗 , 𝑣𝑖 ), which is utilized for calculating the distance between cluster
centers and data sets.
The PFCM constructs possibilities and memberships with normal function prototypes and cluster
center for every cluster. Selection of objective function is important aspects for analyzing the
performance of cluster. This method is used to accomplish the better clustering. Though, the
clustering performance is based on objective function. For developing an effective objective
function, the following sets of requirements are considered.
Distances between the clusters are reduced.
Distances between data points that are allocated to clusters are reduced.
Desirability between the clusters and data are modeled by objective function. Further, objective
function of PFCM is improved by applying driven prototype learning of parameter 𝛼. The learning
process is dependent on exponential strength between the clusters and in all iterations it is updated.
The parameter 𝛼 is represented in the equation 4.27.
‖𝑣𝑖 −𝑣𝑘 ‖2
𝛼 = 𝑒𝑥𝑝(− 𝑚𝑖𝑛 ) ………………. (4.27)
𝑖≠𝑘 𝛽
Then, a weight parameter is introduced to determine common value of 𝛼. Each point of database
consists of weight in relationship with each cluster. Usage of weight function delivers a proper
classification outcome. Mostly in the case of noisy data. General equation of weight function is
calculated by using equation 4.29
2
‖𝑥𝑗− 𝑣𝑖 ‖
𝑤𝑗𝑖 = 𝑒𝑥𝑝(− 2 ) ………………(4.29)
(∑𝑛
𝑗=1‖𝑥𝑗− 𝑣
̅‖ )×𝑐/𝑛
Where, 𝑤𝑗𝑖 is denoted as weight function of the point 𝑗 with the class 𝑖. Detailed procedure to
calculate the centroid value is shown in figure 4.18. For calculation of centroid value we have
initialize the cluster from the given dataset. Then similarity index is calculated in cluster and find
out typical matrix and membership matrix as shown in figure 4.20 & 4.21.
Initialization of clusters
Update centroid
Calculation of typicality matrix: After the estimation of distance matrix, the typicality
matrixes are evaluated that are obtained from PFCM.
Steps in PFCM
Define the function pfcm
Define pstepfcm
Define the Distance FCM function
Define Predict_PFCM function
After executing above code it gives an objective function, Cluster Membership value and
centroid.
0.50035725, 0.49964654, 0.50096315, 0.50007308, 0.50129539
,
0.50089867, 0.50097514, 0.50056768, 0.49994451, 0.49940925,
0.4994884, 0.49905699, 0.49934772, 0.49966253, 0.4996249,
0.50028074, 0.49891913, 0.50053236, 0.4993378, 0.50122665,
0.50022896, 0.49932323, 0.49988482, 0.49957806, 0.5004187,
0.49993463, 0.49989893, 0.50003675, 0.50087154, 0.49936221,
0.5011381, 0.50016753, 0.49909664, 0.49862943, 0.49985147,
0.50062837, 0.49938457, 0.49874154, 0.49898137, 0.5003785,
0.50019572, 0.49949399, 0.49940813, 0.49880648, 0.50033374,
0.49954368, 0.50026627, 0.50065016, 0.49975106, 0.50003185,
0.49980456, 0.49940109, 0.50192776, 0.49919474, 0.49923926,
0.49978018, 0.49950598, 0.4993499, 0.50056272, 0.4998832,
0.50013379, 0.50007046, 0.49876166, 0.50116598, 0.49960517,
0.49928816, 0.499926, 0.50109471, 0.50077733, 0.49948783,
0.50016189, 0.50005007, 0.4997749, 0.50098796, 0.49999751,
0.50043257, 0.49988557, 0.49994281, 0.49996019, 0.49961383,
0.49910133, 0.49902486, 0.49943232, 0.50005549, 0.50059075,
0.5005116, 0.50094301, 0.50065228, 0.50033747, 0.5003751,
0.49971926, 0.50108087, 0.49946764, 0.5006622, 0.49877335,
0.49977104, 0.50067677, 0.50011518, 0.50042194, 0.4995813,
0.50006537, 0.50010107, 0.49996325, 0.49912846, 0.50063779,
0.4988619, 0.49983247, 0.50090336, 0.50137057, 0.50014853,
0.49937163, 0.50061543, 0.50125846, 0.50101863, 0.4996215,
0.5007717, 0.50093994, 0.50045122, 0.50098485, 0.4985472,
0.49976564, 0.50042099, 0.50044988, 0.49938412, 0.49967631,
0.49968725, 0.50031574, 0.49957849, 0.50039597, 0.50004027,
0.50144354, 0.4993567, 0.49989254, 0.49921724, 0.5002391,
0.499521, 0.50036575, 0.49998441, 0.50007516, 0.49896532,
0.49970514, 0.50087323, 0.50014874, 0.4995155, 0.5005407,
0.49942135, 0.49876484, 0.49971073, 0.49974469, 0.49968295,
0.5004633, 0.4997123, 0.49922607, 0.49990858, 0.49921526,
0.499521, 0.50036575, 0.49998441, 0.50007516, 0.49896532,
0.49970514, 0.50087323, 0.50014874, 0.4995155, 0.5005407,
0.49942135, 0.49876484, 0.49971073, 0.49974469, 0.49968295,
0.5004633, 0.4997123, 0.49922607, 0.49990858, 0.49921526,
0.50035768, 0.50097162, 0.50060451, 0.50060917, 0.49953305,
0.4998682, 0.49973057, 0.49940346, 0.49985729, 0.50039173]]
4.4 Results
In the final step we have calculated relevancy, accuracy and other parameters with the help of the
contingency matrix. A contingency table is also called as cross table. It is used to display the
frequency distribution of data variants. To define all this performance measures we have used
contingency matrix. Performance measure is defined as the regular measurement of outcomes and
results that develops reliable information about the effectiveness of proposed system. Also,
performance measure is the process of reporting, collecting and analyzing information about the
performance of a group or individual. The mathematical equation of accuracy, f-measure,
precision, and recall are denoted in the equations 4.30, 4.31, 4.32 and 4.33.
𝑇𝑁+𝑇𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃 × 100 …………… (4.30)
2𝑇𝑃
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (2𝑇𝑃+𝐹𝑃+𝐹𝑁) × 100 ..………………(4.31)
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (𝐹𝑃+𝑇𝑃) × 100 ………………. (4.32)
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (𝐹𝑁+𝑇𝑃) × 100 ……………….(4.33)
F1-
Precision Recall Support
Score
The experimental result and discussion of proposed system is detailed in this section. Also, detailed
about the experimental setup, performance measure, quantitative analysis and comparative
analysis. Here, the proposed system was experimented using Python (anaconda) with 8 GB RAM,
3.0 GHz Intel i7 processor. The proposed system performance was compared with other existing
classifiers on a dataset; amazon customer review dataset in order to assess the effectiveness and
efficiency of the proposed system. The performance of proposed system was evaluated by means
of precision, recall, accuracy, f-measure and AUC. In table 4.3 we have calculated all the values
using above formulas. Detailed Graphical analysis is shown in figure 4.22.
Classification Report
16
14
12
10
8
6
4
2
0
Precision Recall F1-Score Support
Amazon review dataset is used to find out the various parameters like Precision, Recall, Function
measures, support and accuracy. We have run our proposed system for Amazon Video dataset. It
has been found that the accuracy of the proposed system is better as compared to other systems.
Figure 4.23 Amazon Video Dataset Results after applying Naïve Bayes Classifier
As shown in Figure 4.23 after applying Naïve Bayes Classifier on Amazon video review dataset it
gives the following output. Precision: 64%, Recall: 60%, F1-Score: 61%, Support: 20%, AUC:
57.14% Accuracy: 60%
Figure 4.24 Amazon Video Dataset Results after applying Random Forest Classifier
As shown in Figure 4.24 after applying Random Forest Classifier on Amazon video review dataset
it gives the following output. Precision: 54%, Recall: 57%, F1-Score: 54 %, Support: 30%, AUC:
43.65 Accuracy: 56.66%
Figure 4.25 Amazon Video Dataset Results after applying Decision Tree Classifier
As shown in Figure 4.25 after applying decision tree classifier on Amazon video review dataset it
gives the following output. Precision: 56%, Recall: 57%, F1-Score: 57%, Support: 35%, AUC:
53.32% Accuracy: 57.1%
Figure 4.26 Amazon Video Dataset Results after applying the Proposed System
As shown in Figure 4.26 after applying proposed system on Amazon video review dataset it gives
the following output. Precision: 63.82%, Recall: 66%, F1-Score: 61.47%, AUC: 56.04%,
Accuracy: 66%
Table 4.4 Comparison between Existing Classifiers & Proposed System
Comparative results between different algorithms with PCHOT are shown in table no 4.4. It has
been observed the results we are getting from proposed system are better as compared to different
classifiers. Detailed graphical analysis is shown in figure 4.27.
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy
Figure 4.30 User Interface of Proposed System after applying PCHOT on Amazon Videos
Datasets
To compare and proves the significant results of algorithms we have applied algorithm on different
datasets. Figure 4.31 shows the relevant results after applied PCHOT algorithm on Kindle Store
datasets. After running the algorithm, it has been observed most relevant top five results are fetched
and displayed in user interface window for Kindles.
Figure 4.31 User Interface of Proposed System after applying PCHOT on Amazon Kindle
Datasets
To compare and proves the substantial results of PCHOT algorithms we have applied algorithm
on Baby Products datasets. Figure 4.32 shows the relevant results after applied PCHOT algorithm
on Baby Products datasets. After running the algorithm, it has been observed most relevant top
five results are fetched and displayed in user interface window for Baby Products.
Figure 4. 32 User Interface of Proposed System after applying PCHOT on Amazon Baby
Products Datasets
To associate and proves the substantial results and relevancy of PCHOT algorithms we have
applied algorithm on Amazon Mobile Devices datasets. Figure 4.33 shows the relevant results
after applied PCHOT algorithm on Amazon Mobile Devices datasets. After running the algorithm,
it has been observed most relevant top five results are fetched and displayed in user interface
window for Amazon Mobile Devices.
Figure 4. 33 User Interface of Proposed System after applying PCHOT on Amazon Mobile
devices Datasets
To compare and proves the substantial results of PCHOT algorithms we have applied algorithm
on Watch datasets. Figure 4.32 shows the relevant results after applied PCHOT algorithm on
Watch datasets. After running the algorithm, it has been observed most relevant top five results
are fetched and displayed in user interface window for watches.
Figure 4.34 User Interface of Proposed System after applying PCHOT on Amazon Watch
Datasets
4.5 Quantitative Analysis
In the quantitative analysis section, amazon customer review dataset is used to evaluate the
performance of proposed system and other existing classification approaches like random forest,
decision tree, and naive bayes. In this research analysis, the collected data are classified into two
forms: positive and negative classes. In tables 4.13, 4.14, 4.15, and 4.16 performance valuations
of proposed system and existing classification approaches are evaluated by means of precision,
recall, accuracy, f-measure and AUC. In this scenario, the performance valuation is validated with
20% testing of data and 80% training of data. Datasets are taken from Stanford university website.
It contains various datasets like books; kindle store, videos, accessories, child wears, movies, and
media camera datasets. Among 2,441,053 amazon products, eight products are considered for
experimental investigation such as home and kitchen, electronics, baby products, office products,
amazon instant video, kindle, mobile phone and accessories, and shoes.
Initially, for the purpose of comparison we have determined the precision, recall, accuracy, f-
measure and AUC for various products taken from Amazon datasets. It has been observed for most
of the product accuracy, precision, recall, AUC & F-measure is better as compared to other
classifiers. For Amazon videos, after executing the proposed system with amazon videos datasets
it has been observed for Gaussian naïve bayes classifier precision is 64%, Recall value is 60%, F1-
Score is 61 %, area under curve is 57.14% and Accuracy or relevancy of the product prediction is
60%. Similarly for Random Forest classifier precision is 52%, Recall value is 57%, F1-Score is
54%, area under curve is 43.65% and Accuracy or relevancy of the product prediction is 56.66%.
After implementing and executing random forest and naïve bayes classifier we have tried for most
famous machine learning classifier. For decision tree classifier precision is 56%, Recall value is
57%, F1-Score is 57%, area under curve is 53.32% and Accuracy or relevancy of the product
prediction is 57.14%. Finally, to demonstrate the relevancy and accuracy of our system we have
executed proposed system with Amazon video datasets. It has been observed, maximum parameter
values are increased for the proposed system as compared to other classifiers. For PCHOT
precision is 62.78%, Recall value is 66%, F1-Score is 63.03%, area under curve is 56.17% and
Accuracy or relevancy of the product prediction is 66%. Comparative analysis for all the products
is shown in table 4.5. Graphical analysis is shown in figure 4.35.
Table 4.5 Performance analysis of proposed and existing classifiers for amazon videos
Amazon Video Precision Recall F1-Score AUC Accuracy
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy
Table 4.6 Performance analysis of proposed and existing classifiers for amazon videos
Table 4.6 shows the comparative analysis between PCHOT and other existing classifiers like
Gaussian naïve bayes, random forest and decision tree classifiers. Detailed graphical analysis is
shown in figure 4.36
Performance Evaluation of Shoes using
diffrent Classifiers
120
100
80
60
40
20
0
Precision Recall F1-Score AUC Accuracy
Table 4.7 Performance analysis of proposed system and existing classifiers for Amazon
mobile and accessories
Cell Phones & Accessories Precision Recall F1-Score AUC Accuracy
Table 4.8 Performance analysis of proposed system and existing classifiers for Amazon
mobile and accessories
Gaussian Naïve
70 55 59 53.12 55
Bayes
To prove the relevancy and provide concrete proof of the proposed system we have tried on few
more datasets. For home and kitchen products after executing the proposed system with amazon
home and kitchen products datasets it has been observed for Gaussian naïve bayes classifier
precision is 60%, Recall value is 50%, F1-Score is 53 %, area under curve is 46.66% and Accuracy
or relevancy of the product prediction is 50%. Similarly for Random Forest classifier precision is
75%, Recall value is 77%, F1-Score is 76 %, area under curve is 54% and Accuracy or relevancy
of the product prediction is 76.66%. After implementing and executing random forest and naïve
bayes classifier we have tried for most famous machine learning classifier. For decision tree
classifier precision is 70%, Recall value is 66%, F1-Score is 68 %, area under curve is 46.26% and
Accuracy or relevancy of the product prediction is 65.71%. Finally, to demonstrate the relevancy
and accuracy of our system we have executed proposed system with Amazon home and kitchen
datasets. It has been observed, for maximum parameter values are almost similar for all classifiers.
For PCHOT precision is 76.03%, Recall value is 79%, F1-Score is 77.06 %, area under curve is
58.54% and Accuracy or relevancy of the product prediction is 79%. Comparative analysis for all
the products are shown in table 4.9 and graphical analysis is shown in figure 4.39.
Table 4.9 Performance analysis of proposed system and existing classifiers for Amazon Home
and Kitchen Products
Home_&_Kitchen Precision Recall F1-Score AUC Accuracy
80
70
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy
Figure 4.39 Performance Evaluation of home and kitchen products using different Classifiers
Similarly, for kindle products after executing the proposed system with amazon kindle products
datasets it has been observed for Gaussian naïve bayes classifier precision is 83%, Recall value is
70%, F1-Score is 84 %, area under curve is 68.62% and Accuracy or relevancy of the product
prediction is 70%. Similarly for Random Forest classifier precision is 68%, Recall value is 73%,
F1-Score is 71 %, area under curve is 44% and Accuracy or relevancy of the product prediction is
73.33%. After implementing and executing random forest and naïve bayes classifier we have tried
for most famous machine learning classifier. For decision tree classifier precision is 72%, Recall
value is 77%, F1-Score is 75 %, area under curve is 45% and Accuracy or relevancy of the product
prediction is 77%. Finally, to demonstrate the relevancy and accuracy of our system we have
executed proposed system with Amazon kindle product datasets. It has been observed, for
maximum parameter values are almost similar for all classifiers. For PCHOT precision is 85%,
Recall value is 77%, F1-Score is 79 %, area under curve is 70.42% and Accuracy or relevancy of
the product prediction is 77%. Comparative analysis for all the products are shown in table 4.10
and graphical analysis is shown in figure 4.40.
Table 4.10 Performance analysis of proposed system and existing classifiers for Amazon
kindle products
Decision Tree 72 77 75 45 77
PCHOT 85 77 79 70.42 77
Performance Evaluation of Kindle Products
using Diffrent Classifiers
90
80
70
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy
Similarly, for baby products after executing the proposed system with amazon baby products
datasets it has been observed for Gaussian naïve bayes classifier precision is 78%, Recall value is
50%, F1-Score is 42 %, area under curve is 58.33% and Accuracy or relevancy of the product
prediction is 50%. Similarly for Random Forest classifier precision is 52%, Recall value is 57%,
F1-Score is 54 %, area under curve is 43.65% and Accuracy or relevancy of the product prediction
is 56.66%. After implementing and executing random forest and naïve bayes classifier we have
tried for most famous machine learning classifier. For decision tree classifier precision is 57%,
Recall value is 60%, F1-Score is 59 %, area under curve is 48% and Accuracy or relevancy of the
product prediction is 60%. Finally, to demonstrate the relevancy and accuracy of our system we
have executed proposed system with Amazon baby product datasets. It has been observed, for
maximum parameter values are almost similar for all classifiers. For PCHOT precision is 72.35%,
Recall value is 77%, F1-Score is 73.06 %, area under curve is 55.89% and Accuracy or relevancy
of the product prediction is 77%. Comparative analysis for all the products are shown in table 4.11
and graphical analysis is shown in figure 4.41.
Table 4. 11 Performance analysis of proposed system and existing classifiers for Amazon
baby products
Baby Precision Recall F1-Score AUC Accuracy
Decision Tree 57 60 59 48 60
80
70
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy
Finally, for office products after executing the proposed system with amazon office products
datasets it has been observed for Gaussian naïve bayes classifier precision is 84%, Recall value is
60%, F1-Score is 68 %, area under curve is 55.55% and Accuracy or relevancy of the product
prediction is 60%. Similarly for Random Forest classifier precision is 74%, Recall value is 80%,
F1-Score is 77%, area under curve is 46.15% and Accuracy or relevancy of the product prediction
is 80%. After implementing and executing random forest and naïve bayes classifier we have tried
for most famous machine learning classifier. For decision tree classifier precision is 78%, Recall
value is 83%, F1-Score is 80 %, area under curve is 46.77% and Accuracy or relevancy of the
product prediction is 82.85%. Finally, to demonstrate the relevancy and accuracy of our system
we have executed proposed system with Amazon office product datasets. It has been observed, for
maximum parameter values are almost similar for all classifiers. For PCHOT precision is 83.5%,
Recall value is 85%, F1-Score is 84.15 %, area under curve is 61.94% and Accuracy or relevancy
of the product prediction is 85%. Comparative analysis for all the products are shown in table 4.12
and graphical analysis is shown in figure 4.42.
Table 4.12 Performance analysis of proposed system and existing classifiers for Amazon
office products
80
70
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy
Tables 4.13 and 4.14 evaluates the performance analysis of proposed system and existing
classifiers for four amazon products such as books, electronics, home and kitchen, and amazon
instant video. The average classification accuracy of proposed system is 77.236% and the existing
classification approaches (random forest, decision tree, and Naive bayes) achieved 69.817%,
71.696%, and 54.5% of classification accuracy. Similarly, the average precision, recall, f-measure
and AUC of proposed system is really better than the existing classification approaches, because
the proposed system effectively calculates the non-linear and linear features of collected data and
also significantly preserves the quantitative relationship between the high and low level features.
The graphical comparison of proposed and existing classification methods for the Amazon
products; books, electronics, home and kitchen, and amazon instant video is represented in figures
4.15, and 4.16.
Table 4.13 Performance analysis of proposed system and existing classifiers by means of
precision, recall, and f-measure
Precision Recall F1-Score
Classifiers Dataset Type
(%) (%) (%)
Amazon Videos 64 60 61
Gaussian Shoes 96 55 66
Naïve Bayes
Cellphone & Accessories 57 25 23
Electronics 70 55 59
Amazon Videos 52 57 54
Shoes 87 93 90
Random
Forest Cellphone & Accessories 67 53 59
Electronics 62 70 66
Amazon Videos 56 57 57
Decision Shoes 89 94 92
Tree Cellphone & Accessories 63 57 60
Electronics 67 74 71
Table 4.13 shows the detailed analysis for the various datasets available on Amazon. Initially, we
have considered book, Electronics, Home & Kitchen, Amazon Instant Video dataset. We have
calculated precision, recall and F-Measures for above datasets for the comparison purpose using
different classifiers and proposed system. Detailed Graphical analysis is given in figure 4.43.
Performance analysis of proposed and
existing classifiers by means of precision,
recall, and f-measure
120
100
80
60
40
20
0
Shoes
Shoes
Shoes
Shoes
Amazon Videos
Electronics
Amazon Videos
Electronics
Amazon Videos
Electronics
Amazon Videos
Electronics
Cellphone & Accessories
Figure 4.43 Graphical comparison of proposed and existing classifiers by means of precision,
recall, and f-measure
Table 4.14 shows the detailed analysis for the various datasets available on Amazon. Initially, we
have considered book, Electronics, Home & Kitchen, Amazon Instant Video dataset. We have
calculated Area under curve and Accuracy for above datasets for the comparison purpose using
different classifiers and proposed system. Detailed Graphical analysis is given in figure 4.44.
Table 4.14 Performance analysis of proposed system and existing classifiers by means of AUC
and accuracy
Classifiers Dataset Type AUC (%) Accuracy (%)
Shoes 76.31 55
Gaussian Naïve Bayes
Cellphone & Accessories 43.75 25
Electronics 53.12 55
Shoes 50 93.33
Random Forest
Cellphone & Accessories 39.99 53.33
Electronics 43.75 70
Shoes 50 94.28
Decision Tree
Cellphone & Accessories 41.07 57.14
Electronics 59.89 80
When we tried to compare existing classifiers with our proposed system it has been observed out
system will give better results as compared to other classifiers. To demonstrate our results we have
explained the details by using graphical analysis as shown in figure 4.44.
Performance analysis of proposed and
existing classifiers by means of AUC &
Accuracy
100
90
80
70
60
50
40
30
20
10
0
Cellphone & Accessories
Electronics
Electronics
Electronics
Shoes
Shoes
Shoes
Shoes
Amazon Videos
Amazon Videos
Amazon Videos
Amazon Videos
AUC Accuracy
Figure 4.44 Graphical comparison of proposed and existing classifiers by means of AUC and
accuracy
In addition, the comparative study of proposed and existing classification methods are carried-out
for another four amazon products like movie review, media, kindle, and camera. Here, the
performance evaluation is validated with 20% testing and 80% training of data. By inspecting the
tables 4.13 and 4.14, the proposed system performs with the average classification accuracy of
77.236% as compared to the traditional classification methods; random forest, decision tree and
naïve bayes. In addition, the existing classification methods achieved minimum precision, recall,
f-measure, accuracy and AUC, compared to the proposed classification method. The graphical
comparison of proposed and existing classification methods for the Amazon products like movie
review, media, kindle, and camera is represented in the figures 4.45 and 4.46.
Performance analysis of proposed and
existing classifiers by means of precision,
recall, and f-measure
90
80
70
60
50
40
30
20
10
0
Kindle
Kindle
Kindle
Kindle
Home & Kitchen
Office Products
Office Products
Office Products
Office Products
Baby
Baby
Baby
Baby
Random forest, decision tree and naïve bayes are the machine learning classifiers used to predict
the most relevant products in I-Commerce. Many services providers like Amazon, Trivago and
Shopclues are using these classifiers. Just to demonstrate and illustrate the analysis we have taken
different datasets of Amazon and applied all the above classifiers with proposed system as shown
in table 4.15. Detailed graphical analysis is shown in figure 4.45.
Table 4.15 Performance analysis of proposed system and existing classifiers by means of
precision, recall, and f-measure
Classifiers Dataset Type Precision (%) Recall (%) F1-Score (%)
Kindle 83 70 84
Gaussian Naïve Bayes
Baby 78 50 42
Office Products 84 60 68
Kindle 68 73 71
Random Forest
Baby 52 57 54
Office Products 74 80 77
Kindle 72 77 75
Decision Tree
Baby 57 60 59
Office Products 78 83 80
Kindle 85 77 79
PCHOT
Baby 72.35 77 73.06
Kindle 68.62 70
Gaussian Naïve Bayes
Baby 58.33 50
Kindle 44 73.33
Random Forest
Baby 43.65 56.66
Kindle 45 77
Decision Tree
Baby 48 60
Kindle 70.42 77
PCHOT
Baby 55.89 77
Area under curve and accuracy are shown in table 4.16 and detailed graphical analysis is explained
in figure 4.46. It has been observed accuracy is almost increased for all the products with proposed
system as compared to existing classifiers.
Performance analysis of proposed and
existing classifiers by means of AUC &
Accuracy
90
80
70
60
50
40
30
20
10
0
Kindle
Kindle
Kindle
Kindle
Home & Kitchen
Office Products
Office Products
Office Products
Baby
Baby
Baby
Baby
Gaussian Naïve Bayes Random Forest Decision Tree PCHOT
AUC Accuracy
Figure 4.46 Graphical comparison of proposed and existing classifiers by means of AUC and
accuracy
In this research study, a new recommendation system is developed for recommending the products
more accurately by analyzing the reviews posted for the products by the users. The main
motivation behind this experiment is to develop a proper keyword extraction method and clustering
approach for recommending the products to the customers as negative form, and positive form by
using Amazon customer review dataset. In this scenario, a keyword extraction method (LDA)
along with modified GWO algorithm is used for selecting the appropriate keywords. The obtained
similar keywords are clustered using PFCM algorithm. The development automated
recommendation system includes numerous advantages; able to identify the fake products, track
overall customer satisfaction, etc. Related to existing classifiers, the proposed system delivered an
effective performance by means of quantitative analysis and comparative analysis. From the
experimental analysis, the proposed system averagely achieved around 77.236% of classification
accuracy, but the existing methodologies attained limited accuracy in Amazon customer review
dataset. In future work, an effective system is developed in order to further improve the
classification accuracy of product recommendation