0% found this document useful (0 votes)
6 views

Implement Prediction Based Parallel Clustering Algorithm using Hybrid Optimization Technique ch4

Uploaded by

ultraking382
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Implement Prediction Based Parallel Clustering Algorithm using Hybrid Optimization Technique ch4

Uploaded by

ultraking382
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Chapter 4

Prediction Based Approach for


I-Commerce

PREDICTION BASED APPROACH FOR I-COMMERCE 1

4.1 Ant Colony Optimization and Particle Swarm Optimization 5

4.2 Genetic Algorithms 12

4.3 Prediction Based PCHOT for I-Commerce 17

4.3.1. Data Collection 18


4.3.2 Data Preprocessing 19
4.3.3 Keyword Extraction using Latent Dirichlet Allocation 22
4.3.4 Keyword optimization using Hybrid Optimization algorithm 24
4.3.5 Similar Data Clustering using Possibilistic Fuzzy C-Means 28

4.4 Results 33

4.5 Quantitative Analysis 46


Implementation is an action that follows thinking in order to execute the plan decided by the user.
It is carrying out execution, methodology, designing standards and process of research. It is a
process used for the fulfillment of the objectives by supporting the hypothesis defined in Chapter
No 3. It is very hard to achieve the objectives without implementation. The implementation process
is a platform used to showcase the idea of research to pursue my doctorate. The implementation
phase leads to reaching a proper conclusion by producing results and produce the deliverables.

Data is the most essential part of any type of implementation. In our research, we have taken a
secondary data. Data is taken from Amazon website, featured by Stanford University. By taking
permission from Stanford University we have accessed a different amazon product dataset and we
have used the same in our project. Permission and access is given by J. McAuley (2013). The
dataset contains various parameters like Product Id. Title, Price, user id, reviewer profile name,
and helpfulness score, review score, summary and review text.

In prior stages data is taken from any website that may contain unfiltered data or it is not processed
by clef.keras.sequential model. To achieve this objective we have applied various methods to
preprocess the data. Initially, we have filtered the data on the basis of lines and then by calculating
review score, we have categorized it into two parts. First part is appended with positive data where
review is >2.5 and it is considered for further procedure. Remaining data is discarded from the
system.

It is most important in sentimental analysis if you want to find out the positive score of any text
you need to train the data on the basis of previous values. To achieve the following objective in
our project we have used Keras model. Keras model works on the concept of deep learning or
neural network. It is divided into three layers viz, input layer, hidden layer and output layer.
Number of nodes in the hidden layer can be increased as per user requirement. Keras is an API
that used to build a deep neural network. The main advantages of using Keras are it is user-friendly,
modular and composable, easy to extend. We can import the Keras by using tf.Keras where tf is a
TensorFlow. If we want to design a simple model using Keras we need to import the sequential
model. It is the most common type for implementing the stack of layers. This Keras layers are
always available with some types of constructors like Activation(),
kernel_initializer and bias_initializer () and kernel_regulizer() or Bias_regulizer().

Layers.dense(32, kernel_regularizer=tf.keras.regularizer.l1(0.01))
In the above equation, a linear layer is created with L1 regularization factor 0.01 and it is applied
to the kernel matrix. Once the model is created it is passed for compilation using three important
parameters viz. optimizer, loss and metrics. The optimizer is used to specify the training procedure
and loss uses mean square error method to minimize the losses. Metrics are used to monitor the
training.

In Keras, tokenizer function is used to tokenize the data. The same function we have used to
preprocess the data in our research. This function is used to vectorize a text corpus and then
converting each text into integer or vectors. It is based on binary function or based on word count
or tf-idf. The syntax of tokenizer is as given below,

Keras.preprocessing.text.Tokenizer(num_words=None, filters= ‘!”#$%&()*+,-./:;<=>? t/n’,


lower=false, split= ‘.’, char_level=False, oov_token-None, document_count=0)

it contains various arguments like num_words (Maximum number of words ), filters, Boolean (to
convert text into lowercase or not), split (separator for separating the words), char_level and
oov_token. Once the data is preprocessed with Keras, the output is passed to NMF-LDA. Latent
Dirichlet Allocation technique is used to remove the unwanted data from the text. Once, the data
is preprocessed by removing all the stopwords, irrelevant data and unspecified characters. We got
the final dataset with preprocessed and clean data.

Now at initial level, we have implemented the Naïve Bayes model to find out the accuracy,
precision, recall f1-score and support. Then on the same dataset, we have implemented and applied
random forest and decision tree algorithm. This task is performed for comparative analysis.
Figure 4. 1 Detailed Flow of research
As shown in figure 4.1 researches are carried out in further processes. Now in further steps, actual
implementation of our research is started by implementing the Grey Wolf Optimization algorithm.
In the first level, we have implemented GWO. Then by writing two different functions viz.
crossover and mutation we have implemented genetic algorithm. In our project, we have performed
crossover and mutation between beta and delta wolves. And then mutation is performed over the
same wolves. If the output value is better as compared to normal search then it is updated with a
newer one. In this way, we have extracted most optimized keywords from the dataset. Now we are
in the final stage of our research where we need to predict the most relevant products for customers.
In the last stage, we have applied Parallel fuzzy C-Means algorithm. It is a parallel clustering
algorithm used to find out the most relevant product. Finally, by generating contingency matrix
we have calculated all the output parameters like Accuracy, Precision, recall, f1-score and support.

4.1 Ant Colony Optimization and Particle Swarm Optimization


AACO is used to overcome the conventional problem in Ant Colony Optimization technique. In
ant colony optimization technique artificial ants search the path according to the convention of
pheromones deposited by ants. A path with a higher concentration of pheromone is selected or the
probability of selection of such path is more. It leads to uneven distribution of pheromones. Thus
after some iterations with high pheromone path leads to the global optimum path, so it is difficult
to find the path with low initial pheromone concentration. To solve this particular problem in
AACO it is divided into two stages: Early stage and elaborate stage. In the early stages, ants are
divided into three group viz. Ordinary, abnormal and random group. In which ordinary ants will
search for the path which is proportional to the concentration of pheromones. Abnormal ants will
search the path inversely proportional to the concentration of pheromones. Random ants will
search the path without having knowledge of the heuristic approach. Thus low concentration path
can be easily found by random and abnormal ants. Hence, AACO has better goal searching ability.

PSO is a computational method used to optimize candidate solutions to gives a measure of quality.
In this technique by applying some mathematical formulae candidates are trying to find out the
local best solution in search space. Each movement is influenced by the best-known position. But
it is also guided towards the best-known position updated by other particles. Finally, it is expected
to move the swarm towards the best solution. It is working on the concept of parameter selection
and convergence. By using fuzzy logic parameter selection can take place in PSO. Basic variants
of PSO are population (Swarm) of candidate solution (particles).

To evaluate PSO with AACO we have considered Traveling Salesman Problem in both the
scenario. In this salesman has to visit all cities in minimum time. So, we need to design an approach
where salesmen can visit all the cities in minimum time. The Detailed procedure for the execution
is given as follows.

Procedure:

AACO_MetaHeuristic
while(not_termination)
artificial_ant()
organic_ant()
random_ant()
abnormal_ant()
generate Solutions()
daemon Actions()
pheromone Update()
end while
end procedure

For edge selection, Ordinary ants are moves from state ‘x’ to state ‘y’. Ordinary ant ‘k’ computes
a feasible solution of sets Ak(x) for each set for each iteration. For ordinary ant, k probability is
Pxyk from state ‘x’ to state ‘y’. It is dependent on attractiveness η xy and trailing solutions τ of the
move. Trails are updated after receiving low pheromone concentration paths solution to ordinary
ants and all ants have updated their solutions. The probability of k th ant from state ‘x’ to state ‘y’
is

β
α )(η )
(𝜏xy xy
𝑘
𝑃𝑥𝑦 = β
α )(η )
…………………………………..(4.1)
𝛴𝑧𝜖 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑦 (𝜏xz xz

Where τxy is the amount of pheromone deposited from state x to y α and β. It is a parameter use to
control influence and to represent the attractiveness τxz and trail level ηxz for all the possible
solutions.
To update path, trails are updated once all the ants have completed the updations are
τxy  (1-ρ) τxy + ∑k ∆τkxy ………………...(4.2)

Where, ρ is the coefficient of pheromone evaporation and Δτkxy is the amount of pheromone
deposited by kth ant.
The initial level to achieve our objective we have implemented same concept for solving the
Travelling Salesman Problem. Steps involved for solving TSP are
1) Visit each city exactly once
2) Different city has less chance of selection in terms of selection and visibility.
3) If the pheromone concentrations between any two cities are more, the probability of
selection of path is more.
4) Once the journey is completed ants deposits the pheromone on all the paths. If the path is
shortest and concentration of deposited pheromones is more than the path is selected as the
best path.
5) Path with fewer pheromones is evaporated after selection of best path.
AACO can be used in I-Commerce but accuracy, relevancy for prediction of products is less and
escape from local optima is difficult than other metaheuristics. So we have implemented another
technique called as particle swarm optimization.

PSO is a computational method that optimizes a solution to the problem by performing an iterative
operation to increase quality. It works for population called as a swarm having a candidate solution
(particles). These particles are move along the search-space. The movement of particles is guided
by their own best position in the search space. When an improved position is discovered it is
updated to all the swarms and process is repeated to reach the best solution. In PSO two main
principles are used viz. Communication and learning. To find the updated solution in search space
it performs communication with other particles. While performing communication, it emphasizes
on learning to find a stochastic solution in search space.

For all particles ‘i’ in search space we initialize the position of particles with uniformly distributes
random vector

xi ~U(blo, bup) [where for each particle i = 1, ..., S] ……………...(4.3)

at first step we initialize the best-known position in search space as its initial position as pi ← xi
After this, we have checked for all the possible solution in search space and if the value of best
known position in search space is less than initial position then we have to update swarm's best-
known position: g ← pi
Once we have done with previous proves we need to Initialize the

particle's velocity: vi ~ U(-|bup-blo|, |bup-blo|) ………………(4.4)


if the condition is not satisfied for each particle will pick up a random number rp, r0067 and we will
update the velocity vi,d and particle position xi
vi,d ← ω vi,d + φp rp (pi,d-xi,d) + φg rg (gd-xi,d) ..…………….(4.5)

Update the particle's position:


xi ← xi + vi ..…………….(4.6)

Perform the above steps for all the particles in search space and if f(xi) < f(pi) then Update the
particle's best-known position: pi ← xi
else if f(pi) < f(g) then Update the swarm's best-known position: g ← pi

Hence in further step, we have applied both the algorithms on TSP to evaluate the best algorithm
on the following graph in figure 4.2

TSP is a traveling salesman problem where a salesman wants to visit all the cities in minimum
time. In our graph we have considered 5 cities named as 0, 1,2,3,4. Initially salesman will start
from 0th position and travel all the cities. We have also defined a distance between all the cities. It
is very tangible to visit the cities without prior knowledge. It may time consuming and wastage of
effort. To avoid this problem initially we have designed a TSP graph to find out the most relevant
solution. Figure 4.2 shows the details of TSP graph with cities and distance between then. Then in
further steps we have applied PSO and AACO on the graph to find relevant solutions.
Figure 4. 2 TSP Graph
As discussed above first AACO is applied to input data with 100 iterations after applying AACO
successfully on the graph it gives the best path with minimum cost. But the time of execution is
more. Detail output is shown in Figure 4.3.

([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
shorted_path: ([(0, 1), (1, 2), (2, 4), (4, 3), (3, 0)], 9)
82.5 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop
each)

Figure 4. 3 Output of AACO


After AACO, we have applied PSO on a similar graph with 100 Iterations and found that best
solution is provided with minimum cost and minimum time as compared to AACO. Detail output
is shown in Figure 4.4. It shows particle best solution for all iterations. If we see the first iteration
it started with initial position i.e. source and moves till destination. In this traversal it tries to find
out the minimum path from source to destination. It also finds a value of shortest distance and then
compares with all other iteration. Finally, it gives a best solution from source to destination. In
given example we have implemented AACO to find out the shortest distance between Source and
destination. As a result we have found out [0, 3, 4, 2, 1] is a best path as a solution for traveling
salesman problem. It has also calculated a distance. The distance between source and destination
using AACO is 9 which is a best distance.

Showing particles...

pbest: [0, 3, 4, 2, 1] | cost pbest: 9 | current sol


ution: [0, 4, 2, 3, 1] | cost current solution: 9
pbest: [0, 2, 4, 3, 1] | cost pbest: 9 | current sol
ution: [0, 2, 4, 3, 1] | cost current solution: 9
pbest: [0, 1, 2, 4, 3] | cost pbest: 9 | current sol
ution: [0, 3, 2, 4, 1] | cost current solution: 9
pbest: [0, 2, 4, 3, 1] | cost pbest: 11 | current sol
ution: [0, 2, 4, 3, 1] | cost current solution: 11
|
|
|
|
|
|
|
|
pbest: [0, 4, 3, 2, 1] | cost pbest: 14 | current sol
ution: [0, 4, 3, 2, 1] | cost current solution: 14

gbest: [0, 3, 4, 2, 1] | cost: 9

18.3 ms ± 4.86 ms per loop(mean ± std. dev. of 7 runs, 100 loops


each)

Figure 4.4 Output of PSO


Figure 4.5 Function to show points moving in all iteration of AACO and PSO
As shown in figure 4.5, it has been observed the movement of particles In AACO is more as
compared to PSO. Both diagrams will start with iteration wise, it shows the movement of particle
in the graph for all iterations. Here we have taken a final graph after completion of 100 iterations.
But the final output given by PSO is efficient and more relevant as compared to AACO.

Table 4.1 Comparison of Different parameters in AACO and PSO

Optimization
No of Iterations Mean Time Standard Deviation
Technique

AACO 100 206 23.7

PSO 100 18.3 4.86

The above table describes a comparative analysis between AACO and PSO. For comparison, we
have considered means time and standard deviation. For 100 iterations AACO takes 206 ms for
the execution of task, whereas PSO will complete the task with same iterations in 18.3 ms.
250

206
200

150

100 100
100

50
23.7
18.3
4.86
0
No Of Iterations Mean time Standard Deviation

AACO PSO

Figure 4.6 Comparison of Different parameters in AACO and PSO

As shown in Figure 4.6 it has been observed PSO will give better results as compared to AACO.
PSO is more efficient to find the relevant results and perform the operation in minimum time with
highly efficient throughput as compared to AACO.

4.2 Genetic Algorithms


Genetic algorithm concept is inspired by natural instances. These adaptive heuristic search
algorithms are based on the concept of Mutation selection and crossover. This algorithm is used
to generate high-quality output to solve search problems and optimization problems. In genetic
algorithms, each individual are acts as chromosomes. The population of this individual is
maintained in search space. Each individual has stored a solution to problem and coded as
chromosomes. Fitness score is assigned to each individual. The individual is having a maximum
fitness score will get a maximum chance to reproduce. It results in better offspring. It is expected
for every new generation has a better gene. To find the optimized solution multiple iterations are
performed among the chromosomes.

Once the generation is created the algorithm will work on the following parameters:
1) Selection Operator: The concept behind selection is to give preference for each individual
to pass their genes for the successive process. Where condition is to have a good fitness
score.

2) Crossover Operators: It provides a mating between two individuals. Where best genes are
selected from both the mates. Then crossover sites is exchange the genes and produce a
new individual offspring as sown in figure 4.7.

3) Mutation: The basic idea of is to insert random genes to maintain the diversity in
population and avoid the premature problem in offspring.

A B C D E F

A M C D J L

H I J K L M

Figure 4.7 Demonstrate the concept of Crossover function

To demonstrate the concept of Genetic algorithm we have consider an example with equation
Z = W1X1+W2X2+W3X3+W4X4+W5X5+W6X6

Where Xi=1 to 6 = (2, 4, 5, 3.5, 6, 8) and Number of generations = 1000

Our target is to find the optimized weight to maximize the equation. To achieve this target we have
applied genetic algorithm by performing crossover and mutation. After performing ‘n’ iterations
our target reached to the optimize value or best possible solution. Initially we have started with
population P. Then we have calculated fitness value of population ‘p’. After calculation fitness
value, we have performed crossover and generated a new population. Then we have performed
mutation between different chromosomes and fitness score is determined. Similar process is
repeated until we have find out the best value. The detail steps are explained in algorithm as given
below.
Algorithm:

1) Initialize the value of Population ‘p’ randomly


2) Calculate the fitness value of population
3) Perform the following operations
a) Select Parents from P
b) Perform Crossover and generate new population
c) Perform Mutation on New population
d) Calculate fitness score
4) Continue Step 3 until Convergence and getting best value.

After implementing above algorithm the output which we get at iteration 1000 is shown in
figure 4.8. It has been observed as the number of iteration is increased the fitness score is
increase gradually at each iteration level.

Figure 4.8 Iterations V/S Fitness after implementing Genetic Algorithm


Generation : 999
Fitness
[4081.6434547 4079.34677048 4074.17903111 4069.14153406 4083.8559
3298
4076.45570562 4071.50814659 4074.96396531]
Best result : 4083.8559329815625
Parents
[[ 1.46029159 2.45977548 263.37652502 -2.14258678 2.87840787
343.05527867]
[ 1.46029159 2.45977548 263.44473507 -2.14258678 2.87840787
342.7360876 ]
[ 1.46029159 2.45977548 261.90225374 -2.14258678 2.87840787
343.4130529 ]
[ 1.46029159 2.45977548 262.06768103 -2.14258678 2.87840787
342.94827774]]
Crossover
[[ 1.46029159 2.45977548 263.37652502 -2.14258678 2.87840787
342.7360876 ]
[ 1.46029159 2.45977548 263.44473507 -2.14258678 2.87840787
343.4130529 ]
[ 1.46029159 2.45977548 261.90225374 -2.14258678 2.87840787
342.94827774]
[ 1.46029159 2.45977548 262.06768103 -2.14258678 2.87840787
343.05527867]]
Mutation
[[ 1.46029159 2.45977548 262.86019423 -2.14258678 2.87840787
341.73879858]
[ 1.46029159 2.45977548 263.11259489 -2.14258678 2.87840787
342.57720523]
[ 1.46029159 2.45977548 262.17605692 -2.14258678 2.87840787
343.72974045]
[ 1.46029159 2.45977548 262.7447069 -2.14258678 2.87840787
343.21904154]]
Best solution : [[[ 1.46029159 2.45977548 263.37652502 -2.142
58678 2.87840787
343.05527867]]]
Best solution fitness : [4083.85593298]

Figure 4.9 Best Fitness solution after performing 1000 iterations

PSO is a computational method that optimizes a solution to the problem by performing an iterative
operation to increase quality. It works for population called as a swarm having a candidate solution
(particles). We have applied a genetic algorithm technique in PSO to solve the convergence
problem. We have performed a crossover and Mutation in different particles to determine best
known position as shown in figure 4.9. The detailed steps used while executions are explained
below.
Algorithm:

PSO_GA (b_lo, b_up, n_dimensions, n_particles, n_iters, Swarm_size, g)

1) Generate Initial Population


2) Perform the crossover in different particles
3) Perform Mutation
4) Continue step 2-3 till best known position
5) Determine the fitness value and stop

After implementing above algorithm it give a best known position but time requires for the
execution of this process is more.

Best known position (g) after 1000 iterations: [ 1.85302998 1.83


619793 1.44658843 1.43851125 -0.26322183 1.64172167]
f(g): 14.122377896369798

Figure 4.10 Best known position after 1000 iteration


Best position after executing 1000 iterations is shown in Figure 4.10. We have also find out the
best known position after 1000 iterations with best value. Best fitness value is also calculated as
shown in figure 4.11.

Figure 4.11 Best fitness by 1000 Iterations


Looking towards final output of the GAPSO it has been observed it give an efficient throughput
but the time required for the execution is more as compared to other algorithms. But, if we will go
with this optimization algorithm it has been observed GWO has better convergence velocity,
searching precision than artificial bee colony optimization, Particle swarm optimization, cuckoo
search and ant colony optimization. Wang (2019) compared the same and found that GWO gives
better results as compared to other optimization algorithms.

4.3 Prediction Based PCHOT for I-Commerce


Several researches are conducted to developed web based product recommendation system. To
provide a recommendation on the basis of customer preferences this research is conducted. Our
research inspects how Product RS influences the customer decisions quality and efforts. Main five
processes are involved in this research are data collection, pre-processing, keyword extraction,
keyword optimization, and similar data clustering. Here data is collected from amazon customer
review dataset. In the initial stage preprocessing is carried out by to enhance the quality of data.
Preprocessing step is carried out in two steps by using lemmatization, removal of stop words and
URL. Latent Dirichlet Allocation along with hybrid optimization algorithm is used i.e. PSO and
Modified Grey wolf optimization algorithm are applied on preprocessed data. Finally, Possibilistic
C-means algorithm is applied on optimized keyword. The flowchart for PCHOT is shown in Figure
4.12. PCHOT is a parallel clustering hybrid optimization technique. In which we have combined
parallel clustering algorithm with hybrid optimization algorithm to achieve the most relevant
output. We have applied our algorithm to I-Commerce data taken from amazon. In PCHOT we
have performed 4 stages processing named as Data collection, Data preprocessing, Keyword
extraction using LDA, Keyword optimization using Hybrid optimization techniques and finally
data clustering and predicting the output using Possibilistic fuzzy C-means algorithm.
Data collection
Amazon customer reviews dataset

Data pre-processing
By performing Lemmatization,
Removal of stop words and URLs

Latent dirichlet allocation is used


for Key word extraction

Key word optimization using


Hybrid Optimization Technique

Data clustering using Possibilistic


fuzzy c-means

Figure 4.12 Workflow of PCHOT

4.3.1. Data Collection


Initially, the input data is collected from Amazon review dataset. Amazon Customer reviews are
used as an input data for our system. This dataset contains all the customer reviews available on
Amazon website. This data is preserved from last 18 years which includes approximately 36
million reviews up to 2013. Customer reviews contains various parameters like Product id, Product
information, product rating and product reviews. In the table no 4.2 Characteristics of Amazon
customer reviews dataset are described in table 4.2. It contains information like time span, median
number of words, number of products, number of reviews, number of users, users with >50 reviews
information.

Table 4.2 Statistics of amazon customer review dataset


Dataset statistics

Time span June 1995-March 2013

Median no. of words per review 82

Number of reviews 34,686,770

Number of products 2,441,053

Number of users 6,643,669

Users with>50 reviews 56,772

Various Amazon datasets are available for research purpose. We have considered few databases
for our research. To demonstrate we have given sample data from amazon instant video file. As
shown in sample 1, it contains product ID, title, price, userid, profilename, helpfulness, score,
score, time, summary, reviews, productid for each product and details of the review given by each
customer.

Sample 1: Sample of Amazon Customer review Dataset

4.3.2 Data Preprocessing


After collecting the amazon customer review data, Data preprocessing is performed to enhance the
quality of data. Usually, the data consist of noises like stopwords, URLs which are required to
remove more effectively. In order to achieve this objective lemmatization technique is applied for
enhancing the quality of data.

4.3.2.1 Lemmatization

Lemmatization is a technique used to transform word into dictionary form. It is used to fetch the
proper lemma. Each Morphological word is identified using this technique, whereas lemmatization
is closely related with stemming technique. This technique identifies the base from of ‘adding’ to
‘add’. If will consider a stemming technique, ‘caring is converted into ‘car’ so lemmatization is a
better as compare to stemming. It is very effective technique to identify the proper word. After
preprocessing Amazons customer review data, Positive and negative labels are created using
customers rating. If the rating is in between 1 to 2.5, we have discarded the particular review and
considered it as a negative review. If the rating is between 2.5 to 5 then is considered a positive
review. Finally collected data is converted into sequence number. Detailed procedure for the
lemmatization is given below.

Procedure:

Initialize labels [], I, reviews


For i in reviews
review_text.append (lines [(i*10)-1].strip ('review/text: '))
If (review/score >2.4)
Append it with labels []

The output looks like as shown in sample 2, after running the above procedure all the outputs are
stored into the array. As per the constraints given in procedure it gives output with product ID,
title, price, userid, profilename, helpfulness, score, score, time, summary, reviews, productid for
each product and details of the review given by each customer.
Sample 2: Data is extracted after performing Lemmatization

After Processing data at basic level text data is tokenized. For tokenizing Keras model is use to
process the review_text. Keras is used to build and train the deep learning model. We have
considered maximum 20000 features and length is 100 to perform a tokenize operation. At initial
stage we have prepared the text data for deep learning by initializing the following procedure.

Procedure

Split words with text_to_word_sequence


Encoding the data
Hash encoding
Tokenizer API is applied
After implementing above procedure review_text is encoded into the tagged value and output
looks like as shown in figure 4.13
Figure 4.13 Trained the data using keras library

4.3.3 Keyword Extraction using Latent Dirichlet Allocation


LDA approach is used to extract the keywords, after preprocessing the amazon data. LDA is
Probabilistic model, where each document is denoted by random mixture of latent topics. Over
fixed set of words in LDA all topics are labeled in distribution. It is used to identify the underlying
topic structure on the basis of observed data. Usually, words are generated in two phase process
for all documents. In first phases distribution over topic is randomly selected for all documents.
IN LDA distinct data in words from vocabulary index {1, . . , 𝑉 }, a series of N words are denoted
as 𝑤 = (𝑤1 , 𝑤2 … … 𝑤𝑛 ) and Collection of M documents are denoted as a as 𝐷 =
(𝑤1 , 𝑤2 … … 𝑤𝑀 ). LDA approach uses three levels Bayesian graphical model. In this model nodes
are represented as random variables and the edges are used to represent possible dependencies
between variables.

LDA is observed from three layered representation, 𝜋 and 𝜇 parameters are analyzed during the
generation of corpus. For all documents, document-level topic variables are investigated and word
level variables are examined for every word in document. Joint distribution over random function
variable is represented as a generative process. Probability density function of K-dimensional
Dirichlet random variable is determined by using equation 4.7. Simultaneously, the joint
distribution of topic mixture and the probability of a corpus are evaluated by using 4.8 and 4.9.

𝛤(∑𝑘
𝑖=1 𝜋𝑖 ) 𝜋 −1 𝜋 −1
𝑝(ℵ|𝜋) = ∏𝑘
ℵ1 1 … … ℵ𝑘 𝑘 ………… (4.7)
𝑖=1 𝛤(𝜋𝑖 )

𝑝(ℵ, 𝑥, 𝑦|𝜋, 𝜇 ) = 𝑝(ℵ|𝜋) ∏𝑁


𝑛=1 𝑝 (𝑥𝑛 |ℵ) 𝑝 (𝑦𝑛 |𝑥𝑛 , 𝛽 ) ………... (4.8)
𝑁
𝑝(𝐷|𝜋, 𝜇 ) = ∏𝑀
𝑑=1 ∫ 𝑝 (ℵ𝑑 |𝜋 ) × (∏𝑛=1 ∑𝑥𝑑𝑛 𝑝 (𝑥𝑑𝑛 |ℵ𝑑 )𝑝 (𝑦𝑑𝑛 |𝑥𝑑𝑛 , 𝜇 ))𝑑ℵ𝑑 …… (4.9)
𝑑

Where M is a document, 𝜋 represents dirichlet parameter, N is indicated as a number of words, 𝜇


is characterized as topics, 𝑥 is represented as per-word topic assignment, and 𝑦 is specified as
observed word.

Calculation of posterior distribution of hidden variables is am an inferential task in documents. To


solve intractable problem we can combine LDA algorithm with Markov chain, Laplace and Gibbs
sampling for keywords extraction. The negative and positive words are extracted with individual
weights and stored in dictionary. In testing phase, testing data is coordinated with dictionary to
attend the negative and positive weights. After obtaining the positive and negative values keyword
optimization process is carried out using GWO algorithm.

In our code we have initialized Latent Dirichlet Allocation and Non-Negative Matrix Factorization
with following parameters

Latent Dirichlet Allocation (n_topics=200, max_iter=10,


learning_method='online',
learning_offset=50.,
n_components=200,
random_state=0)
NMF (n_components=200, random_state=1, alpha=.1, l1_ratio=.5)

Then by applying stopword algorithm we have extracted the features from the input file. Then by
extracting the text feature file we have fetch the sparse matrix. The output is looks like as shown
in figure 4.14.

(0, 3069) 0.19343532464735028


(0, 2410) 0.21448656458735263
: :
(199, 464) 0.11058967197341349
(199, 456) 0.11058967197341349
(199, 447) 0.11058967197341349

Figure 4.14 Extracted text feature after applied NMF


By applying LDA Text features, we have extracted the transformed sorted array as shown in figu
re 4.15.

LDA features (batch_size=128, doc_topic_prior=None, evaluate_every=-1, learning_decay=0.7,


learning_method='online', learning_offset=50.0, max_doc_update_iter=100, max_iter=10, mea
n_change_tol=0.001, n_components=200, n_jobs=None, n_topics=200,perp_tol=0.1, random_
state=0, topic_word_prior=None, total_samples=1000000.0, verbose=0)

array([[8.51296240e-04, 8.51296240e-04, 8.51296240e-04, ...,


8.51296240e-04, 8.51296240e-04, 8.30592048e-01],
[5.34010707e-04, 5.34010707e-04, 5.34010707e-04, ...,
5.34010707e-04, 5.34010707e-04, 8.93731869e-01],
[5.37285100e-04, 5.37285100e-04, 5.37285100e-04, ...,
5.37285100e-04, 5.40873251e-04, 8.93076677e-01],
...,
[4.41291744e-04, 4.41291744e-04, 4.41291744e-04, ...,
4.41291744e-04, 4.41291919e-04, 9.12182943e-01],
[4.74791349e-04, 4.74791349e-04, 4.74791349e-04, ...,
4.74791349e-04, 4.74791349e-04, 9.05516522e-01],
[5.32159553e-04, 5.32159553e-04, 5.32159553e-04, ...,
5.32159553e-04, 5.32161036e-04, 8.94100247e-01]])

Figure 4.15 Sorted Array using LDA

Once the sorted values are extracted, it is necessary to optimize the keyword to fetch relevant
output. For this purpose we have combined Grey wolf optimization algorithm with modified grey
wolf optimization algorithm, by performing crossover in beta and delta and then uniform mutation
is adopted.

4.3.4 Keyword optimization using Hybrid Optimization Algorithm


GWO is metaheuristic algorithms based on hunting mechanism and hierarchy of wolves. This
algorithm is based on three phases: Encircling prey, Hunting prey and attacking prey. For dealing
with leadership hierarchy of wolves, Alpha, Beta and delta are the solution. Let us assume Alpha
is a best solution and the second is a beta and third is a delta solution. Rest wolves are called as
omega. In initial phase grey wolves encircle the prey during hunt. The encircled behavior of grey
wolves are simulates by using equation no 4.10 and 4.11.

⃗ = |𝐶 . 𝑋𝑝𝑟𝑒𝑦 (𝑡) − 𝑋𝑤𝑜𝑙𝑓 (𝑡)|


𝐷 ……………….(4.10)

𝑋𝑤𝑜𝑙𝑓 (𝑡 + 1) = 𝑋𝑝𝑟𝑒𝑦 (𝑡) − 𝐴. 𝐷 ………………(4.11)

Where, Current iterations are represented by using t, Position vector is denoted as, 𝑋𝑝𝑟𝑒𝑦 , 𝐴 and 𝐶

are denoted as coefficient vectors. 𝑋𝑤𝑜𝑙𝑓 is stated as position vector of a grey wolf. The vectors

𝐴 and 𝐶 are determined by using the equations 4.12 and 4.13.

𝐴 = 2𝑎. 𝑟1 − 𝑎 , ……………… (4.12)

𝐶 = 2𝑟2 ……………… (4.13)

In above equation, over the course of iteration 𝑎 is linearly decreased from two to zero. 𝑟1 and 𝑟2
are denoted as random vectors in the interval of [0,1]. On continuous basis, hunt is guided by alpha
value, intermittently beta and delta values might also participate.

In order to mimic the hunting behaviors of wolves, first three best solutions of alpha, beta and
delta values are considered. Rest search agents like omega are obliged to update their position
based on the equations 4.14 - 4.20.

⃗ 𝑎𝑙𝑝ℎ𝑎 = |𝐶1 . 𝑋𝑎𝑙𝑝ℎ𝑎 − 𝑋|


𝐷 ………………(4.14)

⃗ 𝑏𝑒𝑡𝑎 = |𝐶2 . 𝑋𝑏𝑒𝑡𝑎 − 𝑋 |


𝐷 ……………(4.15)

⃗ 𝑑𝑒𝑙𝑡𝑎 = |𝐶2 . 𝑋𝑑𝑒𝑙𝑡𝑎 − 𝑋|


𝐷 ………………(4.16)

⃗ 𝑎𝑙𝑝ℎ𝑎
𝑋1 = 𝑋𝑎𝑙𝑝ℎ𝑎 − 𝐴1 . 𝐷 ………………(4.17)

⃗ 𝑏𝑒𝑡𝑎
𝑋2 = 𝑋𝑏𝑒𝑡𝑎 − 𝐴2 . 𝐷 ………………(4.18)

⃗ 𝑑𝑒𝑙𝑡𝑎
𝑋3 = 𝑋𝑑𝑒𝑙𝑡𝑎 − 𝐴3 . 𝐷 ………………(4.19)

𝑋⃗1 +𝑋⃗2 +𝑋⃗3


𝑋 (𝑡 + 1) = ………………(4.20)
3
Pseudo code of GWO algorithm
 Begin
 Initialize the parameters; population size 𝑝𝑜𝑝𝑠𝑖𝑧𝑒, upper bounds of the variables 𝑢𝑏,
lower bounds of the variables 𝑙𝑏, and maximum number of iterations 𝑚𝑎𝑥𝑖𝑡𝑒𝑟;
 Initially, generate the initial positions of grey wolves using 𝑙𝑏 and 𝑢𝑏;
 Initialize 𝑎,⃗⃗⃗𝐴, and 𝐶
 Evaluate the fitness of each grey wolf;
Grey wolf with the first maximum fitness 𝑎𝑙𝑝ℎ𝑎;
Grey wolf with the second maximum fitness 𝑏𝑒𝑡𝑎;
Grey wolf with the third maximum fitness 𝑑𝑒𝑙𝑡𝑎
While 𝑘<maxite
For 𝑖 = 1: 𝑝𝑜𝑝𝑠𝑖𝑧𝑒
Update the position of current grey wolf using the equation (14);
End for
Update 𝑎,⃗⃗⃗𝐴 and 𝐶 ;
Evaluate the fitness of all grey wolves;
Update alpha, beta, and delta;
𝐾 = 𝑘 + 1;
End while
 Return alpha;
 End
Pseudo code is explained above for Grey wolf Optimization. In our research, after updating alpha,
beta and delta values are updated, based on the concept of genetic algorithm. We have applied
crossover between the values of beta and delta. Then uniform mutation is performed. If the
resultant output is optimal, consider the optimal value updated by mutation. If not, Alpha is the
most optimal value. The flow chart of the Modified Grey wolf optimizer is shown in figure 4.16
Parameter initialization

Discrete the positions Re-rank the feature index

Evaluate the fitness with selected features

Save three grey wolves alpha, beta and delta


with first, second, and third maximum fitness

Update the positions Discrete the positions Re-rank the feature index

Evaluate the fitness with selected features

Update alpha, beta and delta

Crossover beta and delta

Mutation of cross over values

No Stopping criteria

Yes
Return the selected features of alpha or
mutated values as optimal feature subset

Figure 4.16 Flow chart of modified grey wolf optimizer


After implementing above algorithm optimized keywords are extracted. The output will run for
200 iterations and then the output looks as shown in figure 4.17.

15
Topic #199:
0.28099209761443633
0.04816120212628072
0.06776271578954335
0.2336600805724257
0.3461654111245941
0.15508862677419016
0.04945173247087492
0.08901404696666354
0.18972953823639033
0.09903705805535212
0.10603221225446763
0.10802570740006906
0.06770453593432929
0.022904244084102215
0.23832149874454583
0.061186685647167116
0.11573850604791097
0.10135808948088434
0.13832359622052065
0.08443257813829846
[0.24412666 0.25848506 0.2594024 0.27212978 0.27501555 0.286257
41
0.28992477 0.29195642 0.30241421 0.30925507 0.31932865 0.323855
12
0.3278822 0.34387091 0.35922882]
15
14 13 10 000 100 141 14t 10th 00uhr 1001 11 01 04 12 00
Figure 4.17 Feature Extraction after applied MGWO
15
4.3.5 Similar Data Clustering using Possibilistic Fuzzy C-Means
Once the keywords are extracted, Similar Data is clustered by Applying PFCM approach.
Clustering is a technique in which group of similar elements is created. It uses to identify the
hidden groups accurately. PFCM is an unsupervised approach. So Clustering does not require
historical knowledge of Inputs and output. In our research, PFCM approach is used for clustering
the membership grade. PFCM clustering considers each object of cluster with a variable degree of
“Membership Function”. In this scenario, PFCM depends on the reduction of objective function.
Mathematically it is represented by using equation.
Generally, the PFCM clustering considers every object as a member of each cluster with a variable
degree of “membership function”. In this scenario, PFCM depends on the reduction of objective
function, which is mathematically represented in the equation 4.21, 4.22 and 4.23.

𝐽𝑃𝐹𝐶𝑀 (𝑈, 𝑇, 𝑉 ) = ∑𝐶𝑖=1 ∑𝑛𝑗=1(𝑢𝑖𝑗


𝑚
+ 𝑡 𝑛 )𝑑 2 (𝑥𝑗 , 𝑣𝑖 )…… (4.21)
Where,
∑𝑐𝑖=1 𝜇𝑖𝑗 = 1, ∀𝑗 ∈ {1, … , 𝑛} ………………… (4.22)

∑𝑛𝑗=1 𝑡𝑖𝑗 = 1, ∀𝑖 ∈ {1, … , 𝑐} ……………….... (4.23)


Where 𝑉 is represented as vector of cluster centres, 𝑇 is considered as typicality matrix, 𝑈 is
denoted as partition matrix, and 𝐽𝑃𝐹𝐶𝑀 is indicated as objective function.

In our research work function is accomplished by using corpus i.e. number of clusters and degree
of membership. Where cluster centers and degree of membership is represented by using equations
4.24, 4.25 and 4.26.

𝑑𝑥 ,𝑣 2
𝜇𝑖𝑗 = [∑𝑐𝑘=1(𝑑𝑥 𝑗,𝑣 𝑖 )𝑚−1 ]−1 , 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑗 ≤ 𝑛 …………(4.24)
𝑗 𝑘

𝑑𝑥𝑗,𝑣𝑖 2
𝑡𝑖𝑗 = [∑𝑛𝑘=1( )𝑛−1 ]−1 , 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑗 ≤ 𝑛 ………….(4.25)
𝑑𝑥𝑗,𝑣𝑘

∑𝑛 𝑚 𝑛
𝑘=1(𝑢𝑖𝑘 +𝑡𝑖𝑘 )𝑥𝑘
𝑣𝑖 = ∑𝑛 𝑚 𝑛 ,1 ≤𝑖≤𝑐 ………….(4.26)
𝑘=1(𝑢𝑖𝑘 +𝑡𝑖𝑘 )

Where 𝑐 is denoted as number of cluster centers, 𝑛 is signified as number of data points that are
described by the co-ordinates(𝑥𝑗 , 𝑣𝑖 ), which is utilized for calculating the distance between cluster
centers and data sets.
The PFCM constructs possibilities and memberships with normal function prototypes and cluster
center for every cluster. Selection of objective function is important aspects for analyzing the
performance of cluster. This method is used to accomplish the better clustering. Though, the
clustering performance is based on objective function. For developing an effective objective
function, the following sets of requirements are considered.
 Distances between the clusters are reduced.
 Distances between data points that are allocated to clusters are reduced.
Desirability between the clusters and data are modeled by objective function. Further, objective
function of PFCM is improved by applying driven prototype learning of parameter 𝛼. The learning
process is dependent on exponential strength between the clusters and in all iterations it is updated.
The parameter 𝛼 is represented in the equation 4.27.
‖𝑣𝑖 −𝑣𝑘 ‖2
𝛼 = 𝑒𝑥𝑝(− 𝑚𝑖𝑛 ) ………………. (4.27)
𝑖≠𝑘 𝛽

Where, 𝛽 is represented as sample variance that is represented in the equation 4.28.


2
∑𝑛
𝑗=1‖𝑥𝑗−𝑥̅ ‖
𝛽= ………………………. (4.28)
𝑛
∑𝑛
𝑗=𝑛 𝑥𝑗
Where, 𝑥̅ = 𝑛

Then, a weight parameter is introduced to determine common value of 𝛼. Each point of database
consists of weight in relationship with each cluster. Usage of weight function delivers a proper
classification outcome. Mostly in the case of noisy data. General equation of weight function is
calculated by using equation 4.29
2
‖𝑥𝑗− 𝑣𝑖 ‖
𝑤𝑗𝑖 = 𝑒𝑥𝑝(− 2 ) ………………(4.29)
(∑𝑛
𝑗=1‖𝑥𝑗− 𝑣
̅‖ )×𝑐/𝑛

Where, 𝑤𝑗𝑖 is denoted as weight function of the point 𝑗 with the class 𝑖. Detailed procedure to
calculate the centroid value is shown in figure 4.18. For calculation of centroid value we have
initialize the cluster from the given dataset. Then similarity index is calculated in cluster and find
out typical matrix and membership matrix as shown in figure 4.20 & 4.21.
Initialization of clusters

Calculation of similar clusters

Calculation of typically matrix

Calculation of membership matrix

Update centroid

Figure 4.18 Step by step procedure of PFCM


 Initialization: Initially, the numbers of clusters are assumed by the user.
 Calculation of similarity distance: After assuming the number of clusters, evaluate the
distance between data points and centroids for each and every segment.

 Calculation of typicality matrix: After the estimation of distance matrix, the typicality
matrixes are evaluated that are obtained from PFCM.

 Calculation of membership matrix: Evaluate the membership matrix 𝑀𝑖𝑘 by assessing


the membership value of data point that are gathered from PFCM.

 Update centroid: After generating the clusters, centroid modernization is updated.


The relative process is performed again and again till the modernized centroid of each cluster
becomes identical in successive iterations. The detailed procedure for PFCM is explained below.

Steps in PFCM
Define the function pfcm
Define pstepfcm
Define the Distance FCM function
Define Predict_PFCM function

After executing above code it gives an objective function, Cluster Membership value and
centroid.
0.50035725, 0.49964654, 0.50096315, 0.50007308, 0.50129539
,
0.50089867, 0.50097514, 0.50056768, 0.49994451, 0.49940925,
0.4994884, 0.49905699, 0.49934772, 0.49966253, 0.4996249,
0.50028074, 0.49891913, 0.50053236, 0.4993378, 0.50122665,
0.50022896, 0.49932323, 0.49988482, 0.49957806, 0.5004187,
0.49993463, 0.49989893, 0.50003675, 0.50087154, 0.49936221,
0.5011381, 0.50016753, 0.49909664, 0.49862943, 0.49985147,
0.50062837, 0.49938457, 0.49874154, 0.49898137, 0.5003785,
0.50019572, 0.49949399, 0.49940813, 0.49880648, 0.50033374,
0.49954368, 0.50026627, 0.50065016, 0.49975106, 0.50003185,
0.49980456, 0.49940109, 0.50192776, 0.49919474, 0.49923926,
0.49978018, 0.49950598, 0.4993499, 0.50056272, 0.4998832,
0.50013379, 0.50007046, 0.49876166, 0.50116598, 0.49960517,
0.49928816, 0.499926, 0.50109471, 0.50077733, 0.49948783,
0.50016189, 0.50005007, 0.4997749, 0.50098796, 0.49999751,
0.50043257, 0.49988557, 0.49994281, 0.49996019, 0.49961383,
0.49910133, 0.49902486, 0.49943232, 0.50005549, 0.50059075,
0.5005116, 0.50094301, 0.50065228, 0.50033747, 0.5003751,
0.49971926, 0.50108087, 0.49946764, 0.5006622, 0.49877335,
0.49977104, 0.50067677, 0.50011518, 0.50042194, 0.4995813,
0.50006537, 0.50010107, 0.49996325, 0.49912846, 0.50063779,
0.4988619, 0.49983247, 0.50090336, 0.50137057, 0.50014853,
0.49937163, 0.50061543, 0.50125846, 0.50101863, 0.4996215,
0.5007717, 0.50093994, 0.50045122, 0.50098485, 0.4985472,
0.49976564, 0.50042099, 0.50044988, 0.49938412, 0.49967631,
0.49968725, 0.50031574, 0.49957849, 0.50039597, 0.50004027,
0.50144354, 0.4993567, 0.49989254, 0.49921724, 0.5002391,
0.499521, 0.50036575, 0.49998441, 0.50007516, 0.49896532,
0.49970514, 0.50087323, 0.50014874, 0.4995155, 0.5005407,
0.49942135, 0.49876484, 0.49971073, 0.49974469, 0.49968295,
0.5004633, 0.4997123, 0.49922607, 0.49990858, 0.49921526,
0.499521, 0.50036575, 0.49998441, 0.50007516, 0.49896532,
0.49970514, 0.50087323, 0.50014874, 0.4995155, 0.5005407,
0.49942135, 0.49876484, 0.49971073, 0.49974469, 0.49968295,
0.5004633, 0.4997123, 0.49922607, 0.49990858, 0.49921526,
0.50035768, 0.50097162, 0.50060451, 0.50060917, 0.49953305,
0.4998682, 0.49973057, 0.49940346, 0.49985729, 0.50039173]]

Figure 4.19 Objective function calculated using PFCM


([1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1
,
1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1
,
1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1
,
0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0
,
0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1
,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1
,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1
,
0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0
,
0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0
,
0, 1], dtype=int64)

Figure 4.20 Cluster Membership calculated using PFCM

[[0.31013219, 0.3113137, 0.31077808, 0.31195162, 0.31577447,


0.31374068, 0.31431634, 0.31560903, 0.3138556, 0.31173575,
0.31338595, 0.31491368, 0.31193294, 0.31442465, 0.31182336]
,
[0.31008121, 0.31137137, 0.31074971, 0.31162773, 0.31551896,
0.31372008, 0.31415551, 0.31515127, 0.31385739, 0.31156399,
0.31293067, 0.31477319, 0.3117103, 0.31434503, 0.31150874]]

Figure 4.21 Centroid calculated using PFCM

4.4 Results
In the final step we have calculated relevancy, accuracy and other parameters with the help of the
contingency matrix. A contingency table is also called as cross table. It is used to display the
frequency distribution of data variants. To define all this performance measures we have used
contingency matrix. Performance measure is defined as the regular measurement of outcomes and
results that develops reliable information about the effectiveness of proposed system. Also,
performance measure is the process of reporting, collecting and analyzing information about the
performance of a group or individual. The mathematical equation of accuracy, f-measure,
precision, and recall are denoted in the equations 4.30, 4.31, 4.32 and 4.33.

𝑇𝑁+𝑇𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃 × 100 …………… (4.30)

2𝑇𝑃
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (2𝑇𝑃+𝐹𝑃+𝐹𝑁) × 100 ..………………(4.31)

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (𝐹𝑃+𝑇𝑃) × 100 ………………. (4.32)

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (𝐹𝑁+𝑇𝑃) × 100 ……………….(4.33)

Where, 𝑇𝑃 is indicated as true positive, 𝑇𝑁 is specified as true negative, 𝐹𝑃 is stated as false


positive, and 𝐹𝑁 is specified as false negative.

Table 4.3 Classification Report for Proposed System

F1-
Precision Recall Support
Score

0 0.29 0.67 0.40 3

1 0.88 0.58 0.70 12

Micro Avg 0.60 0.60 0.60 15

Macro Avg 0.58 0.62 0.55 15

Weighted Avg 0.76 0.60 0.643 15

The experimental result and discussion of proposed system is detailed in this section. Also, detailed
about the experimental setup, performance measure, quantitative analysis and comparative
analysis. Here, the proposed system was experimented using Python (anaconda) with 8 GB RAM,
3.0 GHz Intel i7 processor. The proposed system performance was compared with other existing
classifiers on a dataset; amazon customer review dataset in order to assess the effectiveness and
efficiency of the proposed system. The performance of proposed system was evaluated by means
of precision, recall, accuracy, f-measure and AUC. In table 4.3 we have calculated all the values
using above formulas. Detailed Graphical analysis is shown in figure 4.22.

Classification Report
16
14
12
10
8
6
4
2
0
Precision Recall F1-Score Support

0 1 Micro Avg Macro Avg Weighted Avg

Figure 4.22 Classification Report


With the growth of I-Commerce, Real time clustering based recommendation system is
implemented. Traditional clustering has lots of disadvantages. It is hard to obtain optimal solutions
for large dataset. The primary objective of the research is to develop a new prediction based
parallel clustering algorithm using optimization technique for identifying the users need and
predict the result as per the user’s requirement. We have performed recommendation system on
standard dataset; Amazon customer review dataset. After collecting the data, preprocessing is
performed in three ways: lemmatization technique and removal of stop-words and URLs from the
amazon data. Then preprocessing is carried out by using an effective topic modeling approach:
LDA. The extracted value of testing data is used for product recommendation using PFCM
algorithm. PFCM is an effective algorithm to analyze most relevant membership values.

Amazon review dataset is used to find out the various parameters like Precision, Recall, Function
measures, support and accuracy. We have run our proposed system for Amazon Video dataset. It
has been found that the accuracy of the proposed system is better as compared to other systems.
Figure 4.23 Amazon Video Dataset Results after applying Naïve Bayes Classifier
As shown in Figure 4.23 after applying Naïve Bayes Classifier on Amazon video review dataset it
gives the following output. Precision: 64%, Recall: 60%, F1-Score: 61%, Support: 20%, AUC:
57.14% Accuracy: 60%

Figure 4.24 Amazon Video Dataset Results after applying Random Forest Classifier
As shown in Figure 4.24 after applying Random Forest Classifier on Amazon video review dataset
it gives the following output. Precision: 54%, Recall: 57%, F1-Score: 54 %, Support: 30%, AUC:
43.65 Accuracy: 56.66%
Figure 4.25 Amazon Video Dataset Results after applying Decision Tree Classifier
As shown in Figure 4.25 after applying decision tree classifier on Amazon video review dataset it
gives the following output. Precision: 56%, Recall: 57%, F1-Score: 57%, Support: 35%, AUC:
53.32% Accuracy: 57.1%

Figure 4.26 Amazon Video Dataset Results after applying the Proposed System
As shown in Figure 4.26 after applying proposed system on Amazon video review dataset it gives
the following output. Precision: 63.82%, Recall: 66%, F1-Score: 61.47%, AUC: 56.04%,
Accuracy: 66%
Table 4.4 Comparison between Existing Classifiers & Proposed System

Amazon Video Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 64 60 61 57.14 60

Random Forest 52 57 54 43.65 56.66

Decision Tree 56 57 57 53.32 57.14

PCHOT 62.78 66 63.03 56.17 66

Comparative results between different algorithms with PCHOT are shown in table no 4.4. It has
been observed the results we are getting from proposed system are better as compared to different
classifiers. Detailed graphical analysis is shown in figure 4.27.

Performance Evaluation of Amazon Videos


using diffrent Classifiers
70

60

50

40

30

20

10

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.27 Comparison between Existing Classifiers and Proposed System


Our algorithm is a prediction based algorithm. So finally it is expected that our algorithm must
give some relevant recommendation to customers. By applying the above algorithm and mapping
the product with product ID final output is generated.

Figure 4.28 Predicted Products


As shown in figure 4.28 it gives a list of predicted products that can be recommended to the
customer on the basis of reviews, ratings and other parameters. B0050T2YVA is a product id of
the product available in Amazon database. By using this product ID we can find out the most
relevant products for the customer.
To achieve our objectives and impress the social impact of the system we have designed a UI for
the proposed system. Initially, we have considered eight different datasets from Amazon review
datasets. Book, Kindle, Videos, Camera, Baby product, Clothing and accessories are considered
for analysis purpose. Figure 4.29 gives the impression of user interface of proposed system.

Figure 4.29 User Interface of Proposed System


Prediction based parallel clustering hybrid optimization technique are used for I-Commerce to
predict most relevant product amongst Amazon review dataset. As shown in figure 4.30 we have
applied PCHOT algorithm on Amazon Videos datasets. After running the algorithm, it has been
observed most relevant top five results are fetched and displayed in user interface window for
amazon instant videos.

Figure 4.30 User Interface of Proposed System after applying PCHOT on Amazon Videos
Datasets
To compare and proves the significant results of algorithms we have applied algorithm on different
datasets. Figure 4.31 shows the relevant results after applied PCHOT algorithm on Kindle Store
datasets. After running the algorithm, it has been observed most relevant top five results are fetched
and displayed in user interface window for Kindles.

Figure 4.31 User Interface of Proposed System after applying PCHOT on Amazon Kindle
Datasets
To compare and proves the substantial results of PCHOT algorithms we have applied algorithm
on Baby Products datasets. Figure 4.32 shows the relevant results after applied PCHOT algorithm
on Baby Products datasets. After running the algorithm, it has been observed most relevant top
five results are fetched and displayed in user interface window for Baby Products.
Figure 4. 32 User Interface of Proposed System after applying PCHOT on Amazon Baby
Products Datasets
To associate and proves the substantial results and relevancy of PCHOT algorithms we have
applied algorithm on Amazon Mobile Devices datasets. Figure 4.33 shows the relevant results
after applied PCHOT algorithm on Amazon Mobile Devices datasets. After running the algorithm,
it has been observed most relevant top five results are fetched and displayed in user interface
window for Amazon Mobile Devices.
Figure 4. 33 User Interface of Proposed System after applying PCHOT on Amazon Mobile
devices Datasets
To compare and proves the substantial results of PCHOT algorithms we have applied algorithm
on Watch datasets. Figure 4.32 shows the relevant results after applied PCHOT algorithm on
Watch datasets. After running the algorithm, it has been observed most relevant top five results
are fetched and displayed in user interface window for watches.
Figure 4.34 User Interface of Proposed System after applying PCHOT on Amazon Watch
Datasets
4.5 Quantitative Analysis
In the quantitative analysis section, amazon customer review dataset is used to evaluate the
performance of proposed system and other existing classification approaches like random forest,
decision tree, and naive bayes. In this research analysis, the collected data are classified into two
forms: positive and negative classes. In tables 4.13, 4.14, 4.15, and 4.16 performance valuations
of proposed system and existing classification approaches are evaluated by means of precision,
recall, accuracy, f-measure and AUC. In this scenario, the performance valuation is validated with
20% testing of data and 80% training of data. Datasets are taken from Stanford university website.
It contains various datasets like books; kindle store, videos, accessories, child wears, movies, and
media camera datasets. Among 2,441,053 amazon products, eight products are considered for
experimental investigation such as home and kitchen, electronics, baby products, office products,
amazon instant video, kindle, mobile phone and accessories, and shoes.

Initially, for the purpose of comparison we have determined the precision, recall, accuracy, f-
measure and AUC for various products taken from Amazon datasets. It has been observed for most
of the product accuracy, precision, recall, AUC & F-measure is better as compared to other
classifiers. For Amazon videos, after executing the proposed system with amazon videos datasets
it has been observed for Gaussian naïve bayes classifier precision is 64%, Recall value is 60%, F1-
Score is 61 %, area under curve is 57.14% and Accuracy or relevancy of the product prediction is
60%. Similarly for Random Forest classifier precision is 52%, Recall value is 57%, F1-Score is
54%, area under curve is 43.65% and Accuracy or relevancy of the product prediction is 56.66%.
After implementing and executing random forest and naïve bayes classifier we have tried for most
famous machine learning classifier. For decision tree classifier precision is 56%, Recall value is
57%, F1-Score is 57%, area under curve is 53.32% and Accuracy or relevancy of the product
prediction is 57.14%. Finally, to demonstrate the relevancy and accuracy of our system we have
executed proposed system with Amazon video datasets. It has been observed, maximum parameter
values are increased for the proposed system as compared to other classifiers. For PCHOT
precision is 62.78%, Recall value is 66%, F1-Score is 63.03%, area under curve is 56.17% and
Accuracy or relevancy of the product prediction is 66%. Comparative analysis for all the products
is shown in table 4.5. Graphical analysis is shown in figure 4.35.

Table 4.5 Performance analysis of proposed and existing classifiers for amazon videos
Amazon Video Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 64 60 61 57.14 60

Random Forest 52 57 54 43.65 56.66

Decision Tree 56 57 57 53.32 57.14

PCHOT 62.78 66 63.03 56.17 66

Performance Evaluation of Amazon Videos


using diffrent Classifiers
70

60

50

40

30

20

10

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4. 35 Performance Evaluation of Amazon Videos using different Classifiers


Similarly, for shoes dataset after executing the proposed system with amazon shoes datasets it has
been observed for Gaussian naïve bayes classifier precision is 96%, Recall value is 55%, F1-Score
is 66%, area under curve is 76.31% and Accuracy or relevancy of the product prediction is 55%.
Similarly for Random Forest classifier precision is 87%, Recall value is 93%, F1-Score is 90%,
area under curve is 50% and Accuracy or relevancy of the product prediction is 93.33%. After
implementing and executing random forest and naïve bayes classifier we have tried for most
famous machine learning classifier. For decision tree classifier precision is 89%, Recall value is
94%, F1-Score is 92%, area under curve is 50% and Accuracy or relevancy of the product
prediction is 94.28%. Finally, to demonstrate the relevancy and accuracy of our system we have
executed proposed system with Amazon shoes datasets. It has been observed, for maximum
parameter values are almost similar for all classifiers. For PCHOT precision is 88.3%, Recall value
is 93%, F1-Score is 90.59%, area under curve is 49.47% and Accuracy or relevancy of the product
prediction is 93.36%.

Table 4.6 Performance analysis of proposed and existing classifiers for amazon videos

Shoes Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 96 55 66 76.31 55

Random Forest 87 93 90 50 93.33

Decision Tree 89 94 92 50 94.28

PCHOT 88.3 93 90.59 49.47 93.36

Table 4.6 shows the comparative analysis between PCHOT and other existing classifiers like
Gaussian naïve bayes, random forest and decision tree classifiers. Detailed graphical analysis is
shown in figure 4.36
Performance Evaluation of Shoes using
diffrent Classifiers
120

100

80

60

40

20

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.36 Performance Evaluation of shoes products using different Classifiers


Similarly, for cellphone and accessories after executing the proposed system with Amazon
cellphone and accessories datasets it has been observed for Gaussian naïve bayes classifier
precision is 57%, Recall value is 55%, F1-Score is 23 %, area under curve is 43.75% and Accuracy
or relevancy of the product prediction is 25%. Similarly for Random Forest classifier precision is
67%, Recall value is 53%, F1-Score is 59 %, area under curve is 39.99% and Accuracy or
relevancy of the product prediction is 53.33%. After implementing and executing random forest
and naïve bayes classifier we have tried for most famous machine learning classifiers. For decision
tree classifier precision is 63%, Recall value is 57%, F1-Score is 60 %, area under curve is 41.07%
and Accuracy or relevancy of the product prediction is 57.14%. Finally, to demonstrate the
relevancy and accuracy of our system we have executed proposed system with Amazon mobile
and accessories datasets. It has been observed, for maximum parameter values are almost similar
for all classifiers. For PCHOT precision is 59.46%, Recall value is 69%, F1-Score is 63.35 %, area
under curve is 46.82% and Accuracy or relevancy of the product prediction is 69%. Comparative
analysis for all the products are shown in table 4.7 and graphical analysis is shown in figure 4.37.

Table 4.7 Performance analysis of proposed system and existing classifiers for Amazon
mobile and accessories
Cell Phones & Accessories Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 57 25 23 43.75 25

Random Forest 67 53 59 39.99 53.33

Decision Tree 63 57 60 41.07 57.14

PCHOT 59.46 69 63.35 46.82 69

Performance Evaluation of Cellphone


Accessories using Diffrent Classifiers
80
70
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4. 37 Performance Evaluation of mobile and accessories products using different


Classifiers
Similarly, for Electronics products after executing the proposed system with amazon Electronics
products datasets it has been observed for Gaussian naïve bayes classifier precision is 70%, Recall
value is 55%, F1-Score is 59 %, area under curve is 53.12% and Accuracy or relevancy of the
product prediction is 55%. Similarly for Random Forest classifier precision is 62%, Recall value
is 70%, F1-Score is 66 %, area under curve is 43.75% and Accuracy or relevancy of the product
prediction is 70%. After implementing and executing random forest and naïve bayes classifier we
have tried for most famous machine learning classifiers. For decision tree classifier precision is
67%, Recall value is 74%, F1-Score is 71 %, area under curve is 44.82% and Accuracy or
relevancy of the product prediction is 74.28%. Finally, to demonstrate the relevancy and accuracy
of our system we have executed proposed system with Amazon electronics product datasets. It has
been observed, for maximum parameter values are almost similar for all classifiers. For PCHOT
precision is 78.09%, Recall value is 80%, F1-Score is 78.9 %, area under curve is 59.89% and
Accuracy or relevancy of the product prediction is 80%. Comparative analysis for all the products
are shown in table 4.8 and graphical analysis is shown in figure 4.38.

Table 4.8 Performance analysis of proposed system and existing classifiers for Amazon
mobile and accessories

Electronics Precision Recall F1-Score AUC Accuracy

Gaussian Naïve
70 55 59 53.12 55
Bayes

Random Forest 62 70 66 43.75 70

Decision Tree 67 74 71 44.82 74.28

PCHOT 78.09 80 78.9 59.89 80


Performance Evaluation of Electronics Product
using Diffrent Classifiers
90
80
70
60
50
40
30
20
10
0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.38 Performance Evaluation of electronics products using different Classifiers

To prove the relevancy and provide concrete proof of the proposed system we have tried on few
more datasets. For home and kitchen products after executing the proposed system with amazon
home and kitchen products datasets it has been observed for Gaussian naïve bayes classifier
precision is 60%, Recall value is 50%, F1-Score is 53 %, area under curve is 46.66% and Accuracy
or relevancy of the product prediction is 50%. Similarly for Random Forest classifier precision is
75%, Recall value is 77%, F1-Score is 76 %, area under curve is 54% and Accuracy or relevancy
of the product prediction is 76.66%. After implementing and executing random forest and naïve
bayes classifier we have tried for most famous machine learning classifier. For decision tree
classifier precision is 70%, Recall value is 66%, F1-Score is 68 %, area under curve is 46.26% and
Accuracy or relevancy of the product prediction is 65.71%. Finally, to demonstrate the relevancy
and accuracy of our system we have executed proposed system with Amazon home and kitchen
datasets. It has been observed, for maximum parameter values are almost similar for all classifiers.
For PCHOT precision is 76.03%, Recall value is 79%, F1-Score is 77.06 %, area under curve is
58.54% and Accuracy or relevancy of the product prediction is 79%. Comparative analysis for all
the products are shown in table 4.9 and graphical analysis is shown in figure 4.39.
Table 4.9 Performance analysis of proposed system and existing classifiers for Amazon Home
and Kitchen Products
Home_&_Kitchen Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 60 50 53 46.66 50

Random Forest 75 77 76 54 76.66

Decision Tree 70 66 68 46.26 65.71

PCHOT 76.03 79 77.06 58.54 79

Performance Evaluation of Home & Kitchen


using Diffrent Classifiers
90

80

70

60

50

40

30

20

10

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.39 Performance Evaluation of home and kitchen products using different Classifiers
Similarly, for kindle products after executing the proposed system with amazon kindle products
datasets it has been observed for Gaussian naïve bayes classifier precision is 83%, Recall value is
70%, F1-Score is 84 %, area under curve is 68.62% and Accuracy or relevancy of the product
prediction is 70%. Similarly for Random Forest classifier precision is 68%, Recall value is 73%,
F1-Score is 71 %, area under curve is 44% and Accuracy or relevancy of the product prediction is
73.33%. After implementing and executing random forest and naïve bayes classifier we have tried
for most famous machine learning classifier. For decision tree classifier precision is 72%, Recall
value is 77%, F1-Score is 75 %, area under curve is 45% and Accuracy or relevancy of the product
prediction is 77%. Finally, to demonstrate the relevancy and accuracy of our system we have
executed proposed system with Amazon kindle product datasets. It has been observed, for
maximum parameter values are almost similar for all classifiers. For PCHOT precision is 85%,
Recall value is 77%, F1-Score is 79 %, area under curve is 70.42% and Accuracy or relevancy of
the product prediction is 77%. Comparative analysis for all the products are shown in table 4.10
and graphical analysis is shown in figure 4.40.

Table 4.10 Performance analysis of proposed system and existing classifiers for Amazon
kindle products

Kindle Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 83 70 84 68.62 70

Random Forest 68 73 71 44 73.33

Decision Tree 72 77 75 45 77

PCHOT 85 77 79 70.42 77
Performance Evaluation of Kindle Products
using Diffrent Classifiers
90

80

70

60

50

40

30

20

10

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.40 Performance Evaluation of kindle products using different Classifiers

Similarly, for baby products after executing the proposed system with amazon baby products
datasets it has been observed for Gaussian naïve bayes classifier precision is 78%, Recall value is
50%, F1-Score is 42 %, area under curve is 58.33% and Accuracy or relevancy of the product
prediction is 50%. Similarly for Random Forest classifier precision is 52%, Recall value is 57%,
F1-Score is 54 %, area under curve is 43.65% and Accuracy or relevancy of the product prediction
is 56.66%. After implementing and executing random forest and naïve bayes classifier we have
tried for most famous machine learning classifier. For decision tree classifier precision is 57%,
Recall value is 60%, F1-Score is 59 %, area under curve is 48% and Accuracy or relevancy of the
product prediction is 60%. Finally, to demonstrate the relevancy and accuracy of our system we
have executed proposed system with Amazon baby product datasets. It has been observed, for
maximum parameter values are almost similar for all classifiers. For PCHOT precision is 72.35%,
Recall value is 77%, F1-Score is 73.06 %, area under curve is 55.89% and Accuracy or relevancy
of the product prediction is 77%. Comparative analysis for all the products are shown in table 4.11
and graphical analysis is shown in figure 4.41.

Table 4. 11 Performance analysis of proposed system and existing classifiers for Amazon
baby products
Baby Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 78 50 42 58.33 50

Random Forest 52 57 54 43.65 56.66

Decision Tree 57 60 59 48 60

PCHOT 72.35 77 73.06 55.89 77

Performance Evaluation of Baby Products


using Diffrent Classifiers
90

80

70

60

50

40

30

20

10

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.41 Performance Evaluation of baby products using different Classifiers

Finally, for office products after executing the proposed system with amazon office products
datasets it has been observed for Gaussian naïve bayes classifier precision is 84%, Recall value is
60%, F1-Score is 68 %, area under curve is 55.55% and Accuracy or relevancy of the product
prediction is 60%. Similarly for Random Forest classifier precision is 74%, Recall value is 80%,
F1-Score is 77%, area under curve is 46.15% and Accuracy or relevancy of the product prediction
is 80%. After implementing and executing random forest and naïve bayes classifier we have tried
for most famous machine learning classifier. For decision tree classifier precision is 78%, Recall
value is 83%, F1-Score is 80 %, area under curve is 46.77% and Accuracy or relevancy of the
product prediction is 82.85%. Finally, to demonstrate the relevancy and accuracy of our system
we have executed proposed system with Amazon office product datasets. It has been observed, for
maximum parameter values are almost similar for all classifiers. For PCHOT precision is 83.5%,
Recall value is 85%, F1-Score is 84.15 %, area under curve is 61.94% and Accuracy or relevancy
of the product prediction is 85%. Comparative analysis for all the products are shown in table 4.12
and graphical analysis is shown in figure 4.42.

Table 4.12 Performance analysis of proposed system and existing classifiers for Amazon
office products

Office Products Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes 84 60 68 55.55 60

Random Forest 74 80 77 46.15 80

Decision Tree 78 83 80 46.77 82.85

PCHOT 83.5 85 84.15 61.94 85


Performance Evaluation of Office Products
using Diffrent Classifiers
90

80

70

60

50

40

30

20

10

0
Precision Recall F1-Score AUC Accuracy

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Figure 4.42 Performance Evaluation of office products using different Classifiers

Tables 4.13 and 4.14 evaluates the performance analysis of proposed system and existing
classifiers for four amazon products such as books, electronics, home and kitchen, and amazon
instant video. The average classification accuracy of proposed system is 77.236% and the existing
classification approaches (random forest, decision tree, and Naive bayes) achieved 69.817%,
71.696%, and 54.5% of classification accuracy. Similarly, the average precision, recall, f-measure
and AUC of proposed system is really better than the existing classification approaches, because
the proposed system effectively calculates the non-linear and linear features of collected data and
also significantly preserves the quantitative relationship between the high and low level features.
The graphical comparison of proposed and existing classification methods for the Amazon
products; books, electronics, home and kitchen, and amazon instant video is represented in figures
4.15, and 4.16.
Table 4.13 Performance analysis of proposed system and existing classifiers by means of
precision, recall, and f-measure
Precision Recall F1-Score
Classifiers Dataset Type
(%) (%) (%)

Amazon Videos 64 60 61

Gaussian Shoes 96 55 66
Naïve Bayes
Cellphone & Accessories 57 25 23

Electronics 70 55 59

Amazon Videos 52 57 54

Shoes 87 93 90
Random
Forest Cellphone & Accessories 67 53 59

Electronics 62 70 66

Amazon Videos 56 57 57

Decision Shoes 89 94 92
Tree Cellphone & Accessories 63 57 60

Electronics 67 74 71

Amazon Videos 62.78 66 63.03

Shoes 88.3 93 90.59


PCHOT
Cellphone & Accessories 59.46 69 63.35

Electronics 78.09 80 78.9

Table 4.13 shows the detailed analysis for the various datasets available on Amazon. Initially, we
have considered book, Electronics, Home & Kitchen, Amazon Instant Video dataset. We have
calculated precision, recall and F-Measures for above datasets for the comparison purpose using
different classifiers and proposed system. Detailed Graphical analysis is given in figure 4.43.
Performance analysis of proposed and
existing classifiers by means of precision,
recall, and f-measure
120

100

80

60

40

20

0
Shoes

Shoes

Shoes

Shoes
Amazon Videos

Electronics

Amazon Videos

Electronics

Amazon Videos

Electronics

Amazon Videos

Electronics
Cellphone & Accessories

Cellphone & Accessories

Cellphone & Accessories

Cellphone & Accessories

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Precision Recall F1-Score

Figure 4.43 Graphical comparison of proposed and existing classifiers by means of precision,
recall, and f-measure
Table 4.14 shows the detailed analysis for the various datasets available on Amazon. Initially, we
have considered book, Electronics, Home & Kitchen, Amazon Instant Video dataset. We have
calculated Area under curve and Accuracy for above datasets for the comparison purpose using
different classifiers and proposed system. Detailed Graphical analysis is given in figure 4.44.
Table 4.14 Performance analysis of proposed system and existing classifiers by means of AUC
and accuracy
Classifiers Dataset Type AUC (%) Accuracy (%)

Amazon Videos 57.14 60

Shoes 76.31 55
Gaussian Naïve Bayes
Cellphone & Accessories 43.75 25

Electronics 53.12 55

Amazon Videos 43.65 56.66

Shoes 50 93.33
Random Forest
Cellphone & Accessories 39.99 53.33

Electronics 43.75 70

Amazon Videos 53.32 57.14

Shoes 50 94.28
Decision Tree
Cellphone & Accessories 41.07 57.14

Electronics 44.82 74.28

Amazon Videos 56.17 66

Shoes 49.47 93.36


PCHOT
Cellphone & Accessories 46.82 69

Electronics 59.89 80

When we tried to compare existing classifiers with our proposed system it has been observed out
system will give better results as compared to other classifiers. To demonstrate our results we have
explained the details by using graphical analysis as shown in figure 4.44.
Performance analysis of proposed and
existing classifiers by means of AUC &
Accuracy
100
90
80
70
60
50
40
30
20
10
0
Cellphone & Accessories

Cellphone & Accessories

Cellphone & Accessories

Cellphone & Accessories


Electronics

Electronics

Electronics

Electronics
Shoes
Shoes

Shoes

Shoes
Amazon Videos
Amazon Videos

Amazon Videos

Amazon Videos

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

AUC Accuracy

Figure 4.44 Graphical comparison of proposed and existing classifiers by means of AUC and
accuracy

In addition, the comparative study of proposed and existing classification methods are carried-out
for another four amazon products like movie review, media, kindle, and camera. Here, the
performance evaluation is validated with 20% testing and 80% training of data. By inspecting the
tables 4.13 and 4.14, the proposed system performs with the average classification accuracy of
77.236% as compared to the traditional classification methods; random forest, decision tree and
naïve bayes. In addition, the existing classification methods achieved minimum precision, recall,
f-measure, accuracy and AUC, compared to the proposed classification method. The graphical
comparison of proposed and existing classification methods for the Amazon products like movie
review, media, kindle, and camera is represented in the figures 4.45 and 4.46.
Performance analysis of proposed and
existing classifiers by means of precision,
recall, and f-measure
90

80

70

60

50

40

30

20

10

0
Kindle

Kindle

Kindle

Kindle
Home & Kitchen

Home & Kitchen

Home & Kitchen

Home & Kitchen

Office Products
Office Products

Office Products

Office Products
Baby

Baby

Baby

Baby

Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

Precision Recall F1-Score

Figure 4.45 Comparison between Proposed system and existing classifiers

Random forest, decision tree and naïve bayes are the machine learning classifiers used to predict
the most relevant products in I-Commerce. Many services providers like Amazon, Trivago and
Shopclues are using these classifiers. Just to demonstrate and illustrate the analysis we have taken
different datasets of Amazon and applied all the above classifiers with proposed system as shown
in table 4.15. Detailed graphical analysis is shown in figure 4.45.
Table 4.15 Performance analysis of proposed system and existing classifiers by means of
precision, recall, and f-measure
Classifiers Dataset Type Precision (%) Recall (%) F1-Score (%)

Home & Kitchen 60 50 53

Kindle 83 70 84
Gaussian Naïve Bayes
Baby 78 50 42

Office Products 84 60 68

Home & Kitchen 75 77 76

Kindle 68 73 71
Random Forest
Baby 52 57 54

Office Products 74 80 77

Home & Kitchen 70 66 68

Kindle 72 77 75
Decision Tree
Baby 57 60 59

Office Products 78 83 80

Home & Kitchen 76.03 79 77.06

Kindle 85 77 79
PCHOT
Baby 72.35 77 73.06

Office Products 83.5 85 84.15


Table 4.16 Performance analysis of proposed system and existing classifiers by means

Classifiers Dataset Type AUC (%) Accuracy (%)


Home & Kitchen 46.66 50

Kindle 68.62 70
Gaussian Naïve Bayes
Baby 58.33 50

Office Products 55.55 60

Home & Kitchen 54 76.66

Kindle 44 73.33
Random Forest
Baby 43.65 56.66

Office Products 46.15 80

Home & Kitchen 46.26 65.71

Kindle 45 77
Decision Tree
Baby 48 60

Office Products 46.77 82.85

Home & Kitchen 58.54 79

Kindle 70.42 77
PCHOT
Baby 55.89 77

Office Products 61.94 85

Area under curve and accuracy are shown in table 4.16 and detailed graphical analysis is explained
in figure 4.46. It has been observed accuracy is almost increased for all the products with proposed
system as compared to existing classifiers.
Performance analysis of proposed and
existing classifiers by means of AUC &
Accuracy
90

80

70

60

50

40

30

20

10

0
Kindle
Kindle

Kindle

Kindle
Home & Kitchen

Home & Kitchen

Home & Kitchen

Home & Kitchen


Office Products

Office Products

Office Products

Office Products
Baby

Baby

Baby

Baby
Gaussian Naïve Bayes Random Forest Decision Tree PCHOT

AUC Accuracy

Figure 4.46 Graphical comparison of proposed and existing classifiers by means of AUC and
accuracy
In this research study, a new recommendation system is developed for recommending the products
more accurately by analyzing the reviews posted for the products by the users. The main
motivation behind this experiment is to develop a proper keyword extraction method and clustering
approach for recommending the products to the customers as negative form, and positive form by
using Amazon customer review dataset. In this scenario, a keyword extraction method (LDA)
along with modified GWO algorithm is used for selecting the appropriate keywords. The obtained
similar keywords are clustered using PFCM algorithm. The development automated
recommendation system includes numerous advantages; able to identify the fake products, track
overall customer satisfaction, etc. Related to existing classifiers, the proposed system delivered an
effective performance by means of quantitative analysis and comparative analysis. From the
experimental analysis, the proposed system averagely achieved around 77.236% of classification
accuracy, but the existing methodologies attained limited accuracy in Amazon customer review
dataset. In future work, an effective system is developed in order to further improve the
classification accuracy of product recommendation

You might also like