Recommendation of Web Pages Using Weighted KMeans Clustering

Uploaded by

032001ayushmaurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Recommendation of Web Pages Using Weighted KMeans Clustering

Uploaded by

032001ayushmaurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

Volume 86 – No 14, January 2014

Recommendation of Web Pages using Weighted K-

Means Clustering
R. Thiyagarajan K. Thangavel R. Rathipriya
Department of Computer Applications Department of Computer Science Department of Computer Science
Nehru Institute of IT and Management Periyar University Periyar University
Coimbatore-641 105. India Salem-636 011. India Salem-636 011. India

ABSTRACT 2. RELATED WORK

Web Recommendation Systems are implemented by using At present many researchers have proposed various
collaborative filtering approach. It is a specific type of recommendation systems for online personalization through
information filtering system that aims to predict the user web usage mining [2]. This kind of recommendation system is
browsing activity and then recommend to the user web pages used to predict the user navigation behavior and their
items that are likely to be of interest. In this paper, a new preferences using web log data.
recommendation system is proposed by using Weighted K-
Means clustering approach to predict the user’s navigational Bamshad Mobasher [1] presented a system called Web
behavior. The proposed recommendation system based on personalizer which provides dynamic recommendations, as a
Weighted K-Means clustering performs well when compared to list of hypertext links, to users. In preprocessing phase, the data
K-Means algorithm. The performance of the comparative mining techniques (i.e. clustering, sequence pattern discovery
analysis is presented through experimental results. and association rules) are used to obtain the aggregate usage
profiles. In offline phase web server logs are converted into
Keywords clusters of visited pages, and cluster made up of set of pages
Web Usage Mining, Web recommendation system, K-Means with common usage characteristics or behavior. In the online
clustering, Weighted K-Means clustering, Hamming distance, phase, active user session is considered in order to find matches
Mean square residue. among users’ activities and discovered similar usage profiles.
Matching usage profiles are used to compute a set of
1. INTRODUCTION recommendations which will be inserted into last requested
Nowadays Web becomes the backbone of information. The page as list of hypertext links. Sumathi et al. [2] introduced the
major problem for the internet users is being unable to retrieve recommender systems based on the user’s navigational patterns
useful and relevant information. The browsing patterns of the using model based clustering and suitable recommendations
users can help organizations to recommend the more relevant has been provided to cater to the needs of the user.
web pages according to the current interests of the user. AlMurtadha et al. [3] have focused on improving the prediction
of the next visited web pages and recommends it to the current
Web Usage Mining (WUM) mines user access patterns from
anonymous user by assigning them to the best navigation
usage logs, which record clicks made by every user. The output
profiles obtained by previous navigations of similar interested
of WUM is some patterns that may be the input to the
users. NEWER, a usage-based Web recommendation system
recommendation systems which is one of the application areas
presented by Castellano, Fanelli and Torsello [4] exploits the
of the web usage gives the ability to predict the next visited
potential of Computational Intelligence techniques to suggest
page for a given user [6]. The main goal of the recommendation
dynamically interesting pages to users according to their
system is to improve the web site usability by knowing the
preferences. Almurtadha, Sulaiman, Mustapha and Udzir [5]
interest of the users. The web recommendation process consists
focused on IPACT, an improved recommendation system using
of two components namely online and off-line with respect to
Profile Aggregation based on Clustering of Transactions. In
web server activity. Offline component builds the knowledge
[14], Fuzhi ZHANG, Huilin LIU and jinbo CHAO presented A
base by analyzing historical data, such as server access log file
Two-stage Recommendation Algorithm based on K-means
or web logs which are captured from the server. Then these web
Clustering in Mobile E-commerce. K.Thangadurai, M.Uma and
logs are used in the online component for capturing the intuition
list of the user so as to recommend page views to the user M.Punithavalli [12] had a Study on Rough Clustering in which
rough K-means clustering is studied and compared with the
whenever user comes online for the next time [2].
traditional K-means and weighted K-Means clustering methods
In this paper, a framework is generated for capturing for different data sets available in UCI data repository
recommendations in the form of recommendation list for user
using Weighted K-Means clustering. A recommendation list 3. USAGE BASED RECOMMENDATION
consists of list of pages visited by user as well as list of pages
visited by other users of having similar usage profile. The rest
SYSTEM
of this paper is organized as follows: In section 2, The proposed framework consists of two main components,
recommendation system using web usage mining is discussed. namely the offline and online as shown in figure 1.
Section 3 presents the block diagram and the implementation
for the usage based recommendation system using K-Means
and Weighted K-Means Clustering algorithms. Results and
discussion are revealed in section 4. Finally, section 5
concludes the paper with the direction for future work.

44
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014

Online data set contain the data to be clustered data point, D = {X1…
Xn}, first choose from this data points, K initial centroids
randomly, where K is user-parameter, the number of clusters
desired. It uses an iterative hill-climbing algorithm. The process
Data pre- of K-means clustering is explained as follows [13]:
processing Active user session
(i) The initial seeds with the chosen number of clusters,
K, are selected and an initial partition is built by using
Similarity
the seeds as the centroids of the initial clusters.
Web logs
Clustering using (ii) Each data point is assigned to the centroid that is
K-means, Weighted Cosine Hamming nearest, thus forming a cluster.
K-Means (iii) Keeping the same number of clusters, the new
centroid of each cluster is calculated.
Recommended (iv) Iterate Steps (ii) and (iii) until the clusters stop
Aggregate usage pages changing or stop conditions are satisfied.
profile
In K-Means Clustering, all the pages are equally considered, but
some of the pages have been visited by more number of users.
Offline To consider the pages which are visited by more number of
users, the suitable weight is assigned to each page in order to
Figure 1: Framework of Recommendation System for online give more importance for that page at the time of clustering.
users Hence, Weighted K-Means algorithm has been proposed in this
paper and web page recommendation have been done
In the offline component the three important steps are accordingly. In this paper, Weighted K-Means clustering is
considered. First step is to preprocess the web server logs or applied in the offline phase to generate the similar user groups
web usage data by applying data cleaning techniques and then based on their usage behavior, since these user groups or user
partition the web navigations into sessions determined by the clusters are used to generate the usage profile using equation
period of browsing. Second one is to partition the filtered (1).
sessionized page views into clusters of users navigation patterns
with similar page views browsing activities using K-Means 3.3 Weighted K-Means algorithm
algorithm[7] and Weighted K-Means algorithm [12]. Finally, Weighted K-Means algorithm [12] is one of the clustering
web navigation profiles are generated based on the preformed algorithms, based on the K-Means algorithm calculating with
clusters. The online component does the matching of the new weights. A natural extension of the K-Means problem allows us
anonymous user request (current active session) to the profile to include some more information, namely, a set of weights
shares common interests to the user. associated with the data points. These might represent a
measure of importance, a frequency count, or some other
The usage profile contains only those web pages that passed information. Weighted K-Means attempts to decompose a set of
certain confidence support and weights values. The confidence objects into a set of disjoint clusters, taking into consideration
support determines the frequent occurrence on those pages in the fact that the numerical attributes of objects in the set often
the cluster. These profiles don’t consider specific users, since do not come from independent identical normal distribution.
this study don’t consider the users history in account during the
profile generation. The usage profile is constructed as a set of The weighted K-Means algorithm uses weight vector to
pageview and its weight as pair using equation (1). decrease the affects of irrelevant attributes and reflect the
semantic information of objects. Weighted K-Means algorithm
Usage profile = {( p, weight(p)) | p  P, weight(p) ≥ is iterative and use hill-climbing to find an optimal solution
min_weight }. (1) (clustering), and thus usually converge to a local minimum.
In this algorithm, the weights can be classified into two types.
where P ={p1, p2, . . . , pn}, a set of n pageviews appearing in Dynamic Weights: In the dynamic weights, the weights are
the transaction file with each pageview uniquely represented by changed during the program. Static Weights: In the static
its associated Uniform Resource Locator (URL) and the weights, the weights are not changed during the program. The
weight(p) is the (mean) value of the attribute’s weights in the Weighted K-Means algorithm is used to cluster the objects.
cluster.
The working procedure of Weighted K-Means clustering is as
3.1 Preprocessing of Click stream Data follows.
Click stream data means that when a user viewed a sequence of
web pages then web pages are displayed one by one on a row at Input: A set of n data points and the number of clusters (K)
a time [10, 11]. Analysis of clicks is the process of extracting
knowledge from web logs. This analysis involves data Output: Centroids of the K clusters
preprocessing and then applying data mining techniques. Data
preprocessing involves data extraction, cleaning and filtration (i) Initialize the number of clusters k.
followed by identification of their sessions. (ii) Randomly selecting the centroids ( 1, 2, …, K) in the
data set.
3.2 K-Means Clustering Algorithm (iii) Choosing the Static weight ,which is range from 0
to 2.5 or (5.0)
Even though there are quite number of algorithms for
(iv) Find the distance between the centroids using the
clustering, the bench mark K-Means clustering technique [7]
has been used to group the web users in this paper. Consider a Euclidean Distance equation. dij = .∗( − ) 2

45
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014

(v) Update the centroids using this equation. flattering to our experiment, if the user visit only the “front
(vi) Stop the process when the new centroids are nearer to page” then 1 is recorded on the first position of the matrix and
old one. Otherwise, go to step-(iv). other 16 column (category) are filled by 0 [2]. The details of the
data set are provided in Table 1.
Table 1. Dataset used in the experiment
3.4 Cosine Similarity [2,8]
The similarity of the active session with each of the discovered Dataset MSNBC
aggregate profile is determined using the well-known Cosine
similarity measure. If an active session si is taken from cluster Total Number of Users 989818
ck, then their similarity can be measured as follows: Average number of visit
5.7
per user
Number of URL for each
10-5000
categories

Users visited more than 8 page view categories are considered

for recommendation. Therefore, the number of user sessions
after this data filter step is 3408. In the first phase, K-Means
where wi,j represents weight of page i in active session j and wi,k clustering technique and Weighted K-Means clustering
represents weight of page i in cluster k. techniques are applied to the MSNBC dataset with K=10.

3.5 Hamming Similarity [9] Table 2. List of MSR values and page view categories using
Given a space of vectors, the Hamming distance between two Weighted K-Means and K-Means of 10 clusters
vectors is defined as the number of components in which they
differ. It should be obvious that Hamming distance is a distance Cluster Weighted Pageview Normal Pageview
measure. Clearly the Hamming distance cannot be negative, and Index K-Means Categories K- Categories
if it is zero, then the vectors are identical. Most commonly, MSR Means
Hamming distance is used when the vectors are binary; they MSR
consist of 0’s and 1’s only. However, in principle, the vectors 1 46.8082 1 2 3 39.0137 1 2 3
can have components from any set. For example the Hamming
4 5 6 4 6 7
distance between the vectors 10101 and 11110 is 3. That is,
7 10 10 12
these vectors differ in the second, fourth, and fifth components,
11 12 15
while they agree in the first and third components. 14
2 45.4358 1 2 3 34.0003 1 2 3
Hamming Similarity (ui, uj) = 1- Hamming distance(ui, uj)
4 5 6 4 6 10
7 10 11 12
In the first phase usage profiles are extracted. In second phase,
11 12 14 17
two different similarity measures namely Cosine similarity
14
and Hamming similarity measures are used to measure the
similarity between the active user and the extracted usage 3 50.6072 1 2 3 39.2458 1 2 3
profiles. The recommendation list is generated from the nearest 4 5 6 4 5 6
usage profiles. Usage profiles whose similarity value greater 7 10 7 10
than the threshold µ (here µ=0.5) are considered as the nearest 11 12 11 15
profile to the given active user. 14
4 44.4727 1 2 3 42.9592 1 2 4
4 5 6 6 7 8
4. EXPERIMENTAL RESULTS AND 7 10 9 14
ANALYSIS 11 12
14 15
4.1 Data Set 5 47.7274 1 2 3 32.1258 1 2 3
A real dataset is used for this experiment. The data set is taken 4 5 6 4 5 6
from the UCI dataset repository (http://kdd.ics.uci.edu/) that 7 10 7 10
consists of Internet Information Server (IIS) logs for msnbc.com 11 12 11
and news-related portions of msn.com for the entire day of 14
September 28, 1999 (Pacific Standard Time). Visits are 6 42.71 1 2 3 35.2479 1 2 4
recorded at the level of URL category and are recorded in time 4 5 6 6 7 10
order. Each sequence in the dataset corresponds to page views 7 10 11 12
of a user during that twenty-four hour period. Each event in the 11 12 14
sequence corresponds to a user's request for a page. Requests 14 15
are not recorded at the finest level of detail that is, at the level
7 45.3843 1 2 3 40.8253 1 2 4
of URL, but rather, they are recorded at the level of page
4 5 6 6 7 9
category (as determined by a site administrator). The categories
7 10 11 12
are "front page", "news, "tech", "local", "opinion", "on-air",
11 12
"misc", "weather", "health", "living", "business", "sports",
14
"summary", "bbs" (bulletin board service), "travel", "msn-
8 43.6248 1 2 3 47.0673 1 2 3
news", and "msn-sports". Any page requests served via a
4 5 6 4 6 7
caching mechanism were not recorded in the server logs and
7 10 9 12
hence, not present in the data. This dataset is slightly changed
11 12 13 14

46
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014

14 Table 4. Similarity measures using Weighted K-Means

clustering
9 42.1685 1 2 3 15.5187 1 2 3
4 6 7 4 5 6
Hamming
10 11 7 8 10
Cluster Index Cosine Similarity Similarity
12 14 11 12
17 14 15 1 0.9511 0.9991
10 42.5192 1 2 3 36.2595 1 2 3 2 0.9511 0.9991
4 5 6 4 5 6
7 8 10 10 11 3 0.9306 0.9991
11 12 12 14 4 0.974 0.9991
14 15 5 0.9511 0.9991
6 0.9511 0.9991
Table 2 shows the Mean Square Residue value of the 10 7 0.9511 0.9991
clusters and pageview categories using K-Means and Weighted
K-Means in the usage profile for the corresponding clusters. In 8 0.8954 0.9991
the second phase, active user visits the 1 and 2 pageview 9 0.9122 0.9991
categories. It is symbolically denoted as 10 0.9511 0.9991
A=[ 1, 1, 0 , 0, 0 , 0 , 0, 0 ,0 , 0, 0 , 0 , 0, 0 , 0 , 0, 0 ]

Table 3 tabulates the Cosine similarity and Hamming similarity

value for the 10 clusters using K-Means clustering with respect
to the active user sessions.

Table 3. Similarity measures using K-Means clustering

Cluster Cosine Hamming

Index Similarity Similarity
1 1 0.8235
2 0.6493 0.8824
3 0.974 0.8235
4 0.6866 0.8824 Figure 2: Comparison of Similarity Measures
5 1 0.8235
Table 5. Recommendations using K-Means clustering
6 0.6667 0.8824
Recommendation
7 0.6866 0.8824 Similarity List of Recommended Quality
8 0.974 0.8235 Measure Pages Percentage
1 2 4 5 8 9
9 0.9122 0.8235 10 11 12 13 14
Cosine 15 83.33
10 0.6493 0.8824
1 2 4 5 8 9
Average
10 11 12 13 14
Similarity 0.81987 0.85295
Visited Hamming 15 83.33
Pages =
(3 6 7 9 10 12 13 14) which is denoted as A=[0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0] Table 6. Recommendations using Weighted K-Means
clustering
Figure 2 shows the graphical representation of the comparison
of Cosine and Hamming similarity measures for the clusters Similarity List of Recommended Recommendation
generated using K-Means and Weighted K-Means clustering Measure Pages Quality Percentage
techniques. It is clearly depicts that Hamming similarity value 1 2 4 5 7 8
is higher than the cosine similarity value. This is because of the 9 10 11 14 15
binary representation of the web data. Table 4 exhibits Cosine Cosine 16 66.67
similarity and Hamming similarity value for the 10 clusters 1 2 4 7 8 9
using Weighted K-Means clustering with respect to the active 10 11 12 13 14
user sessions is given below. Hamming 16 100%

Percentage of Recommendation Quality =Number of correctly

recommended pages/ (Total Number of Visited pages- Number
of Pages in the Active User Session) *100

47
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014

Table 5 and Table 6 show the recommendation list for the on Clustering of Transactions, American Journal of
active user as given above using traditional K-Means clustering Applied Sciences, 8(3),277-283.
and Weighted K-Means clustering respectively. The distance
measure used in the both clustering methods is Euclidean [6] Yahya AlMurtadha, Md. Nasir Sulaiman, Norwati
distance. From this empirical study, it has observed that Mustapha and Nur Izura Udzir. Improved web page
Hamming similarity using Weighted K-Means clustering gives recommender System Based on Web Usage Mining,
better recommendation quality than cosine similarity measure Proceedings of the 3rd International Conference on
for the binary web usage data. The list of recommended pages Computing and Informatics, ICOCI 2011, 8-9 June 2011
is provided in Table 6. Bandung, Indonesia, Paper No. 079.
[7] Ms. Vinita Shrivastava, Mr. Neetesh GuptaPerformance
5. CONCLUSION Improvement Of Web Usage Mining By Using Learning
This paper has paid an attention to group the similar usage Based K-Mean Clustering on International Journal of
behavior of users using Weighted K-Means algorithm for Computer Science and its Applications-[ISSN 2250 -
aggregated usage profile and new validating measure called 3765].
MSR(Mean Square Residue) is applied to evaluate the cluster’s [8] Khribi, M. K., Jemni, M., & Nasraoui, O. Automatic
quality. The results of this clustering approach are compared Recommendations for E-Learning Personalization Based
with the results of traditional clustering called K-Means. It was on Web Usage Mining Techniques and Information
observed that the usage profile extracted from the MSNBC Retrieval (2009) , Educational Technology & Society, 12
dataset using Weighted K-Means provides high quality (4), 30–42.
recommendation for the given active user than results obtained
by using K-Means Clustering. In future, the overlapping [9] F. Khalil, J. Li, H. Wang. An Integrated Model for Next
clusters may be obtained and these clusters may be used for Page Access Prediction, Copyright °c 2009 Inderscience
usage profile generation. Hence more pages visited by users can Enterprises Ltd.
be considered for recommendation process.
[10] Cooley, R., B. Mobasher and J.Srivatsava, 1997. Web
mining information and pattern discovery on the world
6. REFERENCES wide web. Proceeding of the 9th IEEE International
[1] Bamshad Mobasher,2001. WebPersonalizer: A Server-Side Conference on tools with Artificial Intelligence, Newport
Recommender System Based on Web Usage Mining. Beach,CA., pp: 558-567. DOI: 10.1109/TAI.1997.632303.
Technical ReportTR01-010,Schoolof computer Science,
telecommunications and Information Systems, DePaul [11] Ms.Dipa Dixit, Mr.Jayant Gadge, Automatic
University, Chicago, IL, USA Recommendation for Online Users Using Web Usage
Mining on International Journal of Managing Information
[2] C.P. Sumathi et. al. / (IJCSE) International Journal on Technology (IJMIT) Vol.2, No.3, August 2010
Computer Science and Engineering Vol. 02, No. 09,
2010, 3046-3052 [12] Dr.K.Thangadurai, M.Uma, Dr.M.Punithavalli, A Study
On Rough Clustering Global Journal of Computer Science
[3] AlMurtadha, Y.M., M.N.B. Sulaiman, N. Mustapha and N.I. and Technology Vol. 10 Issue 5 Ver. 1.0 July 2010 P a g e
Udzir,2010. Mining web navigation profiles for | 55
recommendation system. Inform.Technol. J., 9: 790-796.
DOI:10.3923/itj.2010.790.796 [13] Kyoung-jae Kim, Hyunchul Ahn, A Recommender system
using GA K-means clustering in an online shopping
[4] Castellano, G., Fanelli, A.M., & Torsello, M.A. (2011). market.., Expert Systems with Applications (20070,
NEWER: A system for NEuro-fuzzy Web doi:10,1016/j.eswa.2006.12.025.
recommendation. Applied soft Computing,11(1), 793-806.
[14] Fuzhi ZHANG, Huilin LIU, jinbo CHAO, A Two-stage
[5] AlMurtadha, Y.,Sulaiman, M..N.B., N. Mustapha and N.I. Recommendation Algorithm based on K-means Clustering
Udzir,(2011). IPACT: Improved web page in Mobile E-commerce, Journal of Computational
recommendation System Using Profile Aggregation Based Information Systems 6:10 (2010) 3327-3334.