Recommendation of Web Pages Using Weighted KMeans Clustering
Recommendation of Web Pages Using Weighted KMeans Clustering
44
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014
Online data set contain the data to be clustered data point, D = {X1…
Xn}, first choose from this data points, K initial centroids
randomly, where K is user-parameter, the number of clusters
desired. It uses an iterative hill-climbing algorithm. The process
Data pre- of K-means clustering is explained as follows [13]:
processing Active user session
(i) The initial seeds with the chosen number of clusters,
K, are selected and an initial partition is built by using
Similarity
the seeds as the centroids of the initial clusters.
Web logs
Clustering using (ii) Each data point is assigned to the centroid that is
K-means, Weighted Cosine Hamming nearest, thus forming a cluster.
K-Means (iii) Keeping the same number of clusters, the new
centroid of each cluster is calculated.
Recommended (iv) Iterate Steps (ii) and (iii) until the clusters stop
Aggregate usage pages changing or stop conditions are satisfied.
profile
In K-Means Clustering, all the pages are equally considered, but
some of the pages have been visited by more number of users.
Offline To consider the pages which are visited by more number of
users, the suitable weight is assigned to each page in order to
Figure 1: Framework of Recommendation System for online give more importance for that page at the time of clustering.
users Hence, Weighted K-Means algorithm has been proposed in this
paper and web page recommendation have been done
In the offline component the three important steps are accordingly. In this paper, Weighted K-Means clustering is
considered. First step is to preprocess the web server logs or applied in the offline phase to generate the similar user groups
web usage data by applying data cleaning techniques and then based on their usage behavior, since these user groups or user
partition the web navigations into sessions determined by the clusters are used to generate the usage profile using equation
period of browsing. Second one is to partition the filtered (1).
sessionized page views into clusters of users navigation patterns
with similar page views browsing activities using K-Means 3.3 Weighted K-Means algorithm
algorithm[7] and Weighted K-Means algorithm [12]. Finally, Weighted K-Means algorithm [12] is one of the clustering
web navigation profiles are generated based on the preformed algorithms, based on the K-Means algorithm calculating with
clusters. The online component does the matching of the new weights. A natural extension of the K-Means problem allows us
anonymous user request (current active session) to the profile to include some more information, namely, a set of weights
shares common interests to the user. associated with the data points. These might represent a
measure of importance, a frequency count, or some other
The usage profile contains only those web pages that passed information. Weighted K-Means attempts to decompose a set of
certain confidence support and weights values. The confidence objects into a set of disjoint clusters, taking into consideration
support determines the frequent occurrence on those pages in the fact that the numerical attributes of objects in the set often
the cluster. These profiles don’t consider specific users, since do not come from independent identical normal distribution.
this study don’t consider the users history in account during the
profile generation. The usage profile is constructed as a set of The weighted K-Means algorithm uses weight vector to
pageview and its weight as pair using equation (1). decrease the affects of irrelevant attributes and reflect the
semantic information of objects. Weighted K-Means algorithm
Usage profile = {( p, weight(p)) | p P, weight(p) ≥ is iterative and use hill-climbing to find an optimal solution
min_weight }. (1) (clustering), and thus usually converge to a local minimum.
In this algorithm, the weights can be classified into two types.
where P ={p1, p2, . . . , pn}, a set of n pageviews appearing in Dynamic Weights: In the dynamic weights, the weights are
the transaction file with each pageview uniquely represented by changed during the program. Static Weights: In the static
its associated Uniform Resource Locator (URL) and the weights, the weights are not changed during the program. The
weight(p) is the (mean) value of the attribute’s weights in the Weighted K-Means algorithm is used to cluster the objects.
cluster.
The working procedure of Weighted K-Means clustering is as
3.1 Preprocessing of Click stream Data follows.
Click stream data means that when a user viewed a sequence of
web pages then web pages are displayed one by one on a row at Input: A set of n data points and the number of clusters (K)
a time [10, 11]. Analysis of clicks is the process of extracting
knowledge from web logs. This analysis involves data Output: Centroids of the K clusters
preprocessing and then applying data mining techniques. Data
preprocessing involves data extraction, cleaning and filtration (i) Initialize the number of clusters k.
followed by identification of their sessions. (ii) Randomly selecting the centroids ( 1, 2, …, K) in the
data set.
3.2 K-Means Clustering Algorithm (iii) Choosing the Static weight ,which is range from 0
to 2.5 or (5.0)
Even though there are quite number of algorithms for
(iv) Find the distance between the centroids using the
clustering, the bench mark K-Means clustering technique [7]
has been used to group the web users in this paper. Consider a Euclidean Distance equation. dij = .∗( − ) 2
45
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014
(v) Update the centroids using this equation. flattering to our experiment, if the user visit only the “front
(vi) Stop the process when the new centroids are nearer to page” then 1 is recorded on the first position of the matrix and
old one. Otherwise, go to step-(iv). other 16 column (category) are filled by 0 [2]. The details of the
data set are provided in Table 1.
Table 1. Dataset used in the experiment
3.4 Cosine Similarity [2,8]
The similarity of the active session with each of the discovered Dataset MSNBC
aggregate profile is determined using the well-known Cosine
similarity measure. If an active session si is taken from cluster Total Number of Users 989818
ck, then their similarity can be measured as follows: Average number of visit
5.7
per user
Number of URL for each
10-5000
categories
3.5 Hamming Similarity [9] Table 2. List of MSR values and page view categories using
Given a space of vectors, the Hamming distance between two Weighted K-Means and K-Means of 10 clusters
vectors is defined as the number of components in which they
differ. It should be obvious that Hamming distance is a distance Cluster Weighted Pageview Normal Pageview
measure. Clearly the Hamming distance cannot be negative, and Index K-Means Categories K- Categories
if it is zero, then the vectors are identical. Most commonly, MSR Means
Hamming distance is used when the vectors are binary; they MSR
consist of 0’s and 1’s only. However, in principle, the vectors 1 46.8082 1 2 3 39.0137 1 2 3
can have components from any set. For example the Hamming
4 5 6 4 6 7
distance between the vectors 10101 and 11110 is 3. That is,
7 10 10 12
these vectors differ in the second, fourth, and fifth components,
11 12 15
while they agree in the first and third components. 14
2 45.4358 1 2 3 34.0003 1 2 3
Hamming Similarity (ui, uj) = 1- Hamming distance(ui, uj)
4 5 6 4 6 10
7 10 11 12
In the first phase usage profiles are extracted. In second phase,
11 12 14 17
two different similarity measures namely Cosine similarity
14
and Hamming similarity measures are used to measure the
similarity between the active user and the extracted usage 3 50.6072 1 2 3 39.2458 1 2 3
profiles. The recommendation list is generated from the nearest 4 5 6 4 5 6
usage profiles. Usage profiles whose similarity value greater 7 10 7 10
than the threshold µ (here µ=0.5) are considered as the nearest 11 12 11 15
profile to the given active user. 14
4 44.4727 1 2 3 42.9592 1 2 4
4 5 6 6 7 8
4. EXPERIMENTAL RESULTS AND 7 10 9 14
ANALYSIS 11 12
14 15
4.1 Data Set 5 47.7274 1 2 3 32.1258 1 2 3
A real dataset is used for this experiment. The data set is taken 4 5 6 4 5 6
from the UCI dataset repository (http://kdd.ics.uci.edu/) that 7 10 7 10
consists of Internet Information Server (IIS) logs for msnbc.com 11 12 11
and news-related portions of msn.com for the entire day of 14
September 28, 1999 (Pacific Standard Time). Visits are 6 42.71 1 2 3 35.2479 1 2 4
recorded at the level of URL category and are recorded in time 4 5 6 6 7 10
order. Each sequence in the dataset corresponds to page views 7 10 11 12
of a user during that twenty-four hour period. Each event in the 11 12 14
sequence corresponds to a user's request for a page. Requests 14 15
are not recorded at the finest level of detail that is, at the level
7 45.3843 1 2 3 40.8253 1 2 4
of URL, but rather, they are recorded at the level of page
4 5 6 6 7 9
category (as determined by a site administrator). The categories
7 10 11 12
are "front page", "news, "tech", "local", "opinion", "on-air",
11 12
"misc", "weather", "health", "living", "business", "sports",
14
"summary", "bbs" (bulletin board service), "travel", "msn-
8 43.6248 1 2 3 47.0673 1 2 3
news", and "msn-sports". Any page requests served via a
4 5 6 4 6 7
caching mechanism were not recorded in the server logs and
7 10 9 12
hence, not present in the data. This dataset is slightly changed
11 12 13 14
46
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014
47
International Journal of Computer Applications (0975 – 8887)
Volume 86 – No 14, January 2014
Table 5 and Table 6 show the recommendation list for the on Clustering of Transactions, American Journal of
active user as given above using traditional K-Means clustering Applied Sciences, 8(3),277-283.
and Weighted K-Means clustering respectively. The distance
measure used in the both clustering methods is Euclidean [6] Yahya AlMurtadha, Md. Nasir Sulaiman, Norwati
distance. From this empirical study, it has observed that Mustapha and Nur Izura Udzir. Improved web page
Hamming similarity using Weighted K-Means clustering gives recommender System Based on Web Usage Mining,
better recommendation quality than cosine similarity measure Proceedings of the 3rd International Conference on
for the binary web usage data. The list of recommended pages Computing and Informatics, ICOCI 2011, 8-9 June 2011
is provided in Table 6. Bandung, Indonesia, Paper No. 079.
[7] Ms. Vinita Shrivastava, Mr. Neetesh GuptaPerformance
5. CONCLUSION Improvement Of Web Usage Mining By Using Learning
This paper has paid an attention to group the similar usage Based K-Mean Clustering on International Journal of
behavior of users using Weighted K-Means algorithm for Computer Science and its Applications-[ISSN 2250 -
aggregated usage profile and new validating measure called 3765].
MSR(Mean Square Residue) is applied to evaluate the cluster’s [8] Khribi, M. K., Jemni, M., & Nasraoui, O. Automatic
quality. The results of this clustering approach are compared Recommendations for E-Learning Personalization Based
with the results of traditional clustering called K-Means. It was on Web Usage Mining Techniques and Information
observed that the usage profile extracted from the MSNBC Retrieval (2009) , Educational Technology & Society, 12
dataset using Weighted K-Means provides high quality (4), 30–42.
recommendation for the given active user than results obtained
by using K-Means Clustering. In future, the overlapping [9] F. Khalil, J. Li, H. Wang. An Integrated Model for Next
clusters may be obtained and these clusters may be used for Page Access Prediction, Copyright °c 2009 Inderscience
usage profile generation. Hence more pages visited by users can Enterprises Ltd.
be considered for recommendation process.
[10] Cooley, R., B. Mobasher and J.Srivatsava, 1997. Web
mining information and pattern discovery on the world
6. REFERENCES wide web. Proceeding of the 9th IEEE International
[1] Bamshad Mobasher,2001. WebPersonalizer: A Server-Side Conference on tools with Artificial Intelligence, Newport
Recommender System Based on Web Usage Mining. Beach,CA., pp: 558-567. DOI: 10.1109/TAI.1997.632303.
Technical ReportTR01-010,Schoolof computer Science,
telecommunications and Information Systems, DePaul [11] Ms.Dipa Dixit, Mr.Jayant Gadge, Automatic
University, Chicago, IL, USA Recommendation for Online Users Using Web Usage
Mining on International Journal of Managing Information
[2] C.P. Sumathi et. al. / (IJCSE) International Journal on Technology (IJMIT) Vol.2, No.3, August 2010
Computer Science and Engineering Vol. 02, No. 09,
2010, 3046-3052 [12] Dr.K.Thangadurai, M.Uma, Dr.M.Punithavalli, A Study
On Rough Clustering Global Journal of Computer Science
[3] AlMurtadha, Y.M., M.N.B. Sulaiman, N. Mustapha and N.I. and Technology Vol. 10 Issue 5 Ver. 1.0 July 2010 P a g e
Udzir,2010. Mining web navigation profiles for | 55
recommendation system. Inform.Technol. J., 9: 790-796.
DOI:10.3923/itj.2010.790.796 [13] Kyoung-jae Kim, Hyunchul Ahn, A Recommender system
using GA K-means clustering in an online shopping
[4] Castellano, G., Fanelli, A.M., & Torsello, M.A. (2011). market.., Expert Systems with Applications (20070,
NEWER: A system for NEuro-fuzzy Web doi:10,1016/j.eswa.2006.12.025.
recommendation. Applied soft Computing,11(1), 793-806.
[14] Fuzhi ZHANG, Huilin LIU, jinbo CHAO, A Two-stage
[5] AlMurtadha, Y.,Sulaiman, M..N.B., N. Mustapha and N.I. Recommendation Algorithm based on K-means Clustering
Udzir,(2011). IPACT: Improved web page in Mobile E-commerce, Journal of Computational
recommendation System Using Profile Aggregation Based Information Systems 6:10 (2010) 3327-3334.
IJCATM : www.ijcaonline.org 48