Review On Exploring User's Surfing Behavior For Recommended Based System
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
33 views6 pages
Review On Exploring User's Surfing Behavior For Recommended Based System
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856
Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 145
Abstract: Classification technique is important concept of web usage mining useful for researchers in web prediction where users browsing behavior is predicted. The knowledge of navigated web pages history recorded from user browsing sequence at web log server which is used to predict future set of pages likely to be visited by user. Lots of research has been carried out on prediction of next page Request for various kinds of applications including wireless application, caching system, search engine, personalization and recommended system. The main objective of this paper is to focus on the survey and analysis of users future page request prediction for recommended system. The detailed review is presented on various kind of prediction models used for Future page Request Prediction. The study and contrast of prediction models will give a new way for researchers to go deeper to find a pearl from shell in the sea.
Keywords: Prediction models, Web prediction, predicting factors.
1. INTRODUCTION Classification technique is important concept of web usage mining helpful for researchers in web prediction where users browsing behavior is predicted. Users browse the web pages, the history of browsed pages recorded in logs at server side. The recorded information in log is helpful to predict users web browsing behavior. The knowledge of navigated web pages history is used to predict futures set of pages likely to be visited by user. This kind of prediction can be applied in various applications such as recommended system, wireless, caching, search engine, etc. The objective of this paper is to reviewing different prediction models which focuses on prediction of future set of pages user may visit from recorded web log history.
The ideas and solutions introduced in this survey may motivate the recommended system developer to turn research into reality.
The organization of this paper is as follows: In section II, presented the motivation behind this survey. In section III, we introduce the literature survey of various models proposed by researchers for recommendation. Section IV is the summary of the survey and last is acknowledgement and references.
2. Motivation Internet is source of informative world where users considered it as a first home to motivate and take the guidance of intelligence for their desire purpose. In todays era of Internet web-browsing is common in human being and often users access the web worldwide. Web browsing is enlightening and descriptive tool to gain the impression of contents for the users to achieve their desire goals and objectives. Users surfing activities of the web are in large amount of data recorded in log files stored at server side. By identifying browsing patterns of users from log files and predicting future request pages for recommended system is the main motivation.
3. LITERATURE SURVEY The main focus of literature survey is to study and contrast the prediction models to predict the users future web page requests. Prediction models are used addressing web prediction problem. The main aim is to study different prediction models to reduce users access times and improving personalization while browsing the services of web. In addition, to reduce network traffic problems by avoiding pages visiting unintentionally and unnecessarily by users. Various prediction model like Markov model, artificial neural networks (ANN), k- nearest neighbor (kNN), support vector machine (SVM), fuzzy inference, Bayesian model are proposed by researchers to predict user future request of page.
Prediction models can be classified into two categories named as point-based and path based prediction models. When users previous and historic path data are predicted then it is referred as path based prediction. Point-based prediction is based on users current actions.
Joachims et al. [1] proposed a tour guide called web- watcher which is a path based recommender model based on kNN and reinforcement learning. Web watcher is a tour guide agent helps the users from their browsing interest by providing the directions to highlight selected hyperlinks from original page. Highlighting is done by inserting eyeball icon around the hyperlinks. Web watcher is an interfacing agent between the user and the World Wide Web as shown in figure 1. Review on Exploring Users Surfing Behavior for Recommended Based System
Mr. Ajeetkumar S. Patel 1 , Prof. Anagha P. Khedkar 2
1 Department of Computer Engineering, Matoshri College of Engineering and Research Center, Nashik, India
2 Department of Information Technology, Matoshri College of Engineering and Research Center, Nashik, India International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 146
Figure 1 Web watcher: Interface agent between user and World Wide Web [1]
The function of web watcher is to suggest appropriate link given an interest and webpage. To fulfill this need requires knowledge of following target function.
LinkQuality: Page X Interest X Link [0, 1] [1]
This gives the value of LinkQuality interpreted as the probability that a user will select the Link give the current page & Interest. Web watcher used the k nearest neighbor model for knowledge discovery by wishing the value of LinkQuality for each hyperlink estimated to be the average similarity of the k (considered as 5) highest ranked keyword sets associated with the hyperlink. The hyperlink value for LinkQuality is above threshold then it is suggest as user interest. Reinforcement learning is second approach used in web watcher agent allows learning control strategies that select optimal actions in certain settings. The main objective is to find paths through the web to maximize the amount of relevant information encountered [1]. Web watcher experimentation is considered in all cases of recommendation give only 48% throughput because of great diversity of individual users interests. The prediction accuracy will be considered as future research and scope to know more deeply in users interest. Nasraoui et al. [2] proposed a web recommended system by applying clustering to group profiles using hierarchical unsupervised niche clustering (HUNC). HUNC algorithm is used to cluster the sessions extracted from server logs into multi-resolution user session profile and also identify the noisy sessions and profiles. Moreover, clustering process hunt the context-sensitive associations between different URL addresses. The main advantage is no additional cost required as well as it can be generalized to other transactional data. HUNC gives faster performance and in addition by applying genetic algorithms parallelization efforts make it even faster. Mobasher et al. [3] proposed association rule mining (ARM) technique and recommended system framework to predict users future page that likely to be visited. To match an active user sessions with frequent item sets can be presented by the frequent item set graphs to predict next pages. Framework is categorized in three phases: data preparation, pattern discovery and recommendation engine.
Data cleansing, user identification, session identification, page view identification and transaction identification is performed on data preparation phase. This preprocessing tasks ultimately result in a set of n page views, P ={p1, p2, p3, .., pn}, and set of m user transactions T ={t1, t2, t3, .. , tn} where each ti T is a subset of P [3]. In pattern discovery phase, an association rule captures relationships among page views based navigational patterns of users. Apriori algorithm is used to find group of items occurring frequently together in many transactions referred as frequent item sets. The last phase takes a collection of frequent item sets as input and generates recommendation set. This set is for user by matching the current users action against training patterns. Recommendation engine is online process framework includes data structure for storing frequent item sets combined with recommendation algorithm to directly produce real-time recommendations from item sets without first generating association rule. For understanding the concept following example is illustrated. Table 1 shows transaction sets. Apriori algorithm generates itemsets using transactions given in table 1 by considering frequency threshold of 4 shown in table 2. By referring table 2 following figure 2 shows frequent itemsets graph based on frequent itemsets.
Figure 2 Frequent itemset for an example [3]
Association rule mining is very much suffering from scalability and efficiency in web prediction give path to research on it. International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 147
Anderson et al. [4] proposed adaptive learning MINIPATH algorithm using dynamic links to provide shorter path to save substantial surfing/navigational efforts. The proposed system specifically designed for wireless gadgets PDAs, cell phones, tablets and pagers used by users to browse the web. In offline mode MINIPATH learns the history of users browsing behavior and used at runtime to predict the users ultimate destination. Researcher also has taken the interest in the evaluation of some prediction models of web usage mining with MINIPATH such as nave bayes mixture model and Markov mixture model. The resultant output of the mixture of Markov model save more than 40% navigational efforts of users. Proposed system is able to reduce the visitors links of browsing but up to specific mark. Beyond that MINIPATH is not able to maintain the consistency. The reason behind it is unable to handle larger sites having more pages with lots of link in between them. The cost of predicting shortcut link with probability of erroneous to follow a shortcut is major issue in MINIPATH. The accuracy of mixture Markov model has somewhat needs the improvement. Perkowitz et al. [5] give the challenge to AI community to create adaptive websites while user accesses the data and then automatic improvement is desired on organization and presentation of the sites. Proposed system suggested two ways for adaptive websites: customization and optimization. Modifying pages of web at runtime to suit the needs of individual users is referred as customization. Changing the site itself to make navigation easier for all is referred as optimization. To do so there is need of users access patterns which will be avail from server logs but for both operations available information is inadequate. So there is a need of meta-information to support adaptivity. Su et al. [6] proposed the N-gram prediction model to utilize path profiles of users from very large server log datasets to predict users future set of page requests. Two models are introduced: path-based prediction model built on web-server log and N-gram prediction model constructed out of log data to predict users clicks in real time. To elaborate the proposed model refers following example. Consider a log file consisting following request paths given in table 3.
Table 3: Example of log file consisting a request path [6] p1, p2, p3, p4 p1, p2, p3, p5 p1, p2, p3, p5 p2, p3, p4, p6 p2, p3, p4, p6 p2, p3, p4, p5 Table 4: 3-gram prediction model H3 () for an Example [6] N-gram Prediction p1, p2, p3 p5 p2, p3, p4 p6 Table 5: 2-gram prediction model H2 () for an Example [6] N-gram Prediction p1, p2 p3 p2, p3 p4 p3, p4 p6
First approach the path-based model which can be used to build an N-gram prediction model based on frequency occurrence. To build 3-gram model on paths (table 3) gives two 3-grams (p1, p2, p3 and p2, p3, p4) to build prediction model. Algorithm return given in hash table 4 called H3 (). 2-gram model provide contents while building it: p1, p2; p2, p3 and p3, p4. 2-gram prediction model H2 () can be return given in table 5. Second approach prediction algorithm (N-gram+) is used to build prediction model. 3-gram H3 () and 2-gram H2 () models are the example of it. By an observation of above example if current clicking sequence consists of only 1 click p4p2p3. Prediction algorithm first checks H3 () for the existence of index p4p2p3. If index doesnt exist then it will check 2-gram model H2 () for index p2p3. p2p3 is exist thus the predicted next click is p4 as per H2 (). These methods give the performance on increasing in precision and decreasing in applicability. This work shows better prediction accuracy while maintaining reasonable applicability using N-gram models where n is greater than 2. N-gram+ algorithm only applicable to significant portion of the logs may degrade the system performance on prediction of users access patterns as well as on prediction time. Levene et al. [7] presented Markov chain model for analysis of users web browsing behavior by computing information contained in navigational trail. Main result is carried out in twofold: First is to obtain a large deviation result for the trail of web trough Markov chain needed to compute its entropy in terms of the mean waiting and recurrence time between states. Second is to present an online algorithm to compute entropy of Markov chain to converge to the true entropy. In addition an error analysis is provided for said algorithm in terms of the relative entropy between empirical and actual transition probabilities.
Table 6: Algorithm to estimate entropy of Markov chain [7] Algorithm 1 (ENTROPY ({m i }, {m i, j })) [1] 1. begin 2. H, {m i }, {m i, j }, k :=0; 3. s 1 :=random({P i }); 4. while not CONVERGED do 5. k :=k +1; 6. s k+1 :=random(s k ; {P kj }); 7. H :=H +(m k,(k+1) +1) log((m k,(k+1) +1)/( m k +1)) m k,(k+1) log(m k,(k+1) /m k ); International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 148
8. for all j (k + 1) do 9. H :=H +m k,j log(m k =(m k +1)); 10. end for 11. m k :=m k +1; 12. m k, (k+1) :=m k, (k+1) +1; 13. end while 14. return H/k; 15. end.
Table 6 shows algorithm ENTROPY ({m i }, {m i, j }) returns an estimate of entropy of the Markov model M by overloading index k of state s k so that it refers both to the kth state in the trail induced by the random walk and also to the kth state in N. {m i } and {m i, j } indicates number of visits to s i and the number of transitions from s i to s j so far, where i, j {1, .., n}. CONVERGED is Boolean variable referred as initially false [7]. Function random ({P i }) is used to return an initial state distribution according to the initial probabilities {P i }. Function random (s i , {P ij }) returns a state adjacent to s i distributed according to the transition probabilities {P ij } where i, j {1, .., n}. Pitkow & pirolli [8] proposed the technique called longest repeating subsequences (LRS) which reduces the modeling complexity of surfing patterns of users. To understand LRS concept consider following example. Suppose websites contains the pages p1, p2, p3 and p4. In that p1 contains hyperlinks to p2 and p2 contains hyperlinks to both p3 and p4 refer following figure 3. In case 1 if users repeatedly surf from p1 to p2, but only one user clicks through to p3 and one user clicks through p4.
Figure 3 Example illustrating the formation of Longest Repeating Subsequences [8]
So from this we conclude that the longest repeating subsequence is p1p2. In case 2 if more than one user clicks through p2 to p4, then both p1p2 and p1p2p4 are longest repeating subsequences. If both p1p2p3 and p1p2p4 are the Longest Repeating Subsequences when both occur more once and are the longest subsequences is case 3. In case 4 if user click through p1p2p4 then p1p2 is not a longest repeating subsequence. Summarizing longest repeating subsequences gives several interested properties. First, low probability transition removal gives complexity reduction. Second, LRS having bias towards specificity will give good impact on hit rate. Experimentation of LRS gives clues that need improvement in adaptation and real-time model. Proposed work give future steps based on loss of pattern matching as well as prediction accuracy on large dataset input. Hassan et al. [9] present Bayesian model for web navigation patterns focusing on long sessions (short and long), kind of pages, page views range, and page categories rank problem. Web navigation patterns are carried out by learning and prediction for user, time spend for surfing and user-time slot combinations employing bayes rule and Markov chain models to gain satisfaction of accuracy and simplicity. Bayesian model is adapted for following reasons. 1) Simple and intuitive 2) adaptability to concept drift 3) efficiency and accuracy. Table 7 shows four pattern problems and prediction performance [9].
Table 7: Classifier/model used for patterns [9] Pattern No. Pattern Name Classifier/ technique 1 Short and long visit sessions Nave bayes classifier 2 Page categories visited in first N positions Markov chain 3 Range of number of page views per page category visited Bayes classifier 4 Rank of page categories in visit sessions Nave bayes classifier
Hence, we can say that proposed model is generative in nature and for pattern1 evaluation doesnt improve prediction performance. Complexity of for all patterns is O (D) in time, where D is total number of visit sessions in the data. The space complexity is sum of the products of variables cardinalities as defined by the number of probability required. Proposed model for learning and predicting sessions has same prediction accuracy but faster in process than support vector machine. Awad et al. [10] proposed the fusion of several prediction models, namely, Markov model and support vector machines with domain knowledge exploitation to improve prediction accuracy. Proposed system is divided in to two sources of prediction support vector machine (SVM) and Markov model. In SVM two ideas are considered for reality. First, extraction of users browsing sessions features and trained it with SVM. Second is by integrating domain knowledge in prediction to improve prediction accuracy. The second body of prediction is probabilistic model (Markov model) utilizing the best of SVM and Markov model which surpass prediction accuracy all other techniques. But problem occurs on proposed models is prediction and training overhead. ARM and SVM model has some limitations as scaling on large datasets. Whereas SVM and ANN having limitations of handling multiclass problem efficiently due to larger number of labels/classes involved in web prediction technique. Awad et al. [11] synthesis the classification techniques namely, Markov model, ANN and All- Kth Markov model to resolve prediction using Dumpsters rule which indeed boosting the accuracy on large number of classes International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 149
of ANN and by engaging reduction technique with domain knowledge to enhance result by reducing number of classifiers to improve prediction accuracy and prediction time of ANNs. The fusion of prediction models shows outperformed results on prediction accuracy but compromises the scalability in number of paths and fusion may compromise on prediction time. To overcome the limitations of [10] [11] Awad et al. [12] proposed modified Markov model and novel two-tier prediction framework to gain better result in prediction accuracy (by reducing the complexity of original Markov model) and prediction time (without compromising prediction accuracy) respectively. Experimentation was done on the three standard benchmark datasets, namely, the NASA datasets, the University of Saskatchewans (UOFS) dataset and the United Arab Emirates University (UAEU) datasets. [13]. The proposed model operations needs background study and analysis of N-gram representation of sessions, Markov model, All- Kth Markov model, ARM and utilize them for same. N-gram representation is used in web prediction to represent training session by depicting sequences of page clicks by users navigating a website [6] [12]. Sliding window method is used at the time of implementation and training the session. Markov model is used to implement the web prediction concept by predicting next page likely to be visited based on history of previous pages visited for random order of k to achieve efficiency and performance on model building and prediction time. In fourth-order Markov model, the prediction of the future page is computed based only on the four web pages previously visited [2] [3]. In All Kth Markov model all orders of Markov model are used collectively in prediction to increase accuracy [2] [3]. One of the data mining technique is used called ARM for solution of web prediction by providing support on the accuracy and maintain relationship of varying the number of items.
Figure 4 Different phases in two-tier prediction framework training process [12]
Proposed modified Markov model is introduced to achieve better memory utilization and faster prediction by reducing number of paths in the model. The main objective behind it is users browsing order on the web. Discarding the repeated page sessions of browsing order is being carried out on observation based. Second approach is two-tier prediction framework that creates an example classifier (EC) based on the training examples and generated classifiers to improve prediction time. The noticeable point is no compromise on prediction accuracy [12]. Following figure 4 shows the two-tier framework training process. In Proposed model, first N orders of Markov models are generated, namely, first, second, .., Nth order Markov model by applying sliding window on training set T, then mapping each training example in T to one or more orders of Markov models. In figure 4 t3 training example is mapped to two classifier IDs namely C1 and C2 as first order and second order Markov model respectively. Checking highest probability of classifiers examples mapped to only one classifier in pruning/filtering process and at last EC is generated based on training the filtered dataset using one of the prediction techniques. SVM is used to generate EC because there is a need of any learning techniques to generate EC may include ANNs or decision trees. Extracting statistical features researchers may improve prediction accuracy by giving improvement over proposed system. Furthermore using ANN on error back propagation instead of SVM to generate example classifiers may give better outcome to improve performance of model.
4. SUMMARY In this paper, survey is presented on several kinds of prediction models for web prediction problem especially focusing on improving the effectiveness of web prediction for recommendation system on prediction accuracy, prediction time, performance, efficiency, simplicity, memory utilization, applicability and scalability by session extraction of web logs, user understanding, model understanding, document understanding with ranking and monitoring of evaluation and feedbacks. Many advanced prediction model were developed as well as mixture of them also utilized. Those models were either applied to large amount of log datasets or part of datasets for web prediction to enhance the powerfulness of existing work. There is still challenging and interesting problems present in existing systems. Challenges may be categorized two sections namely preprocessing and prediction. Those challenges are highlighted by following points mentioned below. Domain knowledge hunt. Handling large amount of data that can be fit in computer memory to improve memory utilization. Session identification to ease the working of proposed methods. Improvement over long training dataset timing. Improvement over prediction time. Selection of prediction model. Hybrid approach involvement for improvement on prediction time. International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March April 2014 ISSN 2278-6856
Volume 3, Issue 2 March April 2014 Page 150
Prediction accuracy. Modeling complexity. Relativity between prediction accuracy, prediction time and modeling complexity.
Acknowledgement Sincerely thank the all anonymous researchers for providing us such helpful opinion, findings, conclusions and recommendations. Also thanks to guide Prof. Anagha P. Khedkar, HOD Dr. Varsha H. Patil, Principal Dr. G. K. Kharate, and colleagues for their support and guidance.
References [1] T. Joachims, D. Freitag, and T. Mitchell, WebWatcher: A tour guide for the World Wide Web, in Proc. IJCAI, 1997, pp. 770777. [2] O. Nasraoui and R. Krishnapuram, One step evolutionary mining of context sensitive associations and Web navigation patterns, in Proc. SIAM Int. Conf. Data Mining, Arlington, VA, Apr. 2002, pp. 531547. [3] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa, Effective personalization based on association rule discovery from Web usage data, in Proc. ACM Workshop WIDM, Atlanta, GA, Nov. 2001. [4] C. R. Anderson, P. Domingos, and D. S. Weld, Adaptive Web navigation for wireless devices, Proc. IJCAI Workshop, Seattle, WA, 2001. [5] M. Perkowitz and O. Etzioni, Adaptive Web sites: An AI challenge, in proc. IJCAI Workshop, Nagoya, Japan, 1997. [6] Z. Su, Q. Yang, Y. Lu, and H. Zhang, WhatNext: A prediction system for Web requests using n-gram sequence models, in Proc. 1st Int. Conf. Web Inf. Syst. Eng. Conf., Hong Kong, Jun. 2000, pp. 200 207. [7] M. Levene and G. Loizou, Computing the entropy of user navigation in theWeb, Int. J. Inf. Technol. Decision Making, vol. 2, no. 3, pp. 459476, 2003. [8] J. Pitkow and P. Pirolli, Mining longest repeating subsequences to predict World Wide Web surfing, in Proc. 2nd USITS, Boulder, CO, Oct. 1999. [9] M. T. Hassan, K. N. Junejo, and A. Karim, Learning and predicting key Web navigation patterns using Bayesian models, in Proc. Int. Conf. Comput. Sci. Appl. II, Seoul, Korea, 2009, pp. 877 887. [10] M. Awad, L. Khan, and B. Thuraisingham, Predicting WWW surfing using multiple evidence combination, VLDB J., vol. 17, no. 3, pp. 401 417, May 2008. [11] M. Awad and L. Khan, Web navigation prediction using multiple evidence combination and domain knowledge, IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 37, no. 6, pp. 10541062, Nov. 2007. [12] M. Awad and I. Khalil, Prediction of Users Web- Browsing Behavior: Application of Markov Model, IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 42, no. 4, pp. 11311142, Aug. 2012. [13] Internet Traffic Archieve. [Online]. Available: http://ita.ee.lbl.gov/html/traces.html
Content 1) Introduction 2) Brief Review of The Work Done in The Related Field 3) ) Noteworthy Contributions 4) Proposed Methodology 5) Expected Outcome 6) References