This document discusses evaluation in information retrieval. It describes standard test collections which consist of a document collection, queries on the collection, and relevance judgments. It also discusses various evaluation measures used in information retrieval like precision, recall, F-measure, mean average precision, and kappa statistic which measure reliability of relevance judgments. R-precision and normalized discounted cumulative gain are also summarized as important single number evaluation measures.
This document discusses link analysis and PageRank, an algorithm for identifying important nodes in large network graphs. It begins with an overview of graph data structures and the goal of identifying influential nodes. It then introduces PageRank, explaining its basic assumptions and showing examples of how it calculates node importance scores. The document discusses problems with the initial PageRank approach and how it was improved with the Complete PageRank algorithm. Finally, it briefly introduces Topic-sensitive PageRank, which aims to identify important nodes related to specific topics.
Brief description of the 3 mining techniques and we give a brief description of the differences between them and the similarities. Finally we talked about the shared techniques.
The PageRank algorithm calculates the importance of web pages based on the structure of incoming links. It models a random web surfer that randomly clicks on links, and also occasionally jumps to a random page. Pages are given more importance if they are linked to by other important pages. The algorithm represents this as a Markov chain and computes the PageRank scores through an iterative process until convergence. It has the advantages of being resistant to spam and efficiently pre-computing scores independently of user queries.
Web mining is the application of data mining techniques to discover patterns from the World Wide Web. As the name proposes, this is information gathered by mining the web
This document discusses web usage mining. It begins by defining web mining and its three categories: web content mining, web structure mining, and web usage mining. The main focus is on web usage mining, which involves discovering user navigation patterns and predicting user behavior. The key processes of web usage mining are preprocessing raw data, pattern discovery using algorithms, and pattern analysis. Pattern discovery techniques discussed include statistical analysis, clustering, classification, association rules, and sequential patterns. Potential applications are personalized recommendations, system improvements, and business intelligence. The document concludes by discussing future research directions such as usage mining on the semantic web and analyzing discovered patterns.
The document discusses the World Wide Web and information retrieval on the web. It provides background on how the web was developed by Tim Berners-Lee in 1990 using HTML, HTTP, and URLs. It then discusses some key differences in information retrieval on the web compared to traditional library systems, including the presence of hyperlinks, heterogeneous content, duplication of content, exponential growth in the number of documents, and lack of stability. It also summarizes some challenges in web search including the expanding nature of the web, dynamically generated content, influence of monetary contributions on search results, and search engine spamming.
This document discusses web scraping and data extraction. It defines scraping as converting unstructured data like HTML or PDFs into machine-readable formats by separating data from formatting. Scraping legality depends on the purpose and terms of service - most public data is copyrighted but fair use may apply. The document outlines the anatomy of a scraper including loading documents, parsing, extracting data, and transforming it. It also reviews several scraping tools and libraries for different programming languages.
Top (10) challenging problems in data miningAhmedasbasb
This document outlines the top 10 challenging problems in data mining, as presented by Dr. Ali Haroun. It introduces data mining and some common techniques. The top 10 problems are then each described in one or two paragraphs: (1) developing a unifying theory of data mining, (2) scaling up for high dimensional and high speed data, (3) mining sequence and time series data, (4) mining complex knowledge from complex data, (5) data mining in a network setting, (6) distributed data mining and mining multi-agent data, (7) data mining for biological and environmental problems, (8) data mining process-related problems, (9) security, privacy, and data integrity, and (
The PageRank and HITS techniques are used for ranking the relevancy of web pages, through analysis of the hyperlink structure that links pages together
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.
This document presents an overview of web mining techniques. It discusses how web mining uses data mining algorithms to extract useful information from the web. The document classifies web mining into three categories: web structure mining, web content mining, and web usage mining. It provides examples and explanations of techniques for each category such as document classification, clustering, association rule mining, and sequential pattern mining. The document also discusses opportunities and challenges of web mining as well as sources of web usage data like server logs.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
This document discusses data mining and its applications. It defines data mining as using algorithms to discover patterns in large data sets beyond simple analysis. It then provides examples of data mining applications, including market basket analysis, education, manufacturing, customer relationship management, fraud detection, research analysis, criminal investigation, and bioinformatics. The document also outlines the typical stages of the data mining process: data understanding, data preparation, modeling, evaluation, and deployment.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
This document provides an overview of the PageRank algorithm. It begins with background on PageRank and its development by Brin and Page. It then introduces the concepts behind PageRank, including how it uses the link structure of webpages to determine importance. The core PageRank algorithm is explained, modeling the web as a graph and calculating page importance based on both the number and quality of inbound links. Iterative methods like power iteration are described for approximating solutions. Examples are given to illustrate PageRank calculations over multiple iterations. Implementation details, applications, advantages/disadvantages are also discussed at a high level. Pseudocode is included.
This document discusses text and web mining. It defines text mining as analyzing huge amounts of text data to extract information. It discusses measures for text retrieval like precision and recall. It also covers text retrieval and indexing methods like inverted indices and signature files. Query processing techniques and ways to reduce dimensionality like latent semantic indexing are explained. The document also discusses challenges in mining the world wide web due to its size and dynamic nature. It defines web usage mining as collecting web access information to analyze paths to accessed web pages.
This document provides an overview of web mining and summarizes key concepts. It begins with definitions of data mining and web mining. The document then discusses three categories of web mining: web content mining, web usage mining, and web structure mining. Various matrix expressions used to represent web data are also introduced, including document-keyword co-occurrence matrices, adjacent matrices, and usage matrices. Finally, two common similarity functions - Pearson correlation coefficient and cosine similarity - are outlined.
This document provides an overview of how to become a data scientist from scratch. It discusses the key skills needed, which include mathematics/statistics, computer programming, and business knowledge. It then covers various topics required for a data science career like mathematics, programming languages, data wrangling, analysis, machine learning, deep learning, big data, and additional skills like NLP and CV. The document also lists learning outcomes, best online resources, blogs, books, and packages to learn data science from the ground up.
Data Mining: What is Data Mining?
History
How data mining works?
Data Mining Techniques.
Data Mining Process.
(The Cross-Industry Standard Process)
Data Mining: Applications.
Advantages and Disadvantages of Data Mining.
Conclusion.
The document provides an overview of data warehousing and OLAP technology. It defines a data warehouse as a subject-oriented, integrated collection of historical data used for analysis and decision making. It describes key properties of data warehouses including being subject-oriented, integrated, time-variant, and non-volatile. It also discusses dimensional modeling, data cubes, and OLAP for analyzing aggregated data.
Web mining involves applying data mining techniques to automatically discover and extract information from web documents and services. It has three main types: web content mining, which extracts useful information from web document contents; web structure mining, which analyzes the hyperlink structure of websites; and web usage mining, which involves discovering patterns from user interactions on websites. Popular algorithms for web mining include PageRank for web structure mining and HITS for determining both hub and authority pages.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
This document discusses web mining and its various types. Web mining involves using data mining techniques to discover useful information from web documents and usage patterns. It can involve content mining of text, images, video and audio to extract useful information. It also includes structure mining, which analyzes the hyperlink structure between documents and within documents. Additionally, web usage mining analyzes log files from web servers and applications to discover interesting usage patterns. The document outlines the differences between traditional data mining and web mining. It provides examples of applications of web mining such as information retrieval, network management and e-commerce.
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
YouTube: https://youtu.be/xtOg44r6dsE
(** Python Data Science Training: https://www.edureka.co/python **)
In this PPT on Supervised vs Unsupervised vs Reinforcement learning, we’ll be discussing the types of machine learning and we’ll differentiate them based on a few key parameters. The following topics are covered in this session:
1. Introduction to Machine Learning
2. Types of Machine Learning
3. Supervised vs Unsupervised vs Reinforcement learning
4. Use Cases
Python Training Playlist: https://goo.gl/Na1p9G
Python Blog Series: https://bit.ly/2RVzcVE
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This document provides an overview of machine learning including: definitions of machine learning; types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning; applications of machine learning such as predictive modeling, computer vision, and self-driving cars; and current trends and careers in machine learning. The document also briefly profiles the history and pioneers of machine learning and artificial intelligence.
1. The document proposes techniques to improve search performance by matching schemas between structured and unstructured data sources.
2. It involves constructing schema mappings using named entities and schema structures. It also uses strategies to narrow the search space to relevant documents.
3. The techniques were shown to improve search accuracy and reduce time/space complexity compared to existing methods.
This document discusses web scraping and data extraction. It defines scraping as converting unstructured data like HTML or PDFs into machine-readable formats by separating data from formatting. Scraping legality depends on the purpose and terms of service - most public data is copyrighted but fair use may apply. The document outlines the anatomy of a scraper including loading documents, parsing, extracting data, and transforming it. It also reviews several scraping tools and libraries for different programming languages.
Top (10) challenging problems in data miningAhmedasbasb
This document outlines the top 10 challenging problems in data mining, as presented by Dr. Ali Haroun. It introduces data mining and some common techniques. The top 10 problems are then each described in one or two paragraphs: (1) developing a unifying theory of data mining, (2) scaling up for high dimensional and high speed data, (3) mining sequence and time series data, (4) mining complex knowledge from complex data, (5) data mining in a network setting, (6) distributed data mining and mining multi-agent data, (7) data mining for biological and environmental problems, (8) data mining process-related problems, (9) security, privacy, and data integrity, and (
The PageRank and HITS techniques are used for ranking the relevancy of web pages, through analysis of the hyperlink structure that links pages together
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.
This document presents an overview of web mining techniques. It discusses how web mining uses data mining algorithms to extract useful information from the web. The document classifies web mining into three categories: web structure mining, web content mining, and web usage mining. It provides examples and explanations of techniques for each category such as document classification, clustering, association rule mining, and sequential pattern mining. The document also discusses opportunities and challenges of web mining as well as sources of web usage data like server logs.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
This document discusses data mining and its applications. It defines data mining as using algorithms to discover patterns in large data sets beyond simple analysis. It then provides examples of data mining applications, including market basket analysis, education, manufacturing, customer relationship management, fraud detection, research analysis, criminal investigation, and bioinformatics. The document also outlines the typical stages of the data mining process: data understanding, data preparation, modeling, evaluation, and deployment.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
This document provides an overview of the PageRank algorithm. It begins with background on PageRank and its development by Brin and Page. It then introduces the concepts behind PageRank, including how it uses the link structure of webpages to determine importance. The core PageRank algorithm is explained, modeling the web as a graph and calculating page importance based on both the number and quality of inbound links. Iterative methods like power iteration are described for approximating solutions. Examples are given to illustrate PageRank calculations over multiple iterations. Implementation details, applications, advantages/disadvantages are also discussed at a high level. Pseudocode is included.
This document discusses text and web mining. It defines text mining as analyzing huge amounts of text data to extract information. It discusses measures for text retrieval like precision and recall. It also covers text retrieval and indexing methods like inverted indices and signature files. Query processing techniques and ways to reduce dimensionality like latent semantic indexing are explained. The document also discusses challenges in mining the world wide web due to its size and dynamic nature. It defines web usage mining as collecting web access information to analyze paths to accessed web pages.
This document provides an overview of web mining and summarizes key concepts. It begins with definitions of data mining and web mining. The document then discusses three categories of web mining: web content mining, web usage mining, and web structure mining. Various matrix expressions used to represent web data are also introduced, including document-keyword co-occurrence matrices, adjacent matrices, and usage matrices. Finally, two common similarity functions - Pearson correlation coefficient and cosine similarity - are outlined.
This document provides an overview of how to become a data scientist from scratch. It discusses the key skills needed, which include mathematics/statistics, computer programming, and business knowledge. It then covers various topics required for a data science career like mathematics, programming languages, data wrangling, analysis, machine learning, deep learning, big data, and additional skills like NLP and CV. The document also lists learning outcomes, best online resources, blogs, books, and packages to learn data science from the ground up.
Data Mining: What is Data Mining?
History
How data mining works?
Data Mining Techniques.
Data Mining Process.
(The Cross-Industry Standard Process)
Data Mining: Applications.
Advantages and Disadvantages of Data Mining.
Conclusion.
The document provides an overview of data warehousing and OLAP technology. It defines a data warehouse as a subject-oriented, integrated collection of historical data used for analysis and decision making. It describes key properties of data warehouses including being subject-oriented, integrated, time-variant, and non-volatile. It also discusses dimensional modeling, data cubes, and OLAP for analyzing aggregated data.
Web mining involves applying data mining techniques to automatically discover and extract information from web documents and services. It has three main types: web content mining, which extracts useful information from web document contents; web structure mining, which analyzes the hyperlink structure of websites; and web usage mining, which involves discovering patterns from user interactions on websites. Popular algorithms for web mining include PageRank for web structure mining and HITS for determining both hub and authority pages.
The document summarizes a technical seminar on web-based information retrieval systems. It discusses information retrieval architecture and approaches, including syntactical, statistical, and semantic methods. It also covers web search analysis techniques like web structure analysis, content analysis, and usage analysis. The document outlines the process of web crawling and types of crawlers. It discusses challenges of web structure, crawling and indexing, and searching. Finally, it concludes that as unstructured online information grows, information retrieval techniques must continue to improve to leverage this data.
This document discusses web mining and its various types. Web mining involves using data mining techniques to discover useful information from web documents and usage patterns. It can involve content mining of text, images, video and audio to extract useful information. It also includes structure mining, which analyzes the hyperlink structure between documents and within documents. Additionally, web usage mining analyzes log files from web servers and applications to discover interesting usage patterns. The document outlines the differences between traditional data mining and web mining. It provides examples of applications of web mining such as information retrieval, network management and e-commerce.
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
YouTube: https://youtu.be/xtOg44r6dsE
(** Python Data Science Training: https://www.edureka.co/python **)
In this PPT on Supervised vs Unsupervised vs Reinforcement learning, we’ll be discussing the types of machine learning and we’ll differentiate them based on a few key parameters. The following topics are covered in this session:
1. Introduction to Machine Learning
2. Types of Machine Learning
3. Supervised vs Unsupervised vs Reinforcement learning
4. Use Cases
Python Training Playlist: https://goo.gl/Na1p9G
Python Blog Series: https://bit.ly/2RVzcVE
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This document provides an overview of machine learning including: definitions of machine learning; types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning; applications of machine learning such as predictive modeling, computer vision, and self-driving cars; and current trends and careers in machine learning. The document also briefly profiles the history and pioneers of machine learning and artificial intelligence.
1. The document proposes techniques to improve search performance by matching schemas between structured and unstructured data sources.
2. It involves constructing schema mappings using named entities and schema structures. It also uses strategies to narrow the search space to relevant documents.
3. The techniques were shown to improve search accuracy and reduce time/space complexity compared to existing methods.
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
This document proposes ODAM, an optimized distributed association rule mining algorithm. It aims to discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. Modern organizations have geographically distributed data stored locally at each site, making centralized data mining infeasible due to high communication costs. Distributed data mining emerged to address this challenge. ODAM reduces communication costs compared to previous distributed ARM algorithms by mining patterns across distributed databases without requiring data consolidation.
Kuan-ming Lin is interested in data mining, particularly mining biological databases, web documents, and the semantic web. He has skills in data mining techniques including machine learning, feature selection, and support vector machines. He has published papers on data integration of microarray data and structure prediction of HIV coreceptors. He hopes to continue a career in data mining and cloud computing.
Data mining refers to the process of analysing the data from different perspectives and summarizing it into useful information.
Data mining software is one of the number of tools used for analysing data. It allows users to analyse from many different dimensions and angles, categorize it, and summarize the relationship identified.
Data mining is about technique for finding and describing Structural Patterns in data.
Data mining is the process of finding correlation or patterns among fields in large relational databases.
The process of extracting valid, previously unknown, comprehensible , and actionable information from large databases and using it to make crucial business decisions.
The document provides an introduction to information retrieval, including its history, key concepts, and challenges. It discusses how information retrieval aims to retrieve relevant documents from a collection to satisfy a user's information need. The main challenge in information retrieval is determining relevance, as relevance depends on personal assessment and can change based on context, time, location, and device. The document outlines the major issues and developments in the field over time from the 1950s to present day.
The document provides an introduction to information retrieval, including its history, key concepts, and challenges. It discusses how information retrieval aims to retrieve relevant documents from a collection to satisfy a user's information need. The main challenge in information retrieval is determining relevance, as relevance depends on personal assessment, task, context, time, location, and device. Three main issues in information retrieval are determining relevance, representing documents and queries, and developing effective retrieval models and algorithms.
This document discusses different types of web mining techniques. It begins by defining web mining as the application of data mining techniques to discover and extract information from web data. The three main types of web mining are discussed as web content mining, web structure mining, and web usage mining. Web content mining involves mining the actual contents within web pages and documents. Web structure mining mines the hyperlink structure of websites to determine how web pages are linked together. Web usage mining mines web server logs to discover user browsing patterns and behaviors.
A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES
The tremendous growth in the availability of enormous text data from a variety of sources creates a slew of concerns and obstacles to discovering meaningful information. This advancement of technology in the digital realm has resulted in the dispersion of texts over millions of web sites. Unstructured texts are densely packed with textual information. The discovery of valuable and intriguing relationships in unstructured texts demands more computer processing. So, text mining has developed into an attractive area of study for obtaining organized and useful data. One of the purposes of this research is to discuss text pre-processing of automobile marketing domains in order to create a structured database. Regular expressions were used to extract data from unstructured vehicle advertisements, resulting in a well-organized database. We manually develop unique rule-based ways of extracting structured data from unstructured web pages. As a result of the information retrieved from these advertisements, a systematic search for certain noteworthy qualities is performed. There are numerous approaches for query recommendation, and it is vital to understand which one should be employed. Additionally, this research attempts to determine the optimal value similarity for query suggestions based on user-supplied parameters by comparing MySQL pattern matching and Jaccard similarity.
Business Intelligence: A Rapidly Growing Option through Web MiningIOSR Journals
This document discusses web mining techniques for business intelligence. It begins with an introduction to web mining and its subfields of web content mining, web structure mining, and web usage mining. It then focuses on web usage mining, describing the process of preprocessing log data, discovering patterns using techniques like statistical analysis and association rule mining, and analyzing the patterns. The goal is to understand customer behavior and improve business functions like marketing through data collected from web servers, proxy servers, and clients.
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
This document discusses challenges in web data mining for information extraction. It outlines how web data varies from structured to unstructured, posing challenges for data mining techniques. Some key challenges discussed are the quality of keyword-based searches, effectively extracting information from the deep web which contains searchable databases, limitations of manually constructed directories, and the need for semantics-based queries. The document argues that addressing these challenges will require improved web mining techniques to fully utilize the vast information available on the web.
This document discusses interactive visualization techniques for information retrieval. It begins by stating that information retrieval systems often return many results, some more relevant than others. While search engines have grown, problems remain with low precision and recall. Visualization techniques can help users better understand retrieval results. The document then reviews several visualization methods like tree views, title views, and bubble views that can enhance web information retrieval systems by helping users browse, filter, and reformulate queries. It argues visualization is an effective tool for dealing with large numbers of documents returned in web searches.
The International Journal of Engineering and Science (The IJES)theijes
The document provides an overview of various web content mining tools. It begins with an introduction to web mining, distinguishing between web structure mining, web content mining, and web usage mining. It then discusses web content mining in more detail. The document proceeds to describe several specific web content mining tools - Screen-scraper, Automation Anywhere 6.1, Web Info Extractor, Mozenda, and Web Content Extractor. It provides details on the features and capabilities of each tool. Finally, the document concludes by comparing the tools based on usability, ability to record data, and capability to extract structured and unstructured web data.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Abstract: In many fields, such as industry, commerce, government, and education, knowledge discovery and data
mining can be immensely valuable to the subject of Artificial Intelligence. Because of the recent increase in
demand for KDD techniques, such as those used in machine learning, databases, statistics, knowledge acquisition,
data visualisation, and high performance computing, knowledge discovery and data mining have grown in
importance. By employing standard formulas for computational correlations, we hope to create an integrated
technique that can be used to filter web world social information and find parallels between similar tastes of
diverse user information in a variety of settings
The document summarizes techniques for web mining, which involves mining web content, structure, and usage data. Web content mining extracts useful information from web page content and structures. Web structure mining analyzes the hyperlink structure between pages to determine important pages and group similar pages. Web usage mining analyzes server logs to discover general access patterns and customize websites for individual users based on their behavior. Text mining extends traditional data mining to unstructured text data through features like word occurrences and relationships.
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...IOSR Journals
This document discusses using web mining techniques like association rule mining to build an academic portal for Al-Imam Muhammad Ibn Saud Islamic University. It proposes building an information system where web data mining and semantic web technologies are applied using association rule algorithms. This would allow building ontologies for new knowledge and classifying that knowledge to add to composed knowledge databases. The paper examines using techniques like association rule mining on web server logs and document contents and structures to extract patterns and associate web pages and documents. This could help build a semantic portal and retrieve integrated information through the portal.
This document provides a literature survey and comparison of different techniques for web mining, including web structure mining, web usage mining, and web content mining. It summarizes various page ranking algorithms and models like PageRank, Weighted PageRank, HITS, General Utility Mining, and Topological Frequency Utility Mining. The document compares these algorithms and models based on the type of web mining activity, whether they consider website topology, their processing approach, and limitations. It aims to help compare techniques for analyzing the structure, usage, and content of websites.
Here is documentation part for the night vision technology which is a latest everything technology.Have the look at these and enjoy the documentation,for more updates please follow me @shobha rani
Hi guys , here is new presentation which is related to password authentication named as Graphical Password Authentication.Here i have covered all the topics which are related to GPA .I will also provide a documentation regarding this topic if u need .So please comment below for the document and fallow @shobha rani
Here is a another presentation based on latest data storage technology which is called as 3D optical data storage.here i have covered all the related topics.If u need documentation for this presentation please let me know in n=below comments.so that i will share u @shobha rani.
Here is a Presentation regarding web mining which is a blooming technology in the industry,here i have covered all the topics required for presentation. Hope u enjoy it.Please encourage to post more presentation documents.I can provide u the document also ,if anyone need comment below.
This document provides an overview of night vision technology. It discusses the history of night vision beginning in Germany in the 1930s. It describes how night vision works using either thermal imaging or image enhancement to detect infrared light. The document outlines the different generations of night vision devices and their improvements. It lists common night vision equipment like scopes, goggles, and cameras. Applications of night vision technology include military, hunting, surveillance, and automobiles. The future of night vision may allow sharing images between devices over long distances.
Hi guys this a PPT for brain gate technology which is a blooming technology in industry.Here i have explained what are req for presentation.Hope u enjoy and if u like my presentation please encourage me by fallowing,commenting and liking.If u need documentation part please comment below.
PPT based on Human Computer Interface whch is easier to understand and carryout the presentation in conferences..if u need documentation please make a comment down...enjoy the ppt..have a good luck
This document provides an overview of cluster computing. It defines a computer cluster as a set of loosely or tightly connected computers that work together to perform tasks like a single system. The document outlines how a cluster works by distributing tasks from a job to nodes in the cluster. It describes different types of clusters and components like nodes and networks. Advantages include reduced costs, scalability and availability, while disadvantages include need for parallel programming skills and more complex administration. Limitations include high latency and low bandwidth between nodes. In conclusion, cluster computing provides good price to performance and reliability due to lack of single point of failure.
Multi-currency in odoo accounting and Update exchange rates automatically in ...Celine George
Most business transactions use the currencies of several countries for financial operations. For global transactions, multi-currency management is essential for enabling international trade.
How to manage Multiple Warehouses for multiple floors in odoo point of saleCeline George
The need for multiple warehouses and effective inventory management is crucial for companies aiming to optimize their operations, enhance customer satisfaction, and maintain a competitive edge.
Exploring Substances:
Acidic, Basic, and
Neutral
Welcome to the fascinating world of acids and bases! Join siblings Ashwin and
Keerthi as they explore the colorful world of substances at their school's
National Science Day fair. Their adventure begins with a mysterious white paper
that reveals hidden messages when sprayed with a special liquid.
In this presentation, we'll discover how different substances can be classified as
acidic, basic, or neutral. We'll explore natural indicators like litmus, red rose
extract, and turmeric that help us identify these substances through color
changes. We'll also learn about neutralization reactions and their applications in
our daily lives.
by sandeep swamy
The ever evoilving world of science /7th class science curiosity /samyans aca...Sandeep Swamy
The Ever-Evolving World of
Science
Welcome to Grade 7 Science4not just a textbook with facts, but an invitation to
question, experiment, and explore the beautiful world we live in. From tiny cells
inside a leaf to the movement of celestial bodies, from household materials to
underground water flows, this journey will challenge your thinking and expand
your knowledge.
Notice something special about this book? The page numbers follow the playful
flight of a butterfly and a soaring paper plane! Just as these objects take flight,
learning soars when curiosity leads the way. Simple observations, like paper
planes, have inspired scientific explorations throughout history.
GDGLSPGCOER - Git and GitHub Workshop.pptxazeenhodekar
This presentation covers the fundamentals of Git and version control in a practical, beginner-friendly way. Learn key commands, the Git data model, commit workflows, and how to collaborate effectively using Git — all explained with visuals, examples, and relatable humor.
Dr. Santosh Kumar Tunga discussed an overview of the availability and the use of Open Educational Resources (OER) and its related various issues for various stakeholders in higher educational Institutions. Dr. Tunga described the concept of open access initiatives, open learning resources, creative commons licensing attribution, and copyright. Dr. Tunga also explained the various types of OER, INFLIBNET & NMEICT initiatives in India and the role of academic librarians regarding the use of OER.
In this ppt I have tried to give basic idea about Diabetic peripheral and autonomic neuropathy ..from Levine textbook,IWGDF guideline etc
Hope it will b helpful for trainee and physician
Unit 5: Dividend Decisions and its theoriesbharath321164
decisions: meaning, factors influencing dividends, forms of dividends, dividend theories: relevance theory (Walter model, Gordon model), irrelevance theory (MM Hypothesis)
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessMark Soia
Boost your chances of passing the 2V0-11.25 exam with CertsExpert reliable exam dumps. Prepare effectively and ace the VMware certification on your first try
Quality dumps. Trusted results. — Visit CertsExpert Now: https://www.certsexpert.com/2V0-11.25-pdf-questions.html
Envenomation is the process by which venom is injected by the bite or sting of a venomous animal such as a snake, scorpion, spider, or insect. Arthropod bite is nothing but a sharp bite or sting by ants, fruit flies, bees, beetles, moths, or hornets. Though not a serious condition, arthropod bite can be extremely painful, with redness and mild to severe swelling around the site of the bite
Geography Sem II Unit 1C Correlation of Geography with other school subjectsProfDrShaikhImran
The correlation of school subjects refers to the interconnectedness and mutual reinforcement between different academic disciplines. This concept highlights how knowledge and skills in one subject can support, enhance, or overlap with learning in another. Recognizing these correlations helps in creating a more holistic and meaningful educational experience.
Geography Sem II Unit 1C Correlation of Geography with other school subjectsProfDrShaikhImran
Web Mining
1. Page 1
SEMINAR
ON
WEB MINING
Abstract: With the explosive growth of information sources available on the World Wide Web, it has
become increasingly necessary for users to utilize automated tools in find the desired information
resources, and to track and analyze their usage patterns. These factors give rise to the necessity of
creating serverside and clientside intelligent systems that can effectively mine for knowledge. Web
mining can be broadly defined as the discovery and analysis of useful information from the World Wide
Web. This describes the automatic search of information resources available online, i.e. Web content
mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining. In this
paper we present detail statistical formulation and experimental result to show how web mining can be
utilized to perform potential customer. Web usability is an important and sometimes controversial
research area. We proposed an integrated system for web mining and usability study where four core
modules are designed to address the fundamental issues in usability analysis.. As an example to cross
modules analysis, we apply association rule mining from the link structure obtained from web mining
module to automatically discover menus and structures in a web site.
Keywordds: web mining, potential customer.
1. INTRODUCTION
With the explosive growth of information sources
available on the World Wide Web, it has become
increasingly necessary for users to utilize auto-
mated tools in order to find, extract, filter, and
evaluate the desired information and resources. In
addition, with the transformation of the web into
the primary tool for electronic commerce, it is
imperative for organizations and companies, who
have invested millions in Internet and Intranet
technologies, to track and analyze user access
patterns. These factors give rise to the necessity of
creating server-side and client-side intelligent
systems that can
effectively mine for knowledge both across the
Internet and in particular web localities.
2. Page 2
At present most of the users commonly use
searching engines such as www.google.com, to
find their required information. Moreover, the
target of the Web search engine is only to discover
resource on the Web. Each searching engines
having its own characteristics and employing
different algorithms to index, rank, and present
web documents. But because all these searching
engines is build based on exact key words
matching and it's query language belongs to some
artificial kind, with restricted syntax and
vocabulary other than natural language, there are
defects that all kind of searching engines cannot
overcome.
Narrowly Searching Scope: Web pages indexed
by any searching engines are only a tiny part of
the whole pages on the www, and the return pages
when user input and submit query are another tiny
part of indexed numbers of the searching engine.
Low Precision: User cannot browse all the pages
one by one, and most pages are irrelevant to the
user's meaning, they are highlighted and returned
by searching engine just because these pages in
possession of the key words.
Web mining techniques could be used to solve the
information over load problems directly or
indirectly. However, Web mining techniques are
not the only tools. Other techniques and works
from different research areas, such as DataBase
(DB), Information Retrieval (IR), Natural
Language Processing (NLP), and the Web
document community, could also be used.
Information retrieval
Information retrieval is the art and science of
searching for information in documents, searching
for documents themselves, searching for metadata
which describes documents, or searching within
databases, whether relational standalone databases
or hypertext networked databases such as the
Internet or intranets, for text, sound, images or
data.
Natural language processing
Natural language processing (NLP) is concerned
with the interactions between computers and
human (natural) languages. NLP is a form of
human-to-computer interaction where the
elements of human language, be it spoken or
written, are formalized so that a computer can
perform value-adding tasks based on that
interaction.
Natural language understanding is sometimes
referred to as an AI-complete problem, because
natural-language recognition seems to require
extensive knowledge about the outside world and
the ability to manipulate it.
The purpose of Web mining is to develop methods
and systems for discovering models
of objects and processes on the World Wide Web
and for web-based systems that show adaptive
performance. Web Mining integrates three parent
areas: Data Mining (we use this term here also for
the closely related areas of Machine Learning and
Knowledge Discovery), Internet technology and
3. Page 3
World Wide Web, and for the more recent
SemanticWeb.
The World Wide Web has made an enormous
amount of information electronically accessible.
The use of email, news and markup languages like
HTML allow users to publish and read documents
at a world-wide scale and to communicate via chat
connections, including information in the form of
images and voice records. The HTTP protocol that
enables access to documents over the network via
Web browsers created an immense improvement
in communication and access to information. For
some years these possibilities were used mostly in
the scientific world but recent years have seen an
immense growth in popularity, supported by the
wide availability of computers and broadband
communication. The use of the internet for other
tasks than finding information and direct
communication is increasing, as can be seen from
the interest in“e-activities” such as e-commerce,
e-learning, e-government, e-science.
Independently of the development of the Internet,
Data Mining expanded out of the academic world
into industry. Methods and their potential became
known outside the academic world and
commercial toolkits became available that allowed
applications at an industrial scale. Numerous
industrial applications have shown that models
can be constructed from data for a wide variety of
industrial problems. The World-Wide Web is an
interesting area for Data Mining because huge
amounts of information are available. Data
Mining methods can be used to analyze the
behavior of individual users, access patterns of
pages or sites, properties of collections of
documents.
Almost all standard data mining methods are
designed for data that are organized as multiple
“cases” that are comparable and can be viewed as
instances of a single pattern, for example patients
described by a fixed set of symptoms and
diseases, applicants for loans, customers of a shop.
A “case” is typically described by a fixed set of
features (or variables). Data on the Web have a
different nature. They are not so easily
comparable and have the form of free text, semi-
structured text (lists, tables) often with images and
hyperlinks, or server logs. The aim to learn
models of documents has given rise to the interest
in Text Mining methods for modeling documents
in terms of properties of documents. Learning
from the hyperlink structure has given rise to
graph-based methods, and server logs are used to
learn about user behavior.
Instead of searching for a document that matches
keywords, it should be possible to combine
information to answer questions. Instead of
retrieving a plan for a trip to Hawaii, it should be
possible to automatically construct a travel plan
that satisfies certain goals and uses opportunities
that arise dynamically. This gives rise to a wide
range of challenges. Some of them concern the
infrastructure, including the interoperability of
4. Page 4
systems and the languages for the exchange of
information rather than data. Many challenges are
in the area of knowledge representation, discovery
and engineering. They include the extraction of
knowledge from data and its representation in a
form understandable by arbitrary parties, the
intelligent questioning and the delivery of answers
to problems as opposed to conventional queries
and the exploitation of formerly extracted
knowledge in this process.
2. WEB MINING
Web mining is the integration of information
gathered by traditional data mining methodologies
and techniques with information gathered over the
World Wide Web.
Data mining is also called knowledge discovery
and data mining (KDD). It is the extraction of
useful patterns from data sources, e.g.databases,
texts, web, images, etc. Patterns must be valid,
novel, potentially useful, understandable.Classic
data mining tasks
Classification: mining patterns that can
classify future (new) data into known
classes.
Association rule mining: mining any rule
of the form X ® Y, where X and Y are sets
of data items. E.g.,Cheese, Milk® Bread
[sup =5%, confid=80%]
Clustering: identifying a set of similarity
groups in the data
Sequential pattern mining: A sequential
rule: A® B, says that event A will be
immediately followed by event B with a
certain confidence
Fig .1 The Data Mining (KDD) Process
Just as data mining aims at discovering valuable
information that is hidden in conventional
databases, the emerging field of web mining aims
at finding and extracting relevant information that
is hidden in Web-related data, in particular hyper-
text documents published on the Web. Web
Mining is the extraction of interesting and
potentially useful patterns and implicit
information from artifacts or activity related to the
World Wide Web. There are roughly three
knowledge discovery domains that pertain to web
mining: Web Content Mining, Web Structure
Mining, and Web Usage Mining. Web content
mining is the process of extracting knowledge
from the content of documents or their
descriptions. Web document text mining, resource
discovery based on concepts indexing or agent
based technology may also fall in this category.
5. Page 5
Web structure mining is the process of inferring
knowledge from the World Wide Web
organization and links between references and
referents in the Web. Finally, web usage mining,
also known as Web Log Mining, is the process of
extracting interesting patterns in web access logs.
Web is a collection of inter-related files on one or
more Web servers. Web mining is a multi-
disciplinary effort that draws techniques from
fields like information retrieval, statistics,
machine learning, natural language processing,
and others. Web mining has new character
compared with the traditional data mining. First,
the objects of Web mining are a large number of
Web documents which are heterogeneously
distributed and each data source are
heterogeneous; second, the Web document itself is
semi-structured or unstructured and lack the
semantics the machine can understand.
3. HISTORY
The term “Web Mining” first used in [E1996],
defined in a ‘task oriented’ manner. Alternate
‘data oriented’ definition given in [CMS1997]. Its
First panel discussion at ICTAI 1997 [SM1997]. It
is a continuing forum.
WebKDD workshops with ACM
SIGKDD, 1999, 2000, 2001, 2002, … ; 60
–90 attendees
SIAM Web analytics workshop 2001,
2002, …
Special issues of DMKD journal,
SIGKDD Explorations
Papers in various data mining conferences
& journals
Surveys [MBNL 1999, BL 1999,
KB2000]
This area of research is so huge today due to the
tremendous growth of information sources
available on the Web and the recent interest in e-
commerce. Web mining is used to understand
customer behavior, evaluate the effectiveness of a
particular Web site, and help quantify the success
of a marketing campaign.
3.1.web mining subtasks
Web mining can be decomposed into the subtasks,
namely:
1. Resource finding: the task of retrieving
intended Web documents. By resource
finding we mean the process of retrieving
the data that is either online or offline from
the text sources available on the web such
as electronic newsletters, electronic
newswire, the text contents of HTML
documents obtained by removing HTML
tags, and also the manual selection of Web
resources.
2. Information selection and pre-
processing: automatically selecting and
pre-processing specific information from
retrieved Web resources. It is a kind of
transformation processes of the original
data retrieved in the IR process. These
transformations could be either a kind of
pre-processing that are mentioned above
6. Page 6
such as stop words, stemming, etc. or a
pre-processing aimed at obtaining the
desired representation such as finding
phrases in the training corpus,
transforming the representation to
relational or first order logic form, etc.
3. Generalization: automatically discovers
general patterns at individual Web sites as
well as across multiple sites. Machine
learning or data mining techniques are
typically used in the process of
generalization. Humans play an important
role in the information or knowledge
discovery process on the Web since the
Web is an interactive medium.
4. Analysis: validating and/or interpretation
of the mined patterns.
4. CHALLENGES OF WEB
MINING
1. Today World Wide Web is flooded with
billions of static and dynamic web pages
created with programming languages such
as HTML, PHP and ASP. It is significant
challenge to search useful and relevant
information on the web.
2. Creating knowledge from available
information.
3. As the coverage of information is very
wide and diverse, personalization of the
information is a tedious process.
4. Learning customer and individual user
patterns.
5. Complexity of Web pages far exceeds the
complexity of any conventional text
document. Web pages on the internet lack
uniformity and standardization.
6. Much of the information present on web is
redundant, as the same piece of
information or its variant appears in many
pages.
7. The web is noisy i.e. a page typically
contains a mixture of many kinds of
information like, main content,
advertisements, copyright notice,
navigation panels.
8. The web is dynamic, information keeps on
changing constantly. Keeping up with the
changes and monitoring them are very
important.
9. The Web is not only disseminating
information but it also about services.
Many Web sites and pages enable people
to perform operations with input
parameters, i.e., they provide services.
10. The most important challenge faced is
Invasion of Privacy. Privacy is considered
lost when information concerning an
individual is obtained, used, or
disseminated, when it occurs without their
knowledge or consent.
Techniques to Address the Problem
1.1 Preprocessing technique - Web
Robots
7. Page 7
When attempting to detect web robots from a
stream it is desirable to monitor both the Web
server log and activity on the client-side. What we
are looking for is to distinguish single Web
sessions from each other. A Web session is a
series of requests to web pages, i.e. visits to web
pages. Since the navigation patterns of web robots
differs from the navigation patterns of human
users the contribution from web robots has to be
eliminated before proceeding with any further data
mining, i.e. when we are looking into web usage
behaviour of real users.
One problem with identifying web robots is
that they might hide their identity behind a facade
looking a lot like conventional web browsers.
Standard approaches to robot detection will fail to
detect camouflaged web robots. As web robots are
used for tasks like website indexing, e.g. by
Google, or detection of broken links they have to
exist. There is a special file on every domain
called “robot.txt” which, according to the Robot
Exclusion Standard [M. Koster, 1994], will be
examined by the robot in order to prevent the
robot from visiting certain pages of no interest.
Evil web robots however aren’t guaranteed to
follow the advice from robot.txt.
The classes chosen for evaluation are Temporal
Features, Page Features, Communication Features
and Path Features. It is desirable to be able to
detect the presence of a web robot after as few
requests as possible, this is ofcourse a tradeoff
between computational effort and result accuracy.
A simple decision model for determining the class
of a visitor is to first check if the visitor requested
robots.txt, then it will be labeled as robot, second
the visitor will be matched against a list of former
known robots. Third the referer “-” is searched
for, since robots seldom assign any value to the
referer fields this is a rewarding place to look. If a
robot is found, the list of known robots is updated
with the new one.
3.1.2 Avoiding Mislabeled Sessions To avoid
mislabeling of sessions an ensemble filtering
approach [C. Brodley et al., 1999] is used, where
the idea is to instead of just one model for
classification, build several models which are used
to find classification errors via finding single
mislabeled sessions.
The set of models acquired are used to classify all
sessions respectively. For each session, the
amount of false negative and false positive
classifications are counted. A large value of false
positive classifications imply that the session is
currently assigned to be a non-robot despite being
predicted to be a robot in most of the models. A
large value of false negative classifications imply
that the session might be a non-robot but has the
robot classifier.
4.2 Mining Issue
8. Page 8
3.2.1 Indirect Association Common association
methods often employ patterns that connects
objects to each other. Sometimes, on the other
hand, it might be valuable to consider indirect
association between objects. Indirect association is
used to e.g. represent the behaviour of distinct
user groups.
3.2.2 Clustering With the growth of the World
Wide Web it can be very time consuming to
analyze every web page on its own. Therefore it is
a good idea to cluster web pages based on
attributes that can be considered similar to find
successful and less successful attributes and
patterns.
5. TAXONOMY OF WEB
MINING
In general, Web mining tasks can be classified
into three categories:
1. Web content mining,
2. Web structure mining and
3. Web usage mining.
However, there are two other different approaches
to categorize Web mining. In both, the categories
are reduced from three to two: Web content
mining and Web usage mining. In one, Web
structure is treated as part of Web Content while
in the other Web usage is treated as part of Web
Structure. All of the three categories focus on the
process of knowledge discovery of implicit,
previously unknown and potentially useful
information from the Web. Each of them focuses
on different mining objects of the Web.
Fig. 2 Taxonomy of Web mining
5.1. Web content mining
Web content mining is an automatic process that
goes beyond keyword extraction. Since the
content of a text document presents no machine
readable semantic, some approaches have
suggested to restructure the document content in a
representation that could be exploited by
machines. The usual approach to exploit known
structure in documents is to use wrappers to map
documents to some data model. Techniques using
lexicons for content interpretation are yet to come.
There are two groups of web content mining
strategies: Those that directly mine the content of
documents and those that improve on the content
search of other tools like search engines.
9. Page 9
Web Content Mining deals with discovering
useful information or knowledge from web page
contents. Web content mining analyzes the
content of Web resources. Content data is the
collection of facts that are contained in a web
page. It consists of unstructured data such as free
texts, images, audio, video, semi-structured data
such as HTML documents, and a more structured
data such as data in tables or database generated
HTML pages. The primary Web resources that are
mined in Web content mining are individual
pages. They can be used to group, categorize,
analyze, and retrieve documents. Web content
mining could be differentiated from two points of
view:
5.1.1. Agent-Based Approach
This approach aims to assist or to improve the
information finding and filtering the information
to the users. This could be placed into the
following three categories:
a. Intelligent Search Agents: These agents
search for relevant information using
domain characteristics and user profiles to
organize and interpret the discovered
information.
b. Information Filtering/ Categorization:
These agents use information retrieval
techniques and characteristics of open
hypertext Web documents to automatically
retrieve, filter, and categorize them.
c. Personalized Web Agents: These agents
learn user preferences and discover Web
information based on these preferences,
and preferences of other users with similar
interest.
1. Intelligent Search Agents:
Several intelligent Web agents have been
developed that search for relevant information
using domain characteristics and user profiles
to organize and interpret the discovered
information. Agents such as Harvest , FAQ
Finder , Information Manifold , OCCAM , and
ParaSite rely either on pre-specified domain
information about particular types of
documents, or on hard coded models of the
information sources to retrieve and interpret
documents. Agents such as ShopBot and ILA
(Internet Learning Agent) interact with and
learn the structure of unfamiliar information
sources. ShopBot retrieves product
information from a variety of vendor sites
using only general information about the
product domain. ILA learns models of various
information sources and translates these into
its own concept hierarchy.
2.InformationFialtering/Categorization:
A number of Web agents use various information
retrieval techniques and characteristics of open
hypertext Web documents to automatically
retrieve, alter, and categorize them, BO
(Bookmark Organizer) 34] combines hierarchical
clustering techniques and user interaction to
10. Page 10
organize a collection of Web documents based on
conceptual information.
3. Personalized Web Agents:
This category of Web agents learn user
preferences and discover Web information sources
based on these preferences, and those of other
individuals with similar interests (using
collaborative altering). A few recent examples of
such agents include the WebWatcher , PAINT ,
Syskill & Webert . For example, Syskill & Webert
utilizes a user profile and learns to rate Web pages
of interest using a Bayesian classier.
5.1.2. Database Approach
Database approach aims on modeling the data on
the Web into more structured form in order to
apply standard database querying mechanism and
data mining applications to analyze it. The two
main categories are
Multilevel databases: The main idea behind this
approach is that the lowest level of the database
contains semi-structured information stored in
various Web sources, such as hypertext
documents. At the higher level(s) meta data or
generalizations are extracted from lower levels
and organized in structured collections, i.e.
relational or object-oriented databases.
Web query systems: Many Web-based query
systems and languages utilize standard database
query languages such as SQL, structural
information about Web documents, and even
natural language processing for the queries that
are used in World Wide Web searches.. W3QL
combines structure queries, based on the
organization of hypertext documents, and content
queries, based on information retrieval techniques.
WebLog, logic-based query language for
restructuring extracts information from Web
information sources. . TSIMMIS .extracts data
from heterogeneous and semi-structured
information sources and correlates them to
generate an integrated database representation of
the extracted information.
5.2. WEB STRUCTURE MINING
World Wide Web can reveal more information
than just the information contained in documents.
For example, links pointing to a document
indicate the popularity of the document, while
links coming out of a document indicate the
richness or perhaps the variety of topics covered
in the document. This can be compared to
bibliographical citations. When a paper is cited
often, it ought to be important. The PageRank and
CLEVER methods take advantage of this
information conveyed by the links to find
pertinent web pages. By means of counters, higher
levels cumulate the number of artifacts subsumed
by the concepts they hold. Counters of hyperlinks,
in and out documents, retrace the structure of the
web artifacts summarized.
Web structure mining is the process of
discovering structure information from the web.
11. Page 11
The structure of a typical web graph consists of
web pages as nodes, and hyperlinks as edges
connecting related pages. This can be further
divided into two kinds based on the kind of
structure information used.
Fig.3 Web graph structure
Hyperlinks
A hyperlink is a structural unit that connects a
location in a web page to a different location,
either within the same web page or on a different
web page. A hyperlink that connects to a different
part of the same page is called an Intra-document
hyperlink, and a hyperlink that connects two
different pages is called an inter-document
hyperlink.
Document Structure
In addition, the content within a Web page can
also be organized in a tree structured format,
based on the various HTML and XML tags within
the page. Mining efforts here have focused on
automatically extracting document object model
(DOM) structures out of documents.
Web structure mining focuses on the hyperlink
structure within the Web itself. The different
objects are linked in some way. Simply applying
the traditional processes and assuming that the
events are independent can lead to wrong
conclusions. However, the appropriate handling of
the links could lead to potential correlations, and
then improve the predictive accuracy of the
learned models.
Two algorithms that have been proposed to lead
with those potential correlations are:
1. HITS and
2. PageRank.
5.2.1. PageRank
Page Rank is a metric for ranking hypertext
documents that determines the quality of these
documents. The key idea is that a page has high
rank if it is pointed to by many highly ranked
pages. So the rank of a page depends upon the
ranks of the pages pointing to it. This process is
done iteratively till the rank of all the pages is
determined.
The rank of a page p can thus be written as:
Here, n is the number of nodes in the graph,
OutDegree(q) is the number of hyperlinks on page
q and d damping factor is the probability at each
page the random surfer will get bored and request
another random page.
5.2.2. HITS
12. Page 12
Hyperlink-induced topic search (HITS) is an
iterative algorithm for mining the Web graph to
identify topic hubs and authorities. Authorities are
the pages with good sources of content that are
referred by many other pages or highly ranked
pages for a given topic; hubs are pages with good
sources of links. The algorithm takes as input,
search results returned by traditional text indexing
techniques, and filters these results to identify
hubs and authorities. The number and weight of
hubs pointing to a page determine the page's
authority. The algorithm assigns weight to a hub
based on the authoritativeness of the pages it
points to. If many good hubs point to a page p,
then authority of that page p increases. Similarly if
a page p points to many good authorities, then hub
of page p increases.
After the computation, HITS outputs the pages
with the largest hub weight and the pages with the
largest authority weights, which is the search
result of a given topic.
5.3. WEB USAGE MINING
Web usage mining is a process of extracting
useful information from server logs i.e. users
history. Web usage mining is the process of
finding out what users are looking for on the
Internet.
Web usage mining focuses on techniques that
could predict the behavior of users while they are
interacting with the WWW. It collects the data
from Web log records to discover user access
patterns of Web pages. Usage data captures the
identity or origin of web users along with their
browsing behavior at a web site.
Web servers record and accumulate data about
user interactions whenever requests for resources
are received. Analyzing the web access logs of
different web sites can help understand the user
behavior and the web structure, thereby improving
the design of this colossal collection of resources.
There are two main tendencies in Web Usage
Mining driven by the applications of the
discoveries: General Access Pattern Tracking and
Customized Usage Tracking. The general access
pattern tracking analyzes the web logs to
understand access patterns and trends. These
analyses can shed light on better structure and
grouping of resource providers. Many web
analysis tools existed but they are limited and
usually unsatisfactory. We have designed a web
log data mining tool, Web Log Miner, and
proposed techniques for using data mining and
OnLine Analytical Processing (OLAP) on treated
and transformed web access files. Applying data
mining techniques on access logs unveils
interesting access patterns that can be used to
restructure sites in a more efficient grouping,
pinpoint effective advertising locations, and target
specific users for specific selling ads.
Customized usage tracking analyzes individual
trends. Its purpose is to customize web sites to
users. The information displayed, the depth of the
13. Page 13
site structure and the format of the resources can
all be dynamically customized for each user over
time based on their access patterns.
While it is encouraging and exciting to see the
various potential applications of web log file
analysis, it is important to know that the success
of such applications depends on what and how
much valid and reliable knowledge one can
discover from the large raw log data. Current web
servers store limited information about the
accesses. Some scripts custom-tailored for some
sites may store additional information. However,
for an effective web usage mining, an important
cleaning and data transformation
step before analysis may be needed.
In the using and mining of Web data, the most
direct source of data are Web log files on the Web
server. Web log files records of the visitor's
browsing behavior very clearly. Web log _les
include the server log, agent log and client log (IP
address, URL, page reference, access time,
cookies etc.).
There are several available research projects and
commercial products that analyze those patterns
for different purposes. The applications generated
from this analysis can be classified as
personalization, system improvement, site
modification, business intelligence and usage
characterization.
The Web Mining Architechture
Fig. 4 Web Usage Mining Process
The Web Usage Mining can be decomposed into
the following three main sub tasks:
Fig 5. Web usage mining process
5.3.1. Pre-processing
It is necessary to perform a data preparation to
convert the raw data for further process. The
actual data collected generally have the features
that incomplete, redundancy and ambiguity. In
order to mine the knowledge more effectively,
pre-processing the data collected is essential.
Preprocessing can provide accurate, concise data
14. Page 14
for data mining. Data preprocessing, includes data
cleaning, user identification, user sessions
identification, access path supplement and
transaction identification.
The main task of data cleaning is to
remove the Web log redundant data which
is not associated with the useful data,
narrowing the scope of data objects.
Determining the single user must be done
after data cleaning. The purpose of user
identification is to identify the users
uniqueness. It can be complete by means
of cookie technology, user registration
techniques and investigative rules.
User session identification should be done
on the basis of the user identification. The
purpose is to divide each user's access
information into several separate session
processes. The simplest way is to use time-
out estimation approach, that is, when the
time interval between the page requests
exceeds the given value, namely, that the
user has started a new session.
Because the widespread use of the page
caching technology and the proxy servers,
the access path recorded by the Web server
access logs may not be the complete
access path of users. Incomplete access log
does not accurately reflect the user's access
patterns, so it is necessary to add access
path. Path supplement can be achieved
using the Web site topology to make the
page analysis.
The transaction identification is based on
the user's session recognition, and its
purpose is to divide or combine
transactions according to the demand of
data mining tasks in order to make it
appropriate for demand of data mining
analysis.
5.3.2. Pattern discovery
Pattern discovery mines effective, novel,
potentially useful and ultimately understandable
information and knowledge using mining
algorithm. Its methods include statistical analysis,
classification analysis, association rule discovery,
sequential pattern discovery, clustering analysis,
and dependency modeling.
Statistical Analysis: Statistical analysts
may perform different kinds of descriptive
statistical analyses (frequency, mean,
median, etc.) based on different variables
such as page views, viewing time and
length of a navigational path when
analyzing the session _le. By analyzing the
statistical information contained in the
periodic web system report, the extracted
report can be potentially useful for
improving the system performance,
enhancing the security of the system,
facilitation the site modification task, and
providing support for marketing decisions.
15. Page 15
Association Rules: In the web domain, the
pages, which are most often referenced
together, can be put in one single server
session by applying the association rule
generation. Association rule mining
techniques can be used to discover
unordered correlation between items found
in a database of transactions.
Clustering analysis: Clustering analysis is
a technique to group together users or data
items (pages) with the similar
characteristics. Clustering of user
information or pages can facilitate the
development and execution of future
marketing strategies.
Classification analysis: Classification is
the technique to map a data item into one
of several predefined classes. The
classification can be done by using
supervised inductive learning algorithms
such as decision tree classifiers, nave
Bayesian classifiers, k-nearest neighbor
classifier,Support Vector Machines etc.
Sequential Pattern: This technique
intends to find the inter-session pattern,
such that a set of the items follows the
presence of another in a time-ordered set
of sessions or episodes. Sequential patterns
also include some other types of temporal
analysis such as trend analysis, change
point detection, or similarity analysis.
Dependency Modeling: The goal of this
technique is to establish a model that is
able to represent significant dependencies
among the various variables in the web
domain. The modeling technique provides
a theoretical framework for analyzing the
behavior of users, and is potentially useful
for predicting future web resource
consumption.
5.3.3. Pattern Analysis
Pattern Analysis is a final stage of the whole web
usage mining. The goal of this process is to
eliminate the irrelevant rules or patterns and to
understand, visualize and to extract the interesting
rules or patterns from the output of the pattern
discovery process. The output of web mining
algorithms is often not in the form suitable for
direct human consumption, and thus need to be
transform to a format can be assimilate easily.
There are two most common approaches for the
patter analysis. One is to use the knowledge query
mechanism such as SQL, while another is to
construct multi-dimensional data cube before
perform OLAP operation.
6. APPLICATIONS OF WEB
MINING
Web mining techniques can be applied to
understand and analyze such data, and turned into
actionable information, that can support a web
enabled electronic business to improve its
marketing, sales and customer support operations.
16. Page 16
Based on the patterns found and the original cache
and log data, many applications can be developed.
Some of them are:
In order to achieve personalized service, it first
has to obtain and collect information on clients to
grasp customer's spending habits, hobbies,
consumer psychology, etc., and then can be
targeted to provide personalized service. To obtain
consumer spending behavior patterns, the
traditional marketing approach is very difficult,
but it can be done using Web mining techniques.
Early on in the life of Amazon.com, its visionary
CEO Jeff Bezos observed, In a traditional (brick-
and mortar) store, the main effort is in getting a
customer to the store. Once a customer is in the
store they are likely to make a purchase since the
cost of going to another store is high and thus the
marketing budget (focused on getting the
customer to the store) is in general much higher
than the in-store customer experience budget
(which keeps the customer in the store). In the
case of an on-line store, getting in or out requires
exactly one click, and thus the main focus must be
on customer experience in the store. This
fundamental observation has been the driving
force behind Amazons comprehensive approach to
personalized customer experience, based on the
mantra a personalized store for every customer. A
host of Web mining techniques, e.g. associations
between pages visited, click-path analysis, etc.,
are used to improve the customers experience
during a store visit. Knowledge gained from Web
mining is the key intelligence behind Amazons
features such as instant recommendations,
purchase circles, wish-lists, etc.
6.1.Improve the website design
Attractiveness of the site depends on its
reasonable design of content and organizational
structure. Web mining can provide details of user
behavior, providing web site designers basis of
decision making to improve the design of the site.
6.2.System Improvement
Performance and other service quality attributes
are crucial to user satisfaction from services such
as databases, net-works, etc. Similar qualities are
expected from the users of Web services. Web
usage mining provides the key to under-standing
Web traffic behavior, which can in turn be used
for developing policies for Web caching, network
transmission, load balancing, or data distribution.
Security is an acutely growing concern for Web-
based services, especially as electronic commerce
continues to grow at an exponential rate. Web
usage mining can also provide patterns which are
useful for detecting intrusion, fraud, attempted
break-ins, etc.
6.3.Predicting trends
Web mining can predict trend within the retrieved
information to indicate future values. For
example, an electronic auction company provides
information about items to auction, previous
17. Page 17
auction details, etc. Predictive modeling can be
utilized to analyze the existing information, and to
estimate the values for auctioneer items or number
of people participating in future auctions.
The predicting capability of the mining
application can also benefit society by identifying
criminal activities.
6.4.To carry out intelligent business
A visit cycle of customer network marketing
activities can be divided into four steps: Being
attracted, presence, purchase and left. Web mining
technology can dig out the customers' motivation
by analyzing the customer click-stream
information in order to help sales make reasonable
strategies, custom personalized pages for
customers, carry out targeted information
feedback and advertising. In short, in e-commerce
network marketing, Using Web mining techniques
to analyze large amounts of data can dig out the
laws of the consumption of goods and the
customer’s access patterns, help businesses
develop effective marketing strategies, enhance
enterprise competitiveness.
The companies can establish better customer
relationship by giving them exactly what they
need. Companies can understand the needs of the
customer better and they can react to customer
needs faster. The companies can find, attract and
retain customers; they can save on production
costs by utilizing the acquired insight of customer
requirements. They can increase profitability by
target pricing based on the profiles created. They
can even find the customer who might default to a
competitor the company will try to retain the
customer by providing promotional offers to the
specific customer, thus reducing the risk of losing
a customer.
7.RESEARCH DIRECTIONS
The techniques being applied to Web content
mining draw heavily from the work on
information retrieval, databases, intelligent agents,
etc. Since most of these techniques are well
known and reported elsewhere, we have focused
on Web usage mining in this survey instead of
Web content mining. In the following we provide
some directions for future research.
7.1 Data Pre-Processing for Mining
Web usage data is collected in various ways, each
mechanism collecting attributes relevant for its
purpose. There is a need to pre-process the data to
make it easier to mine for knowledge.
Specifically, we believe that issues such as
instrumentation and data collection, data
integration and transaction identification need to
be addressed. Clearly improved data quality can
improve the quality of any analysis on it. A
problem in the Web domain is the inherent
conflict between the analysis needs of the analysts
(who want more detailed usage data collected),
and the privacy needs of users (who want as little
18. Page 18
data collected as possible). This has lead to the
development of cookie les on one side and cache
busting on the other. The emerging OPS standard
on collecting profile data may be a compromise on
what can andwill be collected. However, it is not
clear how much compliance to this can be
expected. Hence, there will be a continual need to
develop better instrumentation and data collection
techniques, based on whatever is possible and
allowable at any point in time. Portions of Web
usage data exist in sources as diverse as Web
server logs, referral logs, registration les, and
index server logs. Intelligent integration and
correlation of information from these diverse
sources can reveal usage information which may
not be evident from any one of them. Techniques
from data integration should be examined for this
purpose. Web usage data collected in various logs
is at a very fine granularity. Therefore, while it
has the advantage of being extremely general and
fairly detailed, it also has the corresponding
drawback that it cannot be analyzed directly, since
the analysis may start focusing on micro trends
rather than on the macro trends. On the other
hand, the issue of whether a trend is micro or
macro depends on the purpose of a specific
analysis.
Hence, we believe there is a need to group
individual data collection events into groups,
called Web transactions , before feeding it to the
mining system. While have proposed techniques
to do so, more attention needs to be given to this
issue.
7.2 The Mining Process
The key component of Web mining is the mining
process itself. As discussed in this paper, Web
mining has adapted techniques from the field of
data mining, databases, and information retrieval,
as well as developing some techniques of its own,
e.g. path analysis. A lot of work still remains to be
done in adapting known mining techniques as well
as developing new ones. Web usage mining
studies reported to date have mined for association
rules, temporal sequences, clusters, and path
expressions. As the manner in which the Web is
used continues to expand, there is a continual need
to figure out new kinds of knowledge about user
behavior that needs to be mined. The quality of a
mining algorithm can be measured both in terms
of how effective it is in mining for knowledge and
how efficient it is in computational terms. There
will always be a need to improve the performance
of mining algorithms along both these dimensions.
Usage data collection on the Web is incremental
in nature. Hence, there is a need to develop
mining algorithms that take as input the existing
data, mined knowledge, and the new data, and
develop a new model in an efficient manner.
Usage data collection on the Web is also
distributed by its very nature. If all the data were
to be integrated before mining, a lot of valuable
information could be extracted. However, an
19. Page 19
approach of collecting data from all possible
server logs is both non-scalable and impractical.
Hence, there needs to be an approach where
knowledge mined from various logs can be
integrated together into a more comprehensive
model.
7.3 Analysis of Mined Knowledge
The output of knowledge mining algorithms is
often not in a form suitable for direct human
consumption, and hence there is a need to develop
techniques and tools for helping an analyst better
assimilate it. Issues that need to be addressed in
this area include usage analysis tools and
interpretation of mined knowledge.
There is a need to develop tools which incorporate
statistical methods, visualization, and human
factors to help better understand the mined
knowledge. Section 4 provided a survey of the
current literature in this area. One of the open
issues in data mining, in general, and Web mining,
in particular, is the creation of intelligent tools that
can assist in the interpretation of mined
knowledge. Clearly, these tools need to have
specific knowledge about the particular problem
domain to do any more than altering based on
statistical attributes of the discovered rules or
patterns. In Web mining, for example, intelligent
agents could be developed that based on
discovered access patterns, the topology of the
Web locality, and certain heuristics derived from
user behavior models, could give
recommendations about changing the physical
link structure of a particular site.
8. WEB MINING PROS & CONS
8.1. PROS
Web mining essentially has many advantages
which makes this technology attractive to
corporations including the government agencies.
This technology has enabled ecommerce to do
personalized marketing, which eventually results
in higher trade volumes. The government agencies
are using this technology to classify threats and
fight against terrorism. The predicting capability
of the mining application can benefit the society
by identifying criminal activities. The companies
can establish better customer relationship by
giving them exactly what they need. Companies
can understand the needs of the customer better
and they can react to customer needs faster. The
companies can find, attract and retain customers;
they can save on production costs by utilizing the
acquired insight of customer requirements. They
can increase profitability by target pricing based
on the profiles created. They can even find the
customer who might default to a competitor the
company will try to retain the customer by
providing promotional offers to the specific
customer, thus reducing the risk of losing a
customer.
Prospects
The future of Web Mining will to a large extent
depend on developments of the Semantic Web.
20. Page 20
The role of Web technology still increases in
industry, government, education, entertainment.
This means that the range of data to which Web
Mining can be applied also increases. Even
without technical advances, the role of Web
Mining technology will become larger and more
central. The main technical advances will be in
increasing the types of data to which Web Mining
can be applied. In particular Web Mining for text,
images and video/audio streams will increase the
scope of current methods. These are all active
research topics in Data Mining and Machine
Learning and the results of this can be exploited
for Web Mining.
The second type of technical advance comes from
the integration of Web Mining with other
technologies in application contexts. Examples are
information retrieval, ecommerce, business
process modeling, instruction, and health care.
The widespread use of web-based systems in these
areas makes them amenable to Web Mining.
In this section we outline current generic practical
problems that will be addressed, technology
required for these solutions, and research issues
that need to be addressed for technical progress.
Knowledge Management
Knowledge Management is generally viewed as a
field of great industrial importance. Systematic
management of the knowledge that is available in
an organization can increase the ability of the
organization to make optimal use of the
knowledge that is available in the organization
and to react effectively to new developments,
threats and opportunities. Web Mining technology
creates the
opportunity to integrate knowledge management
more tightly with business processes.
Standardization efforts that use SemanticWeb
technology and the availability of ever more data
about business processes on the internet creates
opportunities for Web Mining technology. More
widespread use of Web Mining for Knowledge
Management requires the availability of low-
threshold Web Mining tools that can be used by
non-experts and that can flexibly be integrated in a
wide variety of tools and systems.
E-commerce
The increased use of XML/RDF to describe
products, services and business processes
increases the scope and power of Data Mining
methods in e-commerce. Another direction is the
use of text mining methods for modeling
technical, social and commercial developments.
This requires advances in text mining and
information extraction.
E-learning
The Semantic Web provides a way of organizing
teaching material, and usage mining can be
applied to suggest teaching materials to a learner.
This opens opportunities for Web Mining. For
21. Page 21
example, a recommending approach can be
followed to find courses or teaching material for a
learner. The material can then be organized with
clustering techniques, and ultimately be shared on
the web again, e. g., within a peer to peer network.
Web mining methods can be used to construct a
profile of user skills, competence or knowledge
and of the effect of instruction. Another possibility
is to use web mining to analyze student
interactions for teaching purposes. The internet
supports students who collaborate during learning.
Web mining methods can be used to monitor this
process, without requiring the teacher to follow
the interactions in detail. Current web mining
technology already provides a good basis for this.
Research and development must be directed
toward important characteristics of interactions
and to integration in the instructional process.
E-government
Many activities in governments involve large
collections of documents. Think of regulations,
letters, announcements, reports. Managing access
and availability of this amount of textual
information can be greatly facilitated by a
combination of Semantic Web standardization and
text mining tools. Many internal processes in
government involve documents, both textual and
structured. Web mining creates the opportunity to
analyze these governmental processes and to
create models of the processes and the information
involved. It seems likely that standard ontologies
will be used in governmental organizations and
the standardization that this produces will make
Web Mining more widely applicable and more
powerful than it currently is. The issues involved
are those of Knowledge Management. Also
governmental activities that involve the general
public include many opportunities for Web
Mining. Like shops, governments that offer
services via the internet can analyze their
customers behavior to improve their services.
Information about social processes can be
observed and monitored using Web Mining, in the
style of marketing analyses. Examples of this are
the analysis of research proposals for the
European Commission and the development of
tools for monitoring and structuring internet
discussion for non political issues. Enabling
technologies for this are more advanced
information extraction methods and tools.
Health care
Medicine is one of the Web’s fastest-growing
areas. It profits from Semantic Web technology in
a number of ways: First, as a means of organizing
medical knowledge - for example, the widely-used
taxonomy International Classification of Diseases
and its variants serve to organize telemedicine
portal content and interfaces. The Unified
Medical Language System
22. Page 22
(http://www.nlm.nih.gov/research/umls) integrates
this classification and many others. Second, health
care institutions can profit from interoperability
between the different clinical information systems
and semantic representations of member
institutions’ organization and services. Usage
analyses of medical sites can be employed for
purposes such as Web site evaluation and the
inference of design guidelines for international
audiences, or the detection of epidemics. In
general, similar issues arise, and the same
methods can be used for analysis and design as in
other content classes of Web sites. Some of the
facets of Semantic Web Mining that we have
mentioned in this article form specific challenges,
in particular: the privacy and security of patient
data, the semantics of visual material, and the
cost-induced pressure towards national and
international integration of Web resources.
E-science
In E-Science two main developments are visible.
One is the use of text mining and Data Mining for
information extraction to extract information from
large collections of textual documents. Much
information is “buried” in the huge scientific
literature and can be extracted by combining
knowledge about the domain and information
extraction. Enabling technology for this is
information extraction in combination with
knowledge representation and ontologies. The
other development is large scale data collection
and data analysis. This also requires common
concept and organisation of the information using
ontologies. However, this form of collaboration
also needs a common methodology and it needs to
be extended with other means of communication,
see for examples and discussion.
Web mining for images and video and audio
streams
So far, efforts in Semantic Web research have
addressed mostly written documents. Recently this
is broadened to include sound/voice and images.
Images and parts of images are annotated with
terms from ontologies.
Privacy and security
A factor that limits the application of Web
Mining is the need to protect privacy of users.
Web Mining uses data that are available on the
web anyway but the use of Data Mining makes it
possible to induce general patterns that can be
applied to personal data to inductively infer data
that should remain private. Recent
research addresses this problem and searches for
selective restrictions on access to data that do
allow the induction of general patterns but at the
same time preserves a preset uncertainty about
individuals, thereby protecting privacy of
individuals.
Information extraction with formalized
knowledge
23. Page 23
We briefly reviewed the use of concept
hierarchies and thesauri for information
extraction. If knowledge
is represented in more general formal Semantic
Web languages like OWL, in principle there are
stronger possibilities to use this knowledge for
information extraction.
In summary, the main foreseen developments are:
– The extensive use of annotated documents
facilitates the application of Data Mining
techniques to documents.
– The use of a standardized format and a
standardized vocabulary for information on the
web will increase the effect and use of Web
Mining.
– The Semantic Web goal of large-scale
construction of ontologies will require the use of
Data Mining methods, in particular to extract
knowledge from text.
8.2. CONS
Web mining, itself, doesn’t create issues, but this
technology when used on data of personal nature
might cause concerns. The most criticized ethical
issue involving web mining is the invasion of
privacy. Privacy is considered lost when
information concerning an individual is obtained,
used, or disseminated, especially if this occurs
without their knowledge or consent. The obtained
data will be analyzed, and clustered to form
profiles; the data will be made anonymous before
clustering so that there are no personal profiles.
Thus these applications de-individualize the users
by judging them by their mouse clicks. De-
individualization, can be defined as a tendency of
judging and treating people on the basis of group
characteristics instead of on their own individual
characteristics and merits.
Another important concern is that the companies
collecting the data for a specific purpose might
use the data for a totally different purpose, and
this essentially violates the user’s interests. The
growing trend of selling personal data as a
commodity encourages website owners to trade
personal data obtained from their site. This trend
has increased the amount of data being captured
and traded increasing the likeliness of one’s
privacy being invaded. The companies which buy
the data are obliged make it anonymous and these
companies are considered authors of any specific
release of mining patterns. They are legally
responsible for the contents of the release; any
inaccuracies in the release will result in serious
lawsuits, but there is no law preventing them from
trading the data.
Some mining algorithms might use controversial
attributes like sex, race, religion, or sexual
orientation to categorize individuals. These
practices might be against the anti-discrimination
legislation. The applications make it hard to
identify the use of such controversial attributes,
24. Page 24
and there is no strong rule against the usage of
such algorithms with such attributes. This process
could result in denial of service or a privilege to
an individual based on his race, religion or sexual
orientation, right now this situation can be avoided
by the high ethical standards maintained by the
data mining company. The collected data is being
made anonymous so that, the obtained data and
the obtained patterns cannot be traced back to an
individual. It might look as if this poses no threat
to one’s privacy, actually many extra information
can be inferred by the application by combining
two separate unscrupulous data from the user.
9. CONCLUSION
The term Web mining has been used to refer to
techniques that encompass a broad range of issues.
However, while meaningful and attractive, this
very broadness has caused Web mining to mean
different things to different people, and there is a
need to develop a common vocabulary. Towards
this goal we proposed a definition of Web mining,
and developed taxonomy of the various ongoing
efforts related to it. Next, presented a survey of
the research in this area and concentrated on Web
usage mining.The provided a detailed survey of
the e orts in this area, even though the survey is
short because of the area's newness. To provided a
general architecture of a system to do Web usage
mining, and identified the issues and problems in
this area that require further research and
development.
As the Web and its usage continue to grow, so
does the opportunity to analyze Web data and
extract all manner of useful knowledge from it.
The past few years have seen the emergence of
Web mining as a rapidly growing area, due to the
efforts of the research community as well as
various organizations that are practicing. The key
component of web mining is the mining process
itself. Here described the key computer science
contributions made in this field, including the
overview of web mining, taxonomy of web
mining, the prominent successful applications, and
outlined some promising areas of future research.
10.REFERENCE
[1] http://en.wikipedia.org/wiki/Web mining
[2] http://www.galeas.de/webimining.html
[3] Jaideep srivastava, Robert Cooley, Mukund
Deshpande, Pan-Ning Tan, Web Usage Mining:
Discovery and Applications of Usage Patterns
from Web Data, SIGKDD Explorations, ACM
SIGKDD,Jan 2000.
[4] Miguel Gomes da Costa Jnior,Zhiguo Gong,
Web Structure Mining: An Introduction,
Proceedings of the 2005 IEEE International
Conference on Information Acquisition
[5] R. Cooley, B. Mobasher, and J.
Srivastava,Web Mining: Information and Pattern
Discovery on the World Wide Web, ICTAI97
[6] Brijendra Singh, Hemant Kumar Singh, WEB
DATA MINING RE- SEARCH: A SURVEY,
2010 IEEE
25. Page 25
[7] Mining the Web: discovering knowledge from
hypertext data, Part 2 By Soumen Chakrabarti,
2003 edition
[8] Web mining: applications and techniques By
Anthony Scime
[9] . R. Agrawal and R. Srikant. Fast algorithms
for mining association rules.
[10] S. Agrawal, R. Agrawal, P.M. Deshpande, A.
Gupta, J. Naughton, R. Ramakrishna, and S.
Sarawagi. On the computation of
multidimensional aggregates.
[11] R. Armstrong, D. Freitag, T. Joachims, and
T. Mitchell. Webwatcher: A learning apprentice
for the world wide web.
[12] M. Balabanovic, Yoav Shoham, and Y. Yun.
An adaptive agent for automated web browsing.
Journal of Visual Communication and Image
Representation,
[13] A. Z. Broder, S. C. Glassman, M. S.
Manasse, and G Zweig. Syntactic clustering of the
web.