0% found this document useful (0 votes)
3 views

Web Data Mining - 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Web Data Mining - 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DMW (UNIT – 5)

WEB DATA MINING

Web Data Mining


Web Data Mining is a field of data mining focused on extracting useful information, patterns, and
knowledge from data on the World Wide Web. The data in question can come from a variety of
sources, including web pages, hyperlinks, user behavior, server logs, social media, and e-commerce
platforms. Web data mining is essential for understanding and deriving insights from the large,
unstructured, and diverse datasets present on the web.

Web Data Mining can be broken down into three main types based on the nature of the data and the
tasks involved:

Types of Web Data Mining

1. Web Content Mining:

o Definition: Web content mining focuses on extracting useful data and information
from the content of web pages. This includes textual information, multimedia (images,
videos), and documents (e.g., PDFs, HTML).

o Examples:

▪ Extracting product descriptions and reviews from e-commerce websites.

▪ Mining textual content for sentiment analysis or topic modeling.

▪ Extracting useful information (e.g., events, news, trends) from blogs, forums,
and social media platforms.

o Techniques Used: Text mining, Natural Language Processing (NLP), and image
processing.

2. Web Structure Mining:

o Definition: Web structure mining deals with analyzing the structure of hyperlinks
between web pages. It focuses on discovering relationships between different
websites, web pages, and how they are interconnected.

o Examples:

▪ Identifying important web pages by analyzing their in-degree (number of links


pointing to a page).

▪ Analyzing the link structure of the web to identify communities or clusters of


related websites.

▪ Search engine optimization (SEO) for improving the ranking of pages.

o Techniques Used: Graph theory, network analysis, and link analysis algorithms like
PageRank.

3. Web Usage Mining:

o Definition: Web usage mining focuses on analyzing the behavior of web users as they
navigate websites. This includes studying click patterns, navigation paths, search
queries, and interactions with web pages to understand user preferences and improve
user experience.

1
DMW (UNIT – 5)
WEB DATA MINING

o Examples:

▪ Analyzing server logs to track which pages are most visited and which links are
clicked most often.

▪ Personalizing content and product recommendations for users based on their


browsing history (e.g., Amazon recommendations).

▪ Predicting and recommending similar products or services based on browsing


patterns.

o Techniques Used: Data mining, clustering, association rule mining, and machine
learning.

Web Data Mining Process

The process of web data mining typically involves the following steps:

1. Data Collection:

o This involves gathering data from various web sources like websites, social media, and
user interaction logs. It can be done using web scraping (for content mining) or
analyzing server logs (for usage mining).

o Web Crawling: Automated scripts or bots that systematically browse the web and
download web pages are often used to gather data.

2. Data Preprocessing:

o Raw data collected from the web is often noisy, inconsistent, or unstructured.
Preprocessing is required to clean and format the data for further analysis.

o Tasks like text cleaning, removing stop words, stemming, and parsing HTML pages are
common during this stage.

3. Pattern Discovery:

o In this step, the data mining techniques (e.g., clustering, classification, association rule
mining) are applied to discover useful patterns.

o Frequent pattern mining might be used to find items or content that frequently co-
occur in user sessions.

o Classification might be used to classify web content or users into different categories
(e.g., customer segments).

4. Pattern Evaluation:

o Once patterns are discovered, they are evaluated for their usefulness. Not all patterns
are significant or actionable, so techniques like statistical analysis and validation are
used to assess the quality and relevance of the patterns.

5. Pattern Utilization:

o The discovered patterns are then used to make decisions, build predictive models, or
improve business processes. For example, using recommendations to suggest
products, improving web design, or adjusting marketing strategies.

2
DMW (UNIT – 5)
WEB DATA MINING

Techniques in Web Data Mining

1. Clustering:

o Grouping web pages or users into clusters based on similar characteristics or behavior.
For example, clustering users who visit similar products or pages.

2. Classification:

o Assigning a category to web content or users. For example, classifying news articles
into categories like politics, sports, or entertainment.

3. Association Rule Mining:

o Finding relationships between different items or actions. For example, finding that
users who view product A are likely to view product B (cross-selling).

4. Text Mining and NLP:

o Extracting insights from textual data. For example, using NLP (Natural Language
Processing) techniques to understand sentiment in social media posts or news
articles.

5. Link Analysis:

o Analyzing the structure of hyperlinks on the web. For example, using PageRank to
assess the importance of web pages based on the link structure.

6. Anomaly Detection:

o Identifying unusual behavior patterns, which might indicate fraud or security


breaches.

Challenges in Web Data Mining

• Data Quality: The web contains a lot of noisy, inconsistent, or irrelevant data that needs to be
cleaned and preprocessed.

• Data Privacy and Security: User data is often sensitive, so privacy concerns and ethical issues
arise when collecting and mining data.

• Scalability: The volume of web data is enormous, and processing it efficiently at scale is a
challenge.

• Dynamic Nature of the Web: The web is constantly changing (pages are updated, links are
added or removed), so the mining process must adapt to these changes.

• Heterogeneity: The web contains diverse types of data (structured, semi-structured,


unstructured), which makes it difficult to apply uniform data mining techniques.

Web Terminology in Data Warehouse Management (DWM)


Data Warehouse (DW): A centralized repository that stores data from multiple sources, allowing for
analysis and reporting. It is designed to facilitate business intelligence activities.

Data Mart: A subset of a data warehouse, focused on a specific business line or team. It contains only
the data relevant to that particular area, making it easier for users to access and analyze. Metadata:

3
DMW (UNIT – 5)
WEB DATA MINING

Data that describes other data. In the context of DWM, metadata provides information about the data
warehouse's structure, contents, and usage, acting as a roadmap for users.

ETL (Extract, Transform, Load): A process used to extract data from various sources, transform it into
a suitable format, and load it into the data warehouse. This is crucial for maintaining data quality and
consistency.

OLAP (Online Analytical Processing): A category of software technology that enables analysts,
managers, and executives to gain insight into data through fast, consistent, interactive access in a
variety of ways.

Data Cube: A multi-dimensional array of values, used to represent data in a way that allows for
complex queries and analysis. It helps in visualizing data across multiple dimensions.

Data Mining: The process of discovering patterns and knowledge from large amounts of data. In DWM,
it involves analyzing data stored in the warehouse to extract useful information.

Business Intelligence (BI): Technologies and strategies used by enterprises for data analysis of business
information. BI tools help in making informed decisions based on data analysis.

Schema: The structure that defines the organization of data in the data warehouse. It includes thE
tables, fields, relationships, and constraints.

Star Schema: A type of database schema that is optimized for data warehousing and OLAP
applications. It consists of a central fact table connected to multiple dimension tables.

Snowflake Schema: A more complex version of the star schema where dimension tables are
normalized into multiple related tables, reducing data redundancy.

Data Governance: The management of data availability, usability, integrity, and security in the data
warehouse. It ensures that data is accurate and consistent across the organization.

Data Quality: Refers to the condition of the data based on factors such as accuracy, completeness,
reliability, and relevance. High data quality is essential for e ective decision-making.

Characteristics of Web Data:

1. Subject-Oriented: Data warehouses are organized around key subjects of the business, such
as customers, products, or sales, rather than around specific applications.

2. Integrated: Data from various sources is integrated into a consistent format, ensuring that
users have a unified view of the data.

3. Time-Variant: Data warehouses store historical data, allowing for analysis over time. This
characteristic enables trend analysis and forecasting.

4. Non-Volatile: Once data is entered into the data warehouse, it is not changed or deleted. This
stability allows for consistent reporting and analysis.

5. Accessible: Data warehouses are designed to be easily accessible to users, providing tools and
interfaces for querying and reporting.

6. Scalable: DWM systems can grow in size and complexity as the organization’s data needs
increase, accommodating more data sources and users.

4
DMW (UNIT – 5)
WEB DATA MINING

7. Performance: Optimized for fast query performance, data warehouses use indexing,
partitioning, and other techniques to ensure quick access to data.

Locality and Hierarchy in the web


Locality and Hierarchy are two important concepts when analyzing the structure and behavior of the
web. Both concepts help in understanding how data is organized, accessed, and how users navigate
the web.

1. Locality in the Web

Locality refers to the principle that web users tend to interact with or access certain parts of the web
that are geographically or conceptually closer to their current location or interest. In web mining,
locality can be understood in different contexts:

• Spatial Locality:

o In the context of web usage, spatial locality refers to how a user tends to access web
pages or resources that are close to each other in a structural sense. For example,
users often visit related pages within the same website, like going from a homepage
to a product page, then to a shopping cart.

o Example: On an e-commerce site, after viewing a product, users might go directly to


the checkout page, demonstrating locality in their navigation.

• Temporal Locality:

o This concept is used when users tend to access the same or similar web pages
repeatedly over time. For example, if a user frequently visits a particular news website
or social media platform, that behavior can exhibit temporal locality.

o Example: A user who regularly checks a specific blog will often revisit the same site
repeatedly, showing locality in their usage patterns.

• Geographic Locality:

o Users from different geographic locations might show patterns of accessing localized
content (e.g., a user in France might access a French version of a website rather than
the English version).

o Example: A website may serve different content based on the user's location, such as
local news, currency, or language preferences.

2. Hierarchy in the Web

Hierarchy in the web refers to the structural organization of web content, websites, and how data is
organized across different levels of a website or the entire web. It is critical in understanding the flow
of information, user navigation, and indexing for search engines.

• Website Hierarchy:

o Websites typically follow a hierarchical structure where the homepage is the starting
point, leading to categories, subcategories, and individual pages. This structure
mirrors traditional hierarchical systems like folders and subfolders in a file system.

5
DMW (UNIT – 5)
WEB DATA MINING

o Example: An e-commerce website might have the following hierarchy:

▪ Homepage → Product Categories → Subcategories → Individual Product


Pages

▪ This allows for easy navigation and organization of content.

• Web Page Hierarchy:

o Within a single web page, there can be a hierarchical structure that divides content
into sections like headings, subheadings, paragraphs, images, and links. This makes
the page more navigable and helps search engines understand the structure.

o Example: A blog post might have a title, followed by subheadings (H2, H3), and then
the body text. This hierarchical structure helps both users and search engines navigate
and understand the content better.

• Link Hierarchy (Link Structure):

o The web itself can be viewed as a giant graph of interlinked web pages, with a
hierarchical relationship based on how pages are linked to each other. Some pages
have many inbound links, indicating they are more important or authoritative (e.g.,
Google's PageRank algorithm ranks pages based on their link structure).

o Example: Wikipedia has a highly organized and hierarchical link structure, with major
topics linking to smaller subtopics and individual articles. This hierarchy supports both
navigation and knowledge discovery.

• Search Engine Hierarchy:

o Search engines like Google also organize web content hierarchically. They prioritize
web pages based on the relevance of content, the quality of links, and the website
structure. Higher-ranked pages are typically linked more frequently and have better
visibility.

o Example: A search query on Google may lead to a series of results ranked by relevance,
with top results typically from authoritative sites, demonstrating a kind of hierarchical
ranking in search engines.

What is Web Content Mining?


Web Content Mining is one of the three different types of techniques in Web Mining. In this article,
we will purely discuss Web Content Mining. Mining, extraction, and integration of useful data,
information, and knowledge from Web page content are known as Web Mining.

It describes the discovery of useful information from web content. In simple words, it is the application
of web mining that extracts relevant or useful information content from the Web. Web Content mining
is somehow related but different from other mining techniques like data mining and text mining. Due
to heterogeneity and the absence of web data, automated discovery of new knowledge patterns can
be challenging to some extent.

Web data are generally semi-structured and/or unstructured, while data mining is primarily concerned
with structured data . It performs scanning and mining of text, image and images, and groups of web
pages according to the content of input by displaying the list in search engines.

6
DMW (UNIT – 5)
WEB DATA MINING

For Example: if the user is searching for a particular song then the search engine will display or provide
suggestions relevant to it.

Web content mining deals with different kinds of data such as text, audio, video, image, etc.

Unstructured Web Data Mining

Unstructured data includes data such as audio, video, etc, We convert these unstructured data into
structured data,i.e., into useful information or structured information (which is known as Web Content
Mining). the process of Conversion is mentioned as follows:

Unstructured Documents Feature Extraction:

1. Bag of words to represent unstructured documents

• Takes a single word as a feature.

• It ignores the sequence or order in which words occur.

2. Features could be:

• Boolean: This would either occur or may not occur in the document.

• Frequency-based: A number of times the word is repeated in the particular document.

3. Variations of the feature selection include:

• Removal of the case, punctuation, less frequent words and also top words, etc.

4. Features can be reduced using different feature selection techniques:

• Gain of Information, measuring of difference between the probability distribution.

• Stemming: it reduces words to their morphological roots.

7
DMW (UNIT – 5)
WEB DATA MINING

Mining Techniques Using Agents and Databases:

1. Agent-Based Approaches:

• Intelligent- Search- This type of search basically refers to a particular goal of the user and will
return the results based on the conclusion of that goal.

• Information-Filtering/ Categorization – This type of search basically deals with the filtering of
data, i.e., removal of unwanted information or redundant information using certain ai based
methods. Like, HyPursuit, BO ( Bookmark Organizer).

• Growth of Sophisticated AI systems replacing users in an automated or unautomated manner.


One of these is Deep Learning, wherein the system is trained by feeding it with certain kinds
of data.

2. Database Approaches:

Used for transforming unstructured data into a more structured and high-level collection of
resources, such as in relational databases, and using standard database querying mechanisms
and data mining techniques to access and analyze this information.

• Multilevel Databases:

o Lowest Level – semi-structured information is kept.

o High Level- generalization from lower levels organized into relations and objects.

• Web Query Systems:

o Web-query systems are developed such as SQL, and Natural Language Processing for
extracting data.

Web Content Mining Techniques:

1. Pre-processing

2. Clustering

3. Classifying

8
DMW (UNIT – 5)
WEB DATA MINING

4. Identifying the associations

5. Topic identification, tracking, and drift analysis

Applications of Web Content Mining:

1. Classifying the web documents into categories.

2. Identify topics of web documents.

3. Finding similar web pages across the different web servers.

4. Applications related to relevance.

What is Web Usage Mining?


Web usage mining, a subset of Data Mining, is basically the extraction of various types of interesting
data that is readily available and accessible in the ocean of huge web pages, Internet- or formally
known as World Wide Web (WWW). Being one of the applications of data mining technique, it has
helped to analyze user activities on different web pages and track them over a period of time. Basically,
Web Usage Mining can be divided into 2 major subcategories based on web usage data.

There are 3 main types of web data:

1. Web Content Data: The common forms of web content data are HTML, web pages, images audio-
video, etc. The main being the HTML format. Though it may differ from browser to browser the
common basic layout/structure would be the same everywhere. Since it’s the most popular in web
content data. XML and dynamic server pages like JSP, PHP, etc. are also various forms of web content
data.

2. Web Structure Data: On a web page, there is content arranged according to HTML tags (which are
known as intrapage structure information). The web pages usually have hyperlinks that connect the
main webpage to the sub-web pages. This is called Inter-page structure information. So basically
relationship/links describing the connection between webpages is web structure data.

3. Web Usage Data: The main source of data here is-Web Server and Application Server. It involves log
data which is collected by the main above two mentioned sources. Log files are created when a
user/customer interacts with a web page. The data in this type can be mainly categorized into three
types based on the source it comes from:

• Server-side

• Client-side

9
DMW (UNIT – 5)
WEB DATA MINING

• Proxy side.There are other additional data sources also which include cookies, demographics,
etc.

Types of Web Usage Mining based upon the Usage Data:

1. Web Server Data: The web server data generally includes the IP address, browser logs, proxy
server logs, user profiles, etc. The user logs are being collected by the web server data.

2. Application Server Data: An added feature on the commercial application servers is to build
applications on it. Tracking various business events and logging them into application server
logs is mainly what application server data consists of.

3. Application-level data: There are various new kinds of events that can be there in an
application. The logging feature enabled in them helps us get the past record of the events.

Advantages of Web Usage Mining

• Government agencies are benefited from this technology to overcome terrorism.

• Predictive capabilities of mining tools have helped identify various criminal activities.

• Customer Relationship is being better understood by the company with the aid of these mining
tools. It helps them to satisfy the needs of the customer faster and efficiently.

Disadvantages of Web Usage Mining

• Privacy stands out as a major issue. Analyzing data for the benefit of customers is good. But
using the same data for something else can be dangerous. Using it within the individual’s
knowledge can pose a big threat to the company.

• Having no high ethical standards in a data mining company, two or more attributes can be
combined to get some personal information of the user which again is not respectable.

Some Techniques in Web Usage Mining

Applications of Web Usage Mining

1. Personalization of Web Content: The World Wide Web has a lot of information and is expanding
very rapidly day by day. The big problem is that on an everyday basis the specific needs of people are
increasing and they quite often don’t get that query result. So, a solution to this is web personalization.
Web personalization may be defined as catering to the user’s need-based upon its navigational

10
DMW (UNIT – 5)
WEB DATA MINING

behavior tracking and their interests. Web Personalization includes recommender systems, check-box
customization, etc. Recommender systems are popular and are used by many companies.

2. E-commerce: Web-usage Mining plays a very vital role in web-based companies. Since their ultimate
focus is on Customer attraction, customer retention, cross-sales, etc. To build a strong relationship with
the customer it is very necessary for the web-based company to rely on web usage mining where they
can get a lot of insights about customer’s interests. Also, it tells the company about improving its web-
design in some aspects.

3. Prefetching and Catching: Prefetching basically means loading of data before it is required to
decrease the time waiting for that data hence the term ‘prefetch’. All the results which we get from
web usage mining can be used to produce prefetching and caching strategies which in turn can highly
reduce the server response time.

Web Structure Mining


Web Structure Mining is one of the three different types of techniques in Web Mining. In this article,
we will purely discuss about the Web Structure Mining. Web Structure Mining is the technique
of discovering structure information from the web. It uses graph theory to analyze the nodes and
connections in the structure of a website.

Depending upon the type of Web Structural data, Web Structure Mining can be categorised into two
types:

11
DMW (UNIT – 5)
WEB DATA MINING

1.Extracting patterns from the hyperlink in the Web: The Web works through a system of hyperlinks
using the hyper text transfer protocol (http). Hyperlink is a structural component that connects the
web page according to different location. Any page can create a hyperlink of any other page and that
page can also be linked to some other page. the intertwined or self-referral nature of web lends itself
to some unique network analytical algorithms. The structure of Web pages could also be analyzed to
examine the pattern of hyperlinks among pages.

2. Mining the document structure. It is the analysis of tree like structure of web page to describe HTML
or XML usage or the tags usage . There are different terms associated with Web Structure Mining :

• Web Graph: Web Graph is the directed graph representing Web.

• Node: Node represents the web page in the graph.

• Edge(s): Edge represents the hyperlinks of the web page in the graph (Web graph)

• In degree(s): It is the number of hyperlinks pointing to a particular node in the graph.

• Degree(s): Degree is the number of links generated from a particular node. These are also
called the Out Degrees.

All these terminologies will be more clear by looking at the following diagram of Web Graph:

Example of Web Structure Mining:

One of the techniques is the Page rank Algorithm that the Google uses to rank its web pages. The rank
of a page is dependent on the number of pages and the quality of links pointing to the target node.

So, we can say that the Web Structure Mining is the type of Mining that can be performed either at
the document level (intra-page) or at the hyperlink level (inter-page). The research done at the
hyperlink level is called as Hyperlink Analysis. the Hyperlink Structure can be used to retrieve useful
information on the Web.

Web structure Mining basically has two main approaches or there are two basic strategic models for
successful websites:

• Hubs and Authorities

12
DMW (UNIT – 5)
WEB DATA MINING

Hubs And Attributes

• Hubs: These are pages with large number of interesting links. They serve as a hub or a
gathering point, where people visit to access a variety of information. More focused sites can
aspire to become a hub for the new emerging areas. The pages on website themselves could
be analyzed for quality of content that attracts most users.

• Authorities: People usually gravitate towards pages that provide the most complete and
authentic information on a particular subject. This could be factual information, news, advice,
etc. these websites would have the most number of inbound links from other websites.

Applications of Web Structure Mining:

• Information retrieval in social networks.

• To find out the relevance of each web page.

• Measuring the completeness of Websites.

• Used in Search engines to find the relevant information.

Terms Web Content Web Web Usage


Structure

IR View DB View

o Semi-
o Unstructured
View of data structured Link structure Interactivity
o Structured
o Website as DB

o Text documents
Hypertext o Server logs
Main data o Hypertext Link structure
documents o Browser logs
documents

o Machine o Proprietary o Machine learning


Learning algorithm Proprietary
Method o Statistical
o Statistical o Association algorithm
(Including NLP) rules o Association Rules

13
DMW (UNIT – 5)
WEB DATA MINING

o Bag of words, n-
gram terms
o Edged labeled
o Phrases, o Relational Table
Representation graph Graph
concepts, or o Graph
ontology o Relational

o Relational

o Categorization
o Finding
o Clustering frequent
o Site construction
Application substructures o Categorization
o Finding Extract
Categories o Adaptation an
rules o Web site o Clustering
management
schema
o Finding Patterns
discovery
in text

Web mining software


Web mining software encompasses a variety of tools and applications designed to extract, analyze,
and visualize data from the web. These tools can help in various aspects of web mining, including web
content mining, web usage mining, and web structure mining. Below are some popular software tools
and platforms categorized by their primary functions:

1. Web Content Mining Tools


2. Web Usage Mining Tools
3. Web Structure Mining Tools
4. General Data Mining and Analysis Tools
5. Web Scraping Frameworks and Libraries

14

You might also like