0% found this document useful (0 votes)
15 views

Business Data Mining Week 13

Business Data Mining

Uploaded by

pm6566
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Business Data Mining Week 13

Business Data Mining

Uploaded by

pm6566
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Week 13 - LAQ's

Explain the role of web mining, specifically web content mining,


web structure mining, and web usage mining, in business data
mining. Discuss the application of the HITS and LOGSOM
algorithms in these contexts.
----------------------------------------------------------------------------------
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is to
discover useful information from the World Wide Web and its usage patterns.

Web mining is the best type of practice for sifting through the vast amount of data in the
system that is available on the World Wide Web to find and extract pertinent information as
per requirements. One unique feature of web mining is its ability to deliver a wide range of
required data types in the actual process. There are various elements of the web that lead to
diverse methods for the actual mining process. For example, web pages are made up of text;
they are connected by hyperlinks in the system or process; and web server logs allow for the
monitoring of user behavior to simplify all the required systems. Combining all the required
methods from data mining, machine learning, artificial intelligence, statistics, and
information retrieval, web mining is an interdisciplinary field for the overall system.
Analyzing user behavior and website traffic is the one basic type or example of web mining.

Applications of Web Mining


Web mining is the process of discovering patterns, structures, and relationships in web data.
It involves using data mining techniques to analyze web data and extract valuable insights.
The applications of web mining are wide-ranging and include:
 Personalized marketing:Web mining can be used to analyze customer behavior on
websites and social media platforms. This information can be used to create personalized
marketing campaigns that target customers based on their interests and preferences.
 E-commerce: Web mining can be used to analyze customer behavior on e-commerce
websites. This information can be used to improve the user experience and increase sales
by recommending products based on customer preferences.
 Search engine optimization: Web mining can be used to analyze search engine queries
and search engine results pages (SERPs). This information can be used to improve the
visibility of websites in search engine results and increase traffic to the website.
 Fraud detection: Web mining can be used to detect fraudulent activity on websites. This
information can be used to prevent financial fraud, identity theft, and other types of online
fraud.
 Sentiment analysis: Web mining can be used to analyze social media data and extract
sentiment from posts, comments, and reviews. This information can be used to understand
customer sentiment towards products and services and make informed business decisions.
 Web content analysis: Web mining can be used to analyze web content and extract
valuable information such as keywords, topics, and themes. This information can be used
to improve the relevance of web content and optimize search engine rankings.
 Customer service: Web mining can be used to analyze customer service interactions on
websites and social media platforms. This information can be used to improve the quality
of customer service and identify areas for improvement.
 Healthcare: Web mining can be used to analyze health-related websites and extract
valuable information about diseases, treatments, and medications. This information can
be used to improve the quality of healthcare and inform medical research.
Process of Web Mining

Web Mining Process

Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
Categories of Web Mining

 Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several types
of data – text, image, audio, video etc. Content data is the group of facts that a web page
is designed. It can provide effective and interesting patterns about user needs. Text
documents are related to text mining, machine learning and natural language processing.
This mining is also known as text mining. This type of mining performs scanning and
mining of the text, images and groups of web pages according to the content of the input.
 Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes,
and hyperlinks as edges connecting related pages. Structure mining basically shows the
structured summary of a particular website. It identifies relationship between web pages
linked by information or direct link connection. To determine the connection between
two commercial websites, Web structure mining can be very useful.
 Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to
understand the user behaviors or something like that. In web usage mining, user access
data on the web and collect data in form of logs. So, Web usage mining is also called log
mining.

1. What is Web Content Mining?


Pre-requisites: Web Mining
Web Content Mining is one of the three different types of techniques in Web Mining. In this
article, we will purely discuss Web Content Mining. Mining, extraction, and integration of
useful data, information, and knowledge from Web page content are known as Web Mining.
It describes the discovery of useful information from web content. In simple words, it is the
application of web mining that extracts relevant or useful information content from the Web.
Web Content mining is somehow related but different from other mining techniques like data
mining and text mining. Due to heterogeneity and the absence of web data, automated
discovery of new knowledge patterns can be challenging to some extent.
Web data are generally semi-structured and/or unstructured, while data mining is primarily
concerned with structured data . It performs scanning and mining of text, image and images,
and groups of web pages according to the content of input by displaying the list in search
engines.
For Example: if the user is searching for a particular song then the search engine will display
or provide suggestions relevant to it.
Web content mining deals with different kinds of data such as text, audio, video, image, etc.

Unstructured Web Data Mining


Unstructured data includes data such as audio, video, etc, We convert these unstructured data
into structured data,i.e., into useful information or structured information (which is known as
Web Content Mining). the process of Conversion is mentioned as follows:

Unstructured Documents Feature Extraction:


1. Bag of words to represent unstructured documents

 Takes a single word as a feature.


 It ignores the sequence or order in which words occur.
2. Features could be:

 Boolean: This would either occur or may not occur in the document.

 Frequency-based: A number of times the word is repeated in the particular document.


3. Variations of the feature selection include:

 Removal of the case, punctuation, less frequent words and also top words, etc.

4. Features can be reduced using different feature selection techniques:

 Gain of Information, measuring of difference between the probability distribution.

 Stemming: it reduces words to their morphological roots.

Mining Techniques Using Agents and Databases:


1. Agent-Based Approaches:

 Intelligent- Search- This type of search basically refers to a particular goal of the user and
will return the results based on the conclusion of that goal.

 Information-Filtering/ Categorization – This type of search basically deals with the


filtering of data, i.e., removal of unwanted information or redundant information using
certain ai based methods. Like, HyPursuit, BO ( Bookmark Organizer).

 Growth of Sophisticated AI systems replacing users in an automated or unautomated


manner. One of these is Deep Learning, wherein the system is trained by feeding it with
certain kinds of data.

2. Database Approaches:
Used for transforming unstructured data into a more structured and high-level collection of
resources, such as in relational databases, and using standard database querying mechanisms
and data mining techniques to access and analyze this information.

 Multilevel Databases:

 Lowest Level – semi-structured information is kept.

 High Level- generalization from lower levels organized into relations and objects.

 Web Query Systems:

 Web-query systems are developed such as SQL, and Natural Language Processing for
extracting data.
Web Content Mining Techniques:
1. Pre-processing
2. Clustering
3. Classifying
4. Identifying the associations
5. Topic identification, tracking, and drift analysis

Applications of Web Content Mining:


1. Classifying the web documents into categories.
2. Identify topics of web documents.
3. Finding similar web pages across the different web servers.
4. Applications related to relevance.

2. Web Structure Mining


Last Updated : 30 Nov, 2022
Pre-requisites: Web Mining
Web Structure Mining is one of the three different types of techniques in Web Mining. In this
article, we will purely discuss about the Web Structure Mining. Web Structure Mining is the
technique of discovering structure information from the web. It uses graph theory to analyze
the nodes and connections in the structure of a website.
Depending upon the type of Web Structural data, Web Structure Mining can be categorised
into two types:
1.Extracting patterns from the hyperlink in the Web: The Web works through a system of
hyperlinks using the hyper text transfer protocol (http). Hyperlink is a structural component
that connects the web page according to different location. Any page can create a hyperlink of
any other page and that page can also be linked to some other page. the intertwined or self-
referral nature of web lends itself to some unique network analytical algorithms. The structure
of Web pages could also be analyzed to examine the pattern of hyperlinks among pages.
0-P[

2. Mining the document structure. It is the analysis of tree like structure of web page to
describe HTML or XML usage or the tags usage . There are different terms associated with
Web Structure Mining :

 Web Graph: Web Graph is the directed graph representing Web.

 Node: Node represents the web page in the graph.

 Edge(s): Edge represents the hyperlinks of the web page in the graph (Web graph)

 In degree(s): It is the number of hyperlinks pointing to a particular node in the graph.

 Degree(s): Degree is the number of links generated from a particular node. These are also
called the Out Degrees.
All these terminologies will be more clear by looking at the following diagram of Web Graph:
Example of Web Structure Mining:
One of the techniques is the Page rank Algorithm that the Google uses to rank its web pages.
The rank of a page is dependent on the number of pages and the quality of links pointing to the
target node.
So, we can say that the Web Structure Mining is the type of Mining that can be performed
either at the document level (intra-page) or at the hyperlink level (inter-page). The research
done at the hyperlink level is called as Hyperlink Analysis. the Hyperlink Structure can be used
to retrieve useful information on the Web.
Web structure Mining basically has two main approaches or there are two basic strategic
models for successful websites:

 Page rank : refer Page Rank

 Hubs and Authorities

Hubs And Attributes

 Hubs: These are pages with large number of interesting links. They serve as a hub or a
gathering point, where people visit to access a variety of information. More focused sites
can aspire to become a hub for the new emerging areas. The pages on website themselves
could be analyzed for quality of content that attracts most users.

 Authorities: People usually gravitate towards pages that provide the most complete and
authentic information on a particular subject. This could be factual information, news,
advice, etc. these websites would have the most number of inbound links from other
websites.

Applications of Web Structure Mining:

 Information retrieval in social networks.


 To find out the relevance of each web page.

 Measuring the completeness of Websites.

 Used in Search engines to find the relevant information.

3. What is Web Usage Mining?


Web usage mining, a subset of Data Mining, is basically the extraction of various types of
interesting data that is readily available and accessible in the ocean of huge web pages, Internet-
or formally known as World Wide Web (WWW). Being one of the applications of data mining
technique, it has helped to analyze user activities on different web pages and track them over a
period of time. Basically, Web Usage Mining can be divided into 2 major subcategories based
on web usage data.
There are 3 main types of web data:

1. Web Content Data: The common forms of web content data are HTML, web pages, images
audio-video, etc. The main being the HTML format. Though it may differ from browser to
browser the common basic layout/structure would be the same everywhere. Since it’s the most
popular in web content data. XML and dynamic server pages like JSP, PHP, etc. are also
various forms of web content data.
2. Web Structure Data: On a web page, there is content arranged according to HTML tags
(which are known as intrapage structure information). The web pages usually have hyperlinks
that connect the main webpage to the sub-web pages. This is called Inter-page structure
information. So basically relationship/links describing the connection between webpages is
web structure data.
3. Web Usage Data: The main source of data here is-Web Server and Application Server. It
involves log data which is collected by the main above two mentioned sources. Log files are
created when a user/customer interacts with a web page. The data in this type can be mainly
categorized into three types based on the source it comes from:

 Server-side

 Client-side
 Proxy side.
There are other additional data sources also which include cookies, demographics, etc.

Types of Web Usage Mining based upon the Usage Data:


1. Web Server Data: The web server data generally includes the IP address, browser logs,
proxy server logs, user profiles, etc. The user logs are being collected by the web server data.
2. Application Server Data: An added feature on the commercial application servers is to
build applications on it. Tracking various business events and logging them into application
server logs is mainly what application server data consists of.
3. Application-level data: There are various new kinds of events that can be there in an
application. The logging feature enabled in them helps us get the past record of the events.
Advantages of Web Usage Mining

 Government agencies are benefited from this technology to overcome terrorism.

 Predictive capabilities of mining tools have helped identify various criminal activities.

 Customer Relationship is being better understood by the company with the aid of these
mining tools. It helps them to satisfy the needs of the customer faster and efficiently.

Disadvantages of Web Usage Mining

 Privacy stands out as a major issue. Analyzing data for the benefit of customers is good.
But using the same data for something else can be dangerous. Using it within the
individual’s knowledge can pose a big threat to the company.

 Having no high ethical standards in a data mining company, two or more attributes can be
combined to get some personal information of the user which again is not respectable.

Some Techniques in Web Usage Mining

1. Association Rules:The most used technique in Web usage mining is Association Rules.
Basically, this technique focuses on relations among the web pages that frequently appear
together in users’ sessions. The pages accessed together are always put together into a single
server session. Association Rules help in the reconstruction of websites using the access logs.
Access logs generally contain information about requests which are approaching the
webserver. The major drawback of this technique is that having so many sets of rules produced
together may result in some of the rules being completely inconsequential. They may not be
used for future use too.
2. Classification: Classification is mainly to map a particular record to multiple predefined
classes. The main target here in web usage mining is to develop that kind of profile of
users/customers that are associated with a particular class/category. For this exact thing, one
requires to extract the best features that will be best suitable for the associated class.
Classification can be implemented by various algorithms – some of them include- Support
vector machines, K-Nearest Neighbors, Logistic Regression, Decision Trees, etc. For example,
having a track record of data of customers regarding their purchase history in the last 6 months
the customer can be classified into frequent and non-frequent classes/categories. There can be
multiclass also in other cases too.
3. Clustering: Clustering is a technique to group together a set of things having similar
features/traits. There are mainly 2 types of clusters- the first one is the usage cluster and the
second one is the page cluster. The clustering of pages can be readily performed based on the
usage data. In usage-based clustering, items that are commonly accessed /purchased together
can be automatically organized into groups. The clustering of users tends to establish groups
of users exhibiting similar browsing patterns. In page clustering, the basic concept is to get
information quickly over the web pages.

Applications of Web Usage Mining


1. Personalization of Web Content: The World Wide Web has a lot of information and is
expanding very rapidly day by day. The big problem is that on an everyday basis the specific
needs of people are increasing and they quite often don’t get that query result. So, a solution to
this is web personalization. Web personalization may be defined as catering to the user’s need-
based upon its navigational behavior tracking and their interests. Web Personalization includes
recommender systems, check-box customization, etc. Recommender systems are popular and
are used by many companies.
2. E-commerce: Web-usage Mining plays a very vital role in web-based companies. Since
their ultimate focus is on Customer attraction, customer retention, cross-sales, etc. To build a
strong relationship with the customer it is very necessary for the web-based company to rely
on web usage mining where they can get a lot of insights about customer’s interests. Also, it
tells the company about improving its web-design in some aspects.
3. Prefetching and Catching: Prefetching basically means loading of data before it is required
to decrease the time waiting for that data hence the term ‘prefetch’. All the results which we
get from web usage mining can be used to produce prefetching and caching strategies which in
turn can highly reduce the server response time.

Challenges of Web Mining


 Complexity of required web pages: Basically, there is no cohesive framework
throughout the site’s pages so when compared to conventional text, they are incredibly
intricate in the process. The web’s digital library contains a vast number of documents in
the actual system. There is no set order in which these libraries are typically arranged for
the user.
 Dynamic data source in the internet: The required online data is updated in real time.
For instance, news, weather, fashion, finance, sports, and so forth is not possible to
indicate properly.
 Data relevancy: It is much believed that a particular person is typically only concerned
with a limited percentage of the internet throughout the process, with the remaining
portion containing data that may provide unexpected outcomes for the actual requirement
and is unfamiliar to the user to verify.
 Too much large web: Basically, the web is getting bigger and bigger very quickly in the
system. The web seems to be too big for data mining and data warehousing as per
requirement.

Comparison between Data Mining and Web Mining


Parameters Data Mining Web Mining
Data Mining is the process Web Mining is the process
that attempts to discover of data mining techniques
Definition pattern and hidden to automatically discover
knowledge in large data and extract information
sets in any system. from web documents.
Web Mining is very useful
Data Mining is very useful
Application for a particular website and
for web page analysis.
e-service.
Data scientist and data Data scientists along with
Target Users
engineers. data analysts.
In Web Mining get the
In Data Mining get the information from
Structure information from explicit structured, unstructured
structure. and semi-structured web
pages.
Clustering, classification,
Web content mining, Web
Problem Type regression, prediction,
structure mining.
optimization and control.
It includes tools like Special tools for web
Tools machine learning mining are Scrapy,
algorithms. PageRank and Apache logs.
It includes application level
It includes approaches for
knowledge, data
data cleansing, machine
Skills engineering with
learning algorithms.
mathematical modules like
Statistics and probability.
statistics and probability.

********************************

Application of HITS Algorithm:

**1. Search Engine Ranking:**


- HITS (Hypertext Induced Topic Selection) algorithm is widely used by search engines
to rank web pages based on their authority and hubness.
- Search engines analyze the link structure of the web to identify authoritative pages
(hubs) and relevant pages (authorities) linked to by these hubs.
- Pages with many incoming links from authoritative hubs are considered high-quality
and are ranked higher in search results.

**2. Social Network Analysis:**


- HITS algorithm can be applied to analyze social networks to identify influential users
(hubs) and relevant content (authorities).
- In social networks, influential users who have many connections are considered hubs,
while content that is widely shared or referenced is considered authorities.

**3. Recommendation Systems:**


- HITS algorithm can be utilized in recommendation systems to identify influential items
(hubs) and related items (authorities).
- For example, in a product recommendation system, popular products with many links
or references from other products can be considered hubs, while related products that are
frequently co-purchased or recommended together can be considered authorities.

Application of LOGSOM Algorithm:

**1. Text Mining and Information Retrieval:**


- LOGSOM (Logarithmic Self-Organizing Map) algorithm is commonly used for
clustering and visualizing textual data extracted from web pages, documents, or social
media.
- In web content mining, LOGSOM can analyze the textual content of web pages to
discover patterns, topics, and relationships.
- It can be applied to categorize documents, identify themes, and visualize the semantic
structure of textual data.

**2. Sentiment Analysis:**


- LOGSOM algorithm can be employed in sentiment analysis applications to analyze
the sentiment expressed in user-generated content on the web, such as product reviews,
social media posts, or forum discussions.
- By clustering and visualizing textual data using LOGSOM, businesses can gain
insights into customer opinions, preferences, and sentiment trends.
**3. Personalization and Recommendation Systems:**
- LOGSOM algorithm can enhance personalization and recommendation systems by
clustering users based on their browsing behavior or textual preferences.
- By analyzing user interactions and content consumption patterns, LOGSOM can
identify clusters of users with similar interests or preferences, enabling targeted content
recommendations or personalized experiences on websites.

Conclusion:
The HITS algorithm is primarily used for analyzing hyperlink structures and
identifying authoritative web pages and relevant content. It finds applications in search
engine ranking, social network analysis, and recommendation systems.
On the other hand, the LOGSOM algorithm is used for analyzing textual data extracted
from web pages. It finds applications in text mining, sentiment analysis, and personalization
systems. Both algorithms play crucial roles in extracting valuable insights from web data,
enabling businesses to enhance their operations, understand user behavior, and improve
decision-making processes.

You might also like