Business Data Mining Week 13
Business Data Mining Week 13
Web mining is the best type of practice for sifting through the vast amount of data in the
system that is available on the World Wide Web to find and extract pertinent information as
per requirements. One unique feature of web mining is its ability to deliver a wide range of
required data types in the actual process. There are various elements of the web that lead to
diverse methods for the actual mining process. For example, web pages are made up of text;
they are connected by hyperlinks in the system or process; and web server logs allow for the
monitoring of user behavior to simplify all the required systems. Combining all the required
methods from data mining, machine learning, artificial intelligence, statistics, and
information retrieval, web mining is an interdisciplinary field for the overall system.
Analyzing user behavior and website traffic is the one basic type or example of web mining.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
Categories of Web Mining
Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several types
of data – text, image, audio, video etc. Content data is the group of facts that a web page
is designed. It can provide effective and interesting patterns about user needs. Text
documents are related to text mining, machine learning and natural language processing.
This mining is also known as text mining. This type of mining performs scanning and
mining of the text, images and groups of web pages according to the content of the input.
Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes,
and hyperlinks as edges connecting related pages. Structure mining basically shows the
structured summary of a particular website. It identifies relationship between web pages
linked by information or direct link connection. To determine the connection between
two commercial websites, Web structure mining can be very useful.
Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to
understand the user behaviors or something like that. In web usage mining, user access
data on the web and collect data in form of logs. So, Web usage mining is also called log
mining.
Boolean: This would either occur or may not occur in the document.
Removal of the case, punctuation, less frequent words and also top words, etc.
Intelligent- Search- This type of search basically refers to a particular goal of the user and
will return the results based on the conclusion of that goal.
2. Database Approaches:
Used for transforming unstructured data into a more structured and high-level collection of
resources, such as in relational databases, and using standard database querying mechanisms
and data mining techniques to access and analyze this information.
Multilevel Databases:
High Level- generalization from lower levels organized into relations and objects.
Web-query systems are developed such as SQL, and Natural Language Processing for
extracting data.
Web Content Mining Techniques:
1. Pre-processing
2. Clustering
3. Classifying
4. Identifying the associations
5. Topic identification, tracking, and drift analysis
2. Mining the document structure. It is the analysis of tree like structure of web page to
describe HTML or XML usage or the tags usage . There are different terms associated with
Web Structure Mining :
Edge(s): Edge represents the hyperlinks of the web page in the graph (Web graph)
Degree(s): Degree is the number of links generated from a particular node. These are also
called the Out Degrees.
All these terminologies will be more clear by looking at the following diagram of Web Graph:
Example of Web Structure Mining:
One of the techniques is the Page rank Algorithm that the Google uses to rank its web pages.
The rank of a page is dependent on the number of pages and the quality of links pointing to the
target node.
So, we can say that the Web Structure Mining is the type of Mining that can be performed
either at the document level (intra-page) or at the hyperlink level (inter-page). The research
done at the hyperlink level is called as Hyperlink Analysis. the Hyperlink Structure can be used
to retrieve useful information on the Web.
Web structure Mining basically has two main approaches or there are two basic strategic
models for successful websites:
Hubs: These are pages with large number of interesting links. They serve as a hub or a
gathering point, where people visit to access a variety of information. More focused sites
can aspire to become a hub for the new emerging areas. The pages on website themselves
could be analyzed for quality of content that attracts most users.
Authorities: People usually gravitate towards pages that provide the most complete and
authentic information on a particular subject. This could be factual information, news,
advice, etc. these websites would have the most number of inbound links from other
websites.
1. Web Content Data: The common forms of web content data are HTML, web pages, images
audio-video, etc. The main being the HTML format. Though it may differ from browser to
browser the common basic layout/structure would be the same everywhere. Since it’s the most
popular in web content data. XML and dynamic server pages like JSP, PHP, etc. are also
various forms of web content data.
2. Web Structure Data: On a web page, there is content arranged according to HTML tags
(which are known as intrapage structure information). The web pages usually have hyperlinks
that connect the main webpage to the sub-web pages. This is called Inter-page structure
information. So basically relationship/links describing the connection between webpages is
web structure data.
3. Web Usage Data: The main source of data here is-Web Server and Application Server. It
involves log data which is collected by the main above two mentioned sources. Log files are
created when a user/customer interacts with a web page. The data in this type can be mainly
categorized into three types based on the source it comes from:
Server-side
Client-side
Proxy side.
There are other additional data sources also which include cookies, demographics, etc.
Predictive capabilities of mining tools have helped identify various criminal activities.
Customer Relationship is being better understood by the company with the aid of these
mining tools. It helps them to satisfy the needs of the customer faster and efficiently.
Privacy stands out as a major issue. Analyzing data for the benefit of customers is good.
But using the same data for something else can be dangerous. Using it within the
individual’s knowledge can pose a big threat to the company.
Having no high ethical standards in a data mining company, two or more attributes can be
combined to get some personal information of the user which again is not respectable.
1. Association Rules:The most used technique in Web usage mining is Association Rules.
Basically, this technique focuses on relations among the web pages that frequently appear
together in users’ sessions. The pages accessed together are always put together into a single
server session. Association Rules help in the reconstruction of websites using the access logs.
Access logs generally contain information about requests which are approaching the
webserver. The major drawback of this technique is that having so many sets of rules produced
together may result in some of the rules being completely inconsequential. They may not be
used for future use too.
2. Classification: Classification is mainly to map a particular record to multiple predefined
classes. The main target here in web usage mining is to develop that kind of profile of
users/customers that are associated with a particular class/category. For this exact thing, one
requires to extract the best features that will be best suitable for the associated class.
Classification can be implemented by various algorithms – some of them include- Support
vector machines, K-Nearest Neighbors, Logistic Regression, Decision Trees, etc. For example,
having a track record of data of customers regarding their purchase history in the last 6 months
the customer can be classified into frequent and non-frequent classes/categories. There can be
multiclass also in other cases too.
3. Clustering: Clustering is a technique to group together a set of things having similar
features/traits. There are mainly 2 types of clusters- the first one is the usage cluster and the
second one is the page cluster. The clustering of pages can be readily performed based on the
usage data. In usage-based clustering, items that are commonly accessed /purchased together
can be automatically organized into groups. The clustering of users tends to establish groups
of users exhibiting similar browsing patterns. In page clustering, the basic concept is to get
information quickly over the web pages.
********************************
Conclusion:
The HITS algorithm is primarily used for analyzing hyperlink structures and
identifying authoritative web pages and relevant content. It finds applications in search
engine ranking, social network analysis, and recommendation systems.
On the other hand, the LOGSOM algorithm is used for analyzing textual data extracted
from web pages. It finds applications in text mining, sentiment analysis, and personalization
systems. Both algorithms play crucial roles in extracting valuable insights from web data,
enabling businesses to enhance their operations, understand user behavior, and improve
decision-making processes.