0% found this document useful (0 votes)
8 views

DM-UNIT ADVANCED CONCEPTS

The document provides an overview of web and text mining, detailing various types such as web content mining, web structure mining, and web usage mining. It emphasizes the importance of extracting useful information from the vast data available on the web and discusses techniques like natural language processing for text mining. Additionally, it covers challenges and processes involved in text mining, including feature selection and the use of n-grams.

Uploaded by

shinyyy2303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DM-UNIT ADVANCED CONCEPTS

The document provides an overview of web and text mining, detailing various types such as web content mining, web structure mining, and web usage mining. It emphasizes the importance of extracting useful information from the vast data available on the web and discusses techniques like natural language processing for text mining. Additionally, it covers challenges and processes involved in text mining, including feature selection and the use of n-grams.

Uploaded by

shinyyy2303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

https://jntuhbtechadda.blogspot.

com/

Syllabus
Web and Text Mining: Introduction, web mining, web content mining, web structure mining, web usage
mining, Text mining –unstructured text, episode rule discovery for texts, hierarchy of categories, text
clustering.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Why Web mining is needed


Enormous wealth of information on Web
• Financial information (e.g. stock quotes)
• Book/CD/Video stores (e.g. Amazon)
• Restaurant information (e.g. Zagat's)
• Car prices (e.g. CarPoint)
• Lots of data on user access patterns
• Web logs contain sequence of URLs accessed by users
• Possible to mine interesting nuggets of information
• Tech stocks have corrections in the summer and rally from November until February

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining
•Web Mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services

•Discovering useful information from the World-Wide Web and its usage
patterns

•Using data mining techniques to make the web more useful and more
profitable (for some) and to increase the efficiency of our interaction with
the web

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining
Learning about Individual Users

•Knowing about the customers what do they want

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Mining Taxonomy

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Content Mining


Web content mining is the application of extracting useful information
from the content of the web documents.

Web content consist of several types of data – text, image, audio, video
etc.

Content data is the group of facts that a web page is designed. It can
provide effective and interesting patterns about user needs.

The representation of a set of documents as vectors in a common vector


space is known as the vector space model
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Content Mining

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Vision based Page Segmentation


• aims to extract the semantic
structure of a web page based
on its visual presentation

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Structure Mining


Web structure mining is the application of discovering structure information from the web.

The structure of the web graph consists of web pages as nodes, and hyperlinks as edges
connecting related pages.

Structure mining basically shows the structured summary of a particular website.

It identifies relationship between web pages linked by information or direct link connection.

To determine the connection between two commercial websites, Web structure mining can be
very useful.

The graph model in block level link analysis is induced from two kinds of relationships that is
block to page(link structure) and page-to-block (page layout)
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Structure Terminology


 Web-graph : A directed graph that represents theWeb.

 Node : EachWeb page represents a node of theWeb-graph.

 Link : Each hyperlink on theWeb is a directed edge of theWeb-graph.

 In-degree : The number of distinct links that point to a node.

 Out-degree : The number of distinct links originating at a node that point to other nodes.

 Directed Path : It is a sequence of links, starting from a node say r that can be followed to reach another node say t.

 Shortest Path : The path with the shortest length out of all the paths between nodes p and q.

 Diameter : It is the maximum of all the shortest paths between a pair of nodes p and q, for all pairs of nodes p and q in
theWeb-graph.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Structure Mining


Hyperlink-Induced Topic Search(HITS) is applied on a subgraph after a search is done on the
complete graph. It defines the authority ranking problem through mutual reinforcement between so-
called hub and authority scores of graph nodes.

Hub is one or a set of web pages that provides collection of links to authorities

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Page rank algorithm

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Page rank algorithm

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Page rank algorithm

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Page rank algorithm

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Transverse Vs Intrinsic link


•A link is transverse if it is between pages with different domain names

•A link is intrinsic if it is between pages with the same domain name.

•Since intrinsic links very often exist purely to allow for navigation of the infrastructure of a
site, they convey much less information than transverse links about the authority of the
pages they point to.

•Thus, we delete all intrinsic links from the graph G[Sσ], keeping only the edges
corresponding to transverse links; this results in a graph Gσ.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Usage Mining


Web usage mining refers to the automatic discovery and analysis of patterns and associated data
collected or generated as a result of user interactions with Web resources on one or more Web
sites

The goal is to capture, model, and analyze the behavioral patterns and profiles of users
interacting with a Web site.

The discovered patterns are usually represented as collections of pages, objects, or resources
that are frequently accessed by groups of users with common needs or interests.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Web Usage Mining


 Web servers, Web proxies, and client application can quite easily capture Web Usage data.

 Web Server Log : It is a file that is created by the server to record all the activities it performs.

 For ex: When a user enters URL into the browsers address bar or requests by clicking on a link.

 The page request sent to web server maintains the following info. in its log like Information about
URL, Whether the request was successful, Users IP address, time and date, etc.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Text Mining
•The amount of information available on the Web has increased rapidly (Information-
explosion era) – World’s data doubles every 18 months

• Users demand useful and reliable information from the Web in the shortest time possible

•Obstacles to fulfilling this demand includes: – Language barriers, diversified users. – Users
may provide only vague specifications of the information they want

• We must perform searching and extracting information from the Web texts using NLP
technologies

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Text Mining
Data-mining: Extraction of interesting information (or patterns) from structured data. ● 80-
90% of all data is held in various unstructured formats

● Useful information can be derived from this unstructured data

● Intelligence in text mining is based on NLP techniques

● NLP can be used as a preprocessing technique to harvest data and get an initial
understanding of the patterns that exist in the data

● Text mining is the analysis of data contained in natural language text

Text Mining = Statistical NLP (structured data) + Data mining (pattern discovery)

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Unstructured Data Examples

• Customer complaint letters


• Contracts
• Transcripts of phone calls with customers
• Technical documents
• Email
• Insurance claims
• News articles
• Web pages
• Patent portfolios

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

How Text Mining Differs from Data Mining

Data Mining
Text Mining
• Identify data sets • Identify documents
• Extract features
• Select features • Select features by
algorithm
• Prepare data • Prepare data
• Analyze
• Analyze distribution distribution

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Challenges
Statistical NLP

POS Tagging
 Ambiguity

 Tokenization \ Sentence Detection \ Parsing


 Context
 Stemming
 Synonymy and polysemy

Data Mining
• Massive amounts of data
•No training data available
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Text Mining Process

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Text Mining Process


Text Pre-processing

It involves a series of steps as shown in below:

• Text Cleanup - removing any unnecessary or unwanted information. Such as remove ads from web pages, normalize text converted from
binary formats.

• Tokenization- achieved by splitting the text into white spaces.

• Part of Speech Tagging- means word class assignment to each token. Its input is given by the tokenized text. Taggers have to cope with
unknown words (OOV problem) and ambiguous word-tag mappings.

b. Text Transformation (Attribute Generation)

A text document is represented by the words it contains and their occurrences. Two main approaches to document representation are:

i. Bag of words

ii. Vector Space

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Text Mining Process


c. Feature Selection (Attribute Selection)

Feature selection also is known as variable selection. It is the process of selecting a subset of important
features for use in model creation. Redundant features are the one which provides no extra information.
Irrelevant features provide no useful or relevant information in any context.

d. Data Mining

At this point, the Text mining process merges with the traditional process. Classic Data Mining
techniques are used in the structured database. Also, it resulted from the previous stages.

e. Evaluate

Evaluate the result, after evaluation, the result discard.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Technology premise of Text Mining


 Summarization : It is a process of making summary of any document containing large amount of
information while theme or main idea of document is maintained.

 Information Extraction : It utilizes relations within the text. It uses pattern matching for it.

 Categorization : It is a supervised learning technique which places the document according to


content. Document categorization is largely used in libraries.

 Visualization : It is computer graphic effect to represent information and revealing relationships.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Technology premise of Text Mining


 Clustering : It is a document’s textual similarity based unsupervised technique which is used by data
analysis to divide the text into mutually exclusive groups.

 Question Answering : Natural language queries or questions answering is responsible to decide a


way find a more suitable answer for particular question.

 Sentiment Analysis : It is also known as opinion mining is configured of user’s emotion, mostly into
several classes which are positive, negative, neutral and mixed. It is mainly used to get people’s view or
attitude towards anything which includes services and products.

https://jntuhbtechadda.blogspot.com/
Text Mining: Analysis https://jntuhbtechadda.blogspot.com/

• Which words are most present.

• Which words are most interesting .

• Which words help define the document.

• What are the interesting text phrases?

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Unstructured Text

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Unstructured Text

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Unstructured Text

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

N-grams
N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-
occurring words within a given window and when computing the n-grams you typically move one word forward
(although you can move X words forward in more advanced scenarios).

For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:

• the cow

• cow jumps

• jumps over

• over the

• the moon

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

N-grams
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to
jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:

•the cow jumps

•cow jumps over

•jumps over the

•over the moon

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

N-grams
•When N=1, this is referred to as unigrams and this is essentially the individual words in a
sentence.

•When N=2, this is called bigrams

•When N=3 this is called trigrams.

•When N>3 this is usually referred to as four grams or five grams and so on.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

What are N-grams used for?

•N-grams are used for a variety of different task.


• For example, when developing a language model, n-grams are used to develop not just unigram
models but also bigram and trigram models.
• Google and Microsoft have developed web scale n-gram models that can be used in a variety of
tasks such as spelling correction, word breaking and text summarization.
• n-grams develops features for supervised Machine Learning models such as SVMs, MaxEnt
models, Naive Bayes, etc.
• The idea is to use tokens such as bigrams in the feature space instead of just unigrams.

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Episode rule discovery of texts

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Hierarchy of categories

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Text Clustering
Text clustering is the application of cluster analysis to text-based documents. It uses machine learning and
natural language processing (NLP) to understand and categorize unstructured, textual data.

How it works

Typically, descriptors (sets of words that describe topic matter) are extracted from the document first. Then
they are analyzed for the frequency in which they are found in the document compared to other terms. After
which, clusters of descriptors can be identified and then auto-tagged.

From there, the information can be used in any number of ways. Google’s search engine is probably the best
and most widely known example. When you search for a term on Google, it pulls up pages that apply to that
term, but have you ever wondered how Google can analyze billions of web pages to deliver an accurate and
fast result?

It’s because of text clustering! Google’s algorithm breaks down unstructured data from web pages and turns it
into a matrix model, tagging pages with keywords that are then used in search results!

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Scatter/Gather

https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/

Applications

https://jntuhbtechadda.blogspot.com/

You might also like