DM-UNIT ADVANCED CONCEPTS
DM-UNIT ADVANCED CONCEPTS
com/
Syllabus
Web and Text Mining: Introduction, web mining, web content mining, web structure mining, web usage
mining, Text mining –unstructured text, episode rule discovery for texts, hierarchy of categories, text
clustering.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Mining
•Web Mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services
•Discovering useful information from the World-Wide Web and its usage
patterns
•Using data mining techniques to make the web more useful and more
profitable (for some) and to increase the efficiency of our interaction with
the web
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Mining
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Mining
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Mining
Learning about Individual Users
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Mining
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Mining
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web content consist of several types of data – text, image, audio, video
etc.
Content data is the group of facts that a web page is designed. It can
provide effective and interesting patterns about user needs.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
The structure of the web graph consists of web pages as nodes, and hyperlinks as edges
connecting related pages.
It identifies relationship between web pages linked by information or direct link connection.
To determine the connection between two commercial websites, Web structure mining can be
very useful.
The graph model in block level link analysis is induced from two kinds of relationships that is
block to page(link structure) and page-to-block (page layout)
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Out-degree : The number of distinct links originating at a node that point to other nodes.
Directed Path : It is a sequence of links, starting from a node say r that can be followed to reach another node say t.
Shortest Path : The path with the shortest length out of all the paths between nodes p and q.
Diameter : It is the maximum of all the shortest paths between a pair of nodes p and q, for all pairs of nodes p and q in
theWeb-graph.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Hub is one or a set of web pages that provides collection of links to authorities
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
•Since intrinsic links very often exist purely to allow for navigation of the infrastructure of a
site, they convey much less information than transverse links about the authority of the
pages they point to.
•Thus, we delete all intrinsic links from the graph G[Sσ], keeping only the edges
corresponding to transverse links; this results in a graph Gσ.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
The goal is to capture, model, and analyze the behavioral patterns and profiles of users
interacting with a Web site.
The discovered patterns are usually represented as collections of pages, objects, or resources
that are frequently accessed by groups of users with common needs or interests.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Web Server Log : It is a file that is created by the server to record all the activities it performs.
For ex: When a user enters URL into the browsers address bar or requests by clicking on a link.
The page request sent to web server maintains the following info. in its log like Information about
URL, Whether the request was successful, Users IP address, time and date, etc.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Text Mining
•The amount of information available on the Web has increased rapidly (Information-
explosion era) – World’s data doubles every 18 months
• Users demand useful and reliable information from the Web in the shortest time possible
•Obstacles to fulfilling this demand includes: – Language barriers, diversified users. – Users
may provide only vague specifications of the information they want
• We must perform searching and extracting information from the Web texts using NLP
technologies
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Text Mining
Data-mining: Extraction of interesting information (or patterns) from structured data. ● 80-
90% of all data is held in various unstructured formats
● NLP can be used as a preprocessing technique to harvest data and get an initial
understanding of the patterns that exist in the data
Text Mining = Statistical NLP (structured data) + Data mining (pattern discovery)
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Data Mining
Text Mining
• Identify data sets • Identify documents
• Extract features
• Select features • Select features by
algorithm
• Prepare data • Prepare data
• Analyze
• Analyze distribution distribution
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Challenges
Statistical NLP
POS Tagging
Ambiguity
Data Mining
• Massive amounts of data
•No training data available
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
• Text Cleanup - removing any unnecessary or unwanted information. Such as remove ads from web pages, normalize text converted from
binary formats.
• Part of Speech Tagging- means word class assignment to each token. Its input is given by the tokenized text. Taggers have to cope with
unknown words (OOV problem) and ambiguous word-tag mappings.
A text document is represented by the words it contains and their occurrences. Two main approaches to document representation are:
i. Bag of words
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Feature selection also is known as variable selection. It is the process of selecting a subset of important
features for use in model creation. Redundant features are the one which provides no extra information.
Irrelevant features provide no useful or relevant information in any context.
d. Data Mining
At this point, the Text mining process merges with the traditional process. Classic Data Mining
techniques are used in the structured database. Also, it resulted from the previous stages.
e. Evaluate
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Information Extraction : It utilizes relations within the text. It uses pattern matching for it.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Sentiment Analysis : It is also known as opinion mining is configured of user’s emotion, mostly into
several classes which are positive, negative, neutral and mixed. It is mainly used to get people’s view or
attitude towards anything which includes services and products.
https://jntuhbtechadda.blogspot.com/
Text Mining: Analysis https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Unstructured Text
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Unstructured Text
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Unstructured Text
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
N-grams
N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-
occurring words within a given window and when computing the n-grams you typically move one word forward
(although you can move X words forward in more advanced scenarios).
For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:
• the cow
• cow jumps
• jumps over
• over the
• the moon
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
N-grams
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to
jumps->over, etc, essentially moving one word forward to generate the next bigram.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
N-grams
•When N=1, this is referred to as unigrams and this is essentially the individual words in a
sentence.
•When N>3 this is usually referred to as four grams or five grams and so on.
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Hierarchy of categories
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Text Clustering
Text clustering is the application of cluster analysis to text-based documents. It uses machine learning and
natural language processing (NLP) to understand and categorize unstructured, textual data.
How it works
Typically, descriptors (sets of words that describe topic matter) are extracted from the document first. Then
they are analyzed for the frequency in which they are found in the document compared to other terms. After
which, clusters of descriptors can be identified and then auto-tagged.
From there, the information can be used in any number of ways. Google’s search engine is probably the best
and most widely known example. When you search for a term on Google, it pulls up pages that apply to that
term, but have you ever wondered how Google can analyze billions of web pages to deliver an accurate and
fast result?
It’s because of text clustering! Google’s algorithm breaks down unstructured data from web pages and turns it
into a matrix model, tagging pages with keywords that are then used in search results!
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Scatter/Gather
https://jntuhbtechadda.blogspot.com/
https://jntuhbtechadda.blogspot.com/
Applications
https://jntuhbtechadda.blogspot.com/