SlideShare a Scribd company logo
Web Content Mining
Web Content Mining
Web Content Mining mines the content like text, image,
audio, video, metadata, hyperlinks and extracts useful
information.
Since Web content mining examines the content of the
web as well as the result of the search. Web Content
mining mines.
Web mining helps to understand customer behavior,
helps to evaluate the performance of a web site and the
research done in web content mining indirectly helps to
boost business.
Web Content Mining
Web content mining examines the search result of search
engine. Manually doing things consumes a lot of time.
When the data to be analyzed is in large quantities, then
it is hard to find out the relevant data. Since now in every
field of life manual work is replaced by technology. Same
happened in the case of internet. As people already
admit that internet is really a magic of technology. Web
Mining became a boon to this magic. In the early stages
Web contained few amount of data. So there was no
need of web mining tools. As years passed Web got
accumulated with large amount of data. Then retrieval of
data according to users need became hard task. Web
mining came as a rescue for this problem.
Web Content Mining
It can be further classified into
● Web page content mining
Web page Content mining is a traditional search of web
page via content.
● Search result mining.
Search result mining is a further search of pages found
from previous search.
Web Content Mining
Two approaches used in web content mining
1)Agent based approach
2)Database approach
Web Content Mining
1)Agent based approach
The three types of agents
● Intelligent search agents
● Information filtering/Categorizing agent
● Personalized web agents.
Web Content Mining
Intelligent Search agents automatically searches for
information according to a particular query using
domain characteristics and user profiles.
Information agents used number of techniques to
filter data according to the predefine instructions.
Personalized web agents learn user preferences and
discovers documents related to those user profiles.
In Database approach it consists of well formed
database containing schemas and attributes with
defined domains.
Web Content Mining
Web content mining becomes complicated when it
has to mine unstructured, structured, semi
structured and multimedia data.
Figure explains the web content mining
techniques.
Web Content Mining
Web Content Mining
Unstructured Data Mining Techniques
Content mining can be done on unstructured data
such as text.
Mining of unstructured data give unknown
information.
Text mining is extraction of previously unknown
information by extracting information from different
text sources. Content mining requires application
of data mining and text mining techniques.
Web Content Mining
Unstructured Data Mining Techniques
Basic Content Mining is a type of text
mining.Some of the techniques used in text
mining are Information.
● Extraction
● Topic Tracking
● Summarization
● Categorization
● Clustering
● Information Visualization.
Web Content Mining
Information Extraction (IE)
To extract information from unstructured data, pattern
matching is used. It traces out the keyword and phrases
and then finds out the connection of the keywords within
the text. This technique is very useful when there is large
volume of text. IE is the basis of many other techniques
used for unstructured mining. Information extraction can
be provided to KDD module because information
extraction has to transform unstructured text to more
structured data. First the information is mined from the
extracted data and then using different types of rules, the
missed out information are found out. IE that makes
incorrect predictions on data are discarded.
Web Content Mining
Topic Tracking
Topic Tracking is a technique in which it checks the
documents viewed by the user and studies the user
profiles. According to each user it predicts the other
documents related to users interest. In Topic Tracking
applied by yahoo, user can give a keyword and if
anything related to the keyword pops up then it will be
informed to the user. Same can be applied in the case of
mining unstructured data. An example for topic tracking is
that if we select the competitors name then if at anytime
their name will come up in the news then this information
will be passed to the company.
Web Content Mining
Topic Tracking
Topic tracking can be applied in many fields. Two such
areas are medical field and education field. In medical
field doctors can easily come to know latest treatments.
In education field topic tracking can be used to find out
the latest reference for research related work. Topic
tracking helps to track all subsequent stories in the news
stream.
Disadvantage of topic tracking is that when we search for
topics we may be provided with information which is not
related to our interest. For example if user sets an alert
for ‘web mining’ it can provide us with topics related to
mineral mining etc. which are not useful for user.
Web Content Mining
Summarization
Summarization is used to reduce the length of the document
by maintaining the main points. It helps the user to decide
whether they should read this topic or not. The time taken by
the technique to summarize the document is less than the
time taken by the user to read the first paragraph. The
challenge in summarization is to teach software to analyze
semantics and to interpret the meaning. This software
statistically weighs the sentence and then extracts important
sentences from the document.
Web Content Mining
Summarization
To understand the key points summarization tool search for
headings and sub headings to find out the important points of
that document. This tool also give the freedom to the user to
select how much percentage of the total text they want
extracted as summary. It can work along with other tools such
as Topic tracking and categorization to summarize the
document. An example for text Summarization is Microsoft
word’s AutoSummarize.
Web Content Mining
Categorization
Categorization is the technique of identifying main
themes by placing the documents into a predefined set of
group. This technique counts the number of words in a
document. It does not process the actual information. It
decides the main topic from the counts. It ranks the
document according to the topics. Documents having
majority content on a particular topic are ranked first.
Categorization can be used in business and industries to
provide customer support.
Web Content Mining
Clustering
Clustering is a technique used to group similar
documents. Here in clustering grouping is not done
based on predefined topic. It is done based on fly. Same
documents can appear in different group. As a result
useful documents will not be omitted from the search
results. Clustering helps the user to easily select the topic
of interest. Clustering technology is useful in
management information system.
Web Content Mining
Information Visualization
Visualization utilizes feature extraction and key term
indexing to build a graphical representation. Through
visualization, documents having similarity are found out.
Large textual materials are represented as visual
hierarchy or maps where browsing facility is allowed. It
helps the user to visually analyze the contents. User can
interact with the graph by zooming, creating sub maps
and scaling. This technique is useful to find out related
topic from a very large amount of documents.
Web Content Mining
Information Visualization
Visualization utilizes feature extraction and key term
indexing to build a graphical representation. Through
visualization, documents having similarity are found out.
Large textual materials are represented as visual
hierarchy or maps where browsing facility is allowed. It
helps the user to visually analyze the contents. User can
interact with the graph by zooming, creating sub maps
and scaling. This technique is useful to find out related
topic from a very large amount of documents.
Web Content Mining
Structured Data Mining Techniques
Web Crawler
There are two types of Web Crawler which are called as
External and Internal Web crawler. Crawlers are
computer programs that traverse the hypertext structure
in the web. External Crawler crawls through unknown
website. Internal crawler crawls through internal pages of
the website which are returned by external crawler.
Web Content Mining
Wrapper Generation
In Wrapper Generation, it provides information on the
capability of sources. Web pages are already ranked by
traditional search engines. According to the query web
pages are retrieved by using the value of page rank. The
sources are what query they will answer and the output
types. The
wrappers will also provide a variety of Meta information.
E.g. Domains, statistics, index look up about the sources.
Page Content Mining
Page Content Mining is structured data extraction
technique which works on the pages ranked by traditional
search engines. By comparing page Content rank it
classifies the pages.
Web Content Mining
Semi-Structured Data Mining Techniques
Object Exchange Model (OEM)
Relevant information are extracted from semi-structured
data and are embedded in a group of useful information
and stored in Object Exchange model (OEM). It helps the
user to understand the information structure on the web
more accurately. It is best suited for heterogeneous and
dynamic environment. A main feature of object exchange
model is self describing, there is no need to describe in
advance the structure of an object.
Web Content Mining
Semi-Structured Data Mining Techniques
Top down Extraction
In top down extraction, it extracts complex objects from a
set of rich web sources and converts into less complex
objects until atomic objects have been extracted.
Web Data Extraction Language
In Web data extraction language it converts web data to
structured data and delivers to end users. It stores data
in the form of tables.
Web Content Mining
Multimedia Data Mining Techniques
SKICAT
SKICAT is a successful astronomical data analysis and
cataloging system which produces digital catalog of sky
object. It uses machine learning technique to convert
these objects to human usable classes. It integrates
technique for image processing and data classification
which helps to classify very large classification set.
Color Histogram Matching
Color Histogram matching consists of Color histogram
equalization and Smoothing. Equalization tries to find out
correlation between color components. The problem
faced by equalization is sparse data problem which is the
presence of unwanted artifacts in equalized images. This
problem is solved by using smoothening.
Web Content Mining
Multimedia Miner
MultiMedia Miner Comprises of four major steps, Image
excavator for extraction of image and Video’s, a
preprocessor for extraction of image features and they
are stored in a database, A search kernel is used for
matching queries with image and video available in the
database. The discovery module performs image
information mining routines to trace out the patterns in
images.
Shot Boundary Detection
It is a technique in which automatically the boundaries
are detected between shots in video.
Web Content Mining
Web Content Mining Tools
Web Content Mining tools are software that helps to
download the essential information for users. It collects
appropriate and perfectly fitting information. Some of
them are Web Info Extractor, Mozenda, Screen-Scraper,
Web Content Extractor, and Automation Anywhere 5.5
Web Content Mining
Web content mining is being used in various different
areas
● Mining Online news sites
● Distance learning
Problems faced by Web Content mining such as
extracting
● Information from heterogeneous environment
● The redundancy
● The linked nature of the web
● The dynamic and noisy nature of the web were
highlighted
Web Content Mining
Integration of web content mining into web usage mining
is also possible . In the textual content of the web pages
are extracted through frequent word sequence. Then they
are combined with web server logs to study association
rule of user’s behavior. The result of the proposed system
helps in better recommendation, web personalization,
web construction and web user profiling.
Connection between Web Content Mining and Web
Structure mining. In this approach the web page content
is compared with the information defined by the structure
of the web site. Each web page is described with a set of
keyword. This information iscombined with the link
structure which generates context based description. This
comparison helps in finding out semantic information of a
web page and its neighborhood.
Ad

More Related Content

What's hot (20)

Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
Amir Fahmideh
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
ajaybabu1314
 
web mining
web miningweb mining
web mining
Arpit Verma
 
Web Mining
Web MiningWeb Mining
Web Mining
Ziyad Abid
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
maha797959
 
Web data mining
Web data miningWeb data mining
Web data mining
Institute of Technology Telkom
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwideweb
Krish_ver2
 
Web content mining
Web content miningWeb content mining
Web content mining
Akanksha Dombe
 
Page rank algortihm
Page rank algortihmPage rank algortihm
Page rank algortihm
Siddharth Kar
 
Web mining
Web miningWeb mining
Web mining
Rashmi Bhat
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Yashwant Rautela
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
Ankit Raj
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
AntaraBhattacharya12
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
Sai Kumar Ale
 
Web mining
Web mining Web mining
Web mining
TeklayBirhane
 
Web Mining
Web Mining Web Mining
Web Mining
guestb73ec6
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
Dabbal Singh Mahara
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
Sushil kasar
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explained
jdhaar
 

Similar to Web Content Mining (20)

Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Editor IJCATR
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
IRJET Journal
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
BookStoreLib
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
anchalsinghdm
 
A detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesA detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniques
ijctet
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
inventionjournals
 
DWM-MODULE 6.pdf
DWM-MODULE 6.pdfDWM-MODULE 6.pdf
DWM-MODULE 6.pdf
nikshaikh786
 
WEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdfWEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdf
SowmyaJyothi3
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
HarshithRaj21
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
ijsrd.com
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
theijes
 
Research Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish KumarResearch Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish Kumar
Nithish Kumar
 
Research report nithish
Research report nithishResearch report nithish
Research report nithish
Nithish Kumar
 
Web Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics WebWeb Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics Web
Aatif19921
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
dannyijwest
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
dannyijwest
 
Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)
Mumbai Academisc
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...Quest Trail: An Effective Approach for Construction of Personalized Search En...
Quest Trail: An Effective Approach for Construction of Personalized Search En...
Editor IJCATR
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
IRJET Journal
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
BookStoreLib
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
anchalsinghdm
 
A detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesA detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniques
ijctet
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
inventionjournals
 
WEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdfWEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdf
SowmyaJyothi3
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
ijsrd.com
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
theijes
 
Research Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish KumarResearch Report on Document Indexing-Nithish Kumar
Research Report on Document Indexing-Nithish Kumar
Nithish Kumar
 
Research report nithish
Research report nithishResearch report nithish
Research report nithish
Nithish Kumar
 
Web Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics WebWeb Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics Web
Aatif19921
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
dannyijwest
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
dannyijwest
 
Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)
Mumbai Academisc
 
Ad

More from Daminda Herath (8)

Data mining
Data miningData mining
Data mining
Daminda Herath
 
Data mining
Data miningData mining
Data mining
Daminda Herath
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
Daminda Herath
 
XML
XMLXML
XML
Daminda Herath
 
Social Aspect of the Internet
Social Aspect of the InternetSocial Aspect of the Internet
Social Aspect of the Internet
Daminda Herath
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage mining
Daminda Herath
 
JavaScript Libraries
JavaScript LibrariesJavaScript Libraries
JavaScript Libraries
Daminda Herath
 
1. Overview of Distributed Systems
1. Overview of Distributed Systems1. Overview of Distributed Systems
1. Overview of Distributed Systems
Daminda Herath
 
Ad

Web Content Mining

  • 2. Web Content Mining Web Content Mining mines the content like text, image, audio, video, metadata, hyperlinks and extracts useful information. Since Web content mining examines the content of the web as well as the result of the search. Web Content mining mines. Web mining helps to understand customer behavior, helps to evaluate the performance of a web site and the research done in web content mining indirectly helps to boost business.
  • 3. Web Content Mining Web content mining examines the search result of search engine. Manually doing things consumes a lot of time. When the data to be analyzed is in large quantities, then it is hard to find out the relevant data. Since now in every field of life manual work is replaced by technology. Same happened in the case of internet. As people already admit that internet is really a magic of technology. Web Mining became a boon to this magic. In the early stages Web contained few amount of data. So there was no need of web mining tools. As years passed Web got accumulated with large amount of data. Then retrieval of data according to users need became hard task. Web mining came as a rescue for this problem.
  • 4. Web Content Mining It can be further classified into ● Web page content mining Web page Content mining is a traditional search of web page via content. ● Search result mining. Search result mining is a further search of pages found from previous search.
  • 5. Web Content Mining Two approaches used in web content mining 1)Agent based approach 2)Database approach
  • 6. Web Content Mining 1)Agent based approach The three types of agents ● Intelligent search agents ● Information filtering/Categorizing agent ● Personalized web agents.
  • 7. Web Content Mining Intelligent Search agents automatically searches for information according to a particular query using domain characteristics and user profiles. Information agents used number of techniques to filter data according to the predefine instructions. Personalized web agents learn user preferences and discovers documents related to those user profiles. In Database approach it consists of well formed database containing schemas and attributes with defined domains.
  • 8. Web Content Mining Web content mining becomes complicated when it has to mine unstructured, structured, semi structured and multimedia data. Figure explains the web content mining techniques.
  • 10. Web Content Mining Unstructured Data Mining Techniques Content mining can be done on unstructured data such as text. Mining of unstructured data give unknown information. Text mining is extraction of previously unknown information by extracting information from different text sources. Content mining requires application of data mining and text mining techniques.
  • 11. Web Content Mining Unstructured Data Mining Techniques Basic Content Mining is a type of text mining.Some of the techniques used in text mining are Information. ● Extraction ● Topic Tracking ● Summarization ● Categorization ● Clustering ● Information Visualization.
  • 12. Web Content Mining Information Extraction (IE) To extract information from unstructured data, pattern matching is used. It traces out the keyword and phrases and then finds out the connection of the keywords within the text. This technique is very useful when there is large volume of text. IE is the basis of many other techniques used for unstructured mining. Information extraction can be provided to KDD module because information extraction has to transform unstructured text to more structured data. First the information is mined from the extracted data and then using different types of rules, the missed out information are found out. IE that makes incorrect predictions on data are discarded.
  • 13. Web Content Mining Topic Tracking Topic Tracking is a technique in which it checks the documents viewed by the user and studies the user profiles. According to each user it predicts the other documents related to users interest. In Topic Tracking applied by yahoo, user can give a keyword and if anything related to the keyword pops up then it will be informed to the user. Same can be applied in the case of mining unstructured data. An example for topic tracking is that if we select the competitors name then if at anytime their name will come up in the news then this information will be passed to the company.
  • 14. Web Content Mining Topic Tracking Topic tracking can be applied in many fields. Two such areas are medical field and education field. In medical field doctors can easily come to know latest treatments. In education field topic tracking can be used to find out the latest reference for research related work. Topic tracking helps to track all subsequent stories in the news stream. Disadvantage of topic tracking is that when we search for topics we may be provided with information which is not related to our interest. For example if user sets an alert for ‘web mining’ it can provide us with topics related to mineral mining etc. which are not useful for user.
  • 15. Web Content Mining Summarization Summarization is used to reduce the length of the document by maintaining the main points. It helps the user to decide whether they should read this topic or not. The time taken by the technique to summarize the document is less than the time taken by the user to read the first paragraph. The challenge in summarization is to teach software to analyze semantics and to interpret the meaning. This software statistically weighs the sentence and then extracts important sentences from the document.
  • 16. Web Content Mining Summarization To understand the key points summarization tool search for headings and sub headings to find out the important points of that document. This tool also give the freedom to the user to select how much percentage of the total text they want extracted as summary. It can work along with other tools such as Topic tracking and categorization to summarize the document. An example for text Summarization is Microsoft word’s AutoSummarize.
  • 17. Web Content Mining Categorization Categorization is the technique of identifying main themes by placing the documents into a predefined set of group. This technique counts the number of words in a document. It does not process the actual information. It decides the main topic from the counts. It ranks the document according to the topics. Documents having majority content on a particular topic are ranked first. Categorization can be used in business and industries to provide customer support.
  • 18. Web Content Mining Clustering Clustering is a technique used to group similar documents. Here in clustering grouping is not done based on predefined topic. It is done based on fly. Same documents can appear in different group. As a result useful documents will not be omitted from the search results. Clustering helps the user to easily select the topic of interest. Clustering technology is useful in management information system.
  • 19. Web Content Mining Information Visualization Visualization utilizes feature extraction and key term indexing to build a graphical representation. Through visualization, documents having similarity are found out. Large textual materials are represented as visual hierarchy or maps where browsing facility is allowed. It helps the user to visually analyze the contents. User can interact with the graph by zooming, creating sub maps and scaling. This technique is useful to find out related topic from a very large amount of documents.
  • 20. Web Content Mining Information Visualization Visualization utilizes feature extraction and key term indexing to build a graphical representation. Through visualization, documents having similarity are found out. Large textual materials are represented as visual hierarchy or maps where browsing facility is allowed. It helps the user to visually analyze the contents. User can interact with the graph by zooming, creating sub maps and scaling. This technique is useful to find out related topic from a very large amount of documents.
  • 21. Web Content Mining Structured Data Mining Techniques Web Crawler There are two types of Web Crawler which are called as External and Internal Web crawler. Crawlers are computer programs that traverse the hypertext structure in the web. External Crawler crawls through unknown website. Internal crawler crawls through internal pages of the website which are returned by external crawler.
  • 22. Web Content Mining Wrapper Generation In Wrapper Generation, it provides information on the capability of sources. Web pages are already ranked by traditional search engines. According to the query web pages are retrieved by using the value of page rank. The sources are what query they will answer and the output types. The wrappers will also provide a variety of Meta information. E.g. Domains, statistics, index look up about the sources. Page Content Mining Page Content Mining is structured data extraction technique which works on the pages ranked by traditional search engines. By comparing page Content rank it classifies the pages.
  • 23. Web Content Mining Semi-Structured Data Mining Techniques Object Exchange Model (OEM) Relevant information are extracted from semi-structured data and are embedded in a group of useful information and stored in Object Exchange model (OEM). It helps the user to understand the information structure on the web more accurately. It is best suited for heterogeneous and dynamic environment. A main feature of object exchange model is self describing, there is no need to describe in advance the structure of an object.
  • 24. Web Content Mining Semi-Structured Data Mining Techniques Top down Extraction In top down extraction, it extracts complex objects from a set of rich web sources and converts into less complex objects until atomic objects have been extracted. Web Data Extraction Language In Web data extraction language it converts web data to structured data and delivers to end users. It stores data in the form of tables.
  • 25. Web Content Mining Multimedia Data Mining Techniques SKICAT SKICAT is a successful astronomical data analysis and cataloging system which produces digital catalog of sky object. It uses machine learning technique to convert these objects to human usable classes. It integrates technique for image processing and data classification which helps to classify very large classification set. Color Histogram Matching Color Histogram matching consists of Color histogram equalization and Smoothing. Equalization tries to find out correlation between color components. The problem faced by equalization is sparse data problem which is the presence of unwanted artifacts in equalized images. This problem is solved by using smoothening.
  • 26. Web Content Mining Multimedia Miner MultiMedia Miner Comprises of four major steps, Image excavator for extraction of image and Video’s, a preprocessor for extraction of image features and they are stored in a database, A search kernel is used for matching queries with image and video available in the database. The discovery module performs image information mining routines to trace out the patterns in images. Shot Boundary Detection It is a technique in which automatically the boundaries are detected between shots in video.
  • 27. Web Content Mining Web Content Mining Tools Web Content Mining tools are software that helps to download the essential information for users. It collects appropriate and perfectly fitting information. Some of them are Web Info Extractor, Mozenda, Screen-Scraper, Web Content Extractor, and Automation Anywhere 5.5
  • 28. Web Content Mining Web content mining is being used in various different areas ● Mining Online news sites ● Distance learning Problems faced by Web Content mining such as extracting ● Information from heterogeneous environment ● The redundancy ● The linked nature of the web ● The dynamic and noisy nature of the web were highlighted
  • 29. Web Content Mining Integration of web content mining into web usage mining is also possible . In the textual content of the web pages are extracted through frequent word sequence. Then they are combined with web server logs to study association rule of user’s behavior. The result of the proposed system helps in better recommendation, web personalization, web construction and web user profiling. Connection between Web Content Mining and Web Structure mining. In this approach the web page content is compared with the information defined by the structure of the web site. Each web page is described with a set of keyword. This information iscombined with the link structure which generates context based description. This comparison helps in finding out semantic information of a web page and its neighborhood.