0% found this document useful (0 votes)

9 views

Spatial and Web Mining

Uploaded by

samanthaargent21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Spatial and Web Mining

Uploaded by

samanthaargent21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

ADVANCED DATA MINING

WEB MINING

Web Mining Issues

◼ Size
– >350 million pages (1999)
– Grows at about 1 million pages a day
– Google indexes 3 billion documents
◼ Diverse types of data

1
Web Data
◼ Web pages
◼ Intra-page structures
◼ Inter-page structures
◼ Usage data
◼ Supplemental data
– Profiles
– Registration information
– Cookies
3

Web Mining Taxonomy

2
Web Content Mining
◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
5

Crawlers
◼ Robot (spider) traverses the hypertext
sructure in the Web.
◼ Collect information from visited pages
◼ Used to construct indexes for search engines
◼ Traditional Crawler – visits entire Web (?)
and replaces index
◼ Periodic Crawler – visits portions of the Web
and updates subset of index
◼ Incremental Crawler – selectively searches
the Web and incrementally modifies index
◼ Focused Crawler – visits pages related to a
particular subject
6

3
Focused Crawler
◼ Only visit links from a page if that page
is determined to be relevant.
◼ Classifier is static after learning phase.
◼ Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.

Focused Crawler
◼ Classifier to related documents to topics
◼ Classifier also determines how useful
outgoing links are
◼ Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.

4
Focused Crawler

Context Focused Crawler

◼ Context Graph:
– Context graph created for each seed document .
– Root is the sedd document.
– Nodes at each level show documents with links
to documents at next higher level.
– Updated during crawl itself .
◼ Approach:
1. Construct context graph and classifiers using
seed documents as training data.
2. Perform crawling using classifiers and context
graph created.
© Prentice Hall 10

5
Context Graph

Virtual Web View

◼ Multiple Layered DataBase (MLDB) built on top
of the Web.
◼ Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
◼ Upper layers of MLDB are structured and can be
accessed with SQL type queries.
◼ Translation tools convert Web documents to XML.
◼ Extraction tools extract desired information to
place in first layer of MLDB.
◼ Higher levels contain more summarized data
obtained through generalizations of the lower
levels. 12

6
Personalization
◼ Web access or contents tuned to better fit the
desires of each user.
◼ Manual techniques identify user’s preferences
based on profiles or demographics.
◼ Collaborative filtering identifies preferences
based on ratings from similar users.
◼ Content based filtering retrieves pages
based on similarity between pages and user
profiles.

Web Structure Mining

◼ Mine structure (links, graph) of the Web
◼ Techniques
– PageRank
– CLEVER
◼ Create a model of the Web organization.
◼ May be combined with content mining to
more effectively retrieve important pages.

7
PageRank
◼ Used by Google
◼ Prioritize pages returned from search by
looking at Web structure.
◼ Importance of page is calculated based
on number of pages which point to it –
Backlinks.
◼ Weighting is used to provide more
importance to backlinks coming form
important pages.

PageRank (cont’d)
◼ PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i

8
CLEVER
◼ Identify authoritative and hub pages.
◼ Authoritative Pages :
– Highly important pages.
– Best source for requested information.
◼ Hub Pages :
– Contain links to highly important pages.

HITS
◼ Hyperlink-Induces Topic Search
◼ Based on a set of keywords, find set of
relevant pages – R.
◼ Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
◼ Pages with highest ranks in R are returned.

9
HITS Algorithm

Web Usage Mining

◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
20

10
Web Usage Mining Applications
◼ Personalization
◼ Improve structure of a site’s Web pages
◼ Aid in caching and prediction of future
page references
◼ Improve design of individual pages
◼ Improve effectiveness of e-commerce
(sales and advertising)

Web Usage Mining Activities

◼ Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
◼ Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
» Transaction: session
» Itemset: pattern (or subset)
» Order is important
◼ Pattern Analysis
22

11
Web Usage Mining Issues
◼ Identification of exact user not possible.
◼ Exact sequence of pages referenced by
a user not possible due to caching.
◼ Session not well defined
◼ Security, privacy, and legal issues

Spatial Mining

12
Spatial Object
◼ Contains both spatial and nonspatial
attributes.
◼ Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
◼ May retrieve object using either (or
both) spatial or nonspatial attributes.
25

Spatial Data Mining Applications

◼ Geology
◼ GIS Systems
◼ Environmental Science
◼ Agriculture
◼ Medicine
◼ Robotics
◼ May involved both spatial and temporal
aspects
26

13
Spatial Queries
◼ Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
◼ Region (Range) Query – find objects that
intersect a given region.
◼ Nearest Neighbor Query – find object close to
identified object.
◼ Distance Scan – find object within a certain
distance of an identified object where distance is
made increasingly larger.
27

Spatial Data Structures

◼ Data structures designed specifically to store or
index spatial data.
◼ Often based on B-tree or Binary Search Tree
◼ Cluster data on disk basked on geographic
location.
◼ May represent complex spatial structure by
placing the spatial object in a containing structure
of a specific geographic shape.
◼ Techniques:
– Quad Tree
– R-Tree
– k-D Tree
28

14
MBR
◼ Minimum Bounding Rectangle
◼ Smallest rectangle that completely
contains the object

MBR Examples

15
Quad Tree
◼ Hierarchical decomposition of the space
into quadrants (MBRs)
◼ Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
◼ Each level is a more exact representation
of the object.
◼ The number of levels is determined by
the degree of accuracy desired.
31

Quad Tree Example

16
R-Tree
◼ As with Quad Tree the region is divided
into successively smaller rectangles
(MBRs).
◼ Rectangles need not be of the same
size or number at each level.
◼ Rectangles may actually overlap.
◼ Lowest level cell has only one object.
◼ Tree maintenance algorithms similar to
those for B-trees.
33

R-Tree Example

17
K-D Tree
◼ Designed for multi-attribute data, not
necessarily spatial
◼ Variation of binary search tree
◼ Each level is used to index one of the
dimensions of the spatial object.
◼ Lowest level cell has only one object
◼ Divisions not based on MBRs but
successive divisions of the dimension
range.
35

k-D Tree Example

18
Topological Relationships
◼ Disjoint
◼ Overlaps or Intersects
◼ Equals
◼ Covered by or inside or contained in
◼ Covers or contains

Distance Between Objects

◼ Euclidean
◼ Manhattan
◼ Extensions:

19
Spatial Data Dominant Algorithm

STING
◼ STatistical Information Grid-based
◼ Hierarchical technique to divide area
into rectangular cells
◼ Grid data structure contains summary
information about each cell
◼ Hierarchical clustering
◼ Similar to quad tree

20
STING

STING Build Algorithm

21
STING Algorithm

Spatial Rules
◼ Characteristic Rule
The average family income in Dallas is $50,000.
◼ Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
◼ Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.

22
Spatial Association Rules
◼ Either antecedent or consequent must
contain spatial predicates.
◼ View underlying database as set of
spatial objects.
◼ May create using a type of progressive
refinement

Spatial Association Rule Algorithm

23
Spatial Classification
◼ Partition spatial objects
◼ May use nonspatial attributes and/or
spatial attributes
◼ Generalization and progressive
refinement may be used.

Spatial Clustering
◼ Detect clusters of irregular shapes
◼ Use of centroids and simple distance
approaches may not work well.
◼ Clusters should be independent of order
of input.

24
CLARANS Extensions
◼ Remove main memory assumption of
CLARANS.
◼ Use spatial index techniques.
◼ Use sampling and R*-tree to identify
central objects.
◼ Change cost calculations by reducing
the number of objects examined.
◼ Voronoi Diagram
51

Voronoi

25
SD(CLARANS)
◼ Spatial Dominant
◼ First clusters spatial components using
CLARANS
◼ Then iteratively replaces medoids, but
limits number of pairs to be searched.
◼ Uses generalization
◼ Uses a learning to to derive description
of cluster.
53

SD(CLARANS) Algorithm

26
Aggregate Proximity
◼ Aggregate Proximity – measure of how
close a cluster is to a feature.
◼ Aggregate proximity relationship finds the
k closest features to a cluster.
◼ CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
55

English 7: Summative Test No. 1 Lessons 1 & 2 Listening Strategies & Locating Specific Sources
100% (2)
English 7: Summative Test No. 1 Lessons 1 & 2 Listening Strategies & Locating Specific Sources
2 pages
Empowerment Technologies: Quarter 1 - Module 3: Contextualized Online Search and Research Skills
100% (5)
Empowerment Technologies: Quarter 1 - Module 3: Contextualized Online Search and Research Skills
23 pages
Spatial & Web Mining
No ratings yet
Spatial & Web Mining
45 pages
Module 7 Mining Object Spatial Multimedia Text and Web Data
100% (1)
Module 7 Mining Object Spatial Multimedia Text and Web Data
28 pages
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
No ratings yet
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
30 pages
7
No ratings yet
7
48 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
DWDM - Unit - VIII
No ratings yet
DWDM - Unit - VIII
32 pages
Web Miningppt
No ratings yet
Web Miningppt
29 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
Unit 5 DWDM
No ratings yet
Unit 5 DWDM
6 pages
Module1PartAweb mining-intro
No ratings yet
Module1PartAweb mining-intro
28 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
DMBI Presentations Unit-8
No ratings yet
DMBI Presentations Unit-8
28 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
Web Content Mining
No ratings yet
Web Content Mining
112 pages
TMK DWDM Unit 7 Advance Topics
No ratings yet
TMK DWDM Unit 7 Advance Topics
28 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
Csb4318 DWDM Unit II PPT Word
No ratings yet
Csb4318 DWDM Unit II PPT Word
133 pages
Data Science Training On Statistical Techniques For Analytics
No ratings yet
Data Science Training On Statistical Techniques For Analytics
154 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Data Mining: Concepts and Techniques (2nd Edition)
No ratings yet
Data Mining: Concepts and Techniques (2nd Edition)
9 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Process of Web Mining and categories of web mining
No ratings yet
Process of Web Mining and categories of web mining
5 pages
UNIT-1 PPT DMA
No ratings yet
UNIT-1 PPT DMA
83 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Data Mining
No ratings yet
Data Mining
80 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Mining Comlex Types of Data
No ratings yet
Mining Comlex Types of Data
19 pages
Web Mining: G.Anuradha References From Dunham
100% (1)
Web Mining: G.Anuradha References From Dunham
63 pages
Part B (Part B: To Be Completed by Students) : Web Mining
No ratings yet
Part B (Part B: To Be Completed by Students) : Web Mining
5 pages
Csb4318 DWDM Unit - II
No ratings yet
Csb4318 DWDM Unit - II
122 pages
Web Mining
No ratings yet
Web Mining
42 pages
Experiment 9: Web Mining
No ratings yet
Experiment 9: Web Mining
9 pages
DM Unit2(Part1)
No ratings yet
DM Unit2(Part1)
19 pages
DMDW-Unit V
No ratings yet
DMDW-Unit V
13 pages
Business Data Mining long
No ratings yet
Business Data Mining long
4 pages
DWM IA-2 QB
No ratings yet
DWM IA-2 QB
10 pages
Personalization Guide
No ratings yet
Personalization Guide
87 pages
13-Overview of Web mining-11-11-2024
No ratings yet
13-Overview of Web mining-11-11-2024
35 pages
CH 6 Web Mining and Other Data Mining
No ratings yet
CH 6 Web Mining and Other Data Mining
19 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Web Mining
100% (3)
Web Mining
28 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining App and Tech2 PDF
No ratings yet
Web Mining App and Tech2 PDF
443 pages
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
No ratings yet
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
53 pages
Web Mining
No ratings yet
Web Mining
28 pages
web_mining
No ratings yet
web_mining
8 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
DM 5th unit ppt
No ratings yet
DM 5th unit ppt
54 pages
Web Mining
No ratings yet
Web Mining
10 pages
Data Mining
No ratings yet
Data Mining
6 pages
Overview of Data Mining
No ratings yet
Overview of Data Mining
31 pages
Backlink Basic
From Everand
Backlink Basic
MUHAMMAD NUR WAHID ANUAR
No ratings yet
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
From Everand
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
Marcus Österberg
4/5 (2)
Ch 05 E Digital Signature
No ratings yet
Ch 05 E Digital Signature
34 pages
L15-Leftist-Heaps-JP
No ratings yet
L15-Leftist-Heaps-JP
60 pages
IDS and Honeypot
No ratings yet
IDS and Honeypot
29 pages
firewalls
No ratings yet
firewalls
37 pages
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
CNS Research Paper (4)
No ratings yet
CNS Research Paper (4)
15 pages
ADS 0256 EXP 7
No ratings yet
ADS 0256 EXP 7
10 pages
CNS_Research_Paper (3)
No ratings yet
CNS_Research_Paper (3)
15 pages
SA ESE-1
No ratings yet
SA ESE-1
4 pages
10th Oct 2023 ImmersionWeek
No ratings yet
10th Oct 2023 ImmersionWeek
8 pages
L-0010107193-pdf
No ratings yet
L-0010107193-pdf
30 pages
dw_chap2
No ratings yet
dw_chap2
15 pages
OS Numericals Mitul Shah
No ratings yet
OS Numericals Mitul Shah
7 pages
11th October Wednesday ImmersionWeek
No ratings yet
11th October Wednesday ImmersionWeek
7 pages
Internsship - Project - Arjun Singh Final
No ratings yet
Internsship - Project - Arjun Singh Final
78 pages
Kanishk Vardhan Singh Mini Project 2
No ratings yet
Kanishk Vardhan Singh Mini Project 2
39 pages
G7SLM3Q2FINAL
No ratings yet
G7SLM3Q2FINAL
22 pages
CHL Digital SEO Case Studies
No ratings yet
CHL Digital SEO Case Studies
3 pages
Unit 6
No ratings yet
Unit 6
18 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
28 pages
BLOOM - Consulting Country Brand Ranking Tourism
No ratings yet
BLOOM - Consulting Country Brand Ranking Tourism
40 pages
Bba DM Unit 2
No ratings yet
Bba DM Unit 2
42 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
History and Evolution of Digital Marketing
100% (2)
History and Evolution of Digital Marketing
6 pages
DSA Final Report
No ratings yet
DSA Final Report
9 pages
CCW332 DIGITAL MARKETING QUESTIONBANK
No ratings yet
CCW332 DIGITAL MARKETING QUESTIONBANK
18 pages
Academic Writing Cheatsheet
No ratings yet
Academic Writing Cheatsheet
2 pages
How To Find A Product To Sell Online
No ratings yet
How To Find A Product To Sell Online
15 pages
Jing Hang Michael Choi - HA6-PartB-Research-HistoryInternet
No ratings yet
Jing Hang Michael Choi - HA6-PartB-Research-HistoryInternet
4 pages
Sfu Submit Thesis To Library
100% (3)
Sfu Submit Thesis To Library
5 pages
Internet Advertising With Search Marketing - Trainer Guide
No ratings yet
Internet Advertising With Search Marketing - Trainer Guide
58 pages
Web Hosting & SEO Tech Move
No ratings yet
Web Hosting & SEO Tech Move
1 page
How To Disable Windows Web Search and Speed Up Your PC - Tom's Hardware
No ratings yet
How To Disable Windows Web Search and Speed Up Your PC - Tom's Hardware
7 pages
CEC Module
No ratings yet
CEC Module
7 pages
Digital Marketing Contents - Plan de Mejoramiento
No ratings yet
Digital Marketing Contents - Plan de Mejoramiento
9 pages
SEO Copywriting Workbook
No ratings yet
SEO Copywriting Workbook
32 pages
Summative Test Module 2
No ratings yet
Summative Test Module 2
1 page
How To Make A Simple Search Engine
No ratings yet
How To Make A Simple Search Engine
2 pages
HOD JSS3 Computer Exam
No ratings yet
HOD JSS3 Computer Exam
2 pages
Fiverr Virtual Assistant Skills Test Answer 2024
No ratings yet
Fiverr Virtual Assistant Skills Test Answer 2024
43 pages
Spink 2009
No ratings yet
Spink 2009
26 pages
Types of Digital Marketing
No ratings yet
Types of Digital Marketing
2 pages