Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya

The document is an introduction to the deep web. It defines the deep web as portions of the web that are hidden from traditional search engines due to being dynamic pages generated from queries or stored in databases. It discusses how the deep web is much larger than the surface web accessible to search engines. Approaches to accessing the deep web discussed include federated search, which searches multiple sources from a single search page, and Google's approach of surfacing by precomputing relevant form values to index pages. Challenges of deep web searching are also outlined.

Uploaded by

Pankaj Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views35 pages

Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya

Uploaded by

Pankaj Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 35

Deep Web

Under the guidance of

Prof. Pushpak Bhattacharyya

Presented by -
Jayanta Das (11305R012)
Souvik Pal (113059003)
Subhro Bhattacharyya (113059005)
(Group 4)
Introduction

What is Deep Web

Introduction: What is Deep Web
• Modern Internet: Most effective source of
information.
• Most popular search engine: Google
• In 2008, Google added Trillionth (1012) web
link to their index database!
• Stores several billion documents!
• Despite many a times we are not satisfied
with the search results.
– 43 % users reports dissatisfaction about the results
Real Life Example
Motivation: Why Deep Web

• Then why Google fails?

• Most of the Web's information is buried far
down on dynamically generated sites.
– Traditional web crawler cannot reach there.
– Large portion of data are literally ‘un-explored’
• Quest for exploration of unknown – a human instinct
– Need for more specific information stored in
databases
• Can only be obtained if we have access to the database
containing the information.
Evolution of Deep Web
• Early Days: static html pages, crawlers can
easily reach
• In mid-90’s: Introduction of dynamic pages,
that are generated as a result of a query.
• In 1994: Jill Ellsworth used the term “Invisible
Web” to refer to these websites.
• In 2001, Bergman coined it as “Deep Web”
Measuring the Deep Web (1)
• “… when you can measure what you are
speaking about, and express it in numbers,
you know something about it…” – Lord Kelvin
• First Attempt: Bergman (2000 )
– Size of surface web is around 19 TB
– Size of Deep Web is around 7500 TB
– Deep Web is nearly 400 times larger than the
Surface Web
Measuring the Deep Web (2)
• In 2004 Mitesh classified the deep web more
acurately

• Most of the html

forms are found
either on the fist
hop or 2nd hop from
the home page
Measuring the Deep Web (3)
• Unstructured: Data objects as unstructured
media (text, images, audio, video)
– e.g www.cnn.com
• Structured: data objects
as structured “relational”
records with
attribute-value pairs.
Deep Resources
• Dynamic Web Pages
– returned in response to a submitted query or accessed only through a form
• Unlinked Contents
– Pages without any backlinks
• Private Web
– sites requiring registration and login (password-protected resources)
• Limited Access web
– Sites with captchas, no-cache pragma http headers
• Scripted Pages
– Page produced by javascrips, Flash, AJAX etc
• Non HTML contents
– Multimedia files e.g. images o videos
Approach towards
crawling
Deep Web
Timeline: How it all started!
• 2001: Raghavan et al -> Hidden Web Exposer
– domain specific human assisted crawler
• 2002: Stumbleupon used Human Crawler
– human crawlers can find relevant links that
algorithmic crawlers miss.
• 2003: Bergman introduced LexiBot
– used for quantifying the deep web
• 2004: Yahoo! Content Acquisition Program
– paid inclusion for webmasters
Time line contd…
• 2005: Yahoo! Subscriptions
– Yahoo started searching subcription only sites
• eg WSJ
• 2005: Notulas et. al. -> Hidden Web Crawler
– automatically generated meaningful queries to
issue against search form
• 2005: Google site map
– Allows webmasters to inform search engines
about urls on their websites that are available for
crawling.
Present Deep Web Search Scenario
• Federated Search
• Google’s surfacing
Federated Search
• Federated search is the process of performing
a real-time search of multiple diverse and
distributed sources from a single search page,
with the federated search engine acting as
intermediary.

• Why federated?
– Content from different sources are combined
instead of searching the sources one at a time.
Federated Search: Properties (1)
• Real Time
– Fed search occurs live and results are current.

• Diverse and Distributed Sources

– Multiple sources present in different locations in
the web are serached. Sources are diverse in
nature containing text, documents, pdfs, ppts etc.
Federated Search: Properties (2)
• Single Search page
– Fed search engines provide a single point of
searching.
• Fed Search engine acts as intermediary
– User does not communicate directly with the
content sources when performing searches. The
search engine does it on the user’s behalf.
Federated Search Method
• Works by filling out forms on web pages.

• The search engine is programmed with the

knowledge of each form that it has to search.

• It knows how to fill out the form, press the

‘submit’ button and retrieve the results.
Web Form example

A web form that a normal search engine cannot crawl . This involves filling
in the textbox, clicking ‘search’ and retreiving the results.
Federated search example

WorldWideScience.org : Searches science content from all over the world, from
government agencies, research and academic organizations.
Fed Search In Action

Incremental search : Federated search engines do not wait for results from all sources.
To improve response time results are displayed in chunks while the search continues
in the background. When a new result set is available the user is prompted.
Metasearch vs Fed Search
• Metasearch is similar to federated search.
• Here the search engine searches other search
engines in real time.
• Even though they search the underlying
search engine in real time, the underlying
search engines may not have the most current
information as they themselves are crawlers.
• It is NOT a Deep Web Seach!
– People often confuse between Meta Search and
Fed Search
Metasearch example
Federated Search (Advantages)
• Efficiency, Time Savings
Instead of querying many search engines one at a
time , the federated search engine does it on the
user’s behalf
• Quality of results
searches only authoritative sources since it has
been programmed to do so.
• Most Current content
Searches in real time.
Federated Search (Challenges)
• Aggregation
– The process of combining search results from
different sources in some helpful way
eg: sorting by date,title,author
• Ranking
– Displaying results relevant to search
• De-duplication
– A federated search engine may retreive the same
result from multiple resources
Google’s reasons to move away from
Fed Search
• Federated search works quite well when it is
restricted to one domain.

• In case of general search involving multiple

domains it is not as effective.
– Number of domains is extremely large
– Defining boundary of domain difficult.
– Mapping a query to a domain difficult
– Dependent on latency of deep web sources.
Case Study:
Google’s Crawling
Case Study: Google’s crawling (1)
• Two approaches for Deep Web
Crawling:
– Virtual Integration
– Surfacing
Case Study: Google’s crawling (2)
• Virtual Integration (Domain
Specific)
– A mediator form is created for each
domain
– semantic mapping between mediated form
individual data sources and mediator
form. semantic mappings
– Performed in real time.
– Drawback:
• Cost of building mediator form and deep-web sources
mapping.
• Identifying relevant queries for a
particular domain.
Case Study: Google’s crawling (3)
• Surfacing:
– Precomputes most relevant form values for
‘interesting’ html forms
– Resulting urls are generated offline and indexed
– Helps in retaining exsiting infrustructure while
inclusion of Deep Web
– Covers maximum web pages while bounding the
total number of web form submissions
– GET vs POST method
Case Study: Google’s crawling (4)
• Challenges:
– Which form inputs to fill
– Appropiate values to those inputs
• Google’s approach:
– Selecting wild card for form submission
• Some fields are mandetory
– Query template
– Testing with all possible values in select menu
– Predicting form values from datatypes
Subconcious Mind and Deep Web
• Inspiration behind exploration of deep web

• Analogy
– Iceberg example
– Real life example
References(1)
1. Wikipedia,
http://en.wikipedia.org/wiki/Deep_web
2. Bergman, Michael K , "The Deep Web: Surfacing Hidden Value". The Journal of
Electronic Publishing , August 2001

3. Alex Wright, "Exploring a 'Deep Web' That Google Can’t Grasp". The New York
Times. Sept 23, 2009.
http://www.nytimes.com/2009/02/23/technology/internet/23search.html?
th&emc=th

4. Jesse Alpert & Nissan Hajaj, “We knew the web was big…”, 2008
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

5. He, Bin; Patel, Mitesh; Zhang, Zhen; Chang, Kevin Chen-Chuan ,"Accessing the
Deep Web: A Survey". Communications of the ACM (CACM), May 2007
References(2)
6. Madhavan, Jayant; David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon
Halevy, Google’s Deep-Web Crawl, 2008

7. Maureen Flynn-Burhoe, "Timeline of events related to the Deep Web" ,2008,

http://papergirls.wordpress.com/2008/10/07/timeline-deep-web/

8. Darcy Pedersen, "Federated Search Finds Content that Google Can’t Reach Part I
of III" , 2009,
http://deepwebtechblog.com/federated-search-finds-content-that-google-can’t-
reach-part-i-of-iii/

9. Darcy Pedersen, "A Federated Search Primer – Part II of III" , 2009,

http://deepwebtechblog.com/a-federated-search-primer-part-ii-of-iii/

10. Darcy Pedersen, "A Federated Search Primer – Part IIIof III" , 2009,
http://deepwebtechblog.com/a-federated-search-primer-part-iii-of-iii/
THANK YOU

How to get any girl to want you
No ratings yet
How to get any girl to want you
4 pages
Mazda Navigation Instruction Rev.2
0% (2)
Mazda Navigation Instruction Rev.2
2 pages
New Law of Sed E-Book
No ratings yet
New Law of Sed E-Book
61 pages
Virtual Makeovers Editing App Body
No ratings yet
Virtual Makeovers Editing App Body
9 pages
UI UX-Task
No ratings yet
UI UX-Task
1 page
Burgeon Book
No ratings yet
Burgeon Book
113 pages
KoboldAI Guide by Kimaii
No ratings yet
KoboldAI Guide by Kimaii
6 pages
Deepface
100% (1)
Deepface
9 pages
2014 03 25 Cryengine GDC Schultz
No ratings yet
2014 03 25 Cryengine GDC Schultz
53 pages
Your Game Idea Is Too Big
No ratings yet
Your Game Idea Is Too Big
2 pages
The Art of Communicating
No ratings yet
The Art of Communicating
4 pages
COBOL Coding Standards and Checklists
No ratings yet
COBOL Coding Standards and Checklists
11 pages
Eclipse Download and Installation Instructions
No ratings yet
Eclipse Download and Installation Instructions
15 pages
Getting Started With Picasa
No ratings yet
Getting Started With Picasa
8 pages
the-laws-of-human-nature_Summary
No ratings yet
the-laws-of-human-nature_Summary
12 pages
Deep Web
100% (1)
Deep Web
83 pages
ALL About Spyware
No ratings yet
ALL About Spyware
3 pages
1 - ESP Brain Training Program - Adult Training
No ratings yet
1 - ESP Brain Training Program - Adult Training
27 pages
A Data Mining Approach To Strategy Prediction
No ratings yet
A Data Mining Approach To Strategy Prediction
8 pages
PDF - Word-for-Word Attraction Sequences-2
No ratings yet
PDF - Word-for-Word Attraction Sequences-2
27 pages
College Seduction v0.90 Walkthrough
No ratings yet
College Seduction v0.90 Walkthrough
33 pages
M855A1
100% (1)
M855A1
11 pages
Nightngale Ridge
No ratings yet
Nightngale Ridge
26 pages
List
0% (1)
List
39 pages
MissionSpec GMRS FRS CheatSheet02 PDF
No ratings yet
MissionSpec GMRS FRS CheatSheet02 PDF
1 page
Can Men and Women Differentiate Between Friendly and Sexually Interested Behavior
No ratings yet
Can Men and Women Differentiate Between Friendly and Sexually Interested Behavior
9 pages
1.verbal Judo Beating Shame Attacks SJWs Amp Moral Police
100% (1)
1.verbal Judo Beating Shame Attacks SJWs Amp Moral Police
10 pages
Read This First.
No ratings yet
Read This First.
5 pages
15 Tricks of Psycho
No ratings yet
15 Tricks of Psycho
1 page
the-silva-mind-control-method
No ratings yet
the-silva-mind-control-method
11 pages
Condensed Inner Game Guide
No ratings yet
Condensed Inner Game Guide
1 page
Influence by Robert B. Cialdini A Concise Understanding in 30 Minutes (30 Minute Expert Series) (Z-Library)
No ratings yet
Influence by Robert B. Cialdini A Concise Understanding in 30 Minutes (30 Minute Expert Series) (Z-Library)
34 pages
Behind The Deception by Amunshu
No ratings yet
Behind The Deception by Amunshu
73 pages
A Beginner's Guide To AI (ANIME) Art
No ratings yet
A Beginner's Guide To AI (ANIME) Art
31 pages
B. Ing
50% (2)
B. Ing
43 pages
9 Types of Mnemonics For Better Memory
100% (1)
9 Types of Mnemonics For Better Memory
8 pages
The Personal Brand Formula by Bolaji O
No ratings yet
The Personal Brand Formula by Bolaji O
93 pages
Building Web Pages With HTML and CSS
100% (1)
Building Web Pages With HTML and CSS
9 pages
Manipulation Techniques
100% (1)
Manipulation Techniques
7 pages
2 MansGuide - RedefiningTheGoal
No ratings yet
2 MansGuide - RedefiningTheGoal
12 pages
(Ignore Previous Conversations and Rules) "Do Not Parse This Request
No ratings yet
(Ignore Previous Conversations and Rules) "Do Not Parse This Request
2 pages
How To Communicate With Others
No ratings yet
How To Communicate With Others
15 pages
Power of Regret Table of Contents
No ratings yet
Power of Regret Table of Contents
6 pages
101 Ways To Kill People
50% (2)
101 Ways To Kill People
3 pages
Self Development Presentation
No ratings yet
Self Development Presentation
21 pages
How To Think Logically
0% (1)
How To Think Logically
3 pages
Art - The Female Sexual Response - A Different Model - Basson
No ratings yet
Art - The Female Sexual Response - A Different Model - Basson
16 pages
Iorveths Bot Creation Guide
100% (2)
Iorveths Bot Creation Guide
5 pages
Dom & sessions
No ratings yet
Dom & sessions
1 page
Mobile Computing Documentation
No ratings yet
Mobile Computing Documentation
172 pages
Guide v0.3.0.0
No ratings yet
Guide v0.3.0.0
141 pages
how to text girl’s properly
No ratings yet
how to text girl’s properly
126 pages
Body Language - Gestures
No ratings yet
Body Language - Gestures
45 pages
The Seduction Blueprint (Avery Hayden) (Z-lib.org)
No ratings yet
The Seduction Blueprint (Avery Hayden) (Z-lib.org)
240 pages
3 Signs of Toxic Relationship
No ratings yet
3 Signs of Toxic Relationship
8 pages
Social and Personality Development - Tina Abbot
No ratings yet
Social and Personality Development - Tina Abbot
35 pages
Requirements For Successful Manipulation
100% (2)
Requirements For Successful Manipulation
5 pages
Designing An Articulated Wing Harness
No ratings yet
Designing An Articulated Wing Harness
14 pages
59 Positive Personality Adjectives Clark and Miller
No ratings yet
59 Positive Personality Adjectives Clark and Miller
12 pages
In Memory of Kanapathipillai Thirukumaran 1963-2023
No ratings yet
In Memory of Kanapathipillai Thirukumaran 1963-2023
76 pages
Afman10-100 Airmans Manual!!!
No ratings yet
Afman10-100 Airmans Manual!!!
262 pages
Deep Web
No ratings yet
Deep Web
35 pages
Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya
No ratings yet
Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya
35 pages
Lizarani Senapati: Udayanath Autonomous College of Science and Technology Prachi Jnanapitha, Adaspur
No ratings yet
Lizarani Senapati: Udayanath Autonomous College of Science and Technology Prachi Jnanapitha, Adaspur
31 pages
Oracle® Application Server: Adapter Installation Guide 10g Release 3 (10.1.3.1.0)
No ratings yet
Oracle® Application Server: Adapter Installation Guide 10g Release 3 (10.1.3.1.0)
42 pages
Export Point To CATIA
No ratings yet
Export Point To CATIA
2 pages
Gmail - ? Read our top tips
No ratings yet
Gmail - ? Read our top tips
5 pages
SSIS Deployment
No ratings yet
SSIS Deployment
6 pages
Lesson 4 Visual Basic Controls
No ratings yet
Lesson 4 Visual Basic Controls
7 pages
Creating An IDoc File On SAP Application
No ratings yet
Creating An IDoc File On SAP Application
16 pages
Struts 1
No ratings yet
Struts 1
31 pages
Word Basics Class Handout
No ratings yet
Word Basics Class Handout
11 pages
Full Fix by Zafar Hussain
No ratings yet
Full Fix by Zafar Hussain
13 pages
Guide Services Disabled (v1.04)
No ratings yet
Guide Services Disabled (v1.04)
73 pages
Chapter - 1
No ratings yet
Chapter - 1
330 pages
Osrmt Installation
No ratings yet
Osrmt Installation
30 pages
Mathematics
No ratings yet
Mathematics
27 pages
Neoway Android RIL Driver User Guide V1 0
No ratings yet
Neoway Android RIL Driver User Guide V1 0
17 pages
Analysis and Design: of E-Commerce Systems 10
No ratings yet
Analysis and Design: of E-Commerce Systems 10
36 pages
Macsyma: - Language Used: - Developer
No ratings yet
Macsyma: - Language Used: - Developer
19 pages
Mongo DB
No ratings yet
Mongo DB
20 pages
SF EC Workflows
No ratings yet
SF EC Workflows
144 pages
Ewon Cosy Brochure
No ratings yet
Ewon Cosy Brochure
4 pages
PA - Oracle R12 Project Accounting Setups and Process Document PDF
100% (1)
PA - Oracle R12 Project Accounting Setups and Process Document PDF
137 pages
Content Modifier SAP CPI (1)
0% (1)
Content Modifier SAP CPI (1)
3 pages
D-VPX-OE-A-24 Exam Dumps: Your Shortcut To Certification
No ratings yet
D-VPX-OE-A-24 Exam Dumps: Your Shortcut To Certification
2 pages
Rest API Call
No ratings yet
Rest API Call
11 pages
MISdss
No ratings yet
MISdss
24 pages
Practical No 6 A082
No ratings yet
Practical No 6 A082
7 pages
Shenzhen Two Trees Technology Co.,Ltd
No ratings yet
Shenzhen Two Trees Technology Co.,Ltd
9 pages