Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
Unit-5
Topics to be covered
Introduction to Web Search& Big data: Crawling and Indexes, Search Engine architectures, Link
Analysis and ranking algorithms such as HITS and PageRank, Hadoop File system & MapReduce
Paradigm
------------------------------------------------------------------------------------------------------------------------------------------
Indexing
After a web page or document has been detected by crawlers, all its accessible data is stored
(cached) on search engine servers so it can be retrieved when a user performs a search query.
Indexing serves two purposes:
to return results related to a sea h e gi e use s ue
to rank those results in order of importance and relevancy
The order of ranking is dependent ith ea h sea h e gi e s a ki g algo ith . These
algorithms are highly complex formulas, made even more advanced by the relationship your
website has with external sites and its on-page SEO factors.
To sum up, indexing exists to ensure that users questions are promptly answered as quickly as
possible.
Ranking
As “EO s this is the a ea e a e ost o e ed ith a d the pa t that allo s us to sho lie ts
tangible progress.
Once a keyword is entered into a search box, search engines will check for pages within their
index that are a closest match; a score will be assigned to these pages based on an algorithm
consisting of hundreds of different ranking signals.
These pages (or images & videos) will then be displayed to the user in order of score.
So in order for your site to rank well in search results pages, it's important to make sure search
engines can crawl and index your site correctly - otherwise they will be unable to appropriately
rank your website's content in search results.
To help give you even more of a basic introduction to this process, here is a useful video from
Google which explains it quite well. Each search engine follows a similar methodology to this.
2. Database
3. Search Interfaces
Web crawler
It is also known as spider or bots. It is a software component that traverses the web to gather
information.
Database
All the information on the web is stored in database. It consists of huge web resources.
Search Interfaces
This component is an interface between user and the database. It helps the user to search
through the database.
Evaluation
It monitors and measures the effectiveness and efficiency. It is done offline.
Examples
Following are the several search engines available today:
Search
Description
Engine
Google It was originally called BackRub. It is the most popular search engine globally.
It was launched in 2009 by Microsoft. It is the latest web-based search engine
Bing
that also deli e s Yahoo s esults.
It was launched in 1996 and was originally known as Ask Jeeves. It includes
Ask
support for match, dictionary, and conversation question.
It was launched by Digital Equipment Corporation in 1995. Since 2003, it is
AltaVista
powered by Yahoo technology.
AOL.Search It is powered by Google.
It is top 5 internet portal and 13th largest online property according to Media
LYCOS
Matrix.
Alexa It is subsidiary of Amazon and used for providing website traffic information.
Link Analysis
Link analysis is literally about analyzing the links between objects, whether they are physical,
digital or relational. This requires diligent data gathering. For example, in the case of a website
where all of the links and back links that are present must be analyzed, a tool has to sift through
all of the HTML codes and various scripts in the page and then follow all the links it finds in
order to determine what sort of links are present and whether they are active or dead. This
information can be very important for search engine optimization, as it allows the analyst to
determine whether the search engine is actually able to find and index the website.
In networking, link analysis may involve determining the integrity of the connection between
each network node by analyzing the data that passes through the physical or virtual links. With
the data, analysts can find bottlenecks and possible fault areas and are able to patch them up
more quickly or even help with network optimization.
Link analysis has three primary purposes:
Find matches for known patterns of interests between linked objects.
Find anomalies by detecting violated known patterns.
Find new patterns of interest (for example, in social networking and marketing and
business intelligence).
Normalize the values by dividing each Hub score by square root of the sum of the
squares of all Hub scores, and dividing each Authority score by square root of the sum of
the squares of all Authority scores.
Repeat from the second step as necessary.
HITS, like Page and Brin's PageRank, is an iterative algorithm based on the linkage of the
documents on the web. However it does have some major differences:
It is query dependent, that is, the (Hubs and Authority) scores resulting from the link
analysis are influenced by the search terms;
As a corollary, it is executed at query time, not at indexing time, with the associated hit
on performance that accompanies query-time processing.
It is not commonly used by search engines. (Though a similar algorithm was said to be
used by Teoma, which was acquired by Ask Jeeves/Ask.com.)
It computes two scores per document, hub and authority, as opposed to a single score;
It is p o essed o a s all su set of ele a t do u e ts a 'fo used sub graph' or base
set), not all documents as was the case with Page Rank.
PageRank Algorithm
Introduction
PageRank is a topic much discussed by Search Engine Optimization (SEO) experts. At the heart
of PageRank is a mathematical formula that seems scary to look at but is actually fairly simple
to understand.
Despite this a people see to get it o g! I pa ti ula Ch is ‘idi gs of
.sea he gi es ste s. et has itte a pape e titled Page‘a k E plai ed: E e thi g
ou e al a s a ted to k o a out Page‘a k , poi ted to a people, that o tai s a
fundamental mistake early on in the explanation! Unfortunately this means some of the
recommendations in the paper are not quite accurate.
By showing code to correctly calculate real Page Rank I hope to achieve several things in this
response:
1. Clearly explain how PageRank is calculated.
2. Go th ough e e e a ple i Ch is pape , a d add so e o e of o , sho i g the
correct Page Rank for each diagram. By showing the code used to calculate each
diag a I e ope ed self up to pee e ie - mostly in an effort to make sure the
examples are correct, but also because the code can help explain the PageRank
calculations.
3. Describe some principles and observations on website design based on these correctly
calculated examples.
Any good web designer should take the time to fully understand how PageRank really works - if
ou do t the ou site s la out ould e se iousl hurting your Google listings!
[Note: I ha e othi g i pa ti ula agai st Ch is. If I fi d a othe pape s o the su je t I ll t
to comment evenly]
How is PageRank Used?
Page‘a k is o e of the ethods Google uses to dete i e a page s ele a e o i po tance. It
is only one part of the story when it comes to the Google listing, but the other aspects are
discussed elsewhere (and are ever changing) and PageRank is interesting enough to deserve a
paper of its own.
PageRank is also displayed on the toolbar of ou o se if ou e i stalled the Google tool a
(http://toolbar.google.com/). But the Toolbar PageRank only goes from 0 – 10 and seems to be
something like a logarithmic scale:
PR: Shorthand for PageRank: the actual, real, page rank for
ea h page as al ulated Google. As e ll see late this
can range from 0.15 to billions.
Toolbar PR: The PageRank displayed in the Google toolbar in your
So what is PageRank?
I sho t Page‘a k is a ote , all the othe pages o the We , a out ho i po ta t a page
is. A li k to a page ou ts as a ote of suppo t. If the e s o li k the e s o suppo t ut it s a
abstention from voting rather than a vote against the page).
Quoting from the original Google paper, PageRank is defined like this:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter
d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There
are more details about d in the next section. Also C(A) is defined as the number of links
going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of
all web pages' PageRanks will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds
to the principal eigenvector of the normalized link matrix of the web.
ut that s ot too helpful so let s eak it do i to se tio s.
1. PR(Tn) - Each page has a notion of its own self-i po ta e. That s P‘ T fo the fi st
page i the e all the a up to P‘ T fo the last page
2. C(Tn) - Each page spreads its vote out evenly amongst all of it s outgoi g li ks. The
ou t, o u e , of outgoi g li ks fo page is C T , C T fo page , a d so o
for all pages.
3. PR(Tn)/C(Tn) - so if ou page page A has a a kli k f o page the sha e of the ote
page A ill get is P‘ T /C T
4. d(... - All these fractions of votes are added together but, to stop the other pages having
too u h i flue e, this total ote is da ped do ultipl i g it .85 the
fa to d
5. (1 - d) - The (1 – d) bit at the beginning is a bit of probability math agi so the sum of
all web pages' PageRanks will be one : it adds i the it lost the d(.... It also means
that if a page has no links to it (no backlinks) even then it will still get a small PR of 0.15
(i.e. 1 – .85 . Aside: the Google pape sa s the su of all pages ut the ea the
the o alised su – othe ise k o as the a e age to ou a d e.
How is PageRank Calculated?
This is where it gets tricky. The PR of each page depends on the PR of the pages pointing to it.
But e o t k o hat PR those pages have until the pages pointing to them have their PR
al ulated a d so o … A d he ou o side that page li ks a fo i les it see s
impossible to do this calculation!
But a tuall it s ot that ad. ‘e e e this it of the Google paper:
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds
to the principal eigenvector of the normalized link matrix of the web.
Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and C(B) = 1).
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provide filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Dist i uted File “ ste HDF“™ : A distributed file system that provides high-
throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
We can use following diagram to depict these four components available in Hadoop framework.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but
also to the collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
MapReduce Paradigm
MapReduce Algorithm
MapReduce is a Distributed Data Processing Algorithm, introduced by Google in its MapReduce
Tech Paper. MapReduce Algorithm is mainly inspired by Functional Programming model.
MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and
efficient way in cluster environments. It divides input task into smaller and manageable sub-
tasks (They should be executable independently) to execute them in-parallel.
MapReduce Algorithm Steps
MapReduce Algorithm uses the following three main steps:
1. Map Function
2. Shuffle Function
3. Reduce Function
Here we are going to discuss each function role and responsibility in MapReduce algorithm.
Map Function
Map Function is the first step in MapReduce Algorithm. It takes input tasks and divides them
into smaller sub-tasks. Then perform required computation on each sub-task in parallel.
This step performs the following two sub-steps:
1. Splitting
2. Mapping
Splitting step takes input DataSet from Source and divides into smaller Sub-Datasets.
Mapping step takes those smaller Sub-Datasets and perform required action or
computation on each Sub-Datasets.
The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the
below diagram:
Shuffle Function
It is the se o d step i Map‘edu e Algo ith . “huffle Fu tio is also k o as Co i e
Fu tio .
It performs the following two sub-steps:
1. Merging
2. Sorting
It takes a list of outputs o i g f o Map Fu tio a d pe fo s these t o sub-steps on
each and every key-value pair.
Merging step combines all key-value pairs which have same keys (that is grouping key-
alue pai s o pa i g Ke . This step etu s <Ke , List<Value>>.
Sorting step takes input from merging step and sorts all key-value pairs by using Keys.
This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
Reduce Function
It is the final step in MapReduce Algorithm. It performs only one step: Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce
operation as shown below.
Stage 1
How Does Hadoop Work?
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
Stage 2
3. The job configuration by setting different parameters specific to the job.
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration to
the slaves, scheduling tasks and monitoring them, providing status and diagnostic information
Stage 3
to the job-client.
The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.
#Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.