Bda Toppers Solution
Bda Toppers Solution
TOPPER'S SOLUTIONS
....In Search of Another Topper
There are many existing pa per solution available in market, bu t Topper's Solution is the one which studen ts
will always prefer if they refer ... ;) Topper's Solutions is not j ust paper solutions, it includes many other
important questions which are important from examination point of view. Topper's Solutions are the
solution w ritten by the Toppers for the students to be the upcoming Topper of t he Semester.
It has been said that "Action Speaks Louder than Words" So Topper's Solutions Team works on same
principle. Diagrammatic representation of answer is considered to be easy & q u icker to understand. So our
major focus is on diagrams & representation how answers shou ld be answered in examinations.
"Education is Free.... But its Technology used & Efforts utilized which
we charge"
It t akes lot of efforts for search ing out each & every question and transfor ming it into Short & Simple
Language. Entire Community is working out for betterment of students, do help us.
Syllabus:
Exam TT-1 TT-2 AVG Term Work Oral/Practical End of Exam Total
Marks 20 20 20 25 25 80 150
Ql. Give difference between Traditional data management and analytics approach Versus Big data
approach
COMPARISON BETWEEN TRADITIONAL DATA MANAGEMENT AND ANALYTICS APPROACH & BIG
DATA APPROACH:
Traditional data management and analytics Big data approach
approach
Traditional database system deals w ith struct ured Big data system deals w ith structured, semi
data. struct ured and unstructured data.
Traditional data is generated in enterprise level. Big data is generated in outside and enterprise
level.
Its volu me ranges from Gigabytes to Tera bytes Its volume ranges from Petabytes to Zetta bytes
or Exa bytes.
Data integ ration is very easy. Data integration is very difficult .
The size of the data is very sma II. The size is more than the traditional data size
Its data model is strict schema b ased and it is stat ic. Its data model is flat schema based and it is
dynamic.
It is easy to manage and man ipulate the data. It is d ifficu lt to manage and manipulate the data.
Its data sources includ es ERP transaction data, CRM Its data sources includes social med ia, device
t ransaction d ata, financial data, organizat ional data, data, sensor dat a, v ideo, images, aud io etc.
web transaction data etc.
The sample from known popu lation is considered as Entire population is considered as object of
object of analysis. analysis.
Normal funct ions can manipulate d ata. Specia l kind of functions can manipu late data.
Tradit ional data b ase to ols are required to perform Special kind of d ata base tools are requ ired to
any d ata base operation. perform any d ata base operation.
Tradit ional data source is centralized and it is Big data source is d istributed and it is managed
managed in centralized form. in d istr ibuted form.
- - EXTRA QUESTIONS --
BIG DATA:
1. Data is d efined as t h e quantities, characters, or symbols on which operations are p erformed by a
computer.
2. Data may be stored and t ransmitted in the form of electr ica l signa ls and record ed on magnetic, optical,
or mechanical recording med ia.
3. Big Data is also data but w ith a huge size.
4. Big Data is a term used to describe a collection o f data that is huge in size and yet g rowing
exponentia lly w it h t ime.
5. In short such data is so large and complex tha t none of t he traditional data management tools are able
to store it o r process it efficien tly.
6. Examples:
a. The New York Stock Exchange generates about one terabyte of new trade d ata per d ay.
b. The statist ic show s that 500+ terabytes of new data get in gested in to the d atabases of social med ia
site Facebook. every day. This data is ma in ly g enerated in terms of photo and video uploads,
message exchanges, putting comments etc.
TYPES:
Big Data
I
Structured
I
Unstructured
l
Semi-structured
I) Structured:
1. Any data that can be stored, accessed and processed in t he form of fixed format is termed as a
Structured Data.
2. It accounts for about 20% of the total existing data and is used the most in programm ing and
compu ter-related activities.
3. Th ere are two sou rces of structured d ata• machines and humans.
4. All t he data received from sensors, weblogs, and financial systems are classified under mach ine-
generated data.
5. These include med ical devices, CPS d ata, data of usage st atist ics captured by servers and applications.
6. Human-generated struct u red data m ainly includes all the data a human input into a computer, such
as his name and o ther persona l details.
7. When a person clicks a link on the internet, or even makes a move in a game, data is created .
8. Example: An 'Em p loyee' ta ble in a database is an example of Struct u red Data.
l Employee_lD l Employee_Name I Gender
• Handcrafted by BackkBenchers Community
Downloaded by asda sd ([email protected]) Page 3 of 65
lOMoARcPSD|48592555
II) Unstructured:
l. Any data with unknown form or the structure is classified as unstructured data.
2. The rest of the data created, about 80% of the total account for unst ructured big data.
3. Unstruct u red data is also c lassified based on its source, into machine-generated or human-gene rated.
4. Machine-generated data accounts for all the satellite images, the scien tific data f rom va rious
experiments and radar data captured by various facets of technology.
5. Human-generated unst ructured data is found in ab undance across the in ternet since it includes social
media data, mobile data, and website content.
6. Th is means that the p ictures we upload to Facebook or lnstagram handle, t he videos we watch on
You Tube and even the text messages w e send a ll contribute to the gigantic heap t h at is unstructured
data.
7. Examples of unstructured data include text, video, audio, mobile activity, social med ia activ ity, satellite
imagery, surveillance imagery etc.
8. The Unst ructured data is fu rt her divided into:
a. Captured data:
• It is the data based on t he user's behavior.
• The best example to understand it is GPS v ia smart phones which help the user each and every
moment and provides a rea l-time output.
b. User-generated data:
• It is the kind of unst ructured data w here the user itself will p ut data on t he int ernet every
movement.
• For example, Tweets and Re-tweets, Likes, Shares, Comments, on YouTube, Facebook, etc.
9. Tools generates unstructured data:
a. Hadoop
b. HBase
C. Hive
d. Pig
e. MapR
f. Cloudera
1111 Semi-Structured:
Q2. Explain Characteristics of Big Data or Define the three V's of Big Data
Ans: [P I High]
I) Variety:
1. Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered from
multiple sources.
2. Th e type and natu re of data is having great variety.
3. During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applicat ions.
4. Nowadays, data in the form of ema ils, photos, videos, monitoring devices, PDFs, audio, etc. are also
being considered in the an alysis applicat ions.
Big Data
Stru<;tur9d
l
Unstruc-tured Semi•nruc.tured
11) Velocity:
100,000• TW1'ets
6~5.000• F'B Status Updates
11.000,000+ Instant Messages
700,445• Google Searches
168,ooo,ooo..- Emalls
220• New Mobele users
2000 TB• Data Created
Main Cllent/
Frame Server
Data Growth
4S
40
3S
., 30
!:!
]25
3
U 20
~
:;; 15
0
10
s
0
2011 2012 2013 2016 2020
Ytar
IJ Programmable;
1. It is possible w ith big data to explore all types by programming logic.
2. Programming can b e use d to perform any kind o f explorat ion because of t he scale of the data.
IV] Veracity:
1. Th e data captured is not in certain for mat.
2. Data capt ured can vary g reatly.
3. Veracity means t he trustworthiness and qua lity of d ata.
4. It is n ecessary that the veracity of the data is mainta ined.
5. For example, th ink about Facebo ok posts, w ith hashtags, abbreviat ions, imag es, videos, etc., which
make them unreliable and hamper the q uality of their content.
6. Collecting loads and loads o f d ata is o f no use if the quality an d trustwort h iness of t he d ata is not up to
the mar k.
Ans: (P I Medium)
11) Academia
1. Big Data is also helping enhance education today.
2. Education is no more lim ited t o the physica l bound s of the classroom - there are numerous online
educational courses to learn from.
3. Acad emic institutions are investing in digital cou rses powered by Big Data technologies to aid the all -
ro und development of b udding learners.
111) Banking
1. The banking sector relies on Big Data for fraud detection.
2. Big Data to ols can efficient ly detect fraudulent acts in real-time such as m isuse of cred it/debit cards,
archival of inspection t racks, faulty alteration in customer stats, etc.
IV) Manufacturjna
1. Accordin g t o TCS Globa l Trend St u dy, the most significant b enefit of Big Dat a in manufacturing is
improving the supply strategies and product qua lity.
2. In t h e manufacturing sector, Big data helps cr eate a transparent infrastructu re, thereby, predicti ng
uncertainties and incompetence's that can affect the business adversely.
V) IT
1. One o f the largest users of Big Data, IT compan ies around the world a re using Big Data to opti m ize
their function ing, enhance employee product ivity, a nd m in im ize risks in busi ness operations.
2. By combining Big D ata technologies w ith M L and A l, the IT sector is continually power ing innova tion
to find solutions even for the most complex of problems.
Ans: [P I Medium]
HADOOP:
l. Hadoop is an open source softwa re prog ramming fram ework for storing a larg e amount of data a nd
performing the computation.
2. Its framework is based on Java p rogramming with some n ative co de in C and shell scripts.
3. Apache Software Foundation is the developers o f Hadoop, an d its co-found ers are Doug
Cutting and Mike Cafarella.
4. Th e Hadoop framewor k ap p licat ion works in an e nvironment that provides
distributed storage and computation across clusters of co mputers.
5. Hadoop is designed to sca le up from sing le server to thousands of m ach ines, each offering local
computation and st orage.
FEATURES:
1. Low Cost
2. Hig h Computi ng Power
3. Scalability
4. Huge & Flexib le Storage
5. Faul t Tole rance & Data Protect ion
HADOOP ARCHITECTURE:
Hadoop
M apR.cfuc•
(Oistrlbuted Computation)
HOFS
(Distributed Storage)
YARN Common
Fram.work Utllltlos
MapReduce:
l. Map Red uce is a parallel programm ing model for w rit ing distr ibuted appl ications.
2. It is used for effic ient pro cessing of la rge amounts o f data (mu lt i-terabyte data-sets), on larg e cl usters
(thousands of n o des) o f commo dity hardware in a re lia ble, fault-tolerant manner.
3. Th e M apRed uce progra m runs on H adoop w h ich is an Apache open-source f ramework.
ADVANTAGES:
l. Abil ity to store a large amount of dat a.
2. High flexibility.
3. Cost effective.
4. High computa tional power.
5. Tasks are ind ep endent.
6. Linear scaling.
PISAPYANIAGES:
l. Not very effective for small data.
Ans: [P I Medium]
l. Hadoop is a n o pen source software framework which provides huge data sto rage.
2. Runn in g Hadoop means runn in g a set of resident prog rams.
3. Th ese resident program s a re a lso know n as daemons.
4. These daemons may be running on the same server or on the different servers in the network.
5. Figure 1.5 show s H adoop cl uster topology.
COMPONENTS:
I) Name Node:
l. It is the master of HDFS (Hadoop fi le system).
2. It contains Job Tracker, which k eeps tracks of a file d istributed to d ifferent data nodes.
3. Na me Node d irects Da ta Node regarding t he low level 1/0 tasks.
4. Failure of Name Node wil l lead to the failure of the full Hadoop system.
Il l) Job Tracker:
l. Job Tracker det erm ines which file to process.
2. Th ere can be on ly one job t racker for p er Hadoop cluster.
3. Job Tracker r uns on a server as a master n o de o f the cluster.
Ans: (P J High)
I
Map ~·uce
I I JobTracku
I I Task Trac.ker
I I TukTracker
I
HOS:$ Clu-1ter
I I Harne Node
I I Data Node
I I Oata Node
I
I 1 N
Characteristics:
1. High Fau lt Tolerant.
2. High throug h p ut.
3. Su pports applicat ion w it h m assive datasets.
4. Streaming access to file system data.
5. Can b e b uilt out o f commod ity ha rdware.
Architecture:
Fig u re 1.7 shows HDFS A rch itectu re.
Meud.ita {N~me,
Replicas.. )
Nam-e Node
Metadata ops ...
,,,,···
Block ops
Client
□
□ [] □
Writ e
Rack 2
Aackl
HDFS follows the master -slave architecture and it has the following elemen ts.
11 Namenode:
1. It is a deamon w h ich runs on master node o f had oop cluster.
2. There is on ly one n amenode in a clust er.
3. It contains metadata of all the files stored on HDFS which is known as namespace of HDFS.
4. It maint ain two fi les i.e. Edit Log & Fslmage.
5. Editlog is used to record every change t hat occurs to file system m e tadata (transaction hist ory)
6. Fslmage stores entire namespace, mapping of b locks to files and fi le system properties.
7. The Fslmage and the Editlog are centra l data structures of HDFS.
8. The system having the namenode acts as t he master server and it does the following tasks:
a. Manages the file system namespace.
b. Regulates client's access to files.
c. It also executes file system operations such as renaming, closing, and opening files and directories.
II} Datanode:
l. It is a deamon which runs on slave machines of H adoop cluster.
2. There are number of datanodes in a cluster.
3. It is responsib le for serving read/write req uest from t he cl ients. It also perform s b lock c reation, delet ion,
and repl ication upon inst ruction from the Nameno de.
4. It also sends a H eartbeat message to t he namenode periodica lly about the b locks it hold.
5. Namenode and D atanod e machines typically run a GNU/Linux operatin g system (OS).
Ill} Block:
l. Generally the user data is stored in the files of HDFS.
2. Th e fi le in a file system will be divided into one or more segments an d/or stored in in d iv idual data nodes.
3. These file segments are ca lled as b locks.
4. In other words, the m inimum amount of data that HDFS can read o r write is called a Block.
5. Th e default block size is 64MB, but it can be increased as per t he need to ch ange in HDF5 configurat ion.
MAPREPUCE:
1. Map Red uce is a software framework.
2. Map Red uce is the d ata processing layer of H adoop .
3. It is a software framework that allows you to w r ite applications for processing a large amount of data.
4. Map Reduce runs these applications in p arallel on a cluster of low-end machines.
5. It does so in a reliab le and fault-tolerant manner.
6. In MapReduce an a pplication is broken down into numb er of small parts.
7. These sma ll par ts are also called as fragments or b locks.
8. Th ese blocks t hen can be run on any node in the cluster.
9. Data Processing is done by MapReduce.
10. Map Reduce sca les and runs an application to d ifferent clutter machines-
11. There are two primitives used for data processing by MapReduce known as Mappers & Reducers.
12. Map Reduce use lists and key/valu e pairs for processing of data.
a. partjtion Function; W ith the g iven key and number o f reducers it finds the correct reducer.
b. Compare Function: Map intermediate outputs are sorted according to this compare function.
IV) Function Reducing:
1. Intermediate values are red uced to smaller solut ions and g iven to output.
VJ Write Output:
Gives file outpu t
M•p
0
I
N
u
Reduce
p T
p
u
M•p u
T
T
D Reduce D
A
A
T Map
T
A
A
Mapl Map2
(2) Combine·
(3) Reduce
<Babita, 2>
<Jethalal, 2>
<Good night, 2>
<Hello, 2>
Ans: [P I Medium]
HADOOP ECOSYSTEM:
l. Core Hadoop ecosystem is nothing but the different components that are built on the Hadoop
platform directly.
2. Figure 1.9 represents Hadoop Ecosystem.
! Bl Reportlng ,i i ROSMS i
.' -------·----' :======
!_
- ··:.;':
z Pf9 10•u Flow) I LI_ _ H_"'_•_1s_Q_L1_.... I ...I __5q_oop
_ ___,
0
0 MaipRe-duce (Job Scheduling / Execution System)
A
K
V
E
HBas• R
E
0
p
E
R HOFS
II) MapReduce:
1. MapReduce is the data processing component of Hadoop.
2. It applies the computation on sets of data in parallel thereby improving the performance.
3. MapReduce works in two phases:
a. Map Phase; This phase takes input as key-value pairs and produces output as key-va lue pairs. It can
w rite custom business logic in th is phase. Map phase processes the data and g ives it to the next
phase.
b. Reduce Phase; The MapReduce framework sorts the key-value pair before giving the data to th is
phase. This phase applies the summary type of calcu lations to the key-value pairs.
Ill) ~
1. Hive is a d ata warehouse project b u ilt on the top of Apache Hadoop which provides data query and
ana lysis.
2. It has got the language of its own call HQL o r Hive Query Language.
3. HQL automatically translates the qu eries into the corresponding map-reduce job.
4. Main p arts of the Hive are -
a. MetaStore: It stores metadata
b . Driver; Manages the lifecycle of HQ L statement
c. Query Compiler: Compiles HQL into DAG i.e. Direct ed Acycl ic Graph
d . Hive Server: Provides interface for JDBC/ODBC server.
IV] Pig:
1. Pig is a SQL lik e lan g uage used for q ueryin g and a nalyzing d ata stored in HDFS.
2. Yahoo was the origina l creator o f t he Pig .
3. It uses p ig lati n la nguage.
4. It loads t h e data, appl ies a filter to it and d u m p s the data in t h e req uired format .
5. Pig a lso consists of JVM called Pig Runtime. Various features of Pig a re as follows:-
a. Extensibility: For carrying o ut sp ecia l purpo se processi ng, users ca n cre ate t h eir own custom
function .
b. Optimization opportunities: Pig a ut omatically optim izes t h e q ue ry allow in g users to focus on
semantics rather than efficiency.
c. Handles all kinds of data: Pig an alyzes b oth stru ctured as w e ll as unstruct u red.
V) HBase:
1. HBase is a NoSQL d ata base b u ilt on the top of HDFS.
2. The vario us features of HBase are that it is open-source, non-relat iona l, d istributed d atabase.
3. It im itates Google's Bigtable and written in Java.
4. It provides real-time read/write access to la rge datasets.
VI) Zookeeper:
1. Zookee per coord in ates betwee n var ious se rvices in t he Hadoop ecosyst em.
2. It saves the time req uired for synchronization, configura tion maintenance, g ro uping, and naming.
3. Following are the features of Zookeeper:
a. Speed: Zookeepe r is fast in workload s where reads to da ta are more than write. A typical read: write
ratio is 10:1.
b. Organized: Zook eeper maintains a record of all t ransact ions.
c. Simple: It ma intains a sing le h ie rarchical namespace, sim ilar to d irectories and files.
d. Re lia ble; We can replicate Zook eeper over a set of hosts and they are aware o f each other. There is
no single point of fa ilure. As long as major servers a re ava ilable zookeeper is available.
Vll)Sqoop:
1. Sqoop imports data from externa l sources into compatible Hadoop Ecosystem components like HDFS,
Hive, HBase etc.
2. It a lso transfers d ata from Hadoop to ot her externa l sou rces.
3. It works w ith RDBMS like Tera Data, Oracle, MySQL and so on.
4. The major difference between Sqoop and Flume is that Flume does not work with structured data.
5. But Sqoop can deal w ith structured as well as unstructured data.
HADOOP:
l. Hadoop is an open source software prog ramming framework for storing a larg e amount of data and
performing the computation.
2. Its framework is based on Java programming w ith some native cod e in C and shell scripts.
3. Apache Software Foundation is t he developers of Hadoop, and its co-founders are Doug
Cutting and Mike Cafarella.
4. The Hadoop framework application works in an environment tha t provides
distr ibuted storage an d computation across clusters o f computers.
5. Hadoop is designed to sca le up from sing le server to thousands of machines, each offering local
computation and storage.
FEATURES OF HADOOP:
l. Low Co st
2. High Computing Power
3. Scalability
4. Huge & Flexib le Storage
5. Fault Tolerance & Data Protection
Characteristics:
1. High Fau lt Tolerant.
2. High throughp ut.
3. Supports application w it h massive datasets.
4. Streaming access to file system da ta.
5. Can be bu ilt out of commod ity hardware.
Architecture:
Figure 2.1 shows HDFS Arch itecture.
Metadata (Name,
Name Node
......,.•·····•·"· Repllcas..)
Block ops
Client
□
□ [] □
Rack l Rack2
Client
HDFS follows the master-slave architecture and it has the following elements.
I) Namenode:
l. It is a deamon which runs on master nod e of had oop cluster.
2. There is only one namenode in a cluster.
3. It contains m etadata of all t h e files sto red on HDFS which is known as n am esp ace of HDFS.
4. It m aintain two files i.e. Edit Log & Fslm age.
5. Edit l o g is used t o record eve ry change that o ccurs to file system m etadata {transaction hist ory)
6. Fslm age stores entire namespace, mapping of b locks to files and file system p roperties.
7. Th e Fslmag e an d the Edit l og are centra I d ata stru cture s of H DFS.
8. Th e syst em having the namenod e acts as t he master se Ner an d it does the following task s:
a. Manages the file system namespace.
b. Regulat es client's access to files.
c. It a lso executes file system operations such as renaming, closing, and opening files and d irectories.
11) Datanode:
l. It is a d eamon which runs on slave mach ines of Hadoop cluster.
2. There are number of dat anod es in a cluster.
3. It is responsib le for seNing read/write request from the cl ients. It a lso performs b lock creation, delet ion,
and replicat ion upon instruct ion from th e Namenode.
4. It a lso sends a Heartbeat message to the namenode period ica lly about the b locks it hold .
5. Namenod e and Datanode machines typical ly run a GNU/Linux o p erating system {OS).
111) Block:
1. Generally the user data is st ored in the files of HDFS.
2. The file in a file system will be divided into one or more segm ents and/or stored in ind ividual data nodes.
3. These file seg ments are called as b locks.
4. In ot her words, the minimum amount of data that HDFS can read or write is cal led a Block.
5. The default bloc k size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Q2. What is the role of JobTracker and Task.Tracker in MapReduce. Illustrate Map Reduce
MAPREDUCE:
l. MapReduce is a software framework.
2. Map Red uce is the data processing layer of Hadoop.
3. Sim ilar to HDFS, MapReduce also exploi ts master/slave architecture in which JobTracker runs on
master nod e and Task.Tracker runs on each salve node.
4. Task Trackers are processes ru nning on data nodes.
5. These monitors the map s and reduce tasks executed on the node and coordinates w ith Job tracker.
6. Job Tracker monitors t h e ent ire MR j ob execut ion.
7, JobTracker and Task.Tracker are 2 essential process involved in MapReduce execution in MRvl (or
Hadoop version 1).
8. Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by Resource Manager,
Applicat ion Master and Node Manager Daemons,
JOB TRACKER:
l. JobTracker is an essent ial Daemon for MapReduce execution in MRvl.
2. It is replaced by Resource Manager/ Appl icat ion Master in MRv2.
3. JobTracker receives the requ ests for MapReduce execution from t he cl ient.
4. JobTracker t alks t o the Nam eNod e t o determ ine t h e location o f t he data.
5. JobTracker finds the best Task Tracker nodes to execute tasks b ased on the data locality (proxim ity of
the d ata) and t h e available slots to execute a task on a g iven node.
6. JobTracker mon itors the individua l Task Trackers and the su bm its b ack the overall status of the job
back to t he client.
7. If a task fails, t he JobTracker can reschedu le it on a different TaskTrackers.
8. When the JobTracker is down, HDFS wil l still be functiona l bu t the Map Reduce execution cannot be
started and the existi ng MapRed uce jobs wil l be halted.
TASKTRACKER:
1. TaskTracker ru n s on Data Node.
2. TaskTracker is replaced by Nod e Manager in MRv2.
3. Mapper and Reducer tasks are executed on Data No des administered by TaskTrackers.
4. TaskTrackers w ill be assigned Mapper and Reducer tasks to execu te by JobTracker.
5. TaskTracker wil l be in constant communication w ith the JobTracker signal ing the p rog ress of the task
in execution.
6. TaskTracker failu re is not considered fatal. When a TaskTracker becomes unresponsive, Job Tracker w ill
assig n the task executed by the Task.Tracker to another node.
Bear I\. 1) }-
Li:s.tl)O VJI
a ,.1
c.,.1
Atv@f, I
•· EXTRA QUESTIONS ••
Q2. Explain Union, Intersection & Difference operation with MapReduce Techniques
Ans: (P I Medium]
UNION WITH MAPREDUCE:
1. For union operation, map phase has responsib ility of convert ing the t uple 't' values from Relation 'R' t o
a key value pa ir format.
2. Reducer has the t ask to assign t h e values to same key 't'
3. Figure 2.3 shows union operat ion with MapReduce Format.
..- .
Figure 2.3: Union Operation with MapReduce Format.
--
INTERSECTION WITH MAPQEDUCE:
1. For intersection operation, map phase has same responsibility as that of union operation.
2. That is to convert tuple 't' values from Relation 'R' to a key va lue pa ir format.
3. Reducer is responsible for generating output by evaluating if else condition.
4. That is then only output w ill be generated in key va lue format e lse no o u tput w ill be produce.
5. Figure 2.4 shows intersect ion operation w ith MapReduce For mat.
11-Q
l
"""'
t, •
Produ« the
TUplet
key,,vakl•
Pair (t. A)
Olo.totlon A
0,~5
?. If n = 100, we d o n ot want to use a DFS or Map Red uce for t h is calcula tion.
8. But such kind of ca lcu lations are req u ired while maint ain ing th e pag e rankin g of web pages for a g iven
search.
9. In rea l time, the value of n is in some b illions.
10. Let us first assume th at n is large, but not so large that vector v cannot fit in ma in m emory and th us be
ava ilab le to every Map task.
11. The mat rix Mand t h e vector v each w ill be stored in a file o f the DFS.
12. We assume tha t the row-column coord in ates o f each ma tr ix element w ill be d iscoverab le, either from
its position in the file, or because it is stored w it h exp licit co ordin ates, as a t rip le (i, j, m ij ).
13. We also assume the position of element vj in the vector v w ill be discoverable in the analogous way.
14. The Map Function:
a. The Map function is written to apply to one element of Matrix M.
b. However, if vector v is not already read in to main memory at t he comput e node executing a Map
task, then vis first read, in its entirety, and subsequently w ill be avai lable to all applications of the
Map function performed at t h is Map task.
c. Each Map task will operate on a chunk of the matr ix M .
d . From each matrix element mij it produces the key-value pair {i, mij' vj).
e. Th us, all t er ms o f the sum that m ake u p t h e component xi o f the m at rix-vector p roduct w ill get the
same key, i.
15. The Reduce Function:
a. The Reduce function simply sums all the values associated w ith a given key i.
b. The re su lt w ill b e a p air {i, xi).
c. W e can divide t h e m atrix int o vertical st r ip es of equal width and divide t h e vector into an equa l
number of horizontal stripes, of the same height.
d . Our goal is to use enoug h str ip es so that t he portion of the vector in one str ip e can fit conven ien t ly
into main memory at a compute node.
16. Figure 2.6 shows what t h e partition lo oks like if the m at rix and vector are each divided into four st r ipes.
MatrlxM Vectorv
Q4. What are Combiners? Explain the position and significance of combiners
Ans: [P I Medium]
COMBINERS·
l. A combiner is also know n as a semi-reducer.
2. It is one of med iator between t he mapp er p hase & the reducer phase.
3. The use of combiners is totally optiona l.
4. It accep ts the out p ut of map phase as an in p u t and pass t he key-value pa ir to the reduce operation.
S. The main function of a Combin er is to summarize the map output record s w ith the same key.
6. It is also known as grouping by key.
7. The output (key-value collection) o f the combiner will be sent over the network to the actual Reducer
t ask as inp ut.
8. The Combiner class is used in be tween the Map class and the Reduce class to reduce the volume of
data transfer betw een Map and Reduce.
9. Usua lly, the output of the map task is larg e and the data t ransferred to the reduce task is high.
10. The following figure 2.7 show s p osit ion and w orking mechanism of combiner.
Combiner
·-------
<kl, vl> '"ki, VJ>
<IQ,VS,. <ki, v.i>
I
c.,ouping values w,th
same key
3 I NoSQL BE I SEM-7
CHAP - 3: NOSOL
Ql. When it comes to big data how NoSQL scores over RDBMS
1. Schema Less: NoSQL databases bein g schema-less do not define any strict data structure.
2. Pvnamic and Aaile:
a. NoSQL databases h ave good tendency to g row dynamica lly w ith changing requirements.
b. It can h andle structured, semi-structured and unstruct ured data.
3. Scales Horizontally:
a. In contrast to SQL databases which scale vertically, NoSQL scales horizonta lly by adding more
seNers and u sing concepts of shard ing and repl ication.
b. This beh avior of NoSQL fits with t he cloud computing services such as Amazon Web Serv ices {AWS)
which allows you to hand le virt ua l servers which can b e expanded horizon tally on demand.
4. Better Performance:
a. A ll t he NoSQL d atab ases cla im to d eliver bet ter and faster performance as comp ared to t raditional
RDBMS implementations.
b. Since NoSQL is an entire set of databases (and not a sing le database), the lim itations d iffer from
d atabase to database.
c. Some o f th ese datab ases do not support ACID transactions while some of them m ig ht be lacking
in relia bi lity. But each one of them has their own strengths due to which they are wel l su ited for
sp ecific requirements
5. Continuous Availability:
a. The var ious relational data b ases may show up mod ern to h igh availabi lity for the d ata t ran sactions
while this is m uch b etter w ith t he NoSQL d ata b ases which excel lently show u p cont inuous
availability to cope u p wit h different sorts of data transactions at any p oint of time and in difficult
situations.
6. Ability to handle changes
a. The schema-less structure of the NoSQ L datab ases h elps it cope up easily with the changes comin g
w it h t ime.
b. There is a universal index provided for struct ure, va lues an d text found in t he data an d hence, it's
easy for t he organ izations to cope w ith the changes immediately using t h is in forma tion.
3 I NoSQL BE I SEM-7
Q2. Explain different ways by which big data problems are handled by NoSQL
3 I NoSQL BE I SEM-7
c. Hypertable
'-- ~
.....No O.pUO Hlr• Oat• [mpM.affltl
I 1 2001·01•01 .,_RM
..Ju ,__ I I II I
I
1001-01-0l S.buA-'O
I
I • I I 2002-02-0, I
2 1 2002-CU-01
l 1 1002-o.4-0I Ghanshyam
I A,Ju
I
• 2 200J-ot-01 AM.lradha
I • I I 200•-•• I
I Ch.lntttyam
I
• 2 1004-ol-01 Kac:h,.s.th
I
Column Orienltd Database
I 1 I 2 I l I • I • I
I 1 I 1 I 1 I I 2 I 2
( '
'
DODD
DODD lSON
DODD
3 I NoSQL BE I SEM-7
6. Traversing rela tionship is fast as they are al ready captured into the DB, and there is no need to calcu late
them .
7. Graph base data base mostly used for socia l networks, logistics, and spatial data.
8. Examples:
a. Neo4J
b. Infinite Graph.
c. OrientDB.
d . FlockDB
Hod•s
Attrlbut•s
Edges
3 I NoSQL BE I SEM - 7
-- EXTRA QUESTIONS --
FEATURES:
I) Non-relational:
1. NoSQL databases never follow the relational model.
2. Never provide tables w it h flat fixed-column record s.
3. Work w it h self-cont ained aggregate s or BLOBs.
4. Doesn't require object-relational mapping and data norma lization.
5. No comp lex features like query languages, q uery p lanners, referent ial integ rity j oins, ACI D.
II) Schema-free
1. NoSQL databases are either schema-free or have relaxed schemas.
2. Do not req u ire any sort of d efinition of t h e sch em a o f the data.
3. Offers heterogeneous structures o f data in t he same domain.
Ill) Simple API
l. Offers easy to use interfaces for st orage and q ueryin g data provided.
2. APls allow low-level d ata man ipulation & selection methods.
3. Text-based p rotocols mostly used w ith HTTP REST w ith JSON.
4. Mostly used no standa rd b ased query language.
5. Web -enabled d atabases running as internet-facing services
IV) Distributed
1. M u lt ip le NoSQL databases can be executed in a distri b ut ed fash ion.
2. Offers au to-sca ling and fail-over capabilities.
3. Often ACID concept can be sacr ificed for sca lability and throughput.
4. Mostly no synchronous rep lication between distrib uted nod es Asynchronous Mult i-Master Replication,
peer-to-peer, HDFS Replicat ion.
5. Only providing eventua l consistency.
6. Shared Nothing Architecture. Th is ena bles less coordina tion and h igher d istr ibution.
APYANJAGES:
l. Good Resources Scalabi lity.
2. No Static Schema.
3 I NoSQL BE I SEM - 7
DISADVANTAGES:
l. Not a d efined standard.
2. Lim it ed q uery capabilities.
Ans: [P I Medium]
DYNAMODB:
AWS
Account
Reglonl Reglon l
Ta~e 1
Ite m 1 ltem2
Attribute l
ID= 1
3 I NoSQL BE I SEM - 7
ID= 2
Subj ectName = "Artificia l Intell igence & Soft Computing"
Authors = ("Sagar Narkar'')
Price = 100
PageCount = 93
Publication = BackkBenchers Publication House
9. Pata type;
a. Scalar types: Num ber, Str ing, Binary, Boolean and NULL
b. Document types: List and Map
c. Set types: String Set, Number Set and Binary Set
10. Data Access:
a. DynamoDB is a w eb service uses HTTP and HTTPS as a t ransport layer services.
b. JSON can be u sed as a message ser ialization format .
c. It is possible to use AWS software development kits (SD Ks)
d. Applicat ion co de makes requests to t h e DynamoDB web service API.
Ans: IP I Medium]
SHAQED-NOJHINC AQCHIJECJUQE:
1. Th ere are three ways t hat resources can be shared between computer systems: shared RAM, shared
disk, and sha red-noth ing
2. Of the t hree a lt ernat ives, a shared-not hing architecture is most cost effective in terms of cost per
processor when you're using commodity hardware.
3. A shared-nothing architecture (SN) is a distr ibuted -computing architecture.
4. Each processor has its own lo cal memory and local disk.
5. A processor at one node may communicate w ith another processor using h igh speed communication
network.
6. Any term inal can act as a node which functions as Server for data that is stored on local d isk.
7. Interconnection networks for sha red nothing systems are usual ly designed to be scalable.
8. Figure 3.6 shows shared nothing architecture.
3 I No SQL BE I SEM - 7
Interconnection Network
Processor
M Local Memory
DISADVANTAGES:
l. It requires rigid data p art it ioning.
2. Cost of communication a nd of non-local d isk access is h ig her.
APPLICATIONS:
l. Teradata database machine uses share nothing architect ure.
2. Shared-nothing is popu lar for w eb development.
Ad-hoc Queries
Streams entering
~nm..
Limited Working
Archival Storage
Storage
3. The data-stream management system architect ure is very much sim ilar to that of conventional
relational data b ase management system archit ecture.
4. Th e basic d ifference is t h at processor block or more specifically a query processor (query engine) is
rep laced w ith the specialized block known as stream processor.
5. Th e first b lock in the system architecture shows t h e inp ut part.
6. Number of data stream generated from d ifferent data sources will enter into the system.
7. Every Data stream has its own characteristics such as:
a. Every data stream can sched ule and rearrange its own data items.
b. Every data stream involved heterogeneity i.e. in each data stream we can found d ifferent kinds of
d ata such as numerical data, alphabets, a lphanumeric data, graphics data, text ual data, binary data
or any converted transformed or t ranslated data.
c. Every data stream has different input d ata rate.
d . No uniformity is ma intained by the elements o f d ifferent data st reams while entering into the
stream processor.
8. In the second block of system architecture, it is abstracted that, there are two d ifferent sub systems
exists one of which w ill take care of storing t he data st ream and other is responsible for fetching the
data stream from secondary storage and processing it by loading in to main memory.
9. Hence the rate at which stream enters into the system is not the burden of sub-system which is
involved in the stream processing.
10. It w ill be con trolled by other sub-system wh ich involved in storage of data stream.
11. The third block r epresents the active storage or wor king storage for processing t he different data
streams.
12. The wor king storage area may also contain sub-st reams which ar e integral p art of main -core stream
to generate result for a given query.
13. Working storage basically a main memory but situation demands then data items within the st ream
can be fetched from the secondary storage.
14. The major issue associated w ith working storage is its limited size.
15. The fourth block of system architecture is known as archival storage.
16. As name indicates, this b lock is responsible for maintain ing the details of very transaction within the
syst em architecture.
17. It is also responsib le to maintain the ed it logs.
18. Ed it logs are not hing but updation of data {i.e. va lues form users or system) .
19. In some sp ecial purpose case, archival storage w ill be used to process or to retrieve previously used
data items which are not present on the seconda ry storage.
20. Th e fifth block is responsible for d isp laying or delivery the output stream g en erated as a resu lt of
processing done by the stream processor usually by taking the support of workin g storage and
occasionally by taking support of arch iva l storage.
Q2. What do you mean by Counting Distinct Elements in a stream. Illustrate with an example
Key Values
7. The addition of new user to an existing list according to decided key is very easy with th is structure.
8. So, it seems to be very easy to count the distinct element w ith in the given stream.
9. The major problem arises, when the number of distinct element are too much in number and to make
it worst.
10. lfwe have number o f such data stream who will enter into the system at the same t ime.
11. E.g. If organization like Yahoo or Goog le requires a count of user who w ill see their different web pages
1" t ime for a given mont h.
12. Here, for every page we need to maintain above said problem which seems to be a compl icated
problem.
13. To have alternate solution to this p roblem, we can have "scale out o f machines".
14. In this approach, we can add new commodity hardware (an ordinary server) to existing hardware so
that load of processing and counting the distinct user on every web p age of an organization will be
distr ibuted.
15. Another add it ional thing is we can have t he use of secondary memory can batch systems.
FLAJOLET-MARTIN ALGORITHM:
1. The Problem of counting t he distinct element can be solved w ith t h e help of ordinary hashing
technique.
2. A hash funct ion will be applied to a given set which generate a b it-string as a resu lt of hashing.
3. A constraint should be appl ied to above process is, there should enough hashing resu lts t h an elements
present inside the set.
4. The general procedure while applying hash function is to p ick different hash function for and apply
these every element in g iven data stream.
5. The significant property of flash funct ion is that, whenever applied to the same data elem ent in a g iven
data stream it will generate the same hash va lue.
6. So, Flajolet-Martin algorithm has extended this hashing idea and properties to count distinct
elements.
7. The algorithm was int roduced by Philippe Flajolet and C. Nigel Martin in their 1984
8. Flajolet-Martin algorithm approximates the numb er of un ique objects in a stream or a d atabase in one
pass.
9. If the stream contains n elements w ith m of them unique, this algorithm r uns in O{n) t ime and
needs O(log(m)) memory.
10. So the real innovation here is the memory usage, in that an exact, brute-force algorithm would
need O(m) memory (e.g . think "hash map").
11. As noted, this is an approxima te algorithm. It g ives an approximat ion for the number of uniq ue objects,
along w ith a standard deviation a, which can then be used to deter mine bounds on the approximation
w ith a desired maximum error E, if needed.
ALGOQIIHM
1. Create a bit vector of sufficient length L, such that 2L > n, the number of elements in the stream. Usually
a 64-b it vector is sufficient since 264 is quite large for most purposes.
2. The i'" bit in this vector/array represents whether we have seen a hash function va lue whose binary
representation ends in Oi. So initialize each bit to 0.
3. Genera te a good, random hash function that maps inpu t (usua lly strings) to natura l numbers.
4. Read input. For each w ord, hash it an d determine the number of t railing zeros. If t h e number of t railing
zeros is k, set the k-th bit in the bi t vector to 1.
5. Once input is exhausted, get the index of the first O in the bit array (call this R). By the way, this is just
the number of consecutive ls {i.e. w e have seen o, 00, ..., o• 1 as the output of the hash function) p lus
one.
EXAMPLE:
S=l, 3, 2, 1, 2, 3, 4 , 3, 1, 2 , 3, l
Assume lbl = 5
X h(x) Rem Binary r{a)
l 7 2 00010 l
3 19 4 00100 2
2 13 3 00011 0
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
4 25 0 00000 5
3 19 4 00100 2
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
l 7 2 00010 l
R = max (r(a)) = 5
-- EXTRA QUESTIONS --
EXAMPLE:
Consider the fol low ing stream of bits: ...111010111001100010110110110 00
11010111 00 110001
t-3
5. Th is is the b aske t which is partly in represen tation because b it left to the bit t - 8 w ill t - 9 and so on.
6. So till t- 8 w ill b e included.
7. Th erefore quoted answ er for number o f l's in last 10 bits K =10 w ill be 6 b ut actu al answ er is 5.
Q2. Explain bloom filter? Explain bloom filtering process with neat diagram
Ans: [P I Medium]
BLOOM FILTER:
STRENGTHS:
l. Space-efficient: Bloom filters take up 0(1) space, regard less of the number of it ems inserted .
2. Fast: Insert and lookup operations are both 0(1) time.
WEAKNESSES:
l. Probabilistic: Bloom filters can only definitively identify tr ue negat ives. They cannot ident ify t rue
posit ives. If a bloom fi lter says an item is present, that item might act ua lly b e p resent (a true positive)
or it m ig ht not (a false posit ive).
2. Limited Interface: Bloom filters on ly support the insert and lookup operations. You can't iterate
t hrough t h e item s in t h e set or d elete items.
Y = A C F D E G
2 3 4 s 6 Posit ion
4. Clearly, posit ion 2, 3, 6 d iffer in their contents from x and y. So make necessa ry insertion and
d eletions at in accordance w ith these 3 p osit ions.
B C
C F G
2 3 6
5. Delete the cha racter 'B' in string x from p osit ion no. 2
6. Shift the oth er characters to left hand side.
7. Now current status of str ing x is,
X = A C D E
2 3 4
8. Now insert the character 'F' at position 3 i.e. after character 'C' and before character 'D'
9. Therefore, now the status of string x w ill be,
X = A C F D E
2 3 4 5
10. Lastly, append the cha racters Gin the X string get 6 th position.
ll. Status of string X w ill be,
X = A C F D E G
l 2 3 4 5 6
12. Hence in above example, we have
No. of deletions = l
No. of insertion = 2
Edit distance between X, Y
d (x, y) = No. of deletion+ No. of insertions
=1 + 2 = 3
II) Longest Common Sequence Method:
1. The longest common seq uence can b e d eveloped by performing d etect ion operat ions on t h e
character positions in respect ive str ing.
2. Suppose we have two point x and y represented as strings.
3. Therefore, edit d istance between p oin ts x and y will b e,
d (x, y) = (Length of string x + Length of string y) - 2 x Length of longest common sequence)
4. Suppose,
X = A B C D E
y = A C F D E G
5. Here,
Longest common seq uence {LCS) is nothing, but the (a, c, d, e) = 4
Length of string x = 5
Length of st ringy= 6
Therefore, d(x, y) = 5+6 - 2x4
=5 + 6 - 8
=11-8
=3
Q2. What are the challenges in clustering of Data streams. Explain stream clustering algorithm in
detail
Ans: [10M - DEC19]
2. Sensors w ith small memory are not able to keep a big amount of data so new method for data stream
clusterin g should be managed t his limitation.
1111 High dimensional data stream;
l. There are hig h dimensional data sets (e.g . image processing, personal sim ilarity, customer preferences
clustering, network intrusion d e tection, wireless sensors network and generally time series da ta) which
should be managed through t he processing of d ata stream.
2. In hug e data bases, data complexity can be increased by number of dimensions.
INITIALISING BUCKETS:
l. In it ialisation of the b ucket is t he first step of the BDMO algorithm.
2. The algor ithm uses the smallest bucket size that is p w ith a power of two.
3. It creates a new b ucket w ith the most recent p p oin ts for p stream elements.
4. The t imestamp of the most recen t point in the bucket is the t imestamp of the new bucket.
5. After t his, we may choose to leave every point in a cluster by itself or perform c lustering using an
appropriate clustering method.
6. For example, if k-means clustering method is used, it clusters the k points into k clusters.
7. For the In itialisat ion o f t h e bucket using selected clustering m ethods, it calcu lates t he cen troid or
clustroid for t he clusters and counts t he points in each cluster.
8. Al l th is in formation is stored and becomes a record for each cluster.
9. The algor ithm also calculates the o t her required parameters for t he merg ing process.
M ERGI N G BUCKETS:
l. After t he In it ialisat ion o f the b ucket, the algorith m n eed s to review t he sequence of a b ucket.
2. If t here happens to be a bucket w ith a ti mestamp more than N t ime units prior to the current t ime
then nothing of th at bucket is in t he w indow.
3. In such a case, t he algorithm d rops it from the list
4. In case we h ad created t hree b ucke ts o f size p , then w e m u st m erge the old est two o f the t hree b uckets.
5. In this case, the merger can create tw o buckets of size 2p, t h is may req u ire us to merge buckets o f
increasi ng sizes rec ursively.
6. For merging two consecutive buckets, the algorithm needs to p erform t he followin g step s:
a. For m erging, t h e size o f the bucket should be twice the sizes of the two b uckets to be m erged.
b. The t imestamp of the merged bucket is the timestamp of th e more recent of the two consecutive
b uckets.
c. In add it ion, it is necessary to calculate t he parameters of the merged clusters.
ANSWERING QUERIES:
l. A query in the st ream-comp uting model is a length o f a suffix of the slidin g w indow .
2. Any algori thm takes all the clust ers in all the buckets that are at least par tially w ithin the suffix and
t hen merges them using some met hod.
3. The answer of the query is t he resu lt ing clusters.
4. For the clustering of the st rea ms, t he stream-computing mod el finds out t h e answer to the query
'W h at are the clusters of t he last or more recent m points in the st rea m for in ms N'.
5. During the In itial isation, the k-means method is used and for merg ing the buckets t imestamp is u sed .
6. Hence the algor ithm is unab le t o find a set of buckets t hat covers the last in p oints.
7. However, we can choose the smal lest set o f buckets that covers the last m poin ts and include in these
buckets no more t h an th e last 2m point s.
8. After this, t he algorithm g enera tes the answer in response to the query as 'the centroids or clustroids
of all the poin ts in the selected b uckets'.
-- EXTRA QUESTIONS --
fl
d...,(x, y) = I ) x; - y;)2
i- 1
V) Hamming Distance:
1. The sp ace is not h ing but col lection of points,
2. Th ese points can be represented as vector.
3. Every vector is composed o f d ifferent components such as magnitude, directions etc.
4. When vector differs in t heir com p onents then that d ifference betw een one or more vectors is known
as hamming distance.
5. As this d istance ca lcu lation depends on the d ifference operat ion so it w ill satisfy all constraints such as
ne gativity of distances, positivity o f distance, symmetry and tria ng le in equa lity.
Ans: [P I High]
CURE ALGORITHM:
I
Make Partitioning of Sample
I
Partially Cluster Partitions
I
Eliminate Outliers
I
Cluster Partial Clusters
JNJTIAUZATJON JN CURE:
1. Take a smal l sample o f the d ata and cluster it in ma in memory.
2. In principle, any c lusterin g method cou ld b e used, but as CURE is designed to h andle oddly shaped
clusters, it is often advisable to use a hierarchica l method in w hich clusters are merged w h en they have
a close pair of poin ts.
3. Select a small set of points from each cluster to be representative poin ts as shown in figure 5.3.
0
Figure 5.3: Select representative points from each cl uster, as far from one another as possible
4. These po ints shou ld be chosen to be as fa r from one another as possib le, using the K- means method.
5. Move each of t he represen t at ive points a fixed fraction of the d ista nce b etw een its location a nd t h e
centroid of its cluster.
6. Per haps 20% is a good fraction to choose. Note that t h is step requ ires a Eucl idea n space, si nce
otherwise, there m ight not be a ny notion of a lin e between two points .
•
•
0 • •
• •
• •
•
Figure 5.4: Moving the representative points 20% of the dista nce to the cluster's centroid
APPLICATIONS:
l. Telephone Networks: Here the nodes represent phone numbers, which are rea lly individuals. There is
an ed ge between two nodes if a call has been p laced between those phones in some fixed period of
time, such as last month, or "ever." The edges cou ld b e weighted by the number of ca lls mad e between
these phones during the period.
2. Email Networks: The nodes represen t email addresses, which are again in dividua ls. An edge
represents the fact that there was at least one ema il in at least one d irection b etween the two
addresses.
3. Collaboration Networks: Nodes represent individua ls who have published resea rch papers. There is
an edge b etween two individuals who p ublished one or more papers jointly
4. Other examples include: information networks (documents, w eb g raphs, p atents), infrastruct ure
networks (roads, planes, water p ipes, powerg rids), b iologica l networks (genes, proteins, food-webs of
anima ls eating each other), as well as other types, like product co-purchasing networks (e.g., Grou p on).
l. Hyperlink-lnduced Topic Sea rch (HITS; also known as hubs and authorit ies) is a link ana lysis algorithm.
2. HITS rates Web pages, developed by Jon Kleinberg.
3. Hubs and authorities are fans and centers in a bipartite core of a web g ra ph.
4. A good hub page is one that points to many good authority pages.
5. A good author ity page is one that is pointed to by many good hub pages.
6. Figure 6.1 represents hubs and authorities.
Hut>s
STRUCTURE OF WEB:
a. Tendrils: The nodes reachable from IN that can't reach the giant sec,and the nodes that can reach
OUT but can't be reached from t he giant sec. If a tendril node sat isfies bot h condit ions then it's
pa rt o f a t ube that travels from IN to OUT w ithout touching the g iant sec, and
b. Disconnected: Nod es t hat belong to none of the p revious categories.
8. Taken as a whole. the bow-tie str ucture of the Web p rovides a h ig h -level view of the Web's structure,
based on its reachabil ity properties and h ow its strongly connected components fit together.
~Tendtil$ ~
1
,_...,.....,,.. 44 Mi11ion
nodes
------ !1
,' '
I
I
-----
IN
44 Million nodes
sec
56 MllliQn nodes
,•
' .I
T...,_s
0
00
C) ~ OiSoonnec18d oomponents
Q3. What is the use of Recommender System? How is classification algorithm used in
recommendation system?
RECOMMENDATION SYSTEMS:
There is an extensive class of Web applications that involve p red icting user responses to options. Such a
facility is called a recommendation system.
CLASSIFICATION ALGORITHM;
l. Classificat ion algorithm is completely d ifferent ap proach to a recommendat ion system using item
profiles and utility matrices is to t reat the problem as one of machine learning.
2. Regard t he given data as a train ing set, and for each user, build a classifier that pred icts t he rating o f
all items.
3. A d ecision t ree is a collection o f nodes, arranged as a b inary tree.
4. The leaves render decisions; in our case, the d ecision would be "likes" or "doesn't like."
5. Each interior node is a condit ion on the objects b eing classified; in our case the condition wou ld be a
predicate involving one or mor e features of an item .
6. To classify an item, we start at the ro ot, and apply the predicate at the rootto the item.
7. If the predicate is true, go to the left ch ild, and if it is false, go to the right child.
8. Then repeat the same process at the node visited, until a leaf is reached.
9. Th at leaf classifies the item as liked or not.
10. Construction of a d ecision tree requ ires selection of a predicate for each inter ior node.
11. There are many ways o f picking the best p redicate, but they al l try to arrange that one o f the children
gets al l or most of t h e positive examples in the training set (i.e, the items that the given user likes, in
our case) and t he other child gets all or most of t he negat ive examples (the items t h is user does n ot
like)
12. Once we have selected a pred icate for a n od e N, we divide t he items into the two g roups: those that
sat isfy the pred icate and t h ose that do not.
13. For each grou p, w e aga in find the predicate that best separates the posit ive and negative examples in
that group.
14. Th ese pred icat es are assigned to the child ren of N.
15. This process of d ividing the examples and b u ilding children can proceed to any n umber of levels.
16. W e can stop, and create a leaf, if t h e group of items for a n ode is homogeneous; i.e., they are all positive
or all negative examples.
17. However, we may w ish to stop and create a leaf w ith the majority decision for a gro up, even if th e gro up
contains both positive and negat ive examples.
Example:
l. Suppose our items are news articles, and features are the highTF.IDF words (keywords) in t h ose
documents.
2. Further suppose there is a user U who likes articles about b aseball, except articles about the New York
Yankees.
3. Th e row of the util ity ma tr ix for Uhas l if U h as read t he article and is blank if not.
4. We shall t ake the l's as "like" and the b lanks as "doesn't like."
5. Pred icates wil l be Boolean expressions of keywords.
6. Since U g enerally likes basebal l, we might find that the best pred icate for the root is "hom erun" OR
("batter" A ND "pitcher").
7. Items that satisfy the p redicate w ill tend to be p osit ive examples (articles w ith l in the row for U in the
utility matrix), and items that fai l to sa tisfy the predicate w ill tend t o be negative examples (blanks in
the utility-mat rix row for U).
8. Figure 6.3 shows the root as well as t he rest of the decision tree.
9. Suppose t hat the g rou p of articles that do not satisfy the p red icate includes sufficient ly few posit ive
examples that we can conclude all o f t h ese items are in the "don't-like" class.
10. W e may t hen put a leaf w ith decision "d on't like" as the rig ht child of t he root.
11. However, the articles that satisfy the p red icate incl udes a number of articles that user U doesn't like;
these ar e the articles that mention the Yankees.
12. Thus, at the left child of the root, we bu ild another predicate.
13. We m ight find that the predicate "Yankees" OR "Jeter" OR "Teixeira" is the best possib le indicator of an
article about b asebal l and about the Yankees.
14. Thus, we see in Fig. 6.3 t he left child of the root, which applies t h is predicate.
15. Bot h ch ildren of this node are leaves, since w e may suppose that t h e items sat isfyin g t his pred icate are
predominantly negative and those not satisfying it are predominantly p ositive.
16. Unfortun ately, classifiers of al l types tend to take a long t ime to con str uct.
17. For instance, if w e wish to u se decision t rees, w e need one t ree per user.
18. Const ructing a t ree not only requires that we look at all the item profiles, bu t we h ave to consider many
different p red icates, which could involve comp lex combinat ions offeatures.
19. Th us, this app roach tends to be used only for relatively sma ll problem sizes.
"'Homerun n OR
r batter" A N D
"'pit cher"}
Q4. Describe Girwan - Newman Algorithm. For the following graph show how the Girvan Newman
STEPS:
1. Find edge w ith hig hest betweenness o f multiple edges of highest betweenness if there is a t ie and
remove those edges from graph. It may affect to graph to get separate into mu lt ip le components. If so,
this is first level of regions in the po rt ioning of graph.
2. Now , recalcu late al l betweenness and again remove the edge or edges of h ighest betweenness. It w ill
break few existin g component into smal ler, if so, these are regions nested w ithin la rger region.
3. Keep repeatat ion of tasks by recalcula ting a ll betweenness an d removing the edge or edges having
highest betweenness.
EXAMPLE:
Step 1:
Step 2:
A 0 0 V
Step 3:
\/ \/ l
0
0 0
0 0
0 0 0
0 0 0 0
0 0
QS. Com pute the page rank of eac h page after running the PageRank algorithm for two ite rations
with teleportation factor Beta (Pl value = 0.8
Ans: [10M - DEC19)
1/2
AJ-===C
v•
1/2 1/2
A 0 ¼ ½ X X X
B ½ 0 0 X X X
C ½ 0 0 X X X
D 0 ¼ 0 X X X
E 0 ¼ 0 X X X
F 0 ¼ 1/2 X X X
It e ration I:
We compute A
0.6
0.4
A' V = 0.4
. 0.2
0.2
0.6
Iteration II:
A 2.V = A. {A1.V)
0.4
0.2
= 0.2
0.05
0.05
0.4
Therefore, the page rank computed is:
A =0 .4 , B = 0 .2, C = 0 .2, D = 0 .05, E = 0.05, F = 0.4
-- EXTRA QUESTIONS --
Ans: [P I Medium]
PAGE RANK:
l. PageRank (PR) is an algorithm u sed by Google Sea rch to rank w ebsites in their search engine results.
2. PageRank was named after Larry Page, one of the founders of Google.
3. It is a w ay of measuring the im p ort an ce of w ebsite pages.
4. It works by countin g t he number and quality of links to a p age to determine a rough estimate of h ow
important t h e w ebsite is.
5. The underly ing assumption is that more important websites are likely to receive more links from other
websites.
6. Search engine work equivalent to "Web Crawler"
7. Web Crawler is the web component whose responsibility is to identify and list down the different terms
found on every w eb page encountered by it.
8. Th is listing of different terms w ill be stored inside t he sp ecialized data st ructure known as an "Inverted
Index"
9. Figure 6.4 shows inversed index funct iona lity.
10. Every term from the inver ted index w ill be extracted and an alyzed for t he u sage of that term w ithin the
w eb page.
Inverted Index
Web Page A
W ebPageC
W ebPageB
Q2. Define content based recommendation system. How it can be used to provide
recommendations to users.
Ans: IP I Medium)
CONTENT BASED RECOMMENDATION:
l. A content b ased recomm ender works w ith d ata t hat t he user provides, either explicitly (rating) or
implicitly (cl ickin g on a link).
2. Based on that data. a user profi le is generated. which is then used to make suggestions to the user.
3. As the user provides m ore inputs or takes actions on the recommend ations, the eng ine becomes more
and m ore accurate.
4. Item profile in content b ased systems focuses on items and user p rofiles in form of weighted lists.
5. Profile are helpful to discover proper ties of items.
6. Consider the below examples:
a. Some students prefer to be guided by few teachers on ly.
b. Some Viewers prefer drama or movie by t heir favor ite actors on ly.
c. Few Viewers prefer o ld songs on other h and few viewers may prefer new songs only d epend ing
upon users sorti ng of songs b ased on year.
7. In g eneral, there are so many classes which p rovides such data.
8. Few domains has common feat ures for example a college and movie it has students, professors set
and actors, d irector set respectively.
9. Certain ratio is maintained in such cases like every coll ege and movie has yea r w ise datasets as movie
released in a year by d irector and actor and colleg e has passing st udent every year etc.
10. Music song album and book has same va lue featu re like songs writers/ poet, year of release and
publication year e tc.
11. Consider the figure 6.6 which shows recommendation system parameters.
Content
Item value
Recommenctadon
Communlty 011ta
Component
List of Atteommend.titlon
UHn
Q3. What are different recommender systems? Explain any one with example.
Ans: (P I Medium)
RECOMMENDATION SYSTEM:
1. Recommendation system is w idely used now-a-days.
2. It is su b class of information filtering system .
3. It is used to provide recommendation for games, movies, music, b ooks, social tags, art icles etc.
4. It is useful for experts, financia l seN ices, life in su rance and social m ed ia based organization.
5. There are two types o f recommendation systems:
a) Coll aborative filtering.
b) Content based filter ing.
Where A, b = users
Ra, p = rating of user 'a' for item 'p'
P = Set of items rated by bo th a an d b
Advantages:
l. Continuous learn ing for market process.
2. No knowledge engineering e fforts needed .
Disadvantages:
l. Rat ing & feedback is requ ired.
2. New items and users faces to cold star t.
Q4. What are social network graphs? How does clustering of social network graphs work?
Ans: [P I Medium]
Il l) Betweenness:
1. Since t h ere a re problems w ith st andard cl usteri ng meth o d s, severa l specia lized clustering t echn iq ues
have been develop ed to fi nd communities in social networks.
2. The simplest one is b ased on find ing the edg es that are least likely to be inside the comm u n ity.
3. Defin e the b etweenness of an ed ge {a, b ) to be t he n u m b e r of p airs o f nodes x an d y such that the edge
(a, b ) lies on the shortest p ath between x and y .
4. To b e m ore p recise, sin ce t h ere can b e severa l short est p aths betw een x and y, edge (a, b) is c redit ed
w ith t he fraction o f t h ose shortest pat hs that inclu d e the edge {a, b ).
5. As in golf, a h igh score is b ad.
6. It suggests t h at the ed g e (a, b) runs between two d ifferent co mmu n ities; that is, a an d b do n ot b elo ng
to the same community.
3. It is used for:
a. Commun ity detection.
b. To measure edge - betweenness among all existing edges.
c. To remove edge having large valued betweenness.
d . To option optimized modular function.
4. It also checks for edge betw eenness centra lity and vertex betw eenness centra lity.
Q2. (a) What is Hadoop? Describ e HDFS architect ure w ith diag ra m. [10)
Ans: (Chapter No. 02 I Page No. 17)
(b) Explain w ith b loc k d iag ram architect u re of Data Stream Management System. [10]
Ans: (Chapter No. 04 I Page No. 33)
Q3. (a) What is th e use of Recommend er System? How is classification algorithm used in
recommen d ation system? [10)
Ans: (Chapter No. 06 I Page No. SO)
(b) Explain the following ter ms w ith d iagram [10 )
1) Hub s and Aut hor ities
2) Structu re of the Web
Ans: (Chapter No. 06 I Page No. 48)
Q4. (a) What d o you mean by Counting Distinct Elements in a stream . Ill ustrate w ith an exam ple
working of a Flajolet - Mart in Algorithm used to cou nt n u m b er o f d ist inct elements. [10 )
Ans: [Chapter No. 04 I Page No. 34)
(b) Expla in d ifferent ways by which big data problems are h andled by NoSQL. [10]
Ans: [Chapter No. 03 I Page No. 26)
QS. (a) Descr ib e Girwan - Newman A lgorit hm. For t he following g raph show how t he Girvan Newman
algorithm find s the different commun ities. [10 )
A l----1 8 1----< C
o }-----l E I----I F
Q6. (a) Compute the page rank of each page after running the PageRank algorithm for two iterations
w ith teleportation fact or Bet a(~) value = 0.8
C>--------c
Note: We have tried to cover almost every important question(s) listed in syllabus. If you feel any other
question is important and it is not cover in this solution then do mail the question on
[email protected] or Whatsapp us on +91-9930038388 / +91-7507531198
Join BackkBenchers Community & become the Student Ambassador to represent your college & earn
15% Discount.
SB Communfty
We organ ize IV for students as well in low p ackage. Con tact u s for m ore details.
Buy & Sell Fina l Year Projects with BackkBenchers. Project Charge upto 10,000.
•••
(/)
•••
--
-- ,_-
r I I ~
-
:z: