0% found this document useful (0 votes)

53 views

Bda Toppers Solution

Uploaded by

forlogin830

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

Bda Toppers Solution

Uploaded by

forlogin830

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

lOMoARcPSD|48592555

BDA Topper Solution - Best for final exam Preparation. You

can get all previous year mumbai university
Big Data Analytics (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by asda sd ([email protected])
lOMoARcPSD|48592555

Big Data Analytics BE I SEM -7

As per Revised Syllabus w.e.f 2019-20

Downloaded by asda sd ([email protected])

lOMoARcPSD|48592555

Big Data Analytics BE I SEM - 7

TOPPER'S SOLUTIONS
....In Search of Another Topper

There are many existing pa per solution available in market, bu t Topper's Solution is the one which studen ts
will always prefer if they refer ... ;) Topper's Solutions is not j ust paper solutions, it includes many other
important questions which are important from examination point of view. Topper's Solutions are the
solution w ritten by the Toppers for the students to be the upcoming Topper of t he Semester.

It has been said that "Action Speaks Louder than Words" So Topper's Solutions Team works on same
principle. Diagrammatic representation of answer is considered to be easy & q u icker to understand. So our
major focus is on diagrams & representation how answers shou ld be answered in examinations.

Why Topper's Solutions:

❖ Point wise answers which are easy to und erstand & remem ber.
❖ Diagrammat ic representation for better understand ing,
❖ Add itiona l im portant questions from university exams point of view.
❖ Covers almost every important question.
❖ In search of another topper.

"Education is Free.... But its Technology used & Efforts utilized which
we charge"

It t akes lot of efforts for search ing out each & every question and transfor ming it into Short & Simple
Language. Entire Community is working out for betterment of students, do help us.

Thanks for Purchasing & Best Luck for Exams

---- In Association with BackkBenchers Community ----

BACl<K B.a. NCHERS

Downloaded by asda sd ([email protected])

lOMoARcPSD|48592555

Big Data An alytics BE I SEM - 7

Practice like you never WON.

Perform like you never LOST.
---- By A nonym ous.

This E-Book is Published Specially for Last Moment Tuitions Viewers

For Video Lectures visit: https;//lastmomenttujtjons,com/

Syllabus:

Downloaded by asda sd ([email protected])

lOMoARcPSD|48592555

Big Data Analytics BE I SEM - 7

Exam TT-1 TT-2 AVG Term Work Oral/Practical End of Exam Total

Marks 20 20 20 25 25 80 150

# Module Details Contents N o.

1. In t roduct ion to Big • Introd uction to Big Data 01
• Big Data characteristics, types of Big Data
Data and Hadoop
• Trad it ional vs. Big Data business ap proach
• Case Study o f Big Data Solutions
• Concept ofHadoop
• Core Hadoop Components; Hadoop Ecosystem
2. Hadoop HDFS and • Distributed File Systems: Physica l Organiza tion of Compute 17
Nodes, Large-Scale File-System Organization.
Map Red uce
• MapReduce: The Map Tasks, Grouping by Key, The Reduce
Tasks, Combiners, Details o f MapReduce Execut ion, Coping
Wit h Node Fa ilures.
• Algorit hms Using MapReduce: Mat rix-Vector Mu ltiplication by
MapRed uce, Relational-Algebra Operations, Computing
Selections by MapReduce, Computing Projections by
MapReduce, Union, Intersection, and Difference by
MapRed uce
• Hadoop Lim itations
3. No SQL • Introd uction to NoSQL, NoSQL Business Drivers 25
• NoSQL Data Architecture Patterns: Key-va lu e stores, Graph
stores, Column fam ily (Bigta b le) stores, Document stores,
Va riations of NoSQL arch itectura l patterns, NoSQL Case Study
• NoSQL solution for big data, Understanding the types of b ig
d ata problems; Analyzing big d ata w ith a shared -nothing
architectu re; Choosing distrib ution models: master-slave
versus p eer-to-peer; NoSQL systems to handl e big data
problems.
4. M ining Data • The Stream Data Model: A Data-Stream-Management System, 33
Examples of Stream Sources, Stream Queries, Issues in Stream
Streams
Processing.
• Sampl ing Data techniques in a Stream
• Filteri ng Streams: Bloom Filter with Ana lysis.
• Cou nting Distinct Elements in a St ream, Count-Distinct
Problem, Flajolet -Martin Algorithm, Combining Estimates,
Space Requirements
• Counting Frequent Items in a Stream, Sampl ing Methods for
Streams, Frequent ltemsets in Decaying Windows.
• Counting Ones in a W in dow: The Cost of Exact Counts, The
Datar-Gionis-lndyk-Motwani Algorithm, Query Answ ering in
the DGIM Algor it hm, Decaying W indows.
5. Find ing Similar • Distance M easures: Definit ion of a Distance Measure, 40
Euclidean Distances, Jaccard Distance, Cosine Distance, Edit
Items and
Dist ance, Hamming Distance.
Cl ustering • CURE Algorithm, Stream-Computing , A Stream-Clustering
Alaorit hm, lnitializina & Merain a Buckets, Answ erina Queries
6. Real-Time Big Data • PageRank overview, Efficient computation of PageRank: 48
PageRank Iteration Using Map Reduce, Use of Combiners to
Models
Consolida te the Resu lt Vector.
• A Model for Recommendation Systems, Content-Based
Recommendations, Collaborative Filtering.
• Social Networks as Graphs, Clustering of Social-Network
Graphs, Direct Discovery of Communities in a socia l graph.

CHAP - 1: INTRODUCTION T BIG DATA & HADOOP

Downloaded by asda sd ([email protected])

lOMoARcPSD|48592555

1 1 Introduction to Big Data & Hadoop BE I SEM - 7

Ql. Give difference between Traditional data management and analytics approach Versus Big data

approach

Ans: [SM - DEC 19)

COMPARISON BETWEEN TRADITIONAL DATA MANAGEMENT AND ANALYTICS APPROACH & BIG
DATA APPROACH:
Traditional data management and analytics Big data approach
approach
Traditional database system deals w ith struct ured Big data system deals w ith structured, semi
data. struct ured and unstructured data.
Traditional data is generated in enterprise level. Big data is generated in outside and enterprise
level.
Its volu me ranges from Gigabytes to Tera bytes Its volume ranges from Petabytes to Zetta bytes
or Exa bytes.
Data integ ration is very easy. Data integration is very difficult .

The size of the data is very sma II. The size is more than the traditional data size
Its data model is strict schema b ased and it is stat ic. Its data model is flat schema based and it is
dynamic.
It is easy to manage and man ipulate the data. It is d ifficu lt to manage and manipulate the data.

Its data sources includ es ERP transaction data, CRM Its data sources includes social med ia, device
t ransaction d ata, financial data, organizat ional data, data, sensor dat a, v ideo, images, aud io etc.
web transaction data etc.
The sample from known popu lation is considered as Entire population is considered as object of
object of analysis. analysis.
Normal funct ions can manipulate d ata. Specia l kind of functions can manipu late data.

Tradit ional data b ase to ols are required to perform Special kind of d ata base tools are requ ired to
any d ata base operation. perform any d ata base operation.
Tradit ional data source is centralized and it is Big data source is d istributed and it is managed
managed in centralized form. in d istr ibuted form.

- - EXTRA QUESTIONS --

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 2 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

Ql. Explain Big Data & Types of Big Data

Ans: [PI High)

BIG DATA:
1. Data is d efined as t h e quantities, characters, or symbols on which operations are p erformed by a
computer.
2. Data may be stored and t ransmitted in the form of electr ica l signa ls and record ed on magnetic, optical,
or mechanical recording med ia.
3. Big Data is also data but w ith a huge size.
4. Big Data is a term used to describe a collection o f data that is huge in size and yet g rowing
exponentia lly w it h t ime.
5. In short such data is so large and complex tha t none of t he traditional data management tools are able
to store it o r process it efficien tly.
6. Examples:
a. The New York Stock Exchange generates about one terabyte of new trade d ata per d ay.
b. The statist ic show s that 500+ terabytes of new data get in gested in to the d atabases of social med ia
site Facebook. every day. This data is ma in ly g enerated in terms of photo and video uploads,
message exchanges, putting comments etc.

TYPES:

Big Data

I
Structured
I
Unstructured
l
Semi-structured

Figure 1.1: Types of Big Data

I) Structured:
1. Any data that can be stored, accessed and processed in t he form of fixed format is termed as a
Structured Data.
2. It accounts for about 20% of the total existing data and is used the most in programm ing and
compu ter-related activities.
3. Th ere are two sou rces of structured d ata• machines and humans.
4. All t he data received from sensors, weblogs, and financial systems are classified under mach ine-
generated data.
5. These include med ical devices, CPS d ata, data of usage st atist ics captured by servers and applications.
6. Human-generated struct u red data m ainly includes all the data a human input into a computer, such
as his name and o ther persona l details.
7. When a person clicks a link on the internet, or even makes a move in a game, data is created .
8. Example: An 'Em p loyee' ta ble in a database is an example of Struct u red Data.
l Employee_lD l Employee_Name I Gender
• Handcrafted by BackkBenchers Community
Downloaded by asda sd ([email protected]) Page 3 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

420 Angel Priya Ma le

100 Babu Bhaiya Male
202 Babita Ji Female
400 Jethalal Tapu Ke Papa Gada Male
007 Dhinchak Pooj a Fema le

9. Tools g enerates structured data:

a. Data Marts
b . RDBMS
c. Greenplum
d. TeraData

II) Unstructured:
l. Any data with unknown form or the structure is classified as unstructured data.
2. The rest of the data created, about 80% of the total account for unst ructured big data.
3. Unstruct u red data is also c lassified based on its source, into machine-generated or human-gene rated.
4. Machine-generated data accounts for all the satellite images, the scien tific data f rom va rious
experiments and radar data captured by various facets of technology.
5. Human-generated unst ructured data is found in ab undance across the in ternet since it includes social
media data, mobile data, and website content.
6. Th is means that the p ictures we upload to Facebook or lnstagram handle, t he videos we watch on
You Tube and even the text messages w e send a ll contribute to the gigantic heap t h at is unstructured
data.
7. Examples of unstructured data include text, video, audio, mobile activity, social med ia activ ity, satellite
imagery, surveillance imagery etc.
8. The Unst ructured data is fu rt her divided into:
a. Captured data:
• It is the data based on t he user's behavior.
• The best example to understand it is GPS v ia smart phones which help the user each and every
moment and provides a rea l-time output.
b. User-generated data:
• It is the kind of unst ructured data w here the user itself will p ut data on t he int ernet every
movement.
• For example, Tweets and Re-tweets, Likes, Shares, Comments, on YouTube, Facebook, etc.
9. Tools generates unstructured data:
a. Hadoop
b. HBase
C. Hive
d. Pig
e. MapR
f. Cloudera

1111 Semi-Structured:

• Ha ndcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page4of65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

1. Semi Structured dat a is information that does not reside in a RDBMS.

2. Information that is not in t he t raditional database format as structured data, but contains some
organizationa l properties which make it easier to process, are included in semi-structured data.
3. It may organized in t ree p attern which is easier to analyze in some cases.
4. Examples of semi structured data m ight include XM L documents and NoSQL databases.
Personal data stored in an XM L fi le
<rec><name>Angel Priya</name><sex>Ma le</sex></rec>
<rec><name>Babu Bhaiya</name><sex>Male</sex></rec>
<rec><name>Babita Ji</name><sex>Female</sex></rec>
<rec><name>Jetha lal Tapu Ke Papa Gada</name><sex>Male</sex></rec>
<rec><name>Dh inchak Pooja</name><sex>Female</sex></rec>

Q2. Explain Characteristics of Big Data or Define the three V's of Big Data

Ans: [P I High]

CHARACTERISTICS OF BIG DATA:

I) Variety:

1. Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered from
multiple sources.
2. Th e type and natu re of data is having great variety.
3. During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applicat ions.
4. Nowadays, data in the form of ema ils, photos, videos, monitoring devices, PDFs, audio, etc. are also
being considered in the an alysis applicat ions.

Big Data

Stru<;tur9d
l
Unstruc-tured Semi•nruc.tured

11) Velocity:

1. The term velocity refers to the speed of genera tion of data.

2. Big Dat a Velocity deals w ith t he speed at which data flows in from sou rces like business processes,
appl ication logs, networks, and socia l med ia sit es, sensors, Mobile devices, etc.
3. The flow of data is massive and con tinuous.
4. The speed of data accumulation also plays a role in determ ining whether the data is categorized into
big data or norma l data.
S. As can be seen from the figure 1.2 below, at first, ma inframes were used wherein fewer people used
computers.
6. Then came the client/server model and more and more computers were evolved.
7. After this, the web appl ications came into the pict ure and sta rted increasing over the Internet.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page S of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

8. Then, everyone began using these applications.

9. These applicat ions were then used by more and more d evices such as mobiles as they were very easy
to access. Hence, a lot of da ta'
Every 60 second

100,000• TW1'ets
6~5.000• F'B Status Updates
11.000,000+ Instant Messages
700,445• Google Searches
168,ooo,ooo..- Emalls
220• New Mobele users
2000 TB• Data Created

Main Cllent/
Frame Server

Figure 1.2: Big Data Velocity

111) Volume:

1. Th e name Big Data itself is related to a size which is enormous.

2. Size of d ata p lays a very crucia l ro le in d ete rm ining va lue out of data.
3. Also, w hether a part icular data can actua lly be considered as a Big Data or not, is d ependent u p on the
volume of data.
4. Hence, 'Volume' is one character istic which need s to be considered wh ile dealing w ith Big Data.
5. Th is refers to t he data tha t is tremendously large.
6. As shown in figu re 1.3 b e low, t h e volume of d ata is risi ng exponentia lly.
7. In 2016, the data c reated was only 8 ZB and it is exp ected t h at, by 2020, the data would rise up to 40
ZB, which is extremely large.

Data Growth
4S
40
3S
., 30
!:!
]25
3
U 20
~
:;; 15
0

10
s
0
2011 2012 2013 2016 2020
Ytar

Figu re 1.3: Big Data Volume

OTHER CHAQACTEQISJICS OF BIG PATA:

• Ha ndcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 6 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

IJ Programmable;
1. It is possible w ith big data to explore all types by programming logic.
2. Programming can b e use d to perform any kind o f explorat ion because of t he scale of the data.

II) Data Driven:

1. The data driven ap proach is p ossible for scientists.
2. As data collected is huge amount.

Ill) Multi Attributes:

1. It is possible to dea l w ith many gigabytes of d ata that consist of thousands o f attributes.
2. As all d ata operations are now happening on a larger scale.

IV] Veracity:
1. Th e data captured is not in certain for mat.
2. Data capt ured can vary g reatly.
3. Veracity means t he trustworthiness and qua lity of d ata.
4. It is n ecessary that the veracity of the data is mainta ined.
5. For example, th ink about Facebo ok posts, w ith hashtags, abbreviat ions, imag es, videos, etc., which
make them unreliable and hamper the q uality of their content.
6. Collecting loads and loads o f d ata is o f no use if the quality an d trustwort h iness of t he d ata is not up to
the mar k.

Q4. Applications of Big Bata

Ans: (P I Medium)

APPLICATIONS OF BIG BATA:

I) Healthcare & Public Health Industry:

1. Big Data has already st arted to create a huge d ifference in the healthcare sector.
2. W ith the help of predictive analytics, medical professionals and HCPs are now able to provide
personalized healthcare services to individ ua l pa tients.
3. Like entire DNA str ings can be decoded in m inutes.
4. Apar t from that, fitness wearab le's, telemedicine, remote monitoring - all powere d by Big Data and A l
- are helping change lives for the better.

11) Academia
1. Big Data is also helping enhance education today.
2. Education is no more lim ited t o the physica l bound s of the classroom - there are numerous online
educational courses to learn from.
3. Acad emic institutions are investing in digital cou rses powered by Big Data technologies to aid the all -
ro und development of b udding learners.

111) Banking
1. The banking sector relies on Big Data for fraud detection.
2. Big Data to ols can efficient ly detect fraudulent acts in real-time such as m isuse of cred it/debit cards,
archival of inspection t racks, faulty alteration in customer stats, etc.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page? of65
lOMoARcPSD|48592555

1 1 Introduction to Big Data & Hadoop BE I SEM - 7

IV) Manufacturjna
1. Accordin g t o TCS Globa l Trend St u dy, the most significant b enefit of Big Dat a in manufacturing is
improving the supply strategies and product qua lity.
2. In t h e manufacturing sector, Big data helps cr eate a transparent infrastructu re, thereby, predicti ng
uncertainties and incompetence's that can affect the business adversely.

V) IT
1. One o f the largest users of Big Data, IT compan ies around the world a re using Big Data to opti m ize
their function ing, enhance employee product ivity, a nd m in im ize risks in busi ness operations.
2. By combining Big D ata technologies w ith M L and A l, the IT sector is continually power ing innova tion
to find solutions even for the most complex of problems.

Q4. Write short notes on Hadoop

Ans: [P I Medium]

HADOOP:

l. Hadoop is an open source softwa re prog ramming fram ework for storing a larg e amount of data a nd
performing the computation.
2. Its framework is based on Java p rogramming with some n ative co de in C and shell scripts.
3. Apache Software Foundation is the developers o f Hadoop, an d its co-found ers are Doug
Cutting and Mike Cafarella.
4. Th e Hadoop framewor k ap p licat ion works in an e nvironment that provides
distributed storage and computation across clusters of co mputers.
5. Hadoop is designed to sca le up from sing le server to thousands of m ach ines, each offering local
computation and st orage.

FEATURES:

1. Low Cost
2. Hig h Computi ng Power
3. Scalability
4. Huge & Flexib le Storage
5. Faul t Tole rance & Data Protect ion

HADOOP ARCHITECTURE:

l. Figure 1.4 shows architecture of Hadoop.

2. At its core, H adoop has two major layers namely:
a. Processing/Computation layer (MapReduce), an d
b. Storage layer {Hadoop Distributed File System).

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 8 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

Hadoop

M apR.cfuc•
(Oistrlbuted Computation)

HOFS
(Distributed Storage)

YARN Common
Fram.work Utllltlos

Figure 1.4: Hadoop Architecture

MapReduce:
l. Map Red uce is a parallel programm ing model for w rit ing distr ibuted appl ications.
2. It is used for effic ient pro cessing of la rge amounts o f data (mu lt i-terabyte data-sets), on larg e cl usters
(thousands of n o des) o f commo dity hardware in a re lia ble, fault-tolerant manner.
3. Th e M apRed uce progra m runs on H adoop w h ich is an Apache open-source f ramework.

Hadoop Distributed File System:

l. The Hadoop Distributed File System (HDFS) is based on the Google File System (CFS)
2. It prov ides a d istributed fi le system t hat is designed to run on commod ity h ardwa re.
3. It has many sim ilarit ies w ith existing distr ibuted file systems.
4. However, the d iffere nces from other distr ibut ed file systems a re significant.
5. It is highly fault-tolerant and is designed to b e d e ployed on low-cost hardwa re.
6. It provides high throughput access to applicat ion data and is suitable for applications having large
datasets.

Hadoop framework also includes the following two modules:

l. Hadoop Common: These are Java libraries and utilities required by other Hadoop modules.
2. Hadoop YARN: Th is is a framework for job scheduling and cluster resource management.

ADVANTAGES:
l. Abil ity to store a large amount of dat a.
2. High flexibility.
3. Cost effective.
4. High computa tional power.
5. Tasks are ind ep endent.
6. Linear scaling.

PISAPYANIAGES:
l. Not very effective for small data.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 9 of 65
lOMoARcPSD|48592555

1 1 Introduction to Big Data & Hadoop BE I SEM - 7

2. Hard cluster management.

3. Has sta b ility issues.
4. Securi ty concerns.

QS. Write physical architecture of Hadoop

Ans: [P I Medium]

PHYSICAL ARCHITECTURE OF HADOOP:

l. Hadoop is a n o pen source software framework which provides huge data sto rage.
2. Runn in g Hadoop means runn in g a set of resident prog rams.
3. Th ese resident program s a re a lso know n as daemons.
4. These daemons may be running on the same server or on the different servers in the network.
5. Figure 1.5 show s H adoop cl uster topology.

Figure 15: Hadoop Cluster Topology

WORKING:

l. Whe n the cl ie nt w ill submit his job, it will g o t o the n am e n od e.

2. Now name n od e w ill d ecide whether to accep t the job or not.
3. After acce pt ing t h e job, the na m e n o de will t ra nsfer t h e j ob t o the j ob tracker.
4. Then t he job t racker w ill d ivide the job in to com p o nents and transfer them to data nod es.
5. Now data n o des w ill further t ra nsfer t he j o b s to t he task t racke r.
6. Now the actual p ro cessing w ill b e d one h ere, means t he execut ion of the job sub mitted is d o ne here.
7. Now , after completing the part o f the jobs assigned to them , the task t rack er w ill subm it t h e completed
t ask to the j o b track er via the d at a node.
8. Now , com ing on second ary name nod e, the task of secondary name n o de is to j ust m o n itor t he whole
process ong o ing.
9. Now, physical architecture of Hadoop is a Master-slave process, here name nod e is a master, job
tracker is a part of mast er and d ata nod es a re the slaves.

COMPONENTS:
I) Name Node:
l. It is the master of HDFS (Hadoop fi le system).
2. It contains Job Tracker, which k eeps tracks of a file d istributed to d ifferent data nodes.
3. Na me Node d irects Da ta Node regarding t he low level 1/0 tasks.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 10 of 65
lOMoARcPSD|48592555

1 1 Introduction to Big Data & Hadoop BE I SEM - 7

4. Failure of Name Node wil l lead to the failure of the full Hadoop system.

111 Data node;

1. Da ta node is the slave of HDFS.
2. A data node can communicate with each oth er through t h e name node to avoid replication in t h e
provided task.
3. For replication of data a data node may commun icates with other d ata nodes.
4. Data node continually informs local change updates to name nodes.
5. To create, move or delete blocks, d ata node receives instructions from t h e local disk

Il l) Job Tracker:
l. Job Tracker det erm ines which file to process.
2. Th ere can be on ly one job t racker for p er Hadoop cluster.
3. Job Tracker r uns on a server as a master n o de o f the cluster.

IV] Task Tracker:

1. On ly single task tracker is present per slave n o de.
2. Task Tracker performs tasks given by j ob tracker and also contin uously communicates with the job
tracker.

V) SSN {Secondary Name Node)

1. Its main purpose is to monitor.
2. State Mon itoring o f cluster HDFS is done by SNN.
3. SNN resides on its own machine also.
4. One SSN is present per clust er.

Q6. Write short notes on Core Hadoop Components

Ans: (P J High)

CORE HADOOP COMPONENTS:

I
Map ~·uce
I I JobTracku
I I Task Trac.ker
I I TukTracker
I

HOS:$ Clu-1ter
I I Harne Node
I I Data Node
I I Oata Node
I
I 1 N

Figure 1.6: Hadoop Core Components

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 11 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

1. Hadoop has a master-slave topology.

2. In t his topology, w e h ave one m aster node and multiple slave n odes.
3. Master node's function is to assign a task to various slave nodes and manage resources. The slave nodes
d o t h e actual com putin g .
4. Slave nodes store the real data whereas on master we have metadata.
5. Figure 1.6 shows Hadoop core components.
HADOOP DISTRIBUTED FILE SYSTEM IHDFS):
1. H DF5 is a file system for Hadoop.
2. HDFS is based on Goog le File System (GFS).
3. It runs on clusters on commodity hard ware.
4. Th e file system h as several similar it ies with the existing distr ibuted file systems.

Characteristics:
1. High Fau lt Tolerant.
2. High throug h p ut.
3. Su pports applicat ion w it h m assive datasets.
4. Streaming access to file system data.
5. Can b e b uilt out o f commod ity ha rdware.

Architecture:
Fig u re 1.7 shows HDFS A rch itectu re.

Meud.ita {N~me,
Replicas.. )
Nam-e Node
Metadata ops ...
,,,,···

Block ops
Client

Re.id Dua nodes Data nodes

Replicatlon

□
□ [] □
Writ e
Rack 2
Aackl

Figure 1.7: HDFS Architecture

HDFS follows the master -slave architecture and it has the following elemen ts.
11 Namenode:
1. It is a deamon w h ich runs on master node o f had oop cluster.
2. There is on ly one n amenode in a clust er.
3. It contains metadata of all the files stored on HDFS which is known as namespace of HDFS.
4. It maint ain two fi les i.e. Edit Log & Fslmage.

• Ha ndcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 12 of 65
lOMoARcPSD|48592555

1 1 Introduction to Big Data & Hadoop BE I SEM - 7

5. Editlog is used to record every change t hat occurs to file system m e tadata (transaction hist ory)
6. Fslmage stores entire namespace, mapping of b locks to files and fi le system properties.
7. The Fslmage and the Editlog are centra l data structures of HDFS.
8. The system having the namenode acts as t he master server and it does the following tasks:
a. Manages the file system namespace.
b. Regulates client's access to files.
c. It also executes file system operations such as renaming, closing, and opening files and directories.

II} Datanode:
l. It is a deamon which runs on slave machines of H adoop cluster.
2. There are number of datanodes in a cluster.
3. It is responsib le for serving read/write req uest from t he cl ients. It also perform s b lock c reation, delet ion,
and repl ication upon inst ruction from the Nameno de.
4. It also sends a H eartbeat message to t he namenode periodica lly about the b locks it hold.
5. Namenode and D atanod e machines typically run a GNU/Linux operatin g system (OS).

Ill} Block:
l. Generally the user data is stored in the files of HDFS.
2. Th e fi le in a file system will be divided into one or more segments an d/or stored in in d iv idual data nodes.
3. These file segments are ca lled as b locks.
4. In other words, the m inimum amount of data that HDFS can read o r write is called a Block.
5. Th e default block size is 64MB, but it can be increased as per t he need to ch ange in HDF5 configurat ion.

MAPREPUCE:
1. Map Red uce is a software framework.
2. Map Red uce is the d ata processing layer of H adoop .
3. It is a software framework that allows you to w r ite applications for processing a large amount of data.
4. Map Reduce runs these applications in p arallel on a cluster of low-end machines.
5. It does so in a reliab le and fault-tolerant manner.
6. In MapReduce an a pplication is broken down into numb er of small parts.
7. These sma ll par ts are also called as fragments or b locks.
8. Th ese blocks t hen can be run on any node in the cluster.
9. Data Processing is done by MapReduce.
10. Map Reduce sca les and runs an application to d ifferent clutter machines-
11. There are two primitives used for data processing by MapReduce known as Mappers & Reducers.
12. Map Reduce use lists and key/valu e pairs for processing of data.

MapReduce Core Functions:

I} Read Jnpyt;
l. It d ivides input into small blocks.
2. These blocks then get assigned to a Map function.

111 function Mappjna;

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 13 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

1. It converts file data to sma ller, intermediate <key, value> pairs.

Ill) Partition, Compare & Sort:

a. partjtion Function; W ith the g iven key and number o f reducers it finds the correct reducer.
b. Compare Function: Map intermediate outputs are sorted according to this compare function.
IV) Function Reducing:
1. Intermediate values are red uced to smaller solut ions and g iven to output.
VJ Write Output:
Gives file outpu t

lnl)Vt M•p Shuffle & Sort Reduce Output

M•p
0
I
N
u
Reduce
p T
p
u
M•p u
T
T

D Reduce D
A
A
T Map
T
A
A

Figure 1.8: General MapReduce DataFlow

Example:
Fi le 1: "Hello Babita Hello Jethalal"
Fi le 2: "Goodnight Babita Goodnight Jethalal"
Operations:
(l) Map:

Mapl Map2

<Hello, 1> <Goodnight, l>

<Babita, 1> <Babita, 1>
<Hello. l> <Goodnight, 1>
<Jethalal, 1> <Jethalal, l>

(2) Combine·

Combine Map 1 Combine Map 2

<Babita, 1> <Babita, 1>

<Jethalal, l > <Jet halal, l>
<Hello, 2> <Goodnight 2>

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 14 of 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

(3) Reduce

<Babita, 2>
<Jethalal, 2>
<Good night, 2>
<Hello, 2>

Q7. Expla in Hadoop Ecosystem

Ans: [P I Medium]

HADOOP ECOSYSTEM:

l. Core Hadoop ecosystem is nothing but the different components that are built on the Hadoop
platform directly.
2. Figure 1.9 represents Hadoop Ecosystem.

! Bl Reportlng ,i i ROSMS i
.' -------·----' :======
!_
- ··:.;':
z Pf9 10•u Flow) I LI_ _ H_"'_•_1s_Q_L1_.... I ...I __5q_oop
_ ___,

0
0 MaipRe-duce (Job Scheduling / Execution System)
A
K
V
E
HBas• R
E
0
p
E
R HOFS

Figure 1.9: Hadoop Ecosystem.

I) Hadoop Distributed Elle System (HPES):
1. HDFS is the foundation of Hadoop and hence is a very important component of t he Hadoop
ecosystem.
2. It is Java software that provides many features like sca lability, high ava ila bility, fault tolerance, cost
effectiveness etc.
3. It also provides rob ust d ist r ibuted data storage for Hadoop.
4. We can deploy many other software frameworks over HOFS.

II) MapReduce:
1. MapReduce is the data processing component of Hadoop.
2. It applies the computation on sets of data in parallel thereby improving the performance.
3. MapReduce works in two phases:
a. Map Phase; This phase takes input as key-value pairs and produces output as key-va lue pairs. It can
w rite custom business logic in th is phase. Map phase processes the data and g ives it to the next
phase.
b. Reduce Phase; The MapReduce framework sorts the key-value pair before giving the data to th is
phase. This phase applies the summary type of calcu lations to the key-value pairs.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 1Sof 65
lOMoARcPSD|48592555

11 Introduction to Big Data & Hadoop BE I SEM -7

Ill) ~
1. Hive is a d ata warehouse project b u ilt on the top of Apache Hadoop which provides data query and
ana lysis.
2. It has got the language of its own call HQL o r Hive Query Language.
3. HQL automatically translates the qu eries into the corresponding map-reduce job.
4. Main p arts of the Hive are -
a. MetaStore: It stores metadata
b . Driver; Manages the lifecycle of HQ L statement
c. Query Compiler: Compiles HQL into DAG i.e. Direct ed Acycl ic Graph
d . Hive Server: Provides interface for JDBC/ODBC server.

IV] Pig:
1. Pig is a SQL lik e lan g uage used for q ueryin g and a nalyzing d ata stored in HDFS.
2. Yahoo was the origina l creator o f t he Pig .
3. It uses p ig lati n la nguage.
4. It loads t h e data, appl ies a filter to it and d u m p s the data in t h e req uired format .
5. Pig a lso consists of JVM called Pig Runtime. Various features of Pig a re as follows:-
a. Extensibility: For carrying o ut sp ecia l purpo se processi ng, users ca n cre ate t h eir own custom
function .
b. Optimization opportunities: Pig a ut omatically optim izes t h e q ue ry allow in g users to focus on
semantics rather than efficiency.
c. Handles all kinds of data: Pig an alyzes b oth stru ctured as w e ll as unstruct u red.

V) HBase:
1. HBase is a NoSQL d ata base b u ilt on the top of HDFS.
2. The vario us features of HBase are that it is open-source, non-relat iona l, d istributed d atabase.
3. It im itates Google's Bigtable and written in Java.
4. It provides real-time read/write access to la rge datasets.

VI) Zookeeper:
1. Zookee per coord in ates betwee n var ious se rvices in t he Hadoop ecosyst em.
2. It saves the time req uired for synchronization, configura tion maintenance, g ro uping, and naming.
3. Following are the features of Zookeeper:
a. Speed: Zookeepe r is fast in workload s where reads to da ta are more than write. A typical read: write
ratio is 10:1.
b. Organized: Zook eeper maintains a record of all t ransact ions.
c. Simple: It ma intains a sing le h ie rarchical namespace, sim ilar to d irectories and files.
d. Re lia ble; We can replicate Zook eeper over a set of hosts and they are aware o f each other. There is
no single point of fa ilure. As long as major servers a re ava ilable zookeeper is available.

Vll)Sqoop:
1. Sqoop imports data from externa l sources into compatible Hadoop Ecosystem components like HDFS,
Hive, HBase etc.
2. It a lso transfers d ata from Hadoop to ot her externa l sou rces.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 16 of 65
lOMoARcPSD|48592555

1 1 Introduction to Big Data & Hadoop BE I SEM - 7

3. It works w ith RDBMS like Tera Data, Oracle, MySQL and so on.
4. The major difference between Sqoop and Flume is that Flume does not work with structured data.
5. But Sqoop can deal w ith structured as well as unstructured data.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 17 of 65
lOMoARcPSD|48592555

2 I Hadoop HDFS & MapReduce BE I SEM - 7

CHAP· 2: HADOOP HDFS & MAPREDUCE

Ql. What is Hadoop? Describe HDFS architecture with diagram

Ans: [l0M - DEC 19]

HADOOP:

l. Hadoop is an open source software prog ramming framework for storing a larg e amount of data and
performing the computation.
2. Its framework is based on Java programming w ith some native cod e in C and shell scripts.
3. Apache Software Foundation is t he developers of Hadoop, and its co-founders are Doug
Cutting and Mike Cafarella.
4. The Hadoop framework application works in an environment tha t provides
distr ibuted storage an d computation across clusters o f computers.
5. Hadoop is designed to sca le up from sing le server to thousands of machines, each offering local
computation and storage.

FEATURES OF HADOOP:

l. Low Co st
2. High Computing Power
3. Scalability
4. Huge & Flexib le Storage
5. Fault Tolerance & Data Protection

HADOOP DISTRIBUTED FILE SYSTEM (HDFSl:

l. H DFS is a file system for Hadoop.
2. HDFS is based on Goog le File System (GFS).
3. It ru ns on clusters on commodity hardware.
4. Th e file system has several sim ilar it ies w ith the existing distr ibut ed file systems.

Characteristics:
1. High Fau lt Tolerant.
2. High throughp ut.
3. Supports application w it h massive datasets.
4. Streaming access to file system da ta.
5. Can be bu ilt out of commod ity hardware.

Architecture:
Figure 2.1 shows HDFS Arch itecture.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 18 of 6S
lOMoARcPSD|48592555

2 I Hadoop HDFS & Map Reduce BE I SEM - 7

Metadata (Name,

Name Node
......,.•·····•·"· Repllcas..)

Block ops
Client

Read Data nodes

Data nodes
Repllcatlon

□
□ [] □

Rack l Rack2
Client

Figure 2.1: HDFS Architecture

HDFS follows the master-slave architecture and it has the following elements.
I) Namenode:
l. It is a deamon which runs on master nod e of had oop cluster.
2. There is only one namenode in a cluster.
3. It contains m etadata of all t h e files sto red on HDFS which is known as n am esp ace of HDFS.
4. It m aintain two files i.e. Edit Log & Fslm age.
5. Edit l o g is used t o record eve ry change that o ccurs to file system m etadata {transaction hist ory)
6. Fslm age stores entire namespace, mapping of b locks to files and file system p roperties.
7. Th e Fslmag e an d the Edit l og are centra I d ata stru cture s of H DFS.
8. Th e syst em having the namenod e acts as t he master se Ner an d it does the following task s:
a. Manages the file system namespace.
b. Regulat es client's access to files.
c. It a lso executes file system operations such as renaming, closing, and opening files and d irectories.

11) Datanode:
l. It is a d eamon which runs on slave mach ines of Hadoop cluster.
2. There are number of dat anod es in a cluster.
3. It is responsib le for seNing read/write request from the cl ients. It a lso performs b lock creation, delet ion,
and replicat ion upon instruct ion from th e Namenode.
4. It a lso sends a Heartbeat message to the namenode period ica lly about the b locks it hold .
5. Namenod e and Datanode machines typical ly run a GNU/Linux o p erating system {OS).

111) Block:
1. Generally the user data is st ored in the files of HDFS.
2. The file in a file system will be divided into one or more segm ents and/or stored in ind ividual data nodes.
3. These file seg ments are called as b locks.

• Handcrafted by BackkBe nchers Community

Downloaded by asda sd ([email protected]) Page 19 of 65
lOMoARcPSD|48592555

2 I Hadoop HDFS & MapReduce BE I SEM - 7

4. In ot her words, the minimum amount of data that HDFS can read or write is cal led a Block.
5. The default bloc k size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Q2. What is the role of JobTracker and Task.Tracker in MapReduce. Illustrate Map Reduce

execution pipeline with Word count example

Ans: [10M - DEC 19)

MAPREDUCE:
l. MapReduce is a software framework.
2. Map Red uce is the data processing layer of Hadoop.
3. Sim ilar to HDFS, MapReduce also exploi ts master/slave architecture in which JobTracker runs on
master nod e and Task.Tracker runs on each salve node.
4. Task Trackers are processes ru nning on data nodes.
5. These monitors the map s and reduce tasks executed on the node and coordinates w ith Job tracker.
6. Job Tracker monitors t h e ent ire MR j ob execut ion.
7, JobTracker and Task.Tracker are 2 essential process involved in MapReduce execution in MRvl (or
Hadoop version 1).
8. Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by Resource Manager,
Applicat ion Master and Node Manager Daemons,

JOB TRACKER:
l. JobTracker is an essent ial Daemon for MapReduce execution in MRvl.
2. It is replaced by Resource Manager/ Appl icat ion Master in MRv2.
3. JobTracker receives the requ ests for MapReduce execution from t he cl ient.
4. JobTracker t alks t o the Nam eNod e t o determ ine t h e location o f t he data.
5. JobTracker finds the best Task Tracker nodes to execute tasks b ased on the data locality (proxim ity of
the d ata) and t h e available slots to execute a task on a g iven node.
6. JobTracker mon itors the individua l Task Trackers and the su bm its b ack the overall status of the job
back to t he client.
7. If a task fails, t he JobTracker can reschedu le it on a different TaskTrackers.
8. When the JobTracker is down, HDFS wil l still be functiona l bu t the Map Reduce execution cannot be
started and the existi ng MapRed uce jobs wil l be halted.

TASKTRACKER:
1. TaskTracker ru n s on Data Node.
2. TaskTracker is replaced by Nod e Manager in MRv2.
3. Mapper and Reducer tasks are executed on Data No des administered by TaskTrackers.
4. TaskTrackers w ill be assigned Mapper and Reducer tasks to execu te by JobTracker.
5. TaskTracker wil l be in constant communication w ith the JobTracker signal ing the p rog ress of the task
in execution.
6. TaskTracker failu re is not considered fatal. When a TaskTracker becomes unresponsive, Job Tracker w ill
assig n the task executed by the Task.Tracker to another node.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page20of65
lOMoARcPSD|48592555

2 I Hadoop HDFS & MapReduce BE I SEM - 7

A WORD COUNT EXAMPLE OF MAPREPUCE:

l. Let us underst and, how a Ma pRed uce works by taking a n exa m ple where I have a text file called
example.txt whose conten ts are as fol lows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
2. Now, suppose, we have to perform a word count on the sample.txt using MapReduce.
3. So, we w ill be find ing t he unique words and the number of occurrences of those unique words.

Input Splitting Reduclng Final Result

.... "2 V2) K2 l.iti (V2)

Bear I\. 1) }-

Li:s.tl)O VJI

a ,.1
c.,.1
Atv@f, I

Figure 2.2: A WORD COUNT EXAMPLE OF MAPREDUCE

4. First, we divide the input in to three spl its as shown in t he figure 2.2.
5. This w ill d istr ibute the work among all the map nod es.
6. Th e n, w e toke n ize t he words in each of the m appers a nd g ive a h ard coded value {l ) t o each of t he
tokens or words.
7. Th e rat ionale behind g iving a ha rdco ded value equal t o 1 is t hat every w ord, in itself, w ill o ccu r once.
8. Now, a list of key-value pair will be created where t he key is nothing but the individual words and value
is one.
9. So, for the first line (Dear Bear River) we have 3 key-va lue pairs - Dea r, 1; Bear, 1; River, l.
10. The mapping process remains the same on al l the nodes.
11. After the mapp er p hase, a partition process t akes p lace w h ere sort ing and shuffling ha p pen so that all
the tuples w ith the same key are sent to the correspond ing reducer.
12. So, after the sorting and shuffl ing p hase, each red ucer will have a unique key and a list o f values
correspond ing to that very key. For example, Bear, [1,1); Car, [1,1,1].., etc.
13. Now, each Reducer counts t he values which are present in that list of values.
14. As shown in the figu re 2.2, reducer get s a list of values which is (1,1] forthe key Bear.
15. Then, it counts the numb er of ones in the very list and g ives the final output as - Bear, 2.
16. Finally, all the output key/va lue pai rs are then collected and written in the output file.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 21 of 6S
lOMoARcPSD|48592555

2 I Hadoop HDFS & MapReduce BE I SEM - 7

•· EXTRA QUESTIONS ••

Ql. List out limitations of Hadoop

Ans: [P I Low]
HADOOP LIM ITATION:
1. Issue w ith small files.
2. Slow processing speed.
3. Latency.
4. No Real Time Data Processing.
5. Support for Batch Processing only .
6. No Caching.
7. Not Ease of Use.
8. No De lta Iterat ion.

Q2. Explain Union, Intersection & Difference operation with MapReduce Techniques
Ans: (P I Medium]
UNION WITH MAPREDUCE:
1. For union operation, map phase has responsib ility of convert ing the t uple 't' values from Relation 'R' t o
a key value pa ir format.
2. Reducer has the t ask to assign t h e values to same key 't'
3. Figure 2.3 shows union operat ion with MapReduce Format.

..- .
Figure 2.3: Union Operation with MapReduce Format.
--
INTERSECTION WITH MAPQEDUCE:
1. For intersection operation, map phase has same responsibility as that of union operation.
2. That is to convert tuple 't' values from Relation 'R' to a key va lue pa ir format.
3. Reducer is responsible for generating output by evaluating if else condition.
4. That is then only output w ill be generated in key va lue format e lse no o u tput w ill be produce.
5. Figure 2.4 shows intersect ion operation w ith MapReduce For mat.

11-Q
l
"""'
t, •

Figure 2.4: Intersection Operation with MapReduce Format.

--
• Handcrafted by BackkBenchers Community
Downloaded by asda sd ([email protected]) Page 22 of 65
lOMoARcPSD|48592555

2 I Hadoop HDFS & MapReduce BE I SEM - 7

DIFFERENCE WITH MAPREDUCE:

l. For d ifference operat ion, In Ma p p hase consider tw o relations n amely R & S.
2. Then the t uple 't' in Relation Rand tuple 't' in Relation Swill produce a key -value p air (t, R) an d (t, S) for
Relation R and Relat ion S resp ectively.
3. Then t he outpu t of map phase i.e. key -va lue pair wil l b e su b mitt ed to reducer as an in put.
4. Reducer w ill n ow g enerate the key -value p air (t1, t) in generalized for m for key 't' of every relation if and
only if correspond ing relation (R or SJ has an asso ciated collect ion of item i.e. list [R] else NULL w ill be
output.
5. Figure 2.5 sh ow s d ifference operat ion wit h MapRed uce.

Produ« the
TUplet
key,,vakl•
Pair (t. A)

Olo.totlon A

For ea.ch ksy 't' if.,.... h.ilw ~l;rrt~ kt

Pfoduce the (R) \Mn ol,ftpUt b k ~ SMir (t. tl e4s.e NIJU.
Tuple t

0,~5

Figure 2.S: Difference Operation with MapReduce Format.

Q3. Explain Matrix Vector Multiplication by MapReduce

Ans: IP I High)
MATRIX VECTOR MULTIPLICATION:
l. Consider a Mat rix 'M' o f size n x n.
2. Let 'i' represents t he row s and 'j' represents t h e colu mns.
3. Any element is Mat r ix w ill be represent ed as M ;
4. Assume that there is an n-dimensiona l vector V, th e first j elements d en oted asv,.
5. So, mat r ix M an d t he vector V is the result o f an n -d imen sional vect or x.
6. The first 'i' elements X; is g iven as:

?. If n = 100, we d o n ot want to use a DFS or Map Red uce for t h is calcula tion.
8. But such kind of ca lcu lations are req u ired while maint ain ing th e pag e rankin g of web pages for a g iven
search.
9. In rea l time, the value of n is in some b illions.
10. Let us first assume th at n is large, but not so large that vector v cannot fit in ma in m emory and th us be
ava ilab le to every Map task.
11. The mat rix Mand t h e vector v each w ill be stored in a file o f the DFS.
12. We assume tha t the row-column coord in ates o f each ma tr ix element w ill be d iscoverab le, either from
its position in the file, or because it is stored w it h exp licit co ordin ates, as a t rip le (i, j, m ij ).

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 23 of 65
lOMoARcPSD|48592555

2 I Hadoop HDFS & Map Reduce BE I SEM - 7

13. We also assume the position of element vj in the vector v w ill be discoverable in the analogous way.
14. The Map Function:
a. The Map function is written to apply to one element of Matrix M.
b. However, if vector v is not already read in to main memory at t he comput e node executing a Map
task, then vis first read, in its entirety, and subsequently w ill be avai lable to all applications of the
Map function performed at t h is Map task.
c. Each Map task will operate on a chunk of the matr ix M .
d . From each matrix element mij it produces the key-value pair {i, mij' vj).
e. Th us, all t er ms o f the sum that m ake u p t h e component xi o f the m at rix-vector p roduct w ill get the
same key, i.
15. The Reduce Function:
a. The Reduce function simply sums all the values associated w ith a given key i.
b. The re su lt w ill b e a p air {i, xi).
c. W e can divide t h e m atrix int o vertical st r ip es of equal width and divide t h e vector into an equa l
number of horizontal stripes, of the same height.
d . Our goal is to use enoug h str ip es so that t he portion of the vector in one str ip e can fit conven ien t ly
into main memory at a compute node.
16. Figure 2.6 shows what t h e partition lo oks like if the m at rix and vector are each divided into four st r ipes.

MatrlxM Vectorv

Fig ure 2 .6 : Matrix Vector Multiplication

Q4. What are Combiners? Explain the position and significance of combiners
Ans: [P I Medium]
COMBINERS·
l. A combiner is also know n as a semi-reducer.
2. It is one of med iator between t he mapp er p hase & the reducer phase.
3. The use of combiners is totally optiona l.
4. It accep ts the out p ut of map phase as an in p u t and pass t he key-value pa ir to the reduce operation.
S. The main function of a Combin er is to summarize the map output record s w ith the same key.
6. It is also known as grouping by key.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page24of65
lOMoARcPSD|48592555

2 I Hadoop HDFS & Map Reduce BE I SEM - 7

7. The output (key-value collection) o f the combiner will be sent over the network to the actual Reducer
t ask as inp ut.
8. The Combiner class is used in be tween the Map class and the Reduce class to reduce the volume of
data transfer betw een Map and Reduce.
9. Usua lly, the output of the map task is larg e and the data t ransferred to the reduce task is high.
10. The following figure 2.7 show s p osit ion and w orking mechanism of combiner.

Input Input Input

Split 1 Split 2 Split 4

Combiner

·-------
<kl, vl> '"ki, VJ>
<IQ,VS,. <ki, v.i>

I
c.,ouping values w,th
same key

kl, Ust!Vl, 112) k2, Ust(vS, v2, v:S. v4)

Output of Reducer ---0

Fig ure 2.7: Position and working mechanism of combiner
Working:
1. A com biner d oes not have a predefined interface and it must im plem ent the Reducer interface' s
redu ce () method.
2. A combiner operates on each m ap output key.
3. It m ust have the same output key -value types as the Red ucer class.
4. A combiner can produce summary information from a la rge dataset because it replaces the origina l
Map output.

• Handcrafted by BackkBe nche rs Commubynity

Downloaded asda sd ([email protected]) Page 25 of 65
lOMoARcPSD|48592555

3 I NoSQL BE I SEM-7

CHAP - 3: NOSOL
Ql. When it comes to big data how NoSQL scores over RDBMS

Ans: [SM - DEC19]

Benefits of NoSOL over RDBMS:

1. Schema Less: NoSQL databases bein g schema-less do not define any strict data structure.
2. Pvnamic and Aaile:
a. NoSQL databases h ave good tendency to g row dynamica lly w ith changing requirements.
b. It can h andle structured, semi-structured and unstruct ured data.
3. Scales Horizontally:
a. In contrast to SQL databases which scale vertically, NoSQL scales horizonta lly by adding more
seNers and u sing concepts of shard ing and repl ication.
b. This beh avior of NoSQL fits with t he cloud computing services such as Amazon Web Serv ices {AWS)
which allows you to hand le virt ua l servers which can b e expanded horizon tally on demand.
4. Better Performance:
a. A ll t he NoSQL d atab ases cla im to d eliver bet ter and faster performance as comp ared to t raditional
RDBMS implementations.
b. Since NoSQL is an entire set of databases (and not a sing le database), the lim itations d iffer from
d atabase to database.
c. Some o f th ese datab ases do not support ACID transactions while some of them m ig ht be lacking
in relia bi lity. But each one of them has their own strengths due to which they are wel l su ited for
sp ecific requirements
5. Continuous Availability:
a. The var ious relational data b ases may show up mod ern to h igh availabi lity for the d ata t ran sactions
while this is m uch b etter w ith t he NoSQL d ata b ases which excel lently show u p cont inuous
availability to cope u p wit h different sorts of data transactions at any p oint of time and in difficult
situations.
6. Ability to handle changes
a. The schema-less structure of the NoSQ L datab ases h elps it cope up easily with the changes comin g
w it h t ime.
b. There is a universal index provided for struct ure, va lues an d text found in t he data an d hence, it's
easy for t he organ izations to cope w ith the changes immediately using t h is in forma tion.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 26 of 65
lOMoARcPSD|48592555

3 I NoSQL BE I SEM-7

Q2. Explain different ways by which big data problems are handled by NoSQL

Q3. Explain 4 ways of NoSQL to operate Big Data Problems

Q4. Write short notes on different architectural Patterns in NoSQL

Ans: [10M - DEC19]

DIFFERENT WAYS BY WHICH BIG DATA PROBLEMS ARE HANDLED BY NOSOL:

I) Key Value Store Databases:
1. It is one of the most basic types of NoSQL databases.
2. Th is kind of NoSQL database is used as a collection, d ictionar ies, associat ive arrays, etc.
3. Data is stored in key/value pa irs.
4. It is designed in such a way to hand le lots of data and heavy load .
5. Key -value pair storage datab ases st ore data as a h ash table w here each key is uniqu e, and the value
can be a JSON, BLOB (Bina ry Large Objects), st ring, etc.
6. Th ey work best for shopping cart contents.
7. Examples:
a. Azure Tab le Storage (ATS)
b. DyanmoDB
a. Limitations:
a. It may w ork w ell for complex queries attempt ing to connect mu lt ip le relations of d ata.
b. If data conta ins lot of many-to-many relationships, a key value store is likely to show poor
performance.

Key:1 10,420 Arst Name: Jethalal

Key:2 Email: B•btt~thalaJOgmaiJ.com location: Mumb.11 Age:20

Key:3 Fae.book 10, AngelPtlya Password: Ange1Prlya98

Figure 3.1: Example of unstructured data for user records

11) Column Store Database:

1. Instead of storing d ata in relational tu ples, it is stored in cel ls grou ped in columns.
2. Column-oriented databases work on columns and are b ased on BigTa ble paper by Goog le.
3. Every column is treated separat ely.
4. Values of sing le colu mn d atabases are stored contiguously.
5. They del iver h igh performance on aggregation queries like SUM, COU NT, AVG, MIN etc. as t he data is
readily availa b le in a colu mn.
6. Column-based No5QL databases are widely used to manage data warehouses, business intelligence,
CRM, Library ca rd cata logs,
7. Examples:
a. HBase.
b. Cassandra

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page27 of65
lOMoARcPSD|48592555

3 I NoSQL BE I SEM-7

c. Hypertable

'-- ~
.....No O.pUO Hlr• Oat• [mpM.affltl
I 1 2001·01•01 .,_RM
..Ju ,__ I I II I
I
1001-01-0l S.buA-'O
I
I • I I 2002-02-0, I
2 1 2002-CU-01
l 1 1002-o.4-0I Ghanshyam
I A,Ju
I
• 2 200J-ot-01 AM.lradha
I • I I 200•-•• I
I Ch.lntttyam
I
• 2 1004-ol-01 Kac:h,.s.th

I
Column Orienltd Database

I 1 I 2 I l I • I • I
I 1 I 1 I 1 I I 2 I 2

I 200,..,,..,, 1200.......,, 12002..,...,, 1 1-,..,,1 200J.01-o1

..... ~

Figure 3.2: Example of Column store database

111) Document Database:

l. Document -Or iented NoSQL DB stores and retrieves data as a key value pa ir but the value pa rt is stored
as a document.
2. Th e document is stored in JSON or XML formats.
3. Every document contains a unique key, used to retrieve the document.
4. Key is u sed for storing, retr iev ing and managing document oriented informat ion also know n as semi
st ructu red data.
S. Examples:
a. MongoDB
b. Couch DB
6. Umitations:
a. It's challeng ing for document store to hand le a t ran sact ion t hat on multiple documents.
b. Documen t databases may not be good if data is req uired in aggregation.

~ tj~ r~I_ ____ '

( '
'
DODD
DODD lSON

DODD

Figure 3.3: Example of document database

IV) Graph Database:

l. A graph type d at abase st ores entities as well the relations amongst t hose entit ies.
2. The entity is stored as a node w ith t he relationsh ip as ed ges.
3. An edge gives a relationship between nod es.
4. Every node and edg e has a unique iden t ifier.
5. Compared to a relat iona l da tabase where tables are loosely connected, a Graph database is a mu lt i-
relat ional in nature.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 28 of 65
lOMoARcPSD|48592555

3 I NoSQL BE I SEM-7

6. Traversing rela tionship is fast as they are al ready captured into the DB, and there is no need to calcu late
them .
7. Graph base data base mostly used for socia l networks, logistics, and spatial data.
8. Examples:
a. Neo4J
b. Infinite Graph.
c. OrientDB.
d . FlockDB

Hod•s
Attrlbut•s

Edges

Figure 3.4: Example of graph database

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 29 of 65
lOMoARcPSD|48592555

3 I NoSQL BE I SEM - 7

-- EXTRA QUESTIONS --

Ql. Write short notes on NoSQL

Ans: [P I Low]
NoSOL:
l. NoSQL stands for Not Only SQL.
2. NoSQL is a non-relational DMS, t hat does not require a fixed schema, avoids joins, and is easy to scale.
3. NoSQL database is used for distributed data stores w ith humongous data storage needs.
4. NoSQL is used for Big data and real-t im e w eb apps.
5. For example, companies like Tw itter, Facebook, Google that collect terabytes of user data every sing le
d ay.
6. NoSQL generally avoids join operations.

FEATURES:

I) Non-relational:
1. NoSQL databases never follow the relational model.
2. Never provide tables w it h flat fixed-column record s.
3. Work w it h self-cont ained aggregate s or BLOBs.
4. Doesn't require object-relational mapping and data norma lization.
5. No comp lex features like query languages, q uery p lanners, referent ial integ rity j oins, ACI D.
II) Schema-free
1. NoSQL databases are either schema-free or have relaxed schemas.
2. Do not req u ire any sort of d efinition of t h e sch em a o f the data.
3. Offers heterogeneous structures o f data in t he same domain.
Ill) Simple API
l. Offers easy to use interfaces for st orage and q ueryin g data provided.
2. APls allow low-level d ata man ipulation & selection methods.
3. Text-based p rotocols mostly used w ith HTTP REST w ith JSON.
4. Mostly used no standa rd b ased query language.
5. Web -enabled d atabases running as internet-facing services
IV) Distributed
1. M u lt ip le NoSQL databases can be executed in a distri b ut ed fash ion.
2. Offers au to-sca ling and fail-over capabilities.
3. Often ACID concept can be sacr ificed for sca lability and throughput.
4. Mostly no synchronous rep lication between distrib uted nod es Asynchronous Mult i-Master Replication,
peer-to-peer, HDFS Replicat ion.
5. Only providing eventua l consistency.
6. Shared Nothing Architecture. Th is ena bles less coordina tion and h igher d istr ibution.

APYANJAGES:
l. Good Resources Scalabi lity.
2. No Static Schema.

• Handcrafted by BackkBe nchers Community

Downloaded by asda sd ([email protected]) Page 30 of 6 S
lOMoARcPSD|48592555

3 I NoSQL BE I SEM - 7

3. Fast er Data Processing.

4 . Lower Operational Cost.
5. Support Sem i Struct u re Data.
6. Relatively simple data models.

DISADVANTAGES:
l. Not a d efined standard.
2. Lim it ed q uery capabilities.

Q2. Write short notes on DynamoDB

Ans: [P I Medium]

DYNAMODB:

l. Dyn am oDB is a fu lly m an aged NoSQL database service provided by Amazon.

2. It is database service that provides fast and p red ictable perform ance wit h sea m less scalability.
3. It sup ports key-va lue an d document data str uctu res.
4 . Dyn amoDB data m odel cont ains:
a. Tables
b. Item s
c. Attr ibutes
5. DynamoDB requires on ly a p rimary key and does not req u ire to d e fine all o f t he attrib ute names and
dat a types in advance.
6. Each att ribute o f DynamoDB in an item is a n ame-va lue p air.
7. Fig ure 3.5 sh ow s Dyn amoDB Nam esp ace.

AWS
Account

Reglonl Reglon l

Ta~e 1

Ite m 1 ltem2

Attribute 1 (Hash Keyl

Attribute 2 (R•nv• Key)

Attribute l

Figure 3 .5: DynamoDB Namespace.

a. Example:

ID= 1

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 31 of 6S
lOMoARcPSD|48592555

3 I NoSQL BE I SEM - 7

SubjectName = "Big Data &Analytics"

Authors = ("Sagar Narkar")
Price = 100
PageCount = 75
Publication = BackkBenchers Publ ication House
},

ID= 2
Subj ectName = "Artificia l Intell igence & Soft Computing"
Authors = ("Sagar Narkar'')
Price = 100
PageCount = 93
Publication = BackkBenchers Publication House

9. Pata type;
a. Scalar types: Num ber, Str ing, Binary, Boolean and NULL
b. Document types: List and Map
c. Set types: String Set, Number Set and Binary Set
10. Data Access:
a. DynamoDB is a w eb service uses HTTP and HTTPS as a t ransport layer services.
b. JSON can be u sed as a message ser ialization format .
c. It is possible to use AWS software development kits (SD Ks)
d. Applicat ion co de makes requests to t h e DynamoDB web service API.

Q3. Expla in Shared-Nothing Architecture in detail

Ans: IP I Medium]

SHAQED-NOJHINC AQCHIJECJUQE:

1. Th ere are three ways t hat resources can be shared between computer systems: shared RAM, shared
disk, and sha red-noth ing
2. Of the t hree a lt ernat ives, a shared-not hing architecture is most cost effective in terms of cost per
processor when you're using commodity hardware.
3. A shared-nothing architecture (SN) is a distr ibuted -computing architecture.
4. Each processor has its own lo cal memory and local disk.
5. A processor at one node may communicate w ith another processor using h igh speed communication
network.
6. Any term inal can act as a node which functions as Server for data that is stored on local d isk.
7. Interconnection networks for sha red nothing systems are usual ly designed to be scalable.
8. Figure 3.6 shows shared nothing architecture.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 32 of 6S
lOMoARcPSD|48592555

3 I No SQL BE I SEM - 7

Interconnection Network

Processor

M Local Memory

D Locail Memory Disk

Fig ure 3 .6: Shared Noting Architecture

ADVANTAGES:
l. It is more scalable.
2. Easily su pport large numbers o f processors.
3. It can achieve high d egree of pa rallelism.

DISADVANTAGES:
l. It requires rigid data p art it ioning.
2. Cost of communication a nd of non-local d isk access is h ig her.

APPLICATIONS:
l. Teradata database machine uses share nothing architect ure.
2. Shared-nothing is popu lar for w eb development.

• Handcrafted by BackkBe nchers Community

Downloaded by asda sd ([email protected]) Page 33 of 6S
lOMoARcPSD|48592555

4 I Mining Data Streams BE I SEM -7

CHAP· 4 : MINING DATA STREAMS

Ql. Explain with block diagram architecture of Data Stream Management System

Ans: [lOM - DEC19]

DATA STREAM MANAGEMENT SYSTEM:

l. For hand ling and controlling the data streams, we require a concrete, standardized model or
framework so that data stream w ill b e processed in a proper ly defined manner.
2. Figure 4.1 shows the b lock diagram of data stream management system.

Ad-hoc Queries

Streams entering

1, 3, 4, 0, 9, 7, S Standing Output Streams

q, w 1 e, r, t, y, u, I Queries
o, 1, 1, o, ,, o, o, 0
- Stream Processor

~nm..

Limited Working
Archival Storage
Storage

Figure 4.1: Block diagram of data stream management system

3. The data-stream management system architect ure is very much sim ilar to that of conventional
relational data b ase management system archit ecture.
4. Th e basic d ifference is t h at processor block or more specifically a query processor (query engine) is
rep laced w ith the specialized block known as stream processor.
5. Th e first b lock in the system architecture shows t h e inp ut part.
6. Number of data stream generated from d ifferent data sources will enter into the system.
7. Every Data stream has its own characteristics such as:
a. Every data stream can sched ule and rearrange its own data items.
b. Every data stream involved heterogeneity i.e. in each data stream we can found d ifferent kinds of
d ata such as numerical data, alphabets, a lphanumeric data, graphics data, text ual data, binary data
or any converted transformed or t ranslated data.
c. Every data stream has different input d ata rate.
d . No uniformity is ma intained by the elements o f d ifferent data st reams while entering into the
stream processor.
8. In the second block of system architecture, it is abstracted that, there are two d ifferent sub systems
exists one of which w ill take care of storing t he data st ream and other is responsible for fetching the
data stream from secondary storage and processing it by loading in to main memory.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page34of65
lOMoARcPSD|48592555

4 I Mining D at a St reams BE I SEM -7

9. Hence the rate at which stream enters into the system is not the burden of sub-system which is
involved in the stream processing.
10. It w ill be con trolled by other sub-system wh ich involved in storage of data stream.
11. The third block r epresents the active storage or wor king storage for processing t he different data
streams.
12. The wor king storage area may also contain sub-st reams which ar e integral p art of main -core stream
to generate result for a given query.
13. Working storage basically a main memory but situation demands then data items within the st ream
can be fetched from the secondary storage.
14. The major issue associated w ith working storage is its limited size.
15. The fourth block of system architecture is known as archival storage.
16. As name indicates, this b lock is responsible for maintain ing the details of very transaction within the
syst em architecture.
17. It is also responsib le to maintain the ed it logs.
18. Ed it logs are not hing but updation of data {i.e. va lues form users or system) .
19. In some sp ecial purpose case, archival storage w ill be used to process or to retrieve previously used
data items which are not present on the seconda ry storage.
20. Th e fifth block is responsible for d isp laying or delivery the output stream g en erated as a resu lt of
processing done by the stream processor usually by taking the support of workin g storage and
occasionally by taking support of arch iva l storage.

Q2. What do you mean by Counting Distinct Elements in a stream. Illustrate with an example

working of a Flajolet - Martin Algorithm used to count number of distinct e lements

Ans: [l0M - DEC19]

COUNT DISTINCT PROBLEM:

1. In comp uter science, the count-distinct problem is the problem o f finding the number of distinct
elements in a data st ream wit h rep eated elements.
2. Consider an example of counting numb er of newly added users who will view a g iven web page 1" t ime
in t he last month.
3. Clearly, the standard approach to solve this problem is, make a list o f all elemen ts of user who has seen
before in a given data st ream.
4. Convert the data elements in the list into some efficient data structu re, more specifical ly a search
struct ure such as tree.
5. The major benefits of this kind of struct u re is we can have bucket like structure where element from
some category will be found in some bucket.
6. E.g. list of user who has seen a given web page on some d ata.

Date & Dilly Userl llffr2 !····- I Umn

Key Values

• Handcrafted by BackkBe nche rs Community

Downloaded by asda sd ([email protected]) Page 3S of 65
lOMoARcPSD|48592555

4 I Mining Data Streams BE I SEM -7

7. The addition of new user to an existing list according to decided key is very easy with th is structure.
8. So, it seems to be very easy to count the distinct element w ith in the given stream.
9. The major problem arises, when the number of distinct element are too much in number and to make
it worst.
10. lfwe have number o f such data stream who will enter into the system at the same t ime.
11. E.g. If organization like Yahoo or Goog le requires a count of user who w ill see their different web pages
1" t ime for a given mont h.
12. Here, for every page we need to maintain above said problem which seems to be a compl icated
problem.
13. To have alternate solution to this p roblem, we can have "scale out o f machines".
14. In this approach, we can add new commodity hardware (an ordinary server) to existing hardware so
that load of processing and counting the distinct user on every web p age of an organization will be
distr ibuted.
15. Another add it ional thing is we can have t he use of secondary memory can batch systems.

FLAJOLET-MARTIN ALGORITHM:
1. The Problem of counting t he distinct element can be solved w ith t h e help of ordinary hashing
technique.
2. A hash funct ion will be applied to a given set which generate a b it-string as a resu lt of hashing.
3. A constraint should be appl ied to above process is, there should enough hashing resu lts t h an elements
present inside the set.
4. The general procedure while applying hash function is to p ick different hash function for and apply
these every element in g iven data stream.
5. The significant property of flash funct ion is that, whenever applied to the same data elem ent in a g iven
data stream it will generate the same hash va lue.
6. So, Flajolet-Martin algorithm has extended this hashing idea and properties to count distinct
elements.
7. The algorithm was int roduced by Philippe Flajolet and C. Nigel Martin in their 1984
8. Flajolet-Martin algorithm approximates the numb er of un ique objects in a stream or a d atabase in one
pass.
9. If the stream contains n elements w ith m of them unique, this algorithm r uns in O{n) t ime and
needs O(log(m)) memory.
10. So the real innovation here is the memory usage, in that an exact, brute-force algorithm would
need O(m) memory (e.g . think "hash map").
11. As noted, this is an approxima te algorithm. It g ives an approximat ion for the number of uniq ue objects,
along w ith a standard deviation a, which can then be used to deter mine bounds on the approximation
w ith a desired maximum error E, if needed.

ALGOQIIHM
1. Create a bit vector of sufficient length L, such that 2L > n, the number of elements in the stream. Usually
a 64-b it vector is sufficient since 264 is quite large for most purposes.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 36 of 65
lOMoARcPSD|48592555

4 I Mining Data Streams BE I SEM -7

2. The i'" bit in this vector/array represents whether we have seen a hash function va lue whose binary
representation ends in Oi. So initialize each bit to 0.
3. Genera te a good, random hash function that maps inpu t (usua lly strings) to natura l numbers.

4. Read input. For each w ord, hash it an d determine the number of t railing zeros. If t h e number of t railing
zeros is k, set the k-th bit in the bi t vector to 1.

5. Once input is exhausted, get the index of the first O in the bit array (call this R). By the way, this is just
the number of consecutive ls {i.e. w e have seen o, 00, ..., o• 1 as the output of the hash function) p lus
one.

6. Calcu late t he number of unique words as 2R/cp, where <I> is 0.77351.

7. The standard deviation of R is a constant: a(R) = 1.12. (In o ther words, R can b e off by about 1 for 1-
0.68=32% o f t h e observat ions, off by 2 for about 1-0.95=5% of the observat ions, off by 3 for 1-0.997=0.3%
of the observations using the Empirica l rule of statistics).
8. This im pl ies that ou r count can b e off by a factor of 2 for 32% of the observations, off by a factory of 4 for
5% of the observations, off by a factor of 8 for 0.3% of the observations and so on.

EXAMPLE:
S=l, 3, 2, 1, 2, 3, 4 , 3, 1, 2 , 3, l
Assume lbl = 5
X h(x) Rem Binary r{a)
l 7 2 00010 l

3 19 4 00100 2
2 13 3 00011 0
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
4 25 0 00000 5
3 19 4 00100 2
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
l 7 2 00010 l

R = max (r(a)) = 5

So no. of distin ct elements =N

= 2•
=2•
=32

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 37 of 65
lOMoARcPSD|48592555

4 I Mining Data Streams BE I SEM -7

-- EXTRA QUESTIONS --

Ql. Explain Datar-Gionis-lndyk-Motwani (DGIM) Algorithm in details

Ans: (P I High]
DGIM:
1. DGIM stands for Datar-Gionis-lndyk-Motwani Algorithm.
2. It is designed to find t he number l's in a data set.
3. Th is algorithm uses O(log 2 N) b its to represent a w indow of N bit, allows to est imate the number of l's
in t h e w indow wit h and error o f no more than 50%.
4. In DGIM algorithm, each bit that arr ives has a t imestamp, for the posit ion at which it ar rives.
5. If the first b it has a t imestamp 1, t h e second b it has a t imestamp 2 and so on.
6. The positions are recognized w ith t he window size N (the w indow sizes are usually taken as a mult iple
of 2).
7. Th e w indows are divided in to buckets consisting o f l's and O's.

RULES FOR FORMING THE BUCKETS:

1. Th e rig h t sid e of t he bucket shou ld alw ays start w ith 1. (if it starts w ith a 0, it is to be neglected) E.g. •
1001011 ➔ a bucket of size 4, having four l's and starting w ith 1on it's right end.
2. Every bucket should have at least one 1, else no b u cket can be formed.
3. Al I buckets shou Id be in pow ers o f 2.
4. Th e buckets cannot decrease in size as we move to the left. (move in increasing order towards left)

EXAMPLE:
Consider the fol low ing stream of bits: ...111010111001100010110110110 00

11010111 00 110001

Size= 8 Size= 4 Size= 4 Size= 2 Size = 1 Size = l

STORAGE SPACE REQUIREMENTS:

The sto rage space required for the DGI M Algorithm can be determined as,
1. A single bit represented as O(Log NJ
2. Number o f baskets O (Lo g N)
3. Aggregate space requirement = O(Log2N)

QUERY - ANSWERING in DGIM:

1. Given k~N, we want to know h ow many of the last k bits w ere 1-s.
2. To find number o f l's in g iven stream say last 10 bits.
111010 11 100 11 000 101 100 1 10 1 101
3. For last 10 bits, K = 10
4. Therefore, p art of stream u nder considerat ion w ill be,

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 38 of 65
lOMoARcPSD|48592555

4 I Mining Data Streams BE I SEM -7

t -8 t-2 t-1 Tlmestamp = t

t-3

5. Th is is the b aske t which is partly in represen tation because b it left to the bit t - 8 w ill t - 9 and so on.
6. So till t- 8 w ill b e included.
7. Th erefore quoted answ er for number o f l's in last 10 bits K =10 w ill be 6 b ut actu al answ er is 5.

Q2. Explain bloom filter? Explain bloom filtering process with neat diagram

Ans: [P I Medium]

BLOOM FILTER:

l. Bloom filter is a tech niqu e to rectify d ata stream for sa id cr iteria.

2. It is a d ata structure.
3. Th e efficiency of bloom filter is that it is p robabil istic d a ta struct ure w h ich tells us that the elemen t is
either present in the set or not .
4. Th e b ase data str uct ure o f a Bloom filter is a Bit Vector.
5. Bloom filter consists of:
a. An Array.
b. Hash Fu n ction .
c. A Set.
6. Figure 4.2 represents Bloom Filter.

Input Oat• Sun m

Hash F"uneUon Set

An Atray
H

(====~=:::::>-- a,e not In Ht ·s· kit)"I

Ef.tfM'ntlwt'IOM

Ele~ntt whose keys

ar• In Mt'$'

Figure 4.2: Bloom Filter

BLOOM FILTERING PROCESS:
l. Bloom filter in g is an array of 'n' bit size.
2. Th is array is initia lize to '0' for all n b it s.
3. Pic k every key and value present inside th e d ata stream's'.
4. Apply every hash fu nction h ; where (i is between 1.... K)
5. Set val u e 1 to every b it of data stream for h ;(k) in t he sets 'S'.
6. Check the values for every h;applied on bit in 'S'.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 39 of 65
lOMoARcPSD|48592555

4 I Mining Data Streams BE I SEM -7

7. If the value of K is 1 'rt (For All), h , - i var ies from 1 to k

8. If for all h ,(k) values is l , then set its bit value to land allow t hat element o f stream to be involved in a
sample.
9. If anyone or more h,(k) valu e is not evaluates to l t hen it is conducted that 'K' doesn't sat isfies the criteria.
Hence it will not b e in sample.

STRENGTHS:
l. Space-efficient: Bloom filters take up 0(1) space, regard less of the number of it ems inserted .
2. Fast: Insert and lookup operations are both 0(1) time.

WEAKNESSES:
l. Probabilistic: Bloom filters can only definitively identify tr ue negat ives. They cannot ident ify t rue
posit ives. If a bloom fi lter says an item is present, that item might act ua lly b e p resent (a true positive)
or it m ig ht not (a false posit ive).
2. Limited Interface: Bloom filters on ly support the insert and lookup operations. You can't iterate
t hrough t h e item s in t h e set or d elete items.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page40of65
lOMoARcPSD|48592555

5 I Finding Similar Items and Clustering BE I SEM -7

CHAP - 5: FINDING SIMILAR ITEMS AND CLUSTERING

Ql. Explain Edit distance measure with an example
Ans: (SM - DEC19)

EDIT DISTANCE MEASURE:

l. Edit distance is a way of quantifying how dissimi lar two strings (e.g., words) are to one another by
cou nting t he minimum number o f operat ions req uir ed to t ransform one st ring in to the o ther.
2. This distance makes sense when points are strings.
3. As in edit d istance we have str ing represent ation so, it w ill satisfy negativity, posit ivity and symmetry
and triangle in equating const raints.
4 . The d istance between two strings x = xl , x2 • • • xn and y = yl, y2 • • • ym is the smallest number of
insertions and deletions of single characters that w ill convert x toy.
5. There are two variants of edit d istance i.e. classical method and longest common sequence method.
I) Classical Method:
l. Consider following representation of points x and yin a g iven space havin g string representation
as,
X = ABCDE
Y = ACFDEG
2. To calculate the distance between x and ywe should convert x st ring toy string.
3. Now compa ring the character sequence in both the str ings.
X = A B C D E

Y = A C F D E G
2 3 4 s 6 Posit ion

4. Clearly, posit ion 2, 3, 6 d iffer in their contents from x and y. So make necessa ry insertion and
d eletions at in accordance w ith these 3 p osit ions.
B C
C F G
2 3 6
5. Delete the cha racter 'B' in string x from p osit ion no. 2
6. Shift the oth er characters to left hand side.
7. Now current status of str ing x is,
X = A C D E
2 3 4
8. Now insert the character 'F' at position 3 i.e. after character 'C' and before character 'D'
9. Therefore, now the status of string x w ill be,
X = A C F D E
2 3 4 5
10. Lastly, append the cha racters Gin the X string get 6 th position.
ll. Status of string X w ill be,

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 41 of 65
lOMoARcPSD|48592555

5 I Finding Similar Items and Clustering BE I SEM - 7

X = A C F D E G
l 2 3 4 5 6
12. Hence in above example, we have
No. of deletions = l
No. of insertion = 2
Edit distance between X, Y
d (x, y) = No. of deletion+ No. of insertions
=1 + 2 = 3
II) Longest Common Sequence Method:
1. The longest common seq uence can b e d eveloped by performing d etect ion operat ions on t h e
character positions in respect ive str ing.
2. Suppose we have two point x and y represented as strings.
3. Therefore, edit d istance between p oin ts x and y will b e,
d (x, y) = (Length of string x + Length of string y) - 2 x Length of longest common sequence)
4. Suppose,
X = A B C D E
y = A C F D E G
5. Here,
Longest common seq uence {LCS) is nothing, but the (a, c, d, e) = 4
Length of string x = 5
Length of st ringy= 6
Therefore, d(x, y) = 5+6 - 2x4
=5 + 6 - 8
=11-8
=3

Q2. What are the challenges in clustering of Data streams. Explain stream clustering algorithm in
detail
Ans: [10M - DEC19]

CHALLENGES IN CLUSTERING OF DATA STREAMS:

I) Cluster Validity:
1. Recent developments in data stream clust ering have heightened the need for determin ing suita ble
criteria to validate result s.
2. Most outcomes of methods are depended to specific appl ication.
3. However, employing suitable criteria in results evaluation is one o f the most important challenges in
this arena.
111 Space liroitatjon:
l. Not on ly time and concept drift are the ma in complexity in data stream clustering b u t also space
complexity in some applications (e.g. w ireless sensor network monitoring and controlling) can be
caused difficu lties in processing.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page42of65
lOMoARcPSD|48592555

5 I Find ing Similar Items and Clust e ring BE I SEM - 7

2. Sensors w ith small memory are not able to keep a big amount of data so new method for data stream
clusterin g should be managed t his limitation.
1111 High dimensional data stream;
l. There are hig h dimensional data sets (e.g . image processing, personal sim ilarity, customer preferences
clustering, network intrusion d e tection, wireless sensors network and generally time series da ta) which
should be managed through t he processing of d ata stream.
2. In hug e data bases, data complexity can be increased by number of dimensions.

BDMO ALGORITHM / STREAM CLUSTERING ALGORITHM:

l. Stream clustering algorit hm is also referred as BDMO Algorithm.
2. BDMO Algorithm is d esigned by Bahcock, Datar, Motwan i and Ocallaghan.
3. It is based on K-Means.
4. Th e BDMO algorithm follow s the concept of 'counting ones' method, w hich means that there is a
w indow of length N on a bi nary stream and it cou nts the numb er o f l s that comes in t he last k bits
where k ~ N.
5. Th e BDMO algorithm uses the b ucket w ith al lowa ble b ucket sizes t hat forms a sequence w here each
size is twice of the prev ious size.
6. In the algorithm, th e number o f points represents the size of the b ucket.
7. It does not consider that the sequence of allowable b ucket sizes starts w ith l b ut consider only formin g
a sequence such as 2, 4, 6, 8 ... Where each size is twice the previous size.
8. For ma inta in ing the buckets, t he algorithm considers the size of the bucket with the power of two.
9. In addit ion, t h e numb er of b uckets of each size is eit her one or two t hat form a sequence of non-
d ecreasing size.
10. Th e b uckets that are used in the algor ithm, contains the size and t imestamp of the most recent p oints
of t he stream.
11. Along w ith t h is, t he b ucket also contains a col lection o f records t hat represents the clusters into which
the points of that bucket h ave b een p art it ioned.
12. This record conta ins the number of p oints in the cluster, the centroid, or clustroid of t he clust er, and
other parameters that are req uired to merge and maintain the clusters.
13. The major step s of t he BDMO algor ithm are as follows:
a. Initialising b uckets
b. Merging b uckets.
c. Answering queries.

INITIALISING BUCKETS:
l. In it ialisation of the b ucket is t he first step of the BDMO algorithm.
2. The algor ithm uses the smallest bucket size that is p w ith a power of two.
3. It creates a new b ucket w ith the most recent p p oin ts for p stream elements.
4. The t imestamp of the most recen t point in the bucket is the t imestamp of the new bucket.
5. After t his, we may choose to leave every point in a cluster by itself or perform c lustering using an
appropriate clustering method.
6. For example, if k-means clustering method is used, it clusters the k points into k clusters.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page43of65
lOMoARcPSD|48592555

5 I Finding Similar Items and Clustering BE I SEM - 7

7. For the In itialisat ion o f t h e bucket using selected clustering m ethods, it calcu lates t he cen troid or
clustroid for t he clusters and counts t he points in each cluster.
8. Al l th is in formation is stored and becomes a record for each cluster.
9. The algor ithm also calculates the o t her required parameters for t he merg ing process.

M ERGI N G BUCKETS:
l. After t he In it ialisat ion o f the b ucket, the algorith m n eed s to review t he sequence of a b ucket.
2. If t here happens to be a bucket w ith a ti mestamp more than N t ime units prior to the current t ime
then nothing of th at bucket is in t he w indow.
3. In such a case, t he algorithm d rops it from the list
4. In case we h ad created t hree b ucke ts o f size p , then w e m u st m erge the old est two o f the t hree b uckets.
5. In this case, the merger can create tw o buckets of size 2p, t h is may req u ire us to merge buckets o f
increasi ng sizes rec ursively.
6. For merging two consecutive buckets, the algorithm needs to p erform t he followin g step s:
a. For m erging, t h e size o f the bucket should be twice the sizes of the two b uckets to be m erged.
b. The t imestamp of the merged bucket is the timestamp of th e more recent of the two consecutive
b uckets.
c. In add it ion, it is necessary to calculate t he parameters of the merged clusters.

ANSWERING QUERIES:
l. A query in the st ream-comp uting model is a length o f a suffix of the slidin g w indow .
2. Any algori thm takes all the clust ers in all the buckets that are at least par tially w ithin the suffix and
t hen merges them using some met hod.
3. The answer of the query is t he resu lt ing clusters.
4. For the clustering of the st rea ms, t he stream-computing mod el finds out t h e answer to the query
'W h at are the clusters of t he last or more recent m points in the st rea m for in ms N'.
5. During the In itial isation, the k-means method is used and for merg ing the buckets t imestamp is u sed .
6. Hence the algor ithm is unab le t o find a set of buckets t hat covers the last in p oints.
7. However, we can choose the smal lest set o f buckets that covers the last m poin ts and include in these
buckets no more t h an th e last 2m point s.
8. After this, t he algorithm g enera tes the answer in response to the query as 'the centroids or clustroids
of all the poin ts in the selected b uckets'.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page44of65
lOMoARcPSD|48592555

S I Finding Similar Items and Clustering BE I SEM - 7

-- EXTRA QUESTIONS --

Ql. Write short notes on distance measures in big data

Ans: (P I High]
DISTANCE MEASURES:
1. A set of po in t is known as a space.
2. By u si ng the space as a p latform w e can ca lculate the d istance measure by app lying t he function 'd '
on any given two points x and y in a p lane.
3. Th e dist a nee fu net ion d (x, y) generates a real number.
4. Numb er of const raints can be imposed on the d ist ance measures as follows:
a. Negative of Distances: The d istance betw een any two po in ts say x and y cannot be negative
id(x, YI ~o
b. Posjtjyjty of Distances; The d istance between any two points say x and y is sa id to b e zero if and
only if x and y has same co-ordinates i.e. x =y
d(x, YI =0 iff X = Y
c. Symmetry of Distance: The distan ce between any two p oints say x and y are in one d irection. i.e.
d istance from x and y is same as t hat of distance from y to x.
d (x, y) = d(y, x)
d . Triangular inequality of distances: When we are dealing w ith the termin ologies like d istance. We
st rive to have the m in imum t ime, distance to From 1 point to other. To achieve the m in imum
d istance betw een two points if w e in troduce some other p oint in between these two p oint. Then it
doesn't p rove to be an efficient solution.
d(x, YI s d(x, z) + d(z, y)
TYPES OF DISTANCE MEASURES:
I) Euclidean Distances:
1. Eucl idean distan ce is the fu ndamenta l technique to measure the distance.
2. The sp ace in Euclidean d istance is known as Euclidean sp ace.
3. Eucl idea n dist ance is the "ordinary" straight-line d istance b etween two p oints in Eu cl idean space.
4. In this Eucl idean sp ace every point is noth ing b u t t he vector which contains 'n' rea l numbers.
5. To ca lculate, the distance between any g iven two points w e can h ave fol low ing form u la known as L2
Norm .

d...,(x, y) = I ) x; - y;)2
i- 1

111 Jaccard Distance:

1. Jaccard d istance is a measure of h ow d issim ilar two sets are.
2. It can be calculat es w it h t he help of following form u la:
J• = d(x, YI= 1 - SIM(x, YI
3. That is, the Jacca rd d istance is 1 minus the ratio o f the sizes o f the intersection and union of sets x & y

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page4Sof6S
lOMoARcPSD|48592555

5 I Finding Sim ilar Items a nd Clust e ring BE I SEM - 7

Ill) Cosine Distance;

l. Cosine d istance method is usefu l in space.
2. Here the space is nothing but the Euclidean space which contain set of points which are represented
as vector along w ith an integer or Boolea n values.
3. The cosine distance between any two given vector is nothing but the angle made by these vectors.
4. These angles range from 0° t o 180° irrespective of no. of d imensions.
5. Hence to ca lculate cosine d istance between any given no, of vectors:
a. Calcu late the ang le made by them.
b. Apply arc cosine funct ion.
c. Cosine angle is nothing b ut the dot product o f two vectors which should be d ivided by L, Norms of
x and y. i.e. Euclidean d istance.

IV) Edit Distance:

1. Ed it distance is a way of quantifying how dissim ilar two strings (e.g., words) are to one another by
counting t he minim u m number o f operations required to transform one st ri ng into the other.
2. This distance m akes sense when poin ts are strings.
3. The d istance between two strings x = xl , x2 • • • xn and y = yl , y2 • • • ym is the smallest number of
insertions and d eletions of single charact ers that w ill convert x to y.

V) Hamming Distance:
1. The sp ace is not h ing but col lection of points,
2. Th ese points can be represented as vector.
3. Every vector is composed o f d ifferent components such as magnitude, directions etc.
4. When vector differs in t heir com p onents then that d ifference betw een one or more vectors is known
as hamming distance.
5. As this d istance ca lcu lation depends on the d ifference operat ion so it w ill satisfy all constraints such as
ne gativity of distances, positivity o f distance, symmetry and tria ng le in equa lity.

• Ha ndcrafted by BackkBe nche rs Commubynity

Downloaded asda sd ([email protected]) Page46 of 65
lOMoARcPSD|48592555

5 I Finding Similar Items and Clustering BE I SEM - 7

Q2. Write short notes on CURE Algorithm

Ans: [P I High]

CURE ALGORITHM:

1. Cure stands for Clustering Using Representatives Algorithm.

2. It is an efficient data clustering algorithm for large databases.
3. CURE A lgorithm works better in spherical as well as non-spherical clusters.
4. CURE uses random sampling and partitioning to sp eed up clustering.
Data

Make Random Sample

I
Make Partitioning of Sample

I
Partially Cluster Partitions

I
Eliminate Outliers

I
Cluster Partial Clusters

Label Data In Disk

Fig ure 5.1: CURE overview

5. Th e CURE algorithm is d ivided into phases:

a. Initialization in CURE
b. Completion o f the CURE A lgorithm

Figure S.2: Two c lusters, one surrounding the other

6. Fig ure 5.2 is an ill ustration of two clusters.
7. Th e inner cluster is an ordin ary circle, while the second is a ring around the circle.
8. This arrangemen t is not completely p athological.

JNJTIAUZATJON JN CURE:
1. Take a smal l sample o f the d ata and cluster it in ma in memory.
2. In principle, any c lusterin g method cou ld b e used, but as CURE is designed to h andle oddly shaped
clusters, it is often advisable to use a hierarchica l method in w hich clusters are merged w h en they have
a close pair of poin ts.
3. Select a small set of points from each cluster to be representative poin ts as shown in figure 5.3.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page47of65
lOMoARcPSD|48592555

5 I Find ing Similar Items and Clust e ring BE I SEM - 7

0
Figure 5.3: Select representative points from each cl uster, as far from one another as possible
4. These po ints shou ld be chosen to be as fa r from one another as possib le, using the K- means method.
5. Move each of t he represen t at ive points a fixed fraction of the d ista nce b etw een its location a nd t h e
centroid of its cluster.
6. Per haps 20% is a good fraction to choose. Note that t h is step requ ires a Eucl idea n space, si nce
otherwise, there m ight not be a ny notion of a lin e between two points .

•
•

0 • •
• •
• •
•
Figure 5.4: Moving the representative points 20% of the dista nce to the cluster's centroid

COMPLETION OF THE CURE ALGORITHM:

l. Th e n ext phase of CURE is to merge two clust ers if t h ey have a pa ir of representat ive p oints, one from
each cluster, that are sufficien t ly close.
2. The user may p ick the d ist ance that d efines "close."
3. This merging step can repeat, until t here are n o more sufficiently close clusters.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page48of65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

CHAP· 6: REAL TIME BIG DATA MODELS

Ql. Give Applications of Social Network Mining

Ans: (SM - DEC19)

SOCIAL NETWORK M INING AS GRAPHS:

l. Socia l networks are naturally modeled as graphs, which we sometimes refer to as a socia l graph.
2. The entit ies are the nodes, and an edge connects two nodes if the nodes are related by the relationship
that cha racterizes the network.
3. If there is a degree associated w ith the relationship, t his degree is represented by labeling the edges.
Often, social g raphs are und irected , as for the Face book friend's graph.
4. But they can be directed g raphs, as for example the graphs of followers on Tw itter or Google+

APPLICATIONS:
l. Telephone Networks: Here the nodes represent phone numbers, which are rea lly individuals. There is
an ed ge between two nodes if a call has been p laced between those phones in some fixed period of
time, such as last month, or "ever." The edges cou ld b e weighted by the number of ca lls mad e between
these phones during the period.
2. Email Networks: The nodes represen t email addresses, which are again in dividua ls. An edge
represents the fact that there was at least one ema il in at least one d irection b etween the two
addresses.
3. Collaboration Networks: Nodes represent individua ls who have published resea rch papers. There is
an edge b etween two individuals who p ublished one or more papers jointly
4. Other examples include: information networks (documents, w eb g raphs, p atents), infrastruct ure
networks (roads, planes, water p ipes, powerg rids), b iologica l networks (genes, proteins, food-webs of
anima ls eating each other), as well as other types, like product co-purchasing networks (e.g., Grou p on).

Q2. Explain the following terms with diagram

1) Hubs and Authorities

2) Structure of the Web

Ans: [10M - DEC19)

HUBS AND AUTHORITIES:

l. Hyperlink-lnduced Topic Sea rch (HITS; also known as hubs and authorit ies) is a link ana lysis algorithm.
2. HITS rates Web pages, developed by Jon Kleinberg.
3. Hubs and authorities are fans and centers in a bipartite core of a web g ra ph.
4. A good hub page is one that points to many good authority pages.
5. A good author ity page is one that is pointed to by many good hub pages.
6. Figure 6.1 represents hubs and authorities.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page49 of 65
lOMoARcPSD|48592555

6 I Rea l Time Big Data Models BE I SEM - 7

Hut>s

Figure 6 .1: Hubs and authorities

7. In the HITS a lgorith m, the first step is to ret rieve the m o st relevant pages to the search que ry.
8. Th is set is called t h e root set and can be obtained by taking the top p ages returned by a t ext-based
search a lgorithm.
9. A base set is generated by a ugment ing the root set w ith a ll the w eb pages that are linked from it and
some of the pages that li nk to it.
10. Th e w eb pages in the base set an d all hyper links among those pages form a focu sed subgraph.
11. The HITS computat ion is performed only on t h is focused subgraph.
12. According to Kle inberg the reason fo r const ructing a base set is to ensure that most (or many) of the
st rongest authorit ies a re included.
13. Aut hor ity and hub values are d efined in terms o f o ne another in a mutual recursion.
14. An auth ority valu e is computed as t h e sum of the scaled hub va lues that po int to that page.
15. A hub va lue is the sum of the scaled aut hor ity va lues of t he pag es it points to.
16. Som e impl ementations also conside r t h e relevan ce of the linked pages.
17. Th e algor ithm performs a series of iterat ions, each consist ing o f two b asic steps:
a. Authority update: Upd ate each nod e's aut hority score to be equa l to t he sum o f the hub scores o f
each n o d e t h at points to it. Th at is, a node is given a h igh aut hority score by being linked from pages
that are recognized as Hubs fo r information.
b. Hub update: Up date each node's hub sco re to be equal to t h e sum of the authority scores of each
node that it points to. Th at is, a nod e is g iven a h igh hub score by linking to nodes that a re
co nsidered to be auth orities on the subject.

STRUCTURE OF WEB:

l. If we consider web pages as vertices and hyperl inks as edges.

2. Then, the web can be rep resented as a d irected g ra p h.
3. The struct ure be low in figure 6.2 shows a la rge strongly connected component (SCC), but there were
several other port ions that were almost as large.
4. The two regions of a pproximat ely equa l size on the two sides of CORE are named as:
a. IN: N odes that can reach the g iant SCC but cannot be reached from it e.g . new web p ages, and
b. QUI; N odes that can be reached from the g iant sec but cannot reach it e.g. corp o rate websites.
5. Th is struct ure o f web is known as the Bowtie structure.
6. There are pages that belong to none o f IN, OUT, o r the g ian t SCC i.e. they can neither reach the g iant
sec nor be reached from it.
7. These are classified as:

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 50 of 6S
lOMoARcPSD|48592555

6 I Rea l Time Big Data Models BE I SEM - 7

a. Tendrils: The nodes reachable from IN that can't reach the giant sec,and the nodes that can reach
OUT but can't be reached from t he giant sec. If a tendril node sat isfies bot h condit ions then it's
pa rt o f a t ube that travels from IN to OUT w ithout touching the g iant sec, and
b. Disconnected: Nod es t hat belong to none of the p revious categories.
8. Taken as a whole. the bow-tie str ucture of the Web p rovides a h ig h -level view of the Web's structure,
based on its reachabil ity properties and h ow its strongly connected components fit together.

~Tendtil$ ~

1
,_...,.....,,.. 44 Mi11ion
nodes
------ !1
,' '
I
I

-----
IN
44 Million nodes
sec
56 MllliQn nodes

,•

' .I
T...,_s
0
00
C) ~ OiSoonnec18d oomponents

Figure 6.2: Bow-tie structure of the web

Q3. What is the use of Recommender System? How is classification algorithm used in

recommendation system?

Ans: [10M - DEC19)

RECOMMENDATION SYSTEMS:

There is an extensive class of Web applications that involve p red icting user responses to options. Such a
facility is called a recommendation system.

USE OF RECOMMENDER SYSTEM:

1. Recommender systems are primarily used in commercial applications.
2. Recommender systems are u t il ized in a variety of areas and are most commonly reco gnized as playlist
gen erators for video and music services like Netflix, YouTube and Spotify, p ro d uct recommenders for
services such as Amazon, or conten t recommen d ers for social media p latforms such
as Facebook and Twitter.
3. Product Recommendations; Perhaps t he most important use o f recommendation systems is at on-
line retailers.
4. Moyje Recommendations: Netflix offers its c ustomers recommendations of movies they m ight like.
These recommen dat ions are based on rat ings provided by users
S. News Articles; News services have attempted to identify articles of interest to readers, based on the
articles that they have read in the past.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 51 of 65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

CLASSIFICATION ALGORITHM;
l. Classificat ion algorithm is completely d ifferent ap proach to a recommendat ion system using item
profiles and utility matrices is to t reat the problem as one of machine learning.
2. Regard t he given data as a train ing set, and for each user, build a classifier that pred icts t he rating o f
all items.
3. A d ecision t ree is a collection o f nodes, arranged as a b inary tree.
4. The leaves render decisions; in our case, the d ecision would be "likes" or "doesn't like."
5. Each interior node is a condit ion on the objects b eing classified; in our case the condition wou ld be a
predicate involving one or mor e features of an item .
6. To classify an item, we start at the ro ot, and apply the predicate at the rootto the item.
7. If the predicate is true, go to the left ch ild, and if it is false, go to the right child.
8. Then repeat the same process at the node visited, until a leaf is reached.
9. Th at leaf classifies the item as liked or not.
10. Construction of a d ecision tree requ ires selection of a predicate for each inter ior node.
11. There are many ways o f picking the best p redicate, but they al l try to arrange that one o f the children
gets al l or most of t h e positive examples in the training set (i.e, the items that the given user likes, in
our case) and t he other child gets all or most of t he negat ive examples (the items t h is user does n ot
like)
12. Once we have selected a pred icate for a n od e N, we divide t he items into the two g roups: those that
sat isfy the pred icate and t h ose that do not.
13. For each grou p, w e aga in find the predicate that best separates the posit ive and negative examples in
that group.
14. Th ese pred icat es are assigned to the child ren of N.
15. This process of d ividing the examples and b u ilding children can proceed to any n umber of levels.
16. W e can stop, and create a leaf, if t h e group of items for a n ode is homogeneous; i.e., they are all positive
or all negative examples.
17. However, we may w ish to stop and create a leaf w ith the majority decision for a gro up, even if th e gro up
contains both positive and negat ive examples.

Example:
l. Suppose our items are news articles, and features are the highTF.IDF words (keywords) in t h ose
documents.
2. Further suppose there is a user U who likes articles about b aseball, except articles about the New York
Yankees.
3. Th e row of the util ity ma tr ix for Uhas l if U h as read t he article and is blank if not.
4. We shall t ake the l's as "like" and the b lanks as "doesn't like."
5. Pred icates wil l be Boolean expressions of keywords.
6. Since U g enerally likes basebal l, we might find that the best pred icate for the root is "hom erun" OR
("batter" A ND "pitcher").
7. Items that satisfy the p redicate w ill tend to be p osit ive examples (articles w ith l in the row for U in the
utility matrix), and items that fai l to sa tisfy the predicate w ill tend t o be negative examples (blanks in
the utility-mat rix row for U).

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 52 of 65
lOMoARcPSD|48592555

6 I Rea l Time Big Data Models BE I SEM - 7

8. Figure 6.3 shows the root as well as t he rest of the decision tree.
9. Suppose t hat the g rou p of articles that do not satisfy the p red icate includes sufficient ly few posit ive
examples that we can conclude all o f t h ese items are in the "don't-like" class.
10. W e may t hen put a leaf w ith decision "d on't like" as the rig ht child of t he root.
11. However, the articles that satisfy the p red icate incl udes a number of articles that user U doesn't like;
these ar e the articles that mention the Yankees.
12. Thus, at the left child of the root, we bu ild another predicate.
13. We m ight find that the predicate "Yankees" OR "Jeter" OR "Teixeira" is the best possib le indicator of an
article about b asebal l and about the Yankees.
14. Thus, we see in Fig. 6.3 t he left child of the root, which applies t h is predicate.
15. Bot h ch ildren of this node are leaves, since w e may suppose that t h e items sat isfyin g t his pred icate are
predominantly negative and those not satisfying it are predominantly p ositive.
16. Unfortun ately, classifiers of al l types tend to take a long t ime to con str uct.
17. For instance, if w e wish to u se decision t rees, w e need one t ree per user.
18. Const ructing a t ree not only requires that we look at all the item profiles, bu t we h ave to consider many
different p red icates, which could involve comp lex combinat ions offeatures.
19. Th us, this app roach tends to be used only for relatively sma ll problem sizes.

"'Homerun n OR
r batter" A N D
"'pit cher"}

"Yan.kHS" OR "l a t er"

Doesn't Like
OR ''Teixeira"

0OHn't Like Uko•

Figure 6.3: Example of decision tree

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 53 of 65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

Q4. Describe Girwan - Newman Algorithm. For the following graph show how the Girvan Newman

algorithm finds the different communities

Ans: [10M - DEC19)

GIRVAN NEWMAN ALGORITHM:

1. Girvan- Newman a lgorithm is a hierarchical method used to detect communities in complex
systems.
2. It is published in 2002 by M ichel le Girvan and mark Newman.
3. It is used for:
a. Commun ity detection.
b. To measure edge - betweenness among all existing edges.
c. To remove edge having large valu ed betweenness.
d . To opt ion optim ized modular function.
4. It also ch ecks for edge betw eenness centra lity and vertex betw eenness centra lity.
5. Vertex betw eenness centra lity is total number of short est p ath that pass through each vertex on the
network.
6. If any ambiguity found w it h above {vertex betweenness centra lity) then eve ry pat h is adjusted t o equal
weight 1/N among a ll N p aths between two vertices.
7. Edge betw een ness centra lity is numb er of shortest path wh ich pass through g iven edge.
8. In order to find out b etween edges, we need to ca lcu late sho rtest paths from going through each of
the edges.
9. For the remova l of each edge, the calcu lat ion of edge betweenness is O(EN); t herefore, this algorithm's
time complexity is O(E'").

STEPS:
1. Find edge w ith hig hest betweenness o f multiple edges of highest betweenness if there is a t ie and
remove those edges from graph. It may affect to graph to get separate into mu lt ip le components. If so,
this is first level of regions in the po rt ioning of graph.
2. Now , recalcu late al l betweenness and again remove the edge or edges of h ighest betweenness. It w ill
break few existin g component into smal ler, if so, these are regions nested w ithin la rger region.
3. Keep repeatat ion of tasks by recalcula ting a ll betweenness an d removing the edge or edges having
highest betweenness.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page54of65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

EXAMPLE:

Step 1:

Step 2:

A 0 0 V
Step 3:
\/ \/ l

0
0 0
0 0
0 0 0
0 0 0 0
0 0

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 55 of 65
lOMoARcPSD|48592555

6 I Rea l Ti me Big Data Models BE I SEM - 7

QS. Com pute the page rank of eac h page after running the PageRank algorithm for two ite rations
with teleportation factor Beta (Pl value = 0.8
Ans: [10M - DEC19)

Weights assigned to every edge:

1/2
AJ-===C
v•

1/2 1/2

Utility Mat rix:

A B C D E F

A 0 ¼ ½ X X X
B ½ 0 0 X X X
C ½ 0 0 X X X
D 0 ¼ 0 X X X
E 0 ¼ 0 X X X
F 0 ¼ 1/2 X X X

He re: D, E, Fare dead end s and d enoted as X o r - in mat rix

08
08
As telepo rtation factor, fl =0.8, w e c reate an eigenvector V =
0.8
0.8
08
0.8
So, to calculate p age rank, w e n eed to carry out 2 it erat ions, as m e nt ioned in t he sum.
0 1/4 1/2 -
1/2 0 0 - - -
1/2 0 0 - - -
Let 's say m at rix A =
0 1/4 0 - - -
0 1/4 0 - - -
0 1/4 1/ 2 -

It e ration I:
We compute A
0.6
0.4
A' V = 0.4
. 0.2
0.2
0.6

• Ha ndcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 56 of 65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

Iteration II:
A 2.V = A. {A1.V)
0.4
0.2
= 0.2
0.05
0.05
0.4
Therefore, the page rank computed is:
A =0 .4 , B = 0 .2, C = 0 .2, D = 0 .05, E = 0.05, F = 0.4

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page57 of65
lOMoARcPSD|48592555

6 I Rea l Time Big Data Models BE I SEM - 7

-- EXTRA QUESTIONS --

Ql. What is Page Rank'? Explain the Inverted Index

Ans: [P I Medium]

PAGE RANK:

l. PageRank (PR) is an algorithm u sed by Google Sea rch to rank w ebsites in their search engine results.
2. PageRank was named after Larry Page, one of the founders of Google.
3. It is a w ay of measuring the im p ort an ce of w ebsite pages.
4. It works by countin g t he number and quality of links to a p age to determine a rough estimate of h ow
important t h e w ebsite is.
5. The underly ing assumption is that more important websites are likely to receive more links from other
websites.
6. Search engine work equivalent to "Web Crawler"
7. Web Crawler is the web component whose responsibility is to identify and list down the different terms
found on every w eb page encountered by it.
8. Th is listing of different terms w ill be stored inside t he sp ecialized data st ructure known as an "Inverted
Index"
9. Figure 6.4 shows inversed index funct iona lity.
10. Every term from the inver ted index w ill be extracted and an alyzed for t he u sage of that term w ithin the
w eb page.

Inverted Index

List - - - Term l Term2 Te rmn

Source 1 Source 2 Source n

Fig ure 6.4: Inversed index functionality

LINKS IN PAGE RANKNG:

l. Assume we have 3 pages A, BC in a g iven domain of web sites.
2. As shown in figure 6.5 they have interconnection links between them .
3. Links can be back links and forward links.
4. Back links in dicates given web page is referred by how many number of other web pages.
5. As shown in figure 6.5, A and Bare b ack links of web page C
6. Forward links indicates how many web pages w ill be referred by given web page.
7. A web page which conta ins number of back links is sa id to be important web page and w ill get upper
posit ion in Ranking.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 58 of 65
lOMoARcPSD|48592555

6 I Rea l Time Big Data Models BE I SEM - 7

Web Page A
W ebPageC

W ebPageB

Fig ure 6.S

8. In the g eneral case, the PageRank va lue for any page u can be expressed as:

PR(u ) = E veB.. pin)>

9. i.e. the Pag eRank value for a p age u is depend ent on the PageRank va lues for each pag e v conta ined
in t h e set Bu (t h e set containing all p ages linkin g to page u), d ivided by the number L(v) of links from
p age v.

Q2. Define content based recommendation system. How it can be used to provide
recommendations to users.
Ans: IP I Medium)
CONTENT BASED RECOMMENDATION:
l. A content b ased recomm ender works w ith d ata t hat t he user provides, either explicitly (rating) or
implicitly (cl ickin g on a link).
2. Based on that data. a user profi le is generated. which is then used to make suggestions to the user.
3. As the user provides m ore inputs or takes actions on the recommend ations, the eng ine becomes more
and m ore accurate.
4. Item profile in content b ased systems focuses on items and user p rofiles in form of weighted lists.
5. Profile are helpful to discover proper ties of items.
6. Consider the below examples:
a. Some students prefer to be guided by few teachers on ly.
b. Some Viewers prefer drama or movie by t heir favor ite actors on ly.
c. Few Viewers prefer o ld songs on other h and few viewers may prefer new songs only d epend ing
upon users sorti ng of songs b ased on year.
7. In g eneral, there are so many classes which p rovides such data.
8. Few domains has common feat ures for example a college and movie it has students, professors set
and actors, d irector set respectively.
9. Certain ratio is maintained in such cases like every coll ege and movie has yea r w ise datasets as movie
released in a year by d irector and actor and colleg e has passing st udent every year etc.
10. Music song album and book has same va lue featu re like songs writers/ poet, year of release and
publication year e tc.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 59 of 65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

11. Consider the figure 6.6 which shows recommendation system parameters.

Product whh ~HtUrel

Content

Item value

Recommenctadon
Communlty 011ta
Component

List of Atteommend.titlon

UHn

Source of pronK- and contextuaJ data

Figure 6 .6: Recommendation system parameters

Q3. What are different recommender systems? Explain any one with example.
Ans: (P I Medium)

RECOMMENDATION SYSTEM:
1. Recommendation system is w idely used now-a-days.
2. It is su b class of information filtering system .
3. It is used to provide recommendation for games, movies, music, b ooks, social tags, art icles etc.
4. It is useful for experts, financia l seN ices, life in su rance and social m ed ia based organization.
5. There are two types o f recommendation systems:
a) Coll aborative filtering.
b) Content based filter ing.

COLLABORATIVE FILTERING SYSTEMS:

1. It uses community data from peer groups for recommendat ion.
2. These exhibits all those things tha t are p opu lar among the peers.
3. It uses user's past behaviou r and apply some p redication about user may like and accordingly p ost
dat a.
4. These filtering systems recommend items based on similarity measure between users and/or items.
5. Here user profile and context ual parameters along with the community data are used by the
recommender systems to personalize the recommendation list.
6. This is t he most prominent approach in e-commerce website.
7. The basic assumption for collaborative filtering is:
a. User g ives ra tings to item in the cata log.
b. Customer who had similar taste in past w ill have similar taste in future.
c. Users who ag reed in their subjective eva luations in the past w ill agree in the future too.
d. To find out sim ilarity we can use Pearson's correlation co-efficient as:

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page60of65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

sim( a, b) = r: ep (r<>,p ro)(rb,p rb)

P JE, cp(ra,p- ra) ,(rb,p- rb) 2 2

Where A, b = users
Ra, p = rating of user 'a' for item 'p'
P = Set of items rated by bo th a an d b

e. We can use this formula for prediction as :

1:, cn..,m(n,b) x (qb,p rb)

pred(a, p) = ra + I:, ensim(a,b)

Advantages:
l. Continuous learn ing for market process.
2. No knowledge engineering e fforts needed .

Disadvantages:
l. Rat ing & feedback is requ ired.
2. New items and users faces to cold star t.

Q4. What are social network graphs? How does clustering of social network graphs work?
Ans: [P I Medium]

SOCIAL NETWORK GRAPH:

l. Socia l networ ks are naturally modeled as g raphs, wh ich we sometimes refer to as a social graph.
2. The entities are t he nodes, and an edge connects two nodes if t he nodes are related by the relationship
that characterizes the n etwor k.
3. If there is a degree associated w ith the relat ionship, th is degree is represented by labeling the edges.
4. Often, socia l graphs are und irected , as for the Facebook friend 's g rap h.
5. But t hey can be directed g raphs, as for example the graphs of followers on Tw itter or Google+.
6. Consider t he example of sma ll social networ k as shown in figure 6.7

Figure 6.7: Example of social network graph

7. Th e entities are the no des A through G.
8. The relat ionship , which we m ight think of as "friends," is represented by the edges.
9. For in stance, B is friends w ith A, c, and D.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 61 of 65
lOMoARcPSD|48592555

6 I Rea l Time Big Data Models BE I SEM - 7

CLUSTERING OF SOCIAL-NETWORK GRAPHS;

Cl usterin g o f the g raph is considered as a way to ident ify communities. Cluste ring of g raphs involves
follow ing steps:

I) Distance Measures for Social-Network Graphs:

1. lf we wan t to apply st andard clustering tech niques to a social-n etw o rk graph, our first step wou ld be
to defi ne a distan ce measure.
2. Whe n the edg es o f the gra ph have labels, these labels m ig ht be usable as a d ista nce m easure,
d e p ending on what they represented.
3. But when t h e edges are un labeled, as in a "frie n d s" graph, t h e re is not m uch w e can d o to d efine a
suitable d istan ce.
4. Our first instinct is t o assume that nodes are close if t hey h ave a n edge betw een t he m and d istan t if
not .
5. Th us, we co u ld say that th e dista n ce d {x, y) is O if t h ere is an edg e {x, y) an d 1 if t h ere is no such ed g e.
6. W e cou ld use a ny other two values, such as 1 an d = , as long as the distance is clo ser when there is an
edge.

II) Applying Standard Clustering Methods:

1. Th ere are tw o genera l ap p ro aches to cluster in g: hierarchical (agglomerative) and point-assignment.
2. Hierarc hica l clusterin g of a social-n etwork graph starts by combin in g some tw o no d es t h at are
connec ted by an edg e.
3. Successively, edg es t h at are not between two n o d es of the same cl uste r w ould be cho sen ra ndom ly to
co mbi ne t he clusters to wh ich t heir two nodes belong.
4. The choices w o u ld b e rando m , b ecau se all d istances re p resented by an ed ge are the same.
5. In Point-assignment approach t o clustering so cia l network s, all edges are at t he same distance w ill
in t rod uce a numbe r o f random fac tors that will lead to so me n odes b eing assig ned to t he wrong
cluster.

Il l) Betweenness:
1. Since t h ere a re problems w ith st andard cl usteri ng meth o d s, severa l specia lized clustering t echn iq ues
have been develop ed to fi nd communities in social networks.
2. The simplest one is b ased on find ing the edg es that are least likely to be inside the comm u n ity.
3. Defin e the b etweenness of an ed ge {a, b ) to be t he n u m b e r of p airs o f nodes x an d y such that the edge
(a, b ) lies on the shortest p ath between x and y .
4. To b e m ore p recise, sin ce t h ere can b e severa l short est p aths betw een x and y, edge (a, b) is c redit ed
w ith t he fraction o f t h ose shortest pat hs that inclu d e the edge {a, b ).
5. As in golf, a h igh score is b ad.
6. It suggests t h at the ed g e (a, b) runs between two d ifferent co mmu n ities; that is, a an d b do n ot b elo ng
to the same community.

1V) The Girvan-Newman Algorithm;

1. Girvan- Newman a lgorithm is a hierarchical method used to detect communities in complex
systems.
2. It is published in 2002 by M ich elle Girvan an d mark Newman .

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 62 of 65
lOMoARcPSD|48592555

6 I Real Time Big Data Models BE I SEM - 7

3. It is used for:
a. Commun ity detection.
b. To measure edge - betweenness among all existing edges.
c. To remove edge having large valued betweenness.
d . To option optimized modular function.
4. It also checks for edge betw eenness centra lity and vertex betw eenness centra lity.

VJ Using betweenness to find communities:

l. The betweenness scores for the edges of a g raph behave something like a distance measure on the
nodes of the graph.
2. It is not exactly a d istance measure, b eca use it is not defined for pa irs of nodes that are unconnected
by an edge, and m ight not satisfy t h e t ria ng le inequality even when defined.
3. However, we can cluster by taking the edges in o rder o f increasing betweenness and add t hem to the
g raph one at a t im e.
4. At each step, the connected com ponents of the g raph form some clusters.
5. The higher the betweenness we allow , t he m ore edges we get, and the la rger the clusters become.
6. More commonly, t his idea is expressed as a process of edge removal.
7. Start with the graph and all its edges; then remove edges w ith the highest b etweenness, until the
graph has broken in to a suitable number o f connected comp onents.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 63 of 65
lOMoARcPSD|48592555

Question Pa per BE I SEM - 7

DEC-2019 [3 Hours] - (80 Marks]

Ql. (a) Expla in Edit d istance measure w ith an example. [OS)

Ans: [Chapter No. OS I Page No. 40)
(b) When it comes to b ig d ata how NoSQL scores over ROB MS. [OS)
Ans: [Chapter No. 03 I Page No. 2S)
(c) Give d ifference between Traditional data management and analytics approach Versus Big data
Approach [OS)
Ans: [Chapter No. 01 I Page No. 01)
(d) Give Applicat ions of Social Network Mining [05)
Ans: [Chapter No. 06 I Page No. 48)

Q2. (a) What is Hadoop? Describ e HDFS architect ure w ith diag ra m. [10)
Ans: (Chapter No. 02 I Page No. 17)
(b) Explain w ith b loc k d iag ram architect u re of Data Stream Management System. [10]
Ans: (Chapter No. 04 I Page No. 33)

Q3. (a) What is th e use of Recommend er System? How is classification algorithm used in
recommen d ation system? [10)
Ans: (Chapter No. 06 I Page No. SO)
(b) Explain the following ter ms w ith d iagram [10 )
1) Hub s and Aut hor ities
2) Structu re of the Web
Ans: (Chapter No. 06 I Page No. 48)

Q4. (a) What d o you mean by Counting Distinct Elements in a stream . Ill ustrate w ith an exam ple
working of a Flajolet - Mart in Algorithm used to cou nt n u m b er o f d ist inct elements. [10 )
Ans: [Chapter No. 04 I Page No. 34)
(b) Expla in d ifferent ways by which big data problems are h andled by NoSQL. [10]
Ans: [Chapter No. 03 I Page No. 26)

QS. (a) Descr ib e Girwan - Newman A lgorit hm. For t he following g raph show how t he Girvan Newman
algorithm find s the different commun ities. [10 )

A l----1 8 1----< C

o }-----l E I----I F

Ans: [Chapter No. 06 I Page No. 53)

(b) What is the role of JobTracker and TaskTracker in Map Reduce. Il lustrate Map Reduce execution
p ipeline w ith Word count example. [10)
Ans: [Chapter No. 02 I Page N o. 19)

• Handcrafted by BackkBe nchers Community

Downloaded by asda sd ([email protected]) Page64of 65
lOMoARcPSD|48592555

Question Paper BE I SEM - 7

Q6. (a) Compute the page rank of each page after running the PageRank algorithm for two iterations
w ith teleportation fact or Bet a(~) value = 0.8

C>--------c

Ans: [Chapter No. 06 I Page No. 55]

(b) What are the chal lenges in clusterin g of Data streams. Explain stream clustering algorithm in
d eta il.
Ans: [Chapter No. 05 I Page No. 41]

Note: We have tried to cover almost every important question(s) listed in syllabus. If you feel any other
question is important and it is not cover in this solution then do mail the question on
[email protected] or Whatsapp us on +91-9930038388 / +91-7507531198

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page 65 of 65
lOMoARcPSD|48592555

Big Data Analytics BE I SEM - 7

Join BackkBenchers Community & become the Student Ambassador to represent your college & earn

15% Discount.

SB Communfty

We organ ize IV for students as well in low p ackage. Con tact u s for m ore details.

Buy & Sell Fina l Year Projects with BackkBenchers. Project Charge upto 10,000.

•••
(/)
•••
--
-- ,_-
r I I ~
-
:z:

Follow us on Social Media Profiles to get notified

[@ BackkBenchersCommunity {9 +91-9930038388 (I BackkBenchersCommunity

E-Solutions Will be Available @ BackkBenchers Website Soon.

• Handcrafted by BackkBenchers Community

Downloaded by asda sd ([email protected]) Page66of6S
lOMoARcPSD|48592555

Downloaded by asda sd ([email protected])

Big Data Black Book
16% (25)
Big Data Black Book
2 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Big Data Black Book PDF
15% (20)
Big Data Black Book PDF
2 pages
BDA Notes
No ratings yet
BDA Notes
70 pages
BDA Unlocked
100% (1)
BDA Unlocked
69 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Bda Solved Sample Question Paper 70 Marks
No ratings yet
Bda Solved Sample Question Paper 70 Marks
29 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
big data sv publication
No ratings yet
big data sv publication
142 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Syllabus
No ratings yet
Syllabus
3 pages
Big Data
No ratings yet
Big Data
76 pages
Data Science and Big Data Analytics_ Unit_1
No ratings yet
Data Science and Big Data Analytics_ Unit_1
47 pages
BDA_Notes
No ratings yet
BDA_Notes
68 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
BDA-1st unit
No ratings yet
BDA-1st unit
39 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
CS8091-Big-Data-Analytics
No ratings yet
CS8091-Big-Data-Analytics
28 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Techknowledge Publication: Big Data Analytics
No ratings yet
Techknowledge Publication: Big Data Analytics
156 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
Unit 1-Big Data Analytics & Lifecycle
No ratings yet
Unit 1-Big Data Analytics & Lifecycle
130 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
Pub Res Feb 20231
No ratings yet
Pub Res Feb 20231
5 pages
Big Data Analysis Using Hadoop: A Survey: August 2015
No ratings yet
Big Data Analysis Using Hadoop: A Survey: August 2015
6 pages
Bda Sem 7 Book
No ratings yet
Bda Sem 7 Book
188 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Big Data With Cloud Computing Discussions and Challenges
No ratings yet
Big Data With Cloud Computing Discussions and Challenges
9 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
25 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Analytics Unit1
No ratings yet
Big Data Analytics Unit1
20 pages
(15) Big Data
No ratings yet
(15) Big Data
10 pages
Unit 1
No ratings yet
Unit 1
14 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
PPT01-Introduction To Big Data
No ratings yet
PPT01-Introduction To Big Data
34 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
PHD CSE Seminar in Course Work
0% (1)
PHD CSE Seminar in Course Work
17 pages
Big Data
No ratings yet
Big Data
1 page
Digital Notes IDBA Final Original
No ratings yet
Digital Notes IDBA Final Original
156 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
21cs71BDA Question bank
No ratings yet
21cs71BDA Question bank
4 pages
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
May_Jun_2024
No ratings yet
May_Jun_2024
2 pages
BIG DATA_UNIT-I
No ratings yet
BIG DATA_UNIT-I
17 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Spark for Data Science
From Everand
Spark for Data Science
Srinivas Duvvuri
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Security Vulnerability Matrix PDF
No ratings yet
Security Vulnerability Matrix PDF
1 page
Assignment 1
No ratings yet
Assignment 1
3 pages
MVB-MCM Datasheet
No ratings yet
MVB-MCM Datasheet
23 pages
Interaction With The Physical World: Embedded Software
No ratings yet
Interaction With The Physical World: Embedded Software
12 pages
Experiment - 1 Addition of Two 8 Bit Numbers
No ratings yet
Experiment - 1 Addition of Two 8 Bit Numbers
20 pages
Configuring and Administering Server
100% (5)
Configuring and Administering Server
100 pages
Pseries 650 Model 6M2 Technical Overview and Introduction - Redp0194
No ratings yet
Pseries 650 Model 6M2 Technical Overview and Introduction - Redp0194
74 pages
Technology Livelihood Education-Computer System Servicing: Quarter 2 - Module 1.2 Microsoft Windows 7 Installation
No ratings yet
Technology Livelihood Education-Computer System Servicing: Quarter 2 - Module 1.2 Microsoft Windows 7 Installation
21 pages
8.4.1.2 Packet Tracer - Skills Integration Challenge Instructions IG
100% (1)
8.4.1.2 Packet Tracer - Skills Integration Challenge Instructions IG
2 pages
Network Device Protection Profile (NDPP) Extended Package Stateful Traffic Filter Firewall
No ratings yet
Network Device Protection Profile (NDPP) Extended Package Stateful Traffic Filter Firewall
36 pages
1616402895708-Subjective Questions On Computer Awareness MRT-14
No ratings yet
1616402895708-Subjective Questions On Computer Awareness MRT-14
4 pages
How To Create Iris Service
No ratings yet
How To Create Iris Service
15 pages
Onboard Azure
No ratings yet
Onboard Azure
9 pages
Ex - No:1 TCP Socket Date
No ratings yet
Ex - No:1 TCP Socket Date
47 pages
Aswin TS Working Principle of Dpist Sys Simplified Notes Unit 1
No ratings yet
Aswin TS Working Principle of Dpist Sys Simplified Notes Unit 1
4 pages
An1218 PDF
No ratings yet
An1218 PDF
56 pages
AirTeach Software Flyer EN
No ratings yet
AirTeach Software Flyer EN
4 pages
Win32.Alman.B Submission Summary
No ratings yet
Win32.Alman.B Submission Summary
6 pages
Distributing Computing
No ratings yet
Distributing Computing
212 pages
Computer Network Tutorial - Javatpoint
No ratings yet
Computer Network Tutorial - Javatpoint
7 pages
AIST Jupyter Notebook Instructions
No ratings yet
AIST Jupyter Notebook Instructions
5 pages
ENARSI SA Troubleshooting
No ratings yet
ENARSI SA Troubleshooting
4 pages
How Network Address Translation Works
No ratings yet
How Network Address Translation Works
8 pages
MRE Development FAQ
No ratings yet
MRE Development FAQ
35 pages
Technical Information Interfacing With VE Bus Products MK2 Protocol 3 14
No ratings yet
Technical Information Interfacing With VE Bus Products MK2 Protocol 3 14
34 pages
PowerScale OneFS Technical Specifications Guide 9.2.1.0
No ratings yet
PowerScale OneFS Technical Specifications Guide 9.2.1.0
18 pages
Reference Guide: HP Deskjet Ink Advantage Ultra 4800
No ratings yet
Reference Guide: HP Deskjet Ink Advantage Ultra 4800
4 pages
Assignment 2
No ratings yet
Assignment 2
52 pages
Manifest NonUFSFiles Win64
No ratings yet
Manifest NonUFSFiles Win64
10 pages
Computer Knowledge For Ibps Specialist Officer Question Bank 5
No ratings yet
Computer Knowledge For Ibps Specialist Officer Question Bank 5
5 pages