10

The document contains practice exercises related to storing and querying large datasets using distributed systems and Apache Spark. The exercises cover topics like key-value stores, document stores, joins, and window functions.

Uploaded by

atik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

10

Uploaded by

atik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CHAPTER

10
Big Data
Pra ti e Exer ises
10.1 Suppose you need to store a very large number of small les, ea h of size say 2
kilobytes. If your hoi e is between a distributed le system and a distributed
key-value store, whi h would you prefer, and explain why.
Answer:
The key-value store, sin e the distributed le system is designed to store a mod-
erate number of large les. With ea h le blo k being multiple megabytes,
kilobyte-sized les would result in a lot of wasted spa e in ea h blo k and poor
storage performan e.
10.2 Suppose you need to store data for a very large number of students in a dis-
tributed do ument store su h as MongoDB. Suppose also that the data for
ea h student orrespond to the data in the student and the takes relations.
How would you represent the above data about students, ensuring that all the
data for a parti ular student an be a essed e iently? Give an example of
the data representation for one student.
Answer:
We would store the student data as a JSON obje t, with the takes tuples for
the student stored as a JSON array of obje ts, ea h obje t orresponding to a
single takes tuple. Give example ...
10.3 Suppose you wish to store utility bills for a large number of users, where ea h
bill is identied by a ustomer ID and a date. How would you store the bills in
a key-value store that supports range queries, if queries request the bills of a
spe ied ustomer for a spe ied date range.
Answer:

Create a key by on atenating the ustomer ID and date (with date represented
in the form year/month/date, e.g., 2018/02/28) and store the re ords indexed
on this key. Now the required re ords an be retrieved by a range query.

79
80 Chapter 10 Big Data

10.4 Give pseudo ode for omputing a join r Ær A=s A s using a single MapRedu e
: :

step, assuming that the map() fun tion is invoked on ea h tuple of r and s.
Assume that the map() fun tion an nd the name of the relation using on-
text.relname().
Answer:

With the map fun tion, output re ords from both the input relations, using the
join attribute value as the redu e key. The redu e fun tion gets re ords from
both relations with mat hing join attribute values and outputs all mat hing
pairs.
10.5 What is the on eptual problem with the following snippet of Apa he Spark
ode meant to work on very large data. Note that the olle t() fun tion returns
a Java olle tion, and Java olle tions (from Java 8 onwards) support map and
redu e fun tions.

JavaRDD<String< lines = s .textFile("logDire tory");

int totalLength = lines. olle t().map(s *> s.length())
.redu e(0,(a,b) *> a+b);

Answer:
The problem with the ode is that the olle t() fun tion gathers the RDD data
at a single node, and the map and redu e fun tions are then exe uted on that
single node, not in parallel as intended.
10.6 Apa he Spark:
a. How does Apa he Spark perform omputations in parallel?
b. Explain the statement: Apa he Spark performs transformations on
RDDs in a lazy manner.
. What are some of the benets of lazy evaluation of operations in Apa he
Spark?

Answer:

a. RDDs are stored partitioned a ross multiple nodes. Ea h of the trans-

formation operations on an RDD are exe uted in parallel on multiple
nodes.
b. Transformations are not exe uted immediately but postponed until the
result is required for fun tions su h as olle t() or saveAsTextFile().
. The operations are organized into a tree, and query optimization an
be applied to the tree to speed up omputation. Also, answers an be
pipelined from one operation to another, without being written to disk,
to redu e time overheads of disk storage.
Pra ti e Exer ises 81

10.7 Given a olle tion of do uments, for ea h word wi , let ni denote the number of
times the word o urs in the olle tion. Let N be the total number of word o -
urren es a ross all do uments. Next, onsider all pairs of onse utive words
(wi , wj ) in the do ument; let ni j denote the number of o urren es of the word
,

pair (wi , wj ) a ross all do uments.

Write an Apa he Spark program that, given a olle tion of do uments in a
dire tory, omputes N , all pairs (wi , ni ), and all pairs ((wi , wj ), ni j ). Then output
all word pairs su h that ni j _N g 10 < (ni _N ) < (nj _N ). These are word pairs
,

that o ur 10 times or more as frequently as they would be expe ted to o ur

if the two words o urred independently of ea h other.
You will nd the join operation on RDDs useful for the last step, to bring
related ounts together. For simpli ity, do not bother about word pairs that
ross lines. Also assume for simpli ity that words only o ur in lower ase and
that there are no pun tuation marks.
Answer:
FILL IN ANSWER (available with SS)
10.8 Consider the following query using the tumbling window operator:

sele titem, System.Timestamp as window end, sum(amount)

from order timestamp by datetime
group by itemid, tumblingwindow(hour, 1)

Give an equivalent query using normal SQL onstru ts, without using the tum-
bling window operator. You an assume that the timestamp an be onverted
to an integer value that represents the number of se onds elapsed sin e (say)
midnight, January 1, 1970, using the fun tion to se onds(timestamp). You an
also assume that the usual arithmeti fun tions are available, along with the
fun tion oor(a) whi h returns the largest integer f a.
Answer:
Divide by 3600, and take oor, group by that. To output the timestamp of the
window end, add 1 to hour and multiply by 3600
10.9 Suppose you wish to model the university s hema as a graph. For ea h of the
following relations, explain whether the relation would be modeled as a node
or as an edge:
(i) student, (ii) instru tor, (iii) ourse, (iv) se tion, (v) takes, (vi) tea hes.
Does the model apture onne tions between se tions and ourses?
Answer:

Ea h relation orresponding to an entity (student, instru tor, ourse, and se -

tion) would be modeled as a node. Takes and tea hes would be modeled as
edges. There is a further edge between ourse and se tion, whi h has been
82 Chapter 10 Big Data

merged into the se tion relation and annot be aptured with the above s hema.
It an be modeled if we reate a separate relation that links se tions to ourses.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Pythons Basics
No ratings yet
Pythons Basics
104 pages
Hadoop Command Line Interface
No ratings yet
Hadoop Command Line Interface
10 pages
M4 - Introduction To Kubernetes Workloads v1.7
No ratings yet
M4 - Introduction To Kubernetes Workloads v1.7
107 pages
ESP32&ESP8266 RF Performance Test Demonstration en
No ratings yet
ESP32&ESP8266 RF Performance Test Demonstration en
42 pages
Syllabus 305
No ratings yet
Syllabus 305
2 pages
PySpark RDD Assignment
No ratings yet
PySpark RDD Assignment
1 page
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
No ratings yet
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
3 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Big Data
No ratings yet
Big Data
3 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
Spark-GraphX and Neo4j
No ratings yet
Spark-GraphX and Neo4j
32 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Bigdataaaaa
No ratings yet
Bigdataaaaa
180 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Big Data Management Syllabus
100% (1)
Big Data Management Syllabus
5 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Big Data With Hadoop & Spark - Introduction
No ratings yet
Big Data With Hadoop & Spark - Introduction
28 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
Hive Join
No ratings yet
Hive Join
6 pages
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
No ratings yet
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
17 pages
MapReduce Example
No ratings yet
MapReduce Example
76 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Devoir Surveillé: Please Answer The Following Multiple-Choice Questions
No ratings yet
Devoir Surveillé: Please Answer The Following Multiple-Choice Questions
8 pages
HDPDeveloper EnterpriseSpark1 StudentGuide
100% (1)
HDPDeveloper EnterpriseSpark1 StudentGuide
244 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Modeling With UML: Solutions
No ratings yet
Modeling With UML: Solutions
6 pages
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
K-Means With Spark & Hadoop - Big Data Analytics
No ratings yet
K-Means With Spark & Hadoop - Big Data Analytics
5 pages
Big Data Not Right Data Yes
No ratings yet
Big Data Not Right Data Yes
8 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
38 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Big Data Analytics PPT-2 (Section-A)
No ratings yet
Big Data Analytics PPT-2 (Section-A)
10 pages
Big Data
No ratings yet
Big Data
27 pages
SPARQL
No ratings yet
SPARQL
39 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Unit 5.2 Issues With and Limitations of Hadoop v1 and MapReduce v1
No ratings yet
Unit 5.2 Issues With and Limitations of Hadoop v1 and MapReduce v1
15 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
16 pages
Big Data Hadoop Insight
No ratings yet
Big Data Hadoop Insight
46 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Bigdatacourse
No ratings yet
Bigdatacourse
10 pages
Datawarehouse PPT
No ratings yet
Datawarehouse PPT
39 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
14-Lesson Cloudera Hive
No ratings yet
14-Lesson Cloudera Hive
9 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
watershed (1)
No ratings yet
watershed (1)
9 pages
SWJ 3625
No ratings yet
SWJ 3625
26 pages
Data Management Tools at Meta
No ratings yet
Data Management Tools at Meta
13 pages
Data Infrastructure at Meta: Atik Ishrak October 2024
No ratings yet
Data Infrastructure at Meta: Atik Ishrak October 2024
6 pages
Lecture 01
No ratings yet
Lecture 01
39 pages
Lab 5: Creating Vector Data: CPAS Archaeological GIS Workshop
No ratings yet
Lab 5: Creating Vector Data: CPAS Archaeological GIS Workshop
6 pages
FlowCAD AN PSpice Etable Subckt
No ratings yet
FlowCAD AN PSpice Etable Subckt
9 pages
File 1441709575
No ratings yet
File 1441709575
2 pages
Hitesh Resume
No ratings yet
Hitesh Resume
3 pages
Array in Data Structure
No ratings yet
Array in Data Structure
9 pages
Barracuda Brochure PDF
No ratings yet
Barracuda Brochure PDF
6 pages
Ayush Resume
No ratings yet
Ayush Resume
3 pages
Voyage Reporting: Watch The Video About The Voyage Performance Monitoring System Navigator Insight
No ratings yet
Voyage Reporting: Watch The Video About The Voyage Performance Monitoring System Navigator Insight
8 pages
Chapter 8 Exam
No ratings yet
Chapter 8 Exam
10 pages
Security Information Management System (SIMS)
No ratings yet
Security Information Management System (SIMS)
39 pages
Mini Test ICT
No ratings yet
Mini Test ICT
2 pages
Why Mastercam
No ratings yet
Why Mastercam
30 pages
Financial Accounting Systems Used by The Company
No ratings yet
Financial Accounting Systems Used by The Company
2 pages
Mission 289 Introduction To Numpy Takeaways
No ratings yet
Mission 289 Introduction To Numpy Takeaways
2 pages
Developer Documentation
No ratings yet
Developer Documentation
5 pages
KCS713 Unit 3 Lecture 4
No ratings yet
KCS713 Unit 3 Lecture 4
35 pages
Chapter 6
No ratings yet
Chapter 6
19 pages
P9 Carly Ekins
No ratings yet
P9 Carly Ekins
21 pages
Deepak Kumar It Skills 2
No ratings yet
Deepak Kumar It Skills 2
166 pages
COMP246 Test 2 Fall 2020 - Written Part
No ratings yet
COMP246 Test 2 Fall 2020 - Written Part
3 pages
Beginners Python Cheat Sheet PCC Classes
No ratings yet
Beginners Python Cheat Sheet PCC Classes
2 pages
Kramer Via Connect Pro Qs 9
No ratings yet
Kramer Via Connect Pro Qs 9
4 pages
MaaXBoard Yocto UserManual V1.1 CN
No ratings yet
MaaXBoard Yocto UserManual V1.1 CN
40 pages
Main GPU
No ratings yet
Main GPU
87 pages
D-Copia 253 MF Plus - 303 MF Plus 8765
No ratings yet
D-Copia 253 MF Plus - 303 MF Plus 8765
2 pages
Dynamic Actions On Steroids: Session 301 Donna Wendling Sherryanne Meyer
No ratings yet
Dynamic Actions On Steroids: Session 301 Donna Wendling Sherryanne Meyer
49 pages
Algorithm 970 - Optimizing The NIST Statistical Test Suite and The
No ratings yet
Algorithm 970 - Optimizing The NIST Statistical Test Suite and The
11 pages

10

Uploaded by

10

Uploaded by

CHAPTER

JavaRDD<String< lines = s .textFile("logDire tory");

a. RDDs are stored partitioned a ross multiple nodes. Ea h of the trans-

pair (wi , wj ) a ross all do uments.

that o ur 10 times or more as frequently as they would be expe ted to o ur

sele titem, System.Timestamp as window end, sum(amount)

Ea h relation orresponding to an entity (student, instru tor, ourse, and se -

You might also like