10
10
10
Big Data
Pra ti e Exer ises
10.1 Suppose you need to store a very large number of small les, ea h of size say 2
kilobytes. If your hoi e is between a distributed le system and a distributed
key-value store, whi h would you prefer, and explain why.
Answer:
The key-value store, sin e the distributed le system is designed to store a mod-
erate number of large les. With ea h le blo k being multiple megabytes,
kilobyte-sized les would result in a lot of wasted spa e in ea h blo k and poor
storage performan e.
10.2 Suppose you need to store data for a very large number of students in a dis-
tributed do ument store su h as MongoDB. Suppose also that the data for
ea h student orrespond to the data in the student and the takes relations.
How would you represent the above data about students, ensuring that all the
data for a parti ular student an be a essed e iently? Give an example of
the data representation for one student.
Answer:
We would store the student data as a JSON obje t, with the takes tuples for
the student stored as a JSON array of obje ts, ea h obje t orresponding to a
single takes tuple. Give example ...
10.3 Suppose you wish to store utility bills for a large number of users, where ea h
bill is identied by a ustomer ID and a date. How would you store the bills in
a key-value store that supports range queries, if queries request the bills of a
spe ied ustomer for a spe ied date range.
Answer:
Create a key by on atenating the ustomer ID and date (with date represented
in the form year/month/date, e.g., 2018/02/28) and store the re ords indexed
on this key. Now the required re ords an be retrieved by a range query.
79
80 Chapter 10 Big Data
10.4 Give pseudo ode for omputing a join r Ær A=s A s using a single MapRedu e
: :
step, assuming that the map() fun tion is invoked on ea h tuple of r and s.
Assume that the map() fun tion an nd the name of the relation using on-
text.relname().
Answer:
With the map fun tion, output re ords from both the input relations, using the
join attribute value as the redu e key. The redu e fun tion gets re ords from
both relations with mat hing join attribute values and outputs all mat hing
pairs.
10.5 What is the on eptual problem with the following snippet of Apa he Spark
ode meant to work on very large data. Note that the olle t() fun tion returns
a Java olle tion, and Java olle tions (from Java 8 onwards) support map and
redu e fun tions.
Answer:
The problem with the ode is that the olle t() fun tion gathers the RDD data
at a single node, and the map and redu e fun tions are then exe uted on that
single node, not in parallel as intended.
10.6 Apa he Spark:
a. How does Apa he Spark perform omputations in parallel?
b. Explain the statement: Apa he Spark performs transformations on
RDDs in a lazy manner.
. What are some of the benets of lazy evaluation of operations in Apa he
Spark?
Answer:
10.7 Given a olle tion of do uments, for ea h word wi , let ni denote the number of
times the word o urs in the olle tion. Let N be the total number of word o -
urren es a ross all do uments. Next, onsider all pairs of onse utive words
(wi , wj ) in the do ument; let ni j denote the number of o urren es of the word
,
Give an equivalent query using normal SQL onstru ts, without using the tum-
bling window operator. You an assume that the timestamp an be onverted
to an integer value that represents the number of se onds elapsed sin e (say)
midnight, January 1, 1970, using the fun tion to se onds(timestamp). You an
also assume that the usual arithmeti fun tions are available, along with the
fun tion oor(a) whi h returns the largest integer f a.
Answer:
Divide by 3600, and take oor, group by that. To output the timestamp of the
window end, add 1 to hour and multiply by 3600
10.9 Suppose you wish to model the university s hema as a graph. For ea h of the
following relations, explain whether the relation would be modeled as a node
or as an edge:
(i) student, (ii) instru tor, (iii) ourse, (iv) se tion, (v) takes, (vi) tea hes.
Does the model apture onne tions between se tions and ourses?
Answer:
merged into the se tion relation and annot be aptured with the above s hema.
It an be modeled if we reate a separate relation that links se tions to ourses.