0% found this document useful (0 votes)
26 views

10

The document contains practice exercises related to storing and querying large datasets using distributed systems and Apache Spark. The exercises cover topics like key-value stores, document stores, joins, and window functions.

Uploaded by

atik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

10

The document contains practice exercises related to storing and querying large datasets using distributed systems and Apache Spark. The exercises cover topics like key-value stores, document stores, joins, and window functions.

Uploaded by

atik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CHAPTER

10
Big Data
Pra ti e Exer ises
10.1 Suppose you need to store a very large number of small les, ea h of size say 2
kilobytes. If your hoi e is between a distributed le system and a distributed
key-value store, whi h would you prefer, and explain why.
Answer:
The key-value store, sin e the distributed le system is designed to store a mod-
erate number of large les. With ea h le blo k being multiple megabytes,
kilobyte-sized les would result in a lot of wasted spa e in ea h blo k and poor
storage performan e.
10.2 Suppose you need to store data for a very large number of students in a dis-
tributed do ument store su h as MongoDB. Suppose also that the data for
ea h student orrespond to the data in the student and the takes relations.
How would you represent the above data about students, ensuring that all the
data for a parti ular student an be a essed e iently? Give an example of
the data representation for one student.
Answer:
We would store the student data as a JSON obje t, with the takes tuples for
the student stored as a JSON array of obje ts, ea h obje t orresponding to a
single takes tuple. Give example ...
10.3 Suppose you wish to store utility bills for a large number of users, where ea h
bill is identied by a ustomer ID and a date. How would you store the bills in
a key-value store that supports range queries, if queries request the bills of a
spe ied ustomer for a spe ied date range.
Answer:

Create a key by on atenating the ustomer ID and date (with date represented
in the form year/month/date, e.g., 2018/02/28) and store the re ords indexed
on this key. Now the required re ords an be retrieved by a range query.

79
80 Chapter 10 Big Data

10.4 Give pseudo ode for omputing a join r Ær A=s A s using a single MapRedu e
: :

step, assuming that the map() fun tion is invoked on ea h tuple of r and s.
Assume that the map() fun tion an nd the name of the relation using on-
text.relname().
Answer:

With the map fun tion, output re ords from both the input relations, using the
join attribute value as the redu e key. The redu e fun tion gets re ords from
both relations with mat hing join attribute values and outputs all mat hing
pairs.
10.5 What is the on eptual problem with the following snippet of Apa he Spark
ode meant to work on very large data. Note that the olle t() fun tion returns
a Java olle tion, and Java olle tions (from Java 8 onwards) support map and
redu e fun tions.

JavaRDD<String< lines = s .textFile("logDire tory");


int totalLength = lines. olle t().map(s *> s.length())
.redu e(0,(a,b) *> a+b);

Answer:
The problem with the ode is that the olle t() fun tion gathers the RDD data
at a single node, and the map and redu e fun tions are then exe uted on that
single node, not in parallel as intended.
10.6 Apa he Spark:
a. How does Apa he Spark perform omputations in parallel?
b. Explain the statement: “Apa he Spark performs transformations on
RDDs in a lazy manner.”
. What are some of the benets of lazy evaluation of operations in Apa he
Spark?

Answer:

a. RDDs are stored partitioned a ross multiple nodes. Ea h of the trans-


formation operations on an RDD are exe uted in parallel on multiple
nodes.
b. Transformations are not exe uted immediately but postponed until the
result is required for fun tions su h as olle t() or saveAsTextFile().
. The operations are organized into a tree, and query optimization an
be applied to the tree to speed up omputation. Also, answers an be
pipelined from one operation to another, without being written to disk,
to redu e time overheads of disk storage.
Pra ti e Exer ises 81

10.7 Given a olle tion of do uments, for ea h word wi , let ni denote the number of
times the word o urs in the olle tion. Let N be the total number of word o -
urren es a ross all do uments. Next, onsider all pairs of onse utive words
(wi , wj ) in the do ument; let ni j denote the number of o urren es of the word
,

pair (wi , wj ) a ross all do uments.


Write an Apa he Spark program that, given a olle tion of do uments in a
dire tory, omputes N , all pairs (wi , ni ), and all pairs ((wi , wj ), ni j ). Then output
all word pairs su h that ni j _N g 10 < (ni _N ) < (nj _N ). These are word pairs
,

that o ur 10 times or more as frequently as they would be expe ted to o ur


if the two words o urred independently of ea h other.
You will nd the join operation on RDDs useful for the last step, to bring
related ounts together. For simpli ity, do not bother about word pairs that
ross lines. Also assume for simpli ity that words only o ur in lower ase and
that there are no pun tuation marks.
Answer:
FILL IN ANSWER (available with SS)
10.8 Consider the following query using the tumbling window operator:

sele titem, System.Timestamp as window end, sum(amount)


from order timestamp by datetime
group by itemid, tumblingwindow(hour, 1)

Give an equivalent query using normal SQL onstru ts, without using the tum-
bling window operator. You an assume that the timestamp an be onverted
to an integer value that represents the number of se onds elapsed sin e (say)
midnight, January 1, 1970, using the fun tion to se onds(timestamp). You an
also assume that the usual arithmeti fun tions are available, along with the
fun tion oor(a) whi h returns the largest integer f a.
Answer:
Divide by 3600, and take oor, group by that. To output the timestamp of the
window end, add 1 to hour and multiply by 3600
10.9 Suppose you wish to model the university s hema as a graph. For ea h of the
following relations, explain whether the relation would be modeled as a node
or as an edge:
(i) student, (ii) instru tor, (iii) ourse, (iv) se tion, (v) takes, (vi) tea hes.
Does the model apture onne tions between se tions and ourses?
Answer:

Ea h relation orresponding to an entity (student, instru tor, ourse, and se -


tion) would be modeled as a node. Takes and tea hes would be modeled as
edges. There is a further edge between ourse and se tion, whi h has been
82 Chapter 10 Big Data

merged into the se tion relation and annot be aptured with the above s hema.
It an be modeled if we reate a separate relation that links se tions to ourses.

You might also like