BigData-Assignment3-CSP 554
BigData-Assignment3-CSP 554
1. HDFS uses replication to ensure data availability and reliability, by storing multiple copies of
data blocks across different nodes. Moreover, this also improves fault tolerance and throughput.
However, it works thanks to an adjustment dynamic based on data popularity, which can lead to
uneven data distribution. This will result in “hot spots” where some nodes are overutilized, while
others are underutilized. To improve this data distribution, authors proposed WBRD (Workload-
Aware Balanced Replica Deletion), but it didn’t work as expected. It led to inefficiencies in clusters
with different hardware configurations because it doesn’t account for node heterogeneity. This
heterogeneity is provided by the HaRD. To do that, it evaluates each node’s processing power
based on the number of containers they are running simultaneously. It permits the distributed
replicas to minimize the data transfer overhead, it also improves data locality and finally, it
improves performance.
The paper presents also a lot of experimental evaluations on a 23-node Hadoop cluster, by
comparing HaRD’s performance to default Hadoop and WBRD. We can see that HaRD reduces
average execution time by up to 60.3% compared to default Hadoop and 17% compared to WBRD.
The experiences were measured with reading-intensive, write-intensive, and network-intensive
tasks. Moreover, when there are a lot of users simultaneously, HaRD is also better by improving
performance with an execution time reduced by up to 60% compared to Hadoop. Finally, HaRD
improves data locality, which results in a smaller usage of the network bandwidth, compared to
Hadoop and WBRD (it showed a 6.9% lower network utilization), and it introduces negligible
overhead: the time required for replica deletion operations is smaller (on the order of
milliseconds).
To conclude, HaRD performs to delete replicas, it dynamically adapts itself to changes in node
capabilities and can be installed very easily.
2. Copy of the program WordCount2.py :
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordCount(MRJob):
first_letter = word[0]
yield 'a_to_n', 1
else:
yield 'other', 1
if __name__ == '__main__':
MRWordCount.run()
Screenshot of the output :
3. Copy of the program Salaries2.py :
class MRSalaries(MRJob):
(name,jobTitle,agencyID,agency,hireDate,annualSalary,grossPay) = line.split('\t')
annualSalary = float(annualSalary)
yield 'High', 1
yield 'Medium', 1
else:
yield 'Low', 1
if __name__ == '__main__':
MRSalaries.run()
Screenshot of the output :
4. Copy of the program movie.py :
class MRUserMovieReview(MRJob):
yield user_id, 1
if __name__ == '__main__':
MRUserMovieReview.run()
Screenshot of the output :