0% found this document useful (0 votes)

11 views5 pages

BigData-Assignment3-CSP 554

Uploaded by

emile.mondon.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views5 pages

BigData-Assignment3-CSP 554

Uploaded by

emile.mondon.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CSP 554 – Big Data Technologies : Assignment 3

1. HDFS uses replication to ensure data availability and reliability, by storing multiple copies of
data blocks across different nodes. Moreover, this also improves fault tolerance and throughput.
However, it works thanks to an adjustment dynamic based on data popularity, which can lead to
uneven data distribution. This will result in “hot spots” where some nodes are overutilized, while
others are underutilized. To improve this data distribution, authors proposed WBRD (Workload-
Aware Balanced Replica Deletion), but it didn’t work as expected. It led to inefficiencies in clusters
with different hardware configurations because it doesn’t account for node heterogeneity. This
heterogeneity is provided by the HaRD. To do that, it evaluates each node’s processing power
based on the number of containers they are running simultaneously. It permits the distributed
replicas to minimize the data transfer overhead, it also improves data locality and finally, it
improves performance.

The paper presents also a lot of experimental evaluations on a 23-node Hadoop cluster, by
comparing HaRD’s performance to default Hadoop and WBRD. We can see that HaRD reduces
average execution time by up to 60.3% compared to default Hadoop and 17% compared to WBRD.
The experiences were measured with reading-intensive, write-intensive, and network-intensive
tasks. Moreover, when there are a lot of users simultaneously, HaRD is also better by improving
performance with an execution time reduced by up to 60% compared to Hadoop. Finally, HaRD
improves data locality, which results in a smaller usage of the network bandwidth, compared to
Hadoop and WBRD (it showed a 6.9% lower network utilization), and it introduces negligible
overhead: the time required for replica deletion operations is smaller (on the order of
milliseconds).

To conclude, HaRD performs to delete replicas, it dynamically adapts itself to changes in node
capabilities and can be installed very easily.
2. Copy of the program WordCount2.py :

from mrjob.job import MRJob

import re

WORD_RE = re.compile(r"[\w']+")

class MRWordCount(MRJob):

def mapper(self, _, line):

for word in WORD_RE.findall(line):

first_letter = word[0]

if 'a' <= first_letter <= 'n':

yield 'a_to_n', 1

else:

yield 'other', 1

def combiner(self, word, counts):

yield word, sum(counts)

def reducer(self, word, counts):

yield word, sum(counts)

if __name__ == '__main__':

MRWordCount.run()
Screenshot of the output :
3. Copy of the program Salaries2.py :

from mrjob.job import MRJob

class MRSalaries(MRJob):

def mapper(self, _, line):

(name,jobTitle,agencyID,agency,hireDate,annualSalary,grossPay) = line.split('\t')

annualSalary = float(annualSalary)

if annualSalary >= 100000.00:

yield 'High', 1

elif 50000 <= annualSalary <= 99999.99:

yield 'Medium', 1

else:

yield 'Low', 1

def combiner(self, salary_cat, counts):

yield salary_cat, sum(counts)

def reducer(self, salary_cat, counts):

yield salary_cat, sum(counts)

if __name__ == '__main__':

MRSalaries.run()
Screenshot of the output :
4. Copy of the program movie.py :

from mrjob.job import MRJob

class MRUserMovieReview(MRJob):

def mapper(self, _, line):

(user_id, movie_id, rating, timestamp) = line.split(',')

yield user_id, 1

def combiner(self, user_id, counts):

yield user_id, sum(counts)

def reducer(self, user_id, counts):

yield user_id, sum(counts)

if __name__ == '__main__':

MRUserMovieReview.run()
Screenshot of the output :

Python DSA Programs
No ratings yet
Python DSA Programs
33 pages
BDA Lab Manual -BAD601-Final one - 7-11
No ratings yet
BDA Lab Manual -BAD601-Final one - 7-11
25 pages
John V. Guttag - Introduction To Computation and Programming Using Python - With Application To Understanding Data-The MIT Press (2016) PDF
100% (1)
John V. Guttag - Introduction To Computation and Programming Using Python - With Application To Understanding Data-The MIT Press (2016) PDF
17 pages
Nikita
No ratings yet
Nikita
33 pages
14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
Next Care PLAN 5-DXB (Silver)
No ratings yet
Next Care PLAN 5-DXB (Silver)
7 pages
SD250 Power
No ratings yet
SD250 Power
10 pages
14 MapReduce
100% (1)
14 MapReduce
82 pages
Resolution No. 6 - Concurring The Appointment
100% (1)
Resolution No. 6 - Concurring The Appointment
2 pages
Preview ANSI+AGMA+9005-F16
50% (2)
Preview ANSI+AGMA+9005-F16
8 pages
Advance Python Question Bank
No ratings yet
Advance Python Question Bank
5 pages
RFI For Procurement of 03 X 200T NP Water Barge For IN
100% (1)
RFI For Procurement of 03 X 200T NP Water Barge For IN
38 pages
bda lab s
No ratings yet
bda lab s
92 pages
Liner calibration
No ratings yet
Liner calibration
1 page
We, The Elder Brothers. Continuity and Change among the Kággaba of The Sierra Nevada De Santa Marta, Colombia
No ratings yet
We, The Elder Brothers. Continuity and Change among the Kággaba of The Sierra Nevada De Santa Marta, Colombia
24 pages
Pyspark and python preparation notes
No ratings yet
Pyspark and python preparation notes
2 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
OD427381341016129100
No ratings yet
OD427381341016129100
2 pages
Objects of Using Accounting Information Include
No ratings yet
Objects of Using Accounting Information Include
2 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
202101445_lab01
No ratings yet
202101445_lab01
6 pages
idsup a1
No ratings yet
idsup a1
17 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Job
No ratings yet
Job
4 pages
DSBDA GROUP B 1
No ratings yet
DSBDA GROUP B 1
5 pages
Iksha ' Nusandhan: Admission Batch: 2019
No ratings yet
Iksha ' Nusandhan: Admission Batch: 2019
17 pages
General and Specific Objectives Library
100% (1)
General and Specific Objectives Library
2 pages
Bda Practical 2
No ratings yet
Bda Practical 2
3 pages
Assignment 1 - 2024
No ratings yet
Assignment 1 - 2024
3 pages
Bear Discussion 11/12/19
No ratings yet
Bear Discussion 11/12/19
54 pages
ASSIGNMENT-DCA-2017-18-1506098312
No ratings yet
ASSIGNMENT-DCA-2017-18-1506098312
7 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
BDA practical (1)
No ratings yet
BDA practical (1)
18 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Map Reduce Design and EXECUTION FRAMEWORK
No ratings yet
Map Reduce Design and EXECUTION FRAMEWORK
21 pages
Big Data Lab
No ratings yet
Big Data Lab
12 pages
Lec28 - RDD (1)
No ratings yet
Lec28 - RDD (1)
56 pages
Question Bank-BDA
No ratings yet
Question Bank-BDA
15 pages
DR K SABITHA
No ratings yet
DR K SABITHA
7 pages
XII_Final Practical_ cs
No ratings yet
XII_Final Practical_ cs
33 pages
Butyl Acetate (BA) : Sds Ts/En
No ratings yet
Butyl Acetate (BA) : Sds Ts/En
11 pages
Untitled 1
No ratings yet
Untitled 1
4 pages
Assignment 1 - Ue21cs343ab2 - Big Data
No ratings yet
Assignment 1 - Ue21cs343ab2 - Big Data
8 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Bda Lab
No ratings yet
Bda Lab
11 pages
JD Freshers Private Equity Fund Accounting
No ratings yet
JD Freshers Private Equity Fund Accounting
3 pages
Jessica Treat - Resume
No ratings yet
Jessica Treat - Resume
1 page
SPARK
No ratings yet
SPARK
36 pages
Docu31481 - Using An EMC VNXe System With CIFS Shared Folders
No ratings yet
Docu31481 - Using An EMC VNXe System With CIFS Shared Folders
56 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Cooperc-Eeta Presentation
No ratings yet
Cooperc-Eeta Presentation
29 pages
List Tuple
No ratings yet
List Tuple
9 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Object Orientation vs. Functional Programming: Writing Modular Python Programs
No ratings yet
Object Orientation vs. Functional Programming: Writing Modular Python Programs
33 pages
Project File Cse
No ratings yet
Project File Cse
162 pages
Cst362 Draft Scheme
No ratings yet
Cst362 Draft Scheme
9 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
BCOM Internship Certificates 2
No ratings yet
BCOM Internship Certificates 2
26 pages
CLASS 10 PRACTICAL FILE-format
100% (1)
CLASS 10 PRACTICAL FILE-format
31 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
Database Developers
No ratings yet
Database Developers
9 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
IDP Lab Report (Saswat Mohanty - 1941012407 - CSE-D)
No ratings yet
IDP Lab Report (Saswat Mohanty - 1941012407 - CSE-D)
47 pages
Airyx Airconditioning Inc. NEW SRP Dealers Discount Less 30 - 2023 - 2
No ratings yet
Airyx Airconditioning Inc. NEW SRP Dealers Discount Less 30 - 2023 - 2
2 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Customizing A Perfect Project For AutoCAD P&ID and Plant 3D.
100% (1)
Customizing A Perfect Project For AutoCAD P&ID and Plant 3D.
128 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Question 1A: What Made OPEC Decide To Cut The Oil Supply? What Was The Desired Outcome of
No ratings yet
Question 1A: What Made OPEC Decide To Cut The Oil Supply? What Was The Desired Outcome of
8 pages
NTPC EOC Noida Report
No ratings yet
NTPC EOC Noida Report
27 pages
Approximation
No ratings yet
Approximation
14 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Balochistan Assembly Rules of Procedure 5 April 2018
No ratings yet
Balochistan Assembly Rules of Procedure 5 April 2018
137 pages
Service Training: Audi A5 - Networking
No ratings yet
Service Training: Audi A5 - Networking
20 pages
C4C S4 Integration Scenarios
0% (1)
C4C S4 Integration Scenarios
7 pages
Hero Honda
No ratings yet
Hero Honda
57 pages
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

BigData-Assignment3-CSP 554

Uploaded by

BigData-Assignment3-CSP 554

Uploaded by

CSP 554 – Big Data Technologies : Assignment 3

from mrjob.job import MRJob

def mapper(self, _, line):

for word in WORD_RE.findall(line):

if 'a' <= first_letter <= 'n':

def combiner(self, word, counts):

yield word, sum(counts)

def reducer(self, word, counts):

yield word, sum(counts)

from mrjob.job import MRJob

def mapper(self, _, line):

if annualSalary >= 100000.00:

elif 50000 <= annualSalary <= 99999.99:

def combiner(self, salary_cat, counts):

yield salary_cat, sum(counts)

def reducer(self, salary_cat, counts):

yield salary_cat, sum(counts)

from mrjob.job import MRJob

def mapper(self, _, line):

(user_id, movie_id, rating, timestamp) = line.split(',')

def combiner(self, user_id, counts):

yield user_id, sum(counts)

def reducer(self, user_id, counts):

yield user_id, sum(counts)

You might also like