0% found this document useful (0 votes)
4 views

2023MCS320004 HEMANTH TARRA - Assignment -9

The document outlines a series of tasks for an assignment involving data processing using the Hadoop framework and PIG Latin. Task 1 focuses on counting word frequencies from a text file, Task 2 involves data analytics to find the most populated cities by country, and Task 3 demonstrates performing a JOIN operation on employee data using multiple keys. Each task includes specific commands and steps to execute within the PIG interactive shell.

Uploaded by

hemanth.tarra.18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

2023MCS320004 HEMANTH TARRA - Assignment -9

The document outlines a series of tasks for an assignment involving data processing using the Hadoop framework and PIG Latin. Task 1 focuses on counting word frequencies from a text file, Task 2 involves data analytics to find the most populated cities by country, and Task 3 demonstrates performing a JOIN operation on employee data using multiple keys. Each task includes specific commands and steps to execute within the PIG interactive shell.

Uploaded by

hemanth.tarra.18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2023MCS320004 HEMANTH TARRA - Assignment -9

Task 1: Word Count Problem Write a PIG Latin program to


count the frequency of the words in the document
(iiitkottayam.txt) using Hadoop framework.
a. Txt file- iiitkottayam.txt moved to Hadoop system (used for Assignment
2).

• Enter PIG's interactive shell by typing:


bash

pig -x mapreduce

• Load the Data:


text = LOAD '/user/hadoop/iiitkottayam.txt' AS (line:chararray);

• Split each line by whitespace to get individual words.


words = FOREACH text GENERATE FLATTEN(TOKENIZE(line)) AS word;

• Filter Out Any Null or Blank Words:


clean_words = FILTER words BY word IS NOT NULL AND word != '';

• Group by each word and count the occurrences.

word_group = GROUP clean_words BY word;


word_count = FOREACH word_group GENERATE group AS word,
COUNT(clean_words) AS frequency;
• Use DUMP to check intermediate results if required.

DUMP word_count;

• Save the results to an HDFS directory.


STORE word_count INTO '/user/hadoop/iiitkottayam_wordcount_output' USING
PigStorage(',');
Task2:
Data Analytics using PIG
Create cities.txt using nano and placing that file in Hadoop

1. Load the data


cities = LOAD '/user/hadoop/cities.txt' USING PigStorage(',') AS
(name:chararray, country:chararray, population:int);

2. Group Data by Country:


grouped_cities = GROUP cities BY country;

3. Find the Most Populated City in Each Country


max_population_cities = FOREACH grouped_cities { sorted = ORDER cities
BY population DESC; top_city = LIMIT sorted 1; GENERATE
FLATTEN(top_city); };

4. View results;
DUMP max_population_cities;
Task 3: Perform JOIN operation using multiple keys
1. Save employee.txt and employee_contact.txt in the local filesystem and
move them to Hadoop filesystem

2. Create the data sets from the files

employee = LOAD '/user/hadoop/employee.txt' USING PigStorage(',') AS


(id:chararray, firstname:chararray, lastname:chararray, age:int,
post:chararray, jobid:int); employee_contact = LOAD
'/user/hadoop/employee_contact.txt' USING PigStorage(',') AS
(id:chararray, mobileno:chararray, mail:chararray, age:int,
city:chararray, jobid:int);

3. Perform JOIN on multiple keys (id and jobid)


joined_data = JOIN employee BY id, employee_contact BY id;

4. -- Select and rename the fields needed

result = FOREACH joined_data GENERATE


employee::id AS id,
employee::firstname AS firstname,
employee::lastname AS lastname,
employee::age AS age,
employee::post AS post,
employee::jobid AS jobid,
employee_contact::mobileno AS mobileno,
employee_contact::mail AS mail,
employee_contact::city AS city;

5. View Result -- DUMP result

You might also like