Lecture38 PDF
Lecture38 PDF
HDFS
(Hadoop Distributed File System)
High-level approaches for specifying Hadoop jobs
Load
• ORDER
emp = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
sorted = ORDER emp BY salary;
• LIMIT
emp = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
agegroup = GROUP emp BY age;
shortlist = LIMIT agegroup 100;
• JOIN
emp = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
pbk = LOAD "input/Phonebook" USING PigStorage(',') AS
(name:chararray, phone:chararray);
contact = JOIN emp BY name, pbk BY name;
DESCRIBE contact;
DUMP contact;
Hive Query Language
• SELECT
$ hive
hive> SELECT * FROM WetOnes;
• Hive supports ORDER BY, but its result differs from SQL's
• Only one Reduce step is applied and partial results are
broadcast and combined
• No need for any intermediate files
• This allows optimization to a single MapReduce step