CSDS Bigdata and Analytics Lab Manual_2023 -24f
CSDS Bigdata and Analytics Lab Manual_2023 -24f
LABORATORY MANUAL
B.Tech., Semester –VI
Subject Code: KCS-651
Roll. No.:
Group/Branch:
JSS MAHAVIDYAPEETHA
DEPARTMENT OF INFORMATION TECHNOLOGY
JSS ACADEMY OF TECHNICAL EDUCATION
C-20/1, SECTOR-62, NOIDA
Big Data and Analytics Lab (KCS-651
Table of Contents
1. Vision and Mission of the Institute
2. Vision and Mission of the Department
3. Programme Educational Objectives (PEOs)
4. Programme Outcomes (POs)
5. Programme Specific Outcomes (PSOs)
6. University Syllabus
7. Course Outcomes (COs)
8. CO- PO and CO-PSO mapping
9. Course Overview
10. List of Experiments
11. DOs and DON’Ts
12. General Safety Precautions
13. Guidelines for Students for Report Preparation
14. Lab Assessment Criteria
15. Details of Conducted Experiments
16. Lab Experiments
Mission:
M1: Develop a platform for achieving globally acceptable level of intellectual acumen and
technological competence.
M2: Create an inspiring ambience that raises the motivation level for conducting quality
research.
M3: Provide an environment for acquiring ethical values and positive attitude.
“To become a Centre of Excellence in teaching and research in Information Technology for
producing skilled professionals having a zeal to serve society”
Mission:
M1: To create an environment where students can be equipped with strong fundamental
concepts, programming and problem solving skills.
M2: To provide an exposure to emerging technologies by providing hands on experience for
generating competent professionals.
M3: To promote Research and Development in the frontier areas of Information Technology
and encourage students for pursuing higher education
M4: To inculcate in students ethics, professional values, team work and leadership skills.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend
and write effective reports and design documentation, make effective presentations, and
give and receive clear instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
PSO2: Design, implement and evaluate processes, components and/or programs using modern
techniques, skills and tools of core Information Technologies to effectively integrate
secure IT-based solutions into the user environment.
PSO3: Develop impactful IT solutions by using research based knowledge and research
methods in the fields of integration, interface issues, security & assurance and
implementation.
University Syllabus
1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup
scripts, Configuration files.
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities
4. Write a Map Reduce program that mines weather data. Hint: Weather sensors collecting
data every hour at many locations across the globe gather a large volume of log data,
which is a good candidate for analysis with Map Reduce, since it is semi structured and
record-oriented
5. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
10. Write PIG Commands: Write Pig Latin scripts sort, group, join, project, and filter your
data.
12. Run the Pig Latin Scripts to find a max temp for each and every year.
CO 1: Optimize business decisions and create competitive advantage with Big data analytics
CO2: Know the java concepts required for developing map reduce programs.
CO3: Implement the architectural concepts of Hadoop and introducing map reduce paradigm.
CO4: Demonstrate the PIG, HIVE in Hadoop Eco system and Hadoop development.
CO5: Implement best practices for Hadoop development.
CO-PO Mapping
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 3 2 3 3 2 2 1 2 2
CO2 2 3 2 3 3 2 2 1 2 2
CO3 2 3 2 3 3 2 2 1 2 2
CO4 2 3 2 3 3 2 2 1 2 2
CO5 2 3 2 3 3 2 2 1 2 2
COs 2 3 2 3 3 2 2 1 2 2
CO-PSO Mapping
Course Overview
The ability to analyze more data at a faster rate can provide big benefits to
an organization, allowing it to more efficiently use data to answer important
questions. Big data analytics is important because it lets organizations use colossal
amounts of data in multiple formats from multiple sources to identify opportunities
and risks, helping organizations move quickly and improve their bottom lines. Big
data analytics is the process of collecting, examining, and analysing large amounts of
data to discover market trends, insights, and patterns that can help companies make
better business decisions. This information is available quickly and efficiently so that
companies can be agile in crafting plans to maintain their competitive
advantage.Technologies such as business intelligence (BI) tools and systems help
organisations take unstructured and structured data from multiple sources.
Sl Course
Program Name
No. Outcome
Downloading and installing Hadoop; Understanding different Hadoop CO3
1.
modes. Startup scripts, Configuration files.
Implement the following file management tasks in Hadoop: i. Adding
files and directories ii. Retrieving files iii. Deleting files.
Hint: A typical Hadoop workflow creates data files (such as log files)
elsewhere and copies them into HDFS using one of the above
2. command line utilities. HDFS various commands touch, touchz, cat, CO3
move From Local, move To Local, copy From Local, copy To Local,
Tall, expunge, du, df, chown, chgrp, setrep, chmod, append To File,
checksum, count, find, stat, test, text, getmerge, help etc
9. Practice importing and exporting data from various data bases. CO4
Write PIG Commands: Write Pig Latin scripts sort, group, join,
10. project, and filter your data. CO4
11. Run the Pig Latin Scripts to find Word Count. CO4
12. Run the Pig Latin Scripts to find a max temp for each and every year. CO4
DON’Ts
1. Do not share your username and password.
2. Do not remove or disconnect cables or hardware parts.
3. Do not personalize the computer setting.
4. Do not run programs that continue to execute after you log off.
5. Do not download or install any programs, games or music on computer in Lab.
6. Personal Internet use chat room for Instant Messaging (IM) and Sites Strictly
Prohibited.
7. No Internet gaming activities allowed.
8. Tea, Coffee, Water & Eatables are not allowed in the Computer Lab.
Note:
1. Students must bring their lab record along with them whenever they come for the lab.
LAB EXPERIMENTS
LAB EXPERIMENT 1
STEP1:Setup Configuration
Edit core-site.xml
$ sudo nano $ HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/cse/hdata</value>
</property>
</configuration>
Edit hdfs-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<name>dfs.data.dir</name>
<value>/home/cse/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Edit mapred-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
#Add below lines in this file(between"<configuration>"and"<"/configuration>")
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Edit yarn-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
6775Data Node
7209Resource Manager
7017Secondary NameNode
6651NameNode
7339NodeManager
7663 Jps
e) visit : (http://localhost:9870)
Result: Downloaded and installed Hadoop and also understand different Hadoop modes. Startup scripts,
Configuration files are successfully implemented.
LAB EXPERIMENT 2
BRIEF DESCRIPTION:
HDFS is a scalable distributed file system designed to scale to petabytes of data while running on
top of the underlying file system of the operating system. HDFS keeps track of where the data
resides in a network by associating the name of its rack (or network switch) with the dataset. This
allows Hadoop to efficiently schedule tasks to those nodes that contain data, or which are nearest to
it, optimizing bandwidth utilization. Hadoop provides a set of command line utilities that work
similarly to the Linux file commands, and serve as your primary interface with HDFS.
We‘re going to have a look in to HDFS by interacting with It from the command line.
We will take a look at the most common file management tasks in Hadoop, which include:
1. Adding files and directories to HDFS
2. Retrieving files from HDFS to local file system
3. Deleting files from HDFS
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as
the data nodes as cluster.
$start-dfs.sh
$$HADOOP_HOME/bin/hadoopfs-ls<args>
Transfer and store a data file from local systems to the Hadoop file system using the put
command.
$$HADOOP_HOME/bin/hadoopfs-put/home/file.txt/user/input
Step3:You can verify the file using ls command.
$$HADOOP_HOME/bin/hadoopfs-ls/user/input
Step4RetrievingDatafromHDFS
Assume we have a file in HDFS called out file. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Initially, view the data from HDFS using cat command.
$$HADOOP_HOME/bin/hadoopfs-cat/user/output/outfile
Get the file from HDFS to the local file system using get command.
$$HADOOP_HOME/bin/hadoopfs-get/user/output//home/hadoop_tp/
Step-5:Deleting Files from HDFS
$hadoopfs-rmfile.txt
Step6:ShuttingDownthe HDFS
You can shut down the HDFS by using the following command.
$stop-dfs.sh
Result:
Thus the Installing of Hadoop in three operating modes has been success fully completed.
LAB EXPERIMENT 3
ifcurr_index==prev_index:
$chmod+x~/Desktop/mr/matrix/Mapper.py
$chmod+x~/Desktop/mr/matrixl/Reducer.py
$hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar\
> -input/user/cse/matrices/ \
> -output/user/cse/mat_output\
> -mapper~/Desktop/mr/matrix/Mapper.py\
> -reducer~/Desktop/mr/matrix/Reducer.py
Step5:To view this full output
Result:
Thus the Map Reduce program to implement Matrix Multiplication was successfully executed.
LAB EXPERIMENT 4
BRIEF DESCRIPTION:
NOAA’s National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and
providing public access to weather data. NCDC provides access to daily data from the U.S. Climate
Reference Network / U.S. Regional Climate Reference Network (USCRN/USRCRN) via anonymous ftp at:
Dataset ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01 After going through wordcount
mapreduce guide, you now have the basic idea of how a mapreduce program works. So, let us see a
complex mapreduce program on weather dataset. Here I am using one of the dataset of year 2015 of
Austin, Texas . We will do analytics on the dataset and classify whether it was a hot day or a cold day
depending on the temperature recorded by NCDC.
NCDC gives us all the weather data we need for this mapreduce project. The dataset which we will be
using looks like below snapshot. ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/2015/CRND
0103-2015- TX_Austin_33_NW.txt Step 1 Download the complete project using below link.
https://drive.google.com/file/d/0B2SFMPvhXPQ5bUdoVFZsQjE2ZDA/view? usp=sharing
Big Data and Analytics Lab (KCS-651
LAB EXPERIMENT 5
OBJECTIVE: Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
BRIEF DESCRIPTION: In a given file Map Function–It takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (Key-Value pair).
Example–(Map function in Word Count)
Input Set of data: Bus,Car,bus,car,train,car,bus,car,train,bus,TRAIN,BUS,buS,caR,CAR,car,BUS, TRAIN
Step5.Type the following command to export the hadoop classpath into bash.
Export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo$HADOOP_CLASSPATH
Step6.It is time to create these directories on HDFS rather than locally.
Type the following commands.
hadoop fs -mkdir /WordCount Tutorial
hadoopfs-mkdir/WordCount Tutorial/Input
hadoopfs-putLab/Input/input.txt/WordCount Tutorial/Input
Step 8.Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac-classpath $HADOOP_CLASSPATH-d tutorial_classes WordCount.java
Put the output files in one jar file(There is a dot at the end)
jar-cvfWordCount.jar-Ctutorial_classes.
OUTPUT:
Big Data and Analytics Lab (KCS-651
Result:
Thus the Word Count Map Reduce program to understand Map Reduce Paradigm was
successfully executed.
Big Data and Analytics Lab (KCS-651
LAB EXPERIMENT 6
BRIEF DESCRIPTION:
MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into
the data and starts pulling out information as needed. Based on the MapReduce design, records are processed
in isolation via tasks called Mappers. The output from the Mapper tasks is further processed by a second set
of tasks, the Reducers, where the results from the different Mapper tasks are merged together. The Map and
Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map
takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2)
The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each
call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them
together, creating one group for each key. The Reduce function is then applied in parallel to each group,
which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3)
Input: A set of objects X = {x1, x2… xn}, A Set ofinitial Centroids C = {c1, c2, ,ck}
Output: An output list which contains pairs of (Ci, xj)where 1 ≤ i≤ n and 1 ≤j ≤ k
Procedure
M1←{x1, x2… xm}
current_centroids←C
Distance (p, q) =√Σdi=1(pi– qi)2 (where pi (or qi)is the coordinate of p (or q) in dimension i) for all xi ϵ M1
such that 1≤i≤m do
bestCentroid←null
minDist←∞
for all c ϵ current_centroids do
dist← distance (xi, c)
if (bestCentroid = null || dist<minDist)
then
minDist←dist
bestCentroid ← c
end if
end for
emit (bestCentroid, xi)
i+=1
end for
return Output list
Algorithm for Reducer
Input: (Key, Value), where key = bestCentroid and Value =Objects assigned to the lpgr'; 1\] x centroid by the
mapper Output: (Key, Value), where key = oldCentroid and value = newBestCentroid which is the new centroid
value calculated for that bestCentroid
Procedure
The final output of the program will be the cluster name, filename: number of text documents that belong to that
cluster.
Big Data and Analytics Lab (KCS-651
LAB EXPERIMENT 7
step3:
source~/.bashrc
step4:
Edithive-config.shfile
sudo nano $HIVE_HOME/bin/hive-config.sh
exportHADOOP_HOME=/home/cse/hadoop-3.3.6
Big Data and Analytics Lab (KCS-651
step5:
Create Hive directories in HDFS
hdfsdfs-mkdir/tmp
hdfsdfs-chmodg+w/tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfsdfs-chmodg+w/user/hive/warehouse
step6:
Fixing guava problem–Additional step
rm$HIVE_HOME/lib/guava-19.0.jar
cp$HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar$HIVE_HOME/lib/
step 7: Configure hive-site.xml File (Optional)
cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the nano text editor:
sudo nano hive-site.xml
Big Data and Analytics Lab (KCS-651
RESULT:
Thus Installation hive was successfully installed and executed.
Big Data and Analytics Lab (KCS-651
LAB EXPERIMENT 8
step-1:Make sure that java has installed in your machine to verify that run java –version
If any Error Occurred While Execute this command, then java is not installed in your system To Install
Java sudo apt install openjdk-8-jdk -y
Step-2:Download Hbase
We get https://dlcdn.apache.org/hbase/2.5.5/hbase-2.5.5-bin.tar.gz
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
step-5:Edit.bashrc file
And then open.bashrc file and mention HBASE_HOME path as shown in below
export HBASE_HOME=/home/prasanna/hbase-2.5.5
Here you can change name according to your local machine name
eg:export HBASE_HOME=/home/<your_machine_name>/hbase-2.5.5
export PATH= $PATH:$HBASE_HOME/bin
Note:*make sure that the hbase-2.5.5 folder in home directory before setting HBASE_HOME
path ,if not then move the hbase-2.5.5 file to home directory*
gotohbase-2.5.5/binfolder
Result:
* To create Table
syntax:
create‘Table_Name’,’col_fam_1’,’col_fam_1’,..........’col_fam-n’
code :
create'aamec','dept','year'
* Insert data
syntax:
put‘table_name’,’row_key’,’column_family:attribute’,’value’
code :
Big Data and Analytics Lab (KCS-651
This data will enter data into the dept column family
put'aamec','cse','dept:studentname','prasanna'
put 'aamec','cse','dept:year','third'
put'aamec','cse','dept:section','A'
This data will enter data into the year column family
put 'aamec','cse','year:joinedyear','2021'
put'aamec','cse','year:finishingyear','2025'
* ScanTable
Same as desc in RDBMS
syntax: scan‘table_name’
code: scan‘aamec’
* Togetspecificdata
syntax: get‘table_name’,’row_key’,[optional column family:attribute]
code: get‘aamec’,’cse’
Big Data and Analytics Lab (KCS-651
* Update table value: The same put command is used to update the table value ,if the row key is already
present in the database then it will update data according to the value ,if not present the it will create new
row with the given row key
Previously the value for the section in cse is A, But after running this command the value will be
changed into B
syntax:
delete‘table_name’,’row_key’,’column_family:attribute’
code : delete 'aamec','cse','year:joinedyear'
8. Delete Table
syntax:
Big Data and Analytics Lab (KCS-651
code:
Result:
LAB EXPERIMENT 9
Step2:MySQLInstallation
COMMANDS:~$sudosu
After this entery our linux user password, then the root mode will be open here we
don’t need any authentication for mysql.
~root$mysql
Mysql>CREATEUSER‘bigdata'@'localhost'IDENTIFIEDBY‘bigdata’;
Mysql>grantallprivilegeson*.*tobigdata@localhost;
Note:This step is not required if you just use the root user to make CRUD operations in the MySQL
Mysql>CREATEUSER‘bigdata’@’127.0.0.1'IDENTIFIEDBY‘bigdata’; Mysql>grant
all privileges on *.* [email protected];
Note:Here, *.* means that the user we create has all the privileges onall the tables of all the databases.
Now, we have created userprofiles which willbe used to make CRUD operations in the mysql
Big Data and Analytics Lab (KCS-651
Step4:SQOOP INSTALLATION:
After downloading the sqoop , go to the directory where we downloaded the sqoop
and then extract it using the following command :
$ tar -xvf sqoop-1.4.4.bin hadoop-2.0.4-alpha.tar.gz
Then enter into the super user : $ su
Next to move that to the usr/lib which requires as uper user privilege
$mvsqoop-1.4.4.bin hadoop-2.0.4-alpha/usr/lib/sqoop
Then exit:$exit
Goto.bashrc:$sudo nano.bashrc,and then add the following export
SQOOP_HOME=/usr/lib/sqoop
exportPATH=$PATH:$SQOOP_HOME/bin
$source~/.bashrc
Then configure the sqoop,goto the directory of the config folder of sqoop_home and then move the
contents of template file to the environment file.
$cd$SQOOP_HOME/conf
$ mvsqoop-env-template.shsqoop-env.sh
Then open the sqoop-environment file and then add the following,
export HADOOP_COMMON_HOME=/usr/local/Hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Note:Here we add the path of the Hadoop libraries and files and it may different from the path which
we mentioned here. So, add the Hadoop path based on your installation.
Big Data and Analytics Lab (KCS-651
Next,to extract the file and place it to the lib folder of sqoop
$tar-zxfmysql-connector-java-5.1.30.tar.gz
$su
$cdmysql-connector-java-5.1.30
$mvmysql-connector-java-5.1.30-bin.jar/usr/lib/sqoop/lib
Note:This is library file is very import ant don’t skip this step because it contains the libraries to
connect the mysql databases to jdbc.
hive>createdatabasesqoop_example; hive>use
sqoop_example;
hive>createtablesqoop(usr_namestring,no_opsint,ops_namesstring);
Hive commands much more a like my sql commands. Here, we just create the structure to store the
data which we want to import in hive.
Big Data and Analytics Lab (KCS-651
--usernameroot--passwordcloudera\
--tabletable_name_in_mysql\
--hive-import--hive-tabledatabase_name_in_hive.table_name_in_hive\
--m1
Big Data and Analytics Lab (KCS-651
OUTPUT:
Result:
Thus the import and export, the order of columns in MySQL queries are
exported to hive successfully
Big Data and Analytics Lab (KCS-651
LAB EXPERIMENT 10
OBJECTIVE: To implement PIG Commands: Write Pig Latin scripts sort, group, join, project, and
filter your data.
Syntax
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [,
field_alias [ASC|DESC] …] } [PARALLEL n];
Terms
alias The name of a relation.
* The design at or for a tuple.
field_alia A field in the relation.The field must be a simple type.
s
ASC Sortinascendingorder.
DESC Sortindescendingorder.
PARALLE Increase the parallelism of a job by specifying the number of
Ln reduce tasks, n.
Usage Formoreinformation,seeUsetheParallelFeatures.
Note: ORDER BY is NOT stable; if multiple records have the same ORDER BY key, the
order in which these records are returned is not defined and is not guarantted to be the same
from one run to the next.
InPig,relationsareunordered(seeRelations,Bags,Tuples,Fields):
• If you retrieve relation X (DUMP X;) the data is guaranteed to be in the order you
specified (descending).
• However, if you further process relation X (Y = FILTER X BY $0 > 1;)
thereisnoguaranteethatthedatawillbeprocessedintheorder you originally specified
(descending).
• Pig currently supports ordering on fields with simple types or by tuple designator (*).
You cannot order on fields with complex types or by expressions.
A=LOAD'mydata'AS(x:int,y:map[]);
Examples
SupposewehaverelationA.
A=LOAD'data'AS(a1:int,a2:int,a3:int);
DUMPA;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
In this example relation A is sorted by the third field, f3 in descending order. Note that the
order of the three tuples ending in 3 can vary.
(8,4,3)
X=ORDERABYa3DESC;
DUMPX;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
RANK
Otherwise, the RANK operator uses each field (or set of fields) to sort the relation. The
rank of a tuple is one plus the number of different rankvalues preceding it. If two or more
tuples tie on the sorting field values, they will receive the same rank.
NOTE: When using the option DENSE, ties do not cause gaps in ranking values.
Examples
SupposewehaverelationA.
(David,1,N)
(Tete,2,N)
(Ranjit,3,M)
(Ranjit,3,P)
(David,4,Q)
(David,4,Q)
(Jillian,8,Q)
(JaePak,7,Q)
Big Data & Analytics Lab (KCS-651)
(Michael,8,T)
(Jillian,8,Q)
(Jose,10,V)
In this example, the RANK operator does not change the order of the relation and simply
prepends to each tuple a sequential value.
B=rankA;
dump B;
(1,David,1,N)
(2,Tete,2,N)
(3,Ranjit,3,M)
(4,Ranjit,3,P)
(5,David,4,Q)
(6,David,4,Q)
(7,Jillian,8,Q)
(8,JaePak,7,Q)
(9,Michael,8,T)
(10,Jillian,8,Q)
(11,Jose,10,V)
In this example, the RANK operator works with f1 and f2 fields, and each one with
different sorting order. RANK sorts the relation on these fieldsand prepends the rank
value to each tuple. Otherwise, the RANK operator uses each field (or set of fields) to
sort the relation. The rank of a tuple is one plus the number of different rank values
preceding it. If two or more tuples tie on the sorting field values, they will receive the
same rank.
C=rankAbyf1DESC,f2ASC;
dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(4,Michael,8,T)
(5,Jose,10,V)
(6,Jillian,8,Q)
(6,Jillian,8,Q)
(8,JaePak,7,Q)
(9,David,1,N)
(10,David,4,Q)
(10,David,4,Q)
Same example as previous, but DENSE. In this case there are no gaps in ranking values.
C=rankAbyf1DESC,f2ASCDENSE;
dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(3,Michael,8,T)
(4,Jose,10,V)
(5,Jillian,8,Q)
(5,Jillian,8,Q)
(6,JaePak,7,Q)
(7,David,1,N)
(8,David,4,Q)
(8,David,4,Q)
LAB EXPERIMENT 11
HiveScript
Commands to create table in hive and to find average temperature
LOADDATALOCALINPATH'/home/student3/Project/Project_Output/outpt1.txt'
count(*)fromw_hd96;
SELECT*fromw_hd
9467limit5;
MaxTemperature.java
import
Expected Output:
Actual Output:
Pig Latin Scripts to find Word Count:
lines=LOAD'/user/hadoop/HDFS_File.txt'AS(line:chararray);
words=FOREACHlinesGENERATEFLATTEN(TOKENIZE(li
ne))asword; grouped = GROUP words BY word;
wordcount=FOREACHgroupedGENERATEgroup,COU
NT(words); DUMP wordcount;
Run the Pig Latin Scripts to find a max temp for each and every year
max_temp.pig:Findsthemaximumtemperature
byyear records = LOAD 'input/ncdc/micro-
tab/sample.txt'
AS(year:chararray,temperature:int,quality:int);
filtered_records=FILTERrecords BYtemperature!=9999AND
(quality==0ORquality==1ORquality==4 ORquality==5ORquality==9);
grouped_records = GROUP filtered_records BY year;
max_temp=FOREACHgrouped_recordsGENERATEgroup,
MAX(filtered_records.temperature);
DUMPmax_temp;
Result:
OUTPUT:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)