0% found this document useful (0 votes)
0 views

CSDS Bigdata and Analytics Lab Manual_2023 -24f

The Big Data and Analytics Lab manual outlines the vision, mission, and educational objectives for students in the B.Tech. program at JSS Academy of Technical Education. It includes a comprehensive syllabus covering Hadoop, Map Reduce, and various data management tasks, along with guidelines for lab conduct and safety precautions. The manual also details course outcomes and mapping of course objectives to program outcomes and specific outcomes for effective learning and assessment.

Uploaded by

Harsh Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

CSDS Bigdata and Analytics Lab Manual_2023 -24f

The Big Data and Analytics Lab manual outlines the vision, mission, and educational objectives for students in the B.Tech. program at JSS Academy of Technical Education. It includes a comprehensive syllabus covering Hadoop, Map Reduce, and various data management tasks, along with guidelines for lab conduct and safety precautions. The manual also details course outcomes and mapping of course objectives to program outcomes and specific outcomes for effective learning and assessment.

Uploaded by

Harsh Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

BIG DATA AND ANALYTICS LAB

LABORATORY MANUAL
B.Tech., Semester –VI
Subject Code: KCS-651

Session: 2023-24, Even Semester


Name:

Roll. No.:

Group/Branch:

JSS MAHAVIDYAPEETHA
DEPARTMENT OF INFORMATION TECHNOLOGY
JSS ACADEMY OF TECHNICAL EDUCATION
C-20/1, SECTOR-62, NOIDA
Big Data and Analytics Lab (KCS-651

Table of Contents
1. Vision and Mission of the Institute
2. Vision and Mission of the Department
3. Programme Educational Objectives (PEOs)
4. Programme Outcomes (POs)
5. Programme Specific Outcomes (PSOs)
6. University Syllabus
7. Course Outcomes (COs)
8. CO- PO and CO-PSO mapping
9. Course Overview
10. List of Experiments
11. DOs and DON’Ts
12. General Safety Precautions
13. Guidelines for Students for Report Preparation
14. Lab Assessment Criteria
15. Details of Conducted Experiments
16. Lab Experiments

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Vision and Mission of the Institute


Vision:

“JSS Academy of Technical Education Noida aims to become an Institution of excellence


in imparting quality Outcome Based Education that empowers the young generation with
Knowledge, Skills, Research, Aptitude and Ethical values to solve Contemporary
Challenging Problems”

Mission:
M1: Develop a platform for achieving globally acceptable level of intellectual acumen and
technological competence.
M2: Create an inspiring ambience that raises the motivation level for conducting quality
research.

M3: Provide an environment for acquiring ethical values and positive attitude.

Vision and Mission of the Department


Vision:

“To become a Centre of Excellence in teaching and research in Information Technology for
producing skilled professionals having a zeal to serve society”

Mission:

M1: To create an environment where students can be equipped with strong fundamental
concepts, programming and problem solving skills.
M2: To provide an exposure to emerging technologies by providing hands on experience for
generating competent professionals.
M3: To promote Research and Development in the frontier areas of Information Technology
and encourage students for pursuing higher education
M4: To inculcate in students ethics, professional values, team work and leadership skills.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Programme Educational Objectives (PEOs)


PEO1: To provide students with a sound knowledge of mathematical, scientific and
engineering fundamentals required to solve real world problems.
PEO2: To develop research oriented analytical ability among students and to prepare them for
making technical contribution to the society.
PEO3: To develop in students the ability to apply state-of-the–art tools and techniques for
designing software products to meet the needs of Industry with due consideration for
environment friendly and sustainable development.
PEO4: To prepare students with effective communication skills, professional ethics and
managerial skills.
PEO5: To prepare students with the ability to upgrade their skills and knowledge for life-long
learning.

Program Outcomes (POs)


PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with
appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend
and write effective reports and design documentation, make effective presentations, and
give and receive clear instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.

Program Specific Outcomes (PSOs)


PSO1: Analyze, identify and clearly define a problem for solving user needs by selecting,
creating and evaluating a computer based system through an effective project plan.

PSO2: Design, implement and evaluate processes, components and/or programs using modern
techniques, skills and tools of core Information Technologies to effectively integrate
secure IT-based solutions into the user environment.

PSO3: Develop impactful IT solutions by using research based knowledge and research
methods in the fields of integration, interface issues, security & assurance and
implementation.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

University Syllabus
1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup
scripts, Configuration files.

2. Implement the following file management tasks in Hadoop:

i. Adding files and directories

ii. Retrieving files

iii. Deleting files

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities

3. Implement of Matrix Multiplication with Hadoop Map Reduce

4. Write a Map Reduce program that mines weather data. Hint: Weather sensors collecting
data every hour at many locations across the globe gather a large volume of log data,
which is a good candidate for analysis with Map Reduce, since it is semi structured and
record-oriented

5. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

6. Implementation of K-means clustering using Map Reduce

7. Installation of Hive along with practice examples.

8. Installation of HBase, Installing thrift along with Practice examples

9. Patrice importing and exporting data from various data bases.

10. Write PIG Commands: Write Pig Latin scripts sort, group, join, project, and filter your
data.

11. Run the Pig Latin Scripts to find Word Count.

12. Run the Pig Latin Scripts to find a max temp for each and every year.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Course Outcomes (COs)

Upon successful completion of the course, the students will be able to

CO 1: Optimize business decisions and create competitive advantage with Big data analytics
CO2: Know the java concepts required for developing map reduce programs.
CO3: Implement the architectural concepts of Hadoop and introducing map reduce paradigm.
CO4: Demonstrate the PIG, HIVE in Hadoop Eco system and Hadoop development.
CO5: Implement best practices for Hadoop development.

CO-PO Mapping
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 3 2 3 3 2 2 1 2 2
CO2 2 3 2 3 3 2 2 1 2 2
CO3 2 3 2 3 3 2 2 1 2 2
CO4 2 3 2 3 3 2 2 1 2 2
CO5 2 3 2 3 3 2 2 1 2 2
COs 2 3 2 3 3 2 2 1 2 2

CO-PSO Mapping

PSO1 PSO2 PSO3


CO1 3 3 3
CO2 3 3 3
CO3 3 3 3
CO4 3 3 3
CO5 3 3 3
COs 3 3 3

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Course Overview
The ability to analyze more data at a faster rate can provide big benefits to
an organization, allowing it to more efficiently use data to answer important
questions. Big data analytics is important because it lets organizations use colossal
amounts of data in multiple formats from multiple sources to identify opportunities
and risks, helping organizations move quickly and improve their bottom lines. Big
data analytics is the process of collecting, examining, and analysing large amounts of
data to discover market trends, insights, and patterns that can help companies make
better business decisions. This information is available quickly and efficiently so that
companies can be agile in crafting plans to maintain their competitive
advantage.Technologies such as business intelligence (BI) tools and systems help
organisations take unstructured and structured data from multiple sources.

Users (typically employees) input queries into these tools to understand


business operations and performance. For example, big data analytics is integral to
the modern health care industry. As you can imagine, systems that must manage
thousands of patient records, insurance plans, prescriptions, and vaccine information.
Big data analytics does this quickly and efficiently so that health care providers can
use the information to make informed, life-saving diagnoses.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

List of Experiments mapped with COs

Sl Course
Program Name
No. Outcome
Downloading and installing Hadoop; Understanding different Hadoop CO3
1.
modes. Startup scripts, Configuration files.
Implement the following file management tasks in Hadoop: i. Adding
files and directories ii. Retrieving files iii. Deleting files.
Hint: A typical Hadoop workflow creates data files (such as log files)
elsewhere and copies them into HDFS using one of the above
2. command line utilities. HDFS various commands touch, touchz, cat, CO3
move From Local, move To Local, copy From Local, copy To Local,
Tall, expunge, du, df, chown, chgrp, setrep, chmod, append To File,
checksum, count, find, stat, test, text, getmerge, help etc

3. Implement of Matrix Multiplication with Hadoop Map Reduce. CO3


Write a Map Reduce program that mines weather data. Hint: Weather
sensors collecting data every hour at many locations across the globe
4. gather a large volume of log data, which is a good candidate for CO1,CO3
analysis with Map Reduce, since it is semi structured and record-
oriented.

Run a basic Word Count Map Reduce program to understand Map


5. Reduce Paradigm. CO2

6. Implementation of K-means clustering using Map Reduce. CO1,CO2

7. Installation of Hive along with practice examples. CO4

8. Installation of HBase, Installing thrift along with Practice examples CO5

9. Practice importing and exporting data from various data bases. CO4

Write PIG Commands: Write Pig Latin scripts sort, group, join,
10. project, and filter your data. CO4

11. Run the Pig Latin Scripts to find Word Count. CO4

12. Run the Pig Latin Scripts to find a max temp for each and every year. CO4

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

DOs and DON’Ts


DOs
1. Login-on with your username and password.
2. Log off the Computer every time when you leave the Lab.
3. Arrange your chair properly when you are leaving the lab.
4. Put your bags in the designated area.
5. Ask permission to print.

DON’Ts
1. Do not share your username and password.
2. Do not remove or disconnect cables or hardware parts.
3. Do not personalize the computer setting.
4. Do not run programs that continue to execute after you log off.
5. Do not download or install any programs, games or music on computer in Lab.
6. Personal Internet use chat room for Instant Messaging (IM) and Sites Strictly
Prohibited.
7. No Internet gaming activities allowed.
8. Tea, Coffee, Water & Eatables are not allowed in the Computer Lab.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

General Safety Precautions


Precautions (In case of Injury or Electric Shock)
1. To break the victim with live electric source, use an insulator such as fire wood or
plastic to break the contact. Do not touch the victim with bare hands to avoid the risk of
electrifying yourself.
2. Unplug the risk of faulty equipment. If main circuit breaker is accessible, turn the circuit
off.
3. If the victim is unconscious, start resuscitation immediately, use your hands to press the
chest in and out to continue breathing function. Use mouth-to-mouth resuscitation if
necessary.
4. Immediately call medical emergency and security. Remember! Time is critical; be best.

Ambulance : 9810611477(Fortis Ambulance)


120-2400222(Fortis Ambulance)
Security : 260 (Gate No.1)
230 (Gate No.2)

Precautions (In case of Fire)


1. Turn the equipment off. If power switch is not immediately accessible, take plug off.
2. If fire continues, try to curb the fire if possible by using the fire extinguisher or by
covering it with a heavy cloth if possible isolate the burning equipment from the other
surrounding equipment.
3. Sound the fire alarm by activating the nearest alarm switch located in the hallway.
4. Call security and emergency department immediately:

Emergency : 219 (Reception)


298(Health Center)
Security : 260 (Gate No.1)
230 (Gate No.2)

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Guidelines to students for report preparation


All students are required to maintain a record of the experiments conducted by them.
Guidelines for its preparation are as follows:-
1) All files must contain a title page followed by an index page. The files will not be
signed by the faculty without an entry in the index page.
2) Student’s Name, Roll number and date of conduction of experiment must be written on all
pages.
3) For each experiment, the record must contain the following
(i) Aim/Objective of the experiment
(ii) Pre-experiment work (as given by the faculty)
(iii) Lab assignment questions and their solutions
(iv) Results/ output

Note:

1. Students must bring their lab record along with them whenever they come for the lab.

2. Students must ensure that their lab record is regularly evaluated.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Lab Assessment Criteria


An estimated 10 lab classes are conducted in a semester for each lab course. These lab
classes are assessed continuously. Each lab experiment is evaluated based on 5 assessment
criteria as shown in following table. Assessed performance in each experiment is used to
compute CO attainment as well as internal marks in the lab course.
Grading
Exemplary (4) Needs Improvement
Criteria Competent (3) Poor (1)
(2)
AC1:
Pre-Lab written Underlined concept Underlined
Complete procedure
work (for last lab is written but Not able to write concept is not
with underlined concept
class, this may be procedure is concept and procedure clearly
is properly written
assessed through incomplete understood
viva)
Assigned problem is
properly analyzed, Assigned problem is
AC2:
correct solution properly analyzed, Assigned problem is Assigned
Program Writing/
designed, appropriate correct solution properly analyzed & problem is
Modeling
language constructs/ designed, appropriate correct solution properly
tools are applied, language constructs/ designed analyzed
Program/solution tools are applied
written is readable
Unable to
understand the
AC3: Able to identify Is dependent totally on
reason for
Identification & Able to identify errors/ errors/ bugs and someone for
errors/ bugs
Removal of errors/ bugs and remove them remove them with identification of errors/
even after they
bugs little bit of guidance bugs and their removal
are explicitly
pointed out

All variants of input Only few variants of Solution is not


All variants of input
/output are not tested, input /output are well
/output are tested,
However, solution is tested, demonstrated
AC4:Execution & Solution is well
well demonstrated Solution is well and
Demonstration demonstrated and
and implemented demonstrated but implemented
implemented concept is
concept is clearly implemented concept concept is not
clearly explained
explained is not clearly explained clearly
explained
More than 70 % of Less than 40 % of
Less than 70 % of the
the assigned the assigned
All assigned problems assigned problems are problems are well
problems are well
are well recorded with well recorded with recorded with
recorded with
objective, design objective, design objective, design
objective, design
constructs and solution contracts and solution contracts and
AC5:Lab Record contracts and
along with along with solution along
Assessment solution along with with Performance
Performance analysis Performance analysis
Performance analysis analysis is done
using all variants of is done with all
is done with all with all variants
input and output variants of input and
variants of input and of input and
output
output output

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

LAB EXPERIMENTS

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 1

OBJECTIVE: To downloading and installing Hadoop, Understanding different


Hadoop modes. Startup scripts, Configuration files.

PROCEDURE: Prerequisites to Install Hadoop on Ubuntu


Hardware requirement-
The machine must have 4GB RAM and minimum 60 GB hard disk for better
performance.
Check java version- It is recommended to install Oracle Java8. The user can check the
version of java with below command.
$java–version
Part I
STEP1: Setup password lessssh
a) Install Open SSH Server and Open SSH Client
We will now setup the password lessssh client with the following command.

1. $sudoapt-get install open ssh-server open ssh-client


b) Generate Public & Private Key Pairs
1. ssh-keygen-trsa-P“”
c) Configure password-less SSH
2. cat$HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

d) Now verify the working of password-less ssh


$ sshlocalhost

e) Now install rsync with command


$sudoapt-getinstallrsync
STEP2:Configure and Setup Hadoop
Downloading Hadoop
$wgethttps://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-
3.3.6tar.gz
$tarxzfhadoop-3.3.6.tar.gz
Result:
Downloaded and installed Hadoop and also understand different Hadoop modes are
successfully implemented.

Part II : To Start up scripts, Configuration files.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

STEP1:Setup Configuration

a) Setting Up the environment variables


Edit.bashrc-Edit the bashrc and therefore add hadoop in a path:
$nanobash.bashrc
exportHADOOP_HOME=/home/cse/h
adoop-3.3.6 export
HADOOP_INSTALL=$HADOOP_H
OME
export
HADOOP_MAPRED_HOME=$HADOOP_H
OME export
HADOOP_COMMON_HOME=$HADOOP_
HOME export
HADOOP_HDFS_HOME=$HADOOP_HOM
E
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativ
e export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Source.bashrc in current login session in terminal


$source~/.bashrc
b) Hadoop configuration file changes
Edit hadoop-env.sh
Edit hadoop-env.sh file which is in etc/hadoop inside the Hadoop installation
directory.
$sudo
nano$HADOOP_HOME/etc/hadoop/Hadoop-
env.sh The user can set JAVA_HOME:
export JAVA_HOME=<root directory of Java-installation>(eg:
/usr/lib/jvm/jdk1.8.0_151/)

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Edit core-site.xml
$ sudo nano $ HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/cse/hdata</value>
</property>
</configuration>

Edit hdfs-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

#Add below lines in this file(between"<configuration>"and"<"/configuration>")


<property>
<name>dfs.data.dir</name>
<value>/home/cse/dfsdata/namenode</value>
</property>
<property>
Department of Information Technology (CSE DS Program) 2023-24
Big Data and Analytics Lab (KCS-651

<name>dfs.data.dir</name>
<value>/home/cse/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Edit mapred-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
#Add below lines in this file(between"<configuration>"and"<"/configuration>")
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Edit yarn-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

#Add below lines in this file(between"<configuration>"and"<"/configuration>")

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resource manager.host name</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.node manager.env-white list</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADO
OP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HA
DOOP_MAPRED_HOME</value>
</property>

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

Step2:Start the cluster


We will now start the single node cluster with the following commands.
a) Format the name node
$hdfs name node–format

b) Start the HDFS


$start-all.sh
c)Verify if all process started
$ jps

6775Data Node
7209Resource Manager
7017Secondary NameNode
6651NameNode
7339NodeManager
7663 Jps

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

d)Web interface-For viewing Web UI of NameNode

e) visit : (http://localhost:9870)

Result: Downloaded and installed Hadoop and also understand different Hadoop modes. Startup scripts,
Configuration files are successfully implemented.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 2

OBJECTIVE: To implement the following file management tasks in Hadoop:


1. Adding files and directories
2. Retrieving files
3. Deleting Files

BRIEF DESCRIPTION:

HDFS is a scalable distributed file system designed to scale to petabytes of data while running on
top of the underlying file system of the operating system. HDFS keeps track of where the data
resides in a network by associating the name of its rack (or network switch) with the dataset. This
allows Hadoop to efficiently schedule tasks to those nodes that contain data, or which are nearest to
it, optimizing bandwidth utilization. Hadoop provides a set of command line utilities that work
similarly to the Linux file commands, and serve as your primary interface with HDFS.
We‘re going to have a look in to HDFS by interacting with It from the command line.
We will take a look at the most common file management tasks in Hadoop, which include:
1. Adding files and directories to HDFS
2. Retrieving files from HDFS to local file system
3. Deleting files from HDFS

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step1: Starting HDFS


Initially you have to format the configured HDFS file system, open name node (HDFS server),
and execute the following command.
$hadoop namenode-format

After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as
the data nodes as cluster.
$start-dfs.sh

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651
Listing Files in HDFS
After loading the information in the server, we can find the list of files in a directory, status of a file,
using ls Given below is the syntax of ls that you can pass to a directory or a file name as an argument.

$$HADOOP_HOME/bin/hadoopfs-ls<args>

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to be saved in
the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file
system.
Step-2:Adding Files and Directories to HDFS
$$HADOOP_HOME/bin/hadoopfs-mkdir/user/input

Transfer and store a data file from local systems to the Hadoop file system using the put
command.
$$HADOOP_HOME/bin/hadoopfs-put/home/file.txt/user/input
Step3:You can verify the file using ls command.
$$HADOOP_HOME/bin/hadoopfs-ls/user/input
Step4RetrievingDatafromHDFS
Assume we have a file in HDFS called out file. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Initially, view the data from HDFS using cat command.
$$HADOOP_HOME/bin/hadoopfs-cat/user/output/outfile
Get the file from HDFS to the local file system using get command.
$$HADOOP_HOME/bin/hadoopfs-get/user/output//home/hadoop_tp/
Step-5:Deleting Files from HDFS
$hadoopfs-rmfile.txt
Step6:ShuttingDownthe HDFS
You can shut down the HDFS by using the following command.
$stop-dfs.sh
Result:
Thus the Installing of Hadoop in three operating modes has been success fully completed.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 3

OBJECTIVE: To Develop a Map Reduce program to implement Matrix Multiplication.

BRIEF DESCRIPTION: In mathematics, matrix multiplication or the matrix product is a binary


operation that produces a matrix from two matrices. The definition is motivated by linear equations and
linear transformations on vectors, which have numerous applications in applied mathematics, physics,
and engineering. In more detail, if A is an n × m matrix and B is an m × p matrix, their matrix product
AB is an n × p matrix, in which the m entries across a row of A are multiplied with the m entries down a
column of B and summed to produce an entry of AB. When two linear transformations are represented by
matrices, then the matrix product represents the composition of the two transformations.

Algorithm for Map Function.

a. For each element mij of M do


produce(key,value)pairs as((i,k),(M,j,mij),fork=1,2,3,.. upto the number of
columns of N
b. For each element njk of N do
produce(key,value) pairs as((i,k),(N,j,Njk), fori=1,2,3,..Upto the number of rows
of M.
c. Return Set of (key,value) pairs that each key(i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.

Algorithm for Reduce Function.

d. For each key(i,k)do


e. Sort values begin with M by j in list M sort values begin with N by j in list N
multiply mij and njk for jth value of each list
f. Sum up mij x njk return(i,k), Σj=1mij x njk

Step1.Creating directory for matrix


Then open matrix1.txt and matrix2.txt put the values in that text files
Step2.Creating Mapper file for Matrix Multiplication.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651
#!/usr/bin/envpython
import sys
cache_info=open("cache.txt").readlines()[0].split(",")
row_a, col_b = map(int,cache_info)
For line in sys.stdin:
matrix_index,row,col,value=line.rstrip().split(",") if
matrix_index == "A":
for i in x range(0,col_b):
key=row+","+str(i)
print"%s\t%s\t%s"%(key,col,value)
else:
for j in x range(0,row_a):
key=str(j)+","+col
print"%s\t%s\t%s"%(key,row,value)

Step3.Creating reducer file for Matrix Multiplication.


#!/usr/bin/envpython
import sys
from operator import item getter
prev_index = None
value_list=[]
for line in sys.stdin:
curr_index,index,value=line.rstrip().split("\t") index,
value = map(int,[index,value])
if curr_index == prev_index:
value_list.append((index,value))
else:
ifprev_index:
value_list=sorted(value_list,key=itemgetter(0)) i
=0
result=0
whilei<len(value_list)-1:
ifvalue_list[i][0]==value_list[i+1][0]:
result+=value_list[i][1]*value_list[i+1][1] i +=
2
else:
i+=1
print"%s,%s"%(prev_index,str(result))
prev_index = curr_index
value_list=[(index,value)]

ifcurr_index==prev_index:

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651
value_list=sorted(value_list,key=itemgetter(0)) i
=0
result=0
whilei<len(value_list)-1:
ifvalue_list[i][0]==value_list[i+1][0]:
result+=value_list[i][1]*value_list[i+1][1] i +=
2
else:
i+=1
print"%s,%s"%(prev_index,str(result))
Step4.To view this file using cat command

$cat*.txt |python mapper.py

$chmod+x~/Desktop/mr/matrix/Mapper.py
$chmod+x~/Desktop/mr/matrixl/Reducer.py
$hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar\
> -input/user/cse/matrices/ \
> -output/user/cse/mat_output\
> -mapper~/Desktop/mr/matrix/Mapper.py\
> -reducer~/Desktop/mr/matrix/Reducer.py
Step5:To view this full output

Result:

Thus the Map Reduce program to implement Matrix Multiplication was successfully executed.

Department of Information Technology (CSE DS Program) 2023-24


Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 4

OBJECTIVE: Implement a Map Reduce program that mines weather data.

BRIEF DESCRIPTION:
NOAA’s National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and
providing public access to weather data. NCDC provides access to daily data from the U.S. Climate
Reference Network / U.S. Regional Climate Reference Network (USCRN/USRCRN) via anonymous ftp at:
Dataset ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01 After going through wordcount
mapreduce guide, you now have the basic idea of how a mapreduce program works. So, let us see a
complex mapreduce program on weather dataset. Here I am using one of the dataset of year 2015 of
Austin, Texas . We will do analytics on the dataset and classify whether it was a hot day or a cold day
depending on the temperature recorded by NCDC.
NCDC gives us all the weather data we need for this mapreduce project. The dataset which we will be
using looks like below snapshot. ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/2015/CRND
0103-2015- TX_Austin_33_NW.txt Step 1 Download the complete project using below link.
https://drive.google.com/file/d/0B2SFMPvhXPQ5bUdoVFZsQjE2ZDA/view? usp=sharing
Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 5

OBJECTIVE: Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
BRIEF DESCRIPTION: In a given file Map Function–It takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (Key-Value pair).
Example–(Map function in Word Count)
Input Set of data: Bus,Car,bus,car,train,car,bus,car,train,bus,TRAIN,BUS,buS,caR,CAR,car,BUS, TRAIN

Output Convert into another set of data(Key,Value)


(Bus,1),(Car,1),(bus,1),(car,1),(train,1),(car,1),(bus,1),(car,1),(train,1),(bus,1),
(TRAIN,1),(BUS,1),(buS,1),(caR,1),(CAR,1),(car,1),(BUS,1),(TRAIN,1)
Reduce Function–Takes the output from Map as an input and combines
those data tuplesinto a smaller set of tuples.
Example–(Reduce function in Word Count)
Input Set of Tuples (output of Map function)
(Bus,1),(Car,1),(bus,1),(car,1),(train,1),(car,1),(bus,1),(car,1), (train,1),
(bus,1),(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1),(car,1),(BUS,1),(TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)

Workflow of Map Reduce consists of 5 steps


1. Splitting–The splitting parameter can be any thing, e.g. splitting by space,comma,
semicolon, or even by a new line (‘\n’).
2. Mapping–as explained above
3. Intermediate splitting–the entire process in parallel on different clusters. In order to group
them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce–it is nothing but mostly group by phase
5. Combining–The last phase where all the data (individual result set from each cluster) is
combine together to form a Result
Now Let’s See the Word Count Programin Java
Step1:Make sure Hadoop and Java are installed properly
Hadoop version
javac –version
Step2.Create a directory on the Desktop named Lab and inside it
create two folders;
Big Data and Analytics Lab (KCS-651
one called “Input” and the other called “tutorial_classes”.
[You can do this step using GUI normally or through terminal commands]
cd Desktop mkdir
Lab
mkdirLab/Input
Mkdir Lab/tutorial_classes

Step3.Add the file attached with this document


“WordCount.java” in the directory Lab
Step4. Add the file attached with this document “input.txt” in the directory Lab/Input.

Step5.Type the following command to export the hadoop classpath into bash.
Export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo$HADOOP_CLASSPATH
Step6.It is time to create these directories on HDFS rather than locally.
Type the following commands.
hadoop fs -mkdir /WordCount Tutorial
hadoopfs-mkdir/WordCount Tutorial/Input
hadoopfs-putLab/Input/input.txt/WordCount Tutorial/Input

Step7.Goto localhost:9870fromthebrowser,Open“Utilities→Browse File System” and


you should see the directories and files we placed in the file system.
Big Data and Analytics Lab (KCS-651

Step 8.Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac-classpath $HADOOP_CLASSPATH-d tutorial_classes WordCount.java

Put the output files in one jar file(There is a dot at the end)
jar-cvfWordCount.jar-Ctutorial_classes.

Step9.Now,we run the jar file on Hadoop.


Hadoop jar Word Count.jar WordCount/WordCount Tutorial/Input/Word CountTutorial/Output

Step 10.Out put the result:


Hadoop dfs-cat/WordCount Tutorial/Output/*
The output is stored in /r_output/part-00000

OUTPUT:
Big Data and Analytics Lab (KCS-651

Result:
Thus the Word Count Map Reduce program to understand Map Reduce Paradigm was
successfully executed.
Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 6

OBJECTIVE: To Implementation of K-means clustering using Map Reduce

BRIEF DESCRIPTION:
MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into
the data and starts pulling out information as needed. Based on the MapReduce design, records are processed
in isolation via tasks called Mappers. The output from the Mapper tasks is further processed by a second set
of tasks, the Reducers, where the results from the different Mapper tasks are merged together. The Map and
Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map
takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:

Map(k1,v1) → list(k2,v2)

The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each
call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them
together, creating one group for each key. The Reduce function is then applied in parallel to each group,
which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3)

Algorithm for Mapper

Input: A set of objects X = {x1, x2… xn}, A Set ofinitial Centroids C = {c1, c2, ,ck}
Output: An output list which contains pairs of (Ci, xj)where 1 ≤ i≤ n and 1 ≤j ≤ k
Procedure
M1←{x1, x2… xm}
current_centroids←C
Distance (p, q) =√Σdi=1(pi– qi)2 (where pi (or qi)is the coordinate of p (or q) in dimension i) for all xi ϵ M1
such that 1≤i≤m do
bestCentroid←null
minDist←∞
for all c ϵ current_centroids do
dist← distance (xi, c)
if (bestCentroid = null || dist<minDist)
then
minDist←dist
bestCentroid ← c
end if
end for
emit (bestCentroid, xi)
i+=1
end for
return Output list
Algorithm for Reducer

Input: (Key, Value), where key = bestCentroid and Value =Objects assigned to the lpgr'; 1\] x centroid by the
mapper Output: (Key, Value), where key = oldCentroid and value = newBestCentroid which is the new centroid
value calculated for that bestCentroid

Procedure

Outputlist←outputlist from mappers


←{}
newCentroidList ← null
for all β outputlist do
centroid ←β.key
Big Data and Analytics Lab (KCS-651
object ←β.value
[centroid] ← object
end for
for all centroid ϵ do
newCentroid, sumofObjects,
sumofObjects← null
for all object ϵ [centroid] do
Sum of Objects += object
Num of Objects += 1
end for
New Centroid ← (sum of Objects +
Num of Objects)
emit (centroid, newCentroid)
end for
end
The outcome of the k-means map reduce algorithm is the cluster points along with bounded documents as <key,
value> pairs, where key is the cluster id and value contains in the form of vector: weight. The weight indicates the
probability of vector be a point in that cluster. For Example: Key: 92: Value: 1.0: [32:0.127,79:0.114, 97:0.114,
157:0.148 ...].

The final output of the program will be the cluster name, filename: number of text documents that belong to that
cluster.
Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 7

OBJECTIVE: To Installation of Hive along with practice examples.

BRIEF DESCRIPTION: Steps for hive installation


• Download and Unzip Hive
• Edit.bashrcfile
• Edithive-config.shfile
• CreateHivedirectoriesinHDFS
• InitiateDerbydatabase
• Configurehive-site.xmlfile
Step1:
Download and unzip Hive
wgethttps://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar xzf apache-hive-3.1.2-bin.tar.gz
step2:
Edit.bashrcfile
sudonano.bashrc
exportHIVE_HOME=/home/hdoop/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin

step3:
source~/.bashrc
step4:
Edithive-config.shfile
sudo nano $HIVE_HOME/bin/hive-config.sh
exportHADOOP_HOME=/home/cse/hadoop-3.3.6
Big Data and Analytics Lab (KCS-651

step5:
Create Hive directories in HDFS
hdfsdfs-mkdir/tmp
hdfsdfs-chmodg+w/tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfsdfs-chmodg+w/user/hive/warehouse

step6:
Fixing guava problem–Additional step
rm$HIVE_HOME/lib/guava-19.0.jar
cp$HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar$HIVE_HOME/lib/
step 7: Configure hive-site.xml File (Optional)

Use the following command to locate the correct file:


cd $HIVE_HOME/conf

List the files contained in the folder using the ls command.

Use the hive-default.xml.template to create the hive-site.xml file:

cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the nano text editor:
sudo nano hive-site.xml
Big Data and Analytics Lab (KCS-651

Step8:Initiate Der by Database


$HIVE_HOME/bin/schematool-dbTypederby–initSchema

RESULT:
Thus Installation hive was successfully installed and executed.
Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 8

OBJECTIVE: Installation of HBase, Installing thrift along with Practice examples.

BRIEF DESCRIPTION: Pre-requisite:


Ubuntu16.04 or higher in stalled on a virtual machine.

step-1:Make sure that java has installed in your machine to verify that run java –version

If any Error Occurred While Execute this command, then java is not installed in your system To Install
Java sudo apt install openjdk-8-jdk -y
Step-2:Download Hbase
We get https://dlcdn.apache.org/hbase/2.5.5/hbase-2.5.5-bin.tar.gz

Step-3:ExtractThehbase-2.5.5-bin.tar.gzfile by using the command tar xvfhbase-2.5.5- bin.tar.gz


Big Data and Analytics Lab (KCS-651

step-4:go to hbase2.5.5/conf folder and open hbase-env.shfile

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

step-5:Edit.bashrc file

And then open.bashrc file and mention HBASE_HOME path as shown in below
export HBASE_HOME=/home/prasanna/hbase-2.5.5
Here you can change name according to your local machine name
eg:export HBASE_HOME=/home/<your_machine_name>/hbase-2.5.5
export PATH= $PATH:$HBASE_HOME/bin

Note:*make sure that the hbase-2.5.5 folder in home directory before setting HBASE_HOME
path ,if not then move the hbase-2.5.5 file to home directory*

step-6:Add properties in theh base-site.xml


Big Data and Analytics Lab (KCS-651

Put the below property between the<configuration></configuration>tag


<property>
<name>hbase.rootdir</name>
<value>file:///home/prasanna/HBASE/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/prasanna/HBASE/zookeeper</value>
</property>

step-7:Goto To/etc/folder and run the following commandand configure


Big Data and Analytics Lab (KCS-651
Change in line no-2 by default the ip is 127.0.1.1

Change it to 127.0.0.1 in second line only step-8:starting hbase

gotohbase-2.5.5/binfolder

After this run jps command to ensure that hbase is running

run http://localhost:16010to see hbasewebUI


Big Data and Analytics Lab (KCS-651
step-9:accessing hbase shell by running./hbaseshellcommand

Result:

HBase was successfully installed on Ubuntu 18.04.

** HBase, Installing thrift along with Practice examples

EXAMPLE: To Install HBase on Ubuntu18.04 HBase in Stand alone Mod

* To create Table
syntax:
create‘Table_Name’,’col_fam_1’,’col_fam_1’,..........’col_fam-n’
code :

create'aamec','dept','year'

* List All Tables


code: list

* Insert data
syntax:
put‘table_name’,’row_key’,’column_family:attribute’,’value’

here row_key is a unique key to retrive data

code :
Big Data and Analytics Lab (KCS-651

This data will enter data into the dept column family

put'aamec','cse','dept:studentname','prasanna'
put 'aamec','cse','dept:year','third'
put'aamec','cse','dept:section','A'

This data will enter data into the year column family
put 'aamec','cse','year:joinedyear','2021'
put'aamec','cse','year:finishingyear','2025'

* ScanTable
Same as desc in RDBMS
syntax: scan‘table_name’
code: scan‘aamec’

* Togetspecificdata
syntax: get‘table_name’,’row_key’,[optional column family:attribute]
code: get‘aamec’,’cse’
Big Data and Analytics Lab (KCS-651

* Update table value: The same put command is used to update the table value ,if the row key is already
present in the database then it will update data according to the value ,if not present the it will create new
row with the given row key

Previously the value for the section in cse is A, But after running this command the value will be
changed into B

7)To Delete Data

syntax:
delete‘table_name’,’row_key’,’column_family:attribute’
code : delete 'aamec','cse','year:joinedyear'
8. Delete Table

First we need to disable the table before drop ping it


To Disable:

syntax:
Big Data and Analytics Lab (KCS-651

code:

Result:

HBase was successfully installed with an example on Ubuntu18.04.


Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 9

OBJECTIVE: To import or export, the order of columns in MySQL and Hive.

BRIEF DESCRIPTION: Pre-requisite


Hadoop and Java MySQL, Hive, SQOOP
Step1:To start hdfs

Step2:MySQLInstallation

Sudo apt install mysql-server(use this command to install MySQL server)

COMMANDS:~$sudosu
After this entery our linux user password, then the root mode will be open here we
don’t need any authentication for mysql.
~root$mysql

Creating user profiles and grant them permissions:

Mysql>CREATEUSER‘bigdata'@'localhost'IDENTIFIEDBY‘bigdata’;

Mysql>grantallprivilegeson*.*tobigdata@localhost;
Note:This step is not required if you just use the root user to make CRUD operations in the MySQL

Mysql>CREATEUSER‘bigdata’@’127.0.0.1'IDENTIFIEDBY‘bigdata’; Mysql>grant
all privileges on *.* [email protected];
Note:Here, *.* means that the user we create has all the privileges onall the tables of all the databases.

Now, we have created userprofiles which willbe used to make CRUD operations in the mysql
Big Data and Analytics Lab (KCS-651

Step3:Create a database and table and insert data.


Example:create database Employe;
create table Employe.Emp(author_name varchar(65), total_no_of_articles int, phone_no int, address
varchar(65));
insertintoEmp values(“Rohan”,10,123456789,”Lucknow”);
Step3:Create a database and table in the hive where data should be imported.
createtablegeeks_hive_table(namestring,total_articlesint,phone_no int,addressstring) row format
delimited fields terminated by ‘,’;

Step4:SQOOP INSTALLATION:

After downloading the sqoop , go to the directory where we downloaded the sqoop
and then extract it using the following command :
$ tar -xvf sqoop-1.4.4.bin hadoop-2.0.4-alpha.tar.gz
Then enter into the super user : $ su
Next to move that to the usr/lib which requires as uper user privilege
$mvsqoop-1.4.4.bin hadoop-2.0.4-alpha/usr/lib/sqoop
Then exit:$exit
Goto.bashrc:$sudo nano.bashrc,and then add the following export
SQOOP_HOME=/usr/lib/sqoop
exportPATH=$PATH:$SQOOP_HOME/bin
$source~/.bashrc

Then configure the sqoop,goto the directory of the config folder of sqoop_home and then move the
contents of template file to the environment file.

$cd$SQOOP_HOME/conf

$ mvsqoop-env-template.shsqoop-env.sh

Then open the sqoop-environment file and then add the following,
export HADOOP_COMMON_HOME=/usr/local/Hadoop

export HADOOP_MAPRED_HOME=/usr/local/hadoop
Note:Here we add the path of the Hadoop libraries and files and it may different from the path which
we mentioned here. So, add the Hadoop path based on your installation.
Big Data and Analytics Lab (KCS-651

Step5:Download and Configure mysql-connector-java :

Wecandownloadmysql-connector-java-5.1.30.tar.gz file from the following link.

Next,to extract the file and place it to the lib folder of sqoop
$tar-zxfmysql-connector-java-5.1.30.tar.gz

$su
$cdmysql-connector-java-5.1.30
$mvmysql-connector-java-5.1.30-bin.jar/usr/lib/sqoop/lib
Note:This is library file is very import ant don’t skip this step because it contains the libraries to
connect the mysql databases to jdbc.

Verify sqoop: sqoop-version


Step3:hivedatabaseCreation

hive>createdatabasesqoop_example; hive>use
sqoop_example;

hive>createtablesqoop(usr_namestring,no_opsint,ops_namesstring);
Hive commands much more a like my sql commands. Here, we just create the structure to store the
data which we want to import in hive.
Big Data and Analytics Lab (KCS-651

Step6:Importing data from MySQLto hive:

sqoop import --connect \


jdbc:mysql://127.0.0.1:3306/database_name_in_mysql\

--usernameroot--passwordcloudera\
--tabletable_name_in_mysql\

--hive-import--hive-tabledatabase_name_in_hive.table_name_in_hive\

--m1
Big Data and Analytics Lab (KCS-651

OUTPUT:

Result:
Thus the import and export, the order of columns in MySQL queries are
exported to hive successfully
Big Data and Analytics Lab (KCS-651

LAB EXPERIMENT 10

OBJECTIVE: To implement PIG Commands: Write Pig Latin scripts sort, group, join, project, and
filter your data.

BRIEF DESCRIPTION: ORDERBY

Sorts a relation based on one or more fields.

Syntax
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [,
field_alias [ASC|DESC] …] } [PARALLEL n];
Terms
alias The name of a relation.
* The design at or for a tuple.
field_alia A field in the relation.The field must be a simple type.
s
ASC Sortinascendingorder.
DESC Sortindescendingorder.
PARALLE Increase the parallelism of a job by specifying the number of
Ln reduce tasks, n.

Usage Formoreinformation,seeUsetheParallelFeatures.

Note: ORDER BY is NOT stable; if multiple records have the same ORDER BY key, the
order in which these records are returned is not defined and is not guarantted to be the same
from one run to the next.

InPig,relationsareunordered(seeRelations,Bags,Tuples,Fields):

• If you order relation A to produce relation X (X = ORDER A BY * DESC;)


relations A and X still contain the same data.

• If you retrieve relation X (DUMP X;) the data is guaranteed to be in the order you
specified (descending).
• However, if you further process relation X (Y = FILTER X BY $0 > 1;)
thereisnoguaranteethatthedatawillbeprocessedintheorder you originally specified
(descending).
• Pig currently supports ordering on fields with simple types or by tuple designator (*).
You cannot order on fields with complex types or by expressions.

A=LOAD'mydata'AS(x:int,y:map[]);

B = ORDER A BY x; -- this is allowed because x is a simple type


Big Data and Analytics Lab (KCS-651

B = ORDER A BY y; -- this is not allowed because y is a complex type

B = ORDER A BY y#'id'; -- this is not allowed because y#'id' is an expression

Examples
SupposewehaverelationA.

A=LOAD'data'AS(a1:int,a2:int,a3:int);

DUMPA;
(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)
In this example relation A is sorted by the third field, f3 in descending order. Note that the
order of the three tuples ending in 3 can vary.
(8,4,3)
X=ORDERABYa3DESC;

DUMPX;
(7,2,5)

(8,3,4)

(1,2,3)

(4,3,3)
RANK

Returns each tuple with the rank with in a relation.


(4,2,1)
Syntax
alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [,
field_alias [ASC|DESC] …] } [DENSE] ];
Terms
alias Thenameofarelation.
* Thedesignatorforatuple.
field_alias Afieldintherelation.Thefieldmustbeasimpletype.
ASC Sortinascendingorder.
DESC Sortindescendingorder.
DENSE Nogapintherankingvalues.
Usage
When specifying no field to sort on, the RANK operator simply prepends a sequential value
to each tuple.

Otherwise, the RANK operator uses each field (or set of fields) to sort the relation. The
rank of a tuple is one plus the number of different rankvalues preceding it. If two or more
tuples tie on the sorting field values, they will receive the same rank.

NOTE: When using the option DENSE, ties do not cause gaps in ranking values.

Examples
SupposewehaverelationA.

A = load 'data' AS (f1:chararray,f2:int,f3:chararray); DUMP A;

(David,1,N)
(Tete,2,N)

(Ranjit,3,M)

(Ranjit,3,P)

(David,4,Q)

(David,4,Q)

(Jillian,8,Q)

(JaePak,7,Q)
Big Data & Analytics Lab (KCS-651)

(Michael,8,T)

(Jillian,8,Q)
(Jose,10,V)

In this example, the RANK operator does not change the order of the relation and simply
prepends to each tuple a sequential value.

B=rankA;

dump B;
(1,David,1,N)

(2,Tete,2,N)

(3,Ranjit,3,M)

(4,Ranjit,3,P)
(5,David,4,Q)

(6,David,4,Q)
(7,Jillian,8,Q)

(8,JaePak,7,Q)

(9,Michael,8,T)

(10,Jillian,8,Q)

(11,Jose,10,V)

In this example, the RANK operator works with f1 and f2 fields, and each one with
different sorting order. RANK sorts the relation on these fieldsand prepends the rank
value to each tuple. Otherwise, the RANK operator uses each field (or set of fields) to
sort the relation. The rank of a tuple is one plus the number of different rank values
preceding it. If two or more tuples tie on the sorting field values, they will receive the
same rank.

C=rankAbyf1DESC,f2ASC;

dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)

(2,Ranjit,3,P)
(4,Michael,8,T)

Department of Information Technology (CSE DS Program) 2023-24


Big Data & Analytics Lab (KCS-651)

(5,Jose,10,V)

(6,Jillian,8,Q)

(6,Jillian,8,Q)
(8,JaePak,7,Q)

(9,David,1,N)

(10,David,4,Q)

(10,David,4,Q)

Same example as previous, but DENSE. In this case there are no gaps in ranking values.

C=rankAbyf1DESC,f2ASCDENSE;

dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)

(2,Ranjit,3,P)
(3,Michael,8,T)

(4,Jose,10,V)

(5,Jillian,8,Q)

(5,Jillian,8,Q)
(6,JaePak,7,Q)

(7,David,1,N)

(8,David,4,Q)

(8,David,4,Q)

Department of Information Technology (CSE DS Program) 2023-24


Big Data & Analytics Lab (KCS-651)

LAB EXPERIMENT 11

OBJECTIVE: To Develop a program to calculate the maximum recorded temperature by year


wise for the weather dataset in Pig Latin. Run the Pig Latin Scripts to find Word Count.
BRIEF DESCRIPTION: The National Climatic Data Center (NCDC) is the world's largest
active archive of weather data. I downloaded the NCDC data for year 1930 and loaded it in
HDFS system. I implemented Map Reduce program and Pig, Hove scripts to findd the Min,
Max, avg temperature for different stations.
Compiled the Java File: javac -classpath /home/student3/hadoop-common-
2.6.1.jar:/home/student3/hadoop-mapreduce-client-core-
2.6.1.jar:/home/student3/commons-cli- 2.0.jar -d . MaxTemperature.java
MaxTemperatureMapper.java MaxTemperatureReducer.java

Created the JARfile:jar-cvfhadoop-project.jar *class

Executed the jar file:hadoop jar hadoop-project.jarMaxTemperature/home/student3/Project/


/home/student3/Project_output111

Copy the output file to local hdfs dfs -copyToLocal


/home/student3/Project_output111/part-r- 00000

PIGScript: Pig -x local grunt> records = LOAD


'/home/student3/Project/Project_Output/output111.txt' AS (year:chararray, temperature:int); grunt>
DUMP records; grunt> grouped_records = GROUP records BY year; grunt> DUMP grouped_records;
grunt> max_temp = FOREACH grouped_records GENERATE group,

HiveScript
Commands to create table in hive and to find average temperature

DROP TABLE IF EXISTS w_hd9467;


CREATE TABLE w_hd9467 (year STRING, temperature INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

LOADDATALOCALINPATH'/home/student3/Project/Project_Output/outpt1.txt'

OVERWRITE INTO TABLE w_hd9467;


SELECT

count(*)fromw_hd96;

SELECT*fromw_hd

9467limit5;

Department of Information Technology (CSE DS Program) 2023-24


Big Data & Analytics Lab (KCS-651)

Query to find average temperature SELECT year,


AVG(temperature)FROMw_hd9467 GROUP BY year;

MaxTemperature.java
import

Expected Output:
Actual Output:
Pig Latin Scripts to find Word Count:

lines=LOAD'/user/hadoop/HDFS_File.txt'AS(line:chararray);
words=FOREACHlinesGENERATEFLATTEN(TOKENIZE(li
ne))asword; grouped = GROUP words BY word;
wordcount=FOREACHgroupedGENERATEgroup,COU
NT(words); DUMP wordcount;

Run the Pig Latin Scripts to find a max temp for each and every year

max_temp.pig:Findsthemaximumtemperature
byyear records = LOAD 'input/ncdc/micro-
tab/sample.txt'
AS(year:chararray,temperature:int,quality:int);
filtered_records=FILTERrecords BYtemperature!=9999AND
(quality==0ORquality==1ORquality==4 ORquality==5ORquality==9);
grouped_records = GROUP filtered_records BY year;
max_temp=FOREACHgrouped_recordsGENERATEgroup,
MAX(filtered_records.temperature);
DUMPmax_temp;

Result:

OUTPUT:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

Department of Information Technology (CSE DS Program) 2023-24

You might also like