0% found this document useful (0 votes)
2 views

Big Data Analytics lab file

The document provides detailed installation instructions for Apache Hive and HBase, including prerequisites, setup steps, and configuration commands. It also includes examples of Pig Latin scripts for data manipulation tasks such as sorting, grouping, joining, filtering, and counting words. Additionally, it outlines steps to find the maximum temperature for each year using Pig Latin.

Uploaded by

kanehoj367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Big Data Analytics lab file

The document provides detailed installation instructions for Apache Hive and HBase, including prerequisites, setup steps, and configuration commands. It also includes examples of Pig Latin scripts for data manipulation tasks such as sorting, grouping, joining, filtering, and counting words. Additionally, it outlines steps to find the maximum temperature for each year using Pig Latin.

Uploaded by

kanehoj367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

[Approved by AICTE, Govt. of India & Affiliated to Dr.

APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Program-6

Aim: Installation of Hive along with practice examples.


Theory: Apache Hive is a distributed, fault-tolerant data warehouse system that enables
analytics at a massive scale. A data warehouse provides a central store of
information that can easily be analyzed to make informed, data driven decisions.
Hive allows users to read, write, and manage petabytes of data using SQL.
Prerequisites:
 JDK(Java) installed in the system
 Hadoop installed
 7 Zip

Steps:
Step 1: Check whether the Java is installed or not using the following command.
$ java -version
Step 2: Check whether the Hadoop is installed or not using the following command.

$ hadoop version
Step 3: Download the Apache Hive file apache-hive-2.3.9-bin.tar.gz using Hive download
directory.
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Step 4: Unzip and Install Hive:


The following command is used to verify the download and extract the hive archive:

$ tar zxvf apache-hive-2.3.9-bin.tar.gz


$ ls

On successful download, you get to see the following response:


apache-hive-2.3.9-bin apache-hive-2.3.9-bin.tar.gz
Step 5: Copying files to /user/local/hive directory:
We need to copy the files from the super user “su -”. The following commands are
used to copy the files from the extracted directory to the /user/local/hive, directory.

$ su -
password:

# cd /home/cloudera/user/Download
# mv apache-hive-2.3.9-bin /usr/local/hive
# exit

Step 6: Setting up environment for Hive:


You can set up the Hive environment by appending the following lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

The following command is used to execute ~/.bashrc file:


$ source ~/.bashrc
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Step 7: Configuring Hive:


To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed
in the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder
and copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:


export HADOOP_HOME=/user/local/27adoop
Hive installation is completed successfully.

Step 8: Verify Hive Installation by following command:

$ bin/hive
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Program-7

Aim: Installation of HBase, Installing thrift along with Practice examples.


Theory: HBase provides low latency random read and write access to petabytes of data by
distributing requests from applications across a cluster of hosts. Each host has
access to data in HDFS and S3, and serves read and write requests in milliseconds.

Steps:
Step 1: Download the HBase file hbase-3.0.0-beta-1-bin.tar.gz from Apache website.

Step 2: Unzip and Move:

$ cd /home/cloudera/user/Downloads
$ sudo tar -zxvf hbase-3.0.0-beta-1-bin.tar.gz
$ sudo mv hbase-3.0.0-beta-1 /usr/local/hbase
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Step 3: Edit hbase-env.sh and hbase-site.xml:

$ cd /usr/local/hbase/conf

In the hbase-env.sh file, you need to export JAVA_HOME path. Here, you need to check the
JAVA_HOME path on your system. You can display your java home using the command
given below:
{} $ echo $JAVA_HOME
On this system the above command gives the output /usr/lib/jvm/java-7-openjdk-amd64
To update the hbase-env.sh, we need to run the command given below:

$ sudo nano hbase-env.sh

Copy and paste the below line in the hbase-env.sh file:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HBASE_REGIONSERVERS=/usr/local/hbase/conf/regionservers
export HBASE_MANAGES_ZK=true

Use Ctrl+X and Y to save.


[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Your hbase-env.sh file will look like the image given below:

Now update the .bashrc file to export hbase variables:

$ sudo nano ~/.bashrc

Copy and paste the below lines at the end of .bashrc:

#HBASE VARIABLES START


export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
#HBASE VARIABLES END

Image below explains how to append the HBASE variables:

Use Ctrl+X and Y to save.

To make the above changes permanent in .bashrc, we need to run the command given
below:

$ source ~/.bashrc
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Now to update the hbase-site.xml file, use the command given below:

$ cd /usr/local/hbase/conf
$ sudo nano hbase-site.xml

Add the following lines between configuration tags:

<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hadoop/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>

Use Ctrl+X and Y to save

Your hbase-site.xml file will look like the image given below:
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Step 4: Starting HBASE:

$ cd /usr/local/hbase/bin
$ sudo chown -R hduser:hadoop /usr/local/hbase/
$ ./start-hbase.sh

You will get the following message which looks as shown in the image given below at the
time of starting:

Once you need to run jps command to check the hbase daemons: {} $ jps
You will get the output as shown in the image given below:

To check the HBase Directory in HDFS:

$ hadoop fs -ls hdfs://localhost:54310/hbase/

You will get the output as shown in the image given below:
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Program-8

Aim: Write PIG Commands: Write Pig Latin scripts sort, group, join, project and filter your
data.

Pig Latin Script for Sorting:


The ORDER BY operator is used to display the contents of a relation in a sorted order based
on one or more fields.
Syntax:

grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Pig Latin Script for Group:


The GROUP operator is used to group the data in one or more relations. It collects the data
having the same key.
Syntax:
grunt> Group_data = GROUP Relation_name BY age;

Pig Latin Script for Join:


The JOIN operator is used to combine records from two or more relations. While performing
a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When
these keys match, the two particular tuples are matched, else the records are dropped.
Joins can be of the following types −

 Self-join
 Inner-join
 Outer-join − le join, right join, and full join
Syntax:
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;

Pig Latin Script for Filter:


The FILTER operator is used to select the required tuples from a relation based on a
condition.
Syntax:

grunt> Relation2_name = FILTER Relation1_name BY (condition);


[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Program-9

Aim: Run the Pig Latin Scripts to find Word Count.


Program:
Assume we have data in the file like below:
This is a hadoop post
hadoop is a bigdata technology
we want to generate output for count of each word like below:

(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)

Steps how to generate the same using Pig Latin:

Step 1: Load the data from HDFS


Use Load statement to load the data into a relation.
As keyword used to declare column names, as we don’t have any columns, we declared only
one column named line.
input = LOAD '/path/to/file/' AS(line:Chararray);
Step 2: Convert the Sentence into words
The data we have is in sentences. So we have to convert that data into words using
TOKENIZE Function.

(TOKENIZE(line));
(or)
If we have any delimeter like space we can specify as
(TOKENIZE(line,' '));

Output will be like this:


({(This),(is),(a),(hadoop),(class)})
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

({(hadoop),(is),(a),(bigdata),(technology)})

but we have to convert it into multiple rows like below:


(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
Convert Column into Rows

I mean we have to convert every line of data into multiple rows ,for this we have function
called
FLATTEN in pig.

Using FLATTEN function the bag is converted into tuple, means the array of strings
converted into multiple rows.

Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;

Then the ouput is like below

(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)

Step 3: Apply GROUP BY

We have to count each word occurance, for that we have to group all the words.
Grouped = GROUP words BY word;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Step 4: Generate word count


wordcount = FOREACH Grouped GENERATE group, COUNT(words);

We can print the word count on console using Dump.


DUMP wordcount;

Output will be like below:


(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)

Below is the complete program for the same:

input = LOAD '/path/to/file/' AS(line:Chararray);


Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;
Grouped = GROUP words BY word;
wordcount = FOREACH Grouped GENERATE group, COUNT(words);
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

Program-10

Aim: Run the Pig Latin Scripts to find a max temp for each and every year.

Word Count using Pig Latin Steps:


Step1:
1. Create a text file having few lines of text and save it as bd.txt.
2. Create on directory in hdfs named wc
3. Store bd..txt file from local file system to hdfs file system directory wc

Step2:
inputline = load '/user/cloudera/wc/bd.txt' using PigStorage('\t') as (data:chararray);
words = FOREACH inputline GENERATE FLATTEN(TOKENIZE(data)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP ordered_word_count;

You can use the below command to save the result in HDFS:
grunt> store ordered_word_count; into '/user/cloudera/wc/output/';

Find the Maximum Year in a given dataset using PIG


Option1:
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MAX(A.Temp);
DUMP C;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida

or

Option2: Using (ORDER and LIMIT)


A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = ORDER A BY Temp DESC;
C = LIMIT B 1;
D = FOREACH C GENERATE Temp;
DUMP D;

You might also like