Big Data Analytics lab file
Big Data Analytics lab file
APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-6
Steps:
Step 1: Check whether the Java is installed or not using the following command.
$ java -version
Step 2: Check whether the Hadoop is installed or not using the following command.
$ hadoop version
Step 3: Download the Apache Hive file apache-hive-2.3.9-bin.tar.gz using Hive download
directory.
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
$ su -
password:
# cd /home/cloudera/user/Download
# mv apache-hive-2.3.9-bin /usr/local/hive
# exit
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ bin/hive
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-7
Steps:
Step 1: Download the HBase file hbase-3.0.0-beta-1-bin.tar.gz from Apache website.
$ cd /home/cloudera/user/Downloads
$ sudo tar -zxvf hbase-3.0.0-beta-1-bin.tar.gz
$ sudo mv hbase-3.0.0-beta-1 /usr/local/hbase
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
$ cd /usr/local/hbase/conf
In the hbase-env.sh file, you need to export JAVA_HOME path. Here, you need to check the
JAVA_HOME path on your system. You can display your java home using the command
given below:
{} $ echo $JAVA_HOME
On this system the above command gives the output /usr/lib/jvm/java-7-openjdk-amd64
To update the hbase-env.sh, we need to run the command given below:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HBASE_REGIONSERVERS=/usr/local/hbase/conf/regionservers
export HBASE_MANAGES_ZK=true
Your hbase-env.sh file will look like the image given below:
To make the above changes permanent in .bashrc, we need to run the command given
below:
$ source ~/.bashrc
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Now to update the hbase-site.xml file, use the command given below:
$ cd /usr/local/hbase/conf
$ sudo nano hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hadoop/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Your hbase-site.xml file will look like the image given below:
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
$ cd /usr/local/hbase/bin
$ sudo chown -R hduser:hadoop /usr/local/hbase/
$ ./start-hbase.sh
You will get the following message which looks as shown in the image given below at the
time of starting:
Once you need to run jps command to check the hbase daemons: {} $ jps
You will get the output as shown in the image given below:
You will get the output as shown in the image given below:
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-8
Aim: Write PIG Commands: Write Pig Latin scripts sort, group, join, project and filter your
data.
Self-join
Inner-join
Outer-join − le join, right join, and full join
Syntax:
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Program-9
(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)
(TOKENIZE(line));
(or)
If we have any delimeter like space we can specify as
(TOKENIZE(line,' '));
({(hadoop),(is),(a),(bigdata),(technology)})
I mean we have to convert every line of data into multiple rows ,for this we have function
called
FLATTEN in pig.
Using FLATTEN function the bag is converted into tuple, means the array of strings
converted into multiple rows.
(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
We have to count each word occurance, for that we have to group all the words.
Grouped = GROUP words BY word;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-10
Aim: Run the Pig Latin Scripts to find a max temp for each and every year.
Step2:
inputline = load '/user/cloudera/wc/bd.txt' using PigStorage('\t') as (data:chararray);
words = FOREACH inputline GENERATE FLATTEN(TOKENIZE(data)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP ordered_word_count;
You can use the below command to save the result in HDFS:
grunt> store ordered_word_count; into '/user/cloudera/wc/output/';
or