0% found this document useful (0 votes)
7 views

big-data-file

Uploaded by

Ankita Kurle
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

big-data-file

Uploaded by

Ankita Kurle
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Downloaded by Ankita Kurle

Prestige Institute of Engineering Management &


Research, Indore

Department of Computer Science & Engineering

Session: 2023-2024
Practical File
Big Data

Submitted to: Submitted by:


Dr. Aradhana Negi Mahesh Chouhan
Assistant Professor 0863CS191076
Department of CSE BTech CSE
PIEMR, Indore Semester : VII
LIST OF EXPERIMENTS
Big Data Practical File

S.No. Name of Experiment Date of Faculty


Experiment Signature

1 Study and explore the Hadoop Installation


● Hadoop Installation: Ubuntu Operating System
in stand-alone mode
● Hadoop Installation: Pseudo Distributed
Mode( Locally )
● Hadoop Installation: Pseudo Distributed
Mode( YARN )

2 Accomplish the file Management tasks in


Hadoop
3 Word count Map reduce program to understand
the MapReduce Paradigm

4 Implementing Matrix Multiplication with


Hadoop Map Reduce

5 Pig Latin scripts to use basic commands,


sort, group, join, and filter the data.
6 Execute different hive commands.
7 Explore MongoDB for Big Data.

Experiment-1

Aim: Installation of Hadoop

Hadoop software can be installed in three modes of operation:

Department of Computer Science & Engineering, PIEMR, Indore 2

Downloaded by Ankita Kurle


Big Data Practical File

1. Stand Alone Mode: Hadoop is a distributed software and is designed to run on a


commodity of machines. However, we can install it on a single node in stand-alone mode.
In this mode, Hadoop software runs as a single monolithic java process. This mode is
extremely useful for debugging purposes. You can first test run your Map-Reduce
application in this mode on small data, before actually executing it on a cluster with big
data.
2. Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single
Node. Various daemons of Hadoop will run on the same machine as separate java
processes. Hence all the daemons namely NameNode, DataNode, SecondaryNameNode,
JobTracker, TaskTracker run on a single machine.
3. Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker,
SecondaryNameNode (Optional and can be run on a separate node) run on the Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.

Hadoop Installation: Ubuntu Operating System in stand-alone mode

Steps for Installation

I. sudo apt-get update

II. In this step, we will install the latest version of JDK (1.8) on the machine.

The Oracle JDK is the official JDK; however, it is no longer provided by Oracle as a default
installation for Ubuntu. You can still install it using apt-get.

To install any version, first execute the following commands:

● sudo apt-get install python-software properties


● sudo add-apt-repository ppa:webupd8team/ java
● sudo apt-get update

Then, depending on the version you want to install, execute one of the following commands:

Oracle JDK 7: sudo apt-get install oracle java7-installer


Oracle JDK 8: sudo apt-get install oracle java8-installer

III. Now, let us set up a new user account for Hadoop installation. This step is optional, but
recommended because it gives you flexibility to have a separate account for Hadoop
installation by separating this installation from other software installation.
● a. sudo adduser hadoop_dev ( Upon executing this command, you will be prompted to enter
the new password for this user. Please enter the password and enter other details. Don’t
forget to save the details at the end)

Department of Computer Science & Engineering, PIEMR, Indore 3

Downloaded by Ankita Kurle


Big Data Practical File

● b. su - hadoop_dev ( Switches the user from current user to the new user created i.e
Hadoop_dev)

VI. Download the latest Hadoop distribution.

● a. Visit this URL and choose one of the mirror sites. You can copy the download link and
also use “wget” to download it from command prompt:

Wget Http://apache.mirrors.lucid networks.net/hadoop/ common/hadoop-2.7.0/hadoop-


2.7.0.tar.gz

V. Untar the file : tar xvzf hadoop-2.7.0.tar.gz

VI. Rename the folder to hadoop2 mv hadoop-2.7.0 hadoop2

VII. Edit configuration file /home/hadoop_dev/ hadoop2/etc/hadoop/hadoop-env.sh

• vim /home/hadoop_dev/hadoop2/etc/hadoop/ hadoop-env.sh

• uncomment JAVA_HOME and update it following line: export


JAVA_HOME=/usr/lib/jvm/java-8- oracle ( Please check for your relevant java
installation and set this value accordingly. Latest versions of Hadoop require > JDK1.7)

VIII. Let us verify if the installation is successful or not( change to home directory cd /home/
hadoop_dev/hadoop2/):

● bin/hadoop ( running this command should prompt you with various options)

IX. This finishes the Hadoop setup in stand-alone mode.

X. Let us run a sample hadoop programs that is provided to you in the download package:

● $ mkdir input (create the input directory)


● $ cp etc/hadoop/*.xml input ( copy over all the xml files to input folder)
● $ bin/hadoop jar share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.7.0.jar grep
input output 'dfs[a-z.]+' (grep/find all the files matching the pattern ‘dfs[a-z.]+’ and copy
those files to output directory)
● $ cat output/* (look for the output in the output directory that Hadoop creates for you).

● Hadoop Installation: Pseudo Distributed Mode( Locally )

Steps for Installation

I. Edit the file /home/Hadoop_dev/hadoop2/etc/ hadoop/core-site.xml as below:

<configuration>

Department of Computer Science & Engineering, PIEMR, Indore 4

Downloaded by Ankita Kurle


Big Data Practical File

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value> </property>
</configuration>
Note: This change sets the namenode ip and port.
II. Edit the file /home/Hadoop_dev/hadoop2/etc/ hadoop/hdfs-site.xml as below:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Note: This change sets the default replication count for blocks used by HDFS.
III. We need to setup password less login so that the master will be able to do a password-less
ssh to start the daemons on all the slaves.

Check if ssh server is running on your host or not:

● ssh localhost ( enter your password and if you are able to login then ssh server is running)
● In step a. if you are unable to login, then install ssh as follows:

sudo apt-get install ssh

● Setup password less login as below:


01. ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
02. cat ~/.ssh/id_dsa.pub >> ~/.ssh/ authorized_keys
VI. We can run Hadoop jobs locally or on YARN in this mode. In this Post, we will focus on
running the jobs locally.

VII. Format the file system. When we format namenode it formats the meta-data related to data-
nodes. By doing that, all the information on the datanodes are lost and they can be reused
for new data:

bin/hdfs namenode –format

VIII. Start the daemons sbin/start-dfs.sh (Starts NameNode and DataNode) You can check If
NameNode has started successfully or not by using the following web interface:
http://0.0.0.0:50070 . If you are unable to see this, try to check the logs in the /home/
hadoop_dev/hadoop2/logs folder.

Department of Computer Science & Engineering, PIEMR, Indore 5

Downloaded by Ankita Kurle


Big Data Practical File

IX. You can check whether the daemons are running or not by issuing a Jps command.

X. This finishes the installation of Hadoop in pseudo distributed mode.

XI. Let us run the same example:


● Create a new directory on the hdfs
bin/hdfs dfs -mkdir –p /user/hadoop_dev
● Copy the input files for the program to hdfs: bin/hdfs dfs -put etc/hadoop input iii)
Run the program:
bin/hadoop jar share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar
grep input output 'dfs[a-z.]+'
● View the output on hdfs:
bin/hdfs dfs -cat output/*
XII. Stop the daemons when you are done executing the jobs, with the below command:
● sbin/stop-dfs.sh

Hadoop Installation – Pseudo Distributed Mode( YARN )

Steps for Installation

I. Edit the file /home/hadoop_dev/hadoop2/etc/ hadoop/mapred-site.xml as below:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

II. Edit the fie /home/hadoop_dev/hadoop/etc/ hadoop/yarn-site.xml as below:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
</configuration>

Department of Computer Science & Engineering, PIEMR, Indore 6

Downloaded by Ankita Kurle


Big Data Practical File

Note: This particular configuration tells MapReduce how to do its shuffle. In this
case it uses the mapreduce_shuffle.
III. Format the NameNode:

bin/hdfs namenode –format

VI. Start the daemons using the command:


● sbin/start-yarn.sh

This starts the daemons ResourceManager and NodeManager. Once this command is run,
you can check if ResourceManager is running or not by visiting the following URL on the
browser : http://0.0.0.0:8088 . If you are unable to see this, check for the logs in the
directory: /home/hadoop_dev/hadoop2/logs

VII. To check whether the services are running, issue a jps command. The following shows all
the services necessary to run YARN on a single server:

$ jps
15933 Jps
15567 ResourceManager
15785 NodeManager
VII. Let us run the same example as we ran before: i) Create a new directory on the hdfs

● bin/hdfs dfs -mkdir –p /user/hadoop_dev ii) Copy the input files for the program to hdfs:
bin/hdfs dfs -put etc/hadoop input iii) Run the program:
● bin/yarn jar share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar grep input
output 'dfs[a-z.]+'
● View the output on hdfs:

bin/hdfs dfs -cat output/*

VIII. Stop the daemons when you are done executing the jobs, with the below command:
● sbin/stop-yarn.sh

This completes the installation part of Hadoop

Experiment-2

Aim: To accomplish the file Management tasks in Hadoop


1. Create a directory in HDFS at given path(s).
Syntax:
hadoop fs -mkdir <paths>

Department of Computer Science & Engineering, PIEMR, Indore 7

Downloaded by Ankita Kurle


Big Data Practical File

Example: hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

2. List the contents of a directory.


Syntax :
hadoop fs -ls <args>
Example: hadoop fs -ls /user/saurzcode

3. Upload and download a file in HDFS.

Upload: hadoop fs -put:


Copy single src file, or multiple src files from local file system to the Hadoop data
file system
Syntax:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/ saurzcode/dir3/

Download: hadoop fs -get:


Copies/Downloads files to the local file system
Syntax:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/ 4.
Same as unix cat command:
Syntax:
hadoop fs -cat <path[filename]>
Example:
hadoop fs -cat /user/saurzcode/dir1/abc.txt

4. Copy a file from source to destination

This command allows multiple sources as well in which case the destination must be a
directory.
Syntax:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

5. Copy a file from/To Local file system to HDFS


copyFromLocal
Syntax:
hadoop fs -copyFromLocal <localsrc> URI
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/ saurzcode/abc.txt
Similar to put command, except that the source is restricted to a local file reference.

Department of Computer Science & Engineering, PIEMR, Indore 8

Downloaded by Ankita Kurle


Big Data Practical File

copyToLocal
Syntax:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file
reference.

6. Move file from source to destination.


Note:- Moving files across filesystems is not permitted.
Syntax :
hadoop fs -mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

7. Remove a file or directory in HDFS.


Remove files specified as arguments. Deletes directory only when it is empty
Syntax :
hadoop fs -rm <arg>
Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
Recursive version of delete.
Syntax :
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/saurzcode/

8. Display the last few lines of a file. Similar to tail command in Unix.
Syntax :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt

9. Display the aggregate length of a file.


Syntax :
hadoop fs -du <path>
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt

Experiment-3

Aim: Word count Map reduce program to understand the Mapreduce Paradigm

Source code:
import java.io.IOException;

Department of Computer Science & Engineering, PIEMR, Indore 9

Downloaded by Ankita Kurle


Big Data Practical File

import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.fs.Path;

public class WordCount


{
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> { public void
map(LongWritable key, Text value,Context context) throws
IOException,InterruptedException{ String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}}}

public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> { public void


reduce(Text key, Iterable<IntWritable> values,Context context) throws
IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}}

public static void main(String[] args) throws Exception


{ Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);

Department of Computer Science & Engineering, PIEMR, Indore 10

Downloaded by Ankita Kurle


Big Data Practical File

//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

The entire MapReduce program can be fundamentally divided into three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code

Mapper code:

public static class Map extends


Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}

I. We have created a class Map that extends the class Mapper which is already defined in the
MapReduce Framework.

II. We define the data types of input and output key/value pair after the class declaration using
angle brackets.

III. Both the input and output of the Mapper is a key/value pair.
• Input:
◦ The key is nothing but the offset of each line in the text file:LongWritable
◦ The value is each individual line (as shown in the figure at the right): Text
• Output:
◦ The key is the tokenized words: Text
◦ We have the hardcoded value in our case which is 1: IntWritable
◦ Example – Dear 1, Bear 1, etc.

Department of Computer Science & Engineering, PIEMR, Indore 11

Downloaded by Ankita Kurle


Big Data Practical File

• We have written a java code where we have tokenized each word and assigned them a hardcoded
value equal to 1.

Reducer Code:

public static class Reduce extends


Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}}

I. We have created a class Reduce which extends class Reducer like that of Mapper.
II. We define the data types of input and output key/value pair after the class declaration using
angle brackets as done for Mapper.
III. Both the input and the output of the Reducer is a key value pair.
• Input:
◦ The key nothing but those unique words which have been generated after the
sorting and shuffling phase: Text
◦ The value is a list of integers corresponding to each key: IntWritable
◦ Example – Bear, [1, 1], etc.
• Output:
◦ The key is all the unique words present in the input text file: Text
◦ The value is the number of occurrences of each of the unique words: IntWritable
◦ Example – Bear, 2; Car, 3, etc.
VI. We have aggregated the values present in each of the list corresponding to each key and
produced the final answer.
VII. In general, a single reducer is created for each of the unique words, but you can specify the
number of reducers in mapred-site.xml.

Driver Code:

Configuration conf= new Configuration();


Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

Department of Computer Science & Engineering, PIEMR, Indore 12

Downloaded by Ankita Kurle


Big Data Practical File

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

I. In the driver class, we set the configuration of our MapReduce job to run in Hadoop.
II. We specify the name of the job , the data type of input/ output of the mapper and reducer.
III. We also specify the names of the mapper and reducer classes.
IV. The path of the input and output folder is also specified.
V. The method setInputFormatClass () is used for specifying how a Mapper will read the input
data or what will be the unit of work. Here, we have chosen TextInputFormat so that a
single line is read by the mapper at a time from the input text file.
VI. The main () method is the entry point for the driver. In this method, we instantiate a new
Configuration object for the job.

Run the MapReduce code:

The command for running a MapReduce code is:

hadoop jar hadoop-mapreduce-example.jar WordCount / sample/input


/sample/output

Experiment-4

Aim: Implementing Matrix Multiplication with Hadoop Map Reduce

Source Code

import java.io.IOException;
import java.util.*;

Department of Computer Science & Engineering, PIEMR, Indore 13

Downloaded by Ankita Kurle


Big Data Practical File

import
java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class TwoStepMatrixMultiplication {

public static class Map extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException { String line = value.toString();

String[] indicesAndValue = line.split(","); Text outputKey = new Text();

Text outputValue = new Text();

if (indicesAndValue[0].equals("A")) {

outputKey.set(indicesAndValue[2]);

outputValue.set("A," + indicesAndValue[1] + "," + indicesAndValue[3]);

context.write(outputKey, outputValue); } else {

outputKey.set(indicesAndValue[1]);

outputValue.set("B," + indicesAndValue[2] + "," + indicesAndValue[3]);

context.write(outputKey, outputValue); }

}}

public static class Reduce extends Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException { String[] value;

ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer, Float>>();

Department of Computer Science & Engineering, PIEMR, Indore 14

Downloaded by Ankita Kurle


Big Data Practical File

ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer, Float>>();

for (Text val : values) {

value = val.toString().split(","); if (value[0].equals("A")) {

listA.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),


Float.parseFloat(value[2]))); } else {

listB.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),


Float.parseFloat(value[2]))); }

String i;

float a_ij;

String k;

float b_jk;

Text outputValue = new Text();

for (Entry<Integer, Float> a : listA) { i = Integer.toString(a.getKey()); a_ij = a.getValue();

for (Entry<Integer, Float> b : listB) { k = Integer.toString(b.getKey()); b_jk = b.getValue();

outputValue.set(i + "," + k + "," + Float.toString(a_ij*b_jk));

context.write(null, outputValue); }

}}}

public static void main(String[] args) throws Exception { Configuration conf = new
Configuration();

Job job = new Job(conf,

"MatrixMatrixMultiplicationTwoSteps");

job.setJarByClass(TwoStepMatrixMultiplication.class); job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);

Department of Computer Science & Engineering, PIEMR, Indore 15

Downloaded by Ankita Kurle


Big Data Practical File

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path("hdfs:// 127.0.0.1:9000/matrixin"));

FileOutputFormat.setOutputPath(job, new Path("hdfs:// 127.0.0.1:9000/matrixout"));

job.waitForCompletion(true);

Experiment-5

Aim: Pig Latin scripts to use basic commands, sort, group, join, and
filter the data.

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.

Department of Computer Science & Engineering, PIEMR, Indore 16

Downloaded by Ankita Kurle


Big Data Practical File

The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an
input and generates another relation as an output.

• It can span multiple lines.


• Each statement must end with a semi-colon.
• It may include expression and schemas.
• By default, these statements are processed using multi-query execution.

Apache Pig FILTER Operator

The Apache Pig FILTER operator is used to remove duplicate tuples in a relation. Initially, Pig
sorts the given data and then eliminates duplicates.

Input: Student (rollno:int,name:chararray,gpa:float)

A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);

B = filter A by gpa > 4.0;

DUMP B;

Apache Pig GROUP Operator

The Apache Pig GROUP operator is used to group the data in one or more relations. It groups the
tuples that contain a similar group key. If the group key has more than one field, it treats as tuple
otherwise it will be the same type as that of the group key. In a result, it provides a relation that
contains one tuple per group.

Input: Student(rollno:int,name:chararray,gpa:float)

A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);

B = GROUP A BY gpa ;

DUMP B;

Apache Pig JOIN Operator

The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records are dropped.

Input: Student(rollno:int,name:chararray,gpa:float)

Department of Computer Science & Engineering, PIEMR, Indore 17

Downloaded by Ankita Kurle


Big Data Practical File

Department(rollno:int,deptno:int,deptname:chararray)

A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);

B = load’/pigdemo/student.tsv’ as *rollno:int,deptno:int,deptname:chararray);

C = JOIN A BY rollno, B BY rollno;

DUMB C;

DUMB B;

Apache Pig ORDER BY Operator

The ORDER BY operator is used to display the contents of a relation in a sorted order based on one
or more fields.

Input: Student(rollno:int,name:chararray,gpa:float)

A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);

B = ORDER A BY name ;

DUMB B;

About Pig

Department of Computer Science & Engineering, PIEMR, Indore 18

Downloaded by Ankita Kurle


Big Data Practical File

Pig is a high level


scripting
language that is
used with Apache
Hadoop. Pig
enables data
workers to write complex
data transformations
without knowing Java.
Pig’s simple SQL-like
scripting language is
called Pig Latin, and
appeals to
developers already
familiar with
scripting
Downloaded by Ankita Kurle
Big Data Practical File

languages and SQL.


Department of Computer Science & Engineering, PIEMR, Indore 19

Downloaded by Ankita Kurle


Big Data Practical File

Pig is complete, so you


can do all required
data manipulations in
Apache Hadoop with
Pig.
Through the User
De 昀椀 ned
Functions(UDF)
facility in Pig, Pig can
invoke code in many
languages like JRuby,
Jython and Java. You
can also embed Pig
scripts in other
languages. The result
is that you can use Pig
Downloaded by Ankita Kurle
Big Data Practical File

as a component to

Department of Computer Science & Engineering, PIEMR, Indore 20

Downloaded by Ankita Kurle


Big Data Practical File

build larger and more


complex applications
that tackle real
business problems.
Pig works with data
from many sources,
including structured
and unstructured data,
and store
the results into the
Hadoop Data File
System.
Experiment-6

Aim: Execute different hive commands.

Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top
of Hadoop. It is a software project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets that stored in distributed storage and queried by Structure Query
Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP) workloads. It is
frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and analysis of
huge datasets. It is designed to enhance scalability, extensibility, performance, fault-tolerance and
loose-coupling with its input formats.
Downloaded by Ankita Kurle
Big Data Practical File

Department of Computer Science & Engineering, PIEMR, Indore 21

Downloaded by Ankita Kurle


Big Data Practical File

Hive includes many features that make it a useful tool for big data analysis, including support for
partitioning, indexing, and user-defined functions (UDFs). It also provides a number of
optimization techniques to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL (extract,
transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big data industry,
especially in companies that have adopted the Hadoop ecosystem as their primary data processing
platform.

Hive Commands
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User
defined functions.

Hive DDL Commands

• Create Database Statement

• A database in Hive is a namespace or a collection of

tables. hive> CREATE SCHEMA userdb;

hive> SHOW DATABASES;

Drop database

hive> DROP DATABASE IF EXISTS userdb;

Creating Hive Tables

Create a table called Sonoo with two columns, the first being an integer and the other a string.

hive> CREATE TABLE Sonoo(foo INT, bar STRING);

Create a table called HIVE_TABLE with two columns and a partition column called ds. The
partition column is a virtual column. It is not part of the data itself but is derived from the partition
that a particular dataset is loaded into. By default, tables are assumed to be of text input format and
the delimiters are assumed to be ^A(ctrl-a).

hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds S
TRING);

Browse the table

Department of Computer Science & Engineering, PIEMR, Indore 22

Downloaded by Ankita Kurle


Big Data Practical File

hive> Show tables;

Altering and Dropping Tables

hive> ALTER TABLE Sonoo RENAME TO Kafka;


hive> ALTER TABLE Kafka ADD COLUMNS (col INT);
hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a comme
nt');
hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT, weight STRING,
INT COMMENT

Hive DML Commands

• LOAD DATA

hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO TABL


E Employee;

• SELECTS and FILTERS

hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';

• GROUP BY

hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;

Hive Sort By vs Order By

Hive sort by and order by commands are used to fetch data in sorted order. The main differences
between sort by and order by commands are given below.

• Sort by

hive> SELECT E.EMP_ID FROM Employee E SORT BY E.empid;

• Order by

hive> SELECT E.EMP_ID FROM Employee E order BY E.empid;

Department of Computer Science & Engineering, PIEMR, Indore 23

Downloaded by Ankita Kurle


Big Data Practical File

Hive Join

• Inner joins

Select * from employee join employeedepartment ON (employee.empid=employeedepart


ment.empId)

Output : <<InnerJoin.png>>

• Left outer joins

Select e.empId, empName, department from employee e Left outer join employeedepartmen
t ed on(e.empId=ed.empId);

Output : <<LeftOuterJoin.png>>

• Right outer joins

Select e.empId, empName, department from employee e Right outer join employeedepartme
nt ed on(e.empId=ed.empId);

Experiment-7

Aim: To explore MongoDB commands.

MongoDB is an open-source document database that provides high performance, high availability,
and automatic scaling. In simple words, you can say that - Mongo DB is a document-oriented
database. It is an open source product, developed and supported by a company named 10gen.
MongoDB is available under General Public license for free, and it is also available under
Commercial license from the manufacturer.

MongoDB Database commands

The MongoDB database commands are used to create, modify, and


update the database.

Department of Computer Science & Engineering, PIEMR, Indore 24

Downloaded by Ankita Kurle


Big Data Practical File

1. show databases
In MongoDB, you can use the show dbs command to list all databases on a MongoDB server. This
will show you the database name, as well as the size of the database in gigabytes.

Example:

> show dbs;


admin (empty)
local 0.078GB
test 0.078GB
>

2. create database

After connecting to your database using mongosh, you can see which database you are using by
typing db in your terminal.

>use myDB;
Switched to db
myDB
>

3. use database
MongoDB use DATABASE_NAME is used to create database. The command will create a new
database if it doesn't exist, otherwise it will return the existing database.

>use mydb
switched to db mydb

4. drop database
MongoDB db.dropDatabase() command is used to drop a existing database.
>db.dropDatabase()

>db.dropDatabase()
>{ "dropped" : "myDB", "ok" : 1 }
>

5. Create a collection
You can create a collection using the createCollection() database method.

>db.createCollection(“Students”);
{ “_id” : 1,

Department of Computer Science & Engineering, PIEMR, Indore 25

Downloaded by Ankita Kurle


Big Data Practical File

“Student” : “Michelle”,
}
>

6. drop collection
MongoDB's db.collection.drop() is used to drop a collection from the database.

>db.mycollection.drop()
true
>

7. insert document
To insert data into MongoDB collection, you need to use MongoDB's insert() or save() method.

> db.users.insert({
... _id : ObjectId("507f191e810c19729de860ea"),
... title: "MongoDB Overview",
... description: "MongoDB is no sql database",
... by: "tutorials point",
... url: "http://www.tutorialspoint.com",
... tags: ['mongodb', 'database', 'NoSQL'],
... likes: 100
... })
WriteResult({ "nInserted" : 1 })
>

8. drop document
MongoDB's remove() method is used to remove a document from the collection. remove() method
accepts two parameters. One is deletion criteria and second is justOne flag.

db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)

>db.Students.remove (
{ "Student": "Michelle" });
>

9. find document
To find documents that match a set of selection criteria, call find() with the <criteria> parameter.

Department of Computer Science & Engineering, PIEMR, Indore 26

Downloaded by Ankita Kurle


Big Data Practical File

>db.Students.find();
{
“_id” : 1,
“Student” : “Michelle ”,
}

10. find document


Adding .pretty() with find() query will format the output with proper indentation and line breaks to
make it more readable.
>db.Students.find().pretty();
{
“_id” : 1,
“Student” : “Michelle ”,
}

Department of Computer Science & Engineering, PIEMR, Indore 27

Downloaded by Ankita Kurle

You might also like