big-data-file
big-data-file
Session: 2023-2024
Practical File
Big Data
Experiment-1
II. In this step, we will install the latest version of JDK (1.8) on the machine.
The Oracle JDK is the official JDK; however, it is no longer provided by Oracle as a default
installation for Ubuntu. You can still install it using apt-get.
Then, depending on the version you want to install, execute one of the following commands:
III. Now, let us set up a new user account for Hadoop installation. This step is optional, but
recommended because it gives you flexibility to have a separate account for Hadoop
installation by separating this installation from other software installation.
● a. sudo adduser hadoop_dev ( Upon executing this command, you will be prompted to enter
the new password for this user. Please enter the password and enter other details. Don’t
forget to save the details at the end)
● b. su - hadoop_dev ( Switches the user from current user to the new user created i.e
Hadoop_dev)
● a. Visit this URL and choose one of the mirror sites. You can copy the download link and
also use “wget” to download it from command prompt:
VIII. Let us verify if the installation is successful or not( change to home directory cd /home/
hadoop_dev/hadoop2/):
● bin/hadoop ( running this command should prompt you with various options)
X. Let us run a sample hadoop programs that is provided to you in the download package:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value> </property>
</configuration>
Note: This change sets the namenode ip and port.
II. Edit the file /home/Hadoop_dev/hadoop2/etc/ hadoop/hdfs-site.xml as below:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Note: This change sets the default replication count for blocks used by HDFS.
III. We need to setup password less login so that the master will be able to do a password-less
ssh to start the daemons on all the slaves.
● ssh localhost ( enter your password and if you are able to login then ssh server is running)
● In step a. if you are unable to login, then install ssh as follows:
VII. Format the file system. When we format namenode it formats the meta-data related to data-
nodes. By doing that, all the information on the datanodes are lost and they can be reused
for new data:
VIII. Start the daemons sbin/start-dfs.sh (Starts NameNode and DataNode) You can check If
NameNode has started successfully or not by using the following web interface:
http://0.0.0.0:50070 . If you are unable to see this, try to check the logs in the /home/
hadoop_dev/hadoop2/logs folder.
IX. You can check whether the daemons are running or not by issuing a Jps command.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
</configuration>
Note: This particular configuration tells MapReduce how to do its shuffle. In this
case it uses the mapreduce_shuffle.
III. Format the NameNode:
This starts the daemons ResourceManager and NodeManager. Once this command is run,
you can check if ResourceManager is running or not by visiting the following URL on the
browser : http://0.0.0.0:8088 . If you are unable to see this, check for the logs in the
directory: /home/hadoop_dev/hadoop2/logs
VII. To check whether the services are running, issue a jps command. The following shows all
the services necessary to run YARN on a single server:
$ jps
15933 Jps
15567 ResourceManager
15785 NodeManager
VII. Let us run the same example as we ran before: i) Create a new directory on the hdfs
● bin/hdfs dfs -mkdir –p /user/hadoop_dev ii) Copy the input files for the program to hdfs:
bin/hdfs dfs -put etc/hadoop input iii) Run the program:
● bin/yarn jar share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar grep input
output 'dfs[a-z.]+'
● View the output on hdfs:
VIII. Stop the daemons when you are done executing the jobs, with the below command:
● sbin/stop-yarn.sh
Experiment-2
This command allows multiple sources as well in which case the destination must be a
directory.
Syntax:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2
copyToLocal
Syntax:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file
reference.
8. Display the last few lines of a file. Similar to tail command in Unix.
Syntax :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt
Experiment-3
Aim: Word count Map reduce program to understand the Mapreduce Paradigm
Source code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.fs.Path;
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The entire MapReduce program can be fundamentally divided into three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code
Mapper code:
I. We have created a class Map that extends the class Mapper which is already defined in the
MapReduce Framework.
II. We define the data types of input and output key/value pair after the class declaration using
angle brackets.
III. Both the input and output of the Mapper is a key/value pair.
• Input:
◦ The key is nothing but the offset of each line in the text file:LongWritable
◦ The value is each individual line (as shown in the figure at the right): Text
• Output:
◦ The key is the tokenized words: Text
◦ We have the hardcoded value in our case which is 1: IntWritable
◦ Example – Dear 1, Bear 1, etc.
• We have written a java code where we have tokenized each word and assigned them a hardcoded
value equal to 1.
Reducer Code:
I. We have created a class Reduce which extends class Reducer like that of Mapper.
II. We define the data types of input and output key/value pair after the class declaration using
angle brackets as done for Mapper.
III. Both the input and the output of the Reducer is a key value pair.
• Input:
◦ The key nothing but those unique words which have been generated after the
sorting and shuffling phase: Text
◦ The value is a list of integers corresponding to each key: IntWritable
◦ Example – Bear, [1, 1], etc.
• Output:
◦ The key is all the unique words present in the input text file: Text
◦ The value is the number of occurrences of each of the unique words: IntWritable
◦ Example – Bear, 2; Car, 3, etc.
VI. We have aggregated the values present in each of the list corresponding to each key and
produced the final answer.
VII. In general, a single reducer is created for each of the unique words, but you can specify the
number of reducers in mapred-site.xml.
Driver Code:
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
I. In the driver class, we set the configuration of our MapReduce job to run in Hadoop.
II. We specify the name of the job , the data type of input/ output of the mapper and reducer.
III. We also specify the names of the mapper and reducer classes.
IV. The path of the input and output folder is also specified.
V. The method setInputFormatClass () is used for specifying how a Mapper will read the input
data or what will be the unit of work. Here, we have chosen TextInputFormat so that a
single line is read by the mapper at a time from the input text file.
VI. The main () method is the entry point for the driver. In this method, we instantiate a new
Configuration object for the job.
Experiment-4
Source Code
import java.io.IOException;
import java.util.*;
import
java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException { String line = value.toString();
if (indicesAndValue[0].equals("A")) {
outputKey.set(indicesAndValue[2]);
outputKey.set(indicesAndValue[1]);
context.write(outputKey, outputValue); }
}}
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException { String[] value;
String i;
float a_ij;
String k;
float b_jk;
context.write(null, outputValue); }
}}}
public static void main(String[] args) throws Exception { Configuration conf = new
Configuration();
"MatrixMatrixMultiplicationTwoSteps");
job.setJarByClass(TwoStepMatrixMultiplication.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
Experiment-5
Aim: Pig Latin scripts to use basic commands, sort, group, join, and
filter the data.
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an
input and generates another relation as an output.
The Apache Pig FILTER operator is used to remove duplicate tuples in a relation. Initially, Pig
sorts the given data and then eliminates duplicates.
A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);
DUMP B;
The Apache Pig GROUP operator is used to group the data in one or more relations. It groups the
tuples that contain a similar group key. If the group key has more than one field, it treats as tuple
otherwise it will be the same type as that of the group key. In a result, it provides a relation that
contains one tuple per group.
Input: Student(rollno:int,name:chararray,gpa:float)
A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);
B = GROUP A BY gpa ;
DUMP B;
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records are dropped.
Input: Student(rollno:int,name:chararray,gpa:float)
Department(rollno:int,deptno:int,deptname:chararray)
A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);
B = load’/pigdemo/student.tsv’ as *rollno:int,deptno:int,deptname:chararray);
DUMB C;
DUMB B;
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one
or more fields.
Input: Student(rollno:int,name:chararray,gpa:float)
A = load’/pigdemo/student.tsv’ as (rollno:int,name:chararray,gpa:float);
B = ORDER A BY name ;
DUMB B;
About Pig
as a component to
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top
of Hadoop. It is a software project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets that stored in distributed storage and queried by Structure Query
Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP) workloads. It is
frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and analysis of
huge datasets. It is designed to enhance scalability, extensibility, performance, fault-tolerance and
loose-coupling with its input formats.
Downloaded by Ankita Kurle
Big Data Practical File
Hive includes many features that make it a useful tool for big data analysis, including support for
partitioning, indexing, and user-defined functions (UDFs). It also provides a number of
optimization techniques to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL (extract,
transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big data industry,
especially in companies that have adopted the Hadoop ecosystem as their primary data processing
platform.
Hive Commands
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User
defined functions.
Drop database
Create a table called Sonoo with two columns, the first being an integer and the other a string.
Create a table called HIVE_TABLE with two columns and a partition column called ds. The
partition column is a virtual column. It is not part of the data itself but is derived from the partition
that a particular dataset is loaded into. By default, tables are assumed to be of text input format and
the delimiters are assumed to be ^A(ctrl-a).
hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds S
TRING);
• LOAD DATA
• GROUP BY
Hive sort by and order by commands are used to fetch data in sorted order. The main differences
between sort by and order by commands are given below.
• Sort by
• Order by
Hive Join
• Inner joins
Output : <<InnerJoin.png>>
Select e.empId, empName, department from employee e Left outer join employeedepartmen
t ed on(e.empId=ed.empId);
Output : <<LeftOuterJoin.png>>
Select e.empId, empName, department from employee e Right outer join employeedepartme
nt ed on(e.empId=ed.empId);
Experiment-7
MongoDB is an open-source document database that provides high performance, high availability,
and automatic scaling. In simple words, you can say that - Mongo DB is a document-oriented
database. It is an open source product, developed and supported by a company named 10gen.
MongoDB is available under General Public license for free, and it is also available under
Commercial license from the manufacturer.
1. show databases
In MongoDB, you can use the show dbs command to list all databases on a MongoDB server. This
will show you the database name, as well as the size of the database in gigabytes.
Example:
2. create database
After connecting to your database using mongosh, you can see which database you are using by
typing db in your terminal.
>use myDB;
Switched to db
myDB
>
3. use database
MongoDB use DATABASE_NAME is used to create database. The command will create a new
database if it doesn't exist, otherwise it will return the existing database.
>use mydb
switched to db mydb
4. drop database
MongoDB db.dropDatabase() command is used to drop a existing database.
>db.dropDatabase()
>db.dropDatabase()
>{ "dropped" : "myDB", "ok" : 1 }
>
5. Create a collection
You can create a collection using the createCollection() database method.
>db.createCollection(“Students”);
{ “_id” : 1,
“Student” : “Michelle”,
}
>
6. drop collection
MongoDB's db.collection.drop() is used to drop a collection from the database.
>db.mycollection.drop()
true
>
7. insert document
To insert data into MongoDB collection, you need to use MongoDB's insert() or save() method.
> db.users.insert({
... _id : ObjectId("507f191e810c19729de860ea"),
... title: "MongoDB Overview",
... description: "MongoDB is no sql database",
... by: "tutorials point",
... url: "http://www.tutorialspoint.com",
... tags: ['mongodb', 'database', 'NoSQL'],
... likes: 100
... })
WriteResult({ "nInserted" : 1 })
>
8. drop document
MongoDB's remove() method is used to remove a document from the collection. remove() method
accepts two parameters. One is deletion criteria and second is justOne flag.
db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)
>db.Students.remove (
{ "Student": "Michelle" });
>
9. find document
To find documents that match a set of selection criteria, call find() with the <criteria> parameter.
>db.Students.find();
{
“_id” : 1,
“Student” : “Michelle ”,
}