Big Data Lab Manual
Big Data Lab Manual
LIST OF EXPERIMENT
1
EXPERIMENT NO. 01
Practical Objectives:
Theory:
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high- availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.
Hadoop Ecosystem:
Hadoop has gained its popularity due to its ability of storing, analyzing and
2
accessing large amount of data, quickly and cost effectively through clusters of
commodity hardware. It wont be wrong if we say that Apache Hadoop is actually a
collection of several components and not just a single product.
With Hadoop Ecosystem there are several commercial along with an open source
products which are broadly used to make Hadoop laymen accessible and more
usable.
The following sections provide additional information on the individual components:
MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters of commodity hardware in a
reliable, fault- tolerant manner. In terms of programming, there are two functions
which are most common in MapReduce.
The Map Task: Master computer or node takes input and convert it into
divide it into smaller parts and distribute it on other worker nodes. All
worker nodes solve their own small problem and give answer to the
master node.
The Reduce Task: Master node combines all answers coming from
worker node and forms it in some form of output which is answer of our
big distributed problem.
Generally both the input and the output are reserved in a file-system. The
framework is responsible for scheduling tasks, monitoring them and even re-
executes the failed tasks.
3
DataNodes: They are the slaves which are deployed on each machine and
provide the actual storage. They are responsible for serving read and
write requests for the clients.
Secondary NameNode: It is responsible for performing periodic
checkpoints. In the event of NameNode failure, you can restart the
NameNode using the checkpoint.
Hive:
Hive is part of the Hadoop ecosystem and provides an SQL like interface to
Hadoop. It is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop
compatible file systems.
It provides a mechanism to project structure onto this data and query the data using a
SQL- like language called HiveQL. Hive also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient
or inefficient to express this logic in HiveQL.
The main building blocks of Hive are –
1. Metastore – To store metadata about columns, partition and system catalogue.
2. Driver – To manage the lifecycle of a HiveQL statement
3. Query Compiler – To compiles HiveQL into a directed acyclic graph.
4. Execution Engine – To execute the tasks in proper order which are
produced by the compiler.
5. HiveServer – To provide a Thrift interface and a JDBC / ODBC server.
RegionServers and maintains the state of the cluster. It is not part of the
4
actual data storage or retrieval path.
RegionServer: It is deployed on each machine and hosts data and
processes I/O requests.
Zookeeper:
ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization and providing group services which
are very useful for a variety of distributed systems. HBase is not operational
without ZooKeeper.
Mahout:
Mahout is a scalable machine learning library that implements various different
approaches machine learning. At present Mahout contains four main groups of
algorithms:
Recommendations, also known as collective filtering
Classifications, also known as categorization
Clustering
Frequent itemset mining, also known as parallel frequent pattern mining
Algorithms in the Mahout library belong to the subset that can be executed in a
distributed fashion and have been written to be executable in MapReduce. Mahout
is scalable along three dimensions: It scales to reasonably large data sets by
leveraging algorithm properties or implementing versions based on Apache
Hadoop.
Apache Spark:
Apache Spark is a general compute engine that offers fast data analysis on a large
scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own
data processing framework. Common uses cases for Apache Spark include real-
time queries, event stream processing, iterative algorithms, complex operations and
machine learning.
Pig:
Pig is a platform for analyzing and querying huge data sets that consist of a high-
level language for expressing data analysis programs, coupled with infrastructure
5
for evaluating these programs. Pig’s built-in operations can make sense of semi-
structured data, such as log files, and the language is extensible using Java to add
support for custom data types and transformations.
Pig has three main key properties:
Extensibility
Optimization opportunities
Ease of programming
Conclusion:
Hadoop is powerful because it is extensible and it is easy to integrate with any
component. Its popularity is due in part to its ability to store, analyze and access
large amounts of data, quickly and cost effectively across clusters of commodity
hardware. Apache Hadoop is not actually a single product but instead a collection of
several components. When all these components are merged, it makes the Hadoop
very user friendly.
6
EXPERIMENT NO. 02
Practical Objectives:
Theory:
Hadoop
2. Faster processing
1. Scalable
2. Fault tolerance
3. Economical
4. Handle hardware failure
To install hadoop core-cluster needed are
a) Install java into the computer
b) Install VMware
7
c) Download VM file
d) Load it into VMware and start
Steps that to be followed for installation of hadoop using IBM Infosphere biginsight are:-
Output:
8
Conclusion:
9
EXPERIMENT NO. 03
Theory:
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
Features of HDFS
HDFS Architecture
Given below is the architecture of a Hadoop File System.
10
HDFS follows the master-slave architecture and it has the following elements.
Steps:
Open Web Client
From here also You can Upload File into Hadoop
Will have a look of all components of websphere
Dashboard, Cluster Status , Files , Application ,Application status , Bigsheet.
Output:
11
Step 2: Open Web Client
Step 3: From here also, you can Upload File into Hadoop
12
13
Conclusion:
The distributed file system which is used only for larger databases. Here, we
have studied HDFS and executed basic commands and file operations for
Hadoop.
14
EXPERIMENT NO.4
Objective:
Theory:
NoSQL databases have grown in popularity with the rise of Big Data applications.
In comparison to relational databases, NoSQL databases are much cheaper to scale,
capable of handling unstructured data, and better suited to current agile
development approaches.
The advantages of NoSQL technology are compelling but the thought of replacing a
legacy relational system can be daunting. To explore the possibilities of NoSQL in
your enterprise, consider a small-scale trial of a NoSQL database like MongoDB.
NoSQL databases are typically open source so you can download the software and
try it out for free. From this trial, you can assess the technology without great risk
or cost to your organization.
Commands of Neo4j:
1.Create database
CREATE (emp:
15
MATCH(dept: Dept) Return dept.deptno,
MATCH(
dept:
Dept)
Return
dept
5. Movie graph
Output:
16
Query Editor
ouery Editor
17
18
Conclusion:
19
EXPERIMENT NO. 05
Objective:
Theory:
Map reduce
Map reduce is a java based system created by google where actual data from
HDFS store gets processed efficiently map reduce breaks down a big data
processing job into smaller tasks.Map reduce is responsible for analyzing large
datasets in parallel before reducing it to find the results.
PIG
Cat:editor
Load:To load
the file
Dump:to
display
Limit:to limit
the range
Abc=LIMIT
20
abcd 2;
Describe:Schema Definition
Group:To make group by
Entity type Group abcd by id
Output:
into hadoop
21
Step 3: Open pig
Step 4: Type command:
Abc = load ‘path’ as
(string:chararray); Dump
Abc;
Conclusion : Hence we have implemented and run Hello world program successfully.
22
EXPERIMENT NO. 06
Practical Objectives:
Theory:
For the implementation of frequent item set using pig we used the Apriori
algorithm. The Apriori algorithm for finding frequent pairs is a two-pass algorithm
that limits the amount of main memory needed by using the down word-closure
property of support to avoid counting pairs that will turn out to be infrequent at the
end.
Let s be the minimum support required. Let n the number of items. This required
in the first pass, we read the baskets and count in main memory the occurrences of
each item. when we then remove all item whose frequency is lesser than S to get
the set of frequency items. This requires memory proportional to n
In the second pass, we read the baskets again and count in main memory only those
pairs where both items are frequent items. This pass will require memory
proportional to square of frequent items only (for counts) plus a list of the frequent
items (so you know what must be counted). In fig main memory in two pass of
Apriori.
Apriori Algorithm:
1. Load text
2. Tokenize text
3. Retain first letter
4. Group by letter
5. Count occurrences
23
6. Grab first element
7. Display/store results
The Apriori Algorithm uses the monotonocity property to reduce the number of
pairs that must be counted, at the expense of performing two passes over data rather
than one pass.
INPUT:
OUTPUT:
24
25
EXPERIMENT NO. 07
Practical Objectives:
Theory:
To implement Word count program using Map reduce Execute following steps:
Open eclipse
Name : MapperAnalysis
i/p key : longwritable
I/p Value : text
o/p Key: text
o/p value intwritable
Click on next
Name : ReducerAnalysis
o/p Key: text
o/p value intwritable
Click on next
Name : DriverAnalysis
Click on finish
26
Click Mapperanalysis.java and complete
Output:
project
27
Step 5: Create java mapreduce files
Click on next
Name :
DriverAnalysi
s Click on
finish
28
Step 7: Open Reduceranalysis.java
29
Step 9: Run Procedure
Open terminal -> goto path of the file ->type command chmod o+w ‘name
of file’ hit enter>>Add input and output path>>Run (publish and run)
30
31
Conclusion: Hence we have implemented Word count problem using Eclipse
by MapReduce technique successfully
32
EXPERIMENT NO. 8
Practical Objectives:
Theory:
Map reduce
MapReduce is a style of computing that has been implemented in several systems,
including Google’s internal implementation (simply called MapReduce) and the
popular open-source implementation Hadoop which can be obtained, along with
the HDFS file system from the Apache Foundation. You can use an
implementation of MapReduce to manage many large-scale computations in a
way that is tolerant of hardware faults. All you need to write are two functions,
called Map and Reduce, while the system manages the parallel execution,
coordination of tasks that execute Map or Reduce, and also deals with the
possibility that one of these tasks will fail to execute. In brief, a MapReduce
computation executes as follows:
1. Some number of Map tasks each are given one or more chunks from a
distributed file system. These Map tasks turn the chunk into a sequence of key-
value pairs. The way key- value pairs are produced from the input data is
determined by the code written by the user for the Map function.
2. The key-value pairs from each Map task are collected by a master controller
and sorted by key. The keys are divided among all the Reduce tasks, so all
key-value pairs with the same key wind up at the same Reduce task.
3. The Reduce tasks work on one key at a time, and combine all the values
associated with that key in some way. The manner of combination of
values is determined by the code written by the user for the Reduce
function.
33
Matrix Multiplication
Suppose we have an nxn matrix M, whose element in row i and column j will be
denoted by Mij. Suppose we also have vector v of length n, whose jth element is Vj .
Then the matrix vector product is the vector of length n, whose ith element xi .
JAVA PROGRAM
import java.io.IOException;
import java.util.*;
import java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
34
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String[] value;
ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer,
Float>>();
ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer,
Float>>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
listA.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),
Float.parseFloat(value[2])));
} else {
listB.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),
Float.parseFloat(value[2])));
}
}
String i;
float a_ij;
String k;
float b_jk;
Text outputValue = new Text();
for (Entry<Integer, Float> a : listA) {
i = Integer.toString(a.getKey());
a_ij = a.getValue();
for (Entry<Integer, Float> b : listB) {
k = Integer.toString(b.getKey());
b_jk = b.getValue();
outputValue.set(i + "," + k + "," + Float.toString(a_ij*b_jk));
context.write(null, outputValue);
}
}
}
}
job.setMapperClass(Map.class);
35
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Output
36
EXPERIMENT NO. 9
Practical Objectives:
IBM technologies enrich this open source framework with analytical software,
enterprise software integration, platform extensions, and tools. BigSheets is a
browser-based analytic tool initially developed by IBM's Emerging Technologies
group. Today, BigSheets is included with BigInsights to enable business users and
non-programmers to explore and analyze data in distributed file systems. BigSheets
presents a spreadsheet-like interface so users can model, filter, combine, explore,
and chart data collected from various sources. The BigInsights web console
includes a tab at top to access BigSheets.
Figure 1 depicts a sample data workbook in BigSheets. While it looks like a typical
spreadsheet, this workbook contains data from blogs posted to public websites, and
analysts can even click on links included in the workbook to visit the site that
published the source content.
Figure 1 - BigSheets workbook based on social media data, with links to source content
After defining a BigSheets workbook, an analyst can filter or transform its data
as desired. Behind the scenes, BigSheets translates user commands, expressed
through a graphical interface, into Pig scripts executed against a subset of the
underlying data. In this manner, an analyst can iteratively explore various
transformations efficiently. When satisfied, the user can save and run the
workbook, which causes BigSheets to initiate MapReduce jobs over the full set
of data, write the results to the distributed
37
Figure 1 : Extract data to Bigsheets
file system, and display the contents of the new workbook. Analysts can page
through or manipulate the full set of data as desired.
38
Figure 2 : Graph Generation
39
EXPERIMENT NO. 10
Practical Objectives:
Theory
K-Means clustering groups the data on similar groups. The algorithm is as follows:
1. Choose the number K clusters.
2. Select at random K points, the centroids(Not necessarily from the given data).
3. Assign each data point to closest centroid that forms K clusters.
4. Compute and place the new centroid of each centroid.
5. Reassign each data point to new cluster.
After final reassignment, name the cluster as Final cluster.
The Dataset
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris
versicolor) and a multivariate dataset introduced by British statistician and biologist Ronald
Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. Four
features were measured from each sample i.e length and width of the sepals and petals and
based on the combination of these four features, Fisher developed a linear discriminant model
to distinguish the species from each other.
40
# Loading data
data(iris)
# Structure
str(iris)
Using K-Means Clustering algorithm on the dataset which includes 11 persons and 6 variables
or attributes
# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm
41
# cex is font size, pch is symbol
points(kmeans.re$centers[, c("Sepal.Length",
"Sepal.Width")],
col = 1:3, pch = 8, cex = 3)
## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')
Output:
Model kmeans_re:
The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the cluster, the
sum of squares is 88.4%.
Cluster identification:
42
The model achieved an accuracy of 100% with a p-value of less than 1. This indicates the
model is good.
Confusion Matrix:
So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48 Versicolor are
correctly classified as Versicolor and 14 are classified as virginica. Out of 36 virginica, 19
virginica are correctly classified as virginica and 2 are classified as Versicolor.
The model showed 3 cluster plots with three different colors and with Sepal.length and with
Sepal.width.
43
Plotting cluster centers:
In the plot, centers of clusters are marked with cross signs with the same color of the
cluster.
Plot of clusters:
44