0% found this document useful (0 votes)
43 views

Big Data Lab Manual

Uploaded by

amartya1820
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Big Data Lab Manual

Uploaded by

amartya1820
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Big Data

LIST OF EXPERIMENT

S.No List of Experiment Page No.


01. To study Hadoop Ecosystem 13-18
02. To study installation of hadoop on VM Ware 19-21
03. To study and run File Operations in Hadoop 22-26
04. To study and implement Nosql program using Neo4j 27-31

05. To implement hello word program in map reduce using 32-34


pig
06. To implement the frequent item algorithm by MapReduce 35-37
using pig
07. To implement word count by Map Reduce using Eclipse 38-44
08. To implement matrix multiplication using Map reduce 45-48
09. To analyze and summarize large data with Graphical 49-51
Representation Using R programming
10. To implement clustering program using R programming 52-56

1
EXPERIMENT NO. 01

Aim: To study Hadoop Ecosystem.

Practical Objectives:

After completing this experiment students will be able to:


1. Understand hadoop ecosystem
2. Understand basics of hadoop.

Theory:

The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high- availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.

Hadoop Ecosystem:
Hadoop has gained its popularity due to its ability of storing, analyzing and

2
accessing large amount of data, quickly and cost effectively through clusters of
commodity hardware. It wont be wrong if we say that Apache Hadoop is actually a
collection of several components and not just a single product.
With Hadoop Ecosystem there are several commercial along with an open source
products which are broadly used to make Hadoop laymen accessible and more
usable.
The following sections provide additional information on the individual components:

MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters of commodity hardware in a
reliable, fault- tolerant manner. In terms of programming, there are two functions
which are most common in MapReduce.
 The Map Task: Master computer or node takes input and convert it into
divide it into smaller parts and distribute it on other worker nodes. All
worker nodes solve their own small problem and give answer to the
master node.
 The Reduce Task: Master node combines all answers coming from
worker node and forms it in some form of output which is answer of our
big distributed problem.
Generally both the input and the output are reserved in a file-system. The
framework is responsible for scheduling tasks, monitoring them and even re-
executes the failed tasks.

Hadoop Distributed File System (HDFS):


HDFS is a distributed file-system that provides high throughput access to data.
When data is pushed to HDFS, it automatically splits up into multiple blocks and
stores/replicates the data thus ensuring high availability and fault tolerance.
Note: A file consists of many blocks (large blocks of 64MB and above).
Here are the main components of HDFS:
 NameNode: It acts as the master of the system. It maintains the name
system i.e., directories and files and manages the blocks which are
present on the DataNodes.

3
 DataNodes: They are the slaves which are deployed on each machine and
provide the actual storage. They are responsible for serving read and
write requests for the clients.
 Secondary NameNode: It is responsible for performing periodic
checkpoints. In the event of NameNode failure, you can restart the
NameNode using the checkpoint.

Hive:
Hive is part of the Hadoop ecosystem and provides an SQL like interface to
Hadoop. It is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop
compatible file systems.
It provides a mechanism to project structure onto this data and query the data using a
SQL- like language called HiveQL. Hive also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient
or inefficient to express this logic in HiveQL.
The main building blocks of Hive are –
1. Metastore – To store metadata about columns, partition and system catalogue.
2. Driver – To manage the lifecycle of a HiveQL statement
3. Query Compiler – To compiles HiveQL into a directed acyclic graph.
4. Execution Engine – To execute the tasks in proper order which are
produced by the compiler.
5. HiveServer – To provide a Thrift interface and a JDBC / ODBC server.

HBase (Hadoop DataBase):


HBase is a distributed, column oriented database and uses HDFS for the underlying
storage. As said earlier, HDFS works on write once and read many times pattern,
but this isn’t a case always. We may require real time read/write random access for
huge dataset; this is where HBase comes into the picture. HBase is built on top of
HDFS and distributed on column-oriented database.

Here are the main components of HBase:


 HBase Master: It is responsible for negotiating load balancing across all

RegionServers and maintains the state of the cluster. It is not part of the

4
actual data storage or retrieval path.
 RegionServer: It is deployed on each machine and hosts data and
processes I/O requests.

Zookeeper:
ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization and providing group services which
are very useful for a variety of distributed systems. HBase is not operational
without ZooKeeper.

Mahout:
Mahout is a scalable machine learning library that implements various different
approaches machine learning. At present Mahout contains four main groups of
algorithms:
 Recommendations, also known as collective filtering
 Classifications, also known as categorization
 Clustering
 Frequent itemset mining, also known as parallel frequent pattern mining

Algorithms in the Mahout library belong to the subset that can be executed in a
distributed fashion and have been written to be executable in MapReduce. Mahout
is scalable along three dimensions: It scales to reasonably large data sets by
leveraging algorithm properties or implementing versions based on Apache
Hadoop.
Apache Spark:
Apache Spark is a general compute engine that offers fast data analysis on a large
scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own
data processing framework. Common uses cases for Apache Spark include real-
time queries, event stream processing, iterative algorithms, complex operations and
machine learning.

Pig:
Pig is a platform for analyzing and querying huge data sets that consist of a high-
level language for expressing data analysis programs, coupled with infrastructure

5
for evaluating these programs. Pig’s built-in operations can make sense of semi-
structured data, such as log files, and the language is extensible using Java to add
support for custom data types and transformations.
Pig has three main key properties:
 Extensibility
 Optimization opportunities
 Ease of programming

The salient property of Pig programs is that their structure is amenable to


substantial parallelization, which in turns enables them to handle very large data
sets. At the present time, Pig’s infrastructure layer consists of a compiler that
produces sequences of MapReduce programs.
Oozie:
Apache Oozie is a workflow/coordination system to manage
Hadoop jobs. Flume:
Flume is a framework for harvesting, aggregating and moving huge amounts of
log data or text files in and out of Hadoop. Agents are populated throughout
ones IT infrastructure inside web servers, application servers and mobile
devices. Flume itself has a query processing engine, so it’s easy to transform
each new batch of data before it is shuttled to the intended sink.
Ambari:
Ambari was created to help manage Hadoop. It offers support for many of the
tools in the Hadoop ecosystem including Hive, HBase, Pig, Sqoop and
Zookeeper. The tool features a management dashboard that keeps track of
cluster health and can help diagnose performance issues.

Conclusion:
Hadoop is powerful because it is extensible and it is easy to integrate with any
component. Its popularity is due in part to its ability to store, analyze and access
large amounts of data, quickly and cost effectively across clusters of commodity
hardware. Apache Hadoop is not actually a single product but instead a collection of
several components. When all these components are merged, it makes the Hadoop
very user friendly.

6
EXPERIMENT NO. 02

Aim: To study installation of hadoop.

Practical Objectives:

After completing this experiment Students will be able to

1. Understand installation of hadoop and able to deal with basic


commands of hadoop.

Resources: Computer, VMware installed, IBM Infosphere VM.

Theory:

Hadoop

Hadoop is a framework that allows distributed processing of large datasets across


clusters and commodity computers using a simple programming modes. It is
designed to scale up from single severs to thousands of machines, each providing
computation and storage.

Hadoop in short is an open source software framework for storing and


processing big data into distributed way on large clusters of commodity
hardware. Basically it accomplished the following 2 tasks.

1. Massive data storage

2. Faster processing

The main goals that hadoop follows are:-

1. Scalable
2. Fault tolerance
3. Economical
4. Handle hardware failure
To install hadoop core-cluster needed are
a) Install java into the computer
b) Install VMware

7
c) Download VM file
d) Load it into VMware and start

Steps that to be followed for installation of hadoop using IBM Infosphere biginsight are:-

Step 1: Check for vtx mode required configuration


8 GB RAM for better performance minimum i3 processor with 8 GB space
Step 2: open file into
VMware Open
user Biadmin
It will start your os, RedhatOS
It contains python,Java,IBM Infoshere and Eclipse IDE
Step 3: Starting hadoop
Cd/opt/ibm/biginsight/bin
./start –all.sh
Components of all hadoop gets started
./starts –all.sh
Hadoop starts successfully

Output:

8
Conclusion:

Hence we have installed hadoop successfully.

9
EXPERIMENT NO. 03

Aim: To study and run File Operations in Hadoop.


Practical Objectives:

After completing this experiment students will be able to

1. Work with Hadoop Distributed File System and its operations.

Resources: Hadoop (IBM biginsights software, computer)

Theory:
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.

Features of HDFS

 It is suitable for the distributed storage and processing.

 Hadoop provides a command interface to interact with HDFS.

 The built-in servers of namenode and datanode help users to


easily check the status of cluster.

 Streaming access to file system data.

 HDFS provides file permissions and authentication.

HDFS Architecture
Given below is the architecture of a Hadoop File System.

10
HDFS follows the master-slave architecture and it has the following elements.

Basic Commands of Hadoop:


 Hadoop fs –ls
 Hadoop fs –mkdir
 Hadoop fs –rmdir
 Hadoop fs –help
 hadoop fs -put <localsrc> ... <HDFS_dest_Path>
 hadoop fs -get <hdfs_src> <localdst>
 hadoop fs -cat <path[filename]>
 hadoop fs -cp <source> <dest>
 hadoop fs -mv <src> <dest>

Steps:
 Open Web Client
 From here also You can Upload File into Hadoop
 Will have a look of all components of websphere
 Dashboard, Cluster Status , Files , Application ,Application status , Bigsheet.

Output:

Step 1: Basic Commands Of


Hadoop:- Hadoop fs –ls ,
Hadoop fs –mkdir, etc.

11
Step 2: Open Web Client

Step 3: From here also, you can Upload File into Hadoop

12
13
Conclusion:
The distributed file system which is used only for larger databases. Here, we
have studied HDFS and executed basic commands and file operations for
Hadoop.

14
EXPERIMENT NO.4

Aim: To study and implement Nosql program.

Objective:

After completing this experiment students will be able to

1. Acquire the knowledge NoSQL queries.


2. Design NoSQL queries.

Resources: Computer, Neo4j.

Theory:

NoSQL databases have grown in popularity with the rise of Big Data applications.
In comparison to relational databases, NoSQL databases are much cheaper to scale,
capable of handling unstructured data, and better suited to current agile
development approaches.

The advantages of NoSQL technology are compelling but the thought of replacing a
legacy relational system can be daunting. To explore the possibilities of NoSQL in
your enterprise, consider a small-scale trial of a NoSQL database like MongoDB.
NoSQL databases are typically open source so you can download the software and
try it out for free. From this trial, you can assess the technology without great risk
or cost to your organization.

Commands of Neo4j:

1.Create database

CREATE (emp:

employee) 2.Insert data

CREATE (DEPT: dept{deptnp:10, dname:”Accounting”,

location:”Hyderabad”}) 3.Display data

15
MATCH(dept: Dept) Return dept.deptno,

dept.dname 4.Create node

MATCH(

dept:

Dept)

Return

dept

5. Movie graph

Output:

16
Query Editor

ouery Editor

17
18
Conclusion:

NoSQL are replicated databases which are handled by NoSQL queries.

19
EXPERIMENT NO. 05

Aim: To implement hello word program in map reduce using pig.

Objective:

After completing this experiment you will be able to

1. Implement simple MapReduce program using mapreduce.

Resources: Computer, VMware installed, IBM Infosphere VM,Pig.

Theory:

Map reduce

Map reduce is a java based system created by google where actual data from
HDFS store gets processed efficiently map reduce breaks down a big data
processing job into smaller tasks.Map reduce is responsible for analyzing large
datasets in parallel before reducing it to find the results.

PIG

Pig is one of the data accessing component in hadoop ecosystem.It is a convenient


tool developed by yahoo for analyzing huge data sets efficiently it is a high level
flow language that is optimized,extensible and easy to use.

Basic command in pig are as follow

Cat:editor
Load:To load
the file
Dump:to
display
Limit:to limit
the range
Abc=LIMIT

20
abcd 2;
Describe:Schema Definition
Group:To make group by
Entity type Group abcd by id

Describe Map reduce plan


Foreach:Similar to for command Foreach <bag> generate <data>
count=foreach abcd generate id
Tokenize:Splits a string into token
Flatten $0:to recording output in
simple manner Likhesh=foreach line
generate tokenize(text)

Hello World in Pig


Create text file saying
“hello world” Upload it
into hadoop
Open pig
Type command Abc = load ‘path’as
(string:chararray); Dump Abc;

Output:

Step1: Create text file saying

“hello world” Step2: Upload it

into hadoop

21
Step 3: Open pig
Step 4: Type command:
Abc = load ‘path’ as
(string:chararray); Dump
Abc;

Conclusion : Hence we have implemented and run Hello world program successfully.

22
EXPERIMENT NO. 06

Aim: To implement the frequent item algorithm by MapReduce using pig..

Practical Objectives:

After completing this experiment students will be able to

1. Implement logic and execute data mining algorithms using mapreduce.

Resources: Computer, VMware installed, IBM Infosphere VM,Pig.

Theory:

For the implementation of frequent item set using pig we used the Apriori
algorithm. The Apriori algorithm for finding frequent pairs is a two-pass algorithm
that limits the amount of main memory needed by using the down word-closure
property of support to avoid counting pairs that will turn out to be infrequent at the
end.

Let s be the minimum support required. Let n the number of items. This required
in the first pass, we read the baskets and count in main memory the occurrences of
each item. when we then remove all item whose frequency is lesser than S to get
the set of frequency items. This requires memory proportional to n

In the second pass, we read the baskets again and count in main memory only those
pairs where both items are frequent items. This pass will require memory
proportional to square of frequent items only (for counts) plus a list of the frequent
items (so you know what must be counted). In fig main memory in two pass of
Apriori.

Apriori Algorithm:

1. Load text
2. Tokenize text
3. Retain first letter
4. Group by letter
5. Count occurrences

23
6. Grab first element
7. Display/store results

The Apriori Algorithm uses the monotonocity property to reduce the number of
pairs that must be counted, at the expense of performing two passes over data rather
than one pass.

INPUT:

Apriori Algorithm and code:

1. Load text: abcd = load ‘path’ as (text:chararray);

2. tokenize text: Tokens = foreach abcd generate


flatten(tokenize) as token:chararray

3. retain first letter: Letters = foreach tokens generate


substring(token,0,1) as letter:chararray

4. group by letter: Lettergroup = group Letters by letter;

5. count occurrences: Countper= foreach lettergroup generate


group , count(letters)

6. grab first element: Orederedcount = order countper by

$1 desc; Result limit orderedcount 1;

7. display/store result: Dump result;

OUTPUT:

24
25
EXPERIMENT NO. 07

Aim: To implement word count by Map Reduce using Eclipse.

Practical Objectives:

After completing this experiment students will be able to

1. Implement logic and execute data mining algorithms using mapreduce.

Resources: Computer, VMware installed, IBM Infosphere VM,Eclipse.

Theory:

To implement Word count program using Map reduce Execute following steps:

Open eclipse

Goto new -> file -> biginsight project

Create a new project

goto new -> file -> Java Mapreduce program

Create Java map reduce files.

 Name : MapperAnalysis
 i/p key : longwritable
 I/p Value : text
 o/p Key: text
 o/p value intwritable
 Click on next
 Name : ReducerAnalysis
 o/p Key: text
 o/p value intwritable
 Click on next
 Name : DriverAnalysis
 Click on finish

26
Click Mapperanalysis.java and complete

code of it. Click Reduceranalysis.java

and complete code of it. Click

Driveranalysis.java and edit code.

 Goto Run Configuration


 Add project name,main class
 Running environment is local
 Job argument
 Before adding job argument
 Open terminal -> goto path of the file ->type command chmod o+w ‘name of
 file’ hit enter
 Add input and output path
 Run (publish and run)

Output:

Step 1: Open eclipse

Step 2: Goto new -> file ->

biginsight project Setp3: Create a new

project

Step 4: goto new -> file -> Java Mapreduce program

27
Step 5: Create java mapreduce files

Name : MapperAnalysis , i/p key : longwritable , I/p Value : text , o/p


Key: text ,o/p value intwritable

Click on next

Name : ReducerAnalysis , o/p Key: text , o/p value

intwritable Click on next

Name :

DriverAnalysi

s Click on

finish

Step 6: Open mapperanalysis.java

28
Step 7: Open Reduceranalysis.java

Step 8: Open DriverAnalysis.java

29
Step 9: Run Procedure

Goto Run Configuration>>Add project name,main class>>Running


environment is local>>Job argument >> Before adding job argument

Open terminal -> goto path of the file ->type command chmod o+w ‘name
of file’ hit enter>>Add input and output path>>Run (publish and run)

30
31
Conclusion: Hence we have implemented Word count problem using Eclipse
by MapReduce technique successfully

32
EXPERIMENT NO. 8

Aim: To implement matrix multiplication using Map reduce.

Practical Objectives:

After completing this experiment students will be able to

1. Implement logic and execute complex program by external resources


using mapreduce.

Theory:

Map reduce
MapReduce is a style of computing that has been implemented in several systems,
including Google’s internal implementation (simply called MapReduce) and the
popular open-source implementation Hadoop which can be obtained, along with
the HDFS file system from the Apache Foundation. You can use an
implementation of MapReduce to manage many large-scale computations in a
way that is tolerant of hardware faults. All you need to write are two functions,
called Map and Reduce, while the system manages the parallel execution,
coordination of tasks that execute Map or Reduce, and also deals with the
possibility that one of these tasks will fail to execute. In brief, a MapReduce
computation executes as follows:
1. Some number of Map tasks each are given one or more chunks from a
distributed file system. These Map tasks turn the chunk into a sequence of key-
value pairs. The way key- value pairs are produced from the input data is
determined by the code written by the user for the Map function.
2. The key-value pairs from each Map task are collected by a master controller
and sorted by key. The keys are divided among all the Reduce tasks, so all
key-value pairs with the same key wind up at the same Reduce task.

3. The Reduce tasks work on one key at a time, and combine all the values
associated with that key in some way. The manner of combination of
values is determined by the code written by the user for the Reduce
function.

33
Matrix Multiplication

Suppose we have an nxn matrix M, whose element in row i and column j will be
denoted by Mij. Suppose we also have vector v of length n, whose jth element is Vj .
Then the matrix vector product is the vector of length n, whose ith element xi .

JAVA PROGRAM

import java.io.IOException;
import java.util.*;
import java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class TwoStepMatrixMultiplication {

public static class Map extends Mapper<LongWritable, Text, Text, Text> {


public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) {
outputKey.set(indicesAndValue[2]);
outputValue.set("A," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
} else {
outputKey.set(indicesAndValue[1]);
outputValue.set("B," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}

34
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String[] value;
ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer,
Float>>();
ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer,
Float>>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
listA.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),
Float.parseFloat(value[2])));
} else {
listB.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),
Float.parseFloat(value[2])));
}
}
String i;
float a_ij;
String k;
float b_jk;
Text outputValue = new Text();
for (Entry<Integer, Float> a : listA) {
i = Integer.toString(a.getKey());
a_ij = a.getValue();
for (Entry<Integer, Float> b : listB) {
k = Integer.toString(b.getKey());
b_jk = b.getValue();
outputValue.set(i + "," + k + "," + Float.toString(a_ij*b_jk));
context.write(null, outputValue);
}
}
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();

Job job = new Job(conf, "MatrixMatrixMultiplicationTwoSteps");


job.setJarByClass(TwoStepMatrixMultiplication.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);

35
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path("hdfs://127.0.0.1:9000/matrixin"));


FileOutputFormat.setOutputPath(job, new
Path("hdfs://127.0.0.1:9000/matrixout"));

job.waitForCompletion(true);
}
}

Output

36
EXPERIMENT NO. 9

Aim: To analyze and summarize large data with Graphical


Representation Using Bigsheets.

Practical Objectives:

After completing this experiment students will be able to

1. Create a graph for large amount of filtered or non-filtered data.

Resources: Computer, VMware installed, IBM Infosphere VM. Bigsheets.

IBM technologies enrich this open source framework with analytical software,
enterprise software integration, platform extensions, and tools. BigSheets is a
browser-based analytic tool initially developed by IBM's Emerging Technologies
group. Today, BigSheets is included with BigInsights to enable business users and
non-programmers to explore and analyze data in distributed file systems. BigSheets
presents a spreadsheet-like interface so users can model, filter, combine, explore,
and chart data collected from various sources. The BigInsights web console
includes a tab at top to access BigSheets.

Figure 1 depicts a sample data workbook in BigSheets. While it looks like a typical
spreadsheet, this workbook contains data from blogs posted to public websites, and
analysts can even click on links included in the workbook to visit the site that
published the source content.

Figure 1 - BigSheets workbook based on social media data, with links to source content

After defining a BigSheets workbook, an analyst can filter or transform its data
as desired. Behind the scenes, BigSheets translates user commands, expressed
through a graphical interface, into Pig scripts executed against a subset of the
underlying data. In this manner, an analyst can iteratively explore various
transformations efficiently. When satisfied, the user can save and run the
workbook, which causes BigSheets to initiate MapReduce jobs over the full set
of data, write the results to the distributed

37
Figure 1 : Extract data to Bigsheets

file system, and display the contents of the new workbook. Analysts can page
through or manipulate the full set of data as desired.

Complementing BigSheets are a number of ready-made sample applications that


business users can launch from the BigInsights web console to collect data from
websites, relational database management systems (RDBMS), remote file systems,
and other sources. We'll rely on two such applications for the work described here.
However, it's important to realize that programmers and administrators can use other
BigInsights technologies to collect, process, and prepare data for subsequent
analysis in BigSheets. Such technologies include Jaql, Flume, Pig, Hive,
MapReduce applications, and others.

38
Figure 2 : Graph Generation

39
EXPERIMENT NO. 10

Aim: To implement clustering program using R programming.

Practical Objectives:

After completing this experiment students will be able to

1. Create a cluster for large amount of filtered or non-filtered data.

Resources: Computer, R Language, VMware installed

K Means Clustering in R Programming is an Unsupervised Non-linear algorithm that cluster


data based on similarity or similar groups. It seeks to partition the observations into a pre-
specified number of clusters. Segmentation of data takes place to assign each training
example to a segment called a cluster. In the unsupervised algorithm, high reliance on raw
data is given with large expenditure on manual review for review of relevance is given. It is
used in a variety of fields like Banking, healthcare, retail, Media, etc.

Theory
K-Means clustering groups the data on similar groups. The algorithm is as follows:
1. Choose the number K clusters.
2. Select at random K points, the centroids(Not necessarily from the given data).
3. Assign each data point to closest centroid that forms K clusters.
4. Compute and place the new centroid of each centroid.
5. Reassign each data point to new cluster.
After final reassignment, name the cluster as Final cluster.

The Dataset
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris
versicolor) and a multivariate dataset introduced by British statistician and biologist Ronald
Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. Four
features were measured from each sample i.e length and width of the sepals and petals and
based on the combination of these four features, Fisher developed a linear discriminant model
to distinguish the species from each other.

40
# Loading data
data(iris)

# Structure
str(iris)

Performing K-Means Clustering on Dataset

Using K-Means Clustering algorithm on the dataset which includes 11 persons and 6 variables
or attributes

# Installing Packages
install.packages("ClusterR")
install.packages("cluster")

# Loading package
library(ClusterR)
library(cluster)

# Removing initial label of


# Species from original dataset
iris_1 <- iris[, -5]

# Fitting K-Means clustering Model


# to training dataset
set.seed(240) # Setting seed
kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re

# Cluster identification for


# each observation
kmeans.re$cluster

# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm

# Model Evaluation and visualization


plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster,
main = "K-means with 3 clusters")

## Plotiing cluster centers


kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]

41
# cex is font size, pch is symbol
points(kmeans.re$centers[, c("Sepal.Length",
"Sepal.Width")],
col = 1:3, pch = 8, cex = 3)

## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')

Output:
 Model kmeans_re:

The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the cluster, the
sum of squares is 88.4%.

Cluster identification:

42
The model achieved an accuracy of 100% with a p-value of less than 1. This indicates the
model is good.

Confusion Matrix:

So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48 Versicolor are
correctly classified as Versicolor and 14 are classified as virginica. Out of 36 virginica, 19
virginica are correctly classified as virginica and 2 are classified as Versicolor.

K-means with 3 clusters plot:

The model showed 3 cluster plots with three different colors and with Sepal.length and with
Sepal.width.

43
Plotting cluster centers:

In the plot, centers of clusters are marked with cross signs with the same color of the
cluster.

Plot of clusters:

44

You might also like