A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
BY : KUMARI SURABHI

INDEX TERMS
• Openstack Cloud
• Hadoop
• Distributed system,
• Virtualization
• Big data.

INTRODUCTION
• Computers are revolutionizing the human era, especially the IT field.
• With new technologies and ideas, smart, efficient and faster computers
and frameworks are introduced in market every day.
• Advancement of intelligent machines and ideas makes computation,
storage and transaction of data faster and accurate which eventually
makes institutions, companies and individuals solve their problems with
ease.
• Among so many computation improvements, cloud computing and
distributed system are the main focus of this seminar.

CLOUD COMPUTING
• Cloud computing is the emerging technology which is being exploited in
every aspects of technology.
• Cloud computing is an abstract term describing the use of resources,
which do not belong to the user to perform required task and then
disconnect from the resources when not in use.
• Most obvious examples would be Gmail, Google doc, Amazon EC2 and
storage, social networks such as Facebook and many more.

BIG DATA
• Describe the exponential growth, availability and use of information, both
structured and unstructured.
• Concept of Big Data has three basic dimensions: volume, variety, velocity
and other dimensions are veracity and complexity.

PROBLEM STATEMENTS AND PRELIMINARIES
The purpose of this seminar is to answer the following
questions:
●
Is performance of OpenStack cloud is better than real system on
Hadoop cluster?
●
Is it feasible to run image processing MapReduce job in a Hadoop
cluster on OpenStack cloud?
●
Are the technical difficulties of converting image files to pdf using
MapReduce framework in distributed system?

CLOUD COMPUTING & OPENSTACK CLOUD
• Cloud computing is the use of computing resources (hardware and
software) that are delivered as a service over a network (typically the
Internet).
• Cloud services include the delivery of software, infrastructure, and
storage over the Internet.
• Based on technology used, cloud computing can be sub categorized into:
Public, Private, Community and Hybrid clouds.

Cloud Computing continued...
cloud computing can be broadly divided into:
●
Software as a Service
(SaaS),
●
Platform as a Service
(PaaS),
●
Infrastructure as a
Service (IaaS).

CLOUD COMPUTING continued...
Based on technology used, cloud computing can be sub categorized into:
• Public,
• Private,
• Community
• Hybrid clouds.

CLOUD COMPUTING continued...
There are many platforms available to set up a cloud such as :
• CloudStak (Foundation)
• DevStack,
• Openstack,
• Eucalyptus,
• Nebula and many more.
Note : Openstack is chosen as the framework for implementing
cloud for the article.

OPENSTACK CLOUD
• It is open-source which is used to open to pick and mix any hardware
needs.
• Open to design own networks.
• Open to use any virtualization technology.
• Open to other needed features and so on.

OPENSTACK CLOUD
• OpenStack is an open source cloud framework, originally launched by
Rackspace and NASA, with the aim to promote cloud standards and
provide a solid foundation for cloud development.
• It is most widely used tool for setting a private and public cloud.
• Big companies like Dell, AMD, Cisco, HP and Rackspace are using it.
• Linux heavyweights like Red Hat and Ubuntu are implementing it.
• Amazon is available on OpenStack with compatible API with Amazon.

OPENSTACK CLOUD continued...
• OpenStack is an Infrastructure as a Service (IaaS) cloud computing project
that is free open source software.
• It is revolutionizing the cloud computing world
• Aim to create a system where storage, resources and performance would
scale everything up quickly and efficiently.

The OpenStack cloud currently consists
of six projects:
• Nova,
• Swift, Glance,
• Keystone,
• Quantum,
• Horizon.

●
Nova :
Nova is the computing fabric controller for the OpenStack cloud.
●
Swift :
Swift is the storage system for OpenStack which is analogous to Amazon
Web Service Simples Storage system.
●
Glance :
It is an imaging service for OpenStack which is responsible for discovery,
registration and delivery services for disk and server images.

●
Keystone :
Keystone is OpenStack identity service which provides authentication
and authorization for all components of OpenStack.
●
Horizon :
Horizon is web based dashboard that provides administrators and users a
graphical interface to access, provision and automate cloud-based
resources.

HADOOP
• There are many distributed systems available to comply with the big data
problems faced by big companies.
• Hadoop is one of the available frameworks.
• Hadoop makes data mining, analytics, and processing of big data cheap
and fast.
• Hadoop is an open source project and is made to deal with terabytes of
data in minutes.
• Hadoop stores and processes any kind of data.
• Hadoop is natively written in Java but can be accessed using other
languages such as SQL-inspired language (Hive), c/c++, python and many
more.

HADOOP continued...
• An open source web search engine which was based on Googles
MapReduce.
• Hadoop works on commodity hardware.

HDFS(Hadoop Distributed File System)
●
Hadoop Distributed File System provides unrestricted, high-
speed access to the data application.
●
A scalable, fault tolerant, high performance distributed file
system.
●
Namenode holds filesystem metadata.
●
Files are broken up and spread over datanodes.
●
Data divided into 64MB(default) or 128 blocks, each block
replicated 3 times(default) .

MAPREDUCE
●
Map Reduce programs are executed in two main phases,
called mapping and reducing.
●
Each phase is defined by a data processing function, and
these functions are called mapper and reducer, respectively.
●
In the mapping phase, Map Reduce takes the input data and
feeds each data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs
from the mapper and arrives at the final result.
●
In simple terms, the mapper is meant to filter and transform
the input into something that the reducer can aggregate over.

PERFORMANCE ANALYSIS MODEL
●
In the performance analysis model, we will discusse use of
two basic applications :
1. WordCount Application
2. Imagetopdf Conversion Application.
●
WordCount is a common MapReduce program which is used
to count total number of word found in the document.
●
Imagetopdf Conversion program is used for converting image
into pdf.

PERFORMANCE ANALYSIS MODEL continued...
●
These two programs are executed in commodity computers
cluster as well as OpenStack Cloud virtual instance cluster.
●
Analyse the performance by changing the number of nodes
and size of the data.
●
The performance analysis has been done for both the
applications.

WORD COUNT APPLICATION
●
WordCount is a simple application that counts the number of
occurrences of each word in a given input set.
●
The purpose of this program is to calculate the total number of repetition
of words in a particular document.
●
The pseudo code for the Mapper and Reducer for WordCount program is
outlined in Algorithm 1 and Algorithm 2 respectively.

Mapper function for WordCount Program
●
Input: String filename, String document
●
Output: String token , 1
1) Map(String filename, String document)
2) {
3) List<String> T = tokenize(document);
4) For each token in T
5) {
6) emit ((String) token, (Integer) 1);
7) }
8) }

Reducer function for WordCount Program
●
Input: String token, 1
●
Output: (String) token , sum
1) Reduce(String token, List<Integer> values)
2) {
3) Integer sum = 0;
4) For each value in values
5) {
6) Sum = sum + value;
7) }
8) emit ((String) token, (Integer) sum);
9) }
10) }

Image to Pdf Conversion Application
●
Hadoop is popular for processing textual big data so there are a lot of
materials available if an application related with texts is to be developed.
●
But few works has been done on image data processing in Hadoop.
●
So there were a lot of challenges while developing the application.
●
Some of the difficulties faced were serialization issues with images,
splitting of images by hadoop to its default blocks, image to pdf
conversion, text to pdf conversion and many more.

Work flow of the application
●
Under Map Reduce model, data processing primitives are called mappers
and reducers.

Mapper function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Map(String key, KUPDF value)
2) {
3) For each bufferList in value
4) {
5) write (filename, value)
6) }
7) }

Reducer function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Reduce(String key, KUPDF values)
2) {
3) For each value in values
4) {
5) concat value as separate page of pdf
6) }
7) write (key, final pdf)
8) }
9) }

MAPPER AND REDUCER
●
In the mapping phase, Map Reduce takes the input data and feeds each
data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at the final result.
●
The mapper is meant to filter and transform the input into something
that the reducer can aggregate over.
●
PDFMapper and PDFReducer class in the application does the above
mentioned jobs in the application developed with image files and pdf
files.

METHODS OF PERFORMANCE EVALUATIONS
Cloud Cluster Setup
• quad core Intel® Xenon(R) CPU of 64 bits
• 16GB RAM
• 1TB ATA disk
• 500GB ATA disk as storage
• 32 bits dual gigabit network interfaces was used.
• Ubuntu 14.04 server was installed for Operating System.

Cloud Cluster Setup continued...
●
Installation and configuration of Openstack Essex cloud was done by the
tutorial of Openstack.
●
Appropriate images were created using virtualization tool, such as
QEMU, supporting KVM or XEN and by using terminal commands.
●
virtual systems for cloud cluster setup, were created by using terminal
commands and by accessing web-interface of openstack after successful
network configuration for fixed and floating ips and other security
parameters by using the images created.

Commodity Computer Cluster Setup
The four-node cluster of commodity computers is setup on :
●
Intel i5 quad core 64-bit CPU with 2GB RAM
●
One with 160GB ATA disk
●
Other three with 80GB ATA disk
●
32 bit gigabit network interface.

2. Commodity Computer Cluster Setup continued...
●
Passwordless secure shell was confugured
●
Java 7 was installed and
●
Hadoop 0.20.2 was configured on all four instances.

Cloud Cluster Setup continued...
●
Four-node cluster, one acting as master/slave and the rest three as
slaves, was created by using Ubuntu 14.04 image.
●
Passwordless secure shell was confugured, java 7 was installed and
Hadoop 0.20.2 was configured on all four instances.

Experimental Results
After the successful configuration of the clusters, three jobs:
●
Two jobs to convert image files to pdf files and
●
One job of word count were run on both systems.
●
The first two jobs were based on image and pdf files being serialized in
map reduce framework.
●
The last job was implemented based on the standard WordCount
program available in the Hadoop package.
●
The algorithms are run first on two node cluster with a Master and two
slaves and then scaled up to four node cluster of a Master and four slaves
(Master running slave machine as well).

Directory-wise Image to PDF continued...
●
The first jobs algorithm is designed to search images directory-wise and
convert each image files to pdf file with same directory tree as the input
image files.
●
TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
3 minute and
8 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
1 minute and
31 second

GRAPHICAL REPRESENTATION
Time taken for First Job

EXPLANATION
The input to the job contained :
●
23 folder
●
94 files and
●
In total 169 MB in size.
The output was the conversion of each image files to pdf with same
directory and file names and configured to generate 90.1 MB in size for
both the processing.
●
The processing were repeated three times to get the average.

2) Multiple Images to Single PDF:
●
Modified version of the first one explained above.
●
All images are converted to final single pdf output file.
●
This processing is also done first in two node and later scaled up to four
node cluster as done in first algorithm.
●
The processing are repeated three time to get the average.
●
The experiments are summarized in Table IV and Table V.

Multiple Images to Single PDF continued...
●
The input contained 476 image files in one directory and its size was 926
MB.
●
The output was a single pdf file of 200.1 MB in size.
TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
11 minute and
29 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
12 minute and
28 second

Multiple Images to Single PDF continued...
TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second

This Shows that commodity computer clusters are more efficient than
virtual node cluster in Openstack cloud.
Time taken for Second Job

EXPLANATION
●
The first two jobs were processed on mapping of small image files that
are no so effective in Hadoop system as hadoop performs really well with
large data sets as input.
●
So in order to test the real performance of Hadoop on big data, the
default word count job of Hadoop system was also run.
●
Hadoop was designed for text processing rather than image processing
so the textual processing was also chosen to analyse the Hadoop clusters.

TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES
The inputs for the job was a text file of 1.1 GB and the output was a file
containing list of words which was 364.6KB in size.
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second

TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES
Commodity
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
4 minute and
0 second
Cloud
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
5 minute and
1 second

It proves that Hadoop cluster in commodity computers performed well
than the Hadoop cluster in the cloud.
Time taken for Third Job

PERFORMANCE ANALYSIS
●
The Hadoop distributed system set up on personal computers is certain
to be more efficient and faster than the cloud system.
●
First reason is that Hadoop is developed with commodity machines in
mind.
●
Second obvious reason is that the processing is done in physical
hardware without any resources sharing as compared to cloud systems.

CONTRADICTION
●
The first job is contradictory with the points discussed above and with
other two jobs.
REASON :
●
The job has to recursively read and write files, thus has to cache all the
bytes read and to be written,which is faster in cloud as the nodes are in
one server and there is no wire communication between nodes.

CONCLUSION
●
An analysis of running a Hadoop cluster incloud and in real system and
identifying the best solution by running simple Hadoop jobs in the
configured clusters.
●
It concludes that running a Hadoop cluster in cloud for data storage and
analysis is more flexible and easily-scalable than the real system cluster.
●
The two nodes to four nodes experiments proved the easy-scalability
where cloud cluster scaled up with creation of an instance from already
configured image.
●
The case was not the same in real system where we needed to get the
machine, download the softwares, adjust configuration to join the new
machine to the cluster.

●
The failed nodes in cloud cluster could be terminated and can be
replaced with a new instance in seconds but the same is not possible in
real system.
●
The cluster in real system computers are faster than the cloud clusters.
●
But due to different advantageous features of the cloud computing
system such as quick termination of servers (nodes) if problems arise and
creation of the node from the same state the machine was terminated,
automatic networking, instant creation of nodes and cluster and many
such features cloud Hadoop clusterwould be more favorable.
●
Despite the difficulties in writing algorithms related with images in map
reduce framework and serialization errors of images, and despite the
popularity of text processings in Hadoop, It is still possible to perform
image processings in distributed framework as hadoop.

FUTURE SCOPE
To perform the same algorithms using different cloud frameworks and
comparing the commodity cluster performance versus new cloud virtual
cluster Or analysis and comparison of Openstack cloud virtual cluster
versus new cloud virtual cluster.

REFERENCES
●
[1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web
Services,”Amazon Web Services, 2014.
●
[2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg,
Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype
and reality for delivering computing as the 5th utility,” Future Generation
Computer Systems, 2009, Elsevier.
●
[3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data
Security Model for Cloud Computing, Proceedings of the 2009
International Workshop on Information Security and Application ,”
(IWISA 2009), (pp. 21-22). China.

●
[4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica
placement to the reliability of distributed brick storage systems,”
Proceedings of the 25th IEEE International Conference on Distributed
Computing Systems, (pp. 187-196), 2005, IEEE.
●
[5] Daniel Ford et al., “Availability in globally distributed storage system,”
Google Inc.
●
[6] HDFS Architecture Guide,“from Address:,
”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.

A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

Recommended

More Related Content

What's hot (20)

Similar to A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters (20)

Recently uploaded (20)

A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters