SlideShare a Scribd company logo
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
A performance analysis of OpenStack
Cloud vs
Real System on Hadoop Clusters
BY : KUMARI SURABHI
INDEX TERMS
• Openstack Cloud
• Hadoop
• Distributed system,
• Virtualization
• Big data.
INTRODUCTION
• Computers are revolutionizing the human era, especially the IT field.
• With new technologies and ideas, smart, efficient and faster computers
and frameworks are introduced in market every day.
• Advancement of intelligent machines and ideas makes computation,
storage and transaction of data faster and accurate which eventually
makes institutions, companies and individuals solve their problems with
ease.
• Among so many computation improvements, cloud computing and
distributed system are the main focus of this seminar.
CLOUD COMPUTING
• Cloud computing is the emerging technology which is being exploited in
every aspects of technology.
• Cloud computing is an abstract term describing the use of resources,
which do not belong to the user to perform required task and then
disconnect from the resources when not in use.
• Most obvious examples would be Gmail, Google doc, Amazon EC2 and
storage, social networks such as Facebook and many more.
WHY CLOUD COMPUTING ?
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
BIG DATA
• Describe the exponential growth, availability and use of information, both
structured and unstructured.
• Concept of Big Data has three basic dimensions: volume, variety, velocity
and other dimensions are veracity and complexity.
PROBLEM STATEMENTS AND PRELIMINARIES
The purpose of this seminar is to answer the following
questions:
●
Is performance of OpenStack cloud is better than real system on
Hadoop cluster?
●
Is it feasible to run image processing MapReduce job in a Hadoop
cluster on OpenStack cloud?
●
Are the technical difficulties of converting image files to pdf using
MapReduce framework in distributed system?
CLOUD COMPUTING & OPENSTACK CLOUD
• Cloud computing is the use of computing resources (hardware and
software) that are delivered as a service over a network (typically the
Internet).
• Cloud services include the delivery of software, infrastructure, and
storage over the Internet.
• Based on technology used, cloud computing can be sub categorized into:
Public, Private, Community and Hybrid clouds.
Cloud Computing continued...
cloud computing can be broadly divided into:
●
Software as a Service
(SaaS),
●
Platform as a Service
(PaaS),
●
Infrastructure as a
Service (IaaS).
CLOUD COMPUTING continued...
Based on technology used, cloud computing can be sub categorized into:
• Public,
• Private,
• Community
• Hybrid clouds.
CLOUD COMPUTING continued...
There are many platforms available to set up a cloud such as :
• CloudStak (Foundation)
• DevStack,
• Openstack,
• Eucalyptus,
• Nebula and many more.
Note : Openstack is chosen as the framework for implementing
cloud for the article.
OPENSTACK CLOUD
• It is open-source which is used to open to pick and mix any hardware
needs.
• Open to design own networks.
• Open to use any virtualization technology.
• Open to other needed features and so on.
OPENSTACK CLOUD
• OpenStack is an open source cloud framework, originally launched by
Rackspace and NASA, with the aim to promote cloud standards and
provide a solid foundation for cloud development.
• It is most widely used tool for setting a private and public cloud.
• Big companies like Dell, AMD, Cisco, HP and Rackspace are using it.
• Linux heavyweights like Red Hat and Ubuntu are implementing it.
• Amazon is available on OpenStack with compatible API with Amazon.
OPENSTACK CLOUD continued...
• OpenStack is an Infrastructure as a Service (IaaS) cloud computing project
that is free open source software.
• It is revolutionizing the cloud computing world
• Aim to create a system where storage, resources and performance would
scale everything up quickly and efficiently.
OPENSTACK CLOUD continued...
The OpenStack cloud currently consists
of six projects:
• Nova,
• Swift, Glance,
• Keystone,
• Quantum,
• Horizon.
OPENSTACK CLOUD continued...
●
Nova :
Nova is the computing fabric controller for the OpenStack cloud.
●
Swift :
Swift is the storage system for OpenStack which is analogous to Amazon
Web Service Simples Storage system.
●
Glance :
It is an imaging service for OpenStack which is responsible for discovery,
registration and delivery services for disk and server images.
OPENSTACK CLOUD continued...
●
Keystone :
Keystone is OpenStack identity service which provides authentication
and authorization for all components of OpenStack.
●
Horizon :
Horizon is web based dashboard that provides administrators and users a
graphical interface to access, provision and automate cloud-based
resources.
OPENSTACK CLOUD ARCHITECTURE
HADOOP
• There are many distributed systems available to comply with the big data
problems faced by big companies.
• Hadoop is one of the available frameworks.
• Hadoop makes data mining, analytics, and processing of big data cheap
and fast.
• Hadoop is an open source project and is made to deal with terabytes of
data in minutes.
• Hadoop stores and processes any kind of data.
• Hadoop is natively written in Java but can be accessed using other
languages such as SQL-inspired language (Hive), c/c++, python and many
more.
HADOOP continued...
• An open source web search engine which was based on Googles
MapReduce.
• Hadoop works on commodity hardware.
HDFS(Hadoop Distributed File System)
●
Hadoop Distributed File System provides unrestricted, high-
speed access to the data application.
●
A scalable, fault tolerant, high performance distributed file
system.
●
Namenode holds filesystem metadata.
●
Files are broken up and spread over datanodes.
●
Data divided into 64MB(default) or 128 blocks, each block
replicated 3 times(default) .
ARCHITECTURE OF HDFS
MAPREDUCE
●
Map Reduce programs are executed in two main phases,
called mapping and reducing.
●
Each phase is defined by a data processing function, and
these functions are called mapper and reducer, respectively.
●
In the mapping phase, Map Reduce takes the input data and
feeds each data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs
from the mapper and arrives at the final result.
●
In simple terms, the mapper is meant to filter and transform
the input into something that the reducer can aggregate over.
MAPREDUCE continued...
PERFORMANCE ANALYSIS MODEL
●
In the performance analysis model, we will discusse use of
two basic applications :
1. WordCount Application
2. Imagetopdf Conversion Application.
●
WordCount is a common MapReduce program which is used
to count total number of word found in the document.
●
Imagetopdf Conversion program is used for converting image
into pdf.
PERFORMANCE ANALYSIS MODEL continued...
●
These two programs are executed in commodity computers
cluster as well as OpenStack Cloud virtual instance cluster.
●
Analyse the performance by changing the number of nodes
and size of the data.
●
The performance analysis has been done for both the
applications.
WORD COUNT APPLICATION
●
WordCount is a simple application that counts the number of
occurrences of each word in a given input set.
●
The purpose of this program is to calculate the total number of repetition
of words in a particular document.
●
The pseudo code for the Mapper and Reducer for WordCount program is
outlined in Algorithm 1 and Algorithm 2 respectively.
Mapper function for WordCount Program
●
Input: String filename, String document
●
Output: String token , 1
1) Map(String filename, String document)
2) {
3) List<String> T = tokenize(document);
4) For each token in T
5) {
6) emit ((String) token, (Integer) 1);
7) }
8) }
Reducer function for WordCount Program
●
Input: String token, 1
●
Output: (String) token , sum
1) Reduce(String token, List<Integer> values)
2) {
3) Integer sum = 0;
4) For each value in values
5) {
6) Sum = sum + value;
7) }
8) emit ((String) token, (Integer) sum);
9) }
10) }
Image to Pdf Conversion Application
●
Hadoop is popular for processing textual big data so there are a lot of
materials available if an application related with texts is to be developed.
●
But few works has been done on image data processing in Hadoop.
●
So there were a lot of challenges while developing the application.
●
Some of the difficulties faced were serialization issues with images,
splitting of images by hadoop to its default blocks, image to pdf
conversion, text to pdf conversion and many more.
Work flow of the application
●
Under Map Reduce model, data processing primitives are called mappers
and reducers.
Mapper function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Map(String key, KUPDF value)
2) {
3) For each bufferList in value
4) {
5) write (filename, value)
6) }
7) }
Reducer function for ImagetoPdf Program
●
Input: String key, KUPDF value
●
Output: filename, KUPDF value(pdf file)
1) Reduce(String key, KUPDF values)
2) {
3) For each value in values
4) {
5) concat value as separate page of pdf
6) }
7) write (key, final pdf)
8) }
9) }
MAPPER AND REDUCER
●
In the mapping phase, Map Reduce takes the input data and feeds each
data element to the mapper.
●
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at the final result.
●
The mapper is meant to filter and transform the input into something
that the reducer can aggregate over.
●
PDFMapper and PDFReducer class in the application does the above
mentioned jobs in the application developed with image files and pdf
files.
METHODS OF PERFORMANCE EVALUATIONS
Cloud Cluster Setup
• quad core Intel® Xenon(R) CPU of 64 bits
• 16GB RAM
• 1TB ATA disk
• 500GB ATA disk as storage
• 32 bits dual gigabit network interfaces was used.
• Ubuntu 14.04 server was installed for Operating System.
Cloud Cluster Setup continued...
●
Installation and configuration of Openstack Essex cloud was done by the
tutorial of Openstack.
●
Appropriate images were created using virtualization tool, such as
QEMU, supporting KVM or XEN and by using terminal commands.
●
virtual systems for cloud cluster setup, were created by using terminal
commands and by accessing web-interface of openstack after successful
network configuration for fixed and floating ips and other security
parameters by using the images created.
Commodity Computer Cluster Setup
The four-node cluster of commodity computers is setup on :
●
Intel i5 quad core 64-bit CPU with 2GB RAM
●
One with 160GB ATA disk
●
Other three with 80GB ATA disk
●
32 bit gigabit network interface.
2. Commodity Computer Cluster Setup continued...
●
Passwordless secure shell was confugured
●
Java 7 was installed and
●
Hadoop 0.20.2 was configured on all four instances.
Cloud Cluster Setup continued...
●
Four-node cluster, one acting as master/slave and the rest three as
slaves, was created by using Ubuntu 14.04 image.
●
Passwordless secure shell was confugured, java 7 was installed and
Hadoop 0.20.2 was configured on all four instances.
Configuration of Experiments
Commodity
Computer/Cloud Server
Commodity Computers
Details
Server Computer Details
Master vs
Cse-dcg
2 GB RAM | 2
VCPU | 160 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Storage
Slave1 vs
uesr1
2 GB RAM | 2
VCPU | 160 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Storage
Slavs2 vs
user2
2 GR BAM | 2
VCPU | 80 GB
Storage
2 GB RAM | 2
VCPU | 80 GB
Sotrage
Slave3 vs
user3
2 GB RAM | 2
VCPU | 80 GB
Storage
2 GB RAM |
2 VCPU |
80 GB Storage
Experimental Results
After the successful configuration of the clusters, three jobs:
●
Two jobs to convert image files to pdf files and
●
One job of word count were run on both systems.
●
The first two jobs were based on image and pdf files being serialized in
map reduce framework.
●
The last job was implemented based on the standard WordCount
program available in the Hadoop package.
●
The algorithms are run first on two node cluster with a Master and two
slaves and then scaled up to four node cluster of a Master and four slaves
(Master running slave machine as well).
1) Directory-wise Image to PDF:
The results of first job is summarized in the Table II and Table III :
●
TABLE II : SUMMARY OF FIRST JOB ON TWO NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
5 minute and
20 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
3 minute and
43 second
Directory-wise Image to PDF continued...
●
The first jobs algorithm is designed to search images directory-wise and
convert each image files to pdf file with same directory tree as the input
image files.
●
TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
23 folder | 94
image files |
169 MB
23 folder |
94 pdf files|
90.1 MB
3 minute and
8 second
Cloud
Cluster
23 folder | 94
image files|
169 MB
1 folder |
94 pdf files|
90.1 MB
1 minute and
31 second
GRAPHICAL REPRESENTATION
Time taken for First Job
EXPLANATION
The input to the job contained :
●
23 folder
●
94 files and
●
In total 169 MB in size.
The output was the conversion of each image files to pdf with same
directory and file names and configured to generate 90.1 MB in size for
both the processing.
●
The processing were repeated three times to get the average.
2) Multiple Images to Single PDF:
●
Modified version of the first one explained above.
●
All images are converted to final single pdf output file.
●
This processing is also done first in two node and later scaled up to four
node cluster as done in first algorithm.
●
The processing are repeated three time to get the average.
●
The experiments are summarized in Table IV and Table V.
Multiple Images to Single PDF continued...
●
The input contained 476 image files in one directory and its size was 926
MB.
●
The output was a single pdf file of 200.1 MB in size.
TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
11 minute and
29 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
12 minute and
28 second
Multiple Images to Single PDF continued...
TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second
GRAPHICAL REPRESENTATION
This Shows that commodity computer clusters are more efficient than
virtual node cluster in Openstack cloud.
Time taken for Second Job
EXPLANATION
●
The first two jobs were processed on mapping of small image files that
are no so effective in Hadoop system as hadoop performs really well with
large data sets as input.
●
So in order to test the real performance of Hadoop on big data, the
default word count job of Hadoop system was also run.
●
Hadoop was designed for text processing rather than image processing
so the textual processing was also chosen to analyse the Hadoop clusters.
TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES
The inputs for the job was a text file of 1.1 GB and the output was a file
containing list of words which was 364.6KB in size.
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 folder |
476 image
files |
926 MB
1 pdf
files |
200.1
MB
7 minute and
51 second
Cloud
Cluster
1 pdf
files |
200.1
MB
1 pdf
files |
200.1
MB
9 minute and
22 second
TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES
INPUT OUTPUT TIME TAKEN
Commodity
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
4 minute and
0 second
Cloud
Cluster
1 text file |
1.1 GB
1 text files
with counts |
361.6 KB
5 minute and
1 second
GRAPHICAL REPRESENTATION
It proves that Hadoop cluster in commodity computers performed well
than the Hadoop cluster in the cloud.
Time taken for Third Job
PERFORMANCE ANALYSIS
●
The Hadoop distributed system set up on personal computers is certain
to be more efficient and faster than the cloud system.
●
First reason is that Hadoop is developed with commodity machines in
mind.
●
Second obvious reason is that the processing is done in physical
hardware without any resources sharing as compared to cloud systems.
CONTRADICTION
●
The first job is contradictory with the points discussed above and with
other two jobs.
REASON :
●
The job has to recursively read and write files, thus has to cache all the
bytes read and to be written,which is faster in cloud as the nodes are in
one server and there is no wire communication between nodes.
CONCLUSION
●
An analysis of running a Hadoop cluster incloud and in real system and
identifying the best solution by running simple Hadoop jobs in the
configured clusters.
●
It concludes that running a Hadoop cluster in cloud for data storage and
analysis is more flexible and easily-scalable than the real system cluster.
●
The two nodes to four nodes experiments proved the easy-scalability
where cloud cluster scaled up with creation of an instance from already
configured image.
●
The case was not the same in real system where we needed to get the
machine, download the softwares, adjust configuration to join the new
machine to the cluster.
●
The failed nodes in cloud cluster could be terminated and can be
replaced with a new instance in seconds but the same is not possible in
real system.
●
The cluster in real system computers are faster than the cloud clusters.
●
But due to different advantageous features of the cloud computing
system such as quick termination of servers (nodes) if problems arise and
creation of the node from the same state the machine was terminated,
automatic networking, instant creation of nodes and cluster and many
such features cloud Hadoop clusterwould be more favorable.
●
Despite the difficulties in writing algorithms related with images in map
reduce framework and serialization errors of images, and despite the
popularity of text processings in Hadoop, It is still possible to perform
image processings in distributed framework as hadoop.
FUTURE SCOPE
To perform the same algorithms using different cloud frameworks and
comparing the commodity cluster performance versus new cloud virtual
cluster Or analysis and comparison of Openstack cloud virtual cluster
versus new cloud virtual cluster.
REFERENCES
●
[1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web
Services,”Amazon Web Services, 2014.
●
[2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg,
Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype
and reality for delivering computing as the 5th utility,” Future Generation
Computer Systems, 2009, Elsevier.
●
[3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data
Security Model for Cloud Computing, Proceedings of the 2009
International Workshop on Information Security and Application ,”
(IWISA 2009), (pp. 21-22). China.
●
[4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica
placement to the reliability of distributed brick storage systems,”
Proceedings of the 25th IEEE International Conference on Distributed
Computing Systems, (pp. 187-196), 2005, IEEE.
●
[5] Daniel Ford et al., “Availability in globally distributed storage system,”
Google Inc.
●
[6] HDFS Architecture Guide,“from Address:,
”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.
THANK YOU
Ad

More Related Content

What's hot (20)

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Nicola Cadenelli
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
Linaro
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
Milind Bhandarkar
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
Joy Rahman
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
David Martínez Rego
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
Madhukara Phatak
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Carol McDonald
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
Linaro
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
Joy Rahman
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
Milind Bhandarkar
 

Similar to A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters (20)

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Hadoop
HadoopHadoop
Hadoop
Ramakrishna Reddy Bijjam
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
 
hadoop
hadoophadoop
hadoop
Deep Mehta
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Nane Kratzke
 
Data scientist a perfect job
Data scientist a perfect jobData scientist a perfect job
Data scientist a perfect job
Sidharth Raj Agarwal
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
WSO2
 
the mapreduce programming paradigm in cybersecurity
the  mapreduce programming paradigm in cybersecuritythe  mapreduce programming paradigm in cybersecurity
the mapreduce programming paradigm in cybersecurity
xawomi1686
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Nane Kratzke
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
WSO2
 
the mapreduce programming paradigm in cybersecurity
the  mapreduce programming paradigm in cybersecuritythe  mapreduce programming paradigm in cybersecurity
the mapreduce programming paradigm in cybersecurity
xawomi1686
 
Ad

Recently uploaded (20)

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Ad

A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters

  • 1. A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters BY : KUMARI SURABHI
  • 2. INDEX TERMS • Openstack Cloud • Hadoop • Distributed system, • Virtualization • Big data.
  • 3. INTRODUCTION • Computers are revolutionizing the human era, especially the IT field. • With new technologies and ideas, smart, efficient and faster computers and frameworks are introduced in market every day. • Advancement of intelligent machines and ideas makes computation, storage and transaction of data faster and accurate which eventually makes institutions, companies and individuals solve their problems with ease. • Among so many computation improvements, cloud computing and distributed system are the main focus of this seminar.
  • 4. CLOUD COMPUTING • Cloud computing is the emerging technology which is being exploited in every aspects of technology. • Cloud computing is an abstract term describing the use of resources, which do not belong to the user to perform required task and then disconnect from the resources when not in use. • Most obvious examples would be Gmail, Google doc, Amazon EC2 and storage, social networks such as Facebook and many more.
  • 7. BIG DATA • Describe the exponential growth, availability and use of information, both structured and unstructured. • Concept of Big Data has three basic dimensions: volume, variety, velocity and other dimensions are veracity and complexity.
  • 8. PROBLEM STATEMENTS AND PRELIMINARIES The purpose of this seminar is to answer the following questions: ● Is performance of OpenStack cloud is better than real system on Hadoop cluster? ● Is it feasible to run image processing MapReduce job in a Hadoop cluster on OpenStack cloud? ● Are the technical difficulties of converting image files to pdf using MapReduce framework in distributed system?
  • 9. CLOUD COMPUTING & OPENSTACK CLOUD • Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). • Cloud services include the delivery of software, infrastructure, and storage over the Internet. • Based on technology used, cloud computing can be sub categorized into: Public, Private, Community and Hybrid clouds.
  • 10. Cloud Computing continued... cloud computing can be broadly divided into: ● Software as a Service (SaaS), ● Platform as a Service (PaaS), ● Infrastructure as a Service (IaaS).
  • 11. CLOUD COMPUTING continued... Based on technology used, cloud computing can be sub categorized into: • Public, • Private, • Community • Hybrid clouds.
  • 12. CLOUD COMPUTING continued... There are many platforms available to set up a cloud such as : • CloudStak (Foundation) • DevStack, • Openstack, • Eucalyptus, • Nebula and many more. Note : Openstack is chosen as the framework for implementing cloud for the article.
  • 13. OPENSTACK CLOUD • It is open-source which is used to open to pick and mix any hardware needs. • Open to design own networks. • Open to use any virtualization technology. • Open to other needed features and so on.
  • 14. OPENSTACK CLOUD • OpenStack is an open source cloud framework, originally launched by Rackspace and NASA, with the aim to promote cloud standards and provide a solid foundation for cloud development. • It is most widely used tool for setting a private and public cloud. • Big companies like Dell, AMD, Cisco, HP and Rackspace are using it. • Linux heavyweights like Red Hat and Ubuntu are implementing it. • Amazon is available on OpenStack with compatible API with Amazon.
  • 15. OPENSTACK CLOUD continued... • OpenStack is an Infrastructure as a Service (IaaS) cloud computing project that is free open source software. • It is revolutionizing the cloud computing world • Aim to create a system where storage, resources and performance would scale everything up quickly and efficiently.
  • 16. OPENSTACK CLOUD continued... The OpenStack cloud currently consists of six projects: • Nova, • Swift, Glance, • Keystone, • Quantum, • Horizon.
  • 17. OPENSTACK CLOUD continued... ● Nova : Nova is the computing fabric controller for the OpenStack cloud. ● Swift : Swift is the storage system for OpenStack which is analogous to Amazon Web Service Simples Storage system. ● Glance : It is an imaging service for OpenStack which is responsible for discovery, registration and delivery services for disk and server images.
  • 18. OPENSTACK CLOUD continued... ● Keystone : Keystone is OpenStack identity service which provides authentication and authorization for all components of OpenStack. ● Horizon : Horizon is web based dashboard that provides administrators and users a graphical interface to access, provision and automate cloud-based resources.
  • 20. HADOOP • There are many distributed systems available to comply with the big data problems faced by big companies. • Hadoop is one of the available frameworks. • Hadoop makes data mining, analytics, and processing of big data cheap and fast. • Hadoop is an open source project and is made to deal with terabytes of data in minutes. • Hadoop stores and processes any kind of data. • Hadoop is natively written in Java but can be accessed using other languages such as SQL-inspired language (Hive), c/c++, python and many more.
  • 21. HADOOP continued... • An open source web search engine which was based on Googles MapReduce. • Hadoop works on commodity hardware.
  • 22. HDFS(Hadoop Distributed File System) ● Hadoop Distributed File System provides unrestricted, high- speed access to the data application. ● A scalable, fault tolerant, high performance distributed file system. ● Namenode holds filesystem metadata. ● Files are broken up and spread over datanodes. ● Data divided into 64MB(default) or 128 blocks, each block replicated 3 times(default) .
  • 24. MAPREDUCE ● Map Reduce programs are executed in two main phases, called mapping and reducing. ● Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. ● In the mapping phase, Map Reduce takes the input data and feeds each data element to the mapper. ● In the reducing phase, the reducer processes all the outputs from the mapper and arrives at the final result. ● In simple terms, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.
  • 26. PERFORMANCE ANALYSIS MODEL ● In the performance analysis model, we will discusse use of two basic applications : 1. WordCount Application 2. Imagetopdf Conversion Application. ● WordCount is a common MapReduce program which is used to count total number of word found in the document. ● Imagetopdf Conversion program is used for converting image into pdf.
  • 27. PERFORMANCE ANALYSIS MODEL continued... ● These two programs are executed in commodity computers cluster as well as OpenStack Cloud virtual instance cluster. ● Analyse the performance by changing the number of nodes and size of the data. ● The performance analysis has been done for both the applications.
  • 28. WORD COUNT APPLICATION ● WordCount is a simple application that counts the number of occurrences of each word in a given input set. ● The purpose of this program is to calculate the total number of repetition of words in a particular document. ● The pseudo code for the Mapper and Reducer for WordCount program is outlined in Algorithm 1 and Algorithm 2 respectively.
  • 29. Mapper function for WordCount Program ● Input: String filename, String document ● Output: String token , 1 1) Map(String filename, String document) 2) { 3) List<String> T = tokenize(document); 4) For each token in T 5) { 6) emit ((String) token, (Integer) 1); 7) } 8) }
  • 30. Reducer function for WordCount Program ● Input: String token, 1 ● Output: (String) token , sum 1) Reduce(String token, List<Integer> values) 2) { 3) Integer sum = 0; 4) For each value in values 5) { 6) Sum = sum + value; 7) } 8) emit ((String) token, (Integer) sum); 9) } 10) }
  • 31. Image to Pdf Conversion Application ● Hadoop is popular for processing textual big data so there are a lot of materials available if an application related with texts is to be developed. ● But few works has been done on image data processing in Hadoop. ● So there were a lot of challenges while developing the application. ● Some of the difficulties faced were serialization issues with images, splitting of images by hadoop to its default blocks, image to pdf conversion, text to pdf conversion and many more.
  • 32. Work flow of the application ● Under Map Reduce model, data processing primitives are called mappers and reducers.
  • 33. Mapper function for ImagetoPdf Program ● Input: String key, KUPDF value ● Output: filename, KUPDF value(pdf file) 1) Map(String key, KUPDF value) 2) { 3) For each bufferList in value 4) { 5) write (filename, value) 6) } 7) }
  • 34. Reducer function for ImagetoPdf Program ● Input: String key, KUPDF value ● Output: filename, KUPDF value(pdf file) 1) Reduce(String key, KUPDF values) 2) { 3) For each value in values 4) { 5) concat value as separate page of pdf 6) } 7) write (key, final pdf) 8) } 9) }
  • 35. MAPPER AND REDUCER ● In the mapping phase, Map Reduce takes the input data and feeds each data element to the mapper. ● In the reducing phase, the reducer processes all the outputs from the mapper and arrives at the final result. ● The mapper is meant to filter and transform the input into something that the reducer can aggregate over. ● PDFMapper and PDFReducer class in the application does the above mentioned jobs in the application developed with image files and pdf files.
  • 36. METHODS OF PERFORMANCE EVALUATIONS Cloud Cluster Setup • quad core Intel® Xenon(R) CPU of 64 bits • 16GB RAM • 1TB ATA disk • 500GB ATA disk as storage • 32 bits dual gigabit network interfaces was used. • Ubuntu 14.04 server was installed for Operating System.
  • 37. Cloud Cluster Setup continued... ● Installation and configuration of Openstack Essex cloud was done by the tutorial of Openstack. ● Appropriate images were created using virtualization tool, such as QEMU, supporting KVM or XEN and by using terminal commands. ● virtual systems for cloud cluster setup, were created by using terminal commands and by accessing web-interface of openstack after successful network configuration for fixed and floating ips and other security parameters by using the images created.
  • 38. Commodity Computer Cluster Setup The four-node cluster of commodity computers is setup on : ● Intel i5 quad core 64-bit CPU with 2GB RAM ● One with 160GB ATA disk ● Other three with 80GB ATA disk ● 32 bit gigabit network interface.
  • 39. 2. Commodity Computer Cluster Setup continued... ● Passwordless secure shell was confugured ● Java 7 was installed and ● Hadoop 0.20.2 was configured on all four instances.
  • 40. Cloud Cluster Setup continued... ● Four-node cluster, one acting as master/slave and the rest three as slaves, was created by using Ubuntu 14.04 image. ● Passwordless secure shell was confugured, java 7 was installed and Hadoop 0.20.2 was configured on all four instances.
  • 41. Configuration of Experiments Commodity Computer/Cloud Server Commodity Computers Details Server Computer Details Master vs Cse-dcg 2 GB RAM | 2 VCPU | 160 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage Slave1 vs uesr1 2 GB RAM | 2 VCPU | 160 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage Slavs2 vs user2 2 GR BAM | 2 VCPU | 80 GB Storage 2 GB RAM | 2 VCPU | 80 GB Sotrage Slave3 vs user3 2 GB RAM | 2 VCPU | 80 GB Storage 2 GB RAM | 2 VCPU | 80 GB Storage
  • 42. Experimental Results After the successful configuration of the clusters, three jobs: ● Two jobs to convert image files to pdf files and ● One job of word count were run on both systems. ● The first two jobs were based on image and pdf files being serialized in map reduce framework. ● The last job was implemented based on the standard WordCount program available in the Hadoop package. ● The algorithms are run first on two node cluster with a Master and two slaves and then scaled up to four node cluster of a Master and four slaves (Master running slave machine as well).
  • 43. 1) Directory-wise Image to PDF: The results of first job is summarized in the Table II and Table III : ● TABLE II : SUMMARY OF FIRST JOB ON TWO NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 23 folder | 94 image files | 169 MB 23 folder | 94 pdf files| 90.1 MB 5 minute and 20 second Cloud Cluster 23 folder | 94 image files| 169 MB 1 folder | 94 pdf files| 90.1 MB 3 minute and 43 second
  • 44. Directory-wise Image to PDF continued... ● The first jobs algorithm is designed to search images directory-wise and convert each image files to pdf file with same directory tree as the input image files. ● TABLE III : SUMMARY OF FIRST JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 23 folder | 94 image files | 169 MB 23 folder | 94 pdf files| 90.1 MB 3 minute and 8 second Cloud Cluster 23 folder | 94 image files| 169 MB 1 folder | 94 pdf files| 90.1 MB 1 minute and 31 second
  • 46. EXPLANATION The input to the job contained : ● 23 folder ● 94 files and ● In total 169 MB in size. The output was the conversion of each image files to pdf with same directory and file names and configured to generate 90.1 MB in size for both the processing. ● The processing were repeated three times to get the average.
  • 47. 2) Multiple Images to Single PDF: ● Modified version of the first one explained above. ● All images are converted to final single pdf output file. ● This processing is also done first in two node and later scaled up to four node cluster as done in first algorithm. ● The processing are repeated three time to get the average. ● The experiments are summarized in Table IV and Table V.
  • 48. Multiple Images to Single PDF continued... ● The input contained 476 image files in one directory and its size was 926 MB. ● The output was a single pdf file of 200.1 MB in size. TABLE IV. SUMMARY OF 2ND JOB ON TWO NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 11 minute and 29 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 12 minute and 28 second
  • 49. Multiple Images to Single PDF continued... TABLE V. SUMMARY OF 2ND JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 7 minute and 51 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 9 minute and 22 second
  • 50. GRAPHICAL REPRESENTATION This Shows that commodity computer clusters are more efficient than virtual node cluster in Openstack cloud. Time taken for Second Job
  • 51. EXPLANATION ● The first two jobs were processed on mapping of small image files that are no so effective in Hadoop system as hadoop performs really well with large data sets as input. ● So in order to test the real performance of Hadoop on big data, the default word count job of Hadoop system was also run. ● Hadoop was designed for text processing rather than image processing so the textual processing was also chosen to analyse the Hadoop clusters.
  • 52. TABLE VI. SUMMARY OF THIRD JOB ON TWO NODES The inputs for the job was a text file of 1.1 GB and the output was a file containing list of words which was 364.6KB in size. INPUT OUTPUT TIME TAKEN Commodity Cluster 1 folder | 476 image files | 926 MB 1 pdf files | 200.1 MB 7 minute and 51 second Cloud Cluster 1 pdf files | 200.1 MB 1 pdf files | 200.1 MB 9 minute and 22 second
  • 53. TABLE VII. SUMMARY OF THIRD JOB ON FOUR NODES INPUT OUTPUT TIME TAKEN Commodity Cluster 1 text file | 1.1 GB 1 text files with counts | 361.6 KB 4 minute and 0 second Cloud Cluster 1 text file | 1.1 GB 1 text files with counts | 361.6 KB 5 minute and 1 second
  • 54. GRAPHICAL REPRESENTATION It proves that Hadoop cluster in commodity computers performed well than the Hadoop cluster in the cloud. Time taken for Third Job
  • 55. PERFORMANCE ANALYSIS ● The Hadoop distributed system set up on personal computers is certain to be more efficient and faster than the cloud system. ● First reason is that Hadoop is developed with commodity machines in mind. ● Second obvious reason is that the processing is done in physical hardware without any resources sharing as compared to cloud systems.
  • 56. CONTRADICTION ● The first job is contradictory with the points discussed above and with other two jobs. REASON : ● The job has to recursively read and write files, thus has to cache all the bytes read and to be written,which is faster in cloud as the nodes are in one server and there is no wire communication between nodes.
  • 57. CONCLUSION ● An analysis of running a Hadoop cluster incloud and in real system and identifying the best solution by running simple Hadoop jobs in the configured clusters. ● It concludes that running a Hadoop cluster in cloud for data storage and analysis is more flexible and easily-scalable than the real system cluster. ● The two nodes to four nodes experiments proved the easy-scalability where cloud cluster scaled up with creation of an instance from already configured image. ● The case was not the same in real system where we needed to get the machine, download the softwares, adjust configuration to join the new machine to the cluster.
  • 58. ● The failed nodes in cloud cluster could be terminated and can be replaced with a new instance in seconds but the same is not possible in real system. ● The cluster in real system computers are faster than the cloud clusters. ● But due to different advantageous features of the cloud computing system such as quick termination of servers (nodes) if problems arise and creation of the node from the same state the machine was terminated, automatic networking, instant creation of nodes and cluster and many such features cloud Hadoop clusterwould be more favorable. ● Despite the difficulties in writing algorithms related with images in map reduce framework and serialization errors of images, and despite the popularity of text processings in Hadoop, It is still possible to perform image processings in distributed framework as hadoop.
  • 59. FUTURE SCOPE To perform the same algorithms using different cloud frameworks and comparing the commodity cluster performance versus new cloud virtual cluster Or analysis and comparison of Openstack cloud virtual cluster versus new cloud virtual cluster.
  • 60. REFERENCES ● [1] Jinesh Varia, Sajee Mathew, “Overview of Amazon Web Services,”Amazon Web Services, 2014. ● [2] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic, “Cloud Computing and emerging IT platforms: Vision, hype and reality for delivering computing as the 5th utility,” Future Generation Computer Systems, 2009, Elsevier. ● [3] Dai Yuefa, Wu Bo, Gu Yaqiang, Zhang Quan, Tang Chaojin. “Data Security Model for Cloud Computing, Proceedings of the 2009 International Workshop on Information Security and Application ,” (IWISA 2009), (pp. 21-22). China.
  • 61. ● [4] Qiao Lian, Wei Chen, Zheng Zhang, “On the impact of replica placement to the reliability of distributed brick storage systems,” Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, (pp. 187-196), 2005, IEEE. ● [5] Daniel Ford et al., “Availability in globally distributed storage system,” Google Inc. ● [6] HDFS Architecture Guide,“from Address:, ”http://hadoop.apache.org/docs/r1.0.4/hdfs design.html.