0% found this document useful (0 votes)
16 views47 pages

Cloud 4

Uploaded by

rishiparmar921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views47 pages

Cloud 4

Uploaded by

rishiparmar921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Unit 4

Aneka: Cloud Application Platform


1. Framework Overview
2. Anatomy of the Aneka Container
3. Building Aneka Clouds
4. Cloud Programming and Management
5. Data Intensive Computing Map-Reduce
Programming
6. What is Data-Intensive Computing?
7. Technologies for Data-Intensive Computing
8. Aneka MapReduce Programming.
Framework overview
• Aneka is Manjrasoft’s solution for developing , deploying and
managing cloud applications.
• Manjrasoft is a start-up business focussed on developing next
generation .NET based Cloud Computing technologies that
ultimately save you time and money.
• ANEKA is a patented Cloud computing technology building block
that enhances:
• Applications development through a support for rapid creation of
legacy and new applications using innovative parallel and
distributed programming models.
• Ability of organisations to harness computing resources within an
enterprise for accelerating execution of “compute” or “data” -
intensive applications.
• ANEKA allows servers and desktop PCs to be linked
together to form a very powerful computing
infrastructure. This allows companies to become
energy efficient and save money without investing in
greater numbers of computers to run their complex
applications.
• One of the key advantages of Aneka is its extensible
set of APIs associated with different types of
programming models –such as Task, Thread, and
MapReduce- used for developing distributed
applications.
• It offers services like coordinating the
execution of applications, helping
administrators to monitor the status of the
cloud, and providing integration with existing
cloud technologies.
• Key advantage of Aneka is its extensible set of
APIs associated with different types of
programming models – such as Task, Thread
and MapReduce.
Aneka framework overview
• Aneka is a pure PaaS solution for cloud computing.
• A collection of interconnected containers constitute
the Aneka cloud .
• The containers features three different classes of
services:
1. fabric services- infrastructure management
2. foundation services- supporting services for the cloud.
3. execution services- application management and
execution .
These services involve the following:
• Services
1. Elasticity and scaling
With dynamic provisioning service , it supports up-sizing and
down-sizing of the infrastructure available for applications.
2. Runtime management
The runtime machinery is responsible for keeping the
infrastructure up and running and serves as a hosting
environment for services.
3. Resource management
it is an elastic infrastructure where resources are added and
removed dynamically, according to the application needs and
user requirements.
4. Application management
Different services like scheduling , execution,
monitoring and storage are devoted to manage
applications.
5. User management
It is a multi-tenant distributed environment where
multiple applications belonging to different users
are executed.
6. QoS/SLA management and billing
Application execution is metered and billed.
Anatomy of the Aneka container
• The Aneka container constitutes the building
block of Aneka clouds and represents the runtime
machinery available to services and applications .
• The main role of container is to provide a
lightweight environment where to deploy services
and some services such as communication
channels for interaction with other nodes in the
Aneka cloud .
• Almost all operations performed within Aneka are
carried out by the services managed by the
container.
• The services installed in the Aneka container is
classified into three categories
1. fabric services
2. foundation services
3. application services
• Fabric services
- Lowest level of software stack
- Define the basic infrastructure management
features of the system
- Provide access to the resource provisioning
subsystem and to monitor facilities.
Main services are : i) profiling and monitoring
ii) resource management
• Foundation services
- Related to the logical management
- Provide supporting services for the execution
- The services are
i. Storage management for applications
ii. Accounting ,billing, and resource pricing
iii. Resource reservation
. Basic reservation
. Libra reservation
. Relay reservation
• Application services
- Manage the execution of applications
- Two services are i) scheduling
. Resource provisioning service
. Reservation service
. Accounting service
. Reporting service
common tasks that performed by scheduling component are .
. Job-to-node mapping
. Rescheduling of failed jobs
. Job status monitoring
. Application status monitoring .
ii) Execution
. Unpacking the jobs received from the
scheduler
. Retrieval of input
Building Aneka Clouds
• Aneka is a platform for developing distributed
applications for clouds.
Cloud programming and management
• The purpose of Aneka is to provide a scalable
middleware to execute distributed
applications.
• Application development and management
are two features exposed to developers.
• In order to simplify these activities Aneka
provides APIs
• The APIs are concentrated in the Aneka SDK
• The SDK provides support for both
programming models and services by means
of Application model and service model.
Data intensive computing: Map Reduce
programming
• Data intensive computing focuses on a class of
applications that deal with “large amount of data”
.
• Several application fields ranging from
computational science to social networking ,
produce large volumes of data that need to be
efficiently stored, made accessible, indexed, and
analyzed.
• These tasks become challenging because the
quantity of information increases over time at
higher rates.
• Distributed computing definitely helps in
addressing the above challenges by providing
more scalable , efficient storage architectures
, data computation and processing.
• MapReduce is a programming model for
creating data intensive applications and their
deployment on clouds.
What is data intensive computing ?
• It concerned with production , manipulation and analysis of large-scale
data in the range of hundreds of megabytes (MB) to petabytes (PB) and
beyond.
• Examples :-
1. scientific applications like Telescopes mapping the sky produce hundreds
of gigabytes of data and petabytes over a year .
2. bioinformatics applications like mine databases, earth quake simulators
produce terabytes of data.
3. Customer data of telecom company ranges from 10 to 100 terabytes.
4. Handset mobile traffic reached 8 petabytes per month and expected to
grow up to 327 petabytes per month by 2015.
5. Google is processing about 24 petabytes of information per day.
6. Social networking : Facebook inbox search operations involve crawling
about 150 terabytes.
7. Zynga social gaming platform moves 1 petabyte of data.
Big data
• Extremely large data sets that may be analysed
computationally to reveal patterns, trends, and
associations, especially relating to human behaviour
and interactions.
• Big data refers to the large, diverse sets of information
that grow at ever-increasing rates.
• An example of big data might be petabytes (1,024
terabytes) or exabytes (1,024 petabytes)
of data consisting of billions to trillions of records of
millions of people—all from different sources (e.g.
Web, sales, customer contact center, social media,
mobile data and so on)
• Big data is used to better understand
customers and their behaviours and
preferences. Companies are keen to expand
their traditional data sets with social
media data, browser logs as well as text
analytics and sensor data to get a more
complete picture of their customers.
• The five V ‘s of big data are volume, variety,
velocity, veracity and value.
Technologies for data intensive
programming
• Data intensive computing concerns the
development of applications that are mainly
focused on processing large quantities of data.
• Therefore , (1). storage systems and (2).
Programming models are two technologies
that supports data-intensive computing.
1. Storage systems
a. High performance distributed file systems and
storage clouds.
1. Lustre
2. IBM General Parallel File System(GPFS)
3. google file system (GFS)
4. Sector
5. amazon simple storage service (S3)
b. Not only SQL(NoSQL) systems
• Storage systems.
Traditionally DBMS constituted the storage
support for several types of applications. Due to
the explosion of unstructured data this method
not seem to be preferred solution for data
analytics .
Distributed file systems constitutes the primary
support for the management of data. They
provide an interface where to store information in
the form of files and later access them for read
and write operations.
1. High performance Distributed File Systems and
Storage Clouds.
(a). Lustre.
Lustre file system is a massively parallel distributed
file system that covers the needs of a small
workgroup of clusters to a large scale computing
cluster.
Lustre is designed to provide access to petabytes
(PBs)of storage and throughput of hundreds of
gigabytes per second .
(b). IBM General Parallel File System (GPFS) is a high
performance distributed file system developed by
IBM to support RS/6000 super computer and
Linux computing. Provides transparent access to
file system and eliminates single point of failure.
GPFS is built on the concept of shared disks,
where a collection of disks is attached to the file
systems nodes by means of some switching fabric.
(c). Google File System (GFS) .
GFS is the storage infrastructure supporting the
execution of distributed applications in the Google’s
computing cloud.
The system has been designed to be a fault tolerant,
high available , distributed file system built on
commodity hardware and standard Linux OS.
The architecture of the file system is organized into a
single master, containing the metadata of the entire
file system and a collection of chunk servers, which
provide storage space.
(d). Sector.
Sector is the storage cloud supporting the execution of
data-intensive applications defined according to the Sphere
framework.
It is a user space file system that can be deployed on
commodity hardware across a wide area network.
compared to other file systems Sector does not partition a
file into blocks but replicates the entire files on multiple
nodes allowing the users to customize the replication
strategy for a better performance. The architecture of the
system is composed by four nodes : s security server, one or
more master nodes, slave nodes, and client machine.
(e). Amazon Simple Storage Service (S3).
amazon S3 is the online storage service provided by Amazon.
Even though its internal detailes are not revealed , the system
is claimed to support high availability, reliability, scalability,
infinite storage and low latency at commodity cost.
The storage space is organized in to buckets, which are
attached to AWS account.
Each bucket can store multiple objects, each of them
identified by unique key.
Objects are identified by unique URLs and exposed through
the HTTP protocol.
Because of the use of the HTTP protocol , there is no need of
any specific library for accessing the storage system.
2. Not only SQL (NoSQL) systems.
- was originally coined in 1998 to identify a relational database,
which did not expose a SQL interface to manipulate and query
data, but relied on a set of UNIX shell scripts and commands to
operate on text files containing the actual data.
- In a strict sense , NoSQL cannot be considered a relational
database , it is a collection of scripts that allow users to manage
most of the simplest and more common database tasks by using
text files as information store.
- nowadays the term ‘NoSQL’ is a big umbrella encompassing all
the storage and database management systems that differ in
some way from the relational model.
- Two main reasons have determined the growth of the NoSQL movement:
(1). In many cases , simple data models are enough to represent the
information used by applications.
(2). The quantity of information contained in unstructured formats has
considerably grown in the last decade.
- A broad classification which distinguishes NoSQL
implementations into :
1. document stores(Apache Jackrabbit, Apache couchDB, SimpleDB, and
Terrastore)
2. Graphs( AllegroGraph, Neo4j, FlockDB,and Cerebrum)
3. Key-value stores-
4. Multi-value databases( OpenQM, Rocket U2, and OpenInsight)
5. Object databases(ObjectStore, JADE,and ZODB)
6. Tabular stores(Google Big Table, Hadoop Hbase, and Hypertable)
7. Tuple stores (Apache River)
• Some prominent implementations supporting data intensive
applications :
(a). Apache CouchDB and MongoDB
- are two examples of document stores.
- provide a schema-less store where the primary objects are
documents, organized into a collection of key-value fields.
- allow querying and indexing data
- couchDB ensures ACID properties on data . (ACID refers to
the four key properties of a transaction: atomicity, consistency,
isolation, and durability.) All changes to data are performed as if
they are a single operation. That is, all the changes are performed,
or none of them are.
- MongoDB supports sharding , which is the ability to
distribute the content of a collection among different nodes.
(b). Amazon Dynamo.
- is the distributed key-value store supporting the
management of information .
- provide an incrementally scalable and highly
available storage system .
(c). Google Bigtable .
- is the distributed storage system designed to scale
up to petabytes of data across thousands of servers.
- provides storage support to several Google
applications.
- Bigtable organizes the data storage in tables.
(d). Apache Cassandra.
- is a distributed object store for managing large amounts
of structured data .
- it provides storage support for several very large Web
applications such as Facebook , Digg, Twitter.
(e). Hadoop HBase.
- is the distributed database supporting the storage needs
of the Hadoop distributed programming platform .
- main goal is to offer real time read/write operations for
tables with billions of rows and millions of columns by
leveraging clusters of commodity hardware .
MapReduce programming model
- MapReduce is a programming platform introduced by Google for
processing large quantities of data.
- It is a processing technique and a program model for distributed
computing based on java.
- The algorithm contains two important tasks, Map and Reduce.
- Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
- Reduce takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
- As the sequence of the name MapReduce implies, the reduce task is
always performed after the map job.
- The model is expressed in the form of two functions,
map(k1,v1) list(k2,v2)
reduce(k2,list(v2)) list(v2)
• How MapReduce Works?
• The whole process goes through four phases of
execution namely, splitting, mapping, shuffling,
and reducing.
• Let's understand this with an example –
Consider you have following input data for your
Map Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
• 1. First, in the map stage, the input data (the
six documents) is split and distributed across
the cluster (the three servers). In this case,
each map task works on a split containing two
documents. During mapping, there is no
communication between the nodes. They
perform independently.
• 2. Then, map tasks create a <key, value> pair for every
word. These pairs show how many times a word
occurs. A word is a key, and a value is its count. For
example, one document contains three of four words
we are looking for: Apache 7 times, Class 8
times, and Track 6 times. The key-value pairs in one
map task output look like this:
– <apache, 7>
– <class, 8>
– <track, 6>
• This process is done in parallel tasks on all nodes for all
documents and gives a unique output.
3. After input splitting and mapping completes, the
outputs of every map task are shuffled. This is the
first step of the Reduce stage. Since we are
looking for the frequency of occurrence for four
words, there are four parallel Reduce tasks. The
reduce tasks can run on the same nodes as the
map tasks, or they can run on any other node.
• The shuffle step ensures the keys Apache,
Hadoop, Class, and Track are sorted for the
reduce step. This process groups the values by
keys in the form of <key, value-list> pairs.
• 4. In the reduce step of the Reduce stage, each of the
four tasks process a <key, value-list> to provide a final
key-value pair. The reduce tasks also happen at the
same time and work independently.
• In our example from the diagram, the reduce tasks get
the following individual results:
<apache, 22>
<hadoop, 20>
<class, 18>
<track, 22>
• 5. Finally, the data in the Reduce stage is
grouped into one output. MapReduce now
shows us how many times the words Apache,
Hadoop, Class, and track appeared in all
documents. The aggregate data is, by default,
stored in the HDFS.

You might also like