We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47
Unit 4
Aneka: Cloud Application Platform
1. Framework Overview 2. Anatomy of the Aneka Container 3. Building Aneka Clouds 4. Cloud Programming and Management 5. Data Intensive Computing Map-Reduce Programming 6. What is Data-Intensive Computing? 7. Technologies for Data-Intensive Computing 8. Aneka MapReduce Programming. Framework overview • Aneka is Manjrasoft’s solution for developing , deploying and managing cloud applications. • Manjrasoft is a start-up business focussed on developing next generation .NET based Cloud Computing technologies that ultimately save you time and money. • ANEKA is a patented Cloud computing technology building block that enhances: • Applications development through a support for rapid creation of legacy and new applications using innovative parallel and distributed programming models. • Ability of organisations to harness computing resources within an enterprise for accelerating execution of “compute” or “data” - intensive applications. • ANEKA allows servers and desktop PCs to be linked together to form a very powerful computing infrastructure. This allows companies to become energy efficient and save money without investing in greater numbers of computers to run their complex applications. • One of the key advantages of Aneka is its extensible set of APIs associated with different types of programming models –such as Task, Thread, and MapReduce- used for developing distributed applications. • It offers services like coordinating the execution of applications, helping administrators to monitor the status of the cloud, and providing integration with existing cloud technologies. • Key advantage of Aneka is its extensible set of APIs associated with different types of programming models – such as Task, Thread and MapReduce. Aneka framework overview • Aneka is a pure PaaS solution for cloud computing. • A collection of interconnected containers constitute the Aneka cloud . • The containers features three different classes of services: 1. fabric services- infrastructure management 2. foundation services- supporting services for the cloud. 3. execution services- application management and execution . These services involve the following: • Services 1. Elasticity and scaling With dynamic provisioning service , it supports up-sizing and down-sizing of the infrastructure available for applications. 2. Runtime management The runtime machinery is responsible for keeping the infrastructure up and running and serves as a hosting environment for services. 3. Resource management it is an elastic infrastructure where resources are added and removed dynamically, according to the application needs and user requirements. 4. Application management Different services like scheduling , execution, monitoring and storage are devoted to manage applications. 5. User management It is a multi-tenant distributed environment where multiple applications belonging to different users are executed. 6. QoS/SLA management and billing Application execution is metered and billed. Anatomy of the Aneka container • The Aneka container constitutes the building block of Aneka clouds and represents the runtime machinery available to services and applications . • The main role of container is to provide a lightweight environment where to deploy services and some services such as communication channels for interaction with other nodes in the Aneka cloud . • Almost all operations performed within Aneka are carried out by the services managed by the container. • The services installed in the Aneka container is classified into three categories 1. fabric services 2. foundation services 3. application services • Fabric services - Lowest level of software stack - Define the basic infrastructure management features of the system - Provide access to the resource provisioning subsystem and to monitor facilities. Main services are : i) profiling and monitoring ii) resource management • Foundation services - Related to the logical management - Provide supporting services for the execution - The services are i. Storage management for applications ii. Accounting ,billing, and resource pricing iii. Resource reservation . Basic reservation . Libra reservation . Relay reservation • Application services - Manage the execution of applications - Two services are i) scheduling . Resource provisioning service . Reservation service . Accounting service . Reporting service common tasks that performed by scheduling component are . . Job-to-node mapping . Rescheduling of failed jobs . Job status monitoring . Application status monitoring . ii) Execution . Unpacking the jobs received from the scheduler . Retrieval of input Building Aneka Clouds • Aneka is a platform for developing distributed applications for clouds. Cloud programming and management • The purpose of Aneka is to provide a scalable middleware to execute distributed applications. • Application development and management are two features exposed to developers. • In order to simplify these activities Aneka provides APIs • The APIs are concentrated in the Aneka SDK • The SDK provides support for both programming models and services by means of Application model and service model. Data intensive computing: Map Reduce programming • Data intensive computing focuses on a class of applications that deal with “large amount of data” . • Several application fields ranging from computational science to social networking , produce large volumes of data that need to be efficiently stored, made accessible, indexed, and analyzed. • These tasks become challenging because the quantity of information increases over time at higher rates. • Distributed computing definitely helps in addressing the above challenges by providing more scalable , efficient storage architectures , data computation and processing. • MapReduce is a programming model for creating data intensive applications and their deployment on clouds. What is data intensive computing ? • It concerned with production , manipulation and analysis of large-scale data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond. • Examples :- 1. scientific applications like Telescopes mapping the sky produce hundreds of gigabytes of data and petabytes over a year . 2. bioinformatics applications like mine databases, earth quake simulators produce terabytes of data. 3. Customer data of telecom company ranges from 10 to 100 terabytes. 4. Handset mobile traffic reached 8 petabytes per month and expected to grow up to 327 petabytes per month by 2015. 5. Google is processing about 24 petabytes of information per day. 6. Social networking : Facebook inbox search operations involve crawling about 150 terabytes. 7. Zynga social gaming platform moves 1 petabyte of data. Big data • Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions. • Big data refers to the large, diverse sets of information that grow at ever-increasing rates. • An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on) • Big data is used to better understand customers and their behaviours and preferences. Companies are keen to expand their traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers. • The five V ‘s of big data are volume, variety, velocity, veracity and value. Technologies for data intensive programming • Data intensive computing concerns the development of applications that are mainly focused on processing large quantities of data. • Therefore , (1). storage systems and (2). Programming models are two technologies that supports data-intensive computing. 1. Storage systems a. High performance distributed file systems and storage clouds. 1. Lustre 2. IBM General Parallel File System(GPFS) 3. google file system (GFS) 4. Sector 5. amazon simple storage service (S3) b. Not only SQL(NoSQL) systems • Storage systems. Traditionally DBMS constituted the storage support for several types of applications. Due to the explosion of unstructured data this method not seem to be preferred solution for data analytics . Distributed file systems constitutes the primary support for the management of data. They provide an interface where to store information in the form of files and later access them for read and write operations. 1. High performance Distributed File Systems and Storage Clouds. (a). Lustre. Lustre file system is a massively parallel distributed file system that covers the needs of a small workgroup of clusters to a large scale computing cluster. Lustre is designed to provide access to petabytes (PBs)of storage and throughput of hundreds of gigabytes per second . (b). IBM General Parallel File System (GPFS) is a high performance distributed file system developed by IBM to support RS/6000 super computer and Linux computing. Provides transparent access to file system and eliminates single point of failure. GPFS is built on the concept of shared disks, where a collection of disks is attached to the file systems nodes by means of some switching fabric. (c). Google File System (GFS) . GFS is the storage infrastructure supporting the execution of distributed applications in the Google’s computing cloud. The system has been designed to be a fault tolerant, high available , distributed file system built on commodity hardware and standard Linux OS. The architecture of the file system is organized into a single master, containing the metadata of the entire file system and a collection of chunk servers, which provide storage space. (d). Sector. Sector is the storage cloud supporting the execution of data-intensive applications defined according to the Sphere framework. It is a user space file system that can be deployed on commodity hardware across a wide area network. compared to other file systems Sector does not partition a file into blocks but replicates the entire files on multiple nodes allowing the users to customize the replication strategy for a better performance. The architecture of the system is composed by four nodes : s security server, one or more master nodes, slave nodes, and client machine. (e). Amazon Simple Storage Service (S3). amazon S3 is the online storage service provided by Amazon. Even though its internal detailes are not revealed , the system is claimed to support high availability, reliability, scalability, infinite storage and low latency at commodity cost. The storage space is organized in to buckets, which are attached to AWS account. Each bucket can store multiple objects, each of them identified by unique key. Objects are identified by unique URLs and exposed through the HTTP protocol. Because of the use of the HTTP protocol , there is no need of any specific library for accessing the storage system. 2. Not only SQL (NoSQL) systems. - was originally coined in 1998 to identify a relational database, which did not expose a SQL interface to manipulate and query data, but relied on a set of UNIX shell scripts and commands to operate on text files containing the actual data. - In a strict sense , NoSQL cannot be considered a relational database , it is a collection of scripts that allow users to manage most of the simplest and more common database tasks by using text files as information store. - nowadays the term ‘NoSQL’ is a big umbrella encompassing all the storage and database management systems that differ in some way from the relational model. - Two main reasons have determined the growth of the NoSQL movement: (1). In many cases , simple data models are enough to represent the information used by applications. (2). The quantity of information contained in unstructured formats has considerably grown in the last decade. - A broad classification which distinguishes NoSQL implementations into : 1. document stores(Apache Jackrabbit, Apache couchDB, SimpleDB, and Terrastore) 2. Graphs( AllegroGraph, Neo4j, FlockDB,and Cerebrum) 3. Key-value stores- 4. Multi-value databases( OpenQM, Rocket U2, and OpenInsight) 5. Object databases(ObjectStore, JADE,and ZODB) 6. Tabular stores(Google Big Table, Hadoop Hbase, and Hypertable) 7. Tuple stores (Apache River) • Some prominent implementations supporting data intensive applications : (a). Apache CouchDB and MongoDB - are two examples of document stores. - provide a schema-less store where the primary objects are documents, organized into a collection of key-value fields. - allow querying and indexing data - couchDB ensures ACID properties on data . (ACID refers to the four key properties of a transaction: atomicity, consistency, isolation, and durability.) All changes to data are performed as if they are a single operation. That is, all the changes are performed, or none of them are. - MongoDB supports sharding , which is the ability to distribute the content of a collection among different nodes. (b). Amazon Dynamo. - is the distributed key-value store supporting the management of information . - provide an incrementally scalable and highly available storage system . (c). Google Bigtable . - is the distributed storage system designed to scale up to petabytes of data across thousands of servers. - provides storage support to several Google applications. - Bigtable organizes the data storage in tables. (d). Apache Cassandra. - is a distributed object store for managing large amounts of structured data . - it provides storage support for several very large Web applications such as Facebook , Digg, Twitter. (e). Hadoop HBase. - is the distributed database supporting the storage needs of the Hadoop distributed programming platform . - main goal is to offer real time read/write operations for tables with billions of rows and millions of columns by leveraging clusters of commodity hardware . MapReduce programming model - MapReduce is a programming platform introduced by Google for processing large quantities of data. - It is a processing technique and a program model for distributed computing based on java. - The algorithm contains two important tasks, Map and Reduce. - Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). - Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples. - As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. - The model is expressed in the form of two functions, map(k1,v1) list(k2,v2) reduce(k2,list(v2)) list(v2) • How MapReduce Works? • The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing. • Let's understand this with an example – Consider you have following input data for your Map Reduce Program Welcome to Hadoop Class Hadoop is good Hadoop is bad • 1. First, in the map stage, the input data (the six documents) is split and distributed across the cluster (the three servers). In this case, each map task works on a split containing two documents. During mapping, there is no communication between the nodes. They perform independently. • 2. Then, map tasks create a <key, value> pair for every word. These pairs show how many times a word occurs. A word is a key, and a value is its count. For example, one document contains three of four words we are looking for: Apache 7 times, Class 8 times, and Track 6 times. The key-value pairs in one map task output look like this: – <apache, 7> – <class, 8> – <track, 6> • This process is done in parallel tasks on all nodes for all documents and gives a unique output. 3. After input splitting and mapping completes, the outputs of every map task are shuffled. This is the first step of the Reduce stage. Since we are looking for the frequency of occurrence for four words, there are four parallel Reduce tasks. The reduce tasks can run on the same nodes as the map tasks, or they can run on any other node. • The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted for the reduce step. This process groups the values by keys in the form of <key, value-list> pairs. • 4. In the reduce step of the Reduce stage, each of the four tasks process a <key, value-list> to provide a final key-value pair. The reduce tasks also happen at the same time and work independently. • In our example from the diagram, the reduce tasks get the following individual results: <apache, 22> <hadoop, 20> <class, 18> <track, 22> • 5. Finally, the data in the Reduce stage is grouped into one output. MapReduce now shows us how many times the words Apache, Hadoop, Class, and track appeared in all documents. The aggregate data is, by default, stored in the HDFS.