0% found this document useful (0 votes)
140 views

Dokumen - Tips Cs6703 Grid and Cloud Computing Unit 4

The document discusses programming models for grid and cloud computing. It describes the Globus Toolkit (GT4) architecture and components, including the Global Resource Allocation Manager (GRAM) and Grid Security Infrastructure (GSI). It also discusses Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce, the Hadoop Distributed File System (HDFS), and its ability to reliably process large amounts of data in a parallel and scalable manner.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views

Dokumen - Tips Cs6703 Grid and Cloud Computing Unit 4

The document discusses programming models for grid and cloud computing. It describes the Globus Toolkit (GT4) architecture and components, including the Global Resource Allocation Manager (GRAM) and Grid Security Infrastructure (GSI). It also discusses Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce, the Hadoop Distributed File System (HDFS), and its ability to reliably process large amounts of data in a parallel and scalable manner.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

CS6703 GRID AND CLOUD COMPUTING

Unit 4

Dr Gnanasekaran Thangavel
Professor and Head
Faculty of Information Technology
R M K College of Engineering and Technology
UNIT IV PROGRAMMING MODEL
Open source grid middleware packages – Globus Toolkit
(GT4) Architecture , Configuration – Usage of Globus –
Main components and Programming model -
Introduction to Hadoop Framework - Mapreduce, Input
splitting, map and reduce functions, specifying input and
output parameters, configuring and running a job – Design
of Hadoop file system, HDFS concepts, command line and
java interface, dataflow of File read & File write.

2 Dr Gnanasekaran Thangavel 09/20/2022


Open source grid middleware packages

The Open Grid Forum and Object Management are two well- formed organizations behind
the standards
Middleware is the software layer that connects software components. It lies between
operating system and the applications.
Grid middleware is specially designed a layer between hardware and software, enable the
sharing of heterogeneous resources and managing virtual organizations created around the
grid.
The popular grid middleware are
1. BOINC -Berkeley Open Infrastructure for Network Computing.
2. UNICORE - Middleware developed by the German grid computing community.
3. Globus (GT4) - A middleware library jointly developed by Argonne National Lab., Univ.
of Chicago, and USC Information Science Institute, funded by DARPA, NSF, and NIH.
4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a middleware library
developed by 20 top universities in China as part of the ChinaGrid Project
3 Dr Gnanasekaran Thangavel 09/20/2022
Open source grid middleware packages conti…

5. Condor-G - Originally developed at the Univ. of Wisconsin for general


distributed computing, and later extended to Condor-G for grid job
management.
6. Sun Grid Engine (SGE) - Developed by Sun Microsystems for business grid
applications. Applied to private grids and local clusters within enterprises or
campuses.
7. gLight -Born from the collaborative efforts of more than 80 people in 12
different academic and industrial research centers as part of the EGEE
Project, gLite provided a framework for building grid applications tapping
into the power of distributed computing and storage resources across the
Internet.
4 Dr Gnanasekaran Thangavel 09/20/2022
The Globus Toolkit Architecture (GT4)
The Globus Toolkit, is an open middleware library for the grid computing communities. These open
source software libraries support many operational grids and their applications on an international
basis.
The toolkit addresses common problems and issues related to grid resource discovery, management,
communication, security, fault detection, and portability. The software itself provides a variety of
components and capabilities.
The library includes a rich set of service implementations. The implemented software supports grid
infrastructure management, provides tools for building new web services in Java , C, and Python,
builds a powerful standard-based security infrastructure and client API s (in different languages),
and offers comprehensive command-line programs for accessing various grid services. T
The Globus Toolkit was initially motivated by a desire to remove obstacles that prevent seamless
collaboration, and thus sharing of resources and services, in scientific and engineering applications.
The shared resources can be computers, storage, data, services, networks, science instruments (e.g.,
sensors), and so on. The Globus library version GT4, is conceptually shown in Figure

5 Dr Gnanasekaran Thangavel 09/20/2022


The Globus Toolkit

6 Dr Gnanasekaran Thangavel 09/20/2022


The Globus Toolkit
GT4 offers the middle-level core services in grid applications.
The high-level services and tools, such as MPI , Condor-G, and
Nirod/G, are developed by third parties for general purpose
distributed computing applications.
The local services, such as LSF, TCP, Linux, and Condor, are at
the boom level and are fundamental tools supplied by other
developers.
As a de facto standard in grid middleware, GT4 is based on
industry-standard web service technologies.
7 Dr Gnanasekaran Thangavel 09/20/2022
Functionalities of GT4
Global Resource Allocation Manager (GRAM ) - Grid Resource Access and
Management (HTTP-based)
Communication (Nexus ) - Unicast and multicast communication
Grid Security Infrastructure (GSI ) - Authentication and related security services
Monitory and Discovery Service (MDS ) - Distributed access to structure and
state information
Health and Status (HBM ) - Heartbeat monitoring of system components
Global Access of Secondary Storage (GASS ) - Grid access of data in remote
secondary storage
Grid File Transfer (GridFTP ) Inter-node fast file transfer

8 Dr Gnanasekaran Thangavel 09/20/2022


Globus Job Workflow

9 Dr Gnanasekaran Thangavel 09/20/2022


Globus Job Workflow
A typical job execution sequence proceeds as follows: The user delegates his credentials
to a delegation service.
The user submits a j ob request to GRAM with the delegation identifier as a parameter.
 GRAM parses the request, retrieves the user proxy certificate from the delegation
service, and then acts on behalf of the user.
GRAM sends a transfer request to the RFT (Reliable File Transfer), which applies
GridFTP to bring in the necessary files.
GRAM invokes a local scheduler via a GRAM adapter and the SEG (Scheduler Event
Generator) initiates a set of user j obs.
The local scheduler reports the j ob state to the SEG. Once the j ob is complete, GRAM
uses RFT and GridFTP to stage out the resultant files. The grid monitors the progress of
these operations and sends the user a notification

10 Dr Gnanasekaran Thangavel 09/20/2022


Client-Globus Interactions

There are strong interactions between provider programs and user code. GT4 makes heavy use of industry-
standard web service protocols and mechanisms in service description, discovery, access, authentication,
authorization.
GT4 makes extensive use of java, C, and Python to write user code. Web service mechanisms define
specific interfaces for grid computing.
Web services provide flexible, extensible, and widely adopted XML-based interfaces.

11 Dr Gnanasekaran Thangavel 09/20/2022


Data Management Using GT4
Grid applications one need to provide access to and/or integrate large quantities of data at multiple sites.
The GT4 tools can be used individually or in conj unction with other tools to develop interesting
solutions to efficient data access. The following list briefly introduces these GT4 tools:
1. Grid FTP supports reliable, secure, and fast memory-to-memory and disk-to-disk data movement over
high-bandwidth WANs. Based on the popular FTP protocol for internet file transfer, Grid FTP adds
additional features such as parallel data transfer, third-party data transfer, and striped data transfer. I n
addition, Grid FTP benefits from using the strong Globus Security Infra structure for securing data
channels with authentication and reusability. It has been reported that the grid has achieved 27
Gbit/second end-to-end transfer speeds over some WANs.
2. RFT provides reliable management of multiple Grid FTP transfers. I t has been used to orchestrate the
transfer of millions of files among many sites simultaneously.
3. RLS (Replica Location Service) is a scalable system for maintaining and providing access to
information about the location of replicated files and data sets.
4. OGSA-DAI (Globus Data Access and Integration) tools were developed by the UK eScience program
and provide access to relational and XML databases.
12 Dr Gnanasekaran Thangavel 09/20/2022
Apache top level project, open-source implementation of frameworks for reliable, scalable,
distributed computing and data storage.
It is a flexible and highly-available architecture for large scale computation and data
processing on a network of commodity hardware.
Hadoop offers a software platform that was originally developed by a Yahoo! group. The
package enables users to write and run applications over vast amounts of distributed data.
 Users can easily scale Hadoop to store and process petabytes of data in the web space.
Hadoop is economical in that it comes with an open source version of MapReduce that
minimizes overhead in task spawning and massive data communication.
It is efficient, as it processes data with a high degree of parallelism across a large number of
commodity nodes, and it is reliable in that it automatically keeps multiple data copies to
facilitate redeployment of computing tasks upon unexpected system failures.
13 Dr Gnanasekaran Thangavel 09/20/2022
Hadoop
• Hadoop:

• an open-source software framework that supports data-intensive distributed applications,


licensed under the Apache v2 license.

• Goals / Requirements:

• Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models

• High scalability and availability

• Use commodity (cheap!) hardware with little redundancy

• Fault-tolerance

14 •Dr Gnanasekaran Thangavel rather than data


Move computation 09/20/2022
Hadoop Framework Tools

15 Dr Gnanasekaran Thangavel 09/20/2022


Hadoop’s Architecture
Distributed, with some centralization

Main nodes of cluster are where most of the computational power and storage of the
system lies

Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also
DataNode to store needed blocks closely as possible

Central control node runs NameNode to keep track of HDFS directories & files, and
JobTracker to dispatch compute tasks to TaskTracker

Written in Java, also supports Python and Ruby

16 Dr Gnanasekaran Thangavel 09/20/2022


Hadoop’s Architecture  Hadoop Distributed
Filesystem
 Tailored to needs of
MapReduce
 Targeted towards many
reads of filestreams
 Writes are more costly
 High degree of data
replication (3x by
default)
 No need for RAID on
normal nodes
 Large blocksize
(64MB)
 Location awareness of
17 Dr Gnanasekaran Thangavel 09/20/2022
DataNodes in network
Hadoop’s Architecture

NameNode:
Stores metadata for the files, like the directory structure of a typical FS.
The server holding the NameNode instance is quite crucial, as there is only one.
Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks
or file-streams, only metadata.
Handles creation of more replica blocks when necessary after a DataNode failure
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
18 Dr Gnanasekaran Thangavel 09/20/2022
Hadoop’s Architecture

19 Dr Gnanasekaran Thangavel 09/20/2022


Hadoop’s Architecture
MapReduce Engine:
JobTracker & TaskTracker
JobTracker splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node
TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs
None of these components are necessarily limited to using HDFS
Many other distributed file-systems with quite different architectures work
Many other software packages besides Hadoop's MapReduce platform make
use of HDFS

20 Dr Gnanasekaran Thangavel 09/20/2022


Hadoop in the Wild
Hadoop is in use at most organizations that handle big data:
Yahoo!
Facebook
Amazon
Netflix
Etc…

Some examples of scale:


Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search

FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov,
2012)

21 Dr Gnanasekaran Thangavel 09/20/2022


Three main applications of Hadoop
Advertisement (Mining user behavior to generate recommendations)

Searches (group related documents)

Security (search for uncommon patterns)

22 Dr Gnanasekaran Thangavel 09/20/2022


Hadoop Highlights
Distributed File System
Fault Tolerance
Open Data Format
Flexible Schema
Queryable Database
Why use Hadoop?
Need to process Multi Petabyte Datasets
Data may not have strict schema
Expensive to build reliability in each application
Nodes fails everyday
Need common infrastructure
Very Large Distributed File System
Assumes Commodity Hardware
Optimized for Batch Processing
Runs on heterogeneous OS
23 Dr Gnanasekaran Thangavel 09/20/2022
DataNode

A Block Sever
Stores data in local file system
Stores meta-data of a block - checksum
Serves data and meta-data to clients
Block Report
Periodically sends a report of all existing
blocks to NameNode
Facilitate Pipelining of Data
Forwards data to other specified DataNodes

24 Dr Gnanasekaran Thangavel 09/20/2022


Block Placement

Replication Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
Clients read from nearest replica

25 Dr Gnanasekaran Thangavel 09/20/2022


Data Correctness

Use Checksums to validate data – CRC32


File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
File Access
Client retrieves the data and checksum from DataNode
If validation fails, client tries other replicas

26 Dr Gnanasekaran Thangavel 09/20/2022


Data Pipelining

Client retrieves a list of DataNodes on which to place


replicas of a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next
DataNode in the Pipeline
When all replicas are written, the client moves on to write
the next block in file

27 Dr Gnanasekaran Thangavel 09/20/2022


MapReduce Usage

Log processing
Web search indexing
Ad-hoc queries

28 Dr Gnanasekaran Thangavel 09/20/2022


MapReduce Process (org.apache.hadoop.mapred)
JobClient
Submit job
JobTracker
Manage and schedule job, split job into tasks
TaskTracker
Start and monitor the task execution
Child
The process that really execute the task

29 Dr Gnanasekaran Thangavel 09/20/2022


Inter Process Communication
Protocol
JobSubmissionProtocol

JobClient <-------------> JobTracker


InterTrackerProtocol
TaskTracker <------------> JobTracker

TaskUmbilicalProtocol
TaskTracker <-------------> Child
JobTracker impliments both protocol and works as server in
both IPC
TaskTracker implements the TaskUmbilicalProtocol; Child
gets task information and reports task status through it.
30 Dr Gnanasekaran Thangavel 09/20/2022
JobClient.submitJob - 1

Check input and output, e.g. check if the output directory


is already existing
job.getInputFormat().validateInput(job);
job.getOutputFormat().checkOutputSpecs(fs, job);
Get InputSplits, sort, and write output to HDFS
InputSplit[] splits = job.getInputFormat().
getSplits(job, job.getNumMapTasks());
writeSplitsFile(splits, out); // out is
$SYSTEMDIR/$JOBID/job.split

31 Dr Gnanasekaran Thangavel 09/20/2022


JobClient.submitJob - 2

The jar file and configuration file will be uploaded to


HDFS system directory
job.write(out); // out is $SYSTEMDIR/$JOBID/job.xml
JobStatus status = jobSubmitClient.submitJob(jobId);
This is an RPC invocation, jobSubmitClient is a proxy
created in the initialization

32 Dr Gnanasekaran Thangavel 09/20/2022


Job initialization on JobTracker - 1

JobTracker.submitJob(jobID) <-- receive RPC invocation request


JobInProgress job = new JobInProgress(jobId, this, this.conf)
Add the job into Job Queue
jobs.put(job.getProfile().getJobId(), job);
jobsByPriority.add(job);
jobInitQueue.add(job);

33 Dr Gnanasekaran Thangavel 09/20/2022


Job initialization on JobTracker - 2

Sort by priority
resortPriority();
compare the JobPrioity first, then compare the JobSubmissionTime
Wake JobInitThread
jobInitQueue.notifyall();
job = jobInitQueue.remove(0);
job.initTasks();

34 Dr Gnanasekaran Thangavel 09/20/2022


JobTracker Task Scheduling - 1

Task getNewTaskForTaskTracker(String taskTracker)


Compute the maximum tasks that can be running on
taskTracker
int maxCurrentMap Tasks = tts.getMaxMapTasks();
int maxMapLoad = Math.min(maxCurrentMapTasks,
(int)Math.ceil(double)
remainingMapLoad/numTaskTrackers));

35 Dr Gnanasekaran Thangavel 09/20/2022


JobTracker Task Scheduling - 2

int numMaps = tts.countMapTasks(); // running tasks


number
If numMaps < maxMapLoad, then more tasks can be
allocated, then based on priority, pick the first job from
the jobsByPriority Queue, create a task, and return to
TaskTracker
Task t = job.obtainNewMapTask(tts, numTaskTrackers);

36 Dr Gnanasekaran Thangavel 09/20/2022


Start TaskTracker - 1

initialize()
Remove original local directory
RPC initialization
 TaskReportServer = RPC.getServer(this, bindAddress, tmpPort, max,
false, this, fConf);
 InterTrackerProtocol jobClient = (InterTrackerProtocol)
RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID, jobTrackAddr, this.fConf);

37 Dr Gnanasekaran Thangavel 09/20/2022


Run Task on TaskTracker - 1

TaskTracker.localizeJob(TaskInProgress tip);
launchTasksForJob(tip, new JobConf(rjob.jobFile));
tip.launchTask(); // TaskTracker.TaskInProgress
tip.localizeTask(task); // create folder, symbol link
runner = task.createRunner(TaskTracker.this);
runner.start(); // start TaskRunner thread

38 Dr Gnanasekaran Thangavel 09/20/2022


Run Task on TaskTracker - 2

TaskRunner.run();
Configure child process’ jvm parameters, i.e. classpath,
taskid, taskReportServer’s address & port
Start Child Process
 runChild(wrappedCommand, workDir, taskid);

39 Dr Gnanasekaran Thangavel 09/20/2022


Child.main()

Create RPC Proxy, and execute RPC invocation


TaskUmbilicalProtocol umbilical = (TaskUmbilicalProtocol)
RPC.getProxy(TaskUmbilicalProtocol.class,
TaskUmbilicalProtocol.versionID, address, defaultConf);
Task task = umbilical.getTask(taskid);
task.run(); // mapTask / reduceTask.run

40 Dr Gnanasekaran Thangavel 09/20/2022


Finish Job - 1

Child
task.done(umilical);
 RPC call: umbilical.done(taskId, shouldBePromoted)

TaskTracker
done(taskId, shouldPromote)
 TaskInProgress tip = tasks.get(taskid);
 tip.reportDone(shouldPromote);
 taskStatus.setRunState(TaskStatus.State.SUCCEEDED)

41 Dr Gnanasekaran Thangavel 09/20/2022


Finish Job - 2

JobTracker
TaskStatus report: status.getTaskReports();
TaskInProgress tip = taskidToTIPMap.get(taskId);
JobInProgress update JobStatus
 tip.getJob().updateTaskStatus(tip, report, myMetrics);
 One task of current job is finished
 completedTask(tip, taskStatus, metrics);
 If (this.status.getRunState() == JobStatus.RUNNING && allDone)
{this.status.setRunState(JobStatus.SUCCEEDED)}

42 Dr Gnanasekaran Thangavel 09/20/2022


Demo

Word Count
hadoop jar hadoop-0.20.2-examples.jar wordcount <input
dir> <output dir>
Hive
hive -f pagerank.hive

43 Dr Gnanasekaran Thangavel 09/20/2022


References
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, “Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet”, First Edition, Morgan Kaufman
Publisher, an Imprint of Elsevier, 2012.
2. www.csee.usf.edu/~anda/CIS6930-S11/notes/hadoop.ppt
3. www.ics.uci.edu/~cs237/lectures/cloudvirtualization/Hadoop.pptx

44 Dr Gnanasekaran Thangavel 09/20/2022


Other presentations
http://www.slideshare.net/drgst/presentations

45 Dr Gnanasekaran Thangavel 09/20/2022


Thank You

Questions and Comments?

46 Dr Gnanasekaran Thangavel 09/20/2022

You might also like