Dokumen - Tips Cs6703 Grid and Cloud Computing Unit 4
Dokumen - Tips Cs6703 Grid and Cloud Computing Unit 4
Unit 4
Dr Gnanasekaran Thangavel
Professor and Head
Faculty of Information Technology
R M K College of Engineering and Technology
UNIT IV PROGRAMMING MODEL
Open source grid middleware packages – Globus Toolkit
(GT4) Architecture , Configuration – Usage of Globus –
Main components and Programming model -
Introduction to Hadoop Framework - Mapreduce, Input
splitting, map and reduce functions, specifying input and
output parameters, configuring and running a job – Design
of Hadoop file system, HDFS concepts, command line and
java interface, dataflow of File read & File write.
The Open Grid Forum and Object Management are two well- formed organizations behind
the standards
Middleware is the software layer that connects software components. It lies between
operating system and the applications.
Grid middleware is specially designed a layer between hardware and software, enable the
sharing of heterogeneous resources and managing virtual organizations created around the
grid.
The popular grid middleware are
1. BOINC -Berkeley Open Infrastructure for Network Computing.
2. UNICORE - Middleware developed by the German grid computing community.
3. Globus (GT4) - A middleware library jointly developed by Argonne National Lab., Univ.
of Chicago, and USC Information Science Institute, funded by DARPA, NSF, and NIH.
4. CGSP in ChinaGrid - The CGSP (ChinaGrid Support Platform) is a middleware library
developed by 20 top universities in China as part of the ChinaGrid Project
3 Dr Gnanasekaran Thangavel 09/20/2022
Open source grid middleware packages conti…
There are strong interactions between provider programs and user code. GT4 makes heavy use of industry-
standard web service protocols and mechanisms in service description, discovery, access, authentication,
authorization.
GT4 makes extensive use of java, C, and Python to write user code. Web service mechanisms define
specific interfaces for grid computing.
Web services provide flexible, extensible, and widely adopted XML-based interfaces.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• Fault-tolerance
Main nodes of cluster are where most of the computational power and storage of the
system lies
Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also
DataNode to store needed blocks closely as possible
Central control node runs NameNode to keep track of HDFS directories & files, and
JobTracker to dispatch compute tasks to TaskTracker
NameNode:
Stores metadata for the files, like the directory structure of a typical FS.
The server holding the NameNode instance is quite crucial, as there is only one.
Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks
or file-streams, only metadata.
Handles creation of more replica blocks when necessary after a DataNode failure
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
18 Dr Gnanasekaran Thangavel 09/20/2022
Hadoop’s Architecture
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov,
2012)
A Block Sever
Stores data in local file system
Stores meta-data of a block - checksum
Serves data and meta-data to clients
Block Report
Periodically sends a report of all existing
blocks to NameNode
Facilitate Pipelining of Data
Forwards data to other specified DataNodes
Replication Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
Clients read from nearest replica
Log processing
Web search indexing
Ad-hoc queries
TaskUmbilicalProtocol
TaskTracker <-------------> Child
JobTracker impliments both protocol and works as server in
both IPC
TaskTracker implements the TaskUmbilicalProtocol; Child
gets task information and reports task status through it.
30 Dr Gnanasekaran Thangavel 09/20/2022
JobClient.submitJob - 1
Sort by priority
resortPriority();
compare the JobPrioity first, then compare the JobSubmissionTime
Wake JobInitThread
jobInitQueue.notifyall();
job = jobInitQueue.remove(0);
job.initTasks();
initialize()
Remove original local directory
RPC initialization
TaskReportServer = RPC.getServer(this, bindAddress, tmpPort, max,
false, this, fConf);
InterTrackerProtocol jobClient = (InterTrackerProtocol)
RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID, jobTrackAddr, this.fConf);
TaskTracker.localizeJob(TaskInProgress tip);
launchTasksForJob(tip, new JobConf(rjob.jobFile));
tip.launchTask(); // TaskTracker.TaskInProgress
tip.localizeTask(task); // create folder, symbol link
runner = task.createRunner(TaskTracker.this);
runner.start(); // start TaskRunner thread
TaskRunner.run();
Configure child process’ jvm parameters, i.e. classpath,
taskid, taskReportServer’s address & port
Start Child Process
runChild(wrappedCommand, workDir, taskid);
Child
task.done(umilical);
RPC call: umbilical.done(taskId, shouldBePromoted)
TaskTracker
done(taskId, shouldPromote)
TaskInProgress tip = tasks.get(taskid);
tip.reportDone(shouldPromote);
taskStatus.setRunState(TaskStatus.State.SUCCEEDED)
JobTracker
TaskStatus report: status.getTaskReports();
TaskInProgress tip = taskidToTIPMap.get(taskId);
JobInProgress update JobStatus
tip.getJob().updateTaskStatus(tip, report, myMetrics);
One task of current job is finished
completedTask(tip, taskStatus, metrics);
If (this.status.getRunState() == JobStatus.RUNNING && allDone)
{this.status.setRunState(JobStatus.SUCCEEDED)}
Word Count
hadoop jar hadoop-0.20.2-examples.jar wordcount <input
dir> <output dir>
Hive
hive -f pagerank.hive