Distributed Computing
Distributed Computing
Agenda
• Concurrency
– The capacity of the system to handle shared
resources can be increased by adding more
resources to the network.
• No global clock
– The only communication is by sending messages
through a network.
• Independent failures
– The programs may not be able to detect whether
the network has failed or has become unusually
slow.
Resource
• Services
– A distinct part of a computer system that manages a
collection of related resources and presents their
functionality to users and applications.
– http, telnet, pop3...
• Server
– A running program (a process) on a networked computer
that accepts requests from programs running on other
computers to perform a service, and responds
apppropriately.
– IIS, Apache...
• Client
– The requesting processes.
Challenges
• Heterogeneity
• Openness
• Security
• Scalability
• Failure handling
• Concurrency
• Transparency
Heterogeneity
1.4
1.5
[19]
What is Distributed Computing/System?
• Distributed computing
– A field of computing science that
studies distributed system.
– The use of distributed systems to
solve computational problems.
• Distributed system
• There are several autonomous
computational entities, each of which has its
own local memory.
• The entities communicate with each other by
message passing.
– Operating System Concept
• The processors communicate with one
another through various communication
lines, such as high-speed buses or
telephone lines.
• Each processor has its own local memory.
WHY DISTRIBUTED COMPUTING?
[21]
Why Distributed Computing?
– Data intensive
• The task that deals with a lot mount or large size of files. For example,
Facebook, LHC(Large Hadron Collider).
• Robustness
– No SPOF (Single Point Of Failure)
– Other nodes can execute the same task executed
on failed node.
Common properties
– Fault tolerance
• When one or some nodes fails, the whole system can still work fine
except performance.
• Need to check the status of each node
– Each node play partial role
• Each computer has only a limited, incomplete view of the system. Each
computer may know only one part of the input.
– Resource sharing
• Each user can share the computing power and storage resource in the
system with other users
– Load Sharing
• Dispatching several tasks to each nodes can help share loading to the
whole system.
– Easy to expand
• We expect to use few time when adding nodes. Hope to spend no time
if possible.
Centralized vs. Distributed
Computing
t e r m in a l
m a in f r a m e c o m p u t e r
w o r k s t a t io n
n e tw o r k h o s t
c e n t r a liz e d c o m p u tin g
d is t r ib u t e d c o m p u t in g
Centralized vs. Distributed Computing
A distributed system is a collection of independent computers,
interconnected via a network, capable of collaborating on a task.
Distributed computing is computing performed in a distributed
system.
Distributed computing has become increasingly common due
advances that have made both machines and networks
cheaper and faster.
t e r m in a l
m a in f r a m e c o m p u t e r
w o r k s t a t io n
n e t w o r k l in k
n e tw o rk h o s t
c e n tr a liz e d c o m p u t in g
d is tr ib u te d c o m p u tin g
A typical portion of the Internet
intranet %
%
% ISP
backbone
satellite link
desktop computer:
server:
network link:
Computers in a Distributed System
[28]
Message Passing
• Robustness
– Still safe when one or partial nodes fail
– Need to recover when failed nodes are online. No
further or few action is needed
• Condor – restart daemon
– Failure detection
• When any nodes fails, master nodes can detect this situation.
– Eg: Heartbeat detection
• Network issue
– Bandwidth
• Need to think of bandwidth when copying files from one node to other nodes
if we would like to execute the task on the nodes if no data in these nodes.
• Scalability
– Easy to expand
• Hadoop – configuration modification and start daemon
• Optimization
– What can we do if the performance of some nodes is
not good?
• Monitoring the performance of each node
– According to any information exchange like heartbeat or log
• Resume the same task on another nodes
Best Practice
• App/User
– shouldn’t know how to communicate between
nodes
– User mobility – user can access the system from
some point or anywhere
• Grid – UI (User interface)
• Condor – submit machine
Components of Distributed Software
Systems
• Distributed systems
• Middleware
• Distributed applications
Middleware
• Client-server
• Group-oriented/Peer-to-Peer
– Applications that require reliability, scalability
41
Clients invoke individual servers
result result
Server
Client
Key:
Process: Computer:
A service provided by multiple servers
Service
Server
Client
Server
Client
Server
Web proxy server
Client Web
server
Proxy
server
Client Web
server
A distributed application based on peer
processes
Peer 2
Peer 1
Application
Application
Sharable Peer 3
objects
Application
Peer 4
Application
Peers 5 .... N
Distributed applications
48
Transparency in Distributed Systems
• HDFS
– Namenode:
• manages the file system namespace and regulates access to files by
clients.
• determines the mapping of blocks to DataNodes.
– Data Node :
• manage storage attached to the nodes that they run on
• save CRC codes
• send heartbeat to namenode.
• Each data is split as a chunk and each chuck is stored on some data
nodes.
– Secondary Namenode
• responsible for merging fsImage and EditLog
Case study - Hadoop
Case study - Hadoop
• Map-reduce Framework
– JobTracker
• Responsible for dispatch job to each tasktracker
• Job management like removing and scheduling.
– TaskTracker
• Responsible for executing job. Usually tasktracker launch another JVM
to execute the job.
Case study - Hadoop
• Data replication
– Data are replicated to different nodes
• Reduce the possibility of data loss
• Data locality. Job will be sent to the node where data are.
• Robustness
– One datanode fails
• We can get data from other nodes.
– Recovery
• Only need to restart the daemon when the failed nodes are online
Case study - Hadoop
• Resource sharing
– Each hadoop user can share computing power
and storage space with other hadoop users.
• Synchronization
– No synchronization
• Failure detection
– Namenode/Jobtracker can know when
datanode/tasktracker fails
• Based on heartbeat