DECISIONS
ANALYTICS
BIG DATADISTRIBUTED FILE SYSTEMS
+ The main purpose of the Distributed File System (DFS) is to allow users of physically
distributed systems to share their data and resources exactly as they do local ones.
* The performance and reliability of such access should be comparable to that for files stored
locally.
+ Recent advances in higher bandwidth connectivity of switched local networks and disk
organization have lead high performance and highly scalable file systems.~ —_—
FEATURES OF DISTRIBUTED FILE
SYSTEM
+ Transparency
* Concurrent Updates
+ Replication
+ Fault Tolerance
* Consistency
+ Platform Independence
* Security
* EfficiencyTRANSPARENCY
usion cha all files are similar. Includes:
+ Access transparency —a single set of eperations. Clients that work on Toca files can werk with remote fies,
* Location transparency — clients sze a uniform name space, Relocate without changing path names,
+ Mobility ransparency —files
19 he moved without modifying programs or changing system tables
+ Performance transparency —within linits, local and remote file aceess meet performance standards
+ Seating wansparency — increased loads do not degrae performance significantly. Capacity can be expanded,CONCURRENT UPDATES
+ Changes to file from one client should not interfere with changes from other clients
+ Even if changes at same time
* Solutions often include:
+ File or record-level lockingREPLICATION
+ File may have several copies of its data at different locations
+ Often for performance reasons
+ Requires update other copies when one copy is changed
+ Simple solution
+ Change master copy and periodically refresh the other copies
* More complicated solution
+ Multiple copies can be updated independently at same time needs finer grained refresh and/or
mergeFAULT TOLERANCE
+ Function when clients or servers fail
* Detect, report, and correct faults that occur
* Solutions often include:
+ Redundant copies of data, redundant hardware, backups, transaction logs and other measures
* Stateless servers
+ Idempotent operationsCONSISTENCY
+ Data must always be complete, current, and correct
* File seen by one process looks the same for all processes accessing
+ Consistency special concern whenever data is duplicated
+ Solutions often include:
+ Timestamps and ownership informationPLATFORM INDEPENDENCE
+ Access even though hardware and OS completely different in design,
architecture and functioning, from different vendors
+ Solutions often include:
* Set flexible communication protocol b/w clients and serversEFFICIENCY
* Overall, want same power and generality as local file systems
+ Early days, goal was to share “expensive” resource 1 the disk
* Now, allow convenient access to remotely stored filesSECURITY
+ File systems must be protected against unauthorized access, data
corruption, loss and other threats
+ Solutions include:
+ Access control mechanisms (ownership, permissions)
* Encryption of commands or data to prevent “sniffing”HADOOP+ Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming
model.
* Open-source Data Management with scale-out storage &
distributed processing.story
Doug Cutting added
DFS & Mapheduce
ule
Hadoop defeated
Supercomputer
Doug Cutting stated vantoo! stared
‘orkng lmao
(= =
Google published GPS
&
Hodoop on a 1000
Node Fuster
Hadoop became
top-level project
Developme: ieaafiaays
stated as Lucene sub-project
MapRedice papers
‘ocebook [TERS
SQL Support for Hadoop
Apache released fist
table version 1.0 was released
wi Doug Cutting
joined Clouders
Hadoop 3.0 version
7
Hauloop 33.4 latest
version of hadeop
Hadoop 2.0 which eontains
YARN was elesed* Amazon WHO USES HADOOP?
* Facebook
* Google
+ New York Times
* Veoh
* Yahoo!
*.... many more_ 7]
Job Tracker
Admin Node
Name nodeHadoop is a system for large scale data processing.
It has two m:
~ HDFS — Hadoop Distributed File System (Storage)
~ Distributed across “nodes”
~ Natively redundant
~ NameNode tracks locations.
~ MapReduce (Processing)
~ Splits a task across processors
components:
~ “near” the data & assembles results
~ Self-Healing, High Bandwidth
~ Clustered storage
~ JobTracker manages the TaskTrackersv NameNode:
V_ master of the system
v_ maintains and manages the blocks which are present on the
DataNodes
v DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clientsSECONDARY NAMENODE
eee eS]
¥ Secondary NameNode:
¥ Not a hot standby for the NameNNode
¥ Connects to NameNode every hour*
¥ Housekeeping, backup of NemeNede metadata
¥ Saved metadata can build a failed NameNodeJOBTRACKER AND TASKTRACKER
V JobTracker
+ Determines the execution plan for the job
+ Assigns individual tasks
Vv TaskTracker
+ Keeps track of the performance of an individual mapper or reducerHADOOP DISTRIBUTED FILE
SYSTEMOVERVIEW
* Responsible for storing data on the cluster
+ Data files are split into blocks and distributed across the nodes in the
cluster
+ Each block is replicated multiple timesHDFS BASIC CONCEPTS
+ Hadoop Distributed File System (HDFS) is designed to reliably store very large files across
machines in a large cluster. It is inspired by the GoogleFileSystem.
+ Distribute large data file into blocks
* Blocks are managed by different nodes in the cluster
+ Each block is replicated on multiple nodesHOW ARE FILES STORED
+ Files are split into blocks
* Blocks are split across many machines at load time
* Different blocks from the same file will be stored on different machines
* Blocks are replicated across multiple machines
+ The NameNode keeps track of which blocks make up a file and where they are
storedDATA REPLICATION
+ Default replication is 3-fold
HDFS Data Distribution
Node A Node B Node D Node €
HEHE
Input FileDATA RETRIEVAL
+ When a client wants to retrieve data
+ Communicates with the NameNode to determine which blocks make up a file
and on which data nodes those blocks are stored
* Then communicated directly with the data nodes to read the dataFUNCTIONS OF A NAMENODE
+ Manages File System Namespace
* Maps a file name to a set of blocks
* Maps a block to the DataNodes where it resides
* Cluster Configuration Management
* Replication Engine for BlocksNAMENODE METADATA
+ Types of metadata
* List of files
* List of Blocks for each file
* List of DataNodes for each block
attributes, e.g. creation time, replication factor
+ A Transaction Log
+ Records file creations, file deletions eteDATANODE
* A Block Server
+ Stores data in the local file system (e.g. ext3)
+ Stores metadata of a block (e.g. CRC)
+ Serves data and metadata to Clients
* Block Report
+ Periodically sends a report of all existing blocks to the NameNode
+ Facilitates Pipelining of Data
+ Forwards data to other specified DataNodesBLOCK PLACEMENT
* Current Strategy
+ One replica on local node
+ Second replica on a remote rack
* Third replica on same remote rack
+ Additional replicas are randomly placed
* Clients read from nearest replicasHEARTBEATS
* DataNodes send hearbeat to the NameNode
+ Once every 3 seconds
+ NameNode uses heartbeats to detect DataNode failureREPLICATION ENGINE
* NameNode detects DataNode failures
* Chooses new DataNodes for new replicas
* Balances disk usage
+ Balances communication traffic to DataNodesNAMENODE FAILURE
+ A single point of failure
+ Transaction Log stored in multiple directories
+ A directory on the local file system
+ A directory on a remote file system (NFS/CIFS)DATA PIEPLINING
* Client retrieves a list of DataNodes on which to place replicas of a block
* Client writes block to the first DataNode
+ The first DataNode forwards the data to the next node in the Pipeline
* When all replicas are written, the Client moves on to write the next block in fileREBALANCER
*Goal: % disk full on DataNodes should be similar
* Usually run when new DataNodes are added
+ Cluster is online when Rebalancer is active
+ Rebalancer is throttled to avoid network congestion
* Command line toolSECONDARY NAMENODE
+ Copies FsImage and Transaction Log from Namenode to a temporary directory
+ Merges FSImage and Transaction Log into a new FSImage in temporary directory
+ Uploads new FSImage to the NameNode
+ Transaction Log on NameNode is purgedUSE RERUCE
stributing computation across nodesWHY MAPREDUCE?
+ Before MapReduce
Concurrent Systems,
* Grid Computing
+ Rolling Your Own Solution
* Considerations
+ Threading is hard!
+ How do you scale to more machines?
+ How do you handle machine failures?
+ How do you facilitate communication between nodes?
+ Does your solution scale?
Scale out, not up!THE MAPREDUCE PARADIGM
+ Platform for reliable and scalable computing
+ Runs over distributed file systems
+ Google File System
+ Hadoop File System (HDFS)MAPREDUCE PROGRAMMING
MODEL
+ Inspired from map and reduce operations commonly used in functional
programming languages like Lisp.
+ Input: a set of key/value pairs
+ User supplies two functions
+ map(k,y) O list(kty1)
+ reduce(ki, list(v1)) 0 v2
+ (KL,vI) is an intermediate key/value pair
* Output is the set of (k1,v2) pairs+ Map Reduce is a programming model which is used to process large
data sets in a batch processing manner.
+ A method for distributing computation across multiple nodes
+ A Map Reduce program comprises of
+ a Map() procedure that performs filtering and sorting (such as sorting
students by last name into queucs, one queue for cach name)
+ a Reduce() procedure that performs a summary operation (such as counting
the number of students in cach queue, yielding name frequencies),THE MAPPER
+ Each block is processed in isolation by a map task called mapper
+ Map task runs on the node where the block is storedInput ink HE. MAPPER
key-value pairs key-value pairs
om
map
Am om
42 om
AE
E.g. (doc—id, doc-content) Eg. (word, wordeount-in-a-doc)
Adapted from Jeff Lilliman’s course slidesSHUFFLE AND SORT
* Output from the mapper is sorted by key
* All values with the same key are guaranteed to go to the same machineTHE REDUCER
+ Consolidate result from different mappers
* Produce final outputIntermediate Output
key-value pairs
O— (mE <@
Of 0m =e
om
key-value groups key-value pairs
Ee (word, list-of-wordeount) (word, final-count)
(word, ~SQL Group ~ SQL aggregation
wordcount-in-a-doc) by
Adapted from Jeif Ullman’s course slidesJOB TRACKER
+ Job Tracker is the one to which client application submit map reduce programs(jobs),
* Job Tracker schedule clients jobs and allocates task to the slave ‘task trackers’ that are running on
individual worker machines(date nodes).
+ Job tracker manage overall execution of Map-Reduce job.
+ Job tracker manages the resources of the cluster like:
+ Manage the data nodes i. task tracker
+ To keep track of the consumed and available resource.
+ To keep track of already running task, to provide faulttolerance for task ete.TASK TRACKER
+ Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job
Tracker,
+ Task Tracker also handles the data motion between the map and reduce phases.
+ One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the
status of the Task.
+ Ifthe Job Tracker fails to receive a heartbeat from a Task Tracker within a specified amount of time,
it will assume the Task Tracker has crashed and will resubmit the corresponding tasks to other nodes
in the cluster.HOW MAPREDUCE ENGINE
WORKS?
* Client applications submit jobs to the Job Tracker.
* The Job Tracker talks to the Name Node to determine the location of the data
* The Job Tracker locates TaskTracker nodes with available slots at or near the data
* The Job Tracker submits the work to the chosen TaskTracker nodes.HOW MAPREDUCE ENGINE
WORKS?
+ The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough,
they are deemed to have failed and the work is scheduled on a different TaskTracker.
+ A TaskTracker will notify the Job Tracker when a task fails. The Job Tracker decides what
to do then: it may resubmit the job elsewhere, it may mark that specific record as something
to avoid, and it may even blacklist the TaskTracker as unreliable.
+ When the work is completed, the Job Tracker updates its status.WORD COUNT EXAMPLE
ee oeThe Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K2,V2) __K2,List(V2)
\ List(K3,V3)
KiV1
Bear, (1,1)
Deg
eee
Pere pael
\\
ots \