0% found this document useful (0 votes)
30 views

4

accounting e book management from my prespective

Uploaded by

abhinc2024
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
30 views

4

accounting e book management from my prespective

Uploaded by

abhinc2024
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 53
DECISIONS ANALYTICS BIG DATA DISTRIBUTED FILE SYSTEMS + The main purpose of the Distributed File System (DFS) is to allow users of physically distributed systems to share their data and resources exactly as they do local ones. * The performance and reliability of such access should be comparable to that for files stored locally. + Recent advances in higher bandwidth connectivity of switched local networks and disk organization have lead high performance and highly scalable file systems. ~ —_— FEATURES OF DISTRIBUTED FILE SYSTEM + Transparency * Concurrent Updates + Replication + Fault Tolerance * Consistency + Platform Independence * Security * Efficiency TRANSPARENCY usion cha all files are similar. Includes: + Access transparency —a single set of eperations. Clients that work on Toca files can werk with remote fies, * Location transparency — clients sze a uniform name space, Relocate without changing path names, + Mobility ransparency —files 19 he moved without modifying programs or changing system tables + Performance transparency —within linits, local and remote file aceess meet performance standards + Seating wansparency — increased loads do not degrae performance significantly. Capacity can be expanded, CONCURRENT UPDATES + Changes to file from one client should not interfere with changes from other clients + Even if changes at same time * Solutions often include: + File or record-level locking REPLICATION + File may have several copies of its data at different locations + Often for performance reasons + Requires update other copies when one copy is changed + Simple solution + Change master copy and periodically refresh the other copies * More complicated solution + Multiple copies can be updated independently at same time needs finer grained refresh and/or merge FAULT TOLERANCE + Function when clients or servers fail * Detect, report, and correct faults that occur * Solutions often include: + Redundant copies of data, redundant hardware, backups, transaction logs and other measures * Stateless servers + Idempotent operations CONSISTENCY + Data must always be complete, current, and correct * File seen by one process looks the same for all processes accessing + Consistency special concern whenever data is duplicated + Solutions often include: + Timestamps and ownership information PLATFORM INDEPENDENCE + Access even though hardware and OS completely different in design, architecture and functioning, from different vendors + Solutions often include: * Set flexible communication protocol b/w clients and servers EFFICIENCY * Overall, want same power and generality as local file systems + Early days, goal was to share “expensive” resource 1 the disk * Now, allow convenient access to remotely stored files SECURITY + File systems must be protected against unauthorized access, data corruption, loss and other threats + Solutions include: + Access control mechanisms (ownership, permissions) * Encryption of commands or data to prevent “sniffing” HADOOP + Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. * Open-source Data Management with scale-out storage & distributed processing. story Doug Cutting added DFS & Mapheduce ule Hadoop defeated Supercomputer Doug Cutting stated vantoo! stared ‘orkng lmao (= = Google published GPS & Hodoop on a 1000 Node Fuster Hadoop became top-level project Developme: ieaafiaays stated as Lucene sub-project MapRedice papers ‘ocebook [TERS SQL Support for Hadoop Apache released fist table version 1.0 was released wi Doug Cutting joined Clouders Hadoop 3.0 version 7 Hauloop 33.4 latest version of hadeop Hadoop 2.0 which eontains YARN was elesed * Amazon WHO USES HADOOP? * Facebook * Google + New York Times * Veoh * Yahoo! *.... many more _ 7] Job Tracker Admin Node Name node Hadoop is a system for large scale data processing. It has two m: ~ HDFS — Hadoop Distributed File System (Storage) ~ Distributed across “nodes” ~ Natively redundant ~ NameNode tracks locations. ~ MapReduce (Processing) ~ Splits a task across processors components: ~ “near” the data & assembles results ~ Self-Healing, High Bandwidth ~ Clustered storage ~ JobTracker manages the TaskTrackers v NameNode: V_ master of the system v_ maintains and manages the blocks which are present on the DataNodes v DataNodes: slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients SECONDARY NAMENODE eee eS] ¥ Secondary NameNode: ¥ Not a hot standby for the NameNNode ¥ Connects to NameNode every hour* ¥ Housekeeping, backup of NemeNede metadata ¥ Saved metadata can build a failed NameNode JOBTRACKER AND TASKTRACKER V JobTracker + Determines the execution plan for the job + Assigns individual tasks Vv TaskTracker + Keeps track of the performance of an individual mapper or reducer HADOOP DISTRIBUTED FILE SYSTEM OVERVIEW * Responsible for storing data on the cluster + Data files are split into blocks and distributed across the nodes in the cluster + Each block is replicated multiple times HDFS BASIC CONCEPTS + Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the GoogleFileSystem. + Distribute large data file into blocks * Blocks are managed by different nodes in the cluster + Each block is replicated on multiple nodes HOW ARE FILES STORED + Files are split into blocks * Blocks are split across many machines at load time * Different blocks from the same file will be stored on different machines * Blocks are replicated across multiple machines + The NameNode keeps track of which blocks make up a file and where they are stored DATA REPLICATION + Default replication is 3-fold HDFS Data Distribution Node A Node B Node D Node € HEHE Input File DATA RETRIEVAL + When a client wants to retrieve data + Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored * Then communicated directly with the data nodes to read the data FUNCTIONS OF A NAMENODE + Manages File System Namespace * Maps a file name to a set of blocks * Maps a block to the DataNodes where it resides * Cluster Configuration Management * Replication Engine for Blocks NAMENODE METADATA + Types of metadata * List of files * List of Blocks for each file * List of DataNodes for each block attributes, e.g. creation time, replication factor + A Transaction Log + Records file creations, file deletions ete DATANODE * A Block Server + Stores data in the local file system (e.g. ext3) + Stores metadata of a block (e.g. CRC) + Serves data and metadata to Clients * Block Report + Periodically sends a report of all existing blocks to the NameNode + Facilitates Pipelining of Data + Forwards data to other specified DataNodes BLOCK PLACEMENT * Current Strategy + One replica on local node + Second replica on a remote rack * Third replica on same remote rack + Additional replicas are randomly placed * Clients read from nearest replicas HEARTBEATS * DataNodes send hearbeat to the NameNode + Once every 3 seconds + NameNode uses heartbeats to detect DataNode failure REPLICATION ENGINE * NameNode detects DataNode failures * Chooses new DataNodes for new replicas * Balances disk usage + Balances communication traffic to DataNodes NAMENODE FAILURE + A single point of failure + Transaction Log stored in multiple directories + A directory on the local file system + A directory on a remote file system (NFS/CIFS) DATA PIEPLINING * Client retrieves a list of DataNodes on which to place replicas of a block * Client writes block to the first DataNode + The first DataNode forwards the data to the next node in the Pipeline * When all replicas are written, the Client moves on to write the next block in file REBALANCER *Goal: % disk full on DataNodes should be similar * Usually run when new DataNodes are added + Cluster is online when Rebalancer is active + Rebalancer is throttled to avoid network congestion * Command line tool SECONDARY NAMENODE + Copies FsImage and Transaction Log from Namenode to a temporary directory + Merges FSImage and Transaction Log into a new FSImage in temporary directory + Uploads new FSImage to the NameNode + Transaction Log on NameNode is purged USE RERUCE stributing computation across nodes WHY MAPREDUCE? + Before MapReduce Concurrent Systems, * Grid Computing + Rolling Your Own Solution * Considerations + Threading is hard! + How do you scale to more machines? + How do you handle machine failures? + How do you facilitate communication between nodes? + Does your solution scale? Scale out, not up! THE MAPREDUCE PARADIGM + Platform for reliable and scalable computing + Runs over distributed file systems + Google File System + Hadoop File System (HDFS) MAPREDUCE PROGRAMMING MODEL + Inspired from map and reduce operations commonly used in functional programming languages like Lisp. + Input: a set of key/value pairs + User supplies two functions + map(k,y) O list(kty1) + reduce(ki, list(v1)) 0 v2 + (KL,vI) is an intermediate key/value pair * Output is the set of (k1,v2) pairs + Map Reduce is a programming model which is used to process large data sets in a batch processing manner. + A method for distributing computation across multiple nodes + A Map Reduce program comprises of + a Map() procedure that performs filtering and sorting (such as sorting students by last name into queucs, one queue for cach name) + a Reduce() procedure that performs a summary operation (such as counting the number of students in cach queue, yielding name frequencies), THE MAPPER + Each block is processed in isolation by a map task called mapper + Map task runs on the node where the block is stored Input ink HE. MAPPER key-value pairs key-value pairs om map Am om 42 om AE E.g. (doc—id, doc-content) Eg. (word, wordeount-in-a-doc) Adapted from Jeff Lilliman’s course slides SHUFFLE AND SORT * Output from the mapper is sorted by key * All values with the same key are guaranteed to go to the same machine THE REDUCER + Consolidate result from different mappers * Produce final output Intermediate Output key-value pairs O— (mE <@ Of 0m =e om key-value groups key-value pairs Ee (word, list-of-wordeount) (word, final-count) (word, ~SQL Group ~ SQL aggregation wordcount-in-a-doc) by Adapted from Jeif Ullman’s course slides JOB TRACKER + Job Tracker is the one to which client application submit map reduce programs(jobs), * Job Tracker schedule clients jobs and allocates task to the slave ‘task trackers’ that are running on individual worker machines(date nodes). + Job tracker manage overall execution of Map-Reduce job. + Job tracker manages the resources of the cluster like: + Manage the data nodes i. task tracker + To keep track of the consumed and available resource. + To keep track of already running task, to provide faulttolerance for task ete. TASK TRACKER + Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker, + Task Tracker also handles the data motion between the map and reduce phases. + One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task. + Ifthe Job Tracker fails to receive a heartbeat from a Task Tracker within a specified amount of time, it will assume the Task Tracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster. HOW MAPREDUCE ENGINE WORKS? * Client applications submit jobs to the Job Tracker. * The Job Tracker talks to the Name Node to determine the location of the data * The Job Tracker locates TaskTracker nodes with available slots at or near the data * The Job Tracker submits the work to the chosen TaskTracker nodes. HOW MAPREDUCE ENGINE WORKS? + The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. + A TaskTracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable. + When the work is completed, the Job Tracker updates its status. WORD COUNT EXAMPLE ee oe The Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K2,V2) __K2,List(V2) \ List(K3,V3) KiV1 Bear, (1,1) Deg eee Pere pael \\ ots \

You might also like