SlideShare a Scribd company logo
Gerardo Barberena
gerardo.barberena@sensecloudnetwork.com
Hadoop
Apache Hadoop is a framework for running applications on
large cluster built of commodity hardware.
A common way of avoiding data loss is through replication:
redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available. The Hadoop
Distributed Filesystem (HDFS), takes care of this problem.
The second problem is solved by a simple programming model-
Mapreduce. Hadoop is the popular open source implementation
of MapReduce, a powerful tool designed for deep analysis and
transformation of very large data sets.
Introduccion
Big Data:
•Big data is a term used to describe the voluminous amount of unstructured
and semi-structured data a company creates.
•Data that would take too much time and cost too much money to load into
a relational database for analysis.
• Big data doesn't refer to any specific quantity, the term is often used when
speaking about petabytes and exabytes of data.
• The New York Stock Exchange generates about one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.
• The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of
data per year.
Cual es el Problema?
 The transfer speed is around 100 MB/s
 A standard disk is 1 Terabyte
 Concurrency Users
 Time to read entire disk= 10000 seconds or 3 Hours!
 Increase in processing time may not be as helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips have been reached
Computacion Distribuida vs
Paralelizacion
 Parallelization- Multiple processors or CPU’s
in a single machine
 Distributed Computing- Multiple computers
connected via a network
Computacion Distribuida
The key issues involved in this Solution:
 Hardware failure
 Combine the data after analysis
 Network Associated Problems
Problemas en la Computacion
Distribuida
• Hardware Failure:
As soon as we start using many pieces of
hardware, the chance that one will fail is fairly
high.
• Combine the data after analysis:
Most analysis tasks need to be able to combine
the data in some way; data read from one
disk may need to be combined with the data
from any of the other 99 disks.
Componentes Claves de Hadoop
A reliable shared storage and analysis system.
There are other subprojects of Hadoop that provide complementary
services, or build on the core to add higher-level abstractions The various
subprojects of hadoop include:
1. Core
2. Avro
3. Pig
4. HBase
5. Zookeeper
6. Hive
Funcionalidades de Hadoop en la
computacion distribuida
 The theoretical 1000-CPU machine would cost a very large amount of
money, far more than 1,000 single-CPU.
 Hadoop will tie these smaller and more reasonably priced machines together
into a single cost-effective compute cluster.
 Hadoop provides a simplified programming model which allows the user to
quickly write and test distributed systems, and its’ efficient, automatic
distribution of data and work across machines and in turn utilizing the
underlying parallelism of the CPU cores.
MapReduce
MapReduce
 Hadoop limits the amount of communication which can be performed by the
processes, as each individual record is processed by a task in isolation from one another
 By restricting the communication between nodes, Hadoop makes the distributed system
much more reliable. Individual node failures can be worked around by restarting tasks
on other machines.
 The other workers continue to operate as though nothing went wrong, leaving the
challenging aspects of partially restarting the program to the underlying Hadoop layer.
Map : (in_value,in_key)(out_key, intermediate_value)
Reduce: (out_key, intermediate_value) (out_value list)
Que es Map Reduce?
 MapReduce is a programming model
 Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines
 MapReduce is an associated implementation for processing and generating
large data sets.
Modelo de Programacion de MapReduce
 Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key I and passes them to the Reduce
function.
 The Reduce function, also written by the user, accepts an intermediate key I and a set of values
for that key. It merges together these values to form a possibly smaller set of values
Orientacion de los Nodos
Data Locality Optimization:
The computer nodes and the storage nodes are the same. The Map-Reduce
framework and the Distributed File System run on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where
data is already present, resulting in very high aggregate bandwidth across the
cluster.
If this is not possible: The computation is done by another processor on the same
rack.
“Moving Computation is Cheaper than Moving Data”
Como trabaja MapReduce
 A Map-Reduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps, which are then input to the reduce tasks.
 Typically both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the failed
tasks.
 A MapReduce job is a unit of work that the client wants to be performed: it consists of
the input data, the MapReduce program, and configuration information. Hadoop runs
the job by dividing it into tasks, of which there are two types: map tasks and reduce
tasks
Tolerancia a Fallas
 There are two types of nodes that control the job execution process: tasktrackers and
jobtrackers
 The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.
 Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record
of the overall progress of each job.
 If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
Introduccion a Hadoop / Introduction to Hadoop
Division de Entrada
 Input splits: Hadoop divides the input to a MapReduce job into fixed-size
pieces called input splits, or just splits. Hadoop creates one map task for each
split, which runs the user-defined map function for each record in the split.
 The quality of the load balancing increases as the splits become more fine-
grained.
 BUT if splits are too small, then the overhead of managing the splits and of map
task creation begins to dominate the total job execution time. For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
WHY?
 Map tasks write their output to local disk, not to HDFS. Map output is
intermediate output: it’s processed by reduce tasks to produce the final output,
and once the job is complete the map output can be thrown away. So storing it
in HDFS, with replication, would be a waste of time. It is also possible that the
node running the map task fails before the map output has been consumed by
the reduce task.
Entrada a las Tareas de Reduccion
 Reduce tasks don’t have the advantage of
data locality—the input to a single reduce
task is normally the output from all mappers.
MapReduce data flow with a single reduce task
•Many MapReduce jobs are limited by the bandwidth available on the cluster.
•In order to minimize the data transferred between the map and reduce tasks, combiner
functions are introduced.
•Hadoop allows the user to specify a combiner function to be run on the map output—the
combiner function’s output forms the input to the reduce function.
•Combiner finctions can help cut down the amount of data shuffled between the maps and
the reduces.
Combinacion de Funciones
•Hadoop provides an API to MapReduce that allows you to
write your map and reduce functions in languages other than
Java.
•Hadoop Streaming uses Unix standard streams as the
interface between Hadoop and your program, so you can use
any language that can read standard input and write to
standard output to write your MapReduce program.
Hadoop Streaming:
•Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
•Unlike Streaming, which uses standard input and output to communicate with
the map and reduce code, Pipes uses sockets as the channel over which the
tasktracker communicates with the process running the C++ map or reduce
function. JNI is not used.
Hadoop Pipes:
 Filesystems that manage the storage across a network of machines are called
distributed filesystems.
 Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.
 HDFS, the Hadoop Distributed File System, is a distributed file system
designed to hold very large amounts of data (terabytes or even petabytes),
and provide high-throughput access to this information.
HADOOP DISTRIBUTED
FILESYSTEM (HDFS)
Problemas con HDFS
Making distributed filesystems is more complex than regular disk filesystems. This
is because the data is spanned over multiple nodes, so all the complications of
network programming kick in.
•Hardware Failure
An HDFS instance may consist of hundreds or thousands of server machines, each storing
part of the file system’s data. The fact that there are a huge number of components and that
each component has a non-trivial probability of failure means that some component of HDFS
is always non-functional. Therefore, detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS.
•Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high
aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.
Metas del HDFS
Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. They are
not general purpose applications that typically run on general purpose file systems.
HDFS is designed more for batch processing rather than interactive use by users.
The emphasis is on high throughput of data access rather than low latency of data
access. POSIX imposes many hard requirements that are not needed for
applications that are targeted for HDFS. POSIX semantics in a few key areas has
been traded to increase data throughput rates.
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access. A Map/Reduce
application or a web crawler application fits perfectly with this model. There is a plan
to support appending-writes to files in the future.
 Low-latency data access
Applications that require low-latency access to data, in the tens
of milliseconds
range, will not work well with HDFS. Remember HDFS is
optimized for delivering a high throughput of data, and this may
be at the expense of latency. HBase (Chapter 12) is currently a
better choice for low-latency access.
 Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are
always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the
file. (These might be supported in the future, but they are likely
to be relatively inefficient.)
Conceptos de HDFS
Block Abstraction
 Blocks:
• A block is the minimum amount of data that can be read or
written.
• 64 MB by default.
• Files in HDFS are broken into block-sized chunks, which are
stored as independent units.
• HDFS blocks are large compared to disk blocks, and the
reason is to minimize the cost of seeks. By making a block
large enough, the time to transfer the data from the disk can be
made to be significantly larger than the time to seek to the start
of the block. Thus the time to transfer a large file made of
multiple blocks operates at the disk transfer rate.
Benefits of Block Abstraction
 A file can be larger than any single disk in the network. There’s
nothing that requires the blocks from a file to be stored on the
same disk, so they can take advantage of any of the disks in
the cluster.
 Making the unit of abstraction a block rather than a file
simplifies the storage subsystem.
 Blocks provide fault tolerance and availability. To insure against
corrupted blocks and disk and machine failure, each block is
replicated to a small number of physically separate machines
(typically three). If a block becomes unavailable, a copy can be
read from another location in a way that is transparent to the
client.
Hadoop Archives
 HDFS stores small files inefficiently, since each file is stored in
a block, and block metadata is held in memory by the
namenode. Thus, a large number of small files can eat up a lot
of memory on the namenode.
 Hadoop Archives, or HAR files, are a file archiving facility that
packs files into HDFS blocks more efficiently, thereby reducing
namenode memory usage while still allowing transparent
access to files.
 Hadoop Archives can be used as input to MapReduce.
Limitations of Archiving
 There is currently no support for archive
compression, although the files that go into
the archive can be compressed
 Archives are immutable once they have been
created. To add or remove files, you must
recreate the archive
Namenodes and Datanodes
 A HDFS cluster has two types of node operating in a master-
worker pattern: a namenode (the master) and a number of
datanodes (workers).
 The namenode manages the filesystem namespace. It
maintains the filesystem tree and the metadata for all the files
and directories in the tree.
 Datanodes are the work horses of the filesystem. They store
and retrieve blocks when they are told to (by clients or the
namenode), and they report back to the namenode periodically
with lists of blocks that they are storing.
 Without the namenode, the filesystem cannot
be used. In fact, if the machine running the
namenode were obliterated, all the files on
the filesystem would be lost since there
would be no way of knowing how to
reconstruct the files from the blocks on the
datanodes.
 Important to make the namenode resilient to failure, and
Hadoop provides two mechanisms for this:
1. is to back up the files that make up the persistent state of the
filesystem metadata. Hadoop can be configured so that the
namenode writes its persistent state to multiple filesystems.
2. Another solution is to run a secondary namenode. The
secondary namenode usually runs on a separate physical
machine, since it requires plenty of CPU and as much memory
as the namenode to perform the merge. It keeps a copy of the
merged namespace image, which can be used in the event of
the namenode failing
File System Namespace
 HDFS supports a traditional hierarchical file organization. A user or an
application can create and remove files, move a file from one directory
to another, rename a file, create directories and store files inside these
directories.
 HDFS does not yet implement user quotas or access permissions.
HDFS does not support hard links or soft links. However, the HDFS
architecture does not preclude implementing these features.
 The Namenode maintains the file system namespace. Any change to
the file system namespace or its properties is recorded by the
Namenode. An application can specify the number of replicas of a file
that should be maintained by HDFS. The number of copies of a file is
called the replication factor of that file. This information is stored by the
Namenode.
Data Replication
 The blocks of a file are replicated for fault tolerance.
 The NameNode makes all decisions regarding replication of
blocks. It periodically receives a Heartbeat and a Blockreport
from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functioning properly.
 A Blockreport contains a list of all blocks on a DataNode.
 When the replication factor is three, HDFS’s placement policy
is to put one replica on one node in the local rack, another on a
different node in the local rack, and the last on a different node
in a different rack.
Gracias

More Related Content

What's hot (18)

PPTX
Analysing of big data using map reduce
Paladion Networks
 
PPTX
Hadoop live online training
Harika583
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PDF
Hadoop, MapReduce and R = RHadoop
Victoria López
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PDF
Overview of Hadoop and HDFS
Brendan Tierney
 
PDF
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
PDF
Big data overview of apache hadoop
veeracynixit
 
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
PPTX
Hadoop and big data
Sharad Pandey
 
PPTX
Map Reduce
Prashant Gupta
 
PPT
Map Reduce
Michel Bruley
 
PPT
Map reducecloudtech
Jakir Hossain
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PDF
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PPTX
Map Reduce basics
Abhishek Mukherjee
 
Analysing of big data using map reduce
Paladion Networks
 
Hadoop live online training
Harika583
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
MapReduce Paradigm
Dilip Reddy
 
Hadoop, MapReduce and R = RHadoop
Victoria López
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Overview of Hadoop and HDFS
Brendan Tierney
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Big data overview of apache hadoop
veeracynixit
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
Hadoop and big data
Sharad Pandey
 
Map Reduce
Prashant Gupta
 
Map Reduce
Michel Bruley
 
Map reducecloudtech
Jakir Hossain
 
Big Data and Cloud Computing
Farzad Nozarian
 
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Map Reduce basics
Abhishek Mukherjee
 

Viewers also liked (8)

PDF
Infrastructure Considerations for Analytical Workloads
Cognizant
 
PPTX
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Seetharam Venkatesh
 
PDF
Final White Paper_
Ryan Ellingson
 
PDF
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
PPTX
Scalable On-Demand Hadoop Clusters with Docker and Mesos
DataWorks Summit
 
PPTX
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
PPTX
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Seetharam Venkatesh
 
Final White Paper_
Ryan Ellingson
 
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
DataWorks Summit
 
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Ad

Similar to Introduccion a Hadoop / Introduction to Hadoop (20)

PPT
Seminar Presentation Hadoop
Varun Narang
 
PPT
Hadoop ppt2
Ankit Gupta
 
PPT
hadoop
swatic018
 
PPT
hadoop
swatic018
 
PPTX
Hadoop and MapReduce Introductort presentation
ssuserb91a20
 
PPTX
Cppt
chunkypandey12
 
PPTX
Cppt Hadoop
chunkypandey12
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PDF
Seminar_Report_hadoop
Varun Narang
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PPTX
Apache hadoop basics
saili mane
 
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
DakshGoti2
 
PPTX
Hadoop by kamran khan
KamranKhan587
 
PPT
Hadoop
Girish Khanzode
 
PPT
Meethadoop
IIIT-H
 
PPTX
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
PPT
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
PPTX
Big data and hadoop
Roushan Sinha
 
PPT
Hadoop Technology
Atul Kushwaha
 
Seminar Presentation Hadoop
Varun Narang
 
Hadoop ppt2
Ankit Gupta
 
hadoop
swatic018
 
hadoop
swatic018
 
Hadoop and MapReduce Introductort presentation
ssuserb91a20
 
Cppt Hadoop
chunkypandey12
 
Introduction to Hadoop and Big Data
Joe Alex
 
Seminar_Report_hadoop
Varun Narang
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
Apache hadoop basics
saili mane
 
Mapreduce is for Hadoop Ecosystem in Data Science
DakshGoti2
 
Hadoop by kamran khan
KamranKhan587
 
Meethadoop
IIIT-H
 
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Big data and hadoop
Roushan Sinha
 
Hadoop Technology
Atul Kushwaha
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 

Introduccion a Hadoop / Introduction to Hadoop

  • 2. Hadoop Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.
  • 3. Introduccion Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 4. • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of data per year.
  • 5. Cual es el Problema?  The transfer speed is around 100 MB/s  A standard disk is 1 Terabyte  Concurrency Users  Time to read entire disk= 10000 seconds or 3 Hours!  Increase in processing time may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached
  • 6. Computacion Distribuida vs Paralelizacion  Parallelization- Multiple processors or CPU’s in a single machine  Distributed Computing- Multiple computers connected via a network
  • 7. Computacion Distribuida The key issues involved in this Solution:  Hardware failure  Combine the data after analysis  Network Associated Problems
  • 8. Problemas en la Computacion Distribuida • Hardware Failure: As soon as we start using many pieces of hardware, the chance that one will fail is fairly high. • Combine the data after analysis: Most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks.
  • 9. Componentes Claves de Hadoop A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions The various subprojects of hadoop include: 1. Core 2. Avro 3. Pig 4. HBase 5. Zookeeper 6. Hive
  • 10. Funcionalidades de Hadoop en la computacion distribuida  The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU.  Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.  Hadoop provides a simplified programming model which allows the user to quickly write and test distributed systems, and its’ efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.
  • 12. MapReduce  Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another  By restricting the communication between nodes, Hadoop makes the distributed system much more reliable. Individual node failures can be worked around by restarting tasks on other machines.  The other workers continue to operate as though nothing went wrong, leaving the challenging aspects of partially restarting the program to the underlying Hadoop layer. Map : (in_value,in_key)(out_key, intermediate_value) Reduce: (out_key, intermediate_value) (out_value list)
  • 13. Que es Map Reduce?  MapReduce is a programming model  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines  MapReduce is an associated implementation for processing and generating large data sets.
  • 14. Modelo de Programacion de MapReduce  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
  • 15.  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
  • 16. Orientacion de los Nodos Data Locality Optimization: The computer nodes and the storage nodes are the same. The Map-Reduce framework and the Distributed File System run on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. If this is not possible: The computation is done by another processor on the same rack. “Moving Computation is Cheaper than Moving Data”
  • 17. Como trabaja MapReduce  A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.  The framework sorts the outputs of the maps, which are then input to the reduce tasks.  Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.  A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks
  • 18. Tolerancia a Fallas  There are two types of nodes that control the job execution process: tasktrackers and jobtrackers  The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.  Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.  If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
  • 20. Division de Entrada  Input splits: Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.  The quality of the load balancing increases as the splits become more fine- grained.  BUT if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default. WHY?  Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. It is also possible that the node running the map task fails before the map output has been consumed by the reduce task.
  • 21. Entrada a las Tareas de Reduccion  Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers.
  • 22. MapReduce data flow with a single reduce task
  • 23. •Many MapReduce jobs are limited by the bandwidth available on the cluster. •In order to minimize the data transferred between the map and reduce tasks, combiner functions are introduced. •Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. •Combiner finctions can help cut down the amount of data shuffled between the maps and the reduces. Combinacion de Funciones
  • 24. •Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. •Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program. Hadoop Streaming:
  • 25. •Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. •Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. JNI is not used. Hadoop Pipes:
  • 26.  Filesystems that manage the storage across a network of machines are called distributed filesystems.  Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.  HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. HADOOP DISTRIBUTED FILESYSTEM (HDFS)
  • 27. Problemas con HDFS Making distributed filesystems is more complex than regular disk filesystems. This is because the data is spanned over multiple nodes, so all the complications of network programming kick in. •Hardware Failure An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. •Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  • 28. Metas del HDFS Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
  • 29.  Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase (Chapter 12) is currently a better choice for low-latency access.  Multiple writers, arbitrary file modifications Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)
  • 31. Block Abstraction  Blocks: • A block is the minimum amount of data that can be read or written. • 64 MB by default. • Files in HDFS are broken into block-sized chunks, which are stored as independent units. • HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
  • 32. Benefits of Block Abstraction  A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.  Making the unit of abstraction a block rather than a file simplifies the storage subsystem.  Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
  • 33. Hadoop Archives  HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode.  Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.  Hadoop Archives can be used as input to MapReduce.
  • 34. Limitations of Archiving  There is currently no support for archive compression, although the files that go into the archive can be compressed  Archives are immutable once they have been created. To add or remove files, you must recreate the archive
  • 35. Namenodes and Datanodes  A HDFS cluster has two types of node operating in a master- worker pattern: a namenode (the master) and a number of datanodes (workers).  The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.  Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
  • 36.  Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
  • 37.  Important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this: 1. is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. 2. Another solution is to run a secondary namenode. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing
  • 38. File System Namespace  HDFS supports a traditional hierarchical file organization. A user or an application can create and remove files, move a file from one directory to another, rename a file, create directories and store files inside these directories.  HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.  The Namenode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the Namenode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the Namenode.
  • 39. Data Replication  The blocks of a file are replicated for fault tolerance.  The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.  A Blockreport contains a list of all blocks on a DataNode.  When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack.

Editor's Notes

  • #34: (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)