Module 2.1
Module 2.1
• DataNodes:
• Slaves which are deployed on each machine and
provide the actual storage
• Responsible for serving read and write requests for
the clients
Name Node and Data Node block
Replication
How is a 400 MB file Saved on HDFS with hdfs
block size of 100 MB.
Example
Map Reduce
• Input data set is spilt into independent chunks.
• The Mapper:
Each block is processed in isolation by a map task called
mapper
Map task runs on the node where the block is stored
• The Reducer:
Consolidate result from different mappers
Produce final output
MapReduce Programming Phases and
Deamons
Phases:
Daemons:
Map- Converts I/P into
Job Tracker- Master
Key Value Pair
Schedule task
Reduce- Combines O/P
Task Tracker- Slave
produce of mappers
executesTask
and a resulted set
•Jobtracker:
• takes care of all the job scheduling and assign tasks to Task Trackers.
•TaskTracker:
• a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations -
from a jobtracker
Map Reduce Architecture
Versions of Hadoop
YARN
Hadoop Ecosystem
Hadoop Ecosystem
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster