Apache Hadoop
Apache Hadoop
•Stores the
metadata.
•Determines the
SLAVE NODE
• Data Nodes store the actual data.
• Responsible for serving read and write requests from the client.
• Block creation, deletion, and replication.
• TaskTracker:
• Executes tasks upon instruction from the Master.
• Handles data motion between the map and reduce phases.
YARN(Yet Another Resource Negotiator)
• Brain of the Hadoop ecosystem and all processing is performed right here,
which may include resource allocation, job scheduling, and activity
processing.
• YARN’s dynamic allocation of cluster resources improves utilization.
• YARN’s Resource Manager focuses exclusively on scheduling.
YARN ARCHITECTURE
• Components of YARN :
• Resource Manager
• Node Manager
• Application Master
RESOURCE MANAGER
• Arbitrator of all cluster resources.
• TWO PARTS
• 1. Scheduler : Responsible for allocating resources to the various running
applications.
• 2. Applications Manager : Responsible for accepting job-submissions.
NODE MANAGER
• Per-machine framework agent who is responsible for containers, monitoring
their resource usage.
• Also reports the same to the Resource Manager/Scheduler.
APPLICATION MASTER
• Application Master is where the job resides.
• Per-application Application Master is a framework specific library and is
tasked with negotiating resources from the Resource Manager.
• Works with the Node Manager(s) to execute and monitor the tasks.
• Works as a job life-cycle manager.
MAP REDUCE
• Combination of two operations, named as Map and Reduce.
• “Map” sends a query to various data nodes for processing and “Reduce”
collects the result of these queries.
• Map function performs grouping, sorting and filtering operations, while
Reduce function summarizes and aggregates the result, produced by Map
function.
EXAMPLE…
SOME IMPORTANT POINTS….
• Input data can be divided into n number of chunks depending upon the
amount of data.
• All the chunks are processed simultaneously at the same time.
• Shuffling happens which leads to aggregation of similar patterns.
• Reducers combine them all to get a consolidated output as per the logic.
OTHER COMPONENTS
• Apache Pig : Procedural language, alternative for Java, used to process large
data sets in parallel.
• HBase : Open source and non-relational or NoSQL database, supports all
data types and so can handle any data type inside a Hadoop system.
• Mahout : Provides the environment for developing the machine learning
applications, such applications can perform filtering, clustering, classification.
• Zookeeper : Known as the king of coordination, can provide reliable, fast
and organized operational services for the Hadoop clusters.
COMPONENTS…
• Oozie : Performs the job scheduling and works like an alarm and clock
service inside the Hadoop Ecosystem.
• Ambari : Makes the Hadoop ecosystem more manageable by managing,
monitoring, and provisioning of the Hadoop clusters.
• Hive : Gives an SQL-like interface to query data stored in various databases
and file systems that integrate with Hadoop.
• Sqoop : Command-line interface application for transferring data between
relational databases and Hadoop.
REAL-WORLD USE CASES
• Financial services companies to assess risks and build investment models.
• Retail Websites to analyze structured and unstructured data to better
understand and serve their customers.
• Companies can even use it to understand what people think about them
through data mining and machine learning.
• Companies such as Amazon, Microsoft, Intel etc. use Hadoop to store and
analyze their data.
ADVANTAGES OF HADOOP
• Scalability : Highly scalable storage platform.
• Fast : Hadoop’s unique file system processed data at a very rapid rate.
• Flexible : Hadoop enables businesses to easily access new data sources.
• Fault Tolerance : Due to data replication, even hardware failures don’t cause
problems.
CONCLUSION
• Hadoop is a natural platform with which enterprise IT can now apply data
science to a huge variety of business problems such as product
recommendation, data analysis and other sentiment analysis.
• It is rapidly becoming a central store for big data in many industries.