Unit 2 - Intro To Hadoop
Unit 2 - Intro To Hadoop
• Introducing hadoop
• Why hadoop?
• Why not RDBMS?
• RDBMS versus Hadoop
• Distributed computing challenges
• History of Hadoop
• Hadoop overview
• Use case of Hadoop
• Hadoop distributors
• HDFS (Hadoop distributed file system)
• Processing Data with hadoop
• Managing resources and Applications with Hadoop YARN
2.1 Introduction to hadoop:
• Hadoop is an open-source, distributed computing framework designed to
store and process large-scale data sets across clusters of commodity
hardware.
• It was initially developed by Doug Cutting and Mike Cafarella in 2005 and
is now maintained by the Apache Software Foundation.
• Hadoop is widely used in the industry to handle big data applications
efficiently and cost-effectively.
The key components of Hadoop are:
• Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
allows data to be stored across multiple machines in a Hadoop cluster. It breaks
large files into smaller blocks and replicates them across the cluster to ensure fault
tolerance and high availability.
• MapReduce: MapReduce is a programming model and processing engine that
enables distributed processing of data stored in HDFS. It works by splitting the
processing tasks into smaller sub-tasks, mapping data elements, and then reducing
the results of those mappings into the final output. MapReduce is designed to
handle parallel processing of vast amounts of data across the cluster.
• YARN (Yet Another Resource Negotiator): YARN is the resource management
layer of Hadoop. It manages and schedules resources in the cluster, ensuring
efficient utilization and allocation of computational resources to various
applications.
The typical workflow in Hadoop is as follows:
• Data Ingestion: Large amounts of data from various sources are ingested into the
Hadoop cluster, often through HDFS.
• Data Processing: Hadoop processes the data using the MapReduce paradigm,
where data is mapped into key-value pairs, processed in parallel across the cluster,
and reduced to produce the final output.
• Data Analysis: The processed data can be further analyzed using various tools,
such as Apache Hive (a data warehouse infrastructure built on top of Hadoop),
Apache Pig (a high-level scripting language for data analysis), or Apache Spark (a
fast and general-purpose data processing engine).
• Hadoop's distributed nature and fault-tolerant design make it suitable for processing
and analyzing large datasets that would be impractical or impossible to handle using
traditional single-server systems.
• It is particularly popular in applications involving data warehousing, data mining,
machine learning, real-time analytics.
• Modern big data solutions often include other tools and technologies, such as
Apache Spark, Apache HBase, and cloud-based platforms like AWS EMR and
Google Cloud Dataproc, alongside or instead of traditional Hadoop deployments.
• These technologies provide more flexibility and speed in processing large-scale data
2.2 Why Hadoop?
Hadoop is one of the pioneering and fundamental technologies used for big data
processing. It gained popularity because it effectively addresses several challenges that
arise when dealing with massive volumes of data. Here are some reasons why Hadoop
is well-suited for big data:
2. Fault Tolerance: With Hadoop's distributed storage system (HDFS) and data
replication capabilities, data is redundantly stored across multiple nodes in the
cluster. This redundancy ensures that even if some machines fail, the data remains
accessible and processing can continue uninterrupted.
3. Cost-Effectiveness: Hadoop runs on commodity hardware, which is less expensive
compared to specialized high-end servers. This cost-effectiveness makes it more
affordable for organizations to store and process large volumes of data without
significant infrastructure investment.
5. Flexibility: Hadoop is schema-on-read, meaning that it can handle data with varying
structures and formats without requiring strict pre-defined schemas. This flexibility
allows organizations to store and process data of different types, such as structured,
semi-structured, and unstructured data.
6. Ecosystem and Integration: Hadoop has a rich ecosystem with various tools and
frameworks that complement its capabilities. Tools like Apache Hive, Apache Pig, and
Apache Spark provide high-level abstractions and facilitate data analysis and
processing tasks, making it easier for data engineers and analysts to work with big
data.
8. Data Locality: Hadoop leverages data locality, which means that data processing
tasks are scheduled on nodes where the data is already stored. This reduces data
movement across the network, optimizing performance and reducing the strain on the
cluster.
2.3 Why not RDBMS?
While Relational Database Management Systems (RDBMS) are highly valuable for
managing structured data and handling transactional workloads, they may not be the
best choice for big data processing. There are several reasons why RDBMS might not
be the ideal solution for big data:
• Scalability:
– RDBMS systems are not designed to handle the scale of data typically encountered in
big data scenarios.
– As data volume grows, scaling an RDBMS becomes challenging and expensive, often
requiring expensive high-end hardware upgrades.
– RDBMS solutions can be costly, especially when dealing with large datasets.
Licensing fees, hardware costs, and maintenance expenses can be prohibitive
for big data workloads, particularly when compared to more cost-effective
open-source big data technologies like Hadoop and Apache Spark.
• Latency and Performance:
1. Parallel Databases:
• A parallel database is one which involves multiple processors and working in
parallel on the database used to provide the services.
• A parallel database system seeks to improve performance through parallelization of
various operations like loading data, building index and evaluating queries parallel
systems improve processing and I/O speeds by using multiple CPU’s and disks in
parallel.
• Parallel databases can be applied to big data scenarios to some extent, but there are
certain limitations and challenges that might make them less suitable for certain
aspects of big data processing.
• Here are a few reasons why parallel databases might face challenges in handling certain big
data scenarios:
Scalability: While parallel databases are horizontally scalable to a certain extent, they might face
limitations when dealing with the massive scale of data that big data technologies
Data Variety and Flexibility: Big data often includes a variety of data types, including
structured, semi-structured, and unstructured data. Parallel databases are optimized for structured
data with well-defined schemas, which might not accommodate the flexibility required for
handling diverse data formats commonly found in big data.
2. Grid computing:
Grid computing is a distributed computing model that involves connecting multiple computers,
often geographically dispersed, to work together as a unified resource to solve complex
computational problems or perform large-scale tasks.
Unlike traditional supercomputers that consist of a single powerful machine, grid computing
harnesses the collective processing power and resources of a network of interconnected
computers, making it suitable for tackling tasks that require significant computational resources
– However, managing the coordination, communication, and synchronization of these
distributed resources could be challenging.
– Grid computing was often used for scientific computing but lacked the specialized tools
and optimizations needed for general big data processing.
5. Custom Solutions:
• A "custom solution" in the context of distributed computing refers to a tailored or specialized
approach that an organization develops to address specific challenges or requirements related
to data processing, storage, and computation within a distributed environment.
• Unlike off-the-shelf software or standardized frameworks, a custom solution is designed and
implemented specifically to meet the unique needs of the organization's computing
infrastructure, data types, and processing workflows.
Concept and Purpose:
• A custom solution is developed when an organization's requirements cannot be fully
met by existing commercial products or open-source frameworks. This approach
allows organizations to have full control over the design, architecture, and features
of the solution, ensuring that it aligns perfectly with their business goals and
technical needs.
Challenges:
• Development Time and Cost: Developing a custom solution requires time, effort,
and financial investment in software development expertise and resources.
• Maintenance Complexity: Ongoing maintenance, updates, and support for the
custom solution can become complex, especially as technology evolves.
• Integration Issues: Integrating the custom solution with existing systems and tools
may require additional effort and compatibility testing.
2.5 Distributed computing challenges:
Before Hadoop was introduced, distributed computing for data analytics faced
several significant challenges. Some of the key challenges were:
– Scalability: Traditional databases and computing systems struggled to
handle the rapidly growing volume of data. As data sizes increased, the
single-machine architecture became a bottleneck, resulting in
performance degradation.
– Cost: Scaling up traditional hardware was expensive, and the cost of
building and maintaining large-scale computing infrastructures was
prohibitive for many organizations.
– Complexity: Writing custom code for distributed data processing was
complex and error-prone. Developers had to handle data partitioning,
distribution, fault tolerance, and synchronization manually.
– Data Heterogeneity: Data sources were diverse, with varying formats,
structures, and levels of cleanliness. Integrating and processing data from
multiple sources was a challenging task.
– Fault Tolerance: Traditional systems were not designed to handle node
failures gracefully. Recovering from failures and ensuring data integrity
was a significant concern.
– Processing Speed: Analyzing large datasets using traditional batch
processing methods was slow. Real-time or near-real-time analytics were
difficult to achieve.
– Latency: Traditional systems suffered from high latency when dealing
with big data. This latency made it challenging to gain insights quickly
from vast datasets.
– Resource Management: Optimally managing and sharing computing
resources in a distributed environment was challenging. Efficiently
allocating resources for different tasks was not straightforward .
• Hadoop, introduced by Doug Cutting and Mike Cafarella in 2005, addressed
many of these challenges. It brought the Hadoop Distributed File System
(HDFS) for efficient data storage and replication across nodes, and the
MapReduce programming model for parallel data processing.
• Hadoop's design allowed for distributed processing of large datasets across a
cluster of commodity hardware, enabling organizations to scale their data
analytics capabilities cost-effectively
• Additionally, its fault tolerance mechanisms and automatic data replication
provided a higher level of data integrity and reliability in the face of
hardware failures.
• With Hadoop, developers could write data processing jobs using the
MapReduce model, abstracting away much of the complexities of distributed
computing.
• This allowed organizations to harness the power of distributed computing
without having to deal with the low-level intricacies.
• As a result, Hadoop became a cornerstone technology in the big data
ecosystem and revolutionized how organizations approached data analytics
on a large scale.
2.6 Hadoop history:
• Hadoop is an open-source framework for distributed storage and processing
of large-scale data. It was inspired by a paper published by Google in 2004,
titled "MapReduce: Simplified Data Processing on Large Clusters" and "The
Google File System." The paper described Google's proprietary
infrastructure for handling their vast amounts of data in a scalable and fault-
tolerant manner.
• The origins of Hadoop can be traced back to Doug Cutting, who was
working at Yahoo at the time. In 2005, Doug and his team started the
development of Hadoop as an Apache Nutch project, aiming to create an
open-source version of Google's MapReduce
• The name "Hadoop" itself comes from Doug Cutting's son's toy elephant,
which he named Hadoop. The toy elephant became the project's logo and
symbolized the idea of handling large amounts of data.
• The development of Hadoop progressed rapidly, and it soon became an
Apache Software Foundation top-level project in 2006.
• Hadoop's MapReduce model enables parallel processing of data by dividing
the input into smaller chunks and distributing them across the cluster for
processing. The results are then aggregated to produce the final output. This
approach makes it possible to process massive datasets that are too large to
be handled by a single machine.
• Hadoop's design and architecture have made it a fundamental component in
the big data ecosystem. It has been widely adopted by various industries for
data-intensive tasks, such as data warehousing, data processing, log
processing, and machine learning.
• Over the years, Hadoop has continued to evolve and improve, with many
organizations contributing to its development. However, as the big data
landscape has evolved, other distributed data processing frameworks like
Apache Spark have gained popularity due to their faster performance and
more flexible APIs.
• Nonetheless, Hadoop remains an essential technology for managing and
processing vast amounts of data in many organizations around the world,
playing a crucial role in the big data revolution.
2.9 Hadoop distributors:
• Hadoop distributors are companies or organizations that provide a packaged version
of the open-source Apache Hadoop software, along with additional tools, services,
and support to make it easier for enterprises to deploy, manage, and utilize Hadoop
in their data infrastructure.
• These distributions aim to simplify the adoption of Hadoop and related big data
technologies by offering pre-configured solutions that include various Hadoop
ecosystem components and integration with other data management and analytics
tools.
2.10 HDFS:
Functions of HDFS NameNode
• It executes the file system namespace operations like opening, renaming, and
closing files and directories.
• NameNode manages and maintains the DataNodes.
• NameNode receives heartbeat and block reports from all DataNodes that ensure
DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new replicas.
Functions of DataNode
• DataNode is responsible for serving the client read/write requests.
• Based on the instruction from the NameNode, DataNodes performs block creation,
replication, and deletion.
• DataNodes send a heartbeat to NameNode to report the health of HDFS.
• DataNodes also sends block reports to NameNode to report the list of blocks it
contains.
• 1. Creating a directory:
Syntax: hdfs dfs –mkdir <path>
Eg. hdfs dfs –mkdir /chp
Apache Spark and Apache Hadoop are both powerful frameworks for processing large-scale
data, but they have different strengths and use cases. Here are some advantages of using
Apache Spark over Hadoop:
• In-Memory Processing: Spark is designed for in-memory data processing, which can
significantly speed up data analytics tasks compared to Hadoop's disk-based MapReduce.
By keeping intermediate data in memory, Spark reduces the need for time-consuming
disk I/O operations.
• Ease of Use: Spark provides high-level APIs in multiple programming languages (Scala,
Java, Python, and R), making it more accessible to developers. The API includes built-in
libraries for machine learning (MLlib), graph processing (GraphX), and SQL-based
queries (Spark SQL), simplifying complex data processing tasks.
• Iterative Processing: Spark is well-suited for iterative algorithms commonly used in
machine learning and graph processing. It can cache data in memory between iterations,
eliminating the need to read data from storage repeatedly, which is a limitation in
• Real-Time Processing: Spark Streaming, a part of Spark, allows for real-time data
processing and analytics. You can process data streams and make real-time
decisions, whereas Hadoop primarily focuses on batch processing.
• Unified Platform: Spark provides a unified platform for batch processing,
interactive queries, machine learning, and graph processing. In contrast, Hadoop
often requires integrating multiple tools (e.g., MapReduce, Hive, Pig) for different
tasks, leading to complexity.
• Lazy Evaluation: Spark uses lazy evaluation, meaning it doesn't execute operations
until an action is called. This optimization allows Spark to reduce unnecessary
computations and improve performance.
• Integration with Hadoop: Spark can run on Hadoop clusters, and it can read data
from Hadoop Distributed File System (HDFS). This means you can leverage existing
Hadoop investments while taking advantage of Spark's benefits
• Advanced Analytics: Spark's MLlib library offers scalable machine learning
algorithms, making it a natural choice for building and deploying machine learning
models alongside data processing.
• Interactive Data Analysis: Spark SQL allows users to perform interactive SQL
queries on data stored in Spark, enabling data analysts and data scientists to explore
data more easily.
• Community and Ecosystem: Spark has a rapidly growing and active community,
leading to a rich ecosystem of libraries and tools that extend its capabilities. Hadoop
has a strong community as well, but Spark's growth has been notable.