0% found this document useful (0 votes)

75 views

Unit 2 - Intro To Hadoop

The document introduces Hadoop, explaining what it is, why it is useful for big data applications, and how it compares to traditional RDBMS systems. Hadoop is an open-source framework that uses distributed computing to store and process large datasets across clusters of commodity hardware. It addresses challenges like scalability, fault tolerance, and parallel processing that are difficult for single server systems. Key components include HDFS for storage and MapReduce for distributed processing.

Uploaded by

Sanjay H M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views

Unit 2 - Intro To Hadoop

Uploaded by

Sanjay H M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Unit 2 – Introduction to Hadoop

• Introducing hadoop
• Why hadoop?
• Why not RDBMS?
• RDBMS versus Hadoop
• Distributed computing challenges
• History of Hadoop
• Hadoop overview
• Use case of Hadoop
• Hadoop distributors
• HDFS (Hadoop distributed file system)
• Processing Data with hadoop
• Managing resources and Applications with Hadoop YARN
2.1 Introduction to hadoop:
• Hadoop is an open-source, distributed computing framework designed to
store and process large-scale data sets across clusters of commodity
hardware.
• It was initially developed by Doug Cutting and Mike Cafarella in 2005 and
is now maintained by the Apache Software Foundation.
• Hadoop is widely used in the industry to handle big data applications
efficiently and cost-effectively.
The key components of Hadoop are:
• Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
allows data to be stored across multiple machines in a Hadoop cluster. It breaks
large files into smaller blocks and replicates them across the cluster to ensure fault
tolerance and high availability.
• MapReduce: MapReduce is a programming model and processing engine that
enables distributed processing of data stored in HDFS. It works by splitting the
processing tasks into smaller sub-tasks, mapping data elements, and then reducing
the results of those mappings into the final output. MapReduce is designed to
handle parallel processing of vast amounts of data across the cluster.
• YARN (Yet Another Resource Negotiator): YARN is the resource management
layer of Hadoop. It manages and schedules resources in the cluster, ensuring
efficient utilization and allocation of computational resources to various
applications.
The typical workflow in Hadoop is as follows:
• Data Ingestion: Large amounts of data from various sources are ingested into the
Hadoop cluster, often through HDFS.
• Data Processing: Hadoop processes the data using the MapReduce paradigm,
where data is mapped into key-value pairs, processed in parallel across the cluster,
and reduced to produce the final output.
• Data Analysis: The processed data can be further analyzed using various tools,
such as Apache Hive (a data warehouse infrastructure built on top of Hadoop),
Apache Pig (a high-level scripting language for data analysis), or Apache Spark (a
fast and general-purpose data processing engine).
• Hadoop's distributed nature and fault-tolerant design make it suitable for processing
and analyzing large datasets that would be impractical or impossible to handle using
traditional single-server systems.
• It is particularly popular in applications involving data warehousing, data mining,
machine learning, real-time analytics.
• Modern big data solutions often include other tools and technologies, such as
Apache Spark, Apache HBase, and cloud-based platforms like AWS EMR and
Google Cloud Dataproc, alongside or instead of traditional Hadoop deployments.
• These technologies provide more flexibility and speed in processing large-scale data
2.2 Why Hadoop?

Hadoop is one of the pioneering and fundamental technologies used for big data
processing. It gained popularity because it effectively addresses several challenges that
arise when dealing with massive volumes of data. Here are some reasons why Hadoop
is well-suited for big data:

1. Scalability: Hadoop is designed to scale horizontally by distributing data and

computation across a cluster of commodity hardware. As data grows, you can
easily add more machines to the cluster, allowing Hadoop to handle petabytes or
even exabytes of data without significant changes to the architecture.

2. Fault Tolerance: With Hadoop's distributed storage system (HDFS) and data
replication capabilities, data is redundantly stored across multiple nodes in the
cluster. This redundancy ensures that even if some machines fail, the data remains
accessible and processing can continue uninterrupted.
3. Cost-Effectiveness: Hadoop runs on commodity hardware, which is less expensive
compared to specialized high-end servers. This cost-effectiveness makes it more
affordable for organizations to store and process large volumes of data without
significant infrastructure investment.

4. Parallel Processing: Hadoop's MapReduce paradigm enables parallel processing of

data. By breaking down tasks into smaller sub-tasks that can be executed in parallel
across the cluster, Hadoop can significantly speed up data processing for complex
operations on vast datasets.

5. Flexibility: Hadoop is schema-on-read, meaning that it can handle data with varying
structures and formats without requiring strict pre-defined schemas. This flexibility
allows organizations to store and process data of different types, such as structured,
semi-structured, and unstructured data.
6. Ecosystem and Integration: Hadoop has a rich ecosystem with various tools and
frameworks that complement its capabilities. Tools like Apache Hive, Apache Pig, and
Apache Spark provide high-level abstractions and facilitate data analysis and
processing tasks, making it easier for data engineers and analysts to work with big
data.

7. Batch Processing: Hadoop is well-suited for batch processing workloads, where

data can be processed in large chunks at regular intervals. This makes it ideal for
historical data analysis and long-running tasks.

8. Data Locality: Hadoop leverages data locality, which means that data processing
tasks are scheduled on nodes where the data is already stored. This reduces data
movement across the network, optimizing performance and reducing the strain on the
cluster.
2.3 Why not RDBMS?

While Relational Database Management Systems (RDBMS) are highly valuable for
managing structured data and handling transactional workloads, they may not be the
best choice for big data processing. There are several reasons why RDBMS might not
be the ideal solution for big data:
• Scalability:
– RDBMS systems are not designed to handle the scale of data typically encountered in
big data scenarios.
– As data volume grows, scaling an RDBMS becomes challenging and expensive, often
requiring expensive high-end hardware upgrades.

• Data Model and Schema:

– RDBMS requires a predefined schema for data storage. Adapting to new data sources or
changing data formats can be cumbersome and time-consuming in RDBMS.
– In contrast, big data systems like Hadoop or NoSQL databases offer schema flexibility,
making them more suitable for handling diverse data types.
• Cost:

– RDBMS solutions can be costly, especially when dealing with large datasets.
Licensing fees, hardware costs, and maintenance expenses can be prohibitive
for big data workloads, particularly when compared to more cost-effective
open-source big data technologies like Hadoop and Apache Spark.
• Latency and Performance:

– RDBMS is optimized for low-latency transactional queries. However, as data

grows in size, complex joins, and aggregation operations in RDBMS can
become slow and resource-intensive.
– Big data systems are designed to parallelize data processing tasks, enabling
faster performance on large datasets.
• Batch Processing vs. Real-time Processing:

– RDBMS is traditionally oriented towards transactional operations and batch

processing.
– While modern RDBMS may support some degree of real-time data processing,
big data technologies like Apache Kafka and Apache Flink are more specialized
and efficient for handling real-time streaming data
• Unstructured and Semi-structured Data:

– RDBMS excels at handling structured data, but it may struggle with

unstructured or semi-structured data commonly found in big data scenarios.
– Big data technologies like Hadoop, NoSQL databases, or Elasticsearch are
better suited for managing diverse data types.
• Hardware Limitations:

– RDBMS often relies on vertically scaling hardware, which has practical

limitations. On the other hand, big data technologies focus on horizontally
scaling across commodity hardware, allowing for cost-effective and scalable
deployments.
• For traditional transactional workloads and structured data, RDBMS remains an
excellent choice.
• However, for big data scenarios with massive volumes of diverse and unstructured
data, distributed processing, real-time analytics, and cost-effectiveness, big data
technologies like Hadoop, Apache Spark, NoSQL databases, and cloud-based
solutions are more suitable choices.
2.4 RDBMS versus Hadoop:
RDBMS Hadoop
Data Model stores data in structured tables Hadoop is designed to handle
with predefined schemas. both structured and unstructured
It enforces relationships between data.
tables using foreign keys, making It uses a schema-on-read
it suitable for structured data. approach, allowing it to store and
process data with flexible and
evolving schemas.
Scalability scale vertically, which means they follows a horizontal scaling
require more powerful hardware approach, allowing it to distribute
to handle increased data volumes data and processing tasks across a
and processing demands. cluster of commodity hardware.
This enables seamless scalability
as data grows.
Processing Paradigm optimized for transactional uses the MapReduce paradigm for
processing and supports SQL batch processing. It breaks down
queries for data retrieval, data processing tasks into smaller
manipulation, and reporting sub-tasks that can be processed in
parallel across the cluster.
RDBMS Hadoop
Data Processing Speed offers lower latency and real-time more suitable for batch
data processing capabilities for processing and may not be as
transactional workloads efficient for real-time data
processing, although newer
technologies like Apache Spark
offer near-real-time capabilities.
Data Volume well-suited for handling moderate designed to handle massive
to large volumes of structured volumes of data, including
data. structured, semi-structured, and
unstructured data.
Cost can be expensive, especially when runs on commodity hardware and
scaling vertically to accommodate offers a cost-effective solution for
increasing data volumes and storing and processing large-scale
processing demands. data
Use Cases commonly used for transactional used for big data analytics, data
applications, business warehousing, log processing,
intelligence, reporting, and machine learning, and
applications with structured data. applications involving large-scale
unstructured and semi-structured
data.
Ecosystem has a well-established ecosystem has its ecosystem with tools like
of tools, frameworks, and libraries Apache Hive, Apache Pig, Apache
for data management and Spark, and more, providing
analytics, such as Oracle, MySQL, extensive capabilities for big data
SQL Server, and PostgreSQL processing
2.5 Distributed computing challenges:
• Before Hadoop, distributed computing for big data was handled through various
methods and technologies, though they were not as streamlined or specialized as
Hadoop. These methods included:

1. Parallel Databases:
• A parallel database is one which involves multiple processors and working in
parallel on the database used to provide the services.
• A parallel database system seeks to improve performance through parallelization of
various operations like loading data, building index and evaluating queries parallel
systems improve processing and I/O speeds by using multiple CPU’s and disks in
parallel.
• Parallel databases can be applied to big data scenarios to some extent, but there are
certain limitations and challenges that might make them less suitable for certain
aspects of big data processing.
• Here are a few reasons why parallel databases might face challenges in handling certain big
data scenarios:

Scalability: While parallel databases are horizontally scalable to a certain extent, they might face
limitations when dealing with the massive scale of data that big data technologies

Data Variety and Flexibility: Big data often includes a variety of data types, including
structured, semi-structured, and unstructured data. Parallel databases are optimized for structured
data with well-defined schemas, which might not accommodate the flexibility required for
handling diverse data formats commonly found in big data.

2. Grid computing:

Grid computing is a distributed computing model that involves connecting multiple computers,
often geographically dispersed, to work together as a unified resource to solve complex
computational problems or perform large-scale tasks.

Unlike traditional supercomputers that consist of a single powerful machine, grid computing
harnesses the collective processing power and resources of a network of interconnected
computers, making it suitable for tackling tasks that require significant computational resources
– However, managing the coordination, communication, and synchronization of these
distributed resources could be challenging.
– Grid computing was often used for scientific computing but lacked the specialized tools
and optimizations needed for general big data processing.

3. Parallel processing frameworks:

• Parallel processing frameworks are software tools or platforms that enable the execution of
computational tasks in parallel, distributing the workload across multiple processing units or
nodes.
• These frameworks are designed to harness the power of parallelism to speed up the execution
of tasks, whether they involve data processing, simulations, scientific computations, or other
types of computations.
• Parallel processing frameworks manage the complexities of task distribution,
synchronization, and communication, making it easier for developers to take advantage of
parallel computing.
• Parallel processing frameworks allowed programmers to design and implement distributed
applications, but they required a deep understanding of parallel programming concepts.
Developers needed to manually manage data distribution, synchronization, and fault
tolerance, which could lead to complex and error-prone code.
• Data Variety: Some big data scenarios involve diverse data types, such as unstructured text,
images, and sensor data. While parallel processing can handle structured and semi-structured
data well

4. Network Attached Storage (NAS)

• Network Attached Storage (NAS) is a storage technology that plays a significant role in the
context of distributed computing.
• While NAS itself is not a direct component of distributed computing, it interacts closely with
distributed computing systems and can influence how data is stored, accessed, and shared
within a distributed environment.
• Here's an explanation of NAS and its role in distributed computing:
• NAS is a dedicated file storage device that provides centralized data storage and file sharing
services to multiple clients over a network.
• It typically operates as an independent appliance with its own operating system, processor,
and storage devices (hard drives or solid-state drives).
• NAS devices are connected to a network and utilize network protocols (such as NFS or SMB)
to allow clients (computers or servers) to access and interact with the stored data.

5. Custom Solutions:
• A "custom solution" in the context of distributed computing refers to a tailored or specialized
approach that an organization develops to address specific challenges or requirements related
to data processing, storage, and computation within a distributed environment.
• Unlike off-the-shelf software or standardized frameworks, a custom solution is designed and
implemented specifically to meet the unique needs of the organization's computing
infrastructure, data types, and processing workflows.
Concept and Purpose:
• A custom solution is developed when an organization's requirements cannot be fully
met by existing commercial products or open-source frameworks. This approach
allows organizations to have full control over the design, architecture, and features
of the solution, ensuring that it aligns perfectly with their business goals and
technical needs.

Challenges:
• Development Time and Cost: Developing a custom solution requires time, effort,
and financial investment in software development expertise and resources.
• Maintenance Complexity: Ongoing maintenance, updates, and support for the
custom solution can become complex, especially as technology evolves.
• Integration Issues: Integrating the custom solution with existing systems and tools
may require additional effort and compatibility testing.
2.5 Distributed computing challenges:

Before Hadoop was introduced, distributed computing for data analytics faced
several significant challenges. Some of the key challenges were:
– Scalability: Traditional databases and computing systems struggled to
handle the rapidly growing volume of data. As data sizes increased, the
single-machine architecture became a bottleneck, resulting in
performance degradation.
– Cost: Scaling up traditional hardware was expensive, and the cost of
building and maintaining large-scale computing infrastructures was
prohibitive for many organizations.
– Complexity: Writing custom code for distributed data processing was
complex and error-prone. Developers had to handle data partitioning,
distribution, fault tolerance, and synchronization manually.
– Data Heterogeneity: Data sources were diverse, with varying formats,
structures, and levels of cleanliness. Integrating and processing data from
multiple sources was a challenging task.
– Fault Tolerance: Traditional systems were not designed to handle node
failures gracefully. Recovering from failures and ensuring data integrity
was a significant concern.
– Processing Speed: Analyzing large datasets using traditional batch
processing methods was slow. Real-time or near-real-time analytics were
difficult to achieve.
– Latency: Traditional systems suffered from high latency when dealing
with big data. This latency made it challenging to gain insights quickly
from vast datasets.
– Resource Management: Optimally managing and sharing computing
resources in a distributed environment was challenging. Efficiently
allocating resources for different tasks was not straightforward .
• Hadoop, introduced by Doug Cutting and Mike Cafarella in 2005, addressed
many of these challenges. It brought the Hadoop Distributed File System
(HDFS) for efficient data storage and replication across nodes, and the
MapReduce programming model for parallel data processing.
• Hadoop's design allowed for distributed processing of large datasets across a
cluster of commodity hardware, enabling organizations to scale their data
analytics capabilities cost-effectively
• Additionally, its fault tolerance mechanisms and automatic data replication
provided a higher level of data integrity and reliability in the face of
hardware failures.
• With Hadoop, developers could write data processing jobs using the
MapReduce model, abstracting away much of the complexities of distributed
computing.
• This allowed organizations to harness the power of distributed computing
without having to deal with the low-level intricacies.
• As a result, Hadoop became a cornerstone technology in the big data
ecosystem and revolutionized how organizations approached data analytics
on a large scale.
2.6 Hadoop history:
• Hadoop is an open-source framework for distributed storage and processing
of large-scale data. It was inspired by a paper published by Google in 2004,
titled "MapReduce: Simplified Data Processing on Large Clusters" and "The
Google File System." The paper described Google's proprietary
infrastructure for handling their vast amounts of data in a scalable and fault-
tolerant manner.
• The origins of Hadoop can be traced back to Doug Cutting, who was
working at Yahoo at the time. In 2005, Doug and his team started the
development of Hadoop as an Apache Nutch project, aiming to create an
open-source version of Google's MapReduce
• The name "Hadoop" itself comes from Doug Cutting's son's toy elephant,
which he named Hadoop. The toy elephant became the project's logo and
symbolized the idea of handling large amounts of data.
• The development of Hadoop progressed rapidly, and it soon became an
Apache Software Foundation top-level project in 2006.
• Hadoop's MapReduce model enables parallel processing of data by dividing
the input into smaller chunks and distributing them across the cluster for
processing. The results are then aggregated to produce the final output. This
approach makes it possible to process massive datasets that are too large to
be handled by a single machine.
• Hadoop's design and architecture have made it a fundamental component in
the big data ecosystem. It has been widely adopted by various industries for
data-intensive tasks, such as data warehousing, data processing, log
processing, and machine learning.
• Over the years, Hadoop has continued to evolve and improve, with many
organizations contributing to its development. However, as the big data
landscape has evolved, other distributed data processing frameworks like
Apache Spark have gained popularity due to their faster performance and
more flexible APIs.
• Nonetheless, Hadoop remains an essential technology for managing and
processing vast amounts of data in many organizations around the world,
playing a crucial role in the big data revolution.
2.9 Hadoop distributors:
• Hadoop distributors are companies or organizations that provide a packaged version
of the open-source Apache Hadoop software, along with additional tools, services,
and support to make it easier for enterprises to deploy, manage, and utilize Hadoop
in their data infrastructure.
• These distributions aim to simplify the adoption of Hadoop and related big data
technologies by offering pre-configured solutions that include various Hadoop
ecosystem components and integration with other data management and analytics
tools.
2.10 HDFS:
Functions of HDFS NameNode
• It executes the file system namespace operations like opening, renaming, and
closing files and directories.
• NameNode manages and maintains the DataNodes.

• It determines the mapping of blocks of a file to DataNodes.

• NameNode records each change made to the file system namespace.

• It keeps the locations of each block of a file.

• NameNode takes care of the replication factor of all the blocks.

• NameNode receives heartbeat and block reports from all DataNodes that ensure
DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new replicas.
Functions of DataNode
• DataNode is responsible for serving the client read/write requests.
• Based on the instruction from the NameNode, DataNodes performs block creation,
replication, and deletion.
• DataNodes send a heartbeat to NameNode to report the health of HDFS.
• DataNodes also sends block reports to NameNode to report the list of blocks it
contains.

What is Checkpoint Node?

• The Checkpoint node is a node that periodically creates checkpoints of the
namespace.
• file structure image refers to the persistent metadata representation of the namespace
stored on the NameNode. This image captures the hierarchical directory and file
structure, along with associated metadata, permissions, ownership, timestamps, and
block-to-DataNode mappings for all files and directories in the HDFS cluster.
Processing of data with hadoop
File write anatomy
Explain basic HDFS File operations with an example.

• 1. Creating a directory:
Syntax: hdfs dfs –mkdir <path>
Eg. hdfs dfs –mkdir /chp

• 2. Remove a file in specified path:

Syntax: hdfs dfs –rm <src>
Eg. hdfs dfs –rm /chp/abc.txt

• 3. Copy file from local file system to hdfs:

Syntax: hdfs dfs –copyFromLocal <src> <dst>
Eg. hdfs dfs –copyFromLocal /home/hadoop/sample.txt
/chp/abc1.txt
• 4. To display list of contents in a directory:
Syntax: hdfs dfs –ls <path>
Eg. hdfs dfs –ls /chp

• 5. To display contents in a file:

Syntax: hdfs dfs –cat <path>
Eg. hdfs dfs –cat /chp/abc1.txt

• 6. To display last few lines of a file:

Syntax: hdfs dfs –tail <path>
• 9. To count no.of directories, files and bytes under
given path:
Syntax: hdfs dfs –count <path>
Eg. hdfs dfs –count /chp
o/p: 1 1 60
• 10. Remove a directory from hdfs

Syntax: hdfs dfs –rmr <path>

Eg. hdfs dfs rmr /chp
Managing resources and Applications with Hadoop YARN
advantages of using Apache Spark over Hadoop

Apache Spark and Apache Hadoop are both powerful frameworks for processing large-scale
data, but they have different strengths and use cases. Here are some advantages of using
Apache Spark over Hadoop:
• In-Memory Processing: Spark is designed for in-memory data processing, which can
significantly speed up data analytics tasks compared to Hadoop's disk-based MapReduce.
By keeping intermediate data in memory, Spark reduces the need for time-consuming
disk I/O operations.
• Ease of Use: Spark provides high-level APIs in multiple programming languages (Scala,
Java, Python, and R), making it more accessible to developers. The API includes built-in
libraries for machine learning (MLlib), graph processing (GraphX), and SQL-based
queries (Spark SQL), simplifying complex data processing tasks.
• Iterative Processing: Spark is well-suited for iterative algorithms commonly used in
machine learning and graph processing. It can cache data in memory between iterations,
eliminating the need to read data from storage repeatedly, which is a limitation in
• Real-Time Processing: Spark Streaming, a part of Spark, allows for real-time data
processing and analytics. You can process data streams and make real-time
decisions, whereas Hadoop primarily focuses on batch processing.
• Unified Platform: Spark provides a unified platform for batch processing,
interactive queries, machine learning, and graph processing. In contrast, Hadoop
often requires integrating multiple tools (e.g., MapReduce, Hive, Pig) for different
tasks, leading to complexity.
• Lazy Evaluation: Spark uses lazy evaluation, meaning it doesn't execute operations
until an action is called. This optimization allows Spark to reduce unnecessary
computations and improve performance.
• Integration with Hadoop: Spark can run on Hadoop clusters, and it can read data
from Hadoop Distributed File System (HDFS). This means you can leverage existing
Hadoop investments while taking advantage of Spark's benefits
• Advanced Analytics: Spark's MLlib library offers scalable machine learning
algorithms, making it a natural choice for building and deploying machine learning
models alongside data processing.
• Interactive Data Analysis: Spark SQL allows users to perform interactive SQL
queries on data stored in Spark, enabling data analysts and data scientists to explore
data more easily.
• Community and Ecosystem: Spark has a rapidly growing and active community,
leading to a rich ecosystem of libraries and tools that extend its capabilities. Hadoop
has a strong community as well, but Spark's growth has been notable.

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
Hadoop Main
No ratings yet
Hadoop Main
19 pages
BDAunit-II
No ratings yet
BDAunit-II
4 pages
Big Data
No ratings yet
Big Data
27 pages
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
No ratings yet
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
9 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
Unit III
No ratings yet
Unit III
15 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
BDA U2
No ratings yet
BDA U2
68 pages
HADOOP
No ratings yet
HADOOP
10 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
UNIT II HADOOP WITH HDFS
No ratings yet
UNIT II HADOOP WITH HDFS
22 pages
BigDataProcessingTools HaddopHDFSHiveSpark
No ratings yet
BigDataProcessingTools HaddopHDFSHiveSpark
2 pages
04
No ratings yet
04
23 pages
I am preparing for a Big Data Analytics university... (1)
No ratings yet
I am preparing for a Big Data Analytics university... (1)
15 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Hadoop
No ratings yet
Hadoop
4 pages
BigData-Session1
No ratings yet
BigData-Session1
14 pages
BDA ESE
No ratings yet
BDA ESE
21 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
TIE- 21CS71 SIMP with Key Answers (1)
No ratings yet
TIE- 21CS71 SIMP with Key Answers (1)
19 pages
Data Migration From RDBMS To Hadoop: Platform Migration Approach
No ratings yet
Data Migration From RDBMS To Hadoop: Platform Migration Approach
25 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Lec 3
No ratings yet
Lec 3
25 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
data analyst
No ratings yet
data analyst
9 pages
Ch6 Architectural Design v1
No ratings yet
Ch6 Architectural Design v1
26 pages
Last Min Preparation -Big Data
No ratings yet
Last Min Preparation -Big Data
5 pages
BDA IA1 QB Solved complete - Copy
No ratings yet
BDA IA1 QB Solved complete - Copy
22 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
BDA Final Notes
No ratings yet
BDA Final Notes
53 pages
Big Data
No ratings yet
Big Data
28 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
Big data
No ratings yet
Big data
8 pages
BIG DATA PYQ 21-22
No ratings yet
BIG DATA PYQ 21-22
9 pages
Lec 3
No ratings yet
Lec 3
28 pages

Unit 2 - Intro To Hadoop

Uploaded by

Unit 2 - Intro To Hadoop

Uploaded by

Unit 2 – Introduction to Hadoop

1. Scalability: Hadoop is designed to scale horizontally by distributing data and

4. Parallel Processing: Hadoop's MapReduce paradigm enables parallel processing of

7. Batch Processing: Hadoop is well-suited for batch processing workloads, where

• Data Model and Schema:

– RDBMS is optimized for low-latency transactional queries. However, as data

– RDBMS is traditionally oriented towards transactional operations and batch

– RDBMS excels at handling structured data, but it may struggle with

– RDBMS often relies on vertically scaling hardware, which has practical

3. Parallel processing frameworks:

4. Network Attached Storage (NAS)

• It determines the mapping of blocks of a file to DataNodes.

• NameNode records each change made to the file system namespace.

• It keeps the locations of each block of a file.

• NameNode takes care of the replication factor of all the blocks.

What is Checkpoint Node?

• 2. Remove a file in specified path:

• 3. Copy file from local file system to hdfs:

• 5. To display contents in a file:

• 6. To display last few lines of a file:

Syntax: hdfs dfs –rmr <path>

You might also like