0% found this document useful (0 votes)
143 views23 pages

Big Data UNIT1

Big data refers to massive amounts of structured, unstructured, and semi-structured data that is too large for traditional databases to handle efficiently. It is characterized by high volume, variety, velocity, and variability. Examples of big data sources include social media, stock market trade data, and sensor data from IoT devices. Big data is important because it allows companies to gain insights, improve operations and decision making, enhance customer experiences, and innovate through machine learning. Hadoop is an open-source framework that was developed to store and process big data across clusters of commodity hardware.

Uploaded by

Keshava Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views23 pages

Big Data UNIT1

Big data refers to massive amounts of structured, unstructured, and semi-structured data that is too large for traditional databases to handle efficiently. It is characterized by high volume, variety, velocity, and variability. Examples of big data sources include social media, stock market trade data, and sensor data from IoT devices. Big data is important because it allows companies to gain insights, improve operations and decision making, enhance customer experiences, and innovate through machine learning. Hadoop is an open-source framework that was developed to store and process big data across clusters of commodity hardware.

Uploaded by

Keshava Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction to Big Data

Unit - I
What is Big Data?
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data
with so large size and complexity that none of traditional data management tools can store it or process
it efficiently. Big data is also a data but with huge size. Big data is a combination of structured, semi-
structured and unstructured data collected by organizations that can be mined for information and used
in machine learning projects, predictive modeling and other advanced analytics applications.
"Big data refers to massive complex structured and unstructured data sets that are rapidly generated and
transmitted from a wide variety of sources."

Examples of Big Data


Social Media: The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.
Stock Market: Stock Exchange is an example of Big Data that generates about one terabyte of new trade
data per day.
Types of Big Data
Following are the types of Big Data:
● Structured
● Unstructured
● Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’
data. Over the period of time, talent in computer science has achieved greater success in developing
1
techniques for working with such kind of data (where the format is well known in advance) and also
deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to
a huge extent, typical sizes are being in the rage of multiple zettabytes. ‘Employee’ table in a database is
an example of Structured Data

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out
of it. A typical example of unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this data is in its raw form or
unstructured format.The output returned by ‘Google Search’ is examples of Unstructured Data.

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured
in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-
structured data is a data represented in an XML file.

2
Characteristics of Big Data
Big data can be described by the following characteristics:
● Volume
● Variety
● Velocity
● Variability
Volume:The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial
role in determining value out of data. Also, whether a particular data can actually be considered as a Big
Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs
to be considered while dealing with Big Data solutions.
Variety: The next aspect of Big Data is its variety.Variety refers to heterogeneous sources and the nature
of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only
sources of data considered by most of the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This
variety of unstructured data poses certain issues for storage, mining and analyzing data.
Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated
and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with
the speed at which data flows in from sources like business processes, application logs, networks, and
social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
Variability: This refers to the inconsistency which can be shown by the data at times, thus hampering
the process of being able to handle and manage the data effectively.

3
Advantages of Big Data
✔ Businesses can utilize outside intelligence while taking decisions
✔ Improved customer service
✔ Early identification of risk to the product/services, if any
✔ Better operational efficiency
Why Big Data is Important?
Big Data refers to massive amounts of data produced by different sources like social media platforms,
web logs, sensors, IOT devices, and many more. It can be either structured (like tables in DBMS), semi-
structured (like XML files), or unstructured (like audios, videos, images).Traditional database
management systems are not able to handle this vast amount of data. Big Data helps companies to
generate valuable insights. Companies use Big Data to refine their marketing campaigns and techniques.
Companies use it in machine learning projects to train machines, predictive modeling, and other
advanced analytics applications. Big Data initiatives were rated as “extremely important” to 93% of
companies. Leveraging a Big Data analytics solution helps organizations to unlock the strategic values
and take full advantage of their assets. It helps organizations:
✔ To understand Where, When and Why their customers buy
✔ Protect the company’s client base with improved loyalty programs
✔ Seizing cross-selling and upselling opportunities
✔ Provide targeted promotional information
✔ Optimize Workforce planning and operations
4
✔ Improve inefficiencies in the company’s supply chain
✔ Predict market trends
✔ Predict future needs
✔ Make companies more innovative and competitive
✔ It helps companies to discover new sources of revenue
Importance of Big Data
Big Data importance lies in the fact that how the company utilizes the gathered data. Every company
uses its collected data in its own way. More effectively the company uses its data, more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
● Cost Savings: Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help organizations in
identifying more effective ways of doing business.
● Time-Saving: Real-time in-memory analytics helps companies to collect data from various
sources. Tools like Hadoop help them to analyze data immediately thus helping in making quick
decisions based on the learnings.
● Understand the Market Conditions: Big Data analysis helps businesses to get a better
understanding of market situations. For example, analysis of customer purchasing behavior helps
companies to identify the products sold most and thus produces those products accordingly. This
helps companies to get ahead of their competitors.
● Social Media Listening: Companies can perform sentiment analysis using Big Data tools. These
enable them to get feedback about their company, that is, who is saying what about the company.
Companies can use big data tools to improve their online presence.
● Boost Customer Acquisition and Retention: Customers are a vital asset on which any business
depends on. No single business can achieve its success without building a robust customer base.
But even with a solid customer base, the companies can’t ignore the competition in the market.
● Solve Advertisers Problem and Offer Marketing Insights: Big data analytics shapes all
business operations. It enables companies to fulfill customer expectations. Big data analytics
helps in changing the company’s product line. It ensures powerful marketing campaigns.
● The driver of Innovations and Product Development: Big data makes companies capable to
innovate and redevelop their products.
A Brief History of Hadoop
5
Hadoop is an open source framework introduced by Apache Software Foundation which is written in
Java for storing and processing of huge datasets with the cluster of commodity hardware. There are
mainly two problems with the big data. First one is to store such a huge amount of data and the second
one is to process that stored data. The traditional approach like RDBMS is not sufficient due to the
heterogeneity of the data. So Hadoop comes as the solution to the problem of big data i.e. storing and
processing the big data with some extra capabilities.
● In 2002 Doug Cutting and Mike Cafarellaboth started to work on Apache Nutch project. After a
lot of research on Nutch, they concluded that such a system will cost around half a million dollars
in hardware, and along with a monthly running cost of $30, 000 approximately, which is very
expensive.
● In 2003, they came across a paper that described the architecture of Google’s distributed file
system, called GFS (Google File System) which was published by Google, for storing the large
data sets. But this paper was just the half solution to their problem.
● In 2004, Google published one more paper on the technique MapReduce, which was the solution
of processing those large datasets. Now this paper was another half solution for Doug Cutting
and Mike Cafarella for their Nutch project. These both techniques (GFS & MapReduce) were
just on white paper at Google. Google didn’t implement these two techniques.
● In 2005, Doug Cutting and Mike Cafarellaintroduced a new file system known as NDFS (Nutch
Distributed File System) is limited to only 20-to-40 node clusters.
● In 2006, Doug Cutting quit Google and joined Yahoo along with Nutch project. Introduced a new
open-source, reliable, scalable computing framework, he gave name Hadoop. These both
techniques (GFS & MapReduce) were integrated released first version of Hadoop 0.1.0. After a
yellow toy elephant (icon) which was owned by the Doug Cutting’s son. .
● In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.
● In 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software Foundation).
Hadoop became the fastest system to sort 1TB of data on a 900 node cluster within 209 seconds.
After Apache Software Foundation successfully tested a 4000 node cluster with Hadoop.
● In 2009, Hadoop was successfully tested to sort a PB (Peta Byte) of data in less than 17 hours for
handling billions of searches and indexing millions of web pages. And Doug Cutting left the
Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop to other industries.
● In 2011, Apache Software Foundation released Apache Hadoop version 1.1.
6
● In 2013, Apache Hadoop Version 2.0 was available.
● In 2017, Apache Hadoop version 3.0 which released.
● In 2018, Apache Hadoop version 3.1 released.
Apache Hadoop
The Apache Hadoop is an open source framework that is used to efficiently store and process large
datasets ranging in size from gigabytes to petabytes of data. It can easily handle a large amount of data
on a low cost with simple hardware cluster. It also scalable and framework. Hadoop system is not only
storing system data can processed using this system. Apache Hadoop is an open source framework
intended to make interaction with big data easier.
Apache Hadoop is Java-based software platform that manages data processing and storage for big data
applications. It handles two types of projects (HDFS and MapReduce). Hadoop works by distributing
large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller
workloads that can be run in parallel. Hadoop can process structured and unstructured data and scale up
reliably from a single server to thousands of machines.
Features of Hadoop
1. Open Source:Hadoop is open-source, which means it is free to use. Since it is an open-source project
the source-code is available online for anyone to understand it or make some modifications as per their
industry requirement.
2. Highly Scalable Cluster:Hadoop is a highly scalable model. A large amount of data is divided into
multiple inexpensive machines in a cluster which is processed parallelly. The number of these machines
or nodes can be increased or decreased as per the enterprise’s requirements. In traditional
RDBMS(Relational DataBase Management System) the systems cannot be scaled to approach large
amounts of data.
3. Fault Tolerance is Available:Hadoop uses commodity hardware(inexpensive systems) which can be
crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which
ensures the availability of data if somehow any of your systems got crashed. You can read all of the data
from a single machine if this machine faces a technical issue data can also be read from other nodes in a
Hadoop cluster because the data is copied or replicated by default. By default, Hadoop makes 3 copies
of each file block and stored it into different nodes. This replication factor is configurable and can be
changed by changing the replication property in the hdfs-site.xml file.
4. High Availability is Provided:Fault tolerance provides High Availability in the Hadoop cluster. High

7
Availability means the availability of data on the Hadoop cluster. Due to fault tolerance in case if any of
the DataNode goes down the same data can be retrieved from any other node where the data is replicated.
The High available Hadoop cluster also has 2 or more than two Name Node i.e. Active NameNode and
Passive NameNode also known as stand by NameNode. In case if Active NameNode fails then the
Passive node will take the responsibility of Active Node and provide the same data as that of Active
NameNode which can easily be utilized by the user.
5. Cost-Effective:Hadoop is open-source and uses cost-effective commodity hardware which provides
a cost-efficient model, unlike traditional Relational databases that require expensive hardware and high-
end processors to deal with Big Data. The problem with traditional Relational databases is that storing
the Massive volume of data is not cost-effective, so the company’s started to remove the raw data. Which
may not result in the correct scenario of their business. Means Hadoop provides us 2 main benefits with
the cost one is its open-source means free to use and the other is that it uses commodity hardware which
is also inexpensive.
6. Hadoop Provide Flexibility:Hadoop is designed in such a way that it can deal with any kind of dataset
like structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos) very
efficiently. This means it can easily process any kind of data independent of its structure which makes it
highly flexible. It is very much useful for enterprises as they can process large datasets easily, so the
businesses can use Hadoop to analyze valuable insights of data from sources like social media, email,
etc. With this flexibility, Hadoop can be used with log processing, Data Warehousing, Fraud detection,
etc.
7. Easy to Use:Hadoop is easy to use since the developers need not worry about any of the processing
work since it is managed by the Hadoop itself. Hadoop ecosystem is also very large comes up with lots
of tools like Hive, Pig, Spark, HBase, Mahout, etc.
8. Hadoop uses Data Locality:The concept of Data Locality is used to make Hadoop processing fast.
In the data locality concept, the computation logic is moved near data rather than moving the data to the
computation logic. The cost of Moving data on HDFS is costliest and with the help of the data locality
concept, the bandwidth utilization in the system is minimized.
9. Provides Faster Data Processing:Hadoop uses a distributed file system to manage its storage i.e.
HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a large size file is broken into
small size file blocks then distributed among the Nodes available in a Hadoop cluster, as this massive
number of file blocks are processed parallelly which makes Hadoop faster, because of which it provides
a High-level performance as compared to the traditional DataBase Management Systems.
8
Why we should Use Hadoop?
✔ The Hadoop solutions are very popular. It has captured at least 90% of big data market. So it is
suitable for big data.
✔ It has some unique features.
✔ It is scalable
✔ Its solutions are fault tolerant.
✔ It is flexible because stored as structured, unstructured and semi structured mode.
Hadoop Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data processing using
MapReduce and HDFS methods.
NameNode: NameNode represented every files and directory which is used in the namespace.
o Also called as MasterNode.
o It maintains and manages DataNodes.
o Records metadata like location of blocks stored, file size permissions, hierarchy etc.
o Record heartbeat and block report from all the DataNodes.
DataNode: DataNode helps you to manage the state of an HDFS node and allows you to interacts with
the blocks.
o Also called as SlaveNode
o Store actual data
o Serves read and write operations.
MasterNode: The master node allows you to conduct parallel processing of data using Hadoop
MapReduce.
SlaveNode: The slave nodes are the additional machines in the Hadoop cluster which allows you to store
data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a
DataNode. This allows you to synchronize the processes with the NameNode and Job Tracker
9
respectively.
Hadoop Ecosystem
Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library;
it includes open source projects as well as a complete range of complementary tools. It is a platform
or a suite which provides various services to solve the big data problems. It includes Apache projects
and various commercial tools and solutions. Some of the most well-known tools of the Hadoop
ecosystem include HDFS, Hive, Pig, YARN, MapReduce, Spark, HBase, Oozie, Sqoop, Zookeeper,
Ambari etc.
HDFS
Hadoop Distributed File System (HDFS), is one of the largest Apache projects and primary storage
system of Hadoop. It employs a NameNode and DataNode architecture. It is a distributed file system
able to store large files running over the cluster of commodity hardware. HDFS consists of two core
components i.e. Name Node and Data NodeName. Node is the prime node which contains metadata
(data about data) requiring comparatively fewer resources than the data nodes that stores the actual
data.
Hive
Hive is an ETL and Data warehousing tool used to query or analyze large datasets stored within the
Hadoop ecosystem. Hive has three main functions: data summarization, query, and analysis of
unstructured and semi-structured data in Hadoop. It features a SQL-like interface, HQL language that
works similar to SQL and automatically translates queries into MapReduce jobs.
Apache Pig
This is a high-level scripting language used to execute queries for larger datasets that are used within
Hadoop. Pig’s simple SQL-like scripting language is known as Pig Latin and its main objective is to
perform the required operations and arrange the final output in the desired format.
MapReduce
This is another data processing layer of Hadoop. It has the capability to process large structured and
unstructured data as well as to manage very large data files in parallel by dividing the job into a set of
independent tasks (sub-job).MapReduce makes the use of two functions i.e. Map() and Reduce().
Map():performs sorting and filtering of data and thereby organizing them in the
Reduce(): as the name suggests does the summarization by aggregating the mapped data.
YARN
YARN stands for Yet Another Resource Negotiator, but it's commonly referred to by the acronym
10
alone. It is one of the core components in open source Apache Hadoop suitable for resource
management. It is responsible for managing workloads, monitoring, and security controls
implementation. It also allocates system resources to the various applications running in a Hadoop
cluster while assigning which tasks should be executed by each cluster nodes. YARN has two main
components:
Resource Manager
Node Manager
Spark
Apache Spark is a fast, in-memory data processing engine suitable for use in a wide range of
circumstances. Spark can be deployed in several ways, it features Java, Python, Scala, and R
programming languages, and supports SQL, streaming data, machine learning, and graph processing,
whichcan be used together in an application.
HBase
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s big table, thus able to work on big data sets effectively.

Sqoop
Ingesting data is an important part of our Hadoop Ecosystem. It provides data integration services. Sqoop
can import as well as export structured data from RDBMS or Enterprise data warehouses to HDFS or
vice versa. While Flume only ingests unstructured data or semi-structured data into HDFS.
Zookeeper
There was a huge issue of management of coordination and synchronization among the resources or the
components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by

11
performing synchronization, inter-component based communication, grouping, and maintenance.
Oozie
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a
single unit. There is two kinds of jobs .i.e
1.Oozie workflow
2.Oozie coordinator jobs.
Ambari
Ambari is web based tool an Apache Software Foundation Project which aims at making Hadoop
ecosystem more manageable. It includes software for provisioning, managing and monitoring Apache
Hadoop clusters.
Linux Refresher (Overview of Linux)
LINUX is an operating system or a kernel distributed under an open-source license. Its functionality list
is quite like UNIX. The kernel is a program at the heart of the Linux operating system that takes care of
fundamental stuff, like letting hardware communicate with software.
Basic Features of Linux
Following are some of the important features of Linux Operating System.
● Portable: Portability means software can works on different types of hardware in same way.
Linux kernel and application programs supports their installation on any kind of hardware
platform.
● Open Source: Linux source code is freely available and it is community based development
project. Multiple teams work in collaboration to enhance the capability of Linux operating
system and it is continuously evolving.
● Multi-User: Linux is a multiuser system means multiple users can access system resources like
memory/ ram/ application programs at same time.
● Multiprogramming: Linux is a multiprogramming system means multiple applications can run
at same time.
● Hierarchical File System: Linux provides a standard file structure in which system files/ user
files are arranged.
● Shell: Linux provides a special interpreter program which can be used to execute commands of
the operating system. It can be used to do various types of operations, call application programs.
etc.

12
● Security: Linux provides user security using authentication features like password protection/
controlled access to specific files/ encryption of data.
Architecture

The architecture of a Linux System consists of the following layers:


● Hardware layer: Hardware consists of all peripheral devices (RAM/ HDD/ CPU etc).
● Kernel: It is the core component of Operating System, interacts directly with hardwareand
provides low level services to upper layer components.
● Shell: An interface to kernel, hiding complexity of kernel's functions from users. The shell takes
commands from the user and executes kernel's functions.
● Utilities: Utility programs that provide the user most of the functionalities of an operating
systems.

13
14
15
sLinux Commands –
o Pwd:The pwd command is used to display the location of the current working directory.
Syntax:pwd

o Mkdir: The mkdir command is used to create a new directory under any directory.
Syntax: mkdir <directory name>

o Rmdir:The rmdir command is used to delete a directory.


Syntax: rmdir <directory name>

o Ls:The ls command is used to display a list of content of a directory.

16
Syntax: ls

o Cd:The cd command is used to change the current directory.


Syntax: cd <directory name>

o Touch:The touch command is used to create empty files. We can create multiple empty files by
executing it once.
Syntax: touch <file name>
touch <file1> <file2> ....

o Cat:The cat command is a multi-purpose utility in the Linux system. It can be used to create a
file, display content of the file, copy the content of one file to another file, and more.
Syntax: cat [OPTION]... [FILE]..
To create a file, execute it as follows:
cat > <file name>
// Enter file content
Press "CTRL+ D" keys to save the file. To display the content of the file, execute it as follows:
cat <file name>

17
o rm: The rm command is used to remove a file.
Syntax:rm<file name>

o cp:The cp command is used to copy a file or directory.


Syntax: To copy in the same directory:
cp <existing file name> <new file name>
To copy in a different directory:

o mv: The mv command is used to move a file or a directory form one location to another location.
Syntax: mv <file name> <directory path>

VMWare Installation of Hadoop


VMWare player is a free desktop application from a company called VMWare, which runs on Windows
and Linux. This application enables you to create, configure and run virtual machines. A virtual machine
allows you to run one operating system emulated within another operating system.
Installation steps
1. Download the “VMware player” from the link https://www.vmware.com and install it.

18
2. Download the “cloudera setup file” from the link https://ccp.cloudera.com and extract
that zipped file on your hard drive.

Scroll down and select Accept button.

3. Start VM Player and click open a virtual machine.

19
Browse the extracted folder

Then display the below screen

It will start a couple of minutes. The login credentials


Machine login credentials are Cloudera manager credentials are
Username :cloudera Username : admin
Password : cloudera Password :admin
Click on the black box shown below in the image to start terminal.

20
4. Checking your Hadoop cluster
Type: sudosuhdfs

Execute your command ie Hadoop dfs –ls/

5. Download the list of Hadoop commands from the link https://hadoop.apache.org and
execute it.
Meet Hadoop Data
Big data is nothing but a collection of large and complex datasets. Which are difficult to store and
process using available data management tools (or) traditional data processing tools. Big data is
collection of structured, unstructured and semi-structure data. To store and process these data at
place difficult. To maintain these data at one dataset the following problems occurs.
21
Problems with Big Data
● Data storage and analysis (Storage huge and exponentially growing dataset)
● Comparison with other systems (Processing data having complex structure)
● Gird computing (Binding huge amount of data to computational units)
Data storage and analysis: Big data storage is a compute-and-storage architecture that collects and
manages large data sets and enables real-time data analytics. Companies apply big data analytics to get
greater intelligence from metadata. In most cases, big data storage uses low-cost hard disk drives,
although moderating prices for flash appear to have opened the door to using flash in servers and storage
systems as the foundation of big data storage. These systems can be all-flash or hybrids mixing disk
and flash storage.
Comparison with other systems: Big data allows any kinds of data be it structured, unstructured and
semi-structured. All are store different ways to process these data that is complex because all are not
one format that means processing data having complex structure.
Grid computing: Grid computing is a processor architecture that combines computer resources from
various domains to reach main objective.

To overcome above problems of big data user meet Hadoop data. Because hadoop allows two types of
sub-projects HDFS and MapReduce.
HDFS
✔ Storage unit of Hadoop
✔ Distributed file system
✔ Divide files into smaller chunks and stores it across the cluster.
✔ Vertical scaling as per requirement.
✔ Store any kinds of data.
22
✔ No schema validation is done while dumpling data.
✔ Stores the data in form of blocks.
✔ Block size can be configured base on requirements.
MapReduce
✔ Storage unit of Hadoop.
✔ Parallel file system
✔ Splits the input dataset into independent chunks which are processed by map tasks in a
completely parallel manner.
Feature of Hadoop
The following features of Hadoop data
Suitable for big analysis: Big data tends to be distributed and unstructured in nature, Hadoop clusters
are best suited for analysis of big data. Since it is processing logic that flows to the computing nodes,
less band width is consumed.
Scalability: Hadoop clusters can easily be scaled to any extent by adding additional cluster nodes and
thus allows for growth of big data.
Fault tolerance: Hadoop ecosystem has a provision to replicate the input data on to other cluster
nodes. The event of a cluster node failure, data processing can still proceed by using data stored on
another cluster node.
Why we should Use Hadoop?
✔ The Hadoop solutions are very popular. It has captured at least 90% of big data market. So it is
suitable for big data.
✔ It has some unique features.
✔ It is scalable
✔ Its solutions are fault tolerant.
✔ It is flexible because stored as structured, unstructured and semi structured mode.

23

You might also like