BDA UNIT -1_pdf
BDA UNIT -1_pdf
• Big Data Analytics is “the process of examining large data sets containing a variety
of data types – i.e., Big Data – to uncover hidden patterns, unknown correlations,
market trends, customer preferences, and other useful information that can help
organizations make informed business decisions.
• Data Analytics often obtain several business benefits, including more effective
marketing campaigns, the discovery of new revenue opportunities, improved
customer service delivery, more efficient operations, and competitive advantages.
• Big Data Analytics gives analytics professionals, such as data scientists and
predictive modelers, the ability to analyze Big Data from multiple and varied sources,
including transactional data and other structured data.
How is Big Data is actually used
4. The emphasis is on high throughput of data access rather than low latency of data
access
Cont..
4. Hadoop provide interface for many applications to move themselves closer to interact where
the data is located
5. After Google published technical papers detailing its Google File System (GFS) and
MapReduce programming framework in 2003 and 2004, respectively, Cutting and Cafarella
modified earlier technology plans and developed a Java-based MapReduce implementation
and a file system modeled on Google's called “Hadoop Framework”
6. Hadoop is an Apache top-level project being built and used by a global community
of contributors and users. It is licensed under the Apache License 2.0.
Features (or) Benefits of Hadoop
1. It is completely open source and written in Java
2. Highly Scalable: Hadoop cluster is scalable means we can add any number of nodes
(horizontal scalable) or increase the hardware capacity of nodes (vertical scalable) to
achieve high computation power. This provides horizontal as well as vertical scalability to
the Hadoop framework.
3. Computing power: It uses Distributed computing framework that’s designed to provide
rapid data access across the nodes in a cluster
4. Fault-tolerant :
– It provide fault-tolerant capabilities so applications can continue to run if individual
nodes fail.
– HDFS in Hadoop 2.0 uses a replication mechanism to provide fault tolerance. It creates
a replica of each block on the different machines depending on the replication factor (by
default, it is 3). So if any machine in a cluster goes down, data can be accessed from the
other machines containing a replica of the same data.
5. Cost effective: It doesn’t require expensive, high reliable network. i.e. It runs on
clusters of commodity servers and can scale up to support thousands of hardware
nodes and massive amounts of data.
6. Faster in Data Processing :Hadoop stores data in a distributed fashion, which
allows any kind of data (unstructured or Semi-structured) to be processed
distributedly on a cluster of nodes. Thus it provides lightning-fast processing
capability to the Hadoop framework
7. High availability:
– This feature of Hadoop ensures the high availability of the data, even in
unfavorable conditions
– Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down,
the data is available to the user from different DataNodes containing a copy of the
same data.
Cont..
8. Ensures Data Reliability:
In Hadoop due to the replication of data in the cluster, data is stored reliably on the
cluster machines despite machine failures.
9. Used for Batch processing (Not for online[real time] analytical processing)
10. Flexibility:
Store any amount of any kind of data
11. Data Locality concept:
Hadoop is popularly known for its data locality feature means moving computation
logic to the data, rather than moving data to the computation logic. This features of
Hadoop reduces the bandwidth utilization in a system.
Hadoop Ecosystem
Following are the components that collectively form a Hadoop ecosystem:
•HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
•HDFS consists of two core components i.e.
• Name node
• Data Node
•Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
•HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
•Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
•Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
•Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
•By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
•MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
• Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
•It is a platform for structuring the data flow, processing and analyzing huge data sets.
•Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
•Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.
•Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:
•With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
•It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query processing
easier.
•Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
•JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:
•It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
•It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
•Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:
•Solr, Lucene: These are the two services that perform the task of searching and indexing with
the help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
•Zookeeper: There was a huge issue of management of coordination and synchronization among
the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper
overcame all the problems by performing synchronization, inter-component based communication,
grouping, and maintenance.
•Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas
Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to
it.
Components of Hadoop
Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and
solutions.
Benefits:
Resource Management and Accessibility
Fault Tolerance
Workload Management
Difference between Distributed System and Centralized
System
Google File System(GFS) Introduction
• Google File System is proprietary distributed file system developed by Google Inc. for
its own use.
• It is designed to provide efficient, reliable data access using large cluster of commodity
hardware
• GFS is made up of several storage systems built from low-cost commodity hardware
components
• GFS was implemented especially for meeting the rapidly growing demands of
Google’s data processing needs.
• A new version of the Google File System is code named Colossus which was released
in 2010
What is GFS
• Google File System is essentially a distributed file storage.
• In any given cluster of Google file system, there can be hundreds (or) thousands of commodity servers
• This cluster provides an interface for ‘N’no of clients either to read a file/ write into a file.
Design Considerations
1. Commodity Hardware:
Google was still a young company, Instead of buying an expensive server, they
chose to buy off-the-shelf commodity hardware because they are cheap and secondly
using a lot of such servers they could scale horizontally given the right software
layer created on top of it.
2. Large Files:
The second design consideration in the google file system is optimized to store
and read large files.
A typical file in GFS range is from 100MB to multiple GB files
3. File Operations:
The third most design consideration of GFS describes about two kinds of file
operations.
• Writes to any file(generally appends only) and no random writes in the file
• Perform Sequential reads on the file
4. Chunks
– A single file is not stored in a single server. It is actually subdivided into multiple
chunks and each chunk is of 64 MB
GFS ChunkServers
• Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-
MB file chunks. The chunkServers don't send chunks to the master server. Instead,
they send requested chunks directly to the client.
• Neither the client nor the chunk-server caches file data. Chunk servers need not to be
cache file data because chunks are stored as local files and so Linux’s buffer already
keeps frequently accessed data in memory.
• If ChunkServer is down, master ensures all chunks that were on it are copied on
others servers.
• Ensures replica counts remain same
GFS Features Include:
• Fault tolerance
• Critical data replication
• Automatic and efficient data recovery
• High aggregate throughput
• Reduced client and master interaction because of large chunkserver size
• Namespace management and locking
• High availability
Hadoop Distributed File System(HDFS)
• HDFS is the primary storage system used by Hadoop applications
• It is designed for storing very large files with streaming data access patterns, running
on cluster of commodity hardware
• HDFS is a distributed, scalable and portable file system written in java for the Hadoop
Framework
• HDFS stores files as blocks in clusters, typically 128 MB (default block size for
Hadoop2.0)
• HDFS creates multiple replicas of data blocks and distributes them on compute nodes
throughout a cluster to enable reliable, extremely rapid computations
Features of HDFS:
It contains a master-slave architecture
Hadoop provides a command interface to interact with HDFS
It provides a Single Namespace for entire cluster
HDFS is optimized for throughput over latency
It is very efficient at streaming read request for large files but poor at
seek requests for many small files
The built-in servers of Namenode and Datanode help users to easily
check the status of cluster
Difference between Google File System and Hadoop
Distributed File System
Cont..
HDFS Architecture
The File System Namespace
• HDFS supports a traditional hierarchical file organization. A user or an
application can create directories and store files inside these directories.
• The file system namespace hierarchy is similar to most other existing file
systems; one can create and remove files, move a file from one directory to
another, or rename a file. HDFS does not support hard links or soft links.
• The NameNode maintains the file system namespace.
• Any change to the file system namespace or its properties is recorded by the
NameNode. An application can specify the number of replicas of a file that
should be maintained by HDFS.
• The number of copies of a file is called the replication factor of that file. This
information is stored by the NameNode.
Replication in HDFS
• HDFS provides a reliable way to store huge data in a distributed environment as data
blocks. The blocks are also replicated to provide fault tolerance.
• The replication factor represents number of copies of a block that must be there in the
cluster.
• This value is by default 3 (comprises one original block and 2 replicas). So, every
time we create a file in HDFS will have a replication factor as 3.
• hdfs-site.xml configuration file is used to control the HDFS replication factor
<property>
<name>dfs.replication</name>
<value>3</value>
<description>BlockReplication</description>
</property>
Building Blocks or Daemons of Hadoop
On a fully configured cluster, “running Hadoop” means running a set of
daemons, or resident programs, on the different servers in our network. These
daemons have specific roles; some exist only on one server, some exist across
multiple servers.
The daemons includes:
NameNode (Master Node)
DataNode (Slave Node)
Secondary NameNode (Check-point node)
Job Tracker
Task Tracker
Note: A daemon is computer program, that runs as a background process, rather than
being direct control of an Interactive User
NameNode
• Hadoop cluster consists of single Namenode
• The Namenode is the main central component of HDFS architecture framework, that
directs the slave DataNodes to perform low level I/O tasks
• NameNode doesn't store any user data (actual data) or perform any computation
• It is the bookkeeper of HDFS i.e. it keeps track of files metadata, how the files are
broken down into file blocks, which node store those blocks and overall health of
distributed file system
• NameNode is a very highly available server that manages the File System
Namespace and controls access to files by clients.
Functions of Namenode
• It is the master daemon that maintains and manages the DataNodes (slave nodes)
and assign the task to them
• It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
• FsImage: It contains the complete state of the file system namespace since the
start of the NameNode.
• EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage
• It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
Cont..
• It regularly receives a Heartbeat signals and a block report from all the DataNodes in
the cluster to ensure that the DataNodes are live.
• The Namenode executes file system operations like opening, closing and renaming
the files and directories
• It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located
• The NameNode is also responsible to take care of the replication factor of all the
blocks.
Cont..
• In case of the DataNode failure, the NameNode chooses new DataNodes for
new replicas, balance disk usage and manages the communication traffic to the
DataNodes.
• There is unfortunately a negative aspect of Namenode i.e. a single point of
failure in Hadoop Cluster.
• For any of the other daemons, if their host nodes fail for software or hardware
reasons, the Hadoop cluster will likely continue to function smoothly or you
can quickly restart it. But, when the Namenode is down, HDFS/Hadoop
Cluster is inaccessible and considered down.
DataNode
• Hadoop is a framework specially designed for the distributed batch processing and
storage of enormous datasets on commodity hardware.
• Hadoop being an open source framework can be utilized on a single machine or even
a cluster of machines
• Hadoop is very much efficient in the distribution of the huge datasets on commodity
hardware
• Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of
the three supported modes:
Standalone mode (single node cluster)
Pseudo distributed mode (single node cluster)
Fully distributed mode (multi node cluster)
Standalone(Local) mode
• The standalone mode is the default mode in which Hadoop run
• With empty configuration files, Hadoop will run completely on a single machine
(non distributed mode as a single java Process)
• Because there’s no need to communicate with other nodes, the standalone mode
doesn’t use HDFS, nor will it launch any of the Hadoop daemons and everything
will runs in single JVM Instance
• Its primary use is for developing and debugging the application logic of a
MapReduce program without the additional complexity of interacting with the
daemons
• Standalone mode is usually the fastest Hadoop modes as it uses the local file
system for all the input and output
• When Hadoop works in this mode there is no need to configure the files – hdfs-
site.xml, mapred-site.xml, core-site.xml for Hadoop environment. In this Mode, all
of your Processes will run on a single JVM(Java Virtual Machine) and this mode can
only be used for small development purposes.
Pseudo Distributed mode
• The pseudo-distribute mode is also known as a single-node cluster where both
NameNode and DataNode will reside on the same machine.
• In pseudo-distributed mode, all the Hadoop daemons will be running on a single
node. Such configuration is mainly used while testing when we don’t need to think
about the resources and other users sharing the resource.
• In this architecture, a separate JVM is spawned for every Hadoop components
as they could communicate across network sockets, effectively producing a fully
functioning and optimized mini-cluster on a single host
• It uses HDFS for storage and YARN is also used for managing the resources in
hadoop Installation
• Replication Factor will be ONE for Block
• Changes in configuration files will be required for all the three files- mapred-site.xml,
core-site.xml, hdfs-site.xml
Fully distributed mode
• This is the production mode of Hadoop where multiple nodes will be running. Here data
will be distributed across several nodes and processing will be done on each node.
• Master and Slave services will be running on the separate nodes in fully distributed Hadoop
Mode.
• The following are the three server names used to set up a full cluster.
– The master node of the cluster and host of the NameNode and JobTracker
daemons,
– The server that hosts the Secondary NameNode daemon slave1, slave2, slave3, …
– The slave boxes of the cluster running both DataNode and TaskTracker daemons.
Multiple nodes are used to operate Hadoop in Fully Distributed Mode
Hadoop Operational Modes-Summary
Configuring XML files
The following files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
• Core-Site.xml
• Hdfs-site.xml
• Mapred-site.xml
• Yarn-site.xml
Core-site.xml:
• The core-site.xml file informs Hadoop where NameNode runs in the cluster. It
contains configuration settings for Hadoop code such as I/O settings that are
common to HDFS and MapReduce
• The core-site.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing
the data, and size of Read/Write buffers
Location of the file: /etc/Hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name </name>
<value> hdfs://localhost:9000</value>
</property>
</configuration>
Hdfs-site.xml
• The hdfs-site.xml file contains the configuration settings for HDFS daemons; the
NameNode, the Secondary NameNode, and the DataNodes.
• This xml file also provides paths of NameNode and DataNode
• Here, we can configure hdfs-site.xml to specify default block replication and
permission checking on HDFS.
• The actual number of replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
<configuration>
<!--To configure/specify Replica-->
<property>
<name>dfs.replication</name>
<value>1</value>
</Property>
<!--To configure/specify NameNode Metadata location-->
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<!--To configure/specify DataNode Metadata location-->
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
Mapred-site.xml
• It is one of the important configuration files which is required for runtime
environment settings of a Hadoop.
• This file contains the configuration settings for MapReduce daemons; the job tracker
and the task-trackers.
• The mapred.job.tracker parameter is a hostname (or IP address) and port pair on
which the Job Tracker listens for RPC communication. This parameter specify the
location of the Job Tracker to Task Trackers and MapReduce clients.
<configuration>
<property>
<name>mapred.job.tracker</name>
<value> hdfs://localhost:9000</value>
</property>
</configuration>
Yarn-site.xml:
• This file is used to configure yarn into Hadoop.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Verifying Hadoop Installation
Step 1: Name Node Setup
Set up the namenode using the command “hdfs namenode - format” as follows.
$ cd ~
$ hdfs namenode –format
Step 2: Verifying Hadoop dfs( Start the master Deamon)
The following command is used to start dfs. Executing this command
will start your Hadoop file system
$ start-dfs.sh
Step 3:Start the MapReduce Deamons using following command:
$ start -mapred.sh
Step 4: Once started, check the status on the master and slave by using jps (java
Process Status) command
$ start jps
We will get the below Output:
14799 NameNode
15314 Jps
16977 secondaryNameNode
15183 DataNode
Working with files in HDFS
• Hadoop workflow creates data files(such as log files) elsewhere and copies them into
HDFS using one of command line utilities.
• HDFS is not a native UNIX file system. So, Standard UNIX file tools such as ls and
cp don’t work on it and neither standard file read and write operations, such as
fopen() and fread() don’t work.
• Hadoop provides a set of command line utilities to work similar to the Linux file
commands. After copying the files into HDFS mapreduce programs process this data.
But they don’t read HDFS file directly. Instead they relay on mapreduce framework
to read and parse the HDFS files into individual records(Key value pairs) which are
the unit of data mapreduce program to do work on.
Basic File Commands
• The Hadoop get is reverse of put command. It copies files from HDFS to local file system.
hadoop fs –get /555/ArrayListDemo.java
6. Display the content of HDFS files:
• Hadoop cat command allows us to display the content of the HDFS file.
hadoop fs – cat /555/ArrayListDemo.java
7. Deleting files:
• The rm command is used to remove files and empty directories.
hadoop fs - rm /555/ArrayListDemo.java
8. Looking up help:
• We can use hadoop fs (with no parameters) to get a complete list of all available
commands on hadoop. We can also use help to display the usage and short description
of each command.
hadoop fs - help ls
9. Shutting Down the HDFS
• You can shut down the HDFS by using the following command.
$ stop-dfs.sh
HDFS Goals
• Explain the uses of Name node, Data node and Secondary Name node in Hadoop
Distributed File system.
• What is replication factor in HDFS and what is the default value.
• Define Hadoop Cluster? How can you configure Hadoop cluster?
• What are the advantages and disadvantages of Hadoop?
• Define Data node? How does Name node tackle data node failures?
• Discuss in brief about the Name node, Data node, Check point name node and back
up node ?
• What are the real time industry applications of Hadoop?