Hadoop disaster recovery

Hadoop Disaster Recovery Solution
By
Sandeep Kumar

Approach 1 : HDP 2.1 HA Cluster, with 2 different DC and GEO location
Deploy backup namenode and BackUp JobTracker on remote geo location along with
replicated DataNode

• Here we have introduced back up namenode and Back up JobTracker on remote DC(geolocation 2).
• In this type of HA cluster setup, we have to set up BackUp NameNode and JobTracker on remote geography.
• BackUp NameNode and JobTracker will be in standby mode while primary NameNode and JobTracker is in
active mode.
• Back Up NameNode and JobTracker will be in Sink with primary NameNode and JobTracker.
• As soon as the Primary NameNode goes down/destroy due to some issue BackUp namenode will become
active and functional and hence there will not be any impact in data as well as on Hadoop cluster because
we have data and hadoop cluster on DC 2.
1. HDFS data replication factor will be 3, and 3rd replica always will be on DC 2.
2. Secondary NameNode (Now BackUp NameNode) and BackUp JobTracker will be on DC2
3. All restore point ( applications, application server, edits logs, namespace will be on DC 2)
Additional storage can be done in network access storage and we can load data into HDFS while processing the
data through Hadoop. We can also commission NAS nodes in Hadoop cluster in case multiple node failure.
Restore point and replication will be on remote geography for additional NAS.

1. LZO, gz, snappy compression should be used for HDFS data compression.
Note: LZO compression is indexed and support splitting.
Compression Solution 2:
Split the file into chunks and compress each chunk separately.
In this case we should chose the chunk size so that the compressed chunks are
approximately the size of HDFS block.
Note : Data compression will compress data into lesser size and hence it can consume lesser network bandwidth
while transferring data over network. It will also save storage capacity.

Advantage:
• As we have HDFS replica and back up nodes on remote geography, we can recover any kind of disaster
within the hours.
• As we are compressing data, so network latency will be lesser then usual while transferring data over
network.
• There will be least impact on data processing on HDFS as NameNode choosing nearest location of data
during processing.
Disadvantage:
• As we have other datanode, BackUpNode and JT on DC 2, cluster will consume network bandwidth while
talking to each other.
• Cluster performance will be bit sow compare to solution 2.
• Expensive cluster.

Approach 2 : HDP 2.1 HA Cluster, with 2 different parallel cluster on 2 different geo location
Create two parallel HA namenode hadoop cluster on two different geolocation and copy incremental
data periodically in back up cluster.

• Here we have introduced two parallel HA hadoop cluster on two different geographical location. with same
configuration.
• Secondary cluster storage can be lesser then primary cluster as it is a back up cluster. This decision can be
made according to business requirement and available funds.
• Secondary data center will be restore point/recovery point for all applications, application server, data
storage units.
• we will periodically copy useful data on remote back up cluster to prevent any kind of data loss.
• In the case primary cluster goes down/destroy due to any issue, just need to point backup hadoop cluster
as production cluster.
• By doing this backup cluster will become active and functional and hence there will not be any impact on
business process because we already have data and application available on secondary hadoop cluster on
distributed geolocation.
Additional storage can be done in network access storage and we can load data into HDFS while processing the
data through Hadoop. We can also commission NAS nodes in Hadoop cluster in case of multiple node failure.
Restore point and replication will be on remote geography for additional NAS.

Configuration
1. HDFS data replication factor will be 2 in both hadoop cluster. first replica will be on datanode 1 and second
replica will be datanode 2.
2. All backup/restore point ( applications, application server, NAS will be on DC 2)
3. External data source( FTP, NAS) can be kept in external storage and it can be loaded in HDFS whenever it
required to process through hadoop.
4. Directory structure:
5. We should create application wise hdfs directory structure and it should be restricted to specific user. This will
help to easily identify sensitive and useful data for a particular operation.
6. example:
1. /user/hdfs/ida/processing/
2. /user/hdfs/ida/storage/

Things need to take care:
• Periodically copy data into secondary cluster.
• Need to take care about all hive tables and apps in both cluster.
• External hive table should be created on top of HDFS location. This will prevent overhead on loading data
into hive tables.
• All applications should be available on both cluster.
• All backup point restore point should be on data center 2 for faster recovery.

Advantage:
• As we have secondary cluster on remote geography, we can recover any kind of disaster within the hour.
• We are compressing data, so network latency will be lesser then usual while transferring data over network.
Disadvantage:
• Additional storage required.
• Periodically copy data into remote geography will consume network bandwidth.

Approach 3 : HDP 2.2 HA Cluster, with 2 different parallel cluster on 2 different geo location
Create two parallel HA hadoop cluster on two different geolocation and copy hdfs snapshots periodically
in back up cluster using Apache falcon
HDFS snapshots:
• HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree
of the file system or the entire file system.
• Some common use cases of snapshots are data backup, protection against user errors and disaster
recovery.
• Snapshots can be taken on any directory once the directory has been set as snapshottable
• There is no limit on the number of snapshottable directories.
• Snapshots do not adversely affect regular HDFS operations: modifications are recorded in reverse
chronological order so that the current data can be accessed directly. The snapshot data is computed by
subtracting the modifications from the current data.
• This feature is available in Apache Hadoop 2.6.0 (HDP2.2).

Apache Falcon:
• Apache Falcon is a data processing and management solution for Hadoop designed for data motion,
coordination of data pipelines, lifecycle management, and data discovery.
• Falcon enables end consumers to quickly onboard their data and its associated processing and management
tasks on Hadoop clusters.
• Hadoop applications can now rely on the Apache Falcon framework for these function.
• Data set Replication
– Replicate data sets (whether HDFS or Hive Tables) as part of your Disaster Recovery, Backup
and Archival solution. Falcon can trigger processes for retry and handle late data arrival log
• Data set Lifecycle Management
– Establish retention policies for datasets and Falcon will schedule eviction.
• Data set Traceability/Lineage
– Use Falcon to view coarse-grained dependencies between clusters, datasets, and process.
• Falcon is integrated with HDP 2.2
This solution can be used after upgrading the hadoop cluster from HDP 2.2.

Framework Entities:
• The Falcon framework defines the fundamental building blocks for data processing applications
using entities such as Feeds, Processes, and Clusters. A Hadoop user can establish entity
relationships and Falcon handles the management, coordination, and scheduling of data set
processing.
• Cluster:- Represents the “interfaces” to a Hadoop cluster.
• Feed:– Defines a dataset with location, replication schedule, and retention policy.
• Process:- Consumes and processes feeds
Reference: Hortonworks website, Falcon documents.

Hadoop disaster recovery

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop disaster recovery (20)

Recently uploaded (20)

Hadoop disaster recovery