SlideShare a Scribd company logo
Hadoop Disaster Recovery Solution
By
Sandeep Kumar
Approach 1 : HDP 2.1 HA Cluster, with 2 different DC and GEO location
Deploy backup namenode and BackUp JobTracker on remote geo location along with
replicated DataNode
• Here we have introduced back up namenode and Back up JobTracker on remote DC(geolocation 2).
• In this type of HA cluster setup, we have to set up BackUp NameNode and JobTracker on remote geography.
• BackUp NameNode and JobTracker will be in standby mode while primary NameNode and JobTracker is in
active mode.
• Back Up NameNode and JobTracker will be in Sink with primary NameNode and JobTracker.
• As soon as the Primary NameNode goes down/destroy due to some issue BackUp namenode will become
active and functional and hence there will not be any impact in data as well as on Hadoop cluster because
we have data and hadoop cluster on DC 2.
1. HDFS data replication factor will be 3, and 3rd replica always will be on DC 2.
2. Secondary NameNode (Now BackUp NameNode) and BackUp JobTracker will be on DC2
3. All restore point ( applications, application server, edits logs, namespace will be on DC 2)
Additional storage can be done in network access storage and we can load data into HDFS while processing the
data through Hadoop. We can also commission NAS nodes in Hadoop cluster in case multiple node failure.
Restore point and replication will be on remote geography for additional NAS.
1. LZO, gz, snappy compression should be used for HDFS data compression.
Note: LZO compression is indexed and support splitting.
Compression Solution 2:
Split the file into chunks and compress each chunk separately.
In this case we should chose the chunk size so that the compressed chunks are
approximately the size of HDFS block.
Note : Data compression will compress data into lesser size and hence it can consume lesser network bandwidth
while transferring data over network. It will also save storage capacity.
Advantage:
• As we have HDFS replica and back up nodes on remote geography, we can recover any kind of disaster
within the hours.
• As we are compressing data, so network latency will be lesser then usual while transferring data over
network.
• There will be least impact on data processing on HDFS as NameNode choosing nearest location of data
during processing.
Disadvantage:
• As we have other datanode, BackUpNode and JT on DC 2, cluster will consume network bandwidth while
talking to each other.
• Cluster performance will be bit sow compare to solution 2.
• Expensive cluster.
Approach 2 : HDP 2.1 HA Cluster, with 2 different parallel cluster on 2 different geo location
Create two parallel HA namenode hadoop cluster on two different geolocation and copy incremental
data periodically in back up cluster.
• Here we have introduced two parallel HA hadoop cluster on two different geographical location. with same
configuration.
• Secondary cluster storage can be lesser then primary cluster as it is a back up cluster. This decision can be
made according to business requirement and available funds.
• Secondary data center will be restore point/recovery point for all applications, application server, data
storage units.
• we will periodically copy useful data on remote back up cluster to prevent any kind of data loss.
• In the case primary cluster goes down/destroy due to any issue, just need to point backup hadoop cluster
as production cluster.
• By doing this backup cluster will become active and functional and hence there will not be any impact on
business process because we already have data and application available on secondary hadoop cluster on
distributed geolocation.
Additional storage can be done in network access storage and we can load data into HDFS while processing the
data through Hadoop. We can also commission NAS nodes in Hadoop cluster in case of multiple node failure.
Restore point and replication will be on remote geography for additional NAS.
1. LZO, gz, snappy compression should be used for HDFS data compression.
Note: LZO compression is indexed and support splitting.
Compression Solution 2:
Split the file into chunks and compress each chunk separately.
In this case we should chose the chunk size so that the compressed chunks are
approximately the size of HDFS block.
Note : Data compression will compress data into lesser size and hence it can consume lesser network bandwidth
while transferring data over network. It will also save storage capacity.
Configuration
1. HDFS data replication factor will be 2 in both hadoop cluster. first replica will be on datanode 1 and second
replica will be datanode 2.
2. All backup/restore point ( applications, application server, NAS will be on DC 2)
3. External data source( FTP, NAS) can be kept in external storage and it can be loaded in HDFS whenever it
required to process through hadoop.
4. Directory structure:
5. We should create application wise hdfs directory structure and it should be restricted to specific user. This will
help to easily identify sensitive and useful data for a particular operation.
6. example:
1. /user/hdfs/ida/processing/
2. /user/hdfs/ida/storage/
Things need to take care:
• Periodically copy data into secondary cluster.
• Need to take care about all hive tables and apps in both cluster.
• External hive table should be created on top of HDFS location. This will prevent overhead on loading data
into hive tables.
• All applications should be available on both cluster.
• All backup point restore point should be on data center 2 for faster recovery.
Advantage:
• As we have secondary cluster on remote geography, we can recover any kind of disaster within the hour.
• We are compressing data, so network latency will be lesser then usual while transferring data over network.
Disadvantage:
• Additional storage required.
• Periodically copy data into remote geography will consume network bandwidth.
Approach 3 : HDP 2.2 HA Cluster, with 2 different parallel cluster on 2 different geo location
Create two parallel HA hadoop cluster on two different geolocation and copy hdfs snapshots periodically
in back up cluster using Apache falcon
HDFS snapshots:
• HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree
of the file system or the entire file system.
• Some common use cases of snapshots are data backup, protection against user errors and disaster
recovery.
• Snapshots can be taken on any directory once the directory has been set as snapshottable
• There is no limit on the number of snapshottable directories.
• Snapshots do not adversely affect regular HDFS operations: modifications are recorded in reverse
chronological order so that the current data can be accessed directly. The snapshot data is computed by
subtracting the modifications from the current data.
• This feature is available in Apache Hadoop 2.6.0 (HDP2.2).
Apache Falcon:
• Apache Falcon is a data processing and management solution for Hadoop designed for data motion,
coordination of data pipelines, lifecycle management, and data discovery.
• Falcon enables end consumers to quickly onboard their data and its associated processing and management
tasks on Hadoop clusters.
• Hadoop applications can now rely on the Apache Falcon framework for these function.
• Data set Replication
– Replicate data sets (whether HDFS or Hive Tables) as part of your Disaster Recovery, Backup
and Archival solution. Falcon can trigger processes for retry and handle late data arrival log
• Data set Lifecycle Management
– Establish retention policies for datasets and Falcon will schedule eviction.
• Data set Traceability/Lineage
– Use Falcon to view coarse-grained dependencies between clusters, datasets, and process.
• Falcon is integrated with HDP 2.2
This solution can be used after upgrading the hadoop cluster from HDP 2.2.
Framework Entities:
• The Falcon framework defines the fundamental building blocks for data processing applications
using entities such as Feeds, Processes, and Clusters. A Hadoop user can establish entity
relationships and Falcon handles the management, coordination, and scheduling of data set
processing.
• Cluster:- Represents the “interfaces” to a Hadoop cluster.
• Feed:– Defines a dataset with location, replication schedule, and retention policy.
• Process:- Consumes and processes feeds
Reference: Hortonworks website, Falcon documents.

More Related Content

What's hot (20)

PDF
Hadoop Cluster With High Availability
Edureka!
 
PPT
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PDF
HDFS Architecture
Jeff Hammerbacher
 
PDF
Hadoop HDFS
Vigen Sahakyan
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPT
Hadoop ppt2
Ankit Gupta
 
PPTX
Unit 2.pptx
PriyankaAher11
 
PPTX
Hadoop Fundamentals
its_skm
 
PPTX
Introduction to HDFS
Bhavesh Padharia
 
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
PPTX
Introduction to Hadoop
Ran Ziv
 
PPTX
Hadoop
ABHIJEET RAJ
 
PDF
Hadoop scheduler
Subhas Kumar Ghosh
 
PDF
Hadoop Fundamentals I
Romeo Kienzler
 
PPTX
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
PPTX
Apache hadoop basics
saili mane
 
PPTX
Introduction to hadoop and hdfs
shrey mehrotra
 
PDF
Architectural Overview of MapR's Apache Hadoop Distribution
mcsrivas
 
PPTX
Hadoop introduction
musrath mohammad
 
Hadoop Cluster With High Availability
Edureka!
 
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
Hadoop File system (HDFS)
Prashant Gupta
 
HDFS Architecture
Jeff Hammerbacher
 
Hadoop HDFS
Vigen Sahakyan
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop ppt2
Ankit Gupta
 
Unit 2.pptx
PriyankaAher11
 
Hadoop Fundamentals
its_skm
 
Introduction to HDFS
Bhavesh Padharia
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
Introduction to Hadoop
Ran Ziv
 
Hadoop
ABHIJEET RAJ
 
Hadoop scheduler
Subhas Kumar Ghosh
 
Hadoop Fundamentals I
Romeo Kienzler
 
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Apache hadoop basics
saili mane
 
Introduction to hadoop and hdfs
shrey mehrotra
 
Architectural Overview of MapR's Apache Hadoop Distribution
mcsrivas
 
Hadoop introduction
musrath mohammad
 

Viewers also liked (20)

PPT
Disaster Recovery & Data Backup Strategies
Spiceworks
 
PDF
Non-Stop Hadoop for Hortonworks
Hortonworks
 
PDF
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
PPTX
What the Enterprise Requires - Business Continuity and Visibility
Cloudera, Inc.
 
PPTX
Protecting enterprise Data in Hadoop
DataWorks Summit
 
PPTX
Hadoop Security Features That make your risk officer happy
DataWorks Summit
 
PDF
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
PPTX
Aries
قصي نسور
 
PPTX
Data protection for hadoop environments
DataWorks Summit
 
PPSX
ARIES Recovery Algorithms
Pulasthi Lankeshwara
 
PPTX
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
PPTX
Hadoop Operations
Cloudera, Inc.
 
PPT
17. Recovery System in DBMS
koolkampus
 
PPT
Disaster Recovery Plan for IT
hhuihhui
 
PDF
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
DOC
Diocese of Exeter Guidelines on Communion before confirmation
Katherine Lyddon
 
PDF
Mobile-first OOCSS, Sass & Compass at BBC Responsive News
Kaelig Deloumeau-Prigent
 
PPTX
Exposicion erika s.ox
jailander2
 
PPT
Children's corners
Katherine Lyddon
 
Disaster Recovery & Data Backup Strategies
Spiceworks
 
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
What the Enterprise Requires - Business Continuity and Visibility
Cloudera, Inc.
 
Protecting enterprise Data in Hadoop
DataWorks Summit
 
Hadoop Security Features That make your risk officer happy
DataWorks Summit
 
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
Data protection for hadoop environments
DataWorks Summit
 
ARIES Recovery Algorithms
Pulasthi Lankeshwara
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
Hadoop Operations
Cloudera, Inc.
 
17. Recovery System in DBMS
koolkampus
 
Disaster Recovery Plan for IT
hhuihhui
 
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
Diocese of Exeter Guidelines on Communion before confirmation
Katherine Lyddon
 
Mobile-first OOCSS, Sass & Compass at BBC Responsive News
Kaelig Deloumeau-Prigent
 
Exposicion erika s.ox
jailander2
 
Children's corners
Katherine Lyddon
 
Ad

Similar to Hadoop disaster recovery (20)

PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Bigdata Technologies that includes various components .pdf
ashokchoppadandi685
 
PPTX
HDFS tiered storage
DataWorks Summit
 
PPTX
Hadoop
Esraa El Ghoul
 
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Summit 2010 Data Management On Grid
Yahoo Developer Network
 
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
PDF
Facebook's HBase Backups - StampedeCon 2012
StampedeCon
 
PDF
HDFS Design Principles
Konstantin V. Shvachko
 
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon
 
PPT
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
gamer129
 
PDF
Hadoop operations basic
Hafizur Rahman
 
PPTX
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
PPTX
Hdfs
Chirag Ahuja
 
PDF
Infrastructure Around Hadoop
DataWorks Summit
 
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
PDF
Scaling Hadoop at LinkedIn
DataWorks Summit
 
PPTX
Jesse Yates: Hbase snapshots patch
Dmitry Makarchuk
 
PPTX
HBase Snapshots
Jesse Yates
 
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Yahoo!デベロッパーネットワーク
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Bigdata Technologies that includes various components .pdf
ashokchoppadandi685
 
HDFS tiered storage
DataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2010 Data Management On Grid
Yahoo Developer Network
 
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Facebook's HBase Backups - StampedeCon 2012
StampedeCon
 
HDFS Design Principles
Konstantin V. Shvachko
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon
 
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
gamer129
 
Hadoop operations basic
Hafizur Rahman
 
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
Infrastructure Around Hadoop
DataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Jesse Yates: Hbase snapshots patch
Dmitry Makarchuk
 
HBase Snapshots
Jesse Yates
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Yahoo!デベロッパーネットワーク
 
Ad

Recently uploaded (20)

PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 

Hadoop disaster recovery

  • 1. Hadoop Disaster Recovery Solution By Sandeep Kumar
  • 2. Approach 1 : HDP 2.1 HA Cluster, with 2 different DC and GEO location Deploy backup namenode and BackUp JobTracker on remote geo location along with replicated DataNode
  • 3. • Here we have introduced back up namenode and Back up JobTracker on remote DC(geolocation 2). • In this type of HA cluster setup, we have to set up BackUp NameNode and JobTracker on remote geography. • BackUp NameNode and JobTracker will be in standby mode while primary NameNode and JobTracker is in active mode. • Back Up NameNode and JobTracker will be in Sink with primary NameNode and JobTracker. • As soon as the Primary NameNode goes down/destroy due to some issue BackUp namenode will become active and functional and hence there will not be any impact in data as well as on Hadoop cluster because we have data and hadoop cluster on DC 2. 1. HDFS data replication factor will be 3, and 3rd replica always will be on DC 2. 2. Secondary NameNode (Now BackUp NameNode) and BackUp JobTracker will be on DC2 3. All restore point ( applications, application server, edits logs, namespace will be on DC 2) Additional storage can be done in network access storage and we can load data into HDFS while processing the data through Hadoop. We can also commission NAS nodes in Hadoop cluster in case multiple node failure. Restore point and replication will be on remote geography for additional NAS.
  • 4. 1. LZO, gz, snappy compression should be used for HDFS data compression. Note: LZO compression is indexed and support splitting. Compression Solution 2: Split the file into chunks and compress each chunk separately. In this case we should chose the chunk size so that the compressed chunks are approximately the size of HDFS block. Note : Data compression will compress data into lesser size and hence it can consume lesser network bandwidth while transferring data over network. It will also save storage capacity.
  • 5. Advantage: • As we have HDFS replica and back up nodes on remote geography, we can recover any kind of disaster within the hours. • As we are compressing data, so network latency will be lesser then usual while transferring data over network. • There will be least impact on data processing on HDFS as NameNode choosing nearest location of data during processing. Disadvantage: • As we have other datanode, BackUpNode and JT on DC 2, cluster will consume network bandwidth while talking to each other. • Cluster performance will be bit sow compare to solution 2. • Expensive cluster.
  • 6. Approach 2 : HDP 2.1 HA Cluster, with 2 different parallel cluster on 2 different geo location Create two parallel HA namenode hadoop cluster on two different geolocation and copy incremental data periodically in back up cluster.
  • 7. • Here we have introduced two parallel HA hadoop cluster on two different geographical location. with same configuration. • Secondary cluster storage can be lesser then primary cluster as it is a back up cluster. This decision can be made according to business requirement and available funds. • Secondary data center will be restore point/recovery point for all applications, application server, data storage units. • we will periodically copy useful data on remote back up cluster to prevent any kind of data loss. • In the case primary cluster goes down/destroy due to any issue, just need to point backup hadoop cluster as production cluster. • By doing this backup cluster will become active and functional and hence there will not be any impact on business process because we already have data and application available on secondary hadoop cluster on distributed geolocation. Additional storage can be done in network access storage and we can load data into HDFS while processing the data through Hadoop. We can also commission NAS nodes in Hadoop cluster in case of multiple node failure. Restore point and replication will be on remote geography for additional NAS.
  • 8. 1. LZO, gz, snappy compression should be used for HDFS data compression. Note: LZO compression is indexed and support splitting. Compression Solution 2: Split the file into chunks and compress each chunk separately. In this case we should chose the chunk size so that the compressed chunks are approximately the size of HDFS block. Note : Data compression will compress data into lesser size and hence it can consume lesser network bandwidth while transferring data over network. It will also save storage capacity.
  • 9. Configuration 1. HDFS data replication factor will be 2 in both hadoop cluster. first replica will be on datanode 1 and second replica will be datanode 2. 2. All backup/restore point ( applications, application server, NAS will be on DC 2) 3. External data source( FTP, NAS) can be kept in external storage and it can be loaded in HDFS whenever it required to process through hadoop. 4. Directory structure: 5. We should create application wise hdfs directory structure and it should be restricted to specific user. This will help to easily identify sensitive and useful data for a particular operation. 6. example: 1. /user/hdfs/ida/processing/ 2. /user/hdfs/ida/storage/
  • 10. Things need to take care: • Periodically copy data into secondary cluster. • Need to take care about all hive tables and apps in both cluster. • External hive table should be created on top of HDFS location. This will prevent overhead on loading data into hive tables. • All applications should be available on both cluster. • All backup point restore point should be on data center 2 for faster recovery.
  • 11. Advantage: • As we have secondary cluster on remote geography, we can recover any kind of disaster within the hour. • We are compressing data, so network latency will be lesser then usual while transferring data over network. Disadvantage: • Additional storage required. • Periodically copy data into remote geography will consume network bandwidth.
  • 12. Approach 3 : HDP 2.2 HA Cluster, with 2 different parallel cluster on 2 different geo location Create two parallel HA hadoop cluster on two different geolocation and copy hdfs snapshots periodically in back up cluster using Apache falcon HDFS snapshots: • HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. • Some common use cases of snapshots are data backup, protection against user errors and disaster recovery. • Snapshots can be taken on any directory once the directory has been set as snapshottable • There is no limit on the number of snapshottable directories. • Snapshots do not adversely affect regular HDFS operations: modifications are recorded in reverse chronological order so that the current data can be accessed directly. The snapshot data is computed by subtracting the modifications from the current data. • This feature is available in Apache Hadoop 2.6.0 (HDP2.2).
  • 13. Apache Falcon: • Apache Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. • Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters. • Hadoop applications can now rely on the Apache Falcon framework for these function. • Data set Replication – Replicate data sets (whether HDFS or Hive Tables) as part of your Disaster Recovery, Backup and Archival solution. Falcon can trigger processes for retry and handle late data arrival log • Data set Lifecycle Management – Establish retention policies for datasets and Falcon will schedule eviction. • Data set Traceability/Lineage – Use Falcon to view coarse-grained dependencies between clusters, datasets, and process. • Falcon is integrated with HDP 2.2 This solution can be used after upgrading the hadoop cluster from HDP 2.2.
  • 14. Framework Entities: • The Falcon framework defines the fundamental building blocks for data processing applications using entities such as Feeds, Processes, and Clusters. A Hadoop user can establish entity relationships and Falcon handles the management, coordination, and scheduling of data set processing. • Cluster:- Represents the “interfaces” to a Hadoop cluster. • Feed:– Defines a dataset with location, replication schedule, and retention policy. • Process:- Consumes and processes feeds Reference: Hortonworks website, Falcon documents.