SlideShare a Scribd company logo
WANdisco Fusion
Active-active data replication solution for total data protection and availability
across Hadoop distributions and storage
Brett Rudenstein – Director of Product Management
2
WD Fusion
Non-Intrusive
Provides Continuous Replication
Across the LAN/WAN
Active/Active
3
Key Issue For Sharing Data Across Clusters
LAN / WAN
4
• Require Continuous Availability
– SLA’s, Regulatory Compliance
– Regional datacenter failure
• Require Hadoop Deployed Globally
– Share Data Between Data Centers
– Data is Consistent and Not Eventual
• Ease Administrative Burden
– Reduce Operational Complexity
– Simplify Disaster Recovery
– Lower RTO/RPO
• Allow Maximum Utilization of
Resource
– Within the Data Center
– Across Data Centers
Enterprise Ready Hadoop
Characteristics of Mission Critical Applications
5
Standby Datacenter
• Idle Resource
– Single Data Center Ingest
– Disaster Recovery Only
• One way synchronization
– DistCp
• Error Prone
– Clusters can diverge over time
• Difficult to scale > 2 Data Centers
– Complexity of sharing data
increases
Active / Active
• DR Resource Available
– Ingest at all Data Centers
– Run Jobs in both Data Centers
• Replication is Multi-Directional
– active/active
• Absolute Consistency
– Single Virtual NameSpace spans
locations
• ‘N’ Data Center support
– Global Hadoop shared only
appropriate data
Active/Active vs. Active/Passive Data Centers
What’s in a Data Center
Coordinated Replication of
HCFS Namespace
7
Distributed Coordination Engine
Fault-tolerant coordination using multiple acceptors
• Distributed Coordination Engine operates on participating nodes
– Roles: Proposer, Learner, and Acceptor
– Each node can combine multiple roles
• Distributed coordination
– Proposing nodes submit events as
proposals to a quorum of acceptors
– Acceptors agree on the order of each
event in the global sequence of events
– Learners learn agreements in the same
deterministic order
7
8
Consensus Algorithms
Consensus is the process of agreeing on one result among a group of participants
• Coordination Engine guarantees the same state of the learners at a given GSN
– Each agreement is assigned a unique Global Sequence Number (GSN)
– GSNs form a monotonically increasing number series – the order of agreements
– Learners have the same initial state, apply the same deterministic agreements in the same deterministic
order
– GSN represents “logical” time in coordinated systems
• PAXOS is a consensus algorithm
proven to tolerate a variety of failures
– Quorum-based Consensus
– Deterministic State Machine
– Leslie Lamport:
Part-Time Parliament (1990)
8
9
Replicated Virtual Namespace
Coordination Engine provides equivalence of multiple namespace replicas
• Coordinated Virtual Namespace controlled by Fusion Node
– Is a client that acts as a proxy to other client interactions
– Reads are not coordinated
– Writes (Open, Close, Append, etc…) are coordinated
• The namespace events are consistent with each other
– Each fusion server maintains a log of changes that would occur in the namespace
– Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes
• Coordination Engine establishes the global order of namespace updates
– Fusion servers ensure deterministic updates in the same deterministic order to underlying
file system
– Systems, which start from the same state and apply the same updates, are equivalent
9
10
Strict Consistency Model
One-Copy Equivalence as known in replicated databases
• Coordination Engine sequences file open and close
proposals into the global sequence of agreements
– Applied to individual replicated folder namespace in the order of
their Global Sequence Number
• Fusion Replicated Folders have identical states when
they reach the same GSN
• One-copy equivalence
– Folders may have different states at a given moment of “clock”
time
as the rate of consuming agreements may vary
– Provides same state in logical time
10
10
11
Scaling Hadoop Across Data Centers
Continuous Availability and Disaster Recovery over the WAN
• The system should appear, act, and be operated as a single cluster
– Instant and automatic replication of data and metadata
• Parts of the cluster on different data centers should have equal roles
– Data could be ingested or accessed through any of the centers
• Data creation and access should typically be at LAN speed
– Running time of a job executed on one data center as if there are no other centers
• Failure scenarios: the system should provide service and remain consistent
– Any Fusion node can fail and still provide replication
– Fusion nodes can fail simultaneously on two or more data centers and still provide
replication
– WAN Partitioning does not cause a data center outage
– RPO is as low as possible due to continuous replication as opposed to
periodic
11
12
• Majority Quorum
– A fixed number of participants
– The Majority must agree for change
• Failure
– Failed nodes are unavailable
– Normal operation continue on nodes
with quorum
• Recovery / Self Healing
– Nodes that rejoin stay in safe mode
until they are caught up
• Disaster Recovery
– A complete loss can be brought back
from another replica
How DConE Works
WANdisco Active/Active Replication
TX id: 168
TX id: 169
TX id: 170
TX id: 171
TX id: 172
TX id: 173
TX id: 168
TX id: 169
TX id: 170
TX id: 171
TX id: 172
TX id: 173
TX id: 168
TX id: 169
TX id: 170
TX id: 171
TX id: 172
TX id: 173
Proposal 170
Agree 170
Agree 170
Proposal 171
Agree 172
Agree 173
Agree 171
Proposal 172
Proposal 173
B
A
CAgree 170
Agree 171 Agree 172
Agree 173
13
Fusion Architecture
14
Architecture Principles
Strict consistency of metadata with fast data ingest
1. Synchronous replication of metadata between data centers
– Using Coordination Engine
– Provides strict consistency of the namespace
2. Asynchronous replication of data over the WAN
– Data replicated in the background
– Allows fast LAN-speed data creation
14
15
How does it work?
Coordinating writes
17
Inter Hadoop Communication Service
 Uses HCFS API and communicates directly with Hadoop Compatible
storage systems
– Isilon
– MAPR
– HDFS
– S3
 NameNode and DataNode operations are unchanged
18
Technical Comparison
19
Periodic Synchronization
DistCp
Parallel Data Ingest
Load Balancer, Streaming
Multi Data Center Hadoop Today
What's wrong with the status quo
20
Periodic Synchronization
DistCp
Multi Data Center Hadoop Today
Hacks currently in use
• Runs as Map reduce
• DR Data Center is read only
• Over time, Hadoop clusters
become inconsistent
• Manual and labor intensive
process to reconcile differences
• Inefficient us of the network
• N to N datanode communication
21
Parallel Data Ingest
Load Balancer, Flume
Multi Data Center Hadoop Today
Hacks currently in use
• Hiccups in either of the Hadoop
cluster causes the two file
systems to diverge
• Potential to run out of buffer when
WAN is down
• Requires constant attention and
sys-admin hours to keep running
• Data created on the cluster is not
replicated
• Use of streaming technologies
(like flume) for data redirection are
only for streaming
22
Use Cases
23
• Data is as current as possible (no
periodic synchs)
• Virtually zero downtime to recover
from regional data center failure
• Meets or exceeds strict regulatory
compliance around disaster
recovery
Disaster Recovery
24
• Ingest and analyze anywhere
• Analyze Everywhere
– Fraud Detection
– Equity Trading Information
– New Business
– Etc…
• Backup Datacenter(s) can be used
for work
– No idle resource
Multi Data-Center
Ingest and multi-tenant workloads
25
• Maximize Resource Utilization
– No idle standby
• Isolate Dev and Test Clusters
– Share data not resource
• Carve off hardware for a specific
group
– Prevents a bad map/reduce job from
bringing down the cluster
• Guarantee Consistency of data
Zones
26
• Mixed Hardware Profiles
– Memory, Disk, CPU
– Isolate memory-hungry
processing (Storm/Spark)
from regular jobs
• Share data, not processing
– Isolate lower priority
(dev/test) work
Heterogeneous Hardware (Zones)
In memory analytics
27
• Basel III
– Consistency of Data
• Data Privacy Directive
– Data Sovereignty
• data doesn’t leave country of
origin
Compliance
Regulation
Guidelines
Regulatory Compliance
28
• Fast network protocols can keep
up with demanding network
replication
• Hadoop clusters do not require
direct communication with each
other.
- No n x m communication among
datanodes across datacenters
- Reduced firewall / socks
complexities
• Reduced Attack Surface
Use Case
Security Between Data Centers
30
Q & A
Question and Answer
Feel free to submit your questions
31
Thank you
Ad

More Related Content

What's hot (20)

Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
pflueras
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
Amy W. Tang
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
Kenny Gryp
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
Brendan Gregg
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
InnoDb Vs NDB Cluster
InnoDb Vs NDB ClusterInnoDb Vs NDB Cluster
InnoDb Vs NDB Cluster
Mark Swarbrick
 
Oracle fundamentals and sql
Oracle fundamentals and sqlOracle fundamentals and sql
Oracle fundamentals and sql
Plentynum Technologies
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
iammutex
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
Shi Shao Feng
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
Open Source Consulting
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
Jean-François Gagné
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
pflueras
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
Amy W. Tang
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
Kenny Gryp
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
Brendan Gregg
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
iammutex
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
Shi Shao Feng
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
Jean-François Gagné
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 

Viewers also liked (20)

Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recovery
Sandeep Singh
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Disaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup StrategiesDisaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup Strategies
Spiceworks
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
Konstantin V. Shvachko
 
What the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and VisibilityWhat the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and Visibility
Cloudera, Inc.
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
Designing large scale distributed systems
Designing large scale distributed systemsDesigning large scale distributed systems
Designing large scale distributed systems
Ashwani Priyedarshi
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
Ruslan Meshenberg
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
IBM InfoSphere Data Replication for Big Data
IBM InfoSphere Data Replication for Big DataIBM InfoSphere Data Replication for Big Data
IBM InfoSphere Data Replication for Big Data
IBM Analytics
 
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the ScenesCassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
DataStax Academy
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
DataWorks Summit/Hadoop Summit
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Patrick Nicolas
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recovery
Sandeep Singh
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Disaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup StrategiesDisaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup Strategies
Spiceworks
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
Konstantin V. Shvachko
 
What the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and VisibilityWhat the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and Visibility
Cloudera, Inc.
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
Designing large scale distributed systems
Designing large scale distributed systemsDesigning large scale distributed systems
Designing large scale distributed systems
Ashwani Priyedarshi
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
Ruslan Meshenberg
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
IBM InfoSphere Data Replication for Big Data
IBM InfoSphere Data Replication for Big DataIBM InfoSphere Data Replication for Big Data
IBM InfoSphere Data Replication for Big Data
IBM Analytics
 
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the ScenesCassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
DataStax Academy
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Ad

Similar to Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm (20)

SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Big Data Joe™ Rossi
 
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
DataWorks Summit
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
Oleg Tsal-Tsalko
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
Codership Oy - Creators of Galera Cluster
 
Database replication
Database replicationDatabase replication
Database replication
Arslan111
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
Sakari Keskitalo
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
Sakari Keskitalo
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
Information Technology
 
Module 3 - DBMS System Architecture Principles
Module 3 - DBMS System Architecture PrinciplesModule 3 - DBMS System Architecture Principles
Module 3 - DBMS System Architecture Principles
KEERTHANAR250835
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
ssuser5c9d4b1
 
Distributed database
Distributed databaseDistributed database
Distributed database
sanjay joshi
 
lecture-13.pptx
lecture-13.pptxlecture-13.pptx
lecture-13.pptx
laiba29012
 
Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...
javier ramirez
 
Clustering - Eric Vanderburg
Clustering - Eric VanderburgClustering - Eric Vanderburg
Clustering - Eric Vanderburg
Eric Vanderburg
 
Cassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeCassandra Essentials Day Cambridge
Cassandra Essentials Day Cambridge
Marc Fielding
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
Ankit Gupta
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
Resource replication in cloud computing.
Resource replication in cloud computing.Resource replication in cloud computing.
Resource replication in cloud computing.
Hitesh Mohapatra
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Big Data Joe™ Rossi
 
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
DataWorks Summit
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
Oleg Tsal-Tsalko
 
Database replication
Database replicationDatabase replication
Database replication
Arslan111
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
Sakari Keskitalo
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
Sakari Keskitalo
 
Module 3 - DBMS System Architecture Principles
Module 3 - DBMS System Architecture PrinciplesModule 3 - DBMS System Architecture Principles
Module 3 - DBMS System Architecture Principles
KEERTHANAR250835
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
ssuser5c9d4b1
 
Distributed database
Distributed databaseDistributed database
Distributed database
sanjay joshi
 
lecture-13.pptx
lecture-13.pptxlecture-13.pptx
lecture-13.pptx
laiba29012
 
Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...
javier ramirez
 
Clustering - Eric Vanderburg
Clustering - Eric VanderburgClustering - Eric Vanderburg
Clustering - Eric Vanderburg
Eric Vanderburg
 
Cassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeCassandra Essentials Day Cambridge
Cassandra Essentials Day Cambridge
Marc Fielding
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
Ankit Gupta
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
Resource replication in cloud computing.
Resource replication in cloud computing.Resource replication in cloud computing.
Resource replication in cloud computing.
Hitesh Mohapatra
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Asthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdfAsthma presentación en inglés abril 2025 pdf
Asthma presentación en inglés abril 2025 pdf
VanessaRaudez
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 

Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm

  • 1. WANdisco Fusion Active-active data replication solution for total data protection and availability across Hadoop distributions and storage Brett Rudenstein – Director of Product Management
  • 2. 2 WD Fusion Non-Intrusive Provides Continuous Replication Across the LAN/WAN Active/Active
  • 3. 3 Key Issue For Sharing Data Across Clusters LAN / WAN
  • 4. 4 • Require Continuous Availability – SLA’s, Regulatory Compliance – Regional datacenter failure • Require Hadoop Deployed Globally – Share Data Between Data Centers – Data is Consistent and Not Eventual • Ease Administrative Burden – Reduce Operational Complexity – Simplify Disaster Recovery – Lower RTO/RPO • Allow Maximum Utilization of Resource – Within the Data Center – Across Data Centers Enterprise Ready Hadoop Characteristics of Mission Critical Applications
  • 5. 5 Standby Datacenter • Idle Resource – Single Data Center Ingest – Disaster Recovery Only • One way synchronization – DistCp • Error Prone – Clusters can diverge over time • Difficult to scale > 2 Data Centers – Complexity of sharing data increases Active / Active • DR Resource Available – Ingest at all Data Centers – Run Jobs in both Data Centers • Replication is Multi-Directional – active/active • Absolute Consistency – Single Virtual NameSpace spans locations • ‘N’ Data Center support – Global Hadoop shared only appropriate data Active/Active vs. Active/Passive Data Centers What’s in a Data Center
  • 7. 7 Distributed Coordination Engine Fault-tolerant coordination using multiple acceptors • Distributed Coordination Engine operates on participating nodes – Roles: Proposer, Learner, and Acceptor – Each node can combine multiple roles • Distributed coordination – Proposing nodes submit events as proposals to a quorum of acceptors – Acceptors agree on the order of each event in the global sequence of events – Learners learn agreements in the same deterministic order 7
  • 8. 8 Consensus Algorithms Consensus is the process of agreeing on one result among a group of participants • Coordination Engine guarantees the same state of the learners at a given GSN – Each agreement is assigned a unique Global Sequence Number (GSN) – GSNs form a monotonically increasing number series – the order of agreements – Learners have the same initial state, apply the same deterministic agreements in the same deterministic order – GSN represents “logical” time in coordinated systems • PAXOS is a consensus algorithm proven to tolerate a variety of failures – Quorum-based Consensus – Deterministic State Machine – Leslie Lamport: Part-Time Parliament (1990) 8
  • 9. 9 Replicated Virtual Namespace Coordination Engine provides equivalence of multiple namespace replicas • Coordinated Virtual Namespace controlled by Fusion Node – Is a client that acts as a proxy to other client interactions – Reads are not coordinated – Writes (Open, Close, Append, etc…) are coordinated • The namespace events are consistent with each other – Each fusion server maintains a log of changes that would occur in the namespace – Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes • Coordination Engine establishes the global order of namespace updates – Fusion servers ensure deterministic updates in the same deterministic order to underlying file system – Systems, which start from the same state and apply the same updates, are equivalent 9
  • 10. 10 Strict Consistency Model One-Copy Equivalence as known in replicated databases • Coordination Engine sequences file open and close proposals into the global sequence of agreements – Applied to individual replicated folder namespace in the order of their Global Sequence Number • Fusion Replicated Folders have identical states when they reach the same GSN • One-copy equivalence – Folders may have different states at a given moment of “clock” time as the rate of consuming agreements may vary – Provides same state in logical time 10 10
  • 11. 11 Scaling Hadoop Across Data Centers Continuous Availability and Disaster Recovery over the WAN • The system should appear, act, and be operated as a single cluster – Instant and automatic replication of data and metadata • Parts of the cluster on different data centers should have equal roles – Data could be ingested or accessed through any of the centers • Data creation and access should typically be at LAN speed – Running time of a job executed on one data center as if there are no other centers • Failure scenarios: the system should provide service and remain consistent – Any Fusion node can fail and still provide replication – Fusion nodes can fail simultaneously on two or more data centers and still provide replication – WAN Partitioning does not cause a data center outage – RPO is as low as possible due to continuous replication as opposed to periodic 11
  • 12. 12 • Majority Quorum – A fixed number of participants – The Majority must agree for change • Failure – Failed nodes are unavailable – Normal operation continue on nodes with quorum • Recovery / Self Healing – Nodes that rejoin stay in safe mode until they are caught up • Disaster Recovery – A complete loss can be brought back from another replica How DConE Works WANdisco Active/Active Replication TX id: 168 TX id: 169 TX id: 170 TX id: 171 TX id: 172 TX id: 173 TX id: 168 TX id: 169 TX id: 170 TX id: 171 TX id: 172 TX id: 173 TX id: 168 TX id: 169 TX id: 170 TX id: 171 TX id: 172 TX id: 173 Proposal 170 Agree 170 Agree 170 Proposal 171 Agree 172 Agree 173 Agree 171 Proposal 172 Proposal 173 B A CAgree 170 Agree 171 Agree 172 Agree 173
  • 14. 14 Architecture Principles Strict consistency of metadata with fast data ingest 1. Synchronous replication of metadata between data centers – Using Coordination Engine – Provides strict consistency of the namespace 2. Asynchronous replication of data over the WAN – Data replicated in the background – Allows fast LAN-speed data creation 14
  • 15. 15 How does it work? Coordinating writes
  • 16. 17 Inter Hadoop Communication Service  Uses HCFS API and communicates directly with Hadoop Compatible storage systems – Isilon – MAPR – HDFS – S3  NameNode and DataNode operations are unchanged
  • 18. 19 Periodic Synchronization DistCp Parallel Data Ingest Load Balancer, Streaming Multi Data Center Hadoop Today What's wrong with the status quo
  • 19. 20 Periodic Synchronization DistCp Multi Data Center Hadoop Today Hacks currently in use • Runs as Map reduce • DR Data Center is read only • Over time, Hadoop clusters become inconsistent • Manual and labor intensive process to reconcile differences • Inefficient us of the network • N to N datanode communication
  • 20. 21 Parallel Data Ingest Load Balancer, Flume Multi Data Center Hadoop Today Hacks currently in use • Hiccups in either of the Hadoop cluster causes the two file systems to diverge • Potential to run out of buffer when WAN is down • Requires constant attention and sys-admin hours to keep running • Data created on the cluster is not replicated • Use of streaming technologies (like flume) for data redirection are only for streaming
  • 22. 23 • Data is as current as possible (no periodic synchs) • Virtually zero downtime to recover from regional data center failure • Meets or exceeds strict regulatory compliance around disaster recovery Disaster Recovery
  • 23. 24 • Ingest and analyze anywhere • Analyze Everywhere – Fraud Detection – Equity Trading Information – New Business – Etc… • Backup Datacenter(s) can be used for work – No idle resource Multi Data-Center Ingest and multi-tenant workloads
  • 24. 25 • Maximize Resource Utilization – No idle standby • Isolate Dev and Test Clusters – Share data not resource • Carve off hardware for a specific group – Prevents a bad map/reduce job from bringing down the cluster • Guarantee Consistency of data Zones
  • 25. 26 • Mixed Hardware Profiles – Memory, Disk, CPU – Isolate memory-hungry processing (Storm/Spark) from regular jobs • Share data, not processing – Isolate lower priority (dev/test) work Heterogeneous Hardware (Zones) In memory analytics
  • 26. 27 • Basel III – Consistency of Data • Data Privacy Directive – Data Sovereignty • data doesn’t leave country of origin Compliance Regulation Guidelines Regulatory Compliance
  • 27. 28 • Fast network protocols can keep up with demanding network replication • Hadoop clusters do not require direct communication with each other. - No n x m communication among datanodes across datacenters - Reduced firewall / socks complexities • Reduced Attack Surface Use Case Security Between Data Centers
  • 28. 30 Q & A Question and Answer Feel free to submit your questions

Editor's Notes

  • #9: The core of a distributed CE are consensus algorithms
  • #10: Double determinism is important for equivalent evolution of the systems
  • #12: Unlike multi-cluster architecture, where clusters run independently on each data center mirroring data between them
  • #16: Fusion service: 1 or more Fusion servers that act as a proxy for clients writing into HCFS and write replicated data into the local file system (Ref: Fusion technical paper) IHC service: 1 or more IHC servers that know how to read from the local underlying file system in order to send data to other clusters (Ref: Fusion technical paper) Although the diagram shows two data centers, there is no limit on how many data centers you can use – and you can have more than one cluster in a data center. The labels on the lines indicates the purpose and direction of data flow: IHC reads from the file system, Fusion writes into it, and there is coordination between Fusion servers. The color coding indicates coherent paths as one write comes into the HCFS and is replicated across to the other data center – but it shows functions, not an accurate timeline of events. For that, see the Fusion tech paper or the sequence diagram in the reference deck. It is important to stress that active-active replication provides single copy consistency: a user or application can use the data equally from either data center. Finally, note that there are few cross-cluster network connections, which simplifies network security and management.
  • #25: Maximize Resource Utilization No idle standby Isolate Dev and Test Clusters Share data not resource Carve off hardware for a specific group Prevents a bad map/reduce job from bringing down the cluster Guarantee Consistency and availability of data Data is instantly available
  • #27: Optimized hardware profiles for job specific tasks Batch Real-time NoSQL (HBASE) Set replication factors per sub-cluster Use at LAN or WAN scope Resilient to NameNode failures
  • #29: Fusion can be set up to replicate data between the fusion servers without directly accessing DN across the WAN Unique over distcp Could be a large selling point as standard implementations using distcp requires all node to all node connectivity This model would only require the fusion servers to talk between data centers protecting direct node access