SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Falcon
Hadoop Data Governance
Hortonworks. We do Hadoop.
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Venkatesh Seetharam
Architect, Data Management
Hortonworks Inc.
PMC, Apache Falcon
PMC, Apache Knox
Proposed Apache Atlas
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
Overview Components Features Governance
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Motivation for Apache Falcon
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Simple Data Pipeline…
Page 5
HDFS
YARN
Landing Materialized Views
Oozie Workflow
source_db.raw_input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
2014-01-01-12
Partition
N
Pig JobHive Job
source_db.input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
N
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Add Data Management Capability to the Pipeline
Page 6
HDFS
YARN
Landing Materialized Views
Oozie Workflow
source_db.raw_input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
2014-01-01-12
Partition
N
Pig JobHive Job
source_db.input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
N
Frequent
Feeds
Late Data
Arrival
Replication
Rentention
Archival
Exception
Handling
Lineage
Audit
Monitoring
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pipeline Becomes Considerably More Complex
Oozie Workflow
Pig JobHive Job
Results in Many Complex Oozie
Workflows
Frequent
Feeds
Late Data
Arrival
Replication RententionArchival
Exception
Handling
Lineage AuditMonitoring Data Management
Requirements
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Introduction to Apache Falcon
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Falcon Overview
Centrally Manage Data Lifecycle
– Centralized definition & management of pipelines for data ingest, process &
export
Business Continuity & Disaster Recovery
– Out of the box policies for data replication & retention
– End to end monitoring of data pipelines
Address audit & compliance
requirements
– Visualize data pipeline lineage
– Track data pipeline audit logs
– Tag data with business metadata
The data traffic cop
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Complicated Pipeline Simplified with Apache Falcon
Falcon Generates and Instruments
Oozie Workflows
Falcon Engine
Lineage AuditMonitoring
Frequent
Feeds
Late Data
Arrival
Replication RententionArchival
Exception
Handling
Frequent
Feeds
Submit & Schedule Falcon Entities
Cluster
Cluster
Feed
Feed Feed
Process
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Falcon Architecture
Centralized Falcon Orchestration Framework
Hadoop ecosystem tools
Falcon Server JMS
API
&
UI
AMBARI
HDFS / Hive
Oozie
Entity
Specs Scheduled Jobs
Process
Status
MapRed / Pig / Hive / Sqoop /
Flume / DistCP
Data
stewards
+
Hadoop
admins
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Falcon Basic Concepts
• Cluster: Represents the “interfaces” to a Hadoop cluster
• Feed: Defines a “dataset” File, Hive Table or Stream
• Process: Consumes feeds, invokes processing logic & produces feeds
Page 12
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED
aka
DATASET
PROCESS
INPUT TO
CREATES
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Pipeline: Definition
• Flexible based pipeline specification
–JAXB / JSON / JAVA / XML
–Modular - Clusters, feeds & processes defined separately and then linked together
–Easy to re-use across multiple pipelines
• Out of the box policies
–Predefined policies for replication, late data handling & eviction
–Easily customization of policies
• Extensible
–Plug in external solutions at any step of the pipeline
–Eg. Invoke third party data obfuscation components
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Flexibility in Processing
Common types of processing engines can be tied to Falcon processes
Oozie workflows Pig scripts HQL scripts
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Pipeline: Monitoring
DATA
Primary site DR site
Centralized monitoring of data pipeline
With Falcon + Ambari
Pipeline run
alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline run
history
Pipeline
Scheduling
raw clean prep raw clean prep
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Replication with Falcon
Staged Data
Presented
Data
Cleansed
Data
Conformed
Data
Staged Data
Presented
Data
Replication
Failover Hadoop Cluster
Primary Hadoop Cluster
Replication
BI / Analytics
BusinessObjects BI
• Falcon manages workflow and replication
• Enables business continuity without requiring full data reprocessing
• Failover clusters can be smaller than primary clusters
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Retention with Falcon
Staged Data
Presented
Data
Cleansed
Data
Conformed
Data
Retain 5
Years
Retain Last
Copy Only
Retain 3
Years
Retain 3
Years
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing
Retention
Policy
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Late Data Handling with Falcon
Staged Data Combined Data
Online
Transaction Data
(via Sqoop)
Web Log Data
(via FTP)
Wait up to 4
hours for FTP
data to arrive
• Processing waits until all required input data is available
• Checks for late data arrivals, issues retrigger processing as necessary
• Eliminates writing complex data handling rules within applications
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HCatalog
Table access
Aligned metadata
REST API
• Raw Hadoop data
• Inconsistent, unknown
• Tool specific access
Apache Falcon provides metadata services via HCatalog
Metadata Services with HCatalog
• Consistency of metadata and data models across tools (MapReduce, Pig, Hbase,
and Hive)
• Accessibility: share data as tables in and out of HDFS
• Availability: enables flexible, thin-client access via REST API
Shared table and
schema management
opens the platform
Page 19
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance in Apache Falcon
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Pipeline: Tracing
.
Purchase
feed
Customer
feed
Product
feed
Store
feed
View dependencies
between clusters,
datasets and processes
Data pipeline
dependencies
Add arbitrary tags to
feeds & processes
Data pipeline
tagging
Coming Soon
Know who modified a
dataset when and into
what
Data pipeline
audits
Analyze how a
dataset reached a
particular state
Data pipeline
lineage
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Custom Metadata in Falcon
• Metadata on Ingest (Content)
– What is the format I expect my data to be in?
– What source systems did the data come from, owners?
– Answer: ingest descriptors + Hcat schema versioning
• Metadata for Security (Access Controls)
– How is each column blinded or encrypted?
– Can I trust that I can join data across tables? What if email is encrypted differently?
– Answer: security descriptors
• Metadata for lineage (Source, History)
– How do I chase down sources of data leading to reports and data?
– Answer: lineage carried forward per workflow
• Metadata for marts (Usage Constraints, Enrichment)
– How do I materialize views and drop views as needed?
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Entity Dependency in Falcon
• Dependencies between Falcon entity definitions: cluster, feed & process
– Lineage attributes: workflows, input/output feed windows, user, input and output paths, workflow engine,
input/output size
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lineage in Falcon
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Audit, Tagging and Access Control
• Tagging
– Allows custom tags in entities
– Can decorate process entities pipeline names
• Access Control
– Support for ACL in entities
– Authorization driven based on ACLs in entities
• Audit
– Each execution is controlled by Falcon and runs are audited
– Correlate the execution with Lineage (Design)
• Search
– Search based on Tags, Pipelines, etc.
– Full-text search
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Technology
• Metadata Repository
– Titan Graph Database
– Pluggable backing store, berkelydbje, Hbase
• Entity Metadata
– Tags, Entities are stored in the repository
• Execution Metadata
– Execution metadata are stored in the repository as well – this is unique to Falcon
– Optional inputs
• Search
– Pluggable backend – Solr or Elastic Search
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New in Apache Falcon 0.6.0
What is coming soon?
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DR Mirroring of HDFS with Recipes
•Mirroring for Disaster
Recovery and Business
continuity use cases.
•Customizable for multiple
targets and frequency of
synchronization
•Recipes: Template model
re-use of complex workflows
Recipe
Reduce
Cleanse
Replicate
Propertie
s
Workflow
Template
RecipePropertie
s
RecipePropertie
s
Workflow
Template
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Replication to Cloud
•Seemlessly replicate to Cloud
targets
•Replicate from Cloud as a source.
•Support for Amazon S3 and
Microsoft Azure
Azure
Amazon S3
On Prem Cluster
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q & A
Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you!
Learn more at:
hortonworks.com/hadoop/falcon/
Ad

More Related Content

What's hot (20)

Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
DataWorks Summit
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
Hortonworks
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group
Hortonworks
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
Hortonworks
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
DataWorks Summit
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
DataWorks Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
Hortonworks
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
Madhan Neethiraj
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
Hortonworks
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group
Hortonworks
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
Hortonworks
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
Hortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
DataWorks Summit
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
DataWorks Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
Hortonworks
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
Madhan Neethiraj
 

Viewers also liked (9)

Smarter Analytics: Big Data and Predictive Governance
Smarter Analytics: Big Data and Predictive GovernanceSmarter Analytics: Big Data and Predictive Governance
Smarter Analytics: Big Data and Predictive Governance
IBM Danmark
 
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...
DATAVERSITY
 
Big data governance as a corporate governance imperative
Big data governance as a corporate governance imperativeBig data governance as a corporate governance imperative
Big data governance as a corporate governance imperative
Guy Pearce
 
Why You Need to Govern Big Data
Why You Need to Govern Big DataWhy You Need to Govern Big Data
Why You Need to Govern Big Data
IBM Analytics
 
Data Governance in the Big Data Era
Data Governance in the Big Data EraData Governance in the Big Data Era
Data Governance in the Big Data Era
Pieter De Leenheer
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
DataWorks Summit/Hadoop Summit
 
Apache Atlas. Data Governance for Hadoop. Strata London 2015
Apache Atlas. Data Governance for Hadoop. Strata London 2015Apache Atlas. Data Governance for Hadoop. Strata London 2015
Apache Atlas. Data Governance for Hadoop. Strata London 2015
Sean Roberts
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
Christopher Bradley
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Smarter Analytics: Big Data and Predictive Governance
Smarter Analytics: Big Data and Predictive GovernanceSmarter Analytics: Big Data and Predictive Governance
Smarter Analytics: Big Data and Predictive Governance
IBM Danmark
 
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...
Real-World Data Governance Webinar: Big Data Governance - What Is It and Why ...
DATAVERSITY
 
Big data governance as a corporate governance imperative
Big data governance as a corporate governance imperativeBig data governance as a corporate governance imperative
Big data governance as a corporate governance imperative
Guy Pearce
 
Why You Need to Govern Big Data
Why You Need to Govern Big DataWhy You Need to Govern Big Data
Why You Need to Govern Big Data
IBM Analytics
 
Data Governance in the Big Data Era
Data Governance in the Big Data EraData Governance in the Big Data Era
Data Governance in the Big Data Era
Pieter De Leenheer
 
Apache Atlas. Data Governance for Hadoop. Strata London 2015
Apache Atlas. Data Governance for Hadoop. Strata London 2015Apache Atlas. Data Governance for Hadoop. Strata London 2015
Apache Atlas. Data Governance for Hadoop. Strata London 2015
Sean Roberts
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
Christopher Bradley
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Ad

Similar to Driving Enterprise Data Governance for Big Data Systems through Apache Falcon (20)

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
Hortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Falcon Hadoop Data Governance Hortonworks. We do Hadoop.
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Venkatesh Seetharam Architect, Data Management Hortonworks Inc. PMC, Apache Falcon PMC, Apache Knox Proposed Apache Atlas
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda Overview Components Features Governance
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Motivation for Apache Falcon
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Simple Data Pipeline… Page 5 HDFS YARN Landing Materialized Views Oozie Workflow source_db.raw_input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition 2014-01-01-12 Partition N Pig JobHive Job source_db.input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition N
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Add Data Management Capability to the Pipeline Page 6 HDFS YARN Landing Materialized Views Oozie Workflow source_db.raw_input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition 2014-01-01-12 Partition N Pig JobHive Job source_db.input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition N Frequent Feeds Late Data Arrival Replication Rentention Archival Exception Handling Lineage Audit Monitoring
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Pipeline Becomes Considerably More Complex Oozie Workflow Pig JobHive Job Results in Many Complex Oozie Workflows Frequent Feeds Late Data Arrival Replication RententionArchival Exception Handling Lineage AuditMonitoring Data Management Requirements
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Introduction to Apache Falcon
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Falcon Overview Centrally Manage Data Lifecycle – Centralized definition & management of pipelines for data ingest, process & export Business Continuity & Disaster Recovery – Out of the box policies for data replication & retention – End to end monitoring of data pipelines Address audit & compliance requirements – Visualize data pipeline lineage – Track data pipeline audit logs – Tag data with business metadata The data traffic cop
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Complicated Pipeline Simplified with Apache Falcon Falcon Generates and Instruments Oozie Workflows Falcon Engine Lineage AuditMonitoring Frequent Feeds Late Data Arrival Replication RententionArchival Exception Handling Frequent Feeds Submit & Schedule Falcon Entities Cluster Cluster Feed Feed Feed Process
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Falcon Architecture Centralized Falcon Orchestration Framework Hadoop ecosystem tools Falcon Server JMS API & UI AMBARI HDFS / Hive Oozie Entity Specs Scheduled Jobs Process Status MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Falcon Basic Concepts • Cluster: Represents the “interfaces” to a Hadoop cluster • Feed: Defines a “dataset” File, Hive Table or Stream • Process: Consumes feeds, invokes processing logic & produces feeds Page 12 All these put together represent ‘Data Pipelines’ in Hadoop CLUSTER FEED aka DATASET PROCESS INPUT TO CREATES
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Pipeline: Definition • Flexible based pipeline specification –JAXB / JSON / JAVA / XML –Modular - Clusters, feeds & processes defined separately and then linked together –Easy to re-use across multiple pipelines • Out of the box policies –Predefined policies for replication, late data handling & eviction –Easily customization of policies • Extensible –Plug in external solutions at any step of the pipeline –Eg. Invoke third party data obfuscation components
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Flexibility in Processing Common types of processing engines can be tied to Falcon processes Oozie workflows Pig scripts HQL scripts
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Pipeline: Monitoring DATA Primary site DR site Centralized monitoring of data pipeline With Falcon + Ambari Pipeline run alerts Hadoop Cluster-1 Hadoop Cluster-2 Pipeline run history Pipeline Scheduling raw clean prep raw clean prep
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Replication with Falcon Staged Data Presented Data Cleansed Data Conformed Data Staged Data Presented Data Replication Failover Hadoop Cluster Primary Hadoop Cluster Replication BI / Analytics BusinessObjects BI • Falcon manages workflow and replication • Enables business continuity without requiring full data reprocessing • Failover clusters can be smaller than primary clusters
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Retention with Falcon Staged Data Presented Data Cleansed Data Conformed Data Retain 5 Years Retain Last Copy Only Retain 3 Years Retain 3 Years • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing Retention Policy
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Late Data Handling with Falcon Staged Data Combined Data Online Transaction Data (via Sqoop) Web Log Data (via FTP) Wait up to 4 hours for FTP data to arrive • Processing waits until all required input data is available • Checks for late data arrivals, issues retrigger processing as necessary • Eliminates writing complex data handling rules within applications
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HCatalog Table access Aligned metadata REST API • Raw Hadoop data • Inconsistent, unknown • Tool specific access Apache Falcon provides metadata services via HCatalog Metadata Services with HCatalog • Consistency of metadata and data models across tools (MapReduce, Pig, Hbase, and Hive) • Accessibility: share data as tables in and out of HDFS • Availability: enables flexible, thin-client access via REST API Shared table and schema management opens the platform Page 19
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance in Apache Falcon
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Pipeline: Tracing . Purchase feed Customer feed Product feed Store feed View dependencies between clusters, datasets and processes Data pipeline dependencies Add arbitrary tags to feeds & processes Data pipeline tagging Coming Soon Know who modified a dataset when and into what Data pipeline audits Analyze how a dataset reached a particular state Data pipeline lineage
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Custom Metadata in Falcon • Metadata on Ingest (Content) – What is the format I expect my data to be in? – What source systems did the data come from, owners? – Answer: ingest descriptors + Hcat schema versioning • Metadata for Security (Access Controls) – How is each column blinded or encrypted? – Can I trust that I can join data across tables? What if email is encrypted differently? – Answer: security descriptors • Metadata for lineage (Source, History) – How do I chase down sources of data leading to reports and data? – Answer: lineage carried forward per workflow • Metadata for marts (Usage Constraints, Enrichment) – How do I materialize views and drop views as needed?
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Entity Dependency in Falcon • Dependencies between Falcon entity definitions: cluster, feed & process – Lineage attributes: workflows, input/output feed windows, user, input and output paths, workflow engine, input/output size
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lineage in Falcon
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Audit, Tagging and Access Control • Tagging – Allows custom tags in entities – Can decorate process entities pipeline names • Access Control – Support for ACL in entities – Authorization driven based on ACLs in entities • Audit – Each execution is controlled by Falcon and runs are audited – Correlate the execution with Lineage (Design) • Search – Search based on Tags, Pipelines, etc. – Full-text search
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Technology • Metadata Repository – Titan Graph Database – Pluggable backing store, berkelydbje, Hbase • Entity Metadata – Tags, Entities are stored in the repository • Execution Metadata – Execution metadata are stored in the repository as well – this is unique to Falcon – Optional inputs • Search – Pluggable backend – Solr or Elastic Search
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved New in Apache Falcon 0.6.0 What is coming soon?
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved DR Mirroring of HDFS with Recipes •Mirroring for Disaster Recovery and Business continuity use cases. •Customizable for multiple targets and frequency of synchronization •Recipes: Template model re-use of complex workflows Recipe Reduce Cleanse Replicate Propertie s Workflow Template RecipePropertie s RecipePropertie s Workflow Template
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Replication to Cloud •Seemlessly replicate to Cloud targets •Replicate from Cloud as a source. •Support for Amazon S3 and Microsoft Azure Azure Amazon S3 On Prem Cluster
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 34. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 35. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 36. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q & A
  • 37. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Learn more at: hortonworks.com/hadoop/falcon/

Editor's Notes

  • #4: ET: Tactical POS Added benefits – consortium
  • #5: Transition to Andrew
  • #9: Transition to Andrew
  • #10: Thanks Justin, Here are Falcon’s primary features. 1 The first is to manage the data lifecycle in one common place. 2 The second is to facilitate quick deployment of replication for business continuity and disaster recovery use cases. This includes monitoring and a base set of policies for replication and retention 3 Lastly, Falcon provide foundation audit and compliance features – visuallization and tracking of entity lineage and collection of audit logs
  • #12: This is the high level Falcon Architecture Falcon runs as a standalone server as part of your Hadoop cluster A user creates entity specifications and submits to Falcon using the API Falcon validates and saves entity specifications to HDFS Falcon uses Oozie as its default scheduler Dashboard for entity viewing in Falcon UI Ambari integration for management
  • #13: Feeds have location, replication schedule and retention policies Meta info including frequency, where data is coming from (source), where to replicate (target), how to long to retain
  • #14: Let take a look at the Data Pipeline or workflow. ** read high level **
  • #15: Hive – HQL scripts Pig scipts Oozie workflows
  • #16: Once a pipeline is create you’ll want to run it. This means you probably want to monitoring as well. Falcon in conjunction with Ambari has centralized monitor ** bullets **
  • #17: Ok let chat about Replication with Falcon – which is very efficient. In this example with a primary cluster with a typical workflow There is business requirement to replicate this to a Failover cluster ** builett **
  • #18: Falcon has flexible data retention policies, it’s able to model the business compliance requirements. Sophisticated retention policies expressed in one place Simplify data retention for audit, compliance, or for data re-processing In this example, different dataset in a workflow can have different retention policies.
  • #19: We realize at many type of workflow have inputs from different system with may be in different regions. Falcon has logic built-in to handle this potentially tricky situation.
  • #20: HCatalog – metadata shared across whole platform File locations become abstract (not hard-coded) Data types become shared (not redefined per tool) Partitioning and HDFS-optimized
  • #21: Transition to Andrew
  • #22: Last but not least you’ll want to Trace or track the Data Pipeline We trace:
  • #29: The first is DR mirroring with Recipes. Actually recipes can be used in number different use cases, but we’ll just focus on mirroring.
  • #30: Place holder pic
  • #31: Dashboard view Summary counts Inplace filters – by user defined tags
  • #32: Entity creation interface is contextual and has field level sematic check to help the user along.
  • #33: As you can see on the right – we have the actual XML being generated as the UI field are being filled out.
  • #34: This can be help if you want copy portions to skip repeating entity from scratch.
  • #36: Lastly the new UI allow to drilll down to the detail level for each entity types.