SlideShare a Scribd company logo
SECURING DATA IN HYBRID ENVIRONMENTS
USING APACHE RANGER
Madhan Neethiraj , Hortonworks
Apache Ranger PMC, Apache Atlas PMC
Don Bosco Durai, Privacera
Apache Ranger PMC
Disclaimer
‣ This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be developed.
‣ Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback
and the overarching Apache Software Foundation community development process can all effect
timing and final delivery.
‣ This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
‣ Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
‣ Since this document contains an outline of general product development plans, customers should
not rely upon it when making purchasing decisions.
Agenda
• Hybrid Environment Use Cases
• Security Requirements for Cloud
• Implementation Challenges
• Hybrid Environment Security Implementation Flow
• Demo
• Roadmap
• Q & A
HYBRID ENVIRONMENT
- Predictable workload
- ETL and data wrangling
- BI and DW use cases
- Multiple Hadoop enabled
services and tools
- On-demand processing
units
- Analytical and other
services from Cloud
provider
- Share data with 3rd party
vendors
- Micro-Services
On Premise
DATA
PREFERRED SECURITY REQUIREMENTS
FOR CLOUD
1. Access permissions should be consistent in all
environments
2. Protect personal and sensitive data by
anonymizing or tokenizing
IMPLEMENTATION CHALLENGES
● Permission Model between Hadoop and Cloud are not the
same
● Users/Groups between both the systems might not be the
same
● Data constantly moves within the cluster and permissions
might change along with it
Hybrid Environment: overview
Rang
er
Atlas
Discover
y
Privacera
Hortonwrok
s DSS
Hive
3. Classify
4. ETL
Ra
w
Da
ta
1. Ingest
Name Street City State
John Doe 345 First St Eureka CA
Jane
Smith
876 Main St Newark NJ
Sally Mark
Name Street City State
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
7. Export
Name Street City State
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
S3
ACL
Sync
2. Scan
5.Metadata,
Lineage
6. Scan
DISCOVERY AND CLASSIFICATION
● Centrally store classifications in Apache Atlas
● Classify resources via
○ Apache Atlas UI
○ Apache Atlas API
● Auto classify using discovery tools like Privacera,
Hortonworks Data Steward Studio
● Different types of tags
○ Security (SSN, CC, EMAIL, NAME, ADDRESS, etc.)
○ Business (SALES, HR, FINANCE, MARKETING, etc.)
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
RANGER TAG BASED ACCESS POLICY
RANGER TAG BASED DYNAMIC MASKING
DATA UPLOAD TO S3
● Use Hive Export feature
● Dynamic Anonymization based on ETL User (e.g. s3_etl)
● Classification Based Access Authorization Policies to
block copying highly restricted data
● Row level filter
● Ranger policy to restrict access to S3 from Hive
ANONYMIZED DATA IN S3
ACL SYNC TO S3
● S3 Permission model
● Bucket Policies  ACL Sync sets Bucket policies
● User Policies
● Object ACLs
ACL SYNC TO S3
S3
ACL
Sync
1. Kafka Notification
Rang
er
entity_type aws_s3_pseudo_dir
qualifiedName s3a://dws2018-demo/sales/sales_may_2018
tags SALES
2. Retrieve tag ACLs
3. Set ACLs in S3
POLICIES ON S3
DEMO - SALES DATA
User Access View
Sally (Sales) Y Clear/Raw
Mark (Marketing) Y Anonymized
Henry (HR) X X
Item Id Amount Email Name Street City State
435 439.34 jd@y.com John Doe 345 First St Eureka CA
894 592.02 js@m.com Jane Smith 876 Main St Newark NJ
DEMO FLOW
Copy Sales
Data into HDFS
Create Hive
Table from Sales
Data
Export Hive table
to S3
1. Atlas will create table meta and lineage
2. Privacera will scan table and send tags to Atlas
1. Privacera will anonymize data
2. Atlas will create lineage to S3
3. Ranger-S3 Policy Sync will set
permissions on S3 for the resource
TOOLS USED IN DEMO
Metadata, Lineage, Classification Apache Atlas
Auto Discovery Privacera
Tag Based Policies Apache Ranger
Data Transfer Apache Hive
Anonymization/Tokenization Privacera
Ranger to S3 Policy Sync Privacera
OTHER DATA MOVEMENT TOOLS
● Kafka Connect - Transformer Plugin
● Apache NiFi - Processor Plugin
● Apache Spark SQL - UDF
● Java API integrated with Ranger & Atlas
PENDING TASKS/OPEN ISSUES
● Support for mapping Ranger permissions for S3
resources
● Support advanced Ranger policies like wild card
● Monitor permission changes on the cloud side and take
actions (e.g. disallow them)
● Mapping of Ranger group-level permissions to S3ACLs
● AWS S3 bucket policy has max size of 20k
APACHE JIRAS
https://issues.apache.org/jira/browse/RANGER-1974 - Ranger Authorizer and
Audits for AWS S3
https://issues.apache.org/jira/browse/ATLAS-2708 - AWS S3 data lake typedefs
for Atlas (BARBARA ECKMAN)
https://issues.apache.org/jira/browse/ATLAS-2760 - Atlas Hive hook updates to
create lineage between Hive table and S3 entities
QUESTIONS & ANSWERS

More Related Content

What's hot (20)

PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PDF
[EPPG] Oracle to PostgreSQL, Challenges to Opportunity
Equnix Business Solutions
 
PDF
Kafka Streams State Stores Being Persistent
confluent
 
PPSX
Oracle Performance Tools of the Trade
Carlos Sierra
 
PPTX
Microsoft Planner - Agile Tasks Management for Modern Teams
Juan Carlos Gonzalez
 
PPTX
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Top 10 tips for Oracle performance (Updated April 2015)
Guy Harrison
 
PPTX
Migrating on premises and cloud contents to SharePoint Online at no cost with...
Juan Carlos Gonzalez
 
PPTX
Google Data Studio How to Make Tooltips for Better Data Context
Boost Labs
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PDF
TypeScript Seminar
Haim Michael
 
PPTX
How many ways to monitor oracle golden gate-Collaborate 14
Bobby Curtis
 
PDF
Parallel Execution With Oracle Database 12c - Masterclass
Ivica Arsov
 
PDF
Oracle E-Business Suite R12.2.5 on Database 12c: Install, Patch and Administer
Andrejs Karpovs
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
PDF
Introduction to thymeleaf
NexThoughts Technologies
 
PPTX
Application Integration: EPM, ERP, Cloud and On-Premise – All options explained
Alithya
 
PPTX
Apache Airflow in Production
Robert Sanders
 
PPTX
SQL Server Upgrade and Consolidation - Methodology and Approach
Indra Dharmawan
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
[EPPG] Oracle to PostgreSQL, Challenges to Opportunity
Equnix Business Solutions
 
Kafka Streams State Stores Being Persistent
confluent
 
Oracle Performance Tools of the Trade
Carlos Sierra
 
Microsoft Planner - Agile Tasks Management for Modern Teams
Juan Carlos Gonzalez
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Top 10 tips for Oracle performance (Updated April 2015)
Guy Harrison
 
Migrating on premises and cloud contents to SharePoint Online at no cost with...
Juan Carlos Gonzalez
 
Google Data Studio How to Make Tooltips for Better Data Context
Boost Labs
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
TypeScript Seminar
Haim Michael
 
How many ways to monitor oracle golden gate-Collaborate 14
Bobby Curtis
 
Parallel Execution With Oracle Database 12c - Masterclass
Ivica Arsov
 
Oracle E-Business Suite R12.2.5 on Database 12c: Install, Patch and Administer
Andrejs Karpovs
 
Elastic Stack Introduction
Vikram Shinde
 
Introduction to thymeleaf
NexThoughts Technologies
 
Application Integration: EPM, ERP, Cloud and On-Premise – All options explained
Alithya
 
Apache Airflow in Production
Robert Sanders
 
SQL Server Upgrade and Consolidation - Methodology and Approach
Indra Dharmawan
 

Similar to Securing data in hybrid environments using Apache Ranger (20)

PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
PDF
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
PPTX
Classification based security in Hadoop
Madhan Neethiraj
 
PPTX
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
PPTX
GDPR Community Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
Built-In Security for the Cloud
DataWorks Summit
 
PPTX
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
PDF
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
PPTX
Saving the elephant—now, not later
DataWorks Summit
 
PPTX
Apache Ranger
Rommel Garcia
 
PPTX
Atlas and ranger epam meetup
Alex Zeltov
 
PDF
GDPR/CCPA Compliance and Data Governance in Hadoop
Eyad Garelnabi
 
PPTX
Building a data-driven authorization framework
DataWorks Summit
 
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
Classification based security in Hadoop
Madhan Neethiraj
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
GDPR Community Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Built-In Security for the Cloud
DataWorks Summit
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
Saving the elephant—now, not later
DataWorks Summit
 
Apache Ranger
Rommel Garcia
 
Atlas and ranger epam meetup
Alex Zeltov
 
GDPR/CCPA Compliance and Data Governance in Hadoop
Eyad Garelnabi
 
Building a data-driven authorization framework
DataWorks Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 

Securing data in hybrid environments using Apache Ranger

  • 1. SECURING DATA IN HYBRID ENVIRONMENTS USING APACHE RANGER Madhan Neethiraj , Hortonworks Apache Ranger PMC, Apache Atlas PMC Don Bosco Durai, Privacera Apache Ranger PMC
  • 2. Disclaimer ‣ This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. ‣ Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. ‣ This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. ‣ Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. ‣ Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. Agenda • Hybrid Environment Use Cases • Security Requirements for Cloud • Implementation Challenges • Hybrid Environment Security Implementation Flow • Demo • Roadmap • Q & A
  • 4. HYBRID ENVIRONMENT - Predictable workload - ETL and data wrangling - BI and DW use cases - Multiple Hadoop enabled services and tools - On-demand processing units - Analytical and other services from Cloud provider - Share data with 3rd party vendors - Micro-Services On Premise DATA
  • 5. PREFERRED SECURITY REQUIREMENTS FOR CLOUD 1. Access permissions should be consistent in all environments 2. Protect personal and sensitive data by anonymizing or tokenizing
  • 6. IMPLEMENTATION CHALLENGES ● Permission Model between Hadoop and Cloud are not the same ● Users/Groups between both the systems might not be the same ● Data constantly moves within the cluster and permissions might change along with it
  • 7. Hybrid Environment: overview Rang er Atlas Discover y Privacera Hortonwrok s DSS Hive 3. Classify 4. ETL Ra w Da ta 1. Ingest Name Street City State John Doe 345 First St Eureka CA Jane Smith 876 Main St Newark NJ Sally Mark Name Street City State BXPHDE YNiIkjoiTH Eureka CA HNEQON WNDUHNd Newark NJ 7. Export Name Street City State BXPHDE YNiIkjoiTH Eureka CA HNEQON WNDUHNd Newark NJ S3 ACL Sync 2. Scan 5.Metadata, Lineage 6. Scan
  • 8. DISCOVERY AND CLASSIFICATION ● Centrally store classifications in Apache Atlas ● Classify resources via ○ Apache Atlas UI ○ Apache Atlas API ● Auto classify using discovery tools like Privacera, Hortonworks Data Steward Studio ● Different types of tags ○ Security (SSN, CC, EMAIL, NAME, ADDRESS, etc.) ○ Business (SALES, HR, FINANCE, MARKETING, etc.)
  • 11. RANGER TAG BASED ACCESS POLICY
  • 12. RANGER TAG BASED DYNAMIC MASKING
  • 13. DATA UPLOAD TO S3 ● Use Hive Export feature ● Dynamic Anonymization based on ETL User (e.g. s3_etl) ● Classification Based Access Authorization Policies to block copying highly restricted data ● Row level filter ● Ranger policy to restrict access to S3 from Hive
  • 15. ACL SYNC TO S3 ● S3 Permission model ● Bucket Policies  ACL Sync sets Bucket policies ● User Policies ● Object ACLs
  • 16. ACL SYNC TO S3 S3 ACL Sync 1. Kafka Notification Rang er entity_type aws_s3_pseudo_dir qualifiedName s3a://dws2018-demo/sales/sales_may_2018 tags SALES 2. Retrieve tag ACLs 3. Set ACLs in S3
  • 18. DEMO - SALES DATA User Access View Sally (Sales) Y Clear/Raw Mark (Marketing) Y Anonymized Henry (HR) X X Item Id Amount Email Name Street City State 435 439.34 [email protected] John Doe 345 First St Eureka CA 894 592.02 [email protected] Jane Smith 876 Main St Newark NJ
  • 19. DEMO FLOW Copy Sales Data into HDFS Create Hive Table from Sales Data Export Hive table to S3 1. Atlas will create table meta and lineage 2. Privacera will scan table and send tags to Atlas 1. Privacera will anonymize data 2. Atlas will create lineage to S3 3. Ranger-S3 Policy Sync will set permissions on S3 for the resource
  • 20. TOOLS USED IN DEMO Metadata, Lineage, Classification Apache Atlas Auto Discovery Privacera Tag Based Policies Apache Ranger Data Transfer Apache Hive Anonymization/Tokenization Privacera Ranger to S3 Policy Sync Privacera
  • 21. OTHER DATA MOVEMENT TOOLS ● Kafka Connect - Transformer Plugin ● Apache NiFi - Processor Plugin ● Apache Spark SQL - UDF ● Java API integrated with Ranger & Atlas
  • 22. PENDING TASKS/OPEN ISSUES ● Support for mapping Ranger permissions for S3 resources ● Support advanced Ranger policies like wild card ● Monitor permission changes on the cloud side and take actions (e.g. disallow them) ● Mapping of Ranger group-level permissions to S3ACLs ● AWS S3 bucket policy has max size of 20k
  • 23. APACHE JIRAS https://issues.apache.org/jira/browse/RANGER-1974 - Ranger Authorizer and Audits for AWS S3 https://issues.apache.org/jira/browse/ATLAS-2708 - AWS S3 data lake typedefs for Atlas (BARBARA ECKMAN) https://issues.apache.org/jira/browse/ATLAS-2760 - Atlas Hive hook updates to create lineage between Hive table and S3 entities

Editor's Notes

  • #4: We have a lot to cover, want to apologize in advance
  • #8: We have a lot to cover, want to apologize in advance