SlideShare a Scribd company logo
SECURITY UPDATES:
More Seamless Access Controls with
Apache Spark and Apache Ranger
Dongjoon Hyun @ Hortonworks Spark Team
Jason Dere @ Hortonworks Hive Team
June 2017
SECURITY UPDATES:
More Seamless Access Controls with
Apache Spark and Apache Ranger
Dongjoon Hyun
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Background – Security
 One of fundamental features for enterprise adoption
– Multi-tenancy: Billing team / Data science team / Marketing teams
 Row and column-level access control for SQL users
– Row filtering
– Column masking
 Must enforce shared policies to various SQL engines simultaneously
– E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Issue 1
 Spark reads all or nothing
 Directory/file-based permissions are insufficient for fine-grained
access control
Apache Spark is a general data processing engine
scala> val textFile = sc.textFile(“/apps/hive/warehouse/…")
textFile: org.apache.spark.rdd.RDD[String] = …
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Issue 2
 Permission 777 on warehouse?
Security starts from storage
Bad
Good
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Issue 3
 New policies for SparkSQL?
 Rewrite Spark apps?
– Special data source tables
 Duplicated data maintained manually
– Filtered rows
– Removed or masked columns
Overhead during starting and maintaining security policies
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Goal 1: Spark SQL Apps
Support row/column-level security with the batch apps
from pyspark.sql import SparkSession
spark = SparkSession 
.builder 
.enableHiveSupport() 
.getOrCreate()
spark.sql("SELECT * FROM db_common.t_customer").show()
db_common
t_customer
…
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Goal 2: Spark shells (1/2)
Support row/column-level security in all shells
spark-shell
pyspark
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Goal 2: Spark shells (2/2)
Support row/column-level security in all shells
sparkR
spark-sql
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Goal 3: Spark Thrift Server
Support row/column-level security with Spark Thrift Server
Login as `billing`
Login as `datascience`
SECURITY UPDATES:
More Seamless Access Controls with
Apache Spark and Apache Ranger
Jason Dere @ Hortonworks Hive Team
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What are required?
 Apache Ranger
 Apache Hive with LLAP
 Spark-LLAP (Apache License)
– A library and patches to integrate above tech with SparkSQL
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
Provide a standard authorization method across many Hadoop components
https://hortonworks.com/apache/ranger/#section_2
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Ranger Policies – Column Access
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Ranger Policies – Column Masking
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Ranger Policies – Row Filtering
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
SQL Query:
select name from users
1
Apache Hive with LLAP
5
3 4
1.Client sends query to HiveServer2.
2.Query plan generation by HiveServer2.
3.Query plan sent to query coordinator
4.Query plan sent to LLAP daemons for
execution.
5.Results consolidated and sent to client
Plan Generation
TableScan: users
Projection: name
2
LLAP
LLAP
LLAP Daemons
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive Security with Ranger
 Seamless integration with Ranger user-level access policies
– Column/row based security policies are applied automatically
– Hive query plans rewritten to apply masking/filtering functions on top of
the base table data.
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
SQL Query:
select name from users
1
HiveServer2 + LLAP
5
3 4
1.Client sends query to HiveServer2.
2.Query plan generation by HiveServer2.
3.Query plan sent to query coordinator
4.Query plan sent to LLAP daemons for
execution.
5.Results consolidated and sent to client
Plan Generation
TableScan: users
Projection: name
2
LLAP
LLAP
LLAP Daemons
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection: mask(name)
SQL Query:
select name from users
1.Client sends query to HiveServer2.
2.Query plan generation by HiveServer2.
Ranger security policies applied. Plan
modified based on dynamic security policies.
3.Query plan sent to query coordinator
4.Query plan sent to LLAP daemons for
execution. Filtering/masking performed.
5.Results consolidated and sent to client
1
HiveServer2 + LLAP + Ranger
Ranger
Dynamic Policies
5 2
3 4
LLAP
LLAP
LLAP Daemons
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
External LLAP Client
 LLAP Daemon
– Persistent daemons combining query execution and in-memory caching
– External applications also able to use LLAP to retrieve data
• Provide a secure relational datanode view of the data
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP
LLAP
LLAP Daemons
YARN Cluster
HiveServer2
Hive Query
Coordinator
Plan Generation
TableScan: users
Projection: name
1.Client requests data locations known as
“splits” from HiveServer2.
2.Query plan generation by HiveServer2.
3.Splits returned to client which include signed
query plan.
4.LLAP splits used by client to securely submit
query plan to LLAP. Data returned to client.
1
External LLAP Client
3 2
4
Client App
LLAP
InputFormat
SQL Query:
select name from users
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection: mask(name)
1.Client requests data locations known as
“splits” from HiveServer2.
2.Query plan generation by HiveServer2.
Ranger security policies applied. Plan
modified based on dynamic security policies.
3.Splits returned to client which include signed
query plan.
4.LLAP splits used by client to securely submit
query plan to LLAP. Filtering/masking
performed. Data returned to client.
1
External LLAP Client + Ranger
Ranger
Dynamic Policies
3 2
LLAP
InputFormat
SQL Query:
select name from users
LLAP
LLAP
LLAP Daemons
4
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark-LLAP
 Spark connector library + patches on top of Spark
 Table data read securely through LLAP
 Leverages standard Ranger policies to control per-user
access/masking/filtering of data
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark-LLAP: Credentials
 HDFS Delegation Token
– HDFSCredentialProvider gets it from namenode
 Hive Metastore Delegation Token
– HiveCredentialProvider gets it from Hive Metastore
 HiveServer2 Delegation Token
– HiveServer2CredentialProvider gets it from HiveServer2
Get and renew delegation tokens
Spark-LLAP
Existing
30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark-LLAP: LlapMetastoreCatalog
LlapMetastoreCatalog: Replaces MetastoreRelation with LlapRelation
SELECT gender, count(*)
FROM db_common.t_customer
WHERE name LIKE '%Obama’
GROUP BY gender
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
UnresolvedRelation
Filter: name like %Obama
Parsed Logical Plan
Aggregate: gender
31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark-LLAP: LlapMetastoreCatalog
LlapMetastoreCatalog: Replaces MetastoreRelation with LlapRelation
Without Spark-LLAP
With Spark-LLAP
32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
YARN Cluster
HiveServer2
LlapRelation
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection mask(name)
1
Spark-LLAP: LlapRelation
Ranger
Dynamic Policies
3 2
LLAP
InputFormat
SQL Query:
select name from users
LLAP
LLAP
LLAP Daemons
4
Uses LLAP external client API to read table data
33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark-LLAP: LlapRelation
LlapRelation supports predicate pushdown and column pruning
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
Scan LlapRelation
PushedFilter: StringEndsWith(…)
ReadSchema: gender
Filter: EndsWith(name, Obama)
Physical Plan
Project: gender
HashAggregate: gender
…
34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Using Spark-LLAP
 spark-submit
--package spark-llap.jar
--conf spark.sql.hive.llap=true
--conf spark.yarn.security.credentials.hiveserver2.enabled=true
--master yarn
--deploy-mode cluster
sql.py
Launch Spark jobs `--package` option is supported, too
Easy to turn on/off
Only used for YARN cluster mode
35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark-LLAP for Spark 1.6 (TP)
• Use Ranger for SELECT statement
• Use LlapContext
HDP 2.5.X
Milestone
Spark-LLAP for Spark 2.1.0 (TP)
• Use Ranger for more statements (in STS)
• No need to rewrite codes
• Support all languages and shells
HDP 2.6.0 HDP 2.6.1
Spark-LLAP for Spark 2.1.1 (TP)
• Support YARN cluster mode
• Support Hive complex types
Spark-LLAP for Spark 2.2.0
• Available soon in GitHub
HDP X.X.X
37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Resources
 GitHub
– https://github.com/hortonworks-spark/spark-llap
 Maven
– http://repo.hortonworks.com/content/groups/public/com/hortonworks/spark/spark-
llap_2.11/
 Youtube Demo
– https://www.youtube.com/watch?v=_-oYpQGWm5k (HDP 2.6.1)
 Hortonworks Blog
– https://hortonworks.com/blog/row-column-level-control-apache-spark/
 Hortonworks Community Connection Article
– https://community.hortonworks.com/articles/101181/rowcolumn-level-security-in-sql-for-
apache-spark-2.html
 Support Matrix
– https://github.com/hortonworks-spark/spark-llap/wiki/7.-Support-Matrix
38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Summary
 Support row/column-level security with
– Spark apps with YARN client/cluster mode
– Spark shells
– Spark Thrift Server
 You can use the existing Spark 2.X SQL apps and scripts
 Easy to turn on/off with only configurations
 Ranger enforces Hive/Spark simultaneously and consistently
Spark-LLAP with HDP 2.6.1 is TP
39 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Acknowledgement
 Apache Hive / Apache Spark / Apache Ranger Community
 Bikas Saha, Mingjie Tang, Saisai Shao, Siddharth Seth, Sergey
Shelukhin, Thejas Nair, Zhan Zhang, and many others
40 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank you

More Related Content

What's hot (20)

PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
PPTX
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PDF
HAWQ Meets Hive - Querying Unmanaged Data
DataWorks Summit
 
PPTX
Apache Hadoop YARN: state of the union
DataWorks Summit
 
PPTX
Implementing Security on a Large Multi-Tenant Cluster the Right Way
DataWorks Summit
 
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
PDF
Data Governance - Atlas 7.12.2015
Hortonworks
 
PPTX
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
DataWorks Summit
 
PPTX
Insights into Real-world Data Management Challenges
DataWorks Summit
 
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
PPTX
Best Practices for Enterprise User Management in Hadoop Environment
DataWorks Summit/Hadoop Summit
 
PPTX
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
PPTX
Building a data-driven authorization framework
DataWorks Summit
 
PPTX
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
PPTX
Benefits of an Agile Data Fabric for Business Intelligence
DataWorks Summit/Hadoop Summit
 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
HAWQ Meets Hive - Querying Unmanaged Data
DataWorks Summit
 
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Implementing Security on a Large Multi-Tenant Cluster the Right Way
DataWorks Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Data Governance - Atlas 7.12.2015
Hortonworks
 
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
DataWorks Summit
 
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
Best Practices for Enterprise User Management in Hadoop Environment
DataWorks Summit/Hadoop Summit
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
Building a data-driven authorization framework
DataWorks Summit
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
DataWorks Summit/Hadoop Summit
 

Similar to Security Updates: More Seamless Access Controls with Apache Spark and Apache Ranger (20)

PPTX
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
PPTX
Dynamic Column Masking and Row-Level Filtering in HDP
Hortonworks
 
PPTX
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PPTX
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
PPTX
Hive acid and_2.x new_features
Alberto Romero
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
PDF
What is New in Apache Hive 3.0?
DataWorks Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
Dynamic Column Masking and Row-Level Filtering in HDP
Hortonworks
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
Hive acid and_2.x new_features
Alberto Romero
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is New in Apache Hive 3.0?
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 

Security Updates: More Seamless Access Controls with Apache Spark and Apache Ranger

  • 1. SECURITY UPDATES: More Seamless Access Controls with Apache Spark and Apache Ranger Dongjoon Hyun @ Hortonworks Spark Team Jason Dere @ Hortonworks Hive Team June 2017
  • 2. SECURITY UPDATES: More Seamless Access Controls with Apache Spark and Apache Ranger Dongjoon Hyun
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Background – Security  One of fundamental features for enterprise adoption – Multi-tenancy: Billing team / Data science team / Marketing teams  Row and column-level access control for SQL users – Row filtering – Column masking  Must enforce shared policies to various SQL engines simultaneously – E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Issue 1  Spark reads all or nothing  Directory/file-based permissions are insufficient for fine-grained access control Apache Spark is a general data processing engine scala> val textFile = sc.textFile(“/apps/hive/warehouse/…") textFile: org.apache.spark.rdd.RDD[String] = …
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Issue 2  Permission 777 on warehouse? Security starts from storage Bad Good
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Issue 3  New policies for SparkSQL?  Rewrite Spark apps? – Special data source tables  Duplicated data maintained manually – Filtered rows – Removed or masked columns Overhead during starting and maintaining security policies
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Goal 1: Spark SQL Apps Support row/column-level security with the batch apps from pyspark.sql import SparkSession spark = SparkSession .builder .enableHiveSupport() .getOrCreate() spark.sql("SELECT * FROM db_common.t_customer").show() db_common t_customer …
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Goal 2: Spark shells (1/2) Support row/column-level security in all shells spark-shell pyspark
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Goal 2: Spark shells (2/2) Support row/column-level security in all shells sparkR spark-sql
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Goal 3: Spark Thrift Server Support row/column-level security with Spark Thrift Server Login as `billing` Login as `datascience`
  • 13. SECURITY UPDATES: More Seamless Access Controls with Apache Spark and Apache Ranger Jason Dere @ Hortonworks Hive Team
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What are required?  Apache Ranger  Apache Hive with LLAP  Spark-LLAP (Apache License) – A library and patches to integrate above tech with SparkSQL
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Ranger Provide a standard authorization method across many Hadoop components https://hortonworks.com/apache/ranger/#section_2
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Ranger Policies – Column Access
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Ranger Policies – Column Masking
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Ranger Policies – Row Filtering
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN Cluster HiveServer2 Client App Hive Query Coordinator SQL Query: select name from users 1 Apache Hive with LLAP 5 3 4 1.Client sends query to HiveServer2. 2.Query plan generation by HiveServer2. 3.Query plan sent to query coordinator 4.Query plan sent to LLAP daemons for execution. 5.Results consolidated and sent to client Plan Generation TableScan: users Projection: name 2 LLAP LLAP LLAP Daemons
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Security with Ranger  Seamless integration with Ranger user-level access policies – Column/row based security policies are applied automatically – Hive query plans rewritten to apply masking/filtering functions on top of the base table data.
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN Cluster HiveServer2 Client App Hive Query Coordinator SQL Query: select name from users 1 HiveServer2 + LLAP 5 3 4 1.Client sends query to HiveServer2. 2.Query plan generation by HiveServer2. 3.Query plan sent to query coordinator 4.Query plan sent to LLAP daemons for execution. 5.Results consolidated and sent to client Plan Generation TableScan: users Projection: name 2 LLAP LLAP LLAP Daemons
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN Cluster HiveServer2 Client App Hive Query Coordinator Plan Generation TableScan: users Filter: state = ‘CA’ Projection: mask(name) SQL Query: select name from users 1.Client sends query to HiveServer2. 2.Query plan generation by HiveServer2. Ranger security policies applied. Plan modified based on dynamic security policies. 3.Query plan sent to query coordinator 4.Query plan sent to LLAP daemons for execution. Filtering/masking performed. 5.Results consolidated and sent to client 1 HiveServer2 + LLAP + Ranger Ranger Dynamic Policies 5 2 3 4 LLAP LLAP LLAP Daemons
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved External LLAP Client  LLAP Daemon – Persistent daemons combining query execution and in-memory caching – External applications also able to use LLAP to retrieve data • Provide a secure relational datanode view of the data
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP LLAP LLAP Daemons YARN Cluster HiveServer2 Hive Query Coordinator Plan Generation TableScan: users Projection: name 1.Client requests data locations known as “splits” from HiveServer2. 2.Query plan generation by HiveServer2. 3.Splits returned to client which include signed query plan. 4.LLAP splits used by client to securely submit query plan to LLAP. Data returned to client. 1 External LLAP Client 3 2 4 Client App LLAP InputFormat SQL Query: select name from users
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN Cluster HiveServer2 Client App Hive Query Coordinator Plan Generation TableScan: users Filter: state = ‘CA’ Projection: mask(name) 1.Client requests data locations known as “splits” from HiveServer2. 2.Query plan generation by HiveServer2. Ranger security policies applied. Plan modified based on dynamic security policies. 3.Splits returned to client which include signed query plan. 4.LLAP splits used by client to securely submit query plan to LLAP. Filtering/masking performed. Data returned to client. 1 External LLAP Client + Ranger Ranger Dynamic Policies 3 2 LLAP InputFormat SQL Query: select name from users LLAP LLAP LLAP Daemons 4
  • 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark-LLAP  Spark connector library + patches on top of Spark  Table data read securely through LLAP  Leverages standard Ranger policies to control per-user access/masking/filtering of data
  • 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark-LLAP: Credentials  HDFS Delegation Token – HDFSCredentialProvider gets it from namenode  Hive Metastore Delegation Token – HiveCredentialProvider gets it from Hive Metastore  HiveServer2 Delegation Token – HiveServer2CredentialProvider gets it from HiveServer2 Get and renew delegation tokens Spark-LLAP Existing
  • 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark-LLAP: LlapMetastoreCatalog LlapMetastoreCatalog: Replaces MetastoreRelation with LlapRelation SELECT gender, count(*) FROM db_common.t_customer WHERE name LIKE '%Obama’ GROUP BY gender LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender UnresolvedRelation Filter: name like %Obama Parsed Logical Plan Aggregate: gender
  • 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark-LLAP: LlapMetastoreCatalog LlapMetastoreCatalog: Replaces MetastoreRelation with LlapRelation Without Spark-LLAP With Spark-LLAP
  • 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN Cluster HiveServer2 LlapRelation Hive Query Coordinator Plan Generation TableScan: users Filter: state = ‘CA’ Projection mask(name) 1 Spark-LLAP: LlapRelation Ranger Dynamic Policies 3 2 LLAP InputFormat SQL Query: select name from users LLAP LLAP LLAP Daemons 4 Uses LLAP external client API to read table data
  • 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark-LLAP: LlapRelation LlapRelation supports predicate pushdown and column pruning LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender Scan LlapRelation PushedFilter: StringEndsWith(…) ReadSchema: gender Filter: EndsWith(name, Obama) Physical Plan Project: gender HashAggregate: gender …
  • 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Using Spark-LLAP  spark-submit --package spark-llap.jar --conf spark.sql.hive.llap=true --conf spark.yarn.security.credentials.hiveserver2.enabled=true --master yarn --deploy-mode cluster sql.py Launch Spark jobs `--package` option is supported, too Easy to turn on/off Only used for YARN cluster mode
  • 35. 35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 36. 36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark-LLAP for Spark 1.6 (TP) • Use Ranger for SELECT statement • Use LlapContext HDP 2.5.X Milestone Spark-LLAP for Spark 2.1.0 (TP) • Use Ranger for more statements (in STS) • No need to rewrite codes • Support all languages and shells HDP 2.6.0 HDP 2.6.1 Spark-LLAP for Spark 2.1.1 (TP) • Support YARN cluster mode • Support Hive complex types Spark-LLAP for Spark 2.2.0 • Available soon in GitHub HDP X.X.X
  • 37. 37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Resources  GitHub – https://github.com/hortonworks-spark/spark-llap  Maven – http://repo.hortonworks.com/content/groups/public/com/hortonworks/spark/spark- llap_2.11/  Youtube Demo – https://www.youtube.com/watch?v=_-oYpQGWm5k (HDP 2.6.1)  Hortonworks Blog – https://hortonworks.com/blog/row-column-level-control-apache-spark/  Hortonworks Community Connection Article – https://community.hortonworks.com/articles/101181/rowcolumn-level-security-in-sql-for- apache-spark-2.html  Support Matrix – https://github.com/hortonworks-spark/spark-llap/wiki/7.-Support-Matrix
  • 38. 38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Summary  Support row/column-level security with – Spark apps with YARN client/cluster mode – Spark shells – Spark Thrift Server  You can use the existing Spark 2.X SQL apps and scripts  Easy to turn on/off with only configurations  Ranger enforces Hive/Spark simultaneously and consistently Spark-LLAP with HDP 2.6.1 is TP
  • 39. 39 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Acknowledgement  Apache Hive / Apache Spark / Apache Ranger Community  Bikas Saha, Mingjie Tang, Saisai Shao, Siddharth Seth, Sergey Shelukhin, Thejas Nair, Zhan Zhang, and many others
  • 40. 40 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank you