SlideShare a Scribd company logo
Kerberizing spark. Spark Summit east
Jorge López-Malla Matute
INDEX
jlopezm@stratio.com
Abel Rincón Matarranz
arincon@stratio.com
Kerberos
● Introduction
● Key concepts
● Workflow
● Impersonation
1
3
Use Case
● Definition
● Workflow
● Crossdata in production
2
4Stratio Solution
● Prerequirement
● Driver side
● Executor Side
● Final result
Demo time
● Demo
● Q&A
Presentation
Presentation
JORGE LÓPEZ-MALLA
After working with traditional
processing methods, I started to
do some R&S Big Data projects
and I fell in love with the Big Data
world. Currently i’m doing some
awesome Big Data projects
and tools at Stratio.
SKILLS
Presentation
Presentation
ABEL RINCÓN MATARRANZ
SKILLS
Our company
Presentation
Our product
Presentation
Kerberizing spark. Spark Summit east
Kerberos
Kerberos
• What is Kerberos
○ Authentication protocol / standard / service
■ Safe
■ Single-sign-on
■ Trust based
■ Mutual authentication
Kerberos key concepts
Kerberos
• Client/Server → Do you need an explanation???
• Principal → Identify a unique client or service
• Realm → Identify a environment, company, domain …
○ DEMO.EAST.SUMMIT.SPARK.ORG
• KDC → Actor who manages the tickets
• TGT → Ticket which has the client session
• TGS → Ticket which has the client-service session
Kerberos Workflow
Kerberos
1. Client retrieve principal and secret
2. Client performs a TGT request
3. KDC returns TGT
4. Client request a TGS with the TGT
5. KDC returns the TGS
6. El cliente request a service session using
the TGS
7. Service establish a secure connection
directly with the client
Kerberos workflow 2
Kerberos
Client
Service
Backend
AS / KDC
Principal → User1
TGT (user1)
TGS → user1-service1
(tgsUS)
Principal → Service1
TGT (Service1)
TGS → service1-backend1 (tgsSB)
Principal → backend1
TGT (backend1)
tgsUS
tgsSB
user1
service1
Kerberos workflow - Impersonation
Kerberos
Client
Service
Backend
AS / KDC
Principal → User1
TGT (user1)
TGS → user1-service1
(tgsUS)
Principal → Service1
TGT (Service1)
TGS → service1-backend1 (tgsSB)
Principal → backend1
TGT (backend1)
tgsUS
tgsSB
user1
service1user1
Kerberizing spark. Spark Summit east
Use Case
• Stratio Crossdata is a distributed framework and a fast and general-purpose
computing system powered by Apache Spark
• Can be used both as library and as a server.
• Crossdata Server: Provides a multi-user environment to SparkSQL, giving a
reliable architecture with high-availability and scalability out of the box
• To do so it use both native queries and Spark
• Crossdata Server had a unique long time SparkContext to execute all its Sparks
queries
• Crossdata can use YARN, Mesos and Standalone as a resource manager
Use Case
Crossdata as Server
sql> select * from table1
Crossdata shell
Master
Worker-1 Worker-2
Executor-0 Executor-1
Task-0 Task-1
HDFS
Crossdata server
(Spark Driver)
Crossdata server
sql> select * from table1
-------------------------
|id | name |
-------------------------
|1 | John Doe |
Use Case
Kerbe
ros
• Projects in production needs runtime impersonation to be compliance the
AAA(Authorization, Authentication and Audit) at the storage.
• Crossdata allows several users per execution
• Neither of the Sparks resource managers allows us to impersonate in runtime.
• Evenmore Standalone as resource manager does not provide any Kerberos
feature.
Crossdata in production
Use Case
Kerberizing spark. Spark Summit east
Prerequirement
Stratio solution
• Keytab have to be accessible in all the cluster machines
• Keytab must provide proxy grants
• Hadoop client configuration located in the cluster
• Each user, both proxy and real, must have a home in HDFS
Introduction
Stratio solution
• Spark access to the storage system both in the Driver and the Executors.
• In the Driver side both Spark Core and SparkSQL will access to the storage
system.
• Executors will always access via Task.
• As Streaming use the same classes than SparkCore or SparkSQL the same
solution will be usable by Streaming jobs
KerberosUser (Utils)
object KerberosUser extends Logging with UserCache {
def setProxyUser(user: String): Unit = proxyUser = Option(user)
def getUserByName(name: Option[String]): Option[UserGroupInformation] = {
if (getConfiguration.isDefined) {
userFromKeyTab(name)
}
else None
}
private def userFromKeyTab(proxyUser: Option[String]): Option[UserGroupInformation] = {
if (realUser.isDefined) realUser.get.checkTGTAndReloginFromKeytab()
(realUser, proxyUser) match {
case (Some(_), Some(proxy)) => users.get(proxy).orElse(loginProxyUser(proxy))
case (Some(_), None) => realUser
case (None, None) => None
}
}
private lazy val getConfiguration: Option[(String, String)] = {
val principal = env.conf.getOption("spark.executor.kerberos.principal")
val keytab = env.conf.getOption("spark.executor.kerberos.keytab")
(principal, keytab) match {
case (Some(p), Some(k)) => Option(p, k)
case _ => None
}
}
Configuration
setting proxy user
(Global)
Choose between real or
proxy user
Stratio solution
Public Method retrieve
user
Wrappers (Utils)
def executeSecure[U, T](proxyUser: Option[String],
funct: (U => T),
inputParameters: U): T = {
KerberosUser.getUserByName(proxyUser) match {
case Some(user) => {
user.doAs(new PrivilegedExceptionAction[T]() {
@throws(classOf[Exception])
def run: T = {
funct(inputParameters)
}
})
}
case None => {
funct(inputParameters)
}
}
}
def executeSecure[T](exe: ExecutionWrp[T]): T = {
KerberosUser.getUser match {
case Some(user) => {
user.doAs(new PrivilegedExceptionAction[T]() {
@throws(classOf[Exception])
def run: T = {
exe.value
}
})
}
case None => exe.value
}
}
class ExecutionWrp[T](wrp: => T) {
lazy val value: T = wrp
}
Stratio solution
Driver Side
Stratio Solution
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
...
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = KerberosFunction.executeSecure(new ExecutionWrp(getPartitions))
partitions_.zipWithIndex.foreach { case (partition, index) =>
require(partition.index == index,
s"partitions($index).partition == ${partition.index}, but it should equal $index")
}
}
partitions_
}
}
Wrapping parameterless
method
class PairRDDFunctions[K, V](self: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
...
def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
val internalSave: (JobConf => Unit) = (conf: JobConf) => {
val hadoopConf = conf
val outputFormatInstance = hadoopConf.getOutputFormat
val keyClass = hadoopConf.getOutputKeyClass
val valueClass = hadoopConf.getOutputValueClass
...
val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {
….
}
self.context.runJob(self, writeToFile)
writer.commitJob()
}
KerberosFunction.executeSecure(internalSave, conf)
}
Driver Side
Stratio Solution
Hadoop Datastore
RDD save function
Inside wrapper function
that will run in the cluster
Kerberos authentified
save function
Driver Side
class InMemoryCatalog(
conf: SparkConf = new SparkConf,
hadoopConfig: Configuration = new Configuration)
override def createDatabase(
dbDefinition: CatalogDatabase,
ignoreIfExists: Boolean): Unit = synchronized {
def inner: Unit = {
...
val location = new Path(dbDefinition.locationUri)
val fs = location.getFileSystem(hadoopConfig)
fs.mkdirs(location)
} catch {
case e: IOException =>
throw new SparkException(s"Unable to create database ${dbDefinition.name} as failed " +
s"to create its directory ${dbDefinition.locationUri}", e)
}
catalog.put(dbDefinition.name, new DatabaseDesc(dbDefinition))
}
}
KerberosFunction.executeSecure(KerberosUser.principal, new ExecutionWrp(inner))
}
Stratio Solution
Spark create a
directory in HDFS
* Interface used to load a [[Dataset]] from external storage systems (e.g. file systems,
...
class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
...
def load(): DataFrame = {
load(Seq.empty: _*) // force invocation of `load(...varargs...)`
}
...
def load(paths: String*): DataFrame = {
val proxyuser = extraOptions.get("user")
if (proxyuser.isDefined) KerberosUser.setProxyUser(proxyuser.get)
val dataSource = KerberosFunction.executeSecure(proxyuser,
DataSource.apply,
sparkSession,
source,
paths, userSpecifiedSchema, Seq.empty, None, extraOptions.toMap)
val baseRelation = KerberosFunction.executeSecure(proxyuser, dataSource.resolveRelation, false)
KerberosFunction.executeSecure(proxyuser, sparkSession.baseRelationToDataFrame, baseRelation)
}
Driver Side
Stratio Solution
get user from dataset
options
Method for load data from
sources without path
obtaining baseRelation
from datasource
* Interface used to write a [[Dataset]] from external storage systems (e.g. file systems,
...
class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
...
/**
* Saves the content of the [[DataFrame]] as the specified table.
...
def save(): Unit = {
assertNotBucketed("save")
...
val maybeUser = extraOptions.get("user")
def innerWrite(modeData: (SaveMode, DataFrame)): Unit = {
val (mode, data) = modeData
dataSource.write(mode, data)
}
if (maybeUser.isDefined) KerberosUser.setProxyUser(maybeUser.get)
KerberosFunction.executeSecure(maybeUser, innerWrite, (mode, df))
}
Driver Side
Stratio Solution
get user from dataset
options
Method for save data in
external sources
Wrapping save
execution
class DAGScheduler(...){
…
…
KerberosUser.getMaybeUser match {
case Some(user) => properties.setProperty("user", user)
case _ =>
}
...
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
...
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
...
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
}
...
Driver Side
Stratio Solution
private[spark] abstract class Task[T](
val stageId: Int,
val stageAttemptId: Int,
val partitionId: Int,
// The default value is only used in tests.
val metrics: TaskMetrics = TaskMetrics.registered,
@transient var localProperties: Properties = new Properties) extends Serializable {
...
final def run(
...
try {
val proxyUser =
Option(Executor.taskDeserializationProps.get().getProperty("user"))
KerberosFunction.executeSecure(proxyUser, runTask, context)
} catch {
…
def runTask(context: TaskContext): T
Executor Side
Stratio Solution
properties load in
Driver side
get proxy user and
wrapped execution
method implemented by
Task subclasses
Demo time
Demo time
• Merge this code in Apache Spark (SPARK-16788)
• Pluggable authorization
• Pluggable secret management (why always use Hadoop delegation tokens?)
• Distributed cache.
• ...
Next Steps
Next Steps
Q & A
Q & A
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
Ad

More Related Content

What's hot (20)

NoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love StoryNoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love Story
Alexandre Morgaut
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL Cluster
Kris Buytaert
 
Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!
Joydeep Banik Roy
 
Dnsdist
DnsdistDnsdist
Dnsdist
Petko Angelov
 
Jafka guide
Jafka guideJafka guide
Jafka guide
Ady Liu
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
Patrick McFadin
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guide
Ady Liu
 
Zookeeper
ZookeeperZookeeper
Zookeeper
Geng-Dian Huang
 
Nodejs - A quick tour (v5)
Nodejs - A quick tour (v5)Nodejs - A quick tour (v5)
Nodejs - A quick tour (v5)
Felix Geisendörfer
 
Docker and Fargate
Docker and FargateDocker and Fargate
Docker and Fargate
Shinji Miyazato
 
Apache zookeeper 101
Apache zookeeper 101Apache zookeeper 101
Apache zookeeper 101
Quach Tung
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
Kris Buytaert
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Hippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearchHippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearch
Luca Cavanna
 
Node.js - A practical introduction (v2)
Node.js  - A practical introduction (v2)Node.js  - A practical introduction (v2)
Node.js - A practical introduction (v2)
Felix Geisendörfer
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
zeeg
 
Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)
Felix Geisendörfer
 
Intro to GemStone/S
Intro to GemStone/SIntro to GemStone/S
Intro to GemStone/S
ESUG
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
Kris Buytaert
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
NoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love StoryNoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love Story
Alexandre Morgaut
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL Cluster
Kris Buytaert
 
Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!
Joydeep Banik Roy
 
Jafka guide
Jafka guideJafka guide
Jafka guide
Ady Liu
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
Patrick McFadin
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guide
Ady Liu
 
Apache zookeeper 101
Apache zookeeper 101Apache zookeeper 101
Apache zookeeper 101
Quach Tung
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
Kris Buytaert
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Hippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearchHippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearch
Luca Cavanna
 
Node.js - A practical introduction (v2)
Node.js  - A practical introduction (v2)Node.js  - A practical introduction (v2)
Node.js - A practical introduction (v2)
Felix Geisendörfer
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
zeeg
 
Intro to GemStone/S
Intro to GemStone/SIntro to GemStone/S
Intro to GemStone/S
ESUG
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
Kris Buytaert
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 

Viewers also liked (6)

Meetup spark + kerberos
Meetup spark + kerberosMeetup spark + kerberos
Meetup spark + kerberos
Jorge Lopez-Malla
 
Meetup errores en proyectos Big Data
Meetup errores en proyectos Big DataMeetup errores en proyectos Big Data
Meetup errores en proyectos Big Data
Jorge Lopez-Malla
 
Codemotion 2016
Codemotion 2016Codemotion 2016
Codemotion 2016
Jorge Lopez-Malla
 
Spark Hands-on
Spark Hands-onSpark Hands-on
Spark Hands-on
Gaspar Muñoz Soria
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Meetup errores en proyectos Big Data
Meetup errores en proyectos Big DataMeetup errores en proyectos Big Data
Meetup errores en proyectos Big Data
Jorge Lopez-Malla
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Ad

Similar to Kerberizing spark. Spark Summit east (20)

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
JavaScript Growing Up
JavaScript Growing UpJavaScript Growing Up
JavaScript Growing Up
David Padbury
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New Tricks
MongoDB
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Brad Wood - CommandBox CLI
Brad Wood - CommandBox CLI Brad Wood - CommandBox CLI
Brad Wood - CommandBox CLI
Ortus Solutions, Corp
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
Joe Stein
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
Tom Croucher
 
Red Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop LabsRed Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop Labs
Judy Breedlove
 
TopicMapReduceComet log analysis by using splunk
TopicMapReduceComet log analysis by using splunkTopicMapReduceComet log analysis by using splunk
TopicMapReduceComet log analysis by using splunk
akashkale0756
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
petabridge
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
Manish Pandit
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
Prajal Kulkarni
 
Declarative Infrastructure Tools
Declarative Infrastructure Tools Declarative Infrastructure Tools
Declarative Infrastructure Tools
Yulia Shcherbachova
 
Hidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-PersistenceHidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-Persistence
Sven Ruppert
 
Terraform Cosmos DB
Terraform Cosmos DBTerraform Cosmos DB
Terraform Cosmos DB
Moisés Elías Araya
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
JavaScript Growing Up
JavaScript Growing UpJavaScript Growing Up
JavaScript Growing Up
David Padbury
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New Tricks
MongoDB
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
Joe Stein
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
Tom Croucher
 
Red Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop LabsRed Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop Labs
Judy Breedlove
 
TopicMapReduceComet log analysis by using splunk
TopicMapReduceComet log analysis by using splunkTopicMapReduceComet log analysis by using splunk
TopicMapReduceComet log analysis by using splunk
akashkale0756
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
petabridge
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
Manish Pandit
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
Prajal Kulkarni
 
Declarative Infrastructure Tools
Declarative Infrastructure Tools Declarative Infrastructure Tools
Declarative Infrastructure Tools
Yulia Shcherbachova
 
Hidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-PersistenceHidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-Persistence
Sven Ruppert
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
Ad

More from Jorge Lopez-Malla (8)

Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Jorge Lopez-Malla
 
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Jorge Lopez-Malla
 
Haz que tus datos sean sexys
Haz que tus datos sean sexysHaz que tus datos sean sexys
Haz que tus datos sean sexys
Jorge Lopez-Malla
 
Mesos con europa 2017
Mesos con europa 2017Mesos con europa 2017
Mesos con europa 2017
Jorge Lopez-Malla
 
Spark meetup barcelona
Spark meetup barcelonaSpark meetup barcelona
Spark meetup barcelona
Jorge Lopez-Malla
 
Spark web meetup
Spark web meetupSpark web meetup
Spark web meetup
Jorge Lopez-Malla
 
Apache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataApache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own data
Jorge Lopez-Malla
 
Meetup Spark y la Combinación de sus Distintos Módulos
Meetup Spark y la Combinación de sus Distintos MódulosMeetup Spark y la Combinación de sus Distintos Módulos
Meetup Spark y la Combinación de sus Distintos Módulos
Jorge Lopez-Malla
 
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Jorge Lopez-Malla
 
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Jorge Lopez-Malla
 
Haz que tus datos sean sexys
Haz que tus datos sean sexysHaz que tus datos sean sexys
Haz que tus datos sean sexys
Jorge Lopez-Malla
 
Apache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataApache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own data
Jorge Lopez-Malla
 
Meetup Spark y la Combinación de sus Distintos Módulos
Meetup Spark y la Combinación de sus Distintos MódulosMeetup Spark y la Combinación de sus Distintos Módulos
Meetup Spark y la Combinación de sus Distintos Módulos
Jorge Lopez-Malla
 

Recently uploaded (20)

Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 

Kerberizing spark. Spark Summit east

  • 2. Jorge López-Malla Matute INDEX [email protected] Abel Rincón Matarranz [email protected] Kerberos ● Introduction ● Key concepts ● Workflow ● Impersonation 1 3 Use Case ● Definition ● Workflow ● Crossdata in production 2 4Stratio Solution ● Prerequirement ● Driver side ● Executor Side ● Final result Demo time ● Demo ● Q&A
  • 3. Presentation Presentation JORGE LÓPEZ-MALLA After working with traditional processing methods, I started to do some R&S Big Data projects and I fell in love with the Big Data world. Currently i’m doing some awesome Big Data projects and tools at Stratio. SKILLS
  • 8. Kerberos Kerberos • What is Kerberos ○ Authentication protocol / standard / service ■ Safe ■ Single-sign-on ■ Trust based ■ Mutual authentication
  • 9. Kerberos key concepts Kerberos • Client/Server → Do you need an explanation??? • Principal → Identify a unique client or service • Realm → Identify a environment, company, domain … ○ DEMO.EAST.SUMMIT.SPARK.ORG • KDC → Actor who manages the tickets • TGT → Ticket which has the client session • TGS → Ticket which has the client-service session
  • 10. Kerberos Workflow Kerberos 1. Client retrieve principal and secret 2. Client performs a TGT request 3. KDC returns TGT 4. Client request a TGS with the TGT 5. KDC returns the TGS 6. El cliente request a service session using the TGS 7. Service establish a secure connection directly with the client
  • 11. Kerberos workflow 2 Kerberos Client Service Backend AS / KDC Principal → User1 TGT (user1) TGS → user1-service1 (tgsUS) Principal → Service1 TGT (Service1) TGS → service1-backend1 (tgsSB) Principal → backend1 TGT (backend1) tgsUS tgsSB user1 service1
  • 12. Kerberos workflow - Impersonation Kerberos Client Service Backend AS / KDC Principal → User1 TGT (user1) TGS → user1-service1 (tgsUS) Principal → Service1 TGT (Service1) TGS → service1-backend1 (tgsSB) Principal → backend1 TGT (backend1) tgsUS tgsSB user1 service1user1
  • 14. Use Case • Stratio Crossdata is a distributed framework and a fast and general-purpose computing system powered by Apache Spark • Can be used both as library and as a server. • Crossdata Server: Provides a multi-user environment to SparkSQL, giving a reliable architecture with high-availability and scalability out of the box • To do so it use both native queries and Spark • Crossdata Server had a unique long time SparkContext to execute all its Sparks queries • Crossdata can use YARN, Mesos and Standalone as a resource manager Use Case
  • 15. Crossdata as Server sql> select * from table1 Crossdata shell Master Worker-1 Worker-2 Executor-0 Executor-1 Task-0 Task-1 HDFS Crossdata server (Spark Driver) Crossdata server sql> select * from table1 ------------------------- |id | name | ------------------------- |1 | John Doe | Use Case Kerbe ros
  • 16. • Projects in production needs runtime impersonation to be compliance the AAA(Authorization, Authentication and Audit) at the storage. • Crossdata allows several users per execution • Neither of the Sparks resource managers allows us to impersonate in runtime. • Evenmore Standalone as resource manager does not provide any Kerberos feature. Crossdata in production Use Case
  • 18. Prerequirement Stratio solution • Keytab have to be accessible in all the cluster machines • Keytab must provide proxy grants • Hadoop client configuration located in the cluster • Each user, both proxy and real, must have a home in HDFS
  • 19. Introduction Stratio solution • Spark access to the storage system both in the Driver and the Executors. • In the Driver side both Spark Core and SparkSQL will access to the storage system. • Executors will always access via Task. • As Streaming use the same classes than SparkCore or SparkSQL the same solution will be usable by Streaming jobs
  • 20. KerberosUser (Utils) object KerberosUser extends Logging with UserCache { def setProxyUser(user: String): Unit = proxyUser = Option(user) def getUserByName(name: Option[String]): Option[UserGroupInformation] = { if (getConfiguration.isDefined) { userFromKeyTab(name) } else None } private def userFromKeyTab(proxyUser: Option[String]): Option[UserGroupInformation] = { if (realUser.isDefined) realUser.get.checkTGTAndReloginFromKeytab() (realUser, proxyUser) match { case (Some(_), Some(proxy)) => users.get(proxy).orElse(loginProxyUser(proxy)) case (Some(_), None) => realUser case (None, None) => None } } private lazy val getConfiguration: Option[(String, String)] = { val principal = env.conf.getOption("spark.executor.kerberos.principal") val keytab = env.conf.getOption("spark.executor.kerberos.keytab") (principal, keytab) match { case (Some(p), Some(k)) => Option(p, k) case _ => None } } Configuration setting proxy user (Global) Choose between real or proxy user Stratio solution Public Method retrieve user
  • 21. Wrappers (Utils) def executeSecure[U, T](proxyUser: Option[String], funct: (U => T), inputParameters: U): T = { KerberosUser.getUserByName(proxyUser) match { case Some(user) => { user.doAs(new PrivilegedExceptionAction[T]() { @throws(classOf[Exception]) def run: T = { funct(inputParameters) } }) } case None => { funct(inputParameters) } } } def executeSecure[T](exe: ExecutionWrp[T]): T = { KerberosUser.getUser match { case Some(user) => { user.doAs(new PrivilegedExceptionAction[T]() { @throws(classOf[Exception]) def run: T = { exe.value } }) } case None => exe.value } } class ExecutionWrp[T](wrp: => T) { lazy val value: T = wrp } Stratio solution
  • 22. Driver Side Stratio Solution abstract class RDD[T: ClassTag]( @transient private var _sc: SparkContext, @transient private var deps: Seq[Dependency[_]] ) extends Serializable with Logging { ... /** * Get the array of partitions of this RDD, taking into account whether the * RDD is checkpointed or not. */ final def partitions: Array[Partition] = { checkpointRDD.map(_.partitions).getOrElse { if (partitions_ == null) { partitions_ = KerberosFunction.executeSecure(new ExecutionWrp(getPartitions)) partitions_.zipWithIndex.foreach { case (partition, index) => require(partition.index == index, s"partitions($index).partition == ${partition.index}, but it should equal $index") } } partitions_ } } Wrapping parameterless method
  • 23. class PairRDDFunctions[K, V](self: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null) ... def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val internalSave: (JobConf => Unit) = (conf: JobConf) => { val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass ... val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { …. } self.context.runJob(self, writeToFile) writer.commitJob() } KerberosFunction.executeSecure(internalSave, conf) } Driver Side Stratio Solution Hadoop Datastore RDD save function Inside wrapper function that will run in the cluster Kerberos authentified save function
  • 24. Driver Side class InMemoryCatalog( conf: SparkConf = new SparkConf, hadoopConfig: Configuration = new Configuration) override def createDatabase( dbDefinition: CatalogDatabase, ignoreIfExists: Boolean): Unit = synchronized { def inner: Unit = { ... val location = new Path(dbDefinition.locationUri) val fs = location.getFileSystem(hadoopConfig) fs.mkdirs(location) } catch { case e: IOException => throw new SparkException(s"Unable to create database ${dbDefinition.name} as failed " + s"to create its directory ${dbDefinition.locationUri}", e) } catalog.put(dbDefinition.name, new DatabaseDesc(dbDefinition)) } } KerberosFunction.executeSecure(KerberosUser.principal, new ExecutionWrp(inner)) } Stratio Solution Spark create a directory in HDFS
  • 25. * Interface used to load a [[Dataset]] from external storage systems (e.g. file systems, ... class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { ... def load(): DataFrame = { load(Seq.empty: _*) // force invocation of `load(...varargs...)` } ... def load(paths: String*): DataFrame = { val proxyuser = extraOptions.get("user") if (proxyuser.isDefined) KerberosUser.setProxyUser(proxyuser.get) val dataSource = KerberosFunction.executeSecure(proxyuser, DataSource.apply, sparkSession, source, paths, userSpecifiedSchema, Seq.empty, None, extraOptions.toMap) val baseRelation = KerberosFunction.executeSecure(proxyuser, dataSource.resolveRelation, false) KerberosFunction.executeSecure(proxyuser, sparkSession.baseRelationToDataFrame, baseRelation) } Driver Side Stratio Solution get user from dataset options Method for load data from sources without path obtaining baseRelation from datasource
  • 26. * Interface used to write a [[Dataset]] from external storage systems (e.g. file systems, ... class DataFrameWriter[T] private[sql](ds: Dataset[T]) { ... /** * Saves the content of the [[DataFrame]] as the specified table. ... def save(): Unit = { assertNotBucketed("save") ... val maybeUser = extraOptions.get("user") def innerWrite(modeData: (SaveMode, DataFrame)): Unit = { val (mode, data) = modeData dataSource.write(mode, data) } if (maybeUser.isDefined) KerberosUser.setProxyUser(maybeUser.get) KerberosFunction.executeSecure(maybeUser, innerWrite, (mode, df)) } Driver Side Stratio Solution get user from dataset options Method for save data in external sources Wrapping save execution
  • 27. class DAGScheduler(...){ … … KerberosUser.getMaybeUser match { case Some(user) => properties.setProperty("user", user) case _ => } ... val tasks: Seq[Task[_]] = try { stage match { case stage: ShuffleMapStage => partitionsToCompute.map { id => ... new ShuffleMapTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, stage.latestInfo.taskMetrics, properties) ... new ResultTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics) } ... Driver Side Stratio Solution
  • 28. private[spark] abstract class Task[T]( val stageId: Int, val stageAttemptId: Int, val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, @transient var localProperties: Properties = new Properties) extends Serializable { ... final def run( ... try { val proxyUser = Option(Executor.taskDeserializationProps.get().getProperty("user")) KerberosFunction.executeSecure(proxyUser, runTask, context) } catch { … def runTask(context: TaskContext): T Executor Side Stratio Solution properties load in Driver side get proxy user and wrapped execution method implemented by Task subclasses
  • 30. • Merge this code in Apache Spark (SPARK-16788) • Pluggable authorization • Pluggable secret management (why always use Hadoop delegation tokens?) • Distributed cache. • ... Next Steps Next Steps
  • 31. Q & A Q & A