Talk about add proxy user in Spark Task execution time given in Spark Summit East 2017 by Jorge López-Malla and Abel Ricon
full video:
https://www.youtube.com/watch?v=VaU1xC0Rixo&feature=youtu.be
Basically everything you need to get started on your Zookeeper training, and setup apache Hadoop high availability with QJM setup with automatic failover.
Slides for presentation on ZooKeeper I gave at Near Infinity (www.nearinfinity.com) 2012 spring conference.
The associated sample code is on GitHub at https://github.com/sleberknight/zookeeper-samples
How and Why Prometheus' New Storage Engine Pushes the Limits of Time Series D...Docker, Inc.
The Prometheus monitoring system collects and stores time series data to give valuable insights over hosts, containers, and applications. Its storage engine was designed to be multiple orders of magnitude faster and more space efficient than, say, RRD or SQL storage. However, with the rise of orchestration systems such as Docker Swarm and Kubernetes, and their extensive use of techniques like rolling updates and auto-scaling, environments are becoming increasingly dynamic. This increases the strain on metrics collection systems. To deal with the challenges, a new storage engine has been developed from scratch, bringing a sharp increase in performance and enabling new features.
This talk will describe this new storage engine, its architecture, its data structures, and explain why and how it is well suited to gracefully handle high turnover rates of monitoring targets and provide consistent query performance.
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kvXlPd
This CloudxLab Introduction to Apache ZooKeeper tutorial helps you to understand ZooKeeper in detail. Below are the topics covered in this tutorial:
1) Data Model
2) Znode Types
3) Persistent Znode
4) Sequential Znode
5) Architecture
6) Election & Majority Demo
7) Why Do We Need Majority?
8) Guarantees - Sequential consistency, Atomicity, Single system image, Durability, Timeliness
9) ZooKeeper APIs
10) Watches & Triggers
11) ACLs - Access Control Lists
12) Usecases
13) When Not to Use ZooKeeper
ZooKeeper is an open-source coordination service for distributed applications that provides common services like naming, configuration management, synchronization, and groups. It uses a hierarchical data model where data is stored as znodes that can be configured with watches, versions, access control, and more. Common uses include distributed queues, leader election, and group membership. Recipes demonstrate how to implement queues and group membership using ZooKeeper.
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
Zookeeper is a coordination tool to let people build distributed systems easier. In this slides, the author summarizes the usage of zookeeper and provides Kazoo Python library as example.
This document provides an overview of the DataStax Java Driver and how to use it to connect to and query Cassandra. It introduces key concepts like CQL, the data model, asynchronous operations, prepared statements, load balancing, and retry policies. The document also includes code examples for connecting to Cassandra, performing basic read and write operations using CQL strings and prepared statements, and more advanced techniques like asynchronous reads, query builders, custom load balancing policies, and object mapping.
You may all know that JSON is a subset of JavaScript, but… Did you know that HTML5 implements NoSQL databases? Did you know that JavaScript was recommended for REST by HTTP co-creator Roy T. Fielding himself? Did you know that map & reduce are part of the native JavaScript API? Did you know that most NoSQL solutions integrate a JavaScript engine? CouchDB, MongoDB, WakandaDB, ArangoDB, OrientDB, Riak…. And when they don’t, they have a shell client which does. The story of NoSQL and JavaScript goes beyond your expectations and opens more opportunities than you might imagine… What better match could you find than a flexible and dynamic language for schemaless databases? Isn’t an event-driven language what you’ve been waiting for to manage consistency? When NoSQL doesn’t come to JavaScript, JavaScript comes to NoSQL. And does it very well.
This document provides summaries of the structure and attributes of various tables in the Drupal.org MySQL database that have been configured to use the MySQL Cluster storage engine for high availability and scalability. The summaries include the table name, version, number of attributes, indexes, attribute names and data types, storage engine, and other technical details.
[Download the slide to get the entire talk in the form of presentation note embedded in the ppt] Apache ZooKeeper is the chosen leader in distributed coordination. In this talk, I have explored the atomic elements of Apache ZooKeeper, how it fits everything together and some of its popular use cases. For ZooKeeper simplicity is the key and as a consumer of the API, our imagination enables us to push the limits of the ZooKeeper world.
This document provides an overview of DNSDIST, which is described as a highly DNS-, DoS- and abuse-aware load balancer. It discusses DNSDIST's features such as abuse detection, runtime configuration, statistics collection, traffic delaying/limiting capabilities. The document then covers DNSDIST's engine, configuration options, server policies, monitoring tools, caching, high availability, and concludes with potential use cases.
Jafka is a fast and lightweight message queue system that is implemented as a single 271KB JAR file. It uses Zookeeper for coordination and has dependencies on common Java libraries like Log4j and Jackson. Jafka aims to eventually become a full implementation of Apache Kafka with features like persistence, high throughput processing of millions of messages per second, load balancing and a simple message format. It currently focuses on providing basic queue functionality through a simple producer/consumer model.
This summer, coming to a server near you, Cassandra 3.0! Contributors and committers have been working hard on what is the most ambitious release to date. It’s almost too much to talk about, but we will dig into some of the most important, ground breaking features that you’ll want to use. Indexing changes that will make your applications faster and spark jobs more efficient. Storage engine changes to get even more density and efficiency from your nodes. Developer focused features like full JSON support and User Defined Functions. And finally, one of the most requested features, Windows support, has made it’s arrival. There is more, but you’ll just have to some see for yourself. Get your front row seat and don’t miss it!
ICE is an object-oriented distributed middleware platform that provides features like RPC, a language-neutral specification language called Slice, language mappings, support for transports like TCP and UDP, services for server activation and firewall traversal, and integration with persistence and threading. Developing applications with ICE involves writing Slice definitions, generating code from Slice, implementing servers that activate objects, and writing clients that make calls to servers. The process is demonstrated through a sample counter service application.
ZooKeeper is a centralized coordination service for distributed applications. It allows distributed processes to synchronize with each other using shared znodes. ZooKeeper uses a leader election algorithm and quorum consensus to ensure high availability. It provides recipes for common distributed coordination problems like queues, locks, and leader election. ZooKeeper's internals use the ZAB protocol for consensus between leader and followers through epochs and proposals.
Node.js is a JavaScript runtime built on Chrome's V8 engine. It allows JavaScript to be run on the server-side. Node.js avoids blocking I/O operations by using non-blocking techniques and event loops. It provides APIs for common tasks like HTTP servers, filesystem access, and more. While still in development, Node.js has found success in building real-time applications and APIs due to its asynchronous and non-blocking architecture.
This document describes a swarm cluster with an overlay network containing multiple containers running various Docker services and images. The cluster has one container manager and three worker containers running the dind image. Services like HAProxy, a registry, nginx-proxy, and echo are distributed across the worker containers and load balanced with an overlay network for high availability.
Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and leader election. It allows for distributed applications to synchronize transactions and configuration updates. Zookeeper uses a data model of znodes that can be persistent, ephemeral, or sequential. Clients can set watches on znodes to receive notifications of changes. Zookeeper provides consistency guarantees including sequential consistency, atomicity, and a single system image. Many large companies and open source projects use Zookeeper for coordination across distributed systems.
- MySQL HA can be achieved with solutions like shared storage (DRBD), replication, MySQL Cluster, or Linux HA/Pacemaker.
- Linux HA/Pacemaker provides high availability by managing resources across nodes and ensuring that services are running on an available node if one fails.
- It uses a central configuration (CIB) to define resources, constraints between them, and monitor their status to determine the optimal placement of resources across nodes.
Hippo meetup: enterprise search with Solr and elasticsearchLuca Cavanna
Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.
This document provides an introduction to Node.js including its history, uses, advantages, and community. It describes how Node.js uses non-blocking I/O and JavaScript to enable highly scalable applications. Examples show how Node.js can run HTTP servers and handle streaming data faster than traditional blocking architectures. The document recommends Node.js for real-time web applications and advises against using it for hard real-time systems or CPU-intensive tasks. It encourages participation in the growing Node.js community on mailing lists and IRC.
The document discusses continuous deployment and practices at Disqus for releasing code frequently. It emphasizes shipping code as soon as it is ready after it has been reviewed, passes automated tests, and some level of QA. It also discusses keeping development simple, integrating code changes through automated testing, using metrics for reporting, and doing progressive rollouts of new features to subsets of users.
- Node.js is a platform for building scalable network applications. It uses non-blocking I/O and event-driven architecture to handle many connections concurrently using a single-threaded event loop.
- Node.js uses Google's V8 JavaScript engine and provides a module system, I/O bindings, and common protocols to build network programs easily. Popular uses include real-time web applications, file uploading, and streaming.
- While Node.js is ready for many production uses, things like lost stack traces and limited ability to utilize multiple cores present challenges for some workloads. However, an active community provides support through mailing lists, IRC, and over 1,000 modules in its package manager.
Intro to GemStone/S
Mon, August 22, 3:30pm – 4:00pm
Youtube: https://youtu.be/NGMxjtOl8oA
Abstract: What is GemStone/S and how does it compare to other Smalltalks? This talk is intended to introduce you to a system that combines an ANSI-compliant Smalltalk application server with a full-featured multi-user database. In particular, because of the multi-user nature of the system, GemStone/S has implemented namespaces as well as class/object versioning. How this works presents interesting technical challenges.
Bio: As a junior-high student in 1971, James discovered the local university’s computer center and a life-long obsession with computers began. He was introduced to Smalltalk/V for the Mac in the mid-90s, and became a Smalltalk bigot. James is Director of Operations for GemTalk Systems and is a passionate advocate for GemStone and all things Smalltalk.
Pacemaker is a high availability cluster resource manager that can be used to provide high availability for MySQL databases. It monitors MySQL instances and replicates data between nodes using replication. If the primary MySQL node fails, Pacemaker detects the failure and fails over to the secondary node, bringing the MySQL service back online without downtime. Pacemaker manages shared storage and virtual IP failover to ensure connections are direct to the active MySQL node. It is important to monitor replication state and lag to ensure data consistency between nodes.
ETL With Cassandra Streaming Bulk Loadingalex_araujo
Cassandra ETL uses Chef recipes to configure Cassandra clusters on EC2 for bulk loading data. A custom Java ETL JAR processes input files to generate SSTables, which are streamed into Cassandra using sstableloader for fast loading. The Grinder is used to stress test and measure the import performance and throughput across the Cassandra cluster. Results showed streaming bulk loads were 2.5x faster than Thrift and up to 300x faster than MySQL. The only downside was the custom SSTable generation was slower than Cassandra's native writes.
Meetup de Spark y su interacción con Kerberos, para verlo como animación: https://docs.google.com/presentation/d/1DCjp_-s9J647Vydt5ltmqfXpS2PrJDo3KzoVz0C9T7Q/edit?usp=sharing
Meetup de Apache Spark Madrid sobre los errores que todos cometemos en proyectos Big Data.
Como las animaciones no van muy bien podeis verla en el siguiente enlace:
https://docs.google.com/presentation/d/1W4Foy9u0NkZziQ36I5_00b_e-JlwhSshSFv-hcxaBpM/edit?usp=sharing
You may all know that JSON is a subset of JavaScript, but… Did you know that HTML5 implements NoSQL databases? Did you know that JavaScript was recommended for REST by HTTP co-creator Roy T. Fielding himself? Did you know that map & reduce are part of the native JavaScript API? Did you know that most NoSQL solutions integrate a JavaScript engine? CouchDB, MongoDB, WakandaDB, ArangoDB, OrientDB, Riak…. And when they don’t, they have a shell client which does. The story of NoSQL and JavaScript goes beyond your expectations and opens more opportunities than you might imagine… What better match could you find than a flexible and dynamic language for schemaless databases? Isn’t an event-driven language what you’ve been waiting for to manage consistency? When NoSQL doesn’t come to JavaScript, JavaScript comes to NoSQL. And does it very well.
This document provides summaries of the structure and attributes of various tables in the Drupal.org MySQL database that have been configured to use the MySQL Cluster storage engine for high availability and scalability. The summaries include the table name, version, number of attributes, indexes, attribute names and data types, storage engine, and other technical details.
[Download the slide to get the entire talk in the form of presentation note embedded in the ppt] Apache ZooKeeper is the chosen leader in distributed coordination. In this talk, I have explored the atomic elements of Apache ZooKeeper, how it fits everything together and some of its popular use cases. For ZooKeeper simplicity is the key and as a consumer of the API, our imagination enables us to push the limits of the ZooKeeper world.
This document provides an overview of DNSDIST, which is described as a highly DNS-, DoS- and abuse-aware load balancer. It discusses DNSDIST's features such as abuse detection, runtime configuration, statistics collection, traffic delaying/limiting capabilities. The document then covers DNSDIST's engine, configuration options, server policies, monitoring tools, caching, high availability, and concludes with potential use cases.
Jafka is a fast and lightweight message queue system that is implemented as a single 271KB JAR file. It uses Zookeeper for coordination and has dependencies on common Java libraries like Log4j and Jackson. Jafka aims to eventually become a full implementation of Apache Kafka with features like persistence, high throughput processing of millions of messages per second, load balancing and a simple message format. It currently focuses on providing basic queue functionality through a simple producer/consumer model.
This summer, coming to a server near you, Cassandra 3.0! Contributors and committers have been working hard on what is the most ambitious release to date. It’s almost too much to talk about, but we will dig into some of the most important, ground breaking features that you’ll want to use. Indexing changes that will make your applications faster and spark jobs more efficient. Storage engine changes to get even more density and efficiency from your nodes. Developer focused features like full JSON support and User Defined Functions. And finally, one of the most requested features, Windows support, has made it’s arrival. There is more, but you’ll just have to some see for yourself. Get your front row seat and don’t miss it!
ICE is an object-oriented distributed middleware platform that provides features like RPC, a language-neutral specification language called Slice, language mappings, support for transports like TCP and UDP, services for server activation and firewall traversal, and integration with persistence and threading. Developing applications with ICE involves writing Slice definitions, generating code from Slice, implementing servers that activate objects, and writing clients that make calls to servers. The process is demonstrated through a sample counter service application.
ZooKeeper is a centralized coordination service for distributed applications. It allows distributed processes to synchronize with each other using shared znodes. ZooKeeper uses a leader election algorithm and quorum consensus to ensure high availability. It provides recipes for common distributed coordination problems like queues, locks, and leader election. ZooKeeper's internals use the ZAB protocol for consensus between leader and followers through epochs and proposals.
Node.js is a JavaScript runtime built on Chrome's V8 engine. It allows JavaScript to be run on the server-side. Node.js avoids blocking I/O operations by using non-blocking techniques and event loops. It provides APIs for common tasks like HTTP servers, filesystem access, and more. While still in development, Node.js has found success in building real-time applications and APIs due to its asynchronous and non-blocking architecture.
This document describes a swarm cluster with an overlay network containing multiple containers running various Docker services and images. The cluster has one container manager and three worker containers running the dind image. Services like HAProxy, a registry, nginx-proxy, and echo are distributed across the worker containers and load balanced with an overlay network for high availability.
Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and leader election. It allows for distributed applications to synchronize transactions and configuration updates. Zookeeper uses a data model of znodes that can be persistent, ephemeral, or sequential. Clients can set watches on znodes to receive notifications of changes. Zookeeper provides consistency guarantees including sequential consistency, atomicity, and a single system image. Many large companies and open source projects use Zookeeper for coordination across distributed systems.
- MySQL HA can be achieved with solutions like shared storage (DRBD), replication, MySQL Cluster, or Linux HA/Pacemaker.
- Linux HA/Pacemaker provides high availability by managing resources across nodes and ensuring that services are running on an available node if one fails.
- It uses a central configuration (CIB) to define resources, constraints between them, and monitor their status to determine the optimal placement of resources across nodes.
Hippo meetup: enterprise search with Solr and elasticsearchLuca Cavanna
Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.
This document provides an introduction to Node.js including its history, uses, advantages, and community. It describes how Node.js uses non-blocking I/O and JavaScript to enable highly scalable applications. Examples show how Node.js can run HTTP servers and handle streaming data faster than traditional blocking architectures. The document recommends Node.js for real-time web applications and advises against using it for hard real-time systems or CPU-intensive tasks. It encourages participation in the growing Node.js community on mailing lists and IRC.
The document discusses continuous deployment and practices at Disqus for releasing code frequently. It emphasizes shipping code as soon as it is ready after it has been reviewed, passes automated tests, and some level of QA. It also discusses keeping development simple, integrating code changes through automated testing, using metrics for reporting, and doing progressive rollouts of new features to subsets of users.
- Node.js is a platform for building scalable network applications. It uses non-blocking I/O and event-driven architecture to handle many connections concurrently using a single-threaded event loop.
- Node.js uses Google's V8 JavaScript engine and provides a module system, I/O bindings, and common protocols to build network programs easily. Popular uses include real-time web applications, file uploading, and streaming.
- While Node.js is ready for many production uses, things like lost stack traces and limited ability to utilize multiple cores present challenges for some workloads. However, an active community provides support through mailing lists, IRC, and over 1,000 modules in its package manager.
Intro to GemStone/S
Mon, August 22, 3:30pm – 4:00pm
Youtube: https://youtu.be/NGMxjtOl8oA
Abstract: What is GemStone/S and how does it compare to other Smalltalks? This talk is intended to introduce you to a system that combines an ANSI-compliant Smalltalk application server with a full-featured multi-user database. In particular, because of the multi-user nature of the system, GemStone/S has implemented namespaces as well as class/object versioning. How this works presents interesting technical challenges.
Bio: As a junior-high student in 1971, James discovered the local university’s computer center and a life-long obsession with computers began. He was introduced to Smalltalk/V for the Mac in the mid-90s, and became a Smalltalk bigot. James is Director of Operations for GemTalk Systems and is a passionate advocate for GemStone and all things Smalltalk.
Pacemaker is a high availability cluster resource manager that can be used to provide high availability for MySQL databases. It monitors MySQL instances and replicates data between nodes using replication. If the primary MySQL node fails, Pacemaker detects the failure and fails over to the secondary node, bringing the MySQL service back online without downtime. Pacemaker manages shared storage and virtual IP failover to ensure connections are direct to the active MySQL node. It is important to monitor replication state and lag to ensure data consistency between nodes.
ETL With Cassandra Streaming Bulk Loadingalex_araujo
Cassandra ETL uses Chef recipes to configure Cassandra clusters on EC2 for bulk loading data. A custom Java ETL JAR processes input files to generate SSTables, which are streamed into Cassandra using sstableloader for fast loading. The Grinder is used to stress test and measure the import performance and throughput across the Cassandra cluster. Results showed streaming bulk loads were 2.5x faster than Thrift and up to 300x faster than MySQL. The only downside was the custom SSTable generation was slower than Cassandra's native writes.
Meetup de Spark y su interacción con Kerberos, para verlo como animación: https://docs.google.com/presentation/d/1DCjp_-s9J647Vydt5ltmqfXpS2PrJDo3KzoVz0C9T7Q/edit?usp=sharing
Meetup de Apache Spark Madrid sobre los errores que todos cometemos en proyectos Big Data.
Como las animaciones no van muy bien podeis verla en el siguiente enlace:
https://docs.google.com/presentation/d/1W4Foy9u0NkZziQ36I5_00b_e-JlwhSshSFv-hcxaBpM/edit?usp=sharing
La problemática Big Data ha dejado de ser una nueva moda y se ha asentado como una nueva realidad en nuestro día a día, y la tecnolgía se ha adaptado a esta nueva realidad permitiendonos afortar problemas complejos de una manera sencilla y casi transaparete.
Pero, ¿y nosotros, hemos cambiado la forma de ver los proyectos y de atacar la solución?¿seguimos tratando de solucionar esta nueva problemática con la misma metodología?¿Seguimos creyendo que el Big Data nos va a solucionar todos nuestros problemas por arte de magia?
Esta charla versará sobre como, según la experiencia del ponente en distintos proyectos de distintas áreas de negocio, se han cambiado la forma de afrontar estos y de como se han solucionado los distintos problemas a la hora de afrontar un proyecto Big Data.
Este documento presenta una introducción a Spark, incluyendo conceptos clave como RDD, DataFrame y DStream. Explica cómo Spark puede desplegarse localmente, en modo standalone o usando YARN/Mesos, y describe operaciones fundamentales como transformaciones (map, filter), acciones (count) y métodos de agregación (groupByKey, combineByKey).
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will first introduce the concepts of Catalyst trees, followed by major features that were added in order to support Spark’s powerful API abstractions. Audience will walk away with a deeper understanding of how Spark 2.0 works under the hood.
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
In Spark SQL’s Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor, cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans.
We added a Cost-Based Optimizer framework to Spark SQL engine. In our framework, we use Analyze Table SQL statement to collect the detailed column statistics and save them into Spark’s catalog. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. Also, we save the data distribution of columns in either equal-width or equal-height histograms in order to deal with data skew effectively. Furthermore, with the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of join operation and multi-column group-by operation.
In our framework, we compute the cardinality and output size of each database operator. With reliable statistics and derived cardinalities, we are able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), adjusting multi-way join order, etc. In this talk, we will show Spark SQL’s new Cost-Based Optimizer framework and its performance impact on TPC-DS benchmark queries.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
JavaScript is evolving with the addition of modules, platform consistency, and harmony features. Modules allow JavaScript code to be organized and avoid naming collisions. CommonJS and AMD module formats are used widely. Platform consistency is improved through polyfills that mimic future APIs for older browsers. Harmony brings language-level modules and features like destructuring assignment, default parameters, and promises to JavaScript. Traceur compiles Harmony code to existing JavaScript.
MongoDB is the trusted document store we turn to when we have tough data store problems to solve. For this talk we are going to go a little bit off the path and explore what other roles we can fit MongoDB into. Others have discussed how to turn MongoDB’s capped collections into a publish/subscribe server. We stretch that a little further and turn MongoDB into a full fledged broker with both publish/subscribe and queue semantics, and a the ability to mix them. We will provide code and a running demo of the queue producers and consumers. Next we will turn to coordination services: We will explore the fundamental features and show how to implement them using MongoDB as the storage engine. Again we will show the code and demo the coordination of multiple applications.
This document provides an overview of Spark and using Spark on HDInsight. It discusses Spark concepts like RDDs, transformations, and actions. It also covers Spark extensions like Spark SQL, Spark Streaming, and MLlib. Finally, it highlights benefits of using Spark on HDInsight like integration with Azure services, scalability, and support.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
You need to write a script you can call from cron to upload a directory of files to S3. Or perhaps zip log files and E-mail them? Or import a CSV into the DB. What do you use? Bash? Python? Node? No silly, you use CFML! ColdFusion developers have been able to write pure CLI scripts with CommandBox CLI for years now and it beats the pants of bash or Node. There's tools for creating interactive wizards, progress bar animations, colored console text output, and easy parameter handling. And the best thing is, CommandBox Task Runners are written in CFML so they can do anything CFML can do. Come learn how quick and easy Task Runners are to use so CFML can become the go-to language to use for anything.
This document provides an introduction and overview of a Node.js tutorial presented by Tom Hughes-Croucher. The tutorial covers topics such as building scalable server-side code with JavaScript using Node.js, debugging Node.js applications, using frameworks like Express.js, and best practices for deploying Node.js applications in production environments. The tutorial includes exercises for hands-on learning and demonstrates tools and techniques like Socket.io, clustering, error handling and using Redis with Node.js applications.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
NET Systems Programming Learned the Hard Way.pptxpetabridge
This document discusses .NET systems programming and garbage collection. It covers garbage collection generations, modes, and considerations for minimizing allocations. It also discusses eliminating delegates to reduce allocations, using value types appropriately, avoiding empty collections, and optimizing for thread locality to reduce context switching overhead. Data structures and synchronization techniques are discussed, emphasizing the importance of choosing lock-free data structures when possible to improve performance.
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixManish Pandit
This document discusses Netflix's API ecosystem built using Scala, Scalatra, and Swagger. It summarizes Netflix's use of these technologies to build APIs that power their consumer electronics partner portal and enable certification of Netflix ready devices. It describes how the APIs provide a single source of truth for all device data at Netflix and correlate streaming quality metrics. It then discusses aspects of the architecture including the manager layer containing business logic, HTTP layer for handling requests/responses, and use of Scala, Scalatra, Swagger, and deployment process including immutable infrastructure.
PostgreSQL is a free and open-source relational database management system that provides high performance and reliability. It supports replication through various methods including log-based asynchronous master-slave replication, which the presenter recommends as a first option. The upcoming PostgreSQL 9.4 release includes improvements to replication such as logical decoding and replication slots. Future releases may add features like logical replication consumers and SQL MERGE statements. The presenter took questions at the end and provided additional resources on PostgreSQL replication.
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
Null Bachaav - May 07 Attack Monitoring workshop.Prajal Kulkarni
This document provides an overview and instructions for setting up the ELK stack (Elasticsearch, Logstash, Kibana) for attack monitoring. It discusses the components, architecture, and configuration of ELK. It also covers installing and configuring Filebeat for centralized logging, using Kibana dashboards for visualization, and integrating osquery for internal alerting and attack monitoring.
Abstract: At DataRobot we deal with automation challenges every day. This talk will give insight into how we use Python tools built around Ansible, Terraform, and Docker to solve real-world problems in infrastructure and automation.
Hidden pearls for High-Performance-PersistenceSven Ruppert
Small UseCases with a significant amount of data for internal company usage, most developers had this in their career, already. However, no Ops Team, no Kubernetes, no Cluster is available as part of the solution.
In this talk, I will show a few tech stacks that are helping to deal with persistent data without dealing with the classic horizontal scaling tech monsters like Kubernetes, Hadoop and many more.
Sit down, relax and enjoy the journey through a bunch of lightning-fast persistence alternatives for pure java devs.
The document describes deploying Cosmos DB resources using Terraform in Azure. It outlines prerequisites, environment details, and the configuration files and process used to create a resource group, Cosmos DB account, database, and collection. The main.tf file defines these resources, variables.tf contains configurable values, and output.tf displays output after deployment. Running terraform init and terraform plan commands prepares for deploying the resources.
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)lennartkats
Modern IDEs increase developer productivity by incorporating many different kinds of editor services. These can be purely syntactic, such as syntax highlighting, code folding, and an outline for navigation; or they can be based on the language semantics, such as in-line type error reporting and resolving identifier declarations. Building all these services from scratch requires both the extensive knowledge of the sometimes complicated and highly interdependent APIs and extension mechanisms of an IDE framework, and an in-depth understanding of the structure and semantics of the targeted language. This paper describes Spoofax/IMP, a meta-tooling suite that provides high-level domain-specific languages for describing editor services, relieving editor developers from much of the framework-specific programming. Editor services are defined as composable modules of rules coupled to a modular SDF grammar. The composability provided by the SGLR parser and the declaratively defined services allows embedded languages and language extensions to be easily formulated as additional rules extending an existing language definition. The service definitions are used to generate Eclipse editor plugins. We discuss two examples: an editor plugin for WebDSL, a domain-specific language for web applications, and the embedding of WebDSL in Stratego, used for expressing the (static) semantic rules of WebDSL.
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018Jorge Lopez-Malla
Talk about how Big Data and geospatial processing worlds are merging to get the best insights.
(The presenetation with effects here: https://docs.google.com/presentation/d/1EniUHMrRR3vQaJp6q0qBdOyZxv62DcSv3-iZXpcfwOM/edit?usp=sharing)
El documento presenta una charla sobre cómo hacer que los datos sean atractivos mediante el uso de técnicas de machine learning como K-means. Explica conceptos como el entrenamiento de algoritmos, su ejecución a gran escala y la representación de datos. También describe las tecnologías Docker, Apache Spark, Jupyter Notebook y Apache Toree que se pueden utilizar para analizar y visualizar datos de forma interactiva.
Este documento presenta cómo optimizar y monitorear trabajos de Spark con la Spark Web. Explica la terminología básica de Spark como aplicaciones, trabajos, etapas y tareas. Luego describe cómo la Spark Web proporciona información sobre aplicaciones, trabajos, etapas, caché y contadores que ayuda a optimizar el DAG. También cubre cómo la Spark Web monitorea trabajos de Spark SQL y Spark Streaming.
Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla
This document discusses how Stratio used big data technologies like Apache Spark to help a middle eastern telecommunications company with data challenges. It describes Stratio as the first Spark-based big data platform and discusses how they helped the telco process over 9.5 million daily events from 9.2 million customers. Specifically, Stratio used Spark and its machine learning library MLLib to build models from millions of data points to recognize patterns and improve network coverage, gather customer insights, and monetize data.
Meetup Spark y la Combinación de sus Distintos MódulosJorge Lopez-Malla
El documento presenta una introducción a Spark y sus distintos módulos. Explica brevemente Spark Core, que incluye RDD, transformaciones y acciones; Spark SQL, que permite consultas SQL sobre RDD; y cómo se relacionan estos módulos. También menciona Spark Streaming y MLlib, pero se enfoca principalmente en describir Spark Core y SQL, y cómo pueden combinarse mediante la creación de DataFrames a partir de RDD o realizando operaciones de join.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Kerberizing spark. Spark Summit east
2. Jorge López-Malla Matute
INDEX
[email protected]
Abel Rincón Matarranz
[email protected]
Kerberos
● Introduction
● Key concepts
● Workflow
● Impersonation
1
3
Use Case
● Definition
● Workflow
● Crossdata in production
2
4Stratio Solution
● Prerequirement
● Driver side
● Executor Side
● Final result
Demo time
● Demo
● Q&A
3. Presentation
Presentation
JORGE LÓPEZ-MALLA
After working with traditional
processing methods, I started to
do some R&S Big Data projects
and I fell in love with the Big Data
world. Currently i’m doing some
awesome Big Data projects
and tools at Stratio.
SKILLS
8. Kerberos
Kerberos
• What is Kerberos
○ Authentication protocol / standard / service
■ Safe
■ Single-sign-on
■ Trust based
■ Mutual authentication
9. Kerberos key concepts
Kerberos
• Client/Server → Do you need an explanation???
• Principal → Identify a unique client or service
• Realm → Identify a environment, company, domain …
○ DEMO.EAST.SUMMIT.SPARK.ORG
• KDC → Actor who manages the tickets
• TGT → Ticket which has the client session
• TGS → Ticket which has the client-service session
10. Kerberos Workflow
Kerberos
1. Client retrieve principal and secret
2. Client performs a TGT request
3. KDC returns TGT
4. Client request a TGS with the TGT
5. KDC returns the TGS
6. El cliente request a service session using
the TGS
7. Service establish a secure connection
directly with the client
11. Kerberos workflow 2
Kerberos
Client
Service
Backend
AS / KDC
Principal → User1
TGT (user1)
TGS → user1-service1
(tgsUS)
Principal → Service1
TGT (Service1)
TGS → service1-backend1 (tgsSB)
Principal → backend1
TGT (backend1)
tgsUS
tgsSB
user1
service1
12. Kerberos workflow - Impersonation
Kerberos
Client
Service
Backend
AS / KDC
Principal → User1
TGT (user1)
TGS → user1-service1
(tgsUS)
Principal → Service1
TGT (Service1)
TGS → service1-backend1 (tgsSB)
Principal → backend1
TGT (backend1)
tgsUS
tgsSB
user1
service1user1
14. Use Case
• Stratio Crossdata is a distributed framework and a fast and general-purpose
computing system powered by Apache Spark
• Can be used both as library and as a server.
• Crossdata Server: Provides a multi-user environment to SparkSQL, giving a
reliable architecture with high-availability and scalability out of the box
• To do so it use both native queries and Spark
• Crossdata Server had a unique long time SparkContext to execute all its Sparks
queries
• Crossdata can use YARN, Mesos and Standalone as a resource manager
Use Case
15. Crossdata as Server
sql> select * from table1
Crossdata shell
Master
Worker-1 Worker-2
Executor-0 Executor-1
Task-0 Task-1
HDFS
Crossdata server
(Spark Driver)
Crossdata server
sql> select * from table1
-------------------------
|id | name |
-------------------------
|1 | John Doe |
Use Case
Kerbe
ros
16. • Projects in production needs runtime impersonation to be compliance the
AAA(Authorization, Authentication and Audit) at the storage.
• Crossdata allows several users per execution
• Neither of the Sparks resource managers allows us to impersonate in runtime.
• Evenmore Standalone as resource manager does not provide any Kerberos
feature.
Crossdata in production
Use Case
18. Prerequirement
Stratio solution
• Keytab have to be accessible in all the cluster machines
• Keytab must provide proxy grants
• Hadoop client configuration located in the cluster
• Each user, both proxy and real, must have a home in HDFS
19. Introduction
Stratio solution
• Spark access to the storage system both in the Driver and the Executors.
• In the Driver side both Spark Core and SparkSQL will access to the storage
system.
• Executors will always access via Task.
• As Streaming use the same classes than SparkCore or SparkSQL the same
solution will be usable by Streaming jobs
20. KerberosUser (Utils)
object KerberosUser extends Logging with UserCache {
def setProxyUser(user: String): Unit = proxyUser = Option(user)
def getUserByName(name: Option[String]): Option[UserGroupInformation] = {
if (getConfiguration.isDefined) {
userFromKeyTab(name)
}
else None
}
private def userFromKeyTab(proxyUser: Option[String]): Option[UserGroupInformation] = {
if (realUser.isDefined) realUser.get.checkTGTAndReloginFromKeytab()
(realUser, proxyUser) match {
case (Some(_), Some(proxy)) => users.get(proxy).orElse(loginProxyUser(proxy))
case (Some(_), None) => realUser
case (None, None) => None
}
}
private lazy val getConfiguration: Option[(String, String)] = {
val principal = env.conf.getOption("spark.executor.kerberos.principal")
val keytab = env.conf.getOption("spark.executor.kerberos.keytab")
(principal, keytab) match {
case (Some(p), Some(k)) => Option(p, k)
case _ => None
}
}
Configuration
setting proxy user
(Global)
Choose between real or
proxy user
Stratio solution
Public Method retrieve
user
21. Wrappers (Utils)
def executeSecure[U, T](proxyUser: Option[String],
funct: (U => T),
inputParameters: U): T = {
KerberosUser.getUserByName(proxyUser) match {
case Some(user) => {
user.doAs(new PrivilegedExceptionAction[T]() {
@throws(classOf[Exception])
def run: T = {
funct(inputParameters)
}
})
}
case None => {
funct(inputParameters)
}
}
}
def executeSecure[T](exe: ExecutionWrp[T]): T = {
KerberosUser.getUser match {
case Some(user) => {
user.doAs(new PrivilegedExceptionAction[T]() {
@throws(classOf[Exception])
def run: T = {
exe.value
}
})
}
case None => exe.value
}
}
class ExecutionWrp[T](wrp: => T) {
lazy val value: T = wrp
}
Stratio solution
22. Driver Side
Stratio Solution
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
...
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = KerberosFunction.executeSecure(new ExecutionWrp(getPartitions))
partitions_.zipWithIndex.foreach { case (partition, index) =>
require(partition.index == index,
s"partitions($index).partition == ${partition.index}, but it should equal $index")
}
}
partitions_
}
}
Wrapping parameterless
method
23. class PairRDDFunctions[K, V](self: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
...
def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
val internalSave: (JobConf => Unit) = (conf: JobConf) => {
val hadoopConf = conf
val outputFormatInstance = hadoopConf.getOutputFormat
val keyClass = hadoopConf.getOutputKeyClass
val valueClass = hadoopConf.getOutputValueClass
...
val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {
….
}
self.context.runJob(self, writeToFile)
writer.commitJob()
}
KerberosFunction.executeSecure(internalSave, conf)
}
Driver Side
Stratio Solution
Hadoop Datastore
RDD save function
Inside wrapper function
that will run in the cluster
Kerberos authentified
save function
24. Driver Side
class InMemoryCatalog(
conf: SparkConf = new SparkConf,
hadoopConfig: Configuration = new Configuration)
override def createDatabase(
dbDefinition: CatalogDatabase,
ignoreIfExists: Boolean): Unit = synchronized {
def inner: Unit = {
...
val location = new Path(dbDefinition.locationUri)
val fs = location.getFileSystem(hadoopConfig)
fs.mkdirs(location)
} catch {
case e: IOException =>
throw new SparkException(s"Unable to create database ${dbDefinition.name} as failed " +
s"to create its directory ${dbDefinition.locationUri}", e)
}
catalog.put(dbDefinition.name, new DatabaseDesc(dbDefinition))
}
}
KerberosFunction.executeSecure(KerberosUser.principal, new ExecutionWrp(inner))
}
Stratio Solution
Spark create a
directory in HDFS
25. * Interface used to load a [[Dataset]] from external storage systems (e.g. file systems,
...
class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
...
def load(): DataFrame = {
load(Seq.empty: _*) // force invocation of `load(...varargs...)`
}
...
def load(paths: String*): DataFrame = {
val proxyuser = extraOptions.get("user")
if (proxyuser.isDefined) KerberosUser.setProxyUser(proxyuser.get)
val dataSource = KerberosFunction.executeSecure(proxyuser,
DataSource.apply,
sparkSession,
source,
paths, userSpecifiedSchema, Seq.empty, None, extraOptions.toMap)
val baseRelation = KerberosFunction.executeSecure(proxyuser, dataSource.resolveRelation, false)
KerberosFunction.executeSecure(proxyuser, sparkSession.baseRelationToDataFrame, baseRelation)
}
Driver Side
Stratio Solution
get user from dataset
options
Method for load data from
sources without path
obtaining baseRelation
from datasource
26. * Interface used to write a [[Dataset]] from external storage systems (e.g. file systems,
...
class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
...
/**
* Saves the content of the [[DataFrame]] as the specified table.
...
def save(): Unit = {
assertNotBucketed("save")
...
val maybeUser = extraOptions.get("user")
def innerWrite(modeData: (SaveMode, DataFrame)): Unit = {
val (mode, data) = modeData
dataSource.write(mode, data)
}
if (maybeUser.isDefined) KerberosUser.setProxyUser(maybeUser.get)
KerberosFunction.executeSecure(maybeUser, innerWrite, (mode, df))
}
Driver Side
Stratio Solution
get user from dataset
options
Method for save data in
external sources
Wrapping save
execution
27. class DAGScheduler(...){
…
…
KerberosUser.getMaybeUser match {
case Some(user) => properties.setProperty("user", user)
case _ =>
}
...
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
...
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
...
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
}
...
Driver Side
Stratio Solution
28. private[spark] abstract class Task[T](
val stageId: Int,
val stageAttemptId: Int,
val partitionId: Int,
// The default value is only used in tests.
val metrics: TaskMetrics = TaskMetrics.registered,
@transient var localProperties: Properties = new Properties) extends Serializable {
...
final def run(
...
try {
val proxyUser =
Option(Executor.taskDeserializationProps.get().getProperty("user"))
KerberosFunction.executeSecure(proxyUser, runTask, context)
} catch {
…
def runTask(context: TaskContext): T
Executor Side
Stratio Solution
properties load in
Driver side
get proxy user and
wrapped execution
method implemented by
Task subclasses