Virtualizing Apache Spark with Justin Murray

Jun 16, 20171 like859 views

This talk explains the reasons why virtualizing Spark, in-house or elsewhere, is a requirement in today’s fast-moving and experimental world of data science and data engineering. Different teams want to spin up a Spark cluster “on the fly” to carry out some research and quickly answer business questions. They are not concerned with the availability of the server hardware – or with what any other team might be doing on it at the time. Virtualization provides the means of working within your own sandbox to try out the new query or Machine Learning algorithm. Deep performance test results will be shown that demonstrate that Spark and ML programs perform equally well on virtual machines just like native implementations do. An early introduction is given to the best practices you should adhere to when you do this. If time allows, a short demo will be given of creating an ephemeral, single-purpose Spark cluster, running an ML application test program on that cluster, and bringing it down when finished.

Justin Murray, VMware
Virtualizing Spark on
VMware vSphere

Use Cases : Virtualization of Big
Data
• IT wants to provide Spark clusters as a service on-demand for its
end users
• Enterprises have development, test, pre-prod staging and production
clusters that are required to be separated from each other and
provisioned independently
• Organizations need different versions of Spark to be available to
different teams - with possibly different services available
• Enterprises do not wish to dedicate a specific set of hardware to each
different requirement above, and want to reduce overall costs
CONFIDENTIAL 3

Worker Node 1 Worker Node 2 Worker Node 3
Input File
The Traditional Hadoop Architecture
ResourcemanagerJob
Datanode
Nodemanager
Split 1 – 64MB
AppMaster - 1
Split 2 – 64MB
Split 3 – 64MB
Nodemanager Nodemanager
Datanode Datanode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Container - 2 Container - 3
Master Roles
Namenode

Worker Node 1 Worker Node 2 Worker Node 3
Input File
Hadoop – in Virtual Machines
ResourceManagerJob
Datanode
Nodemanager
Split 1 – 64MB
AppMaster - 1
Split 2 – 64MB
Split 3 – 64MB
Nodemanager Nodemanager
Datanode Datanode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Container - 2 Container - 3
Namenode
Master Roles

Worker Node 1 Worker Node 2 Worker Node 3
The Spark Architecture – Standalone
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM

Worker Node 1 Worker Node 2 Worker Node 3
Spark Standalone - Virtualized
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM
Virtual
Machine

NodemanagerNodemanagerNodemanager
Worker Node 1 Worker Node 2 Worker Node 3
The Spark Architecture (on YARN)
Job
Datanode
AppMaster - 1
Datanode Datanode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Container - 2 Container - 3
Namenode
Driver Executor Executor
Resourcemanager

Virtualization
Host Server
VMDK
Hadoop
Node 1
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4
Six or More Local DAS disksper Virtual Machine
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
Hadoop
Node 2
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4Ext4
VMDKVMDK VMDKVMDK
Ext4Ext4Ext4
Combined Model: Two Virtual Machines on a Host

Workloads - Spark
• Two standard analytic programs from the Spark MLLib (Machine Learning
Library)
• Driven using SparkBench (https://github.com/SparkTC/spark-bench)
– Support Vector Machine
– Logistic Regression
CONFIDENTIAL 13

Spark Support Vector Machine
Performance
CONFIDENTIAL 14

Spark Logistic Regression
Performance
CONFIDENTIAL 15

Results - Spark
•Support Vector Machines workload, which stayed in memory, ran
about 10% faster in virtualized form than on bare metal
•Logistic Regression workload, which was written to disk at the
larger dataset sizes, showed a slight advantage to bare metal
•part of the dataset was cached to disk,
•larger memory of the bare metal Spark executors may help
•Both workloads showed linear scaling from 5 to 10 hosts and as
dataset size increased
CONFIDENTIAL 16

1 TB RAM
on Server
Each NUMA
Node has 1024/2
512GB
482 GB RAM
for each VM
NUMA and Virtual
Machine Placement

§Spark workloads work very well on VMware
vSphere
• Various performance studies have shown that any
difference between virtualized performance and native
performance is minimal
• Follow the general best practice guidelines that VMware
has published
• Design patterns such as data-compute separation can be
used to provide elasticity of your Spark cluster.
Conclusions

Thank You.
Contact jmurray@vmware.com or
bigdata@vmware.com

Add Slides as Necessary
• Supporting points go here.

This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.

Kafka for begginerYousun Jeong

Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It allows for publishing and subscribing to streams of records known as topics in a fault-tolerant, scalable, and fast manner. Producers publish data to topics while consumers subscribe to topics and process the data streams. The Kafka cluster stores these topic partitions across servers and replicates the data for fault tolerance. It provides ordering and processing guarantees through offsets as it retains data for a configurable period of time.

How to collect and utilize logs at Kubernetes with Elastic StackRakuten Group, Inc.

- Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that share resources into pods and allows containers in a pod to find each other and communicate using localhost. Pods run on nodes which are physical or virtual machines. - There are different approaches to logging in Kubernetes including sending logs from pods to a log backend directly or indirectly via nodes. Common backends include Elasticsearch, Splunk, and Kibana. Logs can be searched and alerts generated based on their contents. - Application performance monitoring (APM) tools integrate with applications like Rails to capture metrics on CPU, memory, transactions and send structured log data to backends for creating graphs and dashboards without

Introduction to Container Storage Interface (CSI)Idan Atias

Among the cool stuff we do at Silk, my colleagues and I develop the Silk CSI Plugin for customers who use our system as the storage layer for their Kubernetes workloads. Before deep diving into the code and as part of my ramp-up on this subject I prepared some slides that cover some basic and important information on this topic. These slides start by recapping some basic storage principals in containers and Kubernetes, continues with some more advanced use cases (including an "offline demo" of persisting Redis data on EBS volumes), and ends with a detailed information on the CSI solution itself. IMHO, reviewing these slides can improve your understanding on this matter and can get you started implementing your own CSI plugin. The main sources of information I used for preparing these slides are: * Official CSI docs * Kubernetes Storage Lingo 101 - Saad Ali, Google * Container Storage Interface: Present and Future - Jie Yu, Mesosphere, Inc.

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit

This presentation describes the journey we went through in containerizing Spark workload into multiple elastic Spark clusters in a multi-tenant kubernetes environment. Initially we deployed Spark binaries onto a host-level filesystem, and then the Spark drivers, executors and master can transparently migrate to run inside a Docker container by automatically mounting host-level volumes. In this environment, we do not need to prepare a specific Spark image in order to run Spark workload in containers. We then utilized Kubernetes helm charts to deploy a Spark cluster. The administrator could further create a Spark instance group for each tenant. A Spark instance group, which is akin to the Spark notion of a tenant, is logically an independent kingdom for a tenant’s Spark applications in which they own dedicated Spark masters, history server, shuffle service and notebooks. Once a Spark instance group is created, it automatically generates its image and commits to a specified repository. Meanwhile, from Kubernetes’ perspective, each Spark instance group is a first-class deployment and thus the administrator can scale up/down its size according to the tenant’s SLA and demand. In a cloud-based data center, each Spark cluster can provide a Spark as a service while sharing the Kubernetes cluster. Each tenant that is registered into the service gets a fully isolated Spark instance group. In an on-prem Kubernetes cluster, each Spark cluster can map to a Business Unit, and thus each user in the BU can get a dedicated Spark instance group. The next step on this journey will address the resource sharing across Spark instance groups by leveraging new Kubernetes’ features (Kubernetes31068/9), as well as the Elastic workload containers depending on job demands (Spark18278). Demo: https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5s

Big Data on Cloud Native PlatformSunil Govindan

Spark day 2017 - Spark on KubernetesYousun Jeong

The document discusses deploying Apache Spark on Kubernetes. It provides an overview of Kubernetes and Spark architectures, and describes how to configure Spark applications to run on Kubernetes, including using DaemonSets for the shuffle service, StatefulSets for HDFS, and a staging server for resources. Examples are given of SparkPi and GroupByTest submissions using Kubernetes. Challenges of running HDFS on Kubernetes are also mentioned.

Gocd – Kubernetes/Nomad Continuous DeploymentLeandro Totino Pereira

Scylla Virtual Workshop 2020ScyllaDB

Learn how to get started with Scylla Join us for an overview of NoSQL best practices and get a look into the scale-out vs scale up models and the Scylla philosophy for accelerated success. In this session, we will show you how to get fast wins when using Scylla. We will cover architectural concepts, installation best practices, data model antipatterns, use case examples, scale-out vs. scale up approaches and management and monitoring tools. If you’re a database architect, developer, or manager, this session is for you! Architectural concepts Installation of a single cluster of Scylla Using Docker to get a 3-node cluster on your laptop Connecting your application to the database Data model antipatterns Management and monitoring installation

12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETESZalando adtech lab

February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network

This session will examine the many options the data scientist has for running Spark clusters in public and private clouds. We will discuss various environments employing AWS, Mesos, containers, docker, and BlueData EPIC technologies and the benefits and challenges of each. Speakers: Tom Phelan, Co-founder and Chief Architect - BlueData Inc. Tom has spent the last 25 years as a senior architect, developer, and team lead in the computer software industry in Silicon Valley. Prior to co-founding BlueData, Tom spent 10 years at VMware as a senior architect and team lead in the core R&D Storage and Availability group. Most recently, Tom led one of the key projects – vFlash, focusing on integration of server-based Flash into the vSphere core hypervisor. Prior to VMware, Tom was part of the early team at Silicon Graphics that developed XFS, one of the most successful open source file systems. Earlier in his career, he was a key member of the Stratus team that ported the Unix operating system to their highly available computing platform. Tom received his Computer Science degree from the University of California, Berkeley.

Spark Powered by ScyllaScyllaDB

Register to see webinar: http://go.scylladb.com/wbn-spark-scylla-registration.html Spark has become the de-facto analytics tool for data stored in Scylla. In this webinar we will review different workloads using Spark and Scylla, for example Extract, Transform, Load (ETL), creating joins between tables and summaries and reporting. We will also cover data modeling best practices for Scylla-Spark use cases and different deployment scenarios. To conclude, we will share performance tuning settings to utilize both Scylla and Spark at peak performance. Join us to learn... Why using Spark with Scylla is advantageous for analytics workloads How to create reporting using Spark and Scylla Best practices for data modeling and performance tuning for Scylla and Spark

Simplifying the Move to OpenStackOpenStack

Audience Level Intermediate Synopsis In this presentation, Shunde will show you how to simplify the migration process with a workload migration engine, making the move to OpenStack easy. This talk will address the various difficulties operators and administrators face when migrating workloads and resources between various cloud platforms, including removing time consuming, repetitive and complicated steps. This tool can be applied to many cloud migrations, including between Virtual Machines and OpenStack, between Public and Private clouds, as well as between OpenStack and OpenStack. This tool integrates completely with other OpenStack projects minimising deployment and maintenance efforts. So whether you’re looking to upgrade from your existing traditional virtualisation platform, setup a new OpenStack instance, or upgrade to a newer version of OpenStack, we will show you how to simplify this process using GUTS. Speaker Bio Shunde is a senior software developer in Aptira with over 15 years experience in software development, automation and system administration. He has worked with OpenStack since the Diablo cycle and has been involved in projects from OpenStack infrastructure to distributed systems running on top of OpenStack.

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure. Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.

Web後端技術的演變inwin stack

This document discusses the evolution of web backend technologies. It covers the history and concepts of infrastructure as code, immutable infrastructure, blue-green deployments, and canary deployments. It also discusses tools for physical delivery, virtual machines, configuration management, continuous integration/delivery, Docker, and Kubernetes. Kubernetes makes it easy to implement infrastructure as code practices and deployment strategies like blue-green and canary deployments through features like deployments and services.

Virtualizing Apache Spark and Machine Learning with Justin MurrayDatabricks

Kafka Deployment to Steel Threadconfluent

Technical breakout during Confluent’s streaming event in Munich, presented by Sam Julian, Chief Cloud Engineer at E.On SE. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.

State of the Container EcosystemVinay Rao

This document discusses containerization and the Docker ecosystem. It begins by describing the challenges of managing different software stacks across multiple environments. It then introduces Docker as a solution that packages applications into standardized units called containers that are portable and can run anywhere. The rest of the document covers key aspects of the Docker ecosystem like orchestration tools like Kubernetes and Docker Swarm, networking solutions like Flannel and Weave, storage solutions, and security considerations. It aims to provide an overview of the container landscape and components.

AKSgirish goudar

DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev

RSC's BasIS storage orchestration platform addresses complications with deploying DAOS storage. It simplifies DAOS deployment by dynamically composing DAOS clusters from servers' NVMe and PMEM resources over a fabric. This composable disaggregated approach provides flexibility to use PMEM nodes for different roles like DAOS or databases. The orchestration significantly improves on DAOS by making it deployable on existing heterogeneous servers and suitable for cloud environments. Performance tests show NVMe-over-Fabric with the orchestrator achieves similar throughput to local NVMe drives.

Storage os kubernetes clusters need persistent dataLibbySchulze

Kubernetes clusters require persistent storage to unlock their full potential. Without persistent storage, workarounds are needed that sacrifice Kubernetes benefits. StorageOS provides persistent storage through storage classes, allowing multi-tenancy, data encryption, and migration of legacy apps to Kubernetes without additional scaffolding. It also enables features like read-write-many volumes through orchestrating user space NFS.

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Kubestr is a tool to help users identify, validate and evaluate the various storage options in their Kubernetes cluster. It can identify the different storage options present, validate that they are configured correctly, and evaluate the performance of storage using benchmarking tools like FIO to understand if the right storage is being used for their workloads and applications. The goal is to make it easy for users to debug, validate and benchmark their Kubernetes storage.

Scylla Summit 2018: Scylla 3.0 and BeyondScyllaDB

Scylla 3.0 will include several new features and performance improvements including incremental compaction to reduce storage requirements, columnar storage to boost analytics performance, and multi-tenancy to fully isolate user workloads. It will also add lightweight transactions and improve analytics queries, large partition support, and observability tools. Underlying infrastructure changes involve optimizing Linux and Seastar for Scylla's needs.

Scylla Summit 2016: Scylla at Samsung SDSScyllaDB

Managing (Schema) Migrations in CassandraDataStax Academy

I'm going to cover something which could be seen as essential for Cassandra but which hasn't gotten much attention in the Cassandra community and literature. It's schema migrations--how you go about pushing out and versioning changes to your keyspace and table definitions across environments. This is an area that has established solutions in the relational database world, with tools like Liquibase(http://www.liquibase.org/) and Flyway (http://flywaydb.org/) and in web frameworks like Rails and Grails. I'll explain the different types of migrations but then focus, for most of the talk, on schema migrations. I'll explain how schema migrations have been done in the Cassandra community and the roadblocks teams have faced trying to use Liquibase and Flyway to manage Cassandra migrations. Then I'll share an elegant, lightweight schema migrations system that we at GridPoint built on top of Flyway. I'll use our system as a context for discussing schema migration best practices for Cassandra and the various choices teams have for their migrations and table definitions, including when NOT to use a tool like Flyway. I'll also touch on the other types of migrations besides keyspace and table definitions that can be versioned and driven off source control.

Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB

This document discusses the attributes of a high-performance, low-latency database like ScyllaDB. It begins with introductions and an overview of ScyllaDB. It then summarizes how hardware has evolved over 20 years with more cores, memory, and faster disks. ScyllaDB was redesigned from first principles to take advantage of modern hardware, using an asynchronous, shared-nothing architecture with one shard per core. This allows it to achieve significantly higher performance than Cassandra. The document shows benchmark results demonstrating ScyllaDB's lower latencies and ability to scale to higher throughput. It also discusses how ScyllaDB uses workload prioritization to manage different types of workloads.

Persist your data in an ephemeral k8 ecosystemLibbySchulze

The document discusses persisting data in Kubernetes clusters using OpenEBS. It describes OpenEBS components like the Maya API server, Node Disk Manager (NDM), and Local PV Provisioner that enable persistent storage. NDM discovers and manages block devices, the provisioner creates local persistent volumes, and Maya API extends the Kubernetes API for storage management. OpenEBS provides container-attached storage for stateful applications in ephemeral Kubernetes environments.

Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator. This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator. Speaker Rachit Arora, SSE, IBM

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames. In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames in Databricks Community Edition. Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it. * Apache Spark Basics & Architecture * Spark SQL * DataFrames * Brief Overview of Databricks Certified Developer for Apache Spark

More Related Content

What's hot (20)

Scylla Virtual Workshop 2020ScyllaDB

12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETESZalando adtech lab

February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network

Spark Powered by ScyllaScyllaDB

Simplifying the Move to OpenStackOpenStack

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

Web後端技術的演變inwin stack

Virtualizing Apache Spark and Machine Learning with Justin MurrayDatabricks

Kafka Deployment to Steel Threadconfluent

State of the Container EcosystemVinay Rao

AKSgirish goudar

DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev

Storage os kubernetes clusters need persistent dataLibbySchulze

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Scylla Summit 2018: Scylla 3.0 and BeyondScyllaDB

Scylla Summit 2016: Scylla at Samsung SDSScyllaDB

Managing (Schema) Migrations in CassandraDataStax Academy

Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB

Persist your data in an ephemeral k8 ecosystemLibbySchulze

Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy

Scylla Virtual Workshop 2020ScyllaDB

12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETESZalando adtech lab

February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network

Spark Powered by ScyllaScyllaDB

Simplifying the Move to OpenStackOpenStack

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

Web後端技術的演變inwin stack

Virtualizing Apache Spark and Machine Learning with Justin MurrayDatabricks

Kafka Deployment to Steel Threadconfluent

State of the Container EcosystemVinay Rao

AKSgirish goudar

DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev

Storage os kubernetes clusters need persistent dataLibbySchulze

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Scylla Summit 2018: Scylla 3.0 and BeyondScyllaDB

Scylla Summit 2016: Scylla at Samsung SDSScyllaDB

Managing (Schema) Migrations in CassandraDataStax Academy

Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB

Persist your data in an ephemeral k8 ecosystemLibbySchulze

Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy

Similar to Virtualizing Apache Spark with Justin Murray (20)

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Yao Yao Mooyoung Lee https://github.com/yaowser/learn-spark/tree/master/Final%20project https://www.youtube.com/watch?v=IVMbSDS4q3A https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/ Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications

Big Telco - Yousun JeongSpark Summit

This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick

Big Telco Real-Time Network AnalyticsYousun Jeong

This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and BI integration help meet requirements for timely processing and quick responses.

Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks

Introduction to Apache Geode (Cork, Ireland)Anthony Baker

The Fast Path to Building Operational Applications with SparkSingleStore

Nikita Shamgunov gave a presentation about using MemSQL and Spark together. MemSQL is a scalable operational database that can handle petabytes of data with high concurrency. It offers real-time capabilities and compatibility with tools like Spark, Kafka, and ETL/BI tools. The MemSQL Spark Connector allows bidirectional transfer of data between Spark and MemSQL tables for use cases like operationalizing models in Spark, stream/event processing, and live dashboards. Case studies showed customers gaining 10x faster data refresh times and performing entity resolution at scale for fraud detection.

DevOps Supercharged with Docker on ExadataMarketingArrowECS_CZ

This document discusses using Docker containers on Oracle Exadata systems. It provides an overview of Docker and its key components. It then discusses using Docker for various use cases with Exadata, including hosting Oracle applications and database releases in containers for test and development. It also provides instructions for setting up an Oracle Database in a Docker container on Exadata, such as downloading the necessary files from GitHub, building the Docker image, and using DBCA to configure the database.

Apache Sparkmasifqadri

Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.

EMC SRM vs. Sentinel Navigator - Deep divesansentinel

The document compares two storage resource management (SRM) products: Sentinel Navigator and SRM Suite. It outlines 10 key considerations for choosing between the products, such as reporting speed, centralized reporting across sites, ability to deploy server-side agents, support for heterogeneous infrastructures, and budget. Sentinel Navigator provides reports within 1 hour without server-side agents, supports a single cloud repository across sites, and has an all-inclusive annual fee. In contrast, SRM Suite can take 18 months to provide initial reports, requires deploying over 80 VMs across sites, and has separate licenses and repositories per site.

MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...Principled Technologies

IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...In-Memory Computing Summit

In-Memory Computing frameworks such as Spark are gaining tremendous popularity for Big Data processing as their in-memory primitives make it possible to eliminate disk I/O bottleneck. Logically, the more available memory they have, the better performance they can achieve. However, unpredicted GC activity from on-heap memory management, high cost for serialization/de-serialization (SerDe), and burst temporary object creation/destruction greatly impacts their performance and scale-out ability. For example in Spark, when the volume of datasets are much larger than the system memory volume, SerDe makes significant impact on almost every in-memory computing steps such as caching, checkpoint, shuffling/dispatching, data loading and Storing. With fast growing advanced server platform with significant increased non-volatile memory such as Intel 3D Xpoint technology powered NVMe and Fast SSD Array Storage, how to best use various hybrid memory-like resources from DRAM to NVMe/SSD determines Big Data applications performance and scalability. In this presentation, we will first introduce our non-volatile generic Java object programming model for In-Memory Computing. This programming model defines in-memory non-volatile objects which can be directly operated on memory-like resources. We then discuss our structured data in-memory persistence library that can be used to load/store non-volatile generic Java object from/to underlying heterogeneous memory-like resources, such as DRAM, NVMe, even SSD. We then present a non-volatile computing case using Spark. We will introduce that this model can (1) Lazily loads data to minimize memory footprint, (2) Naturally fits both non-volatile RDD and off-heap RDD, (3) Uses non-volatile/off-heap RDDs to transform Spark datasets, (4) Avoids memory caching by using in-place non-volatile datasets. Finally we will present that up to 2X performance boost can be achieved on Spark ML tests after applying this non-volatile computing approach that removed SerDe, caching hot data, and reducing GC pause time dramatically.

Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingYanpingWang

This document introduces Apache Mnemonic, an open source project that provides a non-volatile programming model for Java applications to improve performance of in-memory computing frameworks like Spark. It describes how Mnemonic allows data to be stored and processed directly in persistent memory rather than being serialized to disk. Experimental results show Mnemonic can significantly reduce Spark MLlib Kmeans execution time by avoiding object serialization and spilling to disk.

Spark introduction and architectureSohil Jain

Sparkfatemehjamalii

This document provides an overview of Apache Spark, including: - Spark is an open-source cluster computing framework that supports in-memory processing of large datasets across clusters of computers using a concept called resilient distributed datasets (RDDs). - RDDs allow data to be partitioned across nodes in a fault-tolerant way, and support operations like map, filter, and reduce. - Spark SQL, DataFrames, and Datasets provide interfaces for structured and semi-structured data processing. - The document discusses Spark's performance advantages over Hadoop MapReduce and provides examples of common Spark applications like word count, Pi estimation, and stream processing.

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf

Scaling Spark – Vertically: The mantra of Spark technology is divide and conquer, especially for problems too big for a single computer. The more you divide a problem across worker nodes, the more total memory and processing parallelism you can exploit. This comes with a trade-off. Splitting applications and data across multiple nodes is nontrivial, and more distribution results in more network traffic which becomes a bottleneck. Can you achieve scale and parallelism without those costs? We’ll show results of a variety of Spark application domains including structured data, graph processing and common machine learning in a single, high-capacity scaled-up system versus a more distributed approach and discuss how virtualization can be used to define node size flexibly, achieving the best balance for Spark performance.

Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj

This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands On Labs You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL. Level: Beginner to intermediate, not for advanced Spark users. Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional. Bio: Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Big Telco - Yousun JeongSpark Summit

Big Telco Real-Time Network AnalyticsYousun Jeong

Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks

Introduction to Apache Geode (Cork, Ireland)Anthony Baker

The Fast Path to Building Operational Applications with SparkSingleStore

DevOps Supercharged with Docker on ExadataMarketingArrowECS_CZ

Apache Sparkmasifqadri

EMC SRM vs. Sentinel Navigator - Deep divesansentinel

MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...Principled Technologies

IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...In-Memory Computing Summit

Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingYanpingWang

Spark introduction and architectureSohil Jain

Sparkfatemehjamalii

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf

Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

More from Databricks (20)

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data ScienceDatabricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML MonitoringDatabricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature AggregationsDatabricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and SparkDatabricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta LakeDatabricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

Taking a customer journey with process miningProcess mining Evangelist

T4media specializes in optimizing and personalizing websites for customers. Vanessa shows us what process mining adds to her toolbox as a customer journey analyst. Of course, she still uses web analytics tools like Google Analytics, but process mining helps her focus on the user’s actual behavior. Technically, the data is available without any problems: The Case ID is the user on the website, the Activity is the website's page name, and the Timestamp is the time of the visit. What is difficult is the complexity of the user journeys: The data needs to be simplified to answer targeted questions. Vanessa demonstrates, based on several examples, how this works.

Introduction to systems thinking tools_Eng.pdfAbdurahmanAbd

Feature Engineering for Electronic Health Record SystemsProcess mining Evangelist

Oak Ridge National Laboratory (ORNL) is a leading science and technology laboratory under the direction of the Department of Energy. Hilda Klasky is part of the R&D Staff of the Systems Modeling Group in the Computational Sciences & Engineering Division at ORNL. To prepare the data of the radiology process from the Veterans Affairs Corporate Data Warehouse for her process mining analysis, Hilda had to condense and pre-process the data in various ways. Step by step she shows the strategies that have worked for her to simplify the data to the level that was required to be able to analyze the process with domain experts.

CS-404 COA COURSE FILE JAN JUN 2025.docxnidarizvitit

2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdfRomiRomeo

2024 Digital Equity Accelerator Report.pdfdominikamizerska1

DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...aacj102006

The Data Analytics course in Kochi offers comprehensive training in data collection, processing, visualization, and interpretation using tools like Python, R, Excel, SQL, and Power BI. Designed for beginners and professionals, the course covers key concepts such as statistical analysis, machine learning, and data-driven decision-making. With hands-on projects and real-world case studies, learners gain practical experience to meet industry demands. Institutes in Kochi provide flexible schedules, expert faculty, and placement support, making it an ideal location to kickstart or advance a data analytics career. This course is perfect for those looking to enter the data-driven job market with confidence.

Ann Naser Nabil- Data Scientist Portfolio.pdfআন্ নাসের নাবিল

I am a data scientist with a strong foundation in economics and a deep passion for AI-driven problem-solving. My academic journey includes a B.Sc. in Economics from Jahangirnagar University and a year of Physics study at Shahjalal University of Science and Technology, providing me with a solid interdisciplinary background and a sharp analytical mindset. I have practical experience in developing and deploying machine learning and deep learning models across a range of real-world applications. Key projects include: AI-Powered Disease Prediction & Drug Recommendation System – Deployed on Render, delivering real-time health insights through predictive analytics. Mood-Based Movie Recommendation Engine – Uses genre preferences, sentiment, and user behavior to generate personalized film suggestions. Medical Image Segmentation with GANs (Ongoing) – Developing generative adversarial models for cancer and tumor detection in radiology. In addition, I have developed three Python packages focused on: Data Visualization Preprocessing Pipelines Automated Benchmarking of Machine Learning Models My technical toolkit includes Python, NumPy, Pandas, Scikit-learn, TensorFlow, Keras, Matplotlib, and Seaborn. I am also proficient in feature engineering, model optimization, and storytelling with data. Beyond data science, my background as a freelance writer for Earki and Prothom Alo has refined my ability to communicate complex technical ideas to diverse audiences.

TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTCA Suvidha Chaplot

This infographic presentation by CA Suvidha Chaplot breaks down the core building blocks of computer systems—hardware, software, and their modern advancements—through vibrant visuals and structured layouts. Designed for students, educators, and IT beginners, this visual guide explains everything from the CPU to cloud computing, from operating systems to AI innovations. 🔍 What’s covered: Major hardware components: CPU, memory, storage, input/output Types of computer systems: PCs, workstations, servers, supercomputers System vs application software with examples Software Development Life Cycle (SDLC) explained Programming languages: High-level vs low-level Operating system functions: Memory, file, process, security management Emerging hardware trends: Cloud, Edge, Quantum Computing Software innovations: AI, Machine Learning, Automation Perfect for quick revision, classroom teaching, and foundational learning of IT concepts! 🔑 SEO Keywords: Fundamentals of computer hardware infographic CA Suvidha Chaplot software notes Types of computer systems Difference between system and application software SDLC explained visually Operating system functions wheel chart Programming languages high vs low level Cloud edge quantum computing infographic AI ML automation visual notes SlideShare IT basics for commerce Computer fundamentals for beginners Hardware and software in computer Computer system types infographic Modern computer innovations

Important JavaScript Concepts Every Developer Must Knowyashikanigam1

Mastering JavaScript requires a deep understanding of key concepts like closures, hoisting, promises, async/await, event loop, and prototypal inheritance. These fundamentals are crucial for both frontend and backend development, especially when working with frameworks like React or Node.js. At TutorT Academy, we cover these topics in our live courses for professionals, ensuring hands-on learning through real-world projects. If you're looking to strengthen your programming foundation, our best online professional certificates in full-stack development and system design will help you apply JavaScript concepts effectively and confidently in interviews or production-level applications.

national income & related aggregates (1)(1).pptxj2492618

Time series analysis & forecasting day 2.pptxAsmaaMahmoud89

TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfNhiV747372

What is ETL? Difference between ETL and ELT?.pdfSaikatBasu37

Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Jayantilal Bhanushali

Red Hat Openshift Training - openshift (1).pptxssuserf60686

Responsible Data Science for Process MinersProcess mining Evangelist

Wil van der Aalst gave the closing keynote at camp. He started with giving an overview of the progress that has been made in the process mining field over the past 20 years. Process mining unlocks great potential but also comes with a huge responsibility. Responsible data science focuses on positive technological breakthroughs and aims to prevent “pollution” by “bad data science”. Wil gave us a sneak peek at current responsible process mining research from the area of ‘fairness’ (how to draw conclusions from data that are fair without sacrificing accuracy too much) and ‘confidentiality’ (how to analyze data without revealing secrets). While research can provide some solutions by developing new techniques, understanding these risks is a responsibility of the process miner.

Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxssuserd1f4a3

Lesson 6-Interviewing in SHRM_updated.pdfhemelali11

英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理Taqyea

在线制作本科文凭利物浦约翰摩尔斯大学英文学位证书影本英国成绩单利物浦约翰摩尔斯大学文凭【q微1954292140】高仿真还原英国文凭证书和外壳，定制英国利物浦约翰摩尔斯大学成绩单和信封。成绩单丢失补办LJMU毕业证【q微1954292140】办理英国利物浦约翰摩尔斯大学毕业证(LJMU毕业证书)【q微1954292140】学历认证定制利物浦约翰摩尔斯大学offer/学位证成绩单英文版、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决利物浦约翰摩尔斯大学学历学位认证难题。《利物浦约翰摩尔斯大学毕业证书英国毕业证书办理LJMU成绩单详解细节》【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。主营项目： 1、真实教育部国外学历学位认证《英国毕业文凭证书快速办理利物浦约翰摩尔斯大学毕业证购买》【q微1954292140】《论文没过利物浦约翰摩尔斯大学正式成绩单》，教育部存档，教育部留服网站100%可查. 2、办理LJMU毕业证，改成绩单《LJMU毕业证明办理利物浦约翰摩尔斯大学文凭在线制作》【Q/WeChat：1954292140】Buy Liverpool John Moores University Certificates《正式成绩单论文没过》，利物浦约翰摩尔斯大学Offer、在读证明、学生卡、信封、证明信等全套材料，从防伪到印刷，从水印到钢印烫金，高精仿度跟学校原版100%相同. 3、真实使馆认证（即留学人员回国证明），使馆存档可通过大使馆查询确认. 4、留信网认证，国家专业人才认证中心颁发入库证书，留信网存档可查. 利物浦约翰摩尔斯大学offer/学位证、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy Liverpool John Moores University Diploma购买美国毕业证，购买英国毕业证，购买澳洲毕业证，购买加拿大毕业证，以及德国毕业证，购买法国毕业证（q微1954292140）购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证，硕士毕业证。特殊原因导致无法毕业，也可以联系我们帮您办理相关材料：１：在利物浦约翰摩尔斯大学挂科了，不想读了，成绩不理想怎么办？？？ 2：打算回国了，找工作的时候，需要提供认证《LJMU成绩单购买办理利物浦约翰摩尔斯大学毕业证书范本》【Q/WeChat：1954292140】Buy Liverpool John Moores University Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办？？？英国毕业证购买，英国文凭购买，【q微1954292140】英国文凭购买，英国文凭定制，英国文凭补办。专业在线定制英国大学文凭，定做英国本科文凭，【q微1954292140】复制英国Liverpool John Moores University completion letter。在线快速补办英国本科毕业证、硕士文凭证书，购买英国学位证、利物浦约翰摩尔斯大学Offer，英国大学文凭在线购买。办理利物浦约翰摩尔斯大学学位证(LJMU毕业证书)毕业证详解细节【q微1954292140】帮您解决在英国利物浦约翰摩尔斯大学未毕业难题（Liverpool John Moores University）文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭（q微1954292140）新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证，买毕业证，毕业证购买，买大学文凭，购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证（q微1954292140）新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证，回国证明，留信网认证，留信认证办理，学历认证。从而完成就业。利物浦约翰摩尔斯大学毕业证办理，利物浦约翰摩尔斯大学文凭办理，利物浦约翰摩尔斯大学成绩单办理和真实留信认证、留服认证、利物浦约翰摩尔斯大学学历认证。学院文凭定制，利物浦约翰摩尔斯大学原版文凭补办，Diploma，扫描件文凭定做，100%文凭复刻。

Taking a customer journey with process miningProcess mining Evangelist

Introduction to systems thinking tools_Eng.pdfAbdurahmanAbd

Feature Engineering for Electronic Health Record SystemsProcess mining Evangelist

CS-404 COA COURSE FILE JAN JUN 2025.docxnidarizvitit

2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdfRomiRomeo

2024 Digital Equity Accelerator Report.pdfdominikamizerska1

DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...aacj102006

Ann Naser Nabil- Data Scientist Portfolio.pdfআন্ নাসের নাবিল

TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTCA Suvidha Chaplot

Important JavaScript Concepts Every Developer Must Knowyashikanigam1

national income & related aggregates (1)(1).pptxj2492618

Time series analysis & forecasting day 2.pptxAsmaaMahmoud89

TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfNhiV747372

What is ETL? Difference between ETL and ELT?.pdfSaikatBasu37

Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Jayantilal Bhanushali

Red Hat Openshift Training - openshift (1).pptxssuserf60686

Responsible Data Science for Process MinersProcess mining Evangelist

Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxssuserd1f4a3

Lesson 6-Interviewing in SHRM_updated.pdfhemelali11

英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理Taqyea

Virtualizing Apache Spark with Justin Murray

1. Justin Murray, VMware Virtualizing Spark on VMware vSphere

2. Why Virtualize Spark?

3. Use Cases : Virtualization of Big Data • IT wants to provide Spark clusters as a service on-demand for its end users • Enterprises have development, test, pre-prod staging and production clusters that are required to be separated from each other and provisioned independently • Organizations need different versions of Spark to be available to different teams - with possibly different services available • Enterprises do not wish to dedicate a specific set of hardware to each different requirement above, and want to reduce overall costs CONFIDENTIAL 3

4. Worker Node 1 Worker Node 2 Worker Node 3 Input File The Traditional Hadoop Architecture ResourcemanagerJob Datanode Nodemanager Split 1 – 64MB AppMaster - 1 Split 2 – 64MB Split 3 – 64MB Nodemanager Nodemanager Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Master Roles Namenode

5. Worker Node 1 Worker Node 2 Worker Node 3 Input File Hadoop – in Virtual Machines ResourceManagerJob Datanode Nodemanager Split 1 – 64MB AppMaster - 1 Split 2 – 64MB Split 3 – 64MB Nodemanager Nodemanager Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Namenode Master Roles

6. Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture – Standalone Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM

7. Worker Node 1 Worker Node 2 Worker Node 3 Spark Standalone - Virtualized Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM Virtual Machine

8. NodemanagerNodemanagerNodemanager Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture (on YARN) Job Datanode AppMaster - 1 Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Namenode Driver Executor Executor Resourcemanager

9. Reference Architectures

10. Virtualization Host Server VMDK Hadoop Node 1 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4 Six or More Local DAS disksper Virtual Machine VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Node 2 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4Ext4 VMDKVMDK VMDKVMDK Ext4Ext4Ext4 Combined Model: Two Virtual Machines on a Host

11. #1 Reference Architecture from Cloudera

12. Performance

13. Workloads - Spark • Two standard analytic programs from the Spark MLLib (Machine Learning Library) • Driven using SparkBench (https://github.com/SparkTC/spark-bench) – Support Vector Machine – Logistic Regression CONFIDENTIAL 13

14. Spark Support Vector Machine Performance CONFIDENTIAL 14

15. Spark Logistic Regression Performance CONFIDENTIAL 15

16. Results - Spark •Support Vector Machines workload, which stayed in memory, ran about 10% faster in virtualized form than on bare metal •Logistic Regression workload, which was written to disk at the larger dataset sizes, showed a slight advantage to bare metal •part of the dataset was cached to disk, •larger memory of the bare metal Spark executors may help •Both workloads showed linear scaling from 5 to 10 hosts and as dataset size increased CONFIDENTIAL 16

17. 1 TB RAM on Server Each NUMA Node has 1024/2 512GB 482 GB RAM for each VM NUMA and Virtual Machine Placement

18. §Spark workloads work very well on VMware vSphere • Various performance studies have shown that any difference between virtualized performance and native performance is minimal • Follow the general best practice guidelines that VMware has published • Design patterns such as data-compute separation can be used to provide elasticity of your Spark cluster. Conclusions

19. Thank You. Contact [email protected] or [email protected]

20. Add Slides as Necessary • Supporting points go here.

Virtualizing Apache Spark with Justin Murray

Recommended

More Related Content

What's hot (20)

Similar to Virtualizing Apache Spark with Justin Murray (20)

More from Databricks (20)

Recently uploaded (20)

Virtualizing Apache Spark with Justin Murray