SlideShare a Scribd company logo
Justin Murray, VMware
Virtualizing Spark on
VMware vSphere
Why Virtualize Spark?
Use Cases : Virtualization of Big
Data
• IT wants to provide Spark clusters as a service on-demand for its
end users
• Enterprises have development, test, pre-prod staging and production
clusters that are required to be separated from each other and
provisioned independently
• Organizations need different versions of Spark to be available to
different teams - with possibly different services available
• Enterprises do not wish to dedicate a specific set of hardware to each
different requirement above, and want to reduce overall costs
CONFIDENTIAL 3
Worker	Node	1 Worker	Node	2 Worker	Node	3
Input	File
The Traditional Hadoop Architecture
ResourcemanagerJob
Datanode
Nodemanager
Split	1	– 64MB
AppMaster - 1
Split	2	– 64MB
Split	3	– 64MB
Nodemanager Nodemanager
Datanode Datanode
Block	1	– 64MB Block	2	– 64MB Block	3	– 64MB
Container	- 2 Container	- 3
Master Roles
Namenode
Worker	Node	1 Worker	Node	2 Worker	Node	3
Input	File
Hadoop – in Virtual Machines
ResourceManagerJob
Datanode
Nodemanager
Split	1	– 64MB
AppMaster - 1
Split	2	– 64MB
Split	3	– 64MB
Nodemanager Nodemanager
Datanode Datanode
Block	1	– 64MB Block	2	– 64MB Block	3	– 64MB
Container	- 2 Container	- 3
Namenode
Master Roles
Worker	Node	1 Worker	Node	2 Worker	Node	3
The Spark Architecture – Standalone
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM
Worker	Node	1 Worker	Node	2 Worker	Node	3
Spark Standalone - Virtualized
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM
Virtual
Machine
NodemanagerNodemanagerNodemanager
Worker	Node	1 Worker	Node	2 Worker	Node	3
The Spark Architecture (on YARN)
Job
Datanode
AppMaster - 1
Datanode Datanode
Block	1	– 64MB Block	2	– 64MB Block	3	– 64MB
Container	- 2 Container	- 3
Namenode
Driver Executor Executor
Resourcemanager
Reference Architectures
Virtualization
Host Server
VMDK
Hadoop
Node 1
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4
Six or More Local DAS disksper Virtual Machine
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
Hadoop
Node 2
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4Ext4
VMDKVMDK VMDKVMDK
Ext4Ext4Ext4
Combined Model: Two Virtual Machines on a Host
#1 Reference Architecture from
Cloudera
Performance
Workloads - Spark
• Two standard analytic programs from the Spark MLLib (Machine Learning
Library)
• Driven using SparkBench (https://github.com/SparkTC/spark-bench)
– Support Vector Machine
– Logistic Regression
CONFIDENTIAL 13
Spark Support Vector Machine
Performance
CONFIDENTIAL 14
Spark Logistic Regression
Performance
CONFIDENTIAL 15
Results - Spark
•Support Vector Machines workload, which stayed in memory, ran
about 10% faster in virtualized form than on bare metal
•Logistic Regression workload, which was written to disk at the
larger dataset sizes, showed a slight advantage to bare metal
•part of the dataset was cached to disk,
•larger memory of the bare metal Spark executors may help
•Both workloads showed linear scaling from 5 to 10 hosts and as
dataset size increased
CONFIDENTIAL 16
1 TB RAM
on Server
Each NUMA
Node has 1024/2
512GB
482 GB RAM
for each VM
NUMA and Virtual
Machine Placement
§Spark workloads work very well on VMware
vSphere
• Various performance studies have shown that any
difference between virtualized performance and native
performance is minimal
• Follow the general best practice guidelines that VMware
has published
• Design patterns such as data-compute separation can be
used to provide elasticity of your Spark cluster.
Conclusions
Thank You.
Contact jmurray@vmware.com or
bigdata@vmware.com
Add Slides as Necessary
• Supporting points go here.
Ad

More Related Content

What's hot (20)

Scylla Virtual Workshop 2020
Scylla Virtual Workshop 2020Scylla Virtual Workshop 2020
Scylla Virtual Workshop 2020
ScyllaDB
 
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
Zalando adtech lab
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
Spark Powered by Scylla
Spark Powered by ScyllaSpark Powered by Scylla
Spark Powered by Scylla
ScyllaDB
 
Simplifying the Move to OpenStack
Simplifying the Move to OpenStackSimplifying the Move to OpenStack
Simplifying the Move to OpenStack
OpenStack
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變
inwin stack
 
Virtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayVirtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin Murray
Databricks
 
Kafka Deployment to Steel Thread
Kafka Deployment to Steel ThreadKafka Deployment to Steel Thread
Kafka Deployment to Steel Thread
confluent
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
AKS
AKSAKS
AKS
girish goudar
 
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesDUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
Andrey Kudryavtsev
 
Storage os kubernetes clusters need persistent data
Storage os   kubernetes clusters need persistent dataStorage os   kubernetes clusters need persistent data
Storage os kubernetes clusters need persistent data
LibbySchulze
 
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsIntroducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
LibbySchulze
 
Scylla Summit 2018: Scylla 3.0 and Beyond
Scylla Summit 2018: Scylla 3.0 and BeyondScylla Summit 2018: Scylla 3.0 and Beyond
Scylla Summit 2018: Scylla 3.0 and Beyond
ScyllaDB
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
ScyllaDB
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
DataStax Academy
 
Critical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency DatabaseCritical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency Database
ScyllaDB
 
Persist your data in an ephemeral k8 ecosystem
Persist your data in an ephemeral k8 ecosystemPersist your data in an ephemeral k8 ecosystem
Persist your data in an ephemeral k8 ecosystem
LibbySchulze
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
DataStax Academy
 
Scylla Virtual Workshop 2020
Scylla Virtual Workshop 2020Scylla Virtual Workshop 2020
Scylla Virtual Workshop 2020
ScyllaDB
 
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES
Zalando adtech lab
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
Spark Powered by Scylla
Spark Powered by ScyllaSpark Powered by Scylla
Spark Powered by Scylla
ScyllaDB
 
Simplifying the Move to OpenStack
Simplifying the Move to OpenStackSimplifying the Move to OpenStack
Simplifying the Move to OpenStack
OpenStack
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變
inwin stack
 
Virtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayVirtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin Murray
Databricks
 
Kafka Deployment to Steel Thread
Kafka Deployment to Steel ThreadKafka Deployment to Steel Thread
Kafka Deployment to Steel Thread
confluent
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesDUG'20: 10 - Storage Orchestration for Composable Storage Architectures
DUG'20: 10 - Storage Orchestration for Composable Storage Architectures
Andrey Kudryavtsev
 
Storage os kubernetes clusters need persistent data
Storage os   kubernetes clusters need persistent dataStorage os   kubernetes clusters need persistent data
Storage os kubernetes clusters need persistent data
LibbySchulze
 
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsIntroducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
LibbySchulze
 
Scylla Summit 2018: Scylla 3.0 and Beyond
Scylla Summit 2018: Scylla 3.0 and BeyondScylla Summit 2018: Scylla 3.0 and Beyond
Scylla Summit 2018: Scylla 3.0 and Beyond
ScyllaDB
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
ScyllaDB
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
DataStax Academy
 
Critical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency DatabaseCritical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency Database
ScyllaDB
 
Persist your data in an ephemeral k8 ecosystem
Persist your data in an ephemeral k8 ecosystemPersist your data in an ephemeral k8 ecosystem
Persist your data in an ephemeral k8 ecosystem
LibbySchulze
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
DataStax Academy
 

Similar to Virtualizing Apache Spark with Justin Murray (20)

Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
Anthony Baker
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
SingleStore
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
EMC SRM vs. Sentinel Navigator - Deep dive
EMC SRM vs. Sentinel Navigator - Deep diveEMC SRM vs. Sentinel Navigator - Deep dive
EMC SRM vs. Sentinel Navigator - Deep dive
sansentinel
 
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...
Principled Technologies
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
In-Memory Computing Summit
 
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingIntroduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
YanpingWang
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark
SparkSpark
Spark
fatemehjamalii
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
Anthony Baker
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
SingleStore
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
EMC SRM vs. Sentinel Navigator - Deep dive
EMC SRM vs. Sentinel Navigator - Deep diveEMC SRM vs. Sentinel Navigator - Deep dive
EMC SRM vs. Sentinel Navigator - Deep dive
sansentinel
 
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...
Principled Technologies
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
In-Memory Computing Summit
 
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingIntroduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
YanpingWang
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
RomiRomeo
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
aacj102006
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTTYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
CA Suvidha Chaplot
 
Important JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must KnowImportant JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must Know
yashikanigam1
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Time series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptxTime series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptx
AsmaaMahmoud89
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Responsible Data Science for Process Miners
Responsible Data Science for Process MinersResponsible Data Science for Process Miners
Responsible Data Science for Process Miners
Process mining Evangelist
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
Taqyea
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
RomiRomeo
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...DATA ANALYST  and Techniques in Kochi Explore cutting-edge analytical skills ...
DATA ANALYST and Techniques in Kochi Explore cutting-edge analytical skills ...
aacj102006
 
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTTYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
CA Suvidha Chaplot
 
Important JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must KnowImportant JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must Know
yashikanigam1
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Time series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptxTime series analysis & forecasting day 2.pptx
Time series analysis & forecasting day 2.pptx
AsmaaMahmoud89
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
Taqyea
 

Virtualizing Apache Spark with Justin Murray

  • 1. Justin Murray, VMware Virtualizing Spark on VMware vSphere
  • 3. Use Cases : Virtualization of Big Data • IT wants to provide Spark clusters as a service on-demand for its end users • Enterprises have development, test, pre-prod staging and production clusters that are required to be separated from each other and provisioned independently • Organizations need different versions of Spark to be available to different teams - with possibly different services available • Enterprises do not wish to dedicate a specific set of hardware to each different requirement above, and want to reduce overall costs CONFIDENTIAL 3
  • 4. Worker Node 1 Worker Node 2 Worker Node 3 Input File The Traditional Hadoop Architecture ResourcemanagerJob Datanode Nodemanager Split 1 – 64MB AppMaster - 1 Split 2 – 64MB Split 3 – 64MB Nodemanager Nodemanager Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Master Roles Namenode
  • 5. Worker Node 1 Worker Node 2 Worker Node 3 Input File Hadoop – in Virtual Machines ResourceManagerJob Datanode Nodemanager Split 1 – 64MB AppMaster - 1 Split 2 – 64MB Split 3 – 64MB Nodemanager Nodemanager Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Namenode Master Roles
  • 6. Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture – Standalone Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM
  • 7. Worker Node 1 Worker Node 2 Worker Node 3 Spark Standalone - Virtualized Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM Virtual Machine
  • 8. NodemanagerNodemanagerNodemanager Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture (on YARN) Job Datanode AppMaster - 1 Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Namenode Driver Executor Executor Resourcemanager
  • 10. Virtualization Host Server VMDK Hadoop Node 1 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4 Six or More Local DAS disksper Virtual Machine VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Node 2 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4Ext4 VMDKVMDK VMDKVMDK Ext4Ext4Ext4 Combined Model: Two Virtual Machines on a Host
  • 11. #1 Reference Architecture from Cloudera
  • 13. Workloads - Spark • Two standard analytic programs from the Spark MLLib (Machine Learning Library) • Driven using SparkBench (https://github.com/SparkTC/spark-bench) – Support Vector Machine – Logistic Regression CONFIDENTIAL 13
  • 14. Spark Support Vector Machine Performance CONFIDENTIAL 14
  • 16. Results - Spark •Support Vector Machines workload, which stayed in memory, ran about 10% faster in virtualized form than on bare metal •Logistic Regression workload, which was written to disk at the larger dataset sizes, showed a slight advantage to bare metal •part of the dataset was cached to disk, •larger memory of the bare metal Spark executors may help •Both workloads showed linear scaling from 5 to 10 hosts and as dataset size increased CONFIDENTIAL 16
  • 17. 1 TB RAM on Server Each NUMA Node has 1024/2 512GB 482 GB RAM for each VM NUMA and Virtual Machine Placement
  • 18. §Spark workloads work very well on VMware vSphere • Various performance studies have shown that any difference between virtualized performance and native performance is minimal • Follow the general best practice guidelines that VMware has published • Design patterns such as data-compute separation can be used to provide elasticity of your Spark cluster. Conclusions
  • 20. Add Slides as Necessary • Supporting points go here.