SlideShare a Scribd company logo
Storage requirements for
running Spark workloads on
Kubernetes
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs
About Me
• Advisory Software Engineer @ IBM India Software Labs
• General Purpose Developer
• Love Containers & Kubernetes
• Conference traveler
• Upcoming book on Hadoop and Its Ecosystem
• Cricket fan, Foodie
Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation
Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer
Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics
What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization
Typical Spark deployment
Storage Requirements
• Distributed File System
• Local Scratch Space
• Fast disk rights – DO NOT Write to Containers!!
• User Library
• Logs
• History Server Events
• Configs
• Secrets
What can we leverage
• Distributed
• NFS
• PV to PVC (1 to 1 Mapping in most of the Cloud Providers)
• Big NFS – Multiple PV – qouta
• HDFS – No Direct Support but can be configured to make it work but no data
localization
• DBFS – s3 based Databricks File System (DBFS) is a distributed file system
• S3/Obect Storage – Performance concerns
• Portworx – under exploration
• Glusterfs
What can we leverage
• Local temp dir scratch space
• emptyDir
• Clean Delete ? Need to return machines
• HostPath
• You manage delete
• Logs
• emptyDir vs NFS
• Push to Object store using fluentd (side containers)
• Roll over
• Do not write to containers
What we are looking for?
• Image as Volume
• https://github.com/kubernetes/kubernetes
/issues/831
• Flex Volume Plugin
• CSI
• Encrypted PVCs options – portworx
• PV to PVC 1 to Many Mapping with
Isolations
• Config Map: Better support for updates
• Local
• Clean Delete for HIPAA
• Distributed
• Clean Delete for HIPAA
• PVC transfer across Namespaces
References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora
Thank you
Rachit Arora
rachitar@in.ibm.com
@rachit1arora
Ad

More Related Content

What's hot (20)

Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
Lynn Langit
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
Lynn Langit
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
Lynn Langit
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
GlobalLogic Ukraine
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
Elasticsearch
 
DBaaS at Scale
DBaaS at ScaleDBaaS at Scale
DBaaS at Scale
Mike Faraponov
 
Better Search and Business Analytics at Southern Glazer’s Wine & Spirits
Better Search and Business Analytics at Southern Glazer’s Wine & SpiritsBetter Search and Business Analytics at Southern Glazer’s Wine & Spirits
Better Search and Business Analytics at Southern Glazer’s Wine & Spirits
Elasticsearch
 
Matt Chung (Independent) - Serverless application with AWS Lambda
Matt Chung (Independent) - Serverless application with AWS Lambda Matt Chung (Independent) - Serverless application with AWS Lambda
Matt Chung (Independent) - Serverless application with AWS Lambda
Outlyer
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
Qubole
 
DevOps in real life
DevOps in real lifeDevOps in real life
DevOps in real life
DataArt
 
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsIntroducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
LibbySchulze
 
Ejecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en KubernetesEjecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en Kubernetes
Elasticsearch
 
Apache Cassandra in the Cloud
Apache Cassandra in the CloudApache Cassandra in the Cloud
Apache Cassandra in the Cloud
Instaclustr
 
Wikipedia Cloud Search Webinar
Wikipedia Cloud Search WebinarWikipedia Cloud Search Webinar
Wikipedia Cloud Search Webinar
Search Technologies
 
KEDA Overview
KEDA OverviewKEDA Overview
KEDA Overview
Jeff Hollan
 
Cloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureCloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless Architecture
AWS Vietnam Community
 
Azuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data FactoryAzuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data Factory
Riccardo Perico
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Ejecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en KubernetesEjecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en Kubernetes
Elasticsearch
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
Lynn Langit
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
Lynn Langit
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
Lynn Langit
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
GlobalLogic Ukraine
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
Elasticsearch
 
Better Search and Business Analytics at Southern Glazer’s Wine & Spirits
Better Search and Business Analytics at Southern Glazer’s Wine & SpiritsBetter Search and Business Analytics at Southern Glazer’s Wine & Spirits
Better Search and Business Analytics at Southern Glazer’s Wine & Spirits
Elasticsearch
 
Matt Chung (Independent) - Serverless application with AWS Lambda
Matt Chung (Independent) - Serverless application with AWS Lambda Matt Chung (Independent) - Serverless application with AWS Lambda
Matt Chung (Independent) - Serverless application with AWS Lambda
Outlyer
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
Qubole
 
DevOps in real life
DevOps in real lifeDevOps in real life
DevOps in real life
DataArt
 
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsIntroducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options
LibbySchulze
 
Ejecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en KubernetesEjecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en Kubernetes
Elasticsearch
 
Apache Cassandra in the Cloud
Apache Cassandra in the CloudApache Cassandra in the Cloud
Apache Cassandra in the Cloud
Instaclustr
 
Cloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureCloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless Architecture
AWS Vietnam Community
 
Azuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data FactoryAzuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data Factory
Riccardo Perico
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Ejecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en KubernetesEjecución del Elastic Stack en Kubernetes
Ejecución del Elastic Stack en Kubernetes
Elasticsearch
 

Similar to Spark volume requirements 2018 (20)

Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Necker
inovex GmbH
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made from
John Garbutt
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
Serverless spark
Serverless sparkServerless spark
Serverless spark
MamathaBusi
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksFusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Lucidworks
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013
dotCloud
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
DataWorks Summit
 
Kubernetes – An open platform for container orchestration
Kubernetes – An open platform for container orchestrationKubernetes – An open platform for container orchestration
Kubernetes – An open platform for container orchestration
inovex GmbH
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
Matt Ray
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Necker
inovex GmbH
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made from
John Garbutt
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
Serverless spark
Serverless sparkServerless spark
Serverless spark
MamathaBusi
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksFusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Lucidworks
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013
dotCloud
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
DataWorks Summit
 
Kubernetes – An open platform for container orchestration
Kubernetes – An open platform for container orchestrationKubernetes – An open platform for container orchestration
Kubernetes – An open platform for container orchestration
inovex GmbH
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
Matt Ray
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Ad

Recently uploaded (20)

Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Ad

Spark volume requirements 2018

  • 1. Storage requirements for running Spark workloads on Kubernetes Rachit Arora [email protected] IBM, India Software Labs
  • 2. About Me • Advisory Software Engineer @ IBM India Software Labs • General Purpose Developer • Love Containers & Kubernetes • Conference traveler • Upcoming book on Hadoop and Its Ecosystem • Cricket fan, Foodie
  • 3. Spark Unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Yarn Mesos Standalon e Scheduler Kubernete s Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation
  • 4. Typical Bigdata Application Secure Catalog and Search Ingest & Store Prepare Analyze Visualize Date Engineer Date Scientist Application Developer
  • 5. Evolution of Spark Analytics On Prem Install • Acquire Hardware • Prepare Machine • Install Spark • Retry • Apply patches • security • Upgrades • Scale • High availability Virtualization • Prepare Vm Imaging Solution • Network Management • High Avilability • Patches • Scale Managed • Configure Cluster • Customize • Scale • Pay even if idle Serverless • Run analytics
  • 6. What Kubernetes Bring in? • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. • It Manages Containers for me • It Manages High availability • It Provides me flexibility to choose resource I WANT and Persistence I want • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools • Reduced operational costs • Improved infrastructure utilization
  • 8. Storage Requirements • Distributed File System • Local Scratch Space • Fast disk rights – DO NOT Write to Containers!! • User Library • Logs • History Server Events • Configs • Secrets
  • 9. What can we leverage • Distributed • NFS • PV to PVC (1 to 1 Mapping in most of the Cloud Providers) • Big NFS – Multiple PV – qouta • HDFS – No Direct Support but can be configured to make it work but no data localization • DBFS – s3 based Databricks File System (DBFS) is a distributed file system • S3/Obect Storage – Performance concerns • Portworx – under exploration • Glusterfs
  • 10. What can we leverage • Local temp dir scratch space • emptyDir • Clean Delete ? Need to return machines • HostPath • You manage delete • Logs • emptyDir vs NFS • Push to Object store using fluentd (side containers) • Roll over • Do not write to containers
  • 11. What we are looking for? • Image as Volume • https://github.com/kubernetes/kubernetes /issues/831 • Flex Volume Plugin • CSI • Encrypted PVCs options – portworx • PV to PVC 1 to Many Mapping with Isolations • Config Map: Better support for updates • Local • Clean Delete for HIPAA • Distributed • Clean Delete for HIPAA • PVC transfer across Namespaces
  • 12. References • IBM Watson Studio https://datascience.ibm.com • IBM Watson https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/ • Analytics Engine https://www.ibm.com/cloud/analytics-engine • Apache Spark • Kubernetes Scheduler Design & Discussion • Kuberenetes Clusters on IBM Cloud Rachit Arora [email protected] @rachit1arora

Editor's Notes

  • #4: Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps. Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
  • #5: Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis. Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
  • #6: As a data scientist what I was required to do On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them Cover what is takes to install a hadoop/spark cluster
  • #13: IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework. You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio. The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization. Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data What is Analytics Engine? You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.