Spark volume requirements 2018

Aug 26, 2018Download as PPTX, PDF2 likes77 views

This document discusses storage requirements for running Spark workloads on Kubernetes. It recommends using a distributed file system like HDFS or DBFS for distributed storage and emptyDir or NFS for local temp scratch space. Logs can be stored in emptyDir or pushed to object storage. Features that would improve Spark on Kubernetes include image volumes, flexible PV to PVC mappings, encrypted volumes, and clean deletion for compliance. The document provides an overview of Spark, Kubernetes benefits, and typical Spark deployments.

Storage requirements for
running Spark workloads on
Kubernetes
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs

About Me
• Advisory Software Engineer @ IBM India Software Labs
• General Purpose Developer
• Love Containers & Kubernetes
• Conference traveler
• Upcoming book on Hadoop and Its Ecosystem
• Cricket fan, Foodie

Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation

Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer

Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics

What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization

Storage Requirements
• Distributed File System
• Local Scratch Space
• Fast disk rights – DO NOT Write to Containers!!
• User Library
• Logs
• History Server Events
• Configs
• Secrets

What can we leverage
• Distributed
• NFS
• PV to PVC (1 to 1 Mapping in most of the Cloud Providers)
• Big NFS – Multiple PV – qouta
• HDFS – No Direct Support but can be configured to make it work but no data
localization
• DBFS – s3 based Databricks File System (DBFS) is a distributed file system
• S3/Obect Storage – Performance concerns
• Portworx – under exploration
• Glusterfs

What can we leverage
• Local temp dir scratch space
• emptyDir
• Clean Delete ? Need to return machines
• HostPath
• You manage delete
• Logs
• emptyDir vs NFS
• Push to Object store using fluentd (side containers)
• Roll over
• Do not write to containers

What we are looking for?
• Image as Volume
• https://github.com/kubernetes/kubernetes
/issues/831
• Flex Volume Plugin
• CSI
• Encrypted PVCs options – portworx
• PV to PVC 1 to Many Mapping with
Isolations
• Config Map: Better support for updates
• Local
• Clean Delete for HIPAA
• Distributed
• Clean Delete for HIPAA
• PVC transfer across Namespaces

References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Thank you
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It allows for publishing and subscribing to streams of records known as topics in a fault-tolerant, scalable, and fast manner. Producers publish data to topics while consumers subscribe to topics and process the data streams. The Kafka cluster stores these topic partitions across servers and replicates the data for fault tolerance. It provides ordering and processing guarantees through offsets as it retains data for a configurable period of time.

Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftHostedbyConfluent

The document discusses Kafka connectors for Cosmos DB that allow for seamless integration between the two services without requiring complex application code. It provides an overview of Kafka Connect and connectors, use cases for integrating Cosmos DB and Kafka, and the architecture of source and sink connectors that can read from and write to Cosmos DB and Kafka. It also previews a demo of the connectors and suggests ways to take integration further.

Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...HostedbyConfluent

TerraformOtto Jongerius

Terraform is a tool used by Atlassian for building, changing, and versioning infrastructure safely and efficiently. It manages both popular cloud services and in-house solutions through its infrastructure-as-code approach. Atlassian uses Terraform for its build pipelines via a Python wrapper and fork of Terraform, taking advantage of its modular and extendable design as well as its large, active community for support.

Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseLynn Langit

Apache Superset at AirbnbBill Liu

beSharp a serverless approach to big data on awsClaudio Pontili

Claudio Pontili, a senior cloud solution architect at beSharp, presented on using serverless architectures for big data on AWS. He discussed using Lambda for ETL processes and Glue for managed ETL jobs. He also covered CI/CD for deploying Lambda and Glue code, data warehousing on Aurora Serverless v1, and a fully serverless big data architecture. Some key learnings included using serverless for high availability and scalability with no effort, pausing Aurora Serverless v1 clusters when not in use, and using infrastructure as code to deploy architectures.

Crash Course in Cloud ComputingAll Things Open

Serverless RealityLynn Langit

This document discusses serverless computing and compares it to traditional server-based computing. It defines serverless computing and provides examples of serverless technologies like AWS Lambda. It also outlines common use cases for serverless computing like handling dynamic workloads and scheduled tasks. Finally, it compares different services between server-based and serverless models like compute, files, databases, data pipelines, machine learning, and IoT.

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Beyond RelationalLynn Langit

The document discusses building data pipelines in the cloud. It covers serverless data pipeline patterns using services like BigQuery, Cloud Storage, Cloud Dataflow, and Cloud Pub/Sub. It also compares Cloud Dataflow and Cloud Dataproc for ETL workflows. Key questions around ingestion and ETL are discussed, focusing on volume, variety, velocity and veracity of data. Cloud vendor offerings for streaming and ETL are also compared.

SQL Server on Google Cloud PlatformLynn Langit

SQL Server can run fast and well-priced on Google Cloud Platform infrastructure, with data centers opening locally in Australia in 2017. GCP services like Google Compute Engine offer on-demand virtual machines in various sizes running Linux, Windows, and more. A demo showed how to set up and use SQL Server 2016 with its new features on GCP, with step-by-step guides, best practices, and load testing tutorials available.

Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine

This presentation is dedicated to Azure Cosmos DB, it's history, characteristics, tasks and solutions. The presentation deals with performance optimization, practical experience of usage and an overview of the news about Cosmos DB from Microsoft Build 2017 conference (https://build.microsoft.com). This presentation by Andriy Gorda (Engineering Manager & Lead Software Engineer, Consultant, GlobalLogic Kharkiv) was delivered at GlobalLogic Kharkiv MS TechTalk on June 13, 2017.

Elastic Stack roadmap deep diveElasticsearch

- Elastic provides a search and analytics platform called the Elastic Stack that includes the Elastic Stack, Beats data shippers, and Kibana analytics and visualization tools. - The presentation discussed updates to Elastic's products including performance improvements to search, new features for distributed search across data centers, and enhanced security options for authentication and authorization. - Elastic aims to provide customizable and extensible solutions for users to ingest, store, search, analyze and visualize large volumes of data from various sources.

DBaaS at ScaleMike Faraponov

This document discusses Database as a Service (DBaaS). It begins by defining DBaaS and describing the tasks involved in managing databases, such as setup, configuration, monitoring, backups and upgrades. It then lists the benefits of DBaaS such as reduced time spent on management tasks and increased scalability. Requirements for DBaaS like security, availability and ease of use are outlined. Different DBaaS architectures including proxy-based and DNS-based models are presented, along with examples. The document concludes by comparing several DBaaS providers and listing useful ScaleChamp features and links.

Better Search and Business Analytics at Southern Glazer’s Wine & SpiritsElasticsearch

Matt Chung (Independent) - Serverless application with AWS Lambda Outlyer

The talk will focus on how we are utilizing AWS Lambda for certain applications and the advantages/disadvantages, and the challenges we discovered along the way. It would help those who are looking to reduce technical debt with the infrastructure and costs. Previously a Director of technical operations at fox networks (21st Century Fox/News Corporation) responsible for infrastructure and building deployment pipelines. Currently a Python programmer / DevOps engineer with roots in systems/networks administration. Focus is on infrastructure and application automation. Worked as an engineer for Cisco Systems with emphasis on video conferencing. Built microwave networks at Bel Air Internet. Find me on github and twitter @itsmemattchung Video: https://www.youtube.com/watch?v=BLcElBUhfrQ Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London Follow DOXLON on twitter http://www.twitter.com/doxlon

Big Data Platform at PinterestQubole

This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.

DevOps in real lifeDataArt

Александр Снеговой, DevOps Software Engineer в DataArt Kherson. Более шести лет в IT. Сертифицированный AWS Solutions Architect Associate. Докладчик на международных научных конференциях. Религиозный фанат Docker. Презентация: 1. Докеризация приложения. 2. Настройка CI/CD. 3. Развертывание инфраструктуры в AWS с помощью Terraform.

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Kubestr is a tool to help users identify, validate and evaluate the various storage options in their Kubernetes cluster. It can identify the different storage options present, validate that they are configured correctly, and evaluate the performance of storage using benchmarking tools like FIO to understand if the right storage is being used for their workloads and applications. The goal is to make it easy for users to debug, validate and benchmark their Kubernetes storage.

Ejecución del Elastic Stack en KubernetesElasticsearch

Apache Cassandra in the CloudInstaclustr

Instaclustr provides Cassandra as a service running in the cloud on AWS and Azure. It allows companies to focus on their applications instead of managing Cassandra infrastructure. Instaclustr's fully managed service handles deploying and operating Cassandra clusters in the cloud at global scale. An advertising company was able to improve the performance of their application serving targeted ads by moving their Cassandra cluster to Instaclustr's cloud service for flexibility and reduced management burden.

Wikipedia Cloud Search WebinarSearch Technologies

View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies provides implementation and consulting services for Amazon CloudSearch. For further information, see http://www.searchtechnologies.com/amazon-cloudsearch-services.html http://www.searchtechnologies.com/

KEDA OverviewJeff Hollan

Kubernetes is a system for orchestrating containerized workloads and services across many nodes that provides tools for managing replication, scaling, and state. KEDA allows Kubernetes to automatically scale function apps in response to events from sources like message queues or serverless triggers by integrating with functions running as pods and scaling them based on metrics and triggers. KEDA is useful for running serverless functions on Kubernetes in environments like on-premises, at the edge, or alongside other Kubernetes workloads where full control over scaling is needed.

Cloudsolutionday 2016: Getting Started with Severless ArchitectureAWS Vietnam Community

The document is a presentation on serverless architectures given by Lê Thanh Sang, a senior developer at GO1. It begins with an introduction of the speaker and overview of GO1. The bulk of the presentation defines what serverless computing is, highlights the benefits, and provides examples of serverless products and architectures using various AWS services. It concludes with a demo of a serverless note taking application built on S3, API Gateway, Lambda, and DynamoDB and a Q&A section.

Azuresatpn19 - An Introduction To Azure Data FactoryRiccardo Perico

Building a unified data pipeline in Apache SparkDataWorks Summit

This document discusses Apache Spark, an open-source distributed data processing framework. It describes how Spark provides a unified platform for batch processing, streaming, SQL queries, machine learning and graph processing. The document demonstrates how in Spark these capabilities can be combined in a single application, without needing to move data between systems. It shows an example pipeline that performs SQL queries, machine learning clustering and streaming processing on Twitter data.

Ejecución del Elastic Stack en KubernetesElasticsearch

Meetup Kubernetes Rhein-Neckerinovex GmbH

In order to provide prompt results and efficiently deal with data-intensive workloads, Big Data applications execute their jobs on compute slots across large clusters. Also, for optimal performance, these applications should be as close as possible to the data they use. Data-aware scheduling is the way to achieve that optimization and can conveniently be set up using Kubernetes. We’ll present two different use cases: First, we’ll make use of how Big Data applications like Hadoop and Spark can use their native HDFS protocol for data-aware scheduling. Second, we’ll demonstrate an efficient way to write a data-aware scheduler for Kubernetes that satisfies not just your application’s requirements, but also keeps your admins happy. As a bonus, it’ll also allows us to run data-aware scheduling on applications other than Big Data. Event: Kubernetes Meetup Rhein-Neckar, 18.10.2017 Speaker: Johannes M. Scheuermann weiter Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/ Tech-Artikel in unserem Blog: https://www.inovex.de/blog/

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator. This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator. Speaker Rachit Arora, SSE, IBM

More Related Content

What's hot (20)

Serverless RealityLynn Langit

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Beyond RelationalLynn Langit

SQL Server on Google Cloud PlatformLynn Langit

Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine

Elastic Stack roadmap deep diveElasticsearch

DBaaS at ScaleMike Faraponov

Better Search and Business Analytics at Southern Glazer’s Wine & SpiritsElasticsearch

Matt Chung (Independent) - Serverless application with AWS Lambda Outlyer

Big Data Platform at PinterestQubole

DevOps in real lifeDataArt

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Ejecución del Elastic Stack en KubernetesElasticsearch

Apache Cassandra in the CloudInstaclustr

Wikipedia Cloud Search WebinarSearch Technologies

KEDA OverviewJeff Hollan

Cloudsolutionday 2016: Getting Started with Severless ArchitectureAWS Vietnam Community

Azuresatpn19 - An Introduction To Azure Data FactoryRiccardo Perico

Building a unified data pipeline in Apache SparkDataWorks Summit

Ejecución del Elastic Stack en KubernetesElasticsearch

Serverless RealityLynn Langit

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Beyond RelationalLynn Langit

SQL Server on Google Cloud PlatformLynn Langit

Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine

Elastic Stack roadmap deep diveElasticsearch

DBaaS at ScaleMike Faraponov

Better Search and Business Analytics at Southern Glazer’s Wine & SpiritsElasticsearch

Matt Chung (Independent) - Serverless application with AWS Lambda Outlyer

Big Data Platform at PinterestQubole

DevOps in real lifeDataArt

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Ejecución del Elastic Stack en KubernetesElasticsearch

Apache Cassandra in the CloudInstaclustr

Wikipedia Cloud Search WebinarSearch Technologies

KEDA OverviewJeff Hollan

Cloudsolutionday 2016: Getting Started with Severless ArchitectureAWS Vietnam Community

Azuresatpn19 - An Introduction To Azure Data FactoryRiccardo Perico

Building a unified data pipeline in Apache SparkDataWorks Summit

Ejecución del Elastic Stack en KubernetesElasticsearch

Similar to Spark volume requirements 2018 (20)

Meetup Kubernetes Rhein-Neckerinovex GmbH

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Webinar - DreamObjects/Ceph Case StudyCeph Community

This document summarizes DreamObjects, an object storage platform powered by Ceph. It discusses the hardware used in storage and support nodes, including Intel and AMD processors, RAM, disks, and networking components. The document also provides details on Ceph configuration including replication, CRUSH mapping, OSD configuration, and application tuning. Monitoring tools discussed include Chef, pdsh, Sensu, collectd, graphite, logstash, Jenkins and future plans.

Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg

The lightning talks covered various Netflix OSS projects including S3mper, PigPen, STAASH, Dynomite, Aegisthus, Suro, Zeno, Lipstick on GCE, AnsWerS, and IBM. 41 projects were discussed and the need for a cohesive Netflix OSS platform was highlighted. Matt Bookman then gave a presentation on running Lipstick and Hadoop on Google Cloud Platform using Google Compute Engine and Cloud Storage. He demonstrated running Pig jobs on Compute Engine and discussed design considerations for cloud-based Hadoop deployments. Finally, Peter Sankauskas from @Answers4AWS discussed initial ideas around CloudFormation for Asgard and deploying various Netflix OSS

State of the Container EcosystemVinay Rao

This document discusses containerization and the Docker ecosystem. It begins by describing the challenges of managing different software stacks across multiple environments. It then introduces Docker as a solution that packages applications into standardized units called containers that are portable and can run anywhere. The rest of the document covers key aspects of the Docker ecosystem like orchestration tools like Kubernetes and Docker Swarm, networking solutions like Flannel and Weave, storage solutions, and security considerations. It aims to provide an overview of the container landscape and components.

What are clouds made fromJohn Garbutt

Clouds are made of on-demand, scalable computing resources that are accessed as a service via the internet. There are different cloud deployment models (public, private, hybrid) and service models (IaaS, PaaS, SaaS). Infrastructure as a service (IaaS) clouds provide fundamental computing resources like storage, networking and virtual machines, while platform as a service (PaaS) clouds provide additional services like databases, messaging queues and development tools. Choosing between IaaS and PaaS involves considering factors like lock-in to the cloud vendor, control over the infrastructure, and application requirements.

Lessons learned from running Spark on DockerDataWorks Summit

Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment. Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment. There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:] • Mesos • Kubernetes • Docker Swarm Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers. This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment. Speaker Thomas Phelan, Chief Architect, Blue Data, Inc

Solr + Hadoop: Interactive Search for Hadoopgregchanan

This document discusses Cloudera Search, which integrates Apache Solr with Cloudera's distribution of Apache Hadoop (CDH) to provide interactive search capabilities. It describes the architecture of Cloudera Search, including components like Solr, SolrCloud, and Morphlines for extraction and transformation. Methods for indexing data in real-time using Flume or batch using MapReduce are presented. The document also covers querying, security features like Kerberos authentication and collection-level authorization using Sentry, and concludes by describing how to obtain Cloudera Search.

Apache Cassandra training. Overview and BasicsOleg Magazov

This document provides an overview of Apache Cassandra, including: - Its history originating from Facebook's need to solve an inbox search problem. - Its key features like high availability, linear scalability, fault tolerance and tunable consistency. - Its architecture based on consistent hashing and a ring topology for data distribution. - Its data model using keyspaces, column families, rows, and columns differently than a relational database. - Examples of using the Cassandra CLI to create a schema, insert data, and perform queries.

Serverless sparkMamathaBusi

Move your on prem data to a lake in a Lake in CloudCAMMS

With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksLucidworks

Fusion is Lucidworks' data platform that includes connectors, search, analytics, and machine learning capabilities. It can be deployed on Kubernetes to provide scalability and automation. Previously, Fusion was deployed as a single container using Docker Compose. Now, Helm charts are used to define Fusion services as independent containers/pods allowing for horizontal scaling. Roadmap includes optimizing connectors for distributed environments, improving Solr autoscaling, independently upgradeable services, and adding more AI/analytics capabilities. Operators will be developed for Solr and Fusion to provide lifecycle management and automated scaling of services on Kubernetes.

Intro Docker october 2013dotCloud

This document provides an introduction and overview of Docker. It discusses why Docker was created to address issues with managing applications across different environments, and how Docker uses lightweight containers to package and run applications. It also summarizes the growth and adoption of Docker in its first 7 months, and outlines some of its core features and the Docker ecosystem including integration with DevOps tools and public clouds.

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Kubernetes – An open platform for container orchestrationinovex GmbH

Achieving Infrastructure Portability with ChefMatt Ray

Deploying to the cloud has made it easy to run large numbers of servers, but users may become dissatisfied with their particular cloud platform for reasons such as price, support and performance. There are a number of vendor lock-ins to avoid, this talk discusses how to do so with the open source configuration management and infrastructure automation platform Chef. Chef makes it easy to deploy to nearly every public and private cloud platform as well as virtualized and physical servers. Chef may also be used to deploy cloud infrastructures such as OpenStack, Eucalyptus or CloudStack. By abstracting away the platform, infrastructure becomes portable and you are free to deploy wherever necessary.

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.

Hadoop ppt1chariorienit

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

This document summarizes a presentation about running Apache Spark on Kubernetes. It discusses how Spark jobs can be scheduled and run on Kubernetes, including scheduling the driver and executor pods. Key points of the design include the Kubernetes scheduler backend for Spark and components like the file staging server. The roadmap outlines upcoming support for features like Spark Streaming and improvements to dynamic allocation.

Meetup Kubernetes Rhein-Neckerinovex GmbH

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Webinar - DreamObjects/Ceph Case StudyCeph Community

Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg

State of the Container EcosystemVinay Rao

What are clouds made fromJohn Garbutt

Lessons learned from running Spark on DockerDataWorks Summit

Solr + Hadoop: Interactive Search for Hadoopgregchanan

Apache Cassandra training. Overview and BasicsOleg Magazov

Serverless sparkMamathaBusi

Move your on prem data to a lake in a Lake in CloudCAMMS

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksLucidworks

Intro Docker october 2013dotCloud

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Kubernetes – An open platform for container orchestrationinovex GmbH

Achieving Infrastructure Portability with ChefMatt Ray

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Hadoop ppt1chariorienit

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

Recently uploaded (20)

Not So Common Memory Leaks in Java WebinarTier1 app

This SlideShare presentation is from our May webinar, “Not So Common Memory Leaks & How to Fix Them?”, where we explored lesser-known memory leak patterns in Java applications. Unlike typical leaks, subtle issues such as thread local misuse, inner class references, uncached collections, and misbehaving frameworks often go undetected and gradually degrade performance. This deck provides in-depth insights into identifying these hidden leaks using advanced heap analysis and profiling techniques, along with real-world case studies and practical solutions. Ideal for developers and performance engineers aiming to deepen their understanding of Java memory management and improve application stability.

Download YouTube By Click 2025 Free Full Activatedsaniamalik72555

Top 10 Client Portal Software Solutions for 2025.docxPortli

Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067

🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍 Adobe Lightroom Classic is a desktop-based software application for editing and managing digital photos. It focuses on providing users with a powerful and comprehensive set of tools for organizing, editing, and processing their images on their computer. Unlike the newer Lightroom, which is cloud-based, Lightroom Classic stores photos locally on your computer and offers a more traditional workflow for professional photographers. Here's a more detailed breakdown: Key Features and Functions: Organization: Lightroom Classic provides robust tools for organizing your photos, including creating collections, using keywords, flags, and color labels. Editing: It offers a wide range of editing tools for making adjustments to color, tone, and more. Processing: Lightroom Classic can process RAW files, allowing for significant adjustments and fine-tuning of images. Desktop-Focused: The application is designed to be used on a computer, with the original photos stored locally on the hard drive. Non-Destructive Editing: Edits are applied to the original photos in a non-destructive way, meaning the original files remain untouched. Key Differences from Lightroom (Cloud-Based): Storage Location: Lightroom Classic stores photos locally on your computer, while Lightroom stores them in the cloud. Workflow: Lightroom Classic is designed for a desktop workflow, while Lightroom is designed for a cloud-based workflow. Connectivity: Lightroom Classic can be used offline, while Lightroom requires an internet connection to sync and access photos. Organization: Lightroom Classic offers more advanced organization features like Collections and Keywords. Who is it for? Professional Photographers: PCMag notes that Lightroom Classic is a popular choice among professional photographers who need the flexibility and control of a desktop-based application. Users with Large Collections: Those with extensive photo collections may prefer Lightroom Classic's local storage and robust organization features. Users who prefer a traditional workflow: Users who prefer a more traditional desktop workflow, with their original photos stored on their computer, will find Lightroom Classic a good fit.

Landscape of Requirements Engineering for/by AI through Literature ReviewHironori Washizaki

Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Lionel Briand

Get & Download Wondershare Filmora Crack Latest [2025]saniaaftab72555

Copy & Past Link 👉👉 https://dr-up-community.info/ Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.

Kubernetes_101_Zero_to_Platform_Engineer.pptxCloudScouts

Adobe Master Collection CC Crack Advance Version 2025kashifyounis067

🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍 Adobe Master Collection CC (Creative Cloud) is a comprehensive subscription-based package that bundles virtually all of Adobe's creative software applications. It provides access to a wide range of tools for graphic design, video editing, web development, photography, and more. Essentially, it's a one-stop-shop for creatives needing a broad set of professional tools. Key Features and Benefits: All-in-one access: The Master Collection includes apps like Photoshop, Illustrator, InDesign, Premiere Pro, After Effects, Audition, and many others. Subscription-based: You pay a recurring fee for access to the latest versions of all the software, including new features and updates. Comprehensive suite: It offers tools for a wide variety of creative tasks, from photo editing and illustration to video editing and web development. Cloud integration: Creative Cloud provides cloud storage, asset sharing, and collaboration features. Comparison to CS6: While Adobe Creative Suite 6 (CS6) was a one-time purchase version of the software, Adobe Creative Cloud (CC) is a subscription service. CC offers access to the latest versions, regular updates, and cloud integration, while CS6 is no longer updated. Examples of included software: Adobe Photoshop: For image editing and manipulation. Adobe Illustrator: For vector graphics and illustration. Adobe InDesign: For page layout and desktop publishing. Adobe Premiere Pro: For video editing and post-production. Adobe After Effects: For visual effects and motion graphics. Adobe Audition: For audio editing and mixing.

Expand your AI adoption with AgentExchangeFexle Services Pvt. Ltd.

AgentExchange is Salesforce’s latest innovation, expanding upon the foundation of AppExchange by offering a centralized marketplace for AI-powered digital labor. Designed for Agentblazers, developers, and Salesforce admins, this platform enables the rapid development and deployment of AI agents across industries. Email: [email protected] Phone: +1(630) 349 2411 Website: https://www.fexle.com/blogs/agentexchange-an-ultimate-guide-for-salesforce-consultants-businesses/?utm_source=slideshare&utm_medium=pptNg

How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?steaveroggers

Migrating from Lotus Notes to Outlook can be a complex and time-consuming task, especially when dealing with large volumes of NSF emails. This presentation provides a complete guide on how to batch export Lotus Notes NSF emails to Outlook PST format quickly and securely. It highlights the challenges of manual methods, the benefits of using an automated tool, and introduces eSoftTools NSF to PST Converter Software — a reliable solution designed to handle bulk email migrations efficiently. Learn about the software’s key features, step-by-step export process, system requirements, and how it ensures 100% data accuracy and folder structure preservation during migration. Make your email transition smoother, safer, and faster with the right approach. Read More:- https://www.esofttools.com/nsf-to-pst-converter.html

Revolutionizing Residential Wi-Fi PPT.pptxnidhisingh691197

PDF Reader Pro Crack Latest Version FREE Download 2025mu394968

🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍 PDF Reader Pro is a software application, often referred to as an AI-powered PDF editor and converter, designed for viewing, editing, annotating, and managing PDF files. It supports various PDF functionalities like merging, splitting, converting, and protecting PDFs. Additionally, it can handle tasks such as creating fillable forms, adding digital signatures, and performing optical character recognition (OCR).

How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...Egor Kaleynik

This case study explores how we partnered with a mid-sized U.S. healthcare SaaS provider to help them scale from a successful pilot phase to supporting over 10,000 users—while meeting strict HIPAA compliance requirements. Faced with slow, manual testing cycles, frequent regression bugs, and looming audit risks, their growth was at risk. Their existing QA processes couldn’t keep up with the complexity of real-time biometric data handling, and earlier automation attempts had failed due to unreliable tools and fragmented workflows. We stepped in to deliver a full QA and DevOps transformation. Our team replaced their fragile legacy tests with Testim’s self-healing automation, integrated Postman and OWASP ZAP into Jenkins pipelines for continuous API and security validation, and leveraged AWS Device Farm for real-device, region-specific compliance testing. Custom deployment scripts gave them control over rollouts without relying on heavy CI/CD infrastructure. The result? Test cycle times were reduced from 3 days to just 8 hours, regression bugs dropped by 40%, and they passed their first HIPAA audit without issue—unlocking faster contract signings and enabling them to expand confidently. More than just a technical upgrade, this project embedded compliance into every phase of development, proving that SaaS providers in regulated industries can scale fast and stay secure.

Automation Techniques in RPA - UiPath CertificateVICTOR MAESTRE RAMIREZ

What Do Contribution Guidelines Say About Software Testing? (MSR 2025)Andre Hora

Software testing plays a crucial role in the contribution process of open-source projects. For example, contributions introducing new features are expected to include tests, and contributions with tests are more likely to be accepted. Although most real-world projects require contributors to write tests, the specific testing practices communicated to contributors remain unclear. In this paper, we present an empirical study to understand better how software testing is approached in contribution guidelines. We analyze the guidelines of 200 Python and JavaScript open-source software projects. We find that 78% of the projects include some form of test documentation for contributors. Test documentation is located in multiple sources, including CONTRIBUTING files (58%), external documentation (24%), and README files (8%). Furthermore, test documentation commonly explains how to run tests (83.5%), but less often provides guidance on how to write tests (37%). It frequently covers unit tests (71%), but rarely addresses integration (20.5%) and end-to-end tests (15.5%). Other key testing aspects are also less frequently discussed: test coverage (25.5%) and mocking (9.5%). We conclude by discussing implications and future research.

Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Eric D. Schabell

It's time you stopped letting your telemetry data pressure your budgets and get in the way of solving issues with agility! No more I say! Take back control of your telemetry data as we guide you through the open source project Fluent Bit. Learn how to manage your telemetry data from source to destination using the pipeline phases covering collection, parsing, aggregation, transformation, and forwarding from any source to any destination. Buckle up for a fun ride as you learn by exploring how telemetry pipelines work, how to set up your first pipeline, and exploring several common use cases that Fluent Bit helps solve. All this backed by a self-paced, hands-on workshop that attendees can pursue at home after this session (https://o11y-workshops.gitlab.io/workshop-fluentbit).

Download Wondershare Filmora Crack [2025] With Latesttahirabibi60507

Copy & Past Link 👉👉 http://drfiles.net/ Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.

Exploring Code Comprehension in Scientific Programming: Preliminary Insight...University of Hawai‘i at Mānoa

This presentation explores code comprehension challenges in scientific programming based on a survey of 57 research scientists. It reveals that 57.9% of scientists have no formal training in writing readable code. Key findings highlight a "documentation paradox" where documentation is both the most common readability practice and the biggest challenge scientists face. The study identifies critical issues with naming conventions and code organization, noting that 100% of scientists agree readable code is essential for reproducible research. The research concludes with four key recommendations: expanding programming education for scientists, conducting targeted research on scientific code quality, developing specialized tools, and establishing clearer documentation guidelines for scientific software. Presented at: The 33rd International Conference on Program Comprehension (ICPC '25) Date of Conference: April 2025 Conference Location: Ottawa, Ontario, Canada Preprint: https://arxiv.org/abs/2501.10037

Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi

Not So Common Memory Leaks in Java WebinarTier1 app

Download YouTube By Click 2025 Free Full Activatedsaniamalik72555

Top 10 Client Portal Software Solutions for 2025.docxPortli

Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067

Landscape of Requirements Engineering for/by AI through Literature ReviewHironori Washizaki

Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Lionel Briand

Get & Download Wondershare Filmora Crack Latest [2025]saniaaftab72555

Kubernetes_101_Zero_to_Platform_Engineer.pptxCloudScouts

Adobe Master Collection CC Crack Advance Version 2025kashifyounis067

Expand your AI adoption with AgentExchangeFexle Services Pvt. Ltd.

How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?steaveroggers

Revolutionizing Residential Wi-Fi PPT.pptxnidhisingh691197

PDF Reader Pro Crack Latest Version FREE Download 2025mu394968

How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...Egor Kaleynik

Automation Techniques in RPA - UiPath CertificateVICTOR MAESTRE RAMIREZ

What Do Contribution Guidelines Say About Software Testing? (MSR 2025)Andre Hora

Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Eric D. Schabell

Download Wondershare Filmora Crack [2025] With Latesttahirabibi60507

Exploring Code Comprehension in Scientific Programming: Preliminary Insight...University of Hawai‘i at Mānoa

Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi

Spark volume requirements 2018

1. Storage requirements for running Spark workloads on Kubernetes Rachit Arora [email protected] IBM, India Software Labs

2. About Me • Advisory Software Engineer @ IBM India Software Labs • General Purpose Developer • Love Containers & Kubernetes • Conference traveler • Upcoming book on Hadoop and Its Ecosystem • Cricket fan, Foodie

3. Spark Unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Yarn Mesos Standalon e Scheduler Kubernete s Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation

4. Typical Bigdata Application Secure Catalog and Search Ingest & Store Prepare Analyze Visualize Date Engineer Date Scientist Application Developer

5. Evolution of Spark Analytics On Prem Install • Acquire Hardware • Prepare Machine • Install Spark • Retry • Apply patches • security • Upgrades • Scale • High availability Virtualization • Prepare Vm Imaging Solution • Network Management • High Avilability • Patches • Scale Managed • Configure Cluster • Customize • Scale • Pay even if idle Serverless • Run analytics

6. What Kubernetes Bring in? • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. • It Manages Containers for me • It Manages High availability • It Provides me flexibility to choose resource I WANT and Persistence I want • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools • Reduced operational costs • Improved infrastructure utilization

7. Typical Spark deployment

8. Storage Requirements • Distributed File System • Local Scratch Space • Fast disk rights – DO NOT Write to Containers!! • User Library • Logs • History Server Events • Configs • Secrets

9. What can we leverage • Distributed • NFS • PV to PVC (1 to 1 Mapping in most of the Cloud Providers) • Big NFS – Multiple PV – qouta • HDFS – No Direct Support but can be configured to make it work but no data localization • DBFS – s3 based Databricks File System (DBFS) is a distributed file system • S3/Obect Storage – Performance concerns • Portworx – under exploration • Glusterfs

10. What can we leverage • Local temp dir scratch space • emptyDir • Clean Delete ? Need to return machines • HostPath • You manage delete • Logs • emptyDir vs NFS • Push to Object store using fluentd (side containers) • Roll over • Do not write to containers

11. What we are looking for? • Image as Volume • https://github.com/kubernetes/kubernetes /issues/831 • Flex Volume Plugin • CSI • Encrypted PVCs options – portworx • PV to PVC 1 to Many Mapping with Isolations • Config Map: Better support for updates • Local • Clean Delete for HIPAA • Distributed • Clean Delete for HIPAA • PVC transfer across Namespaces

12. References • IBM Watson Studio https://datascience.ibm.com • IBM Watson https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/ • Analytics Engine https://www.ibm.com/cloud/analytics-engine • Apache Spark • Kubernetes Scheduler Design & Discussion • Kuberenetes Clusters on IBM Cloud Rachit Arora [email protected] @rachit1arora

13. Thank you Rachit Arora [email protected] @rachit1arora

Editor's Notes

#4: Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps. Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
#5: Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis.Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
#6: As a data scientist what I was required to do On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them Cover what is takes to install a hadoop/spark cluster
#13: IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework. You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio. The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization. Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data What is Analytics Engine? You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.

Spark volume requirements 2018

Recommended

More Related Content

What's hot (20)

Similar to Spark volume requirements 2018 (20)

Recently uploaded (20)

Spark volume requirements 2018

Editor's Notes