Apache spark presentation

Jan 25, 20180 likes77 views

Talk on Apache Spark I gave at Hyderabad Software Architects meetup on 20-Jan-2018. Source code and commands are at http://www.mediafire.com/file/tzmzahftxnabs0g/HSA-Spark-20-Jan-2018.zip

Apache Spark
Presenter: MH
http://bit.ly/mahboob

Spark Overview
● General purpose cluster computing system
● High-level APIs in Java, Scala, Python and R
● Supports general execution graphs
● Spark SQL
● MLlib
● GraphX
● Spark Streaming

Deployment Modes
● Amazon EC2:
○ scripts that let you launch a cluster on EC2 in about 5 minutes
● Standalone Deploy Mode:
○ launch a standalone cluster quickly without a third-party cluster manager
● Mesos:
○ deploy a private cluster using Apache Mesos
● YARN:
○ deploy Spark on top of Hadoop NextGen (YARN)

FAQ
Hadoop vs Spark
File system based | Memory based
Map-Reduce Paradigm | Any distributed computing workload
Details:
https://docs.google.com/document/d/1hcv3JOc009AVer6bVFEeGnHrt_Jb0v3v3g2qGPgRy1Y

The Spark Driver
Source:
http://mapr.com

Very good resource
https://mapr.com/ebooks/spark/
Downloadable as pdf

K-Means Clustering
First, decide the number of clusters k.
Then:
1. Initialize the center of the clusters
2. Attribute the closest cluster to each
data point
3. Set the position of each cluster to the
mean of all data points belonging to
that cluster
4. Repeat steps 2-3 until convergence
Create k points for starting centroids (often randomly)
While any point has changed cluster assignment
for every point in the dataset:
for every centroid
calculate distance between centroid and point
assign point to the cluster with the lowest distance
for every cluster calculate mean of points in that cluster
assign the centroid to the mean

Example 1
Applying k-Means Algorithm using Python
Program: kMeans.py*
Data: x-y coordinates in file testSet.txt
* Source: Chapter 10, Machine Learning in Action

Example 2
Applying k-Means Algorithm using Java API to Spark
Program: Spark_KMeans.java
Data: Random numbers in marks.txt

Example 3
https://mapr.com/ebooks/spark/08-machine-learning-mllib-spark-use-case.html
Techniques applied for NLP
● Hashing
● Stemming
● TF-IDF
● Naive Bayes
● Random Forest Regression
● Gradient Boosted Trees Regression
● Code repository: https://github.com/joebluems/Mockingbird/issues
Is Harper Lee’s Go Set a Watchman a discarded rough
draft that was to become the universally beloved
classic To Kill AMockingbird, or was it a truly separate
work?

Who is using Spark?
Source:
https://medium.com/ai
rbnb-
engineering/data-
infrastructure-at-
airbnb-8adfb34f169c

Spark @AirBnB
Spark is also an important tool for Airbnb. The team actually built something called
Airstream, which is a computational framework that sits on top of Spark Streaming and
Spark SQL, allowing engineersand the data team to get quick insights. Ultimately, for an
organization that depends on predictions and machine learning, something like Spark -
alongsideother open sourcemachinelearning libraries- iscrucial in theAirbnb stack.
Source:
https://www.packtpub.com/books/content/what-software-stack-does-airbnb-use

Watch the recording of the speech done at Spark Summit Brussles 2016 here: https://www.youtube.com/watch?v=wyfTjd9z1sY Data Science with SparkML on DataBricks is a perfect platform for application of Ensemble Learning on massive a scale. This presentation describes Prediction-as-a-Service platform which can predict trends on 1 billion observed prices daily. In order to train ensemble model on a multivariate time series in thousands/millions dimensional space, one has to fragment the whole space into subspaces which exhibit a significant similarity. In order to achieve this, the vastly sparse space has to undergo dimensionality reduction into a parameters space which then is used to cluster the observations. The data in the resulting clusters is modeled in parallel using machine learning tools capable of coefficient estimation at the massive scale (SparkML and Scikit Learn). The estimated model coefficients are stored in a database to be used when executing predictions on demand via a web service. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the airline Revenue Management systems.

Harnessing Big Data with SparkAlpine Data

Low Latency Execution For Apache SparkJen Aman

This document describes Drizzle, a low latency execution engine for Apache Spark. It addresses the high overheads of Spark's centralized scheduling model by decoupling execution from scheduling through batch scheduling and pre-scheduling of shuffles. Microbenchmarks show Drizzle achieves milliseconds latency for iterative workloads compared to hundreds of milliseconds for Spark. End-to-end experiments show Drizzle improves latency for streaming and machine learning workloads like logistic regression. The authors are working on automatic batch tuning and an open source release of Drizzle.

Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch

Elasticwulf Pycon TalkPeter Skomoroch

This document discusses using Python and Amazon EC2 for parallel programming and clustering. It introduces ElasticWulf, which provides Amazon Machine Images preconfigured for clustering. It also covers MPI (message passing interface) basics in Python, including broadcasting, scattering, gathering, and reducing data across nodes. A demo is given of launching an ElasticWulf cluster on EC2, configuring it for MPI, and running a simple parallel pi calculation example using mpi4py.

Spark tuning2016may11bidaAnya Bida

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.

Google colab installing ml librariesSaravanakumar viswanathan

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing. Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks. We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks

This document discusses computationally intensive machine learning at large scales. It compares the algorithmic and statistical perspectives of computer scientists and statisticians when analyzing big data. It describes three science applications that use linear algebra techniques like PCA, NMF and CX decompositions on large datasets. Experiments are presented comparing the performance of these techniques implemented in Spark and MPI on different HPC platforms. The results show Spark can be 4-26x slower than optimized MPI codes. Next steps proposed include developing Alchemist to interface Spark and MPI more efficiently and exploring communication-avoiding machine learning algorithms.

Giraph주영 송

Apache Giraph is an iterative graph processing system like Google Pregel, built for high scalability on Hadoop. It uses the bulk synchronous parallel (BSP) model where computation proceeds in supersteps with message passing between vertices in a graph. Giraph provides fault tolerance through checkpointing to storage and master/worker processing on Hadoop infrastructure. Developers define graph algorithms by overriding the compute method to process messages and update vertex values.

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

This document discusses scaling topological data analysis (TDA) using the Mapper algorithm to analyze large datasets. It describes how the authors built the first open-source scalable implementation of Mapper called Betti Mapper using Spark. Betti Mapper uses locality-sensitive hashing to bin data points and compute topological summaries on prototype points to achieve an 8-11x performance improvement over a naive Spark implementation. The key aspects of Betti Mapper that enable scaling to enterprise datasets are locality-sensitive hashing for sampling and using prototype points to reduce the distance matrix computation.

Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit

HPC in Barcelona is centered around the MareNostrum supercomputer and BSC's 425-person team from 40 countries. MareNostrum allows simulation and analysis in fields like life sciences, earth sciences, and engineering. To meet new demands of big data analytics, BSC developed the Spark4MN module to run Spark workloads on MareNostrum. Benchmarking showed Spark4MN achieved good speed-up and scale-out. Further work profiles Spark using BSC tools and benchmarks workloads like image analysis on different hardware. BSC's vision is to advance understanding through technologies like cognitive computing and deep learning.

A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit

This document summarizes research on implementing deep learning models using Spark. It describes: 1) Implementing a multilayer perceptron (MLP) model for digit recognition in Spark using batch processing and matrix optimizations to improve efficiency. 2) Analyzing the tradeoffs of computation and communication in parallelizing the gradient calculation for batch training across multiple nodes to find the optimal number of workers. 3) Benchmark results showing Spark MLP achieves similar performance to Caffe on a single node and outperforms it by scaling nearly linearly when using multiple nodes.

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath

Briefing on the Modern ML Stack with RDatabricks

We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP

Easily reduce runtimes with cythonMichal Mucha

The document discusses how Cython can be used to easily reduce runtimes by up to 3 orders of magnitude. It provides examples showing how applying Cython techniques like declaring types and using NumPy arrays reduced a Pandas apply function runtime from 175ms to 1ms and a convolution runtime from 3310ms to 13.6ms. The document encourages typing variables, matching Python and C types, using Jupyter's Cython magic, and keeping Cython optimization in mind during design to take advantage of these performance gains.

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.

High Performance Python on Apache SparkWes McKinney

This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.

Data Stream Algorithms in Storm and RRadek Maciaszek

Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. These slides are from the Apache.con talk, which discussed developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language. At the talk I dicsussed issues of why and how to use Storm and R to develop streaming algorithms; in particular I focused on: • Streaming algorithms • Online machine learning algorithms • Use cases showing how to process hundreds of millions of events a day in (near) real time See: https://apacheconna2015.sched.org/event/09f5a1cc372860b008bce09e15a034c4#.VUf7wxOUd5o

Re-Architecting Spark For Performance UnderstandabilityJen Aman

The document describes a new architecture called "monotasks" for Apache Spark that aims to make reasoning about Spark job performance easier. The monotasks architecture decomposes Spark tasks so that each task uses only one resource (e.g. CPU, disk, network). This avoids issues where Spark tasks bottleneck on different resources over time or experience resource contention. With monotasks, dedicated schedulers control resource contention and monotask timing data can be used to model ideal performance. Results show monotasks match Spark's performance and provide clearer insight into bottlenecks.

Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit

While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement, In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.

Highly Available GraphiteMatthew Barlocker

Initially presented at OpenWest 2014 conference. Graphite and StatsD gather line series data and offer a robust set of APIs to access that data. While the tools are robust, the dashboards are straight from 1992 and alerting off the data is nonexistent. Nark, an opensource project, solves both of these problems. It provides easy to use dashboards and readily available alerts and notifications to users. It has been used in production at Lucid Software for almost a year. Related to Nark are the tools required to make Graphite highly available.

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

This document summarizes a presentation about using Elasticsearch and Lucene for text processing and machine learning pipelines in Apache Spark. Some key points: - Elasticsearch provides text analysis capabilities through Lucene and can be used to clean, tokenize, and vectorize text for machine learning tasks. - Elasticsearch integrates natively with Spark through Java/Scala APIs and allows indexing and querying data from Spark. - A typical machine learning pipeline for text classification in Spark involves tokenization, feature extraction (e.g. hashing), and a classifier like logistic regression. - The presentation proposes preparing text analysis specifications in Elasticsearch once and reusing them across multiple Spark pipelines to simplify the workflows and avoid data movement between systems

Brief introduction to Distributed Deep LearningAdam Gibson

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf

Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way! In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.

Scaling Graphite At YelpPaul O'Connor

This document discusses metrics and the Graphite monitoring system. It describes the main components of Graphite including Carbon, which persists metrics to disk and supports replication and sharding. It also describes Whisper for data storage and aggregation, and the web interface for rendering graphs. The document provides an overview of how these components work together and tips for optimizing performance such as aggregating metrics before ingestion and controlling the metrics that can be sent. It also briefly mentions alternative time-series databases like InfluxDB and Cassandra that could be used in the future.

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan

As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.

More Related Content

What's hot (20)

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks

Giraph주영 송

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit

A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath

Briefing on the Modern ML Stack with RDatabricks

Easily reduce runtimes with cythonMichal Mucha

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

High Performance Python on Apache SparkWes McKinney

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Data Stream Algorithms in Storm and RRadek Maciaszek

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit

Highly Available GraphiteMatthew Barlocker

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Brief introduction to Distributed Deep LearningAdam Gibson

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf

Scaling Graphite At YelpPaul O'Connor

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks

Giraph주영 송

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit

A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath

Briefing on the Modern ML Stack with RDatabricks

Easily reduce runtimes with cythonMichal Mucha

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

High Performance Python on Apache SparkWes McKinney

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Data Stream Algorithms in Storm and RRadek Maciaszek

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit

Highly Available GraphiteMatthew Barlocker

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Brief introduction to Distributed Deep LearningAdam Gibson

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf

Scaling Graphite At YelpPaul O'Connor

Similar to Apache spark presentation (20)

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Profiling & Testing with SparkRoger Rafanell Mas

Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.

Apache spark - InstallationMartin Zapletal

This document provides an overview of installing and deploying Apache Spark, including: 1. Spark can be installed via prebuilt packages or by building from source. 2. Spark runs in local, standalone, YARN, or Mesos cluster modes and the SparkContext is used to connect to the cluster. 3. Jobs are deployed to the cluster using the spark-submit script which handles building jars and dependencies.

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

Spark on YARNAdarsh Pannu

Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.

Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Akhil Das

This document discusses running Spark Streaming jobs over an Apache Mesos high availability cluster to provide fully fault tolerant streaming workflows at scale. It describes how Spark Streaming chops live data streams into batches, Spark processes the batches using RDD operations, and the results are returned in batches. Fault tolerance is achieved through Mesos' high availability architecture, Spark and RDDs' ability to recover from node failures, and Spark Streaming's use of checkpointing and write ahead logs. The document also provides an example of a simple fault tolerant streaming pipeline running over Mesos and scaling the pipeline to process millions of events per second by choosing the appropriate cluster resources.

Spark application on ec2 clusterChao-Hsuan Shen

This document summarizes a project using Apache Spark on an AWS EC2 cluster to classify images using the Naive Bayes classifier algorithm. It first provides an overview of the key aspects of the project, including the dataset used (ImageNet), AWS architecture, Spark RDDs, and obstacles faced in setting up the cluster. It then goes into more detail on how to set up the EC2 cluster, use Spark for distributed processing, and compares the Mahout and Spark MLlib machine learning libraries.

Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks

With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.

Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju

BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro

BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks

"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps. First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling. "

Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks

Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been on of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training. On the other side, increasingly DL/AI users want to handle large and complex data scenarios needed for their production pipelines. This talk introduces a new project that substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. We will introduce the major directions and provide progress updates, including 1) barrier execution mode for distributed DL training, 2) fast data exchange between Spark and DL frameworks, and 3) accelerator-awareness scheduling.

BigDL webinar - Deep Learning Library for SparkDESMOND YUEN

Apache spark-melbourne-april-2015-meetupNed Shawa

This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.

Distributed Deep Learning on Hadoop ClustersDataWorks Summit/Hadoop Summit

This document discusses distributed deep learning on Hadoop clusters using CaffeOnSpark. CaffeOnSpark is an open source project that allows deep learning models defined in Caffe to be trained and run on large datasets distributed across a Spark cluster. It provides a scalable architecture that can reduce training time by up to 19x compared to single node training. CaffeOnSpark provides APIs in Scala and Python and can be easily deployed on both public and private clouds. It has been used in production at Yahoo since 2015 to power applications like Flickr and Yahoo Weather.