The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters.
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.
First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling.
"
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
Deep Reinforcement Learning (DRL) is a thriving area in the current AI battlefield. AlphaGO by DeepMind is a very successful application of DRL which has drawn the attention of the entire world. Besides playing games, DRL also has many practical use in industry, e.g. autonomous driving, chatbots, financial investment, inventory management, and even recommendation systems. Although DRL applications has something in common with supervised Computer Vision or Natural Language Processing tasks, they are unique in many ways.
For example, they have to interact (explore) with the environment to obtain training samples along the optimization, and the method to improve the model is usually different from common supervised applications. In this talk we will share our experience of building Deep Reinforcement Learning applications on BigDL/Spark. BigDL is a well-developed deep learning library on Spark which is handy for Big Data users, but it has been mostly used for supervised and unsupervised machine learning. We have made extensions particularly for DRL algorithms (e.g. DQN, PG, TRPO and PPO, etc.), implemented classical DRL algorithms, built applications with them and did performance tuning. We are happy to share what we have learnt during this process.
We hope our experience will help our audience learn how to build a RL application on their own for in their production business.
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
"Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.
So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
"
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks
Hari Subramanian presented on Uber's journey to enable data agility and advanced analytics at scale. He discussed Uber's large and growing data platform that processes millions of daily trips and terabytes of data. He then described Uber's Data Science Workbench, which aims to democratize data science by providing self-service access to infrastructure, tools, and data to support various users from data scientists to business analysts. Finally, he presented a case study on COTA, a deep learning model for customer support ticketing that was developed and deployed using Uber's data platform and workflow.
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks
"In addition to the many data engineering initiatives at Starbucks, we are also working on many interesting data science initatives. The business scenarios involved in our deep learning initatives include (but are not limited to) planogram analysis (layout of our stores for efficient partner and customer flow) to predicting product pairings (e.g. purchase a caramel machiato and perhaps you would like caramel brownie) via the product components using graph convolutional networks.
For this session, we will be focusing on how we can run distributed Keras (TensorFlow backend) training to perform image analytics. This will be combined with MLflow to showcase the data science lifecycle and how Databricks + MLflow simplifies it. "
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Scaling Machine Learning To Billions Of ParametersJen Aman
This document summarizes scaling machine learning to billions of parameters using Spark and a parameter server architecture. It describes the requirements for supporting both batch and sequential optimization at web scale. It then outlines the Spark + Parameter server approach, leveraging Spark for distributed processing and the parameter server for synchronizing model updates. Examples of distributed L-BFGS and Word2Vec training are presented to illustrate batch and sequential optimization respectively.
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
This document discusses building deep learning pipelines on Apache Spark for ad optimization. It begins by discussing how data has become a new form of colonialism. It then explains why deep learning should be done on Apache Spark rather than just TensorFlow. The remainder of the document discusses machine learning pipelines on Apache Spark, how machine learning and deep learning can be used for ad optimization, and various approaches to deep learning on Apache Spark using tools like MMLSpark, Databricks, DL4J, BigDL, and SystemML.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
Extending Machine Learning Algorithms with PySparkDatabricks
1. The document discusses using PySpark and Pandas UDFs to perform machine learning at scale for genomic data. It describes a genomics use case called GloWGR that uses this approach.
2. Three key problems are identified with existing tools: genomic data is growing too quickly; bioinformaticians are unfamiliar with Scala; and ML algorithms are difficult to write in Spark SQL. The solutions proposed are to use Spark, provide a Python client, and write algorithms in Python linked to Spark.
3. GloWGR is presented as a novel whole genome regression and association study algorithm built with PySpark. It uses Pandas UDFs to parallelize the REGENIE method and perform tasks like dimensionality
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Databricks
Data is the key ingredient to building high-quality, production AI applications. It comes in during the training phase, where more and higher-quality training data enables better models, as well as during the production phase, where understanding the model’s behavior in production and detecting changes in the predictions and input data are critical to maintaining a production application. However, so far most data management and machine learning tools have been largely separate.
In this presentation, we’ll talk about several efforts from Databricks, in Apache Spark, as well as other open source projects, to unify data and AI in order to make it significantly simpler to build production AI applications.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
High Performance Python on Apache SparkWes McKinney
This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.
How to use Apache TVM to optimize your ML modelsDatabricks
Apache TVM is an open source machine learning compiler that distills the largest, most powerful deep learning models into lightweight software that can run on the edge. This allows the outputed model to run inference much faster on a variety of target hardware (CPUs, GPUs, FPGAs & accelerators) and save significant costs.
In this deep dive, we’ll discuss how Apache TVM works, share the latest and upcoming features and run a live demo of how to optimize a custom machine learning model.
The document summarizes several open source big data analytics toolkits that support Hadoop, including RHadoop, Mahout, MADLib, HiveMall, H2O, and Spark-MLLib. It describes each toolkit's key features such as algorithms supported, performance, ease of use, and architecture. Spark-MLLib provides machine learning algorithms in a distributed in-memory framework for improved performance compared to disk-based approaches. MADLib and H2O also operate directly on data in memory for faster iterative modeling.
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This document summarizes Holden Karau's presentation on augmenting Spark ML pipelines with Kubeflow and TensorFlow. The presentation explored splitting a Spark ML pipeline into feature preparation in Spark and model training in TensorFlow, saving the Spark output in a TF-compatible format, and executing the components as part of a Kubeflow pipeline that uses the Spark operator. It noted challenges with Kubeflow's current stability but provided options for integrating Spark jobs using the operator or notebooks. The presentation concluded by discussing alternatives to this approach and some ending notes of caution.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
This document summarizes a CVPR 2020 tutorial on the Analytics Zoo platform for automated machine learning workflows for distributed big data using Apache Spark. The tutorial covers an overview of Analytics Zoo and the BigDL distributed deep learning framework. It demonstrates distributed training of deep learning models using TensorFlow and PyTorch on Spark, and features of Analytics Zoo like end-to-end pipelines, ML workflow for automation, and model deployment with cluster serving. Real-world use cases applying Analytics Zoo at companies like SK Telecom, Midea, and MasterCard are also presented.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Scaling Machine Learning To Billions Of ParametersJen Aman
This document summarizes scaling machine learning to billions of parameters using Spark and a parameter server architecture. It describes the requirements for supporting both batch and sequential optimization at web scale. It then outlines the Spark + Parameter server approach, leveraging Spark for distributed processing and the parameter server for synchronizing model updates. Examples of distributed L-BFGS and Word2Vec training are presented to illustrate batch and sequential optimization respectively.
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
This document discusses building deep learning pipelines on Apache Spark for ad optimization. It begins by discussing how data has become a new form of colonialism. It then explains why deep learning should be done on Apache Spark rather than just TensorFlow. The remainder of the document discusses machine learning pipelines on Apache Spark, how machine learning and deep learning can be used for ad optimization, and various approaches to deep learning on Apache Spark using tools like MMLSpark, Databricks, DL4J, BigDL, and SystemML.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
Extending Machine Learning Algorithms with PySparkDatabricks
1. The document discusses using PySpark and Pandas UDFs to perform machine learning at scale for genomic data. It describes a genomics use case called GloWGR that uses this approach.
2. Three key problems are identified with existing tools: genomic data is growing too quickly; bioinformaticians are unfamiliar with Scala; and ML algorithms are difficult to write in Spark SQL. The solutions proposed are to use Spark, provide a Python client, and write algorithms in Python linked to Spark.
3. GloWGR is presented as a novel whole genome regression and association study algorithm built with PySpark. It uses Pandas UDFs to parallelize the REGENIE method and perform tasks like dimensionality
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Databricks
Data is the key ingredient to building high-quality, production AI applications. It comes in during the training phase, where more and higher-quality training data enables better models, as well as during the production phase, where understanding the model’s behavior in production and detecting changes in the predictions and input data are critical to maintaining a production application. However, so far most data management and machine learning tools have been largely separate.
In this presentation, we’ll talk about several efforts from Databricks, in Apache Spark, as well as other open source projects, to unify data and AI in order to make it significantly simpler to build production AI applications.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
High Performance Python on Apache SparkWes McKinney
This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.
How to use Apache TVM to optimize your ML modelsDatabricks
Apache TVM is an open source machine learning compiler that distills the largest, most powerful deep learning models into lightweight software that can run on the edge. This allows the outputed model to run inference much faster on a variety of target hardware (CPUs, GPUs, FPGAs & accelerators) and save significant costs.
In this deep dive, we’ll discuss how Apache TVM works, share the latest and upcoming features and run a live demo of how to optimize a custom machine learning model.
The document summarizes several open source big data analytics toolkits that support Hadoop, including RHadoop, Mahout, MADLib, HiveMall, H2O, and Spark-MLLib. It describes each toolkit's key features such as algorithms supported, performance, ease of use, and architecture. Spark-MLLib provides machine learning algorithms in a distributed in-memory framework for improved performance compared to disk-based approaches. MADLib and H2O also operate directly on data in memory for faster iterative modeling.
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This document summarizes Holden Karau's presentation on augmenting Spark ML pipelines with Kubeflow and TensorFlow. The presentation explored splitting a Spark ML pipeline into feature preparation in Spark and model training in TensorFlow, saving the Spark output in a TF-compatible format, and executing the components as part of a Kubeflow pipeline that uses the Spark operator. It noted challenges with Kubeflow's current stability but provided options for integrating Spark jobs using the operator or notebooks. The presentation concluded by discussing alternatives to this approach and some ending notes of caution.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
This document summarizes a CVPR 2020 tutorial on the Analytics Zoo platform for automated machine learning workflows for distributed big data using Apache Spark. The tutorial covers an overview of Analytics Zoo and the BigDL distributed deep learning framework. It demonstrates distributed training of deep learning models using TensorFlow and PyTorch on Spark, and features of Analytics Zoo like end-to-end pipelines, ML workflow for automation, and model deployment with cluster serving. Real-world use cases applying Analytics Zoo at companies like SK Telecom, Midea, and MasterCard are also presented.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
In this deck, Huihuo Zheng from Argonne National Laboratory presents: Data Parallel Deep Learning.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two weeks of training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-lsl
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
The BigDL framework scales deep learning for large data sets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. In this talk we propose a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception, VGG etc. Aggregation functions like reduce or treeReduce that are used for parameter aggregation in Apache Spark (and the original MapReduce) are slow as the centralized scheduling and driver network bandwidth become a bottleneck especially in large clusters.
To reduce the overhead of parameter aggregation and allow for near-linear scaling, we introduce a new AllReduce operation, a part of the parameter manager in BigDL which is built directly on top of the BlockManager in Apache Spark. AllReduce in BigDL uses a peer-to-peer mechanism to synchronize and aggregate parameters. During parameter synchronization and aggregation, all nodes in the cluster play the same role and driver’s overhead is eliminated thus enabling near-linear scaling. To address the scheduling overhead we use Drizzle, a recently proposed scheduling framework for Apache Spark. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency.
Drizzle introduces group scheduling, where multiple iterations (or a group) of iterations are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortizes the costs of task serialization and launch. Finally we will present results from using the new AllReduce operation and Drizzle on a number of common deep learning models including VGG and Inception. Our benchmarks run on Amazon EC2 and Google DataProc will show the speedups and scalability of our implementation.
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Chris Fregly
http://pipeline.ai
Applying my Netflix experience to a real-world problem in the ML and AI world, I will demonstrate a full-featured, open-source, end-to-end TensorFlow Model Training and Deployment System using the latest advancements from Kubernetes, Istio, and TensorFlow.
In addition to training and hyper-parameter tuning, our model deployment pipeline will include continuous canary deployments of our TensorFlow Models into a live, hybrid-cloud production environment.
This is the holy grail of data science - rapid and safe experiments of ML / AI models directly in production.
Following the Successful Netflix Culture that I lived and breathed (https://www.slideshare.net/reed2001/culture-1798664/2-Netflix_CultureFreedom_Responsibility2), I give Data Scientists the Freedom and Responsibility to extend their ML / AI pipelines and experiments safely into production.
Offline, batch training and validation is for the slow and weak. Online, real-time training and validation on live production data is for the fast and strong.
Learn to be fast and strong by attending this talk.
http://pipeline.ai
Graduate Research Assistant at Multimedia Processing Laboratory, University of Texas at Arlington. MS in EE with focus on Embedded Systems & Image Processing
1) The document provides an introduction to GPGPU programming with CUDA, outlining goals of providing an overview and vision for using GPUs to improve applications.
2) Key aspects of GPU programming are discussed, including the large number of cores devoted to data processing, example applications that are well-suited to parallelization, and the CUDA tooling in Visual Studio.
3) A hands-on example of matrix multiplication is presented to demonstrate basic CUDA programming concepts like memory management between host and device, kernel invocation across a grid of blocks, and using thread IDs to parallelize work.
The document provides details about the candidate's experience and skills in big data technologies like Hadoop, Hive, Pig, Spark, Sqoop, Flume, and HBase. The candidate has over 1.5 years of experience learning and working with these technologies. He has installed and configured Hadoop clusters from different versions and used distributions from MapR. He has in-depth knowledge of Hadoop architecture and frameworks and has performed various tasks in a Hadoop environment including configuration of Hive, writing Pig scripts, using Sqoop and Flume, and writing Spark programs.
The document provides an overview of parallel development and Microsoft's investments in parallel computing technologies. It discusses the difficulty of writing parallel code and introduces some of Microsoft's tools and APIs to help developers write parallel and concurrent applications more easily, including the Task Parallel Library (TPL) and Parallel LINQ (PLINQ). It encourages developers to experiment with and provide feedback on these new parallel programming models and tools.
The document discusses deep learning applications design, development and deployment in IoT edge. It describes using a Power9 system to train artificial neural network models using the MNIST dataset. It also covers building inference engines for Android phones and deploying visual recognition models to IBM Watson Studio.
The document discusses different machine learning techniques including regression, classification, clustering, anomaly detection, and recommendation. It then provides examples of data and labels that could be used for training models with these techniques. It also discusses topics like updating model weights, learning rates, and derivatives or gradients of cost functions. Finally, it provides examples of using Azure machine learning services to train models with cloud resources and deploy them for consumption.
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUsIndrajit Poddar
This document discusses scaling TensorFlow deep learning using Docker, OpenPOWER systems, and GPUs. It provides an overview of distributing TensorFlow in a cluster with parameter servers and worker nodes. Example Dockerfiles are shown for creating deep learning images. The discussion also covers infrastructure components like Docker, OpenStack, and Mesos for managing compute resources in a deep learning cluster as a service.
The document discusses machine learning and TensorFlow. It provides three ways to get started with machine learning of varying complexity: using cloud APIs, retraining existing models, or developing new models. It then discusses Google Cloud machine learning APIs for vision, natural language, speech and translation. It provides examples of using the vision API and mobile vision API. It introduces TensorFlow as an open source machine learning library for research and production with Python and C++ frontends.
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
This document discusses various ways to accelerate applications using GPUs and CUDA programming. It provides examples of using libraries, programming languages like OpenACC and CUDA, and tools like Nsight to add GPU acceleration. It also highlights many success stories and how applications from fields like HPC, deep learning, and computational chemistry have achieved speedups using these techniques. Resources and compilers are available to help developers get started with GPU programming.
This document provides an overview of cloud computing, including its evolution, key characteristics, how to develop cloud applications using frameworks like MapReduce and Hadoop, and who might need cloud computing services. It discusses how cloud computing provides on-demand access to computing resources and data from any device, and how developers' key technical concern is services and data accessible over the internet. It also gives examples of major cloud computing providers like Amazon Web Services, Microsoft Azure, and Google App Engine.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
GenAI for Quant Analytics: survey-analytics.aiInspirient
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters
1. Benchmark Tests and How-
tos of Distributed Deep Learning
On HorovodRunner
Jing Pan and Wendao Liu
2. 2020 Copyright eHealth Insurance
ABOUT US
Wendao Liu, Sr. Data Scientist at eHealth, Inc
§ Wears many hats: data science/machine learning, data pipeline, end-to-
end data product
§ Currently studying Doctor in Business Administration
Jing Pan, PhD, Sr. Staff User Experience Researcher at eHealth, Inc
§ Architect of customer facing machine learning models
§ Expert in application of deep learning models on Spark
§ Author of multiple patents and speaker at top AI conferences (KDD, AAAI)
3. 2020 Copyright eHealth Insurance
AGENDA
§ Horovod
§ HorovodRunner
§ HorovodRunner Benchmark
§ How to Use HorovodRunner
4. 2020 Copyright eHealth Insurance
WHY DISTRIBUTED DEEP LEARNING?
Rapidly Growing Data
§ Image net has 1.3M images (150 GB)
§ Amazon has 143 million product reviews (20 GB)
Increasing Model Complexity
§ AlexNet with batch size 128 requires 1.1GB memory (5 conv layers
and 3 fully connected layers)
§ VGG-16 with batch size 128 requires 14GB memory, size 256
requires 28GB
5. 2020 Copyright eHealth Insurance
MEET HOROVOD
§ Uber's open source distributed deep learning library
§ Easy to use
§ Slightly modify single-node DL code to make it distributed
using Horovod
§ Great scaling efficiency
§ Supports four popular frameworks
§ TensorFlow, Keras, PyTorch, MXNet
§ Supports both data and model parallelismHorovod Github
Courtesy of Uber
8. 2020 Copyright eHealth Insurance
HOROVOD – RING-ALLREDUCE
0 1
2 3
4
6789
13
14 15
Horovod Size: Number of processing units, e.g. 16
Horovod Rank: Ordinal rank of processing units, e.g. 0-15
Courtesy of Uber
9. 2020 Copyright eHealth Insurance
HOROVOD BENCHMARK
§ Great scaling efficiency, but requires dedicated engineering resources to set it up
• Container, MPI, and NCCL
§ Fine-tuning infra is not trivial
§ Previous Horovod in-house implementation gains no overall scaling effect (Wu et al '18)
Courtesy of Uber
10. 2020 Copyright eHealth Insurance
HOROVODRUNNER – DATABRICKS
HorovodRunner is a general API to run distributed deep learning workloads
on Databricks using Uber's Horovod framework
§ Built on top of Horovod
§ No need to set up underlying infrastructure
• Supports AWS and Azure
§ Run in Databricks’ Spark
§ Data prep and data training in one place
§ Takes care of random shuffling, fault tolerance, etc.
§ Barrier execute mode
Non-Endorsement Disclaimer
11. 2020 Copyright eHealth Insurance
HOROVODRUNNER DIAGRAM
Courtesy of Databricks
§ A spark driver and num of executors
that run Horovod
§ Barrier execution mode
§ Enable synchronize training
§ Start all tasks together
§ Restart all tasks in case of failure
12. 2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK – MNIST
Dataset: MNIST
Instance: C4.2xlarge
Instance Type: CPU
Model: Simple CNN
(2 convolutional layers)
Epochs: 50
Network: 10 Gbps
Demonstrated scaling efficiency on simple CNN runs on CPU clusters.
13. 2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK
Achieved good scaling efficiency using HorovodRunner for both models:
Inception V3 (79.7%~48.9%) and VGG-16 (49.0%~18.5%)
14. 2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK OTHERS
§ GCN
§ Currently no scaling efficiency
§ Adjacency matrix is input and cannot be divided
§ Stochastic GCN might able to help
§ Multiple GPUs instance
§ Horovod usually outperforms multithreading
22. 2020 Copyright eHealth Insurance
PIN GPUs
def train_hvd(learning_rate=0.1):
from tensorflow.keras import backend as K
import tensorflow as tf
import horovod.tensorflow.keras as hvd
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list =
str(hvd.local_rank())
K.set_session(tf.Session(config=config))
GPU 0
GPU 1
GPU 2...
GPU 15
§ For ring-all reduce to function properly
§ Find all GPU device ids on the slaves
§ Assign an invariant ordinal rank to each
GPU device id
23. 2020 Copyright eHealth Insurance
DATA PARALLELISM
def get_dataset(num_classes, rank=0, size=1):
from tensorflow import keras
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512,
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Conceptually,
data in the train_hvd function = data in one GPU
Chunk
0
Chunk
1
Chunk
...
Chunk
k
Entire Data Set
Rank 0 Rank 1 Rank ... Rank k
Slave
GPUs
Graphics for conceptual illustration purpose only, not for backend implementation
24. 2020 Copyright eHealth Insurance
GET DATA – INDEXED SOLUTION
def get_dataset(num_classes, rank=0, size=1):
from tensorflow import keras
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512, #on 1 GPU
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Graphics for conceptual illustration purpose only, not for backend implementation
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
N is how many rows that can be in each GPU,
N= number of rows I//hvd.size
25. 2020 Copyright eHealth Insurance
GET DATA – INDEX SOLUTION
PROBLEM?
§ At each step, in each GPU, are the
rows the same?
§ Yes
§ No Shuffle, no representativeness.
§ Solution for parquet files on S3
§ Petastrom, shuffle by default
https://github.com/uber/petast
orm
§ Image files?
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
Graphics for conceptual illustration purpose only, not for backend implementation
26. 2020 Copyright eHealth Insurance
GET DATA – GENERATOR SOLUTION
Generator-based solution will shuffle by default at each epoch.
train_generator, validation_generator = get_dataset() #shuffle set to true
step_size_train = train_generator.n//train_generator.batch_size
step_size_validation = validation_generator.n//validation_generator.batch_size
history = model.fit_generator(
generator = train_generator,
steps_per_epoch = step_size_train // hvd.size() ,
validation_data = validation_generator,
validation_steps = step_size_validation // hvd.size() ,
epochs = epochs,
callbacks = callbacks,
verbose=2
)
27. 2020 Copyright eHealth Insurance
GET DATA – GENERATOR SOLUTION
§ Entire data set
step_size_train = train_generator.n//train_generator.batch_size
§ Inside each GPU
steps_per_epoch = step_size_train // hvd.size()
GPU Rank Steps in a GPU Entire Steps Batch img_ind (n total)
0 0 0 Batch_size 346,…, 29
0 1 1 Batch_size 420,…,1032
0 2 2 Batch_size 75,…,89
0 3 3 Batch_size ...
1 0 4 Batch_size ...
1 1 5 Batch_size ...
1 2 ... Batch_size ...
1 3 ... Batch_size ...
... ... ... Batch_size ...
k 0 ... Batch_size ...
k 1 ... Batch_size ...
k 2 ... Batch_size ...
k 3 m Batch_size ...
§ Ensures no repetition on images in an epoch
§ How?
§ Why?
Images
are shuffled
28. 2020 Copyright eHealth Insurance
DISTRIBUTED MODEL RETRIEVAL
§ Why
Every GPU will load the model structure at the beginning of training; too many
requests error if loading from github
§ How
Save model to S3 or dbfs
example_model = get_model()
example_model.save("path_on_master/vgg_model.h5")
shutil.copy("path_on_master/vgg_model.h5",
"dbfs_or_s3_path/vgg_model.h5")
§ Then in train_hvd,
Replace
model=get_model()
With
model = keras.models.load_model("dbfs_or_s3_path_to/vgg_model.h5")
29. 2020 Copyright eHealth Insurance
WRAP THE OPTIMIZER
#single machine optimizer
optimizer = keras.optimizers.Adadelta
(lr=learning_rate * hvd.size())
# Wrap with Distributed Optimizer.
optimizer = hvd.DistributedOptimizer(optimizer)
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
Paper by Facebook
Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour
§ Preserve the same number of epochs in
hvdRunner as in a single machine for model to
converge to preserve accuracy
§ By linearly scaling of learning rate with
batch size
§ Synchronous hvdRunner batch size
= batch_size*hvd_size
§ LR_n = LR_1*N
§ HvdRunner's steps_per_epoch is inversely
proportionate to the number of GPUs
§ Same epochs * less_steps_per_epoch
= faster training time
§ Same epochs ~ comparable accuracy
30. 2020 Copyright eHealth Insurance
RECTIFIED ADAM OPTIMIZER
§ Why
§ Fast convergence
§ Accurate initial direction finding to avoid bad local optima
§ Setting
§ Cluster install keras-retified-adam
§ Notebook set %env TF_KERAS =1
§ RA optimizer setting
optimizer = RAdam(total_steps=5000, warmup_proportion=0.1,
learning_rate=learning_rate*hvd.size(), min_lr=1e-5)
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)]
ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND
Liu et al 2020
31. 2020 Copyright eHealth Insurance
SYNCHRONIZE & CHECKPOINT
Checkpoint model parameters from GPU 0
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoi
nt(checkpoint_dir + '/checkpoint-
{epoch}.ckpt', save_weights_only = True))
callbacks = [
hvd.callbacks.BroadcastGlobalVariables
Callback(0) ]
Synchronize parameters from GPU 0
GPU 0
GPU 1
GPU 2...
GPU 15
Graphics for conceptual illustration purpose only, not for backend implementation
At the end of synchronous step
§ GPU 0 gets the averaged gradient from ring-
Allreduce
§ And send the updated parameters to the rest of
the GPUs (broadcast)
§ The weights from each step is saved from GPU 0
32. 2020 Copyright eHealth Insurance
AVOID HVD.TIMELINE
§ Why: hvd.timeline = no scaling efficiency
§ How: add timestamp to standard output
Redirect HorovodRunner output to log
reset_stdout()
redirect_stdout(output_dir+filename)
hr = HorovodRunner(np = np_setup)
hr.run(train_hvd, learning_rate=learning_rate)
#checkpointed model is on master
#If you want to keep your model after cluster went
down
save_model_to_s3()
move_log_to_s3()
import logging
def redirect_stdout(log_filename):
class StreamToLogger
…
stdout_logger = logging.getLogger('STDOUT')
sl = StreamToLogger(stdout_logger,logging.INFO)
sys.stdout = sl
33. 2020 Copyright eHealth Insurance
EXAMPLE TIMESTAMP ADDED OUTPUT
Hvd.rank Current step Total steps per epoch
Current epoch
Total epoch
Added Timestamp
34. 2020 Copyright eHealth Insurance
SUMMARY
HorovodRunner is great for distributed deep learning
§ Unlike Horovod, does not require engineering resources to set up infrastructure
§ Simplicity of coding inherited from Horovod
§ Scaling efficiency is good; has room for improvement
§ Choose better network bandwidth instances
§ Change AWS S3 to EC2 instance store
§ Works best if the data can be divided
§ Horovod Timeline adversely impacts performance
§ Security
§ Since Open MPI does not use encrypted communication and can launch new processes,
it's recommended to use network-level security to isolate Horovod jobs from potential
attackers
35. 2020 Copyright eHealth Insurance
LINK TO CODE AND PAPER
§ Code:
https://github.com/psychologyphd/horovodRunnerBenchMark_IPython
§ Paper (AAAI2020 Workshop 8 Accepted Poster):
http://arxiv.org/abs/2005.05510
or
https://deep-learning-graphs.bitbucket.io/dlg-
aaai20/accepted_papers/DLGMA_2020_paper_23.pdf
36. 2020 Copyright eHealth Insurance
FEEDBACK
Thank you!
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
38. 2020 Copyright eHealth Insurance
APPENDIX
Some things we found and can be useful to share
§ When training some NLP models where we need to determine certain constraints like vocab, create the vocab
first and read on each worker. But during the training, each worker is still only processing a subset of the data
independently.
§ Shuffle
§ Default is random shuffling. But only works on parquet data.
https://github.com/uber/petastorm
Save dataframe to parquet and use petastorm for data digestion.
§ Horovod supports N-gram readouts, assuming it might be able to shuffle the data by the order.
§ Kereas data_generator is by default random shuffling too
§ Recitified Adam
https://www.zhihu.com/question/340834465
§ Real time serving:
§ Same model as single machine trained model. Kubernetics + docker or sagemaker. Check other
sessions today.
§ ring-Allreduce bandwidth optimized
§ https://databricks.com/blog/2019/08/15/how-not-to-scale-deep-learning-in-6-easy-
steps.html
39. Some tips/takeaways:
1. You can use HorovodRunner out of box and it works great
2. Do not use Horovod timeline
3. Init script and disable hippa to run HorovodRunner
4. Not all optimizers are well supported, some learning rates require special setting.
5. Make sure everything is wrapped in the function including import statements so it can be serialized.
6. Don't use many GPU instances blindly, there is network cost. Instead, run few smaller samples and check GPU
memory usage first.
7. You will still gain performance from a single machine with multiple GPUs
(JP the rest of the tips in the appendix)
40. Wendao, see here.
https://stackoverflow.com/questions/44788946/shuffling-training-data-with-lstm-rnn
Stateful LSTM is a special case. Brandon correct me if I am wrong. I don’t think horovodRunner can handle the shuffle of stateful LSTM.
--
Jing Pan, Ph.D
Quantitative User Researcher
eHealth
From: Wendao Liu <[email protected]>
Date: Tuesday, June 2, 2020 at 4:04 PM
To: Brandon Williams <[email protected]>
Cc: Ryan O'Rourke <[email protected]>, Jing Pan <[email protected]>
Subject: Re: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Thanks Brandon for quick reply!
First question make totally sense, the entire process will fail.
for second questions, yes, I mean the time only, the data is organized by chronological order. Sorry my questions wasn’t really clear so I am adding more
context here:
Let’s say we have historical 5 years of amazon stock data and our goal is to predict the future amazon stock price and data is organized by chronological order
and each row is at day level. In this case, if we train a model such LSTM, we want to preserve the order of the time and direct random shuffle probably won’t
work as it break the sequence of the stock prices. Do you have any suggestions of how to train such model on Horovod? Especially on how to shuffle the data in
a meaningful way. Hope it help to clarify the problem.
Thanks a lot!
From: Brandon Williams <[email protected]>
Date: Tuesday, June 2, 2020 at 2:47 PM
To: Wendao Liu <[email protected]>
Cc: Ryan O'Rourke <[email protected]>, Jing Pan <[email protected]>
Subject: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Hi Wendao,
41. Hi Wendao,
+1 Jing Pan as well.
Regarding recommendations on shuffling in a meaningful way given your case, one approach is to pre-
transform these into (overlapping) arrays of contiguous time steps. Then each row is a chunk of time
and can be read pretty independently so shuffling would be fine. But that may cause a large bit of
storage but is worth a try.
Also, petastorm looks like it shuffles by row group. so that should be fine since the data is ordered by
time chronologically, as each rowgroup should be contiguous in time. Following that you should be able
to then generate the overlapping windows of data on the fly from that batch, as normal. Our ML team
believes this is also good approach to test out albeit not a trivial task.
42. Logging get log from master
Ifi you want to
Retrieve log from slave, db MLflow