Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs, including AllReduce, Horovod, and how commodity GPU servers, such as DeepLearning11, will gain adoption.
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
The document discusses scaling TensorFlow to hundreds of GPUs using Spark and Hops Hadoop. It describes how Hops enables running TensorFlow jobs on Spark clusters and supports distributed deep learning using algorithms like Ring AllReduce. Hops provides a single platform for machine learning and big data workloads by running TensorFlow, Spark and other frameworks on a HopsFS storage system and YARN cluster manager.
At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
The document summarizes a presentation given by Chris Fregly on end-to-end real-time analytics using Apache Spark. It discusses topics like Spark streaming, machine learning, tuning Spark for performance, and demonstrates live demos of sorting, matrix multiplication, and thread synchronization optimized for CPU cache. The presentation emphasizes techniques like cache-friendly data layouts, prefetching, and lock-free algorithms to improve Spark performance.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
The document summarizes Kazuaki Ishizaki's talk on making hardware accelerators easier to use. Some key points:
- Programs are becoming simpler while hardware is becoming more complicated, with commodity processors including hardware accelerators like GPUs.
- The speaker's recent work focuses on generating hardware accelerator code from high-level programs without needing specific hardware knowledge.
- An approach using a Java JIT compiler was presented that can generate optimized GPU code from parallel Java streams, requiring programmers to only express parallelism.
- The JIT compiler performs optimizations like aligning arrays, using read-only caches, reducing data transfer, and eliminating exception checks.
- Benchmarks show the generated GPU
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
Talk for MacIT 2014. This talk is about systems performance on OS X, and introduces the USE Method to check for common performance bottlenecks and errors. This methodology can be used by beginners and experts alike, and begins by constructing a checklist of the questions we’d like to ask of the system, before reaching for tools to answer them. The focus is resources: CPUs, GPUs, memory capacity, network interfaces, storage devices, controllers, interconnects, as well as some software resources such as mutex locks. These areas are investigated by a wide variety of tools, including vm_stat, iostat, netstat, top, latency, the DTrace scripts in /usr/bin (which were written by Brendan), custom DTrace scripts, Instruments, and more. This is a tour of the tools needed to solve our performance needs, rather than understanding tools just because they exist. This talk will make you aware of many areas of OS X that you can investigate, which will be especially useful for the time when you need to get to the bottom of a performance issue.
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Tracing Summit 2014, Düsseldorf. What can Linux learn from DTrace: what went well, and what didn't go well, on its path to success? This talk will discuss not just the DTrace software, but lessons from the marketing and adoption of a system tracer, and an inside look at how DTrace was really deployed and used in production environments. It will also cover ongoing problems with DTrace, and how Linux may surpass them and continue to advance the field of system tracing. A world expert and core contributor to DTrace, Brendan now works at Netflix on Linux performance with the various Linux tracers (ftrace, perf_events, eBPF, SystemTap, ktap, sysdig, LTTng, and the DTrace Linux ports), and will summarize his experiences and suggestions for improvements. He has also been contributing to various tracers: recently promoting ftrace and perf_events adoption through articles and front-end scripts, and testing eBPF.
Introduction to DTrace (Dynamic Tracing), written by Brendan Gregg and delivered in 2007. While aimed at a Solaris-based audience, this introduction is still largely relevant today (2012). Since then, DTrace has appeared in other operating systems (Mac OS X, FreeBSD, and is being ported to Linux), and, many user-level providers have been developed to aid tracing of other languages.
TensorFlow is a dataflow-like model that runs on a wide variety of hardware platforms. It uses tensors and a directed graph to describe computations. Operations are abstract computations implemented by kernels that run on different devices like CPUs and GPUs. The core C++ implementation defines the framework and kernel functions, while the Python implementation focuses on operations, training, and providing APIs. Additional libraries like Keras, TensorFlow Slim, Skflow, PrettyTensor, and TFLearn build on TensorFlow to provide higher-level abstractions.
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
The document discusses performance differences between Linux and Solaris (SmartOS). It begins by providing an example of a Perl program that runs 14% slower on one system versus the other. This example is used to explore potential reasons for performance differences between operating systems. The document then categorizes differences into major ("big") differences, such as kernel features, and minor ("small") differences, such as tunable parameters. Several major performance-related features of both Linux and Solaris are highlighted. The document cautions against a "Not Invented Here" viewpoint and suggests areas where each system could potentially learn from the other to improve performance.
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
This document discusses using Jupyter Notebook for machine learning projects with Spark. It describes running Python, Spark, and pandas code in Jupyter notebooks to work with data from various sources and build machine learning models. Key points include using notebooks for an ML pipeline, running Spark jobs, visualizing data, and building word embedding models with Spark. The document emphasizes how Jupyter notebooks allow integrating various tools for an ML workflow.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
Kazuaki Ishizaki is a research staff member at IBM Research - Tokyo who is interested in compiler optimizations, language runtimes, and parallel processing. He has worked on the Java virtual machine and just-in-time compiler for over 20 years. His message is that Spark can utilize GPUs to accelerate computation-heavy applications in a transparent way. He proposes new binary columnar and GPU enabler components that would efficiently store and handle data on GPUs without requiring changes to Spark programs. This could be implemented either through a Spark plugin for RDDs or by enhancing the Catalyst optimizer in Spark to generate GPU code.
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
Talk delivered at SCaLE10x, Los Angeles 2012.
Cloud Computing introduces new challenges for performance
analysis, for both customers and operators of the cloud. Apart from
monitoring a scaling environment, issues within a system can be
complicated when tenants are competing for the same resources, and are
invisible to each other. Other factors include rapidly changing
production code and wildly unpredictable traffic surges. For
performance analysis in the Joyent public cloud, we use a variety of
tools including Dynamic Tracing, which allows us to create custom
tools and metrics and to explore new concepts. In this presentation
I'll discuss a collection of these tools and the metrics that they
measure. While these are DTrace-based, the focus of the talk is on
which metrics are proving useful for analyzing real cloud issues.
The document compares on-heap and off-heap caching options. It discusses heap memory usage in the JVM and alternatives like off-heap memory using memory mapped files, ByteBuffers, and Unsafe. Popular off-heap caches like Chronicle, Hazelcast, and Redis are presented along with comparisons of their features, performance, and garbage collection impact. The document aims to help developers choose the most suitable cache for their application needs.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
ChainerUI v0.3 was released with new features like sampled log visualization and performance tuning. It also introduced the experimental ImageReport extension for visualizing images generated during training. Examples shown include using ImageReport with a DCGAN and pix2pix model to display generated images. Future work includes improving the usability of ImageReport, adding support for charts, logging improvements, and enhancing the user experience of ChainerUI.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
Data scientists use Graphics Processing Unit, or GPU, to achieve the highest performance for deep learning training and inference. However, there is complexity to managing those hardware resources efficiently which may be outside the scope of the data scientists’ expertise. OpenShift is the ideal platform for simplifying that complexity by providing powerful abstractions for scalable cloud computing. This session will review the value of GPU in data science, how modern deep learning software frameworks consume GPU resources, and the operator-based architecture that enables GPU in OpenShift today.
This document discusses using Kubernetes to cluster Raspberry Pi devices running TensorFlow. It begins by introducing Kubernetes, TensorFlow, and the Raspberry Pi. It then covers setting up a Kubernetes cluster across multiple Raspberry Pis, including installing Docker, configuring the master and nodes, and deploying networking. Next, it discusses deploying TensorFlow jobs in a distributed manner across the Kubernetes cluster using strategies like in-graph replication. It also proposes using Docker images and Ansible scripts to simplify and automate the cluster setup. Finally, it outlines how the cluster could be used for applications involving hyperparameter tuning, scaling ML APIs, and ensemble/data parallelism with TensorFlow.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
Talk for MacIT 2014. This talk is about systems performance on OS X, and introduces the USE Method to check for common performance bottlenecks and errors. This methodology can be used by beginners and experts alike, and begins by constructing a checklist of the questions we’d like to ask of the system, before reaching for tools to answer them. The focus is resources: CPUs, GPUs, memory capacity, network interfaces, storage devices, controllers, interconnects, as well as some software resources such as mutex locks. These areas are investigated by a wide variety of tools, including vm_stat, iostat, netstat, top, latency, the DTrace scripts in /usr/bin (which were written by Brendan), custom DTrace scripts, Instruments, and more. This is a tour of the tools needed to solve our performance needs, rather than understanding tools just because they exist. This talk will make you aware of many areas of OS X that you can investigate, which will be especially useful for the time when you need to get to the bottom of a performance issue.
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Tracing Summit 2014, Düsseldorf. What can Linux learn from DTrace: what went well, and what didn't go well, on its path to success? This talk will discuss not just the DTrace software, but lessons from the marketing and adoption of a system tracer, and an inside look at how DTrace was really deployed and used in production environments. It will also cover ongoing problems with DTrace, and how Linux may surpass them and continue to advance the field of system tracing. A world expert and core contributor to DTrace, Brendan now works at Netflix on Linux performance with the various Linux tracers (ftrace, perf_events, eBPF, SystemTap, ktap, sysdig, LTTng, and the DTrace Linux ports), and will summarize his experiences and suggestions for improvements. He has also been contributing to various tracers: recently promoting ftrace and perf_events adoption through articles and front-end scripts, and testing eBPF.
Introduction to DTrace (Dynamic Tracing), written by Brendan Gregg and delivered in 2007. While aimed at a Solaris-based audience, this introduction is still largely relevant today (2012). Since then, DTrace has appeared in other operating systems (Mac OS X, FreeBSD, and is being ported to Linux), and, many user-level providers have been developed to aid tracing of other languages.
TensorFlow is a dataflow-like model that runs on a wide variety of hardware platforms. It uses tensors and a directed graph to describe computations. Operations are abstract computations implemented by kernels that run on different devices like CPUs and GPUs. The core C++ implementation defines the framework and kernel functions, while the Python implementation focuses on operations, training, and providing APIs. Additional libraries like Keras, TensorFlow Slim, Skflow, PrettyTensor, and TFLearn build on TensorFlow to provide higher-level abstractions.
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
The document discusses performance differences between Linux and Solaris (SmartOS). It begins by providing an example of a Perl program that runs 14% slower on one system versus the other. This example is used to explore potential reasons for performance differences between operating systems. The document then categorizes differences into major ("big") differences, such as kernel features, and minor ("small") differences, such as tunable parameters. Several major performance-related features of both Linux and Solaris are highlighted. The document cautions against a "Not Invented Here" viewpoint and suggests areas where each system could potentially learn from the other to improve performance.
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
This document discusses using Jupyter Notebook for machine learning projects with Spark. It describes running Python, Spark, and pandas code in Jupyter notebooks to work with data from various sources and build machine learning models. Key points include using notebooks for an ML pipeline, running Spark jobs, visualizing data, and building word embedding models with Spark. The document emphasizes how Jupyter notebooks allow integrating various tools for an ML workflow.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
Kazuaki Ishizaki is a research staff member at IBM Research - Tokyo who is interested in compiler optimizations, language runtimes, and parallel processing. He has worked on the Java virtual machine and just-in-time compiler for over 20 years. His message is that Spark can utilize GPUs to accelerate computation-heavy applications in a transparent way. He proposes new binary columnar and GPU enabler components that would efficiently store and handle data on GPUs without requiring changes to Spark programs. This could be implemented either through a Spark plugin for RDDs or by enhancing the Catalyst optimizer in Spark to generate GPU code.
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
Talk delivered at SCaLE10x, Los Angeles 2012.
Cloud Computing introduces new challenges for performance
analysis, for both customers and operators of the cloud. Apart from
monitoring a scaling environment, issues within a system can be
complicated when tenants are competing for the same resources, and are
invisible to each other. Other factors include rapidly changing
production code and wildly unpredictable traffic surges. For
performance analysis in the Joyent public cloud, we use a variety of
tools including Dynamic Tracing, which allows us to create custom
tools and metrics and to explore new concepts. In this presentation
I'll discuss a collection of these tools and the metrics that they
measure. While these are DTrace-based, the focus of the talk is on
which metrics are proving useful for analyzing real cloud issues.
The document compares on-heap and off-heap caching options. It discusses heap memory usage in the JVM and alternatives like off-heap memory using memory mapped files, ByteBuffers, and Unsafe. Popular off-heap caches like Chronicle, Hazelcast, and Redis are presented along with comparisons of their features, performance, and garbage collection impact. The document aims to help developers choose the most suitable cache for their application needs.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
ChainerUI v0.3 was released with new features like sampled log visualization and performance tuning. It also introduced the experimental ImageReport extension for visualizing images generated during training. Examples shown include using ImageReport with a DCGAN and pix2pix model to display generated images. Future work includes improving the usability of ImageReport, adding support for charts, logging improvements, and enhancing the user experience of ChainerUI.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
Data scientists use Graphics Processing Unit, or GPU, to achieve the highest performance for deep learning training and inference. However, there is complexity to managing those hardware resources efficiently which may be outside the scope of the data scientists’ expertise. OpenShift is the ideal platform for simplifying that complexity by providing powerful abstractions for scalable cloud computing. This session will review the value of GPU in data science, how modern deep learning software frameworks consume GPU resources, and the operator-based architecture that enables GPU in OpenShift today.
This document discusses using Kubernetes to cluster Raspberry Pi devices running TensorFlow. It begins by introducing Kubernetes, TensorFlow, and the Raspberry Pi. It then covers setting up a Kubernetes cluster across multiple Raspberry Pis, including installing Docker, configuring the master and nodes, and deploying networking. Next, it discusses deploying TensorFlow jobs in a distributed manner across the Kubernetes cluster using strategies like in-graph replication. It also proposes using Docker images and Ansible scripts to simplify and automate the cluster setup. Finally, it outlines how the cluster could be used for applications involving hyperparameter tuning, scaling ML APIs, and ensemble/data parallelism with TensorFlow.
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool, I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
https://www.meetup.com/TensorFlow-Chicago/events/240267321/
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/240587698/
http://pipeline.io
https://github.com/fluxcapacitor/pipeline
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...Chris Fregly
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool, Chris will demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment. This talk is 100% demo based with open source tools and completely reproducible through Docker on your own GPU cluster.
https://github.com/fluxcapacitor/pipeline/gpu.ml
http://pipeline.io
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltStack
Google is making the power of its datacenter, network, and technology innovations available to the world through its Cloud services. This presentation will provide an overview of the Google Cloud Platform and a deeper dive on Google Compute Engine. Google recently made an open source contribution to SaltStack and now you can now use Salt Cloud to manage your Compute Engine resources (IaaS virtual machine services). Come find out more about Google's Cloud Platform and how you can leverage Google scale with SaltStack.
This document provides an overview of programming for GPUs. It discusses how GPUs have many more cores than CPUs and are better suited for data-parallel work. The main challenges of GPU programming are different memory architectures, branch divergence, and complexity. It presents CUDA and OpenCL as common approaches for GPU programming and provides an example of a reduction kernel written in CUDA/OpenCL using shared memory and synchronization barriers between threads. Recent advances that help with GPU programming include kernel calls from the GPU, multi-GPU support, unified memory, task parallelism, better profilers, and C++ language support.
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
Python is sometimes discounted as slow because of its dynamic typing and interpreted nature and not suitable for scale because of the GIL. But, in this talk, I will show how with the help of talented open-source contributors around the world, we have been able to build systems in Python that are fast and scalable to many machines and how this has helped Python take over Science.
This document discusses Linux performance analysis tools. It introduces tpoint, a tool for tracing Linux tracepoints. Some example one-liners are provided that demonstrate how to use tpoint to trace disk I/O and see the tasks and processes performing I/O. The document also summarizes ftrace, a Linux kernel tracing tool that can be used to analyze performance issues.
On the necessity and inapplicability of pythonYung-Yu Chen
Python is a popular scripting language adopted by numerical software vendors to help users solve challenging numerical problems. It provides easy-to-use interface and offers decent speed through array operations, but it is not suitable for engineering the low-level constructs. To make good numerical software, developers need to be familiar with C++ and computer architecture. The gap of understandings between the high-level applications and low-level implementation motivated me to organize a course to train computer scientists what it takes to build numerical software that the users (application experts) want. This talk will portray a bird view of the advantages and disadvantages of Python and where and how C++ should be used in the context of numerical software. The information may be used to map out a plan to acquire the necessary skill sets for making the software.
Recording https://www.youtube.com/watch?v=OwA-Xt_Ke3Y
On the Necessity and Inapplicability of PythonTakeshi Akutsu
This document discusses the use of Python for numerical software development. It begins by introducing the author and their background in computational mechanics. It then discusses PyHUG, the Python user group in Taiwan, and PyCon Taiwan 2020.
The document notes that while Python is slow for number crunching, NumPy can provide reasonably fast performance. It explains that a hybrid architecture is commonly used, with the core computing kernel written in C++ for speed and Python used for the user-level API to describe complex problems more easily. An example of solving the Laplace equation is provided to demonstrate the speed differences between pure Python, NumPy, and C++ implementations.
The document advocates for training computer scientists in a hybrid approach through a numerical software
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
The document provides an overview of big data analysis and parallel programming tools for R. It discusses what constitutes big data, popular big data applications, and relevant hardware and software. It then covers parallel programming challenges and approaches in R, including using multicore processors with the multicore package, SMP and cluster programming with foreach and doMC/doSNOW, NoSQL databases like Redis with doRedis, and job scheduling. The goal is to help users effectively analyze big data in R by leveraging parallelism.
This document provides information about installing and using Torch 7, an open-source deep learning framework. It discusses installing Torch 7 on Linux, installing deep learning libraries for natural language processing and CUDA. It also gives examples of using Torch 7 for image processing and classification, including loading an image dataset and defining different types of neural network models like linear, MLP and convolutional networks.
In this deck from Switzerland HPC Conference, Gunter Roeth from NVIDIA presents: Deep Learning on the SaturnV Cluster.
"Machine Learning is among the most important developments in the history of computing. Deep learning is one of the fastest growing areas of machine learning and a hot topic in both academia and industry. It has dramatically improved the state-of-the-art in areas such as speech recognition, computer vision, predicting the activity of drug molecules, and many other machine learning tasks. The basic idea of deep learning is to automatically learn to represent data in multiple layers of increasing abstraction, thus helping to discover intricate structure in large datasets. NVIDIA has invested in SaturnV, a large GPU-accelerated cluster, (#28 on the November 2016 Top500 list) to support internal machine learning projects. After an introduction to deep learning on GPUs, we will address a selection of open questions programmers and users may face when using deep learning for their work on these clusters."
Watch the video: http://wp.me/p3RLHQ-gDv
Learn more: http://www.nvidia.com/object/dgx-saturnv.html
and
http://hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
This document discusses optimizations made to Apache Spark MLlib algorithms to better support sparse data at large scale. It describes how KMeans, linear methods, and other ML algorithms were modified to use sparse vector representations to reduce memory usage and improve performance when working with sparse data, including optimizations made for clustering large, high-dimensional datasets. The optimizations allow these algorithms to be applied to much larger sparse datasets and high-dimensional problems than was previously possible with MLlib.
Tim Hunter presented on TensorFrames, which allows users to run TensorFlow models on Apache Spark. Some key points:
- TensorFrames embeds TensorFlow computations into Spark's execution engine to enable distributed deep learning across a Spark cluster.
- It offers performance improvements over other options like Scala UDFs by avoiding serialization and using direct memory copies between processes.
- The demo showed how TensorFrames can leverage GPUs both on Databricks clusters and locally to accelerate numerical workloads like kernel density estimation and deep dream generation.
- Future work includes better integration with Tungsten and MLlib data types as well as official GPU support on Databricks clusters. TensorFrames aims to provide a simple API for
Tim Hunter presented on TensorFrames, which allows users to run TensorFlow models on Apache Spark. Some key points:
- TensorFrames embeds TensorFlow computations into Spark's execution engine to enable distributed deep learning across a Spark cluster.
- It offers performance improvements over other options like Scala UDFs by avoiding serialization and using direct memory copies between processes.
- The demo showed how TensorFrames can leverage GPUs both in local mode and at scale in a cluster to speed up numerical workloads like kernel density estimation.
- Future work includes better integration with Tungsten and MLlib as well as official GPU support in Databricks. TensorFrames aims to provide a simple API for distributed numerical computing that
When working with big data or complex algorithms, we often look to parallelize our code to optimize runtime. By taking advantage of a GPUs 1000+ cores, a data scientist can quickly scale out solutions inexpensively and sometime more quickly than using traditional CPU cluster computing. In this webinar, we will present ways to incorporate GPU computing to complete computationally intensive tasks in both Python and R.
See the full presentation here: 👉 https://vimeo.com/153290051
Learn more about the Domino data science platform: https://www.dominodatalab.com
ARVC and flecainide case report[EI] Jim.docx.pdfJim Dowling
This case report describes a patient diagnosed with arrhythmogenic right ventricular cardiomyopathy (ARVC) due to a mutation in the titin gene. Initial treatment with beta-blockers for exercise-induced ventricular arrhythmias was ineffective. Treatment with flecainide dramatically improved the patient's symptoms. After 6 years of flecainide treatment, the patient can engage in low-intensity activities without issues. The report highlights the potential efficacy of flecainide for ARVC patients with exercise-induced arrhythmias and preserved heart function.
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
This talk is a mental map for building ML systems as ML Pipelines that are factored into Feature Pipelines, Training Pipelines, and Inference Pipelines.
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
1. The document discusses building a minimal viable prediction service (MVP) to predict air quality using only Python and free serverless services in 90 minutes.
2. It describes creating feature, training, and inference pipelines to build an air quality prediction service using Hopsworks, Modal, and Streamlit/Gradio.
3. The pipelines would extract features from weather and air quality data, train a model, and deploy an inference pipeline to make predictions on new data.
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
This document discusses building machine learning systems using serverless services and Python. It introduces the Iris flower classification dataset as a case study. The key steps outlined are to: create accounts on Hopsworks, Modal, and HuggingFace; build and run feature, training and inference pipelines on Modal to classify Iris flowers; and create a predictive user interface using Gradio on HuggingFace to allow users to input Iris flower properties and predict the variety. The document emphasizes that serverless infrastructure allows building operational and analytical ML systems without managing underlying infrastructure.
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
This document summarizes a presentation about building serverless machine learning applications in Python. It discusses refactoring monolithic ML pipelines into separate feature engineering, training, and inference pipelines. This allows historical data to be used for backfilling features and new models to be trained on schedules. Online inference pipelines retrieve pre-computed features from a feature store and compute additional features from application data. The document provides examples using a feature store on Hopsworks to build batch prediction services for iris flower data as a case study. It promotes serverless ML with Hopsworks which provides an unlimited free tier.
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
Cloud Native London talk about the control layer of Hopsworks.ai and our choice of cloud native services. We built our own multi-tenant services as cloud native services, for the most part.
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
1) MLOps and the Feature Store with Hopsworks discusses how a feature store can be used to orchestrate machine learning pipelines, including feature engineering, model training, model serving, and model monitoring.
2) It provides an overview of the key components in an MLOps workflow including feature groups, training datasets, transformations, and how these interact with roles like data engineers, data scientists, and ML engineers.
3) The document demonstrates how the Hopsworks feature store API can be used to manage the machine learning lifecycle from raw data ingestion, feature engineering, training dataset creation, model training, model deployment, and monitoring.
Hops fs huawei internal conference july 2021Jim Dowling
This document discusses the evolution of distributed databases and file systems. It provides an overview of HopsFS, a scale-out metadata file system that uses NewSQL databases like RonDB to distribute metadata. HopsFS offers high availability, free-text search capabilities, and integration with object stores. The document also describes how HopsFS can provide a unified platform for machine learning workflows by integrating features stores, data lakes, model serving, and more.
- Jim Dowling, CEO of Logical Clocks, discusses breaking up monolithic ML pipelines into feature pipelines and training pipelines using a feature store.
- A feature store allows teams to centrally manage features and training data, enabling more modular development and improved collaboration across roles like data scientists, data engineers, and ML engineers.
- Feature pipelines are used to engineer, validate, and manage features over time, while training pipelines focus on model training, evaluation, and deployment.
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
The document discusses Hopsworks Feature Store 2.0 and its capabilities for managing machine learning workflows. Key points include:
- Hopsworks Feature Store allows for ingesting, storing, and reusing features to support tasks like training, serving, and experimentation.
- It provides APIs for creating feature groups, training datasets, and joining features across groups. This enables end-to-end ML pipelines.
- The feature store supports both online and offline usage, versioning of features and schemas, and time travel capabilities to access past feature values.
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
The document discusses using generative adversarial networks (GANs) to improve anti-money laundering (AML) detection. It describes training a GAN on a large transaction dataset using Spark for feature engineering and TensorFlow. The GAN was able to classify transactions as either suspected of money laundering or clean. It also discusses challenges of training GANs, such as mode collapse, and techniques to address them like using multiple generators. Finally, it proposes candidate features for an AML model, such as graph-based, frequency, amount, time-since, and velocity-change features.
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
This document provides information about Logical Clocks and their Feature Store product. It discusses key leadership and offices for Logical Clocks. It then provides an overview of Feature Engineering and the Feature Store concepts including feature transformations, feature groups, training/test datasets, and online/offline feature stores. It demonstrates how to register and access feature groups from the feature store to create training datasets. Finally, it discusses online model serving from the online feature store and the Hopsworks platform more broadly.
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling
GPUs are well-suited for distributed deep learning tasks like matrix multiplications and convolutions using techniques like SIMD. Frameworks like CUDA and ROCm allow programming GPUs in a SIMT fashion with hierarchical threads. Distributed training uses techniques like data parallelism with synchronous or asynchronous SGD over high-speed networks. Larger batch sizes and learning rate scaling can improve training speed without loss of accuracy.
Hopsworks data engineering melbourne april 2020Jim Dowling
This document provides information about Logical Clocks and their open-source Hopsworks platform for data-intensive AI with a feature store. It lists their leadership and offices in Stockholm, London, and Silicon Valley. It then provides details about Hopsworks and how it is used in production for finance, healthcare, and other industries. It describes common feature stores used in production and outlines key feature store concepts like features, feature groups, and training/test datasets. It shows how different types of data are ingested at different cadences into an online and offline feature store. Finally, it demonstrates how to register a feature group, create training datasets, build feature vectors for model prediction, and more using Hopsworks' feature
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
1. Jim Dowling Assoc Prof, KTH
Senior Researcher, RISE SICS
CEO, Logical Clocks AB
SPARK & TENSORFLOW
AS-A-SERVICE
#EUai8
Hops
2. Newton confirmed what many suspected
• In August 1684, Halley
visited Newton:
“What type of curve does
a planet describe in its
orbit about the sun,
assuming an inverse
square law of attraction?”
2#EUai8
3. • In June 2017,
Facebook showed
how to reduce training
time on ImageNet for
a Deep CNN from 2
weeks to 1 hour by
scaling out to 256
GPUs.
3#EUai8
https://arxiv.org/abs/1706.02677
Facebook confirmed what many suspected
4. AI Hierarchy of Needs
5
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
5. AI Hierarchy of Needs
6
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
Analytics
Prediction
6. AI Hierarchy of Needs
7
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Hops
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
7. Deep Learning Hierarchy of Scale
8#EUai8
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours
8. Deep Learning Hierarchy of Scale
9#EUai8
Public
Clouds
On-Premise
Single GPU
Multiple GPUs on a Single GPU Server
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Single GPU
Many GPUs on a Single GPU Server
Parallel Experiments on GPU Servers
Single Host DL
Distributed DL
9. DNN Training Time and Researcher Productivity
• Distributed Deep Learning
– Interactive analysis!
– Instant gratification!
• Single Host Deep Learning
– Google-Envy
10
“My Model’s Training.”
Training
10. What Hardware do you Need?
• SingleRoot PCI
Complex Server*
– 10 Nvidia GTX 1080Ti
• 11 GB Memory
– 256 GB Ram
– 2 Intel Xeon CPUs
– 2x56 Gb Infiniband
15K Euro
• Nvidia DGX-1
– 8 Nvidia Tesla P100/V100
• 16 GB Memory
– 512 GB Ram
– 2 Intel Xeon CPUs
– 4x100 Gb Infiniband
– NVLink**
up to 150K Euro
*https://www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems
**https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/
11. 12#EUai8
SingleRoot
Complex Server
with 10 GPUs
[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]
12. Tensorflow GAN Training Example*
13#EUai8
*https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1080-ti-single-root-deep-learning-server-part-1/
13. Cluster of Commodity GPU Servers
14#EUai8
InfiniBand
Max 1-2 GPU Servers per Rack (2-4 KW per server)
14. Spark and TF – Cluster Integration
15#EUai8
Training Data and Model Store
Cluster Manager
Single GPU
Experiment
Parallel Experiments
(HyperParam Tuning)
Distributed
Training Job
Deprecated
Mix of commodity GPUs and more
powerful GPUs good for (1) parallel
experiments and (2) distributed training
15. GPU Resource Requests in Hops
16#EUai8
HopsYARN (Supports GPUs-as-a-Resource)
4 GPUs on any host
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
20 GPUs on 2 hosts with ‘Infiniband_P100’
Hops
HopsFS
16. HopsFS: Next Generation HDFS*
17
16x
Throughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf
37x
Number of files
Scale Challenge Winner (2017)
Small Files**
17. TensorFlow Spark API Integration
• Tight Integration
– Databricks’ Tensorframes and Deep Learning Pipelines
• Loose Integration
– TensorFlow-on-Spark, Hops TfLauncher
• PySpark as a wrapper for TensorFlow
18#EUai8
18. Deep Learning Pipelines
19#EUai8
graph = tf.Graph() with tf.Session(graph=graph) as sess:
image_arr = utils.imageInputPlaceholder()
frozen_graph = tfx.strip_and_freeze_until(…)
transformer = TFImageTransformer(…)
image_df = readImages("/data/myimages")
processed_image_df = transformer.transform(image_df)
…
select image, driven_by_007(image) as probability from car_examples
order by probability desc limit 6
Inferencing possible with SparkSQL
19. Hops TfLauncher – TF in Spark
def model_fn(learning_rate, dropout):
import tensorflow as tf
from hops import tensorboard, hdfs, devices
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001], 'dropout': [0.5]}
tflauncher.launch(spark, model_fn, args_dict)
20
Launch TF jobs as Mappers in Spark
“Pure” TensorFlow code
in the Executor
20. Hops TfLauncher – Parallel Experiments
21#EUai8
def model_fn(learning_rate, dropout):
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001, 0.005, 0.01],
'dropout': [0.5, 0.6, 0.7]}
tflauncher.launch(spark, model_fn, args_dict)
Launches 3 Executors with 3 different Hyperparameter
settings. Each Executor can have 1-N GPUs.
22. Distributed TensorFlow
• AllReduce
– Horovod by Uber with MPI/NCCL
– Baidu AllReduce/MPI in TensorFlow/contrib
• Distributed Parameter Servers
– TensorFlow-on-Spark
– Distributed TensorFlow
23#EUai8
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
23. Asynchronous SGD vs Synchronous SGD
• Synchronous Stochastic Gradient Descent (SGD) now dominant,
due to improved convergence guarantees:
– “Revisiting Synchronous SGD”, Chen et al, ICLR 2016
https://research.google.com/pubs/pub45187.html
24
24. Distributed TF with Parameter Servers
25
Synchronous SGD
with Data Parallelism
25. Tensorflow-on-Spark (Yahoo!)
• Rewrite TensorFlow apps to Distributed TensorFlow
• Two modes:
1. feed_dict: RDD.mapPartitions()
2. TFReader + queue_runner: direct HDFS access from Tensorflow
26[Image from https://www.slideshare.net/Hadoop_Summit/tensorflowonspark-scalable-tensorflow-learning-on-spark-clusters]
26. TFonSpark with Spark Streaming
27#EUai8
[Image from https://www.slideshare.net/Hadoop_Summit/tensorflowonspark-scalable-tensorflow-learning-on-spark-clusters]
35. Parameter Server vs AllReduce (Uber)*
36
*https://github.com/uber/horovod
Setup: 16 servers with 4 P100 GPUs each connected by 40 Gbit/s network (synthetic data).
VGG
model
is larger
36. Dist. Synchnrous SGD: N/W is the Bottleneck
37
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8 9 10
1 GPU 4 GPUs
N/W N/W N/W N/W N/W
Amount
Work
Time
Reduce N/W Comms Time, Increase Computation Time
Amdahl’s Law
38. Hopsworks: Full AI Hierarchy of Needs
39
Develop Train Test Deploy
MySQL Cluster
Hive
InfluxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorflow
Jupyter, Zeppelin
Jobs, Kibana, Grafana
REST
API
Hopsworks
39. Proj-42
Hopsworks Abstractions
40
A Project is a Grouping of Users and Data
Proj-X
Shared TopicTopic /Projs/My/Data
Proj-AllCompanyDB
Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017
43. Conclusions
• Many good frameworks for TF and Spark
– TensorFlowOnSpark, Deep Learning Pipelines
• Hopsworks support for TF and Spark
– GPUs-as-a-Resource in HopsYARN
– TfLauncher, TensorFlow-on-Spark, Horovod
– Jupyter with Conda Support
• More on GPU-Servers at www.logicalclocks.com
44#EUai8
44. Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud
Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios
Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersso,n August
Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Roberto Bampi, ArunaKumari Yedurupaka, Tobias Johansson, Fanti Machmount Al Samisti,
Braulio Grana, Adam Alpire, Zahin Azher Rashid, Vasileios Giannokostas, Johan Svedlund
Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri”
Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig
Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana
Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj
Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Please Follow Us!
@hopshadoop
Hops Heads
Please Star Us!
http://github.com/
hopshadoop/hopsworks