SlideShare a Scribd company logo
Leveraging Open Source for Large Scale
Analytics on HPC Systems
Rob Vesse, Software Engineer, Cray Inc
C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges
● Packaging and Deployment
● Input/Output
● Scaling Analytics
● Python Data Science
● Machine Learning
Slides: https://cray.box.com/v/sw-data-july-2018
Copyright Cray Inc 2018
2
C O M P U T E | S T O R E | A N A L Y Z E
Legal Disclaimer
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to
any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to
change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced
for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising,
promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the
approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware
or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and
design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL,
CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL,
THREADSTORM. The following system family marks, and associated model number marks, are trademarks of
Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a
sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other
trademarks used in this document are the property of their respective owners.
Copyright Cray Inc 2018
3
C O M P U T E | S T O R E | A N A L Y Z E
Background
● About Me
● Software Engineer in the Analytics R&D Group
● Develop hardware and software solutions across Cray's product portfolio
● Primarily focused on integrating open source software into a coherent user friendly
product
● Involved in open source for ~15 years, committer at Apache Software Foundation
since 2012, and member since 2015
● Definition - High Performance Computing (HPC)
● Any sufficiently large high performance computer
● Typically $500,000 dollars plus
● As small as 10s of nodes up to 10,000s of nodes
● Creates some interesting scaling and implementation challenges for analytics
● Why analytics on HPC Systems?
● Scale
● Productivity
● Utilization
Copyright Cray Inc 2018
4
C O M P U T E | S T O R E | A N A L Y Z E
Packaging and Deployment
● Challenges
● HPC Systems are highly
controlled environments
● Users are granted the
minimum permissions
possible
● Many open source packages
have extensive dependencies
or expect users to bring in
their own
Copyright Cray Inc 2018
5
C O M P U T E | S T O R E | A N A L Y Z E
Solution - Containers
● An easy solution right?
● HPC Sysadmins are really paranoid
● Docker still considered insecure by many
● NERSC Shifter
● A HPC centric containerizer, used on our top end systems
● Designed to scale out massively
● Forces containerized process to run as the launching users UID
● Can consume Docker images but has own image gateway and
format
● Docker
● Currently used for our cluster systems
● Eventually will be used on our next generation supercomputers
Copyright Cray Inc 2018
6
C O M P U T E | S T O R E | A N A L Y Z E
Containers - Shifter vs Docker
● Both are open source so why choose Docker?
● https://github.com/NERSC/shifter
● https://github.com/docker
● Docker has a far more vibrant community
● Many of its shortcomings for HPC have or are being addressed
● E.g. Container access to hardware devices like GPUs
● NVidia Docker - https://github.com/NVIDIA/nvidia-docker
● It's Open Container Initiative (OCI) compliant
● Docker can be used with other key technologies e.g.
Kubernetes
Copyright Cray Inc 2018
7
C O M P U T E | S T O R E | A N A L Y Z E
Orchestration
● For distributed applications we need something to tie the
containers together
● Also want to support multi-tenant isolation
● Kubernetes
● Fastest growing container orchestrator out there
● Open APIs and highly extensible
● Declaratively specify complex applications and self-service
configuration via APIs
● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's
Kerberos support mods
● Biggest problem for us is networking!
Copyright Cray Inc 2018
8
C O M P U T E | S T O R E | A N A L Y Z E
Kubernetes Cluster Networking
● Kubernetes has a networking model that supports
customizable network providers
● Differing capabilities, bare networking through to network
traffic policy management
● E.g. isolated Tenant A from Tenant B
● Different providers use different approaches e.g.
● Flannel and Weave use VXLAN
● Cilium uses eBPF
● Calico and Romana uses static routing
● Our Aries network doesn't support VLANs and our kernel
doesn't support eBPF!
● Therefore we chose Romana
Copyright Cray Inc 2018
9
C O M P U T E | S T O R E | A N A L Y Z E
Input/Output Challenges
● Lots of analytics
frameworks e.g. Apache
Hadoop Map/Reduce,
Apache Spark rely on local
storage
● E.g. temporary scratch space
● BUT many HPC systems
have no local storage
Map task
thread
Block
manager
Disk
Reduce
task
threadRequest
TCP
Spark
Scheduler
Shuffle write
Shuffle read
Meta data
Copyright Cray Inc 2018
10
C O M P U T E | S T O R E | A N A L Y Z E
Virtual Local Storage
● tmpfs/ramfs
● Standard temporary file system for *nix OSes
● Stored in RAM
● tmpfs is preferred as can be specified with a max size
● BUT competes with your analytics frameworks for memory
● Use the systems parallel file system e.g. Lustre
● Unfortunately these aren't designed for small file IO
● Deadlocks the metadata servers causing significant slowdown for
everyone!
● Using Linux loopback mounts to solve this
● Short lived files never leave OS disk cache i.e. still in memory
● OS can flush OS disk cache as needed
Copyright Cray Inc 2018
11
C O M P U T E | S T O R E | A N A L Y Z E
Python Data Science
● Challenges
● Managing dependencies
● Compute nodes typically have
no external network
connectivity
● Distributed computation
● Maximising hardware
utilization for performance
Copyright Cray Inc 2018
12
C O M P U T E | S T O R E | A N A L Y Z E
Dependency Management
● Using Anaconda to solve this
● Have to resolve the environments up front
● Compute nodes can't access external network
● Also need to project environments onto compute nodes
as needed
● For containers use volume mounts and environment variable
injection into the container
● For standard jobs need to store environments on a file system
visible to compute nodes
Copyright Cray Inc 2018
13
C O M P U T E | S T O R E | A N A L Y Z E
Distributed Computation - Dask
● Distributed work
scheduling library for
Python
● Integrates with
common data science
libraries
● Numpy, Pandas,
SciKit-Learn
● Familiar Pythonic API
for scaling out
workloads
● Can be installed as part
of the Conda
environment
>>> from dask.distributed import Client
>>> client =
Client(scheduler_file='/path/to/scheduler.json')
>>> def square(x):
return x ** 2
>>> def neg(x):
return -x
>>> A = client.map(square, range(10))
>>> B = client.map(neg, A)
>>> total = client.submit(sum, B)
>>> total # Function hasn't yet completed
<Future: status: waiting, key: sum-
58999c52e0fa35c7d7346c098f5085c7>
>>> total.result() -285
>>> client.gather(A)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Copyright Cray Inc 2018
14
C O M P U T E | S T O R E | A N A L Y Z E
Dask - Scheduler & Environment Setup
● Using Dask requires running scheduler and worker
processes on our compute resources
● We don't necessarily know the set of physical nodes we will get
ahead of time
● Dask provides a scheduler file mechanism for this
● Need to start a scheduler and worker on each physical
node
● We use the entry point scripts of our container images to do this
● Also need to integrate with users Conda environment
● MUST activate the volume mounted environments prior to
starting Dask
Copyright Cray Inc 2018
15
C O M P U T E | S T O R E | A N A L Y Z E
Maximising Performance
● To fully take advantage of HPC hardware need to use
appropriately optimized libraries
● Option 1 - Custom Anaconda Channels
● E.g. Intel Distribution for Python
● Uses Intel AVX and MKL (Math Kernel Library) underneath popular
libraries
● Option 2 - ABI Injection
● Where a library uses a defined ABI e.g. mpi4py ensure it is
compiled against the generic ABI
● At runtime use volume mounts to mount the platform specific
ABI implementation at the appropriate location
● E.g. Cray MPICH, Open MPI, Intel MPI
Copyright Cray Inc 2018
16
C O M P U T E | S T O R E | A N A L Y Z E
Machine Learning
● Challenges
● How do we take advantage of
both GPUs and CPUs?
● Efficiently scale out onto
distributed systems
Copyright Cray Inc 2018
17
C O M P U T E | S T O R E | A N A L Y Z E
GPUs vs CPUs
● GPUs typically best suited
to training models
● More time and resource
intensive
● CPUs typically best suited
to inference
● i.e. Make predictions using a
trained model
● Need different hardware optimisations for each
● Don't necessarily know where our code will run ahead of time
● Therefore compile separately for each environment and
select desired build via container entry point script
● This requires a container runtime that supports GPUs e.g. Shifter or
NVidia Docker
● NB - We're trading off image size for performance
Copyright Cray Inc 2018
18
C O M P U T E | S T O R E | A N A L Y Z E
Distributed Training
● Framework support for
distributed training is not
well optimized
● Typically TCP/IP based
protocols e.g. gRPC
● Esoteric to configure
● Want to utilize full
capabilities of the network
● Uber's Horovod
● https://github.com/uber/horovod
● Uses MPI to better leverage the
network (Inifiniband/RoCE)
● Minor changes needed to your
ML scripts
● Interleaves computation and
communication
● Uses more efficient MPI
collectives where possible
Copyright Cray Inc 2018
19
C O M P U T E | S T O R E | A N A L Y Z E
Horovod vs gRPC Performance
https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15
Copyright Cray Inc 2018
20
C O M P U T E | S T O R E | A N A L Y Z E
Conclusions
● Scaling open source analytics has some non-obvious
gotchas
● Often assumes a traditional cluster environment
● Most challenges revolve around IO and Networking
● There's some promising open source efforts to solve these
more thoroughly
● Our Roadmap
● Looking to have stock Docker running on next generation
systems
● Leverage more of Kubernetes features to provide a cloud like
self service HPC model
Copyright Cray Inc 2018
21
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
https://cray.box.com/v/sw-data-july-2018
C O M P U T E | S T O R E | A N A L Y Z E
References - Containers
Copyright Cray Inc 2018
23
Tool Project Homepage/Repository
NERSC Shifter https://github.com/NERSC/shifter
Docker https://docker.com
NVidia Docker https://github.com/NVIDIA/nvidia-docker
Kubernetes https://kubernetes.io
Flannel https://coreos.com/flannel
Weave https://www.weave.works
Cilium https://cilium.io
Calico https://www.projectcalico.org
Romana https://romana.io
C O M P U T E | S T O R E | A N A L Y Z E
References - Analytics & Data Science
Copyright Cray Inc 2018
24
Tool Project Homepage/Repository
Apache Hadoop https://hadoop.apache.org
Anaconda https://conda.io/docs/
Dask http://dask.pydata.org/en/latest/
NumPy http://www.numpy.org
xarray http://xarray.pydata.org/en/stable/
SciPy https://www.scipy.org
Pandas https://pandas.pydata.org
mpi4py http://mpi4py.scipy.org/docs/
Intel Distribution of Python https://software.intel.com/en-us/distribution-for-
python
C O M P U T E | S T O R E | A N A L Y Z E
References - Machine Learning
Copyright Cray Inc 2018
25
Tool Project Homepage/Repository
TensorFlow https://www.tensorflow.org
gRPC https://grpc.io
Horovod https://github.com/uber/horovod
Ad

More Related Content

What's hot (20)

LAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg developmentLAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg development
Linaro
 
Foss Gadgematics
Foss GadgematicsFoss Gadgematics
Foss Gadgematics
Bud Siddhisena
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
Linaro
 
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGaiPGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
Equnix Business Solutions
 
LAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96BoardsLAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96Boards
Linaro
 
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by HisiliconLAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
Linaro
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
Michael Christofferson
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
AMD Developer Central
 
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoT
Linaro
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
AMD Developer Central
 
Programming the Network Data Plane
Programming the Network Data PlaneProgramming the Network Data Plane
Programming the Network Data Plane
C4Media
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
AMD Developer Central
 
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from HellLAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
Linaro
 
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development Lifecycle
Linaro
 
LAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android N
Linaro
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
Linaro
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
Kernel TLV
 
LAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg developmentLAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg development
Linaro
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
Linaro
 
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGaiPGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
Equnix Business Solutions
 
LAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96BoardsLAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96Boards
Linaro
 
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by HisiliconLAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
Linaro
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
AMD Developer Central
 
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoT
Linaro
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
AMD Developer Central
 
Programming the Network Data Plane
Programming the Network Data PlaneProgramming the Network Data Plane
Programming the Network Data Plane
C4Media
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
AMD Developer Central
 
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from HellLAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
Linaro
 
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development Lifecycle
Linaro
 
LAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android N
Linaro
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
Linaro
 

Similar to Leveraging open source for large scale analytics (20)

Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
Rogue Wave Software
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Containerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talkContainerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talk
Patrick Galbraith
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)
Ricardo Amaro
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
OpenShift Origin
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Muga Nishizawa
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
Zhidong Yu
 
Crossplane Graduation Review related presentation
Crossplane Graduation Review related presentationCrossplane Graduation Review related presentation
Crossplane Graduation Review related presentation
kedofef453
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
Dave Bechberger
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Host Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsHost Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment Models
Netronome
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
Ryan Hunter
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
All Things Open
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
ArangoDB Database
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
Linaro
 
Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
Rogue Wave Software
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Containerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talkContainerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talk
Patrick Galbraith
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)
Ricardo Amaro
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
OpenShift Origin
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Muga Nishizawa
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
Zhidong Yu
 
Crossplane Graduation Review related presentation
Crossplane Graduation Review related presentationCrossplane Graduation Review related presentation
Crossplane Graduation Review related presentation
kedofef453
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
Dave Bechberger
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Host Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsHost Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment Models
Netronome
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
Ryan Hunter
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
All Things Open
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
ArangoDB Database
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
Linaro
 
Ad

More from South West Data Meetup (11)

Met Office Informatics Lab
Met Office Informatics LabMet Office Informatics Lab
Met Office Informatics Lab
South West Data Meetup
 
Time Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataTime Series Analytics for Big Fast Data
Time Series Analytics for Big Fast Data
South West Data Meetup
 
@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)
South West Data Meetup
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...
South West Data Meetup
 
Imagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop dayImagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop day
South West Data Meetup
 
Open Data Institute (ODI) Node
Open Data Institute (ODI) NodeOpen Data Institute (ODI) Node
Open Data Institute (ODI) Node
South West Data Meetup
 
Bristol's Open Data Journey
Bristol's Open Data JourneyBristol's Open Data Journey
Bristol's Open Data Journey
South West Data Meetup
 
@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative
South West Data Meetup
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
South West Data Meetup
 
Bristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the CityBristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the City
South West Data Meetup
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
South West Data Meetup
 
Time Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataTime Series Analytics for Big Fast Data
Time Series Analytics for Big Fast Data
South West Data Meetup
 
@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)
South West Data Meetup
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...
South West Data Meetup
 
Imagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop dayImagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop day
South West Data Meetup
 
@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative
South West Data Meetup
 
Bristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the CityBristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the City
South West Data Meetup
 
Ad

Recently uploaded (20)

AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 

Leveraging open source for large scale analytics

  • 1. Leveraging Open Source for Large Scale Analytics on HPC Systems Rob Vesse, Software Engineer, Cray Inc
  • 2. C O M P U T E | S T O R E | A N A L Y Z E Overview ● Background ● Challenges ● Packaging and Deployment ● Input/Output ● Scaling Analytics ● Python Data Science ● Machine Learning Slides: https://cray.box.com/v/sw-data-july-2018 Copyright Cray Inc 2018 2
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright Cray Inc 2018 3
  • 4. C O M P U T E | S T O R E | A N A L Y Z E Background ● About Me ● Software Engineer in the Analytics R&D Group ● Develop hardware and software solutions across Cray's product portfolio ● Primarily focused on integrating open source software into a coherent user friendly product ● Involved in open source for ~15 years, committer at Apache Software Foundation since 2012, and member since 2015 ● Definition - High Performance Computing (HPC) ● Any sufficiently large high performance computer ● Typically $500,000 dollars plus ● As small as 10s of nodes up to 10,000s of nodes ● Creates some interesting scaling and implementation challenges for analytics ● Why analytics on HPC Systems? ● Scale ● Productivity ● Utilization Copyright Cray Inc 2018 4
  • 5. C O M P U T E | S T O R E | A N A L Y Z E Packaging and Deployment ● Challenges ● HPC Systems are highly controlled environments ● Users are granted the minimum permissions possible ● Many open source packages have extensive dependencies or expect users to bring in their own Copyright Cray Inc 2018 5
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Solution - Containers ● An easy solution right? ● HPC Sysadmins are really paranoid ● Docker still considered insecure by many ● NERSC Shifter ● A HPC centric containerizer, used on our top end systems ● Designed to scale out massively ● Forces containerized process to run as the launching users UID ● Can consume Docker images but has own image gateway and format ● Docker ● Currently used for our cluster systems ● Eventually will be used on our next generation supercomputers Copyright Cray Inc 2018 6
  • 7. C O M P U T E | S T O R E | A N A L Y Z E Containers - Shifter vs Docker ● Both are open source so why choose Docker? ● https://github.com/NERSC/shifter ● https://github.com/docker ● Docker has a far more vibrant community ● Many of its shortcomings for HPC have or are being addressed ● E.g. Container access to hardware devices like GPUs ● NVidia Docker - https://github.com/NVIDIA/nvidia-docker ● It's Open Container Initiative (OCI) compliant ● Docker can be used with other key technologies e.g. Kubernetes Copyright Cray Inc 2018 7
  • 8. C O M P U T E | S T O R E | A N A L Y Z E Orchestration ● For distributed applications we need something to tie the containers together ● Also want to support multi-tenant isolation ● Kubernetes ● Fastest growing container orchestrator out there ● Open APIs and highly extensible ● Declaratively specify complex applications and self-service configuration via APIs ● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's Kerberos support mods ● Biggest problem for us is networking! Copyright Cray Inc 2018 8
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Kubernetes Cluster Networking ● Kubernetes has a networking model that supports customizable network providers ● Differing capabilities, bare networking through to network traffic policy management ● E.g. isolated Tenant A from Tenant B ● Different providers use different approaches e.g. ● Flannel and Weave use VXLAN ● Cilium uses eBPF ● Calico and Romana uses static routing ● Our Aries network doesn't support VLANs and our kernel doesn't support eBPF! ● Therefore we chose Romana Copyright Cray Inc 2018 9
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Input/Output Challenges ● Lots of analytics frameworks e.g. Apache Hadoop Map/Reduce, Apache Spark rely on local storage ● E.g. temporary scratch space ● BUT many HPC systems have no local storage Map task thread Block manager Disk Reduce task threadRequest TCP Spark Scheduler Shuffle write Shuffle read Meta data Copyright Cray Inc 2018 10
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Virtual Local Storage ● tmpfs/ramfs ● Standard temporary file system for *nix OSes ● Stored in RAM ● tmpfs is preferred as can be specified with a max size ● BUT competes with your analytics frameworks for memory ● Use the systems parallel file system e.g. Lustre ● Unfortunately these aren't designed for small file IO ● Deadlocks the metadata servers causing significant slowdown for everyone! ● Using Linux loopback mounts to solve this ● Short lived files never leave OS disk cache i.e. still in memory ● OS can flush OS disk cache as needed Copyright Cray Inc 2018 11
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Python Data Science ● Challenges ● Managing dependencies ● Compute nodes typically have no external network connectivity ● Distributed computation ● Maximising hardware utilization for performance Copyright Cray Inc 2018 12
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Dependency Management ● Using Anaconda to solve this ● Have to resolve the environments up front ● Compute nodes can't access external network ● Also need to project environments onto compute nodes as needed ● For containers use volume mounts and environment variable injection into the container ● For standard jobs need to store environments on a file system visible to compute nodes Copyright Cray Inc 2018 13
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Distributed Computation - Dask ● Distributed work scheduling library for Python ● Integrates with common data science libraries ● Numpy, Pandas, SciKit-Learn ● Familiar Pythonic API for scaling out workloads ● Can be installed as part of the Conda environment >>> from dask.distributed import Client >>> client = Client(scheduler_file='/path/to/scheduler.json') >>> def square(x): return x ** 2 >>> def neg(x): return -x >>> A = client.map(square, range(10)) >>> B = client.map(neg, A) >>> total = client.submit(sum, B) >>> total # Function hasn't yet completed <Future: status: waiting, key: sum- 58999c52e0fa35c7d7346c098f5085c7> >>> total.result() -285 >>> client.gather(A) [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] Copyright Cray Inc 2018 14
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Dask - Scheduler & Environment Setup ● Using Dask requires running scheduler and worker processes on our compute resources ● We don't necessarily know the set of physical nodes we will get ahead of time ● Dask provides a scheduler file mechanism for this ● Need to start a scheduler and worker on each physical node ● We use the entry point scripts of our container images to do this ● Also need to integrate with users Conda environment ● MUST activate the volume mounted environments prior to starting Dask Copyright Cray Inc 2018 15
  • 16. C O M P U T E | S T O R E | A N A L Y Z E Maximising Performance ● To fully take advantage of HPC hardware need to use appropriately optimized libraries ● Option 1 - Custom Anaconda Channels ● E.g. Intel Distribution for Python ● Uses Intel AVX and MKL (Math Kernel Library) underneath popular libraries ● Option 2 - ABI Injection ● Where a library uses a defined ABI e.g. mpi4py ensure it is compiled against the generic ABI ● At runtime use volume mounts to mount the platform specific ABI implementation at the appropriate location ● E.g. Cray MPICH, Open MPI, Intel MPI Copyright Cray Inc 2018 16
  • 17. C O M P U T E | S T O R E | A N A L Y Z E Machine Learning ● Challenges ● How do we take advantage of both GPUs and CPUs? ● Efficiently scale out onto distributed systems Copyright Cray Inc 2018 17
  • 18. C O M P U T E | S T O R E | A N A L Y Z E GPUs vs CPUs ● GPUs typically best suited to training models ● More time and resource intensive ● CPUs typically best suited to inference ● i.e. Make predictions using a trained model ● Need different hardware optimisations for each ● Don't necessarily know where our code will run ahead of time ● Therefore compile separately for each environment and select desired build via container entry point script ● This requires a container runtime that supports GPUs e.g. Shifter or NVidia Docker ● NB - We're trading off image size for performance Copyright Cray Inc 2018 18
  • 19. C O M P U T E | S T O R E | A N A L Y Z E Distributed Training ● Framework support for distributed training is not well optimized ● Typically TCP/IP based protocols e.g. gRPC ● Esoteric to configure ● Want to utilize full capabilities of the network ● Uber's Horovod ● https://github.com/uber/horovod ● Uses MPI to better leverage the network (Inifiniband/RoCE) ● Minor changes needed to your ML scripts ● Interleaves computation and communication ● Uses more efficient MPI collectives where possible Copyright Cray Inc 2018 19
  • 20. C O M P U T E | S T O R E | A N A L Y Z E Horovod vs gRPC Performance https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15 Copyright Cray Inc 2018 20
  • 21. C O M P U T E | S T O R E | A N A L Y Z E Conclusions ● Scaling open source analytics has some non-obvious gotchas ● Often assumes a traditional cluster environment ● Most challenges revolve around IO and Networking ● There's some promising open source efforts to solve these more thoroughly ● Our Roadmap ● Looking to have stock Docker running on next generation systems ● Leverage more of Kubernetes features to provide a cloud like self service HPC model Copyright Cray Inc 2018 21
  • 22. C O M P U T E | S T O R E | A N A L Y Z E Questions? [email protected] https://cray.box.com/v/sw-data-july-2018
  • 23. C O M P U T E | S T O R E | A N A L Y Z E References - Containers Copyright Cray Inc 2018 23 Tool Project Homepage/Repository NERSC Shifter https://github.com/NERSC/shifter Docker https://docker.com NVidia Docker https://github.com/NVIDIA/nvidia-docker Kubernetes https://kubernetes.io Flannel https://coreos.com/flannel Weave https://www.weave.works Cilium https://cilium.io Calico https://www.projectcalico.org Romana https://romana.io
  • 24. C O M P U T E | S T O R E | A N A L Y Z E References - Analytics & Data Science Copyright Cray Inc 2018 24 Tool Project Homepage/Repository Apache Hadoop https://hadoop.apache.org Anaconda https://conda.io/docs/ Dask http://dask.pydata.org/en/latest/ NumPy http://www.numpy.org xarray http://xarray.pydata.org/en/stable/ SciPy https://www.scipy.org Pandas https://pandas.pydata.org mpi4py http://mpi4py.scipy.org/docs/ Intel Distribution of Python https://software.intel.com/en-us/distribution-for- python
  • 25. C O M P U T E | S T O R E | A N A L Y Z E References - Machine Learning Copyright Cray Inc 2018 25 Tool Project Homepage/Repository TensorFlow https://www.tensorflow.org gRPC https://grpc.io Horovod https://github.com/uber/horovod