SlideShare a Scribd company logo
1
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Heterogeneous HPC Computing
in the DeepHealth Project
José Flich (UPV)
Monica Caballero (everis)
European Big Data Value Forum (EBDVF) 2019
15 October 2019, Helsinki
2
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
About DeepHealth
Aim & Goals
§ Facilitate the daily work and increase the productivity of medical personnel and IT professionals in terms of image
processing and the use and training of predictive models without the need of combining numerous tools.
§ Offer a unified framework adapted to exploit underlying heterogeneous HPC and Big Data architectures
supporting state-of-the-art and next-generation Deep Learning (AI) and Computer Vision algorithms to enhance
European-based medical software platforms.
§ Put HPC computing power at the service of biomedical applications with DL needs and, through an
interdisciplinary approach, apply DL techniques on large and complex image biomedical datasets to support new and
more efficient ways of diagnosis, monitoring and treatment of diseases.
Duration: 36 months
Starting date: Jan 2019
Budget 14.642.366 €
EU funding 12.774.824 €
21 partners from 9 countries: Research
centers, Health organizations, large industries
and SMEs
3
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
About DeepHealth
• The DeepHealth toolkit: Free and open-source software with two core technology libraries and a dedicated
front-end.
• EDDLL: The European Distributed Deep Learning Library
• ECVL: the European Computer Vision Library
• Ready to run algorithms on Hybrid HPC + Big Data architectures with heterogeneous hardware
• Seven biomedical and AI software platforms will integrate the DeepHealth libraries to improve their
potential.
Use-cases
• 14 pilot test-beds in 3 areas:
• Neurological diseases
• Tumor detection and early cancer prediction
• Digital pathology and automated image annotation.
• Pilots will allow to train models and evaluate the performance of the proposed solutions in terms of time
and accuracy.
Expected results
4
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
DeepHealth HPC Goals
5
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
DeepHealth Goals
• Develop a European Distributed Deep-Learning Library (EDDL)
• Develop a European Computer Vision Library (ECVL)
• Adapt EDDL/ECVL to HPC infrastructure
• Heterogeneous Architectures
• Apply the EDDL/ECVL to 7 European Platforms for Medical applications
• Apply the DeepHealth solution to 14 use cases (pilots) for medical diagnosis
development adaptation use
6
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
HPC Goals and Related Challenges
• Adapt EDDL and ECVL libraries to HPC infrastructure
• Computation
• CPUs, GPUs, FPGAs
• Communication
• Distribution of training process
• KPI
• 4X performance improvement and 7X better power efficiency for target
DeepHealth infrastructure with advanced HPC technologies
(combining manycores with vectorial units, GPUs, FPGAs, and low-
latency interconnects) compared to standard HPC infrastructure
7
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Platform
Platform
Platform
Challenges
At different levels
EDDL
library
ECVL
library
Use case
Heterog.HPC
CPU CPU CPU GPU GPU GPU FPGA FPGA FPGA FPGA
Interconnect
Use caseUse case
Use caseUse caseUse case
• Develop EDDL/ECVL
• Adapt Platforms
• Adapt Use Cases
• Adapt HPC
• computation, runtime, distribution, interconnect
1
1
1
2 2
3
3 3
4
4 4
4 4 4
4
Implementation Challenge:
Adapting new libraries (for performance)
as they are being implemented and tested
8
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Types of Systems
Heterogeneity support
CPU
GPU
Interconnect
CPU
GPU
CPU GPU CPU
GPU
CPU
GPU
Interconnect
CPU
GPU
CPU
FPGA
CPU
FPGA
CPU
Interconnect
CPU CPU CPU
CPU
GPU
Interconnect
GPU
CPU
GPU
GPU
CPU
GPU
GPU
CPU
GPU
Interconnect
CPU
GPU
CPU
FPGA
FPGA
GPU
9
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
DeepHealth HPC Goals
• Reinvest in FET-HPC projects (MANGO)
• Large FPGA cluster for heterogeneous HPC Exploration
10
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Target HPC Systems
11
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
MareNostrum 4
Total	peak	performance:	13,7 Pflops
General	Purpose	Cluster:	 11.15	Pflops (1.07.2017)
CTE1-P9+Volta:	 1.57	Pflops (1.03.2018)
CTE2-Arm	V8:	 0.5	Pflops (????)
CTE3-KNH?:	 0.5	Pflops (????)
MareNostrum 1
2004	– 42,3	Tflops
1st Europe	/	4th World
New	technologies	
MareNostrum 2
2006	– 94,2	Tflops
1st Europe	/	5th World
New	technologies
MareNostrum 3
2012	– 1,1	Pflops
12th Europe	/	36th World
MareNostrum 4
2017	– 11,1	Pflops
2nd Europe	/	13th World
New	technologies
12
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
BSC HPC Infrastructures
PUT YOUR SMART SUBTITLE HERE
• General Purpose Cluster (in production)
• 48 racks with 3456 nodes, each with 2 Intel Xeon Platinum proc.
• Total of 11.15 PFLOPs in Double Precision
• System with total of 165888 processors and 390TB of main memory
• 29th fastest supercomputer in top500, 7th fastest supercomputer in Europe
• CTE1-P9+VOLTA (in production)
• 54 nodes, each with 2 POWER9 proc., 4 Volta GPUs, 6.4TB NVMe
• Total of 1.57 PFLOPs in Double Precision
• Same node as Sierra supercomputer at LLNL (2nd fastest supercomputer in
top500)
• Suitable for HPC and Machine Learning workloads
13
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
BSC HPC Infrastructures
PUT YOUR SMART SUBTITLE HERE
• CTE2-Arm v8 (to be deployed in 2020)
• Same processor as in the future post-K supercomputer in Japan
• Targets Exascale workloads: 2.7 TFLOPS double precision compute power,
5.4 TFLOPS in single precision; 10.8 TFLOPS in half-precision (16 bits)
• HPC and AI convergence: up to 21.6 TOPS in 8-bit int precision
• 7nm technology; 48 cores; 4 stacks of 8GB HBM2 (total of 32GB)
• Novel 512-bit SVE ext. with specific instructions for machine learning
• Might be interesting as a cutting edge system by the end of DeepHealth
• Mont-Blanc 3 prototype (in production)
• 48 nodes, 2 processors/node (96 processors in total)
• Cavium Thunder X2 processor: 32-core Arm v8, 4-way SMT, up to 2.5GHz
• Targets HPC workloads in datacenters
• System with up to 3K cores and 12K threads
• Liquid cooling
14
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
MANGO prototype
From FET-HPC MANGO project
• 16 (interconnected) clusters, each with
• One Server node
• 12 FPGAs (lego system)
• Xilinx 7–series, Zynq-7000, Kintex Ultrascale+
• Intel Stratix-10
• DDR3, DDR4 pluggable memory modules
• Connections: PCIe Express Gen 2/3 lanes, 40Gbps QSFP
prototype
onecluster
15
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
PROD: Development of a customized FPGA-
based PCIe Board
• Based on latest Intel or Xilinx FPGA
technology (TBD)
• High bandwidth and low latency PCIe
interface for data exchange with host
• Modular peripherals (memories,
interfaces) - TBD
16
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
The DeepHealth Computing Infrastructure
Overview
COMPSs
Global Resource Manage
(Slurm-based)
Distributed Programming Model
(e.g., M/R, task-based)
Non-functional
requirements description
API provided to ECVL and EDDLL developers (WP2/WP3)
Parallel
Run-time
Netlist Partitioning
Vivado tools
N2D2
framework
Mango
Run-time
Mango
Cluster
MareNostrum 4 (Intel)
Arm ThunderX2
POWER9+Voltas Cluster
Private (NVIDIA)
+ Public Cloud
DeepHealth HPC HW Resources DeepHealth Cloud HW Resources
OpenStack
platform
Parallel Programming Models
(e.g., CUDA, OpenCL, OpenMP)
Cloud
API
DeepHealth SW Architecture
Private Cloud
(x86+NVIDIA T4)Tailored FPGA PCIe card
1200 cores
cluster (x86)
BSC
UNITO
PROD
UPV
UNITOTREE
Programming models and access methods for
EDDLL and ECVL development
The DeepHealth computing infrastructure including
HPC and big-data cloud-based resources
Multiple Workloads Scheduling
Single Workload
Scheduling
Container-based
(Parallel) Programming Models
HW
EDDLL workload
(e.g., training)
EDDL workload
(e.g., inference)
Single Workload
Scheduling
17
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
COMPSs
• Framework (programming model + runtime system) to develop parallel
applications for distributed infrastructures
• Abstract model: exposes parallelism while hides the infrastructure
• Agnostic of computing platform
• Task-based programming model build on top of general purpose sequential
programming languages (Python, C, C++, Java)
def display(c):
…
def add(a, b, c):
c = a + b
for i in range(MSIZE):
add(A[i],B[i],C[i])
display(C)
@task(c=INOUT)
def display(c):
…
@task(a=IN,b=IN,c=OUT)
def add(a, b, c):
c = a + b
for i in range(MSIZE):
add(A[i],B[i],C[i])
display(C)
ad
d
ad
d
ad
d
dis
pla
y
…
MSIZE
18
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
EPFL: Multi-objective RM policies
• Power/performance/accuracy-aware
runtime resource management policies
• Automatic selection of the most efficient
resources
• Adding one new axis: accuracy!
• Heuristics, ML-based and hyper-heuristic
RM policies (algorithms)
• Single-node: selection of accelerators
(allocation), DVFS settings
• Multiple nodes (Global RM of MANGO)
• Integrated with DeepHealth SW stack
• MANGO API + COMPS + Slurm
19
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Data Parallelism
• Training batch distribution
• Gradient collection and weights distribution
• AllReduce, Broadcast support to be exploited
• Different strategies will be implemented and evaluated
• Synchronization primitives (relaxed models)
CPU
GPU
Interconnect
CPU GPU CPU
FPGA
FPGA
GPU
High Pressure
on the
Interconnect
20
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Netlist partitioning (CEA)
• Use a multi-FPGA platform as a single virtual large FPGA
• For very large inference networks that do not fit into a single
FPGA
• Direct IO-to-IO connection between FPGAs
• Optimized partitioning of the netlist into several netlists
• Combinatiorial optimization model, taking into account
critical paths & resource quantities in each FPGA
• Several state-of-the-art optimization methods, from
Kernighan-Lin to simulated annealing
• Execution of the design on the multi-FPGA platform
• Multiplexing of signals to deal with the limited
interconnection
21
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
Heterogeneous Computing
• DeepLearning and Computer Vision kernels to be deployed for
• CPU
• Math processing routines (MKL, Eigen)
• GPU
• CUDA vs OpenCL programming
• FPGA
• OpenCL vs HLS vs RTL programming
• Intel/Altera vs Xilinx platforms
22
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
HPC Things to Explore in DeepHealth
• Communication impact
• Will the network become the bottleneck?
• Use cases sizes
• Accuracy vs performance trade-off
• FPGA suitability for Training (Floating point precision requirement)
• Will be energy efficient for such large challenge?
• Which FPGA devices will perform better (accuracy vs. energy trade-off)
• Scalability of the solution (EDDL/ECVL)
• Will perform well on any end-used HPC-like platform?
• … so, ahead a challenging future for DeepHealth HPC teams!
23
The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.
José Flich (jflich@disca.upv.es)
Mónica Caballero (monica.caballero.galeote@everis.com)
Thank you!

More Related Content

Similar to Heterogeneous HPC Computing in the DeepHealth Project (20)

PDF
Deep-Learning and HPC to Boost Biomedical applications for health
Big Data Value Association
 
PDF
EuroHPC Joint Undertaking. Accelerating the convergence between Big Data and ...
Big Data Value Association
 
PDF
EuroHPC AI in DAPHNE
University of Maribor
 
PDF
An energy efficient programmable many core accelerator for personalized biome...
Nxfee Innovation
 
PDF
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
PPTX
High performance computing for research
Esteban Hernandez
 
PDF
Nikravesh big datafeb2013bt
Masoud Nikravesh
 
PDF
An Update on Arm HPC
inside-BigData.com
 
PPTX
High-Performance Computing Research in Europe
Govnet Events
 
PPTX
AI Hardware Landscape 2021
Grigory Sapunov
 
PDF
RDA Europe 4.0 - Kick-Off Spanish Node - BSC presentation
Research Data Alliance
 
PDF
El nuevo superordenador Mare Nostrum y el futuro procesador europeo
AMETIC
 
PDF
Exascale Update from Hyperion Research
inside-BigData.com
 
PPTX
Deep Hybrid DataCloud
EOSC-hub project
 
PDF
Automatic generation of hardware memory architectures for HPC
Facultad de Informática UCM
 
PDF
Implementing AI: High Performace Architectures
KTN
 
PDF
The Birth of HPC Cuba
inside-BigData.com
 
PPTX
OpenACC Monthly Highlights: June 2021
OpenACC
 
PDF
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY
 
PPTX
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
Deep-Learning and HPC to Boost Biomedical applications for health
Big Data Value Association
 
EuroHPC Joint Undertaking. Accelerating the convergence between Big Data and ...
Big Data Value Association
 
EuroHPC AI in DAPHNE
University of Maribor
 
An energy efficient programmable many core accelerator for personalized biome...
Nxfee Innovation
 
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
High performance computing for research
Esteban Hernandez
 
Nikravesh big datafeb2013bt
Masoud Nikravesh
 
An Update on Arm HPC
inside-BigData.com
 
High-Performance Computing Research in Europe
Govnet Events
 
AI Hardware Landscape 2021
Grigory Sapunov
 
RDA Europe 4.0 - Kick-Off Spanish Node - BSC presentation
Research Data Alliance
 
El nuevo superordenador Mare Nostrum y el futuro procesador europeo
AMETIC
 
Exascale Update from Hyperion Research
inside-BigData.com
 
Deep Hybrid DataCloud
EOSC-hub project
 
Automatic generation of hardware memory architectures for HPC
Facultad de Informática UCM
 
Implementing AI: High Performace Architectures
KTN
 
The Birth of HPC Cuba
inside-BigData.com
 
OpenACC Monthly Highlights: June 2021
OpenACC
 
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 

More from Big Data Value Association (20)

PDF
Data Privacy, Security in personal data sharing
Big Data Value Association
 
PDF
Key Modules for a trsuted and privacy preserving personal data marketplace
Big Data Value Association
 
PDF
GDPR and Data Ethics considerations in personal data sharing
Big Data Value Association
 
PPTX
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Big Data Value Association
 
PPTX
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Big Data Value Association
 
PPTX
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Big Data Value Association
 
PDF
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
Big Data Value Association
 
PDF
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
Big Data Value Association
 
PDF
BDV Skills Accreditation - EIT labels for professionals
Big Data Value Association
 
PDF
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
Big Data Value Association
 
PDF
BDV Skills Accreditation - Objectives of the workshop
Big Data Value Association
 
PDF
BDV Skills Accreditation - Welcome introduction to the workshop
Big Data Value Association
 
PDF
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
Big Data Value Association
 
PDF
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
Big Data Value Association
 
PDF
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
Big Data Value Association
 
PPTX
Virtual BenchLearning - Data Bench Framework
Big Data Value Association
 
PPTX
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Big Data Value Association
 
PDF
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Big Data Value Association
 
PDF
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Big Data Value Association
 
PDF
Policy Cloud Data Driven Policies against Radicalisation
Big Data Value Association
 
Data Privacy, Security in personal data sharing
Big Data Value Association
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Big Data Value Association
 
GDPR and Data Ethics considerations in personal data sharing
Big Data Value Association
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Big Data Value Association
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Big Data Value Association
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Big Data Value Association
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
Big Data Value Association
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
Big Data Value Association
 
BDV Skills Accreditation - EIT labels for professionals
Big Data Value Association
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
Big Data Value Association
 
BDV Skills Accreditation - Objectives of the workshop
Big Data Value Association
 
BDV Skills Accreditation - Welcome introduction to the workshop
Big Data Value Association
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
Big Data Value Association
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
Big Data Value Association
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
Big Data Value Association
 
Virtual BenchLearning - Data Bench Framework
Big Data Value Association
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Big Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Big Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Big Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation
Big Data Value Association
 
Ad

Recently uploaded (20)

PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Climate Action.pptx action plan for climate
justfortalabat
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
AI/ML Applications in Financial domain projects
Rituparna De
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Ad

Heterogeneous HPC Computing in the DeepHealth Project

  • 1. 1 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Heterogeneous HPC Computing in the DeepHealth Project José Flich (UPV) Monica Caballero (everis) European Big Data Value Forum (EBDVF) 2019 15 October 2019, Helsinki
  • 2. 2 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. About DeepHealth Aim & Goals § Facilitate the daily work and increase the productivity of medical personnel and IT professionals in terms of image processing and the use and training of predictive models without the need of combining numerous tools. § Offer a unified framework adapted to exploit underlying heterogeneous HPC and Big Data architectures supporting state-of-the-art and next-generation Deep Learning (AI) and Computer Vision algorithms to enhance European-based medical software platforms. § Put HPC computing power at the service of biomedical applications with DL needs and, through an interdisciplinary approach, apply DL techniques on large and complex image biomedical datasets to support new and more efficient ways of diagnosis, monitoring and treatment of diseases. Duration: 36 months Starting date: Jan 2019 Budget 14.642.366 € EU funding 12.774.824 € 21 partners from 9 countries: Research centers, Health organizations, large industries and SMEs
  • 3. 3 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. About DeepHealth • The DeepHealth toolkit: Free and open-source software with two core technology libraries and a dedicated front-end. • EDDLL: The European Distributed Deep Learning Library • ECVL: the European Computer Vision Library • Ready to run algorithms on Hybrid HPC + Big Data architectures with heterogeneous hardware • Seven biomedical and AI software platforms will integrate the DeepHealth libraries to improve their potential. Use-cases • 14 pilot test-beds in 3 areas: • Neurological diseases • Tumor detection and early cancer prediction • Digital pathology and automated image annotation. • Pilots will allow to train models and evaluate the performance of the proposed solutions in terms of time and accuracy. Expected results
  • 4. 4 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. DeepHealth HPC Goals
  • 5. 5 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. DeepHealth Goals • Develop a European Distributed Deep-Learning Library (EDDL) • Develop a European Computer Vision Library (ECVL) • Adapt EDDL/ECVL to HPC infrastructure • Heterogeneous Architectures • Apply the EDDL/ECVL to 7 European Platforms for Medical applications • Apply the DeepHealth solution to 14 use cases (pilots) for medical diagnosis development adaptation use
  • 6. 6 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. HPC Goals and Related Challenges • Adapt EDDL and ECVL libraries to HPC infrastructure • Computation • CPUs, GPUs, FPGAs • Communication • Distribution of training process • KPI • 4X performance improvement and 7X better power efficiency for target DeepHealth infrastructure with advanced HPC technologies (combining manycores with vectorial units, GPUs, FPGAs, and low- latency interconnects) compared to standard HPC infrastructure
  • 7. 7 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Platform Platform Platform Challenges At different levels EDDL library ECVL library Use case Heterog.HPC CPU CPU CPU GPU GPU GPU FPGA FPGA FPGA FPGA Interconnect Use caseUse case Use caseUse caseUse case • Develop EDDL/ECVL • Adapt Platforms • Adapt Use Cases • Adapt HPC • computation, runtime, distribution, interconnect 1 1 1 2 2 3 3 3 4 4 4 4 4 4 4 Implementation Challenge: Adapting new libraries (for performance) as they are being implemented and tested
  • 8. 8 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Types of Systems Heterogeneity support CPU GPU Interconnect CPU GPU CPU GPU CPU GPU CPU GPU Interconnect CPU GPU CPU FPGA CPU FPGA CPU Interconnect CPU CPU CPU CPU GPU Interconnect GPU CPU GPU GPU CPU GPU GPU CPU GPU Interconnect CPU GPU CPU FPGA FPGA GPU
  • 9. 9 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. DeepHealth HPC Goals • Reinvest in FET-HPC projects (MANGO) • Large FPGA cluster for heterogeneous HPC Exploration
  • 10. 10 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Target HPC Systems
  • 11. 11 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. MareNostrum 4 Total peak performance: 13,7 Pflops General Purpose Cluster: 11.15 Pflops (1.07.2017) CTE1-P9+Volta: 1.57 Pflops (1.03.2018) CTE2-Arm V8: 0.5 Pflops (????) CTE3-KNH?: 0.5 Pflops (????) MareNostrum 1 2004 – 42,3 Tflops 1st Europe / 4th World New technologies MareNostrum 2 2006 – 94,2 Tflops 1st Europe / 5th World New technologies MareNostrum 3 2012 – 1,1 Pflops 12th Europe / 36th World MareNostrum 4 2017 – 11,1 Pflops 2nd Europe / 13th World New technologies
  • 12. 12 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. BSC HPC Infrastructures PUT YOUR SMART SUBTITLE HERE • General Purpose Cluster (in production) • 48 racks with 3456 nodes, each with 2 Intel Xeon Platinum proc. • Total of 11.15 PFLOPs in Double Precision • System with total of 165888 processors and 390TB of main memory • 29th fastest supercomputer in top500, 7th fastest supercomputer in Europe • CTE1-P9+VOLTA (in production) • 54 nodes, each with 2 POWER9 proc., 4 Volta GPUs, 6.4TB NVMe • Total of 1.57 PFLOPs in Double Precision • Same node as Sierra supercomputer at LLNL (2nd fastest supercomputer in top500) • Suitable for HPC and Machine Learning workloads
  • 13. 13 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. BSC HPC Infrastructures PUT YOUR SMART SUBTITLE HERE • CTE2-Arm v8 (to be deployed in 2020) • Same processor as in the future post-K supercomputer in Japan • Targets Exascale workloads: 2.7 TFLOPS double precision compute power, 5.4 TFLOPS in single precision; 10.8 TFLOPS in half-precision (16 bits) • HPC and AI convergence: up to 21.6 TOPS in 8-bit int precision • 7nm technology; 48 cores; 4 stacks of 8GB HBM2 (total of 32GB) • Novel 512-bit SVE ext. with specific instructions for machine learning • Might be interesting as a cutting edge system by the end of DeepHealth • Mont-Blanc 3 prototype (in production) • 48 nodes, 2 processors/node (96 processors in total) • Cavium Thunder X2 processor: 32-core Arm v8, 4-way SMT, up to 2.5GHz • Targets HPC workloads in datacenters • System with up to 3K cores and 12K threads • Liquid cooling
  • 14. 14 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. MANGO prototype From FET-HPC MANGO project • 16 (interconnected) clusters, each with • One Server node • 12 FPGAs (lego system) • Xilinx 7–series, Zynq-7000, Kintex Ultrascale+ • Intel Stratix-10 • DDR3, DDR4 pluggable memory modules • Connections: PCIe Express Gen 2/3 lanes, 40Gbps QSFP prototype onecluster
  • 15. 15 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. PROD: Development of a customized FPGA- based PCIe Board • Based on latest Intel or Xilinx FPGA technology (TBD) • High bandwidth and low latency PCIe interface for data exchange with host • Modular peripherals (memories, interfaces) - TBD
  • 16. 16 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. The DeepHealth Computing Infrastructure Overview COMPSs Global Resource Manage (Slurm-based) Distributed Programming Model (e.g., M/R, task-based) Non-functional requirements description API provided to ECVL and EDDLL developers (WP2/WP3) Parallel Run-time Netlist Partitioning Vivado tools N2D2 framework Mango Run-time Mango Cluster MareNostrum 4 (Intel) Arm ThunderX2 POWER9+Voltas Cluster Private (NVIDIA) + Public Cloud DeepHealth HPC HW Resources DeepHealth Cloud HW Resources OpenStack platform Parallel Programming Models (e.g., CUDA, OpenCL, OpenMP) Cloud API DeepHealth SW Architecture Private Cloud (x86+NVIDIA T4)Tailored FPGA PCIe card 1200 cores cluster (x86) BSC UNITO PROD UPV UNITOTREE Programming models and access methods for EDDLL and ECVL development The DeepHealth computing infrastructure including HPC and big-data cloud-based resources Multiple Workloads Scheduling Single Workload Scheduling Container-based (Parallel) Programming Models HW EDDLL workload (e.g., training) EDDL workload (e.g., inference) Single Workload Scheduling
  • 17. 17 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. COMPSs • Framework (programming model + runtime system) to develop parallel applications for distributed infrastructures • Abstract model: exposes parallelism while hides the infrastructure • Agnostic of computing platform • Task-based programming model build on top of general purpose sequential programming languages (Python, C, C++, Java) def display(c): … def add(a, b, c): c = a + b for i in range(MSIZE): add(A[i],B[i],C[i]) display(C) @task(c=INOUT) def display(c): … @task(a=IN,b=IN,c=OUT) def add(a, b, c): c = a + b for i in range(MSIZE): add(A[i],B[i],C[i]) display(C) ad d ad d ad d dis pla y … MSIZE
  • 18. 18 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. EPFL: Multi-objective RM policies • Power/performance/accuracy-aware runtime resource management policies • Automatic selection of the most efficient resources • Adding one new axis: accuracy! • Heuristics, ML-based and hyper-heuristic RM policies (algorithms) • Single-node: selection of accelerators (allocation), DVFS settings • Multiple nodes (Global RM of MANGO) • Integrated with DeepHealth SW stack • MANGO API + COMPS + Slurm
  • 19. 19 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Data Parallelism • Training batch distribution • Gradient collection and weights distribution • AllReduce, Broadcast support to be exploited • Different strategies will be implemented and evaluated • Synchronization primitives (relaxed models) CPU GPU Interconnect CPU GPU CPU FPGA FPGA GPU High Pressure on the Interconnect
  • 20. 20 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Netlist partitioning (CEA) • Use a multi-FPGA platform as a single virtual large FPGA • For very large inference networks that do not fit into a single FPGA • Direct IO-to-IO connection between FPGAs • Optimized partitioning of the netlist into several netlists • Combinatiorial optimization model, taking into account critical paths & resource quantities in each FPGA • Several state-of-the-art optimization methods, from Kernighan-Lin to simulated annealing • Execution of the design on the multi-FPGA platform • Multiplexing of signals to deal with the limited interconnection
  • 21. 21 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. Heterogeneous Computing • DeepLearning and Computer Vision kernels to be deployed for • CPU • Math processing routines (MKL, Eigen) • GPU • CUDA vs OpenCL programming • FPGA • OpenCL vs HLS vs RTL programming • Intel/Altera vs Xilinx platforms
  • 22. 22 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. HPC Things to Explore in DeepHealth • Communication impact • Will the network become the bottleneck? • Use cases sizes • Accuracy vs performance trade-off • FPGA suitability for Training (Floating point precision requirement) • Will be energy efficient for such large challenge? • Which FPGA devices will perform better (accuracy vs. energy trade-off) • Scalability of the solution (EDDL/ECVL) • Will perform well on any end-used HPC-like platform? • … so, ahead a challenging future for DeepHealth HPC teams!
  • 23. 23 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111. José Flich ([email protected]) Mónica Caballero ([email protected]) Thank you!