SlideShare a Scribd company logo
Massachusetts Institute of Technology
Ji Lin
Tiny Inference and Scalable Training
for Efficient Video Recognition
04/09/2020
2
Background
• Videos are growing explosively: 105 hours of videos are uploaded to YouTube
per day
• Efficient Video processing is essential for both Cloud and Edge (e.g., hospital)
T
A Challenge for Modern Deep Learning
Moore’s
Law
Data
• We are solving more complicated AI problems with larger datasets,
which requires more computation.
• However, Moore’s Law is slowing down; the amount of computation
per unit cost is no longer increasing at its historic rate.
4
Overview
• Efficient spatial-temporal modeling is important for video understanding
• 2D CNN is more efficient, but it cannot handle temporal modeling
• 3D CNN can perform joint spatial-temporal feature learning, but it is
computationally expensive
• We aim to achieve 3D CNN performance at 2D complexity
5
Temporal Shift Module (TSM)
• Bi-directional TSM shifts part of the channels along the temporal
dimension to facilitate information exchange among neighboring frames
• Uni-directional TSM shifts channels from past to future for online video
understanding.
• It can be inserted into off-the-shelf 2D CNN to enable temporal modeling at
the cost of zero FLOPs and zero parameters
* Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
6
TSM Video Model
• Offline TSM video models
• Online TSM video models
A Simple Implementation of TSM
# shape of x: [N, T, C, H, W]
out = torch.zeros_like(x)
fold = c // fold_div
out[:, :-1, :fold] = x[:, 1:, :fold] # shift left
out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right
out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift
return out
* Naive implementation, involves large memory consumption and increases training memory consumption
8
Datasets
• Less temporal related: UCF101, HMDB51, Kinetics
• Temporal related: Something-Something (V1&V2), Jester
Improving over 2D Baseline
• TSM can improve over 2D baseline (TSN) at no computation
Cost vs. Accuracy
• It consumes 3× less computation than the ECO family, 6× less
computation than the Non-local I3D family while achieving better
performance on Something-Something dataset
Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x
Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput
Online Video Recognition
13
14
Improving the Robustness of Online Video Detection
Improving the Robustness of Online Video Detection
15
Scaling Down: Low-Latency Low-Power
Deployment
16
LED Bulb Level!
Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)

• GPU: 6 x NVIDIA Tesla V100

• RAM: 512 GB DDR4 memory

• Data Storage: HDD

• Connection: Dual-rail EDR InfiniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
Scalable Hardware + Scalable Model Design

TSM is hardware friendly for distributed training:

• Arithmetic efficiency: fewer FLOPs compared to 3D models

• Data I/O efficiency: fewer frames (32->8), no downsampling

• Networking efficiency: fewer parameters
● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 TSM for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes 

(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x

Actual: 106x

Theoretical: 256x

Actual: 211x
128 SUMMIT Nodes 

(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes 

(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer
● The performance of TSM model does not degrade when we scale up the mini-batch
size to 12k.
211x
Accuracy v.s. Batch size
Training Curves
Scalability
aining and validation curve for baseline training and large-batch distributed trainin
. The performance does not degrade for batch size 6k and 12k, while degrades for a
4k
16k
64k
256k
images/second
e throughput and scalability of distributed synchronous SGD training. Considering
The throughput and scalability of distributed synchronous SGD training. Considering the massive
number of GPUs, he system achieves a good scalability (>80%). The most of the communication
overhead is hidden by computation
Scalability v.s. Model
● TSM model achieves 1.6x and 2.9x higher training throughput compared to
previous I3D models
TSM Dissection: Spatial-Temporal Localization
24
• Each channel learns different semantics
• Channel 5: Move something away
TSM Dissection: Spatial-Temporal Localization
25
• Each channel learns different semantics
• Channel 162: Wiping
TSM Dissection: Spatial-Temporal Localization
26
• Each channel learns different semantics
• Channel 446: Push to left
27
Demo: Hand Gesture Recognition with TSM
28
70 FPS on $99 Jetson Nano
Demo: Google Map Navigation with Gesture
29
Demo Video on Something-Something
Acknowledgement
31
Song Han

MIT
Chuang Gan

MIT-IBM Watson AI Lab
John Cohn

IBM
Thank you!
32
Papers
1. Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
2. Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on
Videos, arXiv 1811.08383
Media Coverage:
Website: tsm-hanlab.mit.edu
Code Released! Including gesture recognition demo.

More Related Content

What's hot (20)

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
Ganesan Narayanasamy
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
David Lecomber
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
Ganesan Narayanasamy
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
Ganesan Narayanasamy
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
Ganesan Narayanasamy
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
Edge AI and Vision Alliance
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
David Lecomber
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
Edge AI and Vision Alliance
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
 

Similar to MIT's experience on OpenPOWER/POWER 9 platform (20)

“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
David Tung
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google Cloud
Samantha Guerriero
 
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET Journal
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
Ganesan Narayanasamy
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
Roberto Iacoviello
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Sean Yu
 
Presentation - webinar embedded machine learning
Presentation - webinar embedded machine learningPresentation - webinar embedded machine learning
Presentation - webinar embedded machine learning
Sirris
 
Jc nov.07.2019
Jc nov.07.2019Jc nov.07.2019
Jc nov.07.2019
carvalhoschubert
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
Intel Nervana
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
Qualcomm Research
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
Edge AI and Vision Alliance
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
C3 w3
C3 w3C3 w3
C3 w3
Ajay Taneja
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
vinaya.hs
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
David Tung
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google Cloud
Samantha Guerriero
 
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
IRJET Journal
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
Roberto Iacoviello
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Sean Yu
 
Presentation - webinar embedded machine learning
Presentation - webinar embedded machine learningPresentation - webinar embedded machine learning
Presentation - webinar embedded machine learning
Sirris
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
Intel Nervana
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
Qualcomm Research
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
Edge AI and Vision Alliance
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
vinaya.hs
 

More from Ganesan Narayanasamy (20)

Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging TechnologiesEmpowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
Ganesan Narayanasamy
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
Ganesan Narayanasamy
 
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging TechnologiesEmpowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Ganesan Narayanasamy
 
Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 

Recently uploaded (20)

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 

MIT's experience on OpenPOWER/POWER 9 platform

  • 1. Massachusetts Institute of Technology Ji Lin Tiny Inference and Scalable Training for Efficient Video Recognition 04/09/2020
  • 2. 2 Background • Videos are growing explosively: 105 hours of videos are uploaded to YouTube per day • Efficient Video processing is essential for both Cloud and Edge (e.g., hospital) T
  • 3. A Challenge for Modern Deep Learning Moore’s Law Data • We are solving more complicated AI problems with larger datasets, which requires more computation. • However, Moore’s Law is slowing down; the amount of computation per unit cost is no longer increasing at its historic rate.
  • 4. 4 Overview • Efficient spatial-temporal modeling is important for video understanding • 2D CNN is more efficient, but it cannot handle temporal modeling • 3D CNN can perform joint spatial-temporal feature learning, but it is computationally expensive • We aim to achieve 3D CNN performance at 2D complexity
  • 5. 5 Temporal Shift Module (TSM) • Bi-directional TSM shifts part of the channels along the temporal dimension to facilitate information exchange among neighboring frames • Uni-directional TSM shifts channels from past to future for online video understanding. • It can be inserted into off-the-shelf 2D CNN to enable temporal modeling at the cost of zero FLOPs and zero parameters * Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
  • 6. 6 TSM Video Model • Offline TSM video models • Online TSM video models
  • 7. A Simple Implementation of TSM # shape of x: [N, T, C, H, W] out = torch.zeros_like(x) fold = c // fold_div out[:, :-1, :fold] = x[:, 1:, :fold] # shift left out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift return out * Naive implementation, involves large memory consumption and increases training memory consumption
  • 8. 8 Datasets • Less temporal related: UCF101, HMDB51, Kinetics • Temporal related: Something-Something (V1&V2), Jester
  • 9. Improving over 2D Baseline • TSM can improve over 2D baseline (TSN) at no computation
  • 10. Cost vs. Accuracy • It consumes 3× less computation than the ECO family, 6× less computation than the Non-local I3D family while achieving better performance on Something-Something dataset
  • 11. Latency Comparison Batch size=1. Measured on NVIDIA Tesla P100. Each row represents a video. I3D: Latency: 164.3 ms/Video Something-V1 Acc.: 41.6% TSM: Latency: 17.4 ms/Video Something-V1 Acc.: 43.4% Speed-up: 9x
  • 12. Throughput Comparison Batch size=16. Measured on NVIDIA Tesla P100. Each square represents a video. I3D: Throughput: 6.1 video/s Something-V1 Acc.: 41.6% TSM: Throughput: 77.4 video/s Something-V1 Acc.: 43.4% 12.7x larger throughput
  • 14. 14 Improving the Robustness of Online Video Detection
  • 15. Improving the Robustness of Online Video Detection 15
  • 16. Scaling Down: Low-Latency Low-Power Deployment 16 LED Bulb Level!
  • 17. Scaling Up: Large-Scale Distributed Training with Summit Super Computer SUMMIT Super Computer: • CPU: 2 x 16 Core IBM POWER9 (connected via dual NVLINK bricks, 25GB/s each side) • GPU: 6 x NVIDIA Tesla V100 • RAM: 512 GB DDR4 memory • Data Storage: HDD • Connection: Dual-rail EDR InfiniBand network of 23 GB/s Acknowledgment: IBM and Oak Ridge National Lab * Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
  • 18. Scaling Up: Large-Scale Distributed Training with Summit Super Computer Scalable Hardware + Scalable Model Design TSM is hardware friendly for distributed training: • Arithmetic efficiency: fewer FLOPs compared to 3D models • Data I/O efficiency: fewer frames (32->8), no downsampling • Networking efficiency: fewer parameters
  • 19. ● We are able to speedup the training by 200x, from 2 days to 14minutes. ● Model setup: 8-frame ResNet-50 TSM for video recognition ● Dataset: Kinetics (240k training videos) x 100 epoch Training Time Accuracy Peak GPU Performance Speed-up 1 SUMMIT Nodes 
 (6 GPUs) 49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x Actual: 106x Theoretical: 256x Actual: 211x 128 SUMMIT Nodes 
 (768 GPUs) 28min 74.1% 5,989TFLOP/s 256 SUMMIT Nodes 
 (1536 GPUs) 14min 74.0% 11,978TFLOP/s 0 12.5 25 37.5 50 Time (h) 1 SUMMIT Node 128 SUMMIT Node 106x Scaling Up: Large-Scale Distributed Training with SUMMIT Super Computer
  • 20. ● The performance of TSM model does not degrade when we scale up the mini-batch size to 12k. 211x Accuracy v.s. Batch size
  • 22. Scalability aining and validation curve for baseline training and large-batch distributed trainin . The performance does not degrade for batch size 6k and 12k, while degrades for a 4k 16k 64k 256k images/second e throughput and scalability of distributed synchronous SGD training. Considering The throughput and scalability of distributed synchronous SGD training. Considering the massive number of GPUs, he system achieves a good scalability (>80%). The most of the communication overhead is hidden by computation
  • 23. Scalability v.s. Model ● TSM model achieves 1.6x and 2.9x higher training throughput compared to previous I3D models
  • 24. TSM Dissection: Spatial-Temporal Localization 24 • Each channel learns different semantics • Channel 5: Move something away
  • 25. TSM Dissection: Spatial-Temporal Localization 25 • Each channel learns different semantics • Channel 162: Wiping
  • 26. TSM Dissection: Spatial-Temporal Localization 26 • Each channel learns different semantics • Channel 446: Push to left
  • 27. 27
  • 28. Demo: Hand Gesture Recognition with TSM 28 70 FPS on $99 Jetson Nano
  • 29. Demo: Google Map Navigation with Gesture 29
  • 30. Demo Video on Something-Something
  • 32. Thank you! 32 Papers 1. Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19 2. Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383 Media Coverage: Website: tsm-hanlab.mit.edu Code Released! Including gesture recognition demo.