MIT's experience on OpenPOWER/POWER 9 platform

Massachusetts Institute of Technology
Ji Lin
Tiny Inference and Scalable Training
for Efficient Video Recognition
04/09/2020

2
Background
• Videos are growing explosively: 105 hours of videos are uploaded to YouTube
per day
• Efficient Video processing is essential for both Cloud and Edge (e.g., hospital)
T

A Challenge for Modern Deep Learning
Moore’s
Law
Data
• We are solving more complicated AI problems with larger datasets,
which requires more computation.
• However, Moore’s Law is slowing down; the amount of computation
per unit cost is no longer increasing at its historic rate.

4
Overview
• Efﬁcient spatial-temporal modeling is important for video understanding
• 2D CNN is more efﬁcient, but it cannot handle temporal modeling
• 3D CNN can perform joint spatial-temporal feature learning, but it is
computationally expensive
• We aim to achieve 3D CNN performance at 2D complexity

5
Temporal Shift Module (TSM)
• Bi-directional TSM shifts part of the channels along the temporal
dimension to facilitate information exchange among neighboring frames
• Uni-directional TSM shifts channels from past to future for online video
understanding.
• It can be inserted into off-the-shelf 2D CNN to enable temporal modeling at
the cost of zero FLOPs and zero parameters
* Lin et al., TSM: Temporal Shift Module for Eﬃcient Video Understanding, ICCV’19

6
TSM Video Model
• Ofﬂine TSM video models
• Online TSM video models

A Simple Implementation of TSM
# shape of x: [N, T, C, H, W]
out = torch.zeros_like(x)
fold = c // fold_div
out[:, :-1, :fold] = x[:, 1:, :fold] # shift left
out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right
out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift
return out
* Naive implementation, involves large memory consumption and increases training memory consumption

8
Datasets
• Less temporal related: UCF101, HMDB51, Kinetics
• Temporal related: Something-Something (V1&V2), Jester

Improving over 2D Baseline
• TSM can improve over 2D baseline (TSN) at no computation

Cost vs. Accuracy
• It consumes 3× less computation than the ECO family, 6× less
computation than the Non-local I3D family while achieving better
performance on Something-Something dataset

Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x

Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput

14
Improving the Robustness of Online Video Detection

Improving the Robustness of Online Video Detection
15

Scaling Down: Low-Latency Low-Power
Deployment
16
LED Bulb Level!

Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)

• GPU: 6 x NVIDIA Tesla V100

• RAM: 512 GB DDR4 memory

• Data Storage: HDD

• Connection: Dual-rail EDR InﬁniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383

Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
Scalable Hardware + Scalable Model Design

TSM is hardware friendly for distributed training:

• Arithmetic efficiency: fewer FLOPs compared to 3D models

• Data I/O efficiency: fewer frames (32->8), no downsampling

• Networking efficiency: fewer parameters

● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 TSM for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes  
(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x

Actual: 106x

Theoretical: 256x

Actual: 211x
128 SUMMIT Nodes  
(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes  
(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer

● The performance of TSM model does not degrade when we scale up the mini-batch
size to 12k.
211x
Accuracy v.s. Batch size

Scalability
aining and validation curve for baseline training and large-batch distributed trainin
. The performance does not degrade for batch size 6k and 12k, while degrades for a
4k
16k
64k
256k
images/second
e throughput and scalability of distributed synchronous SGD training. Considering
The throughput and scalability of distributed synchronous SGD training. Considering the massive
number of GPUs, he system achieves a good scalability (>80%). The most of the communication
overhead is hidden by computation

Scalability v.s. Model
● TSM model achieves 1.6x and 2.9x higher training throughput compared to
previous I3D models

TSM Dissection: Spatial-Temporal Localization
24
• Each channel learns different semantics
• Channel 5: Move something away

25
• Channel 162: Wiping

26
• Channel 446: Push to left

Demo: Hand Gesture Recognition with TSM
28
70 FPS on $99 Jetson Nano

Demo: Google Map Navigation with Gesture
29

Demo Video on Something-Something

Acknowledgement
31
Song Han

MIT
Chuang Gan

MIT-IBM Watson AI Lab
John Cohn

IBM

Thank you!
32
Papers
1. Lin et al., TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
2. Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on
Videos, arXiv 1811.08383
Media Coverage:
Website: tsm-hanlab.mit.edu
Code Released! Including gesture recognition demo.

MIT's experience on OpenPOWER/POWER 9 platform

Recommended

More Related Content

What's hot (20)

Similar to MIT's experience on OpenPOWER/POWER 9 platform (20)

More from Ganesan Narayanasamy (20)

Recently uploaded (20)

MIT's experience on OpenPOWER/POWER 9 platform