「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」

2021.08.30
Mana Murakami, Solution Architect , NVIDIA
NVIDIA プロファイラを用いた
PYTORCH 学習最適化手法のご紹介

2
1. プロファイリングの重要性について
2. DLProf & Nsight Systems
3. まとめ
AGENDA

3
よくあるご質問
• GPU を学習に使用したら速くなったが、これ以上速くなるか分からない
• GPU を学習にしようしているが、GPU がどの程度使われているのかよく分からない
• そもそも最適化のステップが分からない
プロファイリングの重要性について

4
よくあるご質問
• GPU を学習に使用したら速くなったが、これ以上速くなるか分からない
• GPU を学習にしようしているが、GPU がどの程度使われているのかよく分からない
• そもそも最適化のステップが分からない
ボトルネック解析の為の便利なツールが
いくつか存在します

5
パフォーマンス最適化の限界
アムダ―ルの法則: トレーニングセッションの一部 (GPU で動作) を高速化すると、残りの部分 (CPU で動
作) が性能ボトルネックになる
Mixed Precision
(TF32/FP16/BF16)
MATH
(linear, conv, matmul)
MEMORY
(pointwise,
reductions)
OTHER
(data pipeline,
communication)
8-16x
1-2x 1x ~2x overall faster
training session time
Single Precision
FP32
GPU CPU

NVIDIA プロファイリングスタック
用途毎に使い分け可能な階層型プロファイルスタック
DLProf Viewer
Deep Learning Profiler (DLProf)
Nsight Systems
NVTX for
Tensorflow NVTX Plugins NVTX for PyTorch
NGC Optimized Framework Containers
NVIDIA COMPUTING PLATFORM
• Nsight Systems と Nsight Compute は CUPTI
(Profiling Tools Interface) ベースの GPU アプリケー
ションの為のプロファイラ
• NVTX (NVIDIA Tools Extension Library) は
ソースコードにアノテーションをする為の CUDA ライブラ
リ
• DLProf は内部で Nsight Systems を実行してプロファイ
ルデータを収集し、データサイエンティストが分かりやすい形
に整形して可視化
NEW
V1.0.0
6

7
性能最適化の為の便利なツール
DL Prof と Nsight Systems
データサイエンティストと
応用研究者
DLProf
<Nsight Systems w/ NVTX>
研究者と開発者
NVTX
Nsight Systems
Nsight Compute
アルゴリズム開発者特定のドメイン向けのモデル開発や
アプリケーション開発者

8
性能最適化の為の便利なツール
DL Prof と Nsight Systems
データサイエンティストと
応用研究者
DLProf
<Nsight Systems w/ NVTX>
研究者と開発者
NVTX
Nsight Systems
Nsight Compute
アルゴリズム開発者特定のドメイン向けのモデル開発や
アプリケーション開発者
アルゴリズムやフレームワーク開発者の為のプロファイルツール
オーバーヘッドも低く軽量でCUDA処理の流れを
細かく把握する事ができる
解析結果をデータサイエンティストが理解しやすい形に
整形・可視化して学習コードの最適化を支援

10
DLProf とは?
DLProf :ディープラーニングモデルの為のプロファイリングツール
解析結果や最適化のアドバイスを表示
TensorFlow, PyTorch, TensorRT をサポート

11
DLProf とは?
ダッシュボード
• GPU 使用率チャート
• wall clock time のうち GPU がアクティブになっている割合の表示、複数 GPU の場合すべての GPU の平均利用率を示す
• オペレーション GPU 時間チャート :
• すべてのオペレーションを「Tensor コアを使用した処理」「Tensor コア使用できたが使用しなかった処理」「Tensor コアを使用する事が出来ない
処理」の 3つに分類してチャートを表示
CUDA カーネルの GPU 時間チャート:
•
• 全 CUDA カーネル実行時間を「カーネル内で Tensor コアを使用した時間」「カーネル内でメモリ処理を行っていた時間」「カーネル内のその他すべての
処理」の 3つに分類してチャートを表示
•
Tensor コア使用率チャート
• Tensor コアを使用した処理の全 GPU 時間に対する割合をチャートで表示

12
DLProf とは?
ダッシュボード
• 性能サマリー:
• 実行時に重要な主要指標を一覧として表示 (実行時間、Tensor コア使用率、GPU 使用率など)
• イテレーションサマリー:
• 実行中に各イテレーションでかかった時間を示す棒グラフ。Tensor コアを使用した時間、Tensor コア以外でGPUを使用した時間、GPU を使用し
ていない時間のインテレーション毎の内訳が表示される。
• トップ 10 GPU オペレーション :
• 実行時間がかかっている上位10オペレーションをソートして表示。ボトルネックになっている箇所の特定に有効

13
DLProf のインストール
DLProf を使うには?
1. NGC 上で配布されている TensorFlow および PyTorch コンテナに同梱されている DLProf を使う (PyTorch と TensorFlow (1.x/2.x ))
•TensorFlow https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
•PyTorch https://ngc.nvidia.com/catalog/containers/nvidia:pytorch
2. Python pip 経由のインストール (PyTorch と TF1.x のみ)
• PyTorch の例: (py Index、DLProf および依存パッケージ、 DLProf Viewer Plugin for TensorBoard のインストール)
$ pip installnvidia-pyindex
$ pip installnvidia-dlprof[pytorch]
$ pip installnvidia-tensorboard-plugin-dlprof
https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/#profiling

14
NOTE:
NGCで配布されているDeep LearningコンテナをSingularityで動かす方法はAppendix.のブログを参照のこと
各コンテナに同梱されている DLProf のバージョンは以下のドキュメントで確認可能
https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/

15
NOTE:
CUDA toolkit および driver と依存関係がある為、構築環境の CUDA バージョンと互換性があるバー
ジョンを入れる必要がある
(参考)
https://docs.nvidia.com/deeplearning/frameworks/dlprof-release-notes/index.html

16
PyTorch スクリプトのプロファイル手順
1. プロファイル対象の PyTorch コードに以下を追加
2. DLProf の実行
3. DLProfViewer による結果の可視化
import nvidia_dlprof_pytorch_nvtx as nvtx
nvtx.init(enable_function_stack=True)
with torch.autograd.profiler.emit_nvtx():
for iter in range(iters):
#forward
#backward
$ dlprof --mode=pytorch python main.py
$ dlprofviewer –b 0.0.0.0 –p 8000 dlprof_dldb.sqlite
References:https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/index.html#quickstart_topic

17
2. DLProf の実行
#forward
#backward
3行追加するだけ

18
2. DLProf の実行
#forward
#backward
DLProf の解析は時間がかかる為、イテレーション数を少なくするのが良い (10~20 mini-batchくらい)
--delay オプションを付けて warmup 部をスキップしてプロファイルする事も可能

19
2. DLProf の実行
#forward
#backward
特にファイル名を指定せずに実行した場合、「dlprof_dldb.sqlite」と「nsys_profile.sqlite」が
出力される
Dlprofviewerには「dlprof_dldb.sqlite」を指定

20
例: DLProf + DLProfViewer によるプロファイル結果
GPU 最適化前 (AMP未使用/バッチサイズ小)
全実行時間の殆どがCPU処理
なのが一目で分かる

21
例: DLProf + DLProfViewer によるプロファイル結果
GPU 最適化前 (AMP未使用/バッチサイズ小)
全実行時間の殆どがCPU処理
なのが一目で分かる
“Problem detected:”と”Recommended Change:”
が表示され、問題点が分かる

23
NSIGHT ツールワークフロー
新しくなった CUDA プロファイルツール群
Nsight Systems
包括的なシステムレベルの性能確認
Nsight Compute
CUDA カーネル詳細性能確認用
Nsight Graphics
フレーム/レンダー詳細性能確認
メトリック/カウンタを用いた
CUDAカーネル単位の詳細
性能確認
グラフィックフレーム単位の
詳細プロファイル
スタート
全体的なパフォーマンスを
再確認
全体的なパフォーマンス
を再確認
https://developer.nvidia.com/nsight-systems

24
Nsight Systems
主な機能:
• システム全体のアルゴリズム最適化
• マルチプロセスのアプリケーション解析のサポート
• アプリケーション内のボトルネックを探しに有効
• 非常に高速なGUIタイムライン上で何百万ものイベントを視覚化
• コマンドライン、IDE(統合型開発環境)の両方に対応
OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+
新しくなった CUDA プロファイルツール群
$ nsys profile –t cuda,osrt,nvtx,cudnn,cublas –o inference_result.qdstrm –w true
python inference.py
https://developer.nvidia.com/nsight-systems

25
CPU
utilization
Processes &
threads
OS runtime
APIs
CUDA &
cuBLAS APIs
GPU CUDA
Kernels & memory
transfers
CPU IP &
backtrace
sample data
NVTX
annotations
NVTX projected on
GPU CUDA streams

26
開発環境に NSIGHT SYSTEMS がインストールされていない場合
Setting Up and Using Nsight Systems Inside Containers
CUDA 11.4: install
CUDA 11.3: install
CUDA 11.2: install
Mapping an Nsight Systems Host Installation into a Container
NSIGHT SYTEMS
$ apt-get update –y
$ apt-get install -y cuda-nsight-systems-11-3 nsight-systems-2021.1.3
$ docker run --rm -it --network=host --gpus=all -v /opt/nvidia/nsight-systems/2021.1.3:/opt/nvidia/nsight-systems/2021.1.3
nvcr.io/nvidia/pytorch:21.08-py3 bash

27
NSIGHT SYSTEMS を使うには?
Example
cuda – GPU kernel
osrt – OS runtime
nvtx – NVIDIA Tools Extension
cublas – CUDA BLAS library
https://docs.nvidia.com/nsight-systems/2020.3/profiling/index.html#cli-options
NSIGHT SYTEMS
$ nsys profile -t nvtx,cuda,osrt,cublas
--stats=true
-f true
-o pusch_result
python main.py
APIs to be traced
Outputs profiling information similar to nvprof
Overwrite the output
Output filename

28
NSIGHT SYSTEMS を使うには?
Example
cuda – GPU kernel
osrt – OS runtime
nvtx – NVIDIA Tools Extension
cublas – CUDA BLAS library
https://docs.nvidia.com/nsight-systems/2020.3/profiling/index.html#cli-options
NSIGHT SYTEMS
$ nsys profile -t nvtx,cuda,osrt,cublas
--stats=true
-f true
-o pusch_result
python main.py
APIs to be traced
Outputs profiling information similar to nvprof
Overwrite the output
Output filename
Other Userful Options
• --delay (-y) : Collection start delay in seconds
• --duration(-d): Collection duration in seconds.
• --capture-range(-c): none/cudaProfilerApi/nvtx
etc..

29
例: Nsight Systems + NVTX
Nsight Systems プロファイル結果(NVTX あり)
前処理
11.07sec 推論処理(10iteration) 28.924sec
1iteration
アノテーションする事で
タイムライン上で処理を把握しやすくなる！

30
Appendix. 技術ブログ・関連セッション
Deep Learning Examples
• https://github.com/NVIDIA/DeepLearningExamples/
How to Run NGC Deep Learning Containers with Singularity
• https://developer.nvidia.com/blog/how-to-run-ngc-deep-learning-containers-with-singularity/
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf (TensorFlow)
• https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/
Deep Learning Performance Optimization with Profiling Tools
• https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31228/
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf
PyTorch Performance Tuning Guide
NVIDIA プロファイラを用いた Pytorch 学習最適化手法のご紹介

「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」

More Related Content

What's hot (20)

Similar to 「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」 (20)

More from ManaMurakami1 (7)

「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」