This document introduces the deep reinforcement learning model 'A3C' by Japanese.
Original literature is "Asynchronous Methods for Deep Reinforcement Learning" written by V. Mnih, et. al.
This document discusses self-supervised representation learning (SRL) for reinforcement learning tasks. SRL learns state representations by using prediction tasks as an auxiliary objective. The key ideas are: (1) SRL learns an encoder that maps observations to states using a prediction task like modeling future states or actions; (2) The learned state representations improve generalization and exploration in reinforcement learning algorithms; (3) Several SRL methods are discussed, including world models, inverse models, and causal infoGANs.
This document introduces the deep reinforcement learning model 'A3C' by Japanese.
Original literature is "Asynchronous Methods for Deep Reinforcement Learning" written by V. Mnih, et. al.
This document discusses self-supervised representation learning (SRL) for reinforcement learning tasks. SRL learns state representations by using prediction tasks as an auxiliary objective. The key ideas are: (1) SRL learns an encoder that maps observations to states using a prediction task like modeling future states or actions; (2) The learned state representations improve generalization and exploration in reinforcement learning algorithms; (3) Several SRL methods are discussed, including world models, inverse models, and causal infoGANs.
A Random Forest using a Multi-valued Decision Diagram on an FPGaHiroki Nakahara
The ISMVL (Int'l Symp. on Multiple-Valued Logic) presentation slide on May, 22nd, 2017 at Novi Sad, Serbia. It is a kind of machine learning to realize a high-performance and low power.
A digital spectrometer using an FPGA is proposed for use on a radio telescope. The spectrometer would provide high-resolution spectral analysis of wideband radio frequency signals received by the telescope. To achieve high throughput on the FPGA, a nested residue number system is used to implement the fast Fourier transforms in the spectrometer. This decomposes large moduli into smaller nested ones, allowing uniform circuit sizes and enabling fully parallel implementation of the arithmetic.
論文紹介:Dueling network architectures for deep reinforcement learningKazuki Adachi
Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:1995-2003, 2016.
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...Hiroki Nakahara
This document presents a method for high-throughput convolutional neural network (CNN) inference on an FPGA using customized JPEG compression. It decomposes convolutions using channel shift and pointwise operations, employs binary weight quantization, and uses a fully pipelined architecture. Experimental results show the proposed JPEG compression achieves an 82x speedup with 0.3% accuracy drop. When implemented on an FPGA, the CNN achieves 3,321 frames per second at 75 watts, providing over 100x and 10x speedups over CPU and GPU respectively.
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...Hiroki Nakahara
The document discusses implementing a deep neural network object detector called YOLOv2 on an FPGA using a technique called Nested Residue Number System (NRNS). Key points:
1. YOLOv2 is used for real-time object detection but requires high performance and low power.
2. NRNS decomposes large integer operations into smaller ones using a nested set of prime number moduli, enabling parallelization on FPGA.
3. The authors implemented a Tiny YOLOv2 model using NRNS on a NetFPGA-SUME board, achieving 3.84 FPS at 3.5W power and 1.097 FPS/W efficiency.
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkHiroki Nakahara
This document summarizes a research paper that proposes a ternary weight binary input convolutional neural network (CNN).
The paper proposes using ternary (-1, 0, +1) weights instead of binary weights to improve recognition accuracy over binary CNNs. By setting many weights to zero, computations can be skipped, reducing operations. Experimental results show the ternary CNN model reduced non-zero weights to 5.3% while maintaining accuracy comparable to binary CNNs. Implementation on an ARM processor demonstrated the ternary CNN was 8 times faster than a binary CNN.
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...Hiroki Nakahara
This document presents a mixed-precision convolutional neural network (CNN) called a Lightweight YOLOv2 for real-time object detection on an FPGA. The network uses binary precision for the feature extraction layers and half precision for the localization and classification layers. An FPGA implementation of the network achieves 40.81 FPS for object detection, outperforming an embedded GPU and CPU. Future work will apply this approach to other CNN-based applications such as semantic segmentation and pose estimation.
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
1) The document describes an object detection system that uses a multiscale sliding window approach with fully pipelined binarized convolutional neural networks (BCNNs) implemented on an FPGA.
2) The system detects and classifies multiple objects in images by applying BCNNs to windows at different scales and locations, and suppresses overlapping detections.
3) Experimental results on a Zynq UltraScale+ MPSoC FPGA demonstrate that the proposed pipelined BCNN architecture can achieve higher accuracy than GPU-based detectors while using less than 5W of power.
21. Artificial Neuron (AN)
+
x0=1
x1
x2
xN
... w0 (Bias)
w1
w2
wN
f(u)
u y
xi: Input signal
wi: Weight
u: Internal state
f(u): Activation function
(Sigmoid, ReLU, etc.)
y: Output signal
y f (u)
u wi xi
i0
N
21
23. LeNet-5
• CNNのベース (1980年に福島先⽣がネオコグニトロ
ンをすでに発表済み!!)
• 畳込み(特徴抽出)→フル結合(分類)
• 5層
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 23
29. AlexNet
• ディープラーニングブームに⽕をつけた記念的CNN
• ILSVRCʼ12優勝 (誤認識率16%)
• ⽔増し(Augmentation)による学習データ増加
• 8層, Dropout, アンサンブルCNN, ReLU活性化関数
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012. 29
32. GoogLeNet
• Network-in-network
• ILSVRCʼ14で優勝 (誤認識率6.7%)
• 22層, Inception演算, フル結合層なし
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E.
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke,
Andrew Rabinovich: Going deeper with convolutions. CVPR 2015: 1-9
32
36. どこまでも深くできるのか︖
• 答えはNO→勾配消失/発散問題
• 活性化関数の微分(誤差)を更新するため
逆伝搬: (0.1)100→0, 順伝搬: (1.1)100→∞
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image
Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
レイヤを増やす
(=深くする)
と認識率悪化
36
37. ResNet
• ILSVRCʼ15優勝 (誤認識率3.57%)
• 152層︕︕, Batch Normalization
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image
Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
⼊⼒xと出⼒H(x)を⼀致させたい
→残差(Residual)F(x)をCNNで学習
37
38. BatchNormalization
• 正規化を⾏った後, スケーリングとシフトを⾏う
• 学習による発散を抑える
Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift,” ICML2015. 38
54. 2値化/3値化 on FPGAがトレンド
• FPT2016 (12⽉開催)
• E. Nurvitadhi (Intel) et al., “Accelerating Binarized Neural
Networks: Comparison of FPGA, CPU, GPU, and ASIC”
• H. Nakahara (東⼯⼤), “A Memory-Based Realization of a
Binarized Deep Convolutional Neural Network”
• ISFPGA2017 (先週開催)
• Ritchie Zhao et al., “Accelerating Binarized Convolutional Neural
Networks with Software-Programmable FPGAs”
• Y. Umuroglu (Xilinx) et al., FINN: A Framework for Fast,
Scalable Binarized Neural Network Inference
• H. Nakahara, H. Yonekawa (東⼯⼤), et al. “A Batch
Normalization Free Binarized Convolutional Deep Neural
Network on an FPGA”
• Y. Li et al., “A 7.663-TOPS 8.2-W Energy-efficient FPGA
Accelerator for Binary Convolutional Neural Networks,”
• G. Lemieux, “TinBiNN: Tiny Binarized Neural Network Overlay
in Less Than 5,000 4-LUTs,”
54
55. (余談)PYNQの開発状況
• 新しいライブラリ(Arduino, PMOD,)を開発中
• 相変わらずお⼿軽(Pythonで叩く+Jupyter)・⾼性能
55
from pynq.iop import Pmod_PWM
from pynq.iop import PMODA, PMODB
from pynq.iop import Adafruit_LCD18_DDR
from pynq.iop import ARDUINO
lcd = Adafruit_LCD18_DDR(ARDUINO)
pwm_A = Pmod_PWM(PMODA, 0)
69. 指⽰⼦による性能向上(1)
• オリジナルのループ→逐次実⾏
• #pragma HLS unroll → 演算器展開
• #pragma HLS pipeline → パイプライン化
for ( int i = 0; i < N; i++){
op_Read;
op_MAC;
op_Write;
}
for ( int i = 0; i < N; i++){
#pragma HLS pipeline
op_Read;
op_MAC;
op_Write;
}
for ( int i = 0; i < N; i++){
#pragma HLS unroll 3
op_Read;
op_MAC;
op_Write;
}
RD MAC WR RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
スループット: 3サイクル
レイテンシ: 3サイクル
演算量: 1/3 データ/サイクル
スループット: 3サイクル
レイテンシ: 3サイクル
演算量: 1 データ/サイクル
スループット: 1サイクル
レイテンシ: 3サイクル
演算量: 1 データ/サイクル
69
70. 指⽰⼦による性能向上(2)
• #pragma HLS unroll → 演算器展開
• #pragma HLS pipeline → パイプライン化
Int X[100];
#pragma HLS array partition
for ( int i = 0; i < N; i++){
#pragma HLS pipeline
op_Read;
op_MAC;
op_Write;
}
Int X[100];
#pragma HLS array partition
for ( int i = 0; i < N; i++){
#pragma HLS unroll 3
op_Read;
op_MAC;
op_Write;
}
RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
RD MAC WR
スループット: 3サイクル
レイテンシ: 3サイクル
演算量: 1 データ/サイクル
スループット: 1サイクル
レイテンシ: 3サイクル
演算量: 1 データ/サイクル
70
Mem
Mem
Mem
Mem
Mem
Mem
RD MAC WR
Mem
Mem
Mem
Mem