SlideShare a Scribd company logo
( (
8C 8B8 ). D C / /). B8 A
09D 8 1 D C , C
). 8 9 A D 28 .
-
Visual perception tasks
1 2 2. 2 3 . . 2
2. 1 2 . 1 2
4
Agenda
• Visual perception tasks
• Mask-RCNN
• Mask-RCNN architecture
• Feature Pyramid Network
• Region Proposal Network
• RoIAlign
• Mark-RCNN head network
• Result
• Summary
Introduction to MaskRCNN
• Mask-RCNN stands for Mask-Region Convolutional Neural Network
• State-of-the-art algorithm for Instance Segmentation
• Evolved through 4 main versions:
• RCNN → Fast-RCNN → Faster-RCNN → Mask-RCNN
• The first 3 versions are for Object Detection
• Improvements over Faster RCNN: use RoIAlign instead of RoIPool
• Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each
class independently
• 2 main stages:
• 1st stage: use Region Proposal Network (RPN) to propose candidate object bounding boxes
• 2nd stage: classify the candidate boxes, refine the boxes and predict masks
Terms
• Bounding box: rectangle identifying location of an object
• Mask: set of pixels which belong to an object
• Anchor: a bounding box is generated independently from image content
• RoI: Region of Interest, a bounding box which may contain an object
• Non-Maximum Suppression (NMS): a method to eliminate duplicated bounding box using scores
and IoU threshold
• IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other.
• RoIAlign: a method to extract features for RoIs from feature maps
• Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale
• Region Proposal Network (RPN): a neural network to propose RoI for an image
• Fully Convolutional Network (FCN): a convolution-based neural network to extract masks
MaskRCNN architecture
1 1 + + : 7
:
:
: :B
7
: :
1 1 +C 1 A:
1 : + : : 7 1 A:
1 : : ( 1 1
7 1 Background + number of classes)
: :B 1 1 : : :B 1
Multi-scale problem
Approaches for multiple scaled objects
2 . . . 2 2 . 2 .-
. 2 2. -. . .
2 . 2 2 . . - 2A. .
2 .- 2A. 2 2 . - . .
1
2 :2 2 2 .
2.- :
: . 29 . 2 2.- -
. 2A. . 2 2 . - - . 2 .
( )
Feature Pyramid Network (FPN)
)
)
)
)
4 +4
+
3
4 6 5 4 2 53
0(1
To detect boundaries of objects:
• 1 1 1
• 11 1 1 1
• 1 1 1
Bounding box regression
Bounding box regression
)*
) )
* *
• / DI : = D F F G : ) ) * *
• AGD F F G : F D
: : = &
• F A IA : FD ) ) * * G-
. * ) . ) (&,
. * ) . ) (&,
• ) ) * * F AGD F FG A A IA : FD G-
) . (&, * . )
) . (&, * . )
/ DF DF F : : DI : = D 4
2FDI : FI D 2
• 4 4 4 4 - F F : : = D F : : D 4
• 2 2 2 2 - F F : : = D =FDI : FI D 2
• 0
• : . 2 4 4 : . 2 4 4
• : . AD= 2 4 : . AD= 2 4
• : : G G A F F GA D D F D 4
• : : G AD= G F GA D G D : = D 4
• F G F F : : DI : = D 4 G DG D AD = G F =F GG D
: IG : DI : = D G : : G-
4
4
4 4
2 2
2
2
1FD 5+
) (
Region Proposal Network
Anchor generator
Proposal layer
4
4 4
4
4 1 4
4 4
Filter out negative anchors with
Rpn_probs and Non-Max Suppression
RPN head network
, F
G
= BC
A4 :)C4
, F
G
4 BC3A B3= 4 G = BC
4G
/A 3AB 5C
, F
G
4 BC3A B3= 4 G = BC
/A 355 G
1 BC 2
1 BC 2
F4= C A B 4 B B
5 : 5 G B :B CC (
B 1 2
Anchor generator
)
-=38 A 4
-=38 3 =
A31
1
1
1
1 (
. 1 1
/8 A 12A 3 2 61A 3== 4 = 8 122 8 6 4 = 2: 3 4 3 =0
1=38 1=38 8 1
IoU – Intersection over Union
3 I I= : 5 = = : =G
?9 G ?= :=G E = /
• 4 7 1 2 2 7
• 2 7 2
• & 7 7 11 2
2 7 1
0
1
4 7 0 1 . ) ( (
2 9BEA=
0
1
4 7 0 1 . % . %
0
1
4 7 0 1 . , , . %%
= A9E = G=
Proposal layer
• Sort all anchors by rpn_probs (how likely an anchor contains an object)
• Choose top N anchors, throw the remainings (e.g., N ~ 6000)
• Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes.
Keep up to M anchors (e.g., M ~ 2000).
Non-Maximum Suppresion (NMS)
• . C: A 5 C9 C
• 09 A = = C: C
?E ? ?? ? A ? (
,>
-: C
A A 9
, 1 C9A 9 5 0 0)
.( . 9
-: C 9
) ?A: =
12
1?A C9 A
12
- C: C9 A : A = : : (
9 C9 :C9 : ?A :C
:=:> A = : : :C9 ,? ) 2
?> - := = 1 A :?> ?A: =
Non-Maximum Suppresion (NMS)
=A
, A = 2= 5
. = 5 = 5 2=
=0 A 5 =: 5 (
- - == 5 2= 5
-A =A , A = = 5 2= 5
( > :
)
> A 5 2= 5 2 = 5
)
,== A : A 5 5 = 5 2=
== 5 A 5 2= C A > :: A )
) : 2 5 2= 5 2 C A 1 (
, 2 :A: A==> 2 > :
Train the RPN
• Positive boxes: IoU >= 0.7 with any GT box
• Negative boxes: IoU < 0.3 with all GT boxes
• Ratio of positive boxes: 1/3
• Fixed num of anchors per image for train: 256
Loss function
• i is the index of an anchor in a mini-batch
• pi is the predicted probability of anchor i being an object
• Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative
• ti is a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box
• is that of the ground-truth box associated with a positive anchor
• Classification loss Lcls is log loss over two classes (object vs. not object)
• For regression loss Lreg, use , where R is smoothL1 defined as:
• While both positive and negative anchors contribute to classification loss, only positive anchors contribute to regression loss.
• Ncls is normalized by the mini-batch size ( ), Nreg is normalized by the number of anchor locations ( ), set
Mask-RCNN for Instance Segmentation
RoIAlign
1024
1024
540
540
Input image
Object
64
1024/16 = 64
540/16 = 33.75
33.75
Feature map
RoI
16X less
33.75 / 7 = 4.82 each bin
7x7
Small feature map
(for each RoI)
RoI
Use bilinear interpolation to
calculate exact value at each bin
No quantization
(From [1])
FCN
Identify Feature Pyramid level for RoIs
Resize
P2
P3
P4
P5
w, h: width & height of a RoI
224: canonical ImageNet pre-training size
k0: target level of the RoI whose w*h = 2242
(here, k0 = 5)
Target level k of a RoI is identified by:
Crop the RoIs on
their feature map
Intuitions:
Features of large RoIs from smaller feature map (high semantic level)
Features of small RoIs from larger feature map (low semantic level)
RoIs
(From [6])
-
Mask-RCNN head network
• A classifier to identify the class for each RoI: K classes + background
• A regressor to predict the 4 values dy, dx, dh, dw for each RoI
• Fully Convolutional Network (FCN) [5] to predict mask per class
• Represent a mask as m x m matrix
• For each RoI, try to predict mask for each class
• Use sigmoid to predict how probability for each pixel
• Use binary loss to train the network
Mask-RCNN head network architecture
7x7x256
Small feature map
(for each RoI)
1024
Fully connected layer implemented by CNN
Shared weights over multiple RoIs
Softmax
(K+1) x 4
(K+1)
14x14x256
3x3
(256 filters)
Conv1
14x14x256
Conv4
14x14x256
3x3
(256 filters)
Conv Transpose
(Up sampling)
2x2
(256 filters)
(stride 2)
28x28x256
...
x 4 conv layers
Conv
28x28x(K+1)
1x1
(K+1 filters)
Sigmoid
activation
28x28x(K+1)
7x7
(1024 filters)
Conv1 Conv2
(BG + num classes)
K+1
Dense
Dense
(K+1) x 4
1024 K+1
Predict mask per class
BG vs K classes
4 box regression values:
dy, dx, dh, dw
1x1
(1024 filters)
Loss functions
• For each sampled RoI, a multi-task loss is applied:
where
• Lcls is classification loss
• Lloc is bounding-box regression loss
• Lmask is mask loss
• The final loss is calculated as mean of loss over samples
Classification loss Lcls
• For a RoI, denotes:
• : true class of the RoI
• : predicted probability distribution over K+1 classes
• The classification loss Lcls for a RoI is a log-loss calculated as:
Bounding-box regression loss Lloc
• For a RoI, denotes:
• : true class of the RoI
• : true bounding-box regression targets of the RoI
• : predicted bounding-box regression for the class u.
• The bounding-box regression loss Lloc for the RoI is calculated as:
where
Mask loss Lmask
• For a RoI, denotes:
• : true class of the RoI
• : the true mask and the predicted mask for the class of the RoI
respectively ( )
• The mask loss Lmask for the RoI is the average binary cross-entropy
loss, calculated as:
Mask-RCNN on COCO data
(From [1])
Evolution of R-CNN
= Faster R-CNN [2] + Fully Convolutional Network [5]
RoIPool RoIAlign Per-pixel softmax Per-pixel sigmoid
Mask R-CNN [1]
Faster R-CNN = Fast R-CNN [3] +
Fast R-CNN = R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer
R-CNN [1]
Region proposal on input image + +
+=
+
+
Summary
• Introduced MaskRCNN, an algorithm for Instance Segmentation
• Detect both bounding boxes and masks of objects in an end-to-end
neural network
• Improve RoIPool from Faster-RCNN with RoIAlign
• Employ Fully Convolutional Network for mask detection
References
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE
International Conference on Computer Vision (ICCV), 2017.
[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In NIPS, 2015.
[3] R. Girshick. Fast R-CNN. In ICCV, 2015.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, 2014
[5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
[6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In CVPR, 2017
by Nguyen Phuoc Tat Dat
Appendix: Some popular DL-based algorithms for visual perception tasks
by Nguyen Phuoc Tat Dat
Visual perception tasks Algorithms
Image Classification
AlexNet
Inception
GooLeNet/Inception v1
ResNet
VGGNet
Object Detection
Fast/Faster R-CNN
SSD
YOLO
Semantic Segmentation
Fully Convolutional Network (FCN)
U-Net
Instance Segmentation Mask R-CNN
Thank you for listening!
!

More Related Content

What's hot (20)

PDF
Image segmentation with deep learning
Antonio Rueda-Toicen
 
PDF
Faster R-CNN - PR012
Jinwon Lee
 
PPTX
Image classification using CNN
Noura Hussein
 
PPTX
You Only Look Once: Unified, Real-Time Object Detection
DADAJONJURAKUZIEV
 
PDF
Introduction to object detection
Brodmann17
 
PPTX
U-Net (1).pptx
Changjin Lee
 
PPTX
Deep neural networks
Si Haem
 
PDF
Machine learning in image processing
Data Science Thailand
 
PPTX
Convolution Neural Network (CNN)
Basit Rafiq
 
PDF
Deep learning based object detection basics
Brodmann17
 
PPTX
Convolutional Neural Network
Vignesh Suresh
 
PPTX
Object Detection using Deep Neural Networks
Usman Qayyum
 
PPTX
Convolutional Neural Network (CNN) - image recognition
YUNG-KUEI CHEN
 
PDF
Faster R-CNN: Towards real-time object detection with region proposal network...
Universitat Politècnica de Catalunya
 
PDF
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Taegyun Jeon
 
PPTX
Image classification using cnn
SumeraHangi
 
PPTX
Deep learning for object detection
Wenjing Chen
 
PPTX
Optimization in Deep Learning
Yan Xu
 
PPTX
Object detection with deep learning
Sushant Shrivastava
 
PPTX
Object detection presentation
AshwinBicholiya
 
Image segmentation with deep learning
Antonio Rueda-Toicen
 
Faster R-CNN - PR012
Jinwon Lee
 
Image classification using CNN
Noura Hussein
 
You Only Look Once: Unified, Real-Time Object Detection
DADAJONJURAKUZIEV
 
Introduction to object detection
Brodmann17
 
U-Net (1).pptx
Changjin Lee
 
Deep neural networks
Si Haem
 
Machine learning in image processing
Data Science Thailand
 
Convolution Neural Network (CNN)
Basit Rafiq
 
Deep learning based object detection basics
Brodmann17
 
Convolutional Neural Network
Vignesh Suresh
 
Object Detection using Deep Neural Networks
Usman Qayyum
 
Convolutional Neural Network (CNN) - image recognition
YUNG-KUEI CHEN
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Universitat Politècnica de Catalunya
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Taegyun Jeon
 
Image classification using cnn
SumeraHangi
 
Deep learning for object detection
Wenjing Chen
 
Optimization in Deep Learning
Yan Xu
 
Object detection with deep learning
Sushant Shrivastava
 
Object detection presentation
AshwinBicholiya
 

Similar to Mask-RCNN for Instance Segmentation (20)

PDF
object detection paper review
Yoonho Na
 
PPTX
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 
PPTX
Week5-Faster R-CNN.pptx
fahmi324663
 
PDF
Auro tripathy - Localizing with CNNs
Auro Tripathy
 
PDF
Multi core k means
b0rAAs
 
PDF
R-FCN : object detection via region-based fully convolutional networks
Entrepreneur / Startup
 
PDF
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
NopphawanTamkuan
 
PDF
SIGNATE 国立国会図書館の画像データレイアウト認識 1st place solution
Koji Asami
 
PDF
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Edge AI and Vision Alliance
 
PDF
D3L4-objects.pdf
ssusere945ae
 
PDF
Anil Thomas - Object recognition
Intel Nervana
 
PDF
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
NVIDIA Taiwan
 
PPTX
Trackster Pruning at the CMS High-Granularity Calorimeter
Yousef Fadila
 
PPTX
Powerpoint templates for machine learning.pptx
chfsong
 
ODP
image compression ppt
Shivangi Saxena
 
PDF
Hardware Acceleration for Machine Learning
CastLabKAIST
 
PDF
Conditional Image Generation with PixelCNN Decoders
suga93
 
PPSX
Practical spherical harmonics based PRT methods.ppsx
MannyK4
 
PDF
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
object detection paper review
Yoonho Na
 
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 
Week5-Faster R-CNN.pptx
fahmi324663
 
Auro tripathy - Localizing with CNNs
Auro Tripathy
 
Multi core k means
b0rAAs
 
R-FCN : object detection via region-based fully convolutional networks
Entrepreneur / Startup
 
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
NopphawanTamkuan
 
SIGNATE 国立国会図書館の画像データレイアウト認識 1st place solution
Koji Asami
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Edge AI and Vision Alliance
 
D3L4-objects.pdf
ssusere945ae
 
Anil Thomas - Object recognition
Intel Nervana
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
NVIDIA Taiwan
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Yousef Fadila
 
Powerpoint templates for machine learning.pptx
chfsong
 
image compression ppt
Shivangi Saxena
 
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Conditional Image Generation with PixelCNN Decoders
suga93
 
Practical spherical harmonics based PRT methods.ppsx
MannyK4
 
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
Ad

Recently uploaded (20)

PPT
04 Origin of Evinnnnnnnnnnnnnnnnnnnnnnnnnnl-notes.ppt
LuckySangalala1
 
PPTX
UNIT III CONTROL OF PARTICULATE CONTAMINANTS
sundharamm
 
PPTX
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
PDF
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
PDF
Geothermal Heat Pump ppt-SHRESTH S KOKNE
SHRESTHKOKNE
 
PPTX
ENSA_Module_8.pptx_nice_ipsec_presentation
RanaMukherjee24
 
PPTX
00-ClimateChangeImpactCIAProcess_PPTon23.12.2024-ByDr.VijayanGurumurthyIyer1....
praz3
 
PDF
Lecture Information Theory and CodingPart-1.pdf
msc9219
 
PPTX
Data_Analytics_Presentation_By_Malik_Azanish_Asghar.pptx
azanishmalik1
 
PPTX
Unit II: Meteorology of Air Pollution and Control Engineering:
sundharamm
 
PDF
The Complete Guide to the Role of the Fourth Engineer On Ships
Mahmoud Moghtaderi
 
PPTX
Smart_Cities_IoT_Integration_Presentation.pptx
YashBhisade1
 
PPTX
Cyclic_Redundancy_Check_Presentation.pptx
alhjranyblalhmwdbdal
 
PPT
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
PPTX
GitHub_Copilot_Basics...........................pptx
ssusera13041
 
PDF
Natural Language processing and web deigning notes
AnithaSakthivel3
 
PPTX
Fluid statistics and Numerical on pascal law
Ravindra Kolhe
 
PDF
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
PDF
Natural Language processing and web deigning notes
AnithaSakthivel3
 
PDF
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
04 Origin of Evinnnnnnnnnnnnnnnnnnnnnnnnnnl-notes.ppt
LuckySangalala1
 
UNIT III CONTROL OF PARTICULATE CONTAMINANTS
sundharamm
 
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
Geothermal Heat Pump ppt-SHRESTH S KOKNE
SHRESTHKOKNE
 
ENSA_Module_8.pptx_nice_ipsec_presentation
RanaMukherjee24
 
00-ClimateChangeImpactCIAProcess_PPTon23.12.2024-ByDr.VijayanGurumurthyIyer1....
praz3
 
Lecture Information Theory and CodingPart-1.pdf
msc9219
 
Data_Analytics_Presentation_By_Malik_Azanish_Asghar.pptx
azanishmalik1
 
Unit II: Meteorology of Air Pollution and Control Engineering:
sundharamm
 
The Complete Guide to the Role of the Fourth Engineer On Ships
Mahmoud Moghtaderi
 
Smart_Cities_IoT_Integration_Presentation.pptx
YashBhisade1
 
Cyclic_Redundancy_Check_Presentation.pptx
alhjranyblalhmwdbdal
 
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
GitHub_Copilot_Basics...........................pptx
ssusera13041
 
Natural Language processing and web deigning notes
AnithaSakthivel3
 
Fluid statistics and Numerical on pascal law
Ravindra Kolhe
 
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
Natural Language processing and web deigning notes
AnithaSakthivel3
 
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
Ad

Mask-RCNN for Instance Segmentation

  • 1. ( ( 8C 8B8 ). D C / /). B8 A 09D 8 1 D C , C ). 8 9 A D 28 . -
  • 2. Visual perception tasks 1 2 2. 2 3 . . 2 2. 1 2 . 1 2 4
  • 3. Agenda • Visual perception tasks • Mask-RCNN • Mask-RCNN architecture • Feature Pyramid Network • Region Proposal Network • RoIAlign • Mark-RCNN head network • Result • Summary
  • 4. Introduction to MaskRCNN • Mask-RCNN stands for Mask-Region Convolutional Neural Network • State-of-the-art algorithm for Instance Segmentation • Evolved through 4 main versions: • RCNN → Fast-RCNN → Faster-RCNN → Mask-RCNN • The first 3 versions are for Object Detection • Improvements over Faster RCNN: use RoIAlign instead of RoIPool • Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each class independently • 2 main stages: • 1st stage: use Region Proposal Network (RPN) to propose candidate object bounding boxes • 2nd stage: classify the candidate boxes, refine the boxes and predict masks
  • 5. Terms • Bounding box: rectangle identifying location of an object • Mask: set of pixels which belong to an object • Anchor: a bounding box is generated independently from image content • RoI: Region of Interest, a bounding box which may contain an object • Non-Maximum Suppression (NMS): a method to eliminate duplicated bounding box using scores and IoU threshold • IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other. • RoIAlign: a method to extract features for RoIs from feature maps • Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale • Region Proposal Network (RPN): a neural network to propose RoI for an image • Fully Convolutional Network (FCN): a convolution-based neural network to extract masks
  • 6. MaskRCNN architecture 1 1 + + : 7 : : : :B 7 : : 1 1 +C 1 A: 1 : + : : 7 1 A: 1 : : ( 1 1 7 1 Background + number of classes) : :B 1 1 : : :B 1
  • 8. Approaches for multiple scaled objects 2 . . . 2 2 . 2 .- . 2 2. -. . . 2 . 2 2 . . - 2A. . 2 .- 2A. 2 2 . - . . 1 2 :2 2 2 . 2.- : : . 29 . 2 2.- - . 2A. . 2 2 . - - . 2 . ( )
  • 9. Feature Pyramid Network (FPN) ) ) ) ) 4 +4 + 3 4 6 5 4 2 53 0(1
  • 10. To detect boundaries of objects: • 1 1 1 • 11 1 1 1 • 1 1 1 Bounding box regression
  • 11. Bounding box regression )* ) ) * * • / DI : = D F F G : ) ) * * • AGD F F G : F D : : = & • F A IA : FD ) ) * * G- . * ) . ) (&, . * ) . ) (&, • ) ) * * F AGD F FG A A IA : FD G- ) . (&, * . ) ) . (&, * . ) / DF DF F : : DI : = D 4 2FDI : FI D 2 • 4 4 4 4 - F F : : = D F : : D 4 • 2 2 2 2 - F F : : = D =FDI : FI D 2 • 0 • : . 2 4 4 : . 2 4 4 • : . AD= 2 4 : . AD= 2 4 • : : G G A F F GA D D F D 4 • : : G AD= G F GA D G D : = D 4 • F G F F : : DI : = D 4 G DG D AD = G F =F GG D : IG : DI : = D G : : G- 4 4 4 4 2 2 2 2 1FD 5+
  • 12. ) (
  • 13. Region Proposal Network Anchor generator Proposal layer 4 4 4 4 4 1 4 4 4 Filter out negative anchors with Rpn_probs and Non-Max Suppression
  • 14. RPN head network , F G = BC A4 :)C4 , F G 4 BC3A B3= 4 G = BC 4G /A 3AB 5C , F G 4 BC3A B3= 4 G = BC /A 355 G 1 BC 2 1 BC 2 F4= C A B 4 B B 5 : 5 G B :B CC ( B 1 2
  • 15. Anchor generator ) -=38 A 4 -=38 3 = A31 1 1 1 1 ( . 1 1 /8 A 12A 3 2 61A 3== 4 = 8 122 8 6 4 = 2: 3 4 3 =0 1=38 1=38 8 1
  • 16. IoU – Intersection over Union 3 I I= : 5 = = : =G ?9 G ?= :=G E = / • 4 7 1 2 2 7 • 2 7 2 • & 7 7 11 2 2 7 1 0 1 4 7 0 1 . ) ( ( 2 9BEA= 0 1 4 7 0 1 . % . % 0 1 4 7 0 1 . , , . %% = A9E = G=
  • 17. Proposal layer • Sort all anchors by rpn_probs (how likely an anchor contains an object) • Choose top N anchors, throw the remainings (e.g., N ~ 6000) • Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes. Keep up to M anchors (e.g., M ~ 2000).
  • 18. Non-Maximum Suppresion (NMS) • . C: A 5 C9 C • 09 A = = C: C ?E ? ?? ? A ? ( ,> -: C A A 9 , 1 C9A 9 5 0 0) .( . 9 -: C 9 ) ?A: = 12 1?A C9 A 12 - C: C9 A : A = : : ( 9 C9 :C9 : ?A :C :=:> A = : : :C9 ,? ) 2 ?> - := = 1 A :?> ?A: =
  • 19. Non-Maximum Suppresion (NMS) =A , A = 2= 5 . = 5 = 5 2= =0 A 5 =: 5 ( - - == 5 2= 5 -A =A , A = = 5 2= 5 ( > : ) > A 5 2= 5 2 = 5 ) ,== A : A 5 5 = 5 2= == 5 A 5 2= C A > :: A ) ) : 2 5 2= 5 2 C A 1 ( , 2 :A: A==> 2 > :
  • 20. Train the RPN • Positive boxes: IoU >= 0.7 with any GT box • Negative boxes: IoU < 0.3 with all GT boxes • Ratio of positive boxes: 1/3 • Fixed num of anchors per image for train: 256
  • 21. Loss function • i is the index of an anchor in a mini-batch • pi is the predicted probability of anchor i being an object • Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative • ti is a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box • is that of the ground-truth box associated with a positive anchor • Classification loss Lcls is log loss over two classes (object vs. not object) • For regression loss Lreg, use , where R is smoothL1 defined as: • While both positive and negative anchors contribute to classification loss, only positive anchors contribute to regression loss. • Ncls is normalized by the mini-batch size ( ), Nreg is normalized by the number of anchor locations ( ), set
  • 23. RoIAlign 1024 1024 540 540 Input image Object 64 1024/16 = 64 540/16 = 33.75 33.75 Feature map RoI 16X less 33.75 / 7 = 4.82 each bin 7x7 Small feature map (for each RoI) RoI Use bilinear interpolation to calculate exact value at each bin No quantization (From [1]) FCN
  • 24. Identify Feature Pyramid level for RoIs Resize P2 P3 P4 P5 w, h: width & height of a RoI 224: canonical ImageNet pre-training size k0: target level of the RoI whose w*h = 2242 (here, k0 = 5) Target level k of a RoI is identified by: Crop the RoIs on their feature map Intuitions: Features of large RoIs from smaller feature map (high semantic level) Features of small RoIs from larger feature map (low semantic level) RoIs (From [6])
  • 25. -
  • 26. Mask-RCNN head network • A classifier to identify the class for each RoI: K classes + background • A regressor to predict the 4 values dy, dx, dh, dw for each RoI • Fully Convolutional Network (FCN) [5] to predict mask per class • Represent a mask as m x m matrix • For each RoI, try to predict mask for each class • Use sigmoid to predict how probability for each pixel • Use binary loss to train the network
  • 27. Mask-RCNN head network architecture 7x7x256 Small feature map (for each RoI) 1024 Fully connected layer implemented by CNN Shared weights over multiple RoIs Softmax (K+1) x 4 (K+1) 14x14x256 3x3 (256 filters) Conv1 14x14x256 Conv4 14x14x256 3x3 (256 filters) Conv Transpose (Up sampling) 2x2 (256 filters) (stride 2) 28x28x256 ... x 4 conv layers Conv 28x28x(K+1) 1x1 (K+1 filters) Sigmoid activation 28x28x(K+1) 7x7 (1024 filters) Conv1 Conv2 (BG + num classes) K+1 Dense Dense (K+1) x 4 1024 K+1 Predict mask per class BG vs K classes 4 box regression values: dy, dx, dh, dw 1x1 (1024 filters)
  • 28. Loss functions • For each sampled RoI, a multi-task loss is applied: where • Lcls is classification loss • Lloc is bounding-box regression loss • Lmask is mask loss • The final loss is calculated as mean of loss over samples
  • 29. Classification loss Lcls • For a RoI, denotes: • : true class of the RoI • : predicted probability distribution over K+1 classes • The classification loss Lcls for a RoI is a log-loss calculated as:
  • 30. Bounding-box regression loss Lloc • For a RoI, denotes: • : true class of the RoI • : true bounding-box regression targets of the RoI • : predicted bounding-box regression for the class u. • The bounding-box regression loss Lloc for the RoI is calculated as: where
  • 31. Mask loss Lmask • For a RoI, denotes: • : true class of the RoI • : the true mask and the predicted mask for the class of the RoI respectively ( ) • The mask loss Lmask for the RoI is the average binary cross-entropy loss, calculated as:
  • 32. Mask-RCNN on COCO data (From [1])
  • 33. Evolution of R-CNN = Faster R-CNN [2] + Fully Convolutional Network [5] RoIPool RoIAlign Per-pixel softmax Per-pixel sigmoid Mask R-CNN [1] Faster R-CNN = Fast R-CNN [3] + Fast R-CNN = R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer R-CNN [1] Region proposal on input image + + += + +
  • 34. Summary • Introduced MaskRCNN, an algorithm for Instance Segmentation • Detect both bounding boxes and masks of objects in an end-to-end neural network • Improve RoIPool from Faster-RCNN with RoIAlign • Employ Fully Convolutional Network for mask detection
  • 35. References [1] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV), 2017. [2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. [3] R. Girshick. Fast R-CNN. In ICCV, 2015. [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014 [5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017 by Nguyen Phuoc Tat Dat
  • 36. Appendix: Some popular DL-based algorithms for visual perception tasks by Nguyen Phuoc Tat Dat Visual perception tasks Algorithms Image Classification AlexNet Inception GooLeNet/Inception v1 ResNet VGGNet Object Detection Fast/Faster R-CNN SSD YOLO Semantic Segmentation Fully Convolutional Network (FCN) U-Net Instance Segmentation Mask R-CNN
  • 37. Thank you for listening! !