Mask-RCNN for Instance Segmentation

( (
8C 8B8 ). D C / /). B8 A
09D 8 1 D C , C
). 8 9 A D 28 .
-

Visual perception tasks
1 2 2. 2 3 . . 2
2. 1 2 . 1 2
4

Agenda
• Visual perception tasks
• Mask-RCNN
• Mask-RCNN architecture
• Feature Pyramid Network
• Region Proposal Network
• RoIAlign
• Mark-RCNN head network
• Result
• Summary

Introduction to MaskRCNN
• Mask-RCNN stands for Mask-Region Convolutional Neural Network
• State-of-the-art algorithm for Instance Segmentation
• Evolved through 4 main versions:
• RCNN → Fast-RCNN → Faster-RCNN → Mask-RCNN
• The first 3 versions are for Object Detection
• Improvements over Faster RCNN: use RoIAlign instead of RoIPool
• Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each
class independently
• 2 main stages:
• 1st stage: use Region Proposal Network (RPN) to propose candidate object bounding boxes
• 2nd stage: classify the candidate boxes, refine the boxes and predict masks

Terms
• Bounding box: rectangle identifying location of an object
• Mask: set of pixels which belong to an object
• Anchor: a bounding box is generated independently from image content
• RoI: Region of Interest, a bounding box which may contain an object
• Non-Maximum Suppression (NMS): a method to eliminate duplicated bounding box using scores
and IoU threshold
• IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other.
• RoIAlign: a method to extract features for RoIs from feature maps
• Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale
• Region Proposal Network (RPN): a neural network to propose RoI for an image
• Fully Convolutional Network (FCN): a convolution-based neural network to extract masks

MaskRCNN architecture
1 1 + + : 7
:
:
: :B
7
: :
1 1 +C 1 A:
1 : + : : 7 1 A:
1 : : ( 1 1
7 1 Background + number of classes)
: :B 1 1 : : :B 1

Approaches for multiple scaled objects
2 . . . 2 2 . 2 .-
. 2 2. -. . .
2 . 2 2 . . - 2A. .
2 .- 2A. 2 2 . - . .
1
2 :2 2 2 .
2.- :
: . 29 . 2 2.- -
. 2A. . 2 2 . - - . 2 .
( )

Feature Pyramid Network (FPN)
)
)
)
)
4 +4
+
3
4 6 5 4 2 53
0(1

To detect boundaries of objects:
• 1 1 1
• 11 1 1 1
• 1 1 1
Bounding box regression

Bounding box regression
)*
) )
* *
• / DI : = D F F G : ) ) * *
• AGD F F G : F D
: : = &
• F A IA : FD ) ) * * G-
. * ) . ) (&,
. * ) . ) (&,
• ) ) * * F AGD F FG A A IA : FD G-
) . (&, * . )
) . (&, * . )
/ DF DF F : : DI : = D 4
2FDI : FI D 2
• 4 4 4 4 - F F : : = D F : : D 4
• 2 2 2 2 - F F : : = D =FDI : FI D 2
• 0
• : . 2 4 4 : . 2 4 4
• : . AD= 2 4 : . AD= 2 4
• : : G G A F F GA D D F D 4
• : : G AD= G F GA D G D : = D 4
• F G F F : : DI : = D 4 G DG D AD = G F =F GG D
: IG : DI : = D G : : G-
4
4
4 4
2 2
2
2
1FD 5+

Region Proposal Network
Anchor generator
Proposal layer
4
4 4
4
4 1 4
4 4
Filter out negative anchors with
Rpn_probs and Non-Max Suppression

RPN head network
, F
G
= BC
A4 :)C4
, F
G
4 BC3A B3= 4 G = BC
4G
/A 3AB 5C
, F
G
4 BC3A B3= 4 G = BC
/A 355 G
1 BC 2
1 BC 2
F4= C A B 4 B B
5 : 5 G B :B CC (
B 1 2

Anchor generator
)
-=38 A 4
-=38 3 =
A31
1
1
1
1 (
. 1 1
/8 A 12A 3 2 61A 3== 4 = 8 122 8 6 4 = 2: 3 4 3 =0
1=38 1=38 8 1

IoU – Intersection over Union
3 I I= : 5 = = : =G
?9 G ?= :=G E = /
• 4 7 1 2 2 7
• 2 7 2
• & 7 7 11 2
2 7 1
0
1
4 7 0 1 . ) ( (
2 9BEA=
0
1
4 7 0 1 . % . %
0
1
4 7 0 1 . , , . %%
= A9E = G=

Proposal layer
• Sort all anchors by rpn_probs (how likely an anchor contains an object)
• Choose top N anchors, throw the remainings (e.g., N ~ 6000)
• Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes.
Keep up to M anchors (e.g., M ~ 2000).

Non-Maximum Suppresion (NMS)
• . C: A 5 C9 C
• 09 A = = C: C
?E ? ?? ? A ? (
,>
-: C
A A 9
, 1 C9A 9 5 0 0)
.( . 9
-: C 9
) ?A: =
12
1?A C9 A
12
- C: C9 A : A = : : (
9 C9 :C9 : ?A :C
:=:> A = : : :C9 ,? ) 2
?> - := = 1 A :?> ?A: =

Non-Maximum Suppresion (NMS)
=A
, A = 2= 5
. = 5 = 5 2=
=0 A 5 =: 5 (
- - == 5 2= 5
-A =A , A = = 5 2= 5
( > :
)
> A 5 2= 5 2 = 5
)
,== A : A 5 5 = 5 2=
== 5 A 5 2= C A > :: A )
) : 2 5 2= 5 2 C A 1 (
, 2 :A: A==> 2 > :

Train the RPN
• Positive boxes: IoU >= 0.7 with any GT box
• Negative boxes: IoU < 0.3 with all GT boxes
• Ratio of positive boxes: 1/3
• Fixed num of anchors per image for train: 256

Loss function
• i is the index of an anchor in a mini-batch
• pi is the predicted probability of anchor i being an object
• Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative
• ti is a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box
• is that of the ground-truth box associated with a positive anchor
• Classification loss Lcls is log loss over two classes (object vs. not object)
• For regression loss Lreg, use , where R is smoothL1 defined as:
• While both positive and negative anchors contribute to classification loss, only positive anchors contribute to regression loss.
• Ncls is normalized by the mini-batch size ( ), Nreg is normalized by the number of anchor locations ( ), set

Mask-RCNN for Instance Segmentation

RoIAlign
1024
1024
540
540
Input image
Object
64
1024/16 = 64
540/16 = 33.75
33.75
Feature map
RoI
16X less
33.75 / 7 = 4.82 each bin
7x7
Small feature map
(for each RoI)
RoI
Use bilinear interpolation to
calculate exact value at each bin
No quantization
(From [1])
FCN

Identify Feature Pyramid level for RoIs
Resize
P2
P3
P4
P5
w, h: width & height of a RoI
224: canonical ImageNet pre-training size
k0: target level of the RoI whose w*h = 2242
(here, k0 = 5)
Target level k of a RoI is identified by:
Crop the RoIs on
their feature map
Intuitions:
Features of large RoIs from smaller feature map (high semantic level)
Features of small RoIs from larger feature map (low semantic level)
RoIs
(From [6])

Mask-RCNN head network
• A classifier to identify the class for each RoI: K classes + background
• A regressor to predict the 4 values dy, dx, dh, dw for each RoI
• Fully Convolutional Network (FCN) [5] to predict mask per class
• Represent a mask as m x m matrix
• For each RoI, try to predict mask for each class
• Use sigmoid to predict how probability for each pixel
• Use binary loss to train the network

Mask-RCNN head network architecture
7x7x256
Small feature map
(for each RoI)
1024
Fully connected layer implemented by CNN
Shared weights over multiple RoIs
Softmax
(K+1) x 4
(K+1)
14x14x256
3x3
(256 filters)
Conv1
14x14x256
Conv4
14x14x256
3x3
(256 filters)
Conv Transpose
(Up sampling)
2x2
(256 filters)
(stride 2)
28x28x256
...
x 4 conv layers
Conv
28x28x(K+1)
1x1
(K+1 filters)
Sigmoid
activation
28x28x(K+1)
7x7
(1024 filters)
Conv1 Conv2
(BG + num classes)
K+1
Dense
Dense
(K+1) x 4
1024 K+1
Predict mask per class
BG vs K classes
4 box regression values:
dy, dx, dh, dw
1x1
(1024 filters)

Loss functions
• For each sampled RoI, a multi-task loss is applied:
where
• Lcls is classification loss
• Lloc is bounding-box regression loss
• Lmask is mask loss
• The final loss is calculated as mean of loss over samples

Classification loss Lcls
• For a RoI, denotes:
• : true class of the RoI
• : predicted probability distribution over K+1 classes
• The classification loss Lcls for a RoI is a log-loss calculated as:

Bounding-box regression loss Lloc
• : true bounding-box regression targets of the RoI
• : predicted bounding-box regression for the class u.
• The bounding-box regression loss Lloc for the RoI is calculated as:
where

Mask loss Lmask
• : the true mask and the predicted mask for the class of the RoI
respectively ( )
• The mask loss Lmask for the RoI is the average binary cross-entropy
loss, calculated as:

Mask-RCNN on COCO data
(From [1])

Evolution of R-CNN
= Faster R-CNN [2] + Fully Convolutional Network [5]
RoIPool RoIAlign Per-pixel softmax Per-pixel sigmoid
Mask R-CNN [1]
Faster R-CNN = Fast R-CNN [3] +
Fast R-CNN = R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer
R-CNN [1]
Region proposal on input image + +
+=
+
+

Summary
• Introduced MaskRCNN, an algorithm for Instance Segmentation
• Detect both bounding boxes and masks of objects in an end-to-end
neural network
• Improve RoIPool from Faster-RCNN with RoIAlign
• Employ Fully Convolutional Network for mask detection

References
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE
International Conference on Computer Vision (ICCV), 2017.
[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In NIPS, 2015.
[3] R. Girshick. Fast R-CNN. In ICCV, 2015.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, 2014
[5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
[6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In CVPR, 2017
by Nguyen Phuoc Tat Dat

Appendix: Some popular DL-based algorithms for visual perception tasks
by Nguyen Phuoc Tat Dat
Visual perception tasks Algorithms
Image Classification
AlexNet
Inception
GooLeNet/Inception v1
ResNet
VGGNet
Object Detection
Fast/Faster R-CNN
SSD
YOLO
Semantic Segmentation
Fully Convolutional Network (FCN)
U-Net
Instance Segmentation Mask R-CNN

Mask-RCNN for Instance Segmentation

More Related Content

What's hot (20)

Similar to Mask-RCNN for Instance Segmentation (20)

Recently uploaded (20)

Mask-RCNN for Instance Segmentation