Online video object segmentation via convolutional trident network

Online Video Object Segmentation via
Convolutional Trident Network
Seminar at Naver
Won-Dong Jang
Korea University
2017-08-22

Video object segmentation
• Clustering pixels in videos into objects or background
• Unsupervised
• Supervised
• Semi-supervised
Segment track

• Unsupervised video object segmentation
• Discover and segment a primary object in a video
• Without user annotations
Input videos Ground-truth segmentation labels

• Unsupervised video object segmentation
• Saliency detection-based approach
• Visual saliency
• Motion saliency
Examples of saliency detection

• Supervised video object segmentation
• User can annotate mislabeled pixels in any frames
• Interaction between an algorithm and a user
Annotation
at the first frame
Segmentation results after adding the annotation
Segmentation results
Additional
user annotation

• Semi-supervised video object segmentation
• Discover and segment a primary object in a video
• With user annotations in the first frame
Annotated target object
in the first frame
Segmentation results

Related works
• Object proposal-based algorithm
• Generate object proposals in each frame
• Select object proposals that are similar to the target object
[2015][ICCV][Perazzi] Fully Connected Object Proposals for Video Segmentation
Generated object proposals
Segment track

Related works
• Superpixel-based algorithm
• Over-segment each frame into superpixels
• Trace the target object based on the inter-frame matching
[2015][CVPR][Wen] Joint Online Tracking and Segmentation
Inter-frame matching Segment track

In this presentation
• A semi-supervised online segmentation algorithm
• Online method
• Offline techniques require a huge memory space for a long video
• Deep learning-based approach
• Connection between a convolutional neural network and an MRF
optimization strategy
• Remarkable performance on the DAVIS benchmark dataset

Overview
• Framework
• Inter-frame propagation
• Inference via convolutional trident network (CTN)
• Yields three tailored probability maps for the MRF optimization
• MRF optimization

Inter-frame propagation
• Propagation from the previous frame
• To roughly locate the target object in the current frame
• Using backward optical flow
• From 𝑡 to 𝑡 − 1
Segmentation label map
for frame 𝑡 − 1
Backward motion
(optical flow)
Propagation map
for frame 𝑡

Convolutional trident network
• Infer segmentation information
• The propagation map may be inaccurate
• The inferred information is effectively used to solve a binary
labeling problem

Network architecture
• Encoder-decoder architecture
• Single encoder
• VGG encoder
• Three decoders
• Separative decoder
• Definite foreground decoder
• Definite background decoder

• VGG encoder
• Trained for image classification
• Using ImageNet dataset
22K categories
14M images

• VGG encoder
• Trained for image classification
• Using ImageNet dataset
Convolutional layers
Fully connected
layers
Fox
Cat
Lion
Dog
Zebra
Umbrella
Tank
Lamp
Desk
Orange
Kite
…

• Separative decoder (SD)
• Separate a target object from the background
• Using a down-sampled foreground propagation patch
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
SD-Dec2
SD-Dec3
SD-Dec4
SD-Dec5
SD-Pred
SD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Skip connections
Image patch
Separative
probability patch
Foreground
propagation patch

• Definite foreground and background decoders
• Definite pixels indicate locations that should be labeled as the
foreground or the background indubitably
• Fixing labels in definite pixels improves labeling accuracies
• Inspired by the image matting problem
Input image Tri-map Matting result

• Definite foreground decoder (DFD)
• Identifies definite foreground pixels
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
DFD-Dec2
DFD-Dec3
DFD-Dec4
DFD-Dec5
DFD-Pred
DFD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Image patch
Definite foreground
probability patch
Foreground
propagation patch

• Definite background decoder (DBD)
• Finds definite background pixels
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
DBD-Dec2
DBD-Dec3
DBD-Dec4
DBD-Dec5
DBD-Pred
DBD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Image patch
Definite background
probability patch
Background
propagation patch

• Implementation issues in decoders
• Prediction layer
• Sigmoid layer is used to yield normalized outputs within [0, 1]
• Rectified linear unit (ReLU) + Batch normalization
• Kernel size
• 3 ×3 kernels are used in the prediction layers
• 5 × 5 kernels are used in the other convolution layers

Training phase
• Lack of video object segmentation dataset
• There are several datasets for video object segmentation
• However, each of them consists of a small number of videos from
12 to 59
• Instead,
• We use the PASCAL VOC 2012 dataset
• Object segmentation
• 26,844 object masks
• 11,355 images

Training phase
• Preprocessing of training data
• Ground-truth masks for the DFD and DBD are not available
• Hence, we generate them through simple image processing

Training phase
• Preprocessing of training data
• Degrade the objet mask to imitate propagation errors
• By performing suppression and noise addition
• Synthesize the ground-truth masks for the DFD and DBD
• By applying erosion and dilation

Training phase
• Implementation issues
• Caffe library
• Cross-entropy losses
−
1
𝑛
෍
𝑛=1
𝑁
𝑝 𝑛 log Ƹ𝑝 𝑛 + 1 − 𝑝 𝑛 log 1 − Ƹ𝑝 𝑛
• Minibatch
• with eight training data
• Learning rate
• 1e-3 for the first 55 epochs
• 1e-4 for the next 35 epochs
• Stochastic gradient descent

Inference phase
• Set input data
• By cropping the frame and propagation map
• CTN outputs three probability patches
• Separative probability map 𝑅S
• Definite foreground probability map 𝑅F
• Definite background probability map 𝑅B
Frame 𝑡
Segmentation label
at frame 𝑡 − 1
Optical flow
Inter-frame propagationInput at frame 𝑡
Propagation map
at frame 𝑡
Inference via convolutional encoder-decoder network
Encoder
Background
propagation patch
Foreground
propagation patch
Image patch
Definite
background
decoder
Definite
foreground
decoder
Separative
decoder
Definite background
probability patch
Definite foreground
probability patch
Separative
probability patch

Inference phase
• Classification
• Separative probability map 𝑅S
• If 𝑅S(𝐩) > 𝜃sep, pixel 𝐩 is classified as the foreground
• ℒ be the coordinate set for such foreground pixels
• Definite foreground probability map 𝑅F
• If 𝑅F(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite foreground
• ℱ denotes the set of the definite foreground pixels
• Definite background probability map 𝑅B
• If 𝑅B(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite background
• ℬ indicates the set of the definite background pixels

MRF optimization
• Solve two-class MRF optimization problem
• To improve the segmentation quality further
• Define a graph 𝐺 = (𝑁, 𝐸)
• Nodes are pixels in the current frame
• Each node is connected to its four neighbors by edges

MRF optimization
• MRF energy function
ℰ 𝑆 = ෍
𝐩∈𝑁
𝒟 𝐩, 𝑆 + 𝛾 × ෍
𝐩,𝐪 ∈𝐸
𝒵 𝐩, 𝐪, 𝑆
Definite background
probability patch
Definite foreground
probability patch
Image patch Separative
probability patch
Unary cost 𝒟
- Returns extremely high costs on DF and DB pixels
when they have background and foreground labels
Pairwise cost 𝒵
- Encourage neighboring pixels
to have the same label

MRF optimization
• Unary cost computation
• Build the RGB color Gaussian mixture models (GMMs) of the
foreground and the background, respectively
• 𝐾 = 10 for both GMMs
• Use pixels in ℒ to construct the foreground GMMs
• Use pixels in ℒ 𝑐
to construct the background GMMs
• Gaussian cost
𝜓 𝐩, 𝑠 = min
𝑘
− log 𝑓 𝐩; ℳ𝑠,𝑘
• Unary cost
𝒟 𝐩, 𝑆 = ൞
∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0
∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0
𝜓 𝐩, 𝑆(𝐩) otherwise

MRF optimization
• Pairwise cost computation
• Pairwise cost
𝒵 𝐩, 𝐪, 𝑆 = ቊ
exp −𝑑 𝐩, 𝐪 if 𝑆 𝐩 ≠ 𝑆 𝐪
0 otherwise
• Graph-cut optimization
ℰ 𝑆 = ෍
𝐩∈𝑁
𝒟 𝐩, 𝑆 + 𝛾 × ෍
𝐩,𝐪 ∈𝐸
𝒵 𝐩, 𝐪, 𝑆

Reappearing object detection
• Identification of reappearing parts
• A target object may disappear and be occluded by other
objects
• Use backward-forward motion consistency

Experimental results
• The DAVIS benchmark dataset
• 50 videos
• 854 × 480 resolution
• Number of frames
• From 25 to 104
• Difficulties
• Fast motion
• Occlusion
• Object deformation

• Performance measures
• Region similarity
• Jaccard index
• Contour accuracy
• F-measure
• Statistics
• Mean
• Recall
• Decay

• Performance comparison

• Qualitative results

• The SegTrack dataset
• 5 videos
• Performance comparison
• Jaccard index

• Ablation studies
• Effectiveness of each decoder
• Efficacy of MRF optimization

• Running time analysis
• Prop-Q
• Uses the state-of-the-art optical flow technique
• Prop-F
• Adopts a much faster optical flow technique
• Parameter selection

Conclusions
• A semi-supervised online video object segmentation
algorithm is introduced in this presentation
• Deep learning-based semi-supervised video object
segmentation algorithm
• Tailored network for MRF optimization
• Remarkable performance on the DAVIS dataset
• Q&A
• Thank you

Online video object segmentation via convolutional trident network

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Online video object segmentation via convolutional trident network (20)

More from NAVER Engineering (20)

Recently uploaded (20)

Online video object segmentation via convolutional trident network