0% found this document useful (0 votes)

21 views

Solution For SquareMind

The document discusses steps taken to solve a machine learning challenge of segmenting a person from a background using RGB and depth images. Initial attempts using online tools had limitations. Classical computer vision methods like thresholding the depth map worked reasonably well but had assumptions that may not always hold. A deep learning model called Eff-Unet achieved good results and further improvement was explored by finding a suitable training dataset with similar indoor images containing RGB and depth modalities.

Uploaded by

bart v

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Solution For SquareMind

Uploaded by

bart v

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

SOLUTION FOR SQUAREMIND__MLCV_CHALLENGE_01

FIRST STEP: CHECK THE RESULT OF EXISTING ONLINE APPLICATIONS

The first step come to my mind when receiving this challenge is to see what the existing software can do.
Let’s take a look at that:
I take the online application by PhotoRoom at https://www.photoroom.com/background-remover/ which base on
Transformer model (SegFormer, I had a call with them). The model has done quite well, but it still can not remove the
background, there are part of the background missing. That’s mean the problem is challenging.

Figure 1: First attempt with PhotoRoom

So what makes the model confuse:
• Indoor environment makes everything dark, so the different between the background and the foreground is
very small in RGB color, thus it can not distinguish between background and foreground.
• The model is trained to segment any kind of foreground object, it doesn’t train with a specific object like
human, therefore it can make a mistake by consider the region (marked as green in the figure) as foreground.

Let’s make another test with human segmentation model:

Now it makes more sense, the object marked in green in Figure 1 is now treat as background. But the object marked as
red is still consider as foreground.

Figure 2: Second attempt with U2Net at Efficient-Unet which trained on human dataset
This is to see that the dataset is important. We will take a note a that and try to make the model more accurate.
SECOND STEP: START PROCESSING THE IMAGE
In this step, I do some processing to align the depth to the color camera and using classical method try to estimate the
foreground using only depth information.
First step is to align the depth with color image using camera intrinsic and extrinsic parameters. The process is using
camera intrinsic to transform depth in to pointcloud, using pointcloud in depth camera to transform the pointcloud into
the coordinate of color camera and then projecting the transformed points into the RGB camera. In this step, image
must be flipped horizontal back to the original direction to get a desired result. The resulted aligned depth is followed
by a closing operator to fill the undesired occlusion caused by the alignment process.

𝑈𝑉𝑅𝐺𝐵 = 𝐾𝑅𝐺𝐵 ∗ [𝑅|𝑇] ∗ 𝑧 ∗ 𝑖𝑛𝑣(𝐾𝑖𝑟 ) ∗ 𝑈𝑉𝐼𝑅

Here are the results:

Figure 3 Align the depth camera to color camera. Left: depth before alignment, right: depth after
alignment

As we can see, the depth is perfectly aligned with the color image.
So the question is: what if we only use the depth for segmentation?
The depth information is a good source for segmentation because the different between background and foreground is
large (if the object is far from the background). Therefore, the edge around object is clear. Moreover, the depth
(assume perfect computation) is invariant to illumination change and shadow which make it good for indoor scenes
(where RGB has trouble with that).
Figure 4: Left image (left) and depth map (right) captured using stereo camera
However, it has several drawbacks:
• Sensitive to occlusion: If occlusion happens, the computed depth with have many NaN values around that
region. For example, in Figure 4, compute the depth using my Stereolabs zed2 camera at home. As we can see
when my hand occluded my body, the depth region (marked as red) cannot be computed.
• Another problem with the depth is in the foot region, the depth of the foot and the depth of the floor is almost
the same, therefore the transition between the foot and the background is very small this make it very difficult
to segment.

THIRD STEP: TRY CLASSICAL METHOD ON DEPTH

A simple and native approach is to threshold the depth to get the foreground. This make assumption that the depth
value is similar for pixel in the foreground region.

Figure 5: Histogram of depth map

First, let scale the depth back with max value = 255 and plot the histogram of the depth map. Then select value which
has highest frequency. With the assumption that the depth value with highest frequency is mostly the foreground (in
practice a cropped foreground obtained by person-detector for example is needed for this assumption to be true), we
threshold the depth by a delta of 20, followed by an opening operator to refine. We then compute the largest
connected component of the mask to get the output.
Figure 6: The mask computed by thresholding the depth map
Not too bad, but so far, we made too many assumptions:
1. The depth value with highest frequency is mostly the foreground
2. The depth value is similar for pixel in the foreground region.
3. No occlusion in the foreground

For the assumptions (1) and (2) to be true, it is required that the camera must face directly toward the person, it will
not work for example with camera place topview where depth values is different for pixels in the foreground. This is an
example with topview camera where depth values in the head region are be small and depth values in bottom regions
(leg) are larger.

Figure 7: Example of scenario where thresholding the depth does not work. Left: RGB image, right:
depth image colored
Before switching to deep learning approaches, let’s evaluate our mask obtained by simple
thresholding method above:

Given a ground truth mask G and an output segmentation M, there are several important accuracy metrics to evaluate
the performance of segmentation results: region similarity, precision, recall and f1_score. Depending on the purpose or
the context of the system, some metrics might be of more importance than others, i.e., accuracy may be expendable up
to a certain point in favor of execution speed for a real-time application.

• Region Similarity J: This is the standard metric for segmentation purpose. It measures the region-based
segmentation similarity by using the Jaccard index J or IoU which is a ratio between the intersection and the
union of the ground truth set and the predicted segmentation.
|𝐺 ∩ 𝑀|
𝐽 = 𝐼𝑜𝑈 =
|𝐺 ∪ 𝑀|
This ratio calculates the number of true positives (intersection) over the sum of true positives, false negatives, and
false positives (union).
• Precision and Recall P, R: They are the simplest metric. Precision is simply computing a ratio between the
amount of properly classified pixels (true positive) and the total number of them (true).
|𝐺 ∩ 𝑀|
𝑃=
|𝑀|
Recall is the ratio between the amount of classified pixels (true positive) and the number of ground truth pixel
(positive).
|𝐺 ∩ 𝑀|
𝑅=
|𝐺|
Precision and recall used to describe under/over segmentation phenomenon.
• Dice Coefficient (F1 score): is the harmonic mean of precision and recall:
2𝑃𝑅
𝐹1 =
𝑃+𝑅
The evaluate of the methods are shown in Table 1:
As we can see, for the depth thresholding method, the recall is high. That means our methods is oversegment the
object.

Table 1: Evaluation of classical method

Method Jaccard Precision Recall F1 score
Depth thresholding 0.747773535596888 0.812005870798100 0.904334857628223 0.855686987320715
4 4 5 5
FOURTH STEP : DEEP LEARNING - BASED METHOD :
Base on the result in Figure 2, we pick the Unet model with EfficientNet encoder as our baseline. The model is
described in this paper: https://openaccess.thecvf.com/content_CVPRW_2020/html/w22/Baheti_Eff-
UNet_A_Novel_Architecture_for_Semantic_Segmentation_in_Unstructured_Environment_CVPRW_2020_paper.html
This method outperforms other state-of-the-art methods like Deeplab or UNet with ResNet50 Encoder.
Let test this method with encoder = EfficientNetb3:
We take this model as baseline which implemented using pytorch-lightning:
This model is trained using multiple datasets contains only people:
TRAIN SET :
• Mapillary Vistas Commercial 1.2 (train)
• COCO (train)
• Pascal VOC (train)
• Human Matting
VALIDATION SET :
• Mapillary Vistas Commercial 1.2 (val)
• COCO (val)
• Pascal VOC (val)
• Supervisely
which is a very large dataset
The result is very impressive with IoU = 0.839 using the pretrained model.

Figure 8: Result of Eff-Unet using pretrained model on human segmentation dataset

Very impressive, but let’s check if we can improve this model:

Looking for datasets:
In the first step, I want to find a dataset as with similar configuration as in the challenge:
• People segmentation
• Have RGB + depth
• Indoor environment
There is not many public dataset having this configuration. But I’ve managed to found a dataset with very similar config:
This called Trimodal dataset (https://www.kaggle.com/aalborguniversity/trimodal-people-segmentation)

“The dataset features a total of 5724 annotated frames divided in three indoor scenes.
Activity in scene 1 and 3 is using the full depth range of the Kinect for XBOX 360 sensor whereas activity in
scene 2 is constrained to a depth range of plus/minus 0.250 m in order to suppress the parallax between the
two physical sensors. Scene 1 and 2 are situated in a closed meeting room with little natural light to disturb
the depth sensing, whereas scene 3 is situated in an area with wide windows and a substantial amount of
sunlight. For each scene, a total of three persons are interacting, reading, walking, sitting, reading, etc.

Every person is annotated with a unique ID in the scene on a pixel-level in the RGB modality. For the thermal
and depth modalities, annotations are transferred from the RGB images using a registration algorithm.”

After downloading the dataset, I followed the guide to extract and do registration to align the depth to the RGB image
and remove frame where no people is detected. Then I sample each sequence with sampling ratio = 5 (select 1 frame in
every 5 frames), then I divided the first 75% of the video to train, and then 25 rest is for validation. The processed
dataset is included in the zip folder.
Let’s do some finetuning using this dataset:

Model is finetuning using pretrained weights

Scheduler: cosine annealing with warn restart
LR: 0.00009
weight_decay: 0.0003
num_epoch: 20
Loss function: w* Jaccard Loss + (1-w) * Focal Loss with w = 0.9

Here are the WandB log board:

Figure 9 :WandB log board of Eff-Unet trained with RGB images

We can see that the val accuracy is very high, it because the training dataset and validation set are very similar.
Here is the result:
The mask is more accurate in the edge of the person as shown in Figure 10.
As we can see in table 3, with a simple finetuning, we can boost the performance by ~5%.
Figure 10: Mask obtained by Eff-Unet after fituning using trimodal dataset
Now let’s add the depth information to see if the network can improve:
The main challenges for RGB-D when bringing the depth to the segmentation model is how to represent and
fuse the RGB and depth channels so that the strong correlation between the depth and RGB channels is
considered. Hence, different policies are utilized for encoder and decoder blocks. These policies include early
fusion, middle fusion, or late fusion as shown in Figure
Figure 11: Different strategies to fuse depth and RGB

We chose the early fusion because:

• simplicity
• Model is pretrained with RGB, dataset is small, weight change small, therefore we should introduce
less weights to the model.

Some method using early fusion strategies in the literature:

Shape-convolution (https://openaccess.thecvf.com/content/ICCV2021/html/Cao_ShapeConv_Shape-
Aware_Convolutional_Layer_for_Indoor_RGB-D_Semantic_Segmentation_ICCV_2021_paper.html)
Lucid-data-dreaming:
https://github.com/ankhoreva/LucidDataDreaming#:~:text=Lucid%20Data%20Dreaming%20is%20a,%E2%80
%9D)%20plausible%20future%20video%20frames.
Figure 12: We fuse RGB image with the depth to form 4-channels images
The weight is also using the pretrained weight, I modify the conv_stem layer of Efficient-netB3 encoder to
accept 4 channels image. I initialize the 4-th channel of the kernel with zeros.

Here is the result:

Figure 13: Mask produced by Effcient-Unet trained on RGBD dataset

The result is quite good, it has a bit higher in term of recall but a bit less than using only RGB dataset. This is
because the backbone a trained with RGB so it fits well with RGB input.

Table 2: Evaluation of deeplearning-based methods

Method Jaccard Precision Recall F1 score
Depth thresholding 0.747773535596888 0.812005870798100 0.904334857628223 0.855686987320715
4 4 5 5
Efficientb3-Unet 0.839454722492697 0.871595208353927 0.957920572055032 0.912721267044973
(pre-trained) 1 6 7 3
Efficientb3-Unet 0.885004374129224 0.933844215215097 0.944201918043635 0.938994504496098
(finetune) 3 9 1
Efficientb3-Unet 0.869114332014593 0.912462040746496 0.948172327627927 0.929974498753999
(finetune)-RGBD 6 4 1

MODEL EVALUATION :
The inference time of our model is ~0.02s (50 FPS) on GPU and ~0.25 sec on CPU which is very fast.

GPU Memory:

Total flops: 15.44

Total params: 13.2M

Total estimated model params size: 26.3MB

BONUS QUESTION :
If the image is corrupted (e.g during transmission). There are two different scenarios:

1. The depth is not corrupted

2. The depth is corrupted

If the depth is not corrupted:

Let’s check our model with this image.

Figure 14: Mask using Eff-Unet RGB

Figure 15: Mask using Eff-Unet RGBD

As we can see that even using the depth, the model can not segment the leg of the person. Because it focus
of the RGB. To solve that, we must add another layer “corrupted mask” contains region of corrupted. In that
case, the model will use the depth to segment in stead of RGB for that region.

If the depth is also corrupted:

An interesting idea I have is to fit the persons with a 3D model, and then project back in image domain. Since
all the information of 2D and 3D are available, we can fit a 3D model of a human by regression of available
parts of the human. This is the result I tested with a model called ROMP:
https://github.com/Arthur151/ROMP

By fitting the model into the person, we can retrieve the missing parts of the human body.

Figure 16: Fit a 3D human model to a existing pose to find the missing part
Reference:
Baheti, Bhakti, et al. "Eff-unet: A novel architecture for semantic segmentation in unstructured environment."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.

Cao, Jinming, et al. "ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Khoreva, Anna, et al. "Lucid data dreaming for object tracking." The DAVIS challenge on video object segmentation.
2017.

Depth Prediction Single Image
No ratings yet
Depth Prediction Single Image
8 pages
3d Depth Perception
No ratings yet
3d Depth Perception
4 pages
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
20 pages
Summary of Computer Vision Cyril Stanissh
No ratings yet
Summary of Computer Vision Cyril Stanissh
13 pages
3D Graph Neural Networks For RGBD Semantic Segmentation
No ratings yet
3D Graph Neural Networks For RGBD Semantic Segmentation
10 pages
01-02 Introduction to CV and Segmentation
No ratings yet
01-02 Introduction to CV and Segmentation
85 pages
Image Segmentation Based On Improved Unet
No ratings yet
Image Segmentation Based On Improved Unet
7 pages
Predicting Depth, Surface Normals and Semantic Labels With A Common Multi-Scale Convolutional Architecture
No ratings yet
Predicting Depth, Surface Normals and Semantic Labels With A Common Multi-Scale Convolutional Architecture
9 pages
2445 - Saving - 100x - Storage - Prototype - Supplementary Material
No ratings yet
2445 - Saving - 100x - Storage - Prototype - Supplementary Material
7 pages
Ip U3 Omkar Pawar-1
No ratings yet
Ip U3 Omkar Pawar-1
8 pages
Foreground Object Segmentation in RGB-D Data Implemented On GPU
No ratings yet
Foreground Object Segmentation in RGB-D Data Implemented On GPU
12 pages
Semantic Segmentation
No ratings yet
Semantic Segmentation
22 pages
Image Segmentation in Deep Learning
No ratings yet
Image Segmentation in Deep Learning
12 pages
High Level Computer Vision Using Opencv
No ratings yet
High Level Computer Vision Using Opencv
14 pages
Hutten Loc Her
No ratings yet
Hutten Loc Her
9 pages
Ummenhofer DeMoN Depth and CVPR 2017 Paper
No ratings yet
Ummenhofer DeMoN Depth and CVPR 2017 Paper
10 pages
Research Paper
No ratings yet
Research Paper
7 pages
Demon: Depth and Motion Network For Learning Monocular Stereo
No ratings yet
Demon: Depth and Motion Network For Learning Monocular Stereo
22 pages
2017 05 12 Image Segmentation
No ratings yet
2017 05 12 Image Segmentation
2 pages
Image Processing On The GPU: A Canonical Example: Scales Ns Orientatio Colors
No ratings yet
Image Processing On The GPU: A Canonical Example: Scales Ns Orientatio Colors
11 pages
Thesis Defenseman Mike Has Been Named To His Team
No ratings yet
Thesis Defenseman Mike Has Been Named To His Team
113 pages
Segmentation
No ratings yet
Segmentation
16 pages
FR-Ro-GE
No ratings yet
FR-Ro-GE
4 pages
Boundary-Aware Segmentation Network For Mobile and Web Applications
No ratings yet
Boundary-Aware Segmentation Network For Mobile and Web Applications
19 pages
Image Segmentation: Ross Whitaker SCI Institute, School of Computing University of Utah
No ratings yet
Image Segmentation: Ross Whitaker SCI Institute, School of Computing University of Utah
49 pages
Advanced Methods For Image Segmentation: Ilya Pollak Purdue University
No ratings yet
Advanced Methods For Image Segmentation: Ilya Pollak Purdue University
49 pages
Computer Vision
No ratings yet
Computer Vision
33 pages
aksoy2012
No ratings yet
aksoy2012
4 pages
An Approach To Counter Replay Attack On Face Recognition Systems
No ratings yet
An Approach To Counter Replay Attack On Face Recognition Systems
21 pages
CV_SVD_L04_P1_ImageTrasformations_1
No ratings yet
CV_SVD_L04_P1_ImageTrasformations_1
45 pages
Project Synopsis Template
No ratings yet
Project Synopsis Template
5 pages
A Utilization of Convolutional Matrix Methods On Sliced Hippocampal Neuron Region Images For Cell Segmentation
No ratings yet
A Utilization of Convolutional Matrix Methods On Sliced Hippocampal Neuron Region Images For Cell Segmentation
8 pages
Atapour-Abarghouei Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by CVPR 2019 Paper PDF
No ratings yet
Atapour-Abarghouei Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by CVPR 2019 Paper PDF
12 pages
Group 09
No ratings yet
Group 09
9 pages
Image Segmentation, Representation and Description
100% (1)
Image Segmentation, Representation and Description
40 pages
Direct Volume Rendering (DVR) : Ray-Casting: Jian Huang
No ratings yet
Direct Volume Rendering (DVR) : Ray-Casting: Jian Huang
53 pages
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
13 pages
Image Segmentation
No ratings yet
Image Segmentation
20 pages
Multimedia Systems: Multimedia Databases - Image Processing Basics
No ratings yet
Multimedia Systems: Multimedia Databases - Image Processing Basics
58 pages
Ip Unit-3
No ratings yet
Ip Unit-3
21 pages
Automatic Lung Nodules Segmentation and Its 3D Visualization
No ratings yet
Automatic Lung Nodules Segmentation and Its 3D Visualization
98 pages
CVT Assignment
No ratings yet
CVT Assignment
28 pages
3D Reconstruction With Depth Prior Using Graph-Cut: Hichem Abdellali Zoltan Kato
No ratings yet
3D Reconstruction With Depth Prior Using Graph-Cut: Hichem Abdellali Zoltan Kato
16 pages
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
10 pages
Segmentation-Aware Convolutional Networks Using Local Attention Masks
No ratings yet
Segmentation-Aware Convolutional Networks Using Local Attention Masks
11 pages
Motion Analysis
No ratings yet
Motion Analysis
10 pages
Blob Analysis
No ratings yet
Blob Analysis
28 pages
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
No ratings yet
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
6 pages
Hidden Variables, The EM Algorithm, and Mixtures of Gaussians
No ratings yet
Hidden Variables, The EM Algorithm, and Mixtures of Gaussians
58 pages
Understanding Regions and Region Segmentation: by Nayan Khinvasara
No ratings yet
Understanding Regions and Region Segmentation: by Nayan Khinvasara
59 pages
Learning The Depths of Moving People by Watching Frozen People
No ratings yet
Learning The Depths of Moving People by Watching Frozen People
10 pages
[2019][Plus.ai] Lane Marking Segmentation
No ratings yet
[2019][Plus.ai] Lane Marking Segmentation
49 pages
Hand Gestures Recognition and Tracking: Deepak Gurung, Cansen Jiang, Jeremie Deray, D Esir e Sidib e
No ratings yet
Hand Gestures Recognition and Tracking: Deepak Gurung, Cansen Jiang, Jeremie Deray, D Esir e Sidib e
12 pages
HW_xla
No ratings yet
HW_xla
11 pages
Week5_Computer_Vision
No ratings yet
Week5_Computer_Vision
58 pages
Time Series Classification: Lab Based Project
No ratings yet
Time Series Classification: Lab Based Project
14 pages
Deep Learning lab manual
No ratings yet
Deep Learning lab manual
69 pages
1907.06119
No ratings yet
1907.06119
58 pages
METHODOLOGY
No ratings yet
METHODOLOGY
5 pages

Solution For SquareMind

Uploaded by

Solution For SquareMind

Uploaded by

SOLUTION FOR SQUAREMIND__MLCV_CHALLENGE_01

FIRST STEP: CHECK THE RESULT OF EXISTING ONLINE APPLICATIONS

Figure 1: First attempt with PhotoRoom

Let’s make another test with human segmentation model:

𝑈𝑉𝑅𝐺𝐵 = 𝐾𝑅𝐺𝐵 ∗ [𝑅|𝑇] ∗ 𝑧 ∗ 𝑖𝑛𝑣(𝐾𝑖𝑟 ) ∗ 𝑈𝑉𝐼𝑅

THIRD STEP: TRY CLASSICAL METHOD ON DEPTH

Figure 5: Histogram of depth map

Table 1: Evaluation of classical method

Figure 8: Result of Eff-Unet using pretrained model on human segmentation dataset

Very impressive, but let’s check if we can improve this model:

Model is finetuning using pretrained weights

Here are the WandB log board:

Figure 9 :WandB log board of Eff-Unet trained with RGB images

We chose the early fusion because:

Some method using early fusion strategies in the literature:

Here is the result:

Figure 13: Mask produced by Effcient-Unet trained on RGBD dataset

Table 2: Evaluation of deeplearning-based methods

Total flops: 15.44

Total params: 13.2M

Total estimated model params size: 26.3MB

1. The depth is not corrupted

If the depth is not corrupted:

Figure 14: Mask using Eff-Unet RGB

Figure 15: Mask using Eff-Unet RGBD

If the depth is also corrupted:

You might also like