0% found this document useful (0 votes)
21 views

Solution For SquareMind

The document discusses steps taken to solve a machine learning challenge of segmenting a person from a background using RGB and depth images. Initial attempts using online tools had limitations. Classical computer vision methods like thresholding the depth map worked reasonably well but had assumptions that may not always hold. A deep learning model called Eff-Unet achieved good results and further improvement was explored by finding a suitable training dataset with similar indoor images containing RGB and depth modalities.

Uploaded by

bart v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Solution For SquareMind

The document discusses steps taken to solve a machine learning challenge of segmenting a person from a background using RGB and depth images. Initial attempts using online tools had limitations. Classical computer vision methods like thresholding the depth map worked reasonably well but had assumptions that may not always hold. A deep learning model called Eff-Unet achieved good results and further improvement was explored by finding a suitable training dataset with similar indoor images containing RGB and depth modalities.

Uploaded by

bart v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SOLUTION FOR SQUAREMIND__MLCV_CHALLENGE_01

FIRST STEP: CHECK THE RESULT OF EXISTING ONLINE APPLICATIONS


The first step come to my mind when receiving this challenge is to see what the existing software can do.
Let’s take a look at that:
I take the online application by PhotoRoom at https://www.photoroom.com/background-remover/ which base on
Transformer model (SegFormer, I had a call with them). The model has done quite well, but it still can not remove the
background, there are part of the background missing. That’s mean the problem is challenging.

Figure 1: First attempt with PhotoRoom


So what makes the model confuse:
• Indoor environment makes everything dark, so the different between the background and the foreground is
very small in RGB color, thus it can not distinguish between background and foreground.
• The model is trained to segment any kind of foreground object, it doesn’t train with a specific object like
human, therefore it can make a mistake by consider the region (marked as green in the figure) as foreground.

Let’s make another test with human segmentation model:


Now it makes more sense, the object marked in green in Figure 1 is now treat as background. But the object marked as
red is still consider as foreground.

Figure 2: Second attempt with U2Net at Efficient-Unet which trained on human dataset
This is to see that the dataset is important. We will take a note a that and try to make the model more accurate.
SECOND STEP: START PROCESSING THE IMAGE
In this step, I do some processing to align the depth to the color camera and using classical method try to estimate the
foreground using only depth information.
First step is to align the depth with color image using camera intrinsic and extrinsic parameters. The process is using
camera intrinsic to transform depth in to pointcloud, using pointcloud in depth camera to transform the pointcloud into
the coordinate of color camera and then projecting the transformed points into the RGB camera. In this step, image
must be flipped horizontal back to the original direction to get a desired result. The resulted aligned depth is followed
by a closing operator to fill the undesired occlusion caused by the alignment process.

𝑈𝑉𝑅𝐺𝐵 = 𝐾𝑅𝐺𝐵 ∗ [𝑅|𝑇] ∗ 𝑧 ∗ 𝑖𝑛𝑣(𝐾𝑖𝑟 ) ∗ 𝑈𝑉𝐼𝑅


Here are the results:

Figure 3 Align the depth camera to color camera. Left: depth before alignment, right: depth after
alignment

As we can see, the depth is perfectly aligned with the color image.
So the question is: what if we only use the depth for segmentation?
The depth information is a good source for segmentation because the different between background and foreground is
large (if the object is far from the background). Therefore, the edge around object is clear. Moreover, the depth
(assume perfect computation) is invariant to illumination change and shadow which make it good for indoor scenes
(where RGB has trouble with that).
Figure 4: Left image (left) and depth map (right) captured using stereo camera
However, it has several drawbacks:
• Sensitive to occlusion: If occlusion happens, the computed depth with have many NaN values around that
region. For example, in Figure 4, compute the depth using my Stereolabs zed2 camera at home. As we can see
when my hand occluded my body, the depth region (marked as red) cannot be computed.
• Another problem with the depth is in the foot region, the depth of the foot and the depth of the floor is almost
the same, therefore the transition between the foot and the background is very small this make it very difficult
to segment.

THIRD STEP: TRY CLASSICAL METHOD ON DEPTH


A simple and native approach is to threshold the depth to get the foreground. This make assumption that the depth
value is similar for pixel in the foreground region.

Figure 5: Histogram of depth map

First, let scale the depth back with max value = 255 and plot the histogram of the depth map. Then select value which
has highest frequency. With the assumption that the depth value with highest frequency is mostly the foreground (in
practice a cropped foreground obtained by person-detector for example is needed for this assumption to be true), we
threshold the depth by a delta of 20, followed by an opening operator to refine. We then compute the largest
connected component of the mask to get the output.
Figure 6: The mask computed by thresholding the depth map
Not too bad, but so far, we made too many assumptions:
1. The depth value with highest frequency is mostly the foreground
2. The depth value is similar for pixel in the foreground region.
3. No occlusion in the foreground

For the assumptions (1) and (2) to be true, it is required that the camera must face directly toward the person, it will
not work for example with camera place topview where depth values is different for pixels in the foreground. This is an
example with topview camera where depth values in the head region are be small and depth values in bottom regions
(leg) are larger.

Figure 7: Example of scenario where thresholding the depth does not work. Left: RGB image, right:
depth image colored
Before switching to deep learning approaches, let’s evaluate our mask obtained by simple
thresholding method above:

Given a ground truth mask G and an output segmentation M, there are several important accuracy metrics to evaluate
the performance of segmentation results: region similarity, precision, recall and f1_score. Depending on the purpose or
the context of the system, some metrics might be of more importance than others, i.e., accuracy may be expendable up
to a certain point in favor of execution speed for a real-time application.

• Region Similarity J: This is the standard metric for segmentation purpose. It measures the region-based
segmentation similarity by using the Jaccard index J or IoU which is a ratio between the intersection and the
union of the ground truth set and the predicted segmentation.
|𝐺 ∩ 𝑀|
𝐽 = 𝐼𝑜𝑈 =
|𝐺 ∪ 𝑀|
This ratio calculates the number of true positives (intersection) over the sum of true positives, false negatives, and
false positives (union).
• Precision and Recall P, R: They are the simplest metric. Precision is simply computing a ratio between the
amount of properly classified pixels (true positive) and the total number of them (true).
|𝐺 ∩ 𝑀|
𝑃=
|𝑀|
Recall is the ratio between the amount of classified pixels (true positive) and the number of ground truth pixel
(positive).
|𝐺 ∩ 𝑀|
𝑅=
|𝐺|
Precision and recall used to describe under/over segmentation phenomenon.
• Dice Coefficient (F1 score): is the harmonic mean of precision and recall:
2𝑃𝑅
𝐹1 =
𝑃+𝑅
The evaluate of the methods are shown in Table 1:
As we can see, for the depth thresholding method, the recall is high. That means our methods is oversegment the
object.

Table 1: Evaluation of classical method


Method Jaccard Precision Recall F1 score
Depth thresholding 0.747773535596888 0.812005870798100 0.904334857628223 0.855686987320715
4 4 5 5
FOURTH STEP : DEEP LEARNING - BASED METHOD :
Base on the result in Figure 2, we pick the Unet model with EfficientNet encoder as our baseline. The model is
described in this paper: https://openaccess.thecvf.com/content_CVPRW_2020/html/w22/Baheti_Eff-
UNet_A_Novel_Architecture_for_Semantic_Segmentation_in_Unstructured_Environment_CVPRW_2020_paper.html
This method outperforms other state-of-the-art methods like Deeplab or UNet with ResNet50 Encoder.
Let test this method with encoder = EfficientNetb3:
We take this model as baseline which implemented using pytorch-lightning:
This model is trained using multiple datasets contains only people:
TRAIN SET :
• Mapillary Vistas Commercial 1.2 (train)
• COCO (train)
• Pascal VOC (train)
• Human Matting
VALIDATION SET :
• Mapillary Vistas Commercial 1.2 (val)
• COCO (val)
• Pascal VOC (val)
• Supervisely
which is a very large dataset
The result is very impressive with IoU = 0.839 using the pretrained model.

Figure 8: Result of Eff-Unet using pretrained model on human segmentation dataset

Very impressive, but let’s check if we can improve this model:


Looking for datasets:
In the first step, I want to find a dataset as with similar configuration as in the challenge:
• People segmentation
• Have RGB + depth
• Indoor environment
There is not many public dataset having this configuration. But I’ve managed to found a dataset with very similar config:
This called Trimodal dataset (https://www.kaggle.com/aalborguniversity/trimodal-people-segmentation)

“The dataset features a total of 5724 annotated frames divided in three indoor scenes.
Activity in scene 1 and 3 is using the full depth range of the Kinect for XBOX 360 sensor whereas activity in
scene 2 is constrained to a depth range of plus/minus 0.250 m in order to suppress the parallax between the
two physical sensors. Scene 1 and 2 are situated in a closed meeting room with little natural light to disturb
the depth sensing, whereas scene 3 is situated in an area with wide windows and a substantial amount of
sunlight. For each scene, a total of three persons are interacting, reading, walking, sitting, reading, etc.

Every person is annotated with a unique ID in the scene on a pixel-level in the RGB modality. For the thermal
and depth modalities, annotations are transferred from the RGB images using a registration algorithm.”

After downloading the dataset, I followed the guide to extract and do registration to align the depth to the RGB image
and remove frame where no people is detected. Then I sample each sequence with sampling ratio = 5 (select 1 frame in
every 5 frames), then I divided the first 75% of the video to train, and then 25 rest is for validation. The processed
dataset is included in the zip folder.
Let’s do some finetuning using this dataset:

Model is finetuning using pretrained weights


Scheduler: cosine annealing with warn restart
LR: 0.00009
weight_decay: 0.0003
num_epoch: 20
Loss function: w* Jaccard Loss + (1-w) * Focal Loss with w = 0.9

Here are the WandB log board:

Figure 9 :WandB log board of Eff-Unet trained with RGB images


We can see that the val accuracy is very high, it because the training dataset and validation set are very similar.
Here is the result:
The mask is more accurate in the edge of the person as shown in Figure 10.
As we can see in table 3, with a simple finetuning, we can boost the performance by ~5%.
Figure 10: Mask obtained by Eff-Unet after fituning using trimodal dataset
Now let’s add the depth information to see if the network can improve:
The main challenges for RGB-D when bringing the depth to the segmentation model is how to represent and
fuse the RGB and depth channels so that the strong correlation between the depth and RGB channels is
considered. Hence, different policies are utilized for encoder and decoder blocks. These policies include early
fusion, middle fusion, or late fusion as shown in Figure
Figure 11: Different strategies to fuse depth and RGB

We chose the early fusion because:

• simplicity
• Model is pretrained with RGB, dataset is small, weight change small, therefore we should introduce
less weights to the model.

Some method using early fusion strategies in the literature:

Shape-convolution (https://openaccess.thecvf.com/content/ICCV2021/html/Cao_ShapeConv_Shape-
Aware_Convolutional_Layer_for_Indoor_RGB-D_Semantic_Segmentation_ICCV_2021_paper.html)
Lucid-data-dreaming:
https://github.com/ankhoreva/LucidDataDreaming#:~:text=Lucid%20Data%20Dreaming%20is%20a,%E2%80
%9D)%20plausible%20future%20video%20frames.
Figure 12: We fuse RGB image with the depth to form 4-channels images
The weight is also using the pretrained weight, I modify the conv_stem layer of Efficient-netB3 encoder to
accept 4 channels image. I initialize the 4-th channel of the kernel with zeros.

Here is the result:

Figure 13: Mask produced by Effcient-Unet trained on RGBD dataset


The result is quite good, it has a bit higher in term of recall but a bit less than using only RGB dataset. This is
because the backbone a trained with RGB so it fits well with RGB input.

Table 2: Evaluation of deeplearning-based methods


Method Jaccard Precision Recall F1 score
Depth thresholding 0.747773535596888 0.812005870798100 0.904334857628223 0.855686987320715
4 4 5 5
Efficientb3-Unet 0.839454722492697 0.871595208353927 0.957920572055032 0.912721267044973
(pre-trained) 1 6 7 3
Efficientb3-Unet 0.885004374129224 0.933844215215097 0.944201918043635 0.938994504496098
(finetune) 3 9 1
Efficientb3-Unet 0.869114332014593 0.912462040746496 0.948172327627927 0.929974498753999
(finetune)-RGBD 6 4 1

MODEL EVALUATION :
The inference time of our model is ~0.02s (50 FPS) on GPU and ~0.25 sec on CPU which is very fast.

GPU Memory:

Total flops: 15.44

Total params: 13.2M

Total estimated model params size: 26.3MB


BONUS QUESTION :
If the image is corrupted (e.g during transmission). There are two different scenarios:

1. The depth is not corrupted


2. The depth is corrupted

If the depth is not corrupted:


Let’s check our model with this image.

Figure 14: Mask using Eff-Unet RGB

Figure 15: Mask using Eff-Unet RGBD


As we can see that even using the depth, the model can not segment the leg of the person. Because it focus
of the RGB. To solve that, we must add another layer “corrupted mask” contains region of corrupted. In that
case, the model will use the depth to segment in stead of RGB for that region.

If the depth is also corrupted:

An interesting idea I have is to fit the persons with a 3D model, and then project back in image domain. Since
all the information of 2D and 3D are available, we can fit a 3D model of a human by regression of available
parts of the human. This is the result I tested with a model called ROMP:
https://github.com/Arthur151/ROMP

By fitting the model into the person, we can retrieve the missing parts of the human body.

Figure 16: Fit a 3D human model to a existing pose to find the missing part
Reference:
Baheti, Bhakti, et al. "Eff-unet: A novel architecture for semantic segmentation in unstructured environment."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.

Cao, Jinming, et al. "ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Khoreva, Anna, et al. "Lucid data dreaming for object tracking." The DAVIS challenge on video object segmentation.
2017.

You might also like