Solution For SquareMind
Solution For SquareMind
Figure 2: Second attempt with U2Net at Efficient-Unet which trained on human dataset
This is to see that the dataset is important. We will take a note a that and try to make the model more accurate.
SECOND STEP: START PROCESSING THE IMAGE
In this step, I do some processing to align the depth to the color camera and using classical method try to estimate the
foreground using only depth information.
First step is to align the depth with color image using camera intrinsic and extrinsic parameters. The process is using
camera intrinsic to transform depth in to pointcloud, using pointcloud in depth camera to transform the pointcloud into
the coordinate of color camera and then projecting the transformed points into the RGB camera. In this step, image
must be flipped horizontal back to the original direction to get a desired result. The resulted aligned depth is followed
by a closing operator to fill the undesired occlusion caused by the alignment process.
Figure 3 Align the depth camera to color camera. Left: depth before alignment, right: depth after
alignment
As we can see, the depth is perfectly aligned with the color image.
So the question is: what if we only use the depth for segmentation?
The depth information is a good source for segmentation because the different between background and foreground is
large (if the object is far from the background). Therefore, the edge around object is clear. Moreover, the depth
(assume perfect computation) is invariant to illumination change and shadow which make it good for indoor scenes
(where RGB has trouble with that).
Figure 4: Left image (left) and depth map (right) captured using stereo camera
However, it has several drawbacks:
• Sensitive to occlusion: If occlusion happens, the computed depth with have many NaN values around that
region. For example, in Figure 4, compute the depth using my Stereolabs zed2 camera at home. As we can see
when my hand occluded my body, the depth region (marked as red) cannot be computed.
• Another problem with the depth is in the foot region, the depth of the foot and the depth of the floor is almost
the same, therefore the transition between the foot and the background is very small this make it very difficult
to segment.
First, let scale the depth back with max value = 255 and plot the histogram of the depth map. Then select value which
has highest frequency. With the assumption that the depth value with highest frequency is mostly the foreground (in
practice a cropped foreground obtained by person-detector for example is needed for this assumption to be true), we
threshold the depth by a delta of 20, followed by an opening operator to refine. We then compute the largest
connected component of the mask to get the output.
Figure 6: The mask computed by thresholding the depth map
Not too bad, but so far, we made too many assumptions:
1. The depth value with highest frequency is mostly the foreground
2. The depth value is similar for pixel in the foreground region.
3. No occlusion in the foreground
For the assumptions (1) and (2) to be true, it is required that the camera must face directly toward the person, it will
not work for example with camera place topview where depth values is different for pixels in the foreground. This is an
example with topview camera where depth values in the head region are be small and depth values in bottom regions
(leg) are larger.
Figure 7: Example of scenario where thresholding the depth does not work. Left: RGB image, right:
depth image colored
Before switching to deep learning approaches, let’s evaluate our mask obtained by simple
thresholding method above:
Given a ground truth mask G and an output segmentation M, there are several important accuracy metrics to evaluate
the performance of segmentation results: region similarity, precision, recall and f1_score. Depending on the purpose or
the context of the system, some metrics might be of more importance than others, i.e., accuracy may be expendable up
to a certain point in favor of execution speed for a real-time application.
• Region Similarity J: This is the standard metric for segmentation purpose. It measures the region-based
segmentation similarity by using the Jaccard index J or IoU which is a ratio between the intersection and the
union of the ground truth set and the predicted segmentation.
|𝐺 ∩ 𝑀|
𝐽 = 𝐼𝑜𝑈 =
|𝐺 ∪ 𝑀|
This ratio calculates the number of true positives (intersection) over the sum of true positives, false negatives, and
false positives (union).
• Precision and Recall P, R: They are the simplest metric. Precision is simply computing a ratio between the
amount of properly classified pixels (true positive) and the total number of them (true).
|𝐺 ∩ 𝑀|
𝑃=
|𝑀|
Recall is the ratio between the amount of classified pixels (true positive) and the number of ground truth pixel
(positive).
|𝐺 ∩ 𝑀|
𝑅=
|𝐺|
Precision and recall used to describe under/over segmentation phenomenon.
• Dice Coefficient (F1 score): is the harmonic mean of precision and recall:
2𝑃𝑅
𝐹1 =
𝑃+𝑅
The evaluate of the methods are shown in Table 1:
As we can see, for the depth thresholding method, the recall is high. That means our methods is oversegment the
object.
“The dataset features a total of 5724 annotated frames divided in three indoor scenes.
Activity in scene 1 and 3 is using the full depth range of the Kinect for XBOX 360 sensor whereas activity in
scene 2 is constrained to a depth range of plus/minus 0.250 m in order to suppress the parallax between the
two physical sensors. Scene 1 and 2 are situated in a closed meeting room with little natural light to disturb
the depth sensing, whereas scene 3 is situated in an area with wide windows and a substantial amount of
sunlight. For each scene, a total of three persons are interacting, reading, walking, sitting, reading, etc.
Every person is annotated with a unique ID in the scene on a pixel-level in the RGB modality. For the thermal
and depth modalities, annotations are transferred from the RGB images using a registration algorithm.”
After downloading the dataset, I followed the guide to extract and do registration to align the depth to the RGB image
and remove frame where no people is detected. Then I sample each sequence with sampling ratio = 5 (select 1 frame in
every 5 frames), then I divided the first 75% of the video to train, and then 25 rest is for validation. The processed
dataset is included in the zip folder.
Let’s do some finetuning using this dataset:
• simplicity
• Model is pretrained with RGB, dataset is small, weight change small, therefore we should introduce
less weights to the model.
Shape-convolution (https://openaccess.thecvf.com/content/ICCV2021/html/Cao_ShapeConv_Shape-
Aware_Convolutional_Layer_for_Indoor_RGB-D_Semantic_Segmentation_ICCV_2021_paper.html)
Lucid-data-dreaming:
https://github.com/ankhoreva/LucidDataDreaming#:~:text=Lucid%20Data%20Dreaming%20is%20a,%E2%80
%9D)%20plausible%20future%20video%20frames.
Figure 12: We fuse RGB image with the depth to form 4-channels images
The weight is also using the pretrained weight, I modify the conv_stem layer of Efficient-netB3 encoder to
accept 4 channels image. I initialize the 4-th channel of the kernel with zeros.
MODEL EVALUATION :
The inference time of our model is ~0.02s (50 FPS) on GPU and ~0.25 sec on CPU which is very fast.
GPU Memory:
An interesting idea I have is to fit the persons with a 3D model, and then project back in image domain. Since
all the information of 2D and 3D are available, we can fit a 3D model of a human by regression of available
parts of the human. This is the result I tested with a model called ROMP:
https://github.com/Arthur151/ROMP
By fitting the model into the person, we can retrieve the missing parts of the human body.
Figure 16: Fit a 3D human model to a existing pose to find the missing part
Reference:
Baheti, Bhakti, et al. "Eff-unet: A novel architecture for semantic segmentation in unstructured environment."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.
Cao, Jinming, et al. "ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Khoreva, Anna, et al. "Lucid data dreaming for object tracking." The DAVIS challenge on video object segmentation.
2017.