EDGE-Net: Efficient Deep-Learning Gradients Extraction Network
EDGE-Net: Efficient Deep-Learning Gradients Extraction Network
2, March 2023
Abstract. Deep Convolutional Neural Networks (CNNs) have achieved impressive performance in
edge detection tasks, but their large number of parameters often leads to high memory and energy
costs for implementation on lightweight devices. In this paper, we propose a new architecture, called
Efficient Deep-learning Gradients Extraction Network (EDGE-Net), that integrates the advan-
tages of Depthwise Separable Convolutions and deformable convolutional networks (Deformable-
ConvNet) to address these inefficiencies. By carefully selecting proper components and utilizing
network pruning techniques, our proposed EDGE-Net achieves state-of-the-art accuracy in edge
detection while significantly reducing complexity. Experimental results on BSDS500 and NYUDv2
datasets demonstrate that EDGE-Net outperforms current lightweight edge detectors with only
500k parameters, without relying on pre-trained weights.
Keywords: Efficient edge detection, lightweight deep neural network, Enhanced receptive field
1 Introduction
DOI: 10.5121/ijaia.2023.14207 85
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
insight into the detection accuracy and complexity (model size) of several well-
known deep learning-based methods. The orange dot on the graph represents how
well our model aligns with human perception in terms of accuracy, given its few
parameters.
Many deep learning-based edge detectors use VGGNet (Visual Geometry Group)
[6] as their feature-based extractor due to its impressive performance. However, the
network’s extensive backbone and high parameter count make it more appropriate
for complex tasks such as object recognition and image segmentation. Our moti-
vation for this study arises from the fact that edge detection is a low-level image-
processing task that does not require complex networks for feature extraction.
To decrease the number of parameters and floating point operations (FLOPs),
we leverage depthwise separable convolutions [7], which disentangle the spatial and
channel interaction that occurs during regular convolution operations. However, this
technique can reduce performance compared to conventional convolution methods.
To compensate for this reduction, we increase the receptive field by selecting ap-
propriate lightweight components for edge detection purposes. The details of this
approach are discussed in section 3.
Fig. 1. Comparison of complexity and accuracy performance among various edge detection
schemes. Our proposed methods (orange).
86
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
of the experiments are presented, and a comparison is made between the proposed
model and state-of-the-art edge detector networks using the Berkeley Segmentation
Dataset 500 (BSDS500) [8] and NYUDv2 [9] datasets. Finally, Section 5 presents
concluding remarks and outlines future research directions.
2 Related Work
87
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
This section presents our proposed EDGE-Net, a lightweight neural network that
provides high running efficiency. It offers a solution to the efficiency concerns of
the models discussed in the previous section. Figure 2 illustrates the architecture
of EDGE-Net. We train the network from scratch to optimize its performance. In
the following paragraphs, we provide a detailed review of the components utilized
in EDGE-Net.
The majority of deep learning-based edge detectors, such as those proposed in [15–
17], utilize VGGNet as their feature extraction backbone. However, we posit that
edge detection is a task that can be accomplished using a less complex backbone. We
achieve this by incorporating lightweight components that maintain high efficiency.
In order to achieve a pyramid structure, we employ three stages, with a max-
pooling operation for downsampling the features between stages. This results in a
decrease in the dimension of output feature maps as we progress through the stages.
As the complexity of the patterns increases in the subsequent stages, we increase
the number of feature channels (i.e., the number of filters) to capture a greater
number of combinations. The channel numbers for stages 1, 2, and 3 are 16, 64,
and 256, respectively. Our backbone comprises mainly deformable and customized
depthwise separable convolutions. To create the fused output, we use standard
bilinear interpolation to upsample the low-resolution features. The fused output is
then formed by concatenating all of the stage outputs. In the following sections, we
provide detailed explanations of the layers and components used in EDGE-Net.
88
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
Standard convolution kernels, with their fixed structure, have limitations in captur-
ing such transformations. Deformable convolutions, on the other hand, offer a more
efficient solution to this problem. These convolutions possess the ability to adapt
their kernel shape and parameters to the image content, thereby accommodating
geometric variations. By incorporating 2D offset kernels to the regular sampling lo-
cation in the standard convolution, deformable convolutions enable the network to
have different receptive fields, depending on the scale of the objects. The 2D offset
kernels are learned from the preceding feature maps using additional convolutional
layers and can be trained end-to-end using normal back-propagation functions. In
order to keep the network light in terms of parameters and computation, we add
this module at the end of each stage to strengthen the features before transferring
them to the next stage [22].
89
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
90
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
MAxout Layer Before transferring the inputs to the side output layers (from
left to right) at each stage, we perform a Maxout operation rather than using
the standard concatenation block. The Maxout activation reduces the number of
parameters significantly compared to classical dense blocks by inducing competition
between feature maps and accelerating network convergence. Instead of stacking
the outputs of previous layers on top of each other at each stage, we keep only the
maximum value at each position. This approach reduces the number of parameters
and improves the model’s performance.
91
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
92
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
The channel attention block redistributes the channel feature responses to en-
hance the importance of specific channels while attenuating others. To calculate the
channel attention, the spatial dimension of the input feature map is first reduced,
a process known as squeezing, as proposed by Woo et al. [24].
In an image, the distribution of edge and non-edge pixel data is often imbalanced.
While CNN models may achieve high accuracy by predicting the majority class,
they may overlook the minority class, leading to a misleading accuracy estimate.
To address this issue, we adopt the weighted Cross-Entropy loss function proposed
by Liu et al. [17].
During network training, we compare all stages and fused outputs to the ground
truth. Specifically, we use the following equation to compare each pixel of each image
to its corresponding label.
α.log 1 − P xi ; W
if yi = 0
L xi ; W ) = 0 if ≤ yi ≤ η (1)
β.logP xi ; W otherwise,
in which
93
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
|Y + |
α = λ.
|Y + | + |Y − |
(2)
|Y −|
β=
|Y + |Y − |
+|
|I| |K|
L xki ; W ) + L xfi use ; W )
X X
L(W ) = (3)
i=1 k=1
We implemented our backbone networks using PyTorch and initialized their stages
with a zero-mean Gaussian distribution with a standard deviation of 0.01. The
learning rate was set to 0.01 initially and then updated using a linear scaling factor
by multiplying 0.1 for every two epochs. We used stochastic gradient descent as the
optimizer, and we terminated the training process after eight epochs. We conducted
all the experiments on a single GPU, NVIDIA GeForce 2080Ti, which has 11G
memory.
4.2 Dataset
94
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
techniques as in RCF [17], including using the PASCAL VOC dataset [25] and its
flipped images.
The NYUD dataset consists of 1449 densely labeled pairs of aligned RGB and
depth images (HHA), captured by Microsoft Kinect cameras in various indoor
scenes. The dataset includes 381 training, 414 validation, and 654 testing images.
To augment the dataset, we rotated the images and corresponding annotations to
four different angles (0, 90, 180, and 270 degrees) and flipped them at each angle,
following the approach in RCF [17].
There are two methods to calculate the optimal threshold and the corresponding
F-score for binarizing the output of the CNN network to make it comparable to
the binarized ground truth.
– Optimal Dataset Scale: Iterates over all possible thresholds and set one threshold
for the entire dataset. The threshold that gives the best F-score for the dataset
is used to calculate ODS score.
– Optimal Image Scale: Finds the best threshold and corresponding F-score for
each image. The OIS F-score is calculated by averaging all of the F-scores for
all images.
95
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
Fig. 11. Precision-Recall curves of our models and some competitors on BSDS500 dataset.
On NYUD dataset - Table 2 presents the comparison results for the NYUD
dataset, while figure 12 depicts the precision-recall curves. To test our model on
NYUD, we adopt network settings similar to those used for BSDS500. Some studies
employ two separate models to train RGB images and HHA feature images of
NYUD and report the evaluation metrics on the average of the models’ outputs.
However, our network is only tested on RGB images. Therefore, to ensure a fair
evaluation, we compare our model’s output with those of models that were also
tested solely on RGB images.
Fig. 12. Precision-Recall curves of our models and some competitors on NYUD dataset.
96
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
97
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
5 Conclusion
References
1. Victor Wiley and Thomas Lucas. Computer vision and image processing: a paper review.
International Journal of Artificial Intelligence Research, 2(1):29–36, 2018.
2. Ronald J Holyer and Sarah H Peckinpaugh. Edge detection applied to satellite imagery of the
oceans. IEEE transactions on geoscience and remote sensing, 27(1):46–56, 1989.
3. Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed Shaharyar Khwaja. Deep learn-
ing for object detection and scene perception in self-driving cars: Survey, challenges, and open
issues. Array, 10:100057, 2021.
4. Wei-Chun Lin and Jing-Wein Wang. Edge detection in medical images with quasi high-pass
filter based on local statistics. Biomedical Signal Processing and Control, 39:294–302, 2018.
5. Jan Kristanto Wibisono and Hsueh-Ming Hang. Traditional method inspired deep neural
network for edge detection. In 2020 IEEE International Conference on Image Processing
(ICIP), pages 678–682. IEEE, 2020.
6. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
7. Yunhui Guo, Yandong Li, Liqiang Wang, and Tajana Rosing. Depthwise convolution is all
you need for learning multiple visual domains. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 8368–8375, 2019.
8. Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection
and hierarchical image segmentation. IEEE transactions on pattern analysis and machine
intelligence, 33(5):898–916, 2010.
9. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation
and support inference from rgbd images. In European conference on computer vision, pages
746–760. Springer, 2012.
10. O Rebecca Vincent, Olusegun Folorunso, et al. A descriptive algorithm for sobel image edge
detection. In Proceedings of informing science & IT education conference (InSITE), volume 40,
pages 97–107, 2009.
11. Renjie Song, Ziqi Zhang, and Haiyang Liu. Edge connection based canny edge detection
algorithm. Pattern Recognition and Image Analysis, 27(4):740–747, 2017.
12. Scott Konishi, Alan L. Yuille, James M. Coughlan, and Song Chun Zhu. Statistical edge
detection: Learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(1):57–74, 2003.
98
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
13. Piotr Dollár and C Lawrence Zitnick. Fast edge detection using structured forests. IEEE
transactions on pattern analysis and machine intelligence, 37(8):1558–1570, 2014.
14. Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proceedings of the IEEE
international conference on computer vision, pages 1395–1403, 2015.
15. Yupei Wang, Xin Zhao, and Kaiqi Huang. Deep crisp boundaries. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 3892–3900, 2017.
16. Dan Xu, Wanli Ouyang, Xavier Alameda-Pineda, Elisa Ricci, Xiaogang Wang, and Nicu Sebe.
Learning deep structured multi-scale features using attention-gated crfs for contour prediction.
Advances in neural information processing systems, 30, 2017.
17. Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional
features for edge detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 3000–3009, 2017.
18. Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict
crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV),
pages 562–578, 2018.
19. Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, and Tiejun Huang. Bi-directional
cascade network for perceptual edge detection. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 3828–3837, 2019.
20. Jan Kristanto Wibisono and Hsueh-Ming Hang. Fined: Fast inference network for edge detec-
tion. arXiv preprint arXiv:2012.08392, 2020.
21. Xavier Soria Poma, Edgar Riba, and Angel Sappa. Dense extreme inception network: Towards
a robust cnn model for edge detection. In Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision, pages 1923–1932, 2020.
22. Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. De-
formable convolutional networks. In Proceedings of the IEEE international conference on
computer vision, pages 764–773, 2017.
23. Leonie Henschel, Sailesh Conjeti, Santiago Estrada, Kersten Diers, Bruce Fischl, and Martin
Reuter. Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline. NeuroImage,
219:117012, 2020.
24. Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional
block attention module. In Proceedings of the European conference on computer vision
(ECCV), pages 3–19, 2018.
25. Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler,
Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic
segmentation in the wild. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 891–898, 2014.
99