Multiresolution_Mixture_Generative_Adversarial_Network_For_Image_Super-Resolution
Multiresolution_Mixture_Generative_Adversarial_Network_For_Image_Super-Resolution
SUPER-RESOLUTION
Yudiao Wang1, Xuguang Lan2, Yinshu Zhang1, Ruixue Miao3, Zhiqiang Tian1
1
School of Software, Xi’an Jiaotong University
2
School of Electronics and Information Engineering, Xi’an Jiaotong University
3
School of Information Sciences and Technology, Northeast Normal University
{wangyd@stu, xglan@mail}.xjtu.edu.cn, [email protected]
[email protected], [email protected]
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on February 18,2025 at 01:44:30 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Multiresolution mixture network (a). The x1, x2 and x4 denote the scale of the feature maps. The start module (b) of
multiresolution mixture network with corresponding kernel size (k), number of feature maps (n) and stride (s) indicated for
each convolutional layer.
which only learned the difference from HR image to SR im- multiresolution mixture network. It can be used as generator
age [2]. Kim J et al. proposed a deep network to implement in adversarial training. The details of MRMNet are showed
SR which used recursive-supervision and skip-connection in Figure 1(a).
[13]. To reduce the training parameters, Tai Y et al. [14] pro- Generally speaking, when viewed from the horizontal
posed a deep CNN model called deep recursive residual net- direction, the MRMNet has three routes of feature map reso-
work. However, the inputs of these models have the same size lution which correspond to the original(x1) resolution, x2 res-
as ground truth image, and this way is time consuming. To olution and x4 resolution (target resolution) in Figure 1(a).
make the training easier and faster, some pioneer improved Between these routes, there are some exchange units which
SR from the network architecture [2], which can handle three are responsible for the resolution scaling and the feature ex-
channels of RGB at the same time, making the whole network change between different resolution feature maps.
more lightweight and faster than SRCNN. Lai WS et al. pro- Specifically, the MRMNet is composed of start module,
posed the Laplacian pyramid networks reconstruction resid- exchange unit (EU), and residual module. The start module is
ual, which made training faster [15]. shown in Figure 1(b). It is worth emphasizing that the feature
With the development of the generated model GAN, it maps are added after convolution layer and batch norm layer.
showed excellent performance in the generation of image de- The exchange unit is shown in Figure 2, which has “n” inputs
tails. SRGAN made great progress in perceptual effects and “m” outputs. The Figure 2 shows three inputs and two
which benefit from using GAN in solving SR problem. Xu X outputs. The resolution of input feature maps are x1, x2, and
et al. used GAN model and better loss function to solve text x4 respectively, and the resolution of output feature maps are
and face blur problems [16], which also improved the percep- x2 and x4. In exchange unit, the lower resolution feature
tual quality. maps are converted to higher resolution feature maps through
However, most of the existing SR models explored the the deconvolution layers in exchange unit. On the contrary,
performance of network which has single resolution feature the higher resolution feature maps can convert to lower reso-
maps at the same time during training. In this work, we pro- lution or the same resolution feature maps through the con-
pose MRMNet, which has different resolution feature maps volution layers. If you want to get an output, you must get all
at the same time during training. feature maps through resolution conversion and then add
them up. There are three exchange units in Figure 1(a), ac-
3. METHOD cording to the specific input and output, you can get the in-
ternal details of each exchange unit. The residual module is a
In this section, we will introduce details of the proposed classic residual network that convolution kernel size is 3 and
method based on generator and the loss function of generator. step size is 1. The corresponding number of feature maps with
The discriminator and the loss function of discriminator, we resolution of x1, x2, and x4 is 128, 64, and 32 respectively.
follow the SRGAN. All of these make a LR image gradually recover to a HR one.
Most of the previous methods enlarge image to the target
3.1. Generator resolution at the end of network, or used target resolution im-
age as network input. Nevertheless, our proposed network
In order to further improve SR image quality, we propose the MRMNet is able to have multiple resolution feature maps at
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on February 18,2025 at 01:44:30 UTC from IEEE Xplore. Restrictions apply.
the same time during training. The feature compensation can pooling layer within the VGG19 [17] network. is the LR
be achieved by the exchange unit of MRMNet. In other words, image version of its high-resolution counterpart .
lost features of image can be compensated by extracting fea-
ture from other resolution feature maps. Experimental results 3.2.2. Reconstruction loss
have confirmed that our method is feasible. More details can
be seen Chapter 4. The reconstruction loss is designed to improve PSNR and
SSIM of SR image. According to different image sources, we
define two kinds of reconstruction loss showed in formula 4
and formula 5.
| |, (4)
|∅ , ∅ , |. (5)
The and denote the dimensions of the feature maps. The first stage of loss function focuses on the generation
For content loss, ∅ , denotes the feature map acquired by the of high perceptual quality images. The second is mainly to
fourth convolution (after activation) before the fifth max improve in terms of PSNR and SSIM, meanwhile recovering
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on February 18,2025 at 01:44:30 UTC from IEEE Xplore. Restrictions apply.
the contour of image better and further improving perceptual 4.3.1. Comparison of different network architectures
quality, the third stage is to strengthen the loss function of the
second stage. For studying on the performance of MRMNet and MRMNet
without exchange unit (MRMNet-noEU), we calculated the
values of LPIPS under the four evaluation datasets. MRM-
4. EXPERIMENT
Net-noEU enlarges resolution by deconvolution, while fea-
ture compensation of exchange unit is cancelled. Meanwhile,
4.1. Data
the SRResNet [10] was used as comparison object. In Figure
3, the results were shown which used Loss 5 (left) and Loss
The training dataset consists of DIV2K dataset [18], Flickr2K
6 (right) as loss function. The details of Loss 5 and Loss 6 are
dataset [19], and OutdoorSceneTraining (OST) [20]. DIV2K
visible in column Loss-Name and Loss of Table 1.
contains 800 images for image restoration tasks, which is a
As we can see from Figure 3, the performance in percep-
high quality (2K resolution) dataset. Flickr2K dataset con-
tual quality of MRMNet is better than SRResNet and MRM-
tains 2,650 2K HR images, which is collected on the Flickr
Net-noEU, because MRMNet owns smaller LPIPS.
website. The OST dataset is used to further enrich our train-
ing data. At the same time, we use image augmentation in the
training process which includes random rotation and flipping. SRResNet MRMNet-noEU MRMNet
For evaluating the performance of our model, the widely 0.3
used benchmark datasets Set5 [21], Set14 [22], BSD100 [23], 0.25
and Urban100 [24] were chosen. 0.2
0.15
4.2. Training Details 0.1
0.05
Following SRGAN, all of LR images are magnified four 0
times to SR images. During training, we scaled the range of
the LR input images to [0, 1] and for the HR images to [-1,
1]. There is a trick clipping the output values of the exchange
unit to [-5, 5] for preventing image noise during the training.
The LR image whose crop size is 24*24 while training is ob- Fig. 3 Comparison of the value of LPIPS under different
tained by bicubic down sampling of the HR image. The batch datasets between SRResNet, MRMNet-noEU, and MRMNet,
size is set to 16. which used Loss 5 (left) and Loss 6 (right) as loss function.
The whole training process is divided into three stages. Table 1. Different losses.
At the first stage, the learning rate is 1e-4, and the loss func-
tion is which is consistent with SRGAN. In , 1e- stage loss Loss-
Loss
3. At the second stage, the learning rate is still 1e-4 and the function? Name
loss function is . At the last stage, the learning rate is 1e-5 No Loss 1
and the loss function is . The whole training has a total of No Loss 2 +
300,000 steps with each stage is 100,000 steps. No Loss 3 +
For the optimization method, we use Adam [25] optimi-
zation in the whole training, where 0.9 and 0.99. No Loss 4 +
The whole training is completed under TensorFlow 1.10, and Stage1 Stage2 Stage3
GPU is Tesla P100-PCIE which has 16G memory. Yes Loss 5
Yes Loss 6
4.3. Results
Yes Loss 7
For evaluating the perceptual quality of the experimental re-
sults, the newly proposed perceptual metric Learned Percep- 4.3.2. Comparison of different loss functions
tual Image Patch Similarity (LPIPS) [26] was used. During
calculating the LPIPS, the parameters mode is set as net-lin In order to research the effect of different loss functions, we
and net is set as alex. The smaller the value of LPIPS, the designed different combinations of loss function. Specifically,
better the perceptual quality of the result. In addition, the tra- we used the stage loss function and designed seven different
ditional evaluation indicators PSNR and SSIM were not used, loss functions. The details are visible in column Loss of Table
because they cannot evaluate the perceptual quality very well. 1, we named them in column Loss-Name. We still calculated
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on February 18,2025 at 01:44:30 UTC from IEEE Xplore. Restrictions apply.
Table 2. Comparison of Bicubic, SRCNN [1], EDSR [5], VDSR [2], DRCN [27], SRGAN [10], and MRMGAN (ours) on
benchmark data Set5, Set14, BSDS100, and Urban100. Best measures (LPIPS) in bold.
Dataset Bicubic SRCNN EDSR VDSR DRCN SRGAN MRMGAN
Set5 0.3397 0.1769 0.1733 0.1798 0.1820 0.1036 0.0880
Set14 0.44 0.2788 0.2870 0.3002 0.3044 0.1803 0.1651
BSDS100 0.5087 0.3788 0.3562 0.3759 0.3835 0.1989 0.1959
Urban100 0.4728 0.3004 0.2283 0.2729 0.2869 0.1801 0.1407
the value of LPIPS under evaluated datasets for performance our method can recover contour better and obtain higher
comparison. In Figure 4, we showed the results of LPIPS in perceptual quality than previous methods. In addition, the
the datasets Set5, Set14, BSDS100, and Urban100. ratio of different feature maps when performing feature
From Figure 4, we can see that the value of LPIPS is exchange in exchange unit can be refined for getting better
getting smaller and smaller under different loss functions. It results.
means our loss function is effective.
Set5
0.6
Set14
0.5 BSDS100 Bicubic SRCNN EDSR VDSR
Urban100 (0.42) (0.29) (0.20) (0.25)
0.4
LPIPS
0.3
0.2
86000 from DRCN SRGAN Ours HR
0.1 BSDS(LPIPS) (0.25) (0.17) (0.11) (0)
0
Loss 1 Loss 2 Loss 3 Loss 4 Loss 5 Loss 6 Loss 7
Loss function
Bicubic SRCNN EDSR VDSR
(0.34) (0.14) (0.05) (0.07)
Fig. 4. Comparison of the value of LPIPS under different loss
functions. The abscissa is derived from Table 1, which
represents the loss function.
210088 from DRCN SRGAN Ours HR
4.3.3. Performance of the proposed method (0.07) (0.08) (0.04) (0)
BSDS(LPIPS)
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on February 18,2025 at 01:44:30 UTC from IEEE Xplore. Restrictions apply.
[15] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, "Deep
6. REFERENCES laplacian pyramid networks for fast and accurate super-
resolution," in Proceedings of the IEEE conference on
[1] C. Dong, C. C. Loy, K. He, and X. Tang, "Image super- computer vision and pattern recognition, 2017, pp. 624-632.
resolution using deep convolutional networks," IEEE [16] X. Xu, D. Sun, J. Pan, Y. Zhang, H. Pfister, and M.-H. Yang,
transactions on pattern analysis and machine intelligence, "Learning to super-resolve blurry face and text images," in
vol. 38, no. 2, pp. 295-307, 2015. Proceedings of the IEEE International Conference on
[2] J. Kim, J. Kwon Lee, and K. Mu Lee, "Accurate image super- Computer Vision, 2017, pp. 251-260.
resolution using very deep convolutional networks," in [17] K. Simonyan and A. Zisserman, "Very deep convolutional
Proceedings of the IEEE conference on computer vision and networks for large-scale image recognition," arXiv preprint
pattern recognition, 2016, pp. 1646-1654. arXiv:1409.1556, 2014.
[3] C. Dong, C. C. Loy, and X. Tang, "Accelerating the super- [18] E. Agustsson and R. Timofte, "Ntire 2017 challenge on single
resolution convolutional neural network," in European image super-resolution: Dataset and study," in Proceedings
conference on computer vision, 2016: Springer, pp. 391-407. of the IEEE Conference on Computer Vision and Pattern
[4] W. Shi et al., "Real-time single image and video super- Recognition Workshops, 2017, pp. 126-135.
resolution using an efficient sub-pixel convolutional neural [19] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L.
network," in Proceedings of the IEEE conference on Zhang, "Ntire 2017 challenge on single image super-
computer vision and pattern recognition, 2016, pp. 1874- resolution: Methods and results," in Proceedings of the IEEE
1883. Conference on Computer Vision and Pattern Recognition
[5] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, "Enhanced Workshops, 2017, pp. 114-125.
deep residual networks for single image super-resolution," in [20] X. Wang, K. Yu, C. Dong, and C. Change Loy, "Recovering
Proceedings of the IEEE conference on computer vision and realistic texture in image super-resolution by deep spatial
pattern recognition workshops, 2017, pp. 136-144. feature transform," in Proceedings of the IEEE Conference
[6] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun, "Meta- on Computer Vision and Pattern Recognition, 2018, pp. 606-
SR: A Magnification-Arbitrary Network for Super- 615.
Resolution," in Proceedings of the IEEE Conference on [21] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-
Computer Vision and Pattern Recognition, 2019, pp. 1575- Morel, "Low-complexity single-image super-resolution
1584. based on nonnegative neighbor embedding," 2012.
[7] Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale [22] R. Zeyde, M. Elad, and M. Protter, "On single image scale-
structural similarity for image quality assessment," in The up using sparse-representations," in International conference
Thrity-Seventh Asilomar Conference on Signals, Systems & on curves and surfaces, 2010: Springer, pp. 711-730.
Computers, 2003, 2003, vol. 2: Ieee, pp. 1398-1402. [23] D. Martin, C. Fowlkes, D. Tal, and J. Malik, "A database of
[8] P. Gupta, P. Srivastava, S. Bhardwaj, and V. Bhateja, "A human segmented natural images and its application to
modified PSNR metric based on HVS for quality assessment evaluating segmentation algorithms and measuring
of color images," in 2011 International Conference on ecological statistics," 2001: Iccv Vancouver:.
Communication and Industrial Application, 2011: IEEE, pp. [24] J.-B. Huang, A. Singh, and N. Ahuja, "Single image super-
1-4. resolution from transformed self-exemplars," in Proceedings
[9] I. Goodfellow et al., "Generative adversarial nets," in of the IEEE Conference on Computer Vision and Pattern
Advances in neural information processing systems, 2014, pp. Recognition, 2015, pp. 5197-5206.
2672-2680. [25] D. P. Kingma and J. Ba, "Adam: A method for stochastic
[10] C. Ledig et al., "Photo-realistic single image super-resolution optimization," arXiv preprint arXiv:1412.6980, 2014.
using a generative adversarial network," in Proceedings of the [26] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
IEEE conference on computer vision and pattern recognition, "The unreasonable effectiveness of deep features as a
2017, pp. 4681-4690. perceptual metric," in Proceedings of the IEEE Conference
[11] J. Johnson, A. Alahi, and L. Fei-Fei, "Perceptual losses for on Computer Vision and Pattern Recognition, 2018, pp. 586-
real-time style transfer and super-resolution," in European 595.
conference on computer vision, 2016: Springer, pp. 694-711. [27] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li,
[12] Z. Wang, J. Chen, and S. C. Hoi, "Deep learning for image "Deep reconstruction-classification networks for
super-resolution: A survey," arXiv preprint unsupervised domain adaptation," in European Conference
arXiv:1902.06068, 2019. on Computer Vision, 2016: Springer, pp. 597-613.
[13] J. Kim, J. Kwon Lee, and K. Mu Lee, "Deeply-recursive
convolutional network for image super-resolution," in
Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 1637-1645.
[14] Y. Tai, J. Yang, and X. Liu, "Image super-resolution via deep
recursive residual network," in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017,
pp. 3147-3155.
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on February 18,2025 at 01:44:30 UTC from IEEE Xplore. Restrictions apply.