Content-Aware Image Compression With Convolutional Neural Networks
Content-Aware Image Compression With Convolutional Neural Networks
Abstract: Traditional image compression algorithms treat all image regions equally, regardless of their
content, often resulting in reconstructed images that do not correlate well with human perception.
Content-aware compression, on the other hand, prioritizes image regions that are more relevant to the
interpretation of an image and encodes them at a higher bitrate, i.e. without loss or with less loss, than
the rest of the image. Our paper explores the multi-structure region of interest (MS-ROI) model, a
convolutional neural network, which enables the localization of multiple regions of interest (ROIs) in an
image. The localization is expressed as a corresponding saliency map, which identifies the relevance of
individual image regions and provides a saliency value for each pixel of the given image. This information
is then used to guide the compression. The saliency values are discretized into multiple levels and more
important levels are encoded with a higher quality factor Q than the less important ones, allowing for
most of the reduction in image resolution to occur in non-salient image regions. Because the generated
saliency maps produce soft boundaries between salient and non-salient image regions, smooth transitions
between these regions are achieved. The obtained image is then encoded further using the standard JPEG
algorithm with a uniform Q factor, resulting in the final image of the standard JPEG format. Our model
was trained on the Caltech-101 image dataset and its performance was tested on two other image
datasets. Presented are the obtained saliency maps for several images, as well as the results of content-
aware compression, which are compared to the standard JPEG compression at different Q factors. For an
objective comparison and evaluation of the quality of the obtained images, various standard quality
metrics were used, i.e. mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity
index (SSIM) and multi-scale structural similarity index (MS-SSIM).
Key words: convolutional neural networks, image compression, JPEG, saliency maps, MS-ROI
1. INTRODUCTION
The primary objective of image compression is to reduce the amount of bits that is required for the image
data to be stored or transmitted. Traditional lossy image compression algorithms, such as the widely
adopted JPEG standard, take advantage of repetitive, redundant or imperceptible image data, and of the
limitations of the human visual system, to encode the data more efficiently and reduce the file size. While
the goal is to preserve the perceptual quality of an image, the approximation of the represented content
results in the development of some unavoidable visual artifacts, which worsen the interpretability of the
reconstructed images. Blocking effects, ringing artifacts and blurring, which are most characteristic for
JPEG compression, appear as a consequence of the discontinuities between adjacent 8x8 pixel blocks and
the elimination of high frequencies. Because the algorithm treats all image regions equally, regardless of
their content, the compression artifacts are equally visible in the image background as well as in
foreground objects. In order to minimize the visibility of these unwanted artifacts in the decoded images,
numerous approaches that focus on obtaining a more accurate reconstruction of the original signal have
been proposed (Dong et al, 2015; Dong et al, 2016a; Tao et al, 2017).
Content-aware compression methods, on the other hand, encode the content in a way that corresponds
more to the manner in which the human eye interprets the image. Because the degree of human interest
in different image regions varies according to what we perceive as more relevant for understanding the
image, content-aware compression prioritizes the more important image regions, i.e. the regions of
interest (ROIs), and enables them to be preserved with less loss than the rest of the image.
Before the encoding process can be accomplished, a saliency map, corresponding to the image content,
needs to be obtained. A saliency map partitions the image into several categories, depending on the
image regions they contain, and therefore serves as the means for quantifying the relevance of individual
regions. The provided contextual information about the image is then integrated into a compression
scheme. Because the selected, more important image regions, are encoded at a higher bitrate than the
459
image background, the compression artifacts in the ROIs of the reconstructed images are less noticeable
than those in the background.
460
Lastly, by using a weighted sum of the top five predictions instead of only the highest scoring class, the
obtained heatmap enables the detection of multiple regions of interest, which can include objects
belonging to different classes. The localization of multiple ROIs and the production of soft boundaries
between different image regions create coherent heatmaps and enable smooth transitions from salient
to non-salient regions in the reconstructed images.
2. METHODS
The code for the MS-ROI model and the image compression was written in Python 3. The processing of
the images was performed using the CUDA software on the Nvidia GPU with 8 GB of memory.
The model was trained on the Caltech-101 image dataset (Fei-Fei et al, 2006), which consists of 9144
monochromatic, greyscale and RGB images (representing the independent variables) belonging to 102
classes (representing the dependent variables). The last 15 images of each class were used for the
validation set, whereas the remaining images were used for the training set. The training set and the
validation set initially contained 7614 images (83.3%) and 1530 (16.7%) respectively, but during the
training phase the size of the trainset was increased using real-time image augmentation. The original
images were rotated up to 30 degrees, centrally scaled up to 30% and flipped horizontally and vertically.
Every iteration produced a number of transformed images from each class that was approximately equal
to the number of images that the class originally contained. Since the transformations were perfomed
randomly, each iteration included different variations of the input images.
The implementation of the model was based on the pretrained VGG16 model (Simonyan et al, 2014), the
architecture of which was modified by removing the fully connected layers at the top of the model and
replacing them with three additional convolutional layers and a GAP layer. The 1000 nodes comprising
the Softmax layer of the standard VGG16 model were replaced with 102 nodes, corresponding to 102
classes included in the Caltech-101 image dataset. The input images were resized to a fixed input size of
224x224 pixels. 3x3 pixel convolution filters with a stride of 1 and 2x2 pixel pooling windows with a stride
of 2 were used for convolution and max-pooling. In total, the neural network consisted of 23 layers – 16
convolutional layers, each followed by a ReLU activation function, 5 max-pooling layers, following each of
the 5 blocks of convolutional layers, a GAP layer and a final, Softmax layer. The combination of removing
the fully connected layers, thereby decreasing the number of parameters, and adding a GAP layer, which
in itself serves as a regularizer, reduced the risk of overfitting the model to the training data largely
enough that the dropout was not needed.
Three different methods were used for weight initialization. In the unmodified layers of the VGG16 model
the pretrained weights were initialized to the constant numbers. In the convolutional layers, added on
top of the VGG16 model, the weights were initialized from the truncated Gaussian distribution using the
standard deviation of 0.1, whereas the weights in the GAP layer were initialized using the Gaussian
distribution. The reason for using the truncated normal distribution in each of the three additional
convolutional layers was to reduce the risk of neuron saturation.
461
The cost was calculated using the cross entropy function and minimized using the Adam optimization
algorithm with a learning rate of 0.0001 and the default values of the parameters beta 1, beta 2 and
epsilon. The learning process included 100 iterations using a batch of 32 images.
In order for the model to be able to identify multiple ROIs, the matrix of sorted activations for each input
image needed to be obtained. Instead of the more commonly applied argmax method, which finds the
index of the element with the maximum activation, the argsort method was used to obtain and sort all
the activations. The five highest scoring activations were picked for every input image and their weighted
sum was taken. Matplotlib's Jet colourmap was used to generate the colour scheme.
The performance of the content-aware compression method based on the MS-ROI model was assessed
on JPEG images from the Salicon dataset (Yu et al, 2015) and on uncompressed BMP images from the
General-100 dataset (Dong et al, 2016b). The results of the MS-ROI based compression were compared to
the standard JPEG compression at Q factors of 50, 30 and 70. For an objective comparison and evaluation
of the quality of the obtained images, the mean squared error (MSE), peak signal-to-noise ratio (PSNR),
structural similarity index (SSIM) and multi-scale structural similarity index (MS-SSIM) were used. Higher
PSNR (measured in dB), SSIM and MS-SSIM values refer to a higher image quality, wherease a higher MSE
indicates a bigger error in the reconstructed image. For all of the experiments, the chosen Q values for
the first encoding pass of the MS-ROI compression method ranged from Ql = 30 to Qh = 70. The Qfinal
factor of the second encoding pass depended on the selected Q factor of the JPEG compression and the
file size of the original image. When comparing the MS-ROI compression to the standard JPEG
compression at Q = 50, the average Qfinal was 57, at Q = 30 the average Qfinal was 31, and at Q = 70 the
average Qfinal was 73. The maximum difference between the file sizes of the standard JPEG images and
the images obtained using the MS-ROI model was 1%.
3. RESULTS
462
Figure 2: Example of an input image with two salient regions
463
using the MS-ROI method are all slightly smaller than those compressed using the JPEG algorithm. The
most significant gain in PSNR, 1.61 dB, is achieved in the case of image 2 (snowboarder), the biggest
improvement in SSIM, 0.0107, in the case of image 1 (bird), and the biggest gain in MS-SSIM, 0.0044 in
the case of image 5 (elephant).
Figure 5: Five examples of images from the Salicon dataset and of their corresponding heatmaps
Table 1: Results of calculated quality metrics when comparing the MS-ROI compression to JPEG at Q = 50 – for the
images with salient regions
Image 1 (bird)
Image 2 (snowboarder)
Image 3 (truck)
Image 4 (cathedral)
Image 5 (elephant)
The results of calculated quality metrics for the set of five images, which do not contain any salient areas
or where their identification is more ambiguous (Figure 6), are displayed in Table 2. Because content-
aware compression is intended for encoding images that depict some salient objects, the model's
predictions were expected to be much less accurate in cases where the input images contained patterns,
shapes or textures. Nonetheless, the PSNR values of the MS-ROI compression method turned out to be
464
higher than those of the JPEG compression for all images, with the exception of the last one (corals).
Unlike the PSNR value, the SSIM and MS-SSIM indexes were improved only for image 1 (stone wall) and
image 3 (leaves).
It is worth mentioning that, even in the cases where any of the PSNR, SSIM or MS-SSIM values of the MS-
ROI compression were higher than for the JPEG compression, the improvement of the MS-ROI method
over the JPEG method is much less significant than in the case of the images that contain clearly
identifiable salient objects.
Figure 6: Five examples of images from the General-100 dataset and of their corresponding heatmaps
Table 2: Results of calculated quality metrics when comparing the MS-ROI compression to JPEG at Q = 50 – for the
images without salient regions
Image 3 (leaves)
Image 4 (pebbles)
Image 5 (corals)
The model was also evaluated on 200 randomly chosen images from the Salicon dataset, which contains a
total of 20000 images, and on 50 randomly chosen images from the General-100 dataset, which contains
465
a total of 100 images. The average PSNR and SSIM values for the selected images from the Salicon and
General-100 dataset are shown in Table 3 and Table 4, respectively.
Table 3: PSNR and SSIM of 200 images from the Salicon dataset
PSNR SSIM
PSNR SSIM
Results show that the MS-ROI compression method performs better than the standard JPEG
compression, since the former is characterized by a higher PSNR and a better visual quality of the
obtained images. As seen in Table 3, compression based on the MS-ROI model achieves an average gain
of 1.09 dB in PSNR and a 0.0077 gain in SSIM against the standard JPEG compression at Q = 50, and an
average gain of 0.97 dB in PSNR and 0.0048 in SSIM at Q = 70. Meanwhile, the improvement of the
MS-ROI compression, when compared to the JPEG compression at Q = 30, is not as substantial,
though the MS-ROI method still performs better and on average improves the PSNR by 0.61 dB and the
SSIM by 0.0030.
Similarly, the MS-ROI based compression generates better results in comparison with the standard JPEG
compression at Q = 50 for the images from the General-100 dataset. On average, the MS-ROI
compression gains 0.36 dB in PSNR and 0.0036 in SSIM. However, when compared to the JPEG
compression at Q = 30 and Q = 70, the results of the MS-ROI compression are much less prominent.
When compared to the JPEG compression at Q = 70, on average, the MS-ROI compression improves the
PSNR by 0.07 dB and the SSIM by 0.0018, while, when compared to the JPEG compression at Q = 30, the
average gain in PSNR is only 0.0040 dB and only 0.0004 in SSIM.
Overall, the MS-ROI model performs better on images from the Salicon dataset. This can be explained by
the fact that the General-100 dataset contains more images depicting textures and patterns compared to
the Salicon dataset, which consists mostly of images of natural indoor and outdoor scenery with various
salient regions. Furthermore, the General-100 dataset contains images of smaller dimensions than the
Salicon dataset, which is therefore better suited for a content-aware compression task.
The obtained results were interpreted using a one-way analysis of variance (Anova). The purpose of the
test was to determine whether the MS-ROI compression and the JPEG compression are actually different
in the measured characteristics. Since the improvement of the MS-ROI compression over the JPEG
compression was more significant for the images from the Salicon dataset than for the images from the
General-100 dataset, Anova was performed for the former dataset. An improvement of the MS-ROI
model over the standard JPEG compression is statistically significant if the p-value is less than the
466
significance level of 0.05. The test yielded p-values that were lower than the significance level, for both
the PSNR and SSIM values, when comparing the MS-ROI model to the JPEG compression at Q factors of
30, 50 and 70, thereby rejecting the null hypothesis in favour of the MS-ROI model.
4. CONCLUSIONS
This paper explores the content-aware compression based on saliency maps obtained using a
convolutional neural network - the MS-ROI model. We showed that by varying the quantization of
compression, the MS-ROI based encoding is capable of achieving a better visual quality of the
reconstructed images compared to the standard JPEG compression. Since the accuracy of the obtained
heatmaps is highly dependent on the content of the input image, the performance of the MS-ROI
compression is especially superior when images with clear semantic regions are used. Because the model
allows for the detection of multiple salient image regions and produces soft boundaries between them,
the transitions from regions encoded at higher and lower bitrates are smooth. Further experimentation
with the MS-ROI model based on different CNN architectures is required to better understand the effect
of the model implementation on the generated saliency maps and the resulting quality of the content-
aware compression.
5. REFERENCES
[1] Dong, C., Deng, Y., Loy, C. C., Tang, X.: “Compression artifacts reduction by a deep convolutional
network”, Proceedings of IEEE International Conference on Computer Vision (ICCV) 2015 (Santiago,
Chile, 2015), pages 576–584. doi:10.1109/ICCV.2015.73.
[2] Dong, C., Loy, C. C., He, K., Tang, X.: “Image super-resolution using deep convolutional networks”,
IEEE Transactions on Pattern Analysis and Machine Intelligence 38(2), 295-307, 2016,
doi:10.1109/TPAMI.2015.2439281.
[3] Dong, C., Loy, C. C., Tang, X.: “Accelerating the super-resolution convolutional neural network”,
Computer Vision – ECCV 2016 (Springer International Publishing, 2016), 391-407. doi: 10.1007/978-
3-319-46475-6_25.
[4] Fei-Fei, L., Fergus, R., Perona, P.: “One-Shot learning of object categories”, IEEE Transactions on
Pattern Analysis and Machine Intelligence 28(4), 594-611, 2006. doi: 10.1109/TPAMI.2006.79.
[5] Prakash, A., Moran, N., Garber, S., Dilillo, A., Storer, J.: “Semantic perceptual image compression
using deep convolution networks”, Proceedings of Data Compression Conference (DCC) 2017
(Snowbird, UT, USA, 2017), pages 250-259. doi: 10.1109/DCC.2017.56.
[6] Simonyan, K., Zisserman, A.: “Very deep convolutional networks for large-scale image recognition”,
Proceedings of International Conference on Learning Representations 2015 (ICLR , San Diego, CA,
2015), pages 1-14.
[7] Tao, W., Jiang, F., Zhang, S., Ren, J., Shi, W., Zuo, W., Guo, X., Zhao, D.: “An end-to-end compression
framework based on convolutional neural networks”, Proceedings of Data Compression Conference
(DCC) 2017 (Snowbird, UT, USA, 2017), page 463. doi: 10.1109/DCC.2017.54.
[8] Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: “LSUN: Construction of a Large-scale
Image Dataset using Deep Learning with Humans in the Loop”, 2015, URL
https://arxiv.org/abs/1506.03365 (last request: 2018-09-20).
[9] Yu, S. X., Lisin, D. A.: “Image compression based on visual saliency at individual scales”, Proceedings
of ISVC: International Symposium on Visual Computing, Advances in Visual Computing 2009 (5th
International Symposium ISVC Las Vegas, NV, USA, 2009), pages 157-166. doi: 10.1007/978-3-642-
10331-5_15.
© 2018 Authors. Published by the University of Novi Sad, Faculty of Technical Sciences, Department of
Graphic Engineering and Design. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution license 3.0 Serbia
(http://creativecommons.org/licenses/by/3.0/rs/).
467