Image-Based Service Recommendation System A JPEG-Coefficient RFs Approach
Image-Based Service Recommendation System A JPEG-Coefficient RFs Approach
ABSTRACT Online shopping platforms are growing at an unprecedented rate all over the world. These
platforms mostly rely on search engines, which are still primarily based on the knowledge-base and use
keywords matching for finding similar products. However, customers want an interactive setup that is
convenient and reliable for querying related products. In this paper, we propose a novel idea of searching for
products in an online shopping system using an image-based approach. A user can provide, select, or click
an image, and similar image-based products will be presented to the user. The proposed recommendation
system is based on content-based image retrieval and is composed of two major phases; Phase 1 and
Phase 2. In Phase 1, the proposed approach learns the class/type of the product. In Phase 2, the proposed
recommendation system retrieves closely matched similar products. For Phase 1, the proposed approach
creates a model of products using Machine Learning (ML). The model is then used to find the category of
the test products. From the ML perspectives, we employ the Random Forests (RF) classifier, and for feature
extraction, we use the JPEG coefficients. The dataset used for proof of concepts includes 20 categories of
products. For image-based recommendation, the proposed RF model is evaluated for Phase 1 and Phase 2.
In Phase 1, the evaluation of the proposed model generates a 75% accurate model. For performance
enhancements, the RF model has been integrated into the Deep Learning (DL) setup achieving 84% accurate
predictions. Based on the custom evaluation approach for Phase 2, the proposed recommendation approach
achieves 98% correct recommendations, thus demonstrating its efficacy for the product recommendation in
practical applications.
INDEX TERMS Recommendation system, services, machine learning, random forests, deep learning, SVM.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
3308 VOLUME 8, 2020
F. Ullah et al.: Image-Based Service Recommendation System: A JPEG-Coefficient RFs Approach
description can be replaced or augmented by the visual search The rest of the paper is organized as follows: Section 2 dis-
for the product recommendation system. A picture of any cusses previous related work. Section 3 discusses the ML
product should clarify the user’s demands for the appearance, models and JPEG features. Section 4 explains the proposed
usage, and brand of the desired products. While the computer approach in the form of Phase 1 and Phase 2. Section 5 eval-
vision and image processing fields have matured, the appli- uates the Phase 1 and Phase 2 of the image-based recom-
cations of product retrieval based on the image features using mendation approach and presents the comparative analysis.
artificial intelligence in the online shopping domain remains Section 6 concludes the paper.
mostly unexplored and is open to new findings.
In this paper, based on user interaction and interests, II. ILITERATURE REVIEW
we propose a novel idea to search the products efficiently The authors in [1] present a novel approach for One-Class
in an online shopping system using image-based search- collaborative filtering that is based on estimating the users’
ing techniques. As such, a user provides or selects an fashion-aware personalized ranking. The recommendation
image, and similar products (images) are presented to the system combines high-level visual features extracted by the
user. The proposed recommendation system consists of two Convolutional Neural Network (CNN) from the past feedback
major phases: In Phase 1, the proposed approach learns the and evolving trends. In [2], the proposed approach models the
class/type of the product based on the image characteristics. human sense of the relationships between objects based on
In Phase 2, the proposed recommendation system retrieves their appearances. The approach is based on the human per-
closely matched similar products. ception of visual connections between products. This human
For Phase 1, the proposed approach takes advantage of ML perception is modeled as a network inference graph problem.
for learning the features of the image/product and generates It is thus capable of recommending clothes and accessories
a learned model. The model is then used to find the cate- together with excellent subjective performance. For image-
gory/class of query products. Once the category/type of the based recommendation, the authors in [4] aim to recommend
product is identified, in Phase 2, the JPEG feature vectors- images using Tuned Perceptual Retrieval (TPR), Comple-
based Euclidian distance is used to retrieve the top 20 match- mentary Nearest Neighbor Consensus (CNNC), Gaussian
ing products from the available items in the particular class Mixture Models (GMM), Markov Chain (MCL), and Tex-
of products. These 20 items are then further processed by ture Agnostic Retrieval (TAR). The authors report that the
the proposed ‘‘Struct-Hist’’ approach for retrieving the top CNNC, GMM, TAR, and TPR easy to train; however, CNNC
10 most relevant products. The Struct-Hist uses the image and GMM are complex, while the TPR, GMM, and TAR
feature and is explained in the corresponding section. do not generalize well. In [5], the authors apply the Alex-
From the ML phase in Phase 1 of a product’s class learning, Net [32] based DL that is used to model one thousand dif-
we propose the Random Forests (RF) meta-classifier due ferent categories of images. The authors in [6] use the CNN
to its generalization capabilities and excellent performance model which classifies images into their relevant classes
in state of the art. For learning the class of the products in the ImageNet Challenge. Image similarity has also been
and the feature extraction from images, we use the JPEG investigated in state of the art as in [7] and [8]. However,
coefficients as image features. The dataset used for proof of most of the approaches are based on category similarity,
concept contains product images and labels from the Amazon i.e., the products are similar if they are in a similar category.
website. This dataset includes images having 20 categories However, there is a possibility that the images might belong
of products. For image-based recommendation, the proposed to a different category. In such cases, machine learning helps
RF model is evaluated for Phase 1 and Phase 2. In Phase 1, in finding the classes of the images. Therefore, for category
the evaluation of the proposed model generates a 75% accu- retrievals, one approach is to first classify images into their
rate model. For performance enhancements, the RF model is respective categories. Once the category of a particular image
further integrated into the DL setup and achieves 84% accu- is identified, the next step recommends the images from
rate predictions. Based on the custom evaluation approach that category. The article [9], uses the NN to calculate the
for phase 2, the proposed recommendation approach delivers similarities within the category. In [12], the authors focus
98% correct recommendations and demonstrates its efficacy on learning similarity by using CNN for multi-products in
for the recommendation of image-based systems. a single image. In [13], the authors concentrate on image
From the implementation point of view of the recom- similarity and image semantics. The authors in [14] adopt the
mendation system, a user submits or selects an image of image descriptions of visual denotations. They propose a new
the product, and similar products (images) are presented to similarity metrics of semantic inference for event descrip-
the user. The proposed approach is based on this assump- tions. Similarly, the article [15] proposes the semantic image
tion. However, this assumption also covers the use-cases browser for representing the information visualization with
in which the query image is obtained from the user selec- the automated intelligent image analysis. The authors in [23]
tion, clicking, and history of user purchases. Thus, our pro- address the issue of long queries and propose the contextual-
posed approach is based on learning features from user-based based image recommendation system. In [24], the authors
images and recommending similar products based on these use Geo-tagged images from social media for the travel rec-
images. ommendation system. They integrate the time, location, and
FIGURE 4. The first phase of the recommendation process. The first phase
learns the features using the ML classifier to determine the
category/class of the products based on the image features.
C. PHASE 2: IMAGE BASED RECOMMENDATION similarity scores are then sorted in the ascending order, and
Figure 7 shows the flow of Phase 2 for recommending similar the top 20 are selected as the possible candidates for the
products that are closely matching the query image. For a recommendations as shown in Figure 7.
query image, the image category is retrieved by Phase 1. The For most of the recommendation systems in the state of the
query image is then searched in the corresponding category art, the similarity index is the last step for recommending sim-
in Phase 2. It retrieves related images based on the similarity ilar products. However, in the proposed approach, we extend
in a particular category. As shown in Figure 7, the category the similarity by introducing image matching steps. Though
images are loaded with the query image. The JPEG features vector matching retrieves similar images, however, the users
are extracted from all the images in a particular category. The are generally concerned about not only item matching but
JPEG features are extracted for the query image as well. This also similar color matching of items. For this reason, and
puts both the category images and the query image in the same for robust matching, we introduce further novel steps. After
vector space. the 20 most similar vectors are retrieved, the images related
The next step is finding the similarity between the feature to these vectors are re-loaded. Then these 20 images are
vector of the query image and the feature vectors of the analyzed by image-based similarity matching. For image-
category images. For similar products selection and vector based similarity matching, we adopt the Structural Similarity
matching, we use the Euclidean distance between all the Index (SSIM) of [28] to include color histograms in the
vectors of the category images to the vector of the query matching process. This is represented as the ‘‘Struct-Hist’’
image. These are represented as the similarity values of the matching process. The SSIM is modified to include the color
query image with all the images in a particular category. The distribution (histogram) as follows:
FIGURE 10. Performance evaluation of Phase 1 based on Recall. The FIGURE 11. Phase 1 F-measure evaluation. The X-axis shows the %
X-axis shows the % recall values. ‘‘Deep learning’’ represents the F-measure values. ‘‘Deep learning’’ represents the approach where
approach where features are extracted by JPEG and classified by the features are extracted by JPEG and classified by the deep learning. ‘‘Deep
Deep learning. ‘‘Deep learning with RF’’ is the approach where the and learning with RF’’ is the approach where the features are extracted by the
then those features are learned by the random forest. Deep learning and then those features are learned by the random forest.
learning with RF’’ in Figure 9 depicts the proposed method in step, it is still labeled by humans, and thus the results are
which the features are extracted by the DL and then classified subjective.
and learned by the RF. From Figure 9, the Logistic Regression Therefore, we adopted the Autocorrelogram vector-based
obtains a very low precision of 38.2%. The NaïveBayes has Euclidian distance for evaluation of Phase 2. We use the
slightly improved precision of 43%, The SVM with preci- Autocorrelograms of [33] for feature vectors in Phase 2 to
sion of 50.4% is better than the Logistic Regression, Naive remove the biases that can be induced by the JPEG feature
Bayes, and the DL (used as the Neural Network for learning vectors used in the retrieval of Top 20 images. We could also
class distribution only). The RF gets an increased Precision use the JPEG vectors, but this will bias the results towards
of 74.9%. The Precision of 74.9% is further enhanced to our approach because the training vectors are the JPEG
85% by integrating the RF with the DL paradigm as shown vectors. Secondly, the Autocorrelograms has shown good
in Figure 9. performance for retrievability and generalization of image-
Figure 10 shows the Recall evaluation of Phase 1. In based retrieval [33]. Third, the Autocorrelogram of the image
Figure 10, the Logistic Regression gets a very low Recall captures the spatial correlation between similar intensities in
of 37%, The NaïveBayes has slightly improved Recall, The the corresponding images.
SVM Recall is better than the Logistic Regression, Naive For the comparative evaluation of Phase 2, we choose the
Bayes, and the DL. The RF gets an increased Recall of 75%. K-Nearest Neighbor (KNN) and Search-based approaches.
The Recall of 75% is further enhanced to 83% by integrating As such, we conducted experiments involving 100 query
the RF with the DL approach as shown in Figure 10. images. These images are randomly selected from the dataset.
Figure 11 shows the performance evaluation of Phase 1 in Every image is used as the query image for Phase 2 for the
terms of F-measure. In Figure 11, for the DL, features are retrieval of ten similar images.
extracted by the JPEG approach and only classified by the For evaluation, the Autocorrelograms of the ten similar
DL. The ‘‘Deep learning with RF’’ in Figure 11 represents images and the query images are calculated. Then each
the proposed approach where the features are extracted by the retrieved image (vector) is compared to the query image by
DL and then learned by the RF. From Figure 11, the Logistic three approaches. These comparative approaches are:
Regression gets a very low F-measure of 37.6%. The Naïve-
Bayes has slightly improved F-measure of 41%, The SVM 1. Subjective similarity (by 3 users)
with F-measure of 47% is better than the Logistic Regres- 2. Euclidian distance-based similarity
sion, Naive Bayes, and the DL. The RF gets an increased 3. Cosine similarity
F-measure of 75%. The F-measure of 75% of RF is further For subject similarity, the vectors are converted to their
enhanced to 84% by integrating the RF with the DL approach, original images for visualization purposes and presented to
as shown in Figure 11. users. Three students (users) perform subjective evaluation.
The images are retrieved for the 100 test queries using the
C. ‘‘PHASE 2’’ EVALUATION Struct-Hist, the KNN, and the Search-based approach. These
The evaluation of Phase 2 of the recommendation system is images are then arranged with retrieved labels being removed.
not straight forward as that of Phase 1. This is due to several The labels are removed so that the user has no clue of the
reasons. First, the recommendation is a subjective process. algorithm used for retrieval for an un-bias scoring. Each
Secondly, there is no baseline for comparison. Third, even user has to give a score between 0 and 10 to the retrieval
if the data is labeled for Phase 2 of the recommendation performance of the particular approach for the 10 retrieved
FIGURE 12. Subjective evaluation. The X-axis shows the average FIGURE 14. Cosine similarity. Evaluation of the recommendation phase
similarity score of test images. ‘‘Phase 2’’ Struct-Hist step of the proposed approach by the Cosine
similarity. The X-axis shows the average similarity score of test images.
TABLE 1. Parameters and time complexity. image recommendation. The article contributes not only to
the recommendation based systems but the algorithm pre-
sented in this article can be used for generic computer vision
problems. In the future, we aim to merge non-image informa-
tion with the images and use the Recurrent Neural Network
(RNN) architecture for the recommendation process.
REFERENCES
[1] I. Kanellopoulos and G. G. Wilkinson, ‘‘Strategies and best practice for
neural network image classification,’’ Int. J. Remote Sens., vol. 18, no. 4,
pp. 711–725, 1997.
[2] R. He and J. McAuley, ‘‘Ups and downs: Modeling the visual evolution
of fashion trends with one-class collaborative filtering,’’ in Proc. 25th Int.
Conf. World Wide Web, 2016, pp. 507–517.
[3] J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, ‘‘Image-based
recommendations on styles and substitutes,’’ in Proc. 38th Int. ACM SIGIR
Conf. Res. Develop. Inf. Retr., 2015, pp. 43–52.
[4] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan,
‘‘Large Scale Visual Recommendations From Street Fashion Images,’’
of two steps in the proposed approach always produces accu- in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
rate recommendations. The Struct-Hist approach thus outper- pp. 1925–1934, 2014.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification
forms the others and shows its efficacy for the image-based with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6,
product recommendations. Figure 15 shows some examples pp. 84–90, May 2017.
of retrievals. [6] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
large-scale image recognition,’’ in Proc. 3rd Int. Conf. Learn. Represent.
(ICLR), May 2015, pp. 1–14.
D. TIME COMPLEXITY [7] G. Wang, D. Hoiem, and D. Forsyth, ‘‘Learning image similarity from
flickr groups using fast kernel machines,’’ IEEE Trans. Pattern Anal. Mach.
Table 1 shows the time taken by different modules. The time
Intell., vol. 34, no. 11, pp. 2177–2188, Nov. 2012.
is represented in seconds and is the average of several runs. [8] G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus, ‘‘Learning invariance
The experiments are conducted using the Core I7, running the through imitation,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Titan Nvidia GPU. Since there are many routines tested in this Recognit., Jun. 2011, pp. 2729–2736.
[9] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,
paper, we report the time for different modules. As in Table 1, and Y. Wu, ‘‘Learning fine-grained image similarity with deep ranking,’’ in
the average time taken for calculating and learning the JPEG Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
features by RF is 35 seconds. The time of finding the category pp. 1386–1393.
[10] A. Dosovitskiy and T. Brox, ‘‘Generating images with perceptual similarity
of the query image by the JPEG and RF is 0.03 seconds. The metrics based on deep networks,’’ in Proc. 30th Int. Conf. Neural Inf.
time for calculating and learning the Deep features by RF is Process. Syst., 2016, pp. 658–666.
1000 seconds. Time to find the category of the query image [11] M. Tan, S. Yuan, and Y. Su, ‘‘Content-based similar document image
retrieval using fusion of CNN features,’’ Commun. Comput. Inf. Sci.,
by the Deep features and RF is 0.3 seconds. The average vol. 819, pp. 260–270, Mar. 2018.
time taken to find the top 20 related items in ‘‘Phase 2’’ is [12] S. Bell and K. Bala, ‘‘Learning visual similarity for product design with
1 second. Time to retrieve the 10 out of 20 associated items convolutional neural networks,’’ ACM Trans. Graph., vol. 34, no. 4, p. 98,
in ‘‘Phase 2’’ is 0.003 seconds. 2015.
[13] T. Deselaers and V. Ferrari, ‘‘Visual and semantic similarity in ima-
genet,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,
VI. CONCLUSION Jun. 2011, pp. 1777–1784.
[14] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, ‘‘From image descrip-
We presented an image-based product recommendation sys- tions to visual denotations: New similarity metrics for semantic infer-
tem based on two phases. In Phase 1, the proposed approach ence over event descriptions,’’ Trans. Assoc. Comput. Linguistics, vol. 2,
learns the class/type of the product. In Phase 2, the proposed pp. 67–78, Feb. 2014.
recommendation system retrieves closely matched similar [15] J. Yang, J. Fan, D. Hubball, Y. Gao, H. Luo, W. Ribarsky, and M. Ward,
‘‘Semantic image browser: Bridging information visualization with auto-
products. From the ML phase of a product’s class learn- mated intelligent image analysis,’’ in Proc. IEEE Symp. Vis. Analytics Sci.
ing, we used the RF classifier. For feature extraction from Technol., Oct./Nov. 2006, pp. 191–198.
images, we used the JPEG coefficients as image features. [16] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
In the evaluation of Phase 1, the proposed model generates [17] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
a 75% accurate model. For performance enhancements, the 2001.
RF model is further integrated into the DL setup and achieves [18] L. J. Li, H. Su, E. P. Xing, and L. Fei-Fei, ‘‘Object bank: A high-level image
representation for scene classification & semantic feature sparsification,’’
84% correct predictions. In Phase 2, we believe that the com- in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1–9.
bination of two steps in the proposed approach of finding the [19] S. McCann and D. G. Lowe, ‘‘Local Naive Bayes Nearest Neighbor for
20 similar items based on the Euclidian distance and 10 most image classification,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
similar out of 20 by the Struct-Hist approach produces very Pattern Recognit., Jun. 2012, pp. 3650–3656.
[20] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, ‘‘SVM- and
accurate recommendations. The Struct-Hist approach thus MRF-based method for accurate classification of hyperspectral images,’’
outperforms the other methods and shows its efficacy for the IEEE Geosci. Remote Sens. Lett., vol. 7, no. 4, pp. 736–740, Oct. 2010.
[21] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, ‘‘Deep learning-based BOFENG ZHANG received the B.S., M.S., and
classification of hyperspectral data,’’ IEEE J. Sel. Topics Appl. Earth Ph.D. degrees from the Northwestern Polytech-
Observ. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014. nical University of Technology, Xi’an, China,
[22] Nielsen. (2015). Green Generation: Millennials Say Sustainability is in 1991, 1994, and 1997, respectively. From
a Shopping Priority. The Nielsen Global Survey of Corporate Social 1997 to 1999, he worked in Zhejiang University
Responsibility and Sustainability. Accessed: May 3, 2019. [Online]. as a Postdoctoral Researcher and was promoted to
Available: https://www.nielsen.com/us/en/insights/article/2015/green- an Associate Professor. Since 1999, he has been
generation-millennials-say-sustainability-is-a-shopping-priority/
on the faculty of the School of Computer Engi-
[23] L. Liu, ‘‘Contextual topic model based image recommendation system,’’
neering and Science, Shanghai University. From
in Proc. IEEE/WIC/ACM Int. Jt. Conf. Web Intell. Intell. Agent Technol.
(WI-IAT), Dec. 2015, pp. 239–240. August 2006 to August 2007, he worked as a Vis-
[24] I. Memon, L. Chen, A. Majid, M. Lv, I. Hussain, and G. Chen, ‘‘Travel iting Professor at the University of Aizu, Japan. From September 2013 to
recommendation using Geo-tagged photos in social media for tourist,’’ September 2014, he worked as a Visiting Professor at Purdue University
Wireless Pers. Commun., vol. 80, no. 4, pp. 1347–1362, 2015. Calumet, USA. He is currently the Vice Dean of the School of Computer
[25] Y. Sun, H. Fan, M. Bakillah, and A. Zipf, ‘‘Road-based travel recommen- Engineering and Science, Shanghai University. He has published three
dation using geo-tagged images,’’ Comput. Environ. Urban Syst., vol. 53, books, about 150 research articles in national and international journals and
pp. 110–122, Sep. 2015. major conference proceedings. His current research interests include intel-
[26] L. Yu, F. Han, S. Huang, and Y. Luo, ‘‘A content-based goods image recom- ligent information processing, data science and technology, and intelligent
mendation system,’’ Multimed. Tools Appl., vol. 77, no. 4, pp. 4155–4169, human–computer interaction. He serves as a steering committee member,
2018. a Workshop Chair, and a program committee member of several important
[27] S. C. Guntuku, S. Roy, and L. Weisi, ‘‘Personality modeling based international conferences.
image recommendation,’’ in Proc. Int. Conf. Multimedia Modeling, 2015,
pp. 171–182.
[28] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image quality
assessment: From error visibility to structural similarity,’’ IEEE Trans.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[29] R. Khan, A. Hanbury, J. Stöttinger, and A. Bais, ‘‘Color based skin classi-
fication,’’ Pattern Recognit. Lett., vol. 33, pp. 157–163, Jan. 2012.
[30] R. Khan, A. Hanbury, and J. Stoettinger, ‘‘Skin detection: A random
forest approach,’’ in Proc. Int. Conf. Image Process. (ICIP), Sep. 2010,
pp. 4613–4616.
[31] B. Xu, Y. Ye, and L. Nie, ‘‘An improved random forest classifier for
image classification,’’ in Proc. IEEE Int. Conf. Inf. Automat., Jun. 2012,
pp. 795–800.
[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
cess. Syst., 2012.
[33] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih, ‘‘Image indexing
using color correlograms,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., Jun. 1997, pp. 762–768.
[34] Amazon Dataset. Accessed: Sep. 3, 2019. [Online]. Available:
https://drive.google.com/drive/folders/1eaCPVw-j9ORkVHA4JY-
n0QFXkHmilukm