0% found this document useful (0 votes)
127 views7 pages

p1 (3)

The document presents Style4Rec, a transformer-based e-commerce recommendation system that enhances product recommendations by incorporating style information from product images and shopping cart data. The model outperformed existing benchmarks, demonstrating significant improvements in various evaluation metrics such as HR@5, NDCG@5, and MRR@5. The research highlights the importance of utilizing visual style cues and shopping cart data to better capture user preferences in sequential product recommendations.

Uploaded by

hienmai19122k4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views7 pages

p1 (3)

The document presents Style4Rec, a transformer-based e-commerce recommendation system that enhances product recommendations by incorporating style information from product images and shopping cart data. The model outperformed existing benchmarks, demonstrating significant improvements in various evaluation metrics such as HR@5, NDCG@5, and MRR@5. The research highlights the importance of utilizing visual style cues and shopping cart data to better capture user preferences in sequential product recommendations.

Uploaded by

hienmai19122k4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Style4Rec: Enhancing Transformer-based E-commerce Recommendation Systems

with Style and Shopping Cart Information


Berke Ugurlu1, Ming-Yi Hong1, Che Lin1
1
National Taiwan University
Taipei, Taiwan
Abstract dependencies. On the other hand, Recurrent Neural
Networks (RNNs) (Hidasi et al. 2016) have been employed
Understanding users’ product preferences is essential to
the efficacy of a recommendation system. Precision to capture both long and short-term dependencies
marketing leverages users’ historical data to discern these between products. By utilizing a hidden state, RNNs are
preferences and recommends products that align with able to predict the subsequent product. GRU4Rec (Hidasi
them. However, recent browsing and purchase records et al. 2016), and its improved versions have demonstrated
might better reflect current purchasing inclinations. success in the task of next-item prediction. GRU4Rec
Transformer-based recommendation systems have made takes both the hidden state and current product vector as
strides in sequential recommendation tasks, but they often
fall short in utilizing product image style information and
input, leveraging their combination for accurate
shopping cart data effectively. In light of this, we propose predictions. A different approach is adopted in CNN-based
Style4Rec, a transformerbased e-commerce sequential product recommendation models (Tang and
recommendation system that harnesses style and Wang 2018). Here, the embeddings of previous products
shopping cart information to enhance existing are treated as an image, and convolutional operations are
transformer-based sequential product recommendation applied to extract relevant information for predicting the
systems. Style4Rec represents a significant step forward in subsequent product. This approach utilizes the structural
personalized e-commerce recommendations,
outperforming benchmarks across various evaluation
properties of CNNs to capture dependencies between
metrics. Style4Rec resulted in notable improvements: products and make effective predictions.
HR@5 increased from 0.681 to 0.735, NDCG@5 increased Transformer-based sequential product
from 0.594 to 0.674, and MRR@5 increased from 0.559 to recommendation models have emerged as promising
0.654. We tested our model using an e-commerce dataset alternatives (Sun et al. 2019; Kang and McAuley 2018; Wu
from our partnering company and found that it exceeded et al. 2020) in predicting subsequent items in user
established transformer-based sequential sessions, showcasing state-of-the-art performance. Unlike
recommendation benchmarks across various evaluation
metrics. Thus, Style4Rec presents a significant step forward previous algorithms, transformer-based models leverage
in personalized e-commerce recommendation systems. self-attention mechanisms, which are highly efficient in
training and excel at extracting patterns within user
sessions. The self-attention mechanism allows the model
Introduction to weigh the importance of different elements in the
Sequential product recommendation is a process that input sequence, capturing dependencies and relationships
involves capturing information from user sessions to effectively. As a result, transformerbased models have
predict the next item they are likely to purchase. By proven to be powerful tools for sequential product
analyzing the history of user interactions, the system aims recommendations, outperforming traditional approaches
to make accurate predictions. However, due to the in terms of accuracy and efficiency.
intricate relationships between products, effectively Existing transformer-based sequential models (Zhang et
capturing information from users’ past interactions poses al. 2018; Lin, Pan, and Ming 2020; Sun et al. 2019; Kang
a significant challenge. and McAuley 2018; Wu et al. 2020) utilizes only the
Different approaches have been explored in the realm purchase information for predicting the next item in user
of sequential product recommendation to capture the sessions. Product images undoubtedly play a major role in
dependencies between products. One simple model is users’ preferences. However, existing transformer-based
Markov Chains (MCs) (Rendle, Freudenthaler, and sequential product recommendation models don’t have a
Schmidt-Thieme 2010), which predict the subsequent methodology to incorporate the style information of the
product based on the previous product or a few preceding product images. Utilizing style information can greatly
products. However, MCs do not effectively utilize long- improve the performance in sequential recommendation
range dependencies and focus primarily on the last few tasks and allow us to evaluate user preferences more
products, attempting to capture short-range product thoroughly.
In our research, we have developed a multi-layer McAuley 2018), and SSE-PT (Wu et al. 2020), have
transformer-based sequential product recommendation demonstrated remarkable success in predicting
system that leverages style information and shopping cart subsequent items for user sessions. However, these
data to enhance existing state-of-the-art recommendation models solely rely on purchase data as their input for the
systems. By grouping users based on their individual transformer network. This limitation prevents them from
sessions, we could recommend multiple products for each effectively utilizing valuable information present in the
user. The performance of our system surpassed that of product images and shopping cart data available in the e-
previous recommendation models designed for sequential commerce dataset provided by our partnering company.
product recommendation tasks, including Bert4Rec (Sun To address this gap, we propose two methods to
et al. 2019) and SASRec (Kang and McAuley 2018), both of enhance the performance of sequential product
which are transformer-based. Our results demonstrate recommendation systems. Firstly, we leverage the neural
the effectiveness and superiority of our approach in style transfer algorithm (Gatys, Ecker, and Bethge 2015)
providing personalized and accurate product to extract style information from product images, which is
recommendations. then utilized as style embeddings in our model. This
In our sequential product recommendation model, we allows us to incorporate important visual cues into the
utilized a dataset provided by our partnering company, recommendation process. Secondly, we adopt a strategy
encompassing user history spanning a significant period of where shopping cart data is employed exclusively during
1.5 years. The dataset included valuable information such the training and validation phases and excluded during
as purchase data, shopping cart data, and product images. testing. This approach provides a more accurate
Recognizing the crucial role of product images in evaluation of real-world performance.
influencing user preferences, we employed the neural Through extensive experiments conducted on our
style transfer algorithm (Gatys, Ecker, and Bethge 2015) dataset, our proposed model has surpassed the existing
to extract style information from the product images. This state-ofthe-art baselines, showcasing its effectiveness in
style information was then used to create style improving sequential product recommendations. These
embeddings, enhancing our recommendation system’s findings validate the significance of incorporating style
performance. Our decision to extract style information information from product images and leveraging shopping
was driven by the fact that the product category was cart data to enhance the performance of
already known, and each product image represented a recommendation systems in the e-commerce domain.
single item. Consequently, directly utilizing an object The main contributions of our research are as follows:
detection algorithm layer would not provide us with • We have designed and implemented a novel
significant insights, as such layers focus on capturing transformer-based model for sequential product
category information that is already known. Instead, we recommendation. This model incorporates separate
leveraged the correlation of feature maps in those layers components for obtaining the product vector of user
by using the VGG-19 object detection algorithm history and the learnable product vector.
(Simonyan and Zisserman 2015) to extract the style
information, as described in the neural style transfer • We proposed a style extraction module that
algorithm (Gatys, Ecker, and Bethge 2015). This modified effectivelyobtains style embeddings from product
approach enabled us to create style embeddings images using the neural style transfer algorithm. This
specifically tailored for our multi-layer transformer-based algorithm was modified to ensure its compatibility
sequential recommendation network. with transformer-based sequential product
Furthermore, we incorporated the shopping cart data recommendation networks.
of the users into our recommendation system. • We developed a method to differentiate between pur-
Recognizing that products added to the shopping cart chase and shopping cart sessions. Specifically, we
reflect user interest, even if they were not ultimately employed shopping cart sessions exclusively during the
purchased, we developed a training strategy to account training and validation phases while excluding them
for the distinction between purchase and shopping cart during testing. This approach allowed us to effectively
products. The specific utilization of shopping cart data and capture the distinction between these types of
the extraction of style embeddings within our sequential sessions and incorporate it into our recommendation
recommendation system are thoroughly evaluated and model.
detailed in the methodology section. These contributions highlight the unique features and
Existing sequential product recommendation systems, advancements of our transformer-based sequential
such as Bert4Rec (Sun et al. 2019), SaSRec (Kang and product recommendation network, emphasizing its ability
to effectively combine different types of embeddings for for utilizing the product image embeddings. This allowed
improved recommendation performance. us to incorporate visual style information into the
recommendation process, enhancing the model’s ability
Related Work to capture visual product attributes.
In summary, our transformer-based sequential
AttRec (Zhang et al. 2018), SASRec (Kang and McAuley
recommendation models incorporated product images
2018), SSE-PT (Wu et al. 2020), FISSA (Lin, Pan, and Ming and shopping cart data by introducing multiple modules.
2020), and BERT4Rec (Sun et al. 2019) are among the This included obtaining the product vector of historical
notable transformer-based models used for sequential behavior, utilizing learnable product vectors, and
product recommendation tasks. These models employ incorporating style embeddings extracted through the
multi-layer transformer blocks to capture item-item neural style transfer algorithm (Gatys, Ecker, and Bethge
relations within user sessions. 2015). The neural style transfer algorithm (Gatys, Ecker,
AttRec (Zhang et al. 2018) leverages the self-attention and Bethge 2015), wasn’t utilized in existing transformer-
mechanism to capture both long-term and short-term based sequential product recommendation systems. Our
interactions between items in user sessions. It considers work showcases the first model incorporating style
the temporal dynamics of the interactions separately. information into transformer-based sequential product
SASRec (Kang and McAuley 2018) utilizes multiple recommendation models. Regarding the temporal aspect,
transformer blocks that facilitate left-to-right item-item we did not consider time-aware sequential product
interactions. It truncates user sessions and performs recommendation models due to the lack of precise time
separate predictions for each truncated session, allowing information for purchased products. We leveraged the
the model to capture sequential patterns effectively. SSE- order of the products in constructing user sessions to
PT (Wu et al. 2020) extends SASRec by incorporating capture sequential patterns.
personalized user embeddings and employs the stochastic
shared embedding (SSE) regularization technique to
mitigate overfitting. FISSA (Lin, Pan, and Ming 2020) Problem Statement
introduces a global representation learning module, a Each user session can be considered as sequential data
local representation learning module, and a gating and sequential product recommendation systems try to
module to balance the impact of global and local predict the next item that the user might buy for user
representations using a multi-layer perceptron (MLP) sessions. In sequential product recommendation systems,
layer. BERT4Rec (Sun et al. 2019) adopts a bidirectional given a set of users and set of items, U = {u1,u2,u3,.....,u|U|}
training framework using the Cloze task. It employs input and I =
masking during training, arguing that unidirectional {i1,i2,i3,.....,i|I|}, we can construct user sessions as Su =
transformer architectures may limit the true potential of
in chronological order. The
transformer-based recommendation systems.
length of the session is nu, and it is the product that the
The aforementioned transformer-based sequential user interacted with at time t. The purpose of the
recommendation models have demonstrated remarkable
sequential product recommendation is to predict which
effectiveness in sequential product recommendation
item the user will interact with at time t+1, given the
tasks. However, none of these models have addressed the
user’s history Su.
utilization of product images and shopping cart data to
For the final prediction, the products on the product list I
enhance performance. It is intuitive that product images
are sorted with respect to their relevance scores.
play a significant role in shaping users’ preferences, while
shopping cart products offer valuable insights as they
represent items of interest to users. Methodology
In our transformer-based sequential recommendation Our multi-layer transformer-based sequential
model, we addressed the incorporation of product images recommendation network consists of two parts, the first
and shopping cart data by introducing multiple modules. part utilizes the deep transformer encoder, and the
One of the modules involved obtaining the product vector second part utilizes the modified version of the neural
of historical behavior, which captured the user’s style transfer algorithm (Gatys, Ecker, and Bethge 2015).
interactions with products over time. This vector The second part was used for extracting the style
represented the user’s preferences and interests based information from the product images, and the style
on their past choices. We used the neural style transfer information was used for further performance increases.
algorithm (Gatys, Ecker, and Bethge 2015) to extract style
Deep Transformer Encoder the negatively sampled product. This dual approach
The deep-transformer-encoder network consists of optimizes the model’s ability to discern relevant and
multiple transformer-encoder blocks on top of each irrelevant products based on historical behavior.
other. The hidden representation of the input was
calculated for each transformer-encoder block and fed Embedding Extraction Module
into the next transformerencoder block. We utilized the The embedding extraction module in Figure 3 consists of 2
multi-head self-attention mechanism and point-wise feed- parts, which are:
forward network for constructing the transformer- 1. Sinusoidal Positional Embeddings
encoder blocks, (Vaswani et al. 2017; Sun et al. 2019): 2. Learnable Product Embeddings
We employed a combination of sinusoidal positional
embeddings and learnable product embeddings. Our goal
was twofold: firstly, to encode information about the
relative positions of products within sessions, and
secondly, to generate unique learnable embeddings for
each product, enabling their differentiation and
The formulas (1), (2), and (3) show the details of the comparison.
multi-head attention mechanism. Hl represents the Sinusoidal positional embeddings were utilized to
stacked hidden representation for a given product encode the products’ relative positions within sessions,
sequence. Hl is projected into different subspaces by using while learnable product embeddings facilitated product
the key, query, and value matrices, which are WQ, WK, WV . comparison, thus informing product recommendations.
The subscript i represents the multi-head index. The This approach integrated both positional information and
heads are concatenated and projected by utilizing WO product-specific embeddings simultaneously.
matrix. WQ, WK, WV , WO are learnable projection matrices,
(Vaswani et al. 2017; Kang and McAuley 2018; Sun et al. PE(pos,2i) = sin(pos/100002i/dmodel) (4)
2019; Li et al. 2018). We used scaled-dot product PE(pos,2i+1) = cos(pos/10000 2i/dmodel
) (5)
attention, (Vaswani et al. 2017; Hinton, Vinyals, and Dean The formulas for sinusoidal positional embeddings can
2015). The query Q, the key K, and the value V are be seen in (4) and (5). dmodel represents the total
projected from stacked hidden representation, Hl. Then, dimension of the input features, (Vaswani et al. 2017;
we applied a point-wise feedforward network to the Kang and McAuley 2018; Sun et al. 2019). For the
result of the multi-head self-attention in order to utilize learnable positional embeddings, which is a linear layer
non-linearity and the interactions between different without the bias term, the embedding parameters are
dimensions. We also utilized residual connection, layer initialized from N(0,1). We compared the learnable
normalization, and dropout to avoid overfitting the data product embeddings with the product vector of historical
(Ba, Kiros, and Hinton 2016; He et al. 2016; Srivastava et behavior in Figure 3, for making the final prediction.
al. 2014). The use of learnable product embeddings was pivotal in
As you can see in Figure 3, to calculate the binary constructing a scalable model, allowing us to compare
crossentropy loss, we first extracted the product vector of these embeddings directly with our model’s output. This
historical behavior for each session. We then compared implies that even if a new, unknown product is added to a
this vector with both the learnable product embedding of session, we can still compare our model’s output with the
the ground truth product and that of the negatively existing learnable product vectors to generate a final
sampled product. Cosine similarity was utilized to product recommendation. Consequently, our model
compute the relevance score of these vectors, which was exhibits the flexibility to accommodate new and unknown
then converted into probabilities using the softmax products, enhancing its adaptability and utility in dynamic
function. These probabilities were subsequently used to environments.
calculate the binary cross-entropy loss. All of the input embeddings were concatenated, the
The fundamental rationale behind this model dimension of learnable product embeddings is 128, and
architecture is to enhance the cosine similarity between the dmodel variable in sinusoidal positional embedding is
the product vector of historical behavior and the ground 128.
truth vector. Simultaneously, we aim to decrease the
relevance, in terms of cosine similarity, between the
product vector of the historical behavior and the vector of
Style Embeddings Figure 2, we transferred only the style information by
For extracting the style information, we utilized the neural setting the content loss as zero to visualize style transfer
style transfer algorithm (Gatys, Ecker, and Bethge 2015). more clearly.
In the neural style transfer algorithm, the feature maps of For the style loss, the gram matrices are calculated.
a generic object detection algorithm are used for Note that both methods utilize the gram matrices since it
capturing the style information. The filter response of captures crucial style information.
each layer of a convolutional neural network is used to
obtain the gram matrices, which are used for calculating
the style loss (Gatys, Ecker, and Bethge 2015). There are
two approaches to how we can transfer style information
by using the neural style transfer algorithm.
Two images are used in the first method of the neural In formula (6), Ginput represents the gram matrix of the
style algorithm, as you can see in Figure 1. These images noise image, Gstyle represents the gram matrix of the style
are the content image and the style image. The main idea image, k represents the layer number, Nk represents the
is to update the content image by using the style image so number of feature maps at layer k, Mk represents the
that the style of the content image becomes similar to the flattened dimensions of the feature maps. The gram
style of the style image. During the training of the neural matrices represent the correlation between the feature
style transfer algorithm, style and content images are maps of a specific layer of an object detection algorithm
used for calculating the style-loss and the content-loss, as in formula (6). The gram matrices are calculated by
respectively, (Gatys, Ecker, and Bethge 2015). The gram using the dot products of each pair of feature maps in a
matrices for both the content and style images are particular convolutional layer. The gram matrices are
calculated. The neural style transfer algorithm tries to obtained separately for multiple network layers, and the
make both gram matrices similar to each other to transfer respective style losses are calculated. The linear
the style information between the style image and the combination of the style losses from different layers of
content image. the network is used as the final style loss of the algorithm,
(Gatys, Ecker, and Bethge 2015). Decreasing the style loss
means the style information is transferred between 2
images.

Figure 1: Neural style transfer algorithm updates the


content image, with the help of gram matrices, depending
on which layers’ gram matrices are used and the ratio of
, different final results can be obtained.

In the second method of the neural style transfer


algorithm (Gatys, Ecker, and Bethge 2015), the style
image and the image consisting of Gaussian noise are
used, as you can see in Figure 2. The main idea is to
update the noisy input image so that the content and
style of the noisy input image become similar to the style Figure 2: First method of neural style transfer algorithm,
image. The gram matrices for the style and noisy input algorithm updates the noisy input image with the help of
images are calculated to transfer the style information. gram matrices.
The feature maps of the noisy input image and the style
image are directly subtracted from each other to transfer
the content information. In the experiment shown in
The neural style transfer algorithm can be applied using the same set of products. Shopping cart sessions are
any object detection algorithm that utilizes convolutional longer than purchase sessions in general, the average
layers. We picked the VGG-19 object detection algorithm session length of purchase sessions is 11.24, and the
(Simonyan and Zisserman 2015) for extracting the style average session of shopping cart sessions is 14.57.
embeddings from the gram matrices. In our case, we used
the gram matrices of the first two layers of the VGG-19
object detection algorithm (Simonyan and Zisserman Table 1: Statistics of datasets
2015) as the style embeddings of the product images. Datasets #Sessions #Products Avg.Length #Actions
Since there are 64 feature maps in the first and second Purchase 19463 2991 11.24 218913
layers of the VGG-19, the dimension of the style S.Cart 18654 2991 14.57 271904
embeddings becomes 64 x 64, and the total dimension of
the style embeddings becomes 2 x 64 x 64. Note that the Preprocessing
dimension of the gram matrices is independent of the We separated purchase and shopping cart sessions. We
dimension of the feature maps. It depends only on the removed the overlapped sessions, which contain both
number of feature maps in a particular convolutional purchase and shopping cart products, to see the effect of
layer. For making the training time of the multi-layer adding shopping cart data more clearly. We set the max
transformer encoder block reasonable, O(n2d), we length of the sessions to 20, and we added padding (0) if
applied max-pooling on the gram matrices, so the total the sessions were shorter than 20 products. If the
dimension of the style embeddings becomes 2 x 16 x 16. sessions were longer than 20 products, we used the last
Evaluating the effect of gram matrices of other layers is 20 products. We removed the repeated final products as
left as future work. seen in Figure 4. We made that change because we know
Shopping Cart Data that allowing our model to learn the repeated patterns,
could decrease the real-life performance significantly. We
In our dataset, we classify sessions into two types:
wanted our model to learn more complex patterns for
purchase sessions and shopping cart sessions. Shopping
better generalization in real-life conditions.
cart sessions, which feature products added to the
shopping cart but not ultimately purchased, provide a
unique opportunity to enhance the performance of
sequential product recommendations. Recognizing the
user interest that these items represent, we have devised
a strategy to differentiate between these two types of
sessions. While both purchase sessions and shopping cart
sessions are utilized in the training and validation stages,
only purchase sessions are included in the testing phase.
This approach allows us to more accurately reflect real-
world performance, as the primary objective of sequential
product recommendation is to predict items users are
likely to purchase. Thus, we achieve a realistic gauge of its
effectiveness by evaluating the model solely with
purchase sessions.

Experiments
Dataset Description
Our partnering company provided the dataset we used.
The data comes from an e-commerce website. The
website consists of household goods. The dataset
contains pageview, purchase, and shopping cart data of
users. We discarded the sessions that consist of only
pageviews. In the dataset, there are 490817 interactions
(including pageviews, purchase, and shopping cart
information) on 38117 user sessions, which can be seen in
Table 1. Both purchase and shopping cart sessions contain
Figure 3: Model Architecture
Training Procedures For the benchmarks (BERT4Rec, SASRec), we tuned the
hyper-parameters according to the descriptions in the
We split both the purchase data and the shopping cart
respective papers, or we used the recommended
data with respect to time. We used the first 14 months of
parameters. All of the models are reported under their
data for training, the next 2 months for validation, and
best hyperparameter settings.
the last 2 months for testing. During training, validation,
and testing, we predicted the last items in user sessions,
by utilizing the previous items. We used shopping cart
sessions only in training and validation but not in testing.
We used purchase sessions in training, validation, and
testing. We tuned the hidden dimension of the
transformer encoder within the range of [8, 16, 32, 64,
128, 256] and the L2 regularization penalty within the
range of [0.1, 0.001, 0.0001, 0.00001]. We set the number
of transformer blocks and the number of heads as 2, for
fair comparison with other benchmarks (BERT4Rec,
SASRec).

Figure 4: An example of how the repeated final products


were removed, each number represents a product.

You might also like