Efficient Conversion Prediction SoftCOM
Efficient Conversion Prediction SoftCOM
Abstract—Unsupervised machine learning became a ubiquitous tailored to their needs and expectations. To address this, we
method appearing in E-commerce solutions that strive to provide present a methodology that relies on user events instead of
personalized recommendations for their users. Most of those ratings. Our event-based approach has led to a significant
solutions embrace collaborative filtering (CF) to predict conver-
sions, which are the beneficial user events, such as a purchase. increase in the number of data points, which seems counter-
Traditionally, the predictions were made based on rating data. intuitive because the operation time of recommender engines
However, e-commerce users seldom leave ratings. Instead, we increases exponentially with the increase of data-points [9].
must rely on user events, such as viewing an item or adding it Our main contribution is a solution reducing all events between
to the cart. The event-based approach seems counter-intuitive, an item and a user (the event chain) to a single UX value in the
for the reason that the operation time of recommender systems
increases exponentially with the increase of data-points. [0, 1) range. This number also depends on the sequentiality of
One of the main contributions of this paper is the UX value the events. After determining the UX value for all event chains,
function. It reduces all events between an item and a user to we stored it in a sparse matrix. This matrix can be generated
a single user experience number, which also depends on the from any event dataset in linear time. Using a random split
sequentiality of the events. We present a method to calculate of the sparse matrix, we trained a deep neural network to
this number in linear time. Then we use a deep neural network
to predict the likelihood of conversions based on this number predict unknown (null) values. We have implemented a ma-
to prove the practical solvability of the problem in a scalable chine learning model featuring Adam optimizer and dropout
manner, with a relatively fast learning speed and good prediction regularization.
accuracy. We have conducted an extensive experimental analysis The machine learning model was introduced to prove the
on Kechinov’s ‘eCommerce Events History in Cosmetics Shop’ practical solvability of the problem in a scalable manner,
dataset, containing 8,738,120 user events. The results of those ex-
periments prove the efficiency and applicability of the developed with a relatively fast learning speed, and to good prediction
approach. accuracy. To do so, we applied this model to Kechinov’s
Index Terms—machine learning, e-commerce, conversion, pre- ‘eCommerce Events History in Cosmetics Shop’ dataset [10]
diction, recommender, collaborative filtering of over eight million events.
We have structured the remainder of this paper as follows:
I. I NTRODUCTION After an overview of related works in Section II, we presented
In recent years a new research topic emerged, the use of the developed approach in Section III, followed by Section IV,
unsupervised machine learning to create recommendations and which documents the numerical evaluation of the developed
improve the user experience (UX) [1]. Quantifying UX is a approach, applied to a real-world dataset. We concluded the
difficult task, and often leads to inaccurate results [2]–[4]. On paper with Section V.
the other hand, it is possible to accurately record the user
events, which happened during the entire interaction between II. R ELATED W ORKS
the user and the application. Examples of such events include Recently, numerous researchers adopted user events to cre-
viewing an item (a product selected by the user), adding or ate better experiences, usually in the form of recommenda-
removing the product from the cart, or purchasing an item. tions. To this end, we mention the work of Vijayakumar, et
Many researchers considered collaborative filtering (CF) to al. [11], which used a heat map of visited travel locations
be the best approach for recommender systems [5]–[7]. CF to create travel recommendations for tourists. Next, Szabo et
represents a collection of algorithms that filter items a user al. used machine learning and behavioral analysis for user-
might like based on reactions from other users. Traditionally, tailored viewer experience [12]. Deng et al. proposed a unified
CF algorithms use the existing ratings to predict the prefer- framework of representation learning and matching function
ences of the users [8]. However, e-commerce users seldom learning to pair users with items, which can also be applied
leave ratings, yet they require shopping recommendations to event data [13].
CF is computationally intensive, especially on large ith user event in E. It is assumed that given two user events
datasets. Several solutions were suggested to improve CF. ei , and ej , if i < j, then user event ei precedes in time user
The typicality-based CF determines the neighbors from user event ej .
groups based on their typicality degree [5]. Another example is Next, we assume that events can be categorized according
the demographic content-based collaborative recommendation to the user input. For example, if the user clicks on an item
system framework. This is a three-step process, which starts in a category listing to see the item in detail, this is called a
with K-means clustering based on the user’s demographic ‘view’ event. An event category is, hereinafter, called an event
information. Then, it predicts the rating using a hybrid of type. Let T denote the set of event types found in E, and the
Pearson correlation similarity and Cosine similarity. Finally, it event type of a user event ei be ti , ∀ti ∈ T . Let c ∈ E denote
gives recommendations using content-based CF [6]. For real- a conversion event. The event type of c is denoted by tc , such
world uses-cases, a weighted average of the typicality-based that tc ∈ T .
CF and demographic-based CF can be used to produce the
best recommendation result for the user [5]. In 2019, a critical A. Calculate the Probability of Conversion
analysis of 131 CF-related articles from 36 journals concluded Let P(ti ) be the probability of the event type in the
that recommendation systems require further research [14]. complete sample space, and P(ti ∩ tc ) the probability of the
Traditional CF suffers from the ‘cold-start problem,’ mean- expected favorable outcome. We assume that the probability
ing that new items do not get recommended until someone of conversion is dependent of the ti event. The computation
reviews them [15]. For rating-based recommenders, the cold- of the probability of the conversion event type (tc ) given event
start problem can mean weeks or, in certain cases, months type ti is therefore defined as:
of delay. Recently this has been identified as a research topic ∀ti (ti ∈ T =⇒ P(tc ) 6= P(tc |ti ))
worth exploring, and we find several suggestions on how to (
alleviate these issues [16]–[18]. Castillejo et al. suggests using 0, if P(ti ) = 0 (1)
P(tc |ti ) = P(ti ∩tc )
data from the users’ social network in a CF recommender P(ti ) , if P(ti ) ∈ (0, 1].
system [19] to tackle the cold start problem. Contrary to those
studies, the approach developed in this paper addresses the In the equation above, P(ti ) is the probability of the
cold-start problem by relying on user events. User events are event type in the whole sample space, and P(ti ∩ tc ) is the
logged as soon as the items are added to an e-commerce probability of the expected favorable outcome. We assume that
platform, or a new user visits it for the first time, therefor the probability of conversion is dependent of the ti event.
there cold-start has almost zero impact on UX. B. Compute the UX Value for an Event Chain
Compared to the aforementioned studies, the methodology
All events between a specific user and a specific item form
documented in this paper distinguishes itself by tackling scala-
the UX event chain, where events are ordered based on their
bility via a new data reduction approach. Overall, it is a faster
recency (timestamp) such that the first element is the oldest
and more accurate prediction methodology, as demonstrated
event, and the last element is the most recent event. The UX
in Section IV.
value function transforms the last element of the UX event
III. D EVELOPED A PPROACH chain into a scalar value.
The main goal of the developed approach is to predict the Let u denote a specific user, and o a specific item, the object
probability of a conversion type event happening for an item of the interaction between the human agent and the software.
unknown to the user. To this end, a conversion is defined as Let E u,o ⊆ E denote an event chain associated to user u and
an event directly beneficial for both the user and the platform. specific to item o. Then, let eu,o
i ∈ E u,o be the i-th element
A purchase event is the most common type of conversion, and of E u,o . According to these definitions, the members of the
the only conversion event discussed in this paper. But other E u,o event chain differ only in event type. As a result, the UX
conversion events are also possible, such as signing up for a function is defined as:
(
newsletter of depositing money in a digital wallet. 0, if i = 0.
u,o
We summarise the developed approach as the subsequent U X(ei ) = (2)
tanh(U X(eu,oi−1 ) + P(t |t
c i )), if i > 0.
execution of the following main steps:
1) Calculate the likelihood of conversion for all event types. In the above equation, the hyperbolic tangent is used to
2) Compute the User Experience (UX) value for all user- assure that the function’s return value is always in the [0, 1)
item event chains and store it as a user-item sparse matrix. range. Therefore it acts as the UX value’s normalizer. Lastly,
3) Train a neural network with Adam optimizer to minimize we define the Yu,o sparse matrix to contain the UX value for
the mean squared error. the last eu,o event, as defined in the following equation:
Then, the resulting machine learning model is used to predict Yu,o ⇐ U X(eu,o
max(i) ) (3)
the likelihood of purchase for all unknown user-item pairs.
In order to formally define the approach, we first introduce Accordingly, Yu,o contains the UX value for the last eu,o
a few notations. Let E denote the ordered set of all user events event. The main benefit of this approach is that it has a com-
(the dataset), and e ∈ E, an element of E. Let ei denote the putational complexity of O(|E|), where |E| is the number of
elements in set E, in other words, it has a linear computation backward propagation of errors [23] and let Adam(lr, β1 , β2 )
time. Another benefit of the approach is that it yields a sparse be the Adam optimizer function to minimize the loss function.
user × item matrix, most of the values being zero, which can The Adam optimizer, as introduced by Kingma et al., assumes
be constructed and stored efficiently. the following hyperparameters. Let lr ∈ (0, 1) be the learning
rate, and β1 , β2 ∈ [0, 1) be the decay rates for moment
C. Three-way Data Split
estimates. To minimize the loss function, the Adam optimizer
In the next step of the developed approach, we split the Y combines the best properties of the AdaGrad and RMSProp
sparse matrix into three matrices with the same dimensions as algorithms, and it was demonstrated that it can handle sparse
Y using three-way data split. gradients on noisy problems [24].
For colossal datasets, generally, holdout validation is used To be able to train the algorithm, let epochs ∈ N>0
[20], [21]. We used a special case of holdout validation, the denote the number of times the entire dataset Ytrain is passed
three-way data split. With this approach, only the final, trained both forward and backward through the neural network. Let
model is evaluated using the test set, and the validation set is batch size ∈ N>0 denote the number of samples evaluated
used during hyperparameter optimization only. before the model’s internal parameters are updated.
Let ptrain , pval , and ptest be the probability of an event When we want to prevent over-fitting for Adam, L2 reg-
chain being in the Ytrain , Yval , and Ytest matrices respectively, ularization is not effective, as demonstrated by Loshchilov
we apply three-way data split using the following equation: et al. [25]. To prevent over-fitting we could use decoupled
ptrain + pval + ptest = 1 weight decay regularization [25] or dropout [26]. Weight decay
(4) penalizes large weights, forcing all weights to be close to
ptrain Ytrain + pval Yval + ptest Ytest = Y
0. Due to the relatively high number of training examples
D. A Neural Network With Adam Optimizer and parameters, dropout regularization is a better option.
In the final step of the developed approach, we train a neural Dropout means dropping units and their connections from the
network, in order to predict the likelihood of conversion for neural network during training, with P0 chance, to prevent co-
the user-item pairs with unknown UX values. adapting [27]. Let D() be a dropout regularization function
Non-negative matrix factorization is NP-hard, as proven with P0 ∈ [0, 1) probability of eliminating units and their
by Vavasis [22]. Therefore, it is unlikely that there is an connections from the neural network during training [26].
exact algorithm that runs in polynomial time. Algorithms that Our algorithm will run until Ytrain is passed forward and
run in time exponential cannot reasonably be considered for backward through the neural network epochs times. During
Ytrain
real-world applications due to the large dimensions of the each pass, batch size number of batches are taken from Ytrain .
datasets. Fortunately, it is not critical to get perfect user-item Let Embedding(x, y) be a lookup function that retrieves
recommendations for an e-commerce site. Instead, it is enough embeddings. Let x be the size of the dictionary of embeddings,
to get recommendations with a high probability, if the solution and y be the size of each embedding vector. Let nf ∈ N>0
scales well to a large number of users and items. Therefore, be the number of factors. Let userf be the user factors, and
within the developed approach, we used a gradient descent- users are the users found in the batch, so that userf ⇐
based optimization of an objective function. Embedding(users, nf ). Let userb be the user bias, so that
To solve the problem of predicting the missing values in userb ⇐ Embedding(users, 1). Moreover let itemf be the
the sparse matrix, we define the objective function to be item factors, and items the items found in the batch, so that
minimized, known as the loss function. The data preparation itemf ⇐ Embedding(items, nf ). Let itemb be the item
procedure described in the previous section ensures that there bias, so that itemb ⇐ Embedding(items, 1).
are no considerable outliers in a data set. The main reason In the developed algorithm the loss is calculated using
why we did not consider mean absolute error (MAE) as the MSPE function (Eq. (5)). Then, the backpropagation()
the loss function is that user event data is bound to have a achieves the backward propagation of errors. Finally, the
small perturbation. MAE is not as stable as mean squared Adam optimizer function is used to minimize the loss function.
error (MSE). Therefore, a small perturbation will have a more After each epoch, the validation loss is calculated and stored
significant effect on MAE than on MSE. using M SP E(Ŷ , Yval ). A decreasing validation loss indicates
Let Y be the ground truth, Ŷ the prediction of the algorithm that the algorithm is learning to predict the conversions. The
about the ground truth, and q the number of predicted data resulting solution is summarized as Algorithm 1.
points. The loss function we used is the MSE calculated on IV. E XPERIMENTAL A SSESSMENT
q data points, known as the mean squared prediction error
A. Implementation Details
(MSPE), defined in the following equation:
We have used Python 3.7.7 to implemented the solution
n+q
1 X described in the previous section. The Jupyter Notebooks,
M SP E(Ŷ , Y ) = (Ya − Ŷa )2 (5) and the outputs can be found on the project’s GitHub reposi-
q a=n+1
tory: https://github.com/WSzP/uxml-ecommerce.
Next, we proceed with the definition of the neural network For the implementation of machine learning algorithms, we
training algorithm. Let backpropagation() be a function for used the PyTorch open source library [28] (version 1.5) and the
Algorithm 1 The training and validation algorithm
Dataset: eCommerce Events History
Require: Ytrain , Yval ∈ Rusers×items . Sparse matrices in Cosmetics Shop
Require: lr ∈ (0, 1) . Step size
Require: epochs ∈ N>0 . passes through Ytrain 8,738,120 events (rows) × 9 columns
MSPE loss
200,000 0.3
150,000
0.2
100,000
50,000 0.1
0
1 2 3 4 5+ 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
Number of events per user Batch number
Fig. 3. Users categorised by the number of events they dispatched Fig. 4. MSPE loss change while training Algorithm 1 for 500 epochs
(1,260,998 batches).