Log-Based Session Profiling and Online Behavioral Prediction in ECommerce Websites
Log-Based Session Profiling and Online Behavioral Prediction in ECommerce Websites
INDEX TERMS Behavior prediction, user profiling, log analysis, clustering, neural networks, model
checking.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
171834 VOLUME 8, 2020
J. Fabra et al.: Log-Based Session Profiling and Online Behavioral Prediction in E–Commerce Websites
Achieving these goals requires that the prediction models be out in Section V. Section VI details the customer profiling
integrated into the company’s decision-making systems and and the validation process. The behavior prediction methods
that their predictions be validated to adapt those models to used and the obtained results are presented in Section VII.
the changing conditions of the business and the evolution of The integration of the prediction system into an e-commerce
customers’ habits. platform is then detailed in Section VIII. Finally, Section IX
In this article, existing research proposals in the field outlines some conclusions of this article and addresses future
of customer behavior prediction are reviewed. This review research lines.
shows the necessity of addressing the challenges towards
building fine-grained predictive models and applying them to II. RELATED WORK
real scenarios. The research presented in this article advances Before making predictions about customers’ future behav-
in this direction by focusing on the behavioral analysis of ior, it is necessary to discover the different profiles of
unregistered customers of an e-commerce, and it shows that it users that visit the e-commerce website. The process
is possible to accurately predict the customer profile of a user of profiling consists of two stages: the characterization of
session while browsing the website. With respect to existing customers’ past behaviors and the grouping of customers who
techniques, the proposed solution addresses the following behave similarly. In this section, the most relevant research
contributions: approaches related to these two stages will be detailed and
• the prediction model works with incomplete unregis- analyzed.
tered user sessions; Regarding the first stage, most research techniques create
• the different customers’ profiles are explicitly con- customers’ behavioral descriptions from the website’s log
sidered as part of the predictions, thus providing a files or the database of customer transactions. The contents of
deeper understanding of users’ browsing and purchasing these descriptions can vary depending on the intended use of
behaviors; the analysis results. Customer personal data [11], their RFM
• those profiles are interpreted and validated from a (Recency, Frequency and Monetary) values [12]–[16], their
business perspective to match predictions with desirable browsing behaviors [17], [18] or purchasing habits [19]–[21],
customer behaviors; or the products they have shown interest in [22]–[25] are
• finally, the integration process of a prototype of the solu- typically used for the creation of such descriptions. The
tion into a website based on the Magento e-commerce concept of session plays a relevant role in this characteriza-
technology is detailed. tion due to the fact that a description is calculated for each
The proposal requires building prediction models that customer session. For this reason, the existing approaches
are used along with clickstream techniques to analyze the are essentially interested in the analysis of registered users.
customers’ behaviors at runtime. To do that, methodologies The sessions of these users are clearly identified and directly
for server log processing, clustering algorithms and artifi- recorded in the website’s log files. As an exception, [22] is
cial intelligence techniques are combined in a three-phase the only work dealing with unregistered users. In this case,
process. First, the log files are processed to discover the a process of reconstructing sessions based on clickstream
customer profiles. These profiles are then validated and analysis is required [4].
interpreted from a business perspective, associating each one Once customers’ descriptions have been created, they are
with a set of behavioral patterns that characterize the users usually grouped using either clustering methods [13], [15],
belonging to that profile. After that, a prediction model is [16], [19], [23], [26] or classification methods [17], [18], [25].
created and trained to evaluate users’ pattern-based behavior As a result, the application of these techniques generates
and determine the customer’s profile. Alternately, once the a set of clusters that must be subsequently interpreted
models are available, predictions are conducted using the to understand the particular behaviors of each class of
user’s clickstream. This allows the system to perform, after customers. An expert-guided analysis of the computed
a small number of events, precise predictions concerning the clusters is proposed by some of the approaches [12], [20],
segment the session is probably going to fall into. [23], [26]. Such a task is rather complicated and time-
From this point, the results from the predictions can be consuming, and therefore alternatives that automatically
used to adapt the customer’s session so as to reinforce the extract knowledge from clusters’ descriptions should be
prediction or attempt to move the session towards a more studied. [11], [19] use association rules for automating
interesting segment, according to the e-commerce website’s these interpretations. Nevertheless, they require advance
interests. Unlike other existing solutions, predictions are knowledge of the interesting attributes to define suitable
based on the current behaviors of users and not limited by rules. However, the clusters obtained must also be validated.
the purchasing probability. Ideally, the clusters’ validation should consist of matching
The remainder of this article is organized as follows. users’ future behaviors according to clusters. Some works
Section II focuses on related work. The process for predicting have proposed frameworks to provide an incremental cluster-
the customer’s profile and a real scenario are introduced in ing to dynamically maintain the customer profiles [27], [28].
Section III. The preprocessing of the log files is presented Despite the efforts to interpret and validate clusters, these are
in Section IV. After that, the clustering process is carried still open challenges.
Customer segmentation techniques are needed to build feature engineering to automate the selection and ranking
models that help to make predictions regarding customers’ of a large number of features to improve prediction tasks.
future behaviors. To give a compact view of the existing Secondly, a method based on association rules is proposed
research in the field of prediction, a set of criteria that help in [4], [31] as an alternative to traditional techniques.
us to classify these works has been established. The result of Prediction techniques. Three types of techniques have been
this classification is presented in Table 1. Let us now detail widely applied in the prediction of customers’ purchasing
the classification criteria used. behavior: classification methods, regression analysis, and
Prediction goals. Some models address the challenge algorithmic techniques. Among the classification methods,
of distinguishing between buying and non-buying sessions the most common are Neuronal Networks (NN), Decision
(B/NB, two possible prediction outcomes) [4], [29]–[35], Trees (DT), Support Vector Machines (SVM), Random For-
[7], [8]. Alternatively, other works concentrate on calculating est (RF) and Naives-Bayes models (NBM). The approaches
the probability that a customer buys either a specific based on these methods make their predictions using a single
product (B-Prod) [20], [36]–[38] or a class of products classifier (S-Class, [33], [34], [43], [46]) or by combining
(B-CProd) [39], [40], makes a purchase in the next visit multiple classifiers to improve the accuracy of the results
to the online store (Next) [41]–[43], or repurchases in a (M-Class, [7], [8], [29]–[31], [40], [42], [44], [45]). In the
future session (ReP) [44], [45]. Time constraints have also last case, a combination algorithm integrating the predictions
been considered as a part of some prediction models to of the different classifiers is needed. Genetic algorithms [30]
estimate the purchasing probability of a user for the next (GA), the Artificial Bee Colony (ABC) algorithm [45],
day (Next-D) [46], for the next year (Next-Y) [47], or over Bootstrap Aggregation (BA) [48] and strategies based on
time (NoT) [40]. Likewise, customers’ profiles have been majority voting [20], [31] (MV) have also been used as
used to distinguish between VIP and non-VIP customers combination methods. As an exception, [44] assigns weights
(V/NV) [48]. Notice that the predictions in the cited research to models manually (MN).
require identified users to predict future behavior once the Alternately, regression analysis has been used to determine
corresponding past behavior has been analyzed. the purchasing probability using logistic regression [32], [35],
Data source. Customers’ past behavior is usually extracted [38], [41], [47]. This statistical model requires an a-priori
from log files generated by Web servers (Log-based pro- analysis of the predictors to be used and the correlations
posals, Log) or transaction data recorded in the seller’s between them to make accurate predictions. In [20], a hybrid
ERP/CRM systems (database approaches, CTD). These solution is presented. Logistic regression and classification
data sources are processed to discover and select the methods are combined to improve the purchasing predictions
features/attributes that will be used to create prediction of a concrete e-commerce. Although the results notably
models. As an exception, [34], [45], [47] propose the use increase the prediction coverage, prediction accuracy is not
of questionnaires for gathering information regarding cus- clearly improved with respect to other approaches based on
tomers’ preferences and behaviors in the hiring of (banking) the use of a unique technique.
services. Finally, different algorithms have been proposed to study
Types of customers. Most works make their predictions specific purchasing behaviors [4], [36], [37]. These solutions
based on registered customers’ past behaviors. Only four define probability models that are evaluated in conjunction
works estimate the purchasing probability for unregistered with association rules to extract the knowledge of interest for
customers [4], [29], [31], [32]. These solutions apply e-commerce managers. [36] attempts to discover the most
clickstream analysis to reconstruct users’ sessions and profitable products and customers. It searches for potential
discover user’s behaviors during their navigation through the customers interested in purchasing a star product in the near
e-commerce website. future and analyses those buyers’ personal profiles. [37]
Selection of a predictor. The selection of features is a determines the best time (the peak hour) for a customer to
critical issue for the creation of an accurate prediction model. purchase a product. This time-based information is used to
It consists of extracting/computing a set of relevant attributes deliver personalized marketing messages to increase sales. [4]
from the data source. As shown in Table 1, the most common uses rules to estimate the purchasing probability of a user
attributes are customers’ personal (P) or demographic session depending on the pages that were visited in the past
(D) data, product interest scores (PI), customers’ naviga- and the time spent on them.
tion (NB) or purchasing behaviors (PB), or historical pur- The nature of approaches. Some of the research works
chasing data (HP) (the RFM value or payments, for instance). are Application-oriented Approaches (AoA) in the sense
Nevertheless, some proposals select alternative interesting that they apply existing prediction methods to solve some
attributes, such as the use of shopping carts (SC) [32], concrete problem, usually in the domain of e-commerce or
[43], seller’s reputations and facilities (SRF) [45], customers’ e-banking services. Adopting a different point of view, some
opinions (CO) [47], changes in user behavior (ChB) [46] works (let us call them Methodology-oriented Approaches,
or interactions of users with Web pages and their ele- abbreviated as MoA) concentrate on defining new meth-
ments (Int) [7]. From a methodological point of view, ods/algorithms for predicting future customers’ purchasing
two approaches should be emphasized. Firstly, [44] applies behaviors. Generally, these types of works also validate
TABLE 1. Comparative analysis of the methods for predicting the purchasing probability.
their solutions by applying them to real application cases usually studied. Some works apply association rules (AsR)
(MoA/AoA). to analyze the sessions with high purchasing probability to
Integration into e-commerce websites. Prediction methods discover behavioral patterns and the reasons that lead to
help explain customers’ behaviors. This understanding can be the purchase of some products [4], [31], [36], [48]. The
used to improve the design and contents of websites, perform knowledge discovered is limited and consists of simple
various recommendation techniques, increase the effective- relationships between pairs of navigation/purchasing events.
ness of marketing campaigns, or customize the service for As an alternative, [47] builds a Behavioral Scoring Model
the user, for instance. In spite of these possibilities, most (BSC). These models have been widely used to identify
solutions have not been integrated into a real e-commerce frequent user behaviors in the field of financial services,
system, except for [7], which developed a prototype of but their applicability to online commerce must still be
a system that can be installed on users’ mobile devices. investigated.
Therefore, the integration of the predictions in the lifecycle
of e-commerce websites is an open challenge that should be III. A PROCESS FOR PREDICTING CUSTOMERS’ PROFILES
addressed. Our goal is to create a model that helps to predict the possible
Validation of results by experts. Prediction models and future behavior of a customer session while browsing an
algorithms are usually trained and tested using the data e-commerce website. This prediction can be used to influence
recorded in server logs or transaction databases. Moreover, customers’ actions (for example, to improve purchase inten-
different metrics (recall, precision, etc.) have been defined tions and/or probability) or to provide them with customized
to evaluate the quality of the predictions. Nevertheless, other contents or products. The proposed approach consists of
supplementary methods should be applied to evaluate the real applying a process in three phases: preprocessing of the server
usefulness of predictions to create new business value and log files, discovery of the customer profiles, and synthesis of
opportunities. These methods could consist of a qualitative the behavioral model. The final result is a prediction model
validation of results based on expert opinions, for instance. that is integrated into the e-commerce’s decision system to
Alternately, because the prediction models are not usually personalize the services it offers to customers.
integrated in real systems, the predictions are not validated The process followed in this article is similar to that used
with customers’ future behaviors. As an exception, [47] in [49] in the field of predictive business process monitoring.
assesses the validation of predictions (the probability that a In that case, a two-phase approach analyzes incomplete traces
customer buys during the next year) by comparing them with of business processes to predict at runtime whether their
the purchases made during the year following the publication execution outcomes will be as expected. These predictions
of the paper. help to minimize the likelihood of violation of business
Discovery of extra knowledge. Many of the proposals constraints specified using Linear Temporal Logic (LTL).
aim to classify customers’ behaviors. Nevertheless, the rea- That technique is applied over medical processes with a
sons that lead customers to exhibit that behavior are not well-defined structure. It is a relevant difference with respect
to our approach, in which a user can navigate freely through coming from an external search engine) and also be short
the website’s structure. sessions.
The aim of the prediction model synthesis phase is to
A. BUILDING A PREDICTION MODEL generate a prediction model so as to be able to establish, after
Figure 1 shows the three-phase process followed in this a few session events, the cluster to which a live session is
article. Firstly, the raw e-commerce logs are preprocessed probably going to belong. The inputs for this phase are the
to discard uninteresting requests, identify user sessions and set of clusters and some behavioral indexes for each session
prepare the log contents to enable their analysis. The result that typically are associated with initial stages of the session.
of the preprocessing phase is a collection of sessions. The prediction model will be used to analyze the event stream
A session is an ordered sequence of user interactions with the of each session and, after the considered initial stage, predict
system (events) that take place within a time frame. A session the cluster of the considered session. Different (artificial
can contain multiple page views, events, social interactions intelligence or statistical) techniques can be applied to obtain
and e-commerce transactions corresponding to actions such the prediction model. In this work, neural network pattern
as visiting a page, executing a search, adding/deleting a recognition techniques are applied during the process, but
product to/from the cart or completing the payment process, it could be easily adapted to different alternatives. For that,
for instance. A session can be interpreted in terms of users’ a vector of features is obtained for the first k events of each
behaviors. The process is designed to be useful for systems session. The features and clusters feed an artificial neural
with either logged or anonymous access. In the first case, network synthesis method, which is trained and validated (in
a session is clearly established in terms of a sequence of the Model training and Quality analysis tasks, respectively).
events corresponding to the logged session. In the case of The resulting artificial neural network is the model used to
anonymous access, a sessionization process is required to predict the session behavior.
establish which events in a sequence can be considered as Finally, the model obtained is integrated into the
belonging to the same session, as will be detailed later. e-commerce’s prediction system. E-commerce data logs are
Afterwards, the established sessions are used to determine processed during the customer’s navigation and transformed
the e-commerce’s customer profiles. This second phase starts into events of interest. These events represent the customer’s
by computing a vector of features for each session (the actions during the browsing (visiting a product category or
features creation task). A feature provides a high-level the product itself, using the search engine, adding/deleting a
and (usually) quantitative description of the user’s behavior product to/from the cart, or completing the purchase, among
during the session: the total session time, the number and type others). The prediction system interprets the event stream and
of visited pages, or the number of times that the resources determines the most probable customer profile, which will be
were used (the search engine, the cart, the wishlist, etc.), for used by the e-commerce system to make some decisions or
instance. In the case of logged e-commerce websites, features recommendations adapted to the session behavior.
can be enhanced with demographic and geographic data,
buying patterns of previous user sessions, or the purchase B. THE UP&SCRAP USE CASE
history, for instance. The features are then processed by the The process presented to obtain a prediction model will be
clustering task to group in the same cluster those sessions applied over a real e-commerce website. Specifically, it will
that present similar features (that are assumed to be strongly be applied over the website of Up&Scrap1, a scrapbooking
related to the user’s behavior). These first two tasks can be company with more than 25, 000 clients all around the world.
executed several times to improve the features’ expressivity In this subsection, the structure and contents of this website
and to find a more adequate (optimal) number of clusters as are introduced.
well. The structure of the website of Up&Scrap is organized
Once the session clusters have been computed, they around two different types of sections (main and secondary).
are interpreted from a business perspective. The business Each section is then split into several subsections to refine
analyst is responsible for mapping these clusters to customer the product classification. Figure 2 depicts the structure of
profiles. This profile discovery is complex and requires the website. Similar taxonomies have been proposed by
knowledge of the website’s structure, the customer’s interests different authors but including only main sections [23], [50].
and purchase habits, and the types of users that typically From the homepage (level 0), different sections can be
interact with the e-commerce. Finally, the resulting profiles accessed (level 1). Two different types of sections can be
must be validated. This task consists of checking if the distinguished. Main sections organize products according to
behavior of a cluster’s session corresponds to the behavioral their functionality and utility. The website provides a menu to
description of its profile. To do that, some type of study access this main categorization of products. There are eight
of the conformance of the sessions in a cluster and the different sections (papers, decorations, stamps, tools, project
intuitive description established for it must be carried out. life-smash, albums, home decor-DIY, and gifts), which are
For instance, if a cluster is described as corresponding to divided into subcategories. Alternately, there are secondary
spurious website users, one can expect those sessions to enter
the system from a different point than the home URL (maybe 1 http://www.upandscrap.com
FIGURE 1. Sketch of the process for identifying and predicting the customers’ profiles.
FIGURE 3. An extract from the raw log of the web server, where IP addresses have been anonymized.
whether it is a GET or POST request and analyzing its URL (accessing the results of the search engine is considered
in terms of the presence and/or absence of specific keywords to be visiting a secondary section), Visit_main_section_L2,
and resources (that is, the words between slash characters). Visit_secondary_section_L2, Visit_homepage (an event
These requests are classified based on their deepness and type representing that the homepage has been visited),
whether they correspond to a main or a secondary section. Visit_product (an event type representing that the URL of
In the case of the Up&Scrap website, its structure is organized a product has been visited), Add_wishlist_products_to_the_
in two levels (N = 2). The different events have then been cart, Add_product_to_the_cart, Add_product_to_the_
separated into 63 different types, such as Visit main section wishlist, Buy_products_in_the_cart, Delete_product_from_
L1, Visit secondary section L22, Visit product, Login, Logout, the_cart, and finally, Update_product_ from_the_cart.
Add product to the wishlist, Add product to the cart, etc. The last filter removes duplicated events, which reduced
These events refer to different actions and can affect different the log size to 1, 331, 697 records. Figure 4 shows the
sections of the website. However, not all event types are processed log of the extract depicted in Figure 3. A more
interesting for the analysis because some of them provide detailed description of the entire process can be found in [52].
superfluous information. They can, for instance, refer to user After that, a sessionization process to group those events
account management or legal warnings. that could be considered as belonging to the same session
Therefore, in the simplification stage, some of these was conducted; because this study deals with non-logged
event types are discarded, and only the event types that sessions, additional criteria had to be applied to define the
are interesting for the type of analysis that is going to be start and end events of each session. For that, a session as the
conducted are considered, with the aim of reducing the ordered sequence of events from the same IP for which no
amount of information included in the log by filtering the more than 30 minutes passed between any two consecutive
records that do not contain relevant information. To do that, events was defined. This is a common characterization that
the following filters are applied. First, sessions with fewer has been used in log file analysis to discover knowledge by
than three requests are discarded because they do not contain several authors [53]–[55]. As a result, 138, 085 anonymous
valuable information and mainly correspond to users that do sessions were identified.
not have an interest in the website contents.
Second, some events are discarded because they do not
V. CLUSTERING PROCESS
provide valuable information for the analysis. Because the
The next step consists of the clustering of sessions with some
goal of analyzing the logs is to extract information regarding
common characteristics. To do that, for each session, a set of
users’ behaviors and preferences when buying products,
global properties was extracted. The set of properties that can
there are many events that can be considered superfluous,
be interesting is strongly related to the problem domain. In the
such as events related to user account management or rating
domain of this work, it has to be dependent on the website
products. In this case, a set of 12 types of events that
structure because the structure will constrain the types of
considered relevant for the analysis has been identified, and
sequences of events a user can execute. For the structure
the remaining ones have been filtered. In the following,
in Figure 2, the properties described in Table 2 have been
they are detailed according to the different sets iden-
considered.
tified: Visit_main_section_L1, Visit_secondary_section_L1
For each session, the set of corresponding values is used
2 Note that L1 and L2 are related to the two levels of the Up&Scrap to generate the vector of features. This vector is useful
website for providing a high-level view of the session (abstracting
FIGURE 4. Log generated after the preprocessing phase from the web server log’s extract depicted in Figure 3.
unimportant details) and facilitating their interpretation by the obtained during the clustering process). A lower entropy
business analyst. Of the 15 properties identified in Table 2, determines the optimal number of clusters in which the data
9 are left, which are those most relevant for prediction should be grouped. In the case of the R software, the NbClust
issues. Of these 9 properties, some can be grouped/added package [58] was used. This package provides 30 indexes to
for the clustering process. The other properties that have determine the optimal number of clusters and proposes the
not been used in this phase can be useful and interesting best grouping scheme based on the different results obtained.
later to perform certain validation processes. Specifically, This includes very well-known methods such as Silhouette or
the following features have been considered: MAIN , which Pamk (which uses the PAM or Clara algorithms, along with
groups visits to the main category (MAIN = ML1 + the Silhouette method).
ML2 ); SECONDARY , which groups visits to the secondary The clustering process was conducted using the sessions
category (SECONDARY = SL1 +SL2 ); MARKETING, which whose lengths were greater than one event, feeding the
groups marketing-related events (MARKETING = OFFER+ process with the described features. One-event sessions can
NOV ); INTEREST , which groups events that indicate interest be considered as noise traces. From the original dataset,
(INTEREST = WISH + PROD + CART ); and finally, which contained 138, 085 sessions, there are 101,917 traces
the SEARCH feature, which corresponds to the property of whose lengths are longer than one event (LONG > 1). Both
the same name. As a result, a feature for each session is Knime and R provided us with the same optimal number of
obtained. As the next step, the sessions are clustered. clusters, k = 4.
As the clustering technique, k-Means has been applied to Table 3 provides information regarding the results of
the vector of features. The objective of this algorithm is to the clustering and the mean values for the features of
partition a set of n elements into k groups of ‘‘near’’ elements: each cluster with respect to the considered sessions. Each
each element belongs to the group whose average is closer. element in the table corresponds to the mean value in the
The algorithm requires definition of the number of clusters, considered set. For instance, the normalized global mean
k. To find the optimal value of k, Knime [56] and R [57] were value of the MAIN value is 0.3538, while that constrained
used. to Cluster 1 is 0.0348. Cluster 1 contains 20, 273 sessions
In Knime, a workflow that performs an iterative process (19.9% of the total number of sessions), cluster 2 contains
and calculates the entropy generated by the selection of 38, 670 sessions (37.9%), cluster 3 contains 15,573 sessions
different values of k has been developed. Entropy is a measure (15.3%), and finally, cluster 4 contains the remaining 27, 401
of the variation of the attributes in the data set for each sessions (26.9%).
cluster; the closer the value is to 0, the greater the similarity The analysis of the features of each cluster with respect
of the data is. However, the further away from 0, the greater to the initial set of data shows that there is a set of feature
differences between data were (and thus worse results are values that stands out for each cluster (the values have
been highlighted in bold in Table 3), which can be used instead remaining in a narrow navigation area. They have a
to establish the users’ profiles. Cluster 1 shows normalized low browsing dispersion. Analyzing session duration, it was
mean values of 260% for the SECONDARY and 409% for found that these customers’ sessions are short.
the MARKETING attributes, respectively. Cluster 2 shows The second cluster stands out in the MAIN property, which
a ratio of visits to the main page (MAIN) of 189% with indicates that the users falling into this cluster represent
respect to the global average. Cluster 4 stands out in the first-time customers or customers that spend time browsing
SECONDARY and SEARCH values, with averages of 189% the website. They probably are users that land on the website
and 469% with respect to the global average, respectively. for exploratory or purchasing purposes. These users have
The fact that the search events are part of the secondary ones long-term sessions with high dispersion (low focus on the
is clearly reflected in the correlated values of such attributes same level/category of items), and they show a moderate ratio
in this cluster. Finally, cluster 4 shows that INTEREST stands of purchases.
out with values that represent 197% with respect to the A detailed view of cluster 3 emphasizes that, as in cluster 1,
global average. There are other coincidences regarding the the secondary-section property stands out. However, this
outstanding features among the clusters, which will help in cluster also highlights the search engine property (SEARCH),
the characterization process. which allows intuiting that the population that falls into
cluster 3 corresponds to those customers that browse using
VI. CUSTOMER PROFILING AND VALIDATION the search engine (search-based navigation). There are three
The analysis of the data obtained from the clustering process main options that probably explain this behavior: possible
allows us to perform an initial profiling phase [59]. Customer ignorance of the website map, the aim of looking for
profiling is the subdivision of a market into discrete customer very specific items, or a non-specific purchasing/browsing
groups that share similar characteristics [60], [61]. This focus. Additionally, depending on the session duration, two
process allows for identification of common characteristics different subclasses of customers with this profile can be
among different users and potential customers, as well as distinguished: one with sessions that have low time between
the proposal of retargeting strategies. Customer profiling events, which represent customers that usually will not finally
requires, as key steps, the division of the market into purchase; and another one with very specific customers who
meaningful and measurable segments (clusters) according to visit the product page to finally purchase it. These last are
customers’ needs, past behaviors or demographic profiles (if represented by sessions that are characterized with a longer
available), as well as determination of the profit potential of time between events.
each cluster by analyzing those aspects and characteristics Finally, customers grouped in cluster 4 stand out for the
that stand out in each one. INTEREST property, which groups the wishlist, visits to
detail pages of the product, and actions in the shopping cart.
A. CLUSTER INTERPRETATION
This indicates that these customers may have a clear idea
about the website and the products and categories they are
Let us now perform a cluster interpretation focusing on
interested in. It can be observed also that these customers
the salient features of the clusters obtained in the previous
focus more on main sections than on secondary ones, visit
section, as well as the results shown in Table 3. In addition,
more product pages, and spend some time on them. Usually,
the information that the clusters offer us allows for the
they enter through the homepage, and their sessions end with
addition of certain characteristics based on the properties
just purchasing products or keeping products in the cart.
calculated previously. The values that stand out above the
Sessions belonging to this profile are long and concentrated
others for each feature appear in bold in Table 3.
on interest-related events.
As shown, cluster 1 stands out in the features of
Based on the main characteristics of each cluster, let
MARKETING and secondary sections (SECONDARY).
us provide a named classification. This helps to create a
These values indicate that this cluster groups the customers
conceptual separation among the groups, similarly to other
who usually access (or repeat their visit to) the website
approaches [62]:
via a campaign or marketing source (which correspond to
secondary items). Those customers also focus on secondary • customers in cluster 1 correspond to repeat or geek
items (SECONDARY feature) and do not tend to move out customers;
of the visited category or explore among different categories, • customers in cluster 2 correspond to explorer customers;
• customers in cluster 3 correspond to searcher customers the LTL version proposed in [63]. In the table, operators
(or narrow searcher customers for very specific ones); G,F, and X have the usual LTL interpretation: Always,
and finally, Eventually and Next, respectively, being H,O, and Y in past
• customers in cluster 4 correspond to potential or counterparts. Alternately, x,y,z, appearing in queries Q15 and
prospective buyers. Q16 , correspond to freeze operators, allowing us to talk about
specific positions in the session and providing the capacity to
B. CLUSTER VALIDATION relate attributes of different session events.
The complete data (features along with the initial properties) The answers to the questions are different depending on
from the clustering process was used to validate the clusters the clusters. The characteristics that stand out for a cluster
using model checking techniques [63]. To this end, this study with respect to the others have been highlighted in green
proposes a set of validation queries related to results obtained in Table 5, while those less prominent but equally important
from the clustering process. features have been highlighted in yellow.
Events have been defined as propositional variables and As it is shown, queries Q1 and Q2 especially highlight
grouped into formulas, as depicted in Table 4, for better sessions in cluster 1, corresponding to the repeat or geek
understanding. The meaning of every event can be easily customers according to the initial interpretation. This indi-
deduced from the event name. Alternately, &, | and ! cates that the marketing campaigns mainly target this type of
correspond, respectively, to and, or and not logic connectives. client (Q1 , 24%) and that they mainly access the secondary
Some of these queries are detailed below in natural sessions of the website (Q2 , 85%). Query Q2 also shows that
language: the searcher customers (cluster 3) have a high percentage
(44%) centered on the secondary sections. This makes sense
• How do users access the website (Q1 : How many because the use of the search engine allows for refinement
sessions directly access the website through a MARKET- of the navigation on the website, giving direct access to
ING event?). brands and products (which are located in the secondary
• How do users access main sections (Q2 : How many ses- categories). In addition, query Q2 indicates that the explorer
sions visit neither the main L1 nor main L2 sections?). customers (cluster 2) always visit the main categories at both
• Which is the relation between how do users access the L1 and L2 levels, which means that customers within
the website and purchasing (Q3 : How many sessions this profile explore the website through these more general
access through a MARKETING event and then have a categories.
PURCHASE event?). The third query (Q3 ) tells us the impact that marketing
• How do users use the search engine of the website (Q4 : campaigns have on purchases. As can be seen, the percent-
How many sessions never use the search engine? Q5,6 : ages are very low and only highlight two clusters, cluster
How many sessions use the search engine at least three 4 with 3% of sessions that end up buying from marketing
(Q5 ) or four times (Q6 )? Q7 : How many sessions feature campaigns and, on the contrary, cluster 1, where there are no
intensive use of the search engine (at least 4 times) and sessions in which a marketing campaign produces a purchase.
purchase items?). This has a direct relationship with the initial interpretation
• How do customers purchase products (Q8 : How many of the clusters because cluster 4 corresponds to a buyer
sessions have a PURCHASE event? Q9 : How many client profile (hence the highest percentage), while cluster 1
sessions operate with the cart and then checkout? Q10 : represents users who access the website, especially to browse
How many sessions have two or more purchase events?) secondary sections, but without a buyer profile.
• What is the relation between user navigation and Queries Q4 through Q7 allow us to study the use of
purchases (Q11 : How many sessions iterate at least five the search engine of the website and its relationship with
times between main and secondary sections and do not purchases. Query Q4 tells us that the sessions of clusters 1,
purchase in the end? Q12 : How many sessions visit five 2 and 4 use the search engine not very often (between 93%
or more product pages?) and 71% of the sessions never use it), while in the sessions of
• How do users purchase and iterate with the cart (Q13 : cluster 3, the search engine is an event that does appear very
How many sessions add products to the cart but do frequently (92% of the sessions). This behavior verifies that
not purchase in the end? Q14 : How many sessions add the sessions in cluster 3 correspond to a searcher customer
at least two products to the cart? Q15,16 : How many profile. Queries Q5 and Q6 allow us to go into detail regarding
sessions have two PURCHASE events, but the cart those sessions that use the search engine, noting that a high
remains untouched (Q15 )/is modified (Q16 ) between the percentage (Q6 , 28%) use it at least four times during a
checkouts? Q17 : How many sessions contain three or session. Finally, the query Q7 allows us to see the relationship
more PURCHASE events?) between the use of the search engine and the purchase events.
Table 5 shows the results obtained for each cluster (ci ), As shown in the table, the low percentages obtained (2% and
reflecting the percentage of sessions that the validation 3% for clusters 3 and 4, respectively) indicate that the use
query (Qj ) fulfills with respect to the total number of of the search engine does not have an evident impact on the
sessions of the cluster, as well as the queries described in purchasing processes.
Queries Q8 through Q17 allow us to study the purchase an explorer profile, which validates our initial hypothesis that
process and its relationship with other events. Query Q8 tells the cluster-2 sessions correspond to the explorer customers.
us where the sessions that make purchases are at some point. Alternately, query Q12 gives us more information about the
As can be observed, sessions in cluster 1 (repeat or geek buyers (cluster 4): a high percentage (38%) of the sessions
customers) never make a purchase on the website. Customers have at least five visits to the products, which is natural in a
from the other profiles do make purchases, but where they purchase profile.
concentrate is in cluster 4 (13% vs. 3% and 4% of clusters Queries Q13 through Q17 are oriented toward the interac-
2 and 3, respectively), which effectively corresponds to tion with the cart. As shown, the greatest interaction with the
potential/prospective buyers. The same behavior is repeated cart occurs in the sessions of cluster 4 (Q13 , 28%; Q14 , 21%),
in queries Q9 and Q10 , which confirms the results obtained. where a purchase is not made in the end. This explains why
Queries Q11 and Q12 are very interesting to elucidate the profile of cluster 4 includes prospective buyers: they are
the intention of the clients of the clusters. Query Q11 buyers who have not finished deciding but who probably end
allows us to observe that cluster-2 sessions stand out for up buying the products in the next session. Queries Q15 and
navigating between main and secondary sections but do not Q16 allow us to study what happens between two purchase
end up buying (46%). In addition, in this cluster, there is events: as can be seen, there is a significant percentage
a significant number of sessions that perform at least five (Q15 , 7%) of sessions in cluster 4 in which the customer
views of specific products (Q12 , 15%). This corresponds to does not modify the cart between one checkout and another.
This indicates that the client started the checkout, decided buy no product, and then some suggestions could be proposed
to go back to review a product, but finally ended up not to drive him towards one of the clusters with a larger buying
modifying the content of the cart. It is very interesting to probability (namely, 2 or 4).
be able to observe this behavior from anonymous sessions
where there is no information associated with the detail of TABLE 6. Some data about sessions and clusters.
the checkout process.
Finally, query Q17 confirms that cluster 4 corresponds to
buyers (or potential buyers) because 8% of their sessions
make three purchase events at some time. The percentages
in the other clusters are much lower or nonexistent (2% in
clusters 2 and 3, and no sessions in cluster 1).
This process of validating the clusters through the use
of queries with temporal logic allows us to validate the
initial hypothesis the clusters obtained, as well as providing With the aim of correlating initial user behavior and
additional information (use of website components such as clusters, in this study some machine learning techniques have
the search engine, impact of marketing campaigns, details of been applied. The process carried out is as follows:
the purchase process, etc.) that can be very valuable for the • First, a vector of features for each session has been com-
business expert. puted. The features are based on the same values used
From the validation, it should be proven that the four for clustering (MAIN, SECONDARY, MARKETING,
clusters correspond to the interpretation given in the previous INTEREST and SEARCH, as described in Section V),
subsection. but the counting of event occurrences is constrained to
the first n events, with n varying from 3 to 8. Different
VII. BEHAVIOR PREDICTION prediction models using the first 3 to 8 events are going
Previous sections have established the interest in grouping to be obtained.
users’ behaviors by means of clustering techniques. At this • As the second step, a multilayer feed-forward network
point, a few interesting questions concerning the relationship has been built. It is composed of 5 hidden layers, with
between clusters and the users’ behaviors appear. Is it 10 hidden neurons per layer. The learning algorithm
possible to predict the cluster to which a session will belong applied is RProp [64], constraining its execution to up
to by analyzing a few initial events? If so, how many events to 100 learning iterations. The network has been trained
are required to get a good prediction? How accurate is that with the scaled conjugate gradient back-propagation
prediction? method using 75% of randomly selected sessions
Answering the previous questions is useful when one (76,478 session features).
wants to modify the user’s behavior to reach some desired • Finally, the remaining 25% sessions (25,479 session
objectives. Let us consider, for instance, the case in which features) have been used to test the quality of the
very few users of a given cluster buy products. Predicting the resulting pattern recognition method, whose results are
case after a few events is essential to apply recommendation commented on in the following.
policies with the aim of redirecting the user session towards The results obtained are summarized in different tables.
a different cluster, one more related to the searched objective. Table 7 corresponds to the confusion matrices of the
On the contrary, if the up-to-now user behavior predicts that prediction models based on 3 (left) and 4 events (right);
the session is going to belong to a cluster strongly related with Table 8 corresponds to the use of 5 and 6 events; and Table 9
buyers, the interest will be in ensuring that she or he does not corresponds to the cases of 7 and 8 events. As an example, let
abandon the behavior associated with that cluster. us describe the case of 5 events (left part of Table 8).
Table 6 shows some global data regarding the clusters Rows AC1 through AC4 correspond to the clusters to
and also the relations between clusters and buying sessions. which input sessions belong (actual clusters), while columns
Each row corresponds to a cluster. The columns correspond, PC1 through PC4 correspond to the predicted clusters
respectively, to the number of sessions, the percentage of according to the trained neural network (predicted clusters).
sessions with respect to the total number of sessions in the Concentrating on a row, the diagonal element corresponds
log, the number of buying sessions, the percentage of buying to the correctly predicted sessions, while the rest are false
sessions with respect to the total number of sessions in the log, negatives (id est, input features that should be predicted as
and finally, the percentage of buying sessions with respect to belonging to the cluster corresponding to the row but that
the total number of buying sessions. Notice that most of the have been predicted as belonging to a different one). For
buying sessions are concentrated in clusters 4 (64.81%) and 2 instance, row 2 in Table 8 (left) shows that 8,238 (true
(23.63%). Let us consider that, after a few initial events of positives) cluster-2 sessions where properly predicted as
a session (5, for instance), the system is able to detect that belonging to that cluster, while 269, 254 and 906 cluster-
the session is very likely going to belong to cluster 1 or 3. 2 sessions where predicted as belonging to clusters 1, 3 and 4,
This means that it will be quite probable that the user will respectively (false negatives). The value in the Rec. column
TABLE 7. Confusion matrices for the test phase when computing the feature for the 3 (left) and 4 (right) first events.
TABLE 8. Confusion matrices for the test phase when computing the feature for the 5 (left) and 6 (right) first events.
TABLE 9. Confusion matrices for the test phase when computing the feature for the 7 (left) and 8 (right) first events.
corresponds to what is called the recall value, as a measure The kappa value has been computed according to the
of the quality of the prediction for the considered cluster, and following formula [66]:
is computed with the following formula: N · ki=1 xii − ki=1 xi. · x.i
P P
K=
#true positives N 2 − ki=1 xi. · x.i
P
recall =
#true positives + #false negatives where xii is the number of cases in the i position of the main
Let us now concentrate on columns. Column values diagonal, N = 25, 479 is the number of sessions, k = 4 is the
out of the diagonal correspond to false positives: sessions number of clusters, and x.i , xi. are the total number of sessions
predicted as belonging to the cluster associated with the in the i-th column and row, respectively.
column while actually belonging to a different cluster. For
TABLE 10. Accuracy and Cohen’s kappa values for the predictions based
instance, considering column 2 of the same table, 38, 236 and on 3 to 8 events.
1,305 sessions where predicted as belonging to cluster 2
when they actually belonged to clusters 1, 3 and 4,
respectively. The value in the Prec. row corresponding
to what is called precision, which is computed with the
following formula: Table 10 shows, for the different models, the values of
#true positives the accuracy and Cohen’s kappa statistics. Depending on the
precision = authors and the problem domain, there are different scales
#true positives + #false positives
dividing the kappa value domain, from non-agreement to
Precision and recall provide insight into the prediction almost perfect agreement. What values of the kappa statistic
quality for each cluster. To measure the global quality, accu- are interesting? There are different interpretations. [65] estab-
racy and Cohen’s kappa statistics are typically considered. lished negative values as indicating that there is no agreement,
Accuracy is computed, for the entirety of the data, as: 0.01-0.20 as having little agreement, 0.21-0.40 as fair
agreement, 0.41-0.60 as moderate, 0.61-0.80 as substantial,
#true positives
accuracy = and 0.81-1.00 as almost perfect agreement. Alternately, [67]
#instances considers that scale to be unacceptable for some domains (in
Accuracy provides an intuitive global view of the quality healthcare research, for instance), proposing an alternative
of the predictions. Cohen’s kappa is used to measure to what scale: 0-0.20 as no agreement, 0.21-0.39 as minimal, 0.40-
degree two different systems of prediction are in agreement. 0.59 as weak, 0.60-0.79 as moderate, 0.80-0.90 as strong, and
In this case, it is used to compare the accuracy of the above 0.9 as almost perfect.
prediction system (observed accuracy) with respect to the As was expectable, the quality of the prediction improves
accuracy of a random system (expected accuracy) [65], [66]. when more initial events are considered for the prediction.
navigation structure of the e-commerce website. Therefore, session is going to belong can be predicted after a few initial
the changes are directed to improve the user experience session events.
during the e-shopping. The analytics tools integrated into Despite the process being applicable to a wide domain,
the current version of the e-commerce provide high-level each step requires specific adaptations when considering its
insights regarding the customers’ navigation habits, but they application to specific cases: in the preprocessing phase,
were not useful for performing fine-grained analysis of where some events are discarded for different reasons
those behaviors. The techniques applied to the creation and that cause events to be considered as non-user events,
validation of profiles have been demonstrated to be useful or non-interesting events, for instance; in the clustering phase,
to extract those low-level insights from the server logs. where the vector of features as well as an adequate number
Moreover, these techniques can be reused to validate the of clusters must be established; in the profiling phase, where
results of the changes proposed, analyzing the logs stored the ratios between global and cluster values are chosen to
after those changes. define the cluster profiles; in the validation phase, where the
In the second phase, the results of the prediction system cluster-profile relations are evaluated and, perhaps, changed;
are used to turn the Up&Scrap e-commerce website into and in the prediction model synthesis phase, where the
a dynamic application able to offer a more personalized number of initial events is chosen as an adequate parameter to
service to the customers. The interest of the company is obtain an accurate online prediction. Every taken decision is
specially focused on improving the mechanisms of online arguable. However, some of the steps are easily automatized.
marketing as a way of influencing the customers’ behavior Most clustering tools are able to find an optimal number of
during their navigation. The predictions are interpreted to clusters, and this is also the case when looking for an adequate
determine the contents that are introduced in the banners and number of initial events for prediction, for instance.
pop-up messages shown dynamically to each customer. These Further research is required, and different techniques could
contents provide feedback to the user about the products, be applied and adopted for the clustering and prediction
offers and services (for instance, workshops) that could be model synthesis phases. With respect to the vector of features,
of interest to her/him. The advantage of this customized the attributes considered are mainly quantitative but do not
marketing strategy is that the e-commerce system does not consider causal relations among events in a sequence. For
require significant technical changes. Moreover, these types instance, when counting the number of times two events, a
of improvements are also applied to the marketing by e-mail, and b, appear in a (partial) session, the possibility of a always
sending more personalized product recommendations and appearing after b, or vice versa, is not distinguished, which
offers that induce the customers to buy in future sessions. could hint at very different behaviors, thus corresponding to
Finally, the third phase is the most ambitious and involves different profiles. In this sense, the inclusion in the features
a change in the Up&Scrap technological infrastructure. The of such types of relations (or more complex ones) using the
goal is for the e-commerce system to dynamically adapt its answers to temporal logic formulas describing such relations,
contents and navigation structure to each customer during for instance, could be a way of improving the results.
the shopping. These adaptations would be based on the
results of the prediction system and directed to maximize REFERENCES
the probability that the customer buys in that session.
[1] R. Kohavi, ‘‘Mining E-commerce data: The good, the bad, and the
Unfortunately, the Up&Scrap e-commerce (and any solution ugly,’’ in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining, 2001,
based on Magento or other similar technologies) has a pp. 8–13.
static nature that does not allow modification of its contents [2] N. Verma and J. Singh, ‘‘An intelligent approach to big data analytics for
sustainable retail environment using apriori-MapReduce framework,’’ Ind.
and structure at runtime and with the desired flexibility. Manage. Data Syst., vol. 117, no. 7, pp. 1503–1520, Aug. 2017.
Therefore, this phase would involve the development of a [3] I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S. W. Kim,
new version of the e-commerce system and a set of base ‘‘A churn prediction model using random forest: Analysis of machine
learning techniques for churn prediction and factor identification in
technologies that support the dynamism required. telecom sector,’’ IEEE Access, vol. 7, pp. 60134–60149, 2019.
[4] G. Suchacka and G. Chodak, ‘‘Using association rules to assess purchase
IX. CONCLUSION probability in online stores,’’ Inf. Syst. e-Bus. Manage., vol. 15, no. 3,
This article has concentrated on the prediction of users’ pp. 751–780, 2017.
[5] Q. Su and L. Chen, ‘‘A method for discovering clusters of e-commerce
behaviors in e-commerce websites. First, session traces interest patterns using click-stream data,’’ Electron. Commerce Res. Appl.,
have been grouped according to the similarity of a set vol. 14, no. 1, pp. 1–13, Jan. 2015.
of quantitative session parameters. After that, the resulting [6] R. E. Bucklin and C. Sismeiro, ‘‘Click here for Internet insight: Advances
values for each cluster have been compared with the entire log in clickstream data analysis in marketing,’’ J. Interact. Marketing, vol. 23,
no. 1, pp. 35–48, Feb. 2009.
dataset to define a user profile for each cluster. The profiles [7] L. Guo, L. Hua, R. Jia, B. Zhao, X. Wang, and B. Cui, ‘‘Buying or
have been validated (or refined) by a closer inspection browsing?: Predicting real-time purchasing intent using attention-based
of the sessions in clusters so as to confirm or contradict deep network with multiple behavior,’’ in Proc. 25th ACM SIGKDD Int.
Conf. Knowl. Discovery Data Mining, Jul. 2019, pp. 1984–1992.
the (intuitive) cluster profiling description. A later training
[8] D. Koehn, S. Lessmann, and M. Schaal, ‘‘Predicting online shopping
phase has generated the prediction model that will be used behaviour from clickstream data using deep learning,’’ Expert Syst. Appl.,
for the online prediction. As a result, the cluster to which the vol. 150, Jul. 2020, Art. no. 113342.
[9] K. Močarníková and M. Greguš, ‘‘Conceptualization of predictive [32] J. Qiu, Z. Lin, and Y. Li, ‘‘Predicting customer purchase behavior in the e-
analytics by literature review,’’ in Data-Centric Business and Applications commerce context,’’ Electron. Commerce Res., vol. 15, no. 4, pp. 427–452,
(Lecture Notes on Data Engineering and Communications Technologies), Dec. 2015.
vol. 30, N. Kryvinska and M. Greguš, Eds. Cham, Switzerland: Springer, [33] R. Jia, R. Li, M. Yu, and S. Wang, ‘‘E-commerce purchase prediction
2020, doi: 10.1007/978-3-030-19069-9_8. approach by user behavior data,’’ in Proc. Int. Conf. Comput., Inf.
[10] J. Chen and A. Abdul, ‘‘A session-based customer preference learning Telecommun. Syst. (CITS), Jul. 2017, pp. 1–5.
method by using the gated recurrent units with attention function,’’ IEEE [34] L. M. Badea, ‘‘Predicting consumer behavior with artificial neural
Access, vol. 7, pp. 17750–17759, 2019. networks,’’ Procedia Econ. Finance, vol. 15, pp. 238–246, Jan. 2014.
[11] W. Niyagas, A. Srivihok, and S. Kitisin, ‘‘Clustering e-banking customer [35] M. Zeng, H. Cao, M. Chen, and Y. Li, ‘‘User behaviour modeling,
using data mining and marketing segmentation,’’ ECTI Trans. Comput. Inf. recommendations, and purchase prediction during shopping festivals,’’
Technol., vol. 2, no. 1, pp. 63–69, Jan. 1970. Electron. Markets, vol. 29, no. 2, pp. 263–274, Jun. 2019.
[12] H. C. C. Chai, ‘‘Online auction customer segmentation using a neural [36] H.-J. Chang, L.-P. Hung, and C.-L. Ho, ‘‘An anticipation model of
network model,’’ Int. J. Appl. Sci. Eng., vol. 3, no. 2, pp. 101–110, 2005. potential customers’ purchasing behavior based on clustering analysis and
[13] M. Namvar, M. R. Gholamian, and S. KhakAbi, ‘‘A two phase clustering association rules analysis,’’ Expert Syst. Appl., vol. 32, no. 3, pp. 753–764,
method for intelligent customer segmentation,’’ in Proc. Int. Conf. Intell. Apr. 2007.
Syst., Model. Simulation, Jan. 2010, pp. 215–219. [37] N. Vanessa and A. Japutra, ‘‘Contextual marketing based on customer
[14] L. B. Romdhane, N. Fadhel, and B. Ayeb, ‘‘An efficient approach for buying pattern in grocery E-Commerce: The case of Bigbasket. com
building customer profiles from business data,’’ Expert Syst. Appl., vol. 37, (India),’’ ASEAN Marketing J., vol. 9, no. 1, pp. 56–67, 2018.
no. 2, pp. 1573–1585, Mar. 2010. [38] N. Nishimura, N. Sukegawa, Y. Takano, and J. Iwanaga, ‘‘A latent-class
[15] M. Walters and J. Bekker, ‘‘Customer super-profiling demonstrator to model for estimating product-choice probabilities from clickstream data,’’
enable efficient targeting in marketing campaigns,’’ South Afr. J. Ind. Eng., Inf. Sci., vol. 429, pp. 406–420, Mar. 2018.
vol. 28, no. 3, pp. 113–127, Nov. 2017. [39] Y.-T. Wen, P.-W. Yeh, T.-H. Tsai, W.-C. Peng, and H.-H. Shuai, ‘‘Customer
[16] A. Beheshtian-Ardakani, M. Fathian, and M. Gholamian, ‘‘A novel model purchase behavior prediction from payment datasets,’’ in Proc. 11th ACM
for product bundling and direct marketing in e-commerce based on market Int. Conf. Web Search Data Mining (WSDM), 2018, pp. 628–636.
segmentation,’’ Decis. Sci. Lett., vol. 7, no. 1, pp. 39–54, 2018. [40] C. Huang, X. Wu, X. Zhang, C. Zhang, J. Zhao, D. Yin, and
[17] J. X. Yu, Y. Ou, C. Zhang, and S. Zhang, ‘‘Identifying interesting N. V. Chawla, ‘‘Online purchase prediction via multi-scale modeling
customers through Web log classification,’’ IEEE Intell. Syst., vol. 20, of behavior dynamics,’’ in Proc. 25th ACM SIGKDD Int. Conf. Knowl.
no. 3, pp. 55–59, May 2005. Discovery Data Mining, Jul. 2019, pp. 2613–2622.
[18] P.-H. Chou, P.-H. Li, K.-K. Chen, and M.-J. Wu, ‘‘Integrating Web mining [41] D. Van den Poel and W. Buckinx, ‘‘Predicting online-purchasing
and neural network for personalized e-commerce automatic service,’’ behaviour,’’ Eur. J. Oper. Res., vol. 166, no. 2, pp. 557–575, Oct. 2005.
Expert Syst. Appl., vol. 37, no. 4, pp. 2898–2910, Apr. 2010. [42] K. Shapoval and T. Setzer, ‘‘Next-purchase prediction using projections
[19] J. Wilson, S. Chaudhury, and B. Lall, ‘‘Clustering short temporal behaviour of discounted purchasing sequences,’’ Bus. Inf. Syst. Eng., vol. 60, no. 2,
sequences for customer segmentation using LDA,’’ Expert Syst., vol. 35, pp. 151–166, Apr. 2018.
no. 3, Jun. 2018, Art. no. e12250. [43] J. Li, L. Tang, A. Wang, and Z. Xu, ‘‘Online-purchasing behavior
[20] S. Peker, A. Kocyigit, and P. E. Eren, ‘‘LRFMP model for customer forecasting with a firefly algorithm-based SVM model considering
segmentation in the grocery retail industry: A case study,’’ Marketing Intell. shopping cart use,’’ EURASIA J. Math., Sci. Technol. Educ., vol. 13, no. 12,
Planning, vol. 35, no. 4, pp. 544–559, Jun. 2017. Nov. 2017, 7967–7983.
[21] M. R. Flores-Méndez, M. Postigo-Boix, J. L. Melús-Moreno, and B. Stiller, [44] G. Liu, T. T. Nguyen, G. Zhao, W. Zha, J. Yang, J. Cao, M. Wu, P. Zhao,
‘‘A model for the mobile market based on customers profile to analyze the and W. Chen, ‘‘Repeat buyer prediction for E-Commerce,’’ in Proc. 22nd
churning process,’’ Wireless Netw., vol. 24, no. 2, pp. 409–422, Feb. 2018. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2016,
[22] I. S. Y. Kwan, J. Fong, and H. K. Wong, ‘‘An e-customer behavior model pp. 155–164.
with online analytical mining for Internet marketing planning,’’ Decis. [45] A. Kumar, G. Kabra, E. K. Mussada, M. K. Dash, and P. S. Rana, ‘‘Com-
Support Syst., vol. 41, no. 1, pp. 189–204, Nov. 2005. bined artificial bee colony algorithm and machine learning techniques
[23] Q. Su and L. Chen, ‘‘A method for discovering clusters of e-commerce for prediction of online consumer repurchase intention,’’ Neural Comput.
interest patterns using click-stream data,’’ Electron. Commerce Res. Appl., Appl., vol. 31, no. S2, pp. 877–890, Feb. 2019.
vol. 14, no. 1, pp. 1–13, Jan. 2015. [46] D. Li, G. Zhao, Z. Wang, W. Ma, and Y. Liu, ‘‘A method of purchase
[24] S. Dhaliwal, N. N. Van, M. Dhaliwal, J. Rokne, R. Alhajj, and T. prediction based on user behavior log,’’ in Proc. IEEE Int. Conf. Data
Ozyer, ‘‘Integrating SOM and fuzzy k-means clustering for customer Mining Workshop (ICDMW), Nov. 2015, pp. 1031–1039.
classification in personalized recommendation system for non-text based [47] K. K. Boyer and G. T. M. Hult, ‘‘Customer behavior in an online
transactional data,’’ in Proc. 8th Int. Conf. Inf. Technol. (ICIT), May 2017, ordering application: A decision scoring model,’’ Decis. Sci., vol. 36, no. 4,
pp. 901–908. pp. 569–598, Nov. 2005.
[25] S. Palaniappan, A. Mustapha, C. F. Mohd Foozy, and R. Atan, ‘‘Customer [48] B. Shim, K. Choi, and Y. Suh, ‘‘CRM strategies for a small-sized online
profiling using classification approach for bank telemarketing,’’ Int. J. shopping mall based on association rules and sequential patterns,’’ Expert
Inform. Vis., vol. 1, nos. 4–2, p. 214, Nov. 2017. Syst. Appl., vol. 39, no. 9, pp. 7736–7742, Jul. 2012.
[26] K. Kalaidopoulou, S. Triantafyllou, A. Griva, and K. Pramatari, ‘‘Identi- [49] C. D. Francescomarino, M. Dumas, F. M. Maggi, and I. Teinemaa,
fying customer satisfaction patterns via data mining: The case of greek ‘‘Clustering-based predictive process monitoring,’’ IEEE Trans. Services
e-shops,’’ in Proc. 11th Medit. Conf. Inf. Syst. (MCIS), 2017, pp. 1–12. Comput., vol. 12, no. 6, pp. 896–909, Nov. 2019.
[27] C. Haruechaiyasak, C. Tipnoe, S. Kongyoung, C. Damrongrat, and [50] Y. S. Kim and B.-J. Yum, ‘‘Recommender system based on click stream
N. Angkawattanawit, ‘‘A dynamic framework for maintaining customer data using association rule mining,’’ Expert Syst. Appl., vol. 38, no. 10,
profiles in E-commerce recommender systems,’’ in Proc. IEEE Int. Conf. pp. 13320–13327, Sep. 2011.
e-Technol., e-Commerce e-Service, Mar. 2005, pp. 768–771. [51] Common Log Format (CLF). (1995). The World Wide Web
[28] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, ‘‘A Web Consortium (W3C). [Online]. Available: http://www.w3.org/
usage mining framework for mining evolving user profiles in dynamic Daemon/User/Config/Logging.html#common-logfileforma
Web sites,’’ IEEE Trans. Knowl. Data Eng., vol. 20, no. 2, pp. 202–215, [52] S. Hernandez, P. Alvarez, J. Fabra, and J. Ezpeleta, ‘‘Analysis of Users’
Feb. 2008. behavior in structured e-Commerce websites,’’ IEEE Access, vol. 5,
[29] G. Suchacka, M. Skolimowska-Kulig, and A. Potempa, ‘‘Classification pp. 11941–11958, 2017.
of E-Customer sessions based on support vector machine,’’ in Proc. Eur. [53] Google Analytics Help Center. Accessed: Sep. 2020. [Online]. Available:
Council Modelling Simulation (ECMS), May 2015, 594–600. https://support.google.com/analytics
[30] E. Kim, W. Kim, and Y. Lee, ‘‘Combination of multiple classifiers for the [54] G. Suchacka and G. Chodak, ‘‘Practical aspects of log file analysis for E-
customer’s purchase behavior prediction,’’ Decis. Support Syst., vol. 34, commerce,’’ in Proc. Int. Conf. Comput. Netw., 2013, pp. 562–572.
no. 2, pp. 167–175, Jan. 2003. [55] M. Adnan, M. Nagi, K. Kianmehr, R. Tahboub, M. Ridley, and J. Rokne,
[31] E. Suh, S. Lim, H. Hwang, and S. Kim, ‘‘A prediction model for the ‘‘Promoting where, when and what? An analysis of Web logs by integrating
purchase probability of anonymous customers to support real time Web data mining and social network techniques to guide ecommerce business
marketing: A case study,’’ Expert Syst. Appl., vol. 27, no. 2, pp. 245–255, promotions,’’ Social Netw. Anal. Mining, vol. 1, no. 3, pp. 173–185,
Aug. 2004. Jul. 2011.
[56] KNIME. Accessed: Sep. 2020. [Online]. Available: https://www. JAVIER FABRA received the Ph.D. degree in
knime.com computer science from the University of Zaragoza,
[57] The R Project for Statistical Computing. Accessed: Sep. 2020. [Online]. Spain, in 2010. He has been an Associate Professor
Available: https://www.r-project.org with the Department of Computer Science and
[58] NbClust: Determining the Best Number of Clusters in a Data Set. Systems Engineering, University of Zaragoza,
Accessed: Sep. 2020. [Online]. Available: https://cran.r-project.org/web/ Spain, since 2008. His main research interests
packages/NbClust/index.html
include data mining analysis techniques in the
[59] R.-S. Wu and P.-H. Chou, ‘‘Customer segmentation of multiple category
data in e-commerce using a soft-clustering approach,’’ Electron. Commerce
context of service-oriented computing and cloud
Res. Appl., vol. 10, no. 3, pp. 331–341, May 2011. architectures.
[60] Bain&Company. (2018). Management Tools-Customer Segmentation.
[Online]. Available: https://www.bain.com/insights/management-tools-
customer-segmentation
[61] J. S. E. Almquist and N. Bloch, ‘‘The elements of value,’’ in Harvard
PEDRO ÁLVAREZ received the Ph.D. degree
Business Review. Brighton, MA, USA: Harvard Business Publishing, 2016.
[62] Business2community. (2016). The Ultimate Guide to eCommerce in computer science engineering from the Uni-
Customer Segmentation. [Online]. Available: https://www. versity of Zaragoza, Zaragoza, Spain, in 2004.
business2community.com/ecommerce/ultimate-guide-ecommerce- He has been a Lecture Professor with University
customer-segmentation-01624275 of Zaragoza, since 2000. His current research
[63] J. M. Couvreur and J. Ezpeleta, ‘‘A linear temporal logic model checking interests include two main aspects on integration
method over finite words with correlated transition attributes,’’ in Data- problems of network-based systems and the use
Driven Process Discovery and Analysis. SIMPDA (Lecture Notes in of novel techniques and methodologies for solving
Business Information Processing), vol. 340, P. Ceravolo, M. van Keulen, them and the application of formal analysis tech-
and K. Stoffel, Eds. Cham, Switzerland: Springer, 2019, doi: 10.1007/978- niques to the mining of event logs and databases.
3-030-11638-5_5.
[64] M. Riedmiller and H. Braun, ‘‘A direct adaptive method for faster
backpropagation learning: The RPROP algorithm,’’ in Proc. IEEE Int.
Conf. Neural Netw., vol. 1, Mar. 1993, pp. 586–591. JOAQUÍN EZPELETA received the M.S. degree
[65] J. Cohen, ‘‘A coefficient of agreement for nominal scales,’’ Educ. Psychol. in mathematics and the Ph.D. degree in computer
Meas., vol. 20, no. 1, pp. 37–46, Apr. 1960. science from the University of Zaragoza, Spain.
[66] A. Bendavid, ‘‘Comparison of classification accuracy using Cohen’s He is currently a Professor with the Department of
weighted kappa,’’ Expert Syst. Appl., vol. 34, no. 2, pp. 825–832, Computer Science and Systems Engineering, Uni-
Feb. 2008.
versity of Zaragoza, where he conducts lectures
[67] M. L. McHugh, ‘‘Interrater reliability: The kappa statistic,’’ Biochemia
Medica, vol. 22, pp. 276–282, Oct. 2012.
on formal methods for sequential and concurrent
[68] (2020). Magento E-commerce Platform. [Online]. Available: programming and service-oriented architectures.
https://magento.com/ His research interests include problems of model-
[69] Monolog: Sends Your Logs to Files, Sockets, Inboxes, Databases ing, analysis, and control synthesis for concurrent
and Various Web Services. Accessed: Sep. 2020. [Online]. Available: systems, the application of formal techniques to help in the development of
https://packagist.org/packages/monolog/monolog correct distributed systems based on Internet and cloud technologies, and
[70] (2020). Apache Kafka: A Distributed Streaming Platform. [Online]. further the parallel processing of data and compute-intensive problems.
Available: https://kafka.apache.org/