0% found this document useful (0 votes)
18 views

081 ICPMDemos2020 VDD AVisualDriftDetectionSystemforProcessMining POSTPRINT

This document describes a visual drift detection system called VDD for analyzing process drift in event logs. The system takes an event log as input and performs the following steps: 1. It extracts declarative process constraints from sub-logs of a sliding window over the full log. 2. It clusters the constraints into "behavior clusters" based on their confidence trends over time, identifying groups of constraints that change similarly. 3. It detects and characterizes four types of drift (sudden, gradual, oscillatory, trend) based on the constraint confidence trends. 4. It visualizes the drifts and their impact on process behavior through enhanced directed graphs and linked views at different levels of granularity.

Uploaded by

son_goten
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

081 ICPMDemos2020 VDD AVisualDriftDetectionSystemforProcessMining POSTPRINT

This document describes a visual drift detection system called VDD for analyzing process drift in event logs. The system takes an event log as input and performs the following steps: 1. It extracts declarative process constraints from sub-logs of a sliding window over the full log. 2. It clusters the constraints into "behavior clusters" based on their confidence trends over time, identifying groups of constraints that change similarly. 3. It detects and characterizes four types of drift (sudden, gradual, oscillatory, trend) based on the constraint confidence trends. 4. It visualizes the drifts and their impact on process behavior through enhanced directed graphs and linked views at different levels of granularity.

Uploaded by

son_goten
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

VDD: A Visual Drift Detection System for

Process Mining
Anton Yeshchenko, Jan Mendling Claudio Di Ciccio Artem Polyvyanyy
Vienna University of Economics and Business Sapienza University of Rome The University of Melbourne
Vienna, Austria Rome, Italy Melbourne, Australia
[email protected] [email protected] [email protected]

Abstract—Research on concept drift detection has inspired re-


cent advancements of process mining and expanding the growing
arsenal of process analysis tools. What has so far been missing
in this new research stream are techniques that support com-
prehensive process drift analysis in terms of localizing, drilling-
down, quantifying, and visualizing process drifts. In our research,
we built on ideas from concept drift, process mining, and
visualization research and present a novel web-based software
tool to analyze process drifts, called Visual Drift Detection (VDD).
Addressing the comprehensive analysis requirements, our tool Figure 1: Drift types, cf. [7, Fig. 2]
is of benefit to researchers and practitioners in the business
intelligence and process analytics area. It constitutes a valuable
aid to those who are involved in business process redesign
projects. from models being too complex inducing too high cognitive
load to be comprehended in an accurate way [3].
I. I NTRODUCTION Research on data mining has discussed changes over time
Process mining is a research field that is concerned with and distinguishes different types of so-called drift. Drift analysis
leveraging real-world event data for providing transparency has been considered in prior research on process mining in
of how business processes operate. Process discovery is a the following way. Recent works include such contributions
branch of process mining that takes as input event logs, i.e., as Maaradji et al. [4] that use statistical tests in order to
collections of event sequences (traces) wherein every event find sudden and gradual drifts, Zheng et al. [5] transform the
corresponds to an activity execution, and returns the model that event logs into relationship matrices and find sudden drifts
best describes the process generating the event log. However, with change point detection algorithms, and Ostovar et al. [6]
process discovery analyzes event logs without distinguishing describes the sudden drift detection algorithm that relies on
executions that are recent and that are far in the past. Therefore, discovering a number of process trees from the event log
it does not explicitly show the behavioral changes that occur and the calculation of the number of change operations to
in the time lapse during which those data is gathered. transform one tree into another. These papers focus on the
These behavioral changes are a commonplace in the real- identification of some specific drift types, limiting to sudden
world scenarios and introduce additional challenges for the drifts and gradual drifts. These papers also do not provide an
existing process mining techniques that are usually assume interpretable solution for visualizing the content of the drifts.
stable patterns of behaviour. If a drift is present in the data, In this paper, we present a technique for process drift
it affects all stages of process mining namely discovery, detection, called Visual Drift Detection (VDD). VDD extends
conformance and enhancement [1] As a consequence, the existing techniques with the following features. First, our
discovered models are much more complex since they integrate technique not only finds sudden drifts but also helps the
behaviour that is present in different points in time. Using data user to interpret the four different types of drifts shown in
affected by behavioral changes for process conformance also Fig. 1. Second, it facilitates assessment of drifts through visual
hinders the results by detecting non-compliant behaviour of the interpretation [8] by the help of an interactive visualization
aggregated data from a process that might have been a norm system. The Visual Drift Detection (VDD) system is built to
for a particular time-span. The process enhancement using the explain input data on different levels of granularity and supports
event log containing changes would produce process models brushing and linking of the visualization views. Its back-end
annotated with information that is not significant at all time builds on the formal rigor of temporal logic of D ECLARE
stages. All these issues with process mining techniques could constraints [9], [10] and time series analysis [11]. Key strengths
be alleviated or turned into the strengths by first analysing of VDD are the clustering of declarative behavioral constraints
behaviour changes during process mining projects [2]. This is that exhibit similar trends of change over time, the automatic
to the benefit of the process analyst who might quickly suffer detection of drift points, and the automated characterization of

Postprint, October 2020


the drift types. We leverage this information about the trends is a chaining constraint, which imposes that Leucocytes can
in the data and represent the changes on the process behavior occur only if Release C is the activity that occur immediately
entailed by the drifts by means of enhanced Directly-Follows before it (i.e., no other activities can occur in between).
graphs [12], to provide further analysis features. These features N OT S UCCESSIONpER Registration, IV Liquidq is a negative con-
allow us to detect and explain drifts that would otherwise go straint as it imposes that ER Registration cannot be followed
undetected by other techniques. We illustrate the usage of the by IV Liquid. For all constraints, we measure their support,
VDD system on a real-world data set publicly available on the confidence and interest factor. Based on established metrics of
4TU Data Centre.1 The event log contains events from sepsis association rule mining [18], they indicate the extent to which
patients’ pathways in the hospital [13]. We will henceforth the constraints are satisfied in the log traces. The detailed
refer to that data set as the Sepsis log. explanation of how those measures are computed is out of
This is a tool demonstration paper illustrating the new scope for this paper. For further information on that matter,
software implementation of the VDD system. The theoretical the interested reader can refer to [10].
design and evaluations of the presented system have been Specifically, the VDD system runs a background process
partially described in [14], [15]. We remark that our earlier to calculate the measures of D ECLARE constraints and group
work did not include the advanced features we present here the resulting time series into behavior clusters. First, traces in
for drift type characterization and for the visualization of the the log are sorted by the timestamp of their respective first
entailed change on the process behavior. events. Thereupon, we extract a sub-log of the given Win size
from the first traces. We let the window slide over the log at
II. T HE VDD A PPROACH the given Slide size. From each sub-log we mine the set of
Our technique takes an event log (henceforth, log for short) D ECLARE constraints and compute their measures. In our case
as an input and conducts a step-by-step visual analysis on study, with the window size set to 50 and the sliding step to
process drifts. It consists of five steps, which we shall explain 25 we mine D ECLARE constraints out of 41 sub-logs. For each
through the application of our tool on the case study of the sub-log, we compute the confidence of 3424 constraints. This
Sepsis log. Figure 2 depicts the visualization system with step proceeds with the extraction of multi-variate time series
connected views, showing the results of these steps. that represent the trends of the constraints’ confidence.
1) Input and setting of parameters As a result of this step, we obtain numerous time series (one
per constraint and measure) which we cluster into groups that
In the first step the user provides an XES [16] and sets
exhibit similar confidence trends. Henceforth, we will refer to
the parameters of the technique that will influence what can
those groups as behavior clusters. In particular, we resort on
be observed. In particular, the Win size parameter determines
hierarchical clustering [19] to find groups of constraints that
the granularity of the drift analysis, and more specifically the
exhibit similar confidence trends (henceforth, behavior clusters).
number of traces that will be included in each time window.
Figure 2(a) shows the values of the time series (i.e., the
Slide size describes the number of traces that should be skipped
confidence measures) through the plasma color-blind friendly
to calculate the next window. The system offers hover-on
color map [8], from blue (low peak) to yellow (high peak).
explanations about each parameter. The in-depth analysis of
The y-axis lists the constraints, the starting timestamp of the
the parameters is described in [14]. After that, the technique
sub-logs lie on the x-axis. Constraints are sorted vertically by
calculates the event log statistics and automatically proposes
the similarity of their measures’ trends. White dotted horizontal
default parameters as shown in Fig. 2(h). Sepsis log has 1050
lines visually separate the behavior clusters. On the Sepsis data
cases and 15 214 events with 16 event variants. We chose the
set, the Drift Map shows 18 behavior clusters.
Win size of 50, Slide size of 25, and Cut threshold of 420 for
our analysis. 3) Visualization of drifts
2) Window-based constraints mining and time series clustering In this step, we detect change points in the set of time
This is a preprocessing step for the visual analysis. We split series, both for the whole log and each cluster separately.
the log into sub-logs. From each resulting part of the log, we Those change points are what we identify as drift points. In
measure the degree to which a set of behavioral relations in the following, we will interchangeably name them as change
the form of declarative process constraints hold true in each or drift points depending on the context. We plot drift points
window. In particular, we resort on the well-known declarative in Drift Maps (Figure 2(a)) and Drift Charts (Figure 2(b)) to
language D ECLARE, whose full repertoire of constraints is effectively communicate the drifts to the user.
described in [17]. The D ECLARE constraints represent the The Drift Map shown in Fig. 2(a) illustrates the detected
behavior of a process by bind the occurrence of activities drift points over the time in the event log, which we shall
to the verification of certain conditions over other events collectively name as drift situation. We add vertical lines to
in the trace. For example, P RECEDENCEpRelease C, IV Liquidq mark such drift points. Drift Charts (e.g., those in Fig. 2(b))
states that IV Liquid can occur in the trace only if Release C have time on the x-axis and the average confidence of the
occurred earlier. C HAIN P RECEDENCEpRelease C, Leucocytesq constraints in a behavior cluster on the y-axis. We add vertical
lines to denote drift points as in Drift Maps. In Fig. 2(b) we
1 https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460 focus on behavior cluster 18 of the Sepsis log. We can observe
Figure 2: The user interface of the VDD system, running on the Sepsis event log [13]. (a) Drift Map. (b) Drift Chart. (c)
Autocorrelation plot. (d) Erratic measure. (e) Spread of constraints view. (f) Incremental drifts test. (g) Extended Directly-Follows
Graph. (i) Behavior cluster selection menu.

two drift points. Drift Map together with Drift Charts, autocorrelation plots,
We also compute the values of measures called spread and stationarity tests. In the chosen cluster 18, the system
of constraints and erratic measure to quantify the extent of automatically identifies two sudden drifts as shown in the
the drifting behavior [14]. The spread of constraints (shown Drift Chart (Fig. 2(b)). To check for incremental drifts, we
in Fig. 2(e)) intuitively indicates how variable and subject to inspect the results of the stationarity test (shown in Fig. 2(f)).
change the event log is. The measure ranges from 0 to 1: the For the chosen behavior cluster, the VDD system reports no
more the behavior changes over time, the higher the value incremental drift. Figure 2(c) depicts an autocorrelation plot
gets. In the Sepsis log, the measured spread of constraints is that shows how the time series correlates with itself with a
0.247, which indicates a relatively small rate of change in the step defined in the y-axis. The blue area on this plot shows
behavior. The erratic measure (shown in Fig. 2(d)) shows how the significant region of the analysis. Cluster 18 reveals an
a chosen cluster (Fig. 2(i)) compares to the cluster with the autocorrelation on step 2, meaning that the drift shows signs
maximum degree of change in the same log. of seasonality – thus being classifiable as a reoccurring drift.
4) Drift type detection 5) Understanding the drift behavior
In this step, we use a range of methods to analyze drift types To get an understanding of the effect of drifts on the process
(as those shown in Fig. 1) and visualize them in the connected behavior, we visually represent the general behavior found in
views. We use multi-variate time series change point detection the log extended with specific behavior shown in a chosen
algorithms to detect sudden drifts. In particular, we resort on behavior cluster. In particular, we use the gathered information
the Pruned Exact Linear Time (PELT) algorithm [20] to detect on the measured D ECLARE constraints in a behavior cluster
change points in the whole multi-variate time series as well and draw it on top of Directly-Follows graphs [12] such
as within the behavior clusters. Thereupon, we make use of as the one in Fig. 2(g). A Directly-Follows graph connects
the stationarity analysis in ensemble with the visual inspection via arcs the activities (nodes) with those other activities
of Drift Charts to highlight gradual and incremental drifts. that followed at least once in a trace. Arcs are weighted
With the aid of autocorrelation plots, we seek for the behavior by the number of such sequences. Nodes are weighted by
clusters exposing reoccurring drifts. the frequency with which the related activities occur in the
To show the results of this step, we resort on a mix of log. The Directly-Follows graph depicts the behavior that
graphical and numerical representations: the aforementioned is common to the entire event log. We add arcs highlighted
with different colors that represent additional D ECLARE, Council Discovery Project DP180102839. Claudio Di Ciccio
cluster-specific constraints. Negative D ECLARE constraints are is partly supported by the MIUR under grant “Dipartimenti
colored in red. Chaining constraints are in green. All other di eccellenza 2018-2022” of the Department of Computer
relationships are in blue. For cluster 18 we see from Fig. 2(g) Science of Sapienza University of Rome. Anton Yeshchenko
that activities Release C and Leucocytes occur in sequence, thanks Maryna Zadoianchuk and Oleksii Tkachenko for their
bound by the C HAIN P RECEDENCEpRelease C, Leucocytesq assistance during the development of the web application.
constraint. Furthermore, P RECEDENCEpRelease C, IV Liquidq
and P RECEDENCEpRelease C, IV Antibioticsq suggest that
R EFERENCES
IV Liquid and IV Antibiotics require Release C to occur before, [1] W. M. P. van der Aalst, Process Mining - Data Science in Action.
Springer, 2016.
unlike in the general behavior. [2] M. L. van Eck, X. Lu, S. J. J. Leemans, and W. M. P. van der Aalst,
“PM ˆ2 : A process mining project methodology,” in CAiSE. Springer,
III. M ATURITY, D OCUMENTATION AND S CREENCAST 2015, pp. 297–313.
[3] R. Moreno and R. E. Mayer, “Visual presentations in multimedia learning:
We implemented the VDD system as a Python-based stand- Conditions that overload visual working memory,” in VISUAL, D. P.
alone program for command line execution, and as a web Huijsmans and A. W. M. Smeulders, Eds. Springer, 1999, pp. 793–800.
application with back-end and front-end parts. The algorithms [4] A. Maaradji, M. Dumas, M. La Rosa, and A. Ostovar, “Detecting sudden
and gradual drifts in business processes from execution traces,” IEEE
are implemented using Python 3, resorting on the scipy library TKDE, vol. 29, no. 10, pp. 2140–2154, 2017.
for time-series clustering and on the ruptures library for [5] C. Zheng, L. Wen, and J. Wang, “Detecting process concept drifts from
change point identification. We use PM4Py2 [21] for the event logs,” in OTM. Springer, 2017, pp. 524–542.
[6] A. Ostovar, S. J. J. Leemans, and M. L. Rosa, “Robust drift
Directly-Follows Graph visualization. We use the MINERful3 characterization from event streams of business processes,” ACM Trans.
Java package for the discovery and measuring of D ECLARE Knowl. Discov. Data, vol. 14, no. 3, pp. 30:1–30:57, 2020. [Online].
constraints [10]. The front-end of the tool is implemented with Available: https://doi.org/10.1145/3375398
[7] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
the React JavaScript library. The back-end is implemented with survey on concept drift adaptation,” ACM Comput. Surv., vol. 46, no. 4,
flask python library. We run our experiments using a laptop pp. 44:1–44:37, 2014.
equipped with an Intel Core i5 at 2.40GHz  2 with 8GB [8] C. Ware, Information visualization: perception for design. Elsevier,
2012.
of RAM. With this modest hardware, the tool was able to [9] W. M. P. van der Aalst and M. Pesic, “DecSerFlow: Towards a truly
process data and produce the analysis outcome in about 17 declarative service flow language,” in WS-FM, ser. Lecture Notes in
seconds using a real-size event log with 15 214 events from 16 Computer Science, vol. 4184. Springer, 2006, pp. 1–23.
[10] C. Di Ciccio and M. Mecella, “On the discovery of declarative control
activities over 1050 traces. This indicates that the VDD system flows for artful processes,” ACM TMIS, vol. 5, no. 4, pp. 24:1–24:37,
has reached a fairly large degree of maturity as it performs 2015.
well in terms of scalability. [11] G. C. Reinsel, Elements of multivariate time series analysis. Springer,
1993.
We have created a project website for the VDD [12] S. J. Leemans, D. Fahland, and W. M. van der Aalst, “Discovering block-
system, from which it can be downloaded together structured process models from event logs - A constructive approach,”
with its sources at https://github.com/yesanton/ in PETRI NETS. Springer, 2013, pp. 311–329.
[13] F. Mannhardt and D. Blinde, “Analyzing the trajectories of patients with
Process-Drift-Visualization-With-Declare. It is free for sepsis using process mining,” in BPMDS/EMMSAD. CEUR-WS.org,
academic and non-commercial use under the MIT license. 2017, pp. 72–80.
On the project website, we provide documentation on its [14] A. Yeshchenko, C. Di Ciccio, J. Mendling, and A. Polyvyanyy, “Compre-
hensive process drift detection with visual analytics,” in ER. Springer,
installation and first run. The web tool with a graphical 2019, in print.
interface is also available at https://yesanton.github.io/driftvis, [15] A. Yeshchenko, C. D. Ciccio, J. Mendling, and A. Polyvyanyy, “Com-
to be used for testing without the need to install the software prehensive process drift analysis with the visual drift detection tool,” in
ER Demos. CEUR-WS.org, 2019, pp. 108–112.
on a local machine. A screencast documenting its usage is [16] “IEEE standard for extensible event stream (xes) for achieving
available at https://youtu.be/mHOgVBZ4Imc. The GitHub interoperability in event logs and event streams,” pp. 1–50, Nov 2016.
project page contains the step by step tutorial of how to [Online]. Available: http://dx.doi.org/10.1109/IEEESTD.2016.7740858
[17] W. M. P. van der Aalst and M. Pesic, “DecSerFlow: Towards a truly
use the web-based tool. It is available at https://github. declarative service flow language,” in WS-FM. Springer, 2006, pp. 1–23.
com/yesanton/Process-Drift-Visualization-With-Declare/blob/ [18] J. Adamo, Data mining for association rules and sequential patterns -
master/publications/icpm-2020-demo-tutorial.pdf sequential and parallel algorithms, J. Adamo, Ed. Springer New York,
2001.
In future work, we will focus on the prediction of drifts in [19] S. Aghabozorgi, A. Seyed Shirkhorshidi, and T. Ying Wah, “Time-series
running processes and the improvements of the interactivity of clustering - a decade review,” IS, vol. 53, no. C, pp. 16–38, Oct. 2015.
the visualization system. Furthermore, we will conduct user [20] R. Killick, P. Fearnhead, and I. A. Eckley, “Optimal detection of
changepoints with a linear computational cost,” Journal of the American
studies to assess the perceived quality of the tool. Statistical Association, vol. 107, no. 500, pp. 1590–1598, 2012.
Acknowledgements. [21] A. Berti, S. J. van Zelst, and W. M. P. van der Aalst, “Process mining for
python (pm4py): Bridging the gap between process- and data science,”
This work is partially funded by the EU H2020 program CoRR, vol. abs/1905.06169, 2019.
under MSCA-RISE agreement 645751 (RISE BPM). Artem
Polyvyanyy is partly supported by the Australian Research
2 http://pm4py.org, https://github.com/pm4py
3 https://github.com/cdc08x/MINERful

You might also like