Forpublicshare...Multivariate Time Series Clustering and...
Forpublicshare...Multivariate Time Series Clustering and...
net/publication/337634794
CITATIONS READS
4 162
2 authors:
Some of the authors of this publication are also working on these related projects:
IMPROVE - Innovative Modelling Approaches for Production Systems to Raise Validatable Efficiency View project
SMARTPas - Cyber-Physical System (CPS) for Thermal Sterilization of Beverages Using NIR Sensor Technology View project
All content following this page was uploaded by Mustafa Borahan Tumer on 19 January 2022.
ABSTRACT
Multivariate Time Series (MTS) data obtained from large scale systems carry resourceful
information about the internal system status. Multivariate Time Series Clustering is one of the
exploratory methods that can enable one to discover the different types of behavior that is
manifested in different working periods of a system. This knowledge can then be used for tasks
such as anomaly detection or system maintenance. In this study, we make use of the statistical
method, Variable Order Markov Models (VOMMs) to model each individual MTS and employ
a new metric to calculate the distances between those VOMMs. The pairwise distances are then
used to accomplish the MTS Clustering task. Two other MTS Clustering methods are presented
and the superiority of the proposed method is confirmed with the experiments on two data sets
from Cyber-Physical Systems. The computational complexity of the presented methods are also
discussed.
KEYWORDS
Multivariate Data Analysis, Markov Models, Time Series Clustering, Industrial Systems
1. Introduction
Analysis of various industrial systems such as plants, large scale machines, distributed network
etc. is one of the biggest challenges in current technological era. Such analysis can enable cost
reduction and performance increase by the early detection of the anomalies that may occur in
the system. Data obtained from sensors integrated to such systems carry critical information
about their status which may be analyzed to deal with these challenges. Such systems contain a
large number of sensors of interest, which provides a vast mass of data mostly impractical for a
manual analysis. Machine Learning techniques are therefore employed that can automatically
1
CONTACT Barış Gün Sürmeli. Email: [email protected]
2
CONTACT Borahan Tümer. Email: [email protected]
learn the characteristics of systems from the sensor data.
When the data points in the collected data set contain time stamp and are ordered with respect
to these stamps, they are referred to as Time Series. Time series analysis is a well studied area in
the literature (Box et al., 2015). Specifically, periodically recorded data of one system variable
(possibly obtained from a sensor) is called univariate time series and if multiple variables
are involved it is referred to as multivariate time series (MTS). Often the time series data are
recorded from the system at different intervals. The data corresponding to each time period can
be called a time series object.
As one of the fundamental data analysis methods, clustering cab be applied to time series
identify similar time series objects that characterize the periods with significantly similar sys-
tem behavior. Obtaining such information may allow us to reveal the internal properties of the
general system behavior. In industrial area, such cluster models then can be used for the sake
of the critical tasks such as mode or anomaly detection (Sürmeli et al., 2017; Niggemann et al.,
2012) and root cause analysis (Niggemann et al., 2014).
MTS clustering is applied on the data collected from the systems in several fields such as dy-
namometers (Liao, 2007), earthquake analysis (Kakizawa et al., 1998), cyber-physical systems
(Sürmeli et al., 2017). While some of the methods deal directly with raw time series objects
(Kakizawa et al., 1998), some of them work with their higher level abstractions/representations
which are also referred to as model-based methods (Singhal and Seborg, 2005; Ghassempour
et al., 2014).
The characteristics or significant patterns of a specific system behavior can show up as cer-
tain sequential and/or cyclic ordering of data points in time series data. One preferable approach
to represent such patterns is learning Markov Models which comes with the ability to represent
the temporal dependencies between the system states. Standard Markov Chains and Hidden
Markov Models either underfit or overfit if the length of these significant patterns does not
correspond to the selected order of the model. Variable order Markov models (VOMMs) (Be-
gleiter et al., 2004) can be a good choice to tackle this problem, which can represent temporal
dependence of variably long patterns.
Our novel contributions in this work are as given in the following:
• We propose a MTS Clustering solution which learns the models of MTS objects as
VOMMs, then calculate the pairwise distances between those VOMMs using our novel
metric given in (4) and learn a clustering model upon them.
• We propose a new VOMM distance metric, which is the extension of PST Matching
(Sürmeli et al., 2017) that we proposed in our previous study.
• We analyze, test and evaluate the performance of all three MTS Clustering methods on
2
two industrial data sets and confirm the superiority of our method, (1) semi-physical
Lego Lab Demonstrator Data and (2) real world Arçelik (Arçelik A.Ş.) Hydrolic Press
Machine Data.
• We discuss the computational complexity of the MTS Clustering methods and show the
proposed method with comparable complexity to the other methods exhibits a better
accuracy.
In Section 2, we summarize the MTS Clustering Problem, VOMMs and HMMs. Related
work is discussed in Section 3. In Section 4 we discuss how each MTS model is learned,
how distances between these models are calculated, and how we perform clustering using the
calculated distances. In Section 5 we give an extensive complexity analysis of the presented
methods. Section 6 presents our experiments and discuss the results. We conclude the paper in
Section 7.
2. Preliminaries
We define Time Series (TS) as a set of data points that are indexed in equally spaced time
order. Then a Multivariate Time Series (MTS) can be defined as TS where each data point is a
d dimensional vector. Consider a set of MTS M = {m1 , m2 ...mn }, where each mr is a matrix
with the size Tmr × d where Tmr refers to the number of data points which can vary for each
mr . Multivariate Time Series Clustering problem is defined as partitioning M such that each
m is mapped to a cluster from a set of clusters C = {c1 , c2 ...cl } in a way that similar m’s are
mapped to the same cluster. Similarity between two m is expected to be calculated upon a well
defined distance metric. To limit the scope of this study we focus on MTSs that have the same
dimensionality d.
Markov Models are used to model stochastic systems where future states are assumed to depend
only on the current state of the modelled system. They are generalized to nth -order Markov
models where the future states are assumed to depend on n previous states from the current
state (Alpaydin, 2014). Given an alphabet Σ = S1 , S2 , ...Sk contains the possible states of
the system that is modelled, an nth -order Markov Model consists of the initial state probabil-
ities P (S) where S ∈ Σ and transition probabilities between the sub-sequences of length n
P (St+1 |St−n St−n+1 ...St ).
3
An extension of Markov models are Variable Order Markov Models where the condition-
ing order of the next state may vary for different sequences of previous states (Begleiter et al.,
2004). This variation depends on the significance of the specific sub-sequences, which is ex-
tracted from the observations. As long as sequence is observed in the data significantly many
times the information about it is more probable to be kept in the model.
One convenient way to implement VOMMs are PSTs (Ron et al., 1994), which are based on
Suffix Trees (STs). Fig. 2 (left) shows an example of a ST. Given an input sequence, STs contain
all suffixes of the sequence such that each traversal from root to a specific leaf yields a specific
suffix. Non-leaf nodes contain the common prefixes of the different sub-sequences of the input
sequence.
Probabilistic Suffix Trees are a probabilistic abstraction of STs, where nodes of the ST are
pruned with respect to two parameters:
t, the minimum number of occurrences of the sub-sequence of the node within the input
sequence, and
L, the maximum length of any subsequence of a node contained in the tree. Intuitively, t
prunes suffixes with fewer witnesses from the tree, and L limits the level of detail represented
in the tree. In addition to representing suffixes, PSTs contain the concise information of the
input sequence by storing probability information in each node: the probability vector at a node
contains the probabilities of occurrence of each character following the sub-sequence of that
node. Fig. 2 (right) shows an example PST and the sequence it represents.
In HMMs (Alpaydin, 2014), the observable data sequence obtained from a system does not give
direct information of the current state of the system, but gives clues about the characteristics
which are represented as a set of hidden system states. Given a set of hidden states Q and a
set of observables O, the model is learned on three kinds of parameters, (1) the initial state
vector Π, (2) transition matrix A which contains probabilities of the transitions from one state
to another, (3) the emission matrix B which contains probabilities of each state-observable pairs
that of specific observables are being observed in specific states. Conventionally, parameters of
HMM is learned by Baum-Welch algorithm (Dempster et al., 1977) which is an Expectation-
Maximization procedure and will not be explained here.
4
3. Related Work
Time series (TS) analysis is a well studied field in the literature which has many application
areas such as medical informatics, (Tumer et al., 2003) industrial informatics (Box et al., 2015)
and bioinformatics (Bar-Joseph, 2004) . (Liao, 2005) presented an extensive survey on TS
Clustering problem, classifying the methods by many aspects; the way of representation of TS
i.e. if clustering is applied directly on raw TS data or a model based technique is preferred,
the distance calculation method used to compare the TSs, the clustering method used and the
dimensionality of TS data. Subsequently, methods that are presented in the following decade is
collected in (Aghabozorgi et al., 2015). While there are many and highly successful methods
which deal with Univariate TS Clustering problem such as the well-established SAX method
(Lin et al., 2003), there are relatively less studies on Multivariate TS Clustering problem.
Anomaly detection methods including Markovian techniques which employ pairwise dissim-
ilarities, are surveyed in (Chandola et al., 2012). (Bejerano and Yona, 2001) compared VOMMs
and HMMs and showed that they have comparable predictive abilities while VOMMs have ad-
vantages regarding performance. VOMMs and their common implementation PSTs have been
mainly used for bioinformatics on classification and prediction tasks (Begleiter et al., 2004;
Bejerano and Yona, 2001; Oğul and Mumcuoğlu, 2006) but also employed for problems such
as outlier detection (Sun et al., 2006).
Various other unsupervised methods are proposed in industrial signal processing which made
use of Hidden Markov Models, Bayesian Networks and Self-Organizing Maps (Owsley et al.,
1997; Wunderlich and Niggemann, 2017; von Birgelen and Niggemann, 2018).
4. Methodology
Here we will describe three different approaches to solve the MTS Clustering (MTSC) prob-
lem. All methods follow four three main steps: (1) preprocessing, (2) model learning and com-
parison, (3) model clustering, where in (1) and (3) same procedures are applied for all of the
methods (Sections 4.1, 4.5). Pairwise comparison of MTS models in step (2) yields a dissimi-
larity matrix which is used for clustering in step (3). Different techniques of three methods for
step (2) are described in Sections 4.2, 4.3 and 4.4.
The sampling rate in the system is generally high compared to the rate of change in the indus-
trial systems. Due to the high sampling rate instantaneous fluctuations/noise may be observed
5
in the input signal. This might mask the characteristic behavior of the system manifested in the
data. To cure this we perform averaging on the concatenation of all MTSs, Ncon , in the data
set N . Ncon is then divided into a sized frames in time and each frame is replaced by their
averaged value to form Navg . For VOMM and HMM MTS Clustering methods, PCA is applied
for dimensionality reduction on the averaged d dimensional data where it is transformed to u
dimensional Nu where u < d. This helps to reduce the noise and dimensions of data to a fea-
sible number in the data sets where the dimensionality is very high. Consequently, following
steps are applied on Nu for VOMM and HMM Clustering and on Navg for PCA Clustering.
Here we will explain our method for MTS Clustering based on VOMMs. First, the multivariate
data is discretized and transformed to a discrete and symbolic sequence, such that each data
point is mapped to a symbol of a finite alphabet. This is done by training a clustering model
on the complete data set (all the MTSs) so the data points that fall into a specific cluster are
mapped to a specific symbol. Following the reconstruction of the data sequence as a sequence
of symbols of the finite alphabet, separate VOMM representations for each of the MTSs are
learned. Then these VOMMs are compared pairwise with respect to distance metrics which
will be described in section 4.2.3.
Discretization is done by clustering applied on Nu where each of the l clusters are labeled
with a unique symbol from a finite alphabet Σ with size l. Accordingly, by replacing the data
points in Nu with the label of the cluster they fall in, a discrete sequence is obtained. Since we
also cluster the learned VOMMs in the final step of our MTS Clustering method, to be able
to distinguish the clustering that is applied in this step we refer this as data point clustering
throughout this paper. We use Ward’s Hierarchical Agglomerative Clustering (Murtagh and
Legendre, 2014) which is a deterministic method with acceptable time complexity of O(n2 )
where n is the number of data points.
Discretized sequence is separated in the same order that N is concatenated to form Ncon at
the discretization step and for each one of these discrete sequences, a VOMM is learned so that
each of mi in N corresponds to a VOMM φi of a set Φ of VOMM’s. We use an efficient PST
construction algorithm (Schulz et al., 2008) to implement VOMMs.
6
4.2.2. VOMM Comparison
We propose Enhanced PST Matching which is an improved version of PST Matching (PSTM)
(Sürmeli et al., 2017).
PSTM considers two types of distances between two PST’s, (1) Probability Cost which
is related to probability information differences and Dissimilarity Cost related to structural
differences between the two PST’s.
Given trees T1 and T2 which are models of the sequences of A and B, respectively, we
define the set {A1 , A2 , . . . , AN } of subsequences appearing in nodes of T1 , and similarly the
set {B1 , B2 , . . . , BM } of subsequences in nodes of T2 . Then the distance measure (matching
cost) is defined as:
N X
X M
CT1 ,T2 = xij ωij (dij I/Lij + (1 − I)δij ij /2). (1)
i j
This matching cost is calculated over all pairwise matchings of substrings of both trees.
Components of the above formula are as follows.
xij ∈ {0, 1} is 1 only if Ai is the closest node to Bj in the other tree with respect to the
length of their subsequence, where the shorter node is a prefix of the longer one.
wij is the average of occurrence probabilities of the subsequences that the nodes represent:
wij = (P (Ai ) + P (Bj ))/2. It weights the dissimilarity value of each match among the nodes
of two trees proportional to the importance of the corresponding nodes.
I allows for scaling between two types of matching costs: Dissimilarity & Probability. The
closer I is to 1, the higher the contribution of dissimilarity cost to the total cost and vice versa.
dij calculates the length difference between the context of nodes i and j:
where |X| indicates the length of the sequence X. The term is normalized by Lij which is the
larger of the two context lengths.
δij is the sum of the absolute values of the differences for each variable in the probability
vector of two nodes:
X −→ −→
δij = | (PAi )k − (PBj )k | (3)
k∈Σ
7
In Enhanced PST Matching, wij is excluded and WT1 ,T2 is introduced which is the total
number of node-pair matchings in comparison of two trees which have a non-zero contribution
to the total cost. Distance measure in Enhanced PSTM is defined as:
N X
M
1 X
CT1 ,T2 = ( xij (dij I/Lij + (1 − I)δij ij /2)). (4)
WT1 ,T2 i j
WT1 ,T2 normalizes the distance between two trees proportional to the number of nodes that
are being compared. The motivation is to scale the distance values calculated from the compar-
ison of a pair of relatively long sequences and from the comparison of a pair of short sequences.
This way, MTS Clustering is intended to work successfully even if the lengths of the MTSs in
the data set are different.
Several HMM based methods are proposed in MTS Analysis in the literature such as (Ghassem-
pour et al., 2014; Owsley et al., 1997). If the observable space is infinitely large as in MTS with
d possibly continuous variables, it is impractical to learn an emission matrix which contains
probabilities for each possible observable. To tackle this, a common approach which is also
used in this study is to learn B as a set of multivariate Gaussian distributions that are fit on ob-
servable space, where each element corresponds to one hidden state of the HMM (Pfundstein,
2011).
For each MTS, an HMM is learned and the pairwise distances between these HMMs are
calculated to obtain a dissimilarity matrix. Here we use a distance metric based on Frobenius
norm. The idea and complexity are similar to that of our distance metric we discussed in 4.2.2.
Frobenius norm is the root mean square of the elements of a matrix:
sX X
||M ||F = aij (5)
i j
Then the distance between two HMMs λ1 and λ2 are the sum of Frobenius norms of the
differences between their transition and emission matrices:
(Singhal and Seborg, 2005) proposed a PCA based MTSC method, which compares two MTS;
8
M T S1 and M T S2 , by two criteria, (1) the pairwise angle θij between the first δ principal
components and (2) the spatial distance between the data points.
The contribution of the angle between the principal components are calculated as follows:
Pδ Pδ (1) (2)
i=1 j=1 (λi λj )cosθij
Spca = Pδ Pδ (1) (2)
(7)
i=1 j=1 λi λj
To weight the contribution of the distance between the principal components, the PCs are
weighted with their explained variance. This is done by multiplying the eigenvalues of the
PCs being compared since the explained variance of a PC is proportional to the corresponding
(1) (2)
eigenvalue. λi and λi are the i’th eigenvalues of the covariance matrices of M T S1 and
M T S2 respectively.
(1) only takes into account the spatial orientation of the data but it is inadequate in the
situations where the spatial orientation of the data matrices are similar but the data points are
located far in the data space. Therefore a spatial distance formulation which is originally defined
in (Singhal and Seborg, 2005) is given as follows:
Z Φ
h 1 2 /2
i
Sdist =2× 1− √ e−z dz) (8)
2π −∞
where:
x̄1 and x̄2 are sample mean row vectors and Σ1 is the covariance matrix for dataset M T S1
and Σ∗−1
1 is the pseudo-inverse of M T S1 calculated using singular value decomposition.
Based on the obtained dissimilarity matrix, clustering is applied to the models to classify the
MTS. Experiments are done using different clustering methods showed that k-medoids is the
best among them regarding performance. It is further observed that it is also easier to auto-
matically determine the optimal/suboptimal clustering parameters (see Appendix A). To limit
the scope of this study, a detailed description or experimental evaluation for other clustering
methods will not be given in this paper.
K-medoids chooses data points as centers (medoids or exemplars) and minimizes an arbitrary
metric of distance between these centers and the points assigned to clusters. A medoid can be
defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is
minimal, i.e., it is the most centrally located point in the cluster. We use a common realization
9
of k-medoids approach, Partitioning Around Medoids algorithm (Kaufman and Rousseeuw,
1990).
5. Complexity Analysis
Averaging (see 4.1) and model clustering (see 4.5) steps are common for each of the MTSC
methods. PCA is applied for dimensionality reduction in VOMM and HMM MTSC and for
obtaining the δ principal components of each MTS in PCA MTSC. Since all these steps have
the same complexity, we only analyze and compare the differing parts for each of them.
VOMM MTS Model Learning and Comparison: Agglomerative Clustering method for data
point clustering has the complexity of O(d(nr)2 ) = O(d(N )2 ) where n is the average number
of data points in the MTSs in the data set and r is the number of MTSs (Murtagh and Legendre,
2014) and d is the data dimensionality. VOMM learning method (Schulz et al., 2008) has the
complexity of O(nL) where n is the size of the MTS and the corresponding discrete sequence
that VOMM represents and L is the maximum order that is allowed in the VOMM. Construction
of models of all MTS therefore takes O(N L). VOMM comparison can be done by one to one
comparison of each node pair in both trees, therefore the complexity can be written as O(g|Σ|)
where g is the average number of nodes in one VOMM and |Σ| is the alphabet size. Considering
that common subsequences in different locations of a sequence are represented in a single node
in VOMMs, we know that the number of nodes can never exceed the size of the sequence
(g ≤ n), therefore the worst case is bounded by O(n|Σ|). The calculation of the dissimilarity
matrix takes O(nr2 |Σ|). We know that L << N , then the complexity of VOMM MTS learning
and comparison is O(dN 2 + r2 n|Σ|).
HMM MTS Model Learning and Comparison: Baum-Welch algorithm is O(n|Σ|2 ) per iter-
ation (Alpaydin, 2014) where |Σ| is the number of hidden states. When MTS is the input and
the emissions are learned as Multivariate Gaussian distributions, complexity is dominated by
calculation of the covariance matrices if |Σ| < d. This is so since calculating the covariance
matrix requires mT m operation, which has the complexity of O(d2 n). If we call the number of
iterations needed for convergence W , then the complexity of learning the HMMs of all MTSs
in the dataset is O(d2 W N ). Naturally, W is expected to grow with the growing size of the
dataset. Comparing a pair of HMMs with Frobenius Norm is O(d2 |Σ|) where |Σ| is the num-
ber of hidden states, since there is one covariance matrix for each state which has the shape
d × d. Accordingly, calculating the dissimilarity matrix takes O(r2 d2 |Σ|). In total HMM MTS
learning and comparison takes O(d2 W N + r2 d2 |Σ|).
PCA MTS Model Comparison: Calculation of Spca between two MTS PCA models will take
10
O(δ 3 ) since; the cosine of the angle of two vectors is the quotient of their dot product that takes
O(δ) to calculate and cosine operation is done pairwise for all the covariance matrices of the
MTSs. Computation of Φ in Sdist is dominated by the calculation of pseudo-inverse of Σ that
is done by Singular Value Decomposition which has the complexity of O(δ 2 d) (Holmes et al.,
2007). Since δ 2 d > δ 3 , the computation of the dissimilarity matrix takes O(r2 δ 2 d).
Among the expressions explained above for three methods, last one is highly unlikely to be
larger than the first two in general unless the data dimensionality is extremely high. Therefore
we can say that PCA MTSC is the most efficient method in time. The second parts of the first
two expressions are expected to have no large difference and in fact the terms are dominated by
their first parts with the existence of N . If Baum-Welch algorithm (see section 2.4) converges
in few iterations then uW < N will hold and we can conclude that HMM MTSC is practically
more efficient than VOMM MTSC. Nevertheless, it is possible to use only a subset of the
dataset for the preprocessing and data point clustering (see section 4.2.1) steps as explained in
Appendix B which can drastically decrease the complexity of VOMM MTSC.
6. Experimental Evaluation
6.1. Datasets
Lego Demonstrator Data3 : The Lego Demonstrator is shown in Figure 3. A Lego piece is
carried from its initial position in the magazine to the conveyor belt and placed on one of the
two sticks.
One run of the demonstrator uses six input Lego pieces to produce two products on the two
sticks on the conveyor belt, by stacking three of the incoming pieces to the first stick and the
other three to the second stick on the conveyor belt. Input pieces are processed sequentially. To
simulate product variation in the system, the order of sticks used for placing the input piece is
varied. For example, if we name one of the sticks as 1 and the other one as 2, one run may have
the stick sequence 1→1→2→1→2→2 to place all six pieces.
This setup produces a sensor and actuator data sequence similar to a real industrial plant. For
each run we obtain a MTS log consisting of voltage and current values of control units, a touch
sensor output and motor information (speed, angle, and motor command) which overall yields
20 signals. The data used in experiments in this paper was obtained from 150 runs. Each run
moves 6 Lego pieces and produces 2 products. Production sequences (i.e., the order of sticks)
are logged and used as the true labels of the MTSs. We use three distinct product types where
the underlying sequence is 2→1→2→1→2→1, 2→2→2→1→1→1 and 2→1→2→2→1→1
3 Data set and detailed information about the demonstrator can be found at (MInD-NET Lab).
11
respectively, and we performed 50 runs for each product type.
Arçelik Hydraulic Press Machine Data: Arçelik Hydrolic Press Machine is one of the key
components of the dishwasher plants of Arçelik. It shapes the metal plates with pressure which
are then used to cover the cage of the dishwasher machine. Machine can stop due to many
reasons; planned stoppages such as holidays or because there is an anomaly. Sensor data is
continuously recorded with six different variables such as pressure, oil level and temperature.
Many types of anomalies may occur as the machine is running such as electrical or mechanical
anomalies etc. Each of the stoppages are recorded as either planned stoppage or anomaly with
the root cause. These periods are used as MTSs, and the stoppage reason information is used
as the true labels for applying MTS Clustering. The data set we used in the experiments is
extracted for a week out of the life time of the dishwasher plant. The sensor data of the periods
that the machine is running are the MTSs to be clustered and the reasons of the stoppages after
that period are used as the labels. This setup is visualized in Figure 4.
In total there are 56 MTSs with 51 recorded when the system behavior is expected to be
normal, and 5 in anomalous periods of 3 types; 1 mechanical, 1 electrical and 3 mold errors.
The parameters and their settings for each method are given in Table 1. For each method and
data set, we perform experimental runs for all combinations of all parameter values shown.
As seen from the Tables 2 and 3, VOMM MTSC achieves up to 0.75 Adjusted RAND Index
(ARI) (Hubert and Arabie, 1985) on Marmara demonstrator data and achieves up to 0.61 ARI
on Arçelik hydrolic machine data. HMM MTSC can not achieve a significantly higher success
than a random clustering for Lego Demonstrator data with at highest 0.06 ARI, but it can
achieve up to 0.29 ARI on Arçelik data, see Table 4. PCA MTSC did not perform better than
a random clustering, at highest 0.04 ARI on demonstrator data and 0.03 on Arçelik data were
achieved. Since there is no significant success, we do not present result tables of PCA MTSC.
Other combinations of the values either give worse results or there is no significant difference
from the closest shown value in the table. About results regarding to parameter settings we can
say the following.
VOMM MTSC: Since Demonstrator data has three different types of MTS, naturally best
results were obtained with mk = 3. Success of VOMM with k = 2 with Demonstrator data can
be explained by the clusters that consist of the data points that correspond to placements on two
different sticks, see 6.1. Arçelik Hydraulic Press Machine arm has 3 degrees of freedom which
can be related to the success on Arçelik data with k values around 6. The best values for a was 6
and 7 for Demonstrator Data where for Arçelik Data even 8, 9 and 10 achieved highest results,
12
no significant difference is observed with other values. PCA yields noise elimination in variance
up to some degree in demonstrator data and therefore d values 7, 10 and 15 gives better results
than d = 20. No such contribution is observed in Arçelik data which is possibly due to the fact
that enterprise quality sensors used in Arçelik production line tend to record the data with lower
noise rate. We observe that t values lower than 4 is not high enough for pruning away the noise
while using higher values than 6 can cause loss of information. Another important observation
is that appropriate settings of t makes the method more robust to the averaging. For both data
sets, when t is higher than 6, even higher information loss occurs as a increases to the values
higher than 8. On the other hand, when t is lower than 4, for high values of a the significant
information is mixed up with the noise. Naturally, as L increases longer sub-sequences can
be kept in the model which allows characterization of higher detail (very high values bring
potential of overfitting). For both data sets and higher values than L = 7 give similar results
to this value. I seems to have very little effect on the results on Demonstrator data due to
similar contributions of probability information and structural differences (see formula 4). Yet
in Arçelik data, higher values than 0.1 decreases ARI to 0.29 which is the highest score achieved
with HMM (same clustering structure is obtained). This means differences between VOMM
models are caught in the probability vectors. This can be related to the structural differences
being low which is an expected issue in data with low noise.
HMM MTSC: By analyzing Table 4, we observe that the success increases by using higher
number of hidden states up to S = 4, but there is no improvement with S > 4. Best results are
achieved at mk = 2 since actually HMM MTSC could distinguish only one of the anomalous
MTS and the rest were clustered in the same cluster with normal MTS. Here PCA did not have
a high impact on the results. Averaging also had little impact on the success and a = 9 and
a = 10 gave the best results, while other settings gave slightly worse results. Increasing S
results in system characteristics with a higher level of details kept in the model. As expected,
higher numbers of iterations (W ≥ 8) allow for more accurate modelling with higher levels of
convergence. Combinations of a high S and a high W also allows for the method to be robust
to the different values of other parameters.
The failure of all methods to achieve a perfect clustering of MTS on these data sets may
be due to several reasons. Calibrating the Lego Demonstrator components is a challenge. Bad
calibrations may result in the change of the applied force by the motors and/or the time of
accomplishing specific actions. Such changes might not be visible with eye but may affect the
complete cycle of the demonstrator and the corresponding recorded MTS. In Arçelik Data,
an assumption such as ”anomalies manifest themselves in the MTS recorded right before the
start of the anomaly” (see 6.1) might be inaccurate and the deviation from the normal behavior
13
might start showing much sooner. Considering these facts, a perfectly accurate clustering is an
extremely complex task with the available setup and data sets.
The poor performance of PCA MTSC shows the requirement for a more powerful method
such as Markov Models, which can capture temporal dependencies of internal states of the
system. HMM MTSC method is outperformed by VOMM MTSC method which also has a
comparable performance in terms of time complexity. This may be expected since VOMMs
can fit better to the variations of the length of the patterns in the data and such variations occur
with a high frequency in the data from industrial area. The drawback of VOMM MTSC is the
high number of parameters; the desired settings for clustering parameters can be estimated by
the clustering evaluation methods up to some degree.
7. Conclusion
The challenges in collecting labelled data in the industrial systems is one of the critical issues in
tasks such as anomaly detection or behavior identification applied to these systems. Therefore,
detection of the system modes from the sensory data which lacks human expert labels is a real
world example to the well-studied Multivariate Time Series Clustering problem. We proposed
a novel method and compared it with the two existing MTS Clustering methods and confirmed
the superiority of our method on two industrial data sets. Success on distinguishing normal
and anomalous modes of the system is also tested by using one data set which is recorded in a
period where the system went through several normal and anomalous modes. We also presented
an extensive complexity analysis and comparison of these three methods.
Expert knowledge can help for the setting of hyper-parameters in VOMM MTS Clustering up
to some degree. L should be set to a small number preferably L < 10, so that while significant
information is still represented in the model, time and space complexity is not overwhelmed.
While PCA has some contribution to the success, the method is in general robust to param-
eter d as shown in experiments where a setting with explained variance proportion over 90%
is appropriate. One should set a considering the fact about the sampling frequency of sensor
data that higher frequencies require higher values of a. One approach to set mk and k is using
clustering validation techniques as presented in Appendix A. The experiments on Demonstra-
tor data with clustering validation gave promising results while in Arçelik data it gave worse
results, this introduces the requirement of further study in this sub-task. Parameters t and I
should be adjusted experimentally with the existing labelled data. An appropriate choice of t
brings robustness to the method against parameter a.
We conclude that VOMM MTS Clustering is bound to errors but it is still significantly suc-
14
cessful in identifying similarities and differences of the characteristics of the MTS in the appli-
cations where the labeled data are hard to obtain. This is experimentally shown on one physical
demonstrator data and one real world industrial system data. Learned clusters with the proposed
method can be used as preliminary information/models for dealing with important tasks such
as anomaly detection or root-cause analysis in industrial systems.
References
Aghabozorgi, S., Shirkhorshidi, A. S., and Wah, T. Y. (2015). Time-series clustering–a decade
review. Information Systems, 53:16–38.
Alpaydin, E. (2014). Introduction to Machine Learning. MIT Press.
Arçelik A.Ş. ”Homepage.”. http://www.arcelikas.com (Online; accessed 13-January-2019).
Bar-Joseph, Z. (2004). Analyzing time series gene expression data. Bioinformatics,
20(16):2493–2503.
Begleiter, R., El-Yaniv, R., and Yona, G. (2004). On prediction using variable order markov
models. Journal of Artificial Intelligence Research, 22:385–421.
Bejerano, G. and Yona, G. (2001). Variations on probabilistic suffix trees: statistical modeling
and prediction of protein families. Bioinformatics, 17(1):23–43.
Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). Time series analysis:
forecasting and control. John Wiley & Sons.
Chandola, V., Banerjee, A., and Kumar, V. (2012). Anomaly detection for discrete sequences:
A survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823–839.
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A., and Charrad, M. M. (2014). Package
‘nbclust’. Journal of Statistical Software, 61:1–36.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the em algorithm. Journal of the royal statistical society. Series B (methodological),
pages 1–38.
Ghassempour, S., Girosi, F., and Maeder, A. (2014). Clustering multivariate time series using
hidden markov models. International journal of environmental research and public health,
11(3):2741–2763.
Holmes, M., Gray, A., and Isbell, C. (2007). Fast svd for large-scale matrices. In Workshop on
Efficient Machine Learning at NIPS, volume 58, pages 249–252.
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1):193–
218.
Kakizawa, Y., Shumway, R. H., and Taniguchi, M. (1998). Discrimination and clustering for
15
multivariate time series. Journal of the American Statistical Association, 93(441):328–340.
Kaufman, L. and Rousseeuw, P. J. (1990). Partitioning around medoids. In Finding groups in
data: an introduction to cluster analysis. Wiley Online Library.
Liao, T. W. (2005). Clustering of time series data—a survey. Pattern recognition, 38(11):1857–
1874.
Liao, T. W. (2007). A clustering procedure for exploratory mining of vector time series. Pattern
Recognition, 40(9):2550–2562.
Lin, J., Keogh, E., Lonardi, S., and Chiu, B. (2003). A symbolic representation of time se-
ries, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD
workshop on Research issues in data mining and knowledge discovery, pages 2–11. ACM.
MInD-NET Lab. ”Lego Demonstrator.”. http://mind-rg.com/lego-demonstrator (Online; ac-
cessed 13-January-2019).
Murtagh, F. and Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method:
which algorithms implement ward’s criterion? Journal of classification, 31(3):274–295.
Niggemann, O., Stein, B., Vodencarevic, A., Maier, A., and Büning, H. K. (2012). Learning
behavior models for hybrid timed systems. In AAAI, pages 1083–1090.
Niggemann, O., Windmann, S., Volgmann, S., Bunte, A., and Stein, B. (2014). Using learned
models for the root cause analysis of cyber-physical production systems. In Int. Workshop
Principles of Diagnosis (DX).
Oğul, H. and Mumcuoğlu, E. Ü. (2006). SVM-based detection of distant protein structural re-
lationships using pairwise probabilistic suffix trees. Computational Biology and Chemistry,
30(4):292–299.
Owsley, L. M., Atlas, L. E., and Bernard, G. D. (1997). Self-organizing feature maps and hid-
den markov models for machine-tool monitoring. IEEE Transactions on Signal Processing,
45(11):2787–2798.
Pfundstein, G. (2011). Hidden Markov Models with Generalised Emission Distribution for the
Analysis of High-Dimensional, Non-Euclidean Data. PhD thesis, Institut für Statistik.
Ron, D., Singer, Y., and Tishby, N. (1994). Learning probabilistic automata with variable
memory length. In Proc. Computational Learning Theory, pages 35–46.
Schulz, M. H., Weese, D., Rausch, T., Döring, A., Reinert, K., and Vingron, M. (2008). Fast and
adaptive variable order markov chain construction. In International Workshop on Algorithms
in Bioinformatics, pages 306–317.
Singhal, A. and Seborg, D. E. (2005). Clustering multivariate time-series data. Journal of
Chemometrics: A Journal of the Chemometrics Society, 19(8):427–438.
Sun, P., Chawla, S., and Arunasalam, B. (2006). Mining for outliers in sequential databases. In
16
Proc. SIAM International Conference on Data Mining, pages 94–105.
Sürmeli, B. G., Eksen, F., Dinç, B., Schüller, P., and Tümer, B. (2017). Unsupervised mode
detection in cyber-physical systems using variable order markov models. In Industrial Infor-
matics (INDIN), 2017 IEEE 15th International Conference on, pages 841–846. IEEE.
Tumer, M., Belfore, L. A., and Ropella, K. M. (2003). A syntactic methodology for auto-
matic diagnosis by analysis of continuous time measurements using hierarchical signal rep-
resentations. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),
33(6):951–965.
von Birgelen, A. and Niggemann, O. (2018). Enable learning of hybrid timed automata in
absence of discrete events through self-organizing maps. In IMPROVE-Innovative Modelling
Approaches for Production Systems to Raise Validatable Efficiency, pages 37–54. Springer.
Wunderlich, P. and Niggemann, O. (2017). Structure learning methods for bayesian networks
to reduce alarm floods by identifying the root cause. In Emerging Technologies and Factory
Automation (ETFA), 2017 22nd IEEE International Conference on, pages 1–8. IEEE.
Estimating the real number of clusters in the data is an important challenge. One solution com-
monly used in the literature is to find the number of clusters that optimize a clustering validation
metric. Commonly two criteria are taken into account; (1) the distance of the objects within the
clusters should be minimized and (2) the distance between the clusters should be maximized.
(Charrad et al., 2014) proposed a method that calculates 30 different clustering validation metric
formulas and applies majority rule between them to determine the optimal number of clusters.
By using majority rule method for evaluating both data point clustering (see 4.2.1) and model
clustering (see 4.5), up to 0.44 and 0.21 ARI is achieved on the Lego Demonstrator and Arçelik
Press Machine data respectively.
Appendix B. Discretization on a Subset of the Data set and Interpolation on the Rest
To learn the data point clustering structure in VOMM MTSC, preprocessing (see 4.1) and dis-
cretization (see 4.2.1) steps can be applied on a subset of the data set instead of the complete
data set. After the clustering is constructed, a classifier can be trained using the clusters as
classes and cluster predictions as class labels. Using the learned preprocessing parameters and
the classifier, remaining data points in the data set can be classified. Our experiments show that
for both data sets, exactly same clustering structures and same clustering results are obtained
17
by learning a classifier (Nearest Neighbor Classifier is used (Alpaydin, 2014)) with only 25%
of the data. This can be a promising remedy to high complexity of the discretization part in the
VOMM MTSC complexity, uN 2 (see 5), since it is reduced h2 times using a h times smaller
data set. Nevertheless, complexity of classification should be added to the overall complexity.
Table 1. List of all parameters for all methods and their settings used for experiments.
Settings
Method Prm. Description Demonstrator Arcelik
a Averaging parameter, data frame size to be averaged. 1, 2, . . . , 12
Preprocessing: d The number of dimensions the data reduced to by PCA 7, 10, 15, 204 4, 5, 65
(for VOMM and HMM MTSC only).
k Number of clusters for data point clustering, see 4.2.1. 2, 3, . . . , 10
VOMM MTS t Minimum support parameter, see 2.3. 1, 2, . . . , 9
Modelling and L Maximum subsequence length parameter, see also 2.3. 1, 2, . . . , 9
Comparison I Ratio of contributions of dissimilarity and probability distance, see 4.2.2. 0, 0.1, 0.2, . . . , 1
HMM MTS S The number of hidden states to be modelled, see 2.4. 2, 3, . . . , 10
Modelling and W Number of iterations of the Baum-Welch algorithm. It is more probable to 2, 4, 6, 8, 10, 25, 50, 100, 1000,
Comparison achieve higher levels of convergence with higher number of iterations. 1200, 1400, 1600, 1800, 2000
PCA MTS δ The number of principal components to be compared, see 4.4 7, 10, 15, 20 4, 5, 6
Modelling and α Ratio of contributions of two distance types Spca and Sdist , see 4.4 0, 0.1, 0.2, . . . , 1
Comparison
Model Clustering: mk Number of clusters for model clustering, see 4.5. 2, 3, 4, 5, 6
Table 2. Results of VOMM MTS Clustering done with the combinations of parameters ex-
plained in table 1 on Demonstrator Data. Adjusted RAND Index6 (ARI) is used for scoring.
a= 6 7 8
d= 7 10 15 7 10 15 7 10 15
t L mk k= 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.1 0.2 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1
5 3 0.2 0.3 0.0 0.2 0.1 0.0 0.2 0.0 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.2 0.0 0.0 0.1 0.1 0.0 0.0 0.1 0.1 0.1 0.0 0.1
4 0.2 0.3 0.0 0.2 0.3 0.0 0.2 0.0 0.0 0.4 0.4 0.0 0.3 0.3 0.1 0.3 0.0 0.0 0.1 0.5 0.0 0.0 0.4 0.0 0.1 0.0 0.0
5 0.1 0.3 0.0 0.1 0.4 0.0 0.1 0.3 0.0 0.3 0.4 0.0 0.3 0.2 0.1 0.3 0.0 0.0 0.1 0.5 0.0 0.1 0.5 0.1 0.0 0.4 0.1
3 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.1 0.2 0.0 0.0 0.4 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.1
6 3 0.3 0.3 0.0 0.3 0.1 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.4 0.0 0.1 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0 0.1
4 0.3 0.3 0.0 0.3 0.4 0.0 0.3 0.0 0.1 0.4 0.1 0.0 0.2 0.1 0.1 0.4 0.0 0.1 0.1 0.1 0.0 0.0 0.4 0.0 0.1 0.0 0.1
5 0.4 0.4 0.0 0.4 0.5 0.1 0.4 0.1 0.1 0.3 0.5 0.0 0.3 0.5 0.1 0.3 0.0 0.1 0.1 0.2 0.0 0.1 0.4 0.0 0.1 0.2 0.1
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.6 0.0 0.0 0.1 0.0 0.0 0.2 0.0 0.0 0.1 0.1 0.1
7 3 0.5 0.1 0.0 0.5 0.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.1 0.5 0.0 0.0 0.1 0.1 0.0 0.2 0.1 0.0 0.2 0.1 0.1
4 0.5 0.5 0.0 0.5 0.4 0.1 0.5 0.0 0.0 0.3 0.1 0.0 0.2 0.1 0.1 0.3 0.0 0.2 0.1 0.1 0.0 0.2 0.1 0.0 0.2 0.1 0.0
5 0.4 0.5 0.0 0.4 0.4 0.1 0.4 0.1 0.1 0.3 0.2 0.0 0.3 0.3 0.1 0.3 0.0 0.1 0.1 0.1 0.0 0.3 0.0 0.0 0.2 0.1 0.1
2 0.2 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.1 0.2 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 3 0.2 0.3 0.0 0.2 0.1 0.0 0.2 0.0 0.1 0.3 0.0 0.0 0.3 0.0 0.0 0.2 0.0 0.0 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0 0.1
4 0.2 0.3 0.0 0.2 0.3 0.0 0.2 0.0 0.0 0.4 0.4 0.1 0.3 0.3 0.0 0.3 0.0 0.0 0.1 0.5 0.0 0.0 0.4 0.0 0.1 0.0 0.0
5 0.3 0.3 0.0 0.3 0.4 0.0 0.3 0.3 0.0 0.3 0.4 0.1 0.3 0.2 0.0 0.3 0.0 0.1 0.1 0.5 0.0 0.1 0.5 0.0 0.0 0.4 0.0
4 2 0.3 0.0 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.4 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.0
6 3 0.3 0.3 0.0 0.3 0.1 0.0 0.3 0.0 0.1 0.3 0.0 0.0 0.3 0.0 0.0 0.4 0.0 0.0 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0 0.1
4 0.4 0.3 0.0 0.4 0.4 0.0 0.4 0.0 0.0 0.4 0.1 0.1 0.2 0.1 0.0 0.4 0.0 0.0 0.1 0.1 0.0 0.0 0.4 0.0 0.1 0.0 0.0
5 0.3 0.4 0.0 0.3 0.5 0.0 0.3 0.1 0.0 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.0 0.0 0.1 0.2 0.0 0.1 0.4 0.1 0.1 0.2 0.0
2 0.5 0.0 0.0 0.5 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.6 0.0 0.0 0.1 0.0 0.0 0.2 0.0 0.0 0.1 0.1 0.0
7 3 0.5 0.1 0.0 0.5 0.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.5 0.0 0.0 0.1 0.1 0.0 0.2 0.0 0.0 0.1 0.1 0.1
4 0.4 0.5 0.0 0.4 0.4 0.0 0.4 0.0 0.0 0.3 0.1 0.1 0.2 0.1 0.0 0.3 0.0 0.0 0.1 0.1 0.0 0.2 0.1 0.0 0.2 0.1 0.0
5 0.3 0.5 0.0 0.3 0.4 0.0 0.3 0.1 0.0 0.3 0.2 0.1 0.3 0.3 0.1 0.3 0.0 0.0 0.1 0.1 0.0 0.3 0.1 0.1 0.2 0.1 0.0
2 0.2 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 3 0.2 0.3 0.0 0.2 0.1 0.0 0.2 0.0 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.2 0.0 0.0 0.1 0.1 0.0 0.0 0.1 0.1 0.1 0.0 0.0
4 0.2 0.3 0.0 0.2 0.3 0.0 0.2 0.0 0.0 0.4 0.4 0.0 0.3 0.3 0.0 0.3 0.0 0.0 0.1 0.5 0.0 0.0 0.4 0.1 0.1 0.0 0.0
5 0.3 0.3 0.0 0.3 0.4 0.0 0.3 0.3 0.0 0.3 0.4 0.0 0.3 0.2 0.0 0.3 0.0 0.0 0.1 0.5 0.0 0.1 0.5 0.1 0.0 0.4 0.0
5 2 0.3 0.0 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.2 0.0 0.0 0.2 0.0 0.0 0.4 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0
6 3 0.3 0.3 0.0 0.3 0.1 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.3 0.0 0.0 0.4 0.0 0.1 0.1 0.1 0.0 0.0 0.1 0.1 0.1 0.0 0.0
4 0.4 0.3 0.0 0.4 0.4 0.0 0.4 0.0 0.0 0.4 0.1 0.0 0.2 0.1 0.0 0.4 0.0 0.1 0.1 0.1 0.0 0.0 0.4 0.1 0.1 0.0 0.0
5 0.3 0.4 0.0 0.3 0.5 0.0 0.3 0.1 0.0 0.3 0.5 0.0 0.3 0.5 0.0 0.3 0.0 0.1 0.1 0.2 0.0 0.1 0.4 0.1 0.1 0.2 0.0
2 0.5 0.0 0.0 0.5 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.6 0.0 0.0 0.1 0.0 0.0 0.2 0.0 0.1 0.1 0.0 0.0
7 3 0.5 0.1 0.0 0.5 0.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.5 0.0 0.1 0.1 0.1 0.0 0.2 0.0 0.1 0.1 0.1 0.1
4 0.4 0.5 0.0 0.4 0.4 0.0 0.4 0.0 0.0 0.3 0.1 0.0 0.2 0.1 0.0 0.3 0.0 0.1 0.1 0.1 0.0 0.3 0.1 0.1 0.2 0.1 0.0
5 0.3 0.5 0.0 0.3 0.4 0.0 0.3 0.1 0.0 0.3 0.2 0.0 0.3 0.3 0.0 0.3 0.0 0.1 0.1 0.1 0.0 0.3 0.1 0.1 0.2 0.1 0.0
2 0.2 0.0 0.1 0.2 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.2 0.1 0.2 0.1 0.0 0.0 0.1 0.1 0.1 0.1 0.0 0.0
5 3 0.2 0.1 0.1 0.2 0.1 0.0 0.2 0.1 0.0 0.0 0.1 0.1 0.0 0.1 0.0 0.2 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.1 0.1 0.0
4 0.2 0.1 0.1 0.2 0.1 0.1 0.2 0.0 0.1 0.0 0.2 0.1 0.0 0.1 0.1 0.3 0.2 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.0
5 0.3 0.1 0.1 0.3 0.1 0.1 0.3 0.1 0.1 0.0 0.2 0.1 0.1 0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.1 0.2 0.2 0.2 0.1 0.1 0.1
6 2 0.3 0.1 0.1 0.3 0.1 0.0 0.3 0.1 0.0 0.5 0.3 0.1 0.4 0.0 0.0 0.4 0.2 0.1 0.3 0.2 0.0 0.4 0.2 0.1 0.3 0.1 0.0
6 3 0.3 0.2 0.1 0.3 0.2 0.1 0.3 0.1 0.0 0.3 0.2 0.2 0.2 0.1 0.2 0.4 0.1 0.1 0.3 0.3 0.1 0.4 0.3 0.2 0.3 0.3 0.1
4 0.4 0.2 0.1 0.4 0.2 0.1 0.4 0.2 0.1 0.4 0.2 0.1 0.1 0.2 0.2 0.4 0.3 0.1 0.3 0.2 0.1 0.3 0.4 0.1 0.3 0.3 0.1
5 0.3 0.2 0.1 0.3 0.2 0.1 0.3 0.1 0.1 0.3 0.3 0.1 0.1 0.2 0.1 0.3 0.2 0.1 0.3 0.2 0.1 0.3 0.3 0.1 0.3 0.2 0.1
2 0.5 0.1 0.2 0.5 0.1 0.0 0.5 0.1 0.0 0.5 0.3 0.1 0.4 0.3 0.0 0.6 0.3 0.3 0.2 0.3 0.0 0.2 0.2 0.1 0.2 0.2 0.0
7 3 0.5 0.2 0.2 0.5 0.2 0.1 0.5 0.2 0.1 0.71 0.2 0.1 0.71 0.2 0.2 0.5 0.3 0.2 0.75 0.3 0.1 0.6 0.3 0.2 0.7 0.3 0.1
4 0.4 0.2 0.2 0.4 0.2 0.2 0.4 0.3 0.0 0.4 0.3 0.2 0.3 0.5 0.2 0.3 0.3 0.1 0.4 0.5 0.1 0.4 0.3 0.2 0.4 0.3 0.2
5 0.3 0.2 0.2 0.3 0.2 0.2 0.3 0.3 0.1 0.4 0.5 0.2 0.4 0.5 0.3 0.3 0.3 0.1 0.3 0.5 0.2 0.3 0.4 0.1 0.3 0.3 0.2
4 Approximately 80%, 85%, 90% and 100% of the variance is explained respectively.
5 Highest three possible values.
6 Only the highest numbers are shown in three decimal places and the rest are rounded to two.
18
Table 3. Results (ARI) of VOMM MTS Clustering on Arcelik Press Machine Data
a= 8 9 10
d= 4 5 6 4 5 6 4 5 6
t L mk k= 5 6 7 5 6 7 5 6 7 5 6 7 5 6 7 5 6 7 5 6 7 5 6 7 5 6 7
3 0.3 -0.1 0.1 0.1 0.0 0.1 0.2 0.1 0.1 0.1 0.1 0.0 0.2 0.2 0.1 0.2 0.0 0.1 0.2 0.1 -0.1 0.2 -0.1 -0.1 0.1 0.1 -0.1
5 4 0.2 0.0 0.1 0.1 0.0 0.0 0.2 0.2 0.1 0.0 0.1 0.0 0.1 0.2 0.1 0.1 0.0 0.1 0.3 0.0 -0.1 0.2 0.1 0.1 0.1 0.1 0.1
5 0.2 0.0 0.1 0.0 0.0 0.0 0.3 0.3 0.0 0.0 0.1 0.0 0.2 0.3 0.1 0.0 0.0 0.1 0.2 0.0 -0.1 0.3 0.1 0.1 0.1 0.1 0.0
4 3 0.53 0.1 0.2 0.1 0.4 0.1 0.1 0.4 0.1 0.1 0.2 0.1 0.2 0.1 0.1 0.1 0.0 0.1 0.4 0.1 0.0 0.4 0.2 0.61 0.0 0.2 0.2
6 4 0.4 0.1 0.2 0.0 0.3 0.3 0.1 0.5 0.0 0.1 0.2 0.1 0.2 0.1 0.1 0.0 0.0 0.0 0.4 0.1 0.0 0.3 0.3 0.53 0.0 0.2 0.3
5 0.4 0.0 0.3 0.0 0.3 0.3 0.1 0.4 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.0 0.1 0.1 0.3 0.2 0.0 0.3 0.3 0.47 0.0 0.3 0.3
3 0.4 0.1 0.1 0.0 0.4 0.4 0.1 0.4 0.0 0.1 0.2 -0.1 0.2 0.1 0.2 0.0 0.1 0.1 0.3 0.1 -0.1 0.4 0.3 0.4 -0.1 0.4 0.4
7 4 0.3 0.2 0.2 0.0 0.3 0.3 0.1 0.5 0.0 0.0 0.2 0.1 0.2 0.2 0.2 0.0 0.0 0.0 0.2 0.2 -0.1 0.3 0.3 0.3 0.0 0.3 0.3
5 0.3 0.2 0.2 0.1 0.3 0.3 0.1 0.4 0.2 0.0 0.1 0.2 0.1 0.2 0.1 0.0 0.1 0.0 0.2 0.1 -0.1 0.3 0.2 0.3 0.0 0.3 0.3
3 0.3 0.1 0.1 0.1 0.1 0.1 0.1 0.3 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.2 0.0 0.1 0.3 0.0 0.1 0.4 0.1 0.2 0.1 0.2 0.2
5 4 0.4 0.0 0.1 0.0 0.1 0.3 0.3 0.4 0.0 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.0 0.1 0.3 0.0 0.0 0.3 0.1 0.3 0.0 0.2 0.1
5 0.4 0.1 0.1 0.0 0.3 0.2 0.4 0.4 0.0 0.1 0.1 0.0 0.4 0.2 0.1 0.0 0.0 0.0 0.2 0.0 0.0 0.2 0.1 0.3 0.0 0.1 0.1
5 3 0.4 0.2 0.1 0.0 0.1 0.1 0.1 0.53 0.1 0.1 0.1 -0.1 0.2 0.2 0.1 0.1 0.0 0.1 0.2 0.2 0.1 0.1 0.4 0.3 0.0 0.4 0.1
6 4 0.3 0.1 0.3 0.0 0.1 0.1 0.1 0.4 0.0 0.1 0.1 -0.1 0.2 0.4 0.1 0.0 0.0 0.1 0.2 0.3 0.0 0.2 0.3 0.4 0.0 0.3 0.1
5 0.4 0.2 0.1 0.0 0.1 0.1 0.1 0.4 0.1 0.0 0.1 0.1 0.1 0.3 0.1 0.0 0.0 0.0 0.2 0.2 0.0 0.2 0.3 0.4 0.0 0.2 0.1
3 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.3 0.2 0.0 0.0 -0.1 0.3 0.2 0.0 0.1 0.0 0.1 0.2 0.3 0.1 0.1 0.1 0.2 0.0 0.1 0.2
7 4 0.1 0.1 0.3 0.0 0.1 0.1 0.1 0.4 0.4 0.0 0.2 -0.1 0.3 0.3 0.2 0.0 0.0 0.0 0.2 0.3 0.1 0.2 0.2 0.3 0.0 0.2 0.3
5 0.1 0.1 0.2 0.0 0.1 0.1 0.1 0.4 0.3 0.0 0.1 0.1 0.2 0.3 0.1 0.0 0.1 0.0 0.2 0.2 0.1 0.2 0.2 0.3 0.0 0.2 0.3
3 0.2 0.2 0.0 0.1 0.4 0.5 0.2 0.3 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.3 0.1 0.0 0.1 -0.1 -0.1 0.0 -0.1 -0.1
5 4 0.4 0.1 0.0 0.0 0.3 0.4 0.1 0.4 0.0 0.0 0.1 0.1 0.1 0.2 0.1 0.1 0.0 0.1 0.5 0.1 0.0 0.2 -0.1 0.0 0.0 -0.1 0.0
5 0.3 0.1 0.0 0.0 0.3 0.4 0.2 0.4 0.0 0.0 0.1 0.1 0.1 0.3 0.1 0.0 0.0 0.0 0.4 0.2 0.0 0.2 0.1 0.0 0.0 0.1 0.3
6 3 0.4 0.2 0.0 0.0 0.1 0.3 0.2 0.3 0.2 0.1 0.1 0.2 0.5 0.2 0.1 0.0 0.1 0.1 0.2 0.3 0.0 0.1 0.1 0.0 0.0 0.1 0.1
6 4 0.3 0.1 0.1 0.0 0.1 0.3 0.3 0.4 0.2 0.0 0.1 0.2 0.4 0.4 0.1 0.0 0.1 0.0 0.3 0.3 0.0 0.2 0.0 0.0 0.0 0.1 0.2
5 0.3 0.1 0.2 0.0 0.1 0.2 0.3 0.4 0.1 0.0 0.1 0.1 0.4 0.3 0.1 0.0 0.1 0.0 0.3 0.2 0.0 0.2 0.2 0.1 0.0 0.2 0.2
3 0.1 0.1 0.1 0.0 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.2 0.2 0.0 0.0 0.1 0.0 0.1 0.3 0.0 0.1 0.1 0.1 0.0 0.1 0.1
7 4 0.1 0.1 0.2 0.0 0.3 0.3 0.3 0.3 0.1 0.0 0.1 0.1 0.3 0.3 0.2 0.0 0.0 0.0 0.2 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.2
5 0.1 0.1 0.2 0.1 0.3 0.3 0.3 0.3 0.2 0.0 0.1 0.1 0.3 0.3 0.1 0.0 0.0 0.0 0.2 0.2 0.0 0.2 0.2 0.2 0.1 0.2 0.2
Table 4. Results (ARI) of VOMM MTS Clustering on Arcelik Press Machine Data
a= 6 7 8 9 10
S W mk d= 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6
2 0.02 0.03 0.03 0.02 0.02 0.02 -0.01 0.00 -0.01 0.00 0.00 0.00 -0.01 -0.01 0.00
4 3 -0.01 -0.03 -0.03 -0.02 -0.02 -0.02 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.02 0.04
4 -0.02 -0.05 -0.04 -0.01 -0.02 0.00 -0.01 0.00 -0.02 0.01 0.01 0.00 0.00 0.00 0.01
2 0.02 0.03 0.03 0.02 0.02 0.02 -0.01 0.00 -0.01 0.00 0.00 0.00 -0.01 -0.01 0.00
6 3 -0.01 -0.03 -0.03 -0.02 -0.02 -0.02 0.00 0.00 0.00 0.01 0.01 0.00 0.02 0.02 0.04
2 4 -0.02 -0.04 -0.04 -0.01 -0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.02
2 0.02 0.03 0.04 0.02 0.02 0.03 -0.01 0.00 -0.01 0.00 0.00 0.00 -0.01 -0.01 0.00
8 3 -0.01 -0.03 -0.03 -0.02 -0.02 -0.02 0.00 0.00 0.00 0.01 0.01 0.00 0.02 0.02 0.04
4 -0.02 -0.04 -0.04 -0.01 -0.02 -0.03 0.00 0.00 0.00 0.00 0.00 -0.01 0.00 0.00 0.02
2 0.08 0.08 0.12 0.01 0.02 0.01 0.01 0.01 0.01 0.02 0.02 0.01 -0.01 -0.01 0.00
10 3 -0.01 -0.01 0.00 -0.03 -0.02 -0.03 -0.01 0.00 -0.01 -0.02 -0.03 -0.03 0.03 -0.04 0.00
4 -0.03 -0.02 -0.01 -0.03 -0.01 -0.02 -0.01 -0.02 -0.01 -0.02 -0.02 -0.02 0.00 0.00 -0.01
2 -0.01 0.00 -0.01 0.04 0.06 0.01 0.04 0.29 0.29 0.02 0.14 0.29 0.02 0.29 0.29
4 3 0.00 0.02 0.02 -0.01 -0.01 0.00 0.01 0.08 0.05 0.00 0.00 0.08 0.07 0.04 0.04
4 0.00 0.01 0.00 -0.03 -0.02 -0.01 -0.01 0.03 0.02 -0.01 -0.02 0.04 0.04 0.08 0.02
2 -0.01 0.00 -0.01 0.29 0.06 0.01 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29
6 3 0.00 0.02 0.03 0.11 0.00 0.00 0.11 0.08 0.05 0.08 0.05 0.07 0.07 0.04 0.04
3 4 0.00 0.01 0.00 0.03 -0.02 0.00 0.05 0.03 0.02 0.04 0.03 0.04 0.04 0.08 0.02
2 -0.01 0.00 -0.01 0.29 0.06 0.01 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29
8 3 0.00 0.02 0.03 0.11 0.00 0.00 0.11 0.08 0.06 0.08 0.05 0.07 0.07 0.04 0.04
4 0.00 0.01 0.00 0.03 -0.02 0.00 0.05 0.03 0.00 0.04 0.03 0.04 0.05 0.09 0.02
2 0.02 0.02 0.00 0.29 0.00 0.02 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29
10 3 0.04 0.00 0.01 0.12 0.03 0.00 0.09 0.10 0.06 0.06 0.05 0.06 0.04 0.05 0.05
4 0.00 0.00 0.00 0.03 0.01 -0.01 0.04 0.01 0.01 0.01 0.01 0.04 0.03 0.03 0.06
2 0.01 0.06 0.01 0.00 -0.01 -0.01 -0.01 0.03 -0.03 0.01 0.00 0.00 -0.02 0.29 0.29
4 3 0.02 0.01 -0.02 -0.02 0.01 0.00 -0.01 0.02 0.00 -0.01 -0.02 0.05 0.01 0.04 0.05
4 -0.01 0.00 -0.03 -0.01 -0.01 -0.02 0.01 0.02 0.02 0.00 0.01 0.02 0.03 0.02 0.03
2 0.01 0.07 0.01 0.00 -0.02 -0.01 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29
6 3 0.06 0.01 -0.02 -0.02 0.01 -0.01 0.04 0.04 0.04 0.07 0.05 0.05 0.04 0.04 0.05
4 4 0.04 0.00 -0.03 0.01 -0.01 -0.02 0.03 0.04 0.03 0.03 0.01 0.02 0.05 0.03 0.03
2 0.29 0.02 0.29 0.29 -0.02 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29
8 3 0.06 0.01 0.06 0.05 0.01 0.04 0.06 0.05 0.04 0.07 0.05 0.05 0.04 0.04 0.05
4 0.04 0.00 0.04 0.02 -0.01 0.03 0.03 0.05 0.03 0.03 0.01 0.03 0.05 0.03 0.03
2 0.29 -0.01 0.29 0.29 0.01 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29 0.29
10 3 0.08 0.01 0.10 0.04 0.02 0.04 0.06 0.04 0.06 0.05 0.05 0.10 0.15 0.04 0.06
4 0.04 0.04 0.02 0.06 0.02 0.04 0.04 0.02 0.03 0.01 0.04 0.05 0.03 0.01 0.02
R R
A B A B
AA BA AA BB
AAB BBA
Figure 1. PST Matching Example: All matchings which have non-zero probability of doing
non-zero contribution to the total matching cost are shown.
19
Root (0.33,0.33,0.26,0.06) Root
A B C $
(0,0.4,0.6,0)
A B C $ A (0.6,0.4,0,0) B C
(0.25,0.25,0.25,0.25)
B C AC BAC BBACC$ C$ $
ACC$
Figure 2. Example Suffix Tree (left) and Probabilistic Suffix Tree (PST) (right) of the
sequence ‘ABACABBACBBACC$’. PST is created with pruning parameters t = 2 and
L = 3. Probability vectors represent next symbol probabilities for the sequence of symbols
(A, B, C, $) where $ indicates end-of-sequence.
Figure 4. Demonstration of the setup of the Arcelik Hydraulic Press Machine Data
20