Big_Data_Data_Mining_and_Data_Science_-_George_Dimitoglou
Big_Data_Data_Mining_and_Data_Science_-_George_Dimitoglou
com
Big Data, Data Mining and Data Science
OceanofPDF.com
Intelligent Computing
Edited by
Leonidas Deligiannidis
Hamid R. Arabnia
Volume 2
OceanofPDF.com
Big Data, Data Mining and Data
Science
OceanofPDF.com
ISBN 9783111344072
e-ISBN (PDF) 9783111344553
e-ISBN (EPUB) 9783111344584
Bibliographic information published by the Deutsche
Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the
Deutsche Nationalbibliografie; detailed bibliographic data are
available on the Internet at http://dnb.dnb.de.
© 2025 Walter de Gruyter GmbH, Berlin/Boston
OceanofPDF.com
Contents
Preface
Preface
It is a pleasure to present an eclectic collection of papers on Big
Data, Data Mining, and Data Science within the Intelligent
Computing (De Gruyter) book series. In an era where data reigns
supreme, harnessing its value has become necessary in
research, business, government, and military. The insights from
analyzing large datasets have revolutionized industries, driven
innovation, and transformed how we understand and interact
with the world and technology ecosystems.
From a computing perspective in our data-rich world, Big
Data, Data Mining, and Data Science collectively leverage data to
uncover hidden knowledge and solve complex problems. Big
Data deals with the vast volumes, velocity, and variety of
structured and unstructured data. Data Mining focuses on
extracting meaningful patterns and insights from large datasets
for predictive modeling and decision support. Data Science aims
to extract actionable insights using various techniques to solve
complex problems and drive decision-making. These techniques
are applied to diverse problems and domains, such as the
financial sector, healthcare, e-commerce, and cybersecurity.
The work presented in this book can be loosely categorized
into two distinct themes. The first theme is “methods and
instrumentation,” where authors provide insight into systematic
methods, procedures, and techniques within a research or
experimental framework and the tooling to measure, observe, or
manipulate variables of interest. In this thematic collection,
papers explore a range of topics such as hypergraph
databases, automated determination of cluster numbers for high-
dimensional big data, centrality metrics for identifying dominant
factors in datasets, machine learning-based data preprocessing
approaches, estimation of time-series outliers using multi-objective
optimization with non-stationary means, and the development of
languages for generating random data to facilitate random testing
of hardware and software applications.
The second theme is “applications and case studies,” where
authors apply and implement theories, techniques,
methodologies, and technologies in specific contexts,
showcasing their practical relevance and effectiveness. In this
thematic collection, papers explore a range of topics such as
using high-volume dynamic ensemble-based model computations in
e-commerce, deploying explainable artificial intelligence (AI) to
explain an assessment analytics algorithm for free text
exams, using graph neural networks (NN) and gene interaction
data, applying recurrent neural network (RNN) models to examine
the volatility in financial markets during a global
pandemic, using skill-centered qualification ontologies to support
data mining of human resources in knowledge-based enterprise
process-representations, extracting information from vibration
sensor data using topological data analysis, leveraging generative
AI (GenAI) and table arrangement techniques to analyze newspaper
stories for stock price insight, creating metadata schemas for data
reservoirs, and exploring the discrimination capabilities of a set of
features for road surface classification.
The book is mainly composed of selected papers that were
accepted for the 2022 and 2023 International Conferences on
Computational Science and Computational Intelligence (CSCI:
December, Las Vegas, USA) and the 2023 International
Conference on Data Science (CSCE/ICDATA: July, Las Vegas, USA).
Selected authors were given the opportunity to submit the
extended versions of their conference papers for publication
consideration in this book. An important mission of CSCI and
CSCE annual conferences includes “Providing a unique platform
for a diverse community of constituents composed of scholars,
researchers, developers, educators, and practitioners. The Congress
makes a concerted effort to reach out to participants affiliated with
diverse entities (such as universities, institutions, corporations,
government agencies, and research centers/labs) worldwide. The
Congress also attempts to connect participants from institutions
that have teaching as their main mission with those who are
affiliated with institutions that have research as their main mission.
The Congress uses a quota system to achieve its institution and
geography diversity objectives.” Since this book comprises the
extended versions of the accepted papers of CSCI and CSCE
annual conferences, it is no surprise that it has chapters from a
highly qualified and diverse group of authors.
Aside from recognizing the authors who provided their
research contributions, we are also grateful to the many
colleagues who offered their time and effort in organizing the
CSCI and CSCE conferences. Their help was instrumental in the
formation of this book. The editorial committee members
appear on the CSCI and CSCE’s websites. Finally, we want to
thank Steve Elliot (De Gruyter Editor) and Aleksandra Ślosarczyk
(De Gruyter Editorial Project Manager) for their continuous
support throughout the development and production of the
book.
We hope our readers find as much value and gain insight
from this book as we have.
George Dimitoglou, Ph.D.
Department of Computer Science & IT,
Hood College,
401 Rosemont Ave, Frederick, MD 21701, USA
Leonidas Deligiannidis, Ph.D.
School of Computing and Data Science,
Wentworth Institute of Technology,
550 Huntington Ave, Boston, MA 02115, USA
Hamid R. Arabnia, Ph.D.
University of Georgia,
415 GSRC, Computer Science,
Athens, GA 30602-7404, USA
OceanofPDF.com
Methods and instrumentation
1 Identifying and estimating outliers in time series with
nonstationary mean through multiobjective optimization
method
Koki Kyo
Acknowledgment: This work was supported by a Grant-in-Aid for Scientific Research (C)
(19K01583) from the Japan Society for the Promotion of Science.
Abstract
This study introduces a method for detecting and estimating outliers in nonstationary time series
data . We begin by reviewing a decomposition technique to separate time series into trend and
stationary components with outliers. A multiobjective optimization approach that combines the
minimum ISU (index of symmetry and uniformity) and maximum likelihood AR modeling methods
through a composite objective function is proposed. The method balances outlier detection and
model selection. We demonstrate its effectiveness by applying it to analyze monthly time series
data for index of industrial production in Japan from January 1975 to December 2019. Our results
highlight its robust performance in addressing anomalies and making informed model selections,
paving the way for future extensions to various domains.
Keywords: autoregression model, outliers in variables, identification and estimation of outliers,
moving linear model approach, multiobjective optimization,
1.1 Introduction
Time series modeling methods have been widely applied across various research fields, serving
diverse purposes. These applications span from extracting seismic signals in earthquake data
[→7] to conducting thorough analyses of business cycles [→10, →13].
In recent times, anomaly detection has become a crucial focus in various research domains,
placing particular emphasis on time series analysis [→12]. Anomaly detection within the scope of
time series analysis entails the identification and estimation of unexpected values or abrupt
fluctuations within temporal data.
This paper delves into the overarching theme of anomaly detection, specifically honing in on
the task of identifying and estimating outliers within time series that display a nonstationary
mean. The main challenge arises in handling time series data characterized by a nonstationary
mean, a scenario in which conventional approaches are typically tailored for stationary structures
in time series.
To address this challenge, we utilize the moving linear model approach introduced by Kyo and
Kitagawa [→9]. This approach enables the decomposition of the target time series into two
distinct components: one that captures the nonstationary mean and another that encapsulates
the stationary term housing the outliers. Subsequently, we present a novel method for identifying
and estimating outliers within the stationary components.
Autoregression (AR) modeling emerges as a valuable approach for analyzing stationary time
series data [→2]. An AR model provides a linear framework for defining a set of coefficients,
making them amenable to straightforward estimation through least squares fitting [→6].
However, the presence of outliers within a time series can significantly compromise the efficiency
of the least squares fitting process. The impact of outliers on parameter estimation has been
extensively discussed in works by Pena [→11] and Fox [→3].
To address the challenges posed by outliers, several novel outlier detection methods have
been proposed, as exemplified by Vishwakarma et al. [→14], and the references therein. While
these methods provide value, many of them are highly specialized and may lack practical
applicability.
In Kyo [→8], we introduced an innovative approach for identifying and estimating outliers
within time series data. Our approach seamlessly integrates an autoregressive (AR) modeling
framework based on the maximum likelihood method, offering several notable characteristics: (1)
When combined with the constrained-remaining component decomposition method proposed by
Kyo and Kitagawa [→9], this method can also be applied to data with trends, thus expanding its
range of applicability. (2) The model structure adopted in our approach is straightforward and
easily comprehensible, making it highly practical and applicable in real-world scenarios. (3)
Differing from many previous studies that treat outliers as concealed factors influencing variable
variances or errors through robust estimation, our approach offers a unique perspective. We treat
outliers as explicit parameters, simultaneously identifying and estimating them alongside the
model coefficients. This approach not only enables precise outlier localization within the time
series but also provides insights into their magnitudes, enhancing our understanding of the
underlying time series dynamics.
In many instances, outliers arise from sudden shocks, and their impact often persists over a
defined timeframe. The key characteristic of the AR modeling approach lies in its ability to detect
and estimate outliers based on the underlying mechanism of periodic variations in time series
data. After removing outliers, the time series retains the structural patterns of these periodic
variations. However, the approach proposed by Kyo [→8] employs the maximum likelihood
method, and the likelihood function may not always exhibit high sensitivity to outliers.
Consequently, there are cases when the AR modeling approach may not be highly efficient in
identifying and estimating outliers.
In other words, when employing the maximum likelihood of the AR modeling approach,
accurately estimating outliers, even if detected through mutual influence, can be challenging.
Therefore, relying solely on the maximum likelihood of the AR modeling approach may not be
efficient for both identifying and estimating outliers. My prior study in Kyo [→8] revealed that the
process of identifying and estimating outliers required multiple iterations to achieve acceptable
results.
Therefore, in this paper, to overcome the limitations of the maximum likelihood of the AR
modeling approach proposed by Kyo [→8], we introduce a highly sensitive indicator referred to as
the index of symmetry and uniformity (ISU) for outlier detection and estimation. We then
integrate the ISU with the likelihood for an AR model and propose a multiobjective optimization
method using negative log-likelihood and the ISU as objective functions. This approach aims to
enhance the efficiency of identifying and estimating outliers while retaining the advantages of the
maximum likelihood of AR modeling approach.
The remainder of the chapter is structured as follows. In Section 1.2, we provide a review of
the constrained-remaining component decomposition method. In Section 1.3, we offer a detailed
review of the maximum likelihood of the AR modeling approach for the identification and
estimation of outliers. Section 1.4 presents a new proposal for identifying and estimating outliers
using a multiobjective optimization method. In Section 1.5, we illustrate the performance of the
proposed approach with an example. Finally, in Section 1.6, we summarize the study.
1.2 First review
To commence, we will provide an overview of the constrained-remaining component
decomposition method introduced by Kyo and Kitagawa [→9]. Let us consider a time series
denoted as z comprising two unobserved components defined as follows:
t
zt = st + yt (t = 1, 2, …, N ). (1.1)
Here, s and y represent the constrained and remaining components, respectively, with t and N
t t
corresponding to a specific time point and the length of the time series. The constrained
component is presumed to be expressible as a local linear model of time t indicating long-term
variations in the time series z , making it inherently smooth. In contrast, the remaining
t
component is derived as the residual after fitting the constrained component to the time series,
representing short-term variations. Additionally, no specific assumptions are made about the
remaining component y , although it is assumed that the time series y is closer to being a
t t
stationary series than z if there are no outliers in the time series z . Further details regarding the
t t
model in →eq. (1.1) were provided by Kyo and Kitagawa [→9] and the definition of an outlier is
elaborated later in this paper.
The moving linear model approach, developed by Kyo and Kitagawa [→9], can be employed to
perform the constrained-remaining component decomposition. In this approach, a critical
parameter called the width of the time interval (WTI)is introduced. Proper determination of the
WTI allows us to execute the decomposition outlined in →eq. (1.1). In Kyo and Kitagawa [→9], a
likelihood function for the WTI was derived, enabling the estimation of the WTI using the
maximum likelihood method. As mentioned earlier, with a properly determined WTI and in the
presence of outliers in the time series z , a constrained component with high smoothness can be
t
estimated, and the remaining component is obtained as a stationary time series. However, the
existence of outliers in the time series z can cause the remaining component to become
t
significant portion of these outliers is located in the remaining component. This allows us to
estimate the constrained component with high smoothness and acquire the remaining
component containing the outliers. Consequently, the objective of this study is to separate the
outliers from the estimation of the remaining component and isolate the stationary part of the
remaining component. Subsequently, in the next subsection, we introduce an AR model for the
time series of the remaining component with outliers.
1.3.1 Models
Consider the scenario where, as a result of the constrained-remaining component decomposition,
we have successfully estimated the constrained component with high smoothness, and the
remaining component y is obtained as
t
yt = rt + et (t = 1, 2, …, N ).
Here, r represents a stationary time series, while e is a time series containing outliers. In this
t t
p (1.2)
r t = ∑ α i r t−i + ϵ t (t = 1, 2, …, N ).
i=1
Here, p denotes the model order, α , α , …, α are coefficients, and ϵ ∼ N (0, σ ) represents
1 2 p t
2
Furthermore, we assume that the values for almost all elements in the time series
e , e , …, e
1 2 are equal to zero, except for possibly k nonzero parameters, denoted as
N
model in →eq. (1.2) are as follows (as per Kitagawa [→6]): (1) The variance σ of the innovation 2
remains constant for t = 1, 2, …, N . (2) The absolute values of all partial autocorrelation
coefficients are less than 1.
r 2 (δ) y2 − e2 α2 ε2
(1) (1)
r (δ) = = ,α = ,ε = ,
⋮ ⋮ ⋮ ⋮
r N (δ) yN − eN αp εN
0 ⋯ ⋯ ⋯ 0
y1 − e1 ⋱ ⋮
y2 − e2 y1 − e1 ⋱ ⋮
(1) ⋮ y2 − e2 ⋱ ⋱ ⋮
X (δ) = .
⋮ ⋮ ⋮ ⋱ 0
⋮ ⋮ ⋮ ⋮ y1 − e1
⋮ ⋮ ⋮ ⋮ ⋮
It is assumed that there are no outliers among the initial values y , y , …, y for the time 1−p 2−p 0
series y , so the initial values e , e , …, e for the time series e are equal to zero; i.e.,
t 1−p 2−p 0 t
These estimates are based on the vectors r (δ) and matrix X (δ) , which are set in the order
(1) (1)
of the observation data. Thus, this set of estimates is referred to as the estimates in natural order
(ENO).
However, these parameter estimation formulas may depend on the location of the outliers in
the time series e . Therefore, a vector and a matrix corresponding to the vector r (δ) and
t
(1)
yN − eN
y N−1 − e N−1
(2)
r (δ) = ,
⋮
y1 − e1
0 ⋯ ⋯ ⋯ 0
yN − eN ⋱ ⋮
y N−1 − e N−1 yN − eN ⋱ ⋮
⋮ ⋮ ⋮ ⋱ 0
⋮ ⋮ ⋮ ⋮ yN − eN
⋮ ⋮ ⋮ ⋮ ⋮
y2 − e2 y3 − e3 ⋯ ⋯ y p+1 − e p+1
Based on the reversibility of a stationary AR model, the model in →eq. (1.2) can also be expressed
as:
(2) (2) (2)
r (δ) = X (δ)α + ε (δ),
where ε (2)
(δ) is the innovation vector corresponding to r (2)
(δ) and X (2)
(δ) .
Correspondingly, when the values of the outliers δ are given, the estimates of the coefficients
are obtained as:
T T
(2) (2) (2) −1 (2) (2)
α̂ (δ) = (X (δ)X (δ)) X (δ)r (δ),
LL(δ) =log (
1
(exp (LL
(1)
(δ))+ exp (LL
(2)
(δ)))).
(1.4)
2
Therefore, the outliers δ can be estimated by maximizing the value of LL(δ) , with δˆ denoting the
estimate of δ obtained by maximizing LL(δ) in →eq. (1.4).
Then, as a function of the model order p and the number of outliers k , the Akaike information
criterion (AIC) for the synthesized model is defined by (see Akaike, [→1]):
ˆ (1.5)
AIC(k, p) = −2LL(δ) + 2(p + k + 1),
and the synthesized estimates for the coefficients are obtained as:
(1) ˆ (2) ˆ
α̂ = w 1 α̂ (δ) + w 2 α̂ (δ)
with w and w being weights that can be determined as ratios of the likelihoods:
1 2
(1) ˆ
exp(LL (δ))
w1 = , w2 = 1 − w1 .
(1) ˆ (2) ˆ
exp(LL (δ))+exp(LL (δ))
Moreover, the parameters k and p can be estimated by minimizing the value of AIC(k, p)
defined by →eq. (1.5). Thus, all parameters can be estimated using maximum likelihood method
and minimum AIC method. However, for each element in the outlier vector, the above estimation
process should be repeated on the condition that we fix the other element, because the outliers
may correlate with each other. Note that in the repetition of the estimation, the AIC should be
calculated for the original time series of the remaining component.
for the random process R , with Y = R + δ , assuming that δ is a potential outlier. It is further
t t t t t
assumed that E{R } = 0 and E{Y } = δ . Moreover, assuming that E{R } = C > 0 and
t t t
2
t
2 2 2 2
E{Y t } = E{R t + 2R t δ t + δ t } = C + δ t .
This shows that E{Y } is proportional to δ . Moreover, when E{Y } is large, it is relatively
t
2 2
t t
2
close to δ , suggesting that the absolute value of δ might be relatively large. Additionally, when
2
t t
y
2
t1
≥ y
2
t2
≥ ⋯ ≥ y , one can reasonably infer that y contains the largest outlier, y contains
2
tN t1 t2
the second largest outlier, and so on. In other words, when estimating k outliers, one can set the
order in which to search for potential outlier positions as t , t , …, t . Then, the positions of 1 2 k
1.4.1 Motivation
In the previous section, we explored the maximum likelihood of the AR modeling approach, a
valuable tool for identifying and estimating outliers. However, it occasionally encounters efficiency
limitations. To address these constraints and enhance the outlier detection process, this paper
introduces a highly sensitive indicator known as the ISU. We integrate the ISU with the negative
log-likelihood (NLL) and propose a multiobjective optimization method by employing both NLL
and the ISU as objective functions. The primary motivation behind this proposal is to improve the
efficiency of identifying and estimating outliers while preserving the advantages of the maximum
likelihood of the AR modeling approach.
The vector r (δ) , as defined in →eq. (1.3), can be regarded as a function of the potential outliers
(1)
r t (δ) > 0,
we call it a negative element. Assume that the number of positive elements is N and that of the 1
negative elements is N with N + N = N . So, we can express the set of all positive elements
2 1 2
as
(+) (+) (+) (+)
R = {r (δ), r (δ), …, r (δ)}
1 2 N1
distribute symmetrically around zero and uniformly over time. This is referred to as its symmetry
and uniformity.
Here, we consider the variances for the positive elements and the negative elements as
follows:
N1
(+) 1 (+) (+) 2
Var{R } = ∑(r (δ) − r (δ)) ,
N1 j
j=1
N2
(−) 1 (−) (−) 2
Var{R } = ∑(r (δ) − r (δ))
N2 j
j=1
with r (δ) and r (δ) being the averages of the positive elements and the negative elements,
(+) (−)
respectively. Set a set of the absolute values of all elements in the vector r (δ) as (1)
let Var{|R(δ)|} and Ave{|R(δ)|} respectively denote the variance and average for the
elements in the set |R(δ)| ; then by definition, we have
N
1 2
Var{ R(δ) } = ∑( r t (δ) − Ave{ R(δ) }) .
N1
t=1
Note that the value of r (δ) is negative. It can be seen that the first two terms on the right-hand
(−)
side in →eq. (1.6) measure the uniformity of the positive elements and the negative elements,
respectively, and the last two terms measure their symmetries. That is, the values of the first two
terms being large means that the elements are not distributed uniformly, and the values of the
last two terms being large indicates that the symmetries of the positive elements and the
negative elements are not so good. Thus, Var{|R(δ)|} can be used as an indicator for symmetry
and uniformity.
On the other hand, while our goal is to identify and estimate outliers, we also aim to preserve
the structure of the time series. In other words, we want to maximize the variation of the time
series r (δ) with outliers removed as much as possible. The variation of the time series r (δ) with
t t
outliers removed can be measured by its standard deviation. Therefore, using the relative
standard deviation (RSD) as an indicator of the prominence of outliers, we consider:
√Var{|R(δ)|}
RSD(δ, k) = ,
√ Var{ r (1) (δ)}
where
N
(1) 1 2
Var{r (δ)} = ∑ r t (δ)
N
t=1
represents the variance of the elements in the vector r (δ) . Since the RSD(δ, k) becomes
(1)
smaller when the Var{|R(δ)|} decreases and Var{ r (δ)} increases, minimizing the
(1)
Furthermore, when using RSD(δ, k) in conjunction with the log-likelihood, we can refer to
the logarithm of the RSD as the ISU and use it as one of the objective functions for outlier
detection:
By minimizing ISU(δ, k) with respect to δ and k , δˆ and k̂ can be obtained as estimates for the
outliers δ and their number k . This method for identifying and estimating outliers is referred to as
the minimum ISU method. It provides an efficient means to achieve these goals while taking
advantage of the maximum likelihood of the AR modeling approach.
by combining the negative log-likelihood NLL(δ) = −LL(δ) , with the log-likelihood LL(δ)
defined in →eq. (1.4), and ISU(δ, k̂) , defined in →eq. (1.7). Here, λ > 0 is a parameter
representing the relative weight of influence from NLL(δ) compared to ISU(δ, k̂) . Note that k̂ is
the estimate of k obtained using the minimum ISU method in advance. For a given value of λ , we
aim to minimize the objective functions ISU(δ, k̂) and NLL(δ) simultaneously by minimizing
f (δ|λ) with respect to δ . Setting a higher value for λ enables the minimization of NLL(δ) to be
primarily driven by the search for outliers, while reducing it places greater emphasis on ISU(δ, k̂)
in the search for outliers.
What needs to be emphasized is that the minimization of each function can only be achieved
numerically, making it challenging for theoretical considerations such as Pareto optimization.
Therefore, an empirical approach is employed, where the value of λ is set through trial and error
to minimize the function f (δ|λ) , and the values of ISU(δ, k̂) and NLL(δ) are computed. Results
with both ISU(δ, k̂) and NLL(δ) being relatively small are considered.
A typical procedure for the proposed approach will be illustrated with an example in the
following section.
1.5 Illustrative example
For the illustrative example, we selected the analysis of the index of industrial production (IIP) in
Japan. The purpose of this example is to showcase the performance of the proposed approach
using the analysis results and to illustrate the typical procedure involved in implementing the
proposed approach.
The data for this analysis were obtained from the Japanese Cabinet Office website [→5]. These
data represent a monthly time series spanning from January 1975 to December 2019,
encompassing a total of N = 540 months.
→Figure 1.1 depicts a plot of the time series for log-IIP. It is important to note that →Figure
1.1 does not provide a unit for the vertical axis. This is because the time series displayed in this
figure is an index, and the absence of units applies to the subsequent figures as well.
Figure 1.1: Time series for log-IIP in Japan (January 1975 to December 2019).
We transformed the data by taking the logarithm, which became our target for analysis. The time
series exhibited a significant decline around February 2009, which can be attributed to the
consequences of the financial crisis that occurred between 2007 and 2008.
For the decomposition of log-IIP, we applied the moving linear model approach as proposed
by Kyo and Kitagawa [→9]. To determine the optimal value of WTI, we calculated the log-likelihood
over a range of WTI values from 3 to 72 using the model described in →eq. (1.1). →Figure 1.2
presents the log-likelihood values plotted against the corresponding WTI values. The WTI value of
57 was selected as it corresponded to the highest log-likelihood.
Figure 1.2: Log-likelihood values versus the WTI values for the model in →eq. (1.1).
→Figure 1.3 presents the results of the decomposition. In →Figure 1.3(a), the time series of the
constrained component exhibits a smooth trend, while →Figure 1.3(b) reveals that the time series
of the remaining component displays cyclical variations, suggesting the presence of business
cycles in Japan. Notably, a substantial portion of the sharp decline observed around February 2009
can be attributed to the remaining component. As a result, terms stemming from this abrupt
variation need to be identified and addressed as outliers, as they have the potential to disrupt the
analysis of business cycles.
Figure 1.4: Results of the detection and estimation of outliers using minimum ISU method.
Furthermore, based on the estimated number and locations of anomalies obtained above, we
determined the estimated order p of the AR model using the minimum AIC method within the
range of 1–45. As a result, the estimated value for p is p̂ = 40 , with a corresponding AIC value of
−3149.12 .
Therefore, we performed the estimation of anomalies using the minimum AIC method.
→Figure 1.5(b) depicts the time series containing outliers, while →Figure 1.5(a) displays the
outlier-adjusted time series. Observations from →Figure 1.5(a) reveal the following: in the areas
where anomalies occur, there is a noticeable periodic variation in the outlier-adjusted time series.
Nevertheless, due to the incomplete removal of anomalies, deep troughs are still present, causing
an asymmetric fluctuation in the time series.
Figure 1.5: Results of the detection and estimation of outliers using minimum AIC method.
Subsequently, we calculated the values of the function f (δˆ λ) for each λ value. The values of λ
increased in 20 steps, with each increment being 5 × 10 , starting from 1 × 10 , resulting in
−6 −5
perspectives of the minimum ISU and the minimum NLL (which agrees with the maximum
likelihood) methods. Since these results are approximately the same, we presented the results
corresponding to the relatively favorable λ = λ = 3.5 × 10 in →Figure 1.7. As evident from
6
−5
→Figure 1.7, the results corresponding to λ = 3.5 × 10 appear to be favorable from both the
−5
Figure 1.7: Results of the detection and estimation of outliers with λ = 3.5 × 10 −5
.
in →Figures 1.8. As can be seen from →Figure 1.8, these results are akin to those obtained using
the minimum AIC method in →Figure 1.6. In such case, a balance is not achieved.
Figure 1.8: Results of the detection and estimation of outliers with λ = 1 × 10
−2
.
Taking into account the outcomes discussed above, that the proposed method demonstrates
exceptional performance is strongly indicated. The results, as showcased in →Figures 1.7,
underscore the method’s capacity to effectively identify the most favorable λ values, thus
optimizing the balance between the minimum ISU and minimum AIC approaches. This highlights
the robustness of the proposed methodology in addressing anomalies and model selection,
ultimately contributing to its high-performance capabilities.
1.6 Summary
This study begins by providing an overview of the constrained-remaining component
decomposition method introduced in Kyo and Kitagawa [→9], which is employed to separate the
original time series into a trend component and a stationary time series containing outliers. Our
aim is to develop a method for identifying and estimating these outliers from the stationary time
series. Subsequently, we perform an in-depth review of Kyo’s [→8] maximum likelihood of the AR
modeling approach, focusing on the exploration and estimation of anomalies.
The AR modeling approach, rooted in the minimum AIC method, is advantageous in
preserving periodic variations in time series while handling anomalies. However, it occasionally
faces challenges in efficiently processing anomalies. To address this, we introduce the ISU
indicator and propose the minimum ISU method, with the goal of enhancing the efficiency of
anomaly handling. It is worth noting that the minimum ISU method, while efficient in managing
anomalies, may not fully retain periodic variations in time series.
In response, we present a multiobjective optimization approach that combines the minimum
ISU and minimum AIC methods through a composite objective function. This approach strives to
strike a balance between outlier detection and model selection.
To illustrate the procedure and performance of our proposed approach, we apply it to the
analysis of monthly index of industrial production data spanning from January 1975 to December
2019. This example serves as compelling evidence of the exceptional performance of our method.
The results emphasize the robustness of our methodology in addressing anomalies and model
selection, ultimately highlighting its high-performance capabilities.
It is important to acknowledge that this paper primarily focuses on a single domain, and
ideally, should be extended to multiple domains. Additionally, the absence of an evaluation metric
in the illustrative example is recognized. The quantification of the benefits of the proposed
method for comparative analysis with other techniques is an important aspect that could not be
comprehensively addressed due to space constraints. This study is considered a preliminary
endeavor, and addressing these issues is reserved for future research.
References
[1] H. Akaike, A new look at the statistical model identification. IEEE Transactions on Automatic
Control, vol. AC-19, pp. 716–723, 1974. →
[2] P. J. Brockwell, and R. A. Davis, Time Series: Theory and Methods, 2nd Edition, New York:
Springer-Verlag, 1991. →
[3] A. J. Fox, Outliers in time series. Journal of the Royal Statistical Society B, vol. 34, pp. 350–363,
1972. →
[4] J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, Bayesian model averaging: A tutorial.
Statistical Science, vol. 14, pp. 382–417, 1999. →
[5] Japanese Cabinet Office. Coincident index, In →https://www.esri.cao.go.jp/en/stat/di/di-
e.html, 2023. →
[6] G. Kitagawa, Introduction to Time Series Modeling with Application in R, 2nd Edition, New York:
Chapman and Hall, 2021. a, b
[7] G. Kitagawa, and T. Takanami, Extraction of signal by a time series model and screening out
micro-earthquakes. Signal Processing, vol. 8, pp. 303–314, 1985. →
[8] K. Kyo, An approach for the identification and estimation of outliers in a time series with a
nonstationary mean. Proceedings of The 2023 World Congress in Computer Science, Computer
Engineering, & Applied Computing (CSCE’23), Conference Publishing Services, pp. 1477–1482,
2024. a, b, c, d, e, f, g
[9] K. Kyo, and G. Kitagawa, A moving linear model approach for extracting cyclical variation from
time series data. Journal of Business Cycle Research, Vol. 19, pp. 373–397, 2023. a, b, c, d, e, f, g, h
[10] K. Kyo, H. Noda, and G. Kitagawa, Co-movement of cyclical components approach to
construct a coincident index of business cycles. Journal of Business Cycle Research, vol. 18, pp.
101–127, 2022. →
[11] D. Pena, Influential observations in time series. Journal of Business & Economic Statistics, vol.
8, pp. 235–241, 1990. →
[12] H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang, Time-
series anomaly detection service at Microsoft. →https://dio.org/10.1145/3292500.3330680, 2019.
→
[13] J. H. Stock, and M. W. Watson, A probability model of the coincident economic indicators. In:
K. Lahiri and G. Moore, (Eds.). Leading Economic Indicators: New Approaches and Forecasting
Records, pp. 63–89, Cambridge University Press, 1991. →
[14] G. K. Vishwakarma, C. Paul, and A. M. Elsawah, An algorithm for outlier detection in a time
series model using backpropagation neural network. Journal of King Saud University – Science,
vol. 32, pp. 3328–3336, 2020. →
2 Using the intentionally linked entities (ILE) database system
to create hypergraph databases with fast and reliable
relationship linking, with example applications
Vitit Kantabutra
Abstract
Hypergraphs can be a stronger data model than ordinary graphs for a database in many
applications, because hypergraphs allow complex, non-binary relationships to be represented
directly. However, when it comes to actual implementation, extant hypergraph database systems
appear to be weak because they are not implemented with sufficiently powerful, modern linked
data structures. For example, one hypergraph database was implemented based on, of all things,
an ordinary graph database system, which lacks a strong mechanism for implementing the
linkages needed by relationships of higher arities. ILE, on the contrary, was specifically designed
to handle linkages of complex, possibly high-arity relationships securely and efficiently. This
chapter shows the details involved in the implementation of a hypergraph database using ILE,
including what to use for the linkages of high-degree hyperedges. Also presented here are several
applications, including some that have never been presented anywhere else.
Keywords: ILE DBMS, hypergraph DBMS, graph DBMS, non-Relational, NoSQL, linked data,
pointers
2.1 Introduction
Around the turn of the millennium, it was widely believed that the Relational database
management system (RDBMS) [→1] was the solution to virtually all database problems. After all,
what could be better than a collection of simple 2D tables that can represent arbitrary entity sets
as well as arbitrary sets of relationships of all arities. Searches can be sped up with index
structures such as B-trees and related data structures.
However, there are significant problems with the RDBMS. In the Relational database system,
keys are equated by means of string matching. This process is error-prone because keys can easily
be misspelled, and blanks and other invisible characters that are unintentionally inserted can
easily result in the lack of a relationship between entities that are meant to be related to each
other, or unintentional relationships can be formed almost as easily. The extremely inefficient
process of table joins is another problem with the RDBMS. Robinson et al., proponents of the
graph database systems, even made the exaggerated claim that RDBMS “lacks relationships”
[→2].
Less known is the fact that database indexes for the RDBMS are typically stored in persistent
storage, and, despite clever câching, are probably not fast enough to feed information to fast
applications and devices such as video players. Yet another problem with the RDBMS is that there
is no one, well-unified place such as a single object that stores all data pertaining to a data entity
[→19].
There are, of course, very positive qualities that the RDBMS possesses, beyond its simplicity.
One such quality is the organization of entity sets into tables. Relationship sets are also organized
into tables. Such organization may seem obvious, but not all database systems are so well-
organized.
In recent years, however, non-Relational database systems have emerged to offer users
systems that don’t have the same shortcomings as the RDBMS. For example, graph database
systems are now offered by several organizations. In a graph database, data entities are linked in
a more robust way than in an RDBMS in order to represent a relationship, and traversing these
links is a fast process. However, in a graph database, there is no clear organization of like entities
into a single entity set. Likewise, relationships that should belong to the same relationship set are
not bundled together in any way.
Additionally, graph databases are made for binary relationships. Higher-arity relationships
cannot be represented well, if at all, in a graph database.
To overcome the lack of higher-arity relationships in a graph database while retaining its
strong relationship-linking capabilities, the hypergraph database has been considered. There
appear to be only three implementations of hypergraph database systems in the publicly available
literature, namely [→3]:
1. HypergraphDB,
2. TypeDB, and
3. GraphBrain.
Of these, only HypergraphDB seems to be the only one to have documentation that is detailed
enough to determine what data structures are used, and to determine something about the
efficiency of database operations. HypergraphDB’s documentation reveals that the B+ Tree, also
used in Relational databases, is the indexing mechanism used for accelerating database searches.
As discussed above, using such an external indexing data structure makes relationship link
traversal slow compared to using an “internal” data structure that is an integral part of the core of
the database itself.
In this chapter we will see that ILE can be used to represent any hypergraph database so that
relationships, including those with higher arities, can be represented directly using a very fast
internal data structure. Additionally, ILE represents each entity set as an object, and each
relationship set is likewise represented as an object. These objects could be added to a
hypergraph database implemented with ILE, even though the hypergraph data model itself
doesn’t have them.
Just for reference or comparison purposes, we should take note of how relationship linkage is
done in some graph databases. According to Kemnitz [→4], the popular graph database Neo4j,
which is also one of the oldest databases in this genre, stores in each node a list of records that
represent the node’s relationships with other nodes. A physical pointer seems to exist in each
such record, pointing to the other node. Some of the other graph databases, however, may store
in the node some form of identification of the destination nodes, necessitating an index search for
the actual location of such destination nodes themselves.
The rest of this chapter comprises the following sections: Section II introduces the ILE
database and explains the mechanism by which that database system can be used to represent
graphs and hypergraphs. The section includes how ILE can be extended to accommodate a better
representation of hypergraphs, including hypergraphs with hyperedges of variable arity, useful
for efficiently and robustly representing groups of nodes on social media, which is a situation
where ordinary graph database systems cannot perform well.
Section III explores the major advantages of ILE over existing databases, namely that ILE
allows more efficient and secure relationship linking and traversal.
Section IV examines ILE as a means for implementing a particular hypergraph database. The
database involves ternary relationships, making it unwieldy to represent with, say, an ordinary
graph. The example is benchmarked for its relationship linkage traversal efficiency. More
specifically, the example is a travel map consisting of a few locations in Idaho, which can obviously
be extended to a full-blown map with many more entity sets and relationship sets.
Section V compares the efficiency of relationship linkage between ILE and Neo4j in social
network graph databases. For this comparison, we use a simple model of a social network graph
found in [→5].
Sections VI and VII present results not previously published in [→6], the publication that this
current chapter extends. Section VI discusses transportation networks beyond the small example
given in Section IV. In particular, we will discuss a subway network and how it could be better
represented as a hypergraph than a graph, which is how it was represented in [→7, →8]. Section
VII discusses algorithms for Web page ranking, which are of great importance for search engines
needing to rank Web pages. Heintz and Chandra [→9] showed why a hypergraph is a superior
model to regular graphs for Web page ranking and other analytics involving social grouping.
However, they rely on regular graph and multigraph physical data representation to represent
hypergraphs. In this chapter, it will be shown that the new direct hypergraph representation can
represent social groups and general grouping of Web pages and of nodes in general much more
efficiently and more naturally, and seemingly also securely, than can graph and multigraph
software.
2.2 The ILE database system and its use for representing a hypergraph database
The ILE database idea [→17] occurred to the author of this chapter while teaching a database
course in 2007, using Garcia-Molina et al.’s Database Systems book, first edition [→10]. That
textbook contains the narrative that Chen’s ER data model [→11] is the best data model for most
situations, but when it comes to implementation, the Relational model is chosen instead. This is
because there was no implementation of the ER model, and the Relational model was chosen
because, though it predates the ER model, it is often thought of as a simple approximation to the
ER model. In the Relational database system, due to having but one kind of data structure, the
relation (ignoring external indexes), relationships must be linked by means of text matching. This
text matching is prone to errors due to misspelled or otherwise incorrect keys. It is even possible
to unintentionally relate entities that aren’t meant to be related.
Given all these pros and cons of the Relational database system, the author of the current
chapter thought that perhaps it was better to use physical pointers to do relationship linkages,
and only perform a linking when there is a verifiable entity to link with. Also, there is no real need
to limit the data structures to just tables. Back in the years leading to 1970, which was the date of
Codd’s seminal paper [→1], the simplicity of a two-dimensional array made sense. Programming
with complicated data structures, especially those involving pointers, was not commonplace.
Object-oriented programming was not well known.
To elaborate on Chen’s paper, it appears that his emphasis is to come up with a database
model that encompasses all the major extant database models, namely, the Network model, the
Relational model, and the Entity model. He critiqued all three extant models, and wrote that all of
them can be thought of as derivatives of his database model, the Entity-Relationship, or ER,
model. He then introduces a diagramming method, later known as the ER diagram method, for
designing database schemata, which is still in use today. He did not cover how to write the code
that it takes to implement his ER data model into a working or even a prototypical database
system.
As the decades went by beyond the ’70s, Codd’s RDBMS popularity exploded. Meanwhile,
Chen’s ER diagramming method became the standard database design method, at least for
students and for serious database designers. However, Chen’s model seemed not to have been
implemented as database software, except some people believe that the Relational system is an
implementation of the ER model, which is not quite an accurate belief. In fact, it seemed to be
widely believed that the ER model is unimplementable.
This author’s thought, in 2007, was that with all the advancements in programming languages
and techniques since 1970, there must be a way to implement a better general-purpose database
system than the RDBMS, such as something more like the ER model. There should no longer be a
need to stick with making everything (except external indexes) a two-dimensional array.
Details of the ILE
ILE was implemented in the programming language Ruby. Ruby is a very good language for
representing ILE because it is simple to create and manipulate complex data structures that
involve pointers (Ruby references). Also, garbage collection is handled by the runtime monitor.
In an ILE database, there is first an object representing the entire database. There is also an
object representing a set of databases, the DatabaseSet object, in case we need to keep an entire
set of different but related databases together. In the future, there could be search features or
other features for an entire set of related databases. A database object contains the database
name and metadata for the database, for example, some text describing the database. A database
is further constructed from the following components:
1. Entity Sets Objects. These are a natural way to represent entity sets, which are the same as
the entity sets in the ER model or the older Entity Set model [→12]. In the original paper on
ILE [→13], each entity set was represented by a table, but in subsequent papers, the table
implementation has been replaced by an entity set object implementation. The database
object mentioned above keeps track of all entity set objects by means of a Ruby hash, using
each entity set’s name as a key. An entity set object has
1. a reference back to the database object,
2. an entity set name, any string, but meant to be used as a searchable hash key,
3. a hash to external indices in case of need. This is usually not needed and so far has not
been implemented,
4. key attribute names, stored as an array,
5. key attribute types. The types can be defined in whatever way that would suit the
database being represented. Stored as a hash, with the key attribute names serving as
the hash keys,
6. nonkey attribute names, stored as an array,
7. nonkey attribute types, stored as a hash,
8. The entities belonging to this entity set are all stored as a ruby hash, which of course,
means that only a handle to the hash is stored with the entity set object itself. Also, if
there is more than one key attribute, then the entities are stored as a multilevel hash,
with the number of levels equaling the number of key attributes. The key attributes are
used as the hash keys, and lastly,
9. each entity set has a hash of relationship sets whose relationships involve entities in this
entity set.
2. Entity Objects. These objects represent data entities. Not only are these objects referenced
by the appropriate entity set object, but each of these entity objects also has
1. a reference to the entity set object that represents the entity set this entity is part of,
2. a reference to the current database (not needed but convenient to have),
3. a hash of key values, hashed by the key field name,
4. a hash of non-key values, hashed by non-key field name, and
5. a hash to an array of relationship objects, using the relationship’s relset (relationship
set) name as the hash key. This entity may participate in more than one relationship,
and hence we need an aggregate such as an array of relationship objects.
3. Relset (Relationship Set) Objects. These objects represent relationship sets. A relationship
set is a set of like relationships with the same sequence of roles, each role played by an
entity from a specific entity set. A relset object has the following information:
1. Relset name. Any string, but meant to be used as a searchable hash key,
2. a reference back to the database,
3. a longer description of the relset than the relset name, if needed,
4. role names, implemented a Ruby Array, but could be reimplemented as a Set,
5. a description of relationship attributes, if there are any. Could be upgraded to be a
more structured container for relationship attributes,
6. all relationships belonging to this relset, currently implemented as an Array but could
be reimplemented as a Set, and
7. an indexing aid for the relationships in the relset, in the form of a data structure called
aha_rel, which stands for an array of hashes of arrays of relationships.
4. Relationship Objects. These objects represent the relationships, containing:
1. a reference to the Relset to which this relationship belongs,
2. a reference to the database (not needed by convenient),
3. a hash of all roles (entities playing various roles), and
4. a hash of relationship attributes, if any.
Perhaps the most complicated task in setting up an ILE database is the installation of each
relationship in each entity object’s @relationship field.
To implement a hypergraph database in ILE, we represent each node as an entity, with an
entity set for each node type in the hypergraph. While edges could also be represented as entities
in ILE, it’s more efficient to represent them as relationships. This approach works if edges only link
nodes. However, to allow edges to link both nodes and other edges, we need a new class,
EdgeNode, which combines the attributes and methods of both entities and relationships. If ILE
were implemented in a language that supports multiple inheritance, we would use that. However,
since Ruby, the language used for ILE, doesn’t support multiple inheritance, we define EdgeNode
as a subclass of Entity and manually add the attributes and methods from the Relset class, either
directly or through Mixins.
Figure 2.1: Example Ruby code used to create a new database and the configuration of the
“cities” entity set.
ILE enables users to define and manage data types. In this basic example, we will rely on Ruby to
manage the data types and will not implement any specialized code to process the data entered
into the database. The following demonstrates how entities can be added to the entity set, and
consequently, to the database:
Another entity set, representing the mode of transportation, is created using the code in →Figure
2.2, while the entities representing travel modes can be included in the entity set using the code in
→Figure 2.3.
Figure 2.2: Example code of the entity set representing the mode of transportation.
Figure 2.3: Example code of entities representing travel modes being added to the entity set.
The next step in building the database is to define the single relationship set, or Relset. The
relationships within the Relset have three roles: origin, destination, and transportation mode.
These relationships correspond to hypergraph edges, where each edge connects three nodes
representing the roles. The following code (→Figure 2.4 (a)) demonstrates how these roles are
defined in the Relset construction, and →Figure 2.4 (b) shows how relationships can now be made
according to these examples.
Figure 2.4: (a) Code that defines the roles in the Relset construction. (b) Demonstrating the
establishment of relationships based on the roles defined in the Relset construction.
The final step is to assess the speed of traversing this hypergraph database. The results will reflect
the efficiency of traversing a typical hypergraph link in ILE, under ideal conditions without system
constraints like disk block swapping due to database size. Additionally, we will explore another
example – a simple graph instead of a hypergraph – and compare its performance in ILE with
Neo4J, a well-known, highly efficient graph database, in the next section.
→Table 2.1 presents the execution results on a 2019 13-inch MacBook Pro equipped with a
quad-core Intel Core i5 processor, operating at a 2.4 GHz clock speed. The system features 8 GB of
LPDDR3 RAM running at 2.133 GHz, while the secondary storage consists of a 500 GB flash drive.
The study from the book was repeated with a database of 1,000,000 individuals, each having an
average of 50 friends. Attempts were made to run a similar database of 1,000,000 nodes, but the
computer significantly slowed down during the process of generating connections, preventing
progress to testing traversals. Although the computer used differs from that of the book’s
authors, it is likely that ILE requires more space per node and relationship, necessitating a
machine with greater memory and working-set capacity to manage large datasets effectively.
→Table 2.3 presents the results from the book’s test on a dataset of 1,000,000 individuals,
each with an average of 50 friends. While the results for Neo4j seem impressive, there are
concerns regarding the 1,000,000-person experiment reported in the book and shown in →Table
2.3. Notably, the results for both MySQL and Neo4j are better for the 1,000,000-person dataset
than for the 1,000-person dataset for the depth 2 case, which seems counterintuitive. Focusing on
the MySQL result for the 1,000,000-person dataset, since MySQL is well understood, the
computation involves a self-join of a 50,000,000-row table, as each person has 50 friends. The
authors claimed this self-join takes 0.016 s on a “commodity” laptop with 8GB of RAM and an Intel
Core i7 processor. This appears unlikely. The size of the join would be 2.5e15 rows, with each row
containing not just numbers but also names and additional information. Even assuming an overly
optimistic scenario where each row contains only three numbers (the persons’ IDs), and the
laptop has 8 cores running at 4 GHz, which is highly optimistic, and further assuming all cores
simultaneously write the results into memory in 1 clock cycle, it’s still implausible. In 0.016 s, the
system would write 8 × 4e9 × 0.016 = 5.12e8 words, which is far fewer than the 7.5e15 required.
Table 2.3: Run results for social network databases with 1,000,000 persons. The tabulated run-
times are in seconds.
Depth MySQL Neo4j
2 0.016 0.01
3 30.267 0.168
4 1,453.505 1.359
5 Unfinished 2.132
While ILE may not be suitable for databases of size 1,000,000 on a small laptop, it worked well for
data of size up to 100,000 and perhaps more. →Table 2.4 is assorted run results for data sized up
to 100,000.
Table 2.4: Run results for social network databases with up to 100,000 persons. The tabulated
runtimes are in seconds.
Persons Depth ILE Run Time
10,000 2 0.068654
10,000 3 0.484819
10,000 4 0.297643
10,000 5 0.312909
10,000 6 0.326356
10,000 7 0.370950
50,000 6 1.563340
100,000 5 6.482886
100,000 6 9.029269
Using the representation of the métro presented here eliminates less clean ways of dealing with
multiline routes, such as the way described in [→7], where each node (station) that serves
multiple lines must appear multiple times in the database.
Another application for which hypergraphs/ILE would be useful is automobile trip route
planning, including route planning in portable GPS units, some of which are notoriously slow at
planning non-local routes. In a typical extant GPS unit for automobiles, the routing software uses
a graph data structure. The nodes are road intersections, while the edges are initially the road
segments that connect those intersections. If computational complexity were not an obstacle, a
shortest path algorithm such as Dijkstra’s Algorithm could be applied, and we’d be done.
However, the search space would be too large, given that there are many nodes, for example, 15.8
million nodes in the continental US (citation needed). One way to limit the search is to
precompute shortcuts, which are edges that don’t correspond to physical road segments that
represent shortest paths between nodes. Heuristics are used to favor highways and other major
roads, and to favor the “small roads: big roads: small roads” pattern.
However, this strategy is still a very “localized” one in that there isn’t enough “big picture”
planning of the overall route. It isn’t surprising, therefore, that GPS units can have difficulty
completing long/distance drive plans.
Our idea is to put more human-like “intelligence” into routing. In particular, we will
precompute routes between major (and not-so-major) cities comprising mainly expressways and
store all of them as nodes. (So our nodes are not just road intersections, which are still nodes in
our scheme.) Except for trips in remote areas, the trip planning search process for a trip from
point A to point B will start by trying one of these intercity routes R. Then we search for the best
route from point A to R, which usually means a short trip from point A to one of a handful of entry
points of the expressway or one of the expressways that make up the route R. With a backward
search, we search for a way to get from one of a few exits of R to point B. Although we have not
performed comparison experiments yet, it seems that our approach has promise in comparison
with the more conventional search in extant GPS devices.
2.8 Conclusions
For decades, the ER database model has been considered the best overall data model for a vast
range of applications. However, ER had not been implemented successfully, and so the Relational
DBMS, which is based on 1960s programming technology, and which can be considered a rough
approximation of an implementation of ER, became the dominant type of database system. This
chapter presents ILE as a modern implementation of ER, and also how to extend ILE to make it the
underlying database system for hypergraph databases. The reason hypergraph databases are
interesting is because, unlike ordinary graph databases, hypergraph databases can directly
represent higher arity relationships which are stronger models of real relationships in many
applications [→18], [→20]. The ILE-based hypergraph database is better than extant
implementations because the ILE-based one is a more direct implementation, using pointers to
link relationships. Several applications were also presented in this chapter from diverse fields.
References
[1] E. F. Codd, A relational model of data for large shared data banks. Communications of the
ACM, 13, 6, June 1970. 377–387. a, b
[2] I. Robinson, J. Webber and E. Eifrem, Graph Databases, 2nd Edition, O’Reilly Media, Inc, 2015.
→
[3] T. Menezes, C. Roth, C. C. Chen, Y. Stein and K. Ren, Graphbrain – Language, knowledge,
cognition. Published online as →https://graphbrain.net 2021. →
[4] G. Kemnitz, Quora: Answer to the question: Technically, how are relation(ship)s stored in a
graph database? →https://www.quora.com/How-are-graph-databases-implemented, 2019. →
[5] A. Vukotić and N. Watt, Neo4j in action. Manning, 2015. a, b
[6] V. Kantabutra, Implementing hypergraph databases with fast relationship linking and traversal
using ILE, CSCE 2022. →
[7] K. Jeyapragasan, G. Krishna and Y. Maniyar, An analysis of subway networks using graph
theory and graph generation with graphrnn, Stanford Dept. CS, CS 224W, 2019 class project,
available online at 2019 →http://snap.stanford.edu/class/cs224w-2019/projects.html. Last
Accessed Dec. 21, 2023. a, b, c
[8] X. Wu, C. K. Tse, H. Dong, I. W. H. Ho and F. C. M. Lau, A network analysis of world’s metro
systems, In: International Symposium on Nonlinear Theory and Its Applications, NOLTA2016,
Yugawara, Japan, November 27th-30th, 2016. →
[9] B. Heintz and A. Chandra, Enabling scalable social group analytics via hypergraph analysis
systems, In: Proceedings of the 7th USENIX Conference on Hot Topics in Cloud Computing, July,
2015. a, b
[10] H. Garcia-Molina, J. D. Ullman and J. Widom, Database Systems: The Complete Book, Prentice-
Hall, 2001. →
[11] P. Chen, The entity-relationship model: Toward a unified view of data. ACM Transactions on
Database Systems, 1, pp. 9–36. 1976. →
[12] M. E. Senko, E. B. Altman, M. M. Astrahan and P. L. Fehder, Data structures and accessing in
data-base systems. IBM Systems Journal, 12, 1, pp. 30–93, 1973. →
[13] V. Kantabutra, A New Type of Database System: Intentionally-Linked Entities-A Detailed
Suggestion for A Direct Way to Implement the Entity-Relationship Data Model, CSREA EEE, pp.
258–263, 2007. →
[14] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vanó, S. Gómez-Villamor, N. Martínez-Bazan
and J. L. Larriba-Pey. Survey of graph database performance on the hpc scalable graph analysis
benchmark. In: Heng Tao Shen, Jian Pei, M. Tamer Özsu, Lei Zou, Jiaheng Lu, Tok-Wang Ling, Ge
Yu, Yi Zhuang, Jie Shao, (Eds.), International Conference on Web-Age Information Management.
Springer-Verlag, Berlin, Heidelberg, usw., 2010, 37–48. →
[15] C. Kennedy and S. Derrible, Network analysis of world subway systems using updated graph
theory. Transportation Research Record: Journal of the Transportation Research Board, 2112, pp.
17–25. 2009. →
[16] J. Pung, R. D’Souza, D. Ghosal and M. Zhang, A road network simplification algorithm that
preserves topological properties. Applied Network Science, 7, pp. 79, 2022.
→https://doi.org/10.1007/s41109-022-00521-8 →
[17] V. Kantabutra and D. Ames, A suggestion for a data structure for temporal gis. In: IEEE
Workshop on e-Science, Oxford, pp. 158–161, 2009. →
[18] V. Kantabutra, J. B. Owens, D. P. Ames, C. N. Burns and B. Stephenson, Using the newly-
created ILE DBMS to better represent temporal and historical GIS data. Transactions in GIS, 14, pp.
39–58. 2010. →
[19] M. Hunger, R. Boyd and W. Lyon, The definitive guide to graph databases for the RDBMS
developer. Neo4j 2016. →
[20] V. Kantabutra, GeoDataLinks: A suggestion for a replacement for the esri shape- file.
Transactions on Computational Science & Computational Intelligence, 685–694, 2021. →
3 Rapid and automated determination of cluster numbers for
high-dimensional big data: a comprehensive update
Zohreh Safari
Note: This chapter extends the findings of a prior collaborative study conducted by Khalid Mursi,
Yu Zhuang, and myself. The earlier research laid the foundation for the current work, and this
paper builds upon those insights, providing further depth, improvement, and exploration into the
subject matter with new larger datasets.
Abstract
Automatically defining the optimal number of clusters is a pivotal challenge in clustering
algorithms. Striking a balance between clustering quality and algorithm efficiency in this
determination process is a crucial tradeoff that motivated our research. In our approach, we have
successfully automated the identification of the optimal number of clusters, particularly tailored
for large high-dimensional datasets. Our method addresses both the quality and efficiency
aspects of clustering. Through conducting experimental studies on five previously explored
datasets [→23] and introducing four new, larger datasets, which have been done in this study, I
have observed that our procedure provides flexibility in selecting diverse criteria for determining
the optimal K under each circumstance. Leveraging the advantages of the bisecting K-means
algorithm, our approach outperforms the Ray and Turi method, showcasing higher efficiency in
identifying the best number of clusters.
Keywords: clustering, K-means, bisect-K-means, big data, cluster validity,
3.1 Introduction
Clustering, an extensively utilized unsupervised learning process, has been applied to numerous
problems for uncovering insights and revealing hidden relationships among objects,
distinguishing it from supervised machine learning algorithms. Its applications span various
domains, including pattern recognition [→11], machine learning [→5, →6], image analysis [→13,
→14, →15], bioinformatics [→16], and information retrieval [→17, →18]. This procedure groups
data points based on similarities, where those within the same group exhibit similar properties,
while those in different groups are less alike. Unlike supervised learning and classification,
clustering lacks a predefined concept for determining the appropriate number of clusters for a
given dataset [→22]. Users typically depend on clustering algorithms to group data points in an
optimal manner, with the number of clusters predetermined by their knowledge. The
performance of clustering heavily relies on accurately estimating the cluster numbers.
Specifying the right number of clusters intuitively is a challenging task, as users may lack
sufficient knowledge of a dataset to define a high-probability optimal number. The absence of
information regarding the cluster number can adversely impact clustering quality, prompting
significant interest in recent years to discover the correct number of clusters through clustering
algorithms. The K-means algorithm [→19] stands out as one of the most widely recognized
clustering methods, valued for its simplicity and ease of implementation. However, a notable
challenge lies in users having to specify the number of clusters beforehand. Recent research has
endeavored to enhance K-means algorithm to address this challenge, introducing solutions for
automatically determining the number of clusters. Some notable approaches include X-means
[→7], Ray and Turi (R&T) [→4], CLUSEQ [→8] designed for sequence data, IKMEANS [→9], and
others [→10, →12]. K-Means++ [→28], the approach that aims to improve the initialization step. K-
means is explained in Section 3.2. It selects initial cluster centers based on their probability, which
is proportional to the squared distance from existing centers. This method has been shown to
mitigate the impact of initial center selection on algorithm performance. Furthermore,
advancements in adaptive clustering algorithms have led to the development of methods like
adaptive K-means [→29], which dynamically adjusts the number of clusters based on the data
distribution. By continuously monitoring cluster quality and adaptively refining cluster
assignments, these algorithms offer improved flexibility and efficiency, compared to traditional K-
means.
This chapter specifically concentrates on automatically determining the optimal number of
clusters (auto best K) for extensive datasets featuring high-dimensional data. While existing
methods like R&T and X-means focus on clustering quality during automatic cluster number
determination, they are not explicitly tailored for large datasets. Managing the efficiency of
clustering algorithms becomes a crucial challenge with big datasets, given that determining the
number of clusters can be time-consuming. Consequently, our focus is on efficiency, while
upholding clustering quality for significant data. In existing algorithms [→4, →7], determining the
number of clusters involves processes like bisecting K-means (BKMeans), where the number of
clusters is progressively increased to a predefined upper bound. The largest cluster is chosen
during the splitting process, based on varying criteria in different methods. Our emphasis is not
on selecting criteria but on reducing computation costs while adhering to the quality standards
set by existing methods. It is noteworthy that the criterion for clustering remains an ongoing
consideration, as each step involves assessing different criteria with an increasing cluster number
in existing algorithms. The structure of the paper unfolds as follows: Section 3.2 offers an
overview of two clustering algorithms, detailing their significant specifications, and provides an
explanation of the R&T method [→4]. Our proposed method is thoroughly examined in Section
3.3. Section 3.4 delves into experiments and comparisons, while the concluding remarks of the
study are presented in the last section.
Algorithm 1: K-means
input: data set: x1,x2, … xn, k and N.
do
a. Compute distance from xi to centroid zk, di = min||xi − zj|| j=1,2, … k
b. Assign each point to closest centroid (zk). xi ∈ Cj
end
2. Re-calculate centroid zj′ for each cluster Cj .
3. zj = zj′
4. repeat step 2 till no change of clusters.
R&T index =
Intra
.
(3.1)
Inter
Indeed, the R&T index serves as a valuable metric by quantifying the ratio of intra-cluster to inter-
cluster distances, effectively considering both the compactness and separation of clusters. A lower
value of the R&T index indicates superior clustering, signifying optimal values for the number of
clusters (k). This approach aligns with the objective of achieving clusters that are internally
cohesive (intra-cluster compactness) while being well-separated from each other (inter-cluster
separation), providing a comprehensive assessment of the quality of clustering results. Therefore,
the pursuit of a minimized R&T index value aids in the identification of an optimal k for the given
dataset:
CH =
Trace B
×
N −K
,
(3.2)
Trace W K−1
where B and W terms are the between- and within-cluster scatter matrices, respectively. Trace B is
the summation of size of each cluster (nk) into the square of distances of each clusters’ centers
(zk) to the centroid of whole dataset (z). The trace of W refers to the distances of data points to
their clusters’ centers.
Trace B and trace W can be written as
n (3.3)
Trace B = ∑ nk||zk − z||2,
k=1
K nk (3.4)
Trace W = ∑ ∑ nk||xi − zk||2.
k=1 i=1
Therefore, the clustering that delivers the maximum value for the CH index indicates the best
value of k:
while ((validity does not meet the criteria) && (k<kmax)) do Choose the biggest cluster.
Table 3.4: R&T procedure’s experimental results based on the R&T index.
Dataset Kmin Kmax Time Best K Best validity Total SSD
BIRCH1 3 100 9.52E + 01 100 1.16E−01 8.08E + 13
DIM64 3 300 1.49E + 01 16 9.00E−05 5.28E + 02
DIM15 3 50 8.27E + 00 9 4.71E−03 4.07E + 12
HIGGS 3 16 2.94E + 01 31 5.10E + 05 13.07E + 12
HEMPASS 3 32 3.79E + 02 129 7.28E + 05 17.07E + 12
TLC 3 32 5.12E + 02 237 8.56E + 05 17.18E + 12
Heterogeneity 3 32 4.72E + 02 263 10.12E + 05 18.51E + 12
Table 3.5: Our procedure experimental results, based on the R&T index.
Dataset Kmin Kmax Time Best K Best validity Total SSD
BIRCH1 3 100 2.51E + 01 100 1.16E−01 8.08E + 13
DIM64 3 300 4.49E + 00 16 9.00E−05 4.96E + 02
DIM15 3 50 2.41E + 00 9 4.71E−03 3.32E + 12
HIGGS 3 16 1.94E + 01 31 5.10E + 05 12.01E + 12
HEMPASS 3 32 2.79E + 02 129 7.28E + 05 15.07E + 12
TLC 3 32 4.12E + 02 237 8.56E + 05 11.18E + 12
Heterogeneity 3 32 1.88E + 02 263 10.12E + 05 16.32E + 12
In →Tables 3.2 (utilizing the R&T procedure) and 3.3 (utilizing our proposed procedure), the CH
index, calculated using eq. (3.2), serves as the validity measure. The CH index considers the
maximum validity as the best validity, comparing validations across different numbers of clusters.
Notably, the maximum validity values occurred when the number of clusters was 100, 56, 56, 20,
and 9 for BRICH1, 8 XOR PUF1s, 8 XOR PUF 0s, DIM64, and DIM15, respectively, in both tables.
Intriguingly, for larger datasets in the last four rows of each table, our method produced
31,129,237, and 263 clusters for HIGGS, HEMPASS, TLC, and Heterogeneity, respectively.
Importantly, both the R&T and our procedures identified the same best K, with nearly identical
validity values for these datasets. However, our proposed method achieved this outcome in
significantly less time compared to the R&T method, as indicated in column five in both tables.
This underscores the efficiency and effectiveness of our approach in rapidly determining the
optimal number of clusters.
In →Tables 3.4 and →3.5, the R&T index, calculated using eq. (3.1), determines the best K by
selecting the minimum validity value, as explained earlier. The experiments in →Table 3.4 were
conducted using the R&T procedure, while the results in →Table 3.5 pertain to our proposed
method. The best K values were consistently identified as 100, 16, and 9 for BRICH1, DIM64, and
DIM15 in both tables. Additionally, the best K values for HIGGS, HEMPASS, TLC, and Heterogeneity
were 31,129,237, and 263, respectively, corresponding to both tables.
Notably, across →Tables 3.2–→3.5, all experiments consistently reveal the same best K
concerning the specific dataset and the chosen validity measure. This uniformity underscores the
robustness and reliability of our proposed clustering procedure, in comparison to the R&T
method. Importantly, while achieving comparable results in terms of the best K, our method
excels in efficiency, significantly reducing the time required to arrive at these optimal cluster
configurations, when compared to the R&T procedure.
3.5 Conclusion
Building upon the groundwork laid by a previous collaborative study involving Khalid Mursi, Yu
Zhuang, and myself, this paper extends and refines our insights into the automated
determination of the optimal number of clusters. The prior research set the stage for the current
work, which delves deeper into the subject matter, offering improvements and exploration,
facilitated by new and larger datasets. In this study, we tackled the pivotal challenge of
automatically defining the optimal number of clusters, emphasizing the delicate balance between
clustering quality and algorithmic efficiency. Our approach, specifically designed for large high-
dimensional datasets, successfully automates the identification of the optimal number of clusters,
while addressing both quality and efficiency aspects in clustering. By conducting experimental
studies on a combination of five previously explored datasets and introducing four new and larger
datasets, our procedure demonstrates its flexibility in selecting diverse criteria for determining
the optimal K under various circumstances. Leveraging the advantages of the BKMeans algorithm,
our approach outperforms the R&T method, showcasing higher efficiency in identifying the best
number of clusters. Our research not only contributes to the evolving field of clustering
algorithms but also highlights the continuous refinement of our methods, based on valuable
insights from prior collaborative studies. The ability to efficiently determine the optimal number
of clusters in large high-dimensional datasets positions our approach as a promising
advancement in the realm of automated clustering methodologies. Through rigorous
experimentation involving nine large datasets, we conducted a comparative analysis against the
established R&T algorithm. The results unequivocally validate the superiority of our approach, as
it consistently identifies the best number of clusters with equal or superior quality, while
significantly outperforming the R&T method in terms of speed. Moreover, the consistency of
results across two distinct validity measures, the CH index, and the R&T indices, underscores the
flexibility of our method. This adaptability allows for the consideration of different criteria in the
determination of the optimal number of clusters, further enhancing the versatility of our
proposed approach. In essence, our research not only contributes to the advancements in
clustering methodologies for big data but also establishes a practical and efficient solution for
automatically determining the optimal number of clusters in the intricate landscape of high-
dimensional datasets.
References
[1] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: A new data clustering algorithm and its
applications. Data Mining And Knowledge Discovery, vol. 1, pp. 141–182, 1997. →
[2] P. Fränti, O. Virmajoki, and V. Hautamäki, Fast agglomerative clustering using ak-nearest
neighbor graph. IEEE Trans. On Pattern Analysis And Machine Intelligence, vol. 28, pp. 1875–1881,
2006. →
[3] I. Kärkkäinen, and P. Fränti, Gradual model generator for single-pass clustering. Pattern
Recognition, vol. 40, pp. 784–795, 2007. →
[4] S. Ray, and R. Turi Determination of number of clusters in K-means clustering and application
in colour image segmentation. Proceedings Of The 4th International Conference On Advances In
Pattern Recognition And Digital Techniques. pp. 137143, 1999. a, b, c, d, e
[5] R. Xu, and D. Wunsch, Survey of clustering algorithms, Institute of Electrical, 2005. →
[6] R. Dubes, and A. Jain, Algorithms for clustering data, Prentice hall Englewood Cliffs, 1988. →
[7] D. Pelleg, and A. Moore, Others, X-means: Extending K-means with efficient estimation of the
number of clusters. Icml, vol. 1, pp. 727–734, 2000. a, b
[8] J. Yang, and W. Wang CLUSEQ: Efficient and effective sequence clustering. Proceedings 19th
International Conference On Data Engineering (Cat. No. 03CH37405). pp. 101–112, 2003. →
[9] M. Chiang, and B. Mirkin, Intelligent choice of the number of clusters in K-means clustering: An
experimental study with different cluster spreads. Journal Of Classification, vol. 27, pp. 3–40,
2010. →
[10] Z. He, A. Cichocki, S. Xie, and K. Choi, Detecting the number of clusters in nway probabilistic
clustering. IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 32, pp. 2006–2021,
2010. →
[11] A. Baraldi, and P. Blonda, A survey of fuzzy clustering algorithms for pattern recognition. I.
IEEE Transactions On Systems, Man, And Cybernetics, Part B (Cybernetics), vol. 29, pp. 778–785,
1999. →
[12] H. Sun, S. Wang, and Q. Jiang, FCM-based model selection algorithms for determining the
number of clusters. Pattern Recognition, vol. 37, pp. 2027–2037, 2004. →
[13] T. Pappas, An adaptive clustering algorithm for image segmentation. IEEE Transactions On
Signal Processing, vol. 40, pp. 901–914, 1992. →
[14] G. Coleman, and H. Andrews, Image segmentation by clustering. Proceedings of The IEEE, vol.
67, pp. 773–785, 1979. →
[15] J. Bezdek, J. Keller, R. Krisnapuram, and N. Pal, Fuzzy models and algorithms for pattern
recognition and image processing, Springer Science & Business Media, 1999. →
[16] U. Bodenhofer, A. Kothmeier, and S. Hochreiter, APCluster: An R package for affinity
propagation clustering. Bioinformatics, vol. 27, pp. 2463–2464, 2011. →
[17] W. Croft, D. Metzler, and T. Strohman, Search engines: Information retrieval in practice,
Addison-Wesley Reading, 2010. →
[18] N. Jardine, and C. Rijsbergen, The use of hierarchic clustering in information retrieval.
Information Storage And Retrieval, vol. 7, pp. 217–240, 1971. →
[19] J. Tou, and R. Gonzalez, Pattern recognition principles, 1974. →
[20] Y. Shim, J. Chung, and I. Choi A comparison study of cluster validity indices using a
nonhierarchical clustering algorithm. International Conference On Computational Intelligence For
Modelling, Control And Automation And International Conference On Intelligent Agents, Web
Technologies And Internet Commerce (CIMCAIAWTIC’06). vol 1 pp. 199–204, 2005. →
[21] Y. Zhuang, Y. Mao, and X. Chen A limited-iteration bisecting K-means for fast clustering large
datasets. 2016 IEEE Trustcom/BigDataSE/ISPA. pp. 2257–2262, 2016. →
[22] R. Sokal, Clustering and classification: Background and current directions. Classification And
Clustering, pp. 1–15, 1977. →
[23] Z. Safari, K. Mursi, and Y. Zhuang Fast automatic determination of cluster numbers for high-
dimensional big data. Proceedings Of The 2020 The 4th International Conference On Compute
And Data Analysis. pp. 50–57, 2020. a, b, c, d
[24] A. Stisen, H. Blunck, S. Bhattacharya, T. Prentow, M. Kjærgaard, A. Dey, T. Sonne, and M.
Jensen Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for
activity recognition. Proceedings Of The 13th ACM Conference On Embedded Networked Sensor
Systems. pp. 127–140, 2015. →
[25] DOT, N. NYC Taxi and Limousine Commission-Trip Record Data, 2016. →
[26] H. Hmida, S. Hamida, A. Borgi, and M. Rukoz, Scale genetic programming for large data sets:
Case of Higgs Bosons classification. Procedia Computer Science, vol. 126, pp. 302–311, 2018. →
[27] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, and D. Whiteson, Parameterized machine
learning for high-energy physics. ArXiv Preprint ArXiv:1601.07913, 2016. →
[28] S. Agarwal, S. Yadav, and K. Singh Notice of Violation of IEEE Publication Principles: K-means
versus K-Means++ clustering technique. 2012 Students Conference On Engineering And Systems.
pp. 1–6, 2012. →
[29] A. Ng, M. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm. Advances In
Neural Information Processing Systems, 2001. →
4 Canonical correlation analysis and exploratory factor analysis
of the four major centrality metrics
Natarajan Meghanathan
Acknowledgments: The work leading to this chapter was partly funded through the U.S. National
Science Foundation (NSF) grant OAC-1835439. The views and conclusions contained in this chapter
are those of the authors and do not represent the official policies, either expressed or implied, of
the funding agency.
Abstract
Canonical correlation analysis (CCA) is useful in analyzing the correlation between two sets of
features in a dataset, whereas exploratory factor analysis (EFA) is useful in identifying the number
and mapping of the hidden factors that could dominantly represent the features in the dataset.
The four major centrality metrics considered in this study are: degree (DEG), eigenvector (EVC),
betweenness (BWC), and closeness (CLC). In the first part of this chapter, we consider DEG and
EVC as neighborhood-based centrality metrics and BWC and CLC as shortest path-based centrality
metrics and study the canonical correlation between these two sets of centrality metrics for a
suite of 80 real-world networks. We observe either a strong negative or a strong positive canonical
correlation between these two sets of metrics for about 60% of the real-world networks. In the
second half of the chapter, we conduct EFA on the same suite of 80 complex network datasets to
identify the number and mapping of the hidden factors (expected to be less than four) that could
dominantly represent the values incurred by the vertices with respect to the four major centrality
metrics. We observe that the BWC and CLC metrics (though both are shortest path-based
centrality metrics) cannot be represented by the same factor for about 70 of the 80 real-world
networks. On the other hand, we observe the DEG and EVC metrics to be represented by the same
factor for about 50 of the 80 real-world networks.
Keywords: canonical correlation analysis, centrality metrics, exploratory factor analysis,
neighborhood-based metrics, shortest path-based metrics, complex network analysis,
4.1 Introduction
Centrality metrics quantify the topological importance of nodes or edges in a complex network
[→1]. The four major centrality metrics typically studied for complex network analysis are: degree
centrality (DEG), eigenvector centrality (EVC), betweenness centrality (BWC) and closeness
centrality (CLC). The DEG [→1] of a node is a direct measure of the number of neighbors of the
node. The EVC [→2] of a node is a measure of the degree of the node as well as the degrees of its
neighbors. The BWC [→3] of a node is a measure of the fraction of the shortest paths between
any two nodes in the network that go through the node. The CLC [→4] of a node is a measure of
the closeness of the node to the rest of the nodes in the network. Among these centrality metrics,
the DEG metric is computationally light and can be computed without requiring global knowledge,
whereas the other three major centrality metrics (EVC, BWC, and CLC) are computationally heavy
and require global knowledge.
Correlation studies in complex network analysis have been traditionally conducted between
any two centrality metrics (node-level or edge-level). Correlation analysis involving two centrality
metrics have been so far conducted either: (1) to assess the strength of the linear relationship for
prediction of value for one metric using the value for the other metric (Pearson’s correlation
[→5]); (2) to assess the similarity in the rankings of the nodes based on the two metrics
(Spearman’s correlation [→5]); and (3) to assess the pair-wise correlation of the nodes based on
the two metrics (Kendall’s correlation [→5]: i.e., if the value for one metric increases, does the
value for the other metric increase or decrease?). Such studies have shown moderate–strong
correlation between several pairs of centrality metrics: (DEG, EVC), (DEG, BWC), and (DEG, CLC) as
well as weaker correlation between the BWC and EVC metrics.
In this chapter, we intend to take correlation analysis involving centrality metrics to the next
level. Per the definition of the four major centrality metrics, we seek to consider the DEG and EVC
metrics as neighborhood-based metrics and the BWC and CLC metrics as shortest path-based
metrics. Thereby we could use canonical correlation analysis (CCA [→16]) to assess the strength of
correlation between the two neighborhood-based centrality metrics DEG and EVC vs. the two
shortest path-based centrality metrics BWC and CLC. The results of such a CCA would help us to
answer the question: if a node exhibits higher (or lower) values for the neighborhood-based
centrality metrics, how likely is the node to exhibit higher (or lower) values for the shortest path-
based centrality metrics? We also wanted to investigate whether any network-level metric could
influence such a correlation. In this pursuit, we consider network-level metrics such as the spectral
radius ratio for node degree [→6], assortativity index [→7], and randomness index [→8] and seek
to explore their distribution vs. the (DEG, EVC) vs. (BWC, CLC) canonical correlation coefficient
values for the real-world networks.
In the second half of the chapter, we propose to conduct exploratory factor analysis (EFA)
[→17, →18] on a complex network dataset featuring the values incurred by the vertices with
respect to the four major centrality metrics (DEG, EVC, BWC, and CLC) and identify the number of
hidden factors (expected to be less than 4, the number of centrality metrics considered for EFA)
that could dominantly represent the values incurred for the four centrality metrics as well as
extract a mapping of the dominating factors to the centrality metrics represented. Through such
an EFA, we seek to assess whether the two neighborhood-based centrality metrics (DEG and EVC)
are dominantly represented by the same factor and likewise, whether the two shortest path-based
centrality metrics (BWC and CLC) are dominantly represented by the same factor. We also seek to
assess whether the strength of the canonical correlation observed between the neighborhood-
based vs. shortest path-based centrality metrics is related to the number of hidden factors
identified through an EFA of the four centrality metrics considered together as well as the
mapping of the dominating factors to the centrality metrics represented.
The rest of the chapter is organized as follows: Section 4.2 presents the step-by-step
computation procedure of the canonical correlation coefficient between the neighborhood-based
(DEG, EVC) vs. shortest path-based (BWC, CLC) metrics for a toy example graph. Section 4.3
presents the (DEG, EVC) vs. (BWC, CLC) canonical correlation coefficient values obtained for a suite
of 80 real-world networks and compares their distribution with the values incurred for the spectral
radius ratio for node degree, assortativity index and randomness index network-level metrics.
Section 4.4 presents the step-by-step procedure to conduct EFA on a dataset of the centrality
metrics for a toy example graph as well as the procedure to extract the mapping of the
dominating factors to the centrality metrics that they could represent. Section 4.5 presents the
results of EFA conducted on the same suite of 80 real-world networks of Section 4.3 as well as
analyzes the relations between the strength of the canonical correlation vs. the number and
mapping of the dominating factors identified through EFA. Section 4.6 reviews related work in the
literature and highlights our contributions. Section 4.7 concludes the chapter and presents plans
for future work. Throughout the chapter, the terms “network” and “graph,” “node” and “vertex,”
“edge” and “link” are used interchangeably. They mean the same. Note that Sections 4.2 and 4.3
of this chapter are based on a recent conference publication [→19] by the same author.
Figure 4.1: Toy example graph and the raw/normalized values of the centrality metrics of the
vertices.
convenience with the notations, the neighborhood-based DEG and EVC metrics are represented
as N1 and N2, respectively, and the shortest path-based BWC and CLC metrics are represented as
S1 and S2 respectively, throughout this section. Let NS denote the matrix version of the
normalized dataset and its dimensions are 8 × 4, where 8 is the number of vertices in the graph
and 4 is the number of centrality metrics (N1, N2, S1, and S2).
Figure 4.3: Generation of the R:N and R:S matrices and their eigenvalues/eigenvectors.
We now investigate the influence of three of the commonly measured network-level metrics
(assortativity index, randomness index, and spectral radius ratio for node degree) on the
canonical correlation coefficient values observed for neighborhood-based vs. shortest path-based
centrality metrics. Assortativity index (ranges from −1 to 1) [→7] is a measure of the similarity
between the degrees of the end vertices of the edges in a network and is calculated as the
Pearson’s correlation coefficient between the degrees of the end vertices of the edges. Networks
with positive assortativity index are referred to as assortative networks and those with negative
assortativity index are referred to as dissortative networks. Randomness index (ranges from −1 to
1) [→8] is a measure of the extent with which association between nodes leading to the formation
of edges is random (and not due to any preferential attachment). Randomness index is calculated
as the Pearson’s correlation coefficient between the degree of the vertices and the average local
clustering coefficient of the vertices with the particular degree. The larger the randomness index
(i.e., far away from −1), the larger the randomness in the association of nodes in the form of
edges; the lower the randomness index (i.e., more closer to −1), the larger the preference among
nodes to get associated with each other in the form of edges. Spectral radius ratio for node
degree (whose values are greater than or equal to 1.0) [→6] is a quantitative measure of the
extent of variation in node degree. The spectral radius ratio for node degree is the ratio of the
principal eigenvalue of the adjacency matrix of the network graph to the average degree of the
vertices; the farther away the value from 1.0, the larger the variation in node degree. Note that
quantitative values for all the above three network-level metrics are independent of the number
of nodes and edges in the network.
→Figure 4.6 displays the distribution of the spectral radius ratio for node degree vs. canonical
correlation coefficient values. We observe a distinct pattern: 28 of the 30 real-world networks that
incurred negative values for the canonical correlation coefficient reported spectral radius ratio for
node degree less than 3.0; 18 of the 20 real-world networks that reported a spectral radius ratio
for node degree greater than 3.0 incurred positive values for the canonical correlation coefficient.
This could lead to a conclusion that real-world networks with significantly larger variation in node
degree are more likely to exhibit a strong positive correlation between the neighborhood-based
and shortest path-based centrality metrics. We could also conclude that if a real-world network
incurred negative values for the canonical correlation coefficient, its spectral radius ratio for node
degree is more likely to be less than 3.0 (i.e., the real-world network is less likely to have a
significant variation in its node degree).
Figure 4.6: Canonical correlation coefficient vs. spectral radius ratio for node degree.
→Figure 4.7 displays the distribution of the assortativity index and randomness index vs. the
canonical correlation coefficient values for the real-world networks. Unlike →Figure 4.6, we do not
observe any distinct patterns. The distributions with respect to both the network-level metrics in
→Figure 4.7 appear to be almost random. In other words, the canonical correlation coefficient
value between the neighborhood-based and shortest path-based centrality metrics appears to be
independent of the degree assortativity of the edges as well as the extent of randomness in the
association of nodes leading to the formation of edges.
Figure 4.7: Canonical correlation coefficient vs. assortativity index and randomness index.
→Table 4.1 presents the canonical correlation coefficient values incurred for some (30) of the well-
studied benchmark real-world datasets. We can observe the biological networks and
transportation networks to predominantly incur positive canonical correlation coefficient values;
whereas, organizational networks (like department faculty networks, political networks, game-
based networks, etc.) are more likely to incur negative canonical correlation coefficient values.
Table 4.1: Canonical correlation coefficient values for some benchmark real-world networks.
Real-world networks Canonical corr. coeff. Real-world networks Canonical corr. coeff.
Slovenian Magazine Network −0.99883 Dolphin Network 0.762299
Cat Brain Network −0.98458 Erdos 971 Network 0.869162
Soccer World Cup 98 Network −0.97529 Celegans Network 0.881077
Adjacency Noun Network −0.97305 US Airports 97 Network 0.884632
Senator Press Meets Network −0.96466 Les Miserables Network 0.885322
UK Faculty Network −0.93757 Taro Exchange Network 0.908145
Mexican Political Network −0.93563 EU Air Transportation Network 0.930042
Karate Network −0.93554 Java Dependency Network 0.934391
Band Jazz Network −0.89191 Copper Field Network 0.945132
Facebook Network of a User −0.93563 Anna Karnenina Network 0.958253
Korean Family Planning Disc. Net. −0.80332 London Train Transportation Net. 0.964252
US Football 2000 Network 0.302123 Yeast Phosphorylation Network 0.967014
Euro Road Network 0.531523 Perl Developers Network 0.978418
McCarty Social Net. J. Co-authors 0.556008 Wind Surfers Beach CA Network 0.981030
Political Books Network 0.700493 Macaque Dominance Network 0.993647
Figure 4.8: Running example graph; centrality metrics dataset; covariance matrix and its principal
components retained and their corresponding eigenvectors.
We then determine the eigenvalues and eigenvectors [→10] of the covariance matrix. Since the
Covariance matrix is a 4 × 4 matrix (corresponding to the number of centrality metrics
considered), we will get four eigenvalues and the corresponding eigenvectors. We multiply the
normalized dataset of the centrality metrics with each of the four eigenvectors to obtain the
corresponding principal components. We compute the variances of the entries in each of the
principal components and retain the first m Eigenvectors whose sum of the variances is at least
99% of the number of rows/columns of the covariance matrix. The number of eigenvectors
retained corresponds to the number of hidden factors that could dominantly represent the
features of the dataset. Note that the entries in the principal components correspond to the node
ids and the entries in the eigenvectors correspond to the features (centrality metrics in our case).
We refer to the communality score [→20] for each entry (centrality metric) in these eigenvectors
as the sum of the squares of the factor loadings for the centrality metric.
In the case of the toy example graph, we observe the sum of the variances (3.5950 and 0.3784)
of the first two principal components is 99.33% of 4 (the number of rows/columns of the
covariance matrix) and hence we retain the eigenvectors corresponding to these two principal
components (all of which are shown in →Figure 4.8). The number of eigenvectors retained (two,
for the toy example graph) correspond to the number of hidden factors in the dataset and we
consider the entries in these eigenvectors as the initial values of the factor loadings (a measure of
the extent of the representation of the centrality metrics by these factors). The communality score
for the DEG metric based on the initial factor loadings is (0.5256)2 + (−0.0411)2 = 0.2779. Likewise,
the communality scores for the EVC, BWC, and CLC metrics based on the initial factor loadings are
(0.4706)2 + (−0.7328)2 = 0.7585, (0.4793)2 + (−0.6741)2 = 0.6841, and (0.5220)2 + (0.0831)2 = 0.2794,
respectively.
The communality score (whose maximum value can be 1.0) for a feature (centrality metric in
our case) is a quantitative measure of the extent of representation of the feature in the identified
factors. Varimax rotation [→21] is a coordinate rotation procedure to scale the loadings for the
features (centrality metrics) with respect to the factors such that the communality scores for each
of the features (centrality metrics) are maximized (in our case, at least 0.98). In order to perform
varimax rotation, we first plot the initial factor loadings for the centrality metrics in a coordinate
system (for the toy example graph, we would need a two-dimensional coordinate system as EFA
identified two factors to represent the four centrality metrics) referred to as the principal
eigenvector-axes; see the two-dimensional coordinate system to the left of →Figure 4.9. Varimax
rotation involves repeated orthogonal synchronous rotations of the eigenvector coordinate axes
until the sum of the squares of the loadings for each feature is maximized. We conducted varimax
rotation using the relevant libraries available in Python (Pandas [→22]). The axes in the resulting
rotated coordinate system (see to the right of →Figure 4.9) correspond to the two factors and the
coordinates for the features (centrality metrics) in this coordinate system represent the final
factor loadings for the features.
Figure 4.9: Varimax rotation of the eigenvector axes to the factor axes.
The final factor loadings for the four centrality metrics resulting from varimax rotation
(mentioned in the right of →Figure 4.9) are: DEG (0.7300, 0.6790), EVC (0.3430, 0.9390), BWC
(0.9740, 0.3190), and CLC (0.7570, 0.6390). Based on these final factor loadings, we observe the
communality scores for each of DEG, EVC, and BWC to be greater than 0.99 and the communality
score for CLC to be greater than 0.98. To identify the factor that could be considered to
dominantly represent a centrality metric, we pick the factor whose entry in the final factor loading
tuple is the maximum. In case of a tie, we consider the metric to be represented by all the tying
factors. Note that by conducting varimax rotation until the communality score for a centrality
metric is at least 0.98, we are requiring that more than 50% (and more than 70%) of the centrality
metric is represented by one of the three factors (and two factors) in a three-factor (and two-
factor) coordinate system. On this basis, for the toy example graph, the DEG (0.7300), BWC
(0.9470), and CLC (0.7570) metrics are dominated by Factor-1 and the EVC (0.9390) metric is
dominated by Factor-2. Thus, for the toy example graph, we conclude the two shortest path-based
centrality metrics are dominantly represented by the same factor, whereas the two
neighborhood-based centrality metrics are dominantly represented by different factors.
Figure 4.10: Final factor loadings for the centrality metrics with respect to the 80 real-world
networks.
→Figure 4.10 presents the results (final factor loadings) for the 80 real-world networks with
respect to each of the four centrality metrics. We observe the DEG centrality metric to incur its
largest loadings in Factor-1 for all the 80 real-world networks (the blue symbol corresponding to
Factor-1 is above the other two symbols throughout the 80-networks sequence). The EVC metric
also incurs its largest loadings in Factor-1 predominantly for networks that exhibit either strong
positive or strong negative canonical correlation. Overall we observe DEG and EVC to be
dominated by the same factor in 52 of the 80 real-world networks. For about half of the 31 real-
world networks that exhibited a weak–moderate canonical correlation between the
neighborhood-based and shortest path-based centrality metrics, we observe the two
neighborhood-based DEG and EVC metrics to be dominated by different factors (Factor-1 for DEG
and Factor-2 for EVC).
A significant observation from the EFA results presented in →Figure 4.10 is that the BWC and
CLC metrics (though both are grouped as shortest path-based metrics for CCA in Section 4.3) are
dominated by different factors for more than 90% (73) of the 80 real-world networks. This could
be justified by the totally different underlying characteristics the two metrics tend to capture. The
BWC metric tends to identify nodes through which majority of the shortest paths communications
could occur; on the other hand, the CLC metric tends to identify nodes that are geographically
closer to the rest of the nodes in the network. Our EFA results show that these two characteristics
are indeed different and need not be represented by the same factor.
In relation to the CCA results of Section 4.3, for a majority of the networks that exhibited
strong negative (positive) canonical correlation between the neighborhood-based and shortest
path-based centrality metrics, we observe the CLC (BWC) metric to be dominated by the same
factor (Factor-1) that also dominated the DEG and EVC metrics while the BWC (CLC) metric
appears to be dominated by a different factor (Factor-2). Note that for the weak–moderate
canonically correlated networks, the CLC metric is dominated by a totally different factor (Factor-3,
which mostly dominates only the CLC metric) whereas, we observe the BWC metric to be either
dominated by Factor-1 (along with DEG) or Factor-2 (along with EVC). The above discussions
relating the CCA and EFA results show that between the two shortest path-based centrality
metrics, the BWC metric is relatively a more dominant metric whose variations vis-a-vis the two
neighborhood-based metrics contribute to the overall canonical correlation between the two
categories of centrality metrics.
With respect to the extent of domination by a particular factor, we observe the BWC metric to
be the metric that is most strongly dominated by a particular factor (the BWC loadings for the
dominating factor is much greater than the BWC loadings for the nondominating factors),
followed by EVC. Though the loading values for the dominating factor vs. the nondominating
factors are clearly separable for the DEG metric, we observe the DEG metric (and to a certain
extent the CLC metric) to be relatively less strongly dominated by a particular factor (especially,
for 23 of the 24 strong negative canonical correlated networks, the DEG-loading values for the two
factors are above 0.4).
References
[1] M. E. J. Newman, Networks: An Introduction, 1st Edition, Oxford, UK: Oxford University Press,
September 2010. a, b
[2] P. Bonacich, Power and Centrality: A Family of Measures. American Journal of Sociology, vol.
92, no. 5, pp. 1170–1182, March 1987. a, b
[3] L. Freeman, A set of measures of centrality based on betweenness. Sociometry, vol. 40, no. 1,
pp. 35–41, March 1977. →
[4] L. Freeman, Centrality in social networks: Conceptual classification. Social Networks, vol. 1, no.
3, pp. 215–239, 1979. →
[5] G. Strang, Linear Algebra and its Applications, 4th Edition, Brooks Cole, Pacific Grove, CA, USA,
2006. a, b, c, d, e
[6] N. Meghanathan, Spectral radius as a measure of variation in node degree for complex
network graphs. In: Proceedings of the 3rd International Conference on Digital Contents and
Applications, pp. 30–33, Hainan, China, December 20–23 2014. a, b, c
[7] R. Noldus, and P. Van Mieghem, Assortativity in complex networks. Journal of Complex
Networks, vol. 3, no. 4, pp. 507–542, December 2015. a, b, c
[8] N. Meghanathan, Randomness index for complex network analysis. Social Network Analysis
and Mining, vol. 7, no. 25, pp. 1–15, December 2017. a, b, c
[9] U. Brandes, A faster algorithm for betweenness centrality. The Journal of Mathematical
Sociology, vol. 25, no. 2, pp. 163–177, 2001. →
[10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th Edition,
Cambridge, MA, USA: MIT Press, April 2022. a, b, c
[11] N. Meghanathan, Correlation coefficient analysis of centrality metrics for complex network
graphs. In: Proceedings of the 4th Computer Science Online Conference, Intelligent Systems in
Cybernetics and Automation Theory: Advances in Intelligent Systems and Computing, Vol. 348, pp.
11–20, April 27–30 2015. a, b
[12] C. Shao, P. Cui, P. Xun, Y. Peng, and X. Jiang, Rank correlation between centrality metrics in
complex networks: An empirical study. Open Physics, vol. 16, no. 1, pp. 1009–1023, December
2018. →
[13] S. Oldham, B. Fulcher, L. Parkes, A. Arnatkeviciute, C. Suo, and A. Fornitto, Consistency and
differences between centrality measures across distinct classes of networks. PLoS ONE, vol.
0220061, pp. 1–23, July 2019. →
[14] A. Ficara, G. Fiumara, P. De Meo, and A. Liotta, Correlation analysis of node and edge
centrality measures in artificial complex networks. In: Proceedings of Sixth International Congress
on Information and Communication Technology, pp. 901–908, London, UK, September 2021. →
[15] D. C. Mocanu, G. Exarchakos, and A. Liotta, Decentralized dynamic understanding of hidden
relations in complex networks. Scientific Reports, vol. 8, no. 1571, pp. 1–15, January 2018. →
[16] H. Hotelling, Relations between two sets of variates. Biometrika, vol. 28, pp. 321–377, 1936. →
[17] R. L. Gorsuch, Factor Analysis, 2nd Edition, Lawrence Erlbaum Associates, November 1983. a,
b
[18] J. C. Loehlin, Latent Variable Models: An Introduction to Factor, Path, and Structural Equation
Analysis, 4th Edition, Psychology Press, December 2003. a, b
[19] N. Meghanathan, Canonical correlation analysis of neighborhood-based centrality metrics vs.
shortest path-based centrality metrics. In: Proceedings of the 19th International Conference on
Data Science, Las Vegas, USA, July 24−27 2023. a, b
[20] P. Eaton, B. Frank, K. Johnson, and S. Willoughby, Comparing exploratory factor models of the
brief electricity and magnetism assessment and the conceptual survey of electricity and
magnetism. Physical Review Physics Education Research, vol. 15, no. 020133, pp. 1–11, 2019. →
[21] H. F. Kaiser, The varimax criterion for analytic rotation in factor analysis. Psychometrika, vol.
23, pp. 187–200, September 1958. →
[22] →https://pypi.org/project/factor-analyzer/ →
[23] S. T. Hasson, and Z. Hussein, Correlation among network centrality metrics in complex
networks. In: Proceedings of the 6th International Engineering Conference – Sustainable
Technology and Development, pp. 54–58, Erbil, Iraq, 2020. →
[24] M. Suyudi, S. Firman, and A. T. Bon, Theoretical and simulation approaches centrality in social
networks. In: Proceedings of the 11th Annual International Conference on Industrial Engineering
and Operations Management, Singapore, March 2021. →
[25] L. C. Freeman, S. P. Borgatti, and D. R. White, Centrality in valued graphs: A measure of
betweenness based on network flow. Social Networks, vol. 13, pp. 141–154, 1991. a, b, c
[26] P. Hage, and F. Harary, Eccentricity and centrality in networks. Social Networks, vol. 17, no. 1,
pp. 57–63, January 2005. →
[27] E. Estrada, and O. Bodin, Using network centrality measures to manage landscape
connectivity. Ecological Applications, vol. 18, no. 7, pp. 1810–1825, October 2008. →
5 Navigating the landscape of automated data preprocessing:
an in-depth review of automated machine learning platforms
Abderahim Salhi
Althea C. Henslee
James Ross
Joseph Jabour
Ian Dettwiller
Abstract
Effective data preprocessing plays a pivotal role in enhancing the performance of machine
learning (ML) pipeline by influencing the accuracy and overall effectiveness of the final outcomes.
In the context of the growing prominence of automated machine learning (AutoML), the
significance of data preprocessing has become more prominent. This chapter is an ongoing effort
to explore and survey the use of AutoML platforms for data preprocessing. The focus is
particularly directed toward exploring how automated machine learning tools contribute to
streamlining the construction and training phases of ML models. Our examination delves into a
comprehensive exploration of various tasks and subtasks, providing insights into the ways
AutoML tools can be leveraged across the spectrum of ML processes.
Keywords: AutoML, automated machine learning, data preprocessing pipeline,
5.1 Introduction
Data preprocessing is an essential step in the ML pipeline that involves data cleaning, integration,
reduction, and transformation. This work is essential to preparing data, and improving its quality
and usability in the future. Preprocessing techniques are also critical for identifying and correcting
errors, reducing noise, and making the data suitable for machine learning algorithms. The task is
often a time-consuming and error-prone process, which can limit the performance of machine
learning models [→1]. In recent years, AutoML has gained a lot of attention from industry,
academia, and government agencies [→2], in particular, as a way to automate various aspects of
the ML process and boost productivity among researchers by minimizing repetitive activities in ML
pipelines and allowing them to focus on higher-level tasks, such as model selection and
hyperparameter tuning [→3]. Meanwhile, there are more and more tools and libraries available
for AutoML that encompass both large-scale commercial platforms and open-source toolkits,
including Amazon AutoGluon [→4], autosklearn [→3, →5], Auto-WEKA [→6], Microsoft Azure
AutoML [→7], Datarobot [→8], Google Vertex AI [→9], H2O DriverlessAI [→10, →11], IBM Watson
AutoAI [→12], MLjar [→13], SalesForce TransmogrifAI [→14], and TPOT [→15].
This chapter is an ongoing survey about the recent advances in AutoML for data
preprocessing [→16]. We discuss the different types of data preprocessing tasks and subtasks
that can be automated, and we review the different AutoML tools that have been proposed for
each task. We also discuss the challenges and limitations of AutoML for data preprocessing.
5.2 Data preprocessing pipeline
A machine learning pipeline can be a complex process, but it is essential for building accurate and
reliable ML models. Data preprocessing is a subset of the ML pipeline. Without counting the
preliminary step of data collection, data preprocessing is the first step in the machine learning
process, as outlined in →Figure 5.2. In general, it consists of four major tasks, as shown in
→Figure 5.1: data cleaning, data integration, data reduction, and data transformation [→17].
In AutoGluon, an Amazon AutoML package, the primary goal of the platform is to automate the
machine learning process, which includes feature engineering, model selection, and
hyperparameter tweaking. However, AutoGluon has some basic data preprocessing subtasks in its
pipeline. These tasks include dealing with missing values, scaling and normalizing numerical
features, encoding categorical variables like one-hot encoding, performing automated feature
selection, and dealing with class imbalance in the dataset [→4].
Auto-WEKA, Auto-Sklearn, and TPOT are some of the earliest AutoML tools. As open-source
frameworks, the three platforms can handle many basic and advanced data preprocessing tasks,
including missing value imputation, data normalization, feature selection, feature extraction, and
data transformation [→3, →5, →29, →30].
TransmogrifAI can handle some of the manual tasks, like detecting data types. It is an
important step in data preprocessing as it helps to ensure that data is correctly interpreted by
downstream algorithms. It has the ability to process basic categorical or numeric and complicated
data types (money, phone, address) [→14, →33]. However, it is unstable on many datasets [→34].
In H2O DriverlessAI, users can apply a wide range of preprocessing techniques to their data,
such as handling missing values, encoding categorical variables, scaling numerical variables, and
handling outliers. The platform also includes feature selection and feature extraction for
automatic dimensionality reduction, automatic text encoding using Word2Vec, and other data
preprocessing stages [→11].
Meanwhile, Datarobot and Azure AutoML follow the same steps as H2O DriverlessAI; both can
identify relationships between variables and automatically generate new features to improve
model performance [→32].
Lately, Google launched a new version of its Google Cloud AutoML, called Vertex AI. The
platform boasts a powerful suite of data preprocessing tools. Vertex AI can handle more advanced
data transformation techniques, such as dimensionality reduction, nonlinear feature
transformations, or handling imbalanced datasets. While it is possible to perform these tasks
using other Google Cloud tools, such as BigQuery or Dataflow, Google claims that Vertex AI can
handle them smoothly [→35].
MLjar is a relatively new platform, having been released in 2020. Despite this, it has gained
significant attention from the machine learning community due to its user-friendly interface and
powerful automated machine learning capabilities. MLjar provides various data preprocessing
options, including handling missing values, categorical encoding, feature scaling, and outlier
detection [→36].
Although it creates the final machine learning model rapidly, Auto-Sklearn cannot handle
missing values. The Sklearn library’s machine learning algorithms provide the foundation of this
system. The user will need to perform preprocessing on the input before using Auto-Sklearn to
generate a model with customized parameters [→32, →36].
IBM Watson AutoAI offers a range of data preprocessing capabilities, such as handling
missing values, feature scaling, and feature engineering [→37]. However, the platform lacks
advanced data preprocessing, nonlinear feature transformations, outlier detection, or custom
feature selection algorithms. While it is possible to perform some of these tasks using the
platform’s Python SDK, it may require additional programming and configuration.
5.4 Challenges
Surveying AutoML tools for data preprocessing presents challenges such as keeping up with the
evolving tool landscape, selecting a representative set of technologies, and achieving
comprehensive coverage of available tools. These challenges require researchers to stay updated,
carefully choose the tools to include in the survey, and be mindful of the limitations of covering all
existing tools.
While AutoML is primarily focused on the creation of reliable ML models, many of the AutoML
tools are really not effective at handling the data preprocessing work, and it still necessitates a lot
of manual labor and carefully evaluation and validation of the output generated by AutoML tools
[→32].
5.5 Conclusion
Data preprocessing is a critical step in the ML pipeline; it involves cleaning, integrating, reducing,
and transforming data so that it can be used to train ML models. AutoML tools are increasingly
being used for data preprocessing tasks. As these tools become more advanced, we are looking to
improve the capabilities and effectiveness of AutoML for data preprocessing by integrating
AutoML with explainability methods. As AutoML becomes more prevalent, it is crucial to ensure
that the results produced by these tools are explainable. Future research could focus on
integrating AutoML with explainability methods such as counterfactual explanations or
interpretable ML models.
References
[1] Q. Yao, et al., Taking human out of learning applications: A survey on automated machine
learning. arXiv Preprint arXiv:1810.13306, 2018. →
[2] M. Lee, et al., Current and future applications of Machine learning for the US Army,
Tech.rep.US Army Research Laboratory Aberdeen Proving Ground United States, 2018. →
[3] M. Feurer, et al., Efficient and robust automated machine learning. In Advances in neural
information processing systems, p. 28, 2015. a, b, c, d
[4] N. Erickson, et al., Autogluon-tabular: Robust and accurate automl for structured data. arXiv
Preprint, arXiv:2003.06505, 2020. a, b, c
[5] Auto-Sklearn Sklearn AutoML Package. →https://www.automl.org/automl/auto-sklearn/.
Accessed: 2023-04-09. a, b, c
[6] C. Thornton, et al., Auto-WEKA: Combined selection and hyperparameter optimization of
classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, p. 847–855, 2013. a, b
[7] Microsoft Azure Machine Learning Service AutoML Concept. →https://docs.microsoft.com/en–
us/azure/machine–learning/service/concept–automated–ml. Accessed: 2023-04-10. →
[8] Datarobot Data preparation process. →https://www.datarobot.com/platform/prepare-
modeling–data/. Accessed: 2023-04-09. →
[9] Google Vertex AI beginners guide.
→https://cloud.google.com/vertexai/docs/beginner/beginners-guide. Accessed: 2023-04-10. a, b
[10] H2O Platform H2O Driverless AI. →https://h2o.ai/platform/aicloud/make/h2o-driverless–ai/.
Accessed:2023-04-09. →
[11] E. LeDell, and S. Poirier, H2O automl: Scalable automatic machine learning. Proceedings of
the AutoML Workshop at ICML, vol. 2020, 2020. a, b, c
[12] IBM Cloud Watson Studio AutoAI. →https://www.ibm.com/cloud/watson-studio/autoai.
Accessed: 2023-04-10. a, b
[13] MLjar MLjar AutoML. →https://mljar.com/automl/.Accessed: 2023-04-09. →
[14] TransmogrifAI Automated machine learning for structured data. →https://transmogrif.ai/.
Accessed: 2023-04-10. a, b
[15] R. S. Olson, and J. H. Moore. TPOT: A tree-based pipeline optimization tool for automating
machine learning. In: Workshop on automatic machine learning. PMLR. pp. 66–74, 2016. a, b, c
[16] A. Salhi, et al., Data preprocessing using AutoML: A survey. In: Proceedings of The 2023 World
Congress in Computer Science, Computer Engineering,& Applied Computing (CSCE’23), p. 106,
2023. →
[17] C. Fan, et al., A review on data preprocessing techniques toward efficient and reliable
knowledge discovery from building operational data. Frontiers in Energy Research, vol. 9, no.
March, pp. 1–17, 2021. →
[18] F. Neutatz, et al., Data Cleaning and AutoML: Would an optimizer choose to clean?.
Datenbank-Spektrum, vol. 22, no. 2, pp. 121–130, 2022. →
[19] M. A. Hernández, and S. J. Stolfo, Real-world data is dirty: Data cleansing and the
merge/purge problem. Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998. →
[20] S. Ryan, P. Djukic, and T. Morris. Unsupervised outlier detection in time-series data. US Patent
App. 16/540,414. Dec. 2020. →
[21] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey. IEEE
Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1–16, 2006 →
[22] L. Yifeng, and A. Ngom. Data integration in machine learning. In: 2015 IEEE International
Conference on Bioinformatics and Biomedicine (BIBM), pp. 1665–1671, 2015, doi:
10.1109/BIBM.2015.7359925. →
[23] I. Czarnowski, and P. Jedrzejowicz. Data reduction algorithm for machine learning and data
mining. In: New Frontiers in Applied Artificial Intelligence: 21st International Conference on
Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2008
Wrocław, Poland, June 18–20, 2008 Proceedings 21. pp. 276–285. Springer, 2008. →
[24] O. Stromann, et al., Dimensionality reduction and feature selection for object-based land
cover classification based on Sentinel-1 and Sentinel-2 time series using Google Earth Engine.
Remote Sensing, vol. 12, no. 1, pp. 76, 2019. →
[25] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max
dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005 →
[26] K. Maharana, S. Mondal, and B. Nemade, A review: Data pre-processing and data
augmentation techniques. Global Transitions Proceedings, 2022. →
[27] M. Francia, J. Giovanelli, and G. Pisano, HAMLET: A framework for human-centered AutoML
via structured argumentation. Future Generation Computer Systems, vol. 142, pp. 182–194, 2023.
a, b
[28] J. Sola, and J. Sevilla, Importance of input data normalization for the application of neural
networks to complex industrial problems. IEEE Transactions on Nuclear Science, vol. 44, no. 3, pp.
1464–1468, 1997, doi: 10.1109/23.589532.
[29] M. Martin Salvador, M. Budka, and B. Gabrys, Automatic composition and optimization of
multicomponent predictive systems with an extended auto-WEKA. IEEE Transactions on
Automation Science and Engineering, vol. 16, no. 2, pp. 946–959, 2019, doi:
10.1109/TASE.2018.2876430. a, b
[30] U. Garciarena, R. Santana, and A. Mendiburu, Evolving imputation strategies for missing data
in classification problems with TPOT. arXiv Preprint arXiv:1706.01120, 2017. a, b
[31] E. Opara, H. Wimmer, and C. M. Rebman. Auto-ML cyber security data analysis using google,
azure and IBM cloud platforms. In: 2022 International Conference on Electrical, Computer and
Energy Technologies (ICECET), pp. 1–10, 2022, doi: 10.1109/ICECET55527.2022.9872782. →
[32] A. Truong, et al., Towards automated machine learning: Evaluation and comparison of
AutoML approaches and tools. In 2019 IEEE 31st international conference on tools with artificial
intelligence (ICTAI), pp. 1471–1479, IEEE, 2019. a, b, c, d, e, f
[33] TransmogrifAI. →https://readthedocs.org/projects/transmogrifai/downloads/pdf/stable/.
Accessed: 2023-05-30. a, b
[34] M. Bilal, et al., Auto-Prep: Efficient and automated data preprocessing pipeline. IEEE Access,
vol. 10, pp. 107764–107784, 2022. →
[35] S. K. Karmaker Santu, et al., AutoML to date and beyond: Challenges and opportunities. arXiv
Preprint arXiv:2010.10777, 2020. →
[36] I. Popova, G. Revunkov, and Y. Gapanyuk. AutoML: Examining existing software
implementations and determining the overall internal structure of solutions. a, b
[37] G. Kalpana, et al., Predicting the price of pre-owned cars using machine learning and data
science. International Journal for Research in Applied Science and Engineering Technology, vol. 10,
pp. VII, 2022. →
6 Generating random XML
Peter M. Maurer
John Carbone
Abstract
XML is a language for transmitting highly structured text data in a standardized form. Although
the XML language is relatively simple, creating XML documents can be a tedious and error-prone
process. The DGL XML feature addresses this problem by automatically enforcing certain rules,
and permitting XML data to be created in a simplified manner. DGL is a language designed for
generating random data, mainly for testing and simulation purposes. It can also be used to create
large volumes of synthetic data for neural network training. The output can take various formats
and can be stored in files, MySQL databases, or C++ internal variables. The generated XML data
can be stored in an .xml file or can be streamed into another program or stored in a database.
XML files, DTD can be generated separately or simultaneously. Several new DGL production types
have been introduced, which help users generate syntactically correct XML data.
Keywords: random XML, random data, testing, simulation, big data,
6.1 Introduction
A preliminary version of this paper appeared in CSCI 2023 [→1]. XML is a universal language for
communicating textual data in machine-readable form. Virtually anything that can be represented
in textual form can be encoded in XML. This is especially true for highly structured data with
hierarchical components. For example, one may wish to record data for a research corporation
with several research teams. Within each team are individuals with distinct roles. At the highest
level, we have the company. Next, we have multiple research teams, each with a leader, several
researchers, and several technicians. Individuals are assigned to one of these categories, making
category a level in the hierarchy. The data for each individual is further structured into name, hire
date, salary, and so forth. This structured hierarchy can be easily expressed in XML. This data
might be coded as follows.
<Company>
<AITeam>
<Leader>
<Person>
<Name>
Barnaby Jones
</Name>
<Salary>
100000
</Salary>
<Address>
1027 Alamo Ct.
Dallas, TX
</Address>
…
</Person>
</Leader>
<Engineer>
<Person>
…
</Person>
</Engineer>
…
</AITeam>
<Electronics>
…
</Electronics>
</Company>
Today, there exist many datasets that are extraordinarily large. Genetic data is one example, but
there are many others. Despite this, datasets with specific properties are difficult to find. This, in
turn, makes testing software difficult. It also makes training deep-learning software difficult. The
data generation language, DGL, can be used to generate datasets of any size with any desired
properties. The data can be simple or as complex as necessary. Once generated, further work
might be necessary to encode it in XML form. The DGL XML feature can be used to overcome this
difficulty by generating data directly in XML form.
6.2 DGL
DGL has been used to generate random data for testing hardware and software [→2, →3, →4].
The underlying principle is probabilistic context-free grammars [→5]. Recently, DGL’s capabilities
have been expanded to generate data in several different ways. DGL code may now be embedded
in C++, and the data generator can be created as a subroutine. Data can be generated in internal
C++ variables. When generating XML, it is typical to generate one or more separate files. An .xml
file and its defining .dtd file can be generated, simultaneously. It is possible to generate many files
simultaneously.
Originally, all data needed to be included in the DGL specification. The upgrades no longer
permit loading data from external files or MySQL databases. However, output can be written to
multiple files at the same time. C++ variables can be accessed and modified by data generation
subroutines, and databases can be created and edited.
The basic unit of the DGL language is the production. A collection of productions representing
a complete specification is called a grammar. Each reference to the start symbol (“main” by
default) produces one data item. For data-generation subroutines, each subroutine call makes a
single reference to “main.” Embedded DGL is executed directly within the code.
Most productions have several alternatives, as shown below.
main: red,green,blue;
“main” refers to the name of the production, which offers three possible alternatives. When
“main” is invoked through the non-terminal %{main}, one of the three alternatives is selected
randomly with equal probability. The probabilities can be adjusted by applying weights, as shown
in the following example.
main: 2:red,1:green,3:blue;
These weights cause “red” to be selected twice as often as “red” and so forth. Weights are
normally optional but sometimes necessary to insure that a recursive grammar generates finite
strings. The number of productions in a grammar is unlimited. Productions are referenced using
non-terminals, which have the form %{name_of_production}, as in the reference %{main} above.
Terminals and non-terminals can be intermixed arbitrarily in an alternative.
The following grammar generates strings of the form ab, aabb, aaabbb, … a b . n n
main: a%{main}b,”“;
The following grammar-segment can generate infinite strings, and must be enhanced with
weights.
Exp: %{var}, %{Exp} “+” %{Exp}, %{Exp} “*” %{Exp};
var: a,b,c;
If the alternatives of Exp are chosen with equal probability, the number of non-terminals has a 2/3
probability of growing and a 1/3 probability of becoming smaller. To prevent problems, the
following weights should be used.
Exp: 3:%{var}, 1:%{Exp} “+” %{Exp}, 1:%{Exp} “*” %{Exp};
Larger weights can be used. The relative sizes of the weights can be used to control the average
length of the generated strings.
Various types of productions are specified by a keyword that comes after the name. The DGL
XML feature uses many such productions. Productions without keywords are called normal or
ordinary productions. The following sequence production causes alternatives to be generated
sequentially rather than randomly.
X: sequence red,green,blue;
The unique production causes alternatives to be selected at random without replacement.
Y: unique red,green,blue;
Additional key words can be used to specify the behavior when no choices remain.
The most powerful production type is the “variable”:
V: variable;
The production “V” functions as a single-alternative production with a variable right-hand side.
The special non-terminal %{X.V} selects an option from “X” and assigns it as the right-hand side of
“V.” These non-terminals do not produce any output. If “X” is defined as described and this is the
first reference to “X,” the value “red” will be assigned to “V.” Later, the non-terminal %{V} will
output the string “red.” The string assigned to “V” can also include non-terminals, giving DGL
theoretical universal power [→6].
To produce data, the DGL compiler typically generates C++ code, which is then compiled to
create either a stand-alone data-generation program or a subroutine. Embedded DGL is
preprocessed by the DGL compiler, with the data-generation code being compiled in line with the
C++ code.
The root element of the file is < hands > . Within the < hands > </hands > tags will be one or more <
hand > </hand > tags containing the lists of cards in each of the four hands. These four hands are
contained in the four tag pairs < north > </north >, < east > </east >, < west > </west > and < south >
</south > . Each of the hands will contain thirteen < card > </card > tags containing the names of
the cards (→Figure 6.2).
Figure 6.2: Representation of the thirteen < card > </card > tags that contain the card names.
The first five productions are for housekeeping and will be explained in later sections.
The “body” production generates the < hands > </hands > tags of the root element, and
references the “hand” production to generate a single instance of < hand > </hand > tags. In
practice, more elaborate mechanisms will be used to generate multiple instances of < hand >
</hand > tags. The “hand” production generates the < hand > </hand > tags and the “news”
production generates the four thirteen-card hands. The non-terminal in ten ort, east, west, and
south %13{card} invokes the non-terminal %{card} thirteen times.
The “card” production generates the < card > </card > tags, and makes a selection from the
“Deck” production. The “Deck” production contains the names of all 52 cards and selects without
replacement. The alternatives of “Deck” are generated using the macros !Number and !Suit.
(When two or more macros are used in the same alternative, every unique combination of macro
values is generated. In this case, 52 alternatives are generated.)
The six productions (→Figure 6.4) will produce the element definitions in →Figure 6.5, with names
generated at random.
Figure 6.5: Examples of generated element definitions.
The first statement states that the < color > </color > tags must contain three tag pairs < red >
</red >, < green > </green > and < blue > </blue >. The second statement indicates that the < choice
> </choice > tags must contain one of the three, but not all. The third statement says that the tags
< dataitem > </dataitem > must contain text data. The fourth statement states that the content
between < whatever > and </whatever > is arbitrary. The fifth statement states that the < NullItem
> </NullItem > pair must not have any content. The final statement states that the < dataorsub >
</dataorsub > pairs may either contain text or a subitem enclosed in < blue > </blue > tags.
→Figure 6.6 is the DTD file (bridge.dtd) for the bridge-hands files described previously.
The specifications card + and hand + state that one element of the type is required, but there may
be additional elements of the same type.
The second and third statements cancel the repetition count, which defaults to 100. Normally, the
repetition count determines the number of internal references made to the “main” production
when creating a stand-alone data generator. When invoking the data generator, the first
parameter on the command line overrides the default of 100. The “repeat” statement in the
example sets the default repetition count to 1, and the “nocount” statement prevents overrides
from the command line. The first parameter on the command line no longer refers to the number
of references to “main”. Instead, the parameter will be used to determine the number of
occurrences of < hand > </hand > that will appear between the < hands > </hands > tags. The three
productions starting with the body production implement this feature. The handlist production is
an iterator, which causes the single alternative to be invoked a specified number of times. To
accomplish this, it is necessary to use a non-terminal of the form %{20.handlist}, which will cause
“handlist” to be referenced 20 times. The “argument” production retrieves the value of a
command-line argument. In the example, argument 1 is retrieved, and if there is no argument 1,
100 is used instead. (Argument 0 is the name of the file being executed.) The non-terminal %
{a.handlist} in the body production causes the specified number of hands to be generated. The
remainder of the specifications are the same as those presented in the previous example.
6.9 Conclusion
Generating XML is greatly simplified by the DGL XML features. The wide range of productions
provided by DGL makes the generation as complex or as simple as necessary. Furthermore, the
XML features are easy to use. It took only a few minutes to convert the bridge-hands example to
XML. We believe the XML feature will significantly increase DGL’s usefulness for testing modern
software and running simulations that require XML-formatted input.
References
[1] P. Maurer, and J. Carbone, Generating Random XML Files Using DGL, CSCI, 2022. →
[2] P. Maurer, Generating test data with enhanced context free grammars. IEEE Software, 7, 4, pp.
50–56, 1990. →
[3] P. Maurer, The design and implementation of a grammar-based data generator. Software
Practice and Experience, 22, 3, pp. 223–244, 1992. →
[4] P. Maurer, DGL reference manual version 2.0.
→http://cs.baylor.edu/~maurer/dglunpublished. →
[5] T. L. Booth, and R. A. Thompson, Applying probability measures to abstract languages. In: IEEE
Trans. Computers, pp. 442–450, May 1973. →
[6] P. Maurer, Generating random bridge hands: A study in data generation. In: Proceedings of
the 2020 International Conference on Modeling, Simulation & Visualization Methods, 2020. →
[7] T. Bray, J. Paoli, C. M. Sperberg-MCQueen, E. Maler, and F. Yergeau, (eds.), Extensible Markup
Language (XML) 1.0, 5th Edition, W3C Recommendation, 26 November 2008. →
[8] T. Erik, and E. T. Ray, Learning XML, 2nd Edition, Sebastopol, CA: O’Reilly Media, October 14,
2003. →
OceanofPDF.com
OceanofPDF.com
Applications and case studies
7 Exploring autism risk: a deep dive into graph
neural networks and gene interaction data
Danushka Bandara
Kyle Riccardi
Abstract
Autism spectrum disorder(ASD) has many genetic connections that can be
represented in genetic association networks. These networks can be converted
in graph data structure and also further examined using graph neural networks
models to further examine the association these genes have with ASD. Our task
shows that Graph Sage contains an 84.03 percent accuracy with 0.85 Area under
the ROC curve. This model further examines the ability Graph Neural Network
models have in the understanding of ASD association in relation to gene
networks. The model was further examined using sensitivity of 0.96 and
specificity of 0.94. The model successfully achieves a low false positive and
negative rate making sure that the results for this association contain less
chance of misclassification.
Keywords: Graph Convolutional Network (GCN), Graph Sage, Autism Spectrum
Disorder(ASD),
7.1 Introduction
Autism spectrum disorder (ASD) is a deficit of social communication or sensory-
motor function based on genetic association and other causes [→21]. The
genetic association is supported by the inheritance rate observed by Tick et al.
[→28] meta-data analysis on twins, which determined the inheritance of ASD to
range between 64% and 91%. Tick et al. also associate a .98 correlation between
genetics and neurodevelopmental disorders. De novo mutations further express
the relationship inheritance has on ASD because these genetic mutations
happen specifically when stem cell divisions and maturation of the female
gametes within the trio (father, mother, and child) [→26]occur. These genetic
mutations are based on mutations found between the trio, which determines
the de novo mutation that carries the high inheritance seen within the Tick et al.
analysis [→11, →15, →29]. In the Handbook of Clinical Neurology [→25], ASD
connection is associated with an estimated 1,000 genes determined based on
genetic linkage between chromosome location (loci) and possible genetic risk.
Alarcón et al. [→2] did a comprehensive study into chromosome regions 7q and
17q, finding that region 7q2-35 has a strong possibility of association with ASD
while noting that other areas, like chromosome 3q, might also have an
association with ASD. Copy number variants (CNV) is the process of adding,
removing, or copying portions of deoxyribonucleic acid (DNA) [→27]. CNV shows
the exact genetic correlation on specific regions of the chromosome band,
further exemplifying the link between loci and ASD. Another genetic association
is a common variant, which significantly affects ASD risk. Grove et al. [→4] used a
genome-wide association study that determined common variants’ strong,
robust connection with ASD. Common variants are significant factors in most
diseases as they relate to 90% of individual differences [→7]. From an individual
perspective, common variants create minimal significance, but by putting all the
variants together, we determine a noticeable impact on risk [→3]. Common
variants make up 40–60% of the risk factor when evaluating ASD [→17]. Between
all of the variants and genetic mutations, we see that these mutations and
variants are connected within a network. Kolchanov et al. describe gene
networks as a group of genes functioning in a coordinated manner [→10]. As
seen, common variants, de novo variants, and CNV are all the by-products of
genes functioning in coordination with one another, and that function creates
variants with high ASD risk. These variants, when combined, create a gene
network that links all of these variants together, showing us the
association/nonassociation of a gene [→24]. Our proposed experiment uses
these gene networks and graph neural networks to determine if a gene is
associated with ASD. The specific graph neural network models we use are
GraphSAGE and graph convolutional network (GCN) [→13, →16]. Within this
experiment, we will determine whether a gene is associated with ASD or not
based on the likelihood of a gene being associated with autism risk.
In this chapter, a methodology and models for discovering genes that cause
autism risk is described. Section 7.2 contains a literature survey of the state of
the art in autism risk classification. The preprocessing and model structure is
presented in Section 7.3. Section 7.4 outlines the experiments that were done to
test our models. Section 7.6 discusses the results in the context of autism risk
classification. We conclude with Section 7.7, where we reiterate this study’s
significant findings and implications.
7.3 Methodology
7.3.1 Dataset
The datasets used for these experiments were Sfari dataset [→1] and protein
interaction network (PIN) data [→24]. The Sfari dataset contains all of our gene
associations and rankings for all datasets. These rankings follow several
parameters with confidence of a gene being associated ASD from 1–3 (1
representing a significant association and 3 representing a minor association).
The Sfari dataset also contains the chromosome band location of that specific
gene. The PIN dataset contains all PPIs both associated and not associated with
ASD. The association is determined based on whether it is scored on Sfari data.
Anything from 1 to 3 would be considered associated with ASD, and any other
genes would be considered as having no association.
7.3.2 Preprocessing
The first preprocessing step is to filter out anything in the PIN dataset that is not
specified as being a human gene interaction. Next, we add our ranking and
chromosome band location from the Sfari dataset to the same gene in the PIN
dataset to have our edges and labels in the same place. The chromosome band
locations are used as features for association risk classification for the following
experiment. Equation (7.1) shows the definition of the graph:
G = (V , E, C) (7.1)
7.3.3 Models
The models used for this experiment are GCN [→16] and GraphSAGE [→13].
These models contain two layers, each having a rectified linear unit (ReLu)
between the layers and ending with a multilayer perceptron (MLP). Both models
are set up to use the adjacency matrix, feature matrix, and selected labels to
create a binary classification.
GCN is more of a top-down approach, which looks at the entire picture of the
network and its feature matrix for performing calculations. This model ignores
the effect a singular node has on the network but instead looks at all the
interactions together through matrix multiplication. Once this is done, we will
get our embedding matrix, allowing us to classify what class a node belongs to.
7.3.3.2 GraphSAGE
N (v) : the set of neighboring nodes of node v. W : weight matrix for the l-th
(l)
The aggregation was a mean aggregation for this experiment to reduce the
complexity of our results. The aggregated features are then concatenated to
create new features for every node in the feature matrix. This model infers the
connection of neighboring nodes. This connection creates another approach but
an essential method for interpreting graph networks. Instead of using a broad
look like GCN this takes a neighboring approach, which instead infers the
connection a feature relationship that not only the current node but its
neighboring nodes have with each other.
7.3.4.1 Specificity
Specificity is the ratio of true negative predictions to the total number of actual
negative instances in a binary classification problem. It quantifies the model’s
ability to correctly identify the instances that do not belong to the positive class.
A high specificity indicates that the model has a low rate of false positive errors,
which is effective at distinguishing negative instances from positive ones.
Mathematically, specificity can be defined as
Specif icity = true negatives /(true negatives+f alse positives).
7.3.4.2 Sensitivity
Sensitivity is the ratio of true positive predictions to the total number of actual
positive instances in a binary classification problem. It quantifies the model’s
ability to capture and correctly identify instances that belong to the positive
class. A high sensitivity indicates that the model effectively identifies positive
instances and has a low rate of false negative errors.
Mathematically, sensitivity can be defined as
Sensitivity = true positives /(true positives + f alse negatives).
AUC is a scalar value that represents the area under a receiver operating
characteristic (ROC) curve, a graphical plot that illustrates the performance of a
binary classification model across different probability thresholds. The ROC
curve is created by plotting the true positive rate (sensitivity) against the false
positive rate (1 – specificity) at various threshold values.
AUC ranges from 0 to 1, where
AUC = 0 indicates that the model’s predictions are entirely incorrect.
AUC = 1 indicates that the model’s predictions are perfect.
A higher AUC value indicates that the model has better discrimination ability,
meaning it is more capable of distinguishing between the positive and negative
classes. In other words, it measures the model’s ability to rank positive instances
higher than negative instances based on their predicted probabilities.
A random binary model with no discriminatory power would have an AUC of
0.5, which is equivalent to the performance expected by chance.
7.3.4.4 Accuracy
7.4 Experiment
The experiment was conducted using three configurations: 1) two GCN layers
attached to an MLP layer; 2) two graph sage layers attached to an MLP layer; and
3) three MLP layers. The models were fed the adjacency matrix, feature matrix,
and labels (not associated with ASD/associated with ASD). The adjacency matrix
considered all gene interactions within the PIN dataset that are associated with
humans and gathered their location and labels through the Sfari dataset.
→Figure 7.2 shows the structure of the models.
7.5 Results
As described in the methodology section, we conducted several tests to check
our models’ performance and understand their workings.
The →Table 7.2 metrics represent the average metrics for our models. Our
model achieved an accuracy of 75.91% with the GCN model and 84.03% with
GraphSAGE. When features are removed, the GraphSAGE model accuracy drops
by 19.86%, and the GCN model accuracy drops by 6.6%. The AUC for GraphSAGE
is 0.85 and drops by 0.17 when run featureless. The AUC for GCN is 0.78 and
drops by 0.28 when run featureless. GraphSAGE performs better than MLP in our
testing. However, GCN’s performance was found to be lower than that of MLP.
Table 7.2: Summary of the performance metrics obtained from our experiments.
Model name AUC Accuracy
GCN 0.78 75.91
GraphSAGE 0.85 84.03
MLP 0.78 77.71
GraphSAGE featureless 0.68 66.89
GCN featureless 0.50 55.13
As shown in →Table 7.3, GraphSAGE achieved comparable specificity with GCN
and much higher sensitivity than GCN or MLP.
→Table 7.4 shows the top 10 genes (based on the confidence level of the model)
that our model predicted to belong to each class.
Table 7.4: Top 10 genes based on model confidence for each class.
Not associated with ASD risk Associated with ASD risk
ASGR1 DSCAM
C17ORF85 HLA-A
MIR181A1 RELN
YBX2 TDO2
WDR13 BCL11A
TULP4 CUL3
SHTN1 DISC1
RBMXL1 NTRK3
PRKRA DRD3
C2ORF3 FOXG1
7.6 Discussion
Our experimental results demonstrate (→Table 7.2) notable performance
variations between different graph-based models in the context of autism risk
classification. The GraphSAGE model exhibited superior accuracy compared to
the GCN model, achieving an impressive 84.03% accuracy as opposed to GCN’s
75.91%. This suggests that the GraphSAGE architecture captures essential
patterns in the gene interaction data more effectively than GCN in the context of
autism risk prediction.
A compelling aspect of our findings is the sensitivity of the GraphSAGE
model to feature removal. When features were systematically removed, the
GraphSAGE model experienced a significant accuracy drop of 17.14%, indicating
its reliance on specific features for accurate predictions. In contrast, the GCN
model showed an even larger drop of 20.78% under the same feature removal
conditions. This suggests that both models’ reliance on features might make it
more susceptible to variations in the input data, potentially pointing to the
importance of specific chromosome locations on autism risk. The Area Under the
Curve (AUC) values further emphasize the nuanced performance of the models.
GraphSAGE achieved an AUC of 0.85, indicating a strong ability to distinguish
between classes. However, when run in a featureless mode, the AUC dropped by
0.17, underscoring the importance of the initial features for optimal model
performance. Similarly, the GCN model exhibited an AUC of 0.78, decreasing by
0.28 when featureless. These results suggest that the discriminative power of
both models is closely tied to the availability of the chromosome location
features.
Our experiments revealed that GraphSAGE outperformed the MLP in our
testing conditions. This suggests that the graph-based architectures, particularly
GraphSAGE, are well-suited for capturing complex relationships in gene
interaction data, leading to improved classification accuracy for autism risk.
On the other hand, the GCN model’s performance was lower than that of
MLP. This unexpected result might indicate that the specific architectural choices
in the GCN model, or the nature of the gene interaction data, do not align
optimally with the learning task at hand. Further investigation into the model’s
limitations and potential areas for improvement could provide valuable insights.
The GCN model demonstrated (→Table 7.3) remarkable specificity with a
value of 0.96, indicating its capability to identify individuals without autism
accurately. However, its sensitivity of 0.63 suggests a notable limitation in
correctly classifying individuals with autism, potentially leading to a higher rate
of false negatives. This trade-off between specificity and sensitivity highlights a
challenge in the GCN model’s ability to comprehensively capture the complexity
of gene interaction patterns associated with autism. Conversely, the GraphSAGE
model exhibited a high sensitivity of 0.96, signifying its effectiveness in correctly
identifying individuals with autism. It also exhibits a specificity of 0.94,
suggesting a very low rate of false positives. This is further underlined in the
ROC curve shown in →Figure 7.3. Comparatively, the MLP model demonstrated a
moderate sensitivity (0.67) and relatively high specificity (0.9). While the MLP
model may not excel in all metrics, it seems to do better than GCN when it
comes to sensitivity.
Figure 7.3: Receiver operating characteristics (ROC) curve for the binary
classification using GraphSAGE (using our optimal architecture of 128 units in
each network layer).
The top 10 genes identified by the model (shown in →Table 7.4) as having no
ASD risk, with high confidence levels, are noteworthy for understanding the
genetic landscape associated with individuals without ASD. Investigating these
genes’ functions and biological roles may reveal protective factors or pathways
associated with typical neurodevelopment. A detailed examination of the
identified genes can provide insights into their functions, potential interactions,
and relevance to neurodevelopmental processes. Understanding the biological
roles of these genes may contribute to the broader understanding of factors
that confer resilience against ASD. Expert validation of the model’s predictions
on these genes is crucial. Cross-referencing with existing literature and
experimental validation can confirm the reliability of the model’s selections and
contribute to the robustness of the identified gene set.
Conversely, the top 10 genes identified by the model as having a high
confidence level associated with ASD risk offer valuable leads for further
investigation into the molecular mechanisms underlying ASD. These genes may
serve as potential biomarkers for ASD risk. Exploring their roles in
neurodevelopmental processes, synaptic function, or other relevant pathways
may provide insights into the biological underpinnings of ASD. Grouping the
identified genes into biological pathways or networks can uncover patterns of
dysregulation. Pathway analysis may reveal common molecular pathways
disrupted in individuals at risk for ASD, guiding targeted therapeutic
interventions or preventive measures. The confidence levels associated with
each gene prediction are critical for gauging the reliability of the model’s
selections. A higher confidence level suggests a stronger association between
the gene and the predicted outcome. Understanding the relationship between
model confidence and clinical relevance is pivotal for translating these findings
into actionable insights in a clinical setting.
7.7 Conclusion
The experiment shows that a graph neural network can determine the
association in ASD and whether or not that specific gene is associated with ASD.
This experiment has shown us that the GraphSAGE model best uses gene
interactions and chromosome location features to classify genes with ASD risk.
We compared the performance of our model with the baseline MLP model and
found that it outperforms MLP in every performance criterion we measured.
Identifying potential biomarkers and protective factors using this method opens
avenues for precision medicine approaches in ASD. Tailoring interventions based
on the genetic profiles of individuals may pave the way for more effective
diagnosis and treatment strategies. Experimental validation of the identified
genes is crucial for establishing their functional roles in neurodevelopmental
processes. This step is fundamental in bridging the gap between computational
predictions and biological significance. Future research endeavors could extend
beyond gene interaction data to incorporate multi-omics information.
Integrating data on gene expression, epigenetics, and other molecular layers
may enhance the depth and accuracy of ASD risk predictions. Moving forward,
translating research findings into clinically actionable insights becomes
imperative. The model’s predictions and the identified genes should be validated
in diverse patient populations, laying the foundation for their integration into
clinical decision-making processes.
References
[1] B. Abrahams, D. Arking, D. Campbell, H. Mefford, E. Morrow, L. Weiss, I.
Menashe, et al., Sfari gene 2.0: A community-driven knowledgebase for the
autism spectrum disorders (asds). Molecular Autism, vol. 4, no. 36, pp. 10, 2013.
→
[2] M. Alarcón, B. S. Abrahams, J. L. Stone, J. A. Duvall, J. V. Perederiy, J. M. Bomar,
J. Sebat, et al., Linkage, association, and gene-expression analyses identify
cntnap2 as an autism-susceptibility gene. The American Journal of Human
Genetics, vol. 82, no. 1, pp. 150–159, 2008. →
[3] R. Anney, L. Klei, D. Pinto, J. Almeida, E. Bacchelli, G. Baird, N. Bolshakova, et
al., Individual common variants exert weak effects on the risk for autism
spectrum disorders. Human Molecular Genetics, vol. 21, no. 21, pp. 47814792, 07
2012. →
[4] J. Grove, S. Ripke, T. D. Als, M. Mattheisen, R. K. Walters, H. Won, et al.,
Identification of common genetic risk variants for autism spectrum disorder.
Nature Genetics, vol. 51, no. 3, pp. 431–444, March 2019. a, b
[5] I. Beyreli, O. Karakahya, and A. E. Cicek, DeepND: Deep multitask learning of
gene risk for comorbid neurodevelopmental dis orders. Patterns, vol. 3, pp.
100524, 06 2022. →
[6] J. Bralten, K. J. van Hulzen, M. B. Martens, T. E. Galesloot, A. Arias Vasquez, L.
A. Kiemeney, J. K. Buitelaar, et al., Autism spectrum disorders and autistic traits
share genetics and biology. Mol. Psychiatry, vol. 23, no. 5, pp. 1205–1212, May
2018. →
[7] P. F. Bray, C. I. Jones, N. Soranzo, and W. H. Ouwehand, Chapter 4 – Platelet
genomics. In: D. M. Alan, (Ed.), Platelets, 3rd Edition, pp. 67–89, Academic Press,
2013. →
[8] J. Feng, A. Zeng, Y. Chen, P. Payne, and L. Fuhai. Signaling interaction link
prediction using deep graph neural networks integrating protein-protein
interactions and omics data. bioRxiv, 2020. →
[9] T. K. B. Gandhi, J. Zhong, S. Mathivanan, L. Karthick, K. N. Chandrika, S.
Sujatha Mohan, S. Sharma, et al., Analysis of the human protein interactome and
comparison with yeast, worm and fly interaction datasets. Nature Genetics, vol.
38, no. 3, pp. 285–293, 2006. →
[10] T. Gaugler, L. Klei, S. Sanders, C. Bodea, A. Goldberg, A. Lee, M. Mahajan, et
al., Most genetic risk for autism resides with common variation. Nature Genetics,
vol. 46, 07 2014. →
[11] J. Gratten, P. Visscher, B. Mowry, and N. Wray, Interpreting the role of de
novo protein-coding mutations in neuropsychiatric disease. Nature Genetics, vol.
45, pp. 234–238, 03 2013. →
[12] K. Riccardi, and D. Bandara, Autism risk classification using graph neural
networks applied to gene interaction data. In: 2023 World Congress in Computer
Science, Computer Engineering, and Applied Computing (CSCE), NV, USA: Las
Vegas, 2023. →
[13] W. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on
large graphs, 06 2017. a, b
[14] F. Hormozdiari, O. Penn, E. Borenstein, and E. E. Eichler, The discovery of
integrated gene networks for autism and related disorders. Genome Research,
vol. 25, pp. 1, 142–154 2015. →
[15] I. Iossifov, B. O’Roak, S. Sanders, M. Ronemus, N. Krumm, D. Levy, H.
Stessman, et al., The contribution of de novo coding mutations to autism
spectrum disorder. Nature, vol. 515, pp. 10, 2014. →
[16] T. N. Kipf, and M. Welling. Semi-supervised classification with graph
convolutional networks. In: International Conference on Learning
Representations, 2017. a, b
[17] E. A. Anan’ko, F. A. Kolpakov, N. A. Kolchanov, et al., Gene networks.
Molecular Biology, vol. 34, pp. 449–460, 2000. →
[18] N. Krumm, B. J. O’Roak, J. Shendure, and E. E. Eichler, A de novo
convergence of autism genetics and molecular neuroscience. Trends in
Neurosciences, vol. 37, pp. 2, 95–105 2014. →
[19] S. Laksshman, R. Bhat, V. Viswanath, and X. Li, Deepbipolar: Identifying
genomic mutations for bipolar disorder via deep learning. Human Mutation, vol.
38, pp. 06, 2017. →
[20] L. Liu, J. Lei, and K. Roeder, Network assisted analysis to reveal the genetic
basis of autism. The Annals of Applied Statistics, vol. 9, pp. 3, 1571 2015. →
[21] C. Lord, M. Elsabbagh, G. Baird, and J. Veenstra-Vanderweele, Autism
spectrum disorder. The Lancet, vol. 392, pp. 10146, 508–520 2018. →
[22] S. Uddin, and H. Lu, A weighted patient network-based framework for
predicting chronic diseases using graph neural networks. Science Report, vol. 11,
p. 22607, 25 2021. →
[23] S. L. Lee, G. Mellick, A. A. Motsinger, et al., Gpnn: Power studies and
applications of a neural network method for detecting gene-gene interactions in
studies of human disease. BMC Bioinformatics, vol. 7, no. 39, 2006. →
[24] W. Pereanu, E. Larsen, I. Das, M. Estevez, A. Sarkar, S. Spring-Pearson, et al.,
Autdb: A platform to decode the genetic architecture of autism. Nucleic Acids
Research, vol. 46, pp. 11, 2017. a, b
[25] G. Ramaswami, and H. G. Daniel, Chapter 21 – Genetics of autism spectrum
disorder. In: H. G. Daniel, L. P. Henry, and K. Christine, (Eds.), Neurogenetics, Part
I, volume 147 of Handbook of Clinical Neurology, pp. 321–329, Elsevier, 2018. →
[26] K. Samocha, E. Robinson, S. Sanders, C. Stevens, A. Sabo, L. McGrath, J.
Kosmicki, et al., A framework for the in- terpretation of de novo mutation in
human disease. Nature Genetics, vol. 46, 08 2014. →
[27] T. H. Shaikh, Copy number variation disorders. Current Genetic Medicine
Reports, vol. 5, no. 183–190, pp. 2017. →
[28] T. Beata, B. Patrick, F. Happé, M. Rutter, and F. Rijsdijk, Heritability of autism
spectrum disorders: A meta-analysis of twin studies. Journal of Child Psychology
and Psychiatry, vol. 57, pp. 5, 585–595 2016. →
[29] J. Veltman, and H. Brunner, De novo mutations in human genetic disease.
Nature Reviews. Genetics, vol. 13, pp. 565–575, 07 2012. →
[30] H. Wang, and P. Avillach, Genotype-based deep learning in autism spectrum
disorder: Diagnostic classification and prognostic prediction using common
genetic variants, preprint, 10 2020. →
[31] Z. Wang, Y. Xu, D. Peng, J. Gao, and L. Fengmei, Brain functional activity-
based classification of autism spectrum disorder using an attention-based graph
neural network combined with gene expression. Cerebral Cortex, vol. 33, no. 10,
pp. 6407–6419, 12 2022. →
8 Leveraging ChatGPT and table arrangement
techniques in advanced newspaper content analysis
for stock insights
Masaki Murata
Acknowledgment: This study was conducted with grant support from the Ishii
Memorial Securities Research Promotion Foundation.
Abstract
This study uses table arrangement techniques and ChatGPT to analyze articles
from the Mainichi Shimbun for stock price prediction. From the analysis, we
identified 22 primary factors that influence the Nikkei Stock Average. We also
discovered that ChatGPT could extract and present newspaper data in a tabular
format. These factors significantly impact stock price fluctuations. The Nikkei
Stock Average tends to rise with improved US trade relations, a strong domestic
economy, and events such as the Olympics. Conversely, it tends to decline
during global stock market crashes, trade tensions, and pandemics. Moreover,
we propose a highly efficient, large-scale method for creating tables by
integrating table arrangement techniques with ChatGPT. Using a lenient
criterion, the proposed method attains an accuracy rate of 0.88. Items frequently
mentioned in articles concerning the Nikkei Stock Average are systematically
presented in a table, illustrating how the index rises or falls in response to these
items. This table also delves deeper into the effects of exchange rate changes on
the Nikkei Stock Average. Our findings offer valuable insights into the movement
of the Nikkei Stock Average. Future research will further refine these techniques
to improve stock prediction accuracy.
Keywords: stock prices, newspaper data, content analysis, table arrangement,
ChatGPT,
8.1 Introduction
According to Bollen et al. [→1], the prediction of stock prices can be partially
achieved through Twitter sentiment analysis. Deep learning techniques have
also been used in various studies to forecast stock prices based on information
extracted from newspapers [→2, →3, →4, →5]. Newspapers generally contain
valuable insights into the prevailing economic landscape and fluctuations in
stock prices. To improve the accuracy of stock price prediction, this study
examined sentences from newspapers that specifically relate to stock prices. We
used table arrangement technology and ChatGPT to identify the common types
of stock-related content found in newspapers. This study was conducted using
Japanese texts.
The primary contributions of this study are summarized below:
We used table arrangement technology to organize newspaper articles
about stock prices into tables. These tables consist of columns that provide
detailed information about stock prices, as presented in newspaper
articles.
The columns within the tables offer valuable information that contributes
to stock price prediction. This includes data such as the Dow Jones
Industrial Average, exchange rates, and other relevant market data.
Furthermore, we extracted content specifically related to significant
events, such as the “Trump administration” and the “New Corona”
situation. In total, 22 content items were obtained and analyzed.
We discovered that ChatGPT could extract information from newspapers
and effectively organize it into a table. However, we observed that the
number of columns in the tables generated by ChatGPT (ranging from 4 to
5) was fewer than that in the tables produced by the table arrangement
technology (ranging from 13 to 50 columns).
In addition, we proposed a method that combined table arrangement
techniques and ChatGPT to construct large-scale tables from a collection of
documents. The proposed method achieved a high accuracy rate of 0.88,
employing a lenient evaluation criterion that allows certain acceptable
additional extractions and minor missing information.
The items in many articles on the Nikkei Stock Average were organized
into a table that shows how the Nikkei Stock Average rose or fell when
these items occurred. In addition, the table shows in more detail how the
Nikkei Stock Average rises or falls when the exchange rate changes. These
analyses are also useful when examining the movement of the Nikkei
Stock Average.
This chapter is an extended version of a previous paper [→6].
8.2 Analysis cases using table arrangement technology
It does not have to be about the future. It can be about the past. Write 10 sentences from
actual newspaper articles when the Nikkei Stock Average rose significantly.
We are sorry, but due to copyright issues, we are unable to reproduce the exact text of specific
newspaper articles from the past. However, it is possible to create 10 exemplary fictional
newspaper article sentences based on actual past events or topics. The following are
exemplary article sentences.
1. “Nikkei average rises sharply on new trade agreement with US. The development was
well received by many industries.”
2. “The Nikkei 225 gained a big 500 yen in a single day. Many market participants are
confident that Japan’s economy will recover.”
3. “The Nikkei 225 soared after the Bank of Japan announced its interest rate revision.
Expectations for future economic policies are rising.”
4. “Renewable energy-related issues performed well. The Nikkei 225 recorded its biggest
gain in three months.”
5. “The Nikkei 225 rose above 25,000 yen on the back of aggressive purchases of Japanese
stocks by foreign investors.”
6. “A batch of major companies reported strong earnings results. Markets rejoiced and
the Nikkei 225 rose sharply.”
7. “The Tokyo Olympics were scheduled to be held in Tokyo. Sightseeing and construction-
related stocks soared and the Nikkei 225 jumped sharply.”
8. “Technology stocks continued to perform well, and the Nikkei 225 soared, riding the
wave.”
9. “Against the backdrop of the yen’s temporary stability, the Nikkei 225 made record
gains.”
10. “The Nikkei 225 also hit a new high on signs of a global economic recovery.”
These sentences are exemplary based on past market trends and events and may differ from
actual newspaper articles. Please use them for reference only.
Roger. The following table summarizes the results of our analysis of the 10 exemplary
newspaper article texts you provided from various perspectives.
Key Factors: The main reasons or underlying factors for the rise in an article.
Affected industry/sector: The industry or sector that is likely to be particularly affected
by the content of the article.
Relevant countries/regions: Key countries/regions relevant to the content of the article.
Tone of the article: The overall mood or impression of the article. In this case, we use
the term “favorable” because all articles report an increase in the Nikkei 225.
This table is based on the text of exemplary newspaper articles and may differ from actual
market conditions and analysis results.
The following was inputted into ChatGPT: “Based on the above results, please
provide a bulleted list of specific times when the Nikkei Stock Average will rise.”
The following was outputted:
Based on the analysis of the above exemplary newspaper articles, the specific circumstances
and factors that would cause the Nikkei Stock Average to rise are itemized below.
1. Progress in trade agreements:
When trade relations improve or agreements are reached with major economic
partners, such as a new trade agreement with the U.S.
2. Recovery of the domestic economy:
A situation in which the Japanese economy is performing well and market
participants are confident of its recovery.
3. Interest rate developments:
Interest rate revisions and central bank monetary policy are favorable to the
market.
4. Sector-specific strength:
When certain sectors are performing particularly well, for example, renewable
energy or technology stocks.
5. Active investment by foreign investors:
A situation in which foreign investors are actively buying Japanese stocks because
of increased inflows of foreign capital.
6. Strong corporate performance:
A major company announces better-than-expected earnings, which spreads
positive sentiment throughout the market.
7. Large-scale events:
When a major international event such as the Olympics is scheduled or decided to
be held in Japan.
8. Currency trends:
When the yen stabilizes or depreciates moderately, which will help boost export-
related stocks, etc.
9. Global economic strength:
A global economic recovery or growth is confirmed, and Japanese stock prices
benefit from this recovery or growth.
Although these factors may contribute to the rise in the Nikkei Stock Average, actual stock
price movements are determined by various factors and complex interactions beyond these
alone.
Based on the analysis of newspaper articles, several factors influence the rise in
the Nikkei Stock Average. These include improved trade relations, particularly
with the United States; a robust domestic economy; favorable interest rate and
central bank policy changes; standout performance in specific sectors such as
technology; increased foreign investments in Japanese stocks; major Japanese
companies exceeding earnings expectations; international events such as the
Olympics being held in Japan; a stable or slightly depreciating yen boosting
export-related stocks; a thriving global economy.
I am sorry, but I am not able to directly provide the text of specific newspaper articles
protected by copyright. However, I can provide information regarding the background, factors,
and historical events when the Nikkei Stock Average fell significantly.
The following events and factors related to past major stock price declines and economic crises
may have had an impact:
1. The Black Monday of 1987
2. The bursting of Japan’s bubble economy in the early 1990s
3. The Asian currency crisis of 1997–98
4. The dot-com bubble burst in 2000
5. The Lehman Shock of 2008 and the subsequent financial crisis
6. The Great East Japan Earthquake of 2011 and its subsequent economic impact
7. The economic slowdown in China in 2015–2016 and its impact
8. The U.S.–China trade war in 2018
9. The global COVID-19 pandemic in 2020 and its economic impact
10. International events such as oil shocks and geopolitical risks
For references to specific newspaper articles and analyses of these events and situations, we
recommend accessing them directly through a licensed news service or archive service.
We inputted the following into ChatGPT: “For each event, write a sentence that
sounds like an actual newspaper article on a major drop in the Nikkei 225.”
ChatGPT outputted the following.
Understood. The following is an imaginary sentence based on the above event, which I have
written as if it were an actual newspaper article. Please note that this is not the content of an
actual newspaper article, but rather a fictional content I created, and should be used only as a
reference.
1. Black Monday 1987 “On October 19, 1987, the Nikkei 225 fell sharply as a result of Black
Monday. The Tokyo stock market also suffered from general selling pressure.”
2. The bursting of Japan’s bubble economy in the early 1990s “With the end of the bubble
economy, the Nikkei 225 plummeted. Many investors experienced asset write-downs,
and real estate and stock prices fell rapidly.”
3. 1997–98 Asian Currency Crisis “The currency crisis in Asian countries led to a decline in
Japanese exports. The Nikkei average fell sharply and many companies revised their
earnings downward.”
4. The dot-com bubble burst of 2000 “Triggered by a sharp drop in IT-related stocks, the
Nikkei 225 fell sharply in value. Many startups failed and the market became
increasingly uncertain.”
5. 2008 Lehman Shock “The collapse of Lehman Brothers triggered a global financial
crisis. The Nikkei average also fell sharply and corporate earnings continued to
deteriorate.”
6. The Great East Japan Earthquake of 2011 “On March 11, 2011, the Great East Japan
Earthquake hit the Tohoku region. The Nikkei 225 fell sharply for a time and many
companies were affected.”
7. Economic slowdown in China in 2015–2016 “Fears of a slowdown in the Chinese
economy affected the market and the Nikkei 225 fell significantly. Japanese export
performance was also affected.”
8. The U.S.–China trade war in 2018 “As trade friction between the U.S. and China
intensified, the Nikkei 225 was volatile. Many companies are concerned about the
impact on their earnings.”
9. The global pandemic of COVID-19 in 2020 “The Nikkei 225 plunged in the wake of the
new coronavirus. Along with a slowdown in economic activity, many companies warned
of deteriorating business performance.”
10. Oil shocks and geopolitical risks “With the ongoing tensions in the Middle East, the
Nikkei 225 also fell sharply due to concerns about rising oil prices. Rising energy-related
costs could affect corporate earnings.”
These sentences are simplified and are shorter and more summary than the actual newspaper
article.
The following was inputted into ChatGPT: “Analyze these 10 articles from various
perspectives and organize them in a table.” ChatGPT outputted the following.
Below is a table that analyzes and organizes a summary of the 10 fictional
newspaper articles provided from various perspectives.
Incident Major cause Nikkei average Major impact
name/time movement
1987 Black Monday International stock Major decline Selling pressure on Tokyo stock
market crash market
1990s Bubble burst End of economic Plunge Asset write-downs, declines in
bubble real estate and stock prices
1997–98 Asian Currency crisis in Asian Major decline Decline in Japanese exports,
Currency Crisis countries downward revisions in corporate
earnings
2000 Dot-com Overheating of IT- Major decline Emerging company failures,
bubble related stocks market uncertainty
2008 Lehman Financial crisis, Major decline Global financial crisis,
Shock Lehman bankruptcy deteriorating corporate
performance
2011 Great East Tsunami, temporary Many companies were affected,
Japan Earthquake big drop economic activities temporarily
halted
2015–2016 China’s Chinese economy Major decline Decrease in Japanese exports,
economic slowdown impact on business performance
slowdown
2018 US–China Trade friction between Unstable Impact on export performance,
trade war the United States and developments concerns about business
China performance
2020 COVID-19 Global pandemic of the Plunge Stagnation of economic activity,
pandemic new coronavirus deteriorating business
performance of many companies
Oil shock Tension in the Middle Major decline Higher energy costs, impact on
East, rise in oil prices business performance
This table summarizes the key information from the 10 articles. This confirms
that each event and period has a different degree of impact and background.
The following was inputted into ChatGPT: “Based on the above results,
please provide a bulleted list of specific times when the Nikkei Stock Average
falls.” ChatGPT outputted the following.
Based on the above results, specific situations and factors that may cause the Nikkei Stock
Average to fall can be summarized as follows.
1. International stock price collapse: When stock prices in other countries plunge, the
Nikkei Stock Average may fall in a chain reaction (e.g., Black Monday in 1987).
2. Burst of economic bubbles: After excessive economic prosperity, stock prices can fall
sharply when bubbles burst (e.g., the bubble burst in the 1990s).
3. Economic crises in neighboring countries or trading partners: Economic instability in
Japan’s major trading partners can affect the Nikkei 225 (e.g., the Asian currency crisis
in 1997–98 and China’s economic slowdown in 2015–2016).
4. Sector-specific overheating: Overheating and subsequent adjustments in some
industries or sectors can affect overall stock prices (e.g., the dot-com bubble of 2000).
5. Major financial crisis: An event that affects the entire financial system, such as the
failure of a major financial institution or a credit crunch, can cause stock prices to fall
sharply (e.g., the Lehman Shock in 2008).
6. Natural disasters: Large-scale natural disasters can temporarily disrupt economic
activity and cause stock prices to decline (e.g., the Great East Japan Earthquake of
2011).
7. International trade frictions: Export-dependent economies could be affected by
intensified trade frictions and tariff wars among major economies (e.g., the U.S.–China
trade war in 2018).
8. Major epidemics: a global epidemic could severely limit economic activity and depress
stock prices (e.g., the COVID-19 pandemic in 2020).
9. Significant energy price volatility: fluctuations in energy-related costs, such as a
significant rise in the price of crude oil, could affect corporate performance and
economic activity.
Note that these factors can affect not only the Nikkei Stock Average but also the
economy as a whole.
We found the following. The Nikkei Stock Average may experience declines
because of several reasons: international stock market crashes, such as the 1987
Black Monday; economic bubble bursts, such as Japan’s 1990s bubble; economic
issues in key trading partners, illustrated by the Asian currency crisis in 1997–98
and China’s economic downturn in 2015–2016; industry-specific imbalances,
highlighted by the 2000 dot-com bubble; widespread financial crises, epitomized
by the 2008 Lehman shock; major natural calamities such as the 2011 Great East
Japan Earthquake; heightened global trade tensions, as seen in the 2018 US–
China trade conflict; global pandemics that limit economic activities, such as the
2020 COVID-19 outbreak; significant fluctuations in energy prices.
A simple survey using ChatGPT in Sections 8.3.1 and 8.3.2 also reveals in
which cases the Nikkei Stock Average rises and falls. However, in a previous
study [→6], ChatGPT outputted sentences that more closely resembled actual
articles. See also the previous study [→6]. Even using only ChatGPT, we can
generate tables by creating sentences for articles and organizing them.
Please pull out the sentences from the following text that apply to the following categories. If
none apply, please state N/A. # Classification (a) Nikkei 225, (a2) Change in Nikkei 225, (b) Dow
Jones Industrial Average, (c) Foreign Exchange, (d) Market Information, (e) World Affairs, (f)
Overseas Stock Indices, (g) Related to “since,” (h) Drop, (i) Tokyo Stock Exchange, (j)
International Information, (k) Noted, (l) Yields, (m) Monetary Easing, (n) Earnings, (o) Start
trading, (p) Related, (q) Trade, (r) Europe, (s) Caution, (t) Trump administration, (u) New
Corona, and (v) Foreign investors # text The Nikkei Stock Average temporarily dropped more
than 330 yen on the Tokyo Stock Exchange on August 28, as concerns over deteriorating
European finances spread following the downgrade of Greek and Portuguese government
bonds by the U.S. rating agency Standard & Poor’s. The previous day, European and U.S. stocks
also plunged, and there is a sense of caution in the market that a global stock market decline
originating in Europe is likely to proceed. The announcement of the credit rating downgrade
and the plunge in European stocks on May 27 prompted the 16 eurozone countries of the
European Union to hold a summit meeting on May 10 to decide on an emergency loan of 30
billion euros (approximately 3.7 trillion yen) to Greece. However, the news of the summit did
not dispel the credit concerns of European countries. The Dow Jones Industrial Average fell
sharply for the first time in seven business days on the New York Stock Exchange on May 27,
falling $213.04 from the previous day’s close to 19,991.99, the biggest drop since February 4
(down $268.37). In the foreign exchange market, the euro sold off sharply against the dollar
and the yen, temporarily falling to 1.3166 dollars per euro, the lowest level in approximately a
year since late April last year. The yen also strengthened against the dollar, temporarily hitting
92.81 yen to the dollar for the first time in three business days. In the Tokyo Stock Exchange on
August 28, selling was also prevalent right from the start of trading. The Nikkei Stock Average
fell sharply, falling below the 10,900-yen level for the first time in three business days during
trading hours. The was down 20.04 points to 977.59. However, as the Shanghai Composite
Index temporarily rose in value on the morning of the 28th on the Shanghai Stock Exchange in
China, some analysts believed that “the tone of the economic recovery has not changed, and it
is not necessarily true that the market will continue to fall” (a major securities firm). The weak
stock market has led to an inflow of funds into the bond market. The yield on newly issued 10-
year bonds temporarily fell to 1.28%, the lowest level in four months since December 30, 2008.
(article author)
When the above was provided as input to ChatGPT (GPT-4), the following output
was obtained. “Please continue.” was also provided as input.
Table 8.2: Portion of the generated table using the fusion method.
Article title Nikkei Dow Jones Exchange Market .
Foreign ex- The Nikkei The Dow Due to economic The Tokyo stock …
change/stocks: Stock Average Jones instability in Europe market was down .
The Tokyo Stock temporarily fell Industrial and the United across the board
Exchange is about briefly and Average States, the dollar on the 23rd due
to break the 8,000 plunged dropped and the euro were to the sharp
yen barrier, rapidly, losing more than sold off on the decline in the US
temporarily falling more than 650 $500 on the foreign currency and European
658 yen to its yen from the New York market, while the markets the
lowest level since previous day’s market on yen significantly previous day and
the year’s closing, and the previous gained in value, the sharp
commencement. during trading day. raising fears about appreciation of
hours, it the declining the yen. … .
reached a new performance of
low for the year Japanese
(8,115.41 yen), enterprises heavily
the lowest level dependent on
in five years exports.
and five
months since
May 2003. … .
Greece: Fiscal The Nikkei The Dow In the foreign . … . However, as …
crisis Stock Average Jones exchange market, the Shanghai .
plummeting stock temporarily Industrial the euro sold off Composite Index
prices in Japan, the dropped more Average fell sharply against the temporarily rose
United States, and than 330 yen on sharply for dollar and the yen, in value on the
Europe decision to the Tokyo Stock the first time temporarily falling morning of the
support, Exchange on in seven to 1.3166 dollars 28th on the
adjustment on the August 28. business per euro, the Shanghai Stock
10th of next The Nikkei days on the lowest level in Exchange in
month. Stock Average New York approximately a China, some
fell sharply, Stock year since late April analysts believed
falling below Exchange on last year. The yen that “the tone of
the 10,900-yen May 27, also strengthened the economic
level for the falling against the dollar, recovery has not
first time in $213.04 from temporarily hitting changed, and it is
three business the previous 92.81 yen to the not necessarily
days during day’s close to dollar for the first true that the
trading hours. 19,991.99, the time in 3 business market will
biggest drop days. continue to fall”
since (a major
February 4 securities firm).
(down
$268.37).
…. …. …. …. ….
In this section, we evaluate the performance of the extracted content from the
first 10 articles using both the table arrangement technique described in Section
8.2.1 and the fusion method introduced in this section. The table constructed in
Section 8.2.1 primarily captured information from (b) to (i). Consequently, we
evaluated the performance solely based on the information within (b)–(i).
The evaluation results are presented in →Table 8.3. In this context, we
assume that the correct information obtained through either the table
arrangement technique or the fusion method is correct. If there is no correct
information in an input document, the case where the information is obtained,
or if it is extracted as N/A, we consider it correct. Furthermore, if the obtained
information is partially correct, despite missing or additional information, it is
still deemed correct.
The fusion method demonstrated a higher accuracy rate of 0.88 than the table
arrangement technique with an accuracy rate of 0.73. Pairwise comparisons
revealed that the fusion method outperformed the table arrangement technique
in 49 of 80 items, whereas the table arrangement technique performed better in
15 items.
In →Table 8.3, only 10 outputs from the fusion method were erroneous. The
10 outputs are in the categories of market information, world affairs, and the
Tokyo Stock Exchange.
In two cases for the category of market information, the output of the fusion
method was “the entire sentence is about market information.” The entire
article contains information about the market, and in a sense, the output of the
fusion method is correct. However, the output is judged to be erroneous
because it is too broad and ambiguous.
There were three instances of mistakes in the classification of the Tokyo
Stock Exchange. This study was conducted in Japanese, and the Tokyo Stock
Exchange was actually “Tosho” (Tokyo Stock Exchange) in the input to ChatGPT.
The Tokyo Stock Exchange is the Tokyo Stock Exchange Stock Price Index
(TOPIX), but ChatGPT mistakenly outputted the Nikkei Stock Average when only
“Tosho” (Tokyo Stock Exchange) was used. Changing the input into ChatGPT
from “Tosho” (Tokyo Stock Exchange) to the Tokyo Stock Exchange Stock Price
Index (TOPIX) may improve the situation. Improving the prompts for ChatGPT
could improve the performance of the fusion method.
The table arrangement technique outperformed the fusion method in some
cases, of which six cases were deemed correct even with the fusion method,
excluding the two erroneous cases mentioned earlier. In most of the cases, the
fusion method also extracted information but obtained less information than the
table arrangement technique. The fusion method occasionally settled for partial
information instead of striving for comprehensive extraction. This aspect can be
improved by refining the prompts in future studies.
The fusion method demonstrated the capability to extract information and
generate tables with many columns, exhibiting suitable performance. It offers
the advantage of using ChatGPT to retrieve specific information when the
desired columns are known. In contrast, if the desired information is unknown,
the table arrangement technique can be used to identify the relevant content,
followed by the use of ChatGPT for retrieval, as demonstrated in the fusion
method described in this study.
It may be possible to use ChatGPT to find the desired columns. For example,
we can input the original documents into ChatGPT and ask ChatGPT what the
desired columns are. However, this is difficult because ChatGPT is currently
unable to handle large input documents. Hence, the fusion method is relatively
convenient in practice.
In the fusion method, ChatGPT processes one article at a time, obtaining
information for each article. Although multiple iterations of ChatGPT are
required for multiple articles, this approach allows for smaller inputs per
iteration, enabling efficient processing by ChatGPT.
In previous studies, we suggested the use of question-answering processing
to retrieve information for each cell in a table, completing information extraction
for all cells [→11]. (We have also highlighted that it would be good to use
question-and-answer processing to find the desired columns (the content items
we should retrieve from documents) [→11].) The current study can be thought
of as leveraging ChatGPT to accomplish this, as ChatGPT improves the
performance of question-answering processing.
Organize a table showing when the Nikkei Stock Average rises or falls based on the following:
(a) Nikkei 225, (a2) Change in Nikkei 225, (b) Dow Jones Industrial Average, (c) Foreign
Exchange, (d) Market Information, (e) World Affairs, (f) Overseas Stock Indices, (g) Related to
“since,” (h) Drop, (i) Tokyo Stock Exchange, (j) International Information, (k) Noted, (l) Yields,
(m) Monetary Easing, (n) Earnings, (o) Start trading, (p) Related, (q) Trade, (r) Europe, (s)
Caution, (t) Trump administration, (u) New Corona, and (v) Foreign investors.
This table shows how each item would cause the Nikkei 225 to rise or fall.
To gain further insight, enter the following into ChatGPT: “Devise a number of factors other
than those shown in the table above. In the table, please list in which cases the Nikkei Stock
Average rises or falls according to the new factors you have invented.”
More information was obtained, which was useful in forecasting the Nikkei 225.
Entering the following into ChatGPT yielded →Table 8.7: “In the dollar-yen
exchange rate, organize the table to show when it would be the case that the
Nikkei Stock Average would rise or fall.” By asking ChatGPT questions for each
item, more detailed findings for each can be obtained.
Table 8.7: Additional knowledge acquired using obtained categories and
ChatGPT.
Dollar-yen Reasons for the rise in the Nikkei Reasons for the fall in the Nikkei Stock
exchange rate Stock Average Average
Strong
dollar/weak 1. Expectations are that export 1. Increased costs for companies
yen companies’ overseas sales will that import large amounts of
increase in yen and their energy and resources.
business performance will 2. When the excessive depreciation
improve. of the yen raises concerns about
2. Lower investment costs for the stability of the economy as a
foreign investors in Japanese whole.
stocks.
3. Prices of raw materials in yen are
stable or declining, except for
companies that import resources
and energy.
Weak
dollar/strong 1. As import costs fall, companies 1. Exporting companies’ overseas
yen that rely on imports are expected sales are expected to decrease in
to improve their performance. yen terms, leading to a
2. When deflationary pressure is deterioration in business
eased. performance.
2. Investment costs for foreign
investors in Japanese stocks are
rising.
3. There are concerns that the
excessive appreciation of the yen
will impede economic growth.
8.6 Conclusions
This study logically predicted stock prices by analyzing newspaper sentences
related to stock prices using table arrangement techniques and ChatGPT.
In Section 8.2, we used table arrangement techniques to analyze the
Mainichi Shimbun data and extract similar articles on the Nikkei Stock Average.
These articles were subsequently organized into a table using table
arrangement technology, with groups of articles arranged in rows and clusters
of similar sentences in columns. By examining the information in the columns,
we investigated the type of information in the newspaper articles. We identified
22 content items related to various aspects, namely, (a) the Dow Jones industrial
average, (b) Exchange, (c) Market information 1, (d) Market information 2, (e)
World affairs, (f) Overseas stock indices, (g) Related to “since,” (h) Drop, (i) Tokyo
Stock Exchange, (j) International information, (k) Findings, (l) Yield, (m) Monetary
easing, (n) Performance, (o) Commencement of trading, (p) Related, (q) Trade,
(r) Europe, (s) Vigilance, (t) Trump administration, (u) New Corona, and (v)
Foreign investors in the target newspapers. Furthermore, content associated
with special events such as the “Trump administration” and “New Corona” was
also obtained.
In Section 8.3, we conducted analysis using ChatGPT and discovered its
capability to extract information from newspapers and perform basic
investigations, including organizing information into a table. Based on the
results obtained from ChatGPT, we discovered the following. The Nikkei Stock
Average rises because of factors such as improved US trade relations, a strong
domestic economy, central bank policies, technology sector performance,
foreign investments, corporate earnings, events such as the Olympics, a stable
yen, and a healthy global economy. Conversely, declines may result from global
stock crashes, economic bubbles, issues with key trade partners, industry
imbalances, financial crises, natural disasters, trade tensions, pandemics, and
energy price shifts.
In Section 8.4, we proposed a fusion method for constructing a large table
with high performance by integrating table arrangement techniques and
ChatGPT. The fusion method involves identifying relevant content using table
arrangement techniques and ChatGPT. Specifically, the table arrangement
technique identifies the matters to be retrieved, retrieves them from articles
using ChatGPT, and organizes them into a table. Our experiments confirmed the
effectiveness of the fusion method, achieving an accuracy rate of 0.88 with a
lenient evaluation criterion that allows for certain additional extractions and
minor missing information.
In Section 8.5, the items extracted in Section 8.2, which appear in many
articles on the Nikkei Stock Average, are organized into a table that shows how
the Nikkei Stock Average rises or falls when these items occur. In addition, the
table shows in more detail how the Nikkei Stock Average rises or falls when the
exchange rate changes. These analyses are also useful when examining the
movement of the Nikkei Stock Average.
In future studies, we will develop and refine the proposed methods to
improve stock price prediction accuracy.
References
[1] J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the stock market.
Journal of Computational Science, vol. 2, no. 1, pp. 1–8, 2011. →
[2] H. Lee, M. Surdeanu, B. MacCartney, and D. Jurafsky, On the importance of
text analysis for stock price prediction. In: Proceedings of Ninth International
Conference on Language Resources and Evaluation (LREC 2014), pp. 1170–1175,
2014. →
[3] A. K. Nassirtoussi, S. Aghabozorgi, T. Y. Wah, and D. C. L. Ngo, Text mining for
market prediction: A systematic review. Expert Systems with Applications, vol. 41,
no. 16, pp. 7653–7670, 2014. →
[4] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, Deep learning for stock
prediction using numerical and textual information. In: 2016 IEEE/ACIS 15th
International Conference on Computer and Information Science (ICIS), pp. 1–6,
2016. →
[5] M. Murata, Y. Nakahara, and Q. Ma, Stock price prediction and knowledge
acquisition through machine learning and language processing. 26th Annual
Meeting of the Association for Natural Language Processing, pp. 657–660, 2020,
(in Japanese). →
[6] M. Murata, Content analysis of items in newspaper data using table
arrangement technology and ChatGPT for stock price prediction. The 22nd
International Conference on Information & Knowledge Engineering on CSCE
2023, pp. 1–8, 2023. a, b, c, d
[7] M. Murata, K. Okazaki, and Q. Ma, Improved method for organizing
information contained in multiple documents into a table. Journal of Natural
Language Processing, vol. 28, no. 3, pp. 802–823, 2021. a, b, c, d
[8] X. Wu, J. Zhang, and H. Li, Text-to-table: A new way of information extraction.
In: Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics, pp. 2518–2533, May 2022 →
[9] OpenAI, ChatGPT. →https://openai.com/blog/chatgpt/, 2022. →
[10] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang,
S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens,
A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, Training language
models to follow instructions with human feedback. arXiv:2203.02155, 2022. →
[11] M. Murata, M. Uchiyama, and H. Isahara, Information extraction using a
question and answer system. Workshop on Information Extraction for the 6th
Annual Meeting of the Association for Natural Language Processing, pp. 33–40,
2000, (in Japanese). a, b
9 An experimental study on road surface
classification
Addisson Salazar
Gonzalo Safont
Luis Vergara
Alberto Gonzalez
Abstract
This chapter presents an experimental study that explores the discrimination
capabilities of a set of features for road surface classification. The features are
extracted from signals collected by a number of sensors installed in a vehicle
traveling on three different surfaces: smooth flat asphalt, stripes, and
cobblestones. The features consist of time-, frequency-, and statistic-domain
parameters that represent the behavior of the signals under different driving
conditions. The accuracy on road classification from those features is a
determining factor in the quality of automatic power-assisted steering systems.
The experiments implemented several driving configurations, i.e., hands on or
off the wheel and constant or accelerated vehicle speed. In addition, several
setups of the classification procedure were tested varying the classifier method
(random forest, support vector machine, linear discriminant analysis, and a
decision fusion method) and the number of features (selected features by
ranking and reducing the number of features using principal component
analysis). The results show high accuracy of the proposed classification system
for road surface classification.
Keywords: road surface classification, machine learning, feature selection,
feature ranking, decision fusion, classification,
9.1 Introduction
Currently, machine learning methods are being increasingly studied to solve
complex problems in several fields of application. One of these applications is
the automatic road surface classification for developing autonomous or
semiautonomous car technologies incorporated in power-assisted steering
systems. Thus, the MI problem consists of classifying the data from several
sensors installed in the car to determine the surface on which the car is
traveling. Existing works on road surface classification can be broadly classified
into three partially overlapping groups depending on the categories of data
(classes) to be classified: (i) estimation of road roughness profiles, (ii) detection
of weather conditions hazardous for driving, and (iii) road type detection. These
three groups typically have diverging goals. Road roughness profiles are
typically studied to obtain cost-effective solutions for supervising and planning
road maintenance and weather conditions are typically studied to increase the
safety of the driver and passengers, while road type detection studies have goals
that depend on the considered classes [→1, →2, →3, →4, →5, →6, →7, →8, →9,
→10, →11, →12, →13].
In this chapter, a road surface identification system is proposed where the
information from sensors for measuring audio, acceleration, speed, and steering
wheel signals is collected and processed. The road surface categories to be
identified are the following: smooth flat asphalt, stripes, and cobblestones. From
the measured signals, several time-, frequency-, and statistic-domain features
that represent the behavior of the signals under different driving conditions are
extracted. The accuracy on road classification from these features is a
determining factor in the quality of automatic power-assisted steering systems.
The experiments implemented several car driving configurations, i.e., hands on
or off the wheel and constant or accelerated vehicle speed.
The experiments also considered several setups of the classification
procedure, implementing the following single classifiers: random forest (RDF),
support vector machine (SVM), and linear discriminant analysis (LDA). Separated
score integration (SSI) was used as a decision fusion method to combine the
results of the single classifiers improving both accuracy and stability of the
results. In addition, considering that the number of the features is high, several
methods to reduce this number were applied including principal component
analysis (PCA), feature ranking, and ReliefF. This avoids overfitting of the
estimated model to the data and provides noise filtering. The results of the
experiments show high accuracy of the proposed classification system for road
surface classification.
The rest of the chapter is organized as follows. Section 9.2 describes the road
surface identification system implemented, including the definitions of the
features extracted from the sensor. Section 9.3 provides a short review of
methods for reduction of dimension of the data observation vector. Section 9.4
is dedicated to the experimental setup and optimization of the system, including
choosing the optimal window size and optimal dimensionality reduction. Section
9.5 contains the results and Section 9.6 comprises the conclusions of this work.
Table 9.1: Features extracted from data (x) windowed in epochs of length Δ .
Feature Definition
Third-order 1
Δ
autocorrelation Δ−2
∑ x(n) ⋅ x(n − 1) ⋅ x(n − 2)
n=3
Time reversibility Δ
−3/2
Δ
1 2 1 3
( ∑ x (n)) ∑ (x(n) − x(n − 1))
Δ Δ−1
n=1 n=2
Average power 1
Δ
2
∑ x (n)
Δ
n=1
Centroid fS
Δ
∑ f=1 f |X(f )|
2
where X(f ) is the direct Fourier transform of x
frequency Δ
∑
Δ
f=1
|X(f )|
2
within the epoch taken at Δ points, and fs is the
sampling rate.
Maximum
fS
frequency Δ
(argmax|X(f )|)
f
1 f
o2
∑ |X(f )|
f −f f=f
o2 o1 o1
9.3 A short review of methods for observation vector dimension
reduction
Considering the high dimension (56) of the data observation vectors,
dimensionality reduction was performed as a preprocessing step before
classification. There are two main reasons for feature selection. Firstly, we would
like to reduce the number of features for computational reasons: faster
evaluation times, lower memory consumption, lower implementation cost, and
so on. Secondly, reducing the number of features might improve performance
or reduce variability.
Dimensionality reduction methods are typically classified in two categories:
feature extraction, where new features are derived from the original ones; and
feature selection, where one or more of the original features are selected and
the rest are discarded [→15]. In turn, feature selection is typically approached in
one of two ways: ranking features according to some criterion and selecting the
top Q features (feature ranking); or selecting a subset of features that keep or
improve classification performance (subset selection) [→15]. Subset selection
algorithms can automatically determine the number of selected features, while
feature ranking algorithms need to rely on a user-determined threshold (or
equivalent method) to set the number of selected features.
The simple one-variable classifier used for feature ranking is shown in →Table
9.2. The algorithm maximizes the informedness of the result when splitting the
values using two thresholds; any value x is assigned to class 1 if x < x ≤ x
1 2
and class 0 otherwise. The search for the pair of values that optimize the
informedness of the classification is done in only one pass through the set of
values. This method is fast to compute and only considers the order of the
values of the input variable, while disregarding the actual values. Thus, the
method is robust with respect to isolated outliers, extreme values, and any data
transformation that does not affect the order of the values of the features, such
as: centering, scaling, exponentiation (when negative values of x are possible,
only for odd powers), and logarithmic transformation.
2 For n = 1, … , N
3 Compute the probability of detection P and the probability of false alarm P for values up to x :
d f
′
n
n ′
∑ yi
′ i=1
P d (x ≤ x n ) = N ′
∑ y
i=1 i
n ′
′ ∑ i=1 1−y i
P f (x ≤ x ) = N
n ′
∑ 1−y
i=1 i
4 Compute informedness as I (x ≤ x .
′
n
′
) = P d (x ≤ x n ) − P f (x ≤ x n )
′
n max = argmaxI (x ≤ x
′
n
) .
n
6 The sub-score of the feature x for class j is the informedness within the range I (x ′
n min ,
′
xn )
max
:
′ ′ ′
.
′
I (x n min < x ≤ xn ) = I (x n ≤ x) − I (x n ≤ x)
max max min
For multiclass problems such as the one shown in this work, the problem with K
classes is first divided into K binary 1-vs.-all problems. Then, the score of the i
th feature is obtained as the average of its scores for each of the K binary
problems.
An example of the results of the feature ranking process described in Section 9.3
is shown in →Figure 9.3, which shows the four best-ranking features (see
Section 9.2.1) extracted from the data displayed in →Figure 9.2.
Figure 9.3: Four best-ranking features extracted from the data shown in
→Figure 9.2. The transition at 6.50 s is marked in all four features.
The experiments described in Section 9.5 were used to optimize the capability of
the system to determine the surface the car is traversing from the considered
channels. This optimization was carried out via a series of Monte Carlo
experiments, following the steps shown in →Figure 9.4. Essentially, features are
extracted from the data and then they are randomly split into three subsets: 50%
of the features are used for training the classifiers (LDA, SVM, and RDF); 25% of
the features are used to train SSI; and the remaining 25% of the features are
used to determine classification performance. The proportion of samples of each
road surface was chosen to be as similar as possible to the proportions for the
full dataset. Also, to avoid overfitting, features extracted from the same
experiment were always assigned to the same subset. These experiments are
repeated for 100 iterations and the average and standard deviation of the results
are computed.
Figure 9.4: Diagram of the experiments used during the optimization process.
Figure 9.6: First five components (49.82% explained variance) obtained by PCA
from the features.
Figure 9.7: Five highest-ranked features returned by the considered feature
ranking method.
was 350 for LDA, 9 for SVM, and 6 for RDF. For ReliefF, improvements in accuracy
could be found for LDA and SVM, but not for RDF. The optimal number of
features was 15 for LDA, 250 for SVM, and 560 (no reduction) for RDF. For the
proposed feature ranking, as for ReliefF, improvements could only be found for
LDA and SVM. The optimal number of features was 10 for LDA, 5 for SVM, and
560 (no reduction) for RDF. PCA yielded the worst overall result, and ReliefF and
the proposed feature ranking methods yielded similar results. On average, the
result of PCA was 1.92% lower than that of ReliefF and 1.71% lower than that of
the proposed feature ranking method.
Table 9.3: Optimal feature selection and performance of the proposed system.
LDA SVM RDF SSI
PCA Number of components, Q 350 6 9 n/a
Accuracy (%) 93.65 93.60 92.50 93.88
Standard error (%) 0.38 0.34 0.37 0.36
ReliefF Number of features, Q 15 250 560 n/a
Accuracy (%) 95.22 94.46 95.53 96.10
Standard error (%) 0.28 0.56 0.37 0.71
Feature ranking Number of features, Q 10 5 560 n/a
Accuracy (%) 94.99 93.79 95.73 95.97
Standard error (%) 0.26 0.37 0.24 0.25
Figure 9.8: Dependence of average performance with respect to the number of
features selected using PCA, ReliefF, and the proposed feature ranking method.
The difference in behavior for feature ranking (both ReliefF and the proposed
method) between RDF and the other classifiers was owing to the built-in feature
selection stage in RDF, given that every tree in the forest only considers some of
the input features. In fact, RDF has been used for feature selection in several
applications, e.g., [→26, →27]. This selection might have conflicted with the
selection performed using our considered method, thus reducing the overall
effectiveness of RDF with both feature-ranking methods.
With respect to the computation times, PCA and the proposed method took
an average of 0.28 s to compute, while ReliefF took an average of 53.33 s to
compute. These calculations were performed in Matlab R2022b, running on a
Windows 10 machine with an Intel Xeon E3 CPU and 16 GB of RAM. Thus, the
proposed feature ranking method was much faster than ReliefF and just as fast
as PCA.
Given the results shown in →Table 9.3 and the differences in computational
costs, in the following, we perform dimensionality reduction by the proposed
feature ranking method. The number of selected features Q was taken from
→Table 9.3. The above processing is related to the reduction of dimension of the
number of columns of the data matrix, which is the input to the classification
stage. Thus, the original dimension is reduced from N X 56 to N X Q . As
explained above, the classification performance depends heavily on the number
of columns. However, it also depends on the number of rows of the classification
matrix, i.e., the number of samples. The estimation of the adequate sample size
for the training stage remains a critical issue that determines the quality of the
classification results. Learning curves of the methods could be estimated,
providing an estimate of the reduction in the excess of the probability of error
obtained by increasing the training set size [→28]. This is a complex task that is
outside the scope of this work. When data sample size is scarce or there is
sample data imbalance in data categories, this may decrease classification
performance. Thus, we could resort to oversampling techniques to alleviate it,
thus adding synthetic samples to avoid bias of the classifiers. Recently, it was
demonstrated for road surface identification systems using an oversampling
method based on generative adversarial networks (GANs) [→29] and graph
signal processing [30–33].
Q RDF = 560 .
Table 9.4: Accuracy for some of the experiments with changes in road surface.
Transition Experiment Accuracy (%)
LDA SVM RDF SSI
Stripes to smooth flat asphalt 1 99.03% 98.08% 90.38% 99.04%
2 86.21% 96.55% 96.55% 96.55%
Cobblestones to smooth flat asphalt 3 99.20% 99.20% 99.23% 99.23%
4 96.47% 97.65% 96.47% 97.65%
Figure 9.9: Estimation of the road surface on experiments where there are
transitions between road surfaces corresponding to experiments of →Table 9.4:
(a) exp. #1; (b) exp. #3; (c) exp. #2; and (d) exp. #4.
9.7 Conclusions
An experimental study on road surface classification has been presented. The
proposed system comprises a set of 10 sensors (3 microphones, 3
accelerometers, 2 speed sensors, 2 steering wheel sensors) and a machine
learning procedure involving feature extraction and selection, classification, and
decision fusion stages. Results show the proposed system is able to determine
the surface with high accuracy (>96%) for three types of road surfaces: smooth
flat asphalt, stripes, and cobblestones. The extra sensors configuration allows
this accuracy result obtained in comparison with many of the currently available
systems (electric power steering, EPS signals) that reach around 80% accuracy.
There are several open lines of research from this work, for instance, semi-
supervised learning (using labeled and unlabeled data samples together)
oversampling techniques could be approached in order to improve the
classification results.
References
[1] V. Surblys, V. Žuraulis, and E. Sokolovskij, Estimation of road roughness from
data of on-vehicle mounted sensors. Eksploat. I. Niezawodn.-Maint. Reliab, vol.
19, pp. 369–374, 2017. →
[2] J. Park, K. Min, H. Kim, W. Lee, G. Cho, and K. Huh, Road surface classification
using a deep ensemble network with sensor feature selection. Sensors, vol. 18,
pp. 4342, 2018. →
[3] G. Safont, A. Salazar, A. Rodríguez, and L. Vergara, Multichannel signal
processing for road surface identification. In: Proceedings of the 2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 3052–3056, Barcelona, Spain, 4–8 May 2020. →
[4] K. Han, M. Choi, and S. B. Choi, Estimation of the tire cornering stiffness as a
road surface classification indicator using understeering characteristics. IEEE
Transactions on Vehicular Technology, vol. 67, pp. 6851–6860, 2018. →
[5] A. Bystrov, E. Hoare, T. Y. Tran, N. Clarke, M. Gashinova, and M. Cherniakov,
Sensors for automotive remote road surface classification. In: Proceedings of the
2018 IEEE International Conference on Vehicular Electronics and Safety (ICVES),
Madrid, Spain, 12–14 September 2018. →
[6] A. Salazar, A. Rodriguez, N. Vargas, and L. Vergara, On training road surface
classifiers by data augmentation. Applied Sciences, vol. 12, pp. 3423, 2022. →
[7] S. M. Yusoff, and J. Giacomin, The effect of vibrational energy distribution on
the level of driver detection. AIP Conference Proceedings. vol. 2059, p. 020032,
2019, Applied Sciences, vol. 12, no. 3423, pp. 1–11, 2022. →
[8] J. R. Ng, J. S. Wong, V. T. Goh, W. J. Yap, T. T. V. Yap, and H. Ng, Identification of
road surface conditions using IoT sensors and machine learning. In:
Computational Science and Technology, pp. 259–268, Singapore: Springer, 2019.
→
[9] T. Beilfuss, K. P. Kortmann, M. Wielitzka, C. Hansen, and T. Ortmaier, Real
time classification of road type and condition in passenger vehicles. IFAC-Papers,
vol. 53, pp. 14254–14260, 2020. →
[10] G. Safont, A. Salazar, A. Rodríguez, and L. Vergara, Comparison of
dimensionality reduction methods for road surface identification system. In:
Proceedings of the 2020 Science and Information Conference, pp. 554–563,
London, UK, 16–17 July 2020. →
[11] T. Chugh, F. Bruzelius, M. Klomp, and B. Jacobson, Steering feedback
transparency using rack force observer. IEEE/ASME Trans. Mechatron, vol. 27, no.
5, pp. 3853–3864, 2022. →
[12] E. Bonera, M. Gadola, D. Chindamo, S. Morbioli, and P. Magri, On the
influence of suspension geometry on steering feedback. Applied Sciences, vol.
10, pp. 4297, 2020. →
[13] L. Yaohua, F. Jikang, H. Jie, N. Youfei, and F. Qianlong, Novel electric power
steering control strategies of commercial vehicles considering adhesion
coefficient. Advances in Mechanical Engineering, vol. 12, 2020. →
[14] G. Peeters, A large set of audio features for sound description (similarity
and classification) in the CUIDADO project 2004. a, b, c
[15] I. Jolliffe, Principal Component Analysis, New York, NY, USA: Springer, 2002.
a, b
[16] O. Shamir, A stochastic PCA and SVD algorithm with an exponential
convergence rate. In: International Conference on Machine Learning (ICML), pp.
Lille (France), 2015. →
[17] G. Safont, A. Salazar, and L. Vergara, On recovering missing GPR traces by
statistical interpolation method. Remote Sensing, vol. 6, pp. 7546–7565, 2014.
[18] G. Safont, A. Salazar, L. Vergara, E. Gomez, and V. Villanueva, Mutichannel
dynamic modeling of non-Gaussian mixtures. Pattern Recognition, vol. 93, pp.
312–323, 2019.
[19] H. Lui, and H. Motoda, (Eds.). Computational Methods of Feature Selection,
Boca Ratón: CRC Press, 2007. a, b, c
[20] I. Kononenko, E. Šimec, and M. Robnik-Šikonja, Overcoming the myopia of
inductive learning algorithms with RELIEFF. Applied Intelligence, vol. 7, no. 1, pp.
39–55, 1997. →
[21] D. M. Powers, Evaluation: from precision, recall and F-Measure to ROC,
informedness, markedness & correlation. Journal of Machine Learning
Technologies, vol. 2, no. 1, pp. 37–63, 2011. →
[22] G. Safont, A. Salazar, and L. Vergara, Multiclass alpha integration of scores
from multiple classifiers. Neural Computation, vol. 31, no. 4, pp. 806–825, 2019.
→
[23] G. Safont, A. Salazar, and L. Vergara, Vector score alpha integration for
classifier late fusion. Pattern Recognition Letters, vol. 136, pp. 48–55, 2020.
[24] A. Salazar, G. Safont, L. Vergara, and E. Vidal, Pattern recognition techniques
for provenance classification of archaeological ceramics using ultrasounds.
Pattern Recognition Letters, vol. 135, pp. 441–450, 2020.
[25] A. Salazar, G. Safont, L. Vergara, and E. Vidal, Graph regularization methods
in soft detector fusion. IEEE Access, vol. 11, pp. 144747–144759, 2023. →
[26] R. Diaz-Uriarte, and S. Alvarez de Andres, Gene selection and classification
of microarray data using random forest. BMC Bioinformatics, vol. 7, no. 3, p. art.,
2006. →
[27] M. Hasan, M. Nasser, S. Ahmad, and K. Molla, Feature selection for intrusion
detection using random forest. Journal of Information Security, vol. 7, pp. 129–
140, 2016. →
[28] A. Salazar, L. Vergara, and E. Vidal, A proxy learning curve for the Bayes
classifier. Pattern Recognition, vol. 136, pp. 109240, 2023. →
[29] A. Salazar, G. Safont, and L. Vergara, Application of an oversampling method
for improving road surface classification. In: 2022 International Conference on
Computational Science and Computational Intelligence (CSCI), pp. 647–651, Las
Vegas, NV, USA, 2022. →
[30] L. Vergara, A. Salazar, J. Belda, G. Safont, S. Moral, and S. Iglesias, Signal
processing on graphs for improving automatic credit card fraud detection. In:
Proceedings of IEEE International Carnahan Conference on Security Technology
(ICCST), pp. 1–6, Madrid, Spain, 2017.
[31] J. Belda, L. Vergara, G. Safont, A. Salazar, and Z. Parcheta, A new surrogating
algorithm by the complex graph Fourier transform (CGFT). Entropy, vol. 21, no. 8,
article no. 759, 2019.
[32] J. Belda, L. Vergara, G. Safont, and A. Salazar, Computing the partial
correlation of ICA models for non-Gaussian graph signal processing. Entropy,
vol. 21, no. 1, article no. 22, 2019.
[33] J. Belda, L. Vergara, A. Salazar, and G. Safont, Estimating the Laplacian
matrix of Gaussian mixtures for signal processing on graphs. Signal Processing,
vol. 148, pp. 241–249, 2018.
10 RNN models for evaluating financial indices:
examining volatility and demand-supply shifts in
financial markets during COVID-19
Ray R. Hashemi
Omid M. Ardakani
Jeffrey Young
Azita G. Bahrami
Abstract
Several semi-recurrent neural networks (RNNs) were developed to explore the
impact of the COVID-19 pandemic on the financial indices representing bond
market (yield spread – Spread), energy market (crude oil prices – Oil), stock
market (volatility index – VIX and Wilshire 5,000 total market index – Wil5,000),
housing market (Shiller housing price index – HPI), commodity market (Gold,
Wheat, and Soybean indices), and recession (USrec). The indices were divided
into two groups based on two properties: volatility and demand-supply shifts.
USrec was a part of both groups. Two time-series datasets (dataset 1 and
dataset 2) were obtained from the Federal Reserve Bank of St. Louis for the two
groups. The volatility and demand-supply shift groups had daily and monthly
data, respectively. Each dataset was first purified by applying derived-
breakpoints (the transition block) for each index, separately. Second, the purified
dataset was partitioned into “before” and “during” the pandemic. Semi-RNNs
were trained by the training sets for the “before” and “during”, separately, and
tested against their corresponding test sets which delivered two 5-element
counter arrays. Each array carries the overall percent of the correct
classifications of the indices for the entire records of the corresponding test sets.
A decay rate was calculated for each index in the arrays and the average of
decay rates (overall decay) is also calculated for each array to be used for
inspecting the impact of COVID-19 on indices. The positive, zero, and negative
decay rate means COVID-19 has negative, none, and positive impacts on the
financial indices, respectively. The results also revealed: (a) VIX was explained by
Wil5,000, Spread, Oil, and USrec more accurately before the pandemic, indicating
that other observed and unobserved factors arising from the COVID-19
pandemic would affect the VIX more than the other financial indices. (b) USrec is
predicted less accurately compared to other indices during the pandemic, which
shows the sensitivity of this index to health and geopolitical challenges. (c)
Effects of the indices on the bond market diminished during the pandemic. (d)
HPI declines in predictive accuracy during the pandemic indicate disruptions in
the housing market due to economic and social changes. (e) Gold price drop in
predictive accuracy during the pandemic reflects its nature as a relatively stable
safe-haven but still sensitive to COVID-19. (f) Wheat’s substantial drop in
predictive accuracy highlights its sensitivity to disrupted supply chains, global
trade dynamics, and fluctuating demands during the pandemic. (g) Similar to
wheat, soybean predictive accuracy fell during the pandemic, underscoring the
impact of supply chain disruptions on agricultural commodities. The sensitivity
analysis suggests that the results remain highly robust to the changes in the
number of records chosen for the different test sets.
Keywords: COVID-19 impact on financial indices, stock market, bond market,
energy market, commodity market, housing market, recurrent neural network,
transition boundary, robustness analysis,
10.1 Introduction
The US economy is under the influence of several major financial markets
including stock, bond, energy, commodity, and housing markets. Such influences
can be more severe if they are combined with a long-lasting societal disaster
such as the COVID-19 pandemic. The reason stems from the fact that there was
a forced lack of social interactions. The COVID-19 pandemic significantly
disrupted various sectors of the economy, including supply chains, the
workforce, agriculture, and education. It also changed the demand for goods.
These disruptions had widespread effects on both the economy and financial
markets.
Our investigation for evaluating the financial indices is done by predicting
these indices using different groups of markets. Our efforts are divided into two
parts as follows. Part 1 includes the exploration of the financial indices
forecasting using the energy, stock, and bond markets. We grouped these
markets because they display a higher degree of volatility in the presence of a
long-lasting societal event like COVID-19. Part 2 includes the exploration of the
COVID-19 impact on the financial indices’ behavior using commodity and
housing markets. We group these markets because their common property is
their quick reaction to the presence of a high demand for goods by the society
and/or broken supply-chains.
For Part 1, the following financial indices are used: the interest rate spread
(Spread), oil prices (Oil), the volatility index (VIX), and the Wilshire 5,000 market
index (Wil5000). These indices represent the bond market, energy market, stock
market volatility, and volume level of the traded stocks in the stock market,
respectively [→1, →2]. The Spread is the difference between the interest rates of
long-term (10 year U.S. Treasury) and short-term (2 year U.S. Treasury)
government bonds. This difference can give us insights into economic activities
and potential recessions [→3]. The price of oil is set by a group of countries
known as OPEC-plus, and it affects both recessions [→4, →5] and the rate of
inflation [→6]. The VIX index represents how much stock market prices are
expected to fluctuate in the near future based on options traded on the stock
market [→7]. Finally, the Wilshire 5000 index measures the combined value of
stocks from over 5,000 publicly traded companies in the U.S. [→8].
For Part 2, the following indices are used: HPI (Shiller Housing Price Index),
Gold, Wheat, and Soybean indices. The first index represents the housing market
and the last three represent the heart of the commodity market. The Shiller
Housing Price Index is a leading US residential housing market measure, and it
offers insights into economic trends. Housing markets often provide early
signals of economic downturns and recoveries. As a result, housing market
trends have a strong predictive value in economic cycles [→9, →10]. A downturn
in housing prices often indicates broader economic challenges, as evidenced in
the 2008 financial crisis [→11].
The selection of commodity futures helps to understand global economic
health. Wheat and soybean futures contracts, for instance, are indicators of
agricultural market trends and global trade dynamics, as discussed in [→12].
Gold, traditionally seen as a ‘safe haven’ during economic uncertainties, has
implications for the global economy [→13]. Variations in these markets can
indicate shifts in economic activities. Gold’s role during economic uncertainties,
such as the COVID-19 period and agricultural commodities’ responses to global
supply and demand changes are critical in this context.
The impact of COVID-19 on our two groups of market is explored. This is
accomplished by dividing the datasets, used for parts 1 and 2, into “Before” and
“During” pandemic eras. The comparison of the financial indices’ accuracy
prediction for each era determines the impact of COVID-19 on our groups of
markets. It is pertinent to inquire about the relevance of this comparison to
COVID-19. The presence of chaotic behavior in data often leads to prediction
inaccuracies, which, in turn, may be indicative of a chaotic state within the
economy characterized by heightened uncertainty.
The breaking point between the “before” and “during” periods cannot be
one specific day but rather it is an interval of days (a transition block). The
reason for having a transition block is that the impact of COVID-19 on an index is
gradual and it may be different from one index to the next. Also, the time for the
COVID-19 impact to have its full effect is not the same for all the indices. The
best way to determine this boundary is by investigating the behavior of
individual indices to determine a period of transition from “Before” era to
“During era.
The nature of data for financial indices (time-series data) and the fact that
the prediction of the recession for the record of date d is influenced by the
history of the records predecessors are properties that demand the use of the
Recurrent Neural Networks (RNN). In this neural network, any record seeking
the prediction of a value for one of its designated indices will also be morphed
(as a part of the history) and participates in the prediction of a value for the
designated index in the next record.
The goal of this research effort is twofold: Development of RNNs to study the
behavior of the financial indices in the context of COVID-19 using: (a) bond,
energy, and stock markets and (b) commodity and housing markets. Previously,
a preliminary investigation of the COVID-19 impacts on financial indices of, VIX,
Wil5000, spread and Oil have been completed and reported in [→14]. This
chapter is the extension of the reported paper.
The structure for the remaining of the chapter is as follows. Previous work,
methodology, empirical results, and discussion, conclusion, and future research
are the subject of Sections 2, 3, 4, and 5, respectively.
10.3 Methodology
The objectives for meeting the goal are: (a) discretization of datasets that make
Part 1 and Part 2 [→26], (b) identification of transition blocks for each index and
for each dataset [→27], and (c) introduction of a semi-recurrent neural network
for evaluating financial indices during COVID-19 [→28]. The rest of this section
describes the above objectives in detail.
Method:
1- Select K records from D as the centers (mean) for clusters of CL1, …, CLk.
The outcome from the Current output layer is the one considered as the output
of the RNN. This outcome is influenced by the Current input record and the
history of the predecessors of this record that is manifested as the Past input
record. For maintaining such a record, the outcome of the neurons from the Past
output layer always feed back to the corresponding neurons in the Past input
layer. The Current input record also contributes into the history of its
predecessors. The key to such contribution is the connection of the neurons in
the Current input layer to each one of the neurons in the Past output layer.
When the Current input record is the first record of a dataset, there is no
history of predecessors for this record. We resolve this problem by using the first
record of the dataset as the Past input record and, thus, we feed the Current
input layer starting from the second record of the dataset. We use the Gaussian
Error Linear Unit (GELU) activation function [→35], for our neural network that is
calculated by Formula (1).
Out =
N et (1)
−1.702N et
1+e
where, Out is the output for a given node and Net is the weighted sum of inputs
to the node.
Table 10.2: Discrete value of indices along with the number of records in the
discrete value and the range of its continuous values.
Index 1 2 3 Index 1 2 3
Daily Monthly
VIX [9.14, 19.6) [19.6, [34.5, HPI [137, 175) [175, 238) [238, 306]
4,815 34.5) 82.7] 45 40 14
2,768 356
Wil5000 [7.35, 69.8) [69.8, 150) [150, Gold [1,060, [1270, [1550, 1970)
5,785 1,713 234] 1270) 1550) 35
441 28 36
Spread [−0.52, [0.758, [1.75, Wheat [368, 567) [567, 750) [750,
0.758) 1.75) 2.91] 52 31 →1,040]
3,163 2,429 2,347 16
Oil [9.1, 42.1) [42.1, [84.8, Soybean [840, 1100) [1100, [1380, 1710]
3,882 84.8) 144] 53 1380) 25
2,799 1,258 21
Table 10.3: Transition blocks of the financial indices in (a) dataset 1and (b)
dataset 2.
(a) Financial indices (daily data) b) Housing and commodity market indices (monthly data)
Index Transition block Index Transition block
VIX [3/12/2020 to 3/27/2020] HPI [2/1/2020 to 4/1/2020]
Wil5000 [3/9/2020 to 3/23/2020] Gold [1/1/2020 to 4/1/2020]
Spread [3/16/2020 to 3/20/2020] Wheat [3/1/2020 to 4/1/2020]
Oil [3/16/2020 to 4/21/2020] Soybean [3/1/2020 to 4/1/2020]
USrec [3/2/2020 to 4/30/2020] USrec [3/1/2020 to 4/1/2020]
Dataset-1 [3/2/2020 to 4/30/2020] Dataset-2 [1/1/2020 to 4/1/2020]
Figure 10.3: Plots for indices in dataset 2 with monthly data (HPI, Gold, Wheat,
Soybean).
Figure 10.4: Plot for the binary index of USrec.
Step 10: Test the trained semi-RNN using TSm and report percent of the
correct prediction of each index for the entire TSm. Save the outcome in a 5-
element counter arraym.
End of Step 8
Step 11: Calculate the decay rate of the predictions between the two counter
arrays.
Step 12: D = Dataset j.
End of Step 3.
End of Step 1.
End of Procedure
In this procedure, we try to predict the indices’ value for the record R of the test
set using the trained neural network that captures the history of the record R’s
predecessors. This means that the trained semi-RNN is good for predicting the
indices for a test record that timewise is next to the very last record of the
training set (i.e., a test set with one record). Therefore, if a test set has N records
(timewise, they are one after another – Time series data), then as soon as the
prediction for the first test record is finished, it must be added to the training
set, and the neural net must be re-trained for the new training set to become
ready for the prediction of the next record in the test set. However, we have
taken a different approach by answering the question that follows: How far can
one depart from the time belonging to the last record in the training set to select
a test record without damaging the indices’ prediction too much? The sure
answer is not too far. For this reason, the number of records (λ) in our test sets
are small (λ = 10% of the records in Fi-before and Fi-during). In justification of our
chosen approach, the reader needs to be reminded that we compare the result
of predictions for “before” and “during” eras. Therefore, the corresponding test
records in the two test sets used for different eras have the same distance
(timewise) from the last record of their training sets and it should not affect the
result of the comparison.
In the loop of Step 3, the purification of the dataset is done for one index at a
time and in the presence of the other indices. We employed individual transition
blocks for each financial index to consider the potential for abnormal shifts in
each series. This methodological choice is predicated on the statistical premise
that the abnormal behavior of one index should not distort the predictive
accuracy of another. By doing so, we mitigate the risk of spurious correlations
and enhance the robustness of our predictions.
If the number of records in the training set for a neural network increase,
then the neural network, in general, is trained better. Having said that, the “Fi-
before-train” has more records than the “Fi-during-train”, which may put the
trained semi-RNN by the “Fi-during-train” at a disadvantage. To remedy this
disadvantage, we make the number of records in both training sets the same.
This remedy was reflected on by Step 7. Step 8 delivers one 5-element counter
array per era (“Before” and “During”) in reference to the Fi index. Each array
carries the overall percent of the correct classifications of the indices for the
entire records of the corresponding test sets.
Considering a purified dataset in reference to the financial index of Fi that
ultimately generates two 5-element counter arrays, see →Table 10.4. The decay
rate for index Fa is calculated using Formula (2).
C 1a −C 2a (2)
Decay F a =
|F i Bef oreT est|
where, C1a and C2a are the overall percent of the correct classifications of the
index Fa using all records in Fi-before-test and Fi-during-test, respectively. Since
both Fi-before-test and Fi-during-test have the same cardinality, either one can
be used in Formula 2.
Having positive, zero, and negative decay rates for all indices (Fa to Fe)
means COVID-19 has negative, none, and positive effects on the financial indices,
respectively. The financial indices with the largest and smallest decay rates are
the most and the least vulnerable financial indices to COVID-19. The Overall
decay rate for Fi is the average of all the decay rates for the indexes of Fa to Fe. If
the overall decay rate is negative/positive but there are some non-negative
decay rates, then we say COVID-19 has a partially negative/positive effect on the
financial indices. If Fi has the maximum/minimum positive overall decay among
all indices, then (a) the transition block of Fi has the minimum/maximum
intersection with the transition blocks of the other indices and (b) Fi is the
most/least sensitive financial index to COVID-19.
The above procedure generates the decay rates for all indices in dataset 1
and dataset 2 that are, respectively, used in Part 1 and Part 2 of our
investigations. We summarize our findings in →Tables 10.5 and →10.6.
We have also used the procedure with some modifications to purify each dataset
by their comprehensive-transition blocks as follows.
Procedure: Modified_Achievement
Step 1: Repeat for j = 1 to 2.
Step 2: Purify the dataset in reference to the comprehensive-transition block
of dataset j.
.
. (Re-numbered steps)
.
End of Step 1.
In this modified version of the procedure, Steps 2 and 3 are removed and Step 4
is changed to Step 2, as shown above. This means, first, those records that are
participating in the comprehensive-transition block for the dataset j are
removed, and then the rest of the procedure will generate two partitions (Compj-
before and Compj-during) and deliver two 5-element counter arrays for each
partition by which the decay rate of the predictions of indices in the totally
purified dataset j are calculated. Results are shown in →Table 10.7.
Table 10.7: Decay rates for indices in dataset 1 and dataset 2 after purification of
the datasets using the comprehensive-transition blocks.
Index of λ Test sets 5-Element counter arrays Overall
purification decay
rate
Vix Wil5000 Spread Oil USrec
All indices in 43 Comp-1- 74 77 70 58 56 0.3
dataset 1 (Days) before
43 Comp-1- 56 53 53 51 47
(days) during
Decay rate 0.3 0.4 0.6 0.4 0.2
All indices in HPI Gold Wheat Soybean USrec
dataset 2
4 Comp-2- 75 75 75 75 99 6.2
(Month) before
4 Comp-2- 50 50 50 50 75
(Month) during
Decay rate 6.25 6.25 6.25 6.25 6
Dean Lee
Jamal Rorie
Andrew Sabater
Abstract
We explore the application of topological data analysis (TDA) to extract
information from vibration sensor data. In Section 11.1, we describe the data
embedding requirements for applying TDA and study the empirical performance
of TDA, given different parameters of data embedding as well as under
degraded data collection conditions. Furthermore, we provide empirical
evidence of the correlation between select vibration metrics and TDA. The results
stated in this section are adapted from reference [→1]. In Section 11.2, we
describe an application of TDA to detect anomalies from vibration signals. The
methods described have indications of being useful to determine remaining
useful life of components.
Keywords: machine learning, topological data analysis, vibration analysis,
11.1.1 Motivation
In [→14], a visibility graph-based topological data analysis method was
proposed and applied on the bearing vibration data from both the Society for
Machinery Failure Prevention Technology (MFPT) and the Center for Intelligent
Maintenance Systems (IMS) to extract topological information for machine
learning. It was shown that a machine learning algorithm can be built to detect
various bearing faults, and that distinct topological structures were generated
from fault-induced vibrations.
However, visibility graph [→15] is just one method to embed time series
data, and the topological information extracted from point clouds is specific to
the embedding. Thus, an investigation of other embedding techniques and the
topological features that they generate is warranted to better understand the
efficacy of the proposed method. Moreover, other embeddings may produce
structures that provide deeper insight for fault diagnosis and potentially show
connections to various vibration metrics. This is crucial in establishing the
proposed method as a legitimate tool for vibration analysis.
11.1.2 Preliminaries
complex is formed from a finite set of simplexes, such that the intersection of
any pair of simplexes is either empty or shared faces. The boundaries of the
simplicial complex K are defined by the chain groups: the k -th chain group
C (K) is an abelian group made from all k -simplexes from the simplicial
k
following properties:
∂0 = 0 , and
∂ k−1 ∂ k = 0.
the k -cycles and the k -boundaries of the simplicial complex K , respectively. The
k -th homology group is defined as H = Z /B . The rank of H is the number
k k k k
of holes in the simplicial complex K ; that is, the rank of H is the number of
0
isolated vertices, the rank of H is the number of loops, etc. The filtration of the
1
11.1.2.2 Embeddings
Time series data are embedded in a space, from which point clouds could be
formed and, crucially, an appropriate distance metric must be selected for the
space so that the persistent homology can be calculated for the time series data.
We investigate different embeddings as well the distance metrics associated
with the embeddings below. Note that in each of the embedding, a sliding
window of values is generated from the time series.
1) Vector space: Each window in the time series is transformed into a
vector in vector space. It is customary to use the Euclidean distance as the
default metric to quantify the difference between vectors, and indeed, the
Euclidean distance is the default of many machine learning software
packages. However, it is not always the appropriate distance measure. It is
noted in [→22] that the Euclidean distance breaks down even in relatively
low dimensions, thus the Manhattan distance is used as the distance
metric for this vector space embedding.
2) Probability distributions: The values of each window are normalized and
binned to create a vector of frequencies, or empirical distributions of the
data. In transforming the time series data into distributions, information
can still be extracted despite deficiencies in the data, such as sparsity,
irregular sampling, etc. The Jensen-Shannon divergence metric [→23] is
used to quantify the similarity between two distributions.
3) Visibility graphs: The visibility graph [→15] provides a framework for
embedding time series data into graphs. Given a time series {(t , y )} , a
i i
vertex is created for every corresponding t , and an edge is created
i
t b −t c
y c < y b + (y a − y b ) .
t b −t a
Intuitively, the time series data is a series of peaks and valleys: every
vertex in the graph corresponds to a sample from the time series, and an
edge is constructed between two vertices if their corresponding peaks in
the time series are visible to each other. A visualization of this is in
presented in Reference [→15].
The visibility graphs could be used to quantify local and global information
in the form of graph structures: a highly connected community of nodes in
the graph indicate that the corresponding values in the time series are
close in value (hence, visible to each other); the links that connect the
disparate communities are defined by the spikes in the time series signal,
which form “bridges” to the different communities. In this sense, the
visibility graphs capture interesting information from the time series
signal. The Euclidean distance is used to calculate the edge weights in the
visibility graph. The visibility graph implementation of ts2vg [→24] is used
in the analysis.
4) Takens embedding: Takens embedding [→25] embeds the time series
data into higher dimensions. More formally, given a time series f (t) , an
embedding dimension d , a time delay τ , and time t , the embedding
i
11.1.3 Methodology
The notional pipeline for generating the topological features is shown in
→Figure 11.1.
A fixed-width sliding window is created from the raw vibration data. These
windowed data are embedded into spaces proposed in Section 11.1.1.2, and
from which point clouds are constructed using the appropriate distance
measures. In the topological feature extraction step, persistent homology is
computed on the point clouds, from which persistence diagrams and point
summaries are generated as features for machine learning models. We use the
persistent homology implementation of giotto-tda [→21] for all analyses in this
section, and we use the scikit-learn version 1.2.2 [→28] implementation of
Random Forest with default settings as the base machine learning model.
Finally, the data is split into train, test, and validation sets, and the validation set
is used to find optimal parameters for each of the proposed embeddings.
11.1.4 Data
We analyze the Case Western Reserve University (CWRU) bearing data set [→29]
to examine the effects of the proposed embeddings on the generated
topological features. The data set is made up of vibration measurements from
healthy bearings and from bearings with seeded faults. The seeded faults are
introduced via the electro-discharging machining methods at various depths.
Three types of faults are introduced: inner raceway, ball-bearing, and outer
raceway faults. The vibration measurements are collected at the drive-end and
the fan-end of the test apparatus, and the data are collected at various motor
loads. While the data set contains drive-end and fan-end bearing fault
information, we only consider the drive-end faults for the following analyses as
the data set is more complete. Furthermore, only the data collected at 12,000
samples/second are considered to ensure uniformity of analysis, as the baseline
data is only collected at 12,000 samples/second.
The main objective of the analysis is to determine the effectiveness of
topological data analysis through fault classification, and three sets of
experiments are conducted using the CWRU data set. In the first set of
experiments, models are built from topological features extracted from the
previously described embeddings. These models are trained to classify different
faults at various motor loads and fault depths. In the second set of experiments,
artificial down-sampling is applied to the data set to simulate operational
environments, where data may be stored in pared down form to reduce storage
footprint. A down-sampling of factor n used here means that every n th sample
is kept from the original data. In these experiments, we investigate the
performance of the classification models at various levels of down-sampling
factors to understand how topological data analysis may perform in real-world
settings.
The connections to some classical vibration metrics are explored in the third
set of experiments. Common vibration metrics such as peak acceleration, root-
mean-square, crest factor, and kurtosis are correlated with the topological
features to show possible connections.
11.1.5 Results
The model performance, based on the topological features extracted from the
different embeddings, is shown in →Figures 11.2 and →11.3. In general, the
models perform better at 7 and 21 millimeters of fault depths. It is seen that the
distribution embedding performs the best out of the five embeddings
considered: the model performance is relatively stable and accurate across
different loads and fault depths. This suggests that by binning the data samples
appropriately, the differences in the vibrations among the various fault
categories can be quantified as distinct topological features for machine
learning.
Figure 11.2: Cross-validated training set performance of models built with
features from the various embeddings.
Figure 11.3: Test performance of models built with features from various
embeddings.
The vector space model performs relatively well at fault depth of seven
millimeters; at other fault depths, the performance starts to degrade. This may
be because the vectors are not capturing enough of the intrinsic differences
among the various fault depths.
The visibility graph model also demonstrates inconsistent performance,
which may be attributed to its sensitivity to small changes in vibration
measurements; that is, entirely new communities in the graph may appear or
disappear due to slight changes in the vibration measurements. Furthermore, it
stands to reason that if the vibration sensors are not calibrated correctly, the
visibility graph method may yield poor performance.
The SAX embedding model appears to be relatively unstable, which may be
due to the standardization of the data for forming the words, and that may mask
the subtle differences in the changes in the vibrations across the various fault
depths.
The Takens embedding model has the worst performance. This suggests
that this embedding technique does not generate features that generalize well
across different fault categories.
In general, we see that the vector space embedding model performance stays
relatively stable over various down-sampling factors, as well as across different
loads. The distribution embedding model, however, becomes more and more
unstable as the data become sparser, as seen from →Figures 11.4 and →11.6.
One reason may be that as the data become sparser, the empirical distributions
no longer represent the underlying distribution.
The SAX embedding model also suffers from the same issue that causes the
instability in the distribution embedding model, albeit the problem is worse. As
the data become sparser, the standardization of the sparse data obscures the
changes in vibrations, and SAX representations are poor features for
distinguishing among various faults.
The visibility graph model also has spots of instability as the down-sampling
factors vary. There is a dip in performance at down-sampling factor of three,
after which the performance improves until down-sampling factor of seven.
Surprisingly, the visibility graph model performance, even at extreme sparsity, is
on par with the visibility graph model on the full data set. This suggests that the
graph may be encoding intrinsic properties of the vibration signal.
The Takens embedding model again fares the worst out of all five models in
terms of its performance, although it was already shown in the previous section
that this particular embedding technique may not be suitable for generating
topological features that distinguish among various faults. Conversely, the
model performance appears to be more stable than that of the SAX embedding
model over various down-sampling factors.
metrics that are frequently employed for analyses in the time domain are the
following:
1. Peak acceleration: max |x i |,
n
∑ x
i
2
i
,
3. Crest factor: , and
max |x i |
1 2
√ ∑ x
n i i
4
4. Kurtosis: E[( .
x−μ
) ]
σ
In general, these metrics are various ways of quantifying spikes in the signal.
Coupling expert intuition with accepted engineering standards (e.g., ISO 20816-
1:2016), these values could be used to detect imminent failure.
We examine the correlations between these classical vibration metrics and
the topological features extracted from the embeddings. We note that only the
drive-end data at fault depth of seven millimeters and motor load of 0 HP are
shown in following correlations analyses, as the results for the other fault depths
and motor loads are similar.
In →Figure 11.8, it is seen that the homology group H of the persistence
0
Figure 11.8: The correlation coefficient values between vibration metrics for the
drive-end data with sliding window size of 50, and the H0 homology group
generated from the persistent silhouette of the various embeddings.
If no correlation coefficient exceeds this threshold, then the row of cells is left blank.
In →Table 11.2, we show the results for window size 2000, which is the point at
which all the embeddings start to generate topological features that correlate to
some vibration metrics. It can be seen that the RMS metric is strongly correlated
to all five of the embeddings. Furthermore, the distribution and SAX embeddings
generate topological features that correlate to all four of the vibration metrics.
→Figure 11.9 provides an additional view of the correlation coefficients for the
distribution embedding and the persistence silhouette. Note the strong
correlation and anticorrelation between the distribution embedding and all four
of the vibration metrics.
Figure 11.9: The correlation coefficient values between vibration metrics for the
drive-end data with sliding window size of 2000, and the H0 homology group
generated from the persistent silhouette of the various embeddings.
We note also that the topological features from the homology groups do not
seem to have a direct connection to the vibration metrics: while homology group
H captures higher order topological structures than homology group H , the
1 0
correlations are not necessarily restricted to the higher order vibration metrics,
such as crest factor.
11.1.6 Discussions
In general, the topological features provide good class separability for fault
classification. The vector space and the distribution embedding perform well
across all loads and fault conditions. On the other hand, we see that the SAX
embedding has uneven performance as a classifier. We also examined the
visibility graph as an embedding, and found that its performance is not
comparable to that of the simple and distribution embeddings. These
differences in model performances are exacerbated by the down-sampling of
the data.
We also note that SAX embedding does not seem to be an ideal topological
feature generator for fault classification. SAX is originally designed to detect
anomalies within a single time series; the poor classifier performance seems to
indicate that the vibrations encoded for a particular measurement do not
generalize to other measurements.
We also investigated the potential correlations between the topological
features and classical vibration metrics. While these correlations are sensitive to
the choice of window sizes, nevertheless, the strong correlation in some cases to
known vibration metrics suggest that topological data analyses may be used to
automate the vibration metric selection, which is usually done based on human
intuition. Furthermore, given the good performance of the vector space and
distribution embeddings, the lack of correlations between their topological
features and the vibration metrics merely suggests that the topological features
may be better for fault diagnosis than some of the vibration metrics.
11.1.7 Postscript
In this section, we examined different embeddings for vibration signals, from
which topological features can be generated. We then used these topological
features to build models to classify faults with the CWRU data set across various
conditions as well as artificially down-sampling the data to examine the
performance of these features for fault analysis. It was found that while some
embeddings generate features that perform well, even in degraded conditions,
other embeddings suffer from poor model performance.
The potential correlations of the topological features to some classical time-
domain vibration metrics are also examined. It was found that some
embeddings have features that have consistently strong correlations to the
vibration metrics, suggesting that these embeddings can automatically generate
these vibration metrics.
We believe that TDA is a valuable addition to the world of vibration analysis.
While we have shown in this paper that some topological features provide good
class separability for fault classifiers, there may be yet undiscovered topological
features that are tuned specifically for vibration analyses that yield even better
performance. Moreover, the topological features investigated in this paper are
based on Vietoris-Rips method of simplicial complex construction; the
investigation of other simplicial complexes and additional topological features
for vibration analysis remains a topic for future work.
11.2.1.2 Methodology
More recently, Gottwald et al. [→33, →34] introduced the 0–1 test, which
produces a binary output, to indicate the presence of chaos. By removing the
need for phase space reconstruction, Gottwald’s method is more appropriate for
high-dimensional systems. The 0–1 test takes a time series signal ϕ , indexed by
j , for j = 1, 2, …, N , and produces a p–q projection [→34] of the data
n (11.2)
q c (n) = ∑ ϕ(j) sin jc
j
11.2.3 Analysis
Since the IMS data set does not include when the bearing faults develop, the
only information we can glean from the data is when the experiments stopped
due to the faults, as well as rely on results from literature, e.g., Gousseau et al.
[→37], as baselines to detect certain faults. However, we still demonstrate how
TDA can be used to detect possible anomalies from the vibration signal; we show
one such approach in this section.
There are 200 persistence diagrams generated per time step, since 200 values of
c are selected to create the p–q projections. We compute the amplitudes of the
Betti curves from each persistence diagram using giotto-tda [→21], and
compute the mean amplitudes for each time step, and set a threshold of three
standard deviations to determine anomalies. The analysis results of trial one are
in →Figure 11.10. Bearings three (channels 5 and 6) and four (channels 7 and 8)
develop faults in the experiment, and we note that there are regions in the plots
for channels 5–8 where the amplitudes clearly exceed the threshold for an
extended period of time. Note that L denotes the Chebyshev distance.
∞
Figure 11.10: Trial 1 Betti curve amplitudes computed with the L distance.
∞
Note that the faults developed in bearings three (channels 5 and 6) and four
(channels 7 and 8), and that there are clear regions in channels 5–8 where the
amplitude values clearly exceed the threshold for an extended period of time.
To compare the results against a known method for detecting faults, we overlay
the amplitude exceedances for channels 5, 6, 7, and 8 on the standard deviations
of the raw values of those channels. It is understood that sudden changes in
standard deviations may indicate the presence of a fault and that there are
accepted thresholds, based on the class of machinery associated with standard
deviation for shutting down machinery, e.g., ISO 20816-1:2016. In →Figure 11.11,
we note that in most cases, the exceedances happen near where the rate of
change in the standard deviation accelerates; furthermore, in channels 6
(bearing three) and 8 (bearing four), the exceedances occur days before the
standard deviation values peak around day 31, suggesting that the exceedances
may be used as early indicators for faults. As noted in Gousseau, the inner race
fault is detectable using an established method like bearing envelope analysis
on channels 5 and 6 around 32 days. The grouping of Betti curve exceedances on
channel 6 is indicative that this method is able to provide early warning of this
fault. Also from Gousseau, the rolling element fault is detectable using bearing
envelope analysis on channels 7 and 8 at around 25 days. While the Betti curve
exceedances occur a few days after being detectable with bearing envelope
analysis, it is before when significant changes in standard deviation and
preprocessing steps like spectral pre-whitening occur to account for baseline
vibrations.
Figure 11.11: Trial 1 standard deviations of the raw vibration values for
channels 5, 6, 7, and 8, as well as the exceedances of the Betti curve amplitudes
for those channels.
In trial 2, bearing one (channel 1) develops a fault. The analysis results are
shown in →Figure 11.12. We note the spikes in values that exceed the three
standard deviation threshold in channel 1, while the other channels do not have
such exceedances, except for channel 2. Upon further examination of the data, it
was determined that the drop in amplitudes in channel 2 toward the end of the
experiment is an artifact of the subsampling method, prior to p–q projection
calculations, where broad frequency noise in the envelope signal resulted in a
large subsampling increment.
Figure 11.12: Trial 2 Betti curve amplitudes computed with the L distance.
∞
Bearing one (channel 1) develops a fault. Note the exceedance of the threshold
toward the end of the experiment. The exceedance detected in channel 2 is an
artifact of the data preprocessing caused by insufficient data.
The exceedances are plotted against the standard deviation of the raw values for
channel 1 in →Figure 11.13. We again note that the exceedances align with the
peaking of the standard deviations, and that the exceedances also detected
anomalies well before the standard deviations start to peak. Reference [→37]
notes that the fault is physically detectable, starting around 4.5 days, so the
amplitude exceedances around day 5 are in line with the expectations.
Figure 11.13: Trial 2 standard deviations of the raw vibration values for channel
1, as well as the exceedances of the Betti curve amplitudes for that channel.
11.2.4 Discussion
As noted, the IMS data set is provided without additional information, such as
exactly when the faults develop. Therefore, we can only demonstrate the
effectiveness of the proposed topological method, knowing that the faults
presumably developed at some point close to the end of the trials. We used the
standard deviations of the raw vibration data as a baseline method to gauge the
performance of the proposed topological method. We note that the proposed
method detected anomalies where the standard deviation values were rapidly
changing, a sign that bearing faults are developing. In some cases, the
exceedances cluster before and around where the standard deviations peak.
Therefore, the proposed method may be useful as early indicators for
developing fault conditions.
On the other hand, we note that there is the possibility of false positive
detections, as with any anomaly detection techniques. In →Figure 11.10, there
were early exceedances in channels 5 and 7. While it is usually straightforward to
tune the method to minimize false positives, we cannot do so in this case as we
do not have access to the states of the bearings at the start of the trials.
Nevertheless, we have demonstrated that the performance of the proposed
method is easier to automate and interpret than more established method to
detect incipient bearing faults, like envelope analysis.
Srividhya Sethuraman
Uma G.
Sunny Kumar
Siddhesh Thombre
Vikash Patel
Sharadha Ramanan
Abstract
This work introduces Dyna-SPECTS, a novel dynamic ensemble-based Thompson
sampling model for calculating price elasticity (PE) in the e-commerce sector,
focusing on kids’ clothing. The model uniquely amalgamates four distinct
Thompson sampling based algorithms namely – algorithms – distribution
approximation, XGBoost regressor, GAN, and MCMC sampling – to determine PE
distribution parameters. Dyna-SPECTS’s outstanding feature is its dynamic
ensemble mechanism, which adeptly selects optimal PE values from a suite of
strategies like minimum, maximum, average, weighted MAPE, and linear
regressor, guided by cumulative rewards. This approach adapts to variable
market conditions in the constantly evolving e-commerce landscape, including
competitor pricing, seasonal changes, and customer segmentation. Empirical
tests across diverse datasets have confirmed Dyna-SPECTS’s exceptional
performance, showcasing substantial improvements in sales and margins,
especially for SKUs experiencing PE fluctuations. The model also demonstrates
efficacy in using transfer learning for PE computation, yielding a 7% margin
improvement and a remarkable 35.3% reduction in RMSE for demand forecasts.
Innovations in Cross-PE coefficient integration and price personalization have
led to further margin enhancements. Moreover, the model excels in
omnichannel pricing, achieving a 17.5% margin increase. Long-tailed product
pricing using these PE values showed an increase in sales volume and revenue
compared to traditional pricing strategies.
Dyna-SPECTS prioritizes scalability using parallelization, distributed computing,
and optimizing algorithms, efficiently managing a high volume of SKUs without
compromising computational effectiveness and achieving rapid convergence. It
represents a significant leap in dynamic price elasticity computation, offering a
scalable, versatile, and empirically proven solution for e-commerce pricing
challenges.
Keywords: Price Elasticity, Thompson sampling, price optimization, Big Data,
fashion, Reinforcement Learning, online learning, e-commerce, ensemble model,
long-tail products, scaling, Bayesian model, dynamic ensemble, static ensemble,
personalization, dynamic pricing,
12.1 Introduction
Figure 12.1: Plot of sales and price of the Key SKU and the prices of substitute
SKUs from Dataset 1.
From →Figures 12.1 and →12.2, we infer that the sales of the Key SKUs are
affected by the prices of substitute and complement SKUs. When price changes
are introduced, they happen simultaneously for both the substitute and
complement SKUs. The sales changes corresponding to the price changes are
similar for Key SKUs and their identified substitutes and complements.
Long-tailed products constitute a considerable portion of the inventory in e-
commerce platforms and they are characterized by low demand and sales
volume [→12]. These products often have fluctuating demand, and pricing them
optimally is critical to maximizing revenue and minimizing inventory holding
costs. Traditional pricing strategies, which rely heavily on historical sales data,
are not effective for long-tailed products due to the sparse data. →Figure 12.3
shows the selling characteristics of some representative long-tailed products
(LTP) from Dataset 6.
Figure 12.2: Plot of sales and price of the Key SKU and the prices of substitute
and complement SKUs from Dataset 2.
Figure 12.3: Sales, price curves across time for LTPs selected from Dataset 6.
12.4 Methodology
The SPECTS (Sequential PE Computation using Thompson Sampling) model
[→10] comprised four methods that use Thompson Sampling to determine the
parameters of the PE distribution. Three methods, namely Distribution
approximation, XGBoost regressor, and GAN, differ in terms of the demand
forecasting component used. The fourth method uses MCMC sampling methods
on the posteriors derived via Thompson Sampling. The motivation behind using
MCMC sampling is to handle the cases wherein the posteriors cannot be
obtained in closed form. Otherwise, they will remain as flat distributions even
after considerable training. Each of these four methods computes and gives PE
value. Our conclusion in this work was to take the average PE value of these four
methods as the output of the SPECTS model.
→Figure 12.4 maps the data flow from source files through various modules,
culminating in PE values. The training phase involves pre-processing and feature
extraction from comprehensive data sets (Big Data), including transaction,
attribute, inventory, and price data. Different bandit models utilize these
features to generate PE values, with the ensemble bandit model providing a
singular PE for the price optimization engine, thereby determining optimal
prices for maximal revenue.
In the testing phase, this model incorporates real-time, high-frequency data,
activating the trigger module under specific conditions like significant
competitor price changes, stock shortages, cost variations, inventory level drops,
forecast accuracy decline, or special events. This triggers the re-training of
bandit models and demand forecasting modules using the pre-processed data.
Upon extensive training, the model processes real-time, high-frequency data in
batches. Once the data enters the bandit models, it is used for the estimation of
prior and likelihood estimates. Further, these estimated priors and likelihoods
are used for the calculation of posteriors wherein the posterior distribution
corresponds to the PE distribution.
P
opt,j
[t]
−γ
ij (12.1)
∏ ( )
j P
∗ init,j
S new,i [t] = S pred,i [t] P opt,i [t]−P comp,i [t]
∗
exp(lr i )
P [t]
comp,i
2
(12.5)
L(ObsR[t]; R[t], β̂) = N (ObsR[t]; R[t], σ )
ˆ 2
(12.6)
L(ObsS new,i [t]; S new,i [t], β) = N (ObsS new,i [t]; S new,i [t], σ )
ˆ ˆ ∗ 2
(12.7)
Posterior t ( β; R[t]) ∝ Prior[t − 1]( β) N (ObsR[t]; R[t], σ )
(12.8)
ˆ ˆ ∗ 2
Posterior t ( β; S new,i [t]) ∝ Prior[t − 1](β) N (ObsS new,i [t]; S new,i [t], σ )
∗ ∗
Maximize Yield = α 1 Sales + α 2 Revenue + α 3 Margin
∗ (12.9)
The minimum approach for ensemble always picks the smallest coefficient in
magnitude. Hence, when combined with the sign, it results in significant
improvement in sales units. The maximum approach, on the other hand, picks
the maximum among the coefficients, in magnitude. Hence, it does not result in
a high improvement in sales units.
The drastic reduction and increase in sales units through adoption of
minimum and maximum approaches for the ensemble model might also lead to
these being treated as outliers, as they are too much away from the mean sales.
The sales improvement, based on the weighted MAPE and the linear regressor,
are along the same range.
The sales improvement based on the average PE is close to actuals and
therefore validates our hypothesis, wherein we claimed that the average will be
a smoothed version of the PE coefficients, out of all approaches. A similar graph
was obtained for the revenue maximization objective. In the revenue
maximization objective also, the revenue obtained by averaging the PE is close
to the actual revenue.
12.5.1.1 When do static ensemble methods fail?
1. Nonstationary data:
Scenario: When the data distribution is non-stationary, meaning that the
statistical properties of the data change over time.
Dynamic advantage: Dynamic ensemble methods can adapt to these
changes by updating the ensemble with new information, adding, or
removing models, as needed. This is especially relevant in applications
where the characteristics of the sales data are not constant.
2. Concept drift:
Scenario: In tasks where the underlying concept being modeled changes
over time, the phenomenon is known as concept drift. PE coefficients
change across the life cycle of items.
Dynamic advantage: Dynamic ensembles can detect and respond to
concept drift by updating the ensemble components. This is crucial for
maintaining model accuracy when the relationships between input
features and the target variable evolve.
3. Resource efficiency:
Scenario: In resource-constrained environments, where it may be
expensive or impractical to retrain the entire ensemble, from scratch, with
each new batch of data.
Dynamic advantage: Dynamic ensemble methods allow for selective
updates, making them more resource efficient. Only the necessary
adjustments, such as adding or removing models, need to be made based
on the evolving data.
4. Adaptive learning:
Scenario: When the task requires a model to adapt quickly to new
information without discarding the knowledge accumulated from past
data.
Dynamic advantage: Dynamic ensembles support adaptive learning, where
models can be updated sequentially and selectively. This is beneficial in
online learning scenarios where the model needs to continuously learn
from incoming data.
5. Handling noisy or outlier data:
Scenario: In situations where the data contains noise or outliers that may
impact the performance of individual models.
Dynamic advantage: Dynamic ensemble methods can be more robust to
noisy data by adapting the ensemble to down-weight the influence of
poorly performing models or outliers. This adaptability helps in
maintaining overall the model’s reliability.
6. Task with evolving requirements/objectives:
Scenario: In applications where the requirements of the task change over
time, and the model needs to be flexible and responsive to these changes.
Dynamic advantage: Dynamic ensembles can evolve to meet new task
requirements by incorporating models that are better suited to the
updated task, ensuring better performance in dynamic environments.
In a real-world environment, PE of products changes across the life cycle of the
product, (Refer →Figure 12.9) due to the following reasons –
Figure 12.9: Drifts in price elasticity values of a SKU across its life cycle.
When the SKU undergoes changes across its life cycle, the dynamic ensemble
can adapt to these changes quickly, and hence results in higher revenue, as
opposed to static ensemble of SPECTS.
After considerable training, the component models, as part of Dyna-SPECTS
model approach (Refer →Figure 12.12), the true PE (as in Oracle). Hence, the
rewards from the dynamic ensemble model approach the true reward.
Figure 12.12: Growth of regret across iterations.
Table 12.3: Sales and margin improvements for three different datasets.
Dataset Sales improvement Margin improvement
Dataset 1 8% 2.7%
Dataset 2 19% 8.1%
Dataset 3 10% 4.7%
12.5.4 Improving demand forecasting accuracy from PE obtained through
SPECTS model
Our objective here was to improve the demand forecast accuracy with the help
of the PE distribution as the PE distribution gets more and more refined. We
trained our PE computation models as part of SPECTS for a considerable number
of iterations until the variance associated with the PE distributions meets a
threshold. The variance threshold for this experimentation was fixed to be 0.005.
Once the PE distribution variance satisfies a threshold, we sampled a PE
value from this distribution and used it to predict demand corresponding to
newer price points. We could observe that the demand predicted via the
updated PE distributions are closer to actuals, as opposed to the demand
forecasts coming from the pre-trained demand forecast ML and DL models.
→Figure 12.13 shows the demand forecast computed for a set of styles from
Dataset 1 using PE distribution and the demand forecast from pre-trained
models such as XGBoost and the actuals.
Figure 12.13: Comparison of forecast sales units.
After accounting all types of styles from Dataset 1, we found that the demand
forecast through PE distribution is 14% more accurate (Refer →Table 12.4) than
the demand forecast obtained via ML models.
From →Table 12.6, we infer that the RMSE for the set of target SKUs has
improved based on the transfer of the model coefficients, based on the XGBoost
models trained on the source data. There is a reduction of 35.37% in RMSE after
incorporating transfer learning for demand forecasting.
Table 12.6: Validation of accuracy of demand forecasting after employing
Transfer learning.
SKU ID RMSE before RMSE after Improvement
transfer learning transfer learning in accuracy
SKU 1 329.49 212.47 55.04%
SKU 2 229.73 200.23 12.8%
SKU 3 225.14 120.33 46.55%
SKU 4 297.93 200.27 32.77%
SKU 5 265.95 148.93 44%
SKU 6 268.87 212.20 21.07%
The results showed an increase in the sales volume of these products by 4.2%,
and a 2.19% increase in revenue compared to traditional pricing strategies for
dataset 6. Similarly, we observe a sales volume increase of 6.3% and a revenue
increase of 3.1% for dataset 7. The implemented dynamic pricing strategy
demonstrated a significant improvement in the profitability of long-tailed
products.
Experimentation
12.6 Conclusion
We present Dyna-SPETS, an innovative dynamic ensemble-based Thompson
Sampling model, specifically designed for the large-scale computation of PE in e-
commerce, particularly in the kids' clothing sector. This model expertly combines
four distinct Thompson Sampling algorithms – Distribution approximation,
XGBoost regressor, GAN, and MCMC sampling – to determine PE parameters. A
key feature of Dyna-SPECTS is its dynamic ensemble capability, which selects the
most suitable PE values from a range of ensemble strategies like min, max,
average, weighted MAPE, and linear regressor, based on cumulative rewards.
This dynamic approach adapts to non-stationary data and concept drift,
efficiently handling variables like competitor pricing, seasonal changes, and
customer segmentation. Notably, Dyna-SPECTS fulfills the nine crucial real-world
criteria outlined in Section 12.1.
Empirical validation across various datasets demonstrates Dyna-SPECTS’s
superior performance, showing significant sales and margin improvements,
particularly for SKUs with fluctuating PE values. For example, using different
datasets, we observed sales and margin improvements, ranging from 8% to 19%,
and from 2.7% to 8.1%, respectively. Further, the model’s application in transfer
learning showcased a 7% margin improvement and a notable 35.3% reduction in
RMSE in demand forecasts. Additionally, cross PE coefficient incorporation and
price personalization strategies led to margin enhancements up to 5.5%. In
omnichannel pricing, a 17.5% margin improvement was observed, while demand
forecasts through PE distribution outperformed those obtained via standard
machine learning models by 14%. For LTP SKUs, the results showed an increase
in the sales volume of 4.2% and 6.3 % for Datasets 6 and 7. Similarly, we
observed a 2.19% and 3.1% increase in revenue compared to traditional pricing
strategies for Dataset 6 and Dataset 7, respectively.
Scalability was a critical focus in our development of Dyna-SPECTS. We
implemented techniques like dimensionality reduction and approximate
inference to manage a high volume of SKUs efficiently. Our model achieved
convergence in half the time of baseline Thompson sampling models,
demonstrating both computational efficiency and scalability. Conclusively, Dyna-
SPECTS not only provides timely PE coefficients but also anticipates future
market trends, enabling retailers to optimize pricing for both short-term and
long-term objectives. It marks a significant advancement in dynamic PE
computation for e-commerce, offering a scalable, adaptable, and empirically
validated solution. In future, we aim to enhance Dyna-SPECTS by integrating
additional ensemble strategies and applying graph-based deep learning and
diffusion methods to further refine its performance.
References
[1] Z. Zhu, and B. Van Roy, Scalable neural contextual Bandit for recommender
systems. arXiv Preprint arXiv:2306, pp. 14834, 2023,
→https://arxiv.org/abs/2306.14834. →
[2] AWS IS →https://aws.amazon.com/blogs/machine-learning/power-
contextual-bandits-using-continual-learning-with-amazon-sagemaker-rl/ →
[3] Y. Zhu, et al., Contextual bandits with large action spaces: Made practical.
International Conference on Machine Learning, PMLR, 2022. →
[4] F. Guillou, R. Gaudel, and P. Preux, Large-scale bandit recommender system.
Machine Learning, Optimization, and Big Data: Second International Workshop,
MOD 2016, Volterra, Italy, August 26–29, 2016, Revised Selected Papers 2.
Springer International Publishing, 2016. →
[5] X. Chen, et al., Online statistical inference for contextual bandits via
stochastic gradient descent. arXiv Preprint arXiv:2212.14883, 2022. →
[6] K. Mahadik, et al., Fast distributed bandits for online recommendation
systems. Proceedings of the 34th ACM International Conference on
Supercomputing, 2020. →
[7] L. Song, C. Tekin, and M. Van Der Schaar, Online learning in largescale
contextual recommender systems. IEEE Transactions on Services Computing, vol.
9, no. 3, pp. 433–445, 2014. →
[8] R. Zhang, et al., Scalable Thompson sampling via optimal transport. arXiv
Preprint arXiv:1902.07239, 2019. →
[9] P. Dutta, et al., AutoML for Contextual Bandits. arXiv Preprint
arXiv:1909.03212, 2019. →
[10] S. Sethuraman, and S. Ramanan, SPECTS: Price elasticity computation model
using Thompson sampling. In: 2022 International Conference on Computational
Science and Computational Intelligence (CSCI), IEEE, pp. 640–646. a, b, c
[11] M. Mussi, et al., Dynamic pricing with volume discounts in online settings.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 13,
2023. →
[12] M. Mussi, et al., Pricing the long tail by explainable product aggregation and
monotonic bandits. Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, 2022. a, b
[13] Y. Zhang, et al., A model of two tales: Dual transfer learning framework for
improved long-tail item recommendation. Proceedings of the Web Conference,
2021. →
Notes
1 →https://www.kaggle.com/competitions/h-and-m-personalized-
fashion-recommendations/data.
2 →https://www.kaggle.com/datasets/olistbr/brazilian-e-commerce.
13 Creating a metadata schema for reservoirs of
data: a systems engineering approach
Barry C. White
Rachel E. Jordan
Reena R. Patel
LaKenya K. Walker
Matthew D. Bray
Note: This work was supported through the US Army military engineering direct
Model-based Assessment of Sensors and Countermeasures program. This
chapter has been approved for public release.
Short Research Paper
Abstract
Physical modeling procedures, with intermediate data, are being developed for
the large-scale generation of synthetic imagery for automated target
recognition (ATR) machine learning (ML) algorithms. This imagery is typically
combined with field-collected data for generating robust training sets. The
management and retrieval of this data requires large-scale storage with a means
to query different types of information. Queries need to be performed for
selection of data sets to the single file. The goal of this study is to establish a
systematic approach for storing and retrieving this information using high-
performance computing resources and the Integrated Rule-Oriented Data
System (iRODS). Metadata tags, focused on location, time, and other factors are
used for search queries. Whenever possible, metadata generation is automated
based on the content of the data file. Use cases for the import and query
operations are created. Simple scalable problems have been processed and are
presented for this data set procedure. Additionally, we have outlined the
proposed system architecture. This data storage and retrieval system will serve
to provide locality-specific data for ATR ML data-sets from a large set of collected
and synthetic imagery.
Keywords: big data, data lake, infrared imagery, ATR, machine learning, iRODS,
MeScheR,
13.1 Introduction
Effective management of large data produced from multiple sources using
engineering processes requires a structured systems engineering approach.
Software components for synthetic scene generation, distribution of photonic
energy, subsurface thermal heat transfer (for soil and vegetation), and synthetic
sensor capabilities are integrated to generate robust training imagery for
automated target recognition (ATR) machine learning (ML) algorithms. Input
data sets include 3D models of vegetation, objects (e.g., vehicles), and
landscapes, which have been captured or synthesized. They also encompass
material properties for these models and intermediate software responses like
computation of incoming/outgoing radiative energy flux and surface
temperatures. Output data sets include imagery and surface fluxes for the
geometry of the entire scene. The data comes in many formats: landscape object
geometry is stored in Wavefront OBJ files, 2DM files, or direct Light Detection
and Ranging (LiDAR) data sets. Intermediate results are stored in formats
specific to the generating software. Output images may be stored in Tag Image
File Format (TIFF), Joint Photographic Experts Group (JPEG), or other image
formats [→1]. Additionally, AI ATR training data may be stored separately in
specially formatted XML files [→2].
Generated data is obtained from software written to create artificial
environments that mimic the radiative energy patterns of real environments,
especially in situations where collecting real-world data is challenging or
restricted. Several computer applications are used to: (1) construct a scene, (2)
distribute radiative flux in the scene (based on time of day, initial temperatures,
etc.), (3) accurately simulate subsurface effects of absorbed radiative energy in
vegetation and soils, and (4) generate sensor-specific imagery based on the
results. This imagery will be verified by and used with existing physical images to
inform and train artificial intelligence automated target recognition algorithms
(ATR).
In this report, we describe a system for managing and retrieving the data
needed for ATR ML. This architecture is a hybrid implementation using a data
lake and a metadata database management system. A data lake, as per
Wikipedia, is essentially a collection of mass storage devices that stores data in
its natural formats as “blobs” of information [→3]. The data lake is used with a
database management system dedicated to metadata for efficient searches of
files with certain characteristics. In the future, we plan to incorporate the
Integrated Rule-Oriented Data System (iRODS) [→4, →5, →6]. iRODS creates a
virtual file system on a distributed database file system, potentially involving
multiple servers [→4, →6, →7]. While our current system can manage basic
functionalities without iRODS, it lacks the scalability and flexibility of iRODS.
Overall, the goal of this project is to establish a system capable of storing
metadata at each stage of the process. This approach ensures that the stored
data can be easily retrieved and queried when needed. This work was originally
introduced at the CSECE ICDATA conference 2023 [8].
13.4 Metadata
Metadata, in essence, offers information about the stored data [→9]. There are
different categories of metadata: for instance, legal metadata describes the
legality of using the data and descriptive metadata has phrases or words that
describe the data in detail. Our system primarily utilizes descriptive, structural,
and administrative metadata [→10]. Descriptive metadata can be used for
categorization and identification, structural metadata maintains the organization
and connectivity of the data, and administrative metadata provides details about
who collected the data and its purpose. Any of these metadata types can be
used to search for data and uniquely identify specific data within extensive
collections.
Metadata can be generated either automatically or with user input.
Automatically generated metadata is easy to collect and uniquely specific to the
data collection process and how it gathers information about the data. It usually
contains information about when, how, and why the data was created rather
than describing the contents of the data. This metadata is typically structural or
administrative in nature. In contrast, user-generated information is more
difficult to collect, describing the content of the file (e.g., labeling data as a
“forest” scene) in such a way that the data can be aggregated. This aggregation
(e.g., the set of files that are for “forest” scenes) enables efficient data searches
[→11]. However, user-generated metadata can be chaotic without defined
standards, raising questions such as whether to include the name of the data
generator.
For efficient organization and searchability, user-generated metadata needs
a clear format. This format should ideally use a method that organizes metadata
in a database using tags and corresponding values. This generic yet well-defined
system aims to reduce user error during metadata input and query selections,
providing a structured and reliable approach to managing data.
When working with a series of processes like the ones we are handling, there
are inherent formats and related metadata that can simplify file retrieval.
Therefore, it is logical to elicit metadata formats from the engineers who
established the processes generating the data. This approach helps outline the
necessary metadata for precise data retrieval, even down to individual files.
13.5 MeScheR
The Metadata Schema Reservoir (MeScheR) was developed to structure user
generated metadata. This system establishes a clear schema defining what
metadata is collected and the method of collection. For example, data from the
Image reservoir may have metadata storing the details like of the image’s height
and width in pixels, information that can be extracted directly from the image.
Using a schema for data collection ensures that the prescribed metadata is
consistently collected for all data stored in the reservoir. It also guarantees that
the user interface includes all the elements. In essence, MeScheR utilizes
schemas to organize the metadata stored in a reservoir.
Each data reservoir has its own specific metadata schema file tailored to the
elicited data descriptions required for the specific data group. This schema data
is stored in a JSON formatted file known as a MeScheR file. Every MeScheR file
contains information confirming its validity, the location of the automatic data
collection software, and a properties section. The properties section outlines the
tag and value sets of metadata to be input by the user as well as logical
constructs for the conditional application of tags based on user input. Metadata
tags can be configured as “one-to-one” (e.g., allowing only one value per tag) or
“one-to-many” (e.g., permitting multiple values for descriptor tags in a scene).
Each data reservoir contains a MeScheR file specific to their collection of
metadata.
Once a MeScheR file is prepared, a MeScheR user interface software can be
used to read the MeScheR file, collect the required metadata for a reservoir, and
automatically generate an interface to collect that metadata according to the
properties section of the MeScheR file. The resulting user interface is
constructed using the properties section such that the interface uses each tag
from a metadata set to prompt the user for corresponding values. These tag and
value sets can include specific data (e.g., originator of the data) as well as more
general keyword data (e.g., descriptor of the data). Once the user provides all
necessary information, the software generates a JSON metadata file for each
input data file. These JSON files can be seamlessly integrated into metadata
database management systems such as iRODS. The ability to collect metadata
according to the MeScheR file’s specifications ensures that the process is driven
by the data itself.
Routines for managing and searching through these JSON files have been
developed demonstrating the usability of the JSON files. These routines can be
applied to any reservoir chosen by the user. The management routines are
capable of listing all existing metadata tag values, adjusting singular metadata
tag values, and changing all occurrences of a metadata value based on the tag.
Additionally, the search routines are capable of searching for similar metadata
tag values. In the Direct Query routine, users input a tag and value to search for,
and the system returns a list of all data files containing the tag value pair. An
Augmented Query routine has been tested that prompts the user to enter a
value for the Descriptor tag and uses a large language model to measure
synonymity [→12] within the data file’s Descriptor metadata tags that match the
query value. The routine then lists all data files determined to be similar along
with the confidence value indicating the degree of similarity.
Figure 13.2: Ingestible JSON format representing image metadata tags and
their corresponding values (with sample values included).
These rules can retrieve metadata from other collections as data moves through
the project pipeline. For instance, an image may be a product of a particular
scene; in that case, the project title and prefix may be accessible from the
metadata for the scene. However, it is necessary that users input the necessary
metadata the first time it is required. This requirement will be enforced in
subsequent iterations.
Users’ access privileges may be restricted to specific collections (e.g., content
providers may be limited to Source Models or Output- > Images).
13.8 Results
We conducted tests for consumer user scenarios using MeScheR reservoirs.
These reservoirs were created from the metadata requirements elicited from
project developer users. A test dataset of 65 images, alongside additional images
and field collected data, was used to test the Python code for these use cases.
Each test aimed to add, modify, and remove files along with their ingestible
metadata from the Images reservoir. Various groups of files were assigned
different metadata values to evaluate query outcomes. The query use cases
were performed to determine if distinct groups, down to individual files, could
be identified. The results were compiled into a list of file names as depicted in
→Figures 13.9 and →13.10.
Figure 13.9: Results from a direct query search for the descriptor value
“desert”.
Figure 13.10: Results from an augmented query search for the descriptor value
“forest”.
13.9 Conclusions
In this chapter, we have presented a comprehensive framework for efficient
management and retrieval of large datasets on high-performance computing
systems using a metadata database management system. The proposed
architecture has been tested with metadata JSON files that are proven to be
ingestible with iRODS (or another metadata database) commands. This
implementation has been tested using a dataset of 65 synthetic infrared images.
Metadata schema was created and tested incorporating user input.
Furthermore, our approach demonstrated the ability to identify specific data files
for retrieval through metadata queries, demonstrating the effectiveness of our
methodology in handling large-scale data management challenges in high-
performance computing environments.
References
[1] J. D. Murray, and W. VanRyper, Encyclopedia of Graphics File Formats,
Cambridge, MA: O’Reilly and Associates, Inc, 1996. →
[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman, The
PASCAL visual object classes (VOC) challenge. International Journal of Computer
Vision, vol. 88, no. 2, pp. 303–338, 2010,
→http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf. →
[3] Wikipedia. Data Lake. Wikipedia, 2023, Accessed April 2023.
→https://en.wikipedia.org/wiki/Data_lake. →
[4] M.Hedges, A.Hasan, and T.Blanke, Management and preservation of research
data with iRODS. CIMS’07: Proceedings of the ACM first workshop on
CyberInfrastructure: information management in eScience. pp. 17–22, 2007.
→https://doi.org/10.1145/1317353.1317358. a, b, c
[5] K. D. Winters, M. A. Cowan, G. E. George, M. E. Gonzales, B. Priest, O. Morris,
and J. Landrum, Analysis of ERS Use Cases for iRODs, ERDC/ITL SR-20–8,
Vicksburg, MS: US Army Engineer Research and Development Center, 2020. a, b
[6] M.Conway, R.Moore, A.Rajasekar, and J. Y.Nief, Demonstration of policy-
guided data preservation Using iRODS. 2011 IEEE International Symposium on
Policies for Distributed Systems and Networks. pp. 173–174, 2011,
→https://doi.org/10.1109/POLICY.2011.17. a, b, c
[7] K. D. Winters, M. A. Cowan, G. E. George, M. E. Gonzales, B. Priest, O. Morris,
and J. Landrum, Integrated Rule-Oriented Data System (iRODS) and High
Performance Computing (HPC) Requirements Document, ERDC/ITL SR-20–10,
Vicksburg, MS: US Army Engineer Research and Development Center, 2020. a, b
[8] B. C.White, R. R.Patel, L. K.Walker, and M. D.Bray, Management, storage, and
retrieval of complex data comprising multiple formats collected from different
sources: A systems engineering approach. CSECE 2023. 2023.
[9] Merriam-Webster. Metadata. Merriam-Webster Dictionary. 2023, Accessed
April 2023. →www.merriam-webster.com/dictionary/metadata. →
[10] J. Pomerantz., Metadata, Cambridge, MA: MIT Press, 2015. →
[11] G. P.Rodrigo, M.Henderson, G. H.Weber, C.Ophus, K.Antypas, and
L.Ramakrishnan. ScienceSearch: Enabling search through automatic metadata
generation. 2018 IEEE 14th International Conference on e-Science (e-Science). pp.
93–104, 2018, →https://doi.org/10.1109/eScience.2018.00025. →
[12] N. Reimers, and I.Gurevych, Sentence-BERT: Sentence embeddings using
Siamese BERT-networks. EMNLP’19: Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing.2019,
→https://doi.org/10.48550/arXiv.1908.10084. →
[13] Library of Congress, EXIF exchangeable image file format, version 2.2., 2023,
Digital Formats, Accessed October 2023,
→https://www.loc.gov/preservation/digital/formats/fdd/fdd000146.shtml. →
[14] X.Zhou, W.Gong, W.Fu, and F.Du, Application of deep learning in object
detection, 2017 IEEE/ACIS 16th International Conference on Computer
Information Science (ICIS). pp. 631–634, 2017,
→https://doi.org/10.1109/ICIS.2017.7960069. a, b
[15] A. Tsyganov, S. Stoica, M. Babai, V. Soancatl-Aguilar, J. McFarland, G.
Strikwerda, M. Klein, et al., The Research Data Management System at the
University of Groningen (RUG RDMS): Architecture, solution engines and
challenges. iRODS UGM 2021, 2021, →https://irods.org/uploads/2021/Tsyganov-
Stoica-Strikwerda-UnivGroningen-
Research_Data_Management_System_at_Univ_of_Groningen_Architecture_Solutio
n_Engines_and_Challenges-paper.pdf. →
[16] K. D. Winters, M. A. Cowan, G. E. George, M. E. Gonzales, B. Priest, O. Morris,
and J. Landrum, Web-Enabled Interface for iRODS: Comparing Hydroshare and
Metalnx, ERDC/ITL SR-20–9, Vicksburg, MS: US Army Engineer Research and
Development Center., 2020. →
14 Implementation and evaluation of an eXplainable
artificial intelligence to explain the evaluation of an
assessment analytics algorithm for free-text exams
in psychology courses in higher education to attest
QBLM-based competencies
Ramona Srbecky
Franz Bühler
Jörg Schuljak
Simon-Alexander Wetzel
Michael Winterhagen
Wieland Fraas
Jan Dettmers
Matthias Hemmje
Abstract
In the realm of higher education, the exploration of eXplainable artificial
intelligence (XAI), elucidating the mechanisms behind artificial intelligence
systems, becomes increasingly pivotal as institutions seek transparency and
comprehension in leveraging advanced technologies for diverse academic
applications. Based on our previous research, we outlined an automated
approach for generating test and training data for assessing free-text answers in
a digital university course on work organization and design for a natural
language processing (NLP) algorithm, encompassing research methodology,
theoretical background, concept, implementation, evaluation, and future
improvements. This chapter presents the underlying NLP algorithm and the
corresponding XAI component, for which the test and training data was created.
Therefore, we will present the state-of-the-art in the area of XAI and NLP for the
higher education sector. Subsequently, we will present the current state of
technology in our educational system, the Knowledge Management Ecosystem
Portal (KM-EP), and its qualification-based learning model-related subsystems.
The corresponding concepts and proof of concept implementations and
evaluations are presented. The chapter concludes with a summary and
discussion.
Keywords: eXplainable artificial intelligence, XAI, higher education, natural
language processing, NLP, PAGEL, GALA4QBLM, QBLM, assessment analytics,
learning analytics, free-text exams,
14.1 Introduction
In the dynamic landscape of higher education, the integration of artificial
intelligence (AI) has emerged as a transformative force, offering innovative
solutions to enhance various facets of academic processes [→1]. As universities
embrace AI technologies to streamline administrative tasks, personalize learning
experiences, and optimize research endeavors, a pressing concern that arises is
the need for transparency and interpretability in AI systems. This concern has
given rise to the burgeoning field of explainable AI (XAI), aimed at demystifying
the opaque nature of AI algorithms and ensuring that their decision-making
processes are comprehensible to stakeholders, including educators,
administrators, and students [→2].
Based on the PSs, the following research questions (RQs) can be derived:
RQ1 for PS1: How can an automated solution evaluation and attestation
of CQs for free-text answers using NLP be implemented for the
application scenario PAGEL in the KM-EP?
RQ2 for PS2: How can a result of the NLP algorithm be explained by
applying XAI in the context of the PAGEL project in the KM-EP?
RQ3 for PS3: How can the NLP and XAI components be integrated into the
KM-EP?
The respective abbreviation of the phase of the research methodology used is <
phase >. The numbering of the respective RQ, <number>. <consecutive
numbering> uniquely identifies each RO per phase of the research methodology
and RQ.
ROs for RQ1:
RO.OP.RQ1.1: Researching the current state-of-the-art NLP algorithms and
methods.
RO.TBP.RQ1.2: Concept for the NLP algorithm for evaluating free-text
answers in the PAGEL course in the KM-EP.
RO.SDP.RQ1.3: Realization of the concept from RO.TBP.RQ1.2.
RO.EP.RQ1.4: Evaluation of the NLP algorithm developed in RO.SDP.RQ1.3.
The structure of the chapter follows the order of the ROs. This means that in the
state-of-the-art section (Section 14.2), the ROs for the OP are described. In the
Conceptual Design section (Section 14.3), the ROs for the TBP are described. The
ROs for the SDP are explained in Section 14.4,, and in Section 14.5, the ROs of EP
are shown. The chapter concludes with a summary and discussion of further
topics.
Due to the fact that the research took place at a German university, several
diagrams are partly written and presented in German.
14.3 Concepts
In this section, RO.TBP.RQ1.2, RO.TBP.RQ2.2, and RO.TBP.RQ3.3, as part of the
TBP, will be further described. Therefore, the application scenario will be initially
explained. Afterward, the use cases, data models, and a system overview, based
on the results of [→73,→74], with the to-be-developed components will be given.
The section will be closed with a short description of the potential integration
mechanism of the NLP and XAI components into the KM-EP as part of the
chapter system overview. The section concludes by describing the concepts and
pipelines for the NLP and XAI algorithms. In the remainder of this chapter, the
term solution evaluation system is used as a generic term for the NLP and XAI
components. Based on the previous OP, the NLP algorithm will be based on a
BERT model since transformer models are the current state-of-the-art. We will
use SHAPE for the XAI component because model-specific implementations for
specific areas, such as transformers, are given.
Next, the data schema for the final quiz , as shown in →figure 14.2, will be
explained. A quiz request contains several quiz answers, and a quiz answer
belongs to a quiz request. A quiz answer answers a question, whereas several
quiz answers can answer a question. A question is assigned to a game
parameter. A game parameter can be assigned to several questions. A question
is also assigned a sample solution and several keywords. A sample solution and
a keyword are only assigned to one question at a time. A T&T data set for
comparing semantic similarity also contains a sample solution and a solution for
the points achieved. A sample solution can be contained in several T&T data sets.
A game parameter labels a T&T data set for classification. A game parameter can
label several T&T data sets. A T&T data record for keyword recognition contains
a solution and a match parameter, labeled by several keywords.
Game parameters and keywords can occur in several T&T data sets for keyword
recognition. A rating is based on a rating scheme, several sample solutions, and
several keywords. A rating evaluates a quiz query. The scoring scheme, model
solution, and keyword can be used in multiple scores [→73, →74].
14.3.3 System overview
The solution evaluation system to be developed should interact with the PAGEL
simulation and the associated applications. All applications described here are
part of the KM-EP. The full system overview with all the components mentioned
in the text is shown in →figure 14.3. The PAGEL simulation will be called up from
the LMS Moodle via the LTI interface. The chat tool and the library in which the
product data is stored are part of the PAGEL simulation. In the context of this
work, the product library is simulated by a file to which the solution evaluation
system has access. Later, the solution evaluation system will access the product
data via an interface of the product library. Access is required to check the
answers from the chat tool for correctness or to answer the questions from the
chat tool. Questions from the fictitious colleagues come from a database to
which the chat tool has access. The chat sends a player’s questions and solutions
to the solution evaluation system. Depending on the case, an answer to the
question or feedback on the correctness of the solution is sent back from there.
Once the PAGEL simulation has been completed, the final quiz is called up,
where free-text answers are sent to the solution evaluation system, together
with the tasks.
The primary forms are determined for the keywords and solution texts to
simplify the comparison. This is used to determine which keywords are
contained in the solution texts. If there is no sample solution for a solution, the
evaluation can then take place based on the keywords. If there is a sample
solution, sentence vectors are then created from the solutions. These are used
to carry out a semantic comparison between a student solution and a sample
solution. The resulting cosine similarity is used together with the keywords for
evaluation. A CQ can then be assigned based on the ratings. The solutions are
also categorized and assigned to one of the game parameters to make the
semantic evaluation more comprehensible.
In the next part, the evaluation scheme is designed, with which the solutions
from the final quiz are to be evaluated. The evaluation scheme is based on the
template from the OP.
A distinction is made between the two cases. In the first case, players justify
their assessment of the experienced game parameters. In this case, the game
parameters can vary with each round and be perceived differently by each
player. For example, a perceived four on a scale of one to five can be described
differently by two players, for example, if one player’s score is a two and the
other’s is a five.
For this reason, answers with very different content may have to be assessed
as correct, and no model solution can be used for assessment. Even if a model
solution is provided for each of the five possible levels of a game parameter, it is
only possible to check whether the content of an answer matches one of the
model solutions. In addition, an answer can also be a mixture of two sample
solutions, as there is no fixed definition of the characteristic levels.
Therefore, the evaluation in this case should be based on the relevant terms
for the respective game parameter. An answer should be considered entirely
correct if at least three keywords are included. One point should be awarded for
each correctly named keyword. A subdivision into individual evaluation
categories is not required, as no marks or similar are to be awarded for the
solutions. The evaluation is done by awarding points. The principle is illustrated
in →Table 14.1 using the role ambiguity game parameter example. In the second
case, players should describe how the game should be designed to create good
working conditions for the individual game parameters.
Table 14.1: Example of evaluation scheme for keywords for role ambiguity.
Keywords for role ambiguity Anchor example Points
Unclear/ contradictory/ ambiguous/ uncertain/ not There was no briefing at all. As a 3
certain/ not transparent/ not unambiguous/ imprecise/ result, I wasn’t sure exactly what
not precise/ inaccurate/ not exact/ no, goal/ tasks/ my tasks were and what criteria
evaluation/ criteria/ guidance/ instruction/ explanation/ were used for assessment.
description/ instruction. There was no description of the 2
tasks.
I was unsure what I was 1
supposed to do.
I didn’t understand anything. 0
The problem of the first case is that very different answers can be correct
regarding content that does not exist here. Therefore, a sample solution can be
used for evaluation in addition to the keywords. Six points should be awarded
for an entirely correct answer; a maximum of three for the keywords and three
for the semantic similarity to the sample solution. In this case, evaluation
categories are introduced to define which answer is awarded how many points,
depending on a sample solution.
These are shown in →Table 14.2, using the example of the role ambiguity
game parameter. Full points are awarded for an answer, whose sentence vectors
have a cosine similarity of 0.8 or higher than the sample solutions. Two points
are awarded up to a value of 0.7, and up to a cosine similarity of 0.6, one point is
awarded. The threshold values for the evaluation categories are evaluated in the
EP and adjusted if necessary. The examples in the table refer to the sample
solution: “It must be clear who is to do what, when, and how. The objectives of
the computer store must be clear. At the beginning, it must be described how
the game works and what exactly it is about. It must clearly state what points are
not awarded in the game. It must also be specified who has authority and who
does not.”
In the next part, we will conceptualize how the result of the semantic
comparison of a solution with a model solution can be explained. The results of
the model are the sentence vectors generated from the solutions. The cosine
similarity between the sentence vectors is calculated for the semantic
comparison, and points are awarded based on this. The result of the model is,
therefore, not the score. Therefore, an XAI method such as SHAP or TED cannot
be used to explain the model and, thus, the rating. One solution would be to
train the model with data sets of solutions, sample solutions, and the points
awarded, to learn to assign the appropriate score. However, this requires large
amounts of data, for which the cosine similarity between the solution and the
sample solution must be calculated beforehand to specify the correct score.
Furthermore, it is not possible to ensure how the model arrives at the
corresponding score from the set vectors. This makes the model more error-
prone in order to increase explainability.
Another approach considers the individual game parameters as classes that
the model should predict for the solutions. The data sets required for this are
more accessible, as no cosine similarity needs to be calculated. For this purpose,
a component for text categorization is to be trained based on a BERT model. This
model can then be used with SHAP to show the influence of the individual text
passages on the assignment to a specific class. The data sets used for training
must not deviate too much from the model solutions. This ensures that a section
of text that significantly influences the assignment to a class also has similarities
to the corresponding sample solution. With this approach, the evaluation is still
based on the calculated cosine similarity. This means that poorer predictions of
the model do not compromise the explainability of the system.
14.4.3 Integration of the NLP algorithm and XAI component into the KM-EP
In the root directory of the software project, there is a file called README.MD,
which describes how the solution can be installed. Firstly, the CUDA drivers, CuPy
and spaCy, will be installed. CUDA is a toolkit from the company NVIDIA, which is
used to develop high-performance graphics processing units (GPUs) and
accelerated applications [→67]. CUDA requires that the computer it is to run has
a CUDA-capable GPU [→68]. CuPy is an open-source library for GPU-accelerated
computing with Python [→69]. However, these steps are skipped because the
server has no GPU. After installing these dependencies, the spacy project run
start should be executed. This should install the other dependencies and start
the application. The necessary steps are defined in the project.yaml file. If the
spacy project run start script is executed, the installation fails with an error
message. This message occurs because the installed Python package is not
compatible with the installed version of the OpenSSL program is not compatible.
The installed version of urllib is 2.0. A compatible version of urllib3 is version
1.26.6. An error message appears if this version is installed and the spacy project
run start command is executed again. This is because no model has been
created yet. To create the model, the script textCategorization/script/nlp_aug.py
must first be executed to generate training data for the model.
The script does not work and will be called with the error message
Parameter device = “cuda”. If these parameters are removed, the script is
executed to the end, and the data is generated. To use the data to train the
model, the spacy project runs all commands executed in the text categorization/
directory. However, this fails with the message Missing dependency, specified by
the command “train”:configs/config.cfg. The problem is that the file
textCategorization/configs/config.cfg is missing. Bühler describes in his work
that he used the demo pipeline as described in [→70]. A significant difference,
however, is that the /configs/config.cfg file is present in the demo project.
If this file is now added, the error message ValueError: GPU is not accessible.
Was the library installed correctly? This can be remedied by setting the value
gpu_id to −1 in the textCategorization/project.yml file, meaning a CPU should be
used instead of a GPU [→71]. If the spacy project runs, all scripts are executed
again. It generates a trained model, eliminating the error. The error message
appears after executing the spacy project run start command again. The
message shows that the textcat component does not exist. The
textCategorization/configs/config.cfg file contains the components tok2vec and
textcat_multilabel.
The command spacy init config –pipeline “textcat” configs/config.cfg -F
creates a configuration with the textcat component. Subsequently, the
command spacy project run must then be executed again in the
textCategorization directory for the change to take effect. If the command spacy
project run start is executed in the project’s root directory, the application works
without error messages. The generation of another model is required. The
command python -m chatette ./assets/data_template.txt -o ./assets/generated is
executed to do this. However, no Python package named “chatette” is available
and must be installed using the pip install chatette command. After this step, the
command can be executed, and training data for an AI model is generated. To
transform this into a suitable form, the script ner/scripts/transform_ner_file.py is
executed. The model is then executed, with the spacy project running all
commands in the ner directory. You will notice that the gpu_id in the project.yml
is set to 0. This is changed to 1 to avoid an error message due to the missing
GPU. Executing the command results in the following error message:
Missing dependency specified by command ‘train’ : configs / config .cfg
A file is missing. One difference is that the project.yml file contains more
steps, including a create-config step. Suppose this step from the demo pipeline
is inserted into the Bühler AI project similarly, the command can be executed
without an error message, and the training results can be found in the
ner/training directory. If now a request to the assessment interface is sent, the
following error message appears FileNotFoundError: [ Errno 2] No such file or
directory: ‘assets / explanations /1234 _17-06-2023 _14 .23.23. html’
This is because the assets/explanations directory does not exist. Once the
directory has been created, a request is processed without an error message.
As the manual process is error-prone and time-consuming, all steps, from
installing the dependencies to generating the data and training, were combined
in a Python script called init_project.py. The Python language was chosen here
because it is operating system-independent, and the entire project also requires
a Python runtime environment; therefore, this dependency is already installed.
Missing information and dependencies were added to the code and the
parameters were adjusted. A start_server.py script was also written with which
the web server for the AI solution can be started. In order to install and initialize
the software on the KM-EP, the following steps must be carried out:
1. git clone <repository name>
2. cd freitextbewertung /
3. sudo git checkout fix_for_no_gpu
4. python3 .9 -m venv venv
5. source venv /bin / activate
6. python init_project .py
7. nohup python start_server .py
In the first test, a text was compared with itself. A similarity value of 1 was
expected, which was also the test result. The second test calculates a similarity
value for a reference text and the comparison text. The value is then calculated
again, but the reference and comparison texts are swapped. Both similarity
values should be the same. Running the test confirms that the values are the
same. In the third test, both texts contain the exact words but in a different
order. A very high similarity value is expected. In the fourth test, two texts with
the same number of words but containing completely different statements were
selected. A very low similarity value was expected. The value was 0.33, which is
very low. The fifth test compares two texts, one very long and the other very
short. Both have a completely different message. A very low similarity value was
expected. The value was −0.05. The sixth test compares two texts with the same
statement but formulated differently. This test was carried out with an English
and a German text. Both similarity scores were 0.95. In the seventh test, the
comparison text is a short subtext of the reference text. A medium–high
similarity value is expected. The value was 0.8 and is, therefore, high. In the
eighth test, the comparison text is a subtext of the reference text, only worded
differently. A medium–high similarity value is expected. The value was 0.73.
The values were 1, 0, 98, 0, 95, 0, 8, 0, 73, 0, 33, and −0, 05. It can be seen that
the values are distributed in different ranges. However, there are still
shortcomings. One value was negative, which does not correspond to the
expected range of values. Furthermore, it is noticeable that texts of similar
length are considered low instead of very low, even if they have no semantic
match and do not contain the exact words. However, if the semantics partially
match, the values are usually high or very high.
During the explanation of the evaluation, it was checked whether each word
was also assigned an influence value. This test was passed. Next, two texts
consisting of two words were constantly compared. The reference text is always
This sentence. The comparison texts are given in →Table 14.3.
Table 14.3: Table with the used values for the verification [→74].
Reference text Similarity Word Influence Word Influence Influence sum
This sentence 1.0 This 0.0521 sentence 0.259 0.3111
this sentence 0.998 this 0.0519 sentence 0.2563 0.3082
This Sentence 0.9877 This 0.0544 Sentence 0.2467 0.3011
sentence This 0.9749 sentence 0.2339 This 0.027 0.2609
This phrase 0.8585 This −0.0219 phrase 0.1174 0.0955
THIS SENTENCE 0.7505 THIS 0.0418 SENTENCE 0.0047 0.0465
This sentence 0.7107 This −0.0003 sentencE −0.0303 −0.0334
This sentenc 0.6873 This −0.0086 sentenc −0.0537 −0.0624
This fadadfsd 0.6717 This −0.0328 fadadfsd −0.0693 −0.1022
This senTence 0.6022 This −0.0748 senTence −0.1388 −0.2136
This dog 0.3531 This −0.0298 dog −0.3879 −0.4177
The table shows that the sum of the individual influence values decreases when
the similarity value decreases. It can also be seen that the capitalization of the
first letter of a word has a relatively small influence on the similarity value and
the influence value. However, the order changes the influence value of the
individual words somewhat more strongly, but the sum changes relatively little.
Using a synonym, as in line 5 of the table, also strongly influences the influence
value of the other word. The similarity and influence values are relatively
strongly influenced if the words are capitalized entirely. Suppose a single letter is
capitalized, which is small in the reference sentence and is not at the beginning
of the word, the similarity value and all influence values are significantly
reduced. The same applies if a letter of the word is omitted. Suppose an utterly
different word replaces the second word with no semantic similarity, the
similarity value drops sharply, as does each influence value, but not as sharply as
in the case of a character string that does not form a word but is just a random
combination of letters. Overall, it can be seen that the sum of the influences
correlates with the similarity values in these trials. Words in the solution that are
not included in the sample solution have a more excellent negative influence
value than those in the sample solution and the solution. As a prototype, this
method is suitable for explaining the evaluation. However, further experiments
with more test data would have to be carried out to confirm this assumption
with more certainty. Another desirable property for the explanation would be
that the influence value of a word does not change when another word changes
in comparison to the reference text [→74].
14.5 Initial evaluation
A CW was selected as the initial evaluation methodology for this chapter to reach
RO.EP.RQ1.4, RO.EP.RQ2.4, and RO.EP.RQ3.4 [→72] This task-oriented inspection
method is performed without an end user. However, a CW can be usefully
applied already in an early development state of software [→72].
A CW is divided into three phases. First, the system users are described in
the preparation phase, sample tasks are designed to be completed with the
system under test, and possible solutions are developed. In the analysis phase,
the actual CW takes place. In this phase, the domain expert documents each of
his actions and considers possible problems real users might have with the
system. In the Follow-Up phase, identified usage problems and their causes are
documented. In addition, possible alternatives for them are named [→72].
14.6 Conclusions
This chapter presented and realized a potential approach for an NLP algorithm
and an XAI algorithm for the PAGEL application scenario. Based on this, how
these algorithms could be integrated as components in the KM-EP and how they
were evaluated was shown.
Finally, based on the results of this work, an outlook on the possible further
developments is given. Once the interface to the CQP has been modified so that
an ID is used to identify a student instead of an e-mail address, this interface can
be linked to the solution evaluation system. This means that the CQs achieved in
the final quiz are entered directly into a student’s CQP. The interface of the
product library should also be connected to the solution evaluation system. This
allows the currently used file to be removed so that there is only one place
where products are maintained. This means that the solution evaluation system
always accesses the current products and product properties. The configuration
tool is also to be linked to the solution evaluation system. The evaluation scheme
can then be adapted in a user-friendly way. Once the authoring tool has been
implemented, its interface can be connected to the solution evaluation system.
The keywords and sample solutions are then maintained in the authoring tool
via a more user-friendly interface. As soon as new keywords are received from
the authoring tool, the script that converts them into a RegEx search pattern can
be executed automatically. This means that manual execution of the script is no
longer necessary.
To reduce the manual execution of scripts in other places, the scripts for
generating and formatting the T&T data for NER can be included as previous
steps in the workflow for training the NER model. The same applies to the script
that generates T&T data for text classification. This can be included as the first
step in the workflow for training the classification model.
The texts sent by real players can be collected to optimize the ML models
further. If they are available in sufficient quantity, they can be converted into the
required format and used to train the models. If insufficient actual data is
available, it can be used to generate new T&T data close to the actual data. The
better the T&T data represents the texts of real players, the better the model
developed with it. The solutions to the final quiz are already saved. The
prototype could be extended to store the chat messages as well.
References
[1] B. George, and O. Wooden, Managing the strategic transformation of higher
education through artificial intelligence. Administrative Sciences, vol. 13, no. 9,
pp. 196, 2023. →https://doi.org/10.3390/admsci13090196. →
[2] H. Khosravi, S. Buckingham Shum, G. Chen, C. Conati, Y.-S. Tsai, J. Kay, S.
Knight, R. Martinez-Maldonado, and S. Sadiq, Dragan Gašević. Explainable
Artificial Intelligence in Education, Computers and Education: Artificial
Intelligence, vol. 3, pp. 100074, 2022. ISSN 2666-920X,
→https://doi.org/10.1016/j.caeai.2022.100074. →
[3] J. Dettmers, “PAGEL Projekt” [online]. URL: →https://www.fernuni-
hagen.de/arbeitspsychologie/forschung/pagelprojekt.shtml [Zugriff am:
17.12.2023]. a, b, c
[4] B. Vu, A Taxonomy Management System Supporting Crowd-based Taxonomy
Generation, Evolution, and Management, Hagen: Fernuniversität in Hagen,
2020. a, b
[5] Moodle, „Die Moodle-Lernplattform“ [Online]. URL: →https://moodle.de/.
[Zugriff am 22 05 2022]. a, b
[6] M. Then, Supporting Qualifications-Based Learning (QBL) in a Higher
Education Institution’s IT-Infrastructure, “FernUniversität in Hagen, Hagen,
2020. a, b, c
[7] A. Marks, Hochschulinterne Ausschreibung „Fellowship in der digitalen
Hochschullehre (digiFellow)“ gestartet, 17 3.2022. [online]. URL:
→https://www.fernuni-hagen.de/zli/blog/neue-hochschulinterne-
ausschreibungsrunde-fellowship-in-der-digitalen-hochschullehre-digifellow-
gestartet/ [Zugriff am: 17.12.2023]. a, b
[8] R. Srbecky, et al., Concept and realization to automatically generate test and
training data for a natural language processing algorithm for the assessment of
free text answers for digital courses on work design in the context of psychology
studies in higher educational institutions. In: 2022 International Conference on
Computational Science and Computational Intelligence (CSCI), Las Vegas, pp.
1977–1982, NV, USA, 2022, doi: 10.1109/CSCI58124.2022.00355. a, b, c, d, e
[9] J. F. Nunamaker Jr., M. Chen, and T. D. Purdin, Systems development in
information systems research. Journal of Management Information Systems, vol.
7, no. 3, pp. 89–106, 1990. →
[10] D. Jurafsky, and J. H. Martin, Speech and Language Processing – An
Introduction to Natural Language Processing21 9 2021. [Online]. URL:
→https://web.stanford.edu/~jurafsky/slp3/old_sep21/ed3book_sep212021.pdf.
[Zugriff am 03 01 2022]. a, b, c, d, e, f, g, h, i, j
[11] Universal Dependencies, Universal Dependencies – This page pertains to UD
version 2, [Online]. URL: →https://universaldependencies.org/. [Zugriff am 12 05
2022]. →
[12] J. Brownlee, Deep Learning for Natural Language Processing – Develop
Deep Learning Models for Natural Language in Python, San Francisco: Jason
Brownlee, 2017. a, b, c, d, e, f, g, h
[13] J. Brownlee, A Gentle Introduction to the Bag-of-Words Model, 09 10 2017.
[Online]. URL: →https://machinelearningmastery.com/gentle-introduction-bag-
words-model/. [Zugriff am 05 01 2022]. →
[14] TensorFlow, word2vec – TensorFlow, [Online]. URL:
→https://www.tensorflow.org/tutorials/text/word2vec. [Zugriff am 12 05 2022].
→
[15] J. Pennington, R. Socher, and C. D. Manning, GloVe: Global Vectors for Word
Representation, 2014. [Online]. URL: →https://nlp.stanford.edu/projects/glove/.
[Zugriff am 12 05 2022]. →
[16] A. Pai, An Essential Guide to Pretrained Word Embeddings for NLP
Practitioners, 16 03 2020. [Online]. URL:
→https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-
embeddings-nlp/. [Zugriff am 05 01 2022]. →
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and Ł.
Kaiser, Attention Is All You Need, 6 12 2017. [Online]. URL:
→https://arxiv.org/pdf/1706.03762.pdf. [Zugriff am 04 01 2022]. a, b, c, d
[18] super.gluebenchmark.com, SuperGLUE – Learderboard Version 2.0,
[Online]. URL: →https://super.gluebenchmark.com/leaderboard. [Zugriff am 12
05 2022]. →
[19] Gluebenchmark.com, GLUE – Leaderboard, [Online]. URL:
→https://gluebenchmark.com/leaderboard. [Zugriff am 12 05 2022]. →
[20] IBM Cloud Education, Neural Networks – What are Neural Networks?, 2020.
[Online]. URL: →https://www.ibm.com/cloud/learn/neural-networks. [Zugriff am
07 05 2022]. →
[21] J. Uszkoreit, Transformer: A Novel Neural Network Architecture for
Language Understanding, 31 08 2017. [Online]. URL:
→https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html.
[Zugriff am 04 01 2021]. a, b
[22] J. Devlin, A. Rao, D. Markowiz, and H. Lin, BERT, 2020. [Online]. URL:
→https://github.com/google-research/bert. [Zugriff am 14 01 2022]. a, b, c, d, e,
f, g, h, i, j, k, l, m
[23] M. Katzlberger, Mit diesen Daten wurde GPT-3 trainiert – KI Sprachmodelle,
12 04 2021. [Online]. URL: →https://katzlberger.ai/2021/04/12/mit-diesen-daten-
wurde-gpt-3-trainiert/. [Zugriff am 12 05 2022]. →
[24] J. Devlin, and M.-W. Chang, Open Sourcing BERT: State-of-the-Art
Pretraining for Natural Language Processing, 02 11 2018. [Online]. URL:
→https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html.
[Zugriff am 14 01 2022]. a, b
[25] Hugging Face, Summary of the models, [Online]. URL:
→https://huggingface.co/docs/transformers/v4.15.0/en/model_summary.
[Zugriff am 15 01 2022]. a, b
[26] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, ELECTRA: Pre-training
Text Encoders as Discriminators Rather Than Generators, 2020. [Online]. URL:
→https://arxiv.org/abs/2003.10555. [Zugriff am 14 01 2022]. a, b, c, d
[27] EXPLOSION, Binary training format, [Online]. URL:
→https://spacy.io/api/data-formats. [Zugriff am 16 08 2022]. a, b, c
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, XLNet:
Generalized Autoregressive Pretraining, 2019. [Online]. URL:
→https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9e
e67cc69-Paper.pdf. [Zugriff am 12 05 2022]. →
[29] Geeks For Geeks, Overview of ROBERTa model, 26 11 2020. [Online]. URL:
→https://www.geeksforgeeks.org/overview-of-roberta-model/. [Zugriff am 12
05 2022]. a, b
[30] N. Reimers, and I. Gurevych, Sentence-BERT: Sentence Embeddings using
Siamese BERT-Networks, [Online]. URL: →https://arxiv.org/abs/1908.10084.
[Zugriff am 11 04 2022]. a, b, c, d, e, f, g, h, i
[31] Explosion, Industrial-strength natural language processing, IN PYTHON,
[Online]. URL: →https://spacy.io/. [Zugriff am 12 05 2022]. →
[32] Explosion, spaCy 101: Everything you need to know, [Online]. URL:
→https://spacy.io/usage/spacy-101. [Zugriff am 12 05 2022]. →
[33] Python Software Foundation, Python is powerful … and fast; plays well with
others; runs everywhere; is friendly & easy to learn; is Open., [Online]. URL:
→https://www.python.org/about/. →
[34] Exposion, Models & Languages, [Online]. URL:
→https://spacy.io/usage/models. [Zugriff am 12 05 2022]. a, b, c, d
[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and Ł.
Kaiser, Attention Is All You Need, 6 12 2017. [Online]. URL:
→https://arxiv.org/pdf/1706.03762.pdf. [Zugriff am 04 01 2022]. →
[36] Explosion, Facts & Figures, [Online]. URL: →https://spacy.io/usage/facts-
figures. [Zugriff am 04 03 2022]. a, b
[37] Explosion, Trained Models & Pipelines, [Online]. URL:
→https://spacy.io/models#design. [Zugriff am 04 03 2022]. a, b, c, d
[38] Explosion, Embeddings & Transformers, [Online]. URL:
→https://spacy.io/usage/embeddings-transformers. [Zugriff am 04 03 2022]. a,
b
[39] Technische Universität Darmstadt, Festlegen von Bewertungskriterien und
Erstellung eines Bewertungsrasters, [Online]. URL:
→https://www.einfachlehren.tu-
darmstadt.de/themensammlung/details_9536.de.jsp. [Zugriff am 08 05 2022]. a,
b, c, d, e, f, g, h
[40] J. Zornoza, Explainable Artificial Intelligence, 15 04 2020. [Online]. URL:
→https://towardsdatascience.com/explainable-artificial-intelligence-
14944563cc79. [Zugriff am 21 01 2022]. a, b
[41] M. T. Ribeiro, S. Singh, and C. Guestrin, Local Interpretable Model-Agnostic
Explanations (LIME): An Introduction, 12 08 2016. [Online]. URL:
→https://www.oreilly.com/content/introduction-to-local-interpretable-model-
agnostic-explanations-lime/. [Zugriff am 13 05 2022]. a, b, c, d, e
[42] K. Barkai, “Interpreting an NLP model with LIME and SHAP, “09 05 2020.
[Online]. URL: →https://medium.com/@kalia_65609/interpreting-an-nlp-model-
with-lime-and-shap-834ccfa124e4. [Zugriff am 21 01 2022]. a, b, c, d
[43] scikit-learn developers, Decision Trees, [Online]. URL: →https://scikit-
learn.org/stable/modules/tree.html. [Zugriff am 17 03 2022]. →
[44] IBM, Logistic regression, [Online]. URL:
→https://www.ibm.com/topics/logistic-regression. [Zugriff am 13 05 2022]. →
[45] S. Glander, Künstliche Intelligenz und Erklärbarkeit. SInformatik Aktuell, 22
05 2018. [Online]. URL: →https://www.informatik-aktuell.de/betrieb/kuenstliche-
intelligenz/kuenstliche-intelligenz-und-erklaerbarkeit.html. [Zugriff am 21 01
2022]. →
[46] S. M. Lundberg, and S.-I. Lee, A Unified Approach to Interpreting Model
Predictions, 2017. [Online]. URL:
→https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28
b67767-Paper.pdf. [Zugriff am 13 05 2022]. a, b, c, d
[47] S. Lundberg Und, et al., SHAP, 20 10 2021. [Online]. URL:
→https://github.com/slundberg/shap/blob/master/README.md. [Zugriff am 21
01 2022]. a, b, c, d, e, f, g, h, i
[48] K. Schroer, „Spieltheorie,“ [Online]. URL: →https://www.bwl-
lexikon.de/wiki/spieltheorie/. [Zugriff am 13 05 2022]. →
[49] M. Hind, D. Wei, M. Campbell, N. C. F. Codella, and A. Dhurandhar, TED:
Teaching AI to Explain its Decisions, 15 06 2019. [Online]. URL:
→https://arxiv.org/pdf/1811.04896.pdf. [Zugriff am 13 05 2022]. a, b, c, d, e, f
[50] IBM Research Trusted AI, Welcome to AI Explainability 360, [Online]. URL:
→http://aix360.mybluemix.net/resources. [Zugriff am 13 05 2022]. →
[51] IBM, About IBM, [Online]. URL: →https://www.ibm.com/about. [Zugriff am
13 05 2022]. →
[52] V. Arya und et al., “AI Explainability 360 (v0.2.1), “31 08 2021. [Online]. URL:
→https://github.com/Trusted-AI/AIX360. [Zugriff am 21 01 2022]. a, b
[53] IBM Corporation Revision, AI Explainability 360, [Online]. URL:
→https://aix360.readthedocs.io/en/latest/index.html. [Zugriff am 13 05 2022]. a,
b
[54] B. Wallenborn, Entwicklung einer innovativen Autorenumgebung für die
universitäre Fernlehre, 2018. [Online]. URL: →https://ub-deposit.fernuni-
hagen.de/receive/mir_mods_00001428. [Zugriff am 23.12.21]. →
[55] Moodle.org, →https://rb.gy/7qhmw, [Zugriff am 30.7.23]. →
[56] Moodle Kurzanleitung für Trainer/innen,
→https://docs.moodle.org/403/de/Kurzanleitung_f%C3%BCr_Trainer/innen,
[Zugriff am 30.10.22]. →
[57] R. Srbecky, M. Winterhagen, B. Wallenborn, M. Then, B. Vu, W. Fraas, J.
Dettmers, and M. Hemmje, Towards a work task simulation supporting training
of work design skills during qualification-based learning. In: Proceedings of the
14th International Conference on Computer Supported Education, vol. 2, CSEDU,
ISBN 978-989-758-562-3, pp. 534–542, 2022, doi: 10.5220/0011072800003182. a, b
[58] M. Frangenberg, Implementierung einer Erweiterung für das Learning
Management System (LMS) Moodle zur Unterstützung von Spielen im
schulischen Bereich mit automatischer Zuweisung von Kompetenzen
entsprechend dem Spielerverhalten, Hagen: FernUniversität in Hagen, 2021. a,
b, c, d, e, f, g, h, i, j, k
[59] J. Dsouza, What is a GPU and do you need one in Deep Learning?, 25.4.2020.
[online]. URL: →https://towardsdatascience.com/what-is-a-gpu-and-do-you-
need-one-in-deep-learning-718b9597aa0d [Zugriff am 16.12.23]. →
[60] Canonical, Ltd., Ubuntu releases, 6 2023.
→https://wiki.ubuntu.com/Releases, [Zugriff am 16.06.2023]. →
[61] PCPlanet, “How to install python 3.9 on ubuntu 16.04 to 22.04.”
→https://pcplanet.ca/how-to-install-python39-on-ubuntu/, [Zugriff am
17.06.2023]. →
[62] Neo4j, Inc, Neo4j supported versions.
→https://neo4j.com/developer/kb/neo4j-supported-versions/, [Zugriff am
17.06.2023]. →
[63] Neo4j, Inc, Neo4j supported versions, →https://neo4j.com/docs/upgrade-
migration-guide/current/version-4/migration/, [Zugriff am 17.06.2023]. →
[64] L. Ogoti, “Why Python is Good for Machine Learning, “ 2021. [Online]. URL:
→https://www.section.io/engineering-education/why-python-is-good-for-
machine-learning/. [Zugriff am 21 04 2022]. →
[65] Explosion, spaCy Project: Demo NER in a new pipeline (Named Entity
Recognition), [Online]. URL:
→https://github.com/explosion/projects/tree/v3/pipelines/ner_demo. [Zugriff
am 25 08 2022]. →
[66] N. Reimers, sentence-transformers/stsb-xlm-r-multilingual, [Online]. URL:
→https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual.
[Zugriff am 09 08 2022]. a, b
[67] N. Corporation, Cuda toolkit, →https://developer.nvidia.com/cuda-toolkit,
[Zugriff am 28.05.2023]. →
[68] N. Corporation, Nvidia cuda installation guide for linux,
→https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html, zuletzt
besucht am 28.05.2023. →
[69] Preferred Networks, Inc., Cupy, →https://cupy.dev/, [Zugriff am
16.06.2023]. →
[70] ExplosionAI GmbH, spacy project: Predicting whether a github issue is
about documentation (text classification),
→https://github.com/explosion/projects/tree/v3/tutorials/textcat_docs_issues,
zuletzt besucht am 16.06.2023. →
[71] ExplosionAI GmbH, Commad line interface, →https://v2.spacy.io/api/cli,
[Zugriff am 16.06.2023]. →
[72] C. Wilson, User Interface Inspection Methods: A User-Centered Design
Method, 1st edition, Morgan Kaufmann Publishers Inc, 2013. a, b, c
[73] F. Bühler, Implementierung eines KI-basierten Algorithmus zur
automatisierten Bewertung von Freitextantworten hinsichtlich der erreichten
Kompetenzen, Hagen: Fernuniversität in Hagen, 2022. a, b, c, d, e, f, g
[74] J. Schuljak, Implementierung einer Explainable AI Komponente zur
Erklärung von automatisierten Bewertungen von Freitextantworten, Hagen:
Fernuniversität in Hagen, 2023. a, b, c, d, e, f, g, h, i, j
15 Toward a skill-centered qualification ontology
supporting data mining of human resources in
knowledge-based enterprise process representations
Adrian Vogler
Binh Vu
Tobias Vogel
Benjamin Gernhardt
Matthias Hemmje
Abstract
In the evolving context of Industry 4.0 and the Fourth Industrial Revolution,
representing and managing process-based big data sources, e.g., optimizing
human resource (HR) assignment in production planning is critical. This research
introduces a HR ontology (FPHR ontology) integrated into knowledge-based
production planning (KPP) methods and automating the utilization of employee
skills in planning, amidst vast industrial big data landscapes. Utilizing semantic
techniques and the ESCO ontology, the FPHR ontology enhances KPP, enabling a
sophisticated consideration of competences and skills. The ontology supports
broad knowledge and skill domains, addressing digitalization challenges and
aiding the upskilling and re-skilling paradigm, highlighting the significance of
semantic representation and user-centric information system design for future
intelligent production planning and the corresponding HR management.
Keywords: knowledge-based production planning (KPP), human resource
ontology (FPHR ontology), Industry 4.0, big data, employee skills and
competencies, semantic web technologies,
15.1 Introduction
In the era of Industry 4.0 and the Fourth Industrial Revolution, the exponential
growth in data generation has presented organizations across industries –
including e-commerce, healthcare, and internal organizational functions like
human resources, sales, and production – with the critical challenge of big data
management and big data mining in order to support industrial process
management. This vast influx of structured and unstructured process-based
data necessitates advanced solutions for extracting actionable insights and
optimizing decision-making processes. The imperative to sift through massive
process-based datasets to identify patterns, trends, and correlations has
propelled the need for sophisticated knowledge representation, data mining,
and information extraction methods. This chapter introduces a skill-centered
qualification ontology designed for human resource management, aiming to
make sense of big data, by facilitating data mining related to human resource
skills in order to allow for process-oriented resource assignment and
optimization during informed decisions and process optimizations within this
context.
The Chair of Multimedia and Internet Applications at the University of
Hagen, Faculty of Mathematics and Computer Science together with its affiliated
institute FTK participated in the SHAMAN project. SHAMAN, or Sustaining
Heritage Access through Multivalent ArchiviNg [→1], was a European funded
project that focused on representation, documentation, managing, mining, and
archiving of industrial processes. But there was no semantic machine-readable
process representation and no representation of human resources in these
processes. Knowledge-based and process-oriented innovation management
(German: Wissensbasiertes und Prozessorientiertes Innovations-Management,
WPIM), as described by Vogel [→2], was one of the results of the SHAMAN
project. WPIM semantically represents and annotates process structures,
including process resources, thus providing machine-readability. Also, WPIM did
not address the issue of representing human resources in these process
representations. Furthermore, the European Commission funded collaborative
adaptive production process planning (CAPP)-4-SMEs project [→3] aimed to
enhance the competitiveness of European companies, particularly SMEs, in a
sustainable distributed manufacturing environment by focusing on automating
innovative process planning and control. In addition, building on the WPIM
results, Gernhardt [→4, →5, →6], initially developed knowledge-based
production planning (KPP) as a follow-up PhD project of CAPP-4-SMEs [→7]. Also,
in the CAPP-4-SMEs, no semantic representation of human resources within the
industrial processes was addressed. KPP focuses on knowledge-based planning
of complex production processes that involve various substeps, locations, and
companies. The aim is to semantically represent and make the process-
associated knowledge machine-readable. KPP has been developed and refined
in recent years to address the challenges of modern production environments
and the collaboration of small and medium-sized enterprises (SMEs). The
foundation of KPP is semantic knowledge representation, which enables flexible,
efficient, and effective support for CAPP. The integration of CAPP knowledge into
machine-readable process formalizations is a key factor in making such
knowledge accessible in cloud-based knowledge repositories and usable for
various SMEs [→4, →5, →6]. So far, KPP has focused on technical resources in
processes, while human resources, especially human resource descriptions
related to qualifications like certified competences and skills have been only
rudimentarily considered.
This chapter introduces a skill-centered qualification ontology aimed at
human resource management, specifically designed to navigate and mine the
complexities of big data produced by industrial process management and
related human resource assignments. It seeks to offer a structured approach to
understanding and mining vast process-oriented datasets, enabling
organizations to optimize their human resource assignment and development
planning, including the related optimization and decision-making processes, in
alignment with Industry 4.0 economical demands. Therefore, the motivation of
the work described in this chapter is built on the following research gaps: There
is no functional and process-oriented human resource ontology (FPHR ontology)
and no human resource qualification profile ontology (HRQP ontology) for the
semantic description of human resource functions and qualifications in
industrial processes. To support the FPHR ontology and the HRQP ontology,
three core problems exist: problem statement 1 (PS1): KPP cannot describe
human resource functions, although they are necessary for complete industrial
process planning. Problem statement 2 (PS2): KPP cannot describe human
resource qualifications, although they are necessary for complete industrial
process planning. Problem statement 3 (PS3): The system cannot automatically
use human resource functions and qualifications in the production planning and
optimization process, even if described using FPHR ontology and HRQP
ontology. The research questions (RQs) derived from the three problems are as
follows: research question 1 (RQ1): How can an FPHR ontology be derived and
represented in a machine-readable manner? Research question 2 (RQ2): How
can an HRQP ontology be derived and represented in a machine-readable
manner? Research question 3 (RQ3): How can the production planning system
be extended to automatically consider human resource functions and
qualifications as resources in the planning process? To solve these RQs, the work
of Vogler [→8] addresses the topic “Ontology creation and ontology integration
as well as exemplary evaluation in a knowledge-based software development
process.” Therefore, the remainder of this chapter is based on Vogler [→8], with
some additions related to qualification-based learning (QBL) in the area of state
of the art in Section 15.2.
The methodological approach in Vogler [→8] is based on Nunamaker et al.
[→9]. This means, the first so-called observation goals (OG) are derived from the
three RQs. For RQ1, which concerns the representation of the FPHR ontology,
the OGs focus on analyzing existing methods for skills and qualifications
management, encompassing classifications, taxonomies, and ontologies. In the
context of RQ2, related to the HRQP ontology, the OG is to explore concepts and
models based on qualification-driven learning. For RQ3, centered on the
incorporation of skills into production planning, the OG is to grasp the concepts
of knowledge-based planning, particularly WPIM and KPP. Regarding the so-
called theory-building goals (TG), the aim is to devise a conceptual information
model, tailored for human resource management, grounded in a thorough
analysis of user requirements. The TG is to design the FPHR ontology, drawing
upon established best practices in ontology engineering. Additionally,
conceptualizing the necessary extensions in the KPP system to assimilate the
FPHR ontology is a TG. One of the so-called systems-development goals (SG) is to
implement a prototype of the FPHR ontology, leveraging semantic web
technologies. Another SG is to enhance the KPP system at the database,
backend, and frontend strata to seamlessly integrate the FPHR ontology.
Furthermore, it is an SG to actualize key user requirements pinpointed during
the conceptual phase into the final implementation. In terms of the so-called
experimentation goals (EG), the priority lies in evaluating the user-friendliness and
utility of the FPHR ontology, along with its integration into KPP. The process will
involve pinpointing further refinements and augmentations based on expert
feedback. Conclusively, an EG is to put forth recommendations to refine the
approach for future endeavors.
Therefore, this chapter describes the results of the literature research,
corresponding to the OGs of these questions, and they are presented in the
state-of-the-art section. This is followed by the description of the results of the
conceptual modeling of the solution based on the TGs in Section 15.3 and the
description of the results of the implementation of proof-of-concept prototypes
(SGs) in Section 15.4 as well as the description of preliminary evaluation results
(EGs) in Section 15.5, closing with an outlook in Section 15.6.
Figure 15.2: Use context model for the KPP human resource management,
including three relevant KPP use cases, exemplarily selected from [→8].
Figure 15.3: Details of the exemplarily selected use case, “Acquire Personnel”
from [→8].
Within the use context of the exemplarily selected KPP human resource
management Use Cases, the system necessitates an information model. This
model captures and represents real-world entities relevant to the application
scenarios, such as main processes, process instances, roles, employees, their
annotated skills, experience levels, professions with their associated skills, and
qualification measures detailing imparted skills. →Figure 15.4 shows the part of
the scheme of the information model, consisting of the classes Person, Process,
Profession and QualificationMeasure that share attributes, and Skills, which is of
type string.
Figure 15.4: Excerpt of the KPP human resource information model of the KPP
human resource management context.
Thus, for the KPP human resource (KPP HR) information model, the ontology is
needed because the FPHR ontology defines the vocabulary for the attribute
skills, which otherwise contains only plain text. Therefore, the FPHR ontology is
part of the KPP HR information model too. To ensure a robust and methodical
development of this ontology, the process is based on Noy and McGuinness
[→20]. Initially, the scope of the ontology is meticulously defined. This involves
determining the precise domain and pinpointing the target audience, which
encompasses executives, project leaders, HR managers, and specialists. Integral
to this phase is the articulation of key competency questions that the ontology
should adeptly address, offering clarity and direction. Before diving into the
creation of a new ontology, it is crucial to explore the possibility of reusing
existing ontologies. This not only offers potential time and cost savings but also
leverages the validation and tools associated with established ontologies. In this
context, the ESCO ontology emerges as a promising candidate for reuse and
further refinement. Next, the ontology’s foundational terms are enumerated.
These represent the core concepts that the ontology seeks to encapsulate,
including but not limited to professions, skills, roles, processes, and
qualifications. This list serves as the bedrock upon which subsequent steps are
built. Defining classes is an essential step in the ontology creation process. In
essence, classes are collections of objects that share common properties. The
intricacies of class definition are not just limited to identifying these classes but
also entail crafting a hierarchical relationship amongst them, ensuring clarity
and logical flow. Furthermore, properties and attributes within each class are
meticulously defined, ensuring each class is comprehensive and robust. This task
borrows insights from the previously discussed informational model. For
instance, in this context, a profession might be an attribute of the Employee
class, and this profession, in turn, might possess a list of associated skills. Laying
down conditions for properties and attributes is the subsequent step. These
conditions stipulate the criteria that properties and attributes must satisfy,
adding another layer of specificity to the ontology. Finally, for the FPHR ontology
to be pragmatically useful, instances for the defined classes need to be created.
This phase is particularly labor-intensive, demanding rigorous quality assurance.
As of the research conducted in August 2022, the ESCO ontology comprised
13,890 instances of skills and 3,008 instances of professions, all intricately linked.
This ensures that user requirements, such as displaying skills associated with a
selected profession, are seamlessly met. The FPHR ontology is built based on
two main components: the ESCO ontology as upper ontology and the existing
KPP ontology itself. Within the scope of the FPHR ontology design, necessary
expansions are made, based on a detailed analysis of the existing KPP ontology.
A special focus is placed on potential heterogeneity conflicts that might arise at
different levels, such as data model level, data schema level, and data instance
level. The “Skill” →Figure 15.5 class is then analyzed in detail to identify possible
conflicts between the KPP and ESCO ontologies. Two further classes,
“Occupation” and “Qualification,” previously only present in the ESCO ontology,
are also detailed to identify aspects of these classes that should be incorporated
into the new FPHR ontology.
Figure 15.5: The Skill class, as modeled in KPP (excerpt from [→8]).
Figure 15.6: Model of ESCO ontology with the Skill class belonging to Member
Concept from [→8].
Figure 15.7: Excerpt of the hierarchy of skills from the ESCO Ontology from
[→17].
The architecture section outlines the design of use cases with respect to user
interfaces and the underlying components, particularly in the backend and data
storage layers. The existing KPP user interface raises questions about the
components necessary for implementing the human resource management use
cases, the components that require adjustments, and the ones that need to be
newly created. Furthermore, with respect to the existing KPP backend and data
storage components, there are inquiries about the services required within the
KPP’s backend API for the use cases, the services that might need modifications,
and those that should be newly developed.
Figure 15.8: Component model of the KPP skill management system.
Figure 15.9: Components that are part of the “annotate skills” use case.
Upon its enhancement, KPP will be capable of various tagging, searching, and
mining functionalities regarding employee skills and processes, all drawing upon
the ESCO ontology (→Figures 15.6 and →15.7). Furthermore, it will offer
capabilities to determine discrepancies between the skills of processes and
individuals.
Figure 15.10: Draft of the new component “annotate skills” for the KPP user
interface.
Figures 15.8 – 15.10 show the Component model of the KPP skill management
system with a highlight on components that are part of the “Annotate Skills” use
case. →Figure 15.11 shows the architecture of the web application after the
integration of the skill management system (SMS). The skills ontology
(“SkillsOntology”) is newly introduced and is central to the customization of
processes, roles, and people. The new “OntologyImportComponent” component
imports data into the skills ontology. In addition, a new “Jobs” component is
added, which also originates from the skills ontology and has annotated skills.
The existing relationships are retained, although the connection to the skills is
now made via the skills ontology.
Figure 15.11: KPP architecture with the SMS extension.
Regarding the frontend, the Apache web server is also used to design the user
interface. JavaScript and PHP are again used to create the user interfaces. In the
database, the models created in RDF/RDFS are implemented in Neo4j Graph
Database [→37], a technology specifically chosen for the representation of
ontologies in KPP. →Figure 15.12 shows the technical architecture of the KPP
application, while →Figure 15.13 show the communication flow in the system. To
store the ontology within KPP, an extension to the Neo4J implementation is
required. While the MySQL database [→38] stores non-domain-relevant KPP data
such as user information, the Neo4j database hosts all domain-relevant KPP
data. Adaptations are required here to meet user requirements, focusing on the
aspects essential to these requirements. To address the development of RQ1,
the FPHR ontology →Figure 15.14 is created, based on the existing KPP ontology,
and extended by the ESCO ontology around occupations, skills, and
qualifications.
In terms of classes, the FPHR ontology introduces new entities that are created
in Protégé. This concerns the classes “Occupation” and “Qualification.” In
addition, the class “esco:Skill” is defined as the equivalent of the class “Skill.”
The relationships between these classes are defined as “Object Properties.” An
example of such a relationship is the connection between “Occupation” and
“Skill.” As far as instances are concerned, entries are added for both
“Occupation” and “Skill” via the ESCO ontology. Documentation is done with the
tool “Wizard for Documenting Ontologies” (WIDOCO) [→39]. This tool creates an
online version of the FPHR ontology. After starting WIDOCO, a template for the
documentation is first selected. Then, based on the FPHR ontology created in
Protégé [→40], WIDOCO generates a series of linked HTML files containing all
the necessary information about the ontology. To address the development of
RQ1, the FPHR ontology from the previous section was integrated into KPP,
whereby special adaptations and extensions had to be carried out.
A query library will be set up to provide different queries in KPP to meet user
needs in the context of human resource data mining and human resource
assignment applications as well as upskilling and re-skilling applications related
to human resource development and overall process optimization. An evaluation
will be conducted to determine whether these implementations are user-friendly
and what further adjustments or enhancements should be made. In this section,
the necessary adaptations, and extensions of KPP, based on the
conceptualization from the previous section, were described. These changes
were implemented to fulfil user requirements using the FPHR ontology. The
challenges identified and modeled in the theoretical part were addressed in the
practical part of our research. Each model discussed was translated into a
corresponding prototype to demonstrate its practical applicability and
effectiveness in our specific context. This process of prototyping not only served
to validate our theoretical assumptions, but also to broaden our understanding
of the practical requirements.
In this chapter, a careful implementation of the previously designed models
for the human resource management system was carried out, specifically
tailored to the requirements of the software industry. The FPHR ontology was
developed based on the basic use context model, which defined the area of
application and user needs. This ontology was created based on the KPP
ontology and enriched by integrating the ESCO ontology to capture occupation
and skill data →Figure 15.16. The instance data for occupations and skills was
then imported from the ESCO ontology into the NEO4J graph database, enabling
a precise and comprehensive representation of these key elements in the
system. At the same time, the components, services, and queries required for
the integration of the FPHR ontology into the KPP system were newly created or
modified. These adaptations, as part of the KPP integration, included both
backend and frontend developments that were tailored to the specific
requirements of the system. The component model and the user interface
model played a central role here, as they guided the design of the functional
components of the system and the design of the user interfaces. The
architecture model, which describes the entire system architecture, including the
backend and data storage levels, formed the framework for these
implementations. The combination of all these elements resulted in a coherent
and functional human resource management system that fulfils the specific
requirements of the software industry. Each of these models contributed in its
own way to the realization of the prototype and made it possible to develop a
comprehensive and integrated system that could be tested and refined in the
subsequent evaluation phase. At the use case level, those that lie on the critical
path of the application were implemented. Following the development of
prototypes based on our theoretical models, we are now progressing to the
evaluation phase of our research. This phase will allow us to assess the
appropriateness of the implemented solutions. The evaluation will be conducted
through expert interviews and application tests to ensure a comprehensive
assessment of the practicability of our research results.
15.5 Preliminary evaluation results
When it comes to the methodological approach of evaluation, different
categories of evaluation exist, including formative [→41, p. 19], quantitative
[→41, p. 15], and qualitative [→41, p. 15] approaches. Quantitative and
qualitative evaluations focus on aspects such as robustness, freedom from
errors, runtime behavior, and usability of a software system. These methods are
suitable for a more mature software version. Formative evaluation, on the other
hand, can be applied at an earlier stage in the development of a software
system. The prototypical implementation considered in this chapter is a proof-of-
concept evaluation, for which formative evaluation is particularly suitable. The
aim of this evaluation is to check whether the implementation meets the
expectations of the users. Key questions include whether the functionality of the
application is complete and useful and whether there are functionalities that the
user does not find or does not need. One method of formative evaluation is the
cognitive walkthrough [→42]. Here, the application is mentally played through,
focusing on the potential tasks of a user. There are various approaches to a
cognitive walkthrough, including the empirical study and the cognitive
walkthrough with expert users. In the empirical study, representative user
groups are interviewed, while in the expert cognitive walkthrough, domain
experts are interviewed individually or gathered in an expert roundtable. For the
current evaluation, the cognitive walkthrough with expert users was chosen.
Experts from different domains, such as process management, human
resources, project management, and informatics, were asked in interviews about
the usability of the application. For the research goals 1.4 and 2.4, interviews
were conducted with six experts of these disciplines to explore different aspects
of the prototype implementation. First, for the evaluation of the KPP integration,
the use cases that form the core functionality of the system were selected based
on a critical path analysis. These use cases are essential to identify and
understand specific functional areas and requirements of the software and serve
as the basis for the subsequent cognitive walkthrough, for example, the use case
UC1: modeling processes. This is about the software’s ability to design and
represent processes. This is fundamental to visualize and structure work
processes within the system. Another example is the use case UC2: annotate
capabilities. Here, specific skills are assigned to specific processes or tasks. This
makes it possible to define and understand the requirements and qualifications
for specific activities within a process. A knowledge-based software development
process was introduced for evaluation purposes too. This process was described
in a diploma thesis entitled “Compatibility and Extensibility of WPIM with
Methods, Models and Processes in Software Development Projects” by [Baue12].
The process presented in this chapter was used as a basis for the usability
assessment in this chapter and adapted in parts for KPP integration.
A central element is the introduction of a new main process in KPP, which
was modeled with the integrated BPMN process editor [→43]. This knowledge-
based software development process refers to the V-Modell XT [→44], a widely
used process model for the development of software projects. The V-Modell XT
serves as a methodological framework and provides clear phases and steps for
software development. →Figure 15.17 visualizes this knowledge-based software
development process, based on the V-Modell XT. This gives users a clear idea of
the different stages and aspects of the process.
The focus of this walkthrough was on the use case “UC2: annotate capabilities,”
a core functionality of the system. This is a central process, where users can add
specific capabilities to processes or tasks and define what experience level is
needed for these capabilities. At the beginning of the walkthrough, the user
starts the KPP application and performs a login process. After a successful login,
the user navigates to the “query library,” a tool that provides an overview of the
different query options. Another important area is the process overview, where
users can create and edit processes and annotate the associated capabilities.
The next step is to use the process editor. Here, users can select a specific
process, view detailed information about it, and add the necessary capabilities.
This process is interactive and allows users to customize the capabilities as
needed. A special focus is on the integration of KPP’s “query library.” Here, users
can run specific queries, e.g., to identify occupations that require certain ICT
skills. In this way, suitable human resource assignments can be performed
during, e.g., during production process planning. Furthermore, data mining can
support identification of human resources who need upskilling or re-skilling in
order to be available for optimized human resource assignments, allowing on
the one hand to optimize process management and also allowing, on the other
hand, to optimize human resource development. After selecting and running a
query, the results are clearly presented to the user. Finally, the KPP integration
was presented. In this section, users can view occupations and their associated
skills. By selecting a specific occupation, the user can filter and view the specific
skills associated with that occupation.
Expert evaluation was based on criteria such as the helpfulness,
comprehensibility, usability, and completeness of the respective functions.
Among the many scenarios and use cases evaluated, the “home – overview,” for
example, was consistently rated positively in all scenarios. In contrast, the
“processes – editor – annotate capabilities” function was found to be helpful and
understandable but did not score fully in other categories. The expert feedback
showed that the scenarios presented were generally well supported. However,
one standout issue was the use of the “Cypher” language for queries, which was
seen as a potential barrier to access. Nevertheless, the graphical annotation of
capabilities in a process editor was highlighted and appreciated by almost all
experts. However, they wished for better support in selecting capabilities from
an extensive list to optimize the user experience. In addition to the existing
features, the experts gave indications of other desired functions. These included
the ability to switch between different languages, specific career planning
features, and options to export search results in standard formats such as RDF.
Another interesting suggestion was the introduction of a guided mode for
related work steps, also called “wizard mode.”
In Section 15.5, the implemented models for the human resource
management, development, and optimization system were evaluated through
six cognitive walkthroughs, engaging experts from domains such as process
management, human resources, project management, and informatics. This
evaluation aimed to assess the integration and usability of the FPHR ontology
within the KPP system. Key use cases, notably “modeling processes” and
“annotating capabilities,” were scrutinized to evaluate the system’s effectiveness
in structuring work processes and assigning skills to tasks. The usability of the
query library and the process editor was thoroughly assessed, collecting
feedback on the system’s overall functionality and user interface design. Experts
appreciated the system’s strengths, especially the integration of the FPHR
ontology and the graphical representation of capabilities. However, they also
pointed out areas needing improvement, such as the use of the “Cypher”
language for queries and the necessity for more intuitive selection mechanisms
for capabilities from extensive lists. Based on the feedback from the evaluation,
certain enhancements have already been implemented. These include the
creation of additional queries and their incorporation into the query library,
thereby enriching the system’s capabilities and making it more user-friendly.
Other suggested improvements, such as integrating functionalities directly into
the user interface for a more seamless user experience, are considered future
work. Overall, the evaluation provided valuable insights into the system’s
usability, identifying areas for improvement, and confirming the effective
integration of the FPHR ontology into the KPP system.
References
[1] Cordis, Sustaining Heritage Access through Multivalent ArchiviNg | SHAMAN
Project | Fact Sheet | FP7 | CORDIS | European Commission. [Online] Available:
Accessed on: May 12 2023→https://cordis.europa.eu/project/id/216736/de. →
[2] T. Vogel, Wissensbasiertes und Prozessorientiertes Innovationsmanagement
WPIM: Innovationsszenarien, Anforderungen, Modell und Methode,
Implementierung und Evaluierung anhand der Innovationsfähigkeit fertigender
Unternehmen. Dissertation, Hagen: FernUniversität Hagen, 2012. a, b, c
[3] Cordis, Collaborative and Adaptive Process Planning for Sustainable
Manufacturing Environments: CAPP-4-SMES Project. [Online] Available:
→https://cordis.europa.eu/project/id/314024. Accessed on: Mar. 21 2023. →
[4] B. Gernhardt, A semantic representation for process-oriented knowledge
management based on function block domain models supporting distributed
and collaborative production planning. 2015. a, b
[5] B. Gernhardt, Knowledge-based production planning within the reference
planning process supporting manufacturing change management, 2016. a, b
[6] B. Gernhardt, T. Vogel, L. Wang, and M. Hemmje, “Implementation of a
Knowledge-Based Production Planning Including a Direct Manipulative Process
Editor and a Mediator Architecture,” 2017. a, b
[7] EFFRA, CAPP-4-SMES Collaborative and Adaptive Process Planning for
Sustainable Manufacturing Environments | EFFRA Innovation Portal:
Kollaborative und adaptive Prozessplanung für nachhaltige
Fertigungsumgebungen. [Online] Available:
→https://portal.effra.eu/project/1009. Accessed on: Jan. 26 2022. a, b
[8] A. Vogler, Qualifikations-ontologie-erstellung und ontologie-integration
sowie exemplarische evaluation an einem wissensbasierten software-
entwicklungs-prozess. In: Masterarbeit, Multimedia und Internetanwendungen,
Hagen: FernUniversität Hagen, 2022. a, b, c, d, e, f, g, h, i, j, k
[9] J. F. Nunamaker Jr., M. Chen, and T. D. M. Purdin, Systems Development in
Information Systems Research, 1990. →
[10] W3C, RDF 1.1 XML Syntax. [Online] Available: →https://www.w3.org/TR/rdf-
syntax-grammar/. Accessed on: Jan. 15 2024. →
[11] ISO, ISO 8879:1986. [Online] Available:
→https://www.iso.org/standard/16387.html. Accessed on: Jan. 15 2024. →
[12] W3C, W3C. [Online] Available: →https://www.w3.org/. Accessed on: Jan. 15
2024. →
[13] M. Hemmje, 01877 KE3: Einführung in XML, DTD und XSD: 01877 Daten-,
Dokumenten-, Informations- und Wissensmanagement im Internet, Hagen:
Kurseinheit, FernUniversität Hagen, 2021. →
[14] W3C, RDF – Semantic Web Standards. [Online] Available:
→https://www.w3.org/RDF/. Accessed on: Jan. 09 2022. →
[15] M. Hemmje, Kurseinheit 3: Konzepte von XML: 1873 Daten- und
Dokumentenmanagement im Internet, Hagen: Kurseinheit, FernUniversität
Hagen, 2018. →
[16] W3C, RDF Schema 1.1. [Online] Available: →https://www.w3.org/TR/rdf-
schema/. Accessed on: Jan. 09 2022. →
[17] W3C, SPARQL Query Language for RDF. [Online] Available:
→https://www.w3.org/TR/rdf-sparql-query/. Accessed on: Jan. 09 2022. a, b
[18] ISO, ISO/IEC 9075-1:2023(en), Information technology – Database
languages SQL – Part 1: Framework (SQL/Framework). [Online] Available:
→https://www.iso.org/obp/ui/en/#iso:std:iso-iec:9075:-1:ed-6:v1:en. Accessed
on: Jan. 15 2024. →
[19] W3C, OWL Web Ontology Language Overview. [Online] Available:
→https://www.w3.org/TR/2004/REC-owl-features-20040210/. Accessed on: Jan.
09 2022. a, b
[20] R. Kuhlen, Informationsethik: Umgang mit Wissen und Information in
elektronischen Räumen. Konstanz: UVK Verl.-Ges, 2004. a, b, c
[21] H. Sack, 3.2 Ontology Types. →
[22] ISCO, ISCO – International Standard Classification of Occupations. [Online]
Available: →https://www.ilo.org/public/english/bureau/stat/isco/. Accessed on:
Jan. 15 2024. →
[23] National Occupational Classification (NOC) 2021 Version 1.0. [Online]
Available: →https://www.statcan.gc.ca/en/subjects/standard/noc/2021/indexV1.
Accessed on: Jan. 15 2024. →
[24] DISCO II Portal. [Online] Available: →http://disco-tools.eu/disco2_portal/.
Accessed on: Jan. 15 2024. →
[25] LDCI, About – Linked Data Competency Index. [Online] Available:
→https://dcmi.github.io/ldci/. Accessed on: Jan. 15 2024. →
[26] Europäische Kommission, European e-Competence Framework (e-CF).
[Online] Available: →https://esco.ec.europa.eu/en/about-
esco/escopedia/escopedia/european-e-competence-framework-e-cf. Accessed
on: Jan. 15 2024. →
[27] Europäische Kommission, European Digital Competence Framework for
Citizens (DigComp). [Online] Available: →https://esco.ec.europa.eu/en/about-
esco/escopedia/escopedia/european-digital-competence-framework-citizens-
digcomp. Accessed on: Jan. 15 2024. →
[28] Onetcenter.org, O✶NET-SOC Taxonomy at O✶NET Resource Center. [Online]
Available: →https://www.onetcenter.org/taxonomy.html. Accessed on: Mar. 10
2022. →
[29] Europäische Kommission, The ESCO ontology. [Online] Available:
→https://webgate.acceptance.ec.europa.eu/esco/lod/static/model.html.
Accessed on: Jan. 10 2022. →
[30] M. Then, Supporting Qualifications-Based Learning (QBL) in a Higher
Education Institution’s IT-Infrastructure, Hagen: FernUniversität 2020. a, b
[31] M. Then, Supporting Qualifications-Based Learning (QBL) in a Higher
Education Institution’s IT-Infrastructure, MyCoRe Community, 2020. a, b
[32] OMG, UML Resources | Unified Modeling Language. [Online] Available:
→https://www.uml.org/resource-hub.htm. Accessed on: Nov. 07 2023. →
[33] Apache Software Foundation, The Apache HTTP Server Project. [Online]
Available: →https://httpd.apache.org/. Accessed on: Jan. 09 2022. →
[34] Apache Software Foundation, Apache Tomcat. [Online] Available:
→https://tomcat.apache.org/. Accessed on: Jan. 09 2022. →
[35] MDN Web Docs, JavaScript | MDN. [Online] Available:
→https://developer.mozilla.org/en-US/docs/Web/JavaScript. Accessed on: Jan.
16 2024. →
[36] PHP Group, PHP: Hypertext Preprocessor. [Online] Available:
→https://www.php.net/. Accessed on: Jan. 16 2024. →
[37] Graph Database & Analytics, Neo4j Graph Database. [Online] Available:
→https://neo4j.com/product/neo4j-graph-database/. Accessed on: Jan. 16
2024. →
[38] Oracle, MySQL. [Online] Available: →https://www.mysql.com/de/. Accessed
on: Jan. 16 2024. →
[39] D. Garijo, “WIDOCO: A Wizard for Documenting Ontologies,” University of
Southern California, 2017. →
[40] Stanford University, Protege Wiki. [Online] Available:
→https://protegewiki.stanford.edu/wiki/Main_Page. Accessed on: Jan. 16 2024.
→
[41] Qualitative evaluation in sieben Schritten In: Qualitative Evaluation: Der
Einstieg in die Praxis, U. Kuckartz, (Ed.), 2nd edition, pp. 15–57, Wiesbaden: VS
Verl. für Sozialwiss, 2008. a, b, c
[42] PII: 0020-7373(92)90039-N, →
[43] J. Kossick, Reimplementierung, Erweiterung und exemplarische Evaluation
einer verteilten und kollaborativen Unterstützung für die Produktionsplanung,
Hagen: Bacherlorarbeit, FernUniversität Hagen, 2016. →
[44] BMI, V-Modell XT Kurz und Knackig – Ein Überblick, Der Beauftragte der
Bundesregierung für Informationstechnik, 02 Nov., 2023,
→https://www.cio.bund.de/Webs/CIO/DE/digitaler-
wandel/Achitekturen_und_Standards/V_modell_xt/V_modell_xt_ueberblick/v_mo
dell_xt_ueberblick_artikel.html. →
[45] World Economic Forum, How digital transformation is driving economic
change. [Online] Available:
→https://www.weforum.org/agenda/2022/01/digital-transformation-economic-
change-technology/. Accessed on: Jan. 28 2022. →
OceanofPDF.com
De Gruyter Series Intelligent Computing
ISSN
e-ISSN
Leonidas Deligiannidis, George Dimitoglou, Hamid R. Arabnia
(Eds.)
Artificial Intelligence: Machine Learning, Convolutional Neural
Networks and Large Language Models, 2024
ISBN 978-3-11-054816-7, e-ISBN (PDF) 978-3-11-134412-6, e-ISBN
(EPUB) 978-3-11-134417-1
George Dimitoglou, Leonidas Deligiannidis, Hamid R. Arabnia
(Eds.)
Big Data, Data Mining and Data Science: Algorithms,
Infrastructures, Management and Security,
ISBN 978-3-11-134407-2, e-ISBN 978-3-11-134455-3, e-ISBN
(EPUB) 978-3-11-134458-4
George Dimitoglou, Leonidas Deligiannidis, Hamid R. Arabnia
(Eds.)
Cybersecurity: Cyber Defense, Privacy and Cyber Warfare,
planned 2024
ISBN 978-3-11-143641-8, e-ISBN 978-3-11-143654-8, e-ISBN
(EPUB) 978-3-11-143656-2
Leonidas Deligiannidis, George Dimitoglou, Hamid R. Arabnia
(Eds.)
Imaging Science: Computer Vision, Image and Signal Processing,
Pattern Recognition, planned 2024
ISBN 978-3-11-143638-8, e-ISBN 978-3-11-143642-5, e-ISBN
(EPUB) 978-3-11-143657-9
OceanofPDF.com
Index
0
2DM files 1
3D models 1
A
abelian group 1
ablation studies 1
abnormal data 1
acceleration 1, 2
accelerometer 1
accuracy 1, 2, 3, 4, 5, 6, 7, 8
accuracy rate 1, 2, 3, 4
adaptive composition 1
adaptive exploration 1
adaptive learning 1
adjacency matrix 1, 2, 3, 4
AI Explainability 360 (AIX360) 1
Akaike information criterion (AIC) 1
algorithm selection 1
anomaly detection 1, 2, 3, 4
anomaly identification 1
Applied Gaming (AG) 1
approximate inference 1, 2, 3
area under the roc curve (AUC) 1
arity 1, 2, 3, 4, 5, 6
artificial intelligence (AI) 1
ASD 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
assortative networks 1
assortativity index 1, 2, 3, 4, 5, 6, 7, 8, 9
attention mechanism 1
attributes 1, 2, 3, 4, 5
audio 1, 2, 3
audio classification 1
authoring tool 1, 2, 3, 4, 5
Authoring Tool 1
autism spectrum disorder 1
AutoGluon 1
automated machine learning (AutoML) 1
automated target recognition 1, 2, 3
automatic dimensionality reduction 1
AutoML 1
AutoML model 1
autonomous 1
Auto-Sklearn 1
Auto-WEKA 1
Azure Databricks Spark 1
B
backward SFS 1
bagging 1, 2, 3
Bag of Words 1
Bahadur representation 1
ball-bearing 1
bandpass filter 1, 2
batch processing 1
bearing faults 1, 2, 3, 4
Bellman principle 1
Betti curves 1
betweenness (BWC) 1
bidirectional encoder representations from transformers 1
binary classification problems 1
biomarkers 1, 2
bipartite graph 1, 2
bipolar 1
bisect-K-means 1, 2
bond market 1, 2, 3
boosting 1, 2, 3
Brandes’ algorithm 1
breadth-first search algorithm 1
breakpoints 1, 2, 3, 4
B-trees 1
burst of economic bubble 1
business cycles 1
C
C++ 1, 2, 3, 4
caching 1
Calinski and Harabasz index 1
canonical correlation 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72
canonical correlation analysis (CCA) 1, 2, 3
Case Western Reserve University (CWRU) bearing data set 1
categorical encoding 1
centering 1
centrality metrics 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97
Centralized File System (CFS) 1
centroid frequency 1, 2
centroids 1, 2
chaos 1, 2, 3
ChatGPT 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38
chat tool 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Chebyshev distance 1
chromatin remodeling 1
chromosome band location 1, 2
chromosome location 1, 2, 3, 4
classical nonlinear dynamical systems 1
classification 1, 2
classification matrix 1
classifications 1, 2, 3
classification system 1, 2
class imbalance 1
closeness (CLC) 1
CLUSEQ 1
clustering 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21
clustering methods 1
clustering quality 1, 2, 3, 4, 5, 6, 7, 8, 9
cluster validity 1, 2, 3, 4
cluster validity measure 1, 2
cognitive walkthrough 1, 2
cohesion 1
cold-start problems 1
commodity market 1, 2, 3
common variants 1, 2
communality score 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
compactness 1, 2
Competence and Qualification (CQ) 1
complex network analysis 1, 2, 3
composite objective function 1
computational costs 1
concept drift 1, 2, 3, 4, 5
condition information 1
constrained-remaining component decomposition 1, 2
consumers 1, 2, 3
content management environment 1
content producers 1, 2, 3, 4, 5
content providers 1, 2, 3
controlled access 1
copy number variants 1
copyright difficulties 1
correlation analysis 1
correlation coefficient 1, 2
counterfactual explanations 1
covariance matrix 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23
COVID-19 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
crude oil prices 1
CUDA 1
currency trends 1
cyclical variations 1, 2
cyclostationary noise 1
cyclostationary signal 1
D
data 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49
data analysis 1, 2
data cleaning 1
data combining 1
data deduplication 1
data integration 1
data lake 1, 2, 3, 4, 5
data mining 1, 2, 3, 4
data model 1, 2, 3, 4
data normalization 1
data objects 1, 2, 3
data observation vectors 1, 2
data preprocessing 1
data reduction 1
data reservoir 1, 2, 3, 4, 5, 6, 7, 8
Datarobot 1
data scaling 1
data schema 1
data similarity 1, 2
data transformation 1
decimal scaling 1
decision fusion method 1, 2
decision trees 1
decomposition technique 1
deep learning 1, 2, 3, 4, 5
degree (DEG) 1
degree of density 1
de novo mutation 1, 2
de novo variants 1
derived imagery 1
descriptive metadata 1, 2
DGL 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
dimensionality curse 1
dimensionality reduction 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
discretization 1, 2, 3
dissortative networks 1
distance 1, 2, 3, 4
distance-based methods 1
distance function 1
distinct topological structures 1
distributed stochastic neighbor embedding (t-SNE) 1
distributed Thompson sampling 1
distribution optimization 1
distribution variance 1
DNA 1, 2
domain expertise 1
dominating factor 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Dow Jones Industrial Average 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14
DTD 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
dynamic advantage 1, 2, 3, 4, 5, 6
dynamic ensemble 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18
dynamic modeling 1
dynamic near real-time pricing 1
dynamic price elasticity computation 1, 2, 3
Dyna-SPECTS 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
E
EdgeNode 1
Efficiently Learning an Encoder that Classifies Token
Replacements Accurately (ELECTRA) 1
eigenvalue 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
eigenvector 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26
ELECTRA 1
embedding techniques 1, 2, 3
energy market 1, 2
energy price volatility 1
ensemble 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37
Ensemble 1
Entity model 1
entity set 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
epochs of length 1, 2
epsilon greedy algorithm 1, 2
ER model 1, 2, 3, 4
ESCO ontology 1, 2, 3, 4, 5, 6, 7, 8, 9
estimating outliers 1, 2
Euclidean distance 1, 2
European stocks 1
European Union 1
evaluation metrics 1, 2
exceedances 1, 2, 3, 4, 5, 6, 7
Exchangeable Image File Format (EXIF) 1
experimentation goals (EG) 1
expert evaluation 1
explainability methods 1
explainable artificial intelligence (XAI) 1, 2, 3, 4
eXplainable artificial intelligence (XAI) 1
exploratory factor analysis 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
exponentiation 1
Extensible Markup Language (XML) 1
extract information from newspapers 1, 2
F
false positive detections 1
fault analysis 1, 2
fault depths 1, 2, 3, 4, 5
faults 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
feature engineering 1, 2
feature extraction 1, 2, 3, 4, 5, 6, 7, 8
feature matrix 1, 2, 3, 4, 5
feature ranking 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
feature ranking algorithms 1
feature relevance 1
feature scaling 1
feature selection 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
file registration 1
final factor loadings 1, 2, 3, 4, 5, 6
financial indices 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
finite strings 1
fixed composition 1, 2
forecast accuracy 1, 2
forecasting 1, 2, 3, 4
formal ontologies 1
forward SFS 1
Fourier transform 1
Fourth Industrial Revolution 1, 2, 3
Fréchet mean 1
functional components 1, 2
function approximation 1, 2
fusion method 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
G
Game of Thieves (GoT) node centrality metric 1
Gaming Analytics (GA) 1
gate delay 1, 2, 3, 4
GCN model 1, 2, 3, 4, 5
gene interaction data 1, 2, 3, 4, 5
genetic mutations 1
genetic risk 1
genome-wide significant loci 1
giotto-tda 1, 2, 3, 4
global trade tensions 1
GloVe 1
Google Vertex AI 1
graph database 1, 2, 3, 4, 5, 6, 7, 8, 9
graph neural networks 1, 2
graph sage 1, 2, 3, 4
Great East Japan Earthquake 1, 2, 3, 4, 5, 6
group membership 1
H
H2O DriverlessAI 1
Handbook of Clinical Neurology 1
Headers 1
high-dimensional data 1, 2, 3, 4
higher education 1, 2, 3, 4, 5
high-performance computing systems 1
high-performance table generation 1
high-risk gene variants 1
Hilbert transform 1
homology group 1, 2, 3, 4, 5, 6
housing market 1, 2, 3, 4, 5
HPI 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
human resource assignment 1, 2, 3
human resource management 1, 2, 3, 4, 5, 6, 7, 8
hypergraph 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
hypergraph database 1, 2, 3, 4, 5, 6, 7, 8, 9
hyperparameter tuning 1, 2, 3
hypervertices 1
I
IBM Watson AutoAI 1
IKMEANS 1
ILE 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29
implementation 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
incremental updates 1
independent component analysis 1
index of industrial production (IIP) 1
index of symmetry and uniformity 1, 2
industrial process management 1, 2
Industry 4.0 1, 2, 3, 4, 5
inference logging 1
infinite strings 1
information extraction 1
information gain 1
information model 1, 2, 3, 4, 5, 6, 7, 8
informedness 1, 2, 3, 4, 5
ingestible 1, 2, 3, 4, 5, 6, 7, 8
ingestible metadata 1, 2, 3, 4
inheritance 1
inner raceway 1
Integrated Rule-Oriented Data System (iRODS) 1, 2, 3
integration 1
inter-cluster 1, 2, 3
international stock price collapse 1
international trade frictions 1
interpretable ML models 1
J
Japanese Cabinet Office 1
Japanese economy 1
Japanese stocks 1, 2, 3, 4, 5
JavaScript 1, 2
Jensen-Shannon divergence metric 1, 2
Joint Photographic Experts Group (JPEG) 1
JSON 1, 2, 3, 4, 5, 6, 7
K
key competency questions 1
k-means 1
K-means clustering 1, 2, 3, 4
KNN model 1
knowledge-based and process-oriented innovation
management 1
knowledge-based software development process 1, 2, 3
Knowledge Management Ecosystem Portal (KM-EP) 1, 2, 3
knowledge representation 1, 2
KPP integration 1, 2, 3
kurtogram method 1
kurtosis 1, 2, 3
L
labeled datasets 1
label encoding 1
labels 1, 2, 3, 4, 5
large-scale events 1
large-scale language model 1
large-scale storage 1
large-scale tables 1
late fusion 1
leakage ratio coefficients 1
Leakage ratio coefficients 1
learning analytics 1
learning Analytics 1
Learning Management System (LMS) 1
least squares fitting 1
least squares method 1
lemmatization 1, 2, 3
Light Detection and Ranging (LiDAR) data sets 1
likelihood function 1
linear discriminant analysis (LDA) 1, 2
linear transformation 1
locality metadata 1
local linear model of time 1
logarithmic transformation 1
logic level circuits 1
logit model 1
log-likelihood 1, 2
long-tailed products (LTP) 1, 2
Lyapunov exponents 1, 2
M
machine learning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
machine learning algorithms 1, 2
machine learning models 1
machine-readable 1
macroeconomic 1, 2, 3, 4, 5, 6
macros 1
Mainichi Shimbun 1, 2, 3
major epidemics 1
major financial crisis 1
Manhattan distance 1
matrix factorization 1
maximum likelihood AR modeling methods 1
maximum likelihood estimate 1
maximum likelihood method 1, 2
MCMC sampling 1, 2, 3, 4
mean amplitudes 1
mean imputation 1
memory-efficient data structures 1
metadata 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69
metadata assignment 1, 2, 3
metadata queries 1, 2
Metadata Schema Reservoir (MeScheR) 1
meta-learning pipeline 1
metric normalization 1
Microsoft Azure AutoML 1
min–max scaling 1
missing value imputation 1
missing values 1, 2
MLjar 1
ML pipeline 1
model-agnostic 1, 2
model selection 1
Monte Carlo experiment 1, 2, 3
moving linear model 1
moving linear model approach 1
multiarmed bandits 1, 2
multiclass problems 1
multigraphs 1
multilayer perceptron (MLP) 1
multiobjective optimization 1, 2
multiple imputation 1
MySQL 1, 2, 3
MySQL database 1
N
Named Entity Recognition 1
natural disasters 1, 2, 3, 4, 5
natural language processing 1, 2, 3, 4
nearest neighbors 1
neighborhood-based centrality metrics 1, 2, 3, 4, 5, 6, 7, 8, 9
Neo4j 1, 2, 3, 4, 5, 6, 7
Neo4j Graph Database 1
NEO4J Graph Database 1
network-based machine learning (NBML) 1
Network model 1
neural contextual bandit algorithm 1
neural networks 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22
neurodevelopment 1
New York Stock Exchange 1, 2, 3, 4, 5
Nikkei Stock Average 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41
nodes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
noise 1, 2
noise filtering 1
noisy 1
nondominating factor 1, 2
nonkey attribute names 1
nonlinear dynamics 1
nonlinear feature transformations 1
nonstationary data 1, 2
nonstationary mean 1
nonstationary time series data 1
non-terminal 1, 2, 3, 4, 5, 6, 7
numerical instability 1
numerical modeling 1
O
one-hot encoding 1
one-to-many relationship 1
online learning 1, 2, 3, 4, 5, 6, 7
optimal number of components 1
optimal window size 1, 2, 3, 4
optimizing prices 1
orthogonal space 1
outer raceway 1
outlier detection 1, 2, 3
outlier removal 1
outliers 1, 2, 3, 4
overfitting 1, 2
oversampling 1, 2
P
PAGEL simulation 1, 2, 3, 4, 5
PageRank 1, 2
Pandas 1
pandemic 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19
parallelization 1, 2
parameterization 1
passband 1
Pearson’s correlation coefficient 1, 2, 3, 4, 5
persistence diagrams 1, 2, 3, 4, 5
persistence landscape 1
persistence silhouettes 1
persistent homology 1, 2, 3, 4
persistent storage 1
physical modeling data 1
pipeline 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
Policy Enforcement Point (PEP) 1
positive sentiment 1
posterior approximations 1
posterior distributions 1, 2
power-assisted steering systems 1, 2, 3
power-iteration algorithm 1
p–q projection 1, 2, 3, 4, 5, 6, 7
preprocessing 1, 2
pre-trained language model 1
price elasticity (PE) 1, 2, 3, 4, 5
price personalization 1, 2, 3
principal component analysis (PCA) 1
prior distributions 1
probability distribution 1
probit model 1
process-associated knowledge 1
processes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
process-oriented resource assignment 1
production 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52
professions 1, 2
proof-of-concept evaluation 1
Protégé 1
protein–protein interaction (PPI) 1
Protocol Enforcement Point (PEP) 1
Python 1, 2, 3, 4, 5, 6, 7, 8, 9
Q
Qualification-Based Learning Model (QBLM) 1
qualifications 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
query library 1, 2, 3, 4
R
random forest 1, 2, 3
randomness index 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
rating scheme 1, 2
Ray and Turi 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
RDF Query Language 1
RDF Schema (RDFS) 1
recommender systems 1
Recurrent Neural Networks 1
recursive feature elimination (RFE) 1
recursive grammar 1
reduction 1
regression 1
regression imputation 1
regression models 1, 2
Reinforcement Learning policy 1
Relational model 1, 2
relationship attributes 1, 2
relationship objects 1
relationship set 1, 2, 3, 4, 5, 6
ReliefF 1, 2, 3, 4, 5, 6, 7
repetition count 1
resource description framework (RDF) 1
resource efficiency 1
revenue maximization objective 1
reversibility 1
risk ranking 1
road surface classification 1, 2, 3, 4, 5, 6
RoBERTa 1
robustness checks 1
roles 1, 2, 3, 4, 5, 6
root element 1, 2, 3, 4
Ruby 1, 2, 3, 4, 5, 6, 7, 8, 9
run-to-failure bearing experiments 1
S
SalesForce TransmogrifAI 1
sampled imagery 1, 2
SAX embedding model 1, 2, 3
scaling 1
scene 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Scene 1, 2, 3
Sector-specific overheating 1
secure administration tools 1
seismic signals 1
semantic comparison 1, 2, 3, 4
semantic expressiveness 1
semantic knowledge representation 1
semantic similarity 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
sensitivity 1, 2, 3, 4, 5, 6, 7, 8, 9
sensors 1, 2, 3, 4, 5, 6
sentiment analysis 1
separated score integration 1
seq2seq methods 1
sequential feature selection (SFS) 1
sequential training 1
Sfari dataset 1, 2
SHAMAN 1
Shanghai Composite Index 1, 2, 3, 4
Shapley Additive exPlanations 1
shortest path-based centrality metrics 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11
simple classifiers 1
simple taxonomies 1
simplexes 1, 2
simplicial complex 1, 2, 3, 4
single time series 1
singular cluster 1
skill-centered qualification ontology 1, 2, 3
skills 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26
Skills 1, 2, 3
solution scoring system 1
spaCy 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
SPARQL Protocol 1
specificity 1, 2, 3, 4, 5, 6, 7, 8
specifier data 1
spectral contrast 1, 2
spectral features 1
spectral flatness 1, 2
spectral radius ratio 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
spectral slope 1, 2
spectral values 1
SPECTS model 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
stacking 1
Standard & Poor 1
Standard Generalized Markup Language (SGML) 1
static ensemble 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
stationary structures 1
stationary time series 1
statistic-domain features 1
steering wheel signals 1
stochastic PCA 1
stock market 1, 2
stock prices 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19
string matching 1
strongly negative canonical correlated networks 1
strongly positive canonical correlated networks 1
structured query language (SQL) 1
subgrammar 1, 2
subsampling rate 1
subset selection 1, 2, 3
Subset selection 1
subsurface effects 1
sum of squared distances (SSD) criterion 1
sum of squares 1
supervised learning 1
support vector machine 1, 2, 3
surface fluxes 1
Symbolic Aggregation approXimation (SAX) 1
symmetry 1
synaptic function 1, 2
synonymity 1
synthetic imagery 1
synthetic images 1, 2, 3, 4
synthetic scene generation 1
systems-development goals (SG) 1
T
table arrangement technology 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
tag 1
tagging 1, 2, 3
Tag Image File Format (TIFF) 1
tags 1, 2, 3, 4, 5, 6, 7, 8, 9
Takens embedding 1, 2, 3, 4, 5, 6
Takens embedding model 1, 2
target encoding 1
task simulation 1, 2
TDA 1, 2, 3, 4, 5, 6, 7, 8, 9
temporal data 1
test set 1, 2, 3
textual form 1
theory-building goals (TG) 1
third-order autocorrelation 1
Thompson sampling 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
Thompson Sampling 1, 2, 3
time reversibility 1
time series data 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
tokenization 1, 2, 3, 4
Tokyo Stock Exchange 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17
Tokyo stock market 1, 2, 3, 4, 5, 6
topological data analysis (TDA) 1, 2
topological methods 1
total market index 1
TPOT 1
trade agreements 1
trading partners 1, 2
traditional pricing strategies 1, 2, 3, 4
trained model 1, 2
training set 1, 2
transfer learning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
transformation 1
transformer 1, 2, 3, 4
transformer architecture 1
transition block 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
transportation network 1
transportation networks 1
U
unfiltered data 1
uniformity 1
unit variance 1
universal language 1
unsupervised learning 1
use case 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
user-determined threshold 1
user interfaces 1, 2, 3, 4
V
varimax rotation 1, 2, 3, 4, 5, 6, 7, 8
vector of frequencies 1
vector space 1, 2, 3, 4, 5
vector space model 1
vehicle speed 1, 2, 3
verification 1
vibration 1, 2
vibration feature extraction 1, 2
vibration sensor data 1
Vietoris-Rips method 1
virtual file system 1, 2
virtual learning environment (VLE) 1
visibility graph 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
V-Modell XT 1
W
Wasserstein gradient flows 1
Wassertein distances 1
Wavefront OBJ files 1
Web Ontology Language (OWL) 1
Web page ranking 1
weighted MAPE 1, 2, 3, 4
weighted stochastic gradient descent 1
width of the time interval (WTI) 1
window sizes 1
wizard mode 1
Word2Vec 1
word vector 1
worm-gear interface configuration 1, 2
X
xcdata 1, 2
XGBoost 1, 2, 3, 4, 5, 6, 7, 8
XGBoost regressor 1, 2, 3
XLNet 1
XML 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34
xmlalitem 1, 2, 3, 4
xmlattlist 1, 2
xmlcomm 1, 2, 3, 4, 5
xmlent 1, 2, 3, 4, 5, 6
XML files 1, 2
xmlpi 1, 2, 3
xstring 1, 2, 3
Z
zero mean 1
z-score normalization 1
OceanofPDF.com