0% found this document useful (0 votes)
43 views

Big_Data_Data_Mining_and_Data_Science_-_George_Dimitoglou

Through the application of cutting-edge techniques like Big Data, Data Mining, and Data Science, it is possible to extract insights from massive datasets. These methodologies are crucial in enabling informed decision-making and driving transformative advancements across many fields, industries, and domains. This book offers an overview of latest tools, methods and approaches while also highlighting their practical use through various applications and case studies.

Uploaded by

cliobent
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Big_Data_Data_Mining_and_Data_Science_-_George_Dimitoglou

Through the application of cutting-edge techniques like Big Data, Data Mining, and Data Science, it is possible to extract insights from massive datasets. These methodologies are crucial in enabling informed decision-making and driving transformative advancements across many fields, industries, and domains. This book offers an overview of latest tools, methods and approaches while also highlighting their practical use through various applications and case studies.

Uploaded by

cliobent
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 386

OceanofPDF.

com
Big Data, Data Mining and Data Science
OceanofPDF.com
Intelligent Computing

Edited by
Leonidas Deligiannidis
Hamid R. Arabnia

Volume 2
OceanofPDF.com
Big Data, Data Mining and Data
Science

Algorithms, Infrastructures, Management and Security


Edited by
George Dimitoglou
Leonidas Deligiannidis
Hamid R. Arabnia

OceanofPDF.com
ISBN 9783111344072
e-ISBN (PDF) 9783111344553
e-ISBN (EPUB) 9783111344584
Bibliographic information published by the Deutsche
Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the
Deutsche Nationalbibliografie; detailed bibliographic data are
available on the Internet at http://dnb.dnb.de.
© 2025 Walter de Gruyter GmbH, Berlin/Boston
OceanofPDF.com
Contents
Preface

Methods and instrumentation


Koki Kyo
1 Identifying and estimating outliers in time series with
nonstationary mean through multiobjective optimization
method
1.1 Introduction
1.2 First review
1.3 Second review
1.4 New proposal
1.5 Illustrative example
1.6 Summary
Vitit Kantabutra
2 Using the intentionally linked entities (ILE) database
system to create hypergraph databases with fast and
reliable relationship linking, with example applications
2.1 Introduction
2.2 The ILE database system and its use for representing
a hypergraph database
2.3 Why is relationship linking fast in ILE?
2.4 An example: a small hypergraph database with
ternary relations, with benchmarking
2.5 Benchmarking larger databases and comparison with
a graph database
2.6 Transportation networks
2.7 World Wide Web
2.8 Conclusions
Zohreh Safari
3 Rapid and automated determination of cluster numbers
for high-dimensional big data: a comprehensive update
3.1 Introduction
3.2 Related works
3.3 Proposed method
3.4 Experimental studies
3.5 Conclusion
Natarajan Meghanathan
4 Canonical correlation analysis and exploratory factor
analysis of the four major centrality metrics
4.1 Introduction
4.2 Canonical correlation analysis for a toy example graph
4.3 Canonical correlation analysis for real-world network
graphs
4.4 Exploratory factor analysis of centrality dataset
4.5 Exploratory factor analysis for the real-world network
graphs
4.6 Related works
4.7 Conclusions and future work
Abderahim Salhi, Althea C. Henslee, James Ross, Joseph Jabour,
Ian Dettwiller
5 Navigating the landscape of automated data
preprocessing: an in-depth review of automated machine
learning platforms
5.1 Introduction
5.2 Data preprocessing pipeline
5.3 AutoML platforms for data preprocessing
5.4 Challenges
5.5 Conclusion
Peter M. Maurer, John Carbone
6 Generating random XML
6.1 Introduction
6.2 DGL
6.3 XML file output
6.4 Additional XML file productions
6.5 DTD productions
6.6 Example of attributes
6.7 Dependent data generation
6.8 Enhanced bridge example
6.9 Conclusion

Applications and case studies


Danushka Bandara, Kyle Riccardi
7 Exploring autism risk: a deep dive into graph neural
networks and gene interaction data
7.1 Introduction
7.2 Related works
7.3 Methodology
7.4 Experiment
7.5 Results
7.6 Discussion
7.7 Conclusion
Masaki Murata
8 Leveraging ChatGPT and table arrangement techniques
in advanced newspaper content analysis for stock insights
8.1 Introduction
8.2 Analysis cases using table arrangement technology
8.3 Analysis cases using ChatGPT
8.4 Analysis cases using both table arrangement
technology and ChatGPT
8.5 Simple analysis case using obtained categories and
ChatGPT
8.6 Conclusions
Addisson Salazar, Gonzalo Safont, Luis Vergara, Alberto
Gonzalez
9 An experimental study on road surface classification
9.1 Introduction
9.2 Road surface identification system
9.3 A short review of methods for observation vector
dimension reduction
9.4 Classification and post-processing
9.5 Experimental setup and optimization of the system
9.6 Performance of the road surface identification system
9.7 Conclusions
Ray R. Hashemi, Omid M. Ardakani, Jeffrey Young, Azita G.
Bahrami
10 RNN models for evaluating financial indices: examining
volatility and demand-supply shifts in financial markets
during COVID-19
10.1 Introduction
10.2 Previous works
10.3 Methodology
10.4 Empirical results
10.5 Discussion, conclusion, and future research
Dean Lee, Jamal Rorie, Andrew Sabater
11 Topological methods for vibration feature extraction
11.1 Feasibility of TDA as a means for vibration feature
extraction
11.2 Anomaly detection
Srividhya Sethuraman, Uma G., Sunny Kumar, Siddhesh
Thombre, Vikash Patel, Sharadha Ramanan
12 Dyna-SPECTS: DYNAmic enSemble of Price Elasticity
Computation models using Thompson Sampling in e-
commerce
12.1 Introduction
12.2 Related works
12.3 Data characteristics
12.4 Methodology
12.5 Experiments and insights
12.6 Conclusion
Barry C. White, Rachel E. Jordan, Reena R. Patel, LaKenya K.
Walker, Matthew D. Bray
13 Creating a metadata schema for reservoirs of data: a
systems engineering approach
13.1 Introduction
13.2 Defining users and processes
13.3 Data collection structure
13.4 Metadata
13.5 MeScheR
13.6 Metadata attributes
13.7 Use cases
13.8 Results
13.9 Conclusions
13.10 Future work
13.11 Brief discussion of the Integrated Rule-Oriented
Data System (iRODS)
Ramona Srbecky, Franz Bühler, Jörg Schuljak, Simon-Alexander
Wetzel, Michael Winterhagen, Wieland Fraas, Jan Dettmers,
Matthias Hemmje
14 Implementation and evaluation of an eXplainable
artificial intelligence to explain the evaluation of an
assessment analytics algorithm for free-text exams in
psychology courses in higher education to attest QBLM-
based competencies
14.1 Introduction
14.2 State-of-the-Art and Technology
14.3 Concepts
14.4 Proof of concept realization
14.5 Initial evaluation
14.6 Conclusions
Adrian Vogler, Binh Vu, Tobias Vogel, Benjamin Gernhardt,
Matthias Hemmje
15 Toward a skill-centered qualification ontology
supporting data mining of human resources in knowledge-
based enterprise process representations
15.1 Introduction
15.2 State of the art
15.3 Design and conceptual modeling
15.4 Proof-of-concept implementation
15.5 Preliminary evaluation results
15.6 Summary and outlook
Index

Preface
It is a pleasure to present an eclectic collection of papers on Big
Data, Data Mining, and Data Science within the Intelligent
Computing (De Gruyter) book series. In an era where data reigns
supreme, harnessing its value has become necessary in
research, business, government, and military. The insights from
analyzing large datasets have revolutionized industries, driven
innovation, and transformed how we understand and interact
with the world and technology ecosystems.
From a computing perspective in our data-rich world, Big
Data, Data Mining, and Data Science collectively leverage data to
uncover hidden knowledge and solve complex problems. Big
Data deals with the vast volumes, velocity, and variety of
structured and unstructured data. Data Mining focuses on
extracting meaningful patterns and insights from large datasets
for predictive modeling and decision support. Data Science aims
to extract actionable insights using various techniques to solve
complex problems and drive decision-making. These techniques
are applied to diverse problems and domains, such as the
financial sector, healthcare, e-commerce, and cybersecurity.
The work presented in this book can be loosely categorized
into two distinct themes. The first theme is “methods and
instrumentation,” where authors provide insight into systematic
methods, procedures, and techniques within a research or
experimental framework and the tooling to measure, observe, or
manipulate variables of interest. In this thematic collection,
papers explore a range of topics such as hypergraph
databases, automated determination of cluster numbers for high-
dimensional big data, centrality metrics for identifying dominant
factors in datasets, machine learning-based data preprocessing
approaches, estimation of time-series outliers using multi-objective
optimization with non-stationary means, and the development of
languages for generating random data to facilitate random testing
of hardware and software applications.
The second theme is “applications and case studies,” where
authors apply and implement theories, techniques,
methodologies, and technologies in specific contexts,
showcasing their practical relevance and effectiveness. In this
thematic collection, papers explore a range of topics such as
using high-volume dynamic ensemble-based model computations in
e-commerce, deploying explainable artificial intelligence (AI) to
explain an assessment analytics algorithm for free text
exams, using graph neural networks (NN) and gene interaction
data, applying recurrent neural network (RNN) models to examine
the volatility in financial markets during a global
pandemic, using skill-centered qualification ontologies to support
data mining of human resources in knowledge-based enterprise
process-representations, extracting information from vibration
sensor data using topological data analysis, leveraging generative
AI (GenAI) and table arrangement techniques to analyze newspaper
stories for stock price insight, creating metadata schemas for data
reservoirs, and exploring the discrimination capabilities of a set of
features for road surface classification.
The book is mainly composed of selected papers that were
accepted for the 2022 and 2023 International Conferences on
Computational Science and Computational Intelligence (CSCI:
December, Las Vegas, USA) and the 2023 International
Conference on Data Science (CSCE/ICDATA: July, Las Vegas, USA).
Selected authors were given the opportunity to submit the
extended versions of their conference papers for publication
consideration in this book. An important mission of CSCI and
CSCE annual conferences includes “Providing a unique platform
for a diverse community of constituents composed of scholars,
researchers, developers, educators, and practitioners. The Congress
makes a concerted effort to reach out to participants affiliated with
diverse entities (such as universities, institutions, corporations,
government agencies, and research centers/labs) worldwide. The
Congress also attempts to connect participants from institutions
that have teaching as their main mission with those who are
affiliated with institutions that have research as their main mission.
The Congress uses a quota system to achieve its institution and
geography diversity objectives.” Since this book comprises the
extended versions of the accepted papers of CSCI and CSCE
annual conferences, it is no surprise that it has chapters from a
highly qualified and diverse group of authors.
Aside from recognizing the authors who provided their
research contributions, we are also grateful to the many
colleagues who offered their time and effort in organizing the
CSCI and CSCE conferences. Their help was instrumental in the
formation of this book. The editorial committee members
appear on the CSCI and CSCE’s websites. Finally, we want to
thank Steve Elliot (De Gruyter Editor) and Aleksandra Ślosarczyk
(De Gruyter Editorial Project Manager) for their continuous
support throughout the development and production of the
book.
We hope our readers find as much value and gain insight
from this book as we have.
George Dimitoglou, Ph.D.
Department of Computer Science & IT,
Hood College,
401 Rosemont Ave, Frederick, MD 21701, USA
Leonidas Deligiannidis, Ph.D.
School of Computing and Data Science,
Wentworth Institute of Technology,
550 Huntington Ave, Boston, MA 02115, USA
Hamid R. Arabnia, Ph.D.
University of Georgia,
415 GSRC, Computer Science,
Athens, GA 30602-7404, USA
OceanofPDF.com
Methods and instrumentation
1 Identifying and estimating outliers in time series with
nonstationary mean through multiobjective optimization
method

Koki Kyo

Acknowledgment: This work was supported by a Grant-in-Aid for Scientific Research (C)
(19K01583) from the Japan Society for the Promotion of Science.

Abstract
This study introduces a method for detecting and estimating outliers in nonstationary time series
data . We begin by reviewing a decomposition technique to separate time series into trend and
stationary components with outliers. A multiobjective optimization approach that combines the
minimum ISU (index of symmetry and uniformity) and maximum likelihood AR modeling methods
through a composite objective function is proposed. The method balances outlier detection and
model selection. We demonstrate its effectiveness by applying it to analyze monthly time series
data for index of industrial production in Japan from January 1975 to December 2019. Our results
highlight its robust performance in addressing anomalies and making informed model selections,
paving the way for future extensions to various domains.
Keywords: autoregression model, outliers in variables, identification and estimation of outliers,
moving linear model approach, multiobjective optimization,

1.1 Introduction
Time series modeling methods have been widely applied across various research fields, serving
diverse purposes. These applications span from extracting seismic signals in earthquake data
[→7] to conducting thorough analyses of business cycles [→10, →13].
In recent times, anomaly detection has become a crucial focus in various research domains,
placing particular emphasis on time series analysis [→12]. Anomaly detection within the scope of
time series analysis entails the identification and estimation of unexpected values or abrupt
fluctuations within temporal data.
This paper delves into the overarching theme of anomaly detection, specifically honing in on
the task of identifying and estimating outliers within time series that display a nonstationary
mean. The main challenge arises in handling time series data characterized by a nonstationary
mean, a scenario in which conventional approaches are typically tailored for stationary structures
in time series.
To address this challenge, we utilize the moving linear model approach introduced by Kyo and
Kitagawa [→9]. This approach enables the decomposition of the target time series into two
distinct components: one that captures the nonstationary mean and another that encapsulates
the stationary term housing the outliers. Subsequently, we present a novel method for identifying
and estimating outliers within the stationary components.
Autoregression (AR) modeling emerges as a valuable approach for analyzing stationary time
series data [→2]. An AR model provides a linear framework for defining a set of coefficients,
making them amenable to straightforward estimation through least squares fitting [→6].
However, the presence of outliers within a time series can significantly compromise the efficiency
of the least squares fitting process. The impact of outliers on parameter estimation has been
extensively discussed in works by Pena [→11] and Fox [→3].
To address the challenges posed by outliers, several novel outlier detection methods have
been proposed, as exemplified by Vishwakarma et al. [→14], and the references therein. While
these methods provide value, many of them are highly specialized and may lack practical
applicability.
In Kyo [→8], we introduced an innovative approach for identifying and estimating outliers
within time series data. Our approach seamlessly integrates an autoregressive (AR) modeling
framework based on the maximum likelihood method, offering several notable characteristics: (1)
When combined with the constrained-remaining component decomposition method proposed by
Kyo and Kitagawa [→9], this method can also be applied to data with trends, thus expanding its
range of applicability. (2) The model structure adopted in our approach is straightforward and
easily comprehensible, making it highly practical and applicable in real-world scenarios. (3)
Differing from many previous studies that treat outliers as concealed factors influencing variable
variances or errors through robust estimation, our approach offers a unique perspective. We treat
outliers as explicit parameters, simultaneously identifying and estimating them alongside the
model coefficients. This approach not only enables precise outlier localization within the time
series but also provides insights into their magnitudes, enhancing our understanding of the
underlying time series dynamics.
In many instances, outliers arise from sudden shocks, and their impact often persists over a
defined timeframe. The key characteristic of the AR modeling approach lies in its ability to detect
and estimate outliers based on the underlying mechanism of periodic variations in time series
data. After removing outliers, the time series retains the structural patterns of these periodic
variations. However, the approach proposed by Kyo [→8] employs the maximum likelihood
method, and the likelihood function may not always exhibit high sensitivity to outliers.
Consequently, there are cases when the AR modeling approach may not be highly efficient in
identifying and estimating outliers.
In other words, when employing the maximum likelihood of the AR modeling approach,
accurately estimating outliers, even if detected through mutual influence, can be challenging.
Therefore, relying solely on the maximum likelihood of the AR modeling approach may not be
efficient for both identifying and estimating outliers. My prior study in Kyo [→8] revealed that the
process of identifying and estimating outliers required multiple iterations to achieve acceptable
results.
Therefore, in this paper, to overcome the limitations of the maximum likelihood of the AR
modeling approach proposed by Kyo [→8], we introduce a highly sensitive indicator referred to as
the index of symmetry and uniformity (ISU) for outlier detection and estimation. We then
integrate the ISU with the likelihood for an AR model and propose a multiobjective optimization
method using negative log-likelihood and the ISU as objective functions. This approach aims to
enhance the efficiency of identifying and estimating outliers while retaining the advantages of the
maximum likelihood of AR modeling approach.
The remainder of the chapter is structured as follows. In Section 1.2, we provide a review of
the constrained-remaining component decomposition method. In Section 1.3, we offer a detailed
review of the maximum likelihood of the AR modeling approach for the identification and
estimation of outliers. Section 1.4 presents a new proposal for identifying and estimating outliers
using a multiobjective optimization method. In Section 1.5, we illustrate the performance of the
proposed approach with an example. Finally, in Section 1.6, we summarize the study.
1.2 First review
To commence, we will provide an overview of the constrained-remaining component
decomposition method introduced by Kyo and Kitagawa [→9]. Let us consider a time series
denoted as z comprising two unobserved components defined as follows:
t

zt = st + yt (t = 1, 2, …, N ). (1.1)

Here, s and y represent the constrained and remaining components, respectively, with t and N
t t

corresponding to a specific time point and the length of the time series. The constrained
component is presumed to be expressible as a local linear model of time t indicating long-term
variations in the time series z , making it inherently smooth. In contrast, the remaining
t

component is derived as the residual after fitting the constrained component to the time series,
representing short-term variations. Additionally, no specific assumptions are made about the
remaining component y , although it is assumed that the time series y is closer to being a
t t

stationary series than z if there are no outliers in the time series z . Further details regarding the
t t

model in →eq. (1.1) were provided by Kyo and Kitagawa [→9] and the definition of an outlier is
elaborated later in this paper.
The moving linear model approach, developed by Kyo and Kitagawa [→9], can be employed to
perform the constrained-remaining component decomposition. In this approach, a critical
parameter called the width of the time interval (WTI)is introduced. Proper determination of the
WTI allows us to execute the decomposition outlined in →eq. (1.1). In Kyo and Kitagawa [→9], a
likelihood function for the WTI was derived, enabling the estimation of the WTI using the
maximum likelihood method. As mentioned earlier, with a properly determined WTI and in the
presence of outliers in the time series z , a constrained component with high smoothness can be
t

estimated, and the remaining component is obtained as a stationary time series. However, the
existence of outliers in the time series z can cause the remaining component to become
t

nonstationary due to the influence of the outliers.


In Kyo [→8], scenarios in which the time series z contains outliers were considered, and a
t

significant portion of these outliers is located in the remaining component. This allows us to
estimate the constrained component with high smoothness and acquire the remaining
component containing the outliers. Consequently, the objective of this study is to separate the
outliers from the estimation of the remaining component and isolate the stationary part of the
remaining component. Subsequently, in the next subsection, we introduce an AR model for the
time series of the remaining component with outliers.

1.3 Second review


Let us review the maximum likelihood of the AR modeling approach for identifying and estimating
outliers proposed by Kyo [→8].

1.3.1 Models
Consider the scenario where, as a result of the constrained-remaining component decomposition,
we have successfully estimated the constrained component with high smoothness, and the
remaining component y is obtained as
t

yt = rt + et (t = 1, 2, …, N ).
Here, r represents a stationary time series, while e is a time series containing outliers. In this
t t

context, we assume that the stationary time series r follows an AR process: t

p (1.2)
r t = ∑ α i r t−i + ϵ t (t = 1, 2, …, N ).
i=1

Here, p denotes the model order, α , α , …, α are coefficients, and ϵ ∼ N (0, σ ) represents
1 2 p t
2

the innovation, assumed to be independent and identically distributed. We consider the


coefficients α , α , …, α and the error variance σ as unknown parameters.
1 2 q
2

Furthermore, we assume that the values for almost all elements in the time series
e , e , …, e
1 2 are equal to zero, except for possibly k nonzero parameters, denoted as
N

δ = (δ , δ , …, δ ) . It is important to note that the key characteristics of stationarity for the AR


T
1 2 k

model in →eq. (1.2) are as follows (as per Kitagawa [→6]): (1) The variance σ of the innovation 2

remains constant for t = 1, 2, …, N . (2) The absolute values of all partial autocorrelation
coefficients are less than 1.

1.3.2 Basic scheme


In this section, we introduce the basic scheme for the maximum likelihood of the AR modeling
approach. Several vectors and a matrix were defined as follows:
r 1 (δ) y1 − e1 α1 ε1 (1.3)

r 2 (δ) y2 − e2 α2 ε2
(1) (1)
r (δ) = = ,α = ,ε = ,
⋮ ⋮ ⋮ ⋮

r N (δ) yN − eN αp εN

0 ⋯ ⋯ ⋯ 0

y1 − e1 ⋱ ⋮

y2 − e2 y1 − e1 ⋱ ⋮

(1) ⋮ y2 − e2 ⋱ ⋱ ⋮
X (δ) = .

⋮ ⋮ ⋮ ⋱ 0

⋮ ⋮ ⋮ ⋮ y1 − e1

⋮ ⋮ ⋮ ⋮ ⋮

y N−1 − e N−1 y N−2 − e N−2 ⋯ ⋯ y N−p − e N−p

It is assumed that there are no outliers among the initial values y , y , …, y for the time 1−p 2−p 0

series y , so the initial values e , e , …, e for the time series e are equal to zero; i.e.,
t 1−p 2−p 0 t

{r = y = 0; t = 1 − p, 2 − p, …, 0} . The model in →eq. (1.2) is then expressed as


t t

(1) (1) (1)


r (δ) = X (δ)α + ε (δ).
Moreover, when the outliers ε (δ) and model order p are given, then by adopting the least
(1)

squares method, the coefficients are estimated as:


T T
(1) (1) (1) −1 (1) (1)
α̂ (δ) = (X (δ)X (δ)) X (δ)r (δ),

and the maximum likelihood estimate for σ is given by 2

(1) 2 1 (1) (1) (1) T


(σ̂ ) (δ) = (r (δ) − X (δ)α̂ (δ))
N

(1) (1) (1)


×(r (δ) − X (δ)α̂ (δ)).

As a function of the outliers δ , the maximum log-likelihood is expressed by:


(1) N (1) 2
LL (δ) = − (log(2π(σ̂ ) (δ)) + 1).
2

These estimates are based on the vectors r (δ) and matrix X (δ) , which are set in the order
(1) (1)

of the observation data. Thus, this set of estimates is referred to as the estimates in natural order
(ENO).
However, these parameter estimation formulas may depend on the location of the outliers in
the time series e . Therefore, a vector and a matrix corresponding to the vector r (δ) and
t
(1)

(δ) are set in reversed order as


(1)
X

yN − eN

y N−1 − e N−1
(2)
r (δ) = ,

y1 − e1

0 ⋯ ⋯ ⋯ 0

yN − eN ⋱ ⋮

y N−1 − e N−1 yN − eN ⋱ ⋮

(2) ⋮ y N−1 − e N−1 ⋱ ⋱ ⋮


X (δ) = .

⋮ ⋮ ⋮ ⋱ 0

⋮ ⋮ ⋮ ⋮ yN − eN

⋮ ⋮ ⋮ ⋮ ⋮

y2 − e2 y3 − e3 ⋯ ⋯ y p+1 − e p+1

Based on the reversibility of a stationary AR model, the model in →eq. (1.2) can also be expressed
as:
(2) (2) (2)
r (δ) = X (δ)α + ε (δ),

where ε (2)
(δ) is the innovation vector corresponding to r (2)
(δ) and X (2)
(δ) .
Correspondingly, when the values of the outliers δ are given, the estimates of the coefficients
are obtained as:
T T
(2) (2) (2) −1 (2) (2)
α̂ (δ) = (X (δ)X (δ)) X (δ)r (δ),

and the maximum likelihood estimate for σ is given by 2

(2) 2 1 (2) (2) (2) T


(σ̂ ) (δ) = (r (δ) − X (δ)α̂ (δ))
N

(2) (2) (2)


×(r (δ) − X (δ)α̂ (δ)).

Finally, the maximum log-likelihood is calculated as


(2) N (2) 2
LL (δ) = − (log (2π(σ̂ ) (δ)) + 1).
2

This set of estimates is referred to as the estimations in reversed order (ERO).


By adopting Bayesian model averaging (see Hoeting et al., [→4]), the results obtained using
the ENO and ERO methods can be integrated as follows. First, the logarithm for the mean
likelihood is calculated as

LL(δ) =log (
1
(exp (LL
(1)
(δ))+ exp (LL
(2)
(δ)))).
(1.4)
2

Therefore, the outliers δ can be estimated by maximizing the value of LL(δ) , with δˆ denoting the
estimate of δ obtained by maximizing LL(δ) in →eq. (1.4).
Then, as a function of the model order p and the number of outliers k , the Akaike information
criterion (AIC) for the synthesized model is defined by (see Akaike, [→1]):
ˆ (1.5)
AIC(k, p) = −2LL(δ) + 2(p + k + 1),

and the synthesized estimates for the coefficients are obtained as:
(1) ˆ (2) ˆ
α̂ = w 1 α̂ (δ) + w 2 α̂ (δ)

with w and w being weights that can be determined as ratios of the likelihoods:
1 2

(1) ˆ
exp(LL (δ))
w1 = , w2 = 1 − w1 .
(1) ˆ (2) ˆ
exp(LL (δ))+exp(LL (δ))

Moreover, the parameters k and p can be estimated by minimizing the value of AIC(k, p)
defined by →eq. (1.5). Thus, all parameters can be estimated using maximum likelihood method
and minimum AIC method. However, for each element in the outlier vector, the above estimation
process should be repeated on the condition that we fix the other element, because the outliers
may correlate with each other. Note that in the repetition of the estimation, the AIC should be
calculated for the original time series of the remaining component.

1.3.3 Rule for determining the positions of outliers


A common challenge in detecting and estimating outliers is the complex nature of the procedure,
especially when the positions of the outliers are not known. Therefore, it is crucial to have some
indication of the potential positions of the outliers.
Let y represent an observation from the random process Y , and let r be the realized value
t t t

for the random process R , with Y = R + δ , assuming that δ is a potential outlier. It is further
t t t t t

assumed that E{R } = 0 and E{Y } = δ . Moreover, assuming that E{R } = C > 0 and
t t t
2
t

treating δ as a constant, E{Y } can be expressed as follows:


t
2
t

2 2 2 2
E{Y t } = E{R t + 2R t δ t + δ t } = C + δ t .

This shows that E{Y } is proportional to δ . Moreover, when E{Y } is large, it is relatively
t
2 2
t t
2

close to δ , suggesting that the absolute value of δ might be relatively large. Additionally, when
2
t t

δ = 0 , y provides an unbiased estimation of Y = R .


2 2 2
t t t t

In summary, we observe that y is often proportional to δ with high probability. If


2
t
2
t

y
2
t1
≥ y
2
t2
≥ ⋯ ≥ y , one can reasonably infer that y contains the largest outlier, y contains
2
tN t1 t2

the second largest outlier, and so on. In other words, when estimating k outliers, one can set the
order in which to search for potential outlier positions as t , t , …, t . Then, the positions of 1 2 k

outliers can be determined based on this order.

1.4 New proposal

1.4.1 Motivation
In the previous section, we explored the maximum likelihood of the AR modeling approach, a
valuable tool for identifying and estimating outliers. However, it occasionally encounters efficiency
limitations. To address these constraints and enhance the outlier detection process, this paper
introduces a highly sensitive indicator known as the ISU. We integrate the ISU with the negative
log-likelihood (NLL) and propose a multiobjective optimization method by employing both NLL
and the ISU as objective functions. The primary motivation behind this proposal is to improve the
efficiency of identifying and estimating outliers while preserving the advantages of the maximum
likelihood of the AR modeling approach.

1.4.2 Defining index of symmetry and uniformity

The vector r (δ) , as defined in →eq. (1.3), can be regarded as a function of the potential outliers
(1)

δ . For an element in r (δ) , say r (δ) , if it satisfies the inequality:


(1)
t

r t (δ) > 0,

then we call it a positive element, and when


r t (δ) < 0,

we call it a negative element. Assume that the number of positive elements is N and that of the 1

negative elements is N with N + N = N . So, we can express the set of all positive elements
2 1 2

as
(+) (+) (+) (+)
R = {r (δ), r (δ), …, r (δ)}
1 2 N1

and the set of all negative elements as


(−) (−) (−) (−)
R = {r (δ), r (δ), …, r (δ)}.
1 2 N2
If r (δ) were not influenced by the outliers, then the elements in the vector r (δ) would
(1) (1)

distribute symmetrically around zero and uniformly over time. This is referred to as its symmetry
and uniformity.
Here, we consider the variances for the positive elements and the negative elements as
follows:
N1
(+) 1 (+) (+) 2
Var{R } = ∑(r (δ) − r (δ)) ,
N1 j
j=1

N2
(−) 1 (−) (−) 2
Var{R } = ∑(r (δ) − r (δ))
N2 j
j=1

with r (δ) and r (δ) being the averages of the positive elements and the negative elements,
(+) (−)

respectively. Set a set of the absolute values of all elements in the vector r (δ) as (1)

|R(δ)| = {|r t (δ)|; t = 1, 2, …, N },

let Var{|R(δ)|} and Ave{|R(δ)|} respectively denote the variance and average for the
elements in the set |R(δ)| ; then by definition, we have
N
1 2
Var{ R(δ) } = ∑( r t (δ) − Ave{ R(δ) }) .
N1
t=1

The above equation can be derived as follows:


N1 (+) N2 (−)
Var R(δ) = VarR + VarR
N N

N1 (+) 2 N2 (−) 2 (1.6)


+ (Ave R(δ) − r (δ)) + (Ave R(δ) + r (δ)) .
N N

Note that the value of r (δ) is negative. It can be seen that the first two terms on the right-hand
(−)

side in →eq. (1.6) measure the uniformity of the positive elements and the negative elements,
respectively, and the last two terms measure their symmetries. That is, the values of the first two
terms being large means that the elements are not distributed uniformly, and the values of the
last two terms being large indicates that the symmetries of the positive elements and the
negative elements are not so good. Thus, Var{|R(δ)|} can be used as an indicator for symmetry
and uniformity.
On the other hand, while our goal is to identify and estimate outliers, we also aim to preserve
the structure of the time series. In other words, we want to maximize the variation of the time
series r (δ) with outliers removed as much as possible. The variation of the time series r (δ) with
t t

outliers removed can be measured by its standard deviation. Therefore, using the relative
standard deviation (RSD) as an indicator of the prominence of outliers, we consider:
√Var{|R(δ)|}
RSD(δ, k) = ,
√ Var{ r (1) (δ)}

where
N
(1) 1 2
Var{r (δ)} = ∑ r t (δ)
N
t=1

represents the variance of the elements in the vector r (δ) . Since the RSD(δ, k) becomes
(1)

smaller when the Var{|R(δ)|} decreases and Var{ r (δ)} increases, minimizing the
(1)

RSD(δ, k) can serve as a measure of the prominence of outliers.

Furthermore, when using RSD(δ, k) in conjunction with the log-likelihood, we can refer to
the logarithm of the RSD as the ISU and use it as one of the objective functions for outlier
detection:

ISU(δ, k) =log (RSD(δ, k)) =


1
(log (Var{ R(δ) })− log (Var{ r
(1)
(δ)})).
(1.7)
2

By minimizing ISU(δ, k) with respect to δ and k , δˆ and k̂ can be obtained as estimates for the
outliers δ and their number k . This method for identifying and estimating outliers is referred to as
the minimum ISU method. It provides an efficient means to achieve these goals while taking
advantage of the maximum likelihood of the AR modeling approach.

1.4.3 Multiobjective optimization for outlier detection


In the preceding section, along with the maximum likelihood of the AR modeling approach, the
minimum ISU method has been introduced as one of the outlier handling methods. However,
depending solely on the minimum ISU method may carry the risk of compromising the original
structure of the time series through outlier removal. For example, the elimination of outliers
might substantially alter periodic variations, a concern particularly relevant in the analysis of
economic cycles. This challenge motivates the exploration of a multiobjective optimization
approach, combining the AR modeling method with the minimum ISU method.
Here, we introduce the following composite objective function:
(1.8)
f (δ λ) = ISU(δ, k̂) + λ NLL(δ),

by combining the negative log-likelihood NLL(δ) = −LL(δ) , with the log-likelihood LL(δ)
defined in →eq. (1.4), and ISU(δ, k̂) , defined in →eq. (1.7). Here, λ > 0 is a parameter
representing the relative weight of influence from NLL(δ) compared to ISU(δ, k̂) . Note that k̂ is
the estimate of k obtained using the minimum ISU method in advance. For a given value of λ , we
aim to minimize the objective functions ISU(δ, k̂) and NLL(δ) simultaneously by minimizing
f (δ|λ) with respect to δ . Setting a higher value for λ enables the minimization of NLL(δ) to be

primarily driven by the search for outliers, while reducing it places greater emphasis on ISU(δ, k̂)
in the search for outliers.
What needs to be emphasized is that the minimization of each function can only be achieved
numerically, making it challenging for theoretical considerations such as Pareto optimization.
Therefore, an empirical approach is employed, where the value of λ is set through trial and error
to minimize the function f (δ|λ) , and the values of ISU(δ, k̂) and NLL(δ) are computed. Results
with both ISU(δ, k̂) and NLL(δ) being relatively small are considered.
A typical procedure for the proposed approach will be illustrated with an example in the
following section.
1.5 Illustrative example
For the illustrative example, we selected the analysis of the index of industrial production (IIP) in
Japan. The purpose of this example is to showcase the performance of the proposed approach
using the analysis results and to illustrate the typical procedure involved in implementing the
proposed approach.
The data for this analysis were obtained from the Japanese Cabinet Office website [→5]. These
data represent a monthly time series spanning from January 1975 to December 2019,
encompassing a total of N = 540 months.
→Figure 1.1 depicts a plot of the time series for log-IIP. It is important to note that →Figure
1.1 does not provide a unit for the vertical axis. This is because the time series displayed in this
figure is an index, and the absence of units applies to the subsequent figures as well.

Figure 1.1: Time series for log-IIP in Japan (January 1975 to December 2019).

We transformed the data by taking the logarithm, which became our target for analysis. The time
series exhibited a significant decline around February 2009, which can be attributed to the
consequences of the financial crisis that occurred between 2007 and 2008.
For the decomposition of log-IIP, we applied the moving linear model approach as proposed
by Kyo and Kitagawa [→9]. To determine the optimal value of WTI, we calculated the log-likelihood
over a range of WTI values from 3 to 72 using the model described in →eq. (1.1). →Figure 1.2
presents the log-likelihood values plotted against the corresponding WTI values. The WTI value of
57 was selected as it corresponded to the highest log-likelihood.
Figure 1.2: Log-likelihood values versus the WTI values for the model in →eq. (1.1).

→Figure 1.3 presents the results of the decomposition. In →Figure 1.3(a), the time series of the
constrained component exhibits a smooth trend, while →Figure 1.3(b) reveals that the time series
of the remaining component displays cyclical variations, suggesting the presence of business
cycles in Japan. Notably, a substantial portion of the sharp decline observed around February 2009
can be attributed to the remaining component. As a result, terms stemming from this abrupt
variation need to be identified and addressed as outliers, as they have the potential to disrupt the
analysis of business cycles.

Figure 1.3: Results of decomposition for log-IIP in Japan.


Now, we will demonstrate the procedure and results of anomaly detection and removal within the
remaining component. Initially, we set the number of potential outliers to 40 and used the
minimum ISU method to estimate both the count k and their respective positions, following the
rule for determining the positions of outliers. As a result, with k̂ = 29 , we achieved the minimum
ISU value of −0.5033 , and the identified outlier positions are 410, 411, 412, 409, 413, 414, 415, 416,
408, 435, 401, 312, 434, 398, 399, 400, 397, 433, 429, 311, 403, 432, 308, 310, 394, 396, 442, 436, and
402. Subsequent analyses were based on these results.
Furthermore, building upon the aforementioned outcomes, we conducted the estimation of
outliers using the minimum ISU method. →Figure 1.4(b) illustrates the time series containing
outliers, while the outlier-adjusted time series is displayed in →Figure 1.4(a). It is noticeable from
→Figure 1.4(b) that the outlier-adjusted time series exhibits a more uniform variation. However, it
is also evident that the cyclical variations in the time series at the locations where outliers were
identified are less discernible.

Figure 1.4: Results of the detection and estimation of outliers using minimum ISU method.

Furthermore, based on the estimated number and locations of anomalies obtained above, we
determined the estimated order p of the AR model using the minimum AIC method within the
range of 1–45. As a result, the estimated value for p is p̂ = 40 , with a corresponding AIC value of
−3149.12 .

Therefore, we performed the estimation of anomalies using the minimum AIC method.
→Figure 1.5(b) depicts the time series containing outliers, while →Figure 1.5(a) displays the
outlier-adjusted time series. Observations from →Figure 1.5(a) reveal the following: in the areas
where anomalies occur, there is a noticeable periodic variation in the outlier-adjusted time series.
Nevertheless, due to the incomplete removal of anomalies, deep troughs are still present, causing
an asymmetric fluctuation in the time series.
Figure 1.5: Results of the detection and estimation of outliers using minimum AIC method.

Subsequently, we calculated the values of the function f (δˆ λ) for each λ value. The values of λ
increased in 20 steps, with each increment being 5 × 10 , starting from 1 × 10 , resulting in
−6 −5

values such as λ = 1 × 10 , λ = 1.5 × 10 , …, λ = 1.05 × 10 . These calculations were


1
−5
2
−5
20
−4

performed with k̂ = 29 , p̂ = 40 , and the estimated positions of outliers. Additionally, we obtain


the estimates δˆ of δ corresponding to the minimum values of the function f (δ|λ) defined in →eq.
(1.8) for each λ value, and computed the corresponding values of ISU and NLL .
→Figure 1.6(a) illustrates the variation of the function f (δˆ λ) for each λ value, and →Figure
1.6(b) shows scatter plots of ISU and NLL corresponding to each λ value (numbers indicate the
index of λ ). It’s important to note that the points located in the lower left of →Figure 1.6(b)
represent the results for λ and λ , which are relatively advantageous from both the
6 14

perspectives of the minimum ISU and the minimum NLL (which agrees with the maximum
likelihood) methods. Since these results are approximately the same, we presented the results
corresponding to the relatively favorable λ = λ = 3.5 × 10 in →Figure 1.7. As evident from
6
−5

→Figure 1.7, the results corresponding to λ = 3.5 × 10 appear to be favorable from both the
−5

minimum ISU method and the minimum NLL method perspectives.


Figure 1.6: (a) Variation of f with respect to λ ; (b) Scatter plots between ISU and NLL .

Figure 1.7: Results of the detection and estimation of outliers with λ = 3.5 × 10 −5
.

For reference, we also showcased the results corresponding to an extreme case of λ = 1 × 10 −2

in →Figures 1.8. As can be seen from →Figure 1.8, these results are akin to those obtained using
the minimum AIC method in →Figure 1.6. In such case, a balance is not achieved.
Figure 1.8: Results of the detection and estimation of outliers with λ = 1 × 10
−2
.

Taking into account the outcomes discussed above, that the proposed method demonstrates
exceptional performance is strongly indicated. The results, as showcased in →Figures 1.7,
underscore the method’s capacity to effectively identify the most favorable λ values, thus
optimizing the balance between the minimum ISU and minimum AIC approaches. This highlights
the robustness of the proposed methodology in addressing anomalies and model selection,
ultimately contributing to its high-performance capabilities.

1.6 Summary
This study begins by providing an overview of the constrained-remaining component
decomposition method introduced in Kyo and Kitagawa [→9], which is employed to separate the
original time series into a trend component and a stationary time series containing outliers. Our
aim is to develop a method for identifying and estimating these outliers from the stationary time
series. Subsequently, we perform an in-depth review of Kyo’s [→8] maximum likelihood of the AR
modeling approach, focusing on the exploration and estimation of anomalies.
The AR modeling approach, rooted in the minimum AIC method, is advantageous in
preserving periodic variations in time series while handling anomalies. However, it occasionally
faces challenges in efficiently processing anomalies. To address this, we introduce the ISU
indicator and propose the minimum ISU method, with the goal of enhancing the efficiency of
anomaly handling. It is worth noting that the minimum ISU method, while efficient in managing
anomalies, may not fully retain periodic variations in time series.
In response, we present a multiobjective optimization approach that combines the minimum
ISU and minimum AIC methods through a composite objective function. This approach strives to
strike a balance between outlier detection and model selection.
To illustrate the procedure and performance of our proposed approach, we apply it to the
analysis of monthly index of industrial production data spanning from January 1975 to December
2019. This example serves as compelling evidence of the exceptional performance of our method.
The results emphasize the robustness of our methodology in addressing anomalies and model
selection, ultimately highlighting its high-performance capabilities.
It is important to acknowledge that this paper primarily focuses on a single domain, and
ideally, should be extended to multiple domains. Additionally, the absence of an evaluation metric
in the illustrative example is recognized. The quantification of the benefits of the proposed
method for comparative analysis with other techniques is an important aspect that could not be
comprehensively addressed due to space constraints. This study is considered a preliminary
endeavor, and addressing these issues is reserved for future research.

References
[1] H. Akaike, A new look at the statistical model identification. IEEE Transactions on Automatic
Control, vol. AC-19, pp. 716–723, 1974. →
[2] P. J. Brockwell, and R. A. Davis, Time Series: Theory and Methods, 2nd Edition, New York:
Springer-Verlag, 1991. →
[3] A. J. Fox, Outliers in time series. Journal of the Royal Statistical Society B, vol. 34, pp. 350–363,
1972. →
[4] J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, Bayesian model averaging: A tutorial.
Statistical Science, vol. 14, pp. 382–417, 1999. →
[5] Japanese Cabinet Office. Coincident index, In →https://www.esri.cao.go.jp/en/stat/di/di-
e.html, 2023. →
[6] G. Kitagawa, Introduction to Time Series Modeling with Application in R, 2nd Edition, New York:
Chapman and Hall, 2021. a, b
[7] G. Kitagawa, and T. Takanami, Extraction of signal by a time series model and screening out
micro-earthquakes. Signal Processing, vol. 8, pp. 303–314, 1985. →
[8] K. Kyo, An approach for the identification and estimation of outliers in a time series with a
nonstationary mean. Proceedings of The 2023 World Congress in Computer Science, Computer
Engineering, & Applied Computing (CSCE’23), Conference Publishing Services, pp. 1477–1482,
2024. a, b, c, d, e, f, g
[9] K. Kyo, and G. Kitagawa, A moving linear model approach for extracting cyclical variation from
time series data. Journal of Business Cycle Research, Vol. 19, pp. 373–397, 2023. a, b, c, d, e, f, g, h
[10] K. Kyo, H. Noda, and G. Kitagawa, Co-movement of cyclical components approach to
construct a coincident index of business cycles. Journal of Business Cycle Research, vol. 18, pp.
101–127, 2022. →
[11] D. Pena, Influential observations in time series. Journal of Business & Economic Statistics, vol.
8, pp. 235–241, 1990. →
[12] H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang, Time-
series anomaly detection service at Microsoft. →https://dio.org/10.1145/3292500.3330680, 2019.

[13] J. H. Stock, and M. W. Watson, A probability model of the coincident economic indicators. In:
K. Lahiri and G. Moore, (Eds.). Leading Economic Indicators: New Approaches and Forecasting
Records, pp. 63–89, Cambridge University Press, 1991. →
[14] G. K. Vishwakarma, C. Paul, and A. M. Elsawah, An algorithm for outlier detection in a time
series model using backpropagation neural network. Journal of King Saud University – Science,
vol. 32, pp. 3328–3336, 2020. →
2 Using the intentionally linked entities (ILE) database system
to create hypergraph databases with fast and reliable
relationship linking, with example applications

Vitit Kantabutra

Abstract
Hypergraphs can be a stronger data model than ordinary graphs for a database in many
applications, because hypergraphs allow complex, non-binary relationships to be represented
directly. However, when it comes to actual implementation, extant hypergraph database systems
appear to be weak because they are not implemented with sufficiently powerful, modern linked
data structures. For example, one hypergraph database was implemented based on, of all things,
an ordinary graph database system, which lacks a strong mechanism for implementing the
linkages needed by relationships of higher arities. ILE, on the contrary, was specifically designed
to handle linkages of complex, possibly high-arity relationships securely and efficiently. This
chapter shows the details involved in the implementation of a hypergraph database using ILE,
including what to use for the linkages of high-degree hyperedges. Also presented here are several
applications, including some that have never been presented anywhere else.
Keywords: ILE DBMS, hypergraph DBMS, graph DBMS, non-Relational, NoSQL, linked data,
pointers

2.1 Introduction
Around the turn of the millennium, it was widely believed that the Relational database
management system (RDBMS) [→1] was the solution to virtually all database problems. After all,
what could be better than a collection of simple 2D tables that can represent arbitrary entity sets
as well as arbitrary sets of relationships of all arities. Searches can be sped up with index
structures such as B-trees and related data structures.
However, there are significant problems with the RDBMS. In the Relational database system,
keys are equated by means of string matching. This process is error-prone because keys can easily
be misspelled, and blanks and other invisible characters that are unintentionally inserted can
easily result in the lack of a relationship between entities that are meant to be related to each
other, or unintentional relationships can be formed almost as easily. The extremely inefficient
process of table joins is another problem with the RDBMS. Robinson et al., proponents of the
graph database systems, even made the exaggerated claim that RDBMS “lacks relationships”
[→2].
Less known is the fact that database indexes for the RDBMS are typically stored in persistent
storage, and, despite clever câching, are probably not fast enough to feed information to fast
applications and devices such as video players. Yet another problem with the RDBMS is that there
is no one, well-unified place such as a single object that stores all data pertaining to a data entity
[→19].
There are, of course, very positive qualities that the RDBMS possesses, beyond its simplicity.
One such quality is the organization of entity sets into tables. Relationship sets are also organized
into tables. Such organization may seem obvious, but not all database systems are so well-
organized.
In recent years, however, non-Relational database systems have emerged to offer users
systems that don’t have the same shortcomings as the RDBMS. For example, graph database
systems are now offered by several organizations. In a graph database, data entities are linked in
a more robust way than in an RDBMS in order to represent a relationship, and traversing these
links is a fast process. However, in a graph database, there is no clear organization of like entities
into a single entity set. Likewise, relationships that should belong to the same relationship set are
not bundled together in any way.
Additionally, graph databases are made for binary relationships. Higher-arity relationships
cannot be represented well, if at all, in a graph database.
To overcome the lack of higher-arity relationships in a graph database while retaining its
strong relationship-linking capabilities, the hypergraph database has been considered. There
appear to be only three implementations of hypergraph database systems in the publicly available
literature, namely [→3]:
1. HypergraphDB,
2. TypeDB, and
3. GraphBrain.
Of these, only HypergraphDB seems to be the only one to have documentation that is detailed
enough to determine what data structures are used, and to determine something about the
efficiency of database operations. HypergraphDB’s documentation reveals that the B+ Tree, also
used in Relational databases, is the indexing mechanism used for accelerating database searches.
As discussed above, using such an external indexing data structure makes relationship link
traversal slow compared to using an “internal” data structure that is an integral part of the core of
the database itself.
In this chapter we will see that ILE can be used to represent any hypergraph database so that
relationships, including those with higher arities, can be represented directly using a very fast
internal data structure. Additionally, ILE represents each entity set as an object, and each
relationship set is likewise represented as an object. These objects could be added to a
hypergraph database implemented with ILE, even though the hypergraph data model itself
doesn’t have them.
Just for reference or comparison purposes, we should take note of how relationship linkage is
done in some graph databases. According to Kemnitz [→4], the popular graph database Neo4j,
which is also one of the oldest databases in this genre, stores in each node a list of records that
represent the node’s relationships with other nodes. A physical pointer seems to exist in each
such record, pointing to the other node. Some of the other graph databases, however, may store
in the node some form of identification of the destination nodes, necessitating an index search for
the actual location of such destination nodes themselves.
The rest of this chapter comprises the following sections: Section II introduces the ILE
database and explains the mechanism by which that database system can be used to represent
graphs and hypergraphs. The section includes how ILE can be extended to accommodate a better
representation of hypergraphs, including hypergraphs with hyperedges of variable arity, useful
for efficiently and robustly representing groups of nodes on social media, which is a situation
where ordinary graph database systems cannot perform well.
Section III explores the major advantages of ILE over existing databases, namely that ILE
allows more efficient and secure relationship linking and traversal.
Section IV examines ILE as a means for implementing a particular hypergraph database. The
database involves ternary relationships, making it unwieldy to represent with, say, an ordinary
graph. The example is benchmarked for its relationship linkage traversal efficiency. More
specifically, the example is a travel map consisting of a few locations in Idaho, which can obviously
be extended to a full-blown map with many more entity sets and relationship sets.
Section V compares the efficiency of relationship linkage between ILE and Neo4j in social
network graph databases. For this comparison, we use a simple model of a social network graph
found in [→5].
Sections VI and VII present results not previously published in [→6], the publication that this
current chapter extends. Section VI discusses transportation networks beyond the small example
given in Section IV. In particular, we will discuss a subway network and how it could be better
represented as a hypergraph than a graph, which is how it was represented in [→7, →8]. Section
VII discusses algorithms for Web page ranking, which are of great importance for search engines
needing to rank Web pages. Heintz and Chandra [→9] showed why a hypergraph is a superior
model to regular graphs for Web page ranking and other analytics involving social grouping.
However, they rely on regular graph and multigraph physical data representation to represent
hypergraphs. In this chapter, it will be shown that the new direct hypergraph representation can
represent social groups and general grouping of Web pages and of nodes in general much more
efficiently and more naturally, and seemingly also securely, than can graph and multigraph
software.

2.2 The ILE database system and its use for representing a hypergraph database
The ILE database idea [→17] occurred to the author of this chapter while teaching a database
course in 2007, using Garcia-Molina et al.’s Database Systems book, first edition [→10]. That
textbook contains the narrative that Chen’s ER data model [→11] is the best data model for most
situations, but when it comes to implementation, the Relational model is chosen instead. This is
because there was no implementation of the ER model, and the Relational model was chosen
because, though it predates the ER model, it is often thought of as a simple approximation to the
ER model. In the Relational database system, due to having but one kind of data structure, the
relation (ignoring external indexes), relationships must be linked by means of text matching. This
text matching is prone to errors due to misspelled or otherwise incorrect keys. It is even possible
to unintentionally relate entities that aren’t meant to be related.
Given all these pros and cons of the Relational database system, the author of the current
chapter thought that perhaps it was better to use physical pointers to do relationship linkages,
and only perform a linking when there is a verifiable entity to link with. Also, there is no real need
to limit the data structures to just tables. Back in the years leading to 1970, which was the date of
Codd’s seminal paper [→1], the simplicity of a two-dimensional array made sense. Programming
with complicated data structures, especially those involving pointers, was not commonplace.
Object-oriented programming was not well known.
To elaborate on Chen’s paper, it appears that his emphasis is to come up with a database
model that encompasses all the major extant database models, namely, the Network model, the
Relational model, and the Entity model. He critiqued all three extant models, and wrote that all of
them can be thought of as derivatives of his database model, the Entity-Relationship, or ER,
model. He then introduces a diagramming method, later known as the ER diagram method, for
designing database schemata, which is still in use today. He did not cover how to write the code
that it takes to implement his ER data model into a working or even a prototypical database
system.
As the decades went by beyond the ’70s, Codd’s RDBMS popularity exploded. Meanwhile,
Chen’s ER diagramming method became the standard database design method, at least for
students and for serious database designers. However, Chen’s model seemed not to have been
implemented as database software, except some people believe that the Relational system is an
implementation of the ER model, which is not quite an accurate belief. In fact, it seemed to be
widely believed that the ER model is unimplementable.
This author’s thought, in 2007, was that with all the advancements in programming languages
and techniques since 1970, there must be a way to implement a better general-purpose database
system than the RDBMS, such as something more like the ER model. There should no longer be a
need to stick with making everything (except external indexes) a two-dimensional array.
Details of the ILE
ILE was implemented in the programming language Ruby. Ruby is a very good language for
representing ILE because it is simple to create and manipulate complex data structures that
involve pointers (Ruby references). Also, garbage collection is handled by the runtime monitor.
In an ILE database, there is first an object representing the entire database. There is also an
object representing a set of databases, the DatabaseSet object, in case we need to keep an entire
set of different but related databases together. In the future, there could be search features or
other features for an entire set of related databases. A database object contains the database
name and metadata for the database, for example, some text describing the database. A database
is further constructed from the following components:
1. Entity Sets Objects. These are a natural way to represent entity sets, which are the same as
the entity sets in the ER model or the older Entity Set model [→12]. In the original paper on
ILE [→13], each entity set was represented by a table, but in subsequent papers, the table
implementation has been replaced by an entity set object implementation. The database
object mentioned above keeps track of all entity set objects by means of a Ruby hash, using
each entity set’s name as a key. An entity set object has
1. a reference back to the database object,
2. an entity set name, any string, but meant to be used as a searchable hash key,
3. a hash to external indices in case of need. This is usually not needed and so far has not
been implemented,
4. key attribute names, stored as an array,
5. key attribute types. The types can be defined in whatever way that would suit the
database being represented. Stored as a hash, with the key attribute names serving as
the hash keys,
6. nonkey attribute names, stored as an array,
7. nonkey attribute types, stored as a hash,
8. The entities belonging to this entity set are all stored as a ruby hash, which of course,
means that only a handle to the hash is stored with the entity set object itself. Also, if
there is more than one key attribute, then the entities are stored as a multilevel hash,
with the number of levels equaling the number of key attributes. The key attributes are
used as the hash keys, and lastly,
9. each entity set has a hash of relationship sets whose relationships involve entities in this
entity set.
2. Entity Objects. These objects represent data entities. Not only are these objects referenced
by the appropriate entity set object, but each of these entity objects also has
1. a reference to the entity set object that represents the entity set this entity is part of,
2. a reference to the current database (not needed but convenient to have),
3. a hash of key values, hashed by the key field name,
4. a hash of non-key values, hashed by non-key field name, and
5. a hash to an array of relationship objects, using the relationship’s relset (relationship
set) name as the hash key. This entity may participate in more than one relationship,
and hence we need an aggregate such as an array of relationship objects.
3. Relset (Relationship Set) Objects. These objects represent relationship sets. A relationship
set is a set of like relationships with the same sequence of roles, each role played by an
entity from a specific entity set. A relset object has the following information:
1. Relset name. Any string, but meant to be used as a searchable hash key,
2. a reference back to the database,
3. a longer description of the relset than the relset name, if needed,
4. role names, implemented a Ruby Array, but could be reimplemented as a Set,
5. a description of relationship attributes, if there are any. Could be upgraded to be a
more structured container for relationship attributes,
6. all relationships belonging to this relset, currently implemented as an Array but could
be reimplemented as a Set, and
7. an indexing aid for the relationships in the relset, in the form of a data structure called
aha_rel, which stands for an array of hashes of arrays of relationships.
4. Relationship Objects. These objects represent the relationships, containing:
1. a reference to the Relset to which this relationship belongs,
2. a reference to the database (not needed by convenient),
3. a hash of all roles (entities playing various roles), and
4. a hash of relationship attributes, if any.
Perhaps the most complicated task in setting up an ILE database is the installation of each
relationship in each entity object’s @relationship field.
To implement a hypergraph database in ILE, we represent each node as an entity, with an
entity set for each node type in the hypergraph. While edges could also be represented as entities
in ILE, it’s more efficient to represent them as relationships. This approach works if edges only link
nodes. However, to allow edges to link both nodes and other edges, we need a new class,
EdgeNode, which combines the attributes and methods of both entities and relationships. If ILE
were implemented in a language that supports multiple inheritance, we would use that. However,
since Ruby, the language used for ILE, doesn’t support multiple inheritance, we define EdgeNode
as a subclass of Entity and manually add the attributes and methods from the Relset class, either
directly or through Mixins.

2.3 Why is relationship linking fast in ILE?


In this section, we explain why relationship linking is fast in ILE. While graph databases typically
have rapid relationship linking (edge traversal), the existing hypergraph database, HypergraphDB,
performs poorly in this area, as demonstrated by extensive benchmarking by [→14].
HypergraphDB’s slow relationship linking is caused by the sluggish process of transitioning from
an entity to a relationship.
Linking from an entity to a relationship in HypergraphDB relies on a B+ index tree, which is a
larger and slower data structure compared to the Ruby hash used in ILE or the direct Java-based
linkage employed by Neo4J.
In ILE, linking an entity to a relationship is highly efficient thanks to a Ruby hash keyed by the
relationship set (relset). All relationships in a relset share the same key and are stored in an Array
(or optionally a Ruby Set). This Array of relationships can be “indexed” by replacing it with a Hash,
using the values of any other roles as keys. Additionally, a multilevel hash (a Hash of Hashes, etc.)
could be used, with each level keyed by the value of a specific role in the relationships. Although
this multilevel hashing approach hasn’t been implemented in ILE, it remains a promising and
feasible idea.
2.4 An example: a small hypergraph database with ternary relations, with
benchmarking
We will now examine a basic example of modeling a Hypergraph database using ILE. In this case,
we will construct a database representing several cities in Idaho and the travel times between
them across different modes of transportation. Although ILE provides a dedicated language for
data entry and querying, Ruby code will be utilized for this demonstration. The database will
include an entity set labeled “cities,” with each entity corresponding to a city within the dataset.
The primary key attribute for this entity set is the city name, referred to as “cityname.”
Additionally, the non-key attributes will include the latitude and longitude of a specific point in the
city. The following Ruby code illustrates the creation of a new database and the configuration of
the “cities” entity set (→Figure 2.1).

Figure 2.1: Example Ruby code used to create a new database and the configuration of the
“cities” entity set.

ILE enables users to define and manage data types. In this basic example, we will rely on Ruby to
manage the data types and will not implement any specialized code to process the data entered
into the database. The following demonstrates how entities can be added to the entity set, and
consequently, to the database:

poky_entity = Entity.new(cities_es, {’cityname’=>’Pocatello’},


{’lat’=>42.8621, ’long’=>-112.4506})
if_entity = Entity.new(cities_es, {’cityname’=>’Idaho Falls’},
{’lat’=>43.4927, ’long’=>-112.0408})
boise_entity = Entity.new(cities_es, {’cityname’=>’Boise’},
{’lat’=>43.6150, ’long’=>-116.2023})
sv_entity = Entity.new(cities_es, {’cityname’=>’Sun Valley’},
{’lat’=>43.6971, ’long’=>-114.3517})

Another entity set, representing the mode of transportation, is created using the code in →Figure
2.2, while the entities representing travel modes can be included in the entity set using the code in
→Figure 2.3.

Figure 2.2: Example code of the entity set representing the mode of transportation.

Figure 2.3: Example code of entities representing travel modes being added to the entity set.

The next step in building the database is to define the single relationship set, or Relset. The
relationships within the Relset have three roles: origin, destination, and transportation mode.
These relationships correspond to hypergraph edges, where each edge connects three nodes
representing the roles. The following code (→Figure 2.4 (a)) demonstrates how these roles are
defined in the Relset construction, and →Figure 2.4 (b) shows how relationships can now be made
according to these examples.
Figure 2.4: (a) Code that defines the roles in the Relset construction. (b) Demonstrating the
establishment of relationships based on the roles defined in the Relset construction.

The final step is to assess the speed of traversing this hypergraph database. The results will reflect
the efficiency of traversing a typical hypergraph link in ILE, under ideal conditions without system
constraints like disk block swapping due to database size. Additionally, we will explore another
example – a simple graph instead of a hypergraph – and compare its performance in ILE with
Neo4J, a well-known, highly efficient graph database, in the next section.
→Table 2.1 presents the execution results on a 2019 13-inch MacBook Pro equipped with a
quad-core Intel Core i5 processor, operating at a 2.4 GHz clock speed. The system features 8 GB of
LPDDR3 RAM running at 2.133 GHz, while the secondary storage consists of a 500 GB flash drive.

Table 2.1: Run results for the transportation database of Section.


Traversals from Boise to Pocatello (3 User Time System Time User + System Time Time per
links) (sec) (sec) (sec) Link
1 0.000110 0.000056 0.000166 53.3 μs
10,000 0.001249 0.000086 0.001335 44.5 ns
100,000 0.009333 0.000085 0.009418 31.4 ns
1,000,000 0.090515 0.000215 0.090730 30.2 ns
The experimental results show that as the number of traversals increases, the time required to
traverse a single link remains around 30 nanoseconds. During each run, the system loads the
caches with a working set of disk blocks containing the links to be traversed. Since the total
memory required for the experiment is relatively small, no additional swapping of disk blocks
seems to occur.
Based on this reasoning, we can express the total running time (user time + system time) as
the sum of a constant, c, representing the time to fill the câches, and the product of n (the number
of traversals from the starting node, Boise, to the final node, Pocatello) and t, the time it takes to
traverse a single link.
Thus, we have, ttotal = c + n · t. Using n = 1 and n = 1,000,000 cases, we can solve for c and t,
getting, approximately, c = 0.0001659 and t = 9.0564 × 10−8. Comparing these figures, it is clear how
much larger the câche-filling time is than traversal time. This outcome is not unexpected, even for
a computer equipped with a flash-based storage drive.

2.5 Benchmarking larger databases and comparison with a graph database


While we currently have only preliminary and indirect comparison results, they are still intriguing
and offer a valuable starting point for future research.
This section is inspired by the benchmarking of Neo4j against MySQL presented in the first
chapter of [→5]. The study examined a social network database of 1,000 people, with links
between pairs representing friendships. On average, each person had 50 friends. The task was to
count friends of friends (“2 levels”), then friends of friends of friends (“3 levels”), and so on. As
expected, MySQL performed significantly worse as the levels increased due to the high cost of
relational table joins. In contrast, Neo4j (and ILE), functioning as graph databases, could traverse
edges very efficiently, with performance largely unaffected by database size unless it resulted in
disk swaps.
There are no details in the cited book on how the database was generated. As for the
computer system used in the book’s experiments, all that was revealed was this: “All experiments
were executed on an Intel i7–powered commodity laptop with 8 GB of RAM, the same computer
that was used to write this book.”
The ideal approach for comparing these databases would be to run them on the same dataset
and system, but this was not possible. Instead, a random social network dataset was created with
1,000 nodes, each connected to 50 randomly selected others. For each node, 50 random IDs were
generated, establishing two-way connections between the nodes and their randomly selected
counterparts. As a result, most nodes ended up connected to considerably more than 50 others
on average. The ILE system was run on a laptop without running Neo4j. The laptop used was a
standard “commodity” machine, about four years newer than the one mentioned in the book, but
with an older Intel Core i5 processor and the same 8GB of RAM.
→Table 2.2: This table summarizes the book’s and my experiments on 1,000-person
databases.
Table 2.2: Run results for social network databases with 1000 persons. The runtimes tabulated are
in seconds.
Depth MySQL Neo4J ILE
2 0.028 0.04 0.024
3 0.213 0.06 0.031
4 10.273 0.07 0.039
5 92.613 0.07 0.040
Unlimited --- --- 0.762

The study from the book was repeated with a database of 1,000,000 individuals, each having an
average of 50 friends. Attempts were made to run a similar database of 1,000,000 nodes, but the
computer significantly slowed down during the process of generating connections, preventing
progress to testing traversals. Although the computer used differs from that of the book’s
authors, it is likely that ILE requires more space per node and relationship, necessitating a
machine with greater memory and working-set capacity to manage large datasets effectively.
→Table 2.3 presents the results from the book’s test on a dataset of 1,000,000 individuals,
each with an average of 50 friends. While the results for Neo4j seem impressive, there are
concerns regarding the 1,000,000-person experiment reported in the book and shown in →Table
2.3. Notably, the results for both MySQL and Neo4j are better for the 1,000,000-person dataset
than for the 1,000-person dataset for the depth 2 case, which seems counterintuitive. Focusing on
the MySQL result for the 1,000,000-person dataset, since MySQL is well understood, the
computation involves a self-join of a 50,000,000-row table, as each person has 50 friends. The
authors claimed this self-join takes 0.016 s on a “commodity” laptop with 8GB of RAM and an Intel
Core i7 processor. This appears unlikely. The size of the join would be 2.5e15 rows, with each row
containing not just numbers but also names and additional information. Even assuming an overly
optimistic scenario where each row contains only three numbers (the persons’ IDs), and the
laptop has 8 cores running at 4 GHz, which is highly optimistic, and further assuming all cores
simultaneously write the results into memory in 1 clock cycle, it’s still implausible. In 0.016 s, the
system would write 8 × 4e9 × 0.016 = 5.12e8 words, which is far fewer than the 7.5e15 required.

Table 2.3: Run results for social network databases with 1,000,000 persons. The tabulated run-
times are in seconds.
Depth MySQL Neo4j
2 0.016 0.01
3 30.267 0.168
4 1,453.505 1.359
5 Unfinished 2.132

While ILE may not be suitable for databases of size 1,000,000 on a small laptop, it worked well for
data of size up to 100,000 and perhaps more. →Table 2.4 is assorted run results for data sized up
to 100,000.
Table 2.4: Run results for social network databases with up to 100,000 persons. The tabulated
runtimes are in seconds.
Persons Depth ILE Run Time
10,000 2 0.068654
10,000 3 0.484819
10,000 4 0.297643
10,000 5 0.312909
10,000 6 0.326356
10,000 7 0.370950
50,000 6 1.563340
100,000 5 6.482886
100,000 6 9.029269

2.6 Transportation networks


In this section, we argue why a transportation network may be better represented as a
hypergraph than a graph. Transportation networks are typically represented as a graph. In a
graph representation of a métro/subway system, for example, the nodes typically represent
significant places such as the métro/subway stations, while each edge represents a connection
between two stations [→7, →15]. In [→16], a road network is represented with a graph where the
nodes are road intersections, and the edges are the road segments connecting such intersections.
The edge weights are “some attribute of a road segment.”
Jeyapragasan et al. mentioned that there were difficulties representing which line or lines
each station belongs to, so that a program could easily count how many line changes are
necessary to get from one station to another. These authors ultimately decided to represent each
station using as many graph nodes as there are lines affiliated with the station.
Here, we propose representing a métro system as a hypergraph instead of a graph. The
stations are nodes, and so are track segments connecting two stations. The lines are also nodes,
and a relationship is a hyperedge that not only connects two stations but also links to a track
segment and a line. →Figure 2.5 illustrates such a relationship, which has an arity of 4.
Fig. 2.5: A relationship relating a track segment to the two stations that it connects and to a train
line. Such a relationship is represented by a hyperedge, which is in turn reprented by an ILE
relationship object.

Using the representation of the métro presented here eliminates less clean ways of dealing with
multiline routes, such as the way described in [→7], where each node (station) that serves
multiple lines must appear multiple times in the database.
Another application for which hypergraphs/ILE would be useful is automobile trip route
planning, including route planning in portable GPS units, some of which are notoriously slow at
planning non-local routes. In a typical extant GPS unit for automobiles, the routing software uses
a graph data structure. The nodes are road intersections, while the edges are initially the road
segments that connect those intersections. If computational complexity were not an obstacle, a
shortest path algorithm such as Dijkstra’s Algorithm could be applied, and we’d be done.
However, the search space would be too large, given that there are many nodes, for example, 15.8
million nodes in the continental US (citation needed). One way to limit the search is to
precompute shortcuts, which are edges that don’t correspond to physical road segments that
represent shortest paths between nodes. Heuristics are used to favor highways and other major
roads, and to favor the “small roads: big roads: small roads” pattern.
However, this strategy is still a very “localized” one in that there isn’t enough “big picture”
planning of the overall route. It isn’t surprising, therefore, that GPS units can have difficulty
completing long/distance drive plans.
Our idea is to put more human-like “intelligence” into routing. In particular, we will
precompute routes between major (and not-so-major) cities comprising mainly expressways and
store all of them as nodes. (So our nodes are not just road intersections, which are still nodes in
our scheme.) Except for trips in remote areas, the trip planning search process for a trip from
point A to point B will start by trying one of these intercity routes R. Then we search for the best
route from point A to R, which usually means a short trip from point A to one of a handful of entry
points of the expressway or one of the expressways that make up the route R. With a backward
search, we search for a way to get from one of a few exits of R to point B. Although we have not
performed comparison experiments yet, it seems that our approach has promise in comparison
with the more conventional search in extant GPS devices.

2.7 World Wide Web


The World Wide Web search engine Google ranks the importance of Web pages using an
algorithm called PageRank. In PageRank, the importance of a Web page is based on how many
other Web pages have links to it, and how important those other Web pages themselves are.
Now consider ranking social media pages. We want to consider the importance of group
membership, especially membership in groups with important members. In [→9], the authors
presented their own extension of PageRank where groups are represented as a set of vertices
that are related by a hyperedge representing group membership.
Heintz and Chandra used the Spark GraphX package emulating a hypergraph, building a
hypergraph API for it. They discussed two possibilities for the underlying data structure that uses
the GraphX framework for their system.
1. Using a bipartite graph, one partition contains the vertices of the graph that represent the
vertices (which they call “hypervertices”) of the hypergraph, and the other partition contains
the vertices that represent the hyperedges of the hypergraph. Native graph edges are used
for connecting each graph vertex representing a hyperedge to its constituent hypervertices.
Note that if the underlying graph data structure is a directed graph, and if it is desired to be
able to go from a hypervertex to another hypervertex sharing the same hyperedge, then we
need a pair of native edges, one going in each direction, to represent the connection
between a hyperedge and a constituent hypervertex.
2. If the underlying graph software permits multigraphs (an extension of a graph where
multiple edges between the same pair of vertices may exist), which GraphX does, then we
can represent a hypergraph with a multigraph. This involves representing each hypervertex
with a multigraph vertex. Hyperedges are represented as follows: if vertices v1 and v2 are
connected by hyperedge h, then the underlying graph has an edge between v1 and v2,
labeled “h”.
Representation 1, using a bipartite graph, does not waste an inordinate number of edges.
However, according to Heintz and Chandra, it can be a problem that hyperedges and hypernodes
are represented by the same kind of object, even though the hypernodes and hyperedges have
very different characteristics. Representation 2, using a multigraph, uses different kinds of objects
to represent hypernodes and hyperedges, and hence is not subject to the same problem.
However, representation 2 takes n × (n – 1)/2 edges to represent a group with n members,
and actually n × (n – 1) edges if two-way links are required. This can lead to a failure, since groups
in social media can be very large. Here in Pocatello, Idaho, a town with a population of about
57,000 (in 2021), the Facebook group called “Pocatello Trails” has 11.3 K members currently (Dec.
2023). Using representation 2, with edges in both directions between each pair of vertices, there
would be about 127,678,700 edges just to represent this group! Additionally, there would be no
single object that serves as the “center” of this group.
On the other hand, if we represent Facebook pages using the ILE-based hypergraph
implementation of this chapter, the Pocatello Trails group would only take one hyperedge
(relationship) object, with a bidirectional link to each of the 11.3 K members, which would make
22.6 K links.

2.8 Conclusions
For decades, the ER database model has been considered the best overall data model for a vast
range of applications. However, ER had not been implemented successfully, and so the Relational
DBMS, which is based on 1960s programming technology, and which can be considered a rough
approximation of an implementation of ER, became the dominant type of database system. This
chapter presents ILE as a modern implementation of ER, and also how to extend ILE to make it the
underlying database system for hypergraph databases. The reason hypergraph databases are
interesting is because, unlike ordinary graph databases, hypergraph databases can directly
represent higher arity relationships which are stronger models of real relationships in many
applications [→18], [→20]. The ILE-based hypergraph database is better than extant
implementations because the ILE-based one is a more direct implementation, using pointers to
link relationships. Several applications were also presented in this chapter from diverse fields.

References
[1] E. F. Codd, A relational model of data for large shared data banks. Communications of the
ACM, 13, 6, June 1970. 377–387. a, b
[2] I. Robinson, J. Webber and E. Eifrem, Graph Databases, 2nd Edition, O’Reilly Media, Inc, 2015.

[3] T. Menezes, C. Roth, C. C. Chen, Y. Stein and K. Ren, Graphbrain – Language, knowledge,
cognition. Published online as →https://graphbrain.net 2021. →
[4] G. Kemnitz, Quora: Answer to the question: Technically, how are relation(ship)s stored in a
graph database? →https://www.quora.com/How-are-graph-databases-implemented, 2019. →
[5] A. Vukotić and N. Watt, Neo4j in action. Manning, 2015. a, b
[6] V. Kantabutra, Implementing hypergraph databases with fast relationship linking and traversal
using ILE, CSCE 2022. →
[7] K. Jeyapragasan, G. Krishna and Y. Maniyar, An analysis of subway networks using graph
theory and graph generation with graphrnn, Stanford Dept. CS, CS 224W, 2019 class project,
available online at 2019 →http://snap.stanford.edu/class/cs224w-2019/projects.html. Last
Accessed Dec. 21, 2023. a, b, c
[8] X. Wu, C. K. Tse, H. Dong, I. W. H. Ho and F. C. M. Lau, A network analysis of world’s metro
systems, In: International Symposium on Nonlinear Theory and Its Applications, NOLTA2016,
Yugawara, Japan, November 27th-30th, 2016. →
[9] B. Heintz and A. Chandra, Enabling scalable social group analytics via hypergraph analysis
systems, In: Proceedings of the 7th USENIX Conference on Hot Topics in Cloud Computing, July,
2015. a, b
[10] H. Garcia-Molina, J. D. Ullman and J. Widom, Database Systems: The Complete Book, Prentice-
Hall, 2001. →
[11] P. Chen, The entity-relationship model: Toward a unified view of data. ACM Transactions on
Database Systems, 1, pp. 9–36. 1976. →
[12] M. E. Senko, E. B. Altman, M. M. Astrahan and P. L. Fehder, Data structures and accessing in
data-base systems. IBM Systems Journal, 12, 1, pp. 30–93, 1973. →
[13] V. Kantabutra, A New Type of Database System: Intentionally-Linked Entities-A Detailed
Suggestion for A Direct Way to Implement the Entity-Relationship Data Model, CSREA EEE, pp.
258–263, 2007. →
[14] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vanó, S. Gómez-Villamor, N. Martínez-Bazan
and J. L. Larriba-Pey. Survey of graph database performance on the hpc scalable graph analysis
benchmark. In: Heng Tao Shen, Jian Pei, M. Tamer Özsu, Lei Zou, Jiaheng Lu, Tok-Wang Ling, Ge
Yu, Yi Zhuang, Jie Shao, (Eds.), International Conference on Web-Age Information Management.
Springer-Verlag, Berlin, Heidelberg, usw., 2010, 37–48. →
[15] C. Kennedy and S. Derrible, Network analysis of world subway systems using updated graph
theory. Transportation Research Record: Journal of the Transportation Research Board, 2112, pp.
17–25. 2009. →
[16] J. Pung, R. D’Souza, D. Ghosal and M. Zhang, A road network simplification algorithm that
preserves topological properties. Applied Network Science, 7, pp. 79, 2022.
→https://doi.org/10.1007/s41109-022-00521-8 →
[17] V. Kantabutra and D. Ames, A suggestion for a data structure for temporal gis. In: IEEE
Workshop on e-Science, Oxford, pp. 158–161, 2009. →
[18] V. Kantabutra, J. B. Owens, D. P. Ames, C. N. Burns and B. Stephenson, Using the newly-
created ILE DBMS to better represent temporal and historical GIS data. Transactions in GIS, 14, pp.
39–58. 2010. →
[19] M. Hunger, R. Boyd and W. Lyon, The definitive guide to graph databases for the RDBMS
developer. Neo4j 2016. →
[20] V. Kantabutra, GeoDataLinks: A suggestion for a replacement for the esri shape- file.
Transactions on Computational Science & Computational Intelligence, 685–694, 2021. →
3 Rapid and automated determination of cluster numbers for
high-dimensional big data: a comprehensive update

Zohreh Safari

Note: This chapter extends the findings of a prior collaborative study conducted by Khalid Mursi,
Yu Zhuang, and myself. The earlier research laid the foundation for the current work, and this
paper builds upon those insights, providing further depth, improvement, and exploration into the
subject matter with new larger datasets.

Abstract
Automatically defining the optimal number of clusters is a pivotal challenge in clustering
algorithms. Striking a balance between clustering quality and algorithm efficiency in this
determination process is a crucial tradeoff that motivated our research. In our approach, we have
successfully automated the identification of the optimal number of clusters, particularly tailored
for large high-dimensional datasets. Our method addresses both the quality and efficiency
aspects of clustering. Through conducting experimental studies on five previously explored
datasets [→23] and introducing four new, larger datasets, which have been done in this study, I
have observed that our procedure provides flexibility in selecting diverse criteria for determining
the optimal K under each circumstance. Leveraging the advantages of the bisecting K-means
algorithm, our approach outperforms the Ray and Turi method, showcasing higher efficiency in
identifying the best number of clusters.
Keywords: clustering, K-means, bisect-K-means, big data, cluster validity,

3.1 Introduction
Clustering, an extensively utilized unsupervised learning process, has been applied to numerous
problems for uncovering insights and revealing hidden relationships among objects,
distinguishing it from supervised machine learning algorithms. Its applications span various
domains, including pattern recognition [→11], machine learning [→5, →6], image analysis [→13,
→14, →15], bioinformatics [→16], and information retrieval [→17, →18]. This procedure groups
data points based on similarities, where those within the same group exhibit similar properties,
while those in different groups are less alike. Unlike supervised learning and classification,
clustering lacks a predefined concept for determining the appropriate number of clusters for a
given dataset [→22]. Users typically depend on clustering algorithms to group data points in an
optimal manner, with the number of clusters predetermined by their knowledge. The
performance of clustering heavily relies on accurately estimating the cluster numbers.
Specifying the right number of clusters intuitively is a challenging task, as users may lack
sufficient knowledge of a dataset to define a high-probability optimal number. The absence of
information regarding the cluster number can adversely impact clustering quality, prompting
significant interest in recent years to discover the correct number of clusters through clustering
algorithms. The K-means algorithm [→19] stands out as one of the most widely recognized
clustering methods, valued for its simplicity and ease of implementation. However, a notable
challenge lies in users having to specify the number of clusters beforehand. Recent research has
endeavored to enhance K-means algorithm to address this challenge, introducing solutions for
automatically determining the number of clusters. Some notable approaches include X-means
[→7], Ray and Turi (R&T) [→4], CLUSEQ [→8] designed for sequence data, IKMEANS [→9], and
others [→10, →12]. K-Means++ [→28], the approach that aims to improve the initialization step. K-
means is explained in Section 3.2. It selects initial cluster centers based on their probability, which
is proportional to the squared distance from existing centers. This method has been shown to
mitigate the impact of initial center selection on algorithm performance. Furthermore,
advancements in adaptive clustering algorithms have led to the development of methods like
adaptive K-means [→29], which dynamically adjusts the number of clusters based on the data
distribution. By continuously monitoring cluster quality and adaptively refining cluster
assignments, these algorithms offer improved flexibility and efficiency, compared to traditional K-
means.
This chapter specifically concentrates on automatically determining the optimal number of
clusters (auto best K) for extensive datasets featuring high-dimensional data. While existing
methods like R&T and X-means focus on clustering quality during automatic cluster number
determination, they are not explicitly tailored for large datasets. Managing the efficiency of
clustering algorithms becomes a crucial challenge with big datasets, given that determining the
number of clusters can be time-consuming. Consequently, our focus is on efficiency, while
upholding clustering quality for significant data. In existing algorithms [→4, →7], determining the
number of clusters involves processes like bisecting K-means (BKMeans), where the number of
clusters is progressively increased to a predefined upper bound. The largest cluster is chosen
during the splitting process, based on varying criteria in different methods. Our emphasis is not
on selecting criteria but on reducing computation costs while adhering to the quality standards
set by existing methods. It is noteworthy that the criterion for clustering remains an ongoing
consideration, as each step involves assessing different criteria with an increasing cluster number
in existing algorithms. The structure of the paper unfolds as follows: Section 3.2 offers an
overview of two clustering algorithms, detailing their significant specifications, and provides an
explanation of the R&T method [→4]. Our proposed method is thoroughly examined in Section
3.3. Section 3.4 delves into experiments and comparisons, while the concluding remarks of the
study are presented in the last section.

3.2 Related works


To reach our end, determining the more suitable value for k automatically while ensuring higher
clustering quality, our auto best K procedure integrates both K-means and BKMeans algorithms.
Leveraging the distinctive features of these algorithms, we have engineered a more efficient
process to determining the right value of K. To provide readers with a clear understanding of the
characteristics of these two algorithms, we presented detailed descriptions in this section.
Subsequently, we delve into the R&T method, aiming to demonstrate the effectiveness of our
approach in comparison to the R&T method in the following →section.

3.2.1 K-means and bisecting K-means clustering algorithm


K-means, a widely used clustering algorithm designed to group similar objects into cohesive
clusters, operates based on Algorithm 1. The key inputs for this algorithm are K and N,
representing the number of clusters and data points, respectively. In its initial phase, K centers are
chosen randomly, a step that significantly influences both the effectiveness and efficiency of the
algorithm. However, this randomness can pose challenges to achieving optimal results. Despite its
straightforward approach, K-means has notable limitations. Primarily, it necessitates the
predefinition of the number of clusters, which can be cumbersome when determining suitable
cluster numbers for diverse datasets. Moreover, ensuring stability in cluster partitions requires
multiple iterations, resulting in high computational costs, particularly for large datasets.
Some recent studies have explored various strategies to address these limitations. For
example, research has focused on enhancing the initialization step by incorporating techniques
such as K-Means++, which selects initial cluster centers based on their probability, which is
proportional to the squared distance from existing centers. Also, adaptive K-means dynamically
adjusts the number of clusters based on the data distribution. These algorithms offer improved
flexibility and efficiency compared to the traditional K-means by mitigating the impact of initial
centers. As explained, the primary drawback of the K-means algorithm lies in its reliance on
randomly selected initial centers, which significantly leads to a high number of iterations and
potentially unreliable clustering outcomes. To lessen this issue, we introduced the B-Kmeans
method in a previous version of this paper [→23], outlined in Algorithm 2 below.
The BKMeans algorithm has exhibited superior clustering quality and efficiency compared to
K-means with randomly selected initial centers, as demonstrated in [→21]. Furthermore, BK-
MEANS offers the advantage of automatically determining the optimal number of clusters,
eliminating the need for specifying a predefined K value. Notably, criteria such as clustering
validity [→20] can be seamlessly integrated during the bisecting process to identify the optimal
number of clusters. Additionally, recent advancements in the field underscore the effectiveness
and adaptability of the BK-Means algorithm, positioning it as a promising alternative for
addressing the inherent limitations of the traditional K-means clustering methodologies.
These additions provide a more comprehensive overview of the recent research and
advancements related to K-means clustering:

Algorithm 1: K-means
input: data set: x1,x2, … xn, k and N.

1. Initialize K centroids z1,z2, … zk for i = 1 to N

do
a. Compute distance from xi to centroid zk, di = min||xi − zj|| j=1,2, … k
b. Assign each point to closest centroid (zk). xi ∈ Cj

end
2. Re-calculate centroid zj′ for each cluster Cj .
3. zj = zj′
4. repeat step 2 till no change of clusters.

3.2.2 Ray and Turi method and R&T index


Ray and Turi implemented their method on color spaces, generating segmented images for
cluster numbers, ranging from 2 up to an upper limit, defined as 25. To determine the optimal
number of clusters, they employed their validity index, the R&T index [→4], which calculates the
ratio of intra-cluster to inter-cluster distances. The K-means algorithm was iteratively applied for
each new cluster creation until the number of clusters reached Kmax. The optimal k value, as
determined by their index, corresponds to the value with the minimum index value.
This validity measure was applied to both synthetic and natural images, with the smallest
number of clusters set at four. Ray and Turi introduced their own indexes, denoted as R&T [→4],
utilizing the concepts of intra-cluster and inter-cluster distances. Intra-cluster distance was
defined as the average distance of each data point to its cluster’s center, where N represents the
number of data points and K is the number of clusters. Inter-cluster distance was defined as the
minimum distance between cluster centers. This comprehensive approach allowed them to assess
and optimize clustering performance across a range of scenarios, including both synthetic and
real-world image datasets:

R&T index =
Intra
.
(3.1)
Inter

Indeed, the R&T index serves as a valuable metric by quantifying the ratio of intra-cluster to inter-
cluster distances, effectively considering both the compactness and separation of clusters. A lower
value of the R&T index indicates superior clustering, signifying optimal values for the number of
clusters (k). This approach aligns with the objective of achieving clusters that are internally
cohesive (intra-cluster compactness) while being well-separated from each other (inter-cluster
separation), providing a comprehensive assessment of the quality of clustering results. Therefore,
the pursuit of a minimized R&T index value aids in the identification of an optimal k for the given
dataset:

Algorithm 2: Bisecting K-means


input: data set: x1, x2, … xn, k and N.

1. make C1 with all N points. j=1


2. find 2-sub clusters C1 and C2 using the basic 2-Means algorithm. j+=1;
3. choose one cluster among all existing clusters from C1 to Cj based on a criterion to
split.
4. go to step 2 until the desired number of clusters is reached(k).

3.3 Proposed method


The detail of our proposed method is explained as following, based on our focuses and changes
to the existing algorithm. Indeed, our study endeavors to substantially enhance the performance
of automatically determining the optimal number of clusters (auto best k) for large datasets
containing high-dimensional data. This enhancement is motivated by the imperative to mitigate
computational costs while upholding the standard of clustering quality. The iterative execution of
K-means, especially when employing randomly chosen initial centers, exerts a profound impact on
computation time, prompting the need for innovative approaches.
In our methodological innovation, we capitalize on the potential of BKMeans to establish
superior initial centers, thereby setting the stage for more efficient and effective clustering
outcomes. This novel approach commences by treating the entire dataset as a singular cluster,
progressively partitioning it into two subsets through the iterative application of BKMeans. Unlike
conventional methods, where centers are selected randomly, we compute them as the centroids
or averages of all data points within each cluster. This meticulous process continues iteratively
until the upper bound clusters are formed, culminating in a more robust clustering framework. To
strike an optimal balance between computational efficiency and clustering quality, we introduce
the concept of “milestones” into our methodology. These milestones serve as predefined
intervals at which the K-means algorithm is invoked, to update cluster centers and refine data
point assignments. Significantly, our approach diverges from traditional methods by executing K-
means and updating centers in each iteration of the milestone, rather than at every step of cluster
creation. The strategic selection of milestones ensures they are judiciously positioned relative to
the data points, thereby facilitating accurate assessments of cluster validity.
An additional noteworthy modification in our methodology pertains to the criterion used to
determine which cluster to split. While variance is a commonly employed criterion in some
clustering algorithms, we opt for the sum of squared distances (SSD) criterion, which imposes a
lesser computational burden, particularly for large datasets necessitating a greater number of
clusters. In the realm of parameterization, our methodology introduces two critical parameters,
namely “best K” and “best validity,” for comparison with the R&T procedure. “Best K” denotes the
number of clusters that yield the most favorable results, while “best validity” considers the value
of cluster validity measures to discern the optimal number of clusters based on clustering quality.
This adaptive framework allows for the incorporation of different criteria for determining the best
K under diverse circumstances, thereby enhancing the versatility and applicability of our
methodology.
Cluster validity measures, encompassing both internal and external perspectives, play a
pivotal role in evaluating the quality of clustering outcomes. Internally, these measures assess the
cohesion and compactness of clusters, while externally, they evaluate the separation and
distinctiveness of clusters from one another. To this end, we select two widely acknowledged
cluster validity measures, namely the Calinski index and the R&T index, for comprehensive
evaluation.
It is essential to acknowledge the abundance of cluster validity indices available, and our
selection of the Calinski and R&T measures is guided by their established performance,
particularly in conjunction with K-means clustering. Our proposed methodology undergoes
rigorous evaluation using these indices, underscoring its effectiveness in achieving significantly
improved clustering outcomes across diverse datasets and scenarios.

3.3.1 Calinski and Harabasz index


The Calinski and Harabasz (CH) index, denoted as CH, can be calculated using the following
equation when the number of data points is N and the number of clusters is K:

CH =
Trace B
×
N −K
,
(3.2)
Trace W K−1

where B and W terms are the between- and within-cluster scatter matrices, respectively. Trace B is
the summation of size of each cluster (nk) into the square of distances of each clusters’ centers
(zk) to the centroid of whole dataset (z). The trace of W refers to the distances of data points to
their clusters’ centers.
Trace B and trace W can be written as
n (3.3)
Trace B = ∑ nk||zk − z||2,
k=1

K nk (3.4)
Trace W = ∑ ∑ nk||xi − zk||2.
k=1 i=1

Therefore, the clustering that delivers the maximum value for the CH index indicates the best
value of k:

Algorithm 3: Proposed method


input: data set, kmin, kmax, milestone.

1. Use BK-Means to partition the data set into kmin clusters.


2. Set k=kmin, and compute validity of the current partition.

while ((validity does not meet the criteria) && (k<kmax)) do Choose the biggest cluster.

Partition it into 2 clusters.


k+=1
if (k mod milestone==0) K-Means(k)
Compute validity of the current partition.
end
return clusters=K and validates.

3.4 Experimental studies


R&T procedure and our proposed method were implemented using the C programming language,
considering various parameters passed to the software. As detailed in Section 3.5, the R&T
procedure involves calling K-means in every iteration and splitting the cluster with the largest
variance, after each round. In contrast, our method calls K-means every predefined number of
iterations (milestone) and splits the cluster with the greatest SSD. The comparison between the
two procedures was conducted using the CH index and the R&T indices. For both methods, the
number of iterations of K-means was predefined as 20. The tests were initiated with a minimum
number of clusters (Kmin) set to 3, while the maximum number of clusters (Kmax) varied, based on
the nature of the dataset and our prior knowledge about the data.
All tests were performed on the HPCC (high-performance computing cluster) of Texas Tech
University – RedRaider cluster, which boasts 30,720 cores with a maximum run time of 48 h.
Additionally, 32 GB of memory was allocated for the first five datasets, and 512 GB of virtual
memory was reserved for the remaining tests. This robust computing environment ensured the
execution of comprehensive tests, allowing for a thorough evaluation of the performance and
efficiency of both the R&T procedure and our proposed method across different datasets.

3.4.1 Dataset description


In the previous experiment conducted as part of our 2020 study [→23], we explored various
synthetic and real datasets featuring a variable number of points and dimensions, including
Birch1 [→1], DIM64 [→2], DIM15 [→3], and 8XOR-PUF 64bit. In the updated set of experiments, we
extended our analysis to larger datasets. As illustrated in →Table 3.1, which outlines the
parameters of the selected datasets [→23], the last four datasets – HIGGS, HEMPASS, TLC, and
Heterogeneity – are sourced from different references, as indicated in the Type/Publisher column.
Notably, the number of data points in these datasets is significantly larger than in the previous
ones. Our proposed method was evaluated using these larger datasets, and the intriguing results
are detailed in the following →section.
Table 3.1: Description of datasets.
Dataset Type/publisher No. of points No. of dim.
BIRCH1 Synthetic 100,000 2
8 XOR PUF (0s) Hardware responses 11,917 64
8 XOR PUF (1s) Hardware responses 12,083 64
DIM64 Synthetic 1,024 64
DIM15 Synthetic 10,126 15
HIGGS [→26] 2,000,000 2
HEMPASS [→27] 21,000,000 28
TLC [→25] 16,000,000 41
Heterogeneity [→24] 16,000,000 24

3.4.2 Experimental results


→Tables 3.2–→3.5 present the comprehensive results of our experiments, aiming to efficiently
determine the optimal number of clusters (best K) for various datasets. Each table provides
detailed insights into the clustering validation process. The first column of each table indicates the
clustering validation index, while the second column specifies the dataset name. Columns three
and four delineate the experiment boundary in terms of the starting K (Kmin) and the ending K
(Kmax), respectively. The ninth column displays the elapsed time for each experiment, measured
from the initiation of clustering in the program until it concludes upon reaching Kmax. Columns six,
seven, and eight showcase the best K, best validity, and the total SSD, respectively, for the
corresponding experiment in the same row.

Table 3.2: R&T procedure’s experimental results based on the CH index.


Dataset Kmin Kmax Time Best K Best validity Total SSD
BIRCH1 3 100 9.42E + 01 100 1.53E + 05 8.08E + 13
8 XOR PUF 1s 3 200 3.55E + 02 56 1.58E + 03 1.79E + 05
8 XOR PUF 0s 3 200 3.69E + 02 56 1.51E + 03 1.76E + 05
DIM64 3 300 1.43E + 01 20 3.07E + 07 6.88E + 02
DIM15 3 50 8.31E + 00 9 2.34E + 05 4.07E + 12
HIGGS 3 16 2.93E + 01 31 5.48E + 05 14.07E + 12
HEMPASS 3 32 3.79E + 02 129 7.48E + 05 18.07E + 12
TLC 3 32 5.12E + 02 237 8.56E + 05 17.18E + 12
Hetergenity 3 32 4.72E + 02 263 10.12E + 05 18.51E + 12
Table 3.3: Our procedure’s experimental results based on the CH index.
Dataset Kmin Kmax Time Best K Best validity Total SSD
BIRCH1 3 100 2.52E + 01 100 1.53E + 05 8.08E + 13
8 XOR PUF 1s 3 200 1.31E + 02 56 1.48E + 03 1.76E + 05
8 XOR PUF 0s 3 200 1.19E + 02 56 1.51E + 03 1.76E + 05
DIM64 3 300 3.84E + 00 20 3.03E + 07 5.28E + 02
DIM15 3 50 2.43E + 00 9 2.34E + 05 3.32E + 12
HIGGS 3 16 1.03E + 01 31 5.48E + 05 14.07E + 12
HEMPASS 3 32 1.19E + 02 129 7.47E + 05 18.07E + 12
TLC 3 32 2.62E + 02 237 8.56E + 05 17.18E + 12
Hetergenity 3 32 1.88E + 02 263 10.12E + 05 18.51E + 12

Table 3.4: R&T procedure’s experimental results based on the R&T index.
Dataset Kmin Kmax Time Best K Best validity Total SSD
BIRCH1 3 100 9.52E + 01 100 1.16E−01 8.08E + 13
DIM64 3 300 1.49E + 01 16 9.00E−05 5.28E + 02
DIM15 3 50 8.27E + 00 9 4.71E−03 4.07E + 12
HIGGS 3 16 2.94E + 01 31 5.10E + 05 13.07E + 12
HEMPASS 3 32 3.79E + 02 129 7.28E + 05 17.07E + 12
TLC 3 32 5.12E + 02 237 8.56E + 05 17.18E + 12
Heterogeneity 3 32 4.72E + 02 263 10.12E + 05 18.51E + 12

Table 3.5: Our procedure experimental results, based on the R&T index.
Dataset Kmin Kmax Time Best K Best validity Total SSD
BIRCH1 3 100 2.51E + 01 100 1.16E−01 8.08E + 13
DIM64 3 300 4.49E + 00 16 9.00E−05 4.96E + 02
DIM15 3 50 2.41E + 00 9 4.71E−03 3.32E + 12
HIGGS 3 16 1.94E + 01 31 5.10E + 05 12.01E + 12
HEMPASS 3 32 2.79E + 02 129 7.28E + 05 15.07E + 12
TLC 3 32 4.12E + 02 237 8.56E + 05 11.18E + 12
Heterogeneity 3 32 1.88E + 02 263 10.12E + 05 16.32E + 12

In →Tables 3.2 (utilizing the R&T procedure) and 3.3 (utilizing our proposed procedure), the CH
index, calculated using eq. (3.2), serves as the validity measure. The CH index considers the
maximum validity as the best validity, comparing validations across different numbers of clusters.
Notably, the maximum validity values occurred when the number of clusters was 100, 56, 56, 20,
and 9 for BRICH1, 8 XOR PUF1s, 8 XOR PUF 0s, DIM64, and DIM15, respectively, in both tables.
Intriguingly, for larger datasets in the last four rows of each table, our method produced
31,129,237, and 263 clusters for HIGGS, HEMPASS, TLC, and Heterogeneity, respectively.
Importantly, both the R&T and our procedures identified the same best K, with nearly identical
validity values for these datasets. However, our proposed method achieved this outcome in
significantly less time compared to the R&T method, as indicated in column five in both tables.
This underscores the efficiency and effectiveness of our approach in rapidly determining the
optimal number of clusters.
In →Tables 3.4 and →3.5, the R&T index, calculated using eq. (3.1), determines the best K by
selecting the minimum validity value, as explained earlier. The experiments in →Table 3.4 were
conducted using the R&T procedure, while the results in →Table 3.5 pertain to our proposed
method. The best K values were consistently identified as 100, 16, and 9 for BRICH1, DIM64, and
DIM15 in both tables. Additionally, the best K values for HIGGS, HEMPASS, TLC, and Heterogeneity
were 31,129,237, and 263, respectively, corresponding to both tables.
Notably, across →Tables 3.2–→3.5, all experiments consistently reveal the same best K
concerning the specific dataset and the chosen validity measure. This uniformity underscores the
robustness and reliability of our proposed clustering procedure, in comparison to the R&T
method. Importantly, while achieving comparable results in terms of the best K, our method
excels in efficiency, significantly reducing the time required to arrive at these optimal cluster
configurations, when compared to the R&T procedure.

3.5 Conclusion
Building upon the groundwork laid by a previous collaborative study involving Khalid Mursi, Yu
Zhuang, and myself, this paper extends and refines our insights into the automated
determination of the optimal number of clusters. The prior research set the stage for the current
work, which delves deeper into the subject matter, offering improvements and exploration,
facilitated by new and larger datasets. In this study, we tackled the pivotal challenge of
automatically defining the optimal number of clusters, emphasizing the delicate balance between
clustering quality and algorithmic efficiency. Our approach, specifically designed for large high-
dimensional datasets, successfully automates the identification of the optimal number of clusters,
while addressing both quality and efficiency aspects in clustering. By conducting experimental
studies on a combination of five previously explored datasets and introducing four new and larger
datasets, our procedure demonstrates its flexibility in selecting diverse criteria for determining
the optimal K under various circumstances. Leveraging the advantages of the BKMeans algorithm,
our approach outperforms the R&T method, showcasing higher efficiency in identifying the best
number of clusters. Our research not only contributes to the evolving field of clustering
algorithms but also highlights the continuous refinement of our methods, based on valuable
insights from prior collaborative studies. The ability to efficiently determine the optimal number
of clusters in large high-dimensional datasets positions our approach as a promising
advancement in the realm of automated clustering methodologies. Through rigorous
experimentation involving nine large datasets, we conducted a comparative analysis against the
established R&T algorithm. The results unequivocally validate the superiority of our approach, as
it consistently identifies the best number of clusters with equal or superior quality, while
significantly outperforming the R&T method in terms of speed. Moreover, the consistency of
results across two distinct validity measures, the CH index, and the R&T indices, underscores the
flexibility of our method. This adaptability allows for the consideration of different criteria in the
determination of the optimal number of clusters, further enhancing the versatility of our
proposed approach. In essence, our research not only contributes to the advancements in
clustering methodologies for big data but also establishes a practical and efficient solution for
automatically determining the optimal number of clusters in the intricate landscape of high-
dimensional datasets.
References
[1] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: A new data clustering algorithm and its
applications. Data Mining And Knowledge Discovery, vol. 1, pp. 141–182, 1997. →
[2] P. Fränti, O. Virmajoki, and V. Hautamäki, Fast agglomerative clustering using ak-nearest
neighbor graph. IEEE Trans. On Pattern Analysis And Machine Intelligence, vol. 28, pp. 1875–1881,
2006. →
[3] I. Kärkkäinen, and P. Fränti, Gradual model generator for single-pass clustering. Pattern
Recognition, vol. 40, pp. 784–795, 2007. →
[4] S. Ray, and R. Turi Determination of number of clusters in K-means clustering and application
in colour image segmentation. Proceedings Of The 4th International Conference On Advances In
Pattern Recognition And Digital Techniques. pp. 137143, 1999. a, b, c, d, e
[5] R. Xu, and D. Wunsch, Survey of clustering algorithms, Institute of Electrical, 2005. →
[6] R. Dubes, and A. Jain, Algorithms for clustering data, Prentice hall Englewood Cliffs, 1988. →
[7] D. Pelleg, and A. Moore, Others, X-means: Extending K-means with efficient estimation of the
number of clusters. Icml, vol. 1, pp. 727–734, 2000. a, b
[8] J. Yang, and W. Wang CLUSEQ: Efficient and effective sequence clustering. Proceedings 19th
International Conference On Data Engineering (Cat. No. 03CH37405). pp. 101–112, 2003. →
[9] M. Chiang, and B. Mirkin, Intelligent choice of the number of clusters in K-means clustering: An
experimental study with different cluster spreads. Journal Of Classification, vol. 27, pp. 3–40,
2010. →
[10] Z. He, A. Cichocki, S. Xie, and K. Choi, Detecting the number of clusters in nway probabilistic
clustering. IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 32, pp. 2006–2021,
2010. →
[11] A. Baraldi, and P. Blonda, A survey of fuzzy clustering algorithms for pattern recognition. I.
IEEE Transactions On Systems, Man, And Cybernetics, Part B (Cybernetics), vol. 29, pp. 778–785,
1999. →
[12] H. Sun, S. Wang, and Q. Jiang, FCM-based model selection algorithms for determining the
number of clusters. Pattern Recognition, vol. 37, pp. 2027–2037, 2004. →
[13] T. Pappas, An adaptive clustering algorithm for image segmentation. IEEE Transactions On
Signal Processing, vol. 40, pp. 901–914, 1992. →
[14] G. Coleman, and H. Andrews, Image segmentation by clustering. Proceedings of The IEEE, vol.
67, pp. 773–785, 1979. →
[15] J. Bezdek, J. Keller, R. Krisnapuram, and N. Pal, Fuzzy models and algorithms for pattern
recognition and image processing, Springer Science & Business Media, 1999. →
[16] U. Bodenhofer, A. Kothmeier, and S. Hochreiter, APCluster: An R package for affinity
propagation clustering. Bioinformatics, vol. 27, pp. 2463–2464, 2011. →
[17] W. Croft, D. Metzler, and T. Strohman, Search engines: Information retrieval in practice,
Addison-Wesley Reading, 2010. →
[18] N. Jardine, and C. Rijsbergen, The use of hierarchic clustering in information retrieval.
Information Storage And Retrieval, vol. 7, pp. 217–240, 1971. →
[19] J. Tou, and R. Gonzalez, Pattern recognition principles, 1974. →
[20] Y. Shim, J. Chung, and I. Choi A comparison study of cluster validity indices using a
nonhierarchical clustering algorithm. International Conference On Computational Intelligence For
Modelling, Control And Automation And International Conference On Intelligent Agents, Web
Technologies And Internet Commerce (CIMCAIAWTIC’06). vol 1 pp. 199–204, 2005. →
[21] Y. Zhuang, Y. Mao, and X. Chen A limited-iteration bisecting K-means for fast clustering large
datasets. 2016 IEEE Trustcom/BigDataSE/ISPA. pp. 2257–2262, 2016. →
[22] R. Sokal, Clustering and classification: Background and current directions. Classification And
Clustering, pp. 1–15, 1977. →
[23] Z. Safari, K. Mursi, and Y. Zhuang Fast automatic determination of cluster numbers for high-
dimensional big data. Proceedings Of The 2020 The 4th International Conference On Compute
And Data Analysis. pp. 50–57, 2020. a, b, c, d
[24] A. Stisen, H. Blunck, S. Bhattacharya, T. Prentow, M. Kjærgaard, A. Dey, T. Sonne, and M.
Jensen Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for
activity recognition. Proceedings Of The 13th ACM Conference On Embedded Networked Sensor
Systems. pp. 127–140, 2015. →
[25] DOT, N. NYC Taxi and Limousine Commission-Trip Record Data, 2016. →
[26] H. Hmida, S. Hamida, A. Borgi, and M. Rukoz, Scale genetic programming for large data sets:
Case of Higgs Bosons classification. Procedia Computer Science, vol. 126, pp. 302–311, 2018. →
[27] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, and D. Whiteson, Parameterized machine
learning for high-energy physics. ArXiv Preprint ArXiv:1601.07913, 2016. →
[28] S. Agarwal, S. Yadav, and K. Singh Notice of Violation of IEEE Publication Principles: K-means
versus K-Means++ clustering technique. 2012 Students Conference On Engineering And Systems.
pp. 1–6, 2012. →
[29] A. Ng, M. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm. Advances In
Neural Information Processing Systems, 2001. →
4 Canonical correlation analysis and exploratory factor analysis
of the four major centrality metrics

Natarajan Meghanathan

Acknowledgments: The work leading to this chapter was partly funded through the U.S. National
Science Foundation (NSF) grant OAC-1835439. The views and conclusions contained in this chapter
are those of the authors and do not represent the official policies, either expressed or implied, of
the funding agency.

Abstract
Canonical correlation analysis (CCA) is useful in analyzing the correlation between two sets of
features in a dataset, whereas exploratory factor analysis (EFA) is useful in identifying the number
and mapping of the hidden factors that could dominantly represent the features in the dataset.
The four major centrality metrics considered in this study are: degree (DEG), eigenvector (EVC),
betweenness (BWC), and closeness (CLC). In the first part of this chapter, we consider DEG and
EVC as neighborhood-based centrality metrics and BWC and CLC as shortest path-based centrality
metrics and study the canonical correlation between these two sets of centrality metrics for a
suite of 80 real-world networks. We observe either a strong negative or a strong positive canonical
correlation between these two sets of metrics for about 60% of the real-world networks. In the
second half of the chapter, we conduct EFA on the same suite of 80 complex network datasets to
identify the number and mapping of the hidden factors (expected to be less than four) that could
dominantly represent the values incurred by the vertices with respect to the four major centrality
metrics. We observe that the BWC and CLC metrics (though both are shortest path-based
centrality metrics) cannot be represented by the same factor for about 70 of the 80 real-world
networks. On the other hand, we observe the DEG and EVC metrics to be represented by the same
factor for about 50 of the 80 real-world networks.
Keywords: canonical correlation analysis, centrality metrics, exploratory factor analysis,
neighborhood-based metrics, shortest path-based metrics, complex network analysis,

4.1 Introduction
Centrality metrics quantify the topological importance of nodes or edges in a complex network
[→1]. The four major centrality metrics typically studied for complex network analysis are: degree
centrality (DEG), eigenvector centrality (EVC), betweenness centrality (BWC) and closeness
centrality (CLC). The DEG [→1] of a node is a direct measure of the number of neighbors of the
node. The EVC [→2] of a node is a measure of the degree of the node as well as the degrees of its
neighbors. The BWC [→3] of a node is a measure of the fraction of the shortest paths between
any two nodes in the network that go through the node. The CLC [→4] of a node is a measure of
the closeness of the node to the rest of the nodes in the network. Among these centrality metrics,
the DEG metric is computationally light and can be computed without requiring global knowledge,
whereas the other three major centrality metrics (EVC, BWC, and CLC) are computationally heavy
and require global knowledge.
Correlation studies in complex network analysis have been traditionally conducted between
any two centrality metrics (node-level or edge-level). Correlation analysis involving two centrality
metrics have been so far conducted either: (1) to assess the strength of the linear relationship for
prediction of value for one metric using the value for the other metric (Pearson’s correlation
[→5]); (2) to assess the similarity in the rankings of the nodes based on the two metrics
(Spearman’s correlation [→5]); and (3) to assess the pair-wise correlation of the nodes based on
the two metrics (Kendall’s correlation [→5]: i.e., if the value for one metric increases, does the
value for the other metric increase or decrease?). Such studies have shown moderate–strong
correlation between several pairs of centrality metrics: (DEG, EVC), (DEG, BWC), and (DEG, CLC) as
well as weaker correlation between the BWC and EVC metrics.
In this chapter, we intend to take correlation analysis involving centrality metrics to the next
level. Per the definition of the four major centrality metrics, we seek to consider the DEG and EVC
metrics as neighborhood-based metrics and the BWC and CLC metrics as shortest path-based
metrics. Thereby we could use canonical correlation analysis (CCA [→16]) to assess the strength of
correlation between the two neighborhood-based centrality metrics DEG and EVC vs. the two
shortest path-based centrality metrics BWC and CLC. The results of such a CCA would help us to
answer the question: if a node exhibits higher (or lower) values for the neighborhood-based
centrality metrics, how likely is the node to exhibit higher (or lower) values for the shortest path-
based centrality metrics? We also wanted to investigate whether any network-level metric could
influence such a correlation. In this pursuit, we consider network-level metrics such as the spectral
radius ratio for node degree [→6], assortativity index [→7], and randomness index [→8] and seek
to explore their distribution vs. the (DEG, EVC) vs. (BWC, CLC) canonical correlation coefficient
values for the real-world networks.
In the second half of the chapter, we propose to conduct exploratory factor analysis (EFA)
[→17, →18] on a complex network dataset featuring the values incurred by the vertices with
respect to the four major centrality metrics (DEG, EVC, BWC, and CLC) and identify the number of
hidden factors (expected to be less than 4, the number of centrality metrics considered for EFA)
that could dominantly represent the values incurred for the four centrality metrics as well as
extract a mapping of the dominating factors to the centrality metrics represented. Through such
an EFA, we seek to assess whether the two neighborhood-based centrality metrics (DEG and EVC)
are dominantly represented by the same factor and likewise, whether the two shortest path-based
centrality metrics (BWC and CLC) are dominantly represented by the same factor. We also seek to
assess whether the strength of the canonical correlation observed between the neighborhood-
based vs. shortest path-based centrality metrics is related to the number of hidden factors
identified through an EFA of the four centrality metrics considered together as well as the
mapping of the dominating factors to the centrality metrics represented.
The rest of the chapter is organized as follows: Section 4.2 presents the step-by-step
computation procedure of the canonical correlation coefficient between the neighborhood-based
(DEG, EVC) vs. shortest path-based (BWC, CLC) metrics for a toy example graph. Section 4.3
presents the (DEG, EVC) vs. (BWC, CLC) canonical correlation coefficient values obtained for a suite
of 80 real-world networks and compares their distribution with the values incurred for the spectral
radius ratio for node degree, assortativity index and randomness index network-level metrics.
Section 4.4 presents the step-by-step procedure to conduct EFA on a dataset of the centrality
metrics for a toy example graph as well as the procedure to extract the mapping of the
dominating factors to the centrality metrics that they could represent. Section 4.5 presents the
results of EFA conducted on the same suite of 80 real-world networks of Section 4.3 as well as
analyzes the relations between the strength of the canonical correlation vs. the number and
mapping of the dominating factors identified through EFA. Section 4.6 reviews related work in the
literature and highlights our contributions. Section 4.7 concludes the chapter and presents plans
for future work. Throughout the chapter, the terms “network” and “graph,” “node” and “vertex,”
“edge” and “link” are used interchangeably. They mean the same. Note that Sections 4.2 and 4.3
of this chapter are based on a recent conference publication [→19] by the same author.

4.2 Canonical correlation analysis for a toy example graph


In this section, we present a detailed step-by-step procedure to conduct canonical correlation
analysis for a toy example graph. The same methodology is applied for the real-world network
graphs analyzed in Section 4.3. →Figure 4.1 presents the toy example graph and the raw as well as
normalized values for the four centrality metrics DEG, EVC, BWC, and CLC of the vertices. While the
DEG metric is simply the number of neighbors for a vertex, the other three metrics are
determined using computationally intensive/heavy algorithms involving global network
knowledge. The EVC metric is computed using the power-iteration algorithm [→2]; the BWC
metric is computed using the Brandes’ algorithm [→9]; the CLC metric is computed using the
breadth-first search algorithm [→10]. While the DEG and CLC metrics can be computed
asynchronously (the algorithms are run only at the particular node of interest), the EVC and BWC
metrics are computed in a synchronous fashion (i.e., the algorithms need to be run at all the
nodes in the network, even if we are interested only in the centrality metric value for a particular
node).

Figure 4.1: Toy example graph and the raw/normalized values of the centrality metrics of the
vertices.

4.2.1 Normalization and notations


The normalization of a metric is done using the mean μ and standard deviation σ of the raw
values v of the metric such that the mean and standard deviation of the normalized metric data (
(v − μ)/σ ) are 0.0 and 1.0, respectively (i.e., follow a standard normal distribution). For ease of

convenience with the notations, the neighborhood-based DEG and EVC metrics are represented
as N1 and N2, respectively, and the shortest path-based BWC and CLC metrics are represented as
S1 and S2 respectively, throughout this section. Let NS denote the matrix version of the
normalized dataset and its dimensions are 8 × 4, where 8 is the number of vertices in the graph
and 4 is the number of centrality metrics (N1, N2, S1, and S2).

4.2.2 Covariance matrix of NS and its partition into submatrices


We now compute the covariance matrix (see →Figure 4.2) of the normalized NS matrix. The
entries in the covariance matrix (that is also symmetric) are the Pearson’s correlation coefficient
[→5] between the corresponding centrality metrics. For example, the entry 0.898146 in the cells
corresponding to (N1, S1) and (S1, N1) in the covariance matrix represents the Pearson’s
correlation coefficient between N1 (DEG) and S1 (BWC). We then generate four submatrices based
on this covariance matrix. The submatrices are represented with notations R:NN, R:NS, R:SN, and
R:SS corresponding to the cells of the centrality metrics (in the covariance matrix) used to form
them. That is, R:NN represents the cells in the intersection of rows and columns corresponding to
each of N1 and N2; R:NS represents the cells in the intersection of rows corresponding to N1 and
N2 and columns corresponding to S1 and S2; R:SN represents the cells in the intersection of rows
corresponding to S1 and S2 and columns corresponding to N1 and N2; and R:SS represents the
cells in the intersection of rows and columns corresponding to each of S1 and S2.

Figure 4.2: Covariance matrix of NS and its partition into submatrices.

4.2.3 Computation of the magnitude of the canonical correlation coefficient


The magnitude of the canonical correlation coefficient for the normalized NS matrix is the
principal eigenvalue of the R:N and the R:S matrices generated based on the submatrices of
Section 4.2.2. We generate the submatrices R:N and R:S (both are 2 × 2 matrices) using the
following formulae and this is illustrated in →Figure 4.3:

Figure 4.3: Generation of the R:N and R:S matrices and their eigenvalues/eigenvectors.

R:N = Inverse(R:NN) ✶ R:NS ✶ Inverse(R:SS) ✶ R:SN


R:S = Inverse(R:SS) ✶ R:SN ✶ Inverse(R:NN) ✶ R:NS
We also compute the two eigenvalues and the corresponding eigenvectors [→5] of the R:N and
R:S matrices. An eigenvector X (a column vector) corresponding to a scalar eigenvalue λ for a
matrix A satisfies the property that A✶X = λ✶X. Note (from →Figure 4.3) that the eigenvalues
obtained for both the R:N and R:S matrices are the same; but, the entries in their corresponding
Eigenvectors are different. The principal eigenvalue is the larger of the two eigenvalues and the
corresponding eigenvector is referred to as the principal eigenvector. In →Figure 4.3, we notice
that for both R:N and R:S, the principal eigenvalue is 0.992182 and it also corresponds to the
square root of the canonical correlation coefficient value (0.9961) that we are interested in finding
out for the neighborhood-based vs. shortest path-based centrality metrics. But, at this moment,
we cannot say anything about the sign of this canonical correlation coefficient value and this is
what we will be interested in finding in Section 4.2.4.

4.2.4 Computation of the sign of the canonical correlation coefficient


We now seek to find a value (referred to as the canonical value) for each vertex that captures the
variations in each of the two categories of centrality metrics and across the two categories. More
specifically, we will find a canonical N value and a canonical S value for each vertex as weighted
averages using the entries in the principal eigenvectors of R:N and R:S respectively as the weights
for the centrality metrics N1 (DEG) and N2 (EVC) for the neighborhood-based centrality metrics
and the weights for the centrality metrics S1 (BWC) and S2 (CLC). For example, the canonical N
value for vertex 0 is 0.845550 ✶ DEG(0) + (−0.533896) ✶ EVC(0) = −0.2599519, where 0.845550 and
−0.533896 are the entries for DEG (also N1) and EVC (also N2) in the principal eigenvector of R:N,
respectively.
We then plot (see →Figure 4.4) the canonical N vs. canonical S values for all the vertices of the
graph and fit this distribution as a straight line; the sign of the slope of this straight line is the sign
of the canonical correlation coefficient. The R2 value for this straight line fit is the magnitude of
the canonical correlation coefficient (0.992182) and it should correspond to the value determined
in Section 4.2.3/→Figure 4.3 for the principal eigenvalue as well. In →Figure 4.4, we notice the sign
of the slope of the straight line (that fits the canonical N vs. canonical S values for toy example
graph) to be positive. Hence, we conclude the canonical correlation coefficient for the
neighborhood-based vs. the shortest path-based centrality metrics is +0.9961. Such a large
positive value for the canonical correlation coefficient suggests that the shortest path-based
centrality metric values for the vertices in the toy example graph are very strongly and positively
correlated to the neighborhood-based centrality metric values for the vertices. In other words, if
any vertex in this graph has a high value for any neighborhood-based centrality metric or is
ranked high with respect to a neighborhood-based centrality metric, that vertex is also expected
to incur a higher value for any shortest path-based centrality metric or also ranked high with
respect to a shortest path-based centrality metric. In →Figure 4.1, we observe vertices 4 and 5 to
be ranked high with respect to both (DEG, EVC) as well as (BWC, CLC); vertex 0 has the lowest
(DEG, EVC) as well as the lowest (BWC, CLC).
Figure 4.4: Finding the sign of the canonical correlation coefficient.

4.3 Canonical correlation analysis for real-world network graphs


We now present results of the canonical correlation analysis conducted on a suite of 80 real-world
network graphs whose number of nodes ranges from 22 to 1,538 (with a median of 143 nodes)
and the number of edges ranges from 38 to 16,715 (with a median of 613 edges). We computed
the values of the four major centrality metrics (DEG, EVC, BWC, and CLC) of the nodes in each of
these networks using the appropriate algorithms mentioned earlier. We computed the canonical
correlation coefficient values for the neighborhood-based (DEG, EVC) metrics vs. the shortest
path-based (BWC, CLC) metrics as well as explored the impact of three network-level metrics (such
as spectral radius ratio for node degree [→6], assortativity index [→7], and randomness index
[→8]) on the canonical correlation coefficient values obtained.
→Figure 4.5 presents the distribution (sorted in the decreasing order) of the canonical
correlation coefficient values for the 80 real-world networks. We observe 30 and 50 of the 80 real-
world networks to incur negative and positive canonical correlation coefficient values,
respectively. While the negative canonical correlation coefficient values fall in the range of [−1, …,
−0.79]: indicating a strong negative correlation, the positive canonical correlation coefficient
values fall in the range of [0.3, …, 1]. Among the 50 real-world networks incurring positive
canonical correlation coefficient values, 37 of them incur values above 0.79, indicating a strong
positive correlation. Only one of the 80 real-world networks incurred a canonical correlation
coefficient in the range of (−0.79, …, 0.53). Overall, 67 of the 80 real-world networks (i.e., more
than four-fifths of the 80 real-world networks) are observed to exhibit either a strong positive or
strong negative canonical correlation between the neighborhood-based centrality metrics and the
shortest path-based centrality metrics. This also implies that the ranking of vertices with respect
to one category of centrality metrics is more likely to be the same (for strongly positive canonical
correlated networks) or more likely the opposite (for strongly negative canonical correlated
networks) as the ranking of vertices with respect to the other category of centrality metrics.
Figure 4.5: Canonical correlation coefficient values of the real-world networks.

We now investigate the influence of three of the commonly measured network-level metrics
(assortativity index, randomness index, and spectral radius ratio for node degree) on the
canonical correlation coefficient values observed for neighborhood-based vs. shortest path-based
centrality metrics. Assortativity index (ranges from −1 to 1) [→7] is a measure of the similarity
between the degrees of the end vertices of the edges in a network and is calculated as the
Pearson’s correlation coefficient between the degrees of the end vertices of the edges. Networks
with positive assortativity index are referred to as assortative networks and those with negative
assortativity index are referred to as dissortative networks. Randomness index (ranges from −1 to
1) [→8] is a measure of the extent with which association between nodes leading to the formation
of edges is random (and not due to any preferential attachment). Randomness index is calculated
as the Pearson’s correlation coefficient between the degree of the vertices and the average local
clustering coefficient of the vertices with the particular degree. The larger the randomness index
(i.e., far away from −1), the larger the randomness in the association of nodes in the form of
edges; the lower the randomness index (i.e., more closer to −1), the larger the preference among
nodes to get associated with each other in the form of edges. Spectral radius ratio for node
degree (whose values are greater than or equal to 1.0) [→6] is a quantitative measure of the
extent of variation in node degree. The spectral radius ratio for node degree is the ratio of the
principal eigenvalue of the adjacency matrix of the network graph to the average degree of the
vertices; the farther away the value from 1.0, the larger the variation in node degree. Note that
quantitative values for all the above three network-level metrics are independent of the number
of nodes and edges in the network.
→Figure 4.6 displays the distribution of the spectral radius ratio for node degree vs. canonical
correlation coefficient values. We observe a distinct pattern: 28 of the 30 real-world networks that
incurred negative values for the canonical correlation coefficient reported spectral radius ratio for
node degree less than 3.0; 18 of the 20 real-world networks that reported a spectral radius ratio
for node degree greater than 3.0 incurred positive values for the canonical correlation coefficient.
This could lead to a conclusion that real-world networks with significantly larger variation in node
degree are more likely to exhibit a strong positive correlation between the neighborhood-based
and shortest path-based centrality metrics. We could also conclude that if a real-world network
incurred negative values for the canonical correlation coefficient, its spectral radius ratio for node
degree is more likely to be less than 3.0 (i.e., the real-world network is less likely to have a
significant variation in its node degree).

Figure 4.6: Canonical correlation coefficient vs. spectral radius ratio for node degree.

→Figure 4.7 displays the distribution of the assortativity index and randomness index vs. the
canonical correlation coefficient values for the real-world networks. Unlike →Figure 4.6, we do not
observe any distinct patterns. The distributions with respect to both the network-level metrics in
→Figure 4.7 appear to be almost random. In other words, the canonical correlation coefficient
value between the neighborhood-based and shortest path-based centrality metrics appears to be
independent of the degree assortativity of the edges as well as the extent of randomness in the
association of nodes leading to the formation of edges.
Figure 4.7: Canonical correlation coefficient vs. assortativity index and randomness index.

→Table 4.1 presents the canonical correlation coefficient values incurred for some (30) of the well-
studied benchmark real-world datasets. We can observe the biological networks and
transportation networks to predominantly incur positive canonical correlation coefficient values;
whereas, organizational networks (like department faculty networks, political networks, game-
based networks, etc.) are more likely to incur negative canonical correlation coefficient values.

Table 4.1: Canonical correlation coefficient values for some benchmark real-world networks.
Real-world networks Canonical corr. coeff. Real-world networks Canonical corr. coeff.
Slovenian Magazine Network −0.99883 Dolphin Network 0.762299
Cat Brain Network −0.98458 Erdos 971 Network 0.869162
Soccer World Cup 98 Network −0.97529 Celegans Network 0.881077
Adjacency Noun Network −0.97305 US Airports 97 Network 0.884632
Senator Press Meets Network −0.96466 Les Miserables Network 0.885322
UK Faculty Network −0.93757 Taro Exchange Network 0.908145
Mexican Political Network −0.93563 EU Air Transportation Network 0.930042
Karate Network −0.93554 Java Dependency Network 0.934391
Band Jazz Network −0.89191 Copper Field Network 0.945132
Facebook Network of a User −0.93563 Anna Karnenina Network 0.958253
Korean Family Planning Disc. Net. −0.80332 London Train Transportation Net. 0.964252
US Football 2000 Network 0.302123 Yeast Phosphorylation Network 0.967014
Euro Road Network 0.531523 Perl Developers Network 0.978418
McCarty Social Net. J. Co-authors 0.556008 Wind Surfers Beach CA Network 0.981030
Political Books Network 0.700493 Macaque Dominance Network 0.993647

4.4 Exploratory factor analysis of centrality dataset


EFA [→17, →18] is a widely used approach in machine learning to quantify the hidden factors that
are behind the values incurred for the different features (columns) in a dataset (matrix) of
records. For the problem in hand, the features are the four centrality metrics (DEG, EVC, BWC, and
CLC) and the records are the centrality metric values incurred for the vertices. We now explain the
procedure to conduct EFA using the same toy example graph of Section 4.2. We use the
normalized centrality metric values for the vertices as the dataset and first determine its
covariance matrix (see →Figure 4.8). The covariance matrix comprises of the Pearson’s correlation
coefficient [→10] between the centrality metrics values for any two nodes.

Figure 4.8: Running example graph; centrality metrics dataset; covariance matrix and its principal
components retained and their corresponding eigenvectors.

We then determine the eigenvalues and eigenvectors [→10] of the covariance matrix. Since the
Covariance matrix is a 4 × 4 matrix (corresponding to the number of centrality metrics
considered), we will get four eigenvalues and the corresponding eigenvectors. We multiply the
normalized dataset of the centrality metrics with each of the four eigenvectors to obtain the
corresponding principal components. We compute the variances of the entries in each of the
principal components and retain the first m Eigenvectors whose sum of the variances is at least
99% of the number of rows/columns of the covariance matrix. The number of eigenvectors
retained corresponds to the number of hidden factors that could dominantly represent the
features of the dataset. Note that the entries in the principal components correspond to the node
ids and the entries in the eigenvectors correspond to the features (centrality metrics in our case).
We refer to the communality score [→20] for each entry (centrality metric) in these eigenvectors
as the sum of the squares of the factor loadings for the centrality metric.
In the case of the toy example graph, we observe the sum of the variances (3.5950 and 0.3784)
of the first two principal components is 99.33% of 4 (the number of rows/columns of the
covariance matrix) and hence we retain the eigenvectors corresponding to these two principal
components (all of which are shown in →Figure 4.8). The number of eigenvectors retained (two,
for the toy example graph) correspond to the number of hidden factors in the dataset and we
consider the entries in these eigenvectors as the initial values of the factor loadings (a measure of
the extent of the representation of the centrality metrics by these factors). The communality score
for the DEG metric based on the initial factor loadings is (0.5256)2 + (−0.0411)2 = 0.2779. Likewise,
the communality scores for the EVC, BWC, and CLC metrics based on the initial factor loadings are
(0.4706)2 + (−0.7328)2 = 0.7585, (0.4793)2 + (−0.6741)2 = 0.6841, and (0.5220)2 + (0.0831)2 = 0.2794,
respectively.
The communality score (whose maximum value can be 1.0) for a feature (centrality metric in
our case) is a quantitative measure of the extent of representation of the feature in the identified
factors. Varimax rotation [→21] is a coordinate rotation procedure to scale the loadings for the
features (centrality metrics) with respect to the factors such that the communality scores for each
of the features (centrality metrics) are maximized (in our case, at least 0.98). In order to perform
varimax rotation, we first plot the initial factor loadings for the centrality metrics in a coordinate
system (for the toy example graph, we would need a two-dimensional coordinate system as EFA
identified two factors to represent the four centrality metrics) referred to as the principal
eigenvector-axes; see the two-dimensional coordinate system to the left of →Figure 4.9. Varimax
rotation involves repeated orthogonal synchronous rotations of the eigenvector coordinate axes
until the sum of the squares of the loadings for each feature is maximized. We conducted varimax
rotation using the relevant libraries available in Python (Pandas [→22]). The axes in the resulting
rotated coordinate system (see to the right of →Figure 4.9) correspond to the two factors and the
coordinates for the features (centrality metrics) in this coordinate system represent the final
factor loadings for the features.

Figure 4.9: Varimax rotation of the eigenvector axes to the factor axes.

The final factor loadings for the four centrality metrics resulting from varimax rotation
(mentioned in the right of →Figure 4.9) are: DEG (0.7300, 0.6790), EVC (0.3430, 0.9390), BWC
(0.9740, 0.3190), and CLC (0.7570, 0.6390). Based on these final factor loadings, we observe the
communality scores for each of DEG, EVC, and BWC to be greater than 0.99 and the communality
score for CLC to be greater than 0.98. To identify the factor that could be considered to
dominantly represent a centrality metric, we pick the factor whose entry in the final factor loading
tuple is the maximum. In case of a tie, we consider the metric to be represented by all the tying
factors. Note that by conducting varimax rotation until the communality score for a centrality
metric is at least 0.98, we are requiring that more than 50% (and more than 70%) of the centrality
metric is represented by one of the three factors (and two factors) in a three-factor (and two-
factor) coordinate system. On this basis, for the toy example graph, the DEG (0.7300), BWC
(0.9470), and CLC (0.7570) metrics are dominated by Factor-1 and the EVC (0.9390) metric is
dominated by Factor-2. Thus, for the toy example graph, we conclude the two shortest path-based
centrality metrics are dominantly represented by the same factor, whereas the two
neighborhood-based centrality metrics are dominantly represented by different factors.

4.5 Exploratory factor analysis for the real-world network graphs


We now present results of EFA conducted on the same suite of 80 real-world network graphs for
which the CCA results were presented in Section 4.3. We seek to identify the number of hidden
factors behind the values incurred by the vertices in these 80 real-world networks with respect to
the four centrality metrics DEG, EVC, BWC, and CLC as well as extract a mapping of the factors to
these centrality metrics. We want to explore whether the two neighborhood-based centrality
metrics DEG and EVC get mapped to the same dominating factor as well as whether the two
shortest path-based centrality metrics BWC and CLC get mapped to the same dominating factor.
As the number of features in the dataset is 4, we restrict our EFA to exploring whether one, two, or
three hidden factors might be behind the values incurred for the centrality metrics.
In →Figure 4.10, we number the networks from Net-1 … Net-80 corresponding to the order
the canonical correlation coefficient results (from −1 to 1) were displayed in →Figure 4.5.
Networks Net-1 … Net-24 incurred strongly negative canonical correlation (coefficient values less
than −0.90), whereas networks Net-56 … Net-80 incurred strongly positive canonical correlation
(coefficient values greater than 0.90). The remaining 31 of the 80 networks (Net-25 … Net-55)
incurred canonical correlation coefficient values in the vast range of [−0.90, …, 0.90], which we
refer to as networks exhibiting a weak–moderate canonical correlation between {DEG, EVC} vs.
{BWC, CLC}. Overall, we observe two factors to be sufficient to statistically represent (with a
communality score of 0.98 or above for each metric) the four centrality metrics for networks that
exhibited either a strong positive or negative canonical correlation between the neighborhood-
based and shortest path-based centrality metrics. On the other hand, we observe three factors
are needed to statistically represent the four centrality metrics for networks that exhibited a
weak–moderate canonical correlation between these two categories of centrality metrics.

Figure 4.10: Final factor loadings for the centrality metrics with respect to the 80 real-world
networks.

→Figure 4.10 presents the results (final factor loadings) for the 80 real-world networks with
respect to each of the four centrality metrics. We observe the DEG centrality metric to incur its
largest loadings in Factor-1 for all the 80 real-world networks (the blue symbol corresponding to
Factor-1 is above the other two symbols throughout the 80-networks sequence). The EVC metric
also incurs its largest loadings in Factor-1 predominantly for networks that exhibit either strong
positive or strong negative canonical correlation. Overall we observe DEG and EVC to be
dominated by the same factor in 52 of the 80 real-world networks. For about half of the 31 real-
world networks that exhibited a weak–moderate canonical correlation between the
neighborhood-based and shortest path-based centrality metrics, we observe the two
neighborhood-based DEG and EVC metrics to be dominated by different factors (Factor-1 for DEG
and Factor-2 for EVC).
A significant observation from the EFA results presented in →Figure 4.10 is that the BWC and
CLC metrics (though both are grouped as shortest path-based metrics for CCA in Section 4.3) are
dominated by different factors for more than 90% (73) of the 80 real-world networks. This could
be justified by the totally different underlying characteristics the two metrics tend to capture. The
BWC metric tends to identify nodes through which majority of the shortest paths communications
could occur; on the other hand, the CLC metric tends to identify nodes that are geographically
closer to the rest of the nodes in the network. Our EFA results show that these two characteristics
are indeed different and need not be represented by the same factor.
In relation to the CCA results of Section 4.3, for a majority of the networks that exhibited
strong negative (positive) canonical correlation between the neighborhood-based and shortest
path-based centrality metrics, we observe the CLC (BWC) metric to be dominated by the same
factor (Factor-1) that also dominated the DEG and EVC metrics while the BWC (CLC) metric
appears to be dominated by a different factor (Factor-2). Note that for the weak–moderate
canonically correlated networks, the CLC metric is dominated by a totally different factor (Factor-3,
which mostly dominates only the CLC metric) whereas, we observe the BWC metric to be either
dominated by Factor-1 (along with DEG) or Factor-2 (along with EVC). The above discussions
relating the CCA and EFA results show that between the two shortest path-based centrality
metrics, the BWC metric is relatively a more dominant metric whose variations vis-a-vis the two
neighborhood-based metrics contribute to the overall canonical correlation between the two
categories of centrality metrics.
With respect to the extent of domination by a particular factor, we observe the BWC metric to
be the metric that is most strongly dominated by a particular factor (the BWC loadings for the
dominating factor is much greater than the BWC loadings for the nondominating factors),
followed by EVC. Though the loading values for the dominating factor vs. the nondominating
factors are clearly separable for the DEG metric, we observe the DEG metric (and to a certain
extent the CLC metric) to be relatively less strongly dominated by a particular factor (especially,
for 23 of the 24 strong negative canonical correlated networks, the DEG-loading values for the two
factors are above 0.4).

4.6 Related works


Several works have been reported in the literature featuring correlation analysis involving
centrality metrics for real-world network datasets. However, these correlation analyses studies
involve only two individual centrality metrics. To the best of our knowledge, other than our earlier
work reported in [→19], there has been no prior work done on canonical correlation analysis
(CCA) involving centrality metrics. Also, ours is the first such work to conduct EFA involving the
four major centrality metrics (DEG, EVC, BWC, and CLC) and study the CCA-EFA results in
conjunction. For the rest of this section, we review results from some of the relevant correlation
and factor analysis studies recently reported in the literature.
In [→11], a strong correlation was observed between the two neighborhood-based DEG and
EVC metrics, while only a moderate correlation was observed between the two shortest path-
based BWC and CLC metrics (also confirmed in [→23]) for real-world network graphs of diverse
domains. The Pearson’s correlation analysis was used in [→11]. In [→12], the authors used
Spearman’s rank-based correlation analysis to quantify the extent to which the computationally
light DEG metric can be used to rank the vertices in lieu of the computationally heavy EVC, BWC,
and CLC metrics: it was observed that high-degree vertices are more likely to be high-BWC
vertices as well and high-degree vertices are more likely closer to the rest of the vertices as well;
on the other hand, high-degree vertices were observed to be not necessarily vertices with high
EVC. The centrality-based clustering results reported in [→13] state that there could be two
distinct clusters of nodes: a central core cluster in which the constituent nodes have the largest
centrality metric values with respect to several measures and a peripheral cluster in which the
constituent nodes have lower centrality metric values. In [→14], the authors explored the
correlation (with respect to Pearson’s, Spearman’s, and Kendall’s correlation analysis) between
the recently proposed O(log2n) Game of Thieves (GoT) node centrality metric [→15] to the four
major centrality metrics for synthetic networks generated using the scale-free, small-world, and
random network models: a strong negative correlation was observed for GoT vs. DEG and BWC,
while a weak negative correlation was observed for GoT vs. CLC.
The following factor analysis studies have been reported in the literature: Among the five
centrality metrics (valued centrality [→24], BWC, CLC, flow centrality [→25] and Jordan centrality
[→26]) considered in [→25], the authors use the number of principal components retained
(similar to our approach) as the basis to decide on the number factors and conclude the entries
for these centrality metrics in the corresponding eigenvectors as the factor loadings. On the other
hand, we treat the entries in the eigenvectors as the initial factor loadings for the centrality
metrics and conduct varimax rotation to maximize the communality score as well as use the
resulting final factor loadings to identify the dominating factor for each centrality metric and
explore these results in the context of strong positive, strong negative, and weak–moderate
canonically correlated networks. Also, similar to our observation, the authors in [→25] also
observed the BWC and CLC metrics to be represented by two different factors. Likewise, the work
of [→27] on graph-theoretical models of agricultural landscapes also observes the BWC and CLC
metrics to be represented by two different factors, with a third factor to represent the degree and
sub graph centrality metrics.

4.7 Conclusions and future work


The results of canonical correlation analysis (CCA) indicate either a strong positive or negative
canonical correlation between the neighborhood-based (DEG and EVC) and shortest path-based
(BWC and CLC) centrality metrics for more than four-fifths of the 80 real-world networks. We
observe 24 and 25 of the real-world networks to incur canonical correlation coefficient values less
than −0.90 (a measure of strong negative canonical correlation) and greater than 0.90 (a measure
of strong positive canonical correlation) respectively. We conclude the rest of the networks (31 of
the 80 networks) to exhibit weak–moderate correlation with the canonical correlation coefficients
spanning over a wide range of [−0.90, …, 0.90]. We also observe the spectral radius ratio for node
degree to be an influential network-level metric that is related to the canonical correlation
between the two categories of centrality metrics: our results indicate that if a real-world network
incurs a negative canonical correlation coefficient, it is more likely to have a relatively lower
variation in node degree (spectral radius ratio for node degree less than 3.0); on the other hand, if
a real-world network exhibits a larger variation in node degree (spectral radius ratio for node
degree greater than 3.0), it is more likely to incur a positive canonical correlation coefficient. With
respect to network domains, we observe the biological networks and transportation networks to
predominantly incur positive canonical correlation coefficient values, while organization networks
(such as game networks, political networks, departmental networks in universities, etc.) are more
likely to incur negative canonical correlation coefficient values.
The results of EFA on the same suite of 80 real-world networks indicate two hidden factors are
sufficient (with a communality score above 0.98) to represent the four centrality metrics (DEG,
EVC, BWC, and CLC) for both the strong positive and negative canonical correlated networks;
whereas, three hidden factors are needed to statistically represent the four centrality metrics for
weak–moderate canonical correlated networks. We observe the BWC and CLC metrics (though
both are shortest path-based centrality metrics) to be dominantly represented by two different
factors for more than 90% of the networks; whereas, the two neighborhood-based centrality
metrics (DEG and EVC) are dominantly represented by the same factor for both the strong positive
and negative canonical correlated networks and by two different factors for the weak–moderate
canonical correlated networks. Moreover, the BWC (CLC) metric is dominated by the same factor
as those of the DEG and EVC metrics for the strong positive (negative) canonical correlated
networks.
As part of future work, we plan to build a linear canonical correlation-based prediction model
that would use three of the four centrality metrics (such as the DEG, EVC, and CLC) as the
independent variables and predict the fourth metric (such as the BWC that could be the most
computationally intensive metric) and compare the accuracy of such a prediction with those of the
existing Pearson’s correlation-based single variable and multivariable linear prediction models.

References
[1] M. E. J. Newman, Networks: An Introduction, 1st Edition, Oxford, UK: Oxford University Press,
September 2010. a, b
[2] P. Bonacich, Power and Centrality: A Family of Measures. American Journal of Sociology, vol.
92, no. 5, pp. 1170–1182, March 1987. a, b
[3] L. Freeman, A set of measures of centrality based on betweenness. Sociometry, vol. 40, no. 1,
pp. 35–41, March 1977. →
[4] L. Freeman, Centrality in social networks: Conceptual classification. Social Networks, vol. 1, no.
3, pp. 215–239, 1979. →
[5] G. Strang, Linear Algebra and its Applications, 4th Edition, Brooks Cole, Pacific Grove, CA, USA,
2006. a, b, c, d, e
[6] N. Meghanathan, Spectral radius as a measure of variation in node degree for complex
network graphs. In: Proceedings of the 3rd International Conference on Digital Contents and
Applications, pp. 30–33, Hainan, China, December 20–23 2014. a, b, c
[7] R. Noldus, and P. Van Mieghem, Assortativity in complex networks. Journal of Complex
Networks, vol. 3, no. 4, pp. 507–542, December 2015. a, b, c
[8] N. Meghanathan, Randomness index for complex network analysis. Social Network Analysis
and Mining, vol. 7, no. 25, pp. 1–15, December 2017. a, b, c
[9] U. Brandes, A faster algorithm for betweenness centrality. The Journal of Mathematical
Sociology, vol. 25, no. 2, pp. 163–177, 2001. →
[10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th Edition,
Cambridge, MA, USA: MIT Press, April 2022. a, b, c
[11] N. Meghanathan, Correlation coefficient analysis of centrality metrics for complex network
graphs. In: Proceedings of the 4th Computer Science Online Conference, Intelligent Systems in
Cybernetics and Automation Theory: Advances in Intelligent Systems and Computing, Vol. 348, pp.
11–20, April 27–30 2015. a, b
[12] C. Shao, P. Cui, P. Xun, Y. Peng, and X. Jiang, Rank correlation between centrality metrics in
complex networks: An empirical study. Open Physics, vol. 16, no. 1, pp. 1009–1023, December
2018. →
[13] S. Oldham, B. Fulcher, L. Parkes, A. Arnatkeviciute, C. Suo, and A. Fornitto, Consistency and
differences between centrality measures across distinct classes of networks. PLoS ONE, vol.
0220061, pp. 1–23, July 2019. →
[14] A. Ficara, G. Fiumara, P. De Meo, and A. Liotta, Correlation analysis of node and edge
centrality measures in artificial complex networks. In: Proceedings of Sixth International Congress
on Information and Communication Technology, pp. 901–908, London, UK, September 2021. →
[15] D. C. Mocanu, G. Exarchakos, and A. Liotta, Decentralized dynamic understanding of hidden
relations in complex networks. Scientific Reports, vol. 8, no. 1571, pp. 1–15, January 2018. →
[16] H. Hotelling, Relations between two sets of variates. Biometrika, vol. 28, pp. 321–377, 1936. →
[17] R. L. Gorsuch, Factor Analysis, 2nd Edition, Lawrence Erlbaum Associates, November 1983. a,
b
[18] J. C. Loehlin, Latent Variable Models: An Introduction to Factor, Path, and Structural Equation
Analysis, 4th Edition, Psychology Press, December 2003. a, b
[19] N. Meghanathan, Canonical correlation analysis of neighborhood-based centrality metrics vs.
shortest path-based centrality metrics. In: Proceedings of the 19th International Conference on
Data Science, Las Vegas, USA, July 24−27 2023. a, b
[20] P. Eaton, B. Frank, K. Johnson, and S. Willoughby, Comparing exploratory factor models of the
brief electricity and magnetism assessment and the conceptual survey of electricity and
magnetism. Physical Review Physics Education Research, vol. 15, no. 020133, pp. 1–11, 2019. →
[21] H. F. Kaiser, The varimax criterion for analytic rotation in factor analysis. Psychometrika, vol.
23, pp. 187–200, September 1958. →
[22] →https://pypi.org/project/factor-analyzer/ →
[23] S. T. Hasson, and Z. Hussein, Correlation among network centrality metrics in complex
networks. In: Proceedings of the 6th International Engineering Conference – Sustainable
Technology and Development, pp. 54–58, Erbil, Iraq, 2020. →
[24] M. Suyudi, S. Firman, and A. T. Bon, Theoretical and simulation approaches centrality in social
networks. In: Proceedings of the 11th Annual International Conference on Industrial Engineering
and Operations Management, Singapore, March 2021. →
[25] L. C. Freeman, S. P. Borgatti, and D. R. White, Centrality in valued graphs: A measure of
betweenness based on network flow. Social Networks, vol. 13, pp. 141–154, 1991. a, b, c
[26] P. Hage, and F. Harary, Eccentricity and centrality in networks. Social Networks, vol. 17, no. 1,
pp. 57–63, January 2005. →
[27] E. Estrada, and O. Bodin, Using network centrality measures to manage landscape
connectivity. Ecological Applications, vol. 18, no. 7, pp. 1810–1825, October 2008. →
5 Navigating the landscape of automated data preprocessing:
an in-depth review of automated machine learning platforms

Abderahim Salhi
Althea C. Henslee
James Ross
Joseph Jabour
Ian Dettwiller

Abstract
Effective data preprocessing plays a pivotal role in enhancing the performance of machine
learning (ML) pipeline by influencing the accuracy and overall effectiveness of the final outcomes.
In the context of the growing prominence of automated machine learning (AutoML), the
significance of data preprocessing has become more prominent. This chapter is an ongoing effort
to explore and survey the use of AutoML platforms for data preprocessing. The focus is
particularly directed toward exploring how automated machine learning tools contribute to
streamlining the construction and training phases of ML models. Our examination delves into a
comprehensive exploration of various tasks and subtasks, providing insights into the ways
AutoML tools can be leveraged across the spectrum of ML processes.
Keywords: AutoML, automated machine learning, data preprocessing pipeline,

5.1 Introduction
Data preprocessing is an essential step in the ML pipeline that involves data cleaning, integration,
reduction, and transformation. This work is essential to preparing data, and improving its quality
and usability in the future. Preprocessing techniques are also critical for identifying and correcting
errors, reducing noise, and making the data suitable for machine learning algorithms. The task is
often a time-consuming and error-prone process, which can limit the performance of machine
learning models [→1]. In recent years, AutoML has gained a lot of attention from industry,
academia, and government agencies [→2], in particular, as a way to automate various aspects of
the ML process and boost productivity among researchers by minimizing repetitive activities in ML
pipelines and allowing them to focus on higher-level tasks, such as model selection and
hyperparameter tuning [→3]. Meanwhile, there are more and more tools and libraries available
for AutoML that encompass both large-scale commercial platforms and open-source toolkits,
including Amazon AutoGluon [→4], autosklearn [→3, →5], Auto-WEKA [→6], Microsoft Azure
AutoML [→7], Datarobot [→8], Google Vertex AI [→9], H2O DriverlessAI [→10, →11], IBM Watson
AutoAI [→12], MLjar [→13], SalesForce TransmogrifAI [→14], and TPOT [→15].
This chapter is an ongoing survey about the recent advances in AutoML for data
preprocessing [→16]. We discuss the different types of data preprocessing tasks and subtasks
that can be automated, and we review the different AutoML tools that have been proposed for
each task. We also discuss the challenges and limitations of AutoML for data preprocessing.
5.2 Data preprocessing pipeline
A machine learning pipeline can be a complex process, but it is essential for building accurate and
reliable ML models. Data preprocessing is a subset of the ML pipeline. Without counting the
preliminary step of data collection, data preprocessing is the first step in the machine learning
process, as outlined in →Figure 5.2. In general, it consists of four major tasks, as shown in
→Figure 5.1: data cleaning, data integration, data reduction, and data transformation [→17].

Figure 5.1: Machine learning pipeline.


Figure 5.2: Data preprocessing tasks.

5.2.1 Data cleaning


Data cleaning is the process of eliminating or changing data that is inaccurate, lacking,
unnecessary, or structured incorrectly, in order to prepare it for analysis. The process of cleaning
data is well-known to be a tedious and repetitive, albeit necessary, stage of any data
preprocessing task. Rather than performing these tedious and repetitive tasks manually, AutoML
tools can be used to automatically handle these tasks, freeing researchers for other valuable
tasks. It is common practice to initiate a data cleaning process with a real-world dataset. The
process helps in identifying and correcting errors in the data. This can be done by removing
duplicate records (i.e., data deduplication), detecting and filling in missing values (i.e., MV
imputation), and handling outliers [→18].
(1) Missing value imputation: Incomplete data poses a frequent challenge in real-world
dataset, exerting a negative impact on the efficacy of machine learning models. To address
this issue, imputation becomes a pivotal process. Although conventional methods like mean
imputation or regression imputation have shortcomings, the emergence of AutoML-driven
imputation techniques has become prevalent. AutoML tools streamline the imputation
process by automatically choosing suitable models, such as regression or decision trees,
tailored to the dataset’s specific attributes. Additionally, alternatives like multiple imputation
and algorithm selection are available [→19].
(2) Outlier detection: Anomaly identification, formally known as outlier detection, plays a vital
role in both data analysis and machine learning. Its primary objective is to pinpoint data
points that significantly deviate from the overall dataset. These outliers are typically viewed
as errors or irregularities that can substantially influence the outcomes of analytical tasks.
Diverse methods are utilized for this purpose, encompassing statistical techniques that
leverage measures such as mean, median, and standard deviation. Additionally, distance-
based methods assess the proximity of data points to their neighbors, while clustering
methods group similar data points into clusters. Machine learning methods, employing
algorithms trained on labeled datasets, are also employed to identify outliers in new data.
The effective detection and management of outliers contribute to result accuracy, ensuring
that models are not swayed by abnormal data [→20]. AutoML streamlines this process by
selecting the most suitable outlier detection models from a range of options, thereby
improving accuracy. This automation not only enhances efficiency but also makes outlier
detection more accessible to users with varying levels of expertise.
(3) Data deduplication: Identifying and eliminating redundant or duplicate data from a
dataset is a crucial process that aims to enhance data quality, minimize storage needs, and
optimize the effectiveness of data analysis and machine learning procedures. Within this
realm, AutoML proves beneficial by automating the creation of models dedicated to
recognizing duplicate records in a dataset [→21]. This, in turn, streamlines the procedures
for data cleaning and preprocessing. Additionally, AutoML tools can efficiently construct
predictive models for diverse tasks, autonomously managing intricate aspects of machine
learning, such as algorithm selection and hyperparameter tuning.

5.2.2 Data integration


Data integration is a critical part of the machine learning pipeline. By combining data from
multiple sources into a single dataset, ML models can be trained on a more comprehensive
dataset, which can lead to better results. The process can involve identifying common variables,
resolving inconsistencies, and merging data from different sources. A high-quality dataset, free of
mistakes and inconsistencies, is what is intended by data integration. This is crucial since the
quality of ML models depends on the data they are trained on. The ML model will be less accurate
and dependable if the data is unreliable or skewed:
1. Encoding categorical variables: Data comes in different forms and shapes. Variables that
encompass nonnumeric attributes like “color” or “category,” necessitate conversion into
numerical values for compatibility with machine learning algorithms. Various methods exist
for encoding categorical variables, such as one-hot encoding, label encoding, and target
encoding. AutoML plays a significant role in automating the selection of optimal encoding
techniques, along with other data preprocessing tasks, tailored to the specific dataset and
model. By automatically experimenting with different encoding methods and model
hyperparameters, these tools contribute to time savings and enhance overall model
performance [→22].
2. Data combining: Data integration, alternatively referred to as data combining, entails the
amalgamation and consolidation of data from diverse sources, resulting in the formation of
a cohesive dataset. This process serves multiple purposes, including enhancing the
precision of machine learning models, obtaining a more comprehensive understanding of a
specific phenomenon, and generating new features for analytical purposes.

5.2.3 Data reduction


Data reduction is the process of reducing the size of a dataset while retaining as much
information as possible. This can be done for a variety of reasons, such as optimizing the speed
and performance of ML algorithms to make the data easier to visualize [→23]. There are several
techniques that can be used for data reduction; for example, if the goal is to improve the
performance of an ML algorithm by identifying and removing features that are not relevant to the
task at hand, then feature selection may be the best option. If the goal involves creating new
features from existing ones, then feature extraction may be the best option [→24].
1. Feature selection: It involves choosing the most pertinent and valuable features from a
dataset in machine learning. AutoML streamlines feature selection by automating the
evaluation and selection of features, considering their significance to the target variable.
This not only reduces the workload for data scientists but also enhances the performance of
machine learning models. Various techniques, such as correlation analysis, information
gain, and recursive feature elimination (RFE), are employed in this process.
2. Feature extraction: AutoML tools often use techniques like feature selection, dimensionality
reduction, and feature engineering to optimize the feature set for model performance.
Feature extraction involves selecting and transforming relevant information from raw data
to create a set of meaningful features for machine learning [→25].

5.2.4 Data transformation


By modifying preexisting attributes, data transformation seeks to provide a better representation
of raw data. Addressing the dimensionality curse, improving model generalization later, and
overcoming data sparsity are essential to successful machine learning models [→26]. In these
steps, AutoML can be used to convert raw data into a format that is more suitable for ML
algorithms, including normalizing the data, scaling the data [→27], generating new features from
existing ones, such as creating polynomial or interaction terms, or extracting information from
text or images. AutoML can be used as well to reduce the number of features in a dataset while
retaining the most important information.
(1) Data normalization: Data normalization is the process of standardizing data so that it falls
within a consistent and comparable range. This typically involves scaling features to have a
mean of 0 and a standard deviation of 1, or mapping them to a specific range, such as [0, 1,
28]. Normalization is important in machine learning to ensure that different features
contribute equally to model training and to prevent numerical instability.
(2) Data scaling: Data scaling is a preprocessing technique used in machine learning to bring
all features to a common scale, typically with zero mean and unit variance. It involves
adjusting the range or scale of features, achieved through methods like min–max scaling, z-
score normalization, or decimal scaling. Min–max scaling sets the data within a specified
range, while z-score normalization centers it on a mean of 0 with a standard deviation of 1.
Decimal scaling, on the other hand, brings values between 0 and 1 by using powers of 10
[→27].
5.3 AutoML platforms for data preprocessing
In this section, we survey some AutoML platforms that handles data preprocessing tasks that we
discussed in the previous section; after that, we detail out the finding in →Table 5.1.

Table 5.1: AutoML platforms for data preprocessing.


AutoML Data cleaning Data integration Data reduction Data tran
platforms
MV Outlier Data Encoding Data Feature Feature Data
imputation detection deduplication categorical combining selection extraction normaliza
variables
AutoGluon Y N Y Y Y Y Y Y
AutoSklearn Y Y Y Y Y Y Y Y
Auto-WEKA Y Y Y Y N Y Y Y
Azure AutoML Y Y Y Y Y Y Y Y
Datarobot Y Y Y Y Y Y Y Y
Google Vertex Y Y Y Y Y Y Y Y
AI
H2O Y Y Y Y Y Y Y Y
DriverlessAI
IBM Watson Y N Y Y Y Y Y Y
AutoAI
MLjar Y Y Y Y N Y Y Y
TransmogrifAI Y N Y Y Y Y Y Y
TPOT Y Y Y Y N Y Y Y

In AutoGluon, an Amazon AutoML package, the primary goal of the platform is to automate the
machine learning process, which includes feature engineering, model selection, and
hyperparameter tweaking. However, AutoGluon has some basic data preprocessing subtasks in its
pipeline. These tasks include dealing with missing values, scaling and normalizing numerical
features, encoding categorical variables like one-hot encoding, performing automated feature
selection, and dealing with class imbalance in the dataset [→4].
Auto-WEKA, Auto-Sklearn, and TPOT are some of the earliest AutoML tools. As open-source
frameworks, the three platforms can handle many basic and advanced data preprocessing tasks,
including missing value imputation, data normalization, feature selection, feature extraction, and
data transformation [→3, →5, →29, →30].
TransmogrifAI can handle some of the manual tasks, like detecting data types. It is an
important step in data preprocessing as it helps to ensure that data is correctly interpreted by
downstream algorithms. It has the ability to process basic categorical or numeric and complicated
data types (money, phone, address) [→14, →33]. However, it is unstable on many datasets [→34].
In H2O DriverlessAI, users can apply a wide range of preprocessing techniques to their data,
such as handling missing values, encoding categorical variables, scaling numerical variables, and
handling outliers. The platform also includes feature selection and feature extraction for
automatic dimensionality reduction, automatic text encoding using Word2Vec, and other data
preprocessing stages [→11].
Meanwhile, Datarobot and Azure AutoML follow the same steps as H2O DriverlessAI; both can
identify relationships between variables and automatically generate new features to improve
model performance [→32].
Lately, Google launched a new version of its Google Cloud AutoML, called Vertex AI. The
platform boasts a powerful suite of data preprocessing tools. Vertex AI can handle more advanced
data transformation techniques, such as dimensionality reduction, nonlinear feature
transformations, or handling imbalanced datasets. While it is possible to perform these tasks
using other Google Cloud tools, such as BigQuery or Dataflow, Google claims that Vertex AI can
handle them smoothly [→35].
MLjar is a relatively new platform, having been released in 2020. Despite this, it has gained
significant attention from the machine learning community due to its user-friendly interface and
powerful automated machine learning capabilities. MLjar provides various data preprocessing
options, including handling missing values, categorical encoding, feature scaling, and outlier
detection [→36].
Although it creates the final machine learning model rapidly, Auto-Sklearn cannot handle
missing values. The Sklearn library’s machine learning algorithms provide the foundation of this
system. The user will need to perform preprocessing on the input before using Auto-Sklearn to
generate a model with customized parameters [→32, →36].
IBM Watson AutoAI offers a range of data preprocessing capabilities, such as handling
missing values, feature scaling, and feature engineering [→37]. However, the platform lacks
advanced data preprocessing, nonlinear feature transformations, outlier detection, or custom
feature selection algorithms. While it is possible to perform some of these tasks using the
platform’s Python SDK, it may require additional programming and configuration.

5.4 Challenges
Surveying AutoML tools for data preprocessing presents challenges such as keeping up with the
evolving tool landscape, selecting a representative set of technologies, and achieving
comprehensive coverage of available tools. These challenges require researchers to stay updated,
carefully choose the tools to include in the survey, and be mindful of the limitations of covering all
existing tools.
While AutoML is primarily focused on the creation of reliable ML models, many of the AutoML
tools are really not effective at handling the data preprocessing work, and it still necessitates a lot
of manual labor and carefully evaluation and validation of the output generated by AutoML tools
[→32].

5.5 Conclusion
Data preprocessing is a critical step in the ML pipeline; it involves cleaning, integrating, reducing,
and transforming data so that it can be used to train ML models. AutoML tools are increasingly
being used for data preprocessing tasks. As these tools become more advanced, we are looking to
improve the capabilities and effectiveness of AutoML for data preprocessing by integrating
AutoML with explainability methods. As AutoML becomes more prevalent, it is crucial to ensure
that the results produced by these tools are explainable. Future research could focus on
integrating AutoML with explainability methods such as counterfactual explanations or
interpretable ML models.

References
[1] Q. Yao, et al., Taking human out of learning applications: A survey on automated machine
learning. arXiv Preprint arXiv:1810.13306, 2018. →
[2] M. Lee, et al., Current and future applications of Machine learning for the US Army,
Tech.rep.US Army Research Laboratory Aberdeen Proving Ground United States, 2018. →
[3] M. Feurer, et al., Efficient and robust automated machine learning. In Advances in neural
information processing systems, p. 28, 2015. a, b, c, d
[4] N. Erickson, et al., Autogluon-tabular: Robust and accurate automl for structured data. arXiv
Preprint, arXiv:2003.06505, 2020. a, b, c
[5] Auto-Sklearn Sklearn AutoML Package. →https://www.automl.org/automl/auto-sklearn/.
Accessed: 2023-04-09. a, b, c
[6] C. Thornton, et al., Auto-WEKA: Combined selection and hyperparameter optimization of
classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, p. 847–855, 2013. a, b
[7] Microsoft Azure Machine Learning Service AutoML Concept. →https://docs.microsoft.com/en–
us/azure/machine–learning/service/concept–automated–ml. Accessed: 2023-04-10. →
[8] Datarobot Data preparation process. →https://www.datarobot.com/platform/prepare-
modeling–data/. Accessed: 2023-04-09. →
[9] Google Vertex AI beginners guide.
→https://cloud.google.com/vertexai/docs/beginner/beginners-guide. Accessed: 2023-04-10. a, b
[10] H2O Platform H2O Driverless AI. →https://h2o.ai/platform/aicloud/make/h2o-driverless–ai/.
Accessed:2023-04-09. →
[11] E. LeDell, and S. Poirier, H2O automl: Scalable automatic machine learning. Proceedings of
the AutoML Workshop at ICML, vol. 2020, 2020. a, b, c
[12] IBM Cloud Watson Studio AutoAI. →https://www.ibm.com/cloud/watson-studio/autoai.
Accessed: 2023-04-10. a, b
[13] MLjar MLjar AutoML. →https://mljar.com/automl/.Accessed: 2023-04-09. →
[14] TransmogrifAI Automated machine learning for structured data. →https://transmogrif.ai/.
Accessed: 2023-04-10. a, b
[15] R. S. Olson, and J. H. Moore. TPOT: A tree-based pipeline optimization tool for automating
machine learning. In: Workshop on automatic machine learning. PMLR. pp. 66–74, 2016. a, b, c
[16] A. Salhi, et al., Data preprocessing using AutoML: A survey. In: Proceedings of The 2023 World
Congress in Computer Science, Computer Engineering,& Applied Computing (CSCE’23), p. 106,
2023. →
[17] C. Fan, et al., A review on data preprocessing techniques toward efficient and reliable
knowledge discovery from building operational data. Frontiers in Energy Research, vol. 9, no.
March, pp. 1–17, 2021. →
[18] F. Neutatz, et al., Data Cleaning and AutoML: Would an optimizer choose to clean?.
Datenbank-Spektrum, vol. 22, no. 2, pp. 121–130, 2022. →
[19] M. A. Hernández, and S. J. Stolfo, Real-world data is dirty: Data cleansing and the
merge/purge problem. Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998. →
[20] S. Ryan, P. Djukic, and T. Morris. Unsupervised outlier detection in time-series data. US Patent
App. 16/540,414. Dec. 2020. →
[21] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey. IEEE
Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1–16, 2006 →
[22] L. Yifeng, and A. Ngom. Data integration in machine learning. In: 2015 IEEE International
Conference on Bioinformatics and Biomedicine (BIBM), pp. 1665–1671, 2015, doi:
10.1109/BIBM.2015.7359925. →
[23] I. Czarnowski, and P. Jedrzejowicz. Data reduction algorithm for machine learning and data
mining. In: New Frontiers in Applied Artificial Intelligence: 21st International Conference on
Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2008
Wrocław, Poland, June 18–20, 2008 Proceedings 21. pp. 276–285. Springer, 2008. →
[24] O. Stromann, et al., Dimensionality reduction and feature selection for object-based land
cover classification based on Sentinel-1 and Sentinel-2 time series using Google Earth Engine.
Remote Sensing, vol. 12, no. 1, pp. 76, 2019. →
[25] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max
dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005 →
[26] K. Maharana, S. Mondal, and B. Nemade, A review: Data pre-processing and data
augmentation techniques. Global Transitions Proceedings, 2022. →
[27] M. Francia, J. Giovanelli, and G. Pisano, HAMLET: A framework for human-centered AutoML
via structured argumentation. Future Generation Computer Systems, vol. 142, pp. 182–194, 2023.
a, b
[28] J. Sola, and J. Sevilla, Importance of input data normalization for the application of neural
networks to complex industrial problems. IEEE Transactions on Nuclear Science, vol. 44, no. 3, pp.
1464–1468, 1997, doi: 10.1109/23.589532.
[29] M. Martin Salvador, M. Budka, and B. Gabrys, Automatic composition and optimization of
multicomponent predictive systems with an extended auto-WEKA. IEEE Transactions on
Automation Science and Engineering, vol. 16, no. 2, pp. 946–959, 2019, doi:
10.1109/TASE.2018.2876430. a, b
[30] U. Garciarena, R. Santana, and A. Mendiburu, Evolving imputation strategies for missing data
in classification problems with TPOT. arXiv Preprint arXiv:1706.01120, 2017. a, b
[31] E. Opara, H. Wimmer, and C. M. Rebman. Auto-ML cyber security data analysis using google,
azure and IBM cloud platforms. In: 2022 International Conference on Electrical, Computer and
Energy Technologies (ICECET), pp. 1–10, 2022, doi: 10.1109/ICECET55527.2022.9872782. →
[32] A. Truong, et al., Towards automated machine learning: Evaluation and comparison of
AutoML approaches and tools. In 2019 IEEE 31st international conference on tools with artificial
intelligence (ICTAI), pp. 1471–1479, IEEE, 2019. a, b, c, d, e, f
[33] TransmogrifAI. →https://readthedocs.org/projects/transmogrifai/downloads/pdf/stable/.
Accessed: 2023-05-30. a, b
[34] M. Bilal, et al., Auto-Prep: Efficient and automated data preprocessing pipeline. IEEE Access,
vol. 10, pp. 107764–107784, 2022. →
[35] S. K. Karmaker Santu, et al., AutoML to date and beyond: Challenges and opportunities. arXiv
Preprint arXiv:2010.10777, 2020. →
[36] I. Popova, G. Revunkov, and Y. Gapanyuk. AutoML: Examining existing software
implementations and determining the overall internal structure of solutions. a, b
[37] G. Kalpana, et al., Predicting the price of pre-owned cars using machine learning and data
science. International Journal for Research in Applied Science and Engineering Technology, vol. 10,
pp. VII, 2022. →
6 Generating random XML
Peter M. Maurer
John Carbone

Abstract
XML is a language for transmitting highly structured text data in a standardized form. Although
the XML language is relatively simple, creating XML documents can be a tedious and error-prone
process. The DGL XML feature addresses this problem by automatically enforcing certain rules,
and permitting XML data to be created in a simplified manner. DGL is a language designed for
generating random data, mainly for testing and simulation purposes. It can also be used to create
large volumes of synthetic data for neural network training. The output can take various formats
and can be stored in files, MySQL databases, or C++ internal variables. The generated XML data
can be stored in an .xml file or can be streamed into another program or stored in a database.
XML files, DTD can be generated separately or simultaneously. Several new DGL production types
have been introduced, which help users generate syntactically correct XML data.
Keywords: random XML, random data, testing, simulation, big data,

6.1 Introduction
A preliminary version of this paper appeared in CSCI 2023 [→1]. XML is a universal language for
communicating textual data in machine-readable form. Virtually anything that can be represented
in textual form can be encoded in XML. This is especially true for highly structured data with
hierarchical components. For example, one may wish to record data for a research corporation
with several research teams. Within each team are individuals with distinct roles. At the highest
level, we have the company. Next, we have multiple research teams, each with a leader, several
researchers, and several technicians. Individuals are assigned to one of these categories, making
category a level in the hierarchy. The data for each individual is further structured into name, hire
date, salary, and so forth. This structured hierarchy can be easily expressed in XML. This data
might be coded as follows.
<Company>
<AITeam>
<Leader>
<Person>
<Name>
Barnaby Jones
</Name>
<Salary>
100000
</Salary>
<Address>
1027 Alamo Ct.
Dallas, TX
</Address>

</Person>
</Leader>
<Engineer>
<Person>

</Person>
</Engineer>

</AITeam>
<Electronics>

</Electronics>
</Company>
Today, there exist many datasets that are extraordinarily large. Genetic data is one example, but
there are many others. Despite this, datasets with specific properties are difficult to find. This, in
turn, makes testing software difficult. It also makes training deep-learning software difficult. The
data generation language, DGL, can be used to generate datasets of any size with any desired
properties. The data can be simple or as complex as necessary. Once generated, further work
might be necessary to encode it in XML form. The DGL XML feature can be used to overcome this
difficulty by generating data directly in XML form.

6.2 DGL
DGL has been used to generate random data for testing hardware and software [→2, →3, →4].
The underlying principle is probabilistic context-free grammars [→5]. Recently, DGL’s capabilities
have been expanded to generate data in several different ways. DGL code may now be embedded
in C++, and the data generator can be created as a subroutine. Data can be generated in internal
C++ variables. When generating XML, it is typical to generate one or more separate files. An .xml
file and its defining .dtd file can be generated, simultaneously. It is possible to generate many files
simultaneously.
Originally, all data needed to be included in the DGL specification. The upgrades no longer
permit loading data from external files or MySQL databases. However, output can be written to
multiple files at the same time. C++ variables can be accessed and modified by data generation
subroutines, and databases can be created and edited.
The basic unit of the DGL language is the production. A collection of productions representing
a complete specification is called a grammar. Each reference to the start symbol (“main” by
default) produces one data item. For data-generation subroutines, each subroutine call makes a
single reference to “main.” Embedded DGL is executed directly within the code.
Most productions have several alternatives, as shown below.
main: red,green,blue;
“main” refers to the name of the production, which offers three possible alternatives. When
“main” is invoked through the non-terminal %{main}, one of the three alternatives is selected
randomly with equal probability. The probabilities can be adjusted by applying weights, as shown
in the following example.
main: 2:red,1:green,3:blue;
These weights cause “red” to be selected twice as often as “red” and so forth. Weights are
normally optional but sometimes necessary to insure that a recursive grammar generates finite
strings. The number of productions in a grammar is unlimited. Productions are referenced using
non-terminals, which have the form %{name_of_production}, as in the reference %{main} above.
Terminals and non-terminals can be intermixed arbitrarily in an alternative.
The following grammar generates strings of the form ab, aabb, aaabbb, … a b . n n

main: a%{main}b,”“;
The following grammar-segment can generate infinite strings, and must be enhanced with
weights.
Exp: %{var}, %{Exp} “+” %{Exp}, %{Exp} “*” %{Exp};
var: a,b,c;
If the alternatives of Exp are chosen with equal probability, the number of non-terminals has a 2/3
probability of growing and a 1/3 probability of becoming smaller. To prevent problems, the
following weights should be used.
Exp: 3:%{var}, 1:%{Exp} “+” %{Exp}, 1:%{Exp} “*” %{Exp};
Larger weights can be used. The relative sizes of the weights can be used to control the average
length of the generated strings.
Various types of productions are specified by a keyword that comes after the name. The DGL
XML feature uses many such productions. Productions without keywords are called normal or
ordinary productions. The following sequence production causes alternatives to be generated
sequentially rather than randomly.
X: sequence red,green,blue;
The unique production causes alternatives to be selected at random without replacement.
Y: unique red,green,blue;
Additional key words can be used to specify the behavior when no choices remain.
The most powerful production type is the “variable”:
V: variable;
The production “V” functions as a single-alternative production with a variable right-hand side.
The special non-terminal %{X.V} selects an option from “X” and assigns it as the right-hand side of
“V.” These non-terminals do not produce any output. If “X” is defined as described and this is the
first reference to “X,” the value “red” will be assigned to “V.” Later, the non-terminal %{V} will
output the string “red.” The string assigned to “V” can also include non-terminals, giving DGL
theoretical universal power [→6].
To produce data, the DGL compiler typically generates C++ code, which is then compiled to
create either a stand-alone data-generation program or a subroutine. Embedded DGL is
preprocessed by the DGL compiler, with the data-generation code being compiled in line with the
C++ code.

6.3 XML file output


XML productions produce syntactically correct output but semantic considerations, such as having
a single root element, are the responsibility of the user [→7, →8]. While it’s possible to generate
properly formatted XML using regular productions, this approach is time-consuming and prone to
mistakes. The XML productions are provided to streamline the process and enhance its reliability.
The most frequently used XML productions are XMLHDR, XMLTYPE, and XMLTAG. An XMLHDR
production and an XMLTYPE production are invoked only once to create the XML header lines. The
XMLTAG production and other less frequently used productions are used to generate the
remainder of the file.

6.3.1 A. The XMLTAG production


The XMLTAG production will generate most of the XML data. When the XMLTAG production will
generate beginning and ending tags, thus preventing the missing tags error. Consider the
following example. This production will generate a < root > and a </root > tag with content in
between. The %{othercontent} non-terminal is used to generate content between these two tags.
content: xmltag root,%{othercontent};
othercontent: xmltag root, “Just one line.”;
The data generated by the “content” production is:
<root>Sorry, just one line.</root>
Normally, the %{othercontent} non-terminal will act as the start symbol of a complex subgrammar
that generates the content. To generate the illustrated data, it is necessary to invoke the
othercontent production with the non-terminal %{content} either directly or indirectly from the
“main” production, as in the following example.
main: %{content};
If attributes are necessary, a third alternative can be added to generate them. The xmlattr
production can be used to simplify their generation, or plain text can be used. The following
example shows how this can be done.
content: xmltag root,%{othercontent},%{attr};
attr: type,25;
The resultant data will be:
<root type=“25”>Just one line.</root>
Now for a more complex example: the example is a somewhat stripped-down version of a DGL
grammar used for generating bridge hands (→Figure 6.1).
Figure 6.1: Example of the format of the generated XML file.

The root element of the file is < hands > . Within the < hands > </hands > tags will be one or more <
hand > </hand > tags containing the lists of cards in each of the four hands. These four hands are
contained in the four tag pairs < north > </north >, < east > </east >, < west > </west > and < south >
</south > . Each of the hands will contain thirteen < card > </card > tags containing the names of
the cards (→Figure 6.2).
Figure 6.2: Representation of the thirteen < card > </card > tags that contain the card names.

The first five productions are for housekeeping and will be explained in later sections.
The “body” production generates the < hands > </hands > tags of the root element, and
references the “hand” production to generate a single instance of < hand > </hand > tags. In
practice, more elaborate mechanisms will be used to generate multiple instances of < hand >
</hand > tags. The “hand” production generates the < hand > </hand > tags and the “news”
production generates the four thirteen-card hands. The non-terminal in ten ort, east, west, and
south %13{card} invokes the non-terminal %{card} thirteen times.
The “card” production generates the < card > </card > tags, and makes a selection from the
“Deck” production. The “Deck” production contains the names of all 52 cards and selects without
replacement. The alternatives of “Deck” are generated using the macros !Number and !Suit.
(When two or more macros are used in the same alternative, every unique combination of macro
values is generated. In this case, 52 alternatives are generated.)

6.3.2 B. The XMLATTR production


Tag attributes are normally created with the xmlattr production, as in the following example.
K: xmlattr Color,red;
This production produces the string:
Color=“red”
The value will be wrapped in double quotes, and XML characters will be correctly escaped, as
demonstrated in the following production.
X: xmlattr str,”&&&><“;
In turn, this production will generates:
str=“&amp;&amp;&amp;&gt;&lt”

6.3.3 C. Generating headers


The parts of the XML file, header line, type declaration, and content are produced by a grammar
such as the following. Recall that “main” is the start symbol.
main: %{header}%{typedecl}%{content};
header: xmlhdr 1.0, UTF-8;
typedecl: xmltype root,”root.dtd”;
The DGL user has full control over the XML header line. A default header can be generated using
the following production.
H: xmlhdr ““;
The value will be wrapped in double quotes, and XML characters will be correctly escaped, as
demonstrated in the following production.
<?xml version=“1.0” encoding=“UTF-8” ?>
Additional alternatives offer greater control. The following productions demonstrate the range of
options available.
H: xmlhdr 1.1, “UTF-16”,yes;
H: xmlhdr 1.0, “UTF-16;”
H” xmlhdr 2.2;.
These productions produce the following XML.
<?xml version=“1.1” encoding=“UTF-16” standalone=“yes”>
<?xml version=“1.0” encoding=“UTF-16” >
<?xml version=“2.2” >
More than three alternatives may be specified. Alternatives beyond the third are inserted directly
into the XML following the “standalone” attribute. Although this data is syntactically incorrect, it
has been included for potential future expansion. Currently, the header contains only three
attributes.
The XMLTYPE production has either 1, 2, or 3 alternatives, the root element name, the dtd file
name, and the start symbol of a sub-grammar for generating internal dtd data. If the second
alternative is not specified, or specified as “”, there is no dtd file.
More than three alternatives may be specified. Alternatives beyond the third are inserted
directly into the XML following the “standalone” attribute. Although this data is syntactically
incorrect, it has been included for potential future expansion. Currently, the header contains only
three attributes.
T: xmltype root, abc.dtd;
This production produces the following DOCTYPE line.
<!DOCTYPE root SYSTEM “abc.dtd” >
The xmltype production can be used to define an external DTD file, internal DTD statements, or
both, as shown in the following examples.
To begin, assume the existence of the following production, which will be explained in a later
section.
internal: xmlent author, “Peter M. Maurer”;
T: xmltype root,abc.dtd,%{internal};
T: xmltype root, ““,%{internal};
T: xmltype root;
These productions produce the following XML output.
<!DOCTYPE root SYSTEM “abc.dtd” [<!ENTITY author “Peter M. Maurer”>] >
<!DOCTYPE root [ <!ENTITY author “Peter M. Maurer”>] >
<!DOCTYPE root >
The file “abc.dtd” can either be an existing file or be generated simultaneously by DGL. It’s
important to note that the DOCTYPE statement is optional in an XML file.

6.4 Additional XML file productions


Virtually all XML data can be generated using the production types xmltag, xmlhdr, xmltype, and
xmlattr. There are additional productions that can be used to simplify the generation of certain
kinds of data. This data is generated using the production types xcdata, xstring, xmlcomm,
xmlent, and xmlpi.

6.4.1 A. The XCDATA production


The xcdata production is used to create CDATA sections, which can appear anywhere text is
allowed. It functions similarly to a standard production, generating CDATA sections. If there are
multiple alternatives, one is randomly chosen with equal probability. For more specific selections,
the xcdata production can have a single alternative that references a production with the desired
characteristics. The examples below illustrate how this production works.
A reference to “X” (%{X}) (→Figure 6.3(a)) will produce one of the character strings in →Figure
6.3(b), chosen at random.
Then, three consecutive references to “Y” (%3{Y}) will produce the strings in →Figure 6.3(c) in
the given order.

Figure 6.3: Examples of the production of CDATA sections.

6.4.2 B. The XSTRING production


The xstring production offers another method for inserting special data into an XML file. It
operates similarly to a standard production but generates double-quoted strings with XML special
characters properly escaped. The following example demonstrates this.
X: xstring abc,”<>\”\’&5”;
This production generates one of the following two strings, randomly selected with equal
probability.
“abc”
“&lt;&gt&quot;&apos;&amp;5”
Xstring productions can be used to generate attribute values as well as arbitrary data.

6.4.3 C. The XMLCOMM production


Complex XML structures often require comments. They are especially useful when creating an
XML document for the first time. The structure of comments is complex, and may itself be the
source of error. The xmlcomm production is used to add comments to an XML document with the
correct syntax.
If an xmlcomm production contains multiple alternatives, one will be randomly selected and
enclosed within the appropriate XML comment syntax. The following example illustrates this.
C: xmlcomm apes,oranges,yellow;
This production will generate one of the following comments, randomly selected with equal
probability.
<!– apes –>
<!– oranges –>
<!– yellow –>
6.4.4 D. The XMLENT production
Some XML files contain data that is repetitive or data whose purpose is difficult to understand.
One way to make the construction of XML files simpler is to use entities. An entity is a string of
text that is given a mnemonic name, shorter than the original text with an identifying mnemonic.
The following is part of a previous example:
internal: xmlent author, “Peter M. Maurer”;
This statement generates the definition of an entity named “author” that contains the data “Peter
M. Maurer”. In text, the string “&author;” will be replaced by “Peter M. Maurer”.
The four types are:
a. substitution entities
b. external entities
c. parameter entities, and,
d. external parameter entities.
The xmlent, xmlentx, xmlentp, and xmlentpx are used to generate these elements, and they all
share the same structure. The general format is as follows:
K: xmlent saying,”When in the course of human events”;
This production generates the following entity declaration.
<!ENTITY saying “When in the course of human events” >
Once this entity is declared, the string &saying; will be replaced with “When in the course of
human events” wherever it appears in the text. If needed, the entity value will include XML
escapes. The reference &saying; cannot be used within a DTD file or internal DTD data.
The xmlentx production is used to include external files.
K; xmlentx std,std.xml;
The above production generates the following entity declaration.
<!ENTITY std SYSTEM “std.xml” >
The entity declaration will result in &std; being replaced by the contents of the file std.xml. The
reference &std; cannot be used in DTD files or internal DTD data.
Parameter entities, generated with xmlentp and xmlentpx, create entities that can be used
within DTD data and DTD files. The xmlentp and xmlentpx productions are nearly identical to
xmlent and xmlentx.
K: xmlentp saying,”When in the course of human events”;
J; xmlentpx std,std.dtd;
These two productions generate the following entity declarations.
<!ENTITY % saying “When in the course of human events” >
<!ENTITY % std SYSTEM “std.dtd” >
Parameter entities can only be referenced in a DTD file or internal DTD data. The references are
written as %saying; and %std;, respectively.

6.4.5 E. The XMLPI production


The xmlpi production generates XML processing instructions. The first alternative specifies the
target, while the second provides the parameters that will be passed to the target. The parameter
string is flexible and depends on the target application. Other XML productions can be used to
generate this data. Here is a simple example.
A: xmlpi “xml-stylesheet”, “type=\”text/xsl\” href=\”style.xsl\”“;
This production generates the following XML.
<?xml-stylesheet type=“text/xsl” href=“style.xsl” ?>

6.5 DTD productions


Besides generating .xml files, DGL can also produce .dtd data. This data can either be placed in a
DTD file or embedded within a DOCTYPE statement. The xmlcomm production inserts comments
into dtd data, and the xmlentp and xmlentpx productions can define parameter entities.

6.5.1 A. The XMLELEM production


Element declarations are created using the xmlelem production. The first alternative provides the
tag name being defined, while the second specifies the type definition. Both alternatives can
include non-terminals, either individually or together. Here are some examples.

Figure 6.4: Example use of the xmlelem production.

The six productions (→Figure 6.4) will produce the element definitions in →Figure 6.5, with names
generated at random.
Figure 6.5: Examples of generated element definitions.

The first statement states that the < color > </color > tags must contain three tag pairs < red >
</red >, < green > </green > and < blue > </blue >. The second statement indicates that the < choice
> </choice > tags must contain one of the three, but not all. The third statement says that the tags
< dataitem > </dataitem > must contain text data. The fourth statement states that the content
between < whatever > and </whatever > is arbitrary. The fifth statement states that the < NullItem
> </NullItem > pair must not have any content. The final statement states that the < dataorsub >
</dataorsub > pairs may either contain text or a subitem enclosed in < blue > </blue > tags.
→Figure 6.6 is the DTD file (bridge.dtd) for the bridge-hands files described previously.

Figure 6.6: Example of the bridge-hands DTD file.

This file is generated by the DGL productions illustrated in →Figure 6.7


Figure 6.7: Example of the DGL productions.

The specifications card + and hand + state that one element of the type is required, but there may
be additional elements of the same type.

6.5.2 B. The XMLATTLIST and XMLALITEM productions


Tag attribute definitions are created using the xmlattlist and xmlalitem productions. The xmlattlist
generates an empty ATTLIST statement. Here is an example.
s: xmlattlist gate,%{body};
body: xmlalitem delay,CDATA,”#IMPLIED”;
The first alternative is the tag-name to which the attributes apply. The second is the start symbol
of a subgrammar that supplies the attribute definitions. The attribute definitions are usually
supplied by xmlalitem productions. The above productions produce the following XML data.
<!ATTLIST gate delay CDATA #IMPLIED >
The xmlalitem production includes three or four alternatives. The first defines the attribute name,
the second specifies the attribute type, and the third indicates the attribute state. If a default
value is needed, it is provided as the fourth alternative.

6.6 Example of attributes


Part of our work involves the simulation of logic-level circuits. These circuits consist of inputs,
outputs, and a collection of gates. The XML structure of a circuit is as follows:
<circuit>
<name>mycircuit</name>
<inputs>
<net>a</net><net>b</net><net>c</net>
</inputs>
<outputs> <net>q</net> </outputs>
<gates>
<gate>
<type>AND</type>
<inputs> <net>a</net><net>b</net></inputs>
<outputs> <net>x</net> </outputs>
</gate>
<gate>
<type>AND</type>
<inputs> <net>x</net><net>c</net></inputs>
<outputs> <net>q</net> </outputs>
</gate>
</gates>
</circuit>
This structure can be built using the following DGL.
Ckt: xmltag circuit,%{details};
details: %{name}%{inputs}%{outputs}%{gates};
name: xmltag name,mycircuit;
inputs: xmltag inputs,%{inlist};
outputs: xmltag outputs,%{Q};
inlist: %{A}%{B}%{C};
A: xmltag net,a;
B: xmltag net,b;
C: xmltag net,c;
Q: xmltag net,q;
X: xmltag net,x;
gates: xmltag gates,%{gate1}%{gate2};
gate1: xmltag gate,%{type}%{inputs1}%{outputs1};
gate2: xmltag gate,%{type}%{inputs2}%{outputs2};
type: xmltag type,AND;
inputs1: xmltag inputs,%{A}%{B};
outputs1: xmltag outputs %{X};
inputs2: xmltag inputs,%{X}%{C};
inputs2: xmltag outputs,%{Q};
For certain types of simulation, it is necessary to supply a gate delay. This can be done with a tag
internal to the < gate > </gate > tags, or with an attribute. To use an attribute, replace the following
two productions:
gate1: xmltag gate,%{type}%{inputs1}%{outputs1};
gate2: xmltag gate,%{type}%{inputs2}%{outputs2};
with:
gate1: xmltag gate,%{type}%{inputs1}%{outputs1},%{d1};
gate2: xmltag gate,%{type}%{inputs2}%{outputs2};%{d2};
d1: xmlattr delay,2;
d2: xmlattr delay,3;
These changes will cause the two < gate > tags above to appear as:
<gate delay=“2”>
and:
<gate delay=“3”>
Multiple instances of the xmlattr production can be used to insert more than one attribute.

6.7 Dependent data generation


DGL allows for generating data that has many different kinds of internal dependencies. Suppose,
for example, that we have a segment of an XML that generates boxes of different colors. The
boxes are of type A or type B. Type A boxes are colored red, green, or blue. Type B boxes are
colored brown, white, or black. The XML has the following format.
<box>
<type>
A
</type>
<color>
red
</color>
</box>
The DGL for this segment would look as follows.
Box: xmltag box,%{stuff};
stuff: %{type}%{color};
colortype: variable;
type: A%{cA.colortype},B%{cB.colortype};
color: %{colortype}
ctA: red, green, blue;
ctB: brown, white, black;
cB: %%{ctB};
cA: %%{ctA};
In this example, “colortype” is a variable that will contain either the non-terminal %{ctA} or the
non-terminal %{ctB} when executed. The non-terminals represent the sets of colors for the type A
and type B boxes. Generating a box-type also stores the correct production reference in the
variable “colortype.” When “colortype” is referenced a color from the correct production is
generated.

6.8 Enhanced bridge example


The following grammar (→Figure 6.8) is the full text of the grammar for generating bridge hands.
Figure 6.8: The complete grammar for generating bridge hands.

The second and third statements cancel the repetition count, which defaults to 100. Normally, the
repetition count determines the number of internal references made to the “main” production
when creating a stand-alone data generator. When invoking the data generator, the first
parameter on the command line overrides the default of 100. The “repeat” statement in the
example sets the default repetition count to 1, and the “nocount” statement prevents overrides
from the command line. The first parameter on the command line no longer refers to the number
of references to “main”. Instead, the parameter will be used to determine the number of
occurrences of < hand > </hand > that will appear between the < hands > </hands > tags. The three
productions starting with the body production implement this feature. The handlist production is
an iterator, which causes the single alternative to be invoked a specified number of times. To
accomplish this, it is necessary to use a non-terminal of the form %{20.handlist}, which will cause
“handlist” to be referenced 20 times. The “argument” production retrieves the value of a
command-line argument. In the example, argument 1 is retrieved, and if there is no argument 1,
100 is used instead. (Argument 0 is the name of the file being executed.) The non-terminal %
{a.handlist} in the body production causes the specified number of hands to be generated. The
remainder of the specifications are the same as those presented in the previous example.

6.9 Conclusion
Generating XML is greatly simplified by the DGL XML features. The wide range of productions
provided by DGL makes the generation as complex or as simple as necessary. Furthermore, the
XML features are easy to use. It took only a few minutes to convert the bridge-hands example to
XML. We believe the XML feature will significantly increase DGL’s usefulness for testing modern
software and running simulations that require XML-formatted input.

References
[1] P. Maurer, and J. Carbone, Generating Random XML Files Using DGL, CSCI, 2022. →
[2] P. Maurer, Generating test data with enhanced context free grammars. IEEE Software, 7, 4, pp.
50–56, 1990. →
[3] P. Maurer, The design and implementation of a grammar-based data generator. Software
Practice and Experience, 22, 3, pp. 223–244, 1992. →
[4] P. Maurer, DGL reference manual version 2.0.
→http://cs.baylor.edu/~maurer/dglunpublished. →
[5] T. L. Booth, and R. A. Thompson, Applying probability measures to abstract languages. In: IEEE
Trans. Computers, pp. 442–450, May 1973. →
[6] P. Maurer, Generating random bridge hands: A study in data generation. In: Proceedings of
the 2020 International Conference on Modeling, Simulation & Visualization Methods, 2020. →
[7] T. Bray, J. Paoli, C. M. Sperberg-MCQueen, E. Maler, and F. Yergeau, (eds.), Extensible Markup
Language (XML) 1.0, 5th Edition, W3C Recommendation, 26 November 2008. →
[8] T. Erik, and E. T. Ray, Learning XML, 2nd Edition, Sebastopol, CA: O’Reilly Media, October 14,
2003. →
OceanofPDF.com
OceanofPDF.com
Applications and case studies
7 Exploring autism risk: a deep dive into graph
neural networks and gene interaction data

Danushka Bandara
Kyle Riccardi

Abstract
Autism spectrum disorder(ASD) has many genetic connections that can be
represented in genetic association networks. These networks can be converted
in graph data structure and also further examined using graph neural networks
models to further examine the association these genes have with ASD. Our task
shows that Graph Sage contains an 84.03 percent accuracy with 0.85 Area under
the ROC curve. This model further examines the ability Graph Neural Network
models have in the understanding of ASD association in relation to gene
networks. The model was further examined using sensitivity of 0.96 and
specificity of 0.94. The model successfully achieves a low false positive and
negative rate making sure that the results for this association contain less
chance of misclassification.
Keywords: Graph Convolutional Network (GCN), Graph Sage, Autism Spectrum
Disorder(ASD),

7.1 Introduction
Autism spectrum disorder (ASD) is a deficit of social communication or sensory-
motor function based on genetic association and other causes [→21]. The
genetic association is supported by the inheritance rate observed by Tick et al.
[→28] meta-data analysis on twins, which determined the inheritance of ASD to
range between 64% and 91%. Tick et al. also associate a .98 correlation between
genetics and neurodevelopmental disorders. De novo mutations further express
the relationship inheritance has on ASD because these genetic mutations
happen specifically when stem cell divisions and maturation of the female
gametes within the trio (father, mother, and child) [→26]occur. These genetic
mutations are based on mutations found between the trio, which determines
the de novo mutation that carries the high inheritance seen within the Tick et al.
analysis [→11, →15, →29]. In the Handbook of Clinical Neurology [→25], ASD
connection is associated with an estimated 1,000 genes determined based on
genetic linkage between chromosome location (loci) and possible genetic risk.
Alarcón et al. [→2] did a comprehensive study into chromosome regions 7q and
17q, finding that region 7q2-35 has a strong possibility of association with ASD
while noting that other areas, like chromosome 3q, might also have an
association with ASD. Copy number variants (CNV) is the process of adding,
removing, or copying portions of deoxyribonucleic acid (DNA) [→27]. CNV shows
the exact genetic correlation on specific regions of the chromosome band,
further exemplifying the link between loci and ASD. Another genetic association
is a common variant, which significantly affects ASD risk. Grove et al. [→4] used a
genome-wide association study that determined common variants’ strong,
robust connection with ASD. Common variants are significant factors in most
diseases as they relate to 90% of individual differences [→7]. From an individual
perspective, common variants create minimal significance, but by putting all the
variants together, we determine a noticeable impact on risk [→3]. Common
variants make up 40–60% of the risk factor when evaluating ASD [→17]. Between
all of the variants and genetic mutations, we see that these mutations and
variants are connected within a network. Kolchanov et al. describe gene
networks as a group of genes functioning in a coordinated manner [→10]. As
seen, common variants, de novo variants, and CNV are all the by-products of
genes functioning in coordination with one another, and that function creates
variants with high ASD risk. These variants, when combined, create a gene
network that links all of these variants together, showing us the
association/nonassociation of a gene [→24]. Our proposed experiment uses
these gene networks and graph neural networks to determine if a gene is
associated with ASD. The specific graph neural network models we use are
GraphSAGE and graph convolutional network (GCN) [→13, →16]. Within this
experiment, we will determine whether a gene is associated with ASD or not
based on the likelihood of a gene being associated with autism risk.
In this chapter, a methodology and models for discovering genes that cause
autism risk is described. Section 7.2 contains a literature survey of the state of
the art in autism risk classification. The preprocessing and model structure is
presented in Section 7.3. Section 7.4 outlines the experiments that were done to
test our models. Section 7.6 discusses the results in the context of autism risk
classification. We conclude with Section 7.7, where we reiterate this study’s
significant findings and implications.

7.2 Related works


Several techniques comprise the state of the art in this area. Genome-wide
association study (GWAS): It is a method used to find genetic variants associated
with a particular disease or trait. It does this by comparing the DNA of people
with the disease or trait to the DNA of people who do not have the disease or
trait. Bralten et al. [→6] used GWAS to find the connection between genetic
sharing between ASDs and the autistic traits of “childhood behavior,” “rigidity,”
and “attention to detail.” Grove et al. [→4] used this technique to find five
genome-wide significant loci associated with autism risk. Network-based
machine learning (NBML): NBML is a machine learning approach that uses the
topology of a gene network to predict autism risk. NBML models learn the
relationships between genes in a network and use this information to predict
which genes are important for autism risk. The logic behind this approach is that
genes that confer risk for ASD are likely to be functionally related and that they
thus converge on molecular networks and biological pathways implicated in
disease [→9]. Taking this idea further, Krumm et al. [→18] showed that ASD
genes with de novo mutations converged on pathways related to chromatin
remodeling and synaptic function. Some later studies showed that integrating
known risk genes using a protein–protein interaction (PPI) network can identify
novel genes involved in ASD [→14, →20]. In machine learning, many techniques
have been used to predict diseases using gene networks, risk assessment of
ASD, and overall disease risk discovery. The first is the interaction discovery done
in PPI networks by Fenq et al., who discovered using omic data that they can
determine new links with these interaction networks [→8]. Wang et al. [→31]
used attention-based graph neural networks to identify ASD based on the
activity within brain scans. Beyreli et al. [→5] created DeepND, a multitask GCN
that used an autism network and an intellectual disability network to determine
the risk for both. Uddin and Lu [→22] used a graph neural network and a patient
network to classify whether a patient suffers from a chronic illness. Wang and
Avillach[→30] used DeepAutism, a convolutional network designed to diagnose
ASD based on the presence of high-risk gene variants. Motsinger and coworkers
[→23] used gene-gene interaction networks and neural networks to classify the
risk people carry for Parkinson’s disease. Laksshman et al. created deep bipolar,
which specialized in identifying gene mutations to determine whether somebody
is bipolar [→19].

7.3 Methodology

7.3.1 Dataset
The datasets used for these experiments were Sfari dataset [→1] and protein
interaction network (PIN) data [→24]. The Sfari dataset contains all of our gene
associations and rankings for all datasets. These rankings follow several
parameters with confidence of a gene being associated ASD from 1–3 (1
representing a significant association and 3 representing a minor association).
The Sfari dataset also contains the chromosome band location of that specific
gene. The PIN dataset contains all PPIs both associated and not associated with
ASD. The association is determined based on whether it is scored on Sfari data.
Anything from 1 to 3 would be considered associated with ASD, and any other
genes would be considered as having no association.

7.3.2 Preprocessing
The first preprocessing step is to filter out anything in the PIN dataset that is not
specified as being a human gene interaction. Next, we add our ranking and
chromosome band location from the Sfari dataset to the same gene in the PIN
dataset to have our edges and labels in the same place. The chromosome band
locations are used as features for association risk classification for the following
experiment. Equation (7.1) shows the definition of the graph:
G = (V , E, C) (7.1)

G: the graph representing genes and their interactions


V: the set of nodes, where each node represents a gene and is associated
with its location (feature)
E: the set of edges, representing interactions or connections between
genes in the graph
C: the set of labels representing autism risk for each gene, where each
gene node is associated with a label indicating its classification (high
confidence, strong candidate, suggestive evidence).
The risk ranking is a confidence score of whether a gene is or is not related to
ASD (1 being high confidence, 2 being a strong candidate, and 3 being
suggestive evidence). The association scores, if they are greater than one, are
labeled 1, meaning they are associated with ASD, while any without a score is
given a 0 to represent no association. The chromosome band is placed in one
hot encoder that splits our feature into a list of 92 binary features representing
every possible location in which the gene has been observed. Then, we will have
our entire feature matrix for our graph neural network models. This will then be
followed up by using our edge list to create an adjacency matrix. →Figure 7.1
shows this process in detail.
Figure 7.1: Preprocessing steps used to generate the graph node classification
setup.

7.3.3 Models
The models used for this experiment are GCN [→16] and GraphSAGE [→13].
These models contain two layers, each having a rectified linear unit (ReLu)
between the layers and ending with a multilayer perceptron (MLP). Both models
are set up to use the adjacency matrix, feature matrix, and selected labels to
create a binary classification.

7.3.3.1 Graph convolutional network

GCN uses the convolutional properties of a graph network to facilitate a


connection of both the features and the network. This process is done in a layer
of the GCN called the propagation layer (eq. 7.2) where l: the layer of the GCN
operation. h : the feature vector of node v in the l-th layer. N (v) : the set of
(l)
v

neighboring nodes of node v. W : weight matrix for the l-th layer. c : a


(l)
v

normalization factor for node v. σ : activation function (e.g., ReLU):


(l+1) 1 (l) (l) (7.2)
hv = σ(Σ u∈N (v) W h u ).
cv

GCN is more of a top-down approach, which looks at the entire picture of the
network and its feature matrix for performing calculations. This model ignores
the effect a singular node has on the network but instead looks at all the
interactions together through matrix multiplication. Once this is done, we will
get our embedding matrix, allowing us to classify what class a node belongs to.

7.3.3.2 GraphSAGE

GraphSAGE is almost the opposite approach; instead, we take an aggregation of


all neighboring nodes in the network. The process behind GraphSAGE is an
inductive framework that leverages a neighbor approach for deriving node
features. The GraphSAGE operation is defined in eq. (7.3) where l: the layer of
the GraphSAGE operation. h : the feature vector of node v in the l-th layer.
(l)
v

N (v) : the set of neighboring nodes of node v. W : weight matrix for the l-th
(l)

layer. AGGREGAT E : aggregation function (e.g., mean, sum, or LSTM-based


aggregation). σ : activation function (e.g., ReLU):
(l+1) (l) (l) (7.3)
hv = σ(W ⋅ AGGREGATE (h u , ∀u ∈ N (v))).

The aggregation was a mean aggregation for this experiment to reduce the
complexity of our results. The aggregated features are then concatenated to
create new features for every node in the feature matrix. This model infers the
connection of neighboring nodes. This connection creates another approach but
an essential method for interpreting graph networks. Instead of using a broad
look like GCN this takes a neighboring approach, which instead infers the
connection a feature relationship that not only the current node but its
neighboring nodes have with each other.

7.3.4 Evaluation techniques


The experiment is done via specificity, sensitivity, area under the ROC curve
(AUC), and accuracy. This was done to measure our accuracy and quantify our
false positives and negatives. By reducing our false positives and negatives, we
can come up with a more defined list of genes that reduce the number of false
diagnoses while picking up diagnoses that were not detected before. Here, we
further elaborate on these metrics.

7.3.4.1 Specificity

Specificity is the ratio of true negative predictions to the total number of actual
negative instances in a binary classification problem. It quantifies the model’s
ability to correctly identify the instances that do not belong to the positive class.
A high specificity indicates that the model has a low rate of false positive errors,
which is effective at distinguishing negative instances from positive ones.
Mathematically, specificity can be defined as
Specif icity = true negatives /(true negatives+f alse positives).

7.3.4.2 Sensitivity

Sensitivity is the ratio of true positive predictions to the total number of actual
positive instances in a binary classification problem. It quantifies the model’s
ability to capture and correctly identify instances that belong to the positive
class. A high sensitivity indicates that the model effectively identifies positive
instances and has a low rate of false negative errors.
Mathematically, sensitivity can be defined as
Sensitivity = true positives /(true positives + f alse negatives).

7.3.4.3 Area under the ROC curve

AUC is a scalar value that represents the area under a receiver operating
characteristic (ROC) curve, a graphical plot that illustrates the performance of a
binary classification model across different probability thresholds. The ROC
curve is created by plotting the true positive rate (sensitivity) against the false
positive rate (1 – specificity) at various threshold values.
AUC ranges from 0 to 1, where
AUC = 0 indicates that the model’s predictions are entirely incorrect.
AUC = 1 indicates that the model’s predictions are perfect.
A higher AUC value indicates that the model has better discrimination ability,
meaning it is more capable of distinguishing between the positive and negative
classes. In other words, it measures the model’s ability to rank positive instances
higher than negative instances based on their predicted probabilities.
A random binary model with no discriminatory power would have an AUC of
0.5, which is equivalent to the performance expected by chance.

7.3.4.4 Accuracy

Accuracy is the ratio of correctly predicted instances to the total number of


instances in a classification problem. It measures the overall correctness of a
model’s predictions. A higher accuracy indicates that the model is making more
correct predictions, while a lower accuracy suggests that the model is making
more incorrect predictions.
Mathematically, accuracy can be defined as
Accuracy = number of correct predictions/total number of predictions.

7.3.5 Ablation studies


To test our hypothesis that the graph structure provides relevant information for
the autism risk classification, we used a baseline model. The model used is an
MLP, which only uses the chromosome location feature and does not use the
graph’s adjacency matrix. Therefore, we understand how a model using just the
node features will perform.
To test the hypothesis that the chromosome band features contribute to the
autism risk classification, we tested the classification performance of the graph
neural network with and without the chromosome band features. If we get
better performance using the features, we can conclude that the features are
helpful for the classification.

7.4 Experiment
The experiment was conducted using three configurations: 1) two GCN layers
attached to an MLP layer; 2) two graph sage layers attached to an MLP layer; and
3) three MLP layers. The models were fed the adjacency matrix, feature matrix,
and labels (not associated with ASD/associated with ASD). The adjacency matrix
considered all gene interactions within the PIN dataset that are associated with
humans and gathered their location and labels through the Sfari dataset.
→Figure 7.2 shows the structure of the models.

Figure 7.2: Neural network structure used in the models.


The model results are evaluated using the above-described metrics on the test
data. The test dataset was 25% of the original dataset and was split after
upsampling to balance the labels. The experiment was conducted using the
parameters listed in →Table 7.1. In our previously published work [→12], we did
an in-depth analysis to find the optimal network layer parameter numbers for
each model. The parameters chosen are shown in →Table 7.1.

Table 7.1: Parameters used for training the models.


Parameter name Value
Learning rate 0.05
Weight decay 5e-4
Batch size 64
Epoch 5,000
Output function Sigmoid
Activation function reLu
Optimizer Adam
Optimal units for each layer 128

7.5 Results
As described in the methodology section, we conducted several tests to check
our models’ performance and understand their workings.
The →Table 7.2 metrics represent the average metrics for our models. Our
model achieved an accuracy of 75.91% with the GCN model and 84.03% with
GraphSAGE. When features are removed, the GraphSAGE model accuracy drops
by 19.86%, and the GCN model accuracy drops by 6.6%. The AUC for GraphSAGE
is 0.85 and drops by 0.17 when run featureless. The AUC for GCN is 0.78 and
drops by 0.28 when run featureless. GraphSAGE performs better than MLP in our
testing. However, GCN’s performance was found to be lower than that of MLP.

Table 7.2: Summary of the performance metrics obtained from our experiments.
Model name AUC Accuracy
GCN 0.78 75.91
GraphSAGE 0.85 84.03
MLP 0.78 77.71
GraphSAGE featureless 0.68 66.89
GCN featureless 0.50 55.13
As shown in →Table 7.3, GraphSAGE achieved comparable specificity with GCN
and much higher sensitivity than GCN or MLP.

Table 7.3: Specificity and sensitivity of the three different models.


Model name Specificity Sensitivity
GCN 0.96 0.94
GraphSAGE 0.96 0.63
MLP 0.90 0.67

→Table 7.4 shows the top 10 genes (based on the confidence level of the model)
that our model predicted to belong to each class.

Table 7.4: Top 10 genes based on model confidence for each class.
Not associated with ASD risk Associated with ASD risk
ASGR1 DSCAM
C17ORF85 HLA-A
MIR181A1 RELN
YBX2 TDO2
WDR13 BCL11A
TULP4 CUL3
SHTN1 DISC1
RBMXL1 NTRK3
PRKRA DRD3
C2ORF3 FOXG1

7.6 Discussion
Our experimental results demonstrate (→Table 7.2) notable performance
variations between different graph-based models in the context of autism risk
classification. The GraphSAGE model exhibited superior accuracy compared to
the GCN model, achieving an impressive 84.03% accuracy as opposed to GCN’s
75.91%. This suggests that the GraphSAGE architecture captures essential
patterns in the gene interaction data more effectively than GCN in the context of
autism risk prediction.
A compelling aspect of our findings is the sensitivity of the GraphSAGE
model to feature removal. When features were systematically removed, the
GraphSAGE model experienced a significant accuracy drop of 17.14%, indicating
its reliance on specific features for accurate predictions. In contrast, the GCN
model showed an even larger drop of 20.78% under the same feature removal
conditions. This suggests that both models’ reliance on features might make it
more susceptible to variations in the input data, potentially pointing to the
importance of specific chromosome locations on autism risk. The Area Under the
Curve (AUC) values further emphasize the nuanced performance of the models.
GraphSAGE achieved an AUC of 0.85, indicating a strong ability to distinguish
between classes. However, when run in a featureless mode, the AUC dropped by
0.17, underscoring the importance of the initial features for optimal model
performance. Similarly, the GCN model exhibited an AUC of 0.78, decreasing by
0.28 when featureless. These results suggest that the discriminative power of
both models is closely tied to the availability of the chromosome location
features.
Our experiments revealed that GraphSAGE outperformed the MLP in our
testing conditions. This suggests that the graph-based architectures, particularly
GraphSAGE, are well-suited for capturing complex relationships in gene
interaction data, leading to improved classification accuracy for autism risk.
On the other hand, the GCN model’s performance was lower than that of
MLP. This unexpected result might indicate that the specific architectural choices
in the GCN model, or the nature of the gene interaction data, do not align
optimally with the learning task at hand. Further investigation into the model’s
limitations and potential areas for improvement could provide valuable insights.
The GCN model demonstrated (→Table 7.3) remarkable specificity with a
value of 0.96, indicating its capability to identify individuals without autism
accurately. However, its sensitivity of 0.63 suggests a notable limitation in
correctly classifying individuals with autism, potentially leading to a higher rate
of false negatives. This trade-off between specificity and sensitivity highlights a
challenge in the GCN model’s ability to comprehensively capture the complexity
of gene interaction patterns associated with autism. Conversely, the GraphSAGE
model exhibited a high sensitivity of 0.96, signifying its effectiveness in correctly
identifying individuals with autism. It also exhibits a specificity of 0.94,
suggesting a very low rate of false positives. This is further underlined in the
ROC curve shown in →Figure 7.3. Comparatively, the MLP model demonstrated a
moderate sensitivity (0.67) and relatively high specificity (0.9). While the MLP
model may not excel in all metrics, it seems to do better than GCN when it
comes to sensitivity.
Figure 7.3: Receiver operating characteristics (ROC) curve for the binary
classification using GraphSAGE (using our optimal architecture of 128 units in
each network layer).

The observed differences in sensitivity and specificity among the models


underscore the importance of selecting a model that aligns with the specific
goals and priorities of the autism risk prediction task. Depending on the
application, stakeholders may prioritize minimizing false positives (increasing
specificity) or false negatives (increasing sensitivity).
The clear separation observed in the t-SNE plot (→Figure 7.4) between ASD-
associated and nonassociated clusters suggests that distinct patterns in the
feature space, which contribute to the classification task exist. The ability of t-
SNE to capture complex relationships in high-dimensional data is evident, as it
effectively reveals inherent structures that differentiate the two classes. The
presence of distinct clusters implies that personalized approaches to ASD
diagnosis and treatment may be warranted. Understanding the genetic and
molecular profiles associated with each cluster could lead to more targeted
interventions, moving toward a precision medicine approach tailored to the
specific characteristics of different ASD subtypes.

Figure 7.4: t-distributed stochastic neighbor embedding (t-SNE) plot depicting


clusters for binary classification, distinguishing ASD (autism spectrum disorder)-
associated and ASD nonassociated samples.

The top 10 genes identified by the model (shown in →Table 7.4) as having no
ASD risk, with high confidence levels, are noteworthy for understanding the
genetic landscape associated with individuals without ASD. Investigating these
genes’ functions and biological roles may reveal protective factors or pathways
associated with typical neurodevelopment. A detailed examination of the
identified genes can provide insights into their functions, potential interactions,
and relevance to neurodevelopmental processes. Understanding the biological
roles of these genes may contribute to the broader understanding of factors
that confer resilience against ASD. Expert validation of the model’s predictions
on these genes is crucial. Cross-referencing with existing literature and
experimental validation can confirm the reliability of the model’s selections and
contribute to the robustness of the identified gene set.
Conversely, the top 10 genes identified by the model as having a high
confidence level associated with ASD risk offer valuable leads for further
investigation into the molecular mechanisms underlying ASD. These genes may
serve as potential biomarkers for ASD risk. Exploring their roles in
neurodevelopmental processes, synaptic function, or other relevant pathways
may provide insights into the biological underpinnings of ASD. Grouping the
identified genes into biological pathways or networks can uncover patterns of
dysregulation. Pathway analysis may reveal common molecular pathways
disrupted in individuals at risk for ASD, guiding targeted therapeutic
interventions or preventive measures. The confidence levels associated with
each gene prediction are critical for gauging the reliability of the model’s
selections. A higher confidence level suggests a stronger association between
the gene and the predicted outcome. Understanding the relationship between
model confidence and clinical relevance is pivotal for translating these findings
into actionable insights in a clinical setting.

7.7 Conclusion
The experiment shows that a graph neural network can determine the
association in ASD and whether or not that specific gene is associated with ASD.
This experiment has shown us that the GraphSAGE model best uses gene
interactions and chromosome location features to classify genes with ASD risk.
We compared the performance of our model with the baseline MLP model and
found that it outperforms MLP in every performance criterion we measured.
Identifying potential biomarkers and protective factors using this method opens
avenues for precision medicine approaches in ASD. Tailoring interventions based
on the genetic profiles of individuals may pave the way for more effective
diagnosis and treatment strategies. Experimental validation of the identified
genes is crucial for establishing their functional roles in neurodevelopmental
processes. This step is fundamental in bridging the gap between computational
predictions and biological significance. Future research endeavors could extend
beyond gene interaction data to incorporate multi-omics information.
Integrating data on gene expression, epigenetics, and other molecular layers
may enhance the depth and accuracy of ASD risk predictions. Moving forward,
translating research findings into clinically actionable insights becomes
imperative. The model’s predictions and the identified genes should be validated
in diverse patient populations, laying the foundation for their integration into
clinical decision-making processes.
References
[1] B. Abrahams, D. Arking, D. Campbell, H. Mefford, E. Morrow, L. Weiss, I.
Menashe, et al., Sfari gene 2.0: A community-driven knowledgebase for the
autism spectrum disorders (asds). Molecular Autism, vol. 4, no. 36, pp. 10, 2013.

[2] M. Alarcón, B. S. Abrahams, J. L. Stone, J. A. Duvall, J. V. Perederiy, J. M. Bomar,
J. Sebat, et al., Linkage, association, and gene-expression analyses identify
cntnap2 as an autism-susceptibility gene. The American Journal of Human
Genetics, vol. 82, no. 1, pp. 150–159, 2008. →
[3] R. Anney, L. Klei, D. Pinto, J. Almeida, E. Bacchelli, G. Baird, N. Bolshakova, et
al., Individual common variants exert weak effects on the risk for autism
spectrum disorders. Human Molecular Genetics, vol. 21, no. 21, pp. 47814792, 07
2012. →
[4] J. Grove, S. Ripke, T. D. Als, M. Mattheisen, R. K. Walters, H. Won, et al.,
Identification of common genetic risk variants for autism spectrum disorder.
Nature Genetics, vol. 51, no. 3, pp. 431–444, March 2019. a, b
[5] I. Beyreli, O. Karakahya, and A. E. Cicek, DeepND: Deep multitask learning of
gene risk for comorbid neurodevelopmental dis orders. Patterns, vol. 3, pp.
100524, 06 2022. →
[6] J. Bralten, K. J. van Hulzen, M. B. Martens, T. E. Galesloot, A. Arias Vasquez, L.
A. Kiemeney, J. K. Buitelaar, et al., Autism spectrum disorders and autistic traits
share genetics and biology. Mol. Psychiatry, vol. 23, no. 5, pp. 1205–1212, May
2018. →
[7] P. F. Bray, C. I. Jones, N. Soranzo, and W. H. Ouwehand, Chapter 4 – Platelet
genomics. In: D. M. Alan, (Ed.), Platelets, 3rd Edition, pp. 67–89, Academic Press,
2013. →
[8] J. Feng, A. Zeng, Y. Chen, P. Payne, and L. Fuhai. Signaling interaction link
prediction using deep graph neural networks integrating protein-protein
interactions and omics data. bioRxiv, 2020. →
[9] T. K. B. Gandhi, J. Zhong, S. Mathivanan, L. Karthick, K. N. Chandrika, S.
Sujatha Mohan, S. Sharma, et al., Analysis of the human protein interactome and
comparison with yeast, worm and fly interaction datasets. Nature Genetics, vol.
38, no. 3, pp. 285–293, 2006. →
[10] T. Gaugler, L. Klei, S. Sanders, C. Bodea, A. Goldberg, A. Lee, M. Mahajan, et
al., Most genetic risk for autism resides with common variation. Nature Genetics,
vol. 46, 07 2014. →
[11] J. Gratten, P. Visscher, B. Mowry, and N. Wray, Interpreting the role of de
novo protein-coding mutations in neuropsychiatric disease. Nature Genetics, vol.
45, pp. 234–238, 03 2013. →
[12] K. Riccardi, and D. Bandara, Autism risk classification using graph neural
networks applied to gene interaction data. In: 2023 World Congress in Computer
Science, Computer Engineering, and Applied Computing (CSCE), NV, USA: Las
Vegas, 2023. →
[13] W. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on
large graphs, 06 2017. a, b
[14] F. Hormozdiari, O. Penn, E. Borenstein, and E. E. Eichler, The discovery of
integrated gene networks for autism and related disorders. Genome Research,
vol. 25, pp. 1, 142–154 2015. →
[15] I. Iossifov, B. O’Roak, S. Sanders, M. Ronemus, N. Krumm, D. Levy, H.
Stessman, et al., The contribution of de novo coding mutations to autism
spectrum disorder. Nature, vol. 515, pp. 10, 2014. →
[16] T. N. Kipf, and M. Welling. Semi-supervised classification with graph
convolutional networks. In: International Conference on Learning
Representations, 2017. a, b
[17] E. A. Anan’ko, F. A. Kolpakov, N. A. Kolchanov, et al., Gene networks.
Molecular Biology, vol. 34, pp. 449–460, 2000. →
[18] N. Krumm, B. J. O’Roak, J. Shendure, and E. E. Eichler, A de novo
convergence of autism genetics and molecular neuroscience. Trends in
Neurosciences, vol. 37, pp. 2, 95–105 2014. →
[19] S. Laksshman, R. Bhat, V. Viswanath, and X. Li, Deepbipolar: Identifying
genomic mutations for bipolar disorder via deep learning. Human Mutation, vol.
38, pp. 06, 2017. →
[20] L. Liu, J. Lei, and K. Roeder, Network assisted analysis to reveal the genetic
basis of autism. The Annals of Applied Statistics, vol. 9, pp. 3, 1571 2015. →
[21] C. Lord, M. Elsabbagh, G. Baird, and J. Veenstra-Vanderweele, Autism
spectrum disorder. The Lancet, vol. 392, pp. 10146, 508–520 2018. →
[22] S. Uddin, and H. Lu, A weighted patient network-based framework for
predicting chronic diseases using graph neural networks. Science Report, vol. 11,
p. 22607, 25 2021. →
[23] S. L. Lee, G. Mellick, A. A. Motsinger, et al., Gpnn: Power studies and
applications of a neural network method for detecting gene-gene interactions in
studies of human disease. BMC Bioinformatics, vol. 7, no. 39, 2006. →
[24] W. Pereanu, E. Larsen, I. Das, M. Estevez, A. Sarkar, S. Spring-Pearson, et al.,
Autdb: A platform to decode the genetic architecture of autism. Nucleic Acids
Research, vol. 46, pp. 11, 2017. a, b
[25] G. Ramaswami, and H. G. Daniel, Chapter 21 – Genetics of autism spectrum
disorder. In: H. G. Daniel, L. P. Henry, and K. Christine, (Eds.), Neurogenetics, Part
I, volume 147 of Handbook of Clinical Neurology, pp. 321–329, Elsevier, 2018. →
[26] K. Samocha, E. Robinson, S. Sanders, C. Stevens, A. Sabo, L. McGrath, J.
Kosmicki, et al., A framework for the in- terpretation of de novo mutation in
human disease. Nature Genetics, vol. 46, 08 2014. →
[27] T. H. Shaikh, Copy number variation disorders. Current Genetic Medicine
Reports, vol. 5, no. 183–190, pp. 2017. →
[28] T. Beata, B. Patrick, F. Happé, M. Rutter, and F. Rijsdijk, Heritability of autism
spectrum disorders: A meta-analysis of twin studies. Journal of Child Psychology
and Psychiatry, vol. 57, pp. 5, 585–595 2016. →
[29] J. Veltman, and H. Brunner, De novo mutations in human genetic disease.
Nature Reviews. Genetics, vol. 13, pp. 565–575, 07 2012. →
[30] H. Wang, and P. Avillach, Genotype-based deep learning in autism spectrum
disorder: Diagnostic classification and prognostic prediction using common
genetic variants, preprint, 10 2020. →
[31] Z. Wang, Y. Xu, D. Peng, J. Gao, and L. Fengmei, Brain functional activity-
based classification of autism spectrum disorder using an attention-based graph
neural network combined with gene expression. Cerebral Cortex, vol. 33, no. 10,
pp. 6407–6419, 12 2022. →
8 Leveraging ChatGPT and table arrangement
techniques in advanced newspaper content analysis
for stock insights

Masaki Murata

Acknowledgment: This study was conducted with grant support from the Ishii
Memorial Securities Research Promotion Foundation.

Abstract
This study uses table arrangement techniques and ChatGPT to analyze articles
from the Mainichi Shimbun for stock price prediction. From the analysis, we
identified 22 primary factors that influence the Nikkei Stock Average. We also
discovered that ChatGPT could extract and present newspaper data in a tabular
format. These factors significantly impact stock price fluctuations. The Nikkei
Stock Average tends to rise with improved US trade relations, a strong domestic
economy, and events such as the Olympics. Conversely, it tends to decline
during global stock market crashes, trade tensions, and pandemics. Moreover,
we propose a highly efficient, large-scale method for creating tables by
integrating table arrangement techniques with ChatGPT. Using a lenient
criterion, the proposed method attains an accuracy rate of 0.88. Items frequently
mentioned in articles concerning the Nikkei Stock Average are systematically
presented in a table, illustrating how the index rises or falls in response to these
items. This table also delves deeper into the effects of exchange rate changes on
the Nikkei Stock Average. Our findings offer valuable insights into the movement
of the Nikkei Stock Average. Future research will further refine these techniques
to improve stock prediction accuracy.
Keywords: stock prices, newspaper data, content analysis, table arrangement,
ChatGPT,

8.1 Introduction
According to Bollen et al. [→1], the prediction of stock prices can be partially
achieved through Twitter sentiment analysis. Deep learning techniques have
also been used in various studies to forecast stock prices based on information
extracted from newspapers [→2, →3, →4, →5]. Newspapers generally contain
valuable insights into the prevailing economic landscape and fluctuations in
stock prices. To improve the accuracy of stock price prediction, this study
examined sentences from newspapers that specifically relate to stock prices. We
used table arrangement technology and ChatGPT to identify the common types
of stock-related content found in newspapers. This study was conducted using
Japanese texts.
The primary contributions of this study are summarized below:
We used table arrangement technology to organize newspaper articles
about stock prices into tables. These tables consist of columns that provide
detailed information about stock prices, as presented in newspaper
articles.
The columns within the tables offer valuable information that contributes
to stock price prediction. This includes data such as the Dow Jones
Industrial Average, exchange rates, and other relevant market data.
Furthermore, we extracted content specifically related to significant
events, such as the “Trump administration” and the “New Corona”
situation. In total, 22 content items were obtained and analyzed.
We discovered that ChatGPT could extract information from newspapers
and effectively organize it into a table. However, we observed that the
number of columns in the tables generated by ChatGPT (ranging from 4 to
5) was fewer than that in the tables produced by the table arrangement
technology (ranging from 13 to 50 columns).
In addition, we proposed a method that combined table arrangement
techniques and ChatGPT to construct large-scale tables from a collection of
documents. The proposed method achieved a high accuracy rate of 0.88,
employing a lenient evaluation criterion that allows certain acceptable
additional extractions and minor missing information.
The items in many articles on the Nikkei Stock Average were organized
into a table that shows how the Nikkei Stock Average rose or fell when
these items occurred. In addition, the table shows in more detail how the
Nikkei Stock Average rises or falls when the exchange rate changes. These
analyses are also useful when examining the movement of the Nikkei
Stock Average.
This chapter is an extended version of a previous paper [→6].
8.2 Analysis cases using table arrangement technology

8.2.1 Analysis case 1 using table arrangement technology


From the Mainichi Shimbun from 2007 to 2020, we extracted articles that
included either “compared to the previous day’s closing price” or “compared to
the previous day’s price,” and we included “Tokyo Stock Market” and “Nikkei
Stock Average.” In total, 2,084 articles were extracted.
Next, these articles were clustered. Clustering was performed by specifying
100 clusters using the K-means clustering method, and clusters with a high
density and a moderately large number of cases were selected as experimental
data. Then, a cluster containing 21 articles with a density of 0.989 (maximum
value is 1) was extracted and used as experimental data. Note that the degree of
density represents the degree of association of information within a given
cluster, and a cluster filled with similar information has a high degree of density.
A word vector was then created using FastText. A document vector was created
based on the word vector, and the degree of similarity between documents was
calculated. Clustering was performed using the K-means clustering method. In
addition, the density was calculated using the document vectors.
These 21 articles were organized into a table using table arrangement
technology [→7]. (Some studies use seq2seq methods to generate tables from
sentences [→8]. These methods are different from the method used in the
literature [→7].) With this table arrangement technology, when multiple articles
are input, the sentences in the articles are clustered, and a table is generated in
which the inputted articles are arranged in rows and the clustered sentences in
columns. Moreover, this technique can be used to generate a table in which
sentences with similar content are arranged in the same column. In this study,
13 columns were generated using this technique. In the generated table, items
with higher importance were arranged in the left column; the importance was
determined based on the product of the density and filling rate of the table (the
fewer empty columns, the better). Details about this procedure can be found in
the literature [→7]. A portion of the generated table is presented in →Table 8.1.
We found that the last three columns were fragmentary and unnecessary
because they contained information about the author of the corresponding
article. The content of the remaining 10 columns was as follows in order of
importance. An example sentence in that column is also presented.
Table 8.1: Portion of the generated table using table arrangement technology.
Article title Dow Jones Industrial Exchange Market information …
Average 1 .
Foreign The Nikkei Stock It was down 531.27 The Tokyo stock …
exchange/stocks: Tokyo Average on the Tokyo yen to 8,143.42 yen market was down .
Stock Exchange on the Stock Exchange as of 0:45 p.m. across the board on
verge of breaking 8,000 dropped for the the 23rd due to the
yen, temporarily second day in a row on sharp decline in the
depreciating 658 yen, January 23, US and European
the lowest price since temporarily dropping markets the previous
the beginning of the to 8,016.61 yen, down day and the sharp
year. 658.08 yen from the appreciation of the
previous day’s closing yen.
price. With the global stock
The Dow Jones market in a state of
Industrial Average unabated decline,
dropped more than there are developing
$500 on the New York expectations that the
market on the market will fall below
previous day. the post-bubble low
(closing price of
7,607.88 yen), which
was hit on April 28,
2003.
Greece: Fiscal crisis The Dow Jones … . The yen also The previous day, …
plummeting stock prices Industrial Average fell strengthened European and US .
in Japan, the US and sharply for the first against the dollar, stocks also plunged,
Europe decision to time in seven business temporarily hitting and there is a sense of
support, adjustment on days on the New York 92.81 yen to the caution in the market
the 10th of next month. Stock Exchange on dollar for the first that a global stock
May 27, falling $213.04 time in three market decline
from the previous business days. originating in Europe
day’s close to is likely to proceed.
19,991.99, the biggest In the Tokyo Stock
drop since February 4 Exchange on August
(down $268.37). 28, selling was also
The Nikkei Stock prevalent right from
Average fell sharply, the start of trading.
falling below the
10,900-yen level for
the first time in three
business days during
trading hours.
…. …. …. …. …
.

a. Dow Jones Industrial Average


The Tokyo stock market on the 9th fell sharply to the 8,600 yen level, more
than 400 yen lower than the previous day’s closing price, as the stock price
plunged by more than $600 in the New York stock market.
b. Exchange
On the same day, in the Tokyo foreign exchange market, speculation that
the dollar would appreciate further because of the devaluation of the yen
led to the prevailing trend of selling the yen and buying the dollar. Since
then, the yen has depreciated for the first time in approximately two
months.
c. Market information 1
The Hong Kong and Taiwan markets have also posted new lows since the
beginning of the year.
d. Market information 2
A wide range of stocks are sold, focusing on stocks sensitive to economic
trends, such as steel and shipping, and export-related stocks, such as
automobiles and electronics. Kazuki Miyazawa, a market analyst at Daiwa
Securities SMBC, said, “The market is dominated by uncertainty, and the
price movements cannot be explained by past experience or corporate
strength.”
e. World affairs
Fears of a simultaneous global recession have surged, pushing stocks
lower in Asia and Europe. In the foreign exchange market, the US dollar
and the euro were sold off because of economic uncertainty in Europe and
the United States, and the yen appreciated sharply.
f. Overseas Stock Indices
In the Asian market, major stock indexes have fallen one after another,
with the Korean Composite Index temporarily dropping by more than 9%.
g. Related to “since”
Immediately after the Great East Japan Earthquake, it was the first time in
approximately four and a half months since March 17 that the 8,000-yen
level was reached during trading hours.
h. Drop
This was the ninth largest drop in the past for a single day.
i. Tokyo Stock Exchange
Tokyo Stock Price Index (TOPIX) also fell for two days in a row, trading at
841.73, down by 47.50 points from the same period last year. The trading
volume of the first section of the Tokyo Stock Exchange in the morning was
1,228 million shares.
j. International information
Iceland and Hungary were pushed to the brink of default on their debts
one after another, and Pakistan requested assistance from the
International Monetary Fund.
Because the above results were processed automatically, there were places
where the data were not divided neatly; however, note that the above
information was frequently written at the same time when describing the Nikkei
average in the target newspapers.

8.2.2 Analysis case 2 using table arrangement technology


Here, we discuss another analysis case. In this example, 2,084 extracted articles
(Section 8.2.1) were clustered using the K-means clustering method, with the
number of clusters set to 20. Among these 20 clusters, those with a high density
and a moderately large number of articles were selected as the experimental
data. In this case, a cluster containing 79 articles with a density of 0.983 was
extracted and used as the experimental data.
We then generated a table using a technique similar to that described in the
literature [→7]. The number of columns was fixed at 20, and the sentences were
clustered using the K-means clustering method to organize the table. The rows
of the table represented 79 different articles, and the columns represented the
sentences clustered into the 20 clusters. Among these 20 columns of
information, 1 column is reasonably organized, except for (a)–(j) (Section 8.2.1),
and this example is shown below with an example sentence.
a. Findings
Kazuhiko Nagami, the director of the investment information department
at Okasan Securities, points out that “Even though the global stock market
has fallen at the same time, the factors are different this year from last
year, and there is also a big difference in the market’s willingness to
invest.”
In addition, for the same group of 79 articles, the number of columns was fixed
at 50, and the sentences were clustered using the K-means clustering method to
organize the table. Here, the rows of the table are the 79 articles, and the
columns are the sentences clustered into 50 clusters. Among the 50 columns of
information, 11 columns are reasonably well organized, except for (a)–(k)
(Sections 8.2.1 and 8.2.2), and this is shown below with an example sentence.
a. Yield
As funds from stocks and commodities flow into government bonds, the
“flight to quality” progresses, and in the Japanese government bond
market, the yield on newly issued 10-year government bonds, which serve
as a benchmark for long-term interest rates, dropped to 0.055% from the
previous day to 1.19%.
b. Monetary easing
Nevertheless, it is the first time in approximately three months since the
Bank of Japan decided to introduce additional monetary easing and de
facto “inflation targeting” on February 14 that it broke below 9,000 yen
during trading hours.
c. Performance
Regarding the future outlook for stock prices, there is also a view that it is
difficult to imagine that the strong performance of Japanese companies
will support buying and will lead to a one-sided decline.
d. Commencement of trading
The Tokyo market also took over this trend, and from the start of trading, it
became an overall decline.
e. Related
In the stock market, the sale of export-related stocks such as automobiles
and electronics accelerated.
f. Trade
Fears of a trade war originating in the United States are rocking global
financial markets.
g. Europe
On the previous day, in overseas markets, euro selling accelerated because
of the view that European financial institutions would find it difficult to
raise funds.
h. Vigilance
However, there is a deep-rooted sense of caution regarding the recent
surge.
i. Trump administration
Earlier this year, the Trump administration implemented protectionist
policies, such as imposing emergency import restrictions on solar panels,
mainly targeting China.
j. New Corona
US stock prices continued to fall on the 25th because of the new variant of
coronavirus, which was seen as “a fire on the other side of the river,” such
as the US Department of Health and Human Services Secretary Azar, said,
“There is a strong possibility that the new type of pneumonia will spread in
the United States.” Concerns about the spread of infection in the United
States have increased.
k. Foreign investors
Some concerned people said, “The only way out of the slump is to wait for
foreign investors to come back.”
When the newspapers reported on the Nikkei 225, they frequently wrote about
the above information simultaneously. Note that we can acquire additional
content by increasing the number of columns in the table. Furthermore, content
related to special events, e.g., the “Trump administration” and “New Corona,”
was also obtained.

8.3 Analysis cases using ChatGPT


The field of natural language processing is advancing rapidly, with advances in
deep learning playing a major role. Among these innovations is ChatGPT, a chat
system that uses a large-scale language model and has demonstrated high
performance in processing natural language [→9, →10]. In this section, we used
ChatGPT (GPT-4) to analyze newspaper content to predict stock prices.

8.3.1 Analysis case 1 using ChatGPT


ChatGPT learns a large amount of data. When we used ChatGPT (GPT-3.5) in
March 2023 in our previous study [→6], we created ChatGPT output articles
about the Nikkei Stock Average to acquire knowledge about stock prices.
However, as of October 2023, ChatGPT (GPT-4) does not output articles on the
Nikkei Stock Average because of copyright difficulties. We inputted the following
data into ChatGPT: “Write a sentence like an actual newspaper article on the
Nikkei Stock Average.” In some cases, the output can be obtained by asking
ChatGPT to output a “sentence similar to an actual sentence.”
We input the following into ChatGPT:

It does not have to be about the future. It can be about the past. Write 10 sentences from
actual newspaper articles when the Nikkei Stock Average rose significantly.

The following was outputted:

We are sorry, but due to copyright issues, we are unable to reproduce the exact text of specific
newspaper articles from the past. However, it is possible to create 10 exemplary fictional
newspaper article sentences based on actual past events or topics. The following are
exemplary article sentences.
1. “Nikkei average rises sharply on new trade agreement with US. The development was
well received by many industries.”
2. “The Nikkei 225 gained a big 500 yen in a single day. Many market participants are
confident that Japan’s economy will recover.”
3. “The Nikkei 225 soared after the Bank of Japan announced its interest rate revision.
Expectations for future economic policies are rising.”
4. “Renewable energy-related issues performed well. The Nikkei 225 recorded its biggest
gain in three months.”
5. “The Nikkei 225 rose above 25,000 yen on the back of aggressive purchases of Japanese
stocks by foreign investors.”
6. “A batch of major companies reported strong earnings results. Markets rejoiced and
the Nikkei 225 rose sharply.”
7. “The Tokyo Olympics were scheduled to be held in Tokyo. Sightseeing and construction-
related stocks soared and the Nikkei 225 jumped sharply.”
8. “Technology stocks continued to perform well, and the Nikkei 225 soared, riding the
wave.”
9. “Against the backdrop of the yen’s temporary stability, the Nikkei 225 made record
gains.”
10. “The Nikkei 225 also hit a new high on signs of a global economic recovery.”
These sentences are exemplary based on past market trends and events and may differ from
actual newspaper articles. Please use them for reference only.

Next, we entered the following “Analyze these 10 articles from various


perspectives and organize them into a table.” The following was outputted:

Roger. The following table summarizes the results of our analysis of the 10 exemplary
newspaper article texts you provided from various perspectives.

Article Key Factor Affected Relevant Article tone


Number Industry/Sector country or
region
1 Trade agreement in general Favorable for the
U.S. and Japan
2 Economic recovery in general Favorable Japan
3 Interest rate revision Finance, real estate Japan Favorable
4 Renewable energy Energy, technology Favorable Japan
5 Purchases by foreign General Favorable Japan,
investors international
investors
6 Announcement of General Japan Favorable
corporate earnings
7 Olympic Games Tourism, Favorable
construction, sports
8 Favorable performance Technology
of technology stocks
9 Stability of the yen Export industry Favorable
10 Global economic recovery General

Key Factors: The main reasons or underlying factors for the rise in an article.
Affected industry/sector: The industry or sector that is likely to be particularly affected
by the content of the article.
Relevant countries/regions: Key countries/regions relevant to the content of the article.
Tone of the article: The overall mood or impression of the article. In this case, we use
the term “favorable” because all articles report an increase in the Nikkei 225.
This table is based on the text of exemplary newspaper articles and may differ from actual
market conditions and analysis results.

The following was inputted into ChatGPT: “Based on the above results, please
provide a bulleted list of specific times when the Nikkei Stock Average will rise.”
The following was outputted:

Based on the analysis of the above exemplary newspaper articles, the specific circumstances
and factors that would cause the Nikkei Stock Average to rise are itemized below.
1. Progress in trade agreements:
When trade relations improve or agreements are reached with major economic
partners, such as a new trade agreement with the U.S.
2. Recovery of the domestic economy:
A situation in which the Japanese economy is performing well and market
participants are confident of its recovery.
3. Interest rate developments:
Interest rate revisions and central bank monetary policy are favorable to the
market.
4. Sector-specific strength:
When certain sectors are performing particularly well, for example, renewable
energy or technology stocks.
5. Active investment by foreign investors:
A situation in which foreign investors are actively buying Japanese stocks because
of increased inflows of foreign capital.
6. Strong corporate performance:
A major company announces better-than-expected earnings, which spreads
positive sentiment throughout the market.
7. Large-scale events:
When a major international event such as the Olympics is scheduled or decided to
be held in Japan.
8. Currency trends:
When the yen stabilizes or depreciates moderately, which will help boost export-
related stocks, etc.
9. Global economic strength:
A global economic recovery or growth is confirmed, and Japanese stock prices
benefit from this recovery or growth.
Although these factors may contribute to the rise in the Nikkei Stock Average, actual stock
price movements are determined by various factors and complex interactions beyond these
alone.

Based on the analysis of newspaper articles, several factors influence the rise in
the Nikkei Stock Average. These include improved trade relations, particularly
with the United States; a robust domestic economy; favorable interest rate and
central bank policy changes; standout performance in specific sectors such as
technology; increased foreign investments in Japanese stocks; major Japanese
companies exceeding earnings expectations; international events such as the
Olympics being held in Japan; a stable or slightly depreciating yen boosting
export-related stocks; a thriving global economy.

8.3.2 Analysis case 2 using ChatGPT


In the previous section, we investigated the phenomenon of the Nikkei Stock
Average rising. In this section, we investigate the phenomenon of the Nikkei
Stock Average falling.
The following was entered into ChatGPT: “It doesn’t have to be about the
future. It can be about the past. Write 10 sentences from actual newspaper
articles when the Nikkei Stock Average fell significantly.” ChatGPT outputted the
following.

I am sorry, but I am not able to directly provide the text of specific newspaper articles
protected by copyright. However, I can provide information regarding the background, factors,
and historical events when the Nikkei Stock Average fell significantly.
The following events and factors related to past major stock price declines and economic crises
may have had an impact:
1. The Black Monday of 1987
2. The bursting of Japan’s bubble economy in the early 1990s
3. The Asian currency crisis of 1997–98
4. The dot-com bubble burst in 2000
5. The Lehman Shock of 2008 and the subsequent financial crisis
6. The Great East Japan Earthquake of 2011 and its subsequent economic impact
7. The economic slowdown in China in 2015–2016 and its impact
8. The U.S.–China trade war in 2018
9. The global COVID-19 pandemic in 2020 and its economic impact
10. International events such as oil shocks and geopolitical risks
For references to specific newspaper articles and analyses of these events and situations, we
recommend accessing them directly through a licensed news service or archive service.

We inputted the following into ChatGPT: “For each event, write a sentence that
sounds like an actual newspaper article on a major drop in the Nikkei 225.”
ChatGPT outputted the following.

Understood. The following is an imaginary sentence based on the above event, which I have
written as if it were an actual newspaper article. Please note that this is not the content of an
actual newspaper article, but rather a fictional content I created, and should be used only as a
reference.
1. Black Monday 1987 “On October 19, 1987, the Nikkei 225 fell sharply as a result of Black
Monday. The Tokyo stock market also suffered from general selling pressure.”
2. The bursting of Japan’s bubble economy in the early 1990s “With the end of the bubble
economy, the Nikkei 225 plummeted. Many investors experienced asset write-downs,
and real estate and stock prices fell rapidly.”
3. 1997–98 Asian Currency Crisis “The currency crisis in Asian countries led to a decline in
Japanese exports. The Nikkei average fell sharply and many companies revised their
earnings downward.”
4. The dot-com bubble burst of 2000 “Triggered by a sharp drop in IT-related stocks, the
Nikkei 225 fell sharply in value. Many startups failed and the market became
increasingly uncertain.”
5. 2008 Lehman Shock “The collapse of Lehman Brothers triggered a global financial
crisis. The Nikkei average also fell sharply and corporate earnings continued to
deteriorate.”
6. The Great East Japan Earthquake of 2011 “On March 11, 2011, the Great East Japan
Earthquake hit the Tohoku region. The Nikkei 225 fell sharply for a time and many
companies were affected.”
7. Economic slowdown in China in 2015–2016 “Fears of a slowdown in the Chinese
economy affected the market and the Nikkei 225 fell significantly. Japanese export
performance was also affected.”
8. The U.S.–China trade war in 2018 “As trade friction between the U.S. and China
intensified, the Nikkei 225 was volatile. Many companies are concerned about the
impact on their earnings.”
9. The global pandemic of COVID-19 in 2020 “The Nikkei 225 plunged in the wake of the
new coronavirus. Along with a slowdown in economic activity, many companies warned
of deteriorating business performance.”
10. Oil shocks and geopolitical risks “With the ongoing tensions in the Middle East, the
Nikkei 225 also fell sharply due to concerns about rising oil prices. Rising energy-related
costs could affect corporate earnings.”
These sentences are simplified and are shorter and more summary than the actual newspaper
article.

The following was inputted into ChatGPT: “Analyze these 10 articles from various
perspectives and organize them in a table.” ChatGPT outputted the following.
Below is a table that analyzes and organizes a summary of the 10 fictional
newspaper articles provided from various perspectives.
Incident Major cause Nikkei average Major impact
name/time movement
1987 Black Monday International stock Major decline Selling pressure on Tokyo stock
market crash market
1990s Bubble burst End of economic Plunge Asset write-downs, declines in
bubble real estate and stock prices
1997–98 Asian Currency crisis in Asian Major decline Decline in Japanese exports,
Currency Crisis countries downward revisions in corporate
earnings
2000 Dot-com Overheating of IT- Major decline Emerging company failures,
bubble related stocks market uncertainty
2008 Lehman Financial crisis, Major decline Global financial crisis,
Shock Lehman bankruptcy deteriorating corporate
performance
2011 Great East Tsunami, temporary Many companies were affected,
Japan Earthquake big drop economic activities temporarily
halted
2015–2016 China’s Chinese economy Major decline Decrease in Japanese exports,
economic slowdown impact on business performance
slowdown
2018 US–China Trade friction between Unstable Impact on export performance,
trade war the United States and developments concerns about business
China performance
2020 COVID-19 Global pandemic of the Plunge Stagnation of economic activity,
pandemic new coronavirus deteriorating business
performance of many companies
Oil shock Tension in the Middle Major decline Higher energy costs, impact on
East, rise in oil prices business performance

This table summarizes the key information from the 10 articles. This confirms
that each event and period has a different degree of impact and background.
The following was inputted into ChatGPT: “Based on the above results,
please provide a bulleted list of specific times when the Nikkei Stock Average
falls.” ChatGPT outputted the following.

Based on the above results, specific situations and factors that may cause the Nikkei Stock
Average to fall can be summarized as follows.
1. International stock price collapse: When stock prices in other countries plunge, the
Nikkei Stock Average may fall in a chain reaction (e.g., Black Monday in 1987).
2. Burst of economic bubbles: After excessive economic prosperity, stock prices can fall
sharply when bubbles burst (e.g., the bubble burst in the 1990s).
3. Economic crises in neighboring countries or trading partners: Economic instability in
Japan’s major trading partners can affect the Nikkei 225 (e.g., the Asian currency crisis
in 1997–98 and China’s economic slowdown in 2015–2016).
4. Sector-specific overheating: Overheating and subsequent adjustments in some
industries or sectors can affect overall stock prices (e.g., the dot-com bubble of 2000).
5. Major financial crisis: An event that affects the entire financial system, such as the
failure of a major financial institution or a credit crunch, can cause stock prices to fall
sharply (e.g., the Lehman Shock in 2008).
6. Natural disasters: Large-scale natural disasters can temporarily disrupt economic
activity and cause stock prices to decline (e.g., the Great East Japan Earthquake of
2011).
7. International trade frictions: Export-dependent economies could be affected by
intensified trade frictions and tariff wars among major economies (e.g., the U.S.–China
trade war in 2018).
8. Major epidemics: a global epidemic could severely limit economic activity and depress
stock prices (e.g., the COVID-19 pandemic in 2020).
9. Significant energy price volatility: fluctuations in energy-related costs, such as a
significant rise in the price of crude oil, could affect corporate performance and
economic activity.

Note that these factors can affect not only the Nikkei Stock Average but also the
economy as a whole.
We found the following. The Nikkei Stock Average may experience declines
because of several reasons: international stock market crashes, such as the 1987
Black Monday; economic bubble bursts, such as Japan’s 1990s bubble; economic
issues in key trading partners, illustrated by the Asian currency crisis in 1997–98
and China’s economic downturn in 2015–2016; industry-specific imbalances,
highlighted by the 2000 dot-com bubble; widespread financial crises, epitomized
by the 2008 Lehman shock; major natural calamities such as the 2011 Great East
Japan Earthquake; heightened global trade tensions, as seen in the 2018 US–
China trade conflict; global pandemics that limit economic activities, such as the
2020 COVID-19 outbreak; significant fluctuations in energy prices.
A simple survey using ChatGPT in Sections 8.3.1 and 8.3.2 also reveals in
which cases the Nikkei Stock Average rises and falls. However, in a previous
study [→6], ChatGPT outputted sentences that more closely resembled actual
articles. See also the previous study [→6]. Even using only ChatGPT, we can
generate tables by creating sentences for articles and organizing them.

8.4 Analysis cases using both table arrangement technology and


ChatGPT
The tables generated using ChatGPT had fewer columns than those generated
using the table arrangement. In this section, our objective is to generate tables
with many columns by combining both table arrangement techniques and
ChatGPT. Notably, ChatGPT is a high-performance technology. Using both
approaches, we anticipate achieving high-performance table generation. To
evaluate the performance of this fusion method, we conduct a simple
performance evaluation of the table generation process.
The procedure involves using the items obtained as columns by the table
arrangement technology, which serve as the targeted extracted items.
Subsequently, ChatGPT is used to extract the information associated with these
items from each document. This process is executed across multiple documents,
and the results are structured into a table with numerous columns and rows. In
this fusion method, the columns represent the content to be extracted, whereas
the rows correspond to multiple documents. This approach enables the
generation of tables with many columns while maintaining high performance.
Unlike the table arrangement technique, which does not divide sentences,
ChatGPT divides sentences and extracts information, allowing for the extraction
of more suitable information in the fusion method.
For input into ChatGPT, we create classifications labeled as (a)–(v) by further
refining the classifications obtained using the table arrangement technology
manually. We retrieve the corresponding content for these classifications using
ChatGPT:

Please pull out the sentences from the following text that apply to the following categories. If
none apply, please state N/A. # Classification (a) Nikkei 225, (a2) Change in Nikkei 225, (b) Dow
Jones Industrial Average, (c) Foreign Exchange, (d) Market Information, (e) World Affairs, (f)
Overseas Stock Indices, (g) Related to “since,” (h) Drop, (i) Tokyo Stock Exchange, (j)
International Information, (k) Noted, (l) Yields, (m) Monetary Easing, (n) Earnings, (o) Start
trading, (p) Related, (q) Trade, (r) Europe, (s) Caution, (t) Trump administration, (u) New
Corona, and (v) Foreign investors # text The Nikkei Stock Average temporarily dropped more
than 330 yen on the Tokyo Stock Exchange on August 28, as concerns over deteriorating
European finances spread following the downgrade of Greek and Portuguese government
bonds by the U.S. rating agency Standard & Poor’s. The previous day, European and U.S. stocks
also plunged, and there is a sense of caution in the market that a global stock market decline
originating in Europe is likely to proceed. The announcement of the credit rating downgrade
and the plunge in European stocks on May 27 prompted the 16 eurozone countries of the
European Union to hold a summit meeting on May 10 to decide on an emergency loan of 30
billion euros (approximately 3.7 trillion yen) to Greece. However, the news of the summit did
not dispel the credit concerns of European countries. The Dow Jones Industrial Average fell
sharply for the first time in seven business days on the New York Stock Exchange on May 27,
falling $213.04 from the previous day’s close to 19,991.99, the biggest drop since February 4
(down $268.37). In the foreign exchange market, the euro sold off sharply against the dollar
and the yen, temporarily falling to 1.3166 dollars per euro, the lowest level in approximately a
year since late April last year. The yen also strengthened against the dollar, temporarily hitting
92.81 yen to the dollar for the first time in three business days. In the Tokyo Stock Exchange on
August 28, selling was also prevalent right from the start of trading. The Nikkei Stock Average
fell sharply, falling below the 10,900-yen level for the first time in three business days during
trading hours. The was down 20.04 points to 977.59. However, as the Shanghai Composite
Index temporarily rose in value on the morning of the 28th on the Shanghai Stock Exchange in
China, some analysts believed that “the tone of the economic recovery has not changed, and it
is not necessarily true that the market will continue to fall” (a major securities firm). The weak
stock market has led to an inflow of funds into the bond market. The yield on newly issued 10-
year bonds temporarily fell to 1.28%, the lowest level in four months since December 30, 2008.
(article author)
When the above was provided as input to ChatGPT (GPT-4), the following output
was obtained. “Please continue.” was also provided as input.

We extracted the statements that apply to each category as follows:


(a) Nikkei Stock Average
“The Nikkei Stock Average temporarily dropped more than 330 yen on the Tokyo
Stock Exchange on August 28.“
“The Nikkei Stock Average fell sharply, falling below the 10,900-yen level for the
first time in three business days during trading hours.”
(a2) Change in Nikkei 225
“The Nikkei Stock Average temporarily dropped more than 330 yen on the Tokyo
Stock Exchange on August 28.“
“The Nikkei Stock Average fell sharply, falling below the 10,900-yen level for the
first time in three business days during trading hours.”
(b) Dow Jones Industrial Average
“The Dow Jones Industrial Average fell sharply for the first time in seven business
days on the New York Stock Exchange on May 27, falling $213.04 from the
previous day’s close to 19,991.99, the biggest drop since February 4 (down
$268.37).”
(c) Foreign Exchange
“In the foreign exchange market, the euro sold off sharply against the dollar and
the yen, temporarily falling to 1.3166 dollars per euro, the lowest level in about a
year since late April last year. The yen also strengthened against the dollar,
temporarily hitting 92.81 yen to the dollar for the first time in three business
days.”
(d) Market Information
“The Nikkei Stock Average temporarily dropped more than 330 yen on the Tokyo
Stock Exchange on August 28.”
“The Dow Jones Industrial Average fell sharply for the first time in seven business
days on the New York Stock Exchange on May 27, falling $213.04 from the
previous day’s close to 19,991.99, the biggest drop since February 4 (down
$268.37).”
“In the Tokyo Stock Exchange on August 28, selling was also prevalent right from
the start of trading.”
“However, as the Shanghai Composite Index temporarily rose in value on the
morning of the 28th on the Shanghai Stock Exchange in China, some analysts
believed that “the tone of the economic recovery has not changed, and it is not
necessarily true that the market will continue to fall” (a major securities firm).”
(e) World Affairs
“The Nikkei Stock Average temporarily dropped more than 330 yen on the Tokyo
Stock Exchange on August 28, as concerns over deteriorating European finances
spread.”
(f) Overseas Stock Indices
“However, as the Shanghai Composite Index temporarily rose in value on the
morning of the 28th on the Shanghai Stock Exchange in China, some analysts
believed that “the tone of the economic recovery has not changed, and it is not
necessarily true that the market will continue to fall” (a major securities firm).”
(Omitted hereafter.)

The experiment used 21 articles, as mentioned in Section 8.2.1. The same


procedure was applied to the 21 articles. By organizing the articles into a table, a
table similar to that presented in Section 8.2.1 was created, demonstrating
improved performance (→Table 8.2).

Table 8.2: Portion of the generated table using the fusion method.
Article title Nikkei Dow Jones Exchange Market .
Foreign ex- The Nikkei The Dow Due to economic The Tokyo stock …
change/stocks: Stock Average Jones instability in Europe market was down .
The Tokyo Stock temporarily fell Industrial and the United across the board
Exchange is about briefly and Average States, the dollar on the 23rd due
to break the 8,000 plunged dropped and the euro were to the sharp
yen barrier, rapidly, losing more than sold off on the decline in the US
temporarily falling more than 650 $500 on the foreign currency and European
658 yen to its yen from the New York market, while the markets the
lowest level since previous day’s market on yen significantly previous day and
the year’s closing, and the previous gained in value, the sharp
commencement. during trading day. raising fears about appreciation of
hours, it the declining the yen. … .
reached a new performance of
low for the year Japanese
(8,115.41 yen), enterprises heavily
the lowest level dependent on
in five years exports.
and five
months since
May 2003. … .
Greece: Fiscal The Nikkei The Dow In the foreign . … . However, as …
crisis Stock Average Jones exchange market, the Shanghai .
plummeting stock temporarily Industrial the euro sold off Composite Index
prices in Japan, the dropped more Average fell sharply against the temporarily rose
United States, and than 330 yen on sharply for dollar and the yen, in value on the
Europe decision to the Tokyo Stock the first time temporarily falling morning of the
support, Exchange on in seven to 1.3166 dollars 28th on the
adjustment on the August 28. business per euro, the Shanghai Stock
10th of next The Nikkei days on the lowest level in Exchange in
month. Stock Average New York approximately a China, some
fell sharply, Stock year since late April analysts believed
falling below Exchange on last year. The yen that “the tone of
the 10,900-yen May 27, also strengthened the economic
level for the falling against the dollar, recovery has not
first time in $213.04 from temporarily hitting changed, and it is
three business the previous 92.81 yen to the not necessarily
days during day’s close to dollar for the first true that the
trading hours. 19,991.99, the time in 3 business market will
biggest drop days. continue to fall”
since (a major
February 4 securities firm).
(down
$268.37).
…. …. …. …. ….
In this section, we evaluate the performance of the extracted content from the
first 10 articles using both the table arrangement technique described in Section
8.2.1 and the fusion method introduced in this section. The table constructed in
Section 8.2.1 primarily captured information from (b) to (i). Consequently, we
evaluated the performance solely based on the information within (b)–(i).
The evaluation results are presented in →Table 8.3. In this context, we
assume that the correct information obtained through either the table
arrangement technique or the fusion method is correct. If there is no correct
information in an input document, the case where the information is obtained,
or if it is extracted as N/A, we consider it correct. Furthermore, if the obtained
information is partially correct, despite missing or additional information, it is
still deemed correct.

Table 8.3: Accuracy of content extraction.


Method Accuracy
Table arrangement technique 0.73 (58/80)
Fusion 0.88 (70/80)

Additional experiments were conducted to compare the results obtained using


the table arrangement technique and the fusion method. The results of this
comparison are presented in →Table 8.4, which illustrates the number of cases
where one extraction method outperformed the other.

Table 8.4: Pairwise comparison of content extraction.


Comparison of methods Number of cases
Table arrangement technique > Fusion 16
Table arrangement technique < Fusion 49
Table arrangement technique = Fusion 15

The fusion method demonstrated a higher accuracy rate of 0.88 than the table
arrangement technique with an accuracy rate of 0.73. Pairwise comparisons
revealed that the fusion method outperformed the table arrangement technique
in 49 of 80 items, whereas the table arrangement technique performed better in
15 items.
In →Table 8.3, only 10 outputs from the fusion method were erroneous. The
10 outputs are in the categories of market information, world affairs, and the
Tokyo Stock Exchange.
In two cases for the category of market information, the output of the fusion
method was “the entire sentence is about market information.” The entire
article contains information about the market, and in a sense, the output of the
fusion method is correct. However, the output is judged to be erroneous
because it is too broad and ambiguous.
There were three instances of mistakes in the classification of the Tokyo
Stock Exchange. This study was conducted in Japanese, and the Tokyo Stock
Exchange was actually “Tosho” (Tokyo Stock Exchange) in the input to ChatGPT.
The Tokyo Stock Exchange is the Tokyo Stock Exchange Stock Price Index
(TOPIX), but ChatGPT mistakenly outputted the Nikkei Stock Average when only
“Tosho” (Tokyo Stock Exchange) was used. Changing the input into ChatGPT
from “Tosho” (Tokyo Stock Exchange) to the Tokyo Stock Exchange Stock Price
Index (TOPIX) may improve the situation. Improving the prompts for ChatGPT
could improve the performance of the fusion method.
The table arrangement technique outperformed the fusion method in some
cases, of which six cases were deemed correct even with the fusion method,
excluding the two erroneous cases mentioned earlier. In most of the cases, the
fusion method also extracted information but obtained less information than the
table arrangement technique. The fusion method occasionally settled for partial
information instead of striving for comprehensive extraction. This aspect can be
improved by refining the prompts in future studies.
The fusion method demonstrated the capability to extract information and
generate tables with many columns, exhibiting suitable performance. It offers
the advantage of using ChatGPT to retrieve specific information when the
desired columns are known. In contrast, if the desired information is unknown,
the table arrangement technique can be used to identify the relevant content,
followed by the use of ChatGPT for retrieval, as demonstrated in the fusion
method described in this study.
It may be possible to use ChatGPT to find the desired columns. For example,
we can input the original documents into ChatGPT and ask ChatGPT what the
desired columns are. However, this is difficult because ChatGPT is currently
unable to handle large input documents. Hence, the fusion method is relatively
convenient in practice.
In the fusion method, ChatGPT processes one article at a time, obtaining
information for each article. Although multiple iterations of ChatGPT are
required for multiple articles, this approach allows for smaller inputs per
iteration, enabling efficient processing by ChatGPT.
In previous studies, we suggested the use of question-answering processing
to retrieve information for each cell in a table, completing information extraction
for all cells [→11]. (We have also highlighted that it would be good to use
question-and-answer processing to find the desired columns (the content items
we should retrieve from documents) [→11].) The current study can be thought
of as leveraging ChatGPT to accomplish this, as ChatGPT improves the
performance of question-answering processing.

8.5 Simple analysis case using obtained categories and ChatGPT


Section 8.2 provides items useful for analysis. A brief overview of knowledge
acquisition using this classification is provided. We asked ChatGPT (GPT-4) to
analyze what would happen to the Nikkei Stock Average when these items occur.
The following was inputted into ChatGPT:

Organize a table showing when the Nikkei Stock Average rises or falls based on the following:
(a) Nikkei 225, (a2) Change in Nikkei 225, (b) Dow Jones Industrial Average, (c) Foreign
Exchange, (d) Market Information, (e) World Affairs, (f) Overseas Stock Indices, (g) Related to
“since,” (h) Drop, (i) Tokyo Stock Exchange, (j) International Information, (k) Noted, (l) Yields,
(m) Monetary Easing, (n) Earnings, (o) Start trading, (p) Related, (q) Trade, (r) Europe, (s)
Caution, (t) Trump administration, (u) New Corona, and (v) Foreign investors.

ChatGPT outputs →Table 8.5.


Table 8.5: Knowledge acquired using obtained categories and ChatGPT.
Factors Possible upside Possible downside
(a) Nikkei 225
(a2) Change in Nikkei 225
(b) Dow Jones industrial Dow Jones up Dow Jones down
average
(c) Foreign exchange
(d) Market information Favorable news in Unfavorable news out
(e) World affairs Stable Tense
(f) Overseas Stock Indices Indexes of other countries rise Indexes of other countries fall
(g) Related to “since” Unknown Unknown
(h) Drop Limited Expansion
(i) Tokyo Stock Exchange Trading Slump
(j) International Favorable news coming in Unfavorable news coming in
information
(k) Noted Limited, temporary Major problems
(l) Yields High Low
(m) Monetary easing Easing measures implemented Easing measures withdrawn
(n) Earnings Good business performance Bad business performance
(o) Start trading Newly listed companies popular Concerns about newly listed
companies
(p) Related Highly interlinked markets Highly interlinked markets
performing well underperforming
(q) Trade Exports up Exports down
(r) Europe European economy stable European economy unstable
(s) Caution Unknown Market participants becoming more
vigilant
(t) Trump administration Policies favorable to business Policies unfavorable to business
(u) New Corona Signs of convergence Expansion, flare-up
(v) Foreign investors Inflows Outflows

This table shows how each item would cause the Nikkei 225 to rise or fall.

To gain further insight, enter the following into ChatGPT: “Devise a number of factors other
than those shown in the table above. In the table, please list in which cases the Nikkei Stock
Average rises or falls according to the new factors you have invented.”

ChatGPT outputs →Table 8.6.


Table 8.6: Additional knowledge acquired using obtained categories and
ChatGPT.
Factors Likelihood of an increase Likelihood of a decrease
(w) Central bank interest rate Interest rate cuts Interest rate hikes
policy
(x) Domestic politics Political stability Political instability
(y) Weather and natural Less natural disasters Major natural disasters
disasters
(z) Technological progress Favorable impact of technological Negative impact on existing
innovation on the economy industries
(aa) Consumer confidence Consumer confidence rises Consumer confidence falls
(bb) Business capital Increase in capital expenditure Decrease in capital
expenditure expenditure
(cc) Oil prices Oil prices stable Oil prices rise/slide
(dd) Inflation/deflation Mild inflation High inflation/deflation
(ee) Economic growth rate of Economic growth of competing Economic growth of
other competing countries countries slows competing countries rapid
(ff) Overseas expansion of Many successful cases of overseas Many failed cases of overseas
domestic companies expansion expansion
(gg) Regulatory and legal Reforms favorable to business Introduction of strict
system regulations
(hh) Employment situation Increase in employment and low Decrease in employment and
unemployment high unemployment
(ii) Consumption tax rate Tax rate unchanged/reduced Consumption tax increased
(jj) Corporate dividends Dividend increase Dividend decrease

More information was obtained, which was useful in forecasting the Nikkei 225.
Entering the following into ChatGPT yielded →Table 8.7: “In the dollar-yen
exchange rate, organize the table to show when it would be the case that the
Nikkei Stock Average would rise or fall.” By asking ChatGPT questions for each
item, more detailed findings for each can be obtained.
Table 8.7: Additional knowledge acquired using obtained categories and
ChatGPT.
Dollar-yen Reasons for the rise in the Nikkei Reasons for the fall in the Nikkei Stock
exchange rate Stock Average Average
Strong
dollar/weak 1. Expectations are that export 1. Increased costs for companies
yen companies’ overseas sales will that import large amounts of
increase in yen and their energy and resources.
business performance will 2. When the excessive depreciation
improve. of the yen raises concerns about
2. Lower investment costs for the stability of the economy as a
foreign investors in Japanese whole.
stocks.
3. Prices of raw materials in yen are
stable or declining, except for
companies that import resources
and energy.

Weak
dollar/strong 1. As import costs fall, companies 1. Exporting companies’ overseas
yen that rely on imports are expected sales are expected to decrease in
to improve their performance. yen terms, leading to a
2. When deflationary pressure is deterioration in business
eased. performance.
2. Investment costs for foreign
investors in Japanese stocks are
rising.
3. There are concerns that the
excessive appreciation of the yen
will impede economic growth.

8.6 Conclusions
This study logically predicted stock prices by analyzing newspaper sentences
related to stock prices using table arrangement techniques and ChatGPT.
In Section 8.2, we used table arrangement techniques to analyze the
Mainichi Shimbun data and extract similar articles on the Nikkei Stock Average.
These articles were subsequently organized into a table using table
arrangement technology, with groups of articles arranged in rows and clusters
of similar sentences in columns. By examining the information in the columns,
we investigated the type of information in the newspaper articles. We identified
22 content items related to various aspects, namely, (a) the Dow Jones industrial
average, (b) Exchange, (c) Market information 1, (d) Market information 2, (e)
World affairs, (f) Overseas stock indices, (g) Related to “since,” (h) Drop, (i) Tokyo
Stock Exchange, (j) International information, (k) Findings, (l) Yield, (m) Monetary
easing, (n) Performance, (o) Commencement of trading, (p) Related, (q) Trade,
(r) Europe, (s) Vigilance, (t) Trump administration, (u) New Corona, and (v)
Foreign investors in the target newspapers. Furthermore, content associated
with special events such as the “Trump administration” and “New Corona” was
also obtained.
In Section 8.3, we conducted analysis using ChatGPT and discovered its
capability to extract information from newspapers and perform basic
investigations, including organizing information into a table. Based on the
results obtained from ChatGPT, we discovered the following. The Nikkei Stock
Average rises because of factors such as improved US trade relations, a strong
domestic economy, central bank policies, technology sector performance,
foreign investments, corporate earnings, events such as the Olympics, a stable
yen, and a healthy global economy. Conversely, declines may result from global
stock crashes, economic bubbles, issues with key trade partners, industry
imbalances, financial crises, natural disasters, trade tensions, pandemics, and
energy price shifts.
In Section 8.4, we proposed a fusion method for constructing a large table
with high performance by integrating table arrangement techniques and
ChatGPT. The fusion method involves identifying relevant content using table
arrangement techniques and ChatGPT. Specifically, the table arrangement
technique identifies the matters to be retrieved, retrieves them from articles
using ChatGPT, and organizes them into a table. Our experiments confirmed the
effectiveness of the fusion method, achieving an accuracy rate of 0.88 with a
lenient evaluation criterion that allows for certain additional extractions and
minor missing information.
In Section 8.5, the items extracted in Section 8.2, which appear in many
articles on the Nikkei Stock Average, are organized into a table that shows how
the Nikkei Stock Average rises or falls when these items occur. In addition, the
table shows in more detail how the Nikkei Stock Average rises or falls when the
exchange rate changes. These analyses are also useful when examining the
movement of the Nikkei Stock Average.
In future studies, we will develop and refine the proposed methods to
improve stock price prediction accuracy.

References
[1] J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the stock market.
Journal of Computational Science, vol. 2, no. 1, pp. 1–8, 2011. →
[2] H. Lee, M. Surdeanu, B. MacCartney, and D. Jurafsky, On the importance of
text analysis for stock price prediction. In: Proceedings of Ninth International
Conference on Language Resources and Evaluation (LREC 2014), pp. 1170–1175,
2014. →
[3] A. K. Nassirtoussi, S. Aghabozorgi, T. Y. Wah, and D. C. L. Ngo, Text mining for
market prediction: A systematic review. Expert Systems with Applications, vol. 41,
no. 16, pp. 7653–7670, 2014. →
[4] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, Deep learning for stock
prediction using numerical and textual information. In: 2016 IEEE/ACIS 15th
International Conference on Computer and Information Science (ICIS), pp. 1–6,
2016. →
[5] M. Murata, Y. Nakahara, and Q. Ma, Stock price prediction and knowledge
acquisition through machine learning and language processing. 26th Annual
Meeting of the Association for Natural Language Processing, pp. 657–660, 2020,
(in Japanese). →
[6] M. Murata, Content analysis of items in newspaper data using table
arrangement technology and ChatGPT for stock price prediction. The 22nd
International Conference on Information & Knowledge Engineering on CSCE
2023, pp. 1–8, 2023. a, b, c, d
[7] M. Murata, K. Okazaki, and Q. Ma, Improved method for organizing
information contained in multiple documents into a table. Journal of Natural
Language Processing, vol. 28, no. 3, pp. 802–823, 2021. a, b, c, d
[8] X. Wu, J. Zhang, and H. Li, Text-to-table: A new way of information extraction.
In: Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics, pp. 2518–2533, May 2022 →
[9] OpenAI, ChatGPT. →https://openai.com/blog/chatgpt/, 2022. →
[10] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang,
S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens,
A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, Training language
models to follow instructions with human feedback. arXiv:2203.02155, 2022. →
[11] M. Murata, M. Uchiyama, and H. Isahara, Information extraction using a
question and answer system. Workshop on Information Extraction for the 6th
Annual Meeting of the Association for Natural Language Processing, pp. 33–40,
2000, (in Japanese). a, b
9 An experimental study on road surface
classification

Addisson Salazar
Gonzalo Safont
Luis Vergara
Alberto Gonzalez

Acknowledgment: This work was supported in part by Generalitat Valenciana


under Grant CIPROM/2022/20.

Abstract
This chapter presents an experimental study that explores the discrimination
capabilities of a set of features for road surface classification. The features are
extracted from signals collected by a number of sensors installed in a vehicle
traveling on three different surfaces: smooth flat asphalt, stripes, and
cobblestones. The features consist of time-, frequency-, and statistic-domain
parameters that represent the behavior of the signals under different driving
conditions. The accuracy on road classification from those features is a
determining factor in the quality of automatic power-assisted steering systems.
The experiments implemented several driving configurations, i.e., hands on or
off the wheel and constant or accelerated vehicle speed. In addition, several
setups of the classification procedure were tested varying the classifier method
(random forest, support vector machine, linear discriminant analysis, and a
decision fusion method) and the number of features (selected features by
ranking and reducing the number of features using principal component
analysis). The results show high accuracy of the proposed classification system
for road surface classification.
Keywords: road surface classification, machine learning, feature selection,
feature ranking, decision fusion, classification,

9.1 Introduction
Currently, machine learning methods are being increasingly studied to solve
complex problems in several fields of application. One of these applications is
the automatic road surface classification for developing autonomous or
semiautonomous car technologies incorporated in power-assisted steering
systems. Thus, the MI problem consists of classifying the data from several
sensors installed in the car to determine the surface on which the car is
traveling. Existing works on road surface classification can be broadly classified
into three partially overlapping groups depending on the categories of data
(classes) to be classified: (i) estimation of road roughness profiles, (ii) detection
of weather conditions hazardous for driving, and (iii) road type detection. These
three groups typically have diverging goals. Road roughness profiles are
typically studied to obtain cost-effective solutions for supervising and planning
road maintenance and weather conditions are typically studied to increase the
safety of the driver and passengers, while road type detection studies have goals
that depend on the considered classes [→1, →2, →3, →4, →5, →6, →7, →8, →9,
→10, →11, →12, →13].
In this chapter, a road surface identification system is proposed where the
information from sensors for measuring audio, acceleration, speed, and steering
wheel signals is collected and processed. The road surface categories to be
identified are the following: smooth flat asphalt, stripes, and cobblestones. From
the measured signals, several time-, frequency-, and statistic-domain features
that represent the behavior of the signals under different driving conditions are
extracted. The accuracy on road classification from these features is a
determining factor in the quality of automatic power-assisted steering systems.
The experiments implemented several car driving configurations, i.e., hands on
or off the wheel and constant or accelerated vehicle speed.
The experiments also considered several setups of the classification
procedure, implementing the following single classifiers: random forest (RDF),
support vector machine (SVM), and linear discriminant analysis (LDA). Separated
score integration (SSI) was used as a decision fusion method to combine the
results of the single classifiers improving both accuracy and stability of the
results. In addition, considering that the number of the features is high, several
methods to reduce this number were applied including principal component
analysis (PCA), feature ranking, and ReliefF. This avoids overfitting of the
estimated model to the data and provides noise filtering. The results of the
experiments show high accuracy of the proposed classification system for road
surface classification.
The rest of the chapter is organized as follows. Section 9.2 describes the road
surface identification system implemented, including the definitions of the
features extracted from the sensor. Section 9.3 provides a short review of
methods for reduction of dimension of the data observation vector. Section 9.4
is dedicated to the experimental setup and optimization of the system, including
choosing the optimal window size and optimal dimensionality reduction. Section
9.5 contains the results and Section 9.6 comprises the conclusions of this work.

9.2 Road surface identification system


The proposed surface identification system is shown in →Figure 9.1. Features are
extracted separately from the four considered sensors (three microphones,
three accelerometers, two speed sensors, and two steering wheel sensors),
joined together, and filtered using feature selection as preprocessing before
classification. The output of this classifier could then be used as feedback for the
car, controlling the suspension and other car systems in order to improve the
safety of the driver.

Figure 9.1: Diagram of the proposed road surface identification system.

9.2.1 Feature extraction


Prior to classification, the data were centered and scaled to ensure they ranged
within [−1, 1] . After that, we performed feature extraction. Given that most of
the channels are audio channels or vibration channels, our tests considered
several spectral features that are commonly used for audio classification [→14]:
Average power across all frequency bands.
Centroid frequency.
Maximum frequency.
Spectral contrast: ratio between the minimum and maximum spectral
values in each octave. As in most applications on audio processing, the
octaves are referenced to 440 Hz. Given the sampling frequency, we had 10
octaves. Therefore, we computed 10 values per channel.
Spectral slope: first-order polynomial trend of the power spectrum,
assuming that the spectrum follows a power law of the frequency. We
compute one value per channel.
Spectral flatness: measure of the variability of the spectrum in a given
band, obtained as the ratio of the geometric and arithmetic means of the
power spectrum. As per the MPEG7 standard, we considered bands of one
quarter of an octave. Therefore, we computed 40 values per channel.
We also considered the following high-order statistics:
third-order autocorrelation
time reversibility
This resulted in a total of 56 features extracted from each of the 10 available
channels, thus resulting in a total of 560 features for classification. Each signal
was divided in a number (N) of epochs of lengthΔ. For each of the epochs, the
features were estimated using definitions in →Table 9.1. Thus a matrix of N × 56
was formed as input for the dimensionality reduction stage. The rows in that
matrix are called data observation vectors.

Table 9.1: Features extracted from data (x) windowed in epochs of length Δ .
Feature Definition
Third-order 1
Δ

autocorrelation Δ−2
∑ x(n) ⋅ x(n − 1) ⋅ x(n − 2)
n=3

Time reversibility Δ
−3/2
Δ
1 2 1 3
( ∑ x (n)) ∑ (x(n) − x(n − 1))
Δ Δ−1
n=1 n=2

Average power 1
Δ
2
∑ x (n)
Δ
n=1

Centroid fS
Δ
∑ f=1 f |X(f )|
2
where X(f ) is the direct Fourier transform of x
frequency Δ

Δ

f=1
|X(f )|
2
within the epoch taken at Δ points, and fs is the
sampling rate.
Maximum
fS
frequency Δ
(argmax|X(f )|)
f

Spectral contrast max f |X(f )|


, f o1 ≤ f ≤ f o2
where f , f are respectively the start and end
o1 o2
min f |X(f )|
indices of the oth octaves, taking 440 Hz as the
reference (440 Hz is the end limit of the 4th octave)
[→14]. Ten octaves were considered.
Spectral slope Trend a of the model log |X(f )| = a log f + b
Spectral flatness (∏
f
o2
|X(f )|)
1/(f
02
−f
o1
)
where f , f are, respectively, the start and end
o1 o2

indices of the oth quarter of octave [→14].


f=f o1

1 f
o2
∑ |X(f )|
f −f f=f
o2 o1 o1
9.3 A short review of methods for observation vector dimension
reduction
Considering the high dimension (56) of the data observation vectors,
dimensionality reduction was performed as a preprocessing step before
classification. There are two main reasons for feature selection. Firstly, we would
like to reduce the number of features for computational reasons: faster
evaluation times, lower memory consumption, lower implementation cost, and
so on. Secondly, reducing the number of features might improve performance
or reduce variability.
Dimensionality reduction methods are typically classified in two categories:
feature extraction, where new features are derived from the original ones; and
feature selection, where one or more of the original features are selected and
the rest are discarded [→15]. In turn, feature selection is typically approached in
one of two ways: ranking features according to some criterion and selecting the
top Q features (feature ranking); or selecting a subset of features that keep or
improve classification performance (subset selection) [→15]. Subset selection
algorithms can automatically determine the number of selected features, while
feature ranking algorithms need to rely on a user-determined threshold (or
equivalent method) to set the number of selected features.

9.3.1 Principal component analysis (PCA)


PCA is a linear transformation of the data that maximizes the uncorrelation of
the transformed components. The original features are projected onto an
orthogonal space where the projections (components) are sorted by the amount
of variance of the original data they explain. In many applications, the first few
components contain most of the original variance, thus the rest can be
discarded to reduce the dimensionality of the problem. PCA has been heavily
used as a feature extraction method and to investigate data structure [→16].
However, unlike feature selection methods, PCA requires all the original features
to compute the projected components. The consideration of more advanced
feature extraction methods such as stochastic PCA or independent component
analysis [17–→19] will not be approached in this chapter.

9.3.2 Sequential feature selection (SFS) and feature ranking


SFS is a feature subset selection method that performs a greedy search of the
optimal subset of features [→19]. There are two main types of SFS: forward,
where the subset of selected features is iteratively grown from the empty set;
and backward, where the subset of features is iteratively reduced from the full
set. Basically, at each iteration, the method determines the effect of adding one
of the unselected features to the subset (forward SFS) or removing one of the
selected features from the subset (backward SFS) using a wrapped classification
method. For each iteration, forward SFS adds the feature that would increase
performance the most if added to the current subset. Conversely, backward SFS
removes the feature that would increase performance the most if removed from
the current subset. In both cases, iterations continue until performance cannot
be improved any more. In a trained system, discarded features can be omitted
from the feature extraction stage, thus lightening the load of the system. In this
work, we considered forward SFS, with each of the classifiers as the wrapped
method.
Feature ranking methods assign a score or grade to each feature and then
rank them by score [→19]. This ranked list can then be used to perform feature
selection, e.g., by choosing the Q best-ranked features. As with feature subset
selection, the features that are not selected are never used and, in a trained
system, could be omitted from the feature extraction stage. One of the most
commonly used feature ranking methods is ReliefF. This method considers the
interactions between features, returning a score that can be used later for
feature ranking [→20]. The feature scores computed by ReliefF are based on the
distances between nearest neighbors. Essentially, the distances between nearest
neighbors of the same class decrease the score of the feature, and the distances
between nearest neighbors of different classes increase the score of the feature.

9.3.3 Feature ranking method applied


In this work, we propose to apply a feature ranking method based on simple
classifiers. The method is robust, quick to compute, and does not require a lot of
memory. For binary classification problems, the score of the i th feature is
computed as the informedness [→21] of a simple classifier fit to the feature:
inf ormedness = specif icity + recall − 1

The simple one-variable classifier used for feature ranking is shown in →Table
9.2. The algorithm maximizes the informedness of the result when splitting the
values using two thresholds; any value x is assigned to class 1 if x < x ≤ x
1 2

and class 0 otherwise. The search for the pair of values that optimize the
informedness of the classification is done in only one pass through the set of
values. This method is fast to compute and only considers the order of the
values of the input variable, while disregarding the actual values. Thus, the
method is robust with respect to isolated outliers, extreme values, and any data
transformation that does not affect the order of the values of the features, such
as: centering, scaling, exponentiation (when negative values of x are possible,
only for odd powers), and logarithmic transformation.

Table 9.2: Feature ranking method based on informedness.


0 Given input feature values x and binary labels y ( y n n n = 1 if n belongs to class j , and 0
otherwise), n = 1, … , N
1 Sort the values x in ascending order, then arrange the binary labels y in the same order; call the
n n

sorted values x and the sorted labels y .



n

n

2 For n = 1, … , N
3 Compute the probability of detection P and the probability of false alarm P for values up to x :
d f

n
n ′
∑ yi
′ i=1
P d (x ≤ x n ) = N ′
∑ y
i=1 i
n ′
′ ∑ i=1 1−y i
P f (x ≤ x ) = N
n ′
∑ 1−y
i=1 i

4 Compute informedness as I (x ≤ x .

n

) = P d (x ≤ x n ) − P f (x ≤ x n )

5 Find the indices of the maximum and minimum values of informedness,


= argminI (x ≤ x ) .

n min n
n

n max = argmaxI (x ≤ x

n
) .
n

6 The sub-score of the feature x for class j is the informedness within the range I (x ′
n min ,

xn )
max
:
′ ′ ′
.

I (x n min < x ≤ xn ) = I (x n ≤ x) − I (x n ≤ x)
max max min

For multiclass problems such as the one shown in this work, the problem with K
classes is first divided into K binary 1-vs.-all problems. Then, the score of the i
th feature is obtained as the average of its scores for each of the K binary
problems.

9.4 Classification and post-processing


We considered the following classifiers: linear discriminant analysis (LDA),
support vector machines with linear kernel (SVM), and random forests with 50
trees (RDF). These classifiers were chosen because of their widespread
application in machine learning problems. Furthermore, while SVM and RDF
might be computationally expensive to train, the trained model is fast to
evaluate, making them appropriate for the task. LDA included a regularization
term equal to γ = 0.01 , in order to compensate for the large amount of
features.
Aside from the classifiers, we also considered late fusion using separated
score integration (SSI [→22],). SSI is a method based on alpha integration that
optimally combines the scores from several classifiers into one result [23–→25].
The parameters of SSI were obtained by the least mean square error (LSME)
criterion.
In an actual setting, the system will not be required to classify epochs in
isolation; more likely, the road surface type will be classified continuously and
the type of surface will change, at most, once every few seconds. Thus, the
results of the classification were post-processed to consider time dependencies
in the result. Thus, any change in road surface that was sustained for less than 1
full second was removed and replaced with the previous road surface. This post-
processing produces more stable outputs that consider the temporal
dependencies in the data.

9.5 Experimental setup and optimization of the system


The data analyzed in this work were obtained using a specially converted
passenger car that was fitted with an accelerometer on the intermediate shaft
and three microphones: two directional microphones (left/right) on the driver’s
head, pointing forward, and one microphone pointing to the upper side of the
electric power steering (EPS) system column. The signals from the car’s EPS
system were also considered, resulting in 10 channels available for processing:
three channels from the accelerometer (acceleration in the X-, Y-, and Z-axes),
three microphone channels, two speed channels, and the torque and position of
the handwheel. All of them were sampled at the same frequency, f = 48 kHz.
s

Three different worm-gear interface configurations were tested, in order to


simulate different noise and vibration conditions on the driver.
The signals were taken on a closed course over three different types of
surfaces: smooth flat asphalt, cobblestones, and stripes. A total of 63
experiments were taken, with an average length of 14.59 s. Each experiment
corresponded to a different configuration of the following properties: worm-
gear interface configuration (three configurations); hands on or off the wheel;
vehicle speed (constant speed of 10, 15, 20, 30 km/h, or accelerating from 0 to
30 km/h); and road surface (smooth flat asphalt, stripes, or cobblestones). In all
cases, the vehicle drove in a straight line during the capture of the data.
One of the captured set of signals of an experiment is shown in →Figure 9.2.
There was a transition from cobblestones to smooth flat asphalt at 6.50 s
approximately. This transition was more noticeable in the microphone channels
and the handwheel channels, which are much noisier for cobblestones than they
are for smooth flat asphalt.
Figure 9.2: Example of the captured signals from four sensors: three
accelerometers (acc.), three microphones (mic.), two speed channels (speed),
and two handwheel signals (wheel).

An example of the results of the feature ranking process described in Section 9.3
is shown in →Figure 9.3, which shows the four best-ranking features (see
Section 9.2.1) extracted from the data displayed in →Figure 9.2.
Figure 9.3: Four best-ranking features extracted from the data shown in
→Figure 9.2. The transition at 6.50 s is marked in all four features.

The experiments described in Section 9.5 were used to optimize the capability of
the system to determine the surface the car is traversing from the considered
channels. This optimization was carried out via a series of Monte Carlo
experiments, following the steps shown in →Figure 9.4. Essentially, features are
extracted from the data and then they are randomly split into three subsets: 50%
of the features are used for training the classifiers (LDA, SVM, and RDF); 25% of
the features are used to train SSI; and the remaining 25% of the features are
used to determine classification performance. The proportion of samples of each
road surface was chosen to be as similar as possible to the proportions for the
full dataset. Also, to avoid overfitting, features extracted from the same
experiment were always assigned to the same subset. These experiments are
repeated for 100 iterations and the average and standard deviation of the results
are computed.
Figure 9.4: Diagram of the experiments used during the optimization process.

9.5.1 Choosing the optimal window size


The first experiment attempted to determine the optimal window size for
classification. A large window size will provide more data for classification, thus
potentially improving classification performance and stability. However, a large
window size would also mean that the system would spend a long time to decide
on the road surface type, thus delaying the EPS correction and the associated
improvement of the driver’s experience and safety. Furthermore, large window
sizes will reduce the number of windows available for training. Therefore, the
optimal window size will be a compromise between classification performance
and response latency.
The selection of the optimal window size was carried out via a series of
Monte Carlo experiments, following the steps shown in →Figure 9.4. In order to
isolate the effect of the window size on the result, for this experiment, no feature
selection was performed (all features were used for all classifiers) and the
output of the classifiers was not post-processed. The window size was changed
from 0.1 to 3 s, and results were then averaged across 100 iterations per value of
window size.
The results of the experiment are shown in →Figure 9.5. The classification
proved to be sensitive with respect to changes in epoch size, with performance
increasing with window size. In all cases, the maximum performance was
reached for the longest window size. The worst overall performance was yielded
by SVM, with a maximum accuracy of 93.16%, and LDA yielded a very similar
result, with a maximum accuracy of 93.92%. SSI yielded an intermediate result of
95.94%, and RDF yielded the best maximum performance, with an accuracy of
96.33%. However, SSI yielded a generally better result than RDF. Given these
results, in the following, a window size of 1.5 s was used as a compromise
between window size and classification performance.
Figure 9.5: Dependence of average performance with respect to epoch size.

9.5.2 Optimal dimensionality reduction


After deciding on the optimal epoch size, the next experiment adjusted the
behavior of the dimensionality reduction stage of the system. As mentioned in
Section 9.3, three dimensionality reduction methods were initially considered:
PCA, ReliefF, and the proposed feature ranking method. An example of the
results of both methods is shown in →Figure 9.6 (PCA) and →Figure 9.7
(proposed feature ranking method); the results for ReliefF were essentially
identical to those of the proposed feature ranking method. →Figure 9.6 shows
the five PCA components with the highest explained variance (total explained
variance of 49.82%), and →Figure 9.7 shows the five highest-ranking features
(out of the original 560). The (i, j) square of the figures displays the scatter plot
of component/feature i against component/feature j , with the diagonal
displaying the histograms of the components/features for each class. The
components obtained by PCA showed a high degree of overlap between the
smooth flat asphalt and cobblestones classes, with only stripes being separable
in some cases (see the combination of component #1 and component #3 in
→Figure 9.6). Conversely, the five best-ranking features showed a low degree of
overlap between classes and, depending on the requirements of the system,
might be enough to perform classification; for instance, the combination of
features 4 and 5 yields very separable features for all three classes.

Figure 9.6: First five components (49.82% explained variance) obtained by PCA
from the features.
Figure 9.7: Five highest-ranked features returned by the considered feature
ranking method.

None of the considered methods estimate the optimal number of components


or features to choose. In order to choose the optimal number of remaining
features for each method and each classifier, we performed a series of Monte
Carlo experiments following the diagram shown in →Figure 9.4. In order to
optimize the effect of the feature selection stage on the result, for this
experiment, the output of the classifiers was not post-processed. The number of
remaining components/features after selection was changed from Q = 2
features to Q = 560 features (all features, i.e., no feature selection). Results
were then averaged across 100 iterations per value of Q .
The results of the experiment are shown in →Figure 9.8, and the optimal
values for each curve are shown in →Table 9.3. SSI was not included in →Figure
9.8 because it is used to combine the optimal scores of LDA, SVM, and RDF,
regardless of the number of features used for those classifiers (as shown in
→Figure 9.4). The effect of the dimensionality reduction stage was complex and
dependent on the classifier and the number of selected features. For PCA, all
methods showed drops in accuracy for very low (Q < 5) or very high
(Q > 400) numbers of selected components. The optimal number of features

was 350 for LDA, 9 for SVM, and 6 for RDF. For ReliefF, improvements in accuracy
could be found for LDA and SVM, but not for RDF. The optimal number of
features was 15 for LDA, 250 for SVM, and 560 (no reduction) for RDF. For the
proposed feature ranking, as for ReliefF, improvements could only be found for
LDA and SVM. The optimal number of features was 10 for LDA, 5 for SVM, and
560 (no reduction) for RDF. PCA yielded the worst overall result, and ReliefF and
the proposed feature ranking methods yielded similar results. On average, the
result of PCA was 1.92% lower than that of ReliefF and 1.71% lower than that of
the proposed feature ranking method.

Table 9.3: Optimal feature selection and performance of the proposed system.
LDA SVM RDF SSI
PCA Number of components, Q 350 6 9 n/a
Accuracy (%) 93.65 93.60 92.50 93.88
Standard error (%) 0.38 0.34 0.37 0.36
ReliefF Number of features, Q 15 250 560 n/a
Accuracy (%) 95.22 94.46 95.53 96.10
Standard error (%) 0.28 0.56 0.37 0.71
Feature ranking Number of features, Q 10 5 560 n/a
Accuracy (%) 94.99 93.79 95.73 95.97
Standard error (%) 0.26 0.37 0.24 0.25
Figure 9.8: Dependence of average performance with respect to the number of
features selected using PCA, ReliefF, and the proposed feature ranking method.

The difference in behavior for feature ranking (both ReliefF and the proposed
method) between RDF and the other classifiers was owing to the built-in feature
selection stage in RDF, given that every tree in the forest only considers some of
the input features. In fact, RDF has been used for feature selection in several
applications, e.g., [→26, →27]. This selection might have conflicted with the
selection performed using our considered method, thus reducing the overall
effectiveness of RDF with both feature-ranking methods.
With respect to the computation times, PCA and the proposed method took
an average of 0.28 s to compute, while ReliefF took an average of 53.33 s to
compute. These calculations were performed in Matlab R2022b, running on a
Windows 10 machine with an Intel Xeon E3 CPU and 16 GB of RAM. Thus, the
proposed feature ranking method was much faster than ReliefF and just as fast
as PCA.
Given the results shown in →Table 9.3 and the differences in computational
costs, in the following, we perform dimensionality reduction by the proposed
feature ranking method. The number of selected features Q was taken from
→Table 9.3. The above processing is related to the reduction of dimension of the
number of columns of the data matrix, which is the input to the classification
stage. Thus, the original dimension is reduced from N X 56 to N X Q . As
explained above, the classification performance depends heavily on the number
of columns. However, it also depends on the number of rows of the classification
matrix, i.e., the number of samples. The estimation of the adequate sample size
for the training stage remains a critical issue that determines the quality of the
classification results. Learning curves of the methods could be estimated,
providing an estimate of the reduction in the excess of the probability of error
obtained by increasing the training set size [→28]. This is a complex task that is
outside the scope of this work. When data sample size is scarce or there is
sample data imbalance in data categories, this may decrease classification
performance. Thus, we could resort to oversampling techniques to alleviate it,
thus adding synthetic samples to avoid bias of the classifiers. Recently, it was
demonstrated for road surface identification systems using an oversampling
method based on generative adversarial networks (GANs) [→29] and graph
signal processing [30–33].

9.6 Performance of the road surface identification system


Once the window size and the dimensionality reduction stage have been set, we
performed a further experiment to determine the final performance of the
system. For this experiment, the proposed system was trained using all the
experiments captured on a single road surface and then tested on all the
experiments with multiple road surfaces. Using the results shown in Sections
9.5.1 and 9.5.2, we set the window size to 1.5 s and the number of selected
features to the values shown in →Table 9.3: Q SVM = 4, Q
LDA = 9 , and

Q RDF = 560 .

In an actual setting, the system will not be required to classify epochs in


isolation; more likely, the road surface type will be classified continuously and
the type of surface will change, at most, once every few seconds. Thus, the
results of the classification were post-processed to consider time dependencies
in the result. Any change in road surface that was sustained for less than 1 full
second was removed and replaced with the previous road surface. This post-
processing produced more stable outputs that considered the temporal
dependencies in the data. Aside from the effect on the stability of the output,
this post-processing increased accuracy slightly (1–2%).
The results of some examples of the experiments are shown in →Table 9.4
and →Figure 9.9. In accordance with the results in →Table 9.3, all methods
yielded a very accurate classification, following closely the changes in road
surface. The average accuracies for all the experiments were 97.21%, 97.55%,
97.12%, and 97.73% respectively for LDA, SVM, RDF, and SSI. Almost all the errors
were located during the transitions between road surfaces, which were confused
with other classes (e.g., LDA in →Figure 9.9(c)) or experienced slight delays in
the recognition of the correct road surface (e.g., all methods in →Figure 9.9(d)).

Table 9.4: Accuracy for some of the experiments with changes in road surface.
Transition Experiment Accuracy (%)
LDA SVM RDF SSI
Stripes to smooth flat asphalt 1 99.03% 98.08% 90.38% 99.04%
2 86.21% 96.55% 96.55% 96.55%
Cobblestones to smooth flat asphalt 3 99.20% 99.20% 99.23% 99.23%
4 96.47% 97.65% 96.47% 97.65%
Figure 9.9: Estimation of the road surface on experiments where there are
transitions between road surfaces corresponding to experiments of →Table 9.4:
(a) exp. #1; (b) exp. #3; (c) exp. #2; and (d) exp. #4.

9.7 Conclusions
An experimental study on road surface classification has been presented. The
proposed system comprises a set of 10 sensors (3 microphones, 3
accelerometers, 2 speed sensors, 2 steering wheel sensors) and a machine
learning procedure involving feature extraction and selection, classification, and
decision fusion stages. Results show the proposed system is able to determine
the surface with high accuracy (>96%) for three types of road surfaces: smooth
flat asphalt, stripes, and cobblestones. The extra sensors configuration allows
this accuracy result obtained in comparison with many of the currently available
systems (electric power steering, EPS signals) that reach around 80% accuracy.
There are several open lines of research from this work, for instance, semi-
supervised learning (using labeled and unlabeled data samples together)
oversampling techniques could be approached in order to improve the
classification results.
References
[1] V. Surblys, V. Žuraulis, and E. Sokolovskij, Estimation of road roughness from
data of on-vehicle mounted sensors. Eksploat. I. Niezawodn.-Maint. Reliab, vol.
19, pp. 369–374, 2017. →
[2] J. Park, K. Min, H. Kim, W. Lee, G. Cho, and K. Huh, Road surface classification
using a deep ensemble network with sensor feature selection. Sensors, vol. 18,
pp. 4342, 2018. →
[3] G. Safont, A. Salazar, A. Rodríguez, and L. Vergara, Multichannel signal
processing for road surface identification. In: Proceedings of the 2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 3052–3056, Barcelona, Spain, 4–8 May 2020. →
[4] K. Han, M. Choi, and S. B. Choi, Estimation of the tire cornering stiffness as a
road surface classification indicator using understeering characteristics. IEEE
Transactions on Vehicular Technology, vol. 67, pp. 6851–6860, 2018. →
[5] A. Bystrov, E. Hoare, T. Y. Tran, N. Clarke, M. Gashinova, and M. Cherniakov,
Sensors for automotive remote road surface classification. In: Proceedings of the
2018 IEEE International Conference on Vehicular Electronics and Safety (ICVES),
Madrid, Spain, 12–14 September 2018. →
[6] A. Salazar, A. Rodriguez, N. Vargas, and L. Vergara, On training road surface
classifiers by data augmentation. Applied Sciences, vol. 12, pp. 3423, 2022. →
[7] S. M. Yusoff, and J. Giacomin, The effect of vibrational energy distribution on
the level of driver detection. AIP Conference Proceedings. vol. 2059, p. 020032,
2019, Applied Sciences, vol. 12, no. 3423, pp. 1–11, 2022. →
[8] J. R. Ng, J. S. Wong, V. T. Goh, W. J. Yap, T. T. V. Yap, and H. Ng, Identification of
road surface conditions using IoT sensors and machine learning. In:
Computational Science and Technology, pp. 259–268, Singapore: Springer, 2019.

[9] T. Beilfuss, K. P. Kortmann, M. Wielitzka, C. Hansen, and T. Ortmaier, Real
time classification of road type and condition in passenger vehicles. IFAC-Papers,
vol. 53, pp. 14254–14260, 2020. →
[10] G. Safont, A. Salazar, A. Rodríguez, and L. Vergara, Comparison of
dimensionality reduction methods for road surface identification system. In:
Proceedings of the 2020 Science and Information Conference, pp. 554–563,
London, UK, 16–17 July 2020. →
[11] T. Chugh, F. Bruzelius, M. Klomp, and B. Jacobson, Steering feedback
transparency using rack force observer. IEEE/ASME Trans. Mechatron, vol. 27, no.
5, pp. 3853–3864, 2022. →
[12] E. Bonera, M. Gadola, D. Chindamo, S. Morbioli, and P. Magri, On the
influence of suspension geometry on steering feedback. Applied Sciences, vol.
10, pp. 4297, 2020. →
[13] L. Yaohua, F. Jikang, H. Jie, N. Youfei, and F. Qianlong, Novel electric power
steering control strategies of commercial vehicles considering adhesion
coefficient. Advances in Mechanical Engineering, vol. 12, 2020. →
[14] G. Peeters, A large set of audio features for sound description (similarity
and classification) in the CUIDADO project 2004. a, b, c
[15] I. Jolliffe, Principal Component Analysis, New York, NY, USA: Springer, 2002.
a, b
[16] O. Shamir, A stochastic PCA and SVD algorithm with an exponential
convergence rate. In: International Conference on Machine Learning (ICML), pp.
Lille (France), 2015. →
[17] G. Safont, A. Salazar, and L. Vergara, On recovering missing GPR traces by
statistical interpolation method. Remote Sensing, vol. 6, pp. 7546–7565, 2014.
[18] G. Safont, A. Salazar, L. Vergara, E. Gomez, and V. Villanueva, Mutichannel
dynamic modeling of non-Gaussian mixtures. Pattern Recognition, vol. 93, pp.
312–323, 2019.
[19] H. Lui, and H. Motoda, (Eds.). Computational Methods of Feature Selection,
Boca Ratón: CRC Press, 2007. a, b, c
[20] I. Kononenko, E. Šimec, and M. Robnik-Šikonja, Overcoming the myopia of
inductive learning algorithms with RELIEFF. Applied Intelligence, vol. 7, no. 1, pp.
39–55, 1997. →
[21] D. M. Powers, Evaluation: from precision, recall and F-Measure to ROC,
informedness, markedness & correlation. Journal of Machine Learning
Technologies, vol. 2, no. 1, pp. 37–63, 2011. →
[22] G. Safont, A. Salazar, and L. Vergara, Multiclass alpha integration of scores
from multiple classifiers. Neural Computation, vol. 31, no. 4, pp. 806–825, 2019.

[23] G. Safont, A. Salazar, and L. Vergara, Vector score alpha integration for
classifier late fusion. Pattern Recognition Letters, vol. 136, pp. 48–55, 2020.
[24] A. Salazar, G. Safont, L. Vergara, and E. Vidal, Pattern recognition techniques
for provenance classification of archaeological ceramics using ultrasounds.
Pattern Recognition Letters, vol. 135, pp. 441–450, 2020.
[25] A. Salazar, G. Safont, L. Vergara, and E. Vidal, Graph regularization methods
in soft detector fusion. IEEE Access, vol. 11, pp. 144747–144759, 2023. →
[26] R. Diaz-Uriarte, and S. Alvarez de Andres, Gene selection and classification
of microarray data using random forest. BMC Bioinformatics, vol. 7, no. 3, p. art.,
2006. →
[27] M. Hasan, M. Nasser, S. Ahmad, and K. Molla, Feature selection for intrusion
detection using random forest. Journal of Information Security, vol. 7, pp. 129–
140, 2016. →
[28] A. Salazar, L. Vergara, and E. Vidal, A proxy learning curve for the Bayes
classifier. Pattern Recognition, vol. 136, pp. 109240, 2023. →
[29] A. Salazar, G. Safont, and L. Vergara, Application of an oversampling method
for improving road surface classification. In: 2022 International Conference on
Computational Science and Computational Intelligence (CSCI), pp. 647–651, Las
Vegas, NV, USA, 2022. →
[30] L. Vergara, A. Salazar, J. Belda, G. Safont, S. Moral, and S. Iglesias, Signal
processing on graphs for improving automatic credit card fraud detection. In:
Proceedings of IEEE International Carnahan Conference on Security Technology
(ICCST), pp. 1–6, Madrid, Spain, 2017.
[31] J. Belda, L. Vergara, G. Safont, A. Salazar, and Z. Parcheta, A new surrogating
algorithm by the complex graph Fourier transform (CGFT). Entropy, vol. 21, no. 8,
article no. 759, 2019.
[32] J. Belda, L. Vergara, G. Safont, and A. Salazar, Computing the partial
correlation of ICA models for non-Gaussian graph signal processing. Entropy,
vol. 21, no. 1, article no. 22, 2019.
[33] J. Belda, L. Vergara, A. Salazar, and G. Safont, Estimating the Laplacian
matrix of Gaussian mixtures for signal processing on graphs. Signal Processing,
vol. 148, pp. 241–249, 2018.
10 RNN models for evaluating financial indices:
examining volatility and demand-supply shifts in
financial markets during COVID-19

Ray R. Hashemi
Omid M. Ardakani
Jeffrey Young
Azita G. Bahrami

Abstract
Several semi-recurrent neural networks (RNNs) were developed to explore the
impact of the COVID-19 pandemic on the financial indices representing bond
market (yield spread – Spread), energy market (crude oil prices – Oil), stock
market (volatility index – VIX and Wilshire 5,000 total market index – Wil5,000),
housing market (Shiller housing price index – HPI), commodity market (Gold,
Wheat, and Soybean indices), and recession (USrec). The indices were divided
into two groups based on two properties: volatility and demand-supply shifts.
USrec was a part of both groups. Two time-series datasets (dataset 1 and
dataset 2) were obtained from the Federal Reserve Bank of St. Louis for the two
groups. The volatility and demand-supply shift groups had daily and monthly
data, respectively. Each dataset was first purified by applying derived-
breakpoints (the transition block) for each index, separately. Second, the purified
dataset was partitioned into “before” and “during” the pandemic. Semi-RNNs
were trained by the training sets for the “before” and “during”, separately, and
tested against their corresponding test sets which delivered two 5-element
counter arrays. Each array carries the overall percent of the correct
classifications of the indices for the entire records of the corresponding test sets.
A decay rate was calculated for each index in the arrays and the average of
decay rates (overall decay) is also calculated for each array to be used for
inspecting the impact of COVID-19 on indices. The positive, zero, and negative
decay rate means COVID-19 has negative, none, and positive impacts on the
financial indices, respectively. The results also revealed: (a) VIX was explained by
Wil5,000, Spread, Oil, and USrec more accurately before the pandemic, indicating
that other observed and unobserved factors arising from the COVID-19
pandemic would affect the VIX more than the other financial indices. (b) USrec is
predicted less accurately compared to other indices during the pandemic, which
shows the sensitivity of this index to health and geopolitical challenges. (c)
Effects of the indices on the bond market diminished during the pandemic. (d)
HPI declines in predictive accuracy during the pandemic indicate disruptions in
the housing market due to economic and social changes. (e) Gold price drop in
predictive accuracy during the pandemic reflects its nature as a relatively stable
safe-haven but still sensitive to COVID-19. (f) Wheat’s substantial drop in
predictive accuracy highlights its sensitivity to disrupted supply chains, global
trade dynamics, and fluctuating demands during the pandemic. (g) Similar to
wheat, soybean predictive accuracy fell during the pandemic, underscoring the
impact of supply chain disruptions on agricultural commodities. The sensitivity
analysis suggests that the results remain highly robust to the changes in the
number of records chosen for the different test sets.
Keywords: COVID-19 impact on financial indices, stock market, bond market,
energy market, commodity market, housing market, recurrent neural network,
transition boundary, robustness analysis,

10.1 Introduction
The US economy is under the influence of several major financial markets
including stock, bond, energy, commodity, and housing markets. Such influences
can be more severe if they are combined with a long-lasting societal disaster
such as the COVID-19 pandemic. The reason stems from the fact that there was
a forced lack of social interactions. The COVID-19 pandemic significantly
disrupted various sectors of the economy, including supply chains, the
workforce, agriculture, and education. It also changed the demand for goods.
These disruptions had widespread effects on both the economy and financial
markets.
Our investigation for evaluating the financial indices is done by predicting
these indices using different groups of markets. Our efforts are divided into two
parts as follows. Part 1 includes the exploration of the financial indices
forecasting using the energy, stock, and bond markets. We grouped these
markets because they display a higher degree of volatility in the presence of a
long-lasting societal event like COVID-19. Part 2 includes the exploration of the
COVID-19 impact on the financial indices’ behavior using commodity and
housing markets. We group these markets because their common property is
their quick reaction to the presence of a high demand for goods by the society
and/or broken supply-chains.
For Part 1, the following financial indices are used: the interest rate spread
(Spread), oil prices (Oil), the volatility index (VIX), and the Wilshire 5,000 market
index (Wil5000). These indices represent the bond market, energy market, stock
market volatility, and volume level of the traded stocks in the stock market,
respectively [→1, →2]. The Spread is the difference between the interest rates of
long-term (10 year U.S. Treasury) and short-term (2 year U.S. Treasury)
government bonds. This difference can give us insights into economic activities
and potential recessions [→3]. The price of oil is set by a group of countries
known as OPEC-plus, and it affects both recessions [→4, →5] and the rate of
inflation [→6]. The VIX index represents how much stock market prices are
expected to fluctuate in the near future based on options traded on the stock
market [→7]. Finally, the Wilshire 5000 index measures the combined value of
stocks from over 5,000 publicly traded companies in the U.S. [→8].
For Part 2, the following indices are used: HPI (Shiller Housing Price Index),
Gold, Wheat, and Soybean indices. The first index represents the housing market
and the last three represent the heart of the commodity market. The Shiller
Housing Price Index is a leading US residential housing market measure, and it
offers insights into economic trends. Housing markets often provide early
signals of economic downturns and recoveries. As a result, housing market
trends have a strong predictive value in economic cycles [→9, →10]. A downturn
in housing prices often indicates broader economic challenges, as evidenced in
the 2008 financial crisis [→11].
The selection of commodity futures helps to understand global economic
health. Wheat and soybean futures contracts, for instance, are indicators of
agricultural market trends and global trade dynamics, as discussed in [→12].
Gold, traditionally seen as a ‘safe haven’ during economic uncertainties, has
implications for the global economy [→13]. Variations in these markets can
indicate shifts in economic activities. Gold’s role during economic uncertainties,
such as the COVID-19 period and agricultural commodities’ responses to global
supply and demand changes are critical in this context.
The impact of COVID-19 on our two groups of market is explored. This is
accomplished by dividing the datasets, used for parts 1 and 2, into “Before” and
“During” pandemic eras. The comparison of the financial indices’ accuracy
prediction for each era determines the impact of COVID-19 on our groups of
markets. It is pertinent to inquire about the relevance of this comparison to
COVID-19. The presence of chaotic behavior in data often leads to prediction
inaccuracies, which, in turn, may be indicative of a chaotic state within the
economy characterized by heightened uncertainty.
The breaking point between the “before” and “during” periods cannot be
one specific day but rather it is an interval of days (a transition block). The
reason for having a transition block is that the impact of COVID-19 on an index is
gradual and it may be different from one index to the next. Also, the time for the
COVID-19 impact to have its full effect is not the same for all the indices. The
best way to determine this boundary is by investigating the behavior of
individual indices to determine a period of transition from “Before” era to
“During era.
The nature of data for financial indices (time-series data) and the fact that
the prediction of the recession for the record of date d is influenced by the
history of the records predecessors are properties that demand the use of the
Recurrent Neural Networks (RNN). In this neural network, any record seeking
the prediction of a value for one of its designated indices will also be morphed
(as a part of the history) and participates in the prediction of a value for the
designated index in the next record.
The goal of this research effort is twofold: Development of RNNs to study the
behavior of the financial indices in the context of COVID-19 using: (a) bond,
energy, and stock markets and (b) commodity and housing markets. Previously,
a preliminary investigation of the COVID-19 impacts on financial indices of, VIX,
Wil5000, spread and Oil have been completed and reported in [→14]. This
chapter is the extension of the reported paper.
The structure for the remaining of the chapter is as follows. Previous work,
methodology, empirical results, and discussion, conclusion, and future research
are the subject of Sections 2, 3, 4, and 5, respectively.

10.2 Previous works


Forecasting the phases of economic business cycles is increasingly done by
machine learning approaches in general and by neural networks in particular.
Joseph et al. demonstrated that neural networks outperform conventional
statistical models, such as vector autoregressions and autoregressive integrated
moving average models, when predicting macroeconomic developments based
on the interest rate spread. Their work emphasizes the superior capacity of
neural networks to capture patterns in time-series data [→15]. Qi highlighted
the effectiveness of neural networks in predicting recessions by employing both
financial and macroeconomic indices [→16].
Vidal and Kristjanpoller recently introduced a deep learning architecture that
integrates convolutional neural networks with long short-term memory
networks to analyze economic trends by addressing both spatial and temporal
features of economic data [→17]. Lazcano et al. further explored the use of
Graph Neural Networks to model the dependencies between economic
indicators and their collective impact on economic cycles [→18].
Further elaborating on this, Puglia and Tucker introduced a methodology for
cross-validation and performed statistical analysis on machine learning
classifiers to assess the efficacy of yield spread and other macroeconomic
indicators in forecasting U.S. recessions [→19]. Their findings indicate that
neural network classifiers excel at identifying critical attributes of recession
distribution, which are not as readily captured by traditional probit and logit
models. Also, Longo et al. have advanced an ensemble learning strategy that
integrates a recurrent neural network with a dynamic factor model to project
economic growth [→20]. Wang et al. also examined machine learning strategies
to predict recessions and their effects on financial markets [→21].
Machine learning has also been used to enhance the predictive modeling of
economic indicators such as HPI and commodity prices, which have been
influential during periods of economic volatility such as those induced by the
COVID-19 pandemic. Zhan et al. demonstrated the potential of deep learning
techniques in forecasting housing prices [→22]. Lorenz et al. also studied the use
of machine learning in analyzing the real estate market by showing how these
models signify housing market dynamics [→23].
COVID-19 also impacted commodity markets such as gold, wheat, and
soybeans. Price variations due to problems in supply chains and shifts in
demand during the pandemic. Kamdem et al. employed deep learning to
analyze the implications of COVID-19 on the volatility of commodity markets and
study the pandemic’s disruptive impact [→24]. Ben Ameur et al. and Oktoviany
et al. provided evidence on the efficacy of deep learning in forecasting
commodity prices [→25, →26].
We extend these methodologies by employing semi-recurrent neural
networks to specifically analyze how different financial indices, including those
for the housing and commodity markets, have responded to the economic
disruptions caused by the COVID-19 pandemic. By comparing daily and monthly
data for these indices, partitioned into “before” and “during” pandemic periods,
our models provide a robust analysis of the pandemic’s impact on economic
activities.

10.3 Methodology
The objectives for meeting the goal are: (a) discretization of datasets that make
Part 1 and Part 2 [→26], (b) identification of transition blocks for each index and
for each dataset [→27], and (c) introduction of a semi-recurrent neural network
for evaluating financial indices during COVID-19 [→28]. The rest of this section
describes the above objectives in detail.

10.3.1 Data discretization


Before the data discretization process, imputing the missing data is necessary.
Fortunately, we do not have any missing data in our datasets. We use the k-
means clustering approach that is encapsulated by the following algorithm
[→29].
Algorithm K-Mean
Input: A dataset, D, and an integer K > 1;
Output: K clusters (CL1, …, CLk) out of data in D;

Method:
1- Select K records from D as the centers (mean) for clusters of CL1, …, CLk.

2- Repeat for i= 1 to |D|


a. Calculate the Euclidian distance (e) of record ri of the dataset D from
the centers of the K clusters, (e1, …, ek).
b. Assign ri to cluster CLm (where 1 ≤ m ≤k) such that em = Min(e1, …, ek);
c. Update the center of CLm, by averaging its members;

End //End of Repeat;


End //End of the algorithm

10.3.2 Identification of the transition blocks


We first define the term “transition block,” and then present a method that
identifies the starting and ending dates of the transition block for a given
financial index. A transition block, for a financial index, represents a time period
in which the influence of an event on the index is sensed very lightly (starting
point of the transition block) and then grows to a noticeable influence (the
ending point of the transition block). The tolerance levels of different financial
indices to the influence of the event are not the same. For this reason, the width
of the transition block may be different from one index to the next. The datasets
in Part 1 and Part 2 have their own comprehensive-transition blocks. The
minimum starting date and maximum ending date among the transition blocks
of a dataset’s indices make the starting and ending dates of its comprehensive-
transition block, respectively. The records in the transition block of an index or a
dataset cannot be a part of either the “before” or “during” eras for an event.
The reason is that a transition block includes records for a period that does not
certainly belong to either the “before” or during” eras.
The identification of a transition block for a given index is done by detecting
breakpoints in the index values. A breakpoint is a sudden change in the index
values which, can be visualized by its plot. To detect breakpoints, we use the
following method that we refer to as the derived-breakpoints. We first assume the
possibility of m breakpoints in the dataset, where the coefficients shift, resulting
in m + 1 segments with constant coefficients. We estimate these breakpoints by
minimizing the residual sum of squares (RSS). The key aspect of this
methodology is the estimation of breakpoints, which are not given exogenously
but are derived from the data.
Following the Bellman principle, this implementation utilizes a dynamic
programming approach to compute the optimal breakpoints for a given number
of breaks [→30]. The computed triangular RSS matrix is central to this process,
providing RSS for a segment starting and ending at specified observations. The
methodology is grounded in [→31‒→33] and is illustrated in [→34].
In the context of this research effort, the transition block separates the
records of a dataset into “before” and “during” eras with the best estimation. By
doing so, one can say that the probability of a record belonging to the wrong era
is close to zero; therefore, the records of the dataset are purified in reference to
the index that its transition block is employed. It is clear that the purification of a
dataset, in reference to different indices, resulting into different outcomes.

10.3.3 Semi-recurrent neural network


The RNN version that we develop has two components “Current” and “Past” as
shown in →Figure 10.1. (To be precise, this version is a semi-RNN [→28].) Both
components have a pair of input and output layers that are referred to as the
input/output Current layer and input/output Past layer. Thus, records feeding
these input layers are “Current input” and “Past input” records, respectively. The
number of neurons in Current layers is the same as the number of neurons in
Past layers and they are equal to the number of financial indices in either
dataset of Part 1 or dataset of Part 2. The Current input and Past input layers
collectively make the input layer of the RNN. A biased neuron is also added to
this layer.
Figure 10.1: The semi-RNN architecture.

The outcome from the Current output layer is the one considered as the output
of the RNN. This outcome is influenced by the Current input record and the
history of the predecessors of this record that is manifested as the Past input
record. For maintaining such a record, the outcome of the neurons from the Past
output layer always feed back to the corresponding neurons in the Past input
layer. The Current input record also contributes into the history of its
predecessors. The key to such contribution is the connection of the neurons in
the Current input layer to each one of the neurons in the Past output layer.
When the Current input record is the first record of a dataset, there is no
history of predecessors for this record. We resolve this problem by using the first
record of the dataset as the Past input record and, thus, we feed the Current
input layer starting from the second record of the dataset. We use the Gaussian
Error Linear Unit (GELU) activation function [→35], for our neural network that is
calculated by Formula (1).

Out =
N et (1)
−1.702N et
1+e

where, Out is the output for a given node and Net is the weighted sum of inputs
to the node.

10.4 Empirical results


The three objectives, for meeting our goal, were the discretization of data,
identification of the transition blocks, and development of a semi-RNN, and they
were introduced and discussed previously. In this section, we describe the
results of implementing these objectives on data in hand which, is composed of
two time-series datasets (named Dataset 1 and Dataset 2). These datasets have
been obtained from FRED, Federal Reserve Bank of St. Louis, for use in Part 1
and Part 2 of our investigation, respectively. For each dataset, the list of indices
and their market representation are given in →Table 10.1. The dataset 1 has
7,939 and dataset 2 has 99 records. One of the reasons for having a much higher
count of records for dataset 1 is the fact that dataset 1 includes daily data
whereas dataset 2 includes monthly data. The period for which data have been
extracted from FRED is (Jan. 2, 1990, Feb. 2, 2020).

Table 10.1: List of indices for dataset 1 and dataset 2.


Dataset 1 Dataset 2
(daily data) (monthly data)
Representative Index Representative Index
Bond market Spread Housing market HPI
Energy market Oil Commodity market Gold
Stock market VIX Wheat
Wil5000 Soybean
State of Economy USrec State of economy USrec
Time (daily) Date Time (monthly) Date
10.4.1 Objective 1: data discretization
The K-means algorithm is used (for k = 3) to discretize indices into three discrete
values of 1, 2, and 3. In general, the number of records in discrete values of 1
and 3 are the highest and lowest, respectively. For each discrete value of an
index, the range of its continuous values along with the number of records is
given in →Table 10.2. The reader needs to be reminded that the USrec index is
an exception in this process because it is a binary indicator.

Table 10.2: Discrete value of indices along with the number of records in the
discrete value and the range of its continuous values.
Index 1 2 3 Index 1 2 3
Daily Monthly
VIX [9.14, 19.6) [19.6, [34.5, HPI [137, 175) [175, 238) [238, 306]
4,815 34.5) 82.7] 45 40 14
2,768 356
Wil5000 [7.35, 69.8) [69.8, 150) [150, Gold [1,060, [1270, [1550, 1970)
5,785 1,713 234] 1270) 1550) 35
441 28 36
Spread [−0.52, [0.758, [1.75, Wheat [368, 567) [567, 750) [750,
0.758) 1.75) 2.91] 52 31 →1,040]
3,163 2,429 2,347 16
Oil [9.1, 42.1) [42.1, [84.8, Soybean [840, 1100) [1100, [1380, 1710]
3,882 84.8) 144] 53 1380) 25
2,799 1,258 21

10.4.2 Objective 2: identification of the transition blocks


The transition blocks need to be identified for each financial and macroeconomic
index individually and also for each dataset. We apply the derived-breakpoints
approach to determine the boundaries of the transition block for each index.
The outcomes are shown in →Table 10.3. Also, the comprehensive-transition
block for each dataset is calculated using the transition blocks of its indices as
described before. The transition blocks for the indices are supported by their
plotted values illustrated in →Figures 10.2–→10.4. These plots are generated
using the R statistical package.
Figure 10.2: Plots for indices in dataset 1 with daily data (Oil, Spread, Vix,
Wil5000).

Table 10.3: Transition blocks of the financial indices in (a) dataset 1and (b)
dataset 2.
(a) Financial indices (daily data) b) Housing and commodity market indices (monthly data)
Index Transition block Index Transition block
VIX [3/12/2020 to 3/27/2020] HPI [2/1/2020 to 4/1/2020]
Wil5000 [3/9/2020 to 3/23/2020] Gold [1/1/2020 to 4/1/2020]
Spread [3/16/2020 to 3/20/2020] Wheat [3/1/2020 to 4/1/2020]
Oil [3/16/2020 to 4/21/2020] Soybean [3/1/2020 to 4/1/2020]
USrec [3/2/2020 to 4/30/2020] USrec [3/1/2020 to 4/1/2020]
Dataset-1 [3/2/2020 to 4/30/2020] Dataset-2 [1/1/2020 to 4/1/2020]
Figure 10.3: Plots for indices in dataset 2 with monthly data (HPI, Gold, Wheat,
Soybean).
Figure 10.4: Plot for the binary index of USrec.

10.4.3 Objective 3: development and use of semi-RNN


A semi-RNN was developed with 11 neurons in the Input layer (five for the
Current input, five for the Past input, and one for the bias node). The number of
neurons for the Current output layer and the Past output layer were also five for
each. The procedure Achievement, cited below, is employed for the investigation of
Part 1 and Part 2. The procedure will be followed by its explanation.
Procedure: Achievement
Step 1: Repeat for j = 1 to 2.
Step 2: D = Dataset j.
Step 3: Repeat for i= 1 to 5.
Step 4: Purify the dataset D in reference to the transition block of the
financial index Fi.
Step 5: Divided the purified D into “Fi-before” and “Fi-during” partitions.
(The number of indices in each partition is the same as the number of
indices in dataset j).
Step 6: Divide each partition into a pair of training and test sets ((Fi-before-
train, Fi-before-test) and (Fi-during-train, Fi-during-test)).

Step 7: Balance the training sets.


Step 8: Repeat for each pair of the training and test set (TRm and TSm) //m =
1 to 2
Step 9: Train the semi-RNN using TRm

Step 10: Test the trained semi-RNN using TSm and report percent of the
correct prediction of each index for the entire TSm. Save the outcome in a 5-
element counter arraym.

End of Step 8
Step 11: Calculate the decay rate of the predictions between the two counter
arrays.
Step 12: D = Dataset j.
End of Step 3.
End of Step 1.

End of Procedure

In this procedure, we try to predict the indices’ value for the record R of the test
set using the trained neural network that captures the history of the record R’s
predecessors. This means that the trained semi-RNN is good for predicting the
indices for a test record that timewise is next to the very last record of the
training set (i.e., a test set with one record). Therefore, if a test set has N records
(timewise, they are one after another – Time series data), then as soon as the
prediction for the first test record is finished, it must be added to the training
set, and the neural net must be re-trained for the new training set to become
ready for the prediction of the next record in the test set. However, we have
taken a different approach by answering the question that follows: How far can
one depart from the time belonging to the last record in the training set to select
a test record without damaging the indices’ prediction too much? The sure
answer is not too far. For this reason, the number of records (λ) in our test sets
are small (λ = 10% of the records in Fi-before and Fi-during). In justification of our
chosen approach, the reader needs to be reminded that we compare the result
of predictions for “before” and “during” eras. Therefore, the corresponding test
records in the two test sets used for different eras have the same distance
(timewise) from the last record of their training sets and it should not affect the
result of the comparison.
In the loop of Step 3, the purification of the dataset is done for one index at a
time and in the presence of the other indices. We employed individual transition
blocks for each financial index to consider the potential for abnormal shifts in
each series. This methodological choice is predicated on the statistical premise
that the abnormal behavior of one index should not distort the predictive
accuracy of another. By doing so, we mitigate the risk of spurious correlations
and enhance the robustness of our predictions.
If the number of records in the training set for a neural network increase,
then the neural network, in general, is trained better. Having said that, the “Fi-
before-train” has more records than the “Fi-during-train”, which may put the
trained semi-RNN by the “Fi-during-train” at a disadvantage. To remedy this
disadvantage, we make the number of records in both training sets the same.
This remedy was reflected on by Step 7. Step 8 delivers one 5-element counter
array per era (“Before” and “During”) in reference to the Fi index. Each array
carries the overall percent of the correct classifications of the indices for the
entire records of the corresponding test sets.
Considering a purified dataset in reference to the financial index of Fi that
ultimately generates two 5-element counter arrays, see →Table 10.4. The decay
rate for index Fa is calculated using Formula (2).

Table 10.4: A sample of the two 5-element counter arrays.


Counters Financial indices
Fa Fb Fc Fd Fe
5-element Counter array-1 C1a C1b C1c C1d C1e
(For Fi-before-test)
5-element Counter array-2 C2a C2b C2c C2d C2e
(For Fi-during-test)

C 1a −C 2a (2)
Decay F a =
|F i Bef oreT est|
where, C1a and C2a are the overall percent of the correct classifications of the
index Fa using all records in Fi-before-test and Fi-during-test, respectively. Since
both Fi-before-test and Fi-during-test have the same cardinality, either one can
be used in Formula 2.
Having positive, zero, and negative decay rates for all indices (Fa to Fe)
means COVID-19 has negative, none, and positive effects on the financial indices,
respectively. The financial indices with the largest and smallest decay rates are
the most and the least vulnerable financial indices to COVID-19. The Overall
decay rate for Fi is the average of all the decay rates for the indexes of Fa to Fe. If
the overall decay rate is negative/positive but there are some non-negative
decay rates, then we say COVID-19 has a partially negative/positive effect on the
financial indices. If Fi has the maximum/minimum positive overall decay among
all indices, then (a) the transition block of Fi has the minimum/maximum
intersection with the transition blocks of the other indices and (b) Fi is the
most/least sensitive financial index to COVID-19.
The above procedure generates the decay rates for all indices in dataset 1
and dataset 2 that are, respectively, used in Part 1 and Part 2 of our
investigations. We summarize our findings in →Tables 10.5 and →10.6.

Table 10.5: Decay rates for indices in dataset 1.


Index of λ Test sets 5-Element counter arrays Overall
purification (In days) decay rate
Vix Wil5000 Spread Oil USrec 0.3
Vix 45 Vix-before 84 78 80 71 67
45 Vix-during 69 71 67 56 58
Decay rate 0.3 0.2 0.3 0.3 0.2
Wil5000 45 Wil5000-before 78 73 76 67 62 0.1
45 Wil5000-during 73 69 71 69 62
Decay rate 0.1 0.1 0.11 −0.04 0
Spread 46 Spread-before 87 74 70 76 63 0.3
46 Spread-during 67 65 65 61 48
Decay rate 0.5 0.2 0.1 0.3 0.3
Oil 44 Oil-before 77 73 77 70 66 0.2
44 Oil-during 70 68 68 64 60
Decay rate 0.2 0.1 0.2 0.1 0.1
USrec 43 USrec-before 74 77 70 58 56 0.3
43 USrec-during 56 53 53 51 47
Decay rate 0.4 0.6 0.4 0.2 0.2
Table 10.6: Decay rates for indices in dataset 2.
Index of λ Test sets 5-Element counter arrays Overall
purification (In decay
month) rate
HPI Gold Wheat Soybean USrec 6.2
HPI 4 HPI-before 99 75 75 50 99
4 HPI-during 50 50 50 50 75
Decay rate 12.25 6.25 6.25 0 6
Gold 4 Gold-before 75 99 50 50 75 3.7
4 Gold-during 50 75 50 50 50
Decay rate 6.25 6 0 0 6.25
Wheat 4 Wheat-before 75 50 99 50 75 3.7
4 Wheat-during 50 50 75 50 50
Decay rate 6.25 0 6 0 6.25
Soybean 4 Soybean- 75 50 50 99 75 3.7
before
4 Soybean- 50 50 50 75 50
during
Decay rate 6.25 0 0 6 6.25
USrec 4 USrec-before 99 75 75 75 99 7.9
4 USrec-during 50 50 40 50 75
Decay rate 12.25 6.25 8.75 6.25 6

We have also used the procedure with some modifications to purify each dataset
by their comprehensive-transition blocks as follows.
Procedure: Modified_Achievement
Step 1: Repeat for j = 1 to 2.
Step 2: Purify the dataset in reference to the comprehensive-transition block
of dataset j.
.
. (Re-numbered steps)
.
End of Step 1.
In this modified version of the procedure, Steps 2 and 3 are removed and Step 4
is changed to Step 2, as shown above. This means, first, those records that are
participating in the comprehensive-transition block for the dataset j are
removed, and then the rest of the procedure will generate two partitions (Compj-
before and Compj-during) and deliver two 5-element counter arrays for each
partition by which the decay rate of the predictions of indices in the totally
purified dataset j are calculated. Results are shown in →Table 10.7.

Table 10.7: Decay rates for indices in dataset 1 and dataset 2 after purification of
the datasets using the comprehensive-transition blocks.
Index of λ Test sets 5-Element counter arrays Overall
purification decay
rate
Vix Wil5000 Spread Oil USrec
All indices in 43 Comp-1- 74 77 70 58 56 0.3
dataset 1 (Days) before
43 Comp-1- 56 53 53 51 47
(days) during
Decay rate 0.3 0.4 0.6 0.4 0.2
All indices in HPI Gold Wheat Soybean USrec
dataset 2
4 Comp-2- 75 75 75 75 99 6.2
(Month) before
4 Comp-2- 50 50 50 50 75
(Month) during
Decay rate 6.25 6.25 6.25 6.25 6

10.5 Discussion, conclusion, and future research


Consider a dataset that is purified in reference to the index of Fi, delivers two
test sets, TSi1 and TSi2, for the “before” and “during” eras. Assume a semi-RNN,
which, when trained by the corresponding training sets, separately, produces
the overall decay rates of δ1 and δ2 for TSi1 and TSi2, respectively. If δ2 is
significantly less than δ1, it suggests that the data for the “before” period is less
chaotic than the data for the “during” period.
→Tables (10.5–→10.7) detail the correct classification percentages for all pre-
pandemic and during-pandemic eras along with the decay rate of indices and
overall decay rates using different purified datasets. In general, the predicted
accuracy of indices for dataset 1 (The volatility group) is better than the
predicted accuracy for indices of dataset 2 (the demand-supply shift group).
Key observations from comparing the “before” and “during” pandemic
results include:
a. The lowest overall decay rate belongs to the Wil5000 index in the volatility
group and the highest overall decay rate belongs to the US recession index
in the demand-supply shift group. The minimal decay in the Wil5000 index
suggests that broader market indices may exhibit resilience by diversifying
across many assets, which will mitigate the impact of volatile events.
Conversely, the significant decay observed in the US recession index
highlights the sensitivity of recession indicators to sudden economic
disruption.
Considering the purified dataset 1 in reference to the Wil5000 index suggests
that COVID-19 had a positive impact on the oil price. The justification may stem
from the fact that the initial sharp decrease in oil demand due to lockdowns led
to drastic price drops and cuts in oil production. As economies began to recover,
these production cuts and a gradual return in demand likely caused oil prices to
rebound faster than anticipated.
Overall, the decay rate for indices in dataset 2 is higher, which means the
demand-supply shift group indices have been more impacted by the COVID-19
pandemic than the volatility indices.
a. On average, the predictive accuracy for indices during the pandemic is
approximately 10% lower than pre-pandemic predictions, indicating
increased macroeconomic and financial index volatility.
The simultaneity in the influence of macroeconomic and financial indices on one
another is critical for predictions. Our semi-RNN model accounts for this
interconnectedness as outlined in Section 3.
a. The VIX index is better predicted using variables like Wil5000, Spread, Oil,
and USrec before the pandemic. This highlights the additional impact of
various pandemic-related factors on the VIX beyond standard financial and
economic indicators.
b. USrec shows reduced predictive accuracy post-pandemic, reflecting its
susceptibility to health and geopolitical disturbances.
c. USrec shows the lowest predictive accuracies of 56% pre-pandemic and
47% during the pandemic, unlike VIX and Wil5000, which exhibit higher
accuracies in these periods. These results corroborate Malladi’s [→36]
findings, who employed machine learning to predict recessions and
market downturns.
d. The Spread pre-pandemic predictive accuracy with VIX, Wil5000, Oil, and
USrec was around 80%, which dropped to 67% during the pandemic,
illustrating the diminished influence of financial and macroeconomic
factors on the bond market. These findings are consistent with Zaremba et
al. [→37].
e. Housing prices showed higher accuracy pre-pandemic due to the housing
market responsiveness to economic shifts. The notable accuracy decline
during the pandemic reflects increased economic uncertainty, disruptions
in real estate, and shifts in housing demand.
f. Gold prices maintained high predictive accuracy pre-pandemic, aligning
with its status as a stable “safe-haven” asset. The slight accuracy reduction
during the pandemic underscores the increased global economic
uncertainty.
g. Wheat and soybean prices exhibited moderate prediction accuracy before
the pandemic, influenced by stable global agricultural trends and trade
dynamics. Their predictive accuracy declined during the pandemic due to
disrupted supply chains and altered global trade and demand.
h. The general decline in predictive accuracy across all indices during the
pandemic underscores the heightened uncertainty and market volatility
induced by COVID-19.
i. The similarity of the decay rate between (Comp-1-before and Comp-1-
during) and (USrec-before and USrec-during in →Table 10.5) is because the
transition block for USrec in →Table 10.5 is the same as the
comprehensive-transition block of the dataset 1.
Robustness checks varied λ values from 7% to 12%, affecting training and test
set sizes. Despite larger training sets improving classification accuracy, the
predictive patterns for both pre-pandemic and during-pandemic periods
remained consistent. This sensitivity analysis underscores the robustness of our
findings.
Future research is bi-fold: (a) expansion of the methodology. In this research
effort, we purified each dataset in reference to one index transition block and for
comprehensive-transition blocks. However, we will expand the methodology to
include purification of the datasets using all possible combinations of the
indices. (b) Developing a hybrid Markov model to enhance predictions of
financial index behavior for N days ahead, establishing a maximum value for N,
and facilitating a user-driven scenario analysis.
References
[1] P.-R. Agénor and L. A. Pereira da Silva, Macroeconomic stability, financial
stability, and monetary policy rules. International Finance, 15, pp. 205–224,
2012. a, b
[2] C. Blot, J. Creel, P. Hubert, F. Labondance and F. Saraceno, Assessing the link
between price and financial stability. Journal of Financial Stability, 16, pp. 71–88,
2015. →
[3] J. D. Hamilton and D. H. Kim, A re-examination of the predictability of
economic activity using the yield spread, NBER Working Paper Series 7954,
2000. →
[4] D. S Bremmer and R. G. Kesselring, The relationship between US retail
gasoline and crude oil prices during the great recession: “Rockets and feathers”
or “balloons and rocks” behavior?. Energy Economics, 55, pp. 200–210, 2016. →
[5] N. Nonejad, Crude oil price volatility dynamics and the great recession.
Applied Economics Letters, 26, 8, pp. 622–627, 2019. →
[6] L. Kilian and X. Zhou, The impact of rising oil prices on US inflation and
inflation expectations. Energy Economics, 113, pp. 106228, 2022. →
[7] R. E. Whaley, Understanding the VIX. Journal of Portfolio Management, 35, 3,
pp. 98–105, 2009. →
[8] J. Driffill, Z. Rotondi, P. Savona and C. Zazzara, Monetary policy and financial
stability: What role for the futures market?. Journal of Financial Stability, 2, 1, pp.
95–112, 2006. →
[9] K. E. Case and R. J. Shiller, Is there a bubble in the housing market?.
Brookings Papers on Economic Activity, 2003, 2, pp. 299–362, 2003. →
[10] A. Fouejieu, A. Popescu and P. Villieu, Trade-offs between macroeconomic
and financial stability objectives. Economic Modelling, 81, pp. 621–639, 2012. →
[11] E. E. Leamer, Housing is the business cycle. In: Federal Reserve Bank of
Kansas City Economic Symposium Conference Proceedings, pp. 149–233, 2007.

[12] C. L. Gilbert, How to understand high food prices. Journal of Agricultural
Economics, 61, 2, pp. 398–425, 2010. →
[13] D. G. Baur and B. M. Lucey, Is gold a hedge or a safe haven? An analysis of
stocks, bonds and gold. Financial Review, 45, 2, pp. 217–229, 2010. →
[14] R. Hashemi, O. Ardakani, J. Young and A. Bahrami, An RNN model for
exploring the macroeconomic and financial indicators in the context of the
COVID-19 pandemic. In: The 2022 International Conference on Computational
Science and Computational Intelligence (CSCI), Las Vegas, Nevada, pp.653–658,
December 2022. →
[15] M. Qi, Predicting US recessions with leading indicators via neural network
models. International Journal of Forecasting, 17, 3, pp. 383–401, 2001. →
[16] A. Joseph, M. Larrain and E. Singh, Predictive ability of the interest rate
spread using neural networks. Procedia Computer Science, 6, pp. 207–212,
2011. →
[17] A. Vidal and W. Kristjanpoller, Gold volatility prediction using a CNN-LSTM
approach. Expert Systems with Applications, 157, pp. 113481, 2020. →
[18] A. Lazcano, P. J. Herrera and M. Monge, A combined model based on
recurrent neural networks and graph convolutional networks for financial time
series forecasting. Mathematics, 11, 1, pp. 224, 2023. →
[19] M. Puglia and A. Tucker, Neural networks, the treasury yield curve, and
recession forecasting. The Journal of Financial Data Science, 3, 2, pp. 149–175,
2021. →
[20] L. Longo, M. Riccaboni and A. Rungi, A neural network ensemble approach
for GDP forecasting. Journal of Economic Dynamics and Control, 134, pp. 104278,
2022. →
[21] Z. Wang, K. Li, S. Q. Xia and H. Liu, Economic recession prediction using deep
neural network. The Journal of Financial Data Science, 4, 3, pp. 108–127, 2022. →
[22] C. Zhan, Z. Wu, Y. Liu, Z. Xie and W. Chen, Housing prices prediction with
deep learning: An application for the real estate market in Taiwan. In: 2020 IEEE
18th International Conference on Industrial Informatics, IEEE, (INDIN)(Vol. 1, pp.
719–724), 2020, July. →
[23] F. Lorenz, J. Willwersch, M. Cajias and F. Fuerst, Interpretable machine
learning for real estate market analysis. Real Estate Economics, 51, 5, pp. 1178–
1208, 2023. →
[24] J. S. Kamdem, R. B. Essomba and J. N. Berinyuy, Deep learning models for
forecasting and analyzing the implications of COVID-19 spread on some
commodities markets volatilities. Chaos, Solitons & Fractals, 140, pp. 110215,
2020. →
[25] Ben Ameur, H., Boubaker, S., Ftiti, Z. et al. Forecasting commodity prices:
empirical evidence using deep learning tools. Ann Oper Res 339, pp. 349–367,
2024. →https://doi.org/10.1007/s10479-022-05076-6. →
[26] P. Oktoviany, R. Knobloch and R. Korn, A machine learning-based price state
prediction model for agricultural commodities using external factors. Decisions
in Economics and Finance, 44, 2, pp. 1063–1085, 2021. a, b
[27] J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques,
Publisher: Morgan Kaufmann, 2012. →
[28] S. Haykin, Neural Networks: A Comprehensive Foundation, Publisher: IEEE
Press, 1994. a, b
[29] J. Bai and P. Perron, Computation and analysis of multiple structural change
models. Journal of Applied Econometrics, 18, 1, pp. 1–22, 2003. →
[30] J. Bai and P. Perron, Computation and analysis of multiple structural change
models. Journal of Applied Econometrics, 18, 1, pp. 1–22, 2003. →
[31] J. Bai, Least squares estimation of a shift in linear processes. Journal of Time
Series Analysis, 15, pp. 453–472, 1994. →
[32] J. Bai, Estimating multiple breaks one at a time. Econometric Theory, 13, pp.
315–352, 1997a.
[33] J. Bai and P. Perron, Estimating and testing linear models with multiple
structural changes. Econometrica, 66, pp. 47–78, 1998. →
[34] A. Zeileis, C. Kleiber, W. Krämer and K. Hornik, Testing and dating of
structural changes in practice. Computational Statistics & Data Analysis, 44, 1–2,
pp. 109–123, 2003. →
[35] D. Hendrycks and K. Gimpel, Gussian error linear units (Gelus). arXiv
preprint arXiv:1606.08415, 2016. →
[36] Malladi, R. K. Application of Supervised Machine Learning Techniques to
Forecast the COVID-19 U.S. Recession and Stock Market Crash. Comput Econ 63,
pp. 1021–1045, 2024. →https://doi.org/10.1007/s10614-022-10333-8. →
[37] A. Zaremba, R. Kizys, D. Y. Aharon and Z. Umar, Term spreads and the
COVID-19 pandemic: Evidence from international sovereign bond markets.
Finance Research Letters, 44, pp. 102042, 2022. →
11 Topological methods for vibration feature
extraction

Dean Lee
Jamal Rorie
Andrew Sabater

Abstract
We explore the application of topological data analysis (TDA) to extract
information from vibration sensor data. In Section 11.1, we describe the data
embedding requirements for applying TDA and study the empirical performance
of TDA, given different parameters of data embedding as well as under
degraded data collection conditions. Furthermore, we provide empirical
evidence of the correlation between select vibration metrics and TDA. The results
stated in this section are adapted from reference [→1]. In Section 11.2, we
describe an application of TDA to detect anomalies from vibration signals. The
methods described have indications of being useful to determine remaining
useful life of components.
Keywords: machine learning, topological data analysis, vibration analysis,

11.1 Feasibility of TDA as a means for vibration feature extraction


Modern mechanical systems are instrumented with sensors that provide
information for system health diagnostics. Such information could be used to
infer health state of components in situ and online, so that appropriate
maintenance actions can be performed to minimize maintenance cost and
maximize system uptime. Traditionally, algorithms developed for these systems
have relied on expert intuition and signals processing techniques [→2, →3, →4,
→5] to develop features for fault diagnosis. These methods rely on subjective
opinions instead of data-driven metrics, however, and may bias the conclusions.
More recently, with the advent of deep learning, the industry has seen the
development of sophisticated machine learning algorithms for fault analysis
[→6, →7, →8, →9, →10]. Deep learning techniques, however, do not come
equipped with explanation for the results, and may require other means, such as
with methods from eXplainable artificial intelligence (XAI) [→11, →12], to provide
interpretation of the results. It is also unclear if deep learning algorithms can be
applied directly if the input sensor data are low quality, e.g., when the data are
made up of sparsely sampled data. Furthermore, due to a combination of large
model parameters and limited available training data generated from hardware
experiments used specifically to simulate fault development, deep learning
algorithms may perform poorly.
In contrast, topological data analysis (TDA) provides a new set of tools for
information extraction that is robust against noise, and often provides more
intuitive interpretation of results. Persistent homology [→13] in particular is a
technique used to transform the latent topological information from the data
into features for machine learning. In previous work [→14], we demonstrated
the feasibility of using TDA to determine faults in roller element bearings and
showed that the visualization of the topological features of vibration signals
convey important information about the health of the component. The
prerequisite step of TDA application, however, is the selection of an appropriate
embedding. In particular, the visibility graph [→15] was used in [→14] to embed
raw vibration data, from which point clouds are formed; the topological features
extracted from the point clouds are then used to train machine learning models
to classify faults. While the approach creates models with high accuracy, it leaves
open the question whether there are other embedding techniques that produce
even better results.
In this section, we explore different embedding techniques to transform
time series data for topological data analysis. The raw vibration data from a
rolling element bearing data set are embedded into a metric space, from which
topological features are extracted. Machine learning models are then built with
the features extracted from each embedding to classify faults. Additionally, the
models are also evaluated against artificially down-sampled data; such data are
common in certain domains where data storage capacity is limited, to determine
the feasibility of TDA in a degraded operational environment. Finally, the
correlation of the topological features to vibration metrics in the time domain
are investigated to determine the connection between the topological features
extracted from the various embeddings to the classical signals processing
methods. These correlations show that TDA is automating the creation of
features that are analogous to the vibration metrics.
In this report, we investigate the feasibility of TDA as a new framework for
extracting features from vibration data that is data-driven, where human biases
are minimized, and is robust on sparsely sampled data.

11.1.1 Motivation
In [→14], a visibility graph-based topological data analysis method was
proposed and applied on the bearing vibration data from both the Society for
Machinery Failure Prevention Technology (MFPT) and the Center for Intelligent
Maintenance Systems (IMS) to extract topological information for machine
learning. It was shown that a machine learning algorithm can be built to detect
various bearing faults, and that distinct topological structures were generated
from fault-induced vibrations.
However, visibility graph [→15] is just one method to embed time series
data, and the topological information extracted from point clouds is specific to
the embedding. Thus, an investigation of other embedding techniques and the
topological features that they generate is warranted to better understand the
efficacy of the proposed method. Moreover, other embeddings may produce
structures that provide deeper insight for fault diagnosis and potentially show
connections to various vibration metrics. This is crucial in establishing the
proposed method as a legitimate tool for vibration analysis.

11.1.2 Preliminaries

11.1.2.1 Persistent homology

The shapes of data are characterized through structures known as simplexes. A


k -simplex σ = {v , v , …, v } is a convex hull formed by k + 1 vertices. A
k
0 1 k

face of a simplex σ is a convex hull of a non-empty subset of σ . A simplicial


k k

complex is formed from a finite set of simplexes, such that the intersection of
any pair of simplexes is either empty or shared faces. The boundaries of the
simplicial complex K are defined by the chain groups: the k -th chain group
C (K) is an abelian group made from all k -simplexes from the simplicial
k

complex K . Formally, the boundary operator ∂ : C → C satisfies the


k k k−1

following properties:
∂0 = 0 , and
∂ k−1 ∂ k = 0.

A cycle is a chain with boundary of zero. Z = Ker ∂ and B = Im ∂ define


k k k k+1

the k -cycles and the k -boundaries of the simplicial complex K , respectively. The
k -th homology group is defined as H = Z /B . The rank of H is the number
k k k k

of holes in the simplicial complex K ; that is, the rank of H is the number of
0

isolated vertices, the rank of H is the number of loops, etc. The filtration of the
1

simplicial complex K is a nested family of subcomplexes (K ) , where r r∈T

T ⊆ R such that for any r, r ∈ T , if r ≤ r then K ⊆ K , and K = ∪


′ ′
r r
′ K r∈T r

. By considering a set of points in some metric space, the Vietoris-Rips complex is


the set of simplexes [x , x , …, x ] such that d(x , x ) ≤ r , for some
0 1 k i j

distance function d [→16, →40].


Persistent homology [→13] tracks the changes in topological structures at
various filtration parameters, and a persistence diagram provides a summary of
the filtration process by capturing the formation and destruction of simplicial
complexes throughout the process. More precisely, each complex i in a
homology group is denoted by a pair (b , d ) , where b and d denote the birth
i i i i

and death times of i , respectively. The persistence diagram is then a multiset of


birth and death times for each homology group. Various point summaries can
be extracted from persistence diagrams [→17, →18, →19] and used as features
for machine learning algorithms. In this section, we consider the weighted
persistence silhouettes [→17], persistence landscape [→17], and the Betti curve
[→20] as features for machine learning models, and use the implementation of
the python package giotto-tda [→21] to compute these point summaries.

11.1.2.2 Embeddings

Time series data are embedded in a space, from which point clouds could be
formed and, crucially, an appropriate distance metric must be selected for the
space so that the persistent homology can be calculated for the time series data.
We investigate different embeddings as well the distance metrics associated
with the embeddings below. Note that in each of the embedding, a sliding
window of values is generated from the time series.
1) Vector space: Each window in the time series is transformed into a
vector in vector space. It is customary to use the Euclidean distance as the
default metric to quantify the difference between vectors, and indeed, the
Euclidean distance is the default of many machine learning software
packages. However, it is not always the appropriate distance measure. It is
noted in [→22] that the Euclidean distance breaks down even in relatively
low dimensions, thus the Manhattan distance is used as the distance
metric for this vector space embedding.
2) Probability distributions: The values of each window are normalized and
binned to create a vector of frequencies, or empirical distributions of the
data. In transforming the time series data into distributions, information
can still be extracted despite deficiencies in the data, such as sparsity,
irregular sampling, etc. The Jensen-Shannon divergence metric [→23] is
used to quantify the similarity between two distributions.
3) Visibility graphs: The visibility graph [→15] provides a framework for
embedding time series data into graphs. Given a time series {(t , y )} , a
i i
vertex is created for every corresponding t , and an edge is created
i

between vertices t and t if for any (t , y ) , where t < t < t ,


a b c c a c b

t b −t c
y c < y b + (y a − y b ) .
t b −t a

Intuitively, the time series data is a series of peaks and valleys: every
vertex in the graph corresponds to a sample from the time series, and an
edge is constructed between two vertices if their corresponding peaks in
the time series are visible to each other. A visualization of this is in
presented in Reference [→15].
The visibility graphs could be used to quantify local and global information
in the form of graph structures: a highly connected community of nodes in
the graph indicate that the corresponding values in the time series are
close in value (hence, visible to each other); the links that connect the
disparate communities are defined by the spikes in the time series signal,
which form “bridges” to the different communities. In this sense, the
visibility graphs capture interesting information from the time series
signal. The Euclidean distance is used to calculate the edge weights in the
visibility graph. The visibility graph implementation of ts2vg [→24] is used
in the analysis.
4) Takens embedding: Takens embedding [→25] embeds the time series
data into higher dimensions. More formally, given a time series f (t) , an
embedding dimension d , a time delay τ , and time t , the embedding
i

produces vectors of the form


(f (t i ), f (t i + τ ), f (t i + 2τ ), …, f (t i + (d − 1)τ ).

These vectors form point clouds, from which persistent homology is


applied to extract topological information. We use the giotto-tda [→21]
implementation of Takens embedding in this report.
5) Symbolic Aggregation approXimation (SAX): The SAX algorithm is a time
series dimensionality reduction method introduced in [→26]. The method
transforms a time series into a series of symbols, which are concatenated
into words. The SAX transformation is of interest as it has been shown to
detect anomalies in time series data [→27].
SAX starts by standardizing the time series data. Sliding windows of size w
are created from the data, and within each window, m number of equal-
sized segments is created; the mean value of each segment is then
computed. The value of m is the word size outputted by SAX. An alphabet
of size n is also chosen; the value of n determines the number of equal-
sized bins under the N (0, 1) distribution, and each bin is assigned a
letter. Finally, the mean values computed for each of the m segments are
converted to letters according to where the values fall in the bins defined
for N (0, 1) ; the concatenation of the m letters forms individual words.
We define an embedding by leveraging the SAX transformation of time
series data. For each of the windows of data, a series of words is formed
via SAX, and the empirical distributions of the words are computed. In
other words, vectors of word counts are generated from each window of
data. The Jensen-Shannon divergence metric is again used to quantify
pairwise distances among word distributions.

11.1.3 Methodology
The notional pipeline for generating the topological features is shown in
→Figure 11.1.
A fixed-width sliding window is created from the raw vibration data. These
windowed data are embedded into spaces proposed in Section 11.1.1.2, and
from which point clouds are constructed using the appropriate distance
measures. In the topological feature extraction step, persistent homology is
computed on the point clouds, from which persistence diagrams and point
summaries are generated as features for machine learning models. We use the
persistent homology implementation of giotto-tda [→21] for all analyses in this
section, and we use the scikit-learn version 1.2.2 [→28] implementation of
Random Forest with default settings as the base machine learning model.
Finally, the data is split into train, test, and validation sets, and the validation set
is used to find optimal parameters for each of the proposed embeddings.

Figure 11.1: Notional topological data analysis pipeline.

11.1.4 Data
We analyze the Case Western Reserve University (CWRU) bearing data set [→29]
to examine the effects of the proposed embeddings on the generated
topological features. The data set is made up of vibration measurements from
healthy bearings and from bearings with seeded faults. The seeded faults are
introduced via the electro-discharging machining methods at various depths.
Three types of faults are introduced: inner raceway, ball-bearing, and outer
raceway faults. The vibration measurements are collected at the drive-end and
the fan-end of the test apparatus, and the data are collected at various motor
loads. While the data set contains drive-end and fan-end bearing fault
information, we only consider the drive-end faults for the following analyses as
the data set is more complete. Furthermore, only the data collected at 12,000
samples/second are considered to ensure uniformity of analysis, as the baseline
data is only collected at 12,000 samples/second.
The main objective of the analysis is to determine the effectiveness of
topological data analysis through fault classification, and three sets of
experiments are conducted using the CWRU data set. In the first set of
experiments, models are built from topological features extracted from the
previously described embeddings. These models are trained to classify different
faults at various motor loads and fault depths. In the second set of experiments,
artificial down-sampling is applied to the data set to simulate operational
environments, where data may be stored in pared down form to reduce storage
footprint. A down-sampling of factor n used here means that every n th sample
is kept from the original data. In these experiments, we investigate the
performance of the classification models at various levels of down-sampling
factors to understand how topological data analysis may perform in real-world
settings.
The connections to some classical vibration metrics are explored in the third
set of experiments. Common vibration metrics such as peak acceleration, root-
mean-square, crest factor, and kurtosis are correlated with the topological
features to show possible connections.

11.1.5 Results

11.1.5.1 Model performance on full data

The model performance, based on the topological features extracted from the
different embeddings, is shown in →Figures 11.2 and →11.3. In general, the
models perform better at 7 and 21 millimeters of fault depths. It is seen that the
distribution embedding performs the best out of the five embeddings
considered: the model performance is relatively stable and accurate across
different loads and fault depths. This suggests that by binning the data samples
appropriately, the differences in the vibrations among the various fault
categories can be quantified as distinct topological features for machine
learning.
Figure 11.2: Cross-validated training set performance of models built with
features from the various embeddings.
Figure 11.3: Test performance of models built with features from various
embeddings.

The vector space model performs relatively well at fault depth of seven
millimeters; at other fault depths, the performance starts to degrade. This may
be because the vectors are not capturing enough of the intrinsic differences
among the various fault depths.
The visibility graph model also demonstrates inconsistent performance,
which may be attributed to its sensitivity to small changes in vibration
measurements; that is, entirely new communities in the graph may appear or
disappear due to slight changes in the vibration measurements. Furthermore, it
stands to reason that if the vibration sensors are not calibrated correctly, the
visibility graph method may yield poor performance.
The SAX embedding model appears to be relatively unstable, which may be
due to the standardization of the data for forming the words, and that may mask
the subtle differences in the changes in the vibrations across the various fault
depths.
The Takens embedding model has the worst performance. This suggests
that this embedding technique does not generate features that generalize well
across different fault categories.

11.1.5.2 Model performance on down-sampled data

The model performance at various levels of down-sampling of the data is shown


in →Figures 11.4 and→ 11.5 for motor loads at 0 HP, and in →Figures 11.6 and
→11.7, for motor loads at 3 HP, to highlight changes in performance at extreme
ends of motor load values; results for the other motor loads are not shown for
brevity.
Figure 11.4: Cross-validated performance of models built with features from
various embeddings, and with data collected at motor load 0 HP and down-
sampled across various factors.
Figure 11.5: Test performance of models built with features from various
embeddings, and with data collected at motor load 0 HP and down-sampled
across various factors.
Figure 11.6: Cross-validated performance of models built with features from
various embeddings, and with data collected at motor load 3 HP and down-
sampled across various factors.
Figure 11.7: Test performance of models built with features from various
embeddings, and with data collected at motor load 3 HP and down-sampled
across various factors.

In general, we see that the vector space embedding model performance stays
relatively stable over various down-sampling factors, as well as across different
loads. The distribution embedding model, however, becomes more and more
unstable as the data become sparser, as seen from →Figures 11.4 and →11.6.
One reason may be that as the data become sparser, the empirical distributions
no longer represent the underlying distribution.
The SAX embedding model also suffers from the same issue that causes the
instability in the distribution embedding model, albeit the problem is worse. As
the data become sparser, the standardization of the sparse data obscures the
changes in vibrations, and SAX representations are poor features for
distinguishing among various faults.
The visibility graph model also has spots of instability as the down-sampling
factors vary. There is a dip in performance at down-sampling factor of three,
after which the performance improves until down-sampling factor of seven.
Surprisingly, the visibility graph model performance, even at extreme sparsity, is
on par with the visibility graph model on the full data set. This suggests that the
graph may be encoding intrinsic properties of the vibration signal.
The Takens embedding model again fares the worst out of all five models in
terms of its performance, although it was already shown in the previous section
that this particular embedding technique may not be suitable for generating
topological features that distinguish among various faults. Conversely, the
model performance appears to be more stable than that of the SAX embedding
model over various down-sampling factors.

11.1.5.3 Correlation to vibration analysis metrics

Given a window of vibration signals x = (x , x , …, x ) , some vibration


0 1 n−1

metrics that are frequently employed for analyses in the time domain are the
following:
1. Peak acceleration: max |x i |,

2. Root mean square (RMS): √ 1

n
∑ x
i
2
i
,
3. Crest factor: , and
max |x i |

1 2
√ ∑ x
n i i

4
4. Kurtosis: E[( .
x−μ
) ]
σ

In general, these metrics are various ways of quantifying spikes in the signal.
Coupling expert intuition with accepted engineering standards (e.g., ISO 20816-
1:2016), these values could be used to detect imminent failure.
We examine the correlations between these classical vibration metrics and
the topological features extracted from the embeddings. We note that only the
drive-end data at fault depth of seven millimeters and motor load of 0 HP are
shown in following correlations analyses, as the results for the other fault depths
and motor loads are similar.
In →Figure 11.8, it is seen that the homology group H of the persistence
0

silhouette generated from the visibility graph embeddings has a strong


correlation to the RMS metric. Furthermore, the visibility graph embedding and
the Takens embedding have relatively strong correlations to the peak
acceleration and RMS metrics, respectively.

Figure 11.8: The correlation coefficient values between vibration metrics for the
drive-end data with sliding window size of 50, and the H0 homology group
generated from the persistent silhouette of the various embeddings.

We examine the correlations systematically by varying the sliding window size


from 50 to 2000 and look for the absolute values of the correlation coefficients
that are greater than the threshold value of 0.8. On one extreme, →Table 11.1
shows that only the visibility graph and the Takens embeddings have generated
topological features, which correlate to some vibration metrics, which are RMS
and peak accelerations.

Table 11.1: The correlative relationships between topological features and


vibration metrics at window size 50, where the absolute values of the correlation
coefficient exceed 0.8.
Embedding Topological features Homology group Vibration metric
Vector space
Distribution
Visibility graph Silhouette, landscape, Betti H0 RMS, peak acceleration
SAX
Takens Silhouette, landscape, Betti H0 RMS, peak acceleration

If no correlation coefficient exceeds this threshold, then the row of cells is left blank.

In →Table 11.2, we show the results for window size 2000, which is the point at
which all the embeddings start to generate topological features that correlate to
some vibration metrics. It can be seen that the RMS metric is strongly correlated
to all five of the embeddings. Furthermore, the distribution and SAX embeddings
generate topological features that correlate to all four of the vibration metrics.

Table 11.2: The correlative relationships between topological features and


vibration metrics at window size 2000, where the absolute values of the
correlation coefficient exceed 0.8.
Embedding Topological features Homology Vibration metric
group
Vector space Silhouette, landscape, H0 RMS
Betti
Distribution Silhouette, landscape, H0 ,H 1 RMS, peak acceleration, crest factor,
Betti kurtosis
Visibility Silhouette, Betti H0 RMS
graph
SAX Silhouette, landscape, H0 ,H 1 RMS, peak acceleration, crest factor,
Betti kurtosis
Takens Silhouette H0 RMS

→Figure 11.9 provides an additional view of the correlation coefficients for the
distribution embedding and the persistence silhouette. Note the strong
correlation and anticorrelation between the distribution embedding and all four
of the vibration metrics.

Figure 11.9: The correlation coefficient values between vibration metrics for the
drive-end data with sliding window size of 2000, and the H0 homology group
generated from the persistent silhouette of the various embeddings.

We note also that the topological features from the homology groups do not
seem to have a direct connection to the vibration metrics: while homology group
H captures higher order topological structures than homology group H , the
1 0

correlations are not necessarily restricted to the higher order vibration metrics,
such as crest factor.

11.1.6 Discussions
In general, the topological features provide good class separability for fault
classification. The vector space and the distribution embedding perform well
across all loads and fault conditions. On the other hand, we see that the SAX
embedding has uneven performance as a classifier. We also examined the
visibility graph as an embedding, and found that its performance is not
comparable to that of the simple and distribution embeddings. These
differences in model performances are exacerbated by the down-sampling of
the data.
We also note that SAX embedding does not seem to be an ideal topological
feature generator for fault classification. SAX is originally designed to detect
anomalies within a single time series; the poor classifier performance seems to
indicate that the vibrations encoded for a particular measurement do not
generalize to other measurements.
We also investigated the potential correlations between the topological
features and classical vibration metrics. While these correlations are sensitive to
the choice of window sizes, nevertheless, the strong correlation in some cases to
known vibration metrics suggest that topological data analyses may be used to
automate the vibration metric selection, which is usually done based on human
intuition. Furthermore, given the good performance of the vector space and
distribution embeddings, the lack of correlations between their topological
features and the vibration metrics merely suggests that the topological features
may be better for fault diagnosis than some of the vibration metrics.

11.1.7 Postscript
In this section, we examined different embeddings for vibration signals, from
which topological features can be generated. We then used these topological
features to build models to classify faults with the CWRU data set across various
conditions as well as artificially down-sampling the data to examine the
performance of these features for fault analysis. It was found that while some
embeddings generate features that perform well, even in degraded conditions,
other embeddings suffer from poor model performance.
The potential correlations of the topological features to some classical time-
domain vibration metrics are also examined. It was found that some
embeddings have features that have consistently strong correlations to the
vibration metrics, suggesting that these embeddings can automatically generate
these vibration metrics.
We believe that TDA is a valuable addition to the world of vibration analysis.
While we have shown in this paper that some topological features provide good
class separability for fault classifiers, there may be yet undiscovered topological
features that are tuned specifically for vibration analyses that yield even better
performance. Moreover, the topological features investigated in this paper are
based on Vietoris-Rips method of simplicial complex construction; the
investigation of other simplicial complexes and additional topological features
for vibration analysis remains a topic for future work.

11.2 Anomaly detection


In the previous section, we empirically demonstrated the utility of TDA for
extracting useful information and for classifying component health states. In
practice, it is often desirable to have a capability to detect anomalies before they
become serious problems. In this section, we investigate the use of TDA as an
anomaly detection tool.

11.2.1 Methodology: a chaotic approach


The field of nonlinear dynamics is rife with tools for detecting changes in state.
In the case of cyclostationary signals, such as bearing vibration, we are
interested in modeling the signal so that changes in state, e.g., from chaotic to
periodic, are detected. These changes may then be used to find anomalies in the
signal.

11.2.1.1 Chaos and Lyapunov exponents

The discrimination between chaotic and stochastic processes can be challenging


as the metrics used to quantify chaos are sensitive to noise. Chaotic and
stochastic time series also have similar properties. For example, over a sufficient
time period, the current and future state of both chaotic and stochastic time
series may be uncorrelated. Thus, one of the central features of chaotic
phenomena that differentiates it from stochastic time series is the chaotic
phenomena’s dependence on initial conditions, which implies that trajectories
separate exponentially fast. The quantification of this separation is done with
Lyapunov exponents, which are determined for each dimension of the system.
The largest Lyapunov exponent provides information about the stability of the
system.
Two classic methods to estimate the largest Lyapunov exponent are the
Wolf’s method [→30] and the Rosenstein’s method [→31]. Given an infinite
amount of noise-free data, Wolf’s method can recover all of the Lyapunov
exponents. While, in theory, Wolf’s method is expected to work well, in practice,
the values of embedding dimension and delays must be carefully selected. It is
also known that with sufficient levels of additive noise, Wolf’s method does not
accurately predict the largest Lyapunov exponents, and that the method is
limited to low-dimensional dynamical systems.
Rosenstein’s method is similar to Wolf’s method with key differences with
Rosenstein’s methods being that all available data are used and the divergence
distance is determined over many different time scales. The largest Lyapunov
exponent is then determined where the logarithm of the divergence distance
grows linearly with the changing time scales. As with Wolf’s method,
Rosenstein’s method is sensitive to embedding dimension, delay value, and the
appropriate selection of the correct time scale to estimate divergence. In the
presence of moderate signal-to-noise ratios in the range of 100 to 1,000, errors
on the order of 25% are typical [→32]. For lower signal-to-noise ratios, it is
difficult to guarantee the accuracy of Rosenstein’s method.

11.2.1.2 Methodology

More recently, Gottwald et al. [→33, →34] introduced the 0–1 test, which
produces a binary output, to indicate the presence of chaos. By removing the
need for phase space reconstruction, Gottwald’s method is more appropriate for
high-dimensional systems. The 0–1 test takes a time series signal ϕ , indexed by
j , for j = 1, 2, …, N , and produces a p–q projection [→34] of the data

according to the following equations:


n (11.1)
p c (n) = ∑ ϕ(j) cos jc
j=1

n (11.2)
q c (n) = ∑ ϕ(j) sin jc
j

for n = 1, 2, …, N , and c ∼ U (0, π) . These projections are then either used in


a regression test or a correlation test for chaos. There are two significant
limitations with Gottwald’s methods. The first is that while the method
converges in the infinite limit for continuous systems, a signal may be incorrectly
classified as periodic if a finite segment with discrete sampling is oversampled.
The second issue is the incorrect identification of a time series as chaotic when a
periodic time series is contaminated with substantial noise.
To address the aforementioned issues, Templeman et al. studied the
topological properties of the p–q projections with several classical nonlinear
dynamical systems in [→35] through a TDA pipeline similar to that we studied in
Section 11.1. The main difference is that an explicit embedding was not
constructed from the data, but rather an estimated function is created from the
p–q projections, and from which the persistent homology is computed by way of
sublevel set filtration.
In the following analysis, we take a general approach similar to [→35] and
identify possible anomalies in the system: the sensor data from the system is
converted to its p–q projections, followed by sublevel set filtrations on the
projections, and various summary metrics of the resulting persistence diagrams
are investigated to determine the presence of anomalies.

11.2.2 Data preprocessing


We consider the bearing vibration data from the Center for Intelligent
Maintenance Systems (IMS), University of Cincinnati [→36]. The data set is made
up of three run-to-failure bearing experiments. The rig consists of an AC motor
that is coupled to a shaft that holds four bearings, and the motor is operated at
a constant speed. The bearings in the center of the shaft are loaded and the rig
is operated until the accumulation of debris on a magnetic plug exceeds a
predetermined level that is indicative of imminent failure. The experiment was
repeated three times with the following results: in the first experiment, bearing
three experienced an inner race fault and bearing four experienced a rolling
element fault; in the second experiment, bearing one experienced an outer race
fault; in the third experiment, bearing three experienced an outer race fault.
Gousseau et al. [→37] noted that the vibration data for the third experiment is
problematic and also noted the absence of its use from literature; thus, we will
not consider the data from experiment three in our analysis.
Bearing faults tend to give rise to cyclostationary noise [→2], and so the
envelope of the signal is used to extract periodic signals. An established method
to compute the envelope is to bandpass filter the signal such that the kurtosis is
maximized within the passband [→2]. After demodulating the signal within this
band with the maximum kurtosis, the signal is first decimated, then the analytic
signal is estimated using a Hilbert transform, and finally the decimated signal
along with its associated analytic signal is used to estimate the envelope.
In order to avoid generating different lengths of signals due to decimation,
for a given experiment and channel, the maximum kurtosis value was
determined for all the 1 s windows of data. From this window, the kurtogram
method was used to determine the bandpass filter parameters that are applied
to all the windows of data. This ensures that after filtering and decimation, the
resulting lengths of the envelope signals are the same for that experiment and
channel.
After the envelope is estimated, the time series is subsampled based on the
method presented by Melosik and Marszalek, where the subsampling rate is
selected to be approximately three times higher than the highest one in the time
series [→38].
Lastly, the p–q projections of the subsampled time series are calculated. To
minimize potential resonances associated with particular choices of c from eqs.
(11.1) and (11.2), 200 c values are randomly sampled and the p–q projections are
computed for each value. Mirroring [→35], the projections are converted to a
density estimate using Gaussian smoothing.

11.2.3 Analysis
Since the IMS data set does not include when the bearing faults develop, the
only information we can glean from the data is when the experiments stopped
due to the faults, as well as rely on results from literature, e.g., Gousseau et al.
[→37], as baselines to detect certain faults. However, we still demonstrate how
TDA can be used to detect possible anomalies from the vibration signal; we show
one such approach in this section.

11.2.3.1 Extraction of features from persistence diagrams

There are 200 persistence diagrams generated per time step, since 200 values of
c are selected to create the p–q projections. We compute the amplitudes of the

Betti curves from each persistence diagram using giotto-tda [→21], and
compute the mean amplitudes for each time step, and set a threshold of three
standard deviations to determine anomalies. The analysis results of trial one are
in →Figure 11.10. Bearings three (channels 5 and 6) and four (channels 7 and 8)
develop faults in the experiment, and we note that there are regions in the plots
for channels 5–8 where the amplitudes clearly exceed the threshold for an
extended period of time. Note that L denotes the Chebyshev distance.

Figure 11.10: Trial 1 Betti curve amplitudes computed with the L distance.

Note that the faults developed in bearings three (channels 5 and 6) and four
(channels 7 and 8), and that there are clear regions in channels 5–8 where the
amplitude values clearly exceed the threshold for an extended period of time.

To compare the results against a known method for detecting faults, we overlay
the amplitude exceedances for channels 5, 6, 7, and 8 on the standard deviations
of the raw values of those channels. It is understood that sudden changes in
standard deviations may indicate the presence of a fault and that there are
accepted thresholds, based on the class of machinery associated with standard
deviation for shutting down machinery, e.g., ISO 20816-1:2016. In →Figure 11.11,
we note that in most cases, the exceedances happen near where the rate of
change in the standard deviation accelerates; furthermore, in channels 6
(bearing three) and 8 (bearing four), the exceedances occur days before the
standard deviation values peak around day 31, suggesting that the exceedances
may be used as early indicators for faults. As noted in Gousseau, the inner race
fault is detectable using an established method like bearing envelope analysis
on channels 5 and 6 around 32 days. The grouping of Betti curve exceedances on
channel 6 is indicative that this method is able to provide early warning of this
fault. Also from Gousseau, the rolling element fault is detectable using bearing
envelope analysis on channels 7 and 8 at around 25 days. While the Betti curve
exceedances occur a few days after being detectable with bearing envelope
analysis, it is before when significant changes in standard deviation and
preprocessing steps like spectral pre-whitening occur to account for baseline
vibrations.

Figure 11.11: Trial 1 standard deviations of the raw vibration values for
channels 5, 6, 7, and 8, as well as the exceedances of the Betti curve amplitudes
for those channels.

In trial 2, bearing one (channel 1) develops a fault. The analysis results are
shown in →Figure 11.12. We note the spikes in values that exceed the three
standard deviation threshold in channel 1, while the other channels do not have
such exceedances, except for channel 2. Upon further examination of the data, it
was determined that the drop in amplitudes in channel 2 toward the end of the
experiment is an artifact of the subsampling method, prior to p–q projection
calculations, where broad frequency noise in the envelope signal resulted in a
large subsampling increment.
Figure 11.12: Trial 2 Betti curve amplitudes computed with the L distance.

Bearing one (channel 1) develops a fault. Note the exceedance of the threshold
toward the end of the experiment. The exceedance detected in channel 2 is an
artifact of the data preprocessing caused by insufficient data.

The exceedances are plotted against the standard deviation of the raw values for
channel 1 in →Figure 11.13. We again note that the exceedances align with the
peaking of the standard deviations, and that the exceedances also detected
anomalies well before the standard deviations start to peak. Reference [→37]
notes that the fault is physically detectable, starting around 4.5 days, so the
amplitude exceedances around day 5 are in line with the expectations.

Figure 11.13: Trial 2 standard deviations of the raw vibration values for channel
1, as well as the exceedances of the Betti curve amplitudes for that channel.
11.2.4 Discussion
As noted, the IMS data set is provided without additional information, such as
exactly when the faults develop. Therefore, we can only demonstrate the
effectiveness of the proposed topological method, knowing that the faults
presumably developed at some point close to the end of the trials. We used the
standard deviations of the raw vibration data as a baseline method to gauge the
performance of the proposed topological method. We note that the proposed
method detected anomalies where the standard deviation values were rapidly
changing, a sign that bearing faults are developing. In some cases, the
exceedances cluster before and around where the standard deviations peak.
Therefore, the proposed method may be useful as early indicators for
developing fault conditions.
On the other hand, we note that there is the possibility of false positive
detections, as with any anomaly detection techniques. In →Figure 11.10, there
were early exceedances in channels 5 and 7. While it is usually straightforward to
tune the method to minimize false positives, we cannot do so in this case as we
do not have access to the states of the bearings at the start of the trials.
Nevertheless, we have demonstrated that the performance of the proposed
method is easier to automate and interpret than more established method to
detect incipient bearing faults, like envelope analysis.

11.2.5 Future work


Extracting summaries from persistence diagrams is only one technique to
quantify persistence diagrams into features. It is also possible to quantify
pairwise distances directly on the persistence diagrams. The bottleneck and
Wassertein distances are popular choices in literature [→39]. The anomaly
detection problem then reduces to the comparison of a base persistence
diagram, which could be fixed in time or extracted from a rolling window, to a
moving persistence diagram.
There are several possibilities to detect anomalies from pairwise persistence
diagram distances, in the context of p–q projections. Recall that every value of
the random variable c produces a persistence diagram for every time step. It is
reasonable to compare pairwise distances for a fixed value of c and then
compute the mean of the distances for every time step to summarize the
vibration signals across the entire time period. It may also be reasonable to
compute a mean representation of all the persistence diagram for each time
step, via the Fréchet mean, and compute the pairwise distances to summarize
the vibration signals to detect anomalies. The investigation of the effectiveness
of these approaches remains a direction for future work.
References
[1] D. Lee, J. Rorie, and A. Sabater, An investigation of time series embeddings
and topological data analysis for fault analysis. In: The 19th International
Conference on Data Science, Las Vegas, NV, 2023. →
[2] R. Randall, Vibration-based Condition Monitoring: Industrial, Aerospace and
Automotive Applications, Wiley, 2011. a, b, c
[3] J. I. Taylor, The Vibration Analysis Handbook, Vibration Consultants, 2003. →
[4] E. Mendel, T. W. Rauber, F. M. Varejão, and R. J. Batista, Rolling element
bearing fault diagnosis in rotating machines of oil extraction rigs. In: 2009 17th
European Signal Processing Conference, Glasgow, UK, 2009. →
[5] S. J. Lacey, An overview of bearing vibration analysis. Maintenance and Asset
Management Journal, vol. 23, no. 6, pp. 32–42, 2008. →
[6] R. Accorsi, R. Manzini, P. Pascarella, M. Patella, and S. Sassi, Data mining and
machine learning for condition-based maintenance. Procedia Manufacturing,
vol. 11, pp. 1153–1161, 2017. →
[7] P. K. Kankar, S. Sharma, and S. P. Harsha, Rolling element bearing fault
diagnosis using wavelet transform. Neurocomputing, vol. 74, pp. 1638–1645,
2011. →
[8] S. Tao, T. Zhang, J. Yang, and X. Wang, Bearing fault diagnosis method based
on stacked autoencoder and softmax regression. In: 34th Chinese Control
Conference (CCC), Hangzhou, China, 2015. →
[9] W. Yan, and L. Yu, On accurate and reliable anomaly detection for gas turbine
combustors: A deep learning approach. In: Annual Conference of the
Prognostics and Health Management Society, 2015. →
[10] Z. Chen, C. Li, and R.-V. Sanchez, Gearbox fault identification and
classification with convolutional neural networks. Shock and Vibration, vol. 10,
pp. 1–10, 2015. →
[11] S. M. Lundberg, and S.-I. Lee, A unified approach to interpreting model
predictions. Advances in Neural Information Processing Systems, 2017. →
[12] M. T. Ribeiro, S. Singh, and C. Guestrin, "Why should I trust you?":
Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining,
2016. →
[13] G. Carlsson, Topology and data. Bulletin of the American Mathematical
Society, vol. 46, pp. 255–308, 2009. a, b
[14] D. Lee, and A. Sabater, Visibility graphs, persistent homology, and rolling
element bearing fault detection. In: IEEE International Conference on
Prognostics and Health Management, Detroit, MI, 2022. a, b, c
[15] L. Lacasa, B. Luque, F. Ballesteros, and J. C. Nuño, From time series to
complex networks: The visibility graph. Proceedings of the National Academy of
Sciences, vol. 105, pp. 4972–4975, 2008. a, b, c, d
[16] F. Chazal, and M. Bertrand, An introduction to topological data analysis:
Fundamental and practical aspects for data scientists. Frontiers in Artificial
Intelligence, vol. 4, 2021. →
[17] F. Chazal, B. T. Fasy, F. Lecci, A. Rinaldo, and L. Wasserman, Stochastic
convergence of persistence landscapes and silhouettes. In: SOCG'14:
Proceedings of the Thirtieth Annual Symposium on Computational Geometry,
2014. a, b, c
[18] T. Bonis, M. Ovsjanikov, S. Oudot, and F. Chazal, Persistence-based pooling
for shape pose recognition. In: CTIC 2016: Proceedings of the 6th International
Workshop on Computational Topology in Image Context, 2016. →
[19] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt, A stable multi-scale kernel
for topological machine learning. In: 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2014. →
[20] H. Edelsbrunner, and J. L. Harer, Computational Topology: An Introduction,
American Mathematical Society, 2009. →
[21] G. Tauzin, U. Lupo, L. Tunstall, J. B. Pérez, M. Caorsi, W. Reise, A. Medina-
Mardones, A. Dassatti, and K. Hess, Giotto-tda: A topological data analysis toolkit
for machine learning and data exploration. Journal of Machine Learning
Research, vol. 22, pp. 1–6, 2021. a, b, c, d
[22] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, "On the surprising behavior of
distance metrics in high dimensional space. In: Database Theory – ICDT 2001: 8th
International Conference, London, UK, 2001. →
[23] S.-H. Cha, Comprehensive survey on distance/similarity measures between
probability density functions. International Journal of Mathematical Models and
Methods in Applied Sciences, vol. 1, 2007. →
[24] C. Bergillos, ts2vg: Time series to visibility graphs [Online]. Available:
→https://cbergillos.com/ts2vg/. →
[25] F. Takens, Detecting strange attractors in turbulence. In: Dynamical Systems
and Turbulence, Warwick 1980, Berlin, Heidelberg, 1981. →
[26] J. Lin, E. Keogh, S. Lonardi, and B. Chiu, A symbolic representation of time
series, with implications for streaming algorithms. In: Proceedings of the 8th
ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery, 2003. →
[27] E. Keogh, J. Lin, and A. Fu, HOT SAX: Efficiently finding the most unusual
time series subsequence. In: 5th IEEE International Conference on Data Mining
(ICDM), 2005. →
[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-Learn: Machine
learning in Python. Journal of Machine Learning Research, vol. 12, pp. 2825–2830,
2011. →
[29] Case Western Reserve University Bearing Data Center, Seeded fault test
data. [Online]. Available: →https://engineering.case.edu/bearingdatacenter. →
[30] A. Wolf, J. B. Swift, H. L. Swinney, and J. A. Vastano, Determining Lyapunov
from a time series. Physica D: Nonlinear Phenomena, vol. 16, no. 3, 1985. →
[31] M. T. Rosenstein, J. J. Collins, and C. J. De Luca, A practical method for
calculating largest Lyapunov exponents from small data sets. Physica D:
Nonlinear Phenomena, vol. 65, no. 1, 1993. →
[32] S. H. Strogatz, Nonlinear Dynamics and Chaos: With Applications to Physics,
Biology, Chemistry, and Engineering, CRC Press, 2015. →
[33] G. A. Gottwald, and I. Melbourne, Testing for chaos in deterministic systems
with noise. Physica D: Nonlinear Phenomena, vol. 212, pp. 100–110, 2005. →
[34] G. A. Gottwald, and I. Melbourne, On the implementation of the 0–1 test for
chaos. SIAM Journal on Applied Dynamical Systems, vol. 8, no. 1, pp. 129–145,
2009. a, b
[35] J. R. Tempelman, and F. A. Khasawneh, A look into chaos detection through
topological data analysis. Physica D: Nonlinear Phenomena, vol. 406, 2020. a, b, c
[36] J. Lee, H. Qiu, G. Yu, and J. Lin, Rexnord Technical Services, Bearing data set.
In: NASA Prognostics Data Repository, NASA Ames Research Center, Moffett
Field, CA: Center for Intelligent Maintenance Systems, University of Cincinnati,
2007. →
[37] W. Gousseau, J. Antoni, F. Girardin, and J. Griffaton, Analysis of the rolling
element bearing data set of the center for intelligent maintenance systems of
the university of Cincinnati. In: CM2016, Charenton, France, 2016. a, b, c
[38] M. Melosik, and W. Marszalek, On the 0/1 test for chaos in continuous
systems. Bulletin of the Polish Academy of Sciences, Technical Sciences, vol. 64,
no. 3, pp. 521–528, 2016. →
[39] S. Agami, Comparison of persistence diagrams. Communications in
Statistics – Simulation and Computation, vol. 52, no. 5, pp. 1948–1961, 2023. →
[40] H. Edelsbrunner, and J. L. Harer, Persistent homology – A survey. Surveys on
Discrete and Computational Geometry, vol. 453, 2008. →
12 Dyna-SPECTS: DYNAmic enSemble of Price
Elasticity Computation models using Thompson
Sampling in e-commerce

Srividhya Sethuraman
Uma G.
Sunny Kumar
Siddhesh Thombre
Vikash Patel
Sharadha Ramanan

Abstract
This work introduces Dyna-SPECTS, a novel dynamic ensemble-based Thompson
sampling model for calculating price elasticity (PE) in the e-commerce sector,
focusing on kids’ clothing. The model uniquely amalgamates four distinct
Thompson sampling based algorithms namely – algorithms – distribution
approximation, XGBoost regressor, GAN, and MCMC sampling – to determine PE
distribution parameters. Dyna-SPECTS’s outstanding feature is its dynamic
ensemble mechanism, which adeptly selects optimal PE values from a suite of
strategies like minimum, maximum, average, weighted MAPE, and linear
regressor, guided by cumulative rewards. This approach adapts to variable
market conditions in the constantly evolving e-commerce landscape, including
competitor pricing, seasonal changes, and customer segmentation. Empirical
tests across diverse datasets have confirmed Dyna-SPECTS’s exceptional
performance, showcasing substantial improvements in sales and margins,
especially for SKUs experiencing PE fluctuations. The model also demonstrates
efficacy in using transfer learning for PE computation, yielding a 7% margin
improvement and a remarkable 35.3% reduction in RMSE for demand forecasts.
Innovations in Cross-PE coefficient integration and price personalization have
led to further margin enhancements. Moreover, the model excels in
omnichannel pricing, achieving a 17.5% margin increase. Long-tailed product
pricing using these PE values showed an increase in sales volume and revenue
compared to traditional pricing strategies.
Dyna-SPECTS prioritizes scalability using parallelization, distributed computing,
and optimizing algorithms, efficiently managing a high volume of SKUs without
compromising computational effectiveness and achieving rapid convergence. It
represents a significant leap in dynamic price elasticity computation, offering a
scalable, versatile, and empirically proven solution for e-commerce pricing
challenges.
Keywords: Price Elasticity, Thompson sampling, price optimization, Big Data,
fashion, Reinforcement Learning, online learning, e-commerce, ensemble model,
long-tail products, scaling, Bayesian model, dynamic ensemble, static ensemble,
personalization, dynamic pricing,

12.1 Introduction

12.1.1 What is the problem?


Accurately computing price elasticity (PE) values is imperative for retailers to
make optimal pricing decisions. PE, defined as the percentage change in sales
due to a percentage change in price, informs retailers’ regular, promotional, and
markdown pricing recommendations to maximize sales, margins, and revenue.
It is also essential for inventory planning, allocation, and re-distribution.
However, accurately computing PE in the real-world poses significant
challenges. For items with few price changes or new products without sales
history, PE values are unreliable. Additionally, PE must account for historical
sales trends, current market conditions, and future expectations to be
representative of dynamic e-commerce pricing. Furthermore, scaling PE
computations for vast online retail data presents difficulties. Dynamic near real-
time pricing also requires dynamic PE re-calculation as influencing factors
change. Current models provide only static PE values, but retailers require
different models addressing different data aspects. The ensemble, combining
multiple PE models, must also dynamically adjust to evolving contexts and new
information. Critically, the objective is optimizing cumulative revenue over time,
not just immediate revenue, necessitating PE re-calculation triggers.
Our solution, a DYNAmic enSemble-based sequential PE Computation using
Thompson Sampling (Dyna-SPECTS) system, amalgamates various
computationally balanced algorithms to produce reliable PE coefficients
underpinned by a cumulative reward metric. Based on predefined ensemble
methods, it calculates intermediate rewards from the cumulative metric to
optimize dynamic pricing for maximum profits. In this work, we address the
pressing need for reliably computing PE values across product types to
recommend optimal dynamic prices that maximizes margins.
12.1.2 How is price elasticity computed in the real-world?
In practice, real-world PE computation typically applies various regression
models to historical sales and price data. However, computed PE values are only
accurate and representative for items with sufficient price changes and sales
history.
In reality, many items have few price changes and new products lack history.
Additionally, PE fluctuates significantly for seasonal and fashion items within and
across life cycles. Existing regression methods do not address these variations
and uncertainties. As a result, similar item PE values or broad estimated ranges,
like (−3) to (−1), are often substituted, yielding suboptimal recommended prices.
Existing approaches do not include incorporation of key financial goals, are not
capable of exploring untested prices and often do not address scalability.
Consequently, current real-world PE computation yields approximate values,
hindering optimal dynamic pricing.

12.1.3 The nine real-world criteria needed for PE computation


A robust PE computation model in e-commerce must:
1. Reflect life cycle changes impacting demand
2. Account for time-varying performance, especially seasonality
3. Model non-linear demand curves
4. Leverage accurate demand forecasts
5. Include historical and current sales trends
6. Incorporate the most recent performance data
7. Offer capability to test explored prices
8. Be oriented towards a financial metric of interest
9. Offer scalability
Satisfying these nine criteria ensures computing representative PE values for
optimizing prices via dynamic modeling.

12.1.4 Static vs. Dynamic ensemble


Our prior SPECTS work applied a static averaging technique to PE coefficients
from four Thompson Sampling-based models [→10]. However, evolving e-
commerce data violates static assumptions, mandating adaptive techniques.
Dynamic ensemble methods better suit pricing with changing characteristics.
Their flexibility maintains performance and relevance in non-stationary
environments.
12.1.5 Overview of our solution
We develop Dyna-SPECTS, a dynamic ensemble for sequentially computing PE in
retail via Thompson sampling. Dyna-SPECTS combines four sampling algorithms
to determine PE distribution parameters. The dynamic ensemble adapts more
quickly than static SPECTS to item life cycles, improving revenue. We apply Dyna-
SPECTS to clothing datasets and pricing applications, quantifying sales and
margin gains. Challenges and scaling for vast data are addressed via distributed
computing frameworks, online learning, hyper-parameter tuning, reward
scaling, early stopping, memory efficient data structures, dimensionality
reduction, and approximate inference.

12.1.6 What is our contribution?


Dyna-SPECTS reliably computes optimal PE values for dynamic pricing,
maximizing profits. Our reinforcement learning approach mitigates limited price
variations by exploring un-tested prices for more representative estimations.
The trigger module facilitates real-time re-calculation as inputs change. The
recommended PE value incorporates trends for past, present, and future. Our
solution satisfies the nine criteria for representative values in dynamic
environments. Whereas current methods provide only approximate PE, ours
ensures optimal dynamic pricing and cumulative revenue growth over time and
contexts. We offer a comprehensive, data-driven solution for dynamic e-
commerce pricing.

12.2 Related works


In [→1], the authors propose a scalable sample-efficient neural contextual
bandit algorithm for recommender systems. They designed an epistemic neural
network architecture, Epistemic Neural Recommendation (ENR) that enables
Thompson sampling at a large scale. While accomplishing performance
improvements, ENR demands orders of magnitude and fewer computational
resources than neural contextual bandit baseline algorithms. For scaling
contextual bandit algorithms in cloud infrastructure for Big Data, Amazon
SageMaker RL offers a comprehensive solution. It includes pre-built
reinforcement learning libraries and algorithms, and supports the
implementation of contextual multiarmed bandits for personalization in various
applications. The system developed in Amazon SageMaker RL involves an
iterative training and deployment loop, capturing user interactions and re-
training the model with incremental data. It provides a library of contextual
bandit algorithms, support for hosting trained models, and features such as
warm start, inference logging, and offline model evaluation and visualization
[→2]. Zhu et al. [→3] present the first efficient, general-purpose algorithm for
contextual bandits with continuous, linearly structured action spaces. Our
algorithm makes use of computational oracles for (i) supervised learning, and (ii)
optimization over the action space, and achieves sample complexity, runtime,
and memory, independent of the size of the action space.
Guillou et al. [→4] present a collaborative filtering RS? method, based on
matrix factorization and multi-armed bandits. Our approach aims at good
recommendations with a narrow computation time. The proposed approach
performs personalized recommendations in less than a millisecond per
recommendation. Chen et al. [→5] propose a general framework for online and
adaptive data collection environment that can update decision rules via
weighted stochastic gradient descent. They allow different weighting schemes of
the stochastic gradient and establish the asymptotic normality of the parameter
estimator. Our proposed estimator significantly improves the asymptotic
efficiency over the previous averaged SGD approach via inverse probability
weights. They also conduct an optimal analysis on the weights in a linear
regression setting. They provide a Bahadur representation of the proposed
estimator and show that the remainder term in the Bahadur representation
entails a slower convergence rate, compared to classical SGD due to the adaptive
data collection. Mahadik et al. [→6] propose a novel distributed bandit-based
algorithm called DistCLUB. This algorithm lazily creates clusters in a distributed
manner, and dramatically reduces the network data sharing requirement,
achieving high scalability. Additionally, DistCLUB finds clusters much faster,
achieving better accuracy than the state-of-the-art algorithms like DCCB. The
recommendation system proposed in [→7] relies on a novel algorithm, which
learns online the item preferences of users based on their click behavior, and
constructs online item-cluster trees. The recommendations are then made by
choosing an item-cluster level and then selecting an item within that cluster as a
recommendation for the user. Our approach can significantly improve the
learning speed when the number of users and items is large, while still providing
high recommendation accuracy and addressing cold-start problems. In [→8], the
authors use distribution optimization techniques to approximate the posterior
distribution, solved via Wasserstein gradient flows. Based on the framework, a
principled particle-optimization algorithm is developed for TS to approximate
the posterior efficiently. The approach is scalable and does not make explicit
distribution assumptions on posterior approximations. An end-to-end
automated meta-learning pipeline to approximate the optimal Q function for
contextual bandits’ problems is presented in [→9]. The AutoML model for
contextual bandits outperforms or performs comparatively with other models
while requiring no tuning nor feature engineering.
12.3 Data characteristics
In our study, we employ seven distinct but complementary datasets to deepen
the understanding of the impact of dynamic PE computation on dynamic pricing
and its different applications in the e-commerce space (see →Table 12.1). Our
primary dataset, referred to as Dataset 1, is synthetically generated based on
synthetic data generation models on data derived from a large kids’ clothing e-
commerce retailer. This dataset is meticulously structured to encompass various
metrics, including performance, attributes, pricing, and inventory across 14
categories. It features 1,905 styles and 9,257 products, focusing particularly on
styles associated with the Fall season, from August to late February in the years
2019–2020. These styles, comprising over 500 articles each, are grouped under
the same merchandise hierarchy – like color–pattern combinations. Throughout
our study, terms such as “articles,” “products,” “SKUs,” and “items” are used
interchangeably, underscoring their conceptual similarity.

Table 12.1: Key statistics of the datasets used in this study.


Dataset Train period Test period Price range Sales
(year to week) (year to week) (in USD) (in units)
Dataset 1 201936 to 201949 202036 to 202049 4.7–62.7 2–1100
Dataset 2 201944 to 202004 202005 to 202009 1.4 to 49.99 1 to 2,492
(H&M Regular)
Dataset 3 201918 to 201927 202018 to 202027 6–35 1–543
(H&M Fashion)
Dataset 4 201936 to 201949 202036 to 202049 6.0–52.9 12–900
Dataset 5 201705 to 201818 201819 to 201834 38–160(Brazilian Real 1–10
currency)
Dataset 6 201936 to 202044 202045 to 202048 18–44 1–54
Dataset 7 201944 to 202005 202018 to 202027 5–37.5 1–216

Dataset 2 1originates from H&M and is publicly available on Kaggle. It


encompasses transactional, product attribute, and customer data, providing a
comprehensive look at online retail operations. Dataset 3 originates from the
same H&M source as Dataset 2, but specifically features Fashion SKUs,
complementing our primary data set. Dataset 4 is a synthetically derived version
of our primary dataset. It consists of the transaction and product attributes data,
corresponding to a group of styles identified as similar styles. Dataset 5 2consists
of a subset of SKUs identified as long-tailed products from an open-source
dataset in Kaggle. Dataset 6 consists of a subset of SKUs identified as long-tailed
products from our primary dataset – Dataset 1. Dataset 7 is a dataset consisting
of transactions corresponding to an online retailer selling women’s wear. It
consists of customer attributes, transaction details, and product attributes,
corresponding to a set of apparels for two years.
Based on EDA, we observe that the sales of Key SKUs is not just impacted by
the price of the Key SKUs but is also affected by the prices of the related SKUs
that are substitutes and complements for the key SKU, as evidenced by →Figures
12.1 and →12.2.

Figure 12.1: Plot of sales and price of the Key SKU and the prices of substitute
SKUs from Dataset 1.

From →Figures 12.1 and →12.2, we infer that the sales of the Key SKUs are
affected by the prices of substitute and complement SKUs. When price changes
are introduced, they happen simultaneously for both the substitute and
complement SKUs. The sales changes corresponding to the price changes are
similar for Key SKUs and their identified substitutes and complements.
Long-tailed products constitute a considerable portion of the inventory in e-
commerce platforms and they are characterized by low demand and sales
volume [→12]. These products often have fluctuating demand, and pricing them
optimally is critical to maximizing revenue and minimizing inventory holding
costs. Traditional pricing strategies, which rely heavily on historical sales data,
are not effective for long-tailed products due to the sparse data. →Figure 12.3
shows the selling characteristics of some representative long-tailed products
(LTP) from Dataset 6.

Figure 12.2: Plot of sales and price of the Key SKU and the prices of substitute
and complement SKUs from Dataset 2.

Figure 12.3: Sales, price curves across time for LTPs selected from Dataset 6.

12.4 Methodology
The SPECTS (Sequential PE Computation using Thompson Sampling) model
[→10] comprised four methods that use Thompson Sampling to determine the
parameters of the PE distribution. Three methods, namely Distribution
approximation, XGBoost regressor, and GAN, differ in terms of the demand
forecasting component used. The fourth method uses MCMC sampling methods
on the posteriors derived via Thompson Sampling. The motivation behind using
MCMC sampling is to handle the cases wherein the posteriors cannot be
obtained in closed form. Otherwise, they will remain as flat distributions even
after considerable training. Each of these four methods computes and gives PE
value. Our conclusion in this work was to take the average PE value of these four
methods as the output of the SPECTS model.

12.4.1 Architecture of Dyna-SPECTS


In the current work, we first highlight how the average PE resulted in sales units
close to actuals in Section 12.4. We then study the shortcomings of using static
ensemble Refer →Figure 12.5 learning approaches in the e-commerce scenario.
We address these shortcomings by providing dynamic ensemble learning
approaches to combine PE coefficients from multiple approaches and highlight
how they can be future proof. The architecture of our Dynamic–SPECTS (Dyna-
SPECTS) ensemble model is illustrated in →Figure 12.4.

Figure 12.4: Architecture diagram for the proposed Dyna-SPECTS model.


Figure 12.5: Static ensemble method pseudocode.

→Figure 12.4 maps the data flow from source files through various modules,
culminating in PE values. The training phase involves pre-processing and feature
extraction from comprehensive data sets (Big Data), including transaction,
attribute, inventory, and price data. Different bandit models utilize these
features to generate PE values, with the ensemble bandit model providing a
singular PE for the price optimization engine, thereby determining optimal
prices for maximal revenue.
In the testing phase, this model incorporates real-time, high-frequency data,
activating the trigger module under specific conditions like significant
competitor price changes, stock shortages, cost variations, inventory level drops,
forecast accuracy decline, or special events. This triggers the re-training of
bandit models and demand forecasting modules using the pre-processed data.
Upon extensive training, the model processes real-time, high-frequency data in
batches. Once the data enters the bandit models, it is used for the estimation of
prior and likelihood estimates. Further, these estimated priors and likelihoods
are used for the calculation of posteriors wherein the posterior distribution
corresponds to the PE distribution.

12.4.2 Mathematical formulation of Dyna-SPECTS solution


The set of equations governing the modification of the posterior distribution is
given below. Refer →Table 12.2 for description of notations:
Table 12.2: Table of notations.
S. Notation Description
no.
1 Pinit, j Initial/previously updated price forecast style “j”
2 Spred,i [t] Sales forecast for the price Pinit for round “t” for style “i”
3 γij Cross-price elasticity between style “i” and style “j”
4 Popt,i[t] Optimal price for the round “t” for style “i”
5 Snew,i [t] Sales corresponding to optimal price for round “t” for the style “i,”
considering inter-item and competitor effects
6 cpi Unit cost price of style “i”
7 α1, α2, α3 Weights for the objective, i.e., sales, revenue, and margin, such that α1 +
α2 + α3 = 1
8 Pmin,i, Pmax,i Min–Max bounds for price of style “i”
9 l ri Competitor leakage coefficient for style “i”
10 Pcompi [t] Competitor price for round “t” for style “i”
11 ˆ
β
Price elasticity coefficient for the time period
12 R [t] Revenue at time “t” in regular period
13 Prior[0]( βˆ) Prior for price elasticity distribution markdown(regular) period
14 μ0,Σ0 Parameters for the initial prior distribution for markdown(regular)
period
15 L(ObsR [t]; R [t], βˆ) Likelihood model fit based on observed revenue in regular period
16 L(ObsSnew,i [t]; Likelihood model fit based on observed sales in markdown period
Snew,i [t], βˆ)
17 ObsSnew,i [t]; Snew,i Parameters of the likelihood model fit on sales for markdown period
[t], σ2
18 Prior[t − 1]( βˆ) Prior distribution at time “t–1” in markdown(regular) period
19 Posteriort( βˆ; R [t]) Posterior distribution at time “t” in regular period
20 Posteriort( βˆ; Snew,i Posterior distribution at time “t” in markdown period
[t])

P
opt,j
[t]
−γ
ij (12.1)
∏ ( )
j P
∗ init,j
S new,i [t] = S pred,i [t] P opt,i [t]−P comp,i [t]

exp(lr i )
P [t]
comp,i

R[t] = P opt,i [t]



S new,i [t] (12.2)

Margin[t] = S new,i [t]



(P opt,i [t] − cp i ) (12.3)
ˆ
(12.4)
Prior[0](β) = N (μ 0 , Σ 0 )

2
(12.5)
L(ObsR[t]; R[t], β̂) = N (ObsR[t]; R[t], σ )

ˆ 2
(12.6)
L(ObsS new,i [t]; S new,i [t], β) = N (ObsS new,i [t]; S new,i [t], σ )

ˆ ˆ ∗ 2
(12.7)
Posterior t ( β; R[t]) ∝ Prior[t − 1]( β) N (ObsR[t]; R[t], σ )

(12.8)
ˆ ˆ ∗ 2
Posterior t ( β; S new,i [t]) ∝ Prior[t − 1](β) N (ObsS new,i [t]; S new,i [t], σ )

∗ ∗
Maximize Yield = α 1 Sales + α 2 Revenue + α 3 Margin
∗ (12.9)

P min,i ≤ P opt,i [t] ≤ P max,i (12.10)

Cumulative reward = Yield + Σ


T n ∗ (12.11)
n=1 γ reward (t + n)

Figure 12.6: Dynamic ensemble method pseudocode.


12.4.3 Dyna-SPECTS model
The Dyna-SPECTS model specializes in amalgamating various Thompson
sampling (TS)-based algorithms to produce PE coefficients, underpinned by a
cumulative reward metric, as expressed in eq. (12.11). Based on a set of
predefined ensemble methods (average, minimum, maximum, weighted MAPE,
and linear regression), an intermediate reward is calculated based on the
cumulative reward metric.
In our framework, an epsilon greedy algorithm is employed to finally pick a
PE coefficient based on the cumulative reward metric. Here, Γ, where Γ < 1, is a
hyperparameter that weighs immediate rewards over future rewards and can be
suitably tweaked as per the retailer requirements. Uniquely, we introduce a
second level of bandit algorithms to supervise these recommendations. We also
offer the flexibility to employ any reinforcement learning policy as an alternative.
In summary, our Dyna-SPECTS model, showcased in →Figure 12.4, ensures PE
coefficients are not just timely but also future-proof, based on real-time and
forecasted data. Thus, the Dyna-SPECTS model computes representative PE
values in near real-time that leads to optimal, dynamic pricing in the real-world
e-commerce scenario, resulting in revenue maximization. It considers not just
the current market context but also projects future trends, thereby enabling
retailers to make pricing decisions that are optimized for both immediate and
extended timeframes. In doing so, we offer an innovative, data driven, and
comprehensive solution for dynamic e-commerce pricing.
From eq. (12.1), we see that in our current formulation, we account for the
presence of active competitors and the impact of prices of related SKUs on the
sales of the Key SKU. In eq. (12.1), we include leakage ratio coefficients in the
presence of active competitors and the impact of prices of related products on
the sales of the Key SKU. Competitor price distance is calculated as the price
distance between the retailer’s own price and the price of the competitor.
Leakage ratio coefficients are derived from historical sales data via a two-step
regression. In the first step, all features affecting the sales of the product,
excluding the competitor price distances, are regressed against the sales of the
product. We obtain residuals based on the regression model fit in the first step.
In the second step, the residuals are regressed against competitor price
distances. The regression model coefficients corresponding to these competitor
price distances are called leakage ratio coefficients and these coefficients are
used to capture the competitor price impact on the sales of the retailer’s
products. The PE coefficient from the dynamic ensemble model then enters a
price optimization engine. The price optimization engine then predicts the final
optimal price that maximizes the revenue, which is then posted on the e-
commerce website. The sales units corresponding to the posted prices are
captured as feedback and fed into our models.

12.5 Experiments and insights


For all our experimentation, we leveraged the Azure Databricks Spark
environment, which could seamlessly integrate with Azure cloud storage for all
our data analytics and data engineering tasks. The distributed cluster used for
our experimentation consisted of 14 GB workers with 4 cores and a driver with
14 GB 4 cores configuration. The worker nodes can auto-scale from 1–3, based
on the workload requirements. This configuration allowed us to implement our
proposed solution to many products through distributed computing.
In this work, we study the variability in PE coefficients obtained through
various static ensemble approaches and highlight the practical infeasibility in
using them. We have therefore transitioned toward dynamic ensemble
approaches for PE coefficients and discuss the benefits offered by them in this
→section.

12.5.1 PE computation using static ensemble approaches for online retail


In our previous work [→10], we employed an average method to finally compute
the PE values, based on the coefficients arrived from: TS with distribution
approximation, TS with XGBoost demand forecasting engine, TS with GAN-based
demand forecasting engine, and TS with MCMC sampling engine. The final
output of our SPECTS model, based on our previous work, is a PE obtained from
a static ensemble method. →Figure 12.5 gives the pseudocode for the static
ensemble algorithm. In general, static ensemble methods consist of the
following techniques:
1. Fixed composition: In static ensemble methods, the composition of the
ensemble (i.e., the number and types of base models) is fixed before
training begins and remains constant throughout the training process.
2. Homogeneous or Heterogeneous: The base models in a static ensemble
can be homogeneous (e.g., all decision trees) or heterogeneous (e.g., a
combination of decision trees, support vector machines, and neural
networks).
3. Independence: The base models are typically trained independent of each
other, and their predictions are combined using a predefined aggregation
strategy such as averaging or voting.
In summary, static ensemble methods are relatively simple to implement, and
the fixed composition can make them easier to interpret. They are suitable for
scenarios where the underlying data distribution and model requirements do
not change significantly.
Examples of other static ensemble techniques include random forests,
bagging, and boosting (e.g., AdaBoost and gradient boosting machines). In the
current work, we compare the PE values, based on different ensemble
approaches.
In general, all ensemble methods can be classified into three broad
categories: stacking, bagging, and boosting. In the stacking approach, different
models are employed to come up with predictions and finally one more model is
employed to decide on the final set of predictions. In the bagging approach, the
models are trained on different samples from the input dataset with
replacement and the final predictions are decided based by techniques such as
voting or averaging. In the Boosting approach, the predictions of the different
models trained on the input dataset are sequentially corrected and the result is a
weighted set of predictions. In this work, we employ the following ensemble
techniques to come up with the final PE, and have compared the respective sales
units and revenue improvements for the same.
The techniques employed are – 1. minimum of the coefficients; 2. maximum
of the coefficients; 3. weighted sum of coefficients – based on (1-MAPE) scores
from first three approaches; and 4. linear regressor – trained for a set of SKUs,
for which the true PE has already been ascertained.
Inferences from →Figures 12.7 and →12.8:
Figure 12.7: SPECTS for sales maximization objective.
Figure 12.8: Sales units obtained through different static ensemble approaches.

The minimum approach for ensemble always picks the smallest coefficient in
magnitude. Hence, when combined with the sign, it results in significant
improvement in sales units. The maximum approach, on the other hand, picks
the maximum among the coefficients, in magnitude. Hence, it does not result in
a high improvement in sales units.
The drastic reduction and increase in sales units through adoption of
minimum and maximum approaches for the ensemble model might also lead to
these being treated as outliers, as they are too much away from the mean sales.
The sales improvement, based on the weighted MAPE and the linear regressor,
are along the same range.
The sales improvement based on the average PE is close to actuals and
therefore validates our hypothesis, wherein we claimed that the average will be
a smoothed version of the PE coefficients, out of all approaches. A similar graph
was obtained for the revenue maximization objective. In the revenue
maximization objective also, the revenue obtained by averaging the PE is close
to the actual revenue.
12.5.1.1 When do static ensemble methods fail?

1. Nonstationary data:
Scenario: When the data distribution is non-stationary, meaning that the
statistical properties of the data change over time.
Dynamic advantage: Dynamic ensemble methods can adapt to these
changes by updating the ensemble with new information, adding, or
removing models, as needed. This is especially relevant in applications
where the characteristics of the sales data are not constant.
2. Concept drift:
Scenario: In tasks where the underlying concept being modeled changes
over time, the phenomenon is known as concept drift. PE coefficients
change across the life cycle of items.
Dynamic advantage: Dynamic ensembles can detect and respond to
concept drift by updating the ensemble components. This is crucial for
maintaining model accuracy when the relationships between input
features and the target variable evolve.
3. Resource efficiency:
Scenario: In resource-constrained environments, where it may be
expensive or impractical to retrain the entire ensemble, from scratch, with
each new batch of data.
Dynamic advantage: Dynamic ensemble methods allow for selective
updates, making them more resource efficient. Only the necessary
adjustments, such as adding or removing models, need to be made based
on the evolving data.
4. Adaptive learning:
Scenario: When the task requires a model to adapt quickly to new
information without discarding the knowledge accumulated from past
data.
Dynamic advantage: Dynamic ensembles support adaptive learning, where
models can be updated sequentially and selectively. This is beneficial in
online learning scenarios where the model needs to continuously learn
from incoming data.
5. Handling noisy or outlier data:
Scenario: In situations where the data contains noise or outliers that may
impact the performance of individual models.
Dynamic advantage: Dynamic ensemble methods can be more robust to
noisy data by adapting the ensemble to down-weight the influence of
poorly performing models or outliers. This adaptability helps in
maintaining overall the model’s reliability.
6. Task with evolving requirements/objectives:
Scenario: In applications where the requirements of the task change over
time, and the model needs to be flexible and responsive to these changes.
Dynamic advantage: Dynamic ensembles can evolve to meet new task
requirements by incorporating models that are better suited to the
updated task, ensuring better performance in dynamic environments.
In a real-world environment, PE of products changes across the life cycle of the
product, (Refer →Figure 12.9) due to the following reasons –

Figure 12.9: Drifts in price elasticity values of a SKU across its life cycle.

a. Loss of freshness – Item no longer perceived as new.


b. Quality degradation – Especially prevalent in the case of perishables.
c. New competitor products corresponding to this product become
dominant in the market.
d. The product sees an increase in terms of substitute and complement
products in the retailer’s own store.
e. Other external factors – seasonality, weather, social media trends, etc.
Product life cycle pricing is about how to adjust prices over the short product life
cycles, especially for seasonal and fashion products. Typically, retailers price new
products with introductory prices that are based on the computed optimal entry
point. Next, dynamic price adjustments are made to optimize total revenue and
margins, based on demand levels and competitor prices. Also, seasonal price
adjustments are made to lower prices during peak shopping periods and
holidays to stimulate sales. Promotional pricing is for promotions to stimulate
sales of slow-moving products and to clear stock, especially if inventory starts to
accumulate. Markdown pricing is the end-of-life discounting, especially for
technology and fashion items. →Figure 12.9 illustrates the drift in PE values for
an SKU as the product undergoes a shift from regular to promotional to
markdown phases.

12.5.1.2 Moving toward dynamic ensemble methods

1. Adaptive composition: In dynamic ensemble methods, the composition


of the ensemble can change during the training process, based on the
performance of individual models. New models can be added, and existing
models can be removed or updated dynamically.
2. Online learning: Dynamic ensembles are often used in online learning
scenarios, where the model needs to adapt to changing data over time.
New data points can be used to update the ensemble and improve its
performance continuously.
3. Sequential training: Base models in dynamic ensembles can be trained
sequentially, and the decision to add or remove a model is based on the
evolving performance of the ensemble on new data.
Dynamic ensemble benefits: They are more adaptive to changes in data
distribution over time, making them suitable for tasks with evolving patterns.
They can also be more resource efficient as they allow for selective updates to
the ensemble. →Figure 12.6 gives the pseudocode for the dynamic ensemble
learning algorithm.
Examples of Dynamic ensemble approaches include Online Bagging, Online
Boosting, and methods with adaptive model selection. In summary, dynamic
ensemble methods are particularly effective in scenarios where the data is
nonstationary, exhibits concept drift, or where there are constraints on
resources. Their adaptability allows them to maintain performance and
relevance in dynamic and changing environments, making them a suitable
choice in applications with evolving data characteristics.
We applied the dynamic ensemble approach to our SPECTS model, wherein
the ensemble strategy is subject to temporal variations in sales changes, in
response to price changes exhibited by SKU. In the dynamic ensemble approach,
another bandit algorithm, such as an epsilon greedy algorithm, is employed to
choose the ensemble approach (minimum, maximum, weighted, Linear
regressor) to finally decide on the PE coefficient out of the four TS-based bandit
models.
From →Figure 12.10, we infer that the dynamic ensemble provides enhanced
sales units, closer to actuals, than static ensemble or other individual methods.
For SKUs with drift in PE values, dynamic ensemble methods result in 3.5 %
increase in sales units than static ensemble methods.

Figure 12.10: Comparison of static and dynamic ensemble approaches


employing SPECTS.

12.5.2 Validation of our Dyna-SPECTS model


We validate our proposed Dyna-SPECTS model with respect to the static
ensemble approach through the computation of reward across iterations and
the growth of regret across iterations. →Figure 12.11 below shows the revenue
increase via dynamic ensemble of SPECTS, as opposed to the static ensemble of
SPECTS.
Figure 12.11: Illustration of average reward via static and dynamic ensemble
approaches across time.

When the SKU undergoes changes across its life cycle, the dynamic ensemble
can adapt to these changes quickly, and hence results in higher revenue, as
opposed to static ensemble of SPECTS.
After considerable training, the component models, as part of Dyna-SPECTS
model approach (Refer →Figure 12.12), the true PE (as in Oracle). Hence, the
rewards from the dynamic ensemble model approach the true reward.
Figure 12.12: Growth of regret across iterations.

12.5.3 Dyna-SPECTS for three different datasets


Our dynamic ensemble approach for SPECTS was employed for three different
clothing datasets. The sales and margin improvements with respect to actuals
are summarized in the table below (Refer →Table 12.3). The results highlight the
power of dynamic ensemble approaches when the underlying data is
nonstationary.

Table 12.3: Sales and margin improvements for three different datasets.
Dataset Sales improvement Margin improvement
Dataset 1 8% 2.7%
Dataset 2 19% 8.1%
Dataset 3 10% 4.7%
12.5.4 Improving demand forecasting accuracy from PE obtained through
SPECTS model
Our objective here was to improve the demand forecast accuracy with the help
of the PE distribution as the PE distribution gets more and more refined. We
trained our PE computation models as part of SPECTS for a considerable number
of iterations until the variance associated with the PE distributions meets a
threshold. The variance threshold for this experimentation was fixed to be 0.005.
Once the PE distribution variance satisfies a threshold, we sampled a PE
value from this distribution and used it to predict demand corresponding to
newer price points. We could observe that the demand predicted via the
updated PE distributions are closer to actuals, as opposed to the demand
forecasts coming from the pre-trained demand forecast ML and DL models.
→Figure 12.13 shows the demand forecast computed for a set of styles from
Dataset 1 using PE distribution and the demand forecast from pre-trained
models such as XGBoost and the actuals.
Figure 12.13: Comparison of forecast sales units.

After accounting all types of styles from Dataset 1, we found that the demand
forecast through PE distribution is 14% more accurate (Refer →Table 12.4) than
the demand forecast obtained via ML models.

Table 12.4: RMSE metric.


Model RMSE
Forecast through PE distribution 36.83
Forecast through XGBoost model 312.14

12.5.5 Price personalization via customer segment-level PE from SPECTS


To offer more personalized prices for different customer segments
corresponding to the products sold in online retail, we grouped customers
based on their attributes such as age, loyalty status, frequency of buying, the
category of items they mostly buy, and their annual spend. Based on our
analysis, we could group customers into two categories: Category A and
Category B. Category A consisted of 80% loyal customers whose frequency of
shopping was three times greater than category B customers. Category A also
consisted of customers, whose mean age ranged between 20 and 35 years.
Category B consisted of only 20% loyal customers whose frequency of buying
was three times less than category A customers. Category B also consisted of
customers whose mean age ranged between 36 and 62 years. Based on this
insight, we computed PE for SKUs separately at each customer segment level.
After our experimentation, we could observe that the PE computed for SKUs
based on Category A customer data was more elastic than on Category B
customers. The PE computation at customer segment level resulted in higher
margins than the PE computed for SKUs using the whole of data. →Figure 12.14
shows the increase in margins before and after computation of PE at the
customer segment level for a set of SKUs from Dataset 8.

Figure 12.14: Comparison of customer segment-level PE for different articles.

The new margin after incorporation of customer segment-level PE from our


model is 5.5% higher than the old margin.

12.5.6 Transfer learning approaches for PE computation in SPECTS model


In addition to our ensemble approaches, we have also employed transfer
learning to re-use the coefficients in the following cases: 1. New products – for
which very limited historical sales data is available; 2. Products for which there
are very less/no price changes. – For case 2, we borrow coefficients from
products within the same category, with similar demand distribution and
correlation between price changes and sales changes. Transfer learning for the
computation of PE coefficients saves considerable computation cost. In this
work, we have employed transfer learning algorithms for a subset of source and
target SKUs, which fulfill a certain threshold criterion.
The thresholds fixed for analyzing the feasibility for transfer learning are
based on (1) KS test statistics between sales distributions of source and target
and (2) correlation coefficient between price changes and sales changes for
source and target.

12.5.6.1 When is transfer learning feasible?

When considering transfer learning for the use case of transferring PE


coefficients from one category to another, several criteria should be evaluated to
determine its feasibility and effectiveness.
Some of the criteria we have used as part of our modeling are as follows:
1. Data similarity: Similarity in data distribution: Assess whether the data
distribution of the source (e.g., sales data for popular products) is like the
target (e.g., sales data for less popular products).
2. Feature relevance: Ensure that the features used in the source task are
relevant and meaningful for the target task. If the features differ
significantly, transfer learning might be less effective. For the PE use case,
we make sure that the source and target price ranges and sales ranges are
similar and the factors impacting the sales of the product in the source
and the target are the same (same set of significant features in source and
target).
3. Evaluation metrics: Define the evaluation metrics for transfer of PE
coefficients (e.g., acceptable variance) and assess whether the pretrained
model’s performance is acceptable for the target task.
4. Domain expertise: Assess whether domain-specific knowledge is crucial
for the target task. If the source data and target data are in the same
domain, transfer learning is more likely to be effective.
→Figures 12.15 and →12.16 highlight cases wherein transfer learning is
infeasible and feasible, respectively.
Figure 12.15: Infeasible case – transfer learning.

Figure 12.16: Feasible case – transfer learning.

12.5.6.2 How was transfer learning incorporated via SPECTS?

1. SPECTS makes use of prior distributions to fine-tune posterior


distributions.
2. For the target data (group of SKUs) identified, the distribution parameters
from the source data were suitably plugged in as parameters of the prior
(group of SKUs) distribution.
3. From thereon, the transaction details of the source data (group of SKUs)
were used to fine-tune the posterior distributions corresponding to the
target data (group of SKUs).
4. For demand forecasting also, the model fit on the source data was used to
produce forecasts, corresponding to the target group.
5. Thus, both demand forecasts and PE coefficients from the source data
were suitably transferred to the target data to arrive at more meaningful
price recommendations.
From →Table 12.5, we infer that the demand predictions based on the PE
borrowed via transfer learning are closer to actuals than that of the demand
predictions via ML models such as XGBoost. This also helped us validate the
goodness of the PE coefficients absorbed from the source data to the target
data.

Table 12.5: Validation of transfer learning method using Dataset 4.


SKU PE PE Actual demand Demand from Demand computed
ID coefficient coefficient through predictions from the PE coefficient
based on from forward- via via transfer learning
regression transfer testing ML models for
learning future (forward-
incorporated testing)
via SPECTS
SKU 1 1.23 −0.7 200 112 197
SKU 2 1.11 −0.4 135 98 129
SKU 3 −0.01 −0.5 445 417 421
SKU 4 −0.01 −0.3 620 554 612
SKU 5 0.2 −0.6 259 312 250
SKU 6 0.11 −0.45 175 198 172

From →Table 12.6, we infer that the RMSE for the set of target SKUs has
improved based on the transfer of the model coefficients, based on the XGBoost
models trained on the source data. There is a reduction of 35.37% in RMSE after
incorporating transfer learning for demand forecasting.
Table 12.6: Validation of accuracy of demand forecasting after employing
Transfer learning.
SKU ID RMSE before RMSE after Improvement
transfer learning transfer learning in accuracy
SKU 1 329.49 212.47 55.04%
SKU 2 229.73 200.23 12.8%
SKU 3 225.14 120.33 46.55%
SKU 4 297.93 200.27 32.77%
SKU 5 265.95 148.93 44%
SKU 6 268.87 212.20 21.07%

12.5.7 Cross-price elasticity computation through SPECTS model


In this subsectio, we highlight the approach for identification of substitute and
complement SKUs:
Substitutes were identified, based on transaction and attribute data
similarity, wherein a KNN (K Nearest Neighbours) model was employed to find
the top five similar SKUs corresponding to a particular Key SKU. Similarly,
complements were identified through market basket analysis and the SKU pairs
satisfying a threshold of at least 0.01 support are filtered, after which the
following steps are carried out:
1. Correlation is studied between the prices of the related SKUs and the sales
of the Key SKU.
2. If the correlation coefficient satisfies a threshold of 0.4, they are taken for
the final cross-PE calculation.
3. Regression models are run for baseline result comparison. Finally, SPECTS
model is run for the set of inter-related styles.
Approach for identification of cross-PE effects is highlighted below:
The demand forecasting engine includes prices of related styles as features
and comes up with predictions. The reward designed for this experiment
includes the predictions after accounting for inter-item effects.
From →Figure 12.7 and →Table 12.7, we infer that the new margin after
inclusion of cross PE is higher than the actual/old margin.
Table 12.7: Margin improvement.
Dataset Margin
Increase
Dataset 1 2.4%
Dataset 2 1.2%
Dataset 3 2.6%

Figure 12.17: Comparison of new and old margins.

12.5.8 Comparison of PE coefficients from SPECTS model across the two


sales channels
We analyzed data corresponding to two different sales channels as part of
Dataset 3. Based on our analysis, we could find that the price range for the SKUs
was from $4 to $20 for both channel 1 and channel 2. And the sales range was
between 1 and 247 units for channel 1 and 1–354 units for channel 2 (→Figure
12.17). We observe that the SKUs on channel 2 (online) had more sales and lower
prices compared to that on channel 1. Inferences based on regression-based
methods for computation of PE coefficients:
1. For about 43% of SKUs, the signs for the PE coefficients are different across
the sales channels 1 and 2.
2. For 50% of SKUs that have the same sign of PE coefficients across both
channels, the PE coefficients are close to each other, wherein the epsilon
distance (for measuring closeness between PE values) is chosen to be 0.5.
Other insights:
1. SKUs are more price elastic in sales channel 2 as compared to that in sales
channel 1.
2. This could be because online e-commerce as a distribution channel is more
competitive and it provides higher capabilities to customers to compare
similar product offers.
→Figure 12.18 shows that the PE of SKUs in sales channel 2 shows more elastic
behavior than that of sales channel 1.

Figure 12.18: Comparison of PE coefficients between sales channels 1 and 2.

From →Table 12.8, we observe a 17.5% improvement in margin after considering


sales channel 2 prices as features in both demand forecast and PE estimation
through SPECTS.

Table 12.8: Margin improvement by including sales channel 2 prices as features.


Margin
Without considering sales channel 2 price as a feature in demand forecast and PE estimation 12,296
After considering sales channel 2 price as a feature in demand forecast and PE estimation 14,449
12.5.9 Long-tailed products pricing through Dyna-SPECTS
The paper [→11] provides a novel online learning algorithm for dynamic pricing
that deals with non-stationary settings due to, e.g., the seasonality or adaptive
competitors, and is very efficient in terms of the need for data. The dynamic
pricing algorithm is paired with a clustering algorithm for the long-tailed
products, which aggregates similar products such that the data of all the
products of the same cluster are merged and used to choose their best price.
The work [→13] conducts research to transfer knowledge from head items to tail
items, leveraging the rich user feedback in head items and the semantic
connections between the head and tail items. A novel dual transfer learning
framework is proposed that jointly learns the knowledge transfer from both
model-level and item-level, building a generic meta mapping of model
parameters from few-shot to many-shot model and through item-level features.
In e-commerce, pricing long-tailed products (LTP) [→12], which are less
popular and have lower sales, is challenging. Traditional pricing methods falter
due to LTPs' sparse data. We propose a dynamic, data-driven pricing approach
to enhance profitability and customer satisfaction. This involves deriving
representative PE values for LTPs, a task complicated by their limited sales and
no price changes. Our Dyna-SPECTS solution effectively computes these PE
values, leading to optimal dynamic pricing and revenue maximization.
Problem statement: Long-tailed products form a significant part of e-
commerce inventory but suffer from low demand and sales. Their
unpredictable demand complicates pricing. Since traditional strategies fail
due to inadequate data, a new, data-centric strategy is essential for setting
LTP prices. The goal is to identify representative PE coefficients for these
products.
Exploratory data analysis for LTPs: Our analysis of LTPs encompassed
sales history, pricing, frequency, costs, margins, inventory, and competitor
pricing. This helped in calculating total revenue and margins per product.
Solution via Dyna-SPECTS: We derived representative PE coefficients for
LTPs using the Dyna-SPECTS model. Initially, individual LTPs, for which PE
values could be computed, were identified. For the remaining, we grouped
similar items based on performance and attributes, and computed group-
level PE values. We validated these PE values through various methods,
including comparisons with actuals, integrating them into demand
forecasts, and assessing sales and revenue improvements. This
multifaceted validation confirms the robustness and representativeness of
our computed PE values. We show the results of the PE computation for
the LTP SKUs for the two datasets in →Table 12.9.
Table 12.9: Sales and margin improvement via SPECTS.
% increase in sales % increase in revenue
Dataset 6 4.2% 2.19%
Dataset 7 6.3% 3.1%

The results showed an increase in the sales volume of these products by 4.2%,
and a 2.19% increase in revenue compared to traditional pricing strategies for
dataset 6. Similarly, we observe a sales volume increase of 6.3% and a revenue
increase of 3.1% for dataset 7. The implemented dynamic pricing strategy
demonstrated a significant improvement in the profitability of long-tailed
products.

12.5.10 Scaling methods for contextual bandits with Thompson sampling


Scaling contextual bandits using Thompson sampling involves extending the
algorithm to handle larger and more complex environments efficiently. Scaling
contextual bandits often involves a combination of these following techniques.
1. Parallelization: Distribute the computation of Thompson sampling across
multiple processors or machines to handle a larger number of contexts
simultaneously. This can be achieved through parallel computing
frameworks.
2. Batch processing: Process multiple contexts in batches rather than
individually. This can improve computational efficiency by leveraging
vectorized operations and reducing the overhead associated with
individual predictions.
3. Function approximation: Use function approximation techniques, such as
neural networks, to represent the underlying model. This can help handle
high-dimensional contexts and capture complex relationships between
contexts and rewards.
4. Adaptive exploration: Dynamically adjust the exploration-exploitation
trade-off based on the uncertainty in the estimated reward distribution.
This can be done by incorporating adaptive strategies that increase or
decrease exploration as needed.
5. Distributed Thompson sampling: Implement distributed versions of
Thompson sampling, where different agents handle subsets of the
contexts. Global knowledge can be updated by periodically sharing
information among agents.
6. Memory-efficient data structures: Optimize the data structures used to
store and update information about contexts and rewards to minimize
memory usage and enhance scalability.
7. Approximate inference: Use approximate methods for sampling from the
posterior distribution in Thompson sampling. This is especially relevant
when dealing with large datasets or complex models.
8. Incremental updates: Update the model parameters incrementally as new
data arrives, allowing for online learning and avoiding the need to
reprocess the entire dataset. This enables the system to adapt quickly to
changing market conditions.

Experimentation

The choice of strategy depends on the specific characteristics of the problem,


the available computational resources, and the desired level of scalability.
Experimentation and testing are crucial to finding the most effective scaling
approach for a particular contextual bandit scenario. Convergence time for
training using baseline TS and distributed TS for Dataset 1 using early stopping
criterion for 500 iterations is shown in →Table 12.10.

Table 12.10: Time for convergence.


Strategy Time for convergence
TS 36.23 min
Distributed TS 15.47 min

12.6 Conclusion
We present Dyna-SPETS, an innovative dynamic ensemble-based Thompson
Sampling model, specifically designed for the large-scale computation of PE in e-
commerce, particularly in the kids' clothing sector. This model expertly combines
four distinct Thompson Sampling algorithms – Distribution approximation,
XGBoost regressor, GAN, and MCMC sampling – to determine PE parameters. A
key feature of Dyna-SPECTS is its dynamic ensemble capability, which selects the
most suitable PE values from a range of ensemble strategies like min, max,
average, weighted MAPE, and linear regressor, based on cumulative rewards.
This dynamic approach adapts to non-stationary data and concept drift,
efficiently handling variables like competitor pricing, seasonal changes, and
customer segmentation. Notably, Dyna-SPECTS fulfills the nine crucial real-world
criteria outlined in Section 12.1.
Empirical validation across various datasets demonstrates Dyna-SPECTS’s
superior performance, showing significant sales and margin improvements,
particularly for SKUs with fluctuating PE values. For example, using different
datasets, we observed sales and margin improvements, ranging from 8% to 19%,
and from 2.7% to 8.1%, respectively. Further, the model’s application in transfer
learning showcased a 7% margin improvement and a notable 35.3% reduction in
RMSE in demand forecasts. Additionally, cross PE coefficient incorporation and
price personalization strategies led to margin enhancements up to 5.5%. In
omnichannel pricing, a 17.5% margin improvement was observed, while demand
forecasts through PE distribution outperformed those obtained via standard
machine learning models by 14%. For LTP SKUs, the results showed an increase
in the sales volume of 4.2% and 6.3 % for Datasets 6 and 7. Similarly, we
observed a 2.19% and 3.1% increase in revenue compared to traditional pricing
strategies for Dataset 6 and Dataset 7, respectively.
Scalability was a critical focus in our development of Dyna-SPECTS. We
implemented techniques like dimensionality reduction and approximate
inference to manage a high volume of SKUs efficiently. Our model achieved
convergence in half the time of baseline Thompson sampling models,
demonstrating both computational efficiency and scalability. Conclusively, Dyna-
SPECTS not only provides timely PE coefficients but also anticipates future
market trends, enabling retailers to optimize pricing for both short-term and
long-term objectives. It marks a significant advancement in dynamic PE
computation for e-commerce, offering a scalable, adaptable, and empirically
validated solution. In future, we aim to enhance Dyna-SPECTS by integrating
additional ensemble strategies and applying graph-based deep learning and
diffusion methods to further refine its performance.

References
[1] Z. Zhu, and B. Van Roy, Scalable neural contextual Bandit for recommender
systems. arXiv Preprint arXiv:2306, pp. 14834, 2023,
→https://arxiv.org/abs/2306.14834. →
[2] AWS IS →https://aws.amazon.com/blogs/machine-learning/power-
contextual-bandits-using-continual-learning-with-amazon-sagemaker-rl/ →
[3] Y. Zhu, et al., Contextual bandits with large action spaces: Made practical.
International Conference on Machine Learning, PMLR, 2022. →
[4] F. Guillou, R. Gaudel, and P. Preux, Large-scale bandit recommender system.
Machine Learning, Optimization, and Big Data: Second International Workshop,
MOD 2016, Volterra, Italy, August 26–29, 2016, Revised Selected Papers 2.
Springer International Publishing, 2016. →
[5] X. Chen, et al., Online statistical inference for contextual bandits via
stochastic gradient descent. arXiv Preprint arXiv:2212.14883, 2022. →
[6] K. Mahadik, et al., Fast distributed bandits for online recommendation
systems. Proceedings of the 34th ACM International Conference on
Supercomputing, 2020. →
[7] L. Song, C. Tekin, and M. Van Der Schaar, Online learning in largescale
contextual recommender systems. IEEE Transactions on Services Computing, vol.
9, no. 3, pp. 433–445, 2014. →
[8] R. Zhang, et al., Scalable Thompson sampling via optimal transport. arXiv
Preprint arXiv:1902.07239, 2019. →
[9] P. Dutta, et al., AutoML for Contextual Bandits. arXiv Preprint
arXiv:1909.03212, 2019. →
[10] S. Sethuraman, and S. Ramanan, SPECTS: Price elasticity computation model
using Thompson sampling. In: 2022 International Conference on Computational
Science and Computational Intelligence (CSCI), IEEE, pp. 640–646. a, b, c
[11] M. Mussi, et al., Dynamic pricing with volume discounts in online settings.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 13,
2023. →
[12] M. Mussi, et al., Pricing the long tail by explainable product aggregation and
monotonic bandits. Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, 2022. a, b
[13] Y. Zhang, et al., A model of two tales: Dual transfer learning framework for
improved long-tail item recommendation. Proceedings of the Web Conference,
2021. →
Notes
1 →https://www.kaggle.com/competitions/h-and-m-personalized-
fashion-recommendations/data.
2 →https://www.kaggle.com/datasets/olistbr/brazilian-e-commerce.
13 Creating a metadata schema for reservoirs of
data: a systems engineering approach

Barry C. White
Rachel E. Jordan
Reena R. Patel
LaKenya K. Walker
Matthew D. Bray

Note: This work was supported through the US Army military engineering direct
Model-based Assessment of Sensors and Countermeasures program. This
chapter has been approved for public release.
Short Research Paper

Acknowledgments: This research was conducted on behalf of the US Army


Engineer Research and Development Center’s Information Technology
Laboratory. The support and resources from the ERDC DSRC (Engineer Research
and Development Center Department of Defense Supercomputing Resource
Center) under the subproject Environmental quality modeling and simulation
are gratefully acknowledged. Permission was granted by the director,
Information Technology Laboratory, to publish this information.

Abstract
Physical modeling procedures, with intermediate data, are being developed for
the large-scale generation of synthetic imagery for automated target
recognition (ATR) machine learning (ML) algorithms. This imagery is typically
combined with field-collected data for generating robust training sets. The
management and retrieval of this data requires large-scale storage with a means
to query different types of information. Queries need to be performed for
selection of data sets to the single file. The goal of this study is to establish a
systematic approach for storing and retrieving this information using high-
performance computing resources and the Integrated Rule-Oriented Data
System (iRODS). Metadata tags, focused on location, time, and other factors are
used for search queries. Whenever possible, metadata generation is automated
based on the content of the data file. Use cases for the import and query
operations are created. Simple scalable problems have been processed and are
presented for this data set procedure. Additionally, we have outlined the
proposed system architecture. This data storage and retrieval system will serve
to provide locality-specific data for ATR ML data-sets from a large set of collected
and synthetic imagery.
Keywords: big data, data lake, infrared imagery, ATR, machine learning, iRODS,
MeScheR,

13.1 Introduction
Effective management of large data produced from multiple sources using
engineering processes requires a structured systems engineering approach.
Software components for synthetic scene generation, distribution of photonic
energy, subsurface thermal heat transfer (for soil and vegetation), and synthetic
sensor capabilities are integrated to generate robust training imagery for
automated target recognition (ATR) machine learning (ML) algorithms. Input
data sets include 3D models of vegetation, objects (e.g., vehicles), and
landscapes, which have been captured or synthesized. They also encompass
material properties for these models and intermediate software responses like
computation of incoming/outgoing radiative energy flux and surface
temperatures. Output data sets include imagery and surface fluxes for the
geometry of the entire scene. The data comes in many formats: landscape object
geometry is stored in Wavefront OBJ files, 2DM files, or direct Light Detection
and Ranging (LiDAR) data sets. Intermediate results are stored in formats
specific to the generating software. Output images may be stored in Tag Image
File Format (TIFF), Joint Photographic Experts Group (JPEG), or other image
formats [→1]. Additionally, AI ATR training data may be stored separately in
specially formatted XML files [→2].
Generated data is obtained from software written to create artificial
environments that mimic the radiative energy patterns of real environments,
especially in situations where collecting real-world data is challenging or
restricted. Several computer applications are used to: (1) construct a scene, (2)
distribute radiative flux in the scene (based on time of day, initial temperatures,
etc.), (3) accurately simulate subsurface effects of absorbed radiative energy in
vegetation and soils, and (4) generate sensor-specific imagery based on the
results. This imagery will be verified by and used with existing physical images to
inform and train artificial intelligence automated target recognition algorithms
(ATR).
In this report, we describe a system for managing and retrieving the data
needed for ATR ML. This architecture is a hybrid implementation using a data
lake and a metadata database management system. A data lake, as per
Wikipedia, is essentially a collection of mass storage devices that stores data in
its natural formats as “blobs” of information [→3]. The data lake is used with a
database management system dedicated to metadata for efficient searches of
files with certain characteristics. In the future, we plan to incorporate the
Integrated Rule-Oriented Data System (iRODS) [→4, →5, →6]. iRODS creates a
virtual file system on a distributed database file system, potentially involving
multiple servers [→4, →6, →7]. While our current system can manage basic
functionalities without iRODS, it lacks the scalability and flexibility of iRODS.
Overall, the goal of this project is to establish a system capable of storing
metadata at each stage of the process. This approach ensures that the stored
data can be easily retrieved and queried when needed. This work was originally
introduced at the CSECE ICDATA conference 2023 [8].

13.2 Defining users and processes


The system has five main user groups:
1. Research scientists: They use the system to develop, test, and validate their
code. The research scientist, as the initial user type, focuses on building
mathematical models to process input data and generate desired outputs.
This involves debugging individual code runs and workflow processes. For
testing purposes, the scientist can retrieve data from previous scenes.
Comparisons with past outputs aid in evaluating new functionalities. For
their queries, the scientists are primarily interested in specific inputs
required for their software and the corresponding outputs. Alternatively,
they may need inputs that allow a comprehensive examination of the
workflow of the entire project. These are the key requirements for the
research scientists using the system. Research scientists want to review
the data comprehensively, starting from its origin to its final output.
2. Content producers: Content producers are responsible for creating input
files for generating synthetic images. This involves gathering data related
to landscapes, vegetation, objects, material properties, and weather
conditions. Once a synthetic scene is developed, it is added to the
database for future use. Content producers also determine parameters
such as sensor positions and orientations, for generating synthetic images.
This functionality will involve pulling scene geometry and temperature files
for visualization purposes.
3. Content providers: Content providers contribute real-world data, such as
material properties, weather data, and infrared images. Integrating this
data improves numerical modeling and enriches content for machine
learning algorithms, improving their accuracy. These users usually supply
data for specific collections within the system.
4. Consumers: Consumers access output data to provide images for ML ATR
algorithms or for visualization purposes, including geometry and suitable
texture based on wavelengths. These users concentrate on output
collections from the workflow and prefer selecting data using specific
keywords, such as “forest” and “arid,” to find the necessary information
they need.
5. Administrators: Administrators are responsible for safeguarding the data,
rather than actively using the data. Administrators oversee the system.
Their purpose is to maintain the security, integrity, and reliability of the
system. Administrator duties depend heavily upon the exact
implementation of the metadata database management system, and thus,
the specific duties of an administrator will not be discussed in this report
since it is outside the scope of this study.
Having different types of users leads to the creation of distinct sets of usage
scenarios. The data and metadata need to be processed in collections because
different users will require data from different stages of the process. Collections
of data are stored for each step of the process with identifying metadata
(specific to the collection). This metadata may be transferred to subsequent
collections, culminating in a diverse set of infrared images for training purposes.
The detailed usage scenarios, including the generalized use cases for
administrators, will be discussed later in this report.

13.3 Data collection structure


A well-organized data structure needs to be established to efficiently manage
the large data generated and collected for developing robust training sets for
ATR algorithms. Although the concept of a data lake implies a repository of
unfiltered data that can be sifted through to find information, the data for this
research is logical and organized in its construction, following a series of
processes to collect, assimilate, and generate specific data. Storing the unfiltered
data in a single place would eliminate the advantage of having a structured
process for searching and manipulating the resultant data. Distributing the data,
with metadata, in specific groupings would preserve the logical flow of the data
ensuring it aligns with the established processes.
To avoid the inefficiency of disorganized data lakes, we propose the concept
of “data reservoirs,” or collections of similar data in a subdirectory tree. As part
of this data reservoir concept, each subdirectory storing the collected data can
be referred to as a reservoir. Each of these reservoirs will contain all the
developed and collected infrared data required for each data category. Each
category will be differentiated from the next according to their stored metadata
tags. Metadata tags provide additional organization for the data reservoirs and
will be further discussed later in this report.
The reservoirs have been set up to collect data from the various stages of
data collection and generation. To ensure a reliable, secure storage location for
these reservoirs, we have identified a location on a Centralized File System (CFS)
in the High Performance Computing Center (HPCC). Within this system,
subfolders that hold the reservoirs are further categorized into logical process
folders named: Source_Models, Scene_Info, and Output. This CFS is managed
and backed up regularly for reliability.
The Source_Models folder stores the input data necessary to initiate an
analysis and create a scene. This data includes geometric information for
vegetation, objects, and landscapes. Additionally, it includes quantitative data for
materials that can be applied to these geometric models.
The Scene_Info folder holds details about a scene, including its geometry,
materials, boundary conditions, etc. This includes meteorological data
(Met_Data), which includes time of day, humidity, and other conditions. The
Scene_File has information on combined geometry as well as additional data
useful for ML ATR algorithms, such as bounding boxes for target data.
The Output folder contains the finalized data obtained from simulations of
physical events in the Images subdirectory and the Temperature_Data
subdirectory for efficient rendering of geometry temperatures on different
platforms. This data includes sampled imagery, derived imagery, and the ML ATR
information used for training purposes. Sampled imagery may have additional
metadata that has been included by the sampling device (e.g., FLIR camera
system) such as the Exchangeable Image File Format (EXIF). EXIF data can
contain when, where, and how an image was obtained (e.g., focal length). The
Exifread library for Python can obtain this metadata.
Within the process folders, there are several reservoirs, each storing a
specific category of data. Each reservoir within a category shares the same
metadata tags. For example, within the Source_Models process folder, there are
separate Vegetation and Objects reservoirs. These reservoirs serve as storage
locations for vegetation and object data, respectively. It is important to note that
the Vegetation reservoir uses a specific set of metadata tags, while the Objects
reservoir employs an entirely different set of metadata tags. The overall layout of
this data structure is illustrated in →Figure 13.1.
Figure 13.1: Organizing the data reservoirs for the collected infrared data.

13.4 Metadata
Metadata, in essence, offers information about the stored data [→9]. There are
different categories of metadata: for instance, legal metadata describes the
legality of using the data and descriptive metadata has phrases or words that
describe the data in detail. Our system primarily utilizes descriptive, structural,
and administrative metadata [→10]. Descriptive metadata can be used for
categorization and identification, structural metadata maintains the organization
and connectivity of the data, and administrative metadata provides details about
who collected the data and its purpose. Any of these metadata types can be
used to search for data and uniquely identify specific data within extensive
collections.
Metadata can be generated either automatically or with user input.
Automatically generated metadata is easy to collect and uniquely specific to the
data collection process and how it gathers information about the data. It usually
contains information about when, how, and why the data was created rather
than describing the contents of the data. This metadata is typically structural or
administrative in nature. In contrast, user-generated information is more
difficult to collect, describing the content of the file (e.g., labeling data as a
“forest” scene) in such a way that the data can be aggregated. This aggregation
(e.g., the set of files that are for “forest” scenes) enables efficient data searches
[→11]. However, user-generated metadata can be chaotic without defined
standards, raising questions such as whether to include the name of the data
generator.
For efficient organization and searchability, user-generated metadata needs
a clear format. This format should ideally use a method that organizes metadata
in a database using tags and corresponding values. This generic yet well-defined
system aims to reduce user error during metadata input and query selections,
providing a structured and reliable approach to managing data.
When working with a series of processes like the ones we are handling, there
are inherent formats and related metadata that can simplify file retrieval.
Therefore, it is logical to elicit metadata formats from the engineers who
established the processes generating the data. This approach helps outline the
necessary metadata for precise data retrieval, even down to individual files.

13.5 MeScheR
The Metadata Schema Reservoir (MeScheR) was developed to structure user
generated metadata. This system establishes a clear schema defining what
metadata is collected and the method of collection. For example, data from the
Image reservoir may have metadata storing the details like of the image’s height
and width in pixels, information that can be extracted directly from the image.
Using a schema for data collection ensures that the prescribed metadata is
consistently collected for all data stored in the reservoir. It also guarantees that
the user interface includes all the elements. In essence, MeScheR utilizes
schemas to organize the metadata stored in a reservoir.
Each data reservoir has its own specific metadata schema file tailored to the
elicited data descriptions required for the specific data group. This schema data
is stored in a JSON formatted file known as a MeScheR file. Every MeScheR file
contains information confirming its validity, the location of the automatic data
collection software, and a properties section. The properties section outlines the
tag and value sets of metadata to be input by the user as well as logical
constructs for the conditional application of tags based on user input. Metadata
tags can be configured as “one-to-one” (e.g., allowing only one value per tag) or
“one-to-many” (e.g., permitting multiple values for descriptor tags in a scene).
Each data reservoir contains a MeScheR file specific to their collection of
metadata.
Once a MeScheR file is prepared, a MeScheR user interface software can be
used to read the MeScheR file, collect the required metadata for a reservoir, and
automatically generate an interface to collect that metadata according to the
properties section of the MeScheR file. The resulting user interface is
constructed using the properties section such that the interface uses each tag
from a metadata set to prompt the user for corresponding values. These tag and
value sets can include specific data (e.g., originator of the data) as well as more
general keyword data (e.g., descriptor of the data). Once the user provides all
necessary information, the software generates a JSON metadata file for each
input data file. These JSON files can be seamlessly integrated into metadata
database management systems such as iRODS. The ability to collect metadata
according to the MeScheR file’s specifications ensures that the process is driven
by the data itself.
Routines for managing and searching through these JSON files have been
developed demonstrating the usability of the JSON files. These routines can be
applied to any reservoir chosen by the user. The management routines are
capable of listing all existing metadata tag values, adjusting singular metadata
tag values, and changing all occurrences of a metadata value based on the tag.
Additionally, the search routines are capable of searching for similar metadata
tag values. In the Direct Query routine, users input a tag and value to search for,
and the system returns a list of all data files containing the tag value pair. An
Augmented Query routine has been tested that prompts the user to enter a
value for the Descriptor tag and uses a large language model to measure
synonymity [→12] within the data file’s Descriptor metadata tags that match the
query value. The routine then lists all data files determined to be similar along
with the confidence value indicating the degree of similarity.

13.6 Metadata attributes


The types of metadata created are different based on the reservoir they are
placed in. For the physical modeling data, metadata includes details about the
process, locality, condition, and specifier information. Process information
includes the metadata tags like ProjectName (name of the current project),
Originator (name of the person who entered the data), and OriginationDate
(when the project data was created) among others. Locality metadata includes
geolocation information such as latitude, longitude, and elevation for a scene
along with tags that provide information about the data’s boundaries. Condition
information includes meteorological (e.g., season, “rain”) and rendering time-
based data for a scene. Specifier data includes information about vegetation and
object types used in the scene, as well as the generic Descriptor tag that allows
for selection of larger groups of data (e.g., “forest” or “tundra”).
In some cases, data can be derived from preexisting data in other reservoirs.
Metadata tags with values that reference the prior data are used to
automatically transfer these tags. For instance, a scene that uses creosote
bushes may inherit the “Descriptor” = “forest” tag or other project information
from the preexisting data.
In certain situations, data can be collected from imported files directly. For
instance, the axis-aligned bounding box of geometry data for vegetation,
objects, and landscapes can be automatically extracted from the file, and this
metadata can then be assigned to the respective data.
→Figure 13.2 displays the format of the metadata information that can be
incorporated, as demonstrated in use case 4. The metadata is stored in JSON
format, where data tags containing multiple values (e.g., Descriptor tag), are
stored within a list.

Figure 13.2: Ingestible JSON format representing image metadata tags and
their corresponding values (with sample values included).

13.7 Use cases


Once the user groups are identified, specific scenarios can be outlined for each
group. These scenarios are primarily designed to align with iRODS but are also
adaptable to accommodate other functionally similar systems until iRODS is fully
implemented as the metadata database management system. Therefore, some
of the terminology used is closely associated to iRODS. In these use cases, the
term reservoir is interchangeable with iRODS “landing zone” (LZ) and “metadata
database” corresponds to iRODS. The Policy Enforcement Point (PEP), similarly,
can be related to a scheduled task enforcing metadata storage policies. The use
cases draw parallels with previous work performed [→5].

13.7.1 Case 1: administrator landing zones/data reservoirs


Administrator(s) have the ability to create new collections within the HPCC’s file
system (CFS). To do this, they must create a MeScheR schema file and establish
corresponding automation rules within the data reservoir. Additionally,
administrators are required to formulate rules in iRODS rule (.re) or Python (.py)
format. These rules serve the purpose of registering data in iRODS and applying
metadata from ingestible metadata files generated by the MeScheR processes.
To facilitate this, administrators need to design a web dashboard for importing
the data and requesting/collecting valid metadata. The ingestible metadata will
be generated by the web application. Subsequently, a scheduled process will be
assigned to the new collection to ingest the data and the metadata. The
ingestible metadata file can be removed after successful processing into iRODS.
This process might not be possible through a web interface.

13.7.2 Case 2: administrator users


Administrators holding group privileges are tasked with setting up new user
accounts, managing user and group privileges, and deactivating old user
accounts. A user-friendly web interface is adequate for performing these tasks.

13.7.3 Case 3: administrator archive data


Data checks can be performed by administrators to determine the last seen
access date of stored data. If this period exceeds a limit, the data can be
transferred to long-term storage. This may not require direct administrator
intervention and can be automated through a scheduled task.

13.7.4 Case 4: user/iRODS ingest data (content producers/content


providers)
Users initiate a secure login to access the site dashboard. Subsequently, they can
navigate to collections they have permissions to access. Users are then able to
select a data file (or files) for ingestion from their local storage. Depending on
the collection, different metadata will need to be entered using the MeScheR
data format. Default values for metadata will be provided (although this
approach does NOT support individual file discovery).
The dashboard will upload the chosen data to the selected LZ and generate a
metadata file tailored to that data. A scheduled task will execute the necessary
iRODS commands to upload and register the data, triggering the PEP rule. This
process enables data registration and metadata application in the iRODS
databases (→Figure 13.3).

Figure 13.3: Use case 4 – uploading data and metadata to iRODS.

These rules can retrieve metadata from other collections as data moves through
the project pipeline. For instance, an image may be a product of a particular
scene; in that case, the project title and prefix may be accessible from the
metadata for the scene. However, it is necessary that users input the necessary
metadata the first time it is required. This requirement will be enforced in
subsequent iterations.
Users’ access privileges may be restricted to specific collections (e.g., content
providers may be limited to Source Models or Output- > Images).

13.7.5 Case 5: user/iRODS retrieve data (researchers/content


producers/consumers)
The user performs a secure login to the site dashboard and navigates to the
appropriate collection(s). They initiate queries based on a “metadata tag:
metadata value” format to select data they need.
The dashboard activates the PEP for data retrieval, identifies the files to
collect on the CWFS, and compiles them into a new tar file for user retrieval
through the dashboard (→Figure 13.4). The tar file can be removed from the
CWFS after delivery, likely through a scheduled task to allow time for external
retrieval, if necessary. The specific query format in the dashboard is still under
design.

Figure 13.4: Use case 5 – downloading data referencing metadata in iRODS.

Users’ access permissions may be restricted to specific collections (e.g., content


producers may be limited to Source_Models and Scene_Info while consumers
may be limited to Output).
13.7.6 Case 6: user change or add metadata values
This use case poses challenges as it necessitates user awareness of registered
data and how it can be uniquely identified, usually through its data path or file
name. This information can be cross-referenced with the DataPath metadata tag
linked to each LZ’s metadata. This data can be identified if the user has
information about identifying tag/value pairs for the specific file.
Once the data has been uniquely identified, an attribute/value pair is
submitted to either reset the attribute values or add a new attribute value for
attributes that allow for multiple values. This modification is instantaneous and
does not require a scheduled task or ingestible files (→Figures 13.5 and →13.6).

Figure 13.5: Use case 6 – changing metadata values in iRODS.


Figure 13.6: Use case 6 – adding metadata values in iRODS.

13.7.7 Case 7: users remove metadata attribute values


This use case also requires the user to be aware of the registered data and be
able to identify it uniquely, typically using its data path or file name. This
information can be compared to the DataPath metadata tag associated with
each LZ’s metadata.
Once the data has been uniquely identified, an attribute/value pair can be
submitted by users. For attributes that can only have one value, the value is set
to NULL. For attributes with a list of values, the matching value is removed from
the list. This modification is instantaneous and does not require a scheduled task
or ingestible files (→Figure 13.7).
Figure 13.7: Use case 7 – removing metadata tags from iRODS.

13.7.8 Case 8: users remove data and its associated metadata


This use case is developed for users to remove data from a reservoir particularly
if it was placed there erroneously. The user will need to provide a valid file name
to initiate the process. If the file name is unknown, a search using unique
metadata tag and values can help identify it. The software searches for the
specified file, if it exists. When the file is located, all metadata references that
apply to that file are removed from the metadata database and/or the ingestible
metadata files associated with it. After all referencing metadata has been
completely removed, the file is deleted from the reservoir (→Figure 13.8).
Figure 13.8: Use case 8 – removing data and associated metadata from iRODS.

13.7.9 Case 9: query list of existing metadata values


An ancillary function is needed to provide user-friendly search query options.
Searching based on descriptive and administrative metadata can be challenging,
especially as more data providers are added or new descriptive values are
introduced (e.g., Descriptor = “arctic”). In these situations, providing a list of
existing metadata tags would be beneficial to the user that needs to query the
data by aiding rapid evaluation of ambiguous search terms. While the list of
metadata tags can be identified by looking at the MeScheR file associated with
the data reservoir, a method to query and return the existing values assigned to
a metadata tag would be invaluable.
In this use case, the user will specify a data reservoir to search and the
metadata tag that will be searched for queryable values. The software will
respond by providing a Python dictionary of existing metadata tags paired with
lists of corresponding values. This outcome streamlines selection of data by
users and facilitates efficient AI searches using ambiguous search terms.

13.8 Results
We conducted tests for consumer user scenarios using MeScheR reservoirs.
These reservoirs were created from the metadata requirements elicited from
project developer users. A test dataset of 65 images, alongside additional images
and field collected data, was used to test the Python code for these use cases.
Each test aimed to add, modify, and remove files along with their ingestible
metadata from the Images reservoir. Various groups of files were assigned
different metadata values to evaluate query outcomes. The query use cases
were performed to determine if distinct groups, down to individual files, could
be identified. The results were compiled into a list of file names as depicted in
→Figures 13.9 and →13.10.

Figure 13.9: Results from a direct query search for the descriptor value
“desert”.
Figure 13.10: Results from an augmented query search for the descriptor value
“forest”.

The results from these procedures were verified by cross-referencing the


metadata files and the corresponding files. Additionally, metadata that was
automatically collected from the files, including EXIF data [→13], were also
validated.
The accessibility features of iRODS metadata were evaluated in the following
manner at the HPCC: preliminary tests of these use cases, automated data
gathering, and metadata assignment features were tested on a high-
performance computing system with extensive storage capabilities. A dedicated
area of this storage space has been designated as a permanent data lake for the
collected data. A cluster of servers is employed in conjunction with the HPCC CFS
iRODS information catalogs. The data collection structure has been established
within the data reservoir system, and an in-house command line interface script
has been developed to prototype import and search functionalities within this
data reservoir system. This script creates ingestible metadata files in a JSON
format. Routines to read the JSON data and import the metadata into iRODS
have also been developed. The subsequent subsections discuss the assignment
of metadata to synthetic images, testing the metadata assignment on these
synthetic images, and an illustrative query example executed through the
command line to demonstrate efficient data retrieval process.

13.8.1 Assignment of metadata:


The systems engineering approach proposed in this study has been
demonstrated by using a set of 65 synthetic image files, identical versions of
these files featuring axis-aligned bounding boxes indicating target locations, and
corresponding 65 visual object class (VOC) XML files [→14] to provide support for
ATR ML. Also included in this data set is the formatted information regarding the
synthetic camera setup for the images. →Figure 13.11 depicts one of the 65
synthetic images used for demonstration purposes. Metadata can be manually
assigned for an image file using the following command:

Figure 13.11: Synthetic image generated from high-fidelity physics simulations.

imeta set –d Inst00026_TemperatureImg.jpg Descriptor “forest”

13.8.2 Testing assignment of metadata:


This section discusses the testing of metadata assignment on the 65 synthetic
images. →Figure 13.11 is an example image depicting a ravine with trees. It is
important to test the correct assignment of metadata to ensure that it correctly
reflects the relevant information and adheres to the metadata standards used in
the study. The image dataset is deliberately kept small because smaller datasets
facilitate a faster testing phase. This can be important during the early stages of
development when multiple iterations of testing and refinement are necessary.
Additionally, smaller datasets also make visual inspection manageable, aiding in
the assessment of metadata assignment accuracy. The following command was
executed on the HPC (Cray XC40/50 with a LUSTRE file system) to validate the
assignment of metadata:
imeta ls –d $HOME/DirName/ProjectName/Inst00026_TemperatureImg.jpg
Upon execution of the above command, the system will display the image.png
file that was previously assigned the metadata in the preceding subsection. This
information can either be stored in a file for future reference or displayed
directly on the screen.

13.8.3 Query example with iRODS


A query example is executed through the command line to demonstrate an
efficient procedure for retrieving data. Data can be queried by using various
criteria (e.g., geographical area, author name, time of collection). The following
query is used to retrieve data with the descriptor “forest”:
imeta qu –d Descriptor = ‘forest’
Upon execution of the above command on iRODS, a list of files matching the
query is displayed on the screen. Visual inspection of the data showed that the
files displayed were appropriately labeled with the assigned metadata “forest.”
The list includes complete path information for selection and downloading.
Executing query through command line allows for an efficient retrieval process,
suitable for automation and implementation on a large dataset of images. The
results of the query can either be saved to a file or displayed directly on the
screen.

13.9 Conclusions
In this chapter, we have presented a comprehensive framework for efficient
management and retrieval of large datasets on high-performance computing
systems using a metadata database management system. The proposed
architecture has been tested with metadata JSON files that are proven to be
ingestible with iRODS (or another metadata database) commands. This
implementation has been tested using a dataset of 65 synthetic infrared images.
Metadata schema was created and tested incorporating user input.
Furthermore, our approach demonstrated the ability to identify specific data files
for retrieval through metadata queries, demonstrating the effectiveness of our
methodology in handling large-scale data management challenges in high-
performance computing environments.

13.10 Future work


Our current work has been focused on the consumer user tools; future work will
include an expansion of these capabilities for secure administration tools. This
expansion will involve the development of a web-based GUI tailored to specific
metadata requirements of different landing zones (LZs). For security, the
administrative tools will be kept separate from the user tools, enabling tasks
such as LZ creation, management, backup, and security. Local implementations
of Kibana [→15] and iRODS will be explored to assess their functionality. Our in-
house developed routines will be combined with user rules in iRODS for tasks
like file registration, metadata assignment, and metadata queries. Moreover, we
aim to extend our testing to diverse data types, evaluating the effectiveness and
applicability of the metadata assignment and retrieval process with larger and
more varied datasets. Additionally, our web-enabled interface for data
management will incorporate Common Access Card (CAC) authentication
capability, and controlled access based on the user type (for example, the field
data collector will have different level of access as compared to a project
manager).

13.11 Brief discussion of the Integrated Rule-Oriented Data


System (iRODS)
As stated previously, iRODS leverages a distributed database file system,
potentially involving multiple servers, to implement a virtual file system [→4, →6,
→7]. A service application on the primary server maintains a database for
“collections” and “data objects.” In this context, a collection corresponds to a
folder or directory in a typical file system, representing a reservoir in our case. A
data object refers to the actual data file (or reference to a data file) along with
associated metadata for improved search capabilities.
Metadata, associated with the file or its reference, is structured in the format
shown in →Table 13.1 (illustrated with an example).
Table 13.1: Table of metadata format with an example for iRODS.
Data tag Data value Units
TimeOfDay 1330 24 hour clock; local time zone

Metadata tags in iRODS, whether created by the user or automatically


generated, are customized based on specific project requirements. These tags
may have multiple values associated with them, allowing for a flexible one-to-
many relationship. For instance, a data tag named “keywords” may be repeated
to contain all of the keywords needed for a specific dataset.
The primary service also watches for events that can trigger a Policy
Enforcement Point (PEP), such as importing a file. When these events occur,
default or user-written rules are applied, affecting both the process and data
objects. For instance, if a user-defined rule is assigned to a PEP for adding a file
to a specific collection, the rule may collect metadata related to that file, and
replicas of the file may be stored for security purposes. This process transforms
the collection, or reservoir, into a “Landing Zone” (LZ) for data.
Queries can be performed on data objects according to their metadata such
that the file can either be uniquely identified (e.g., a file named
“Vegetation\Creosote.obj”) or filtered using keywords in metadata (e.g.,
“TimeOfDay > 0500 and TimeOfDay < 1200” or “Descriptor = ‘forest’ and
Descriptor = ‘road’”).
The iRODS consortium offers a command line interface (CLI), but there are
also clients with graphical interfaces. Different web-based graphical interfaces
exist, such as Hydroshare and MetaLnx [→16]. Dashboards for adding,
attributing, and querying data are accessible, using auditing plugins for iRODS
and software for Elastic Search (e.g., Kibana) [→14].

References
[1] J. D. Murray, and W. VanRyper, Encyclopedia of Graphics File Formats,
Cambridge, MA: O’Reilly and Associates, Inc, 1996. →
[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman, The
PASCAL visual object classes (VOC) challenge. International Journal of Computer
Vision, vol. 88, no. 2, pp. 303–338, 2010,
→http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf. →
[3] Wikipedia. Data Lake. Wikipedia, 2023, Accessed April 2023.
→https://en.wikipedia.org/wiki/Data_lake. →
[4] M.Hedges, A.Hasan, and T.Blanke, Management and preservation of research
data with iRODS. CIMS’07: Proceedings of the ACM first workshop on
CyberInfrastructure: information management in eScience. pp. 17–22, 2007.
→https://doi.org/10.1145/1317353.1317358. a, b, c
[5] K. D. Winters, M. A. Cowan, G. E. George, M. E. Gonzales, B. Priest, O. Morris,
and J. Landrum, Analysis of ERS Use Cases for iRODs, ERDC/ITL SR-20–8,
Vicksburg, MS: US Army Engineer Research and Development Center, 2020. a, b
[6] M.Conway, R.Moore, A.Rajasekar, and J. Y.Nief, Demonstration of policy-
guided data preservation Using iRODS. 2011 IEEE International Symposium on
Policies for Distributed Systems and Networks. pp. 173–174, 2011,
→https://doi.org/10.1109/POLICY.2011.17. a, b, c
[7] K. D. Winters, M. A. Cowan, G. E. George, M. E. Gonzales, B. Priest, O. Morris,
and J. Landrum, Integrated Rule-Oriented Data System (iRODS) and High
Performance Computing (HPC) Requirements Document, ERDC/ITL SR-20–10,
Vicksburg, MS: US Army Engineer Research and Development Center, 2020. a, b
[8] B. C.White, R. R.Patel, L. K.Walker, and M. D.Bray, Management, storage, and
retrieval of complex data comprising multiple formats collected from different
sources: A systems engineering approach. CSECE 2023. 2023.
[9] Merriam-Webster. Metadata. Merriam-Webster Dictionary. 2023, Accessed
April 2023. →www.merriam-webster.com/dictionary/metadata. →
[10] J. Pomerantz., Metadata, Cambridge, MA: MIT Press, 2015. →
[11] G. P.Rodrigo, M.Henderson, G. H.Weber, C.Ophus, K.Antypas, and
L.Ramakrishnan. ScienceSearch: Enabling search through automatic metadata
generation. 2018 IEEE 14th International Conference on e-Science (e-Science). pp.
93–104, 2018, →https://doi.org/10.1109/eScience.2018.00025. →
[12] N. Reimers, and I.Gurevych, Sentence-BERT: Sentence embeddings using
Siamese BERT-networks. EMNLP’19: Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing.2019,
→https://doi.org/10.48550/arXiv.1908.10084. →
[13] Library of Congress, EXIF exchangeable image file format, version 2.2., 2023,
Digital Formats, Accessed October 2023,
→https://www.loc.gov/preservation/digital/formats/fdd/fdd000146.shtml. →
[14] X.Zhou, W.Gong, W.Fu, and F.Du, Application of deep learning in object
detection, 2017 IEEE/ACIS 16th International Conference on Computer
Information Science (ICIS). pp. 631–634, 2017,
→https://doi.org/10.1109/ICIS.2017.7960069. a, b
[15] A. Tsyganov, S. Stoica, M. Babai, V. Soancatl-Aguilar, J. McFarland, G.
Strikwerda, M. Klein, et al., The Research Data Management System at the
University of Groningen (RUG RDMS): Architecture, solution engines and
challenges. iRODS UGM 2021, 2021, →https://irods.org/uploads/2021/Tsyganov-
Stoica-Strikwerda-UnivGroningen-
Research_Data_Management_System_at_Univ_of_Groningen_Architecture_Solutio
n_Engines_and_Challenges-paper.pdf. →
[16] K. D. Winters, M. A. Cowan, G. E. George, M. E. Gonzales, B. Priest, O. Morris,
and J. Landrum, Web-Enabled Interface for iRODS: Comparing Hydroshare and
Metalnx, ERDC/ITL SR-20–9, Vicksburg, MS: US Army Engineer Research and
Development Center., 2020. →
14 Implementation and evaluation of an eXplainable
artificial intelligence to explain the evaluation of an
assessment analytics algorithm for free-text exams
in psychology courses in higher education to attest
QBLM-based competencies

Ramona Srbecky
Franz Bühler
Jörg Schuljak
Simon-Alexander Wetzel
Michael Winterhagen
Wieland Fraas
Jan Dettmers
Matthias Hemmje

Abstract
In the realm of higher education, the exploration of eXplainable artificial
intelligence (XAI), elucidating the mechanisms behind artificial intelligence
systems, becomes increasingly pivotal as institutions seek transparency and
comprehension in leveraging advanced technologies for diverse academic
applications. Based on our previous research, we outlined an automated
approach for generating test and training data for assessing free-text answers in
a digital university course on work organization and design for a natural
language processing (NLP) algorithm, encompassing research methodology,
theoretical background, concept, implementation, evaluation, and future
improvements. This chapter presents the underlying NLP algorithm and the
corresponding XAI component, for which the test and training data was created.
Therefore, we will present the state-of-the-art in the area of XAI and NLP for the
higher education sector. Subsequently, we will present the current state of
technology in our educational system, the Knowledge Management Ecosystem
Portal (KM-EP), and its qualification-based learning model-related subsystems.
The corresponding concepts and proof of concept implementations and
evaluations are presented. The chapter concludes with a summary and
discussion.
Keywords: eXplainable artificial intelligence, XAI, higher education, natural
language processing, NLP, PAGEL, GALA4QBLM, QBLM, assessment analytics,
learning analytics, free-text exams,

14.1 Introduction
In the dynamic landscape of higher education, the integration of artificial
intelligence (AI) has emerged as a transformative force, offering innovative
solutions to enhance various facets of academic processes [→1]. As universities
embrace AI technologies to streamline administrative tasks, personalize learning
experiences, and optimize research endeavors, a pressing concern that arises is
the need for transparency and interpretability in AI systems. This concern has
given rise to the burgeoning field of explainable AI (XAI), aimed at demystifying
the opaque nature of AI algorithms and ensuring that their decision-making
processes are comprehensible to stakeholders, including educators,
administrators, and students [→2].

14.1.1 Background and motivation


As part of the “Psychological ArbeitsGestaltung ErLeben” (PAGEL) project [→3], a
digital work task simulation (PAGEL simulation) is to be developed, with the help
of which psychologically relevant characteristics of work design (psychological
stress) can be made tangible [→3]. Building on existing task simulations, a
complex, generalizable work task that is suitable for most working people is to
be digitally simulated within an applied gaming environment, enabling students
to experience and reflect on content-related, organizational, work process-
related, and social characteristics of work design in a playful way when
completing the tasks, to develop proposed solutions, to test their own design
variants, and to analyze the respective effects [→3].
Within the virtual learning environment (VLE) of the Knowledge
Management Ecosystem Portal (KM-EP) [→4] based on the Learning
Management System (LMS) Moodle [→5] and integrated components of Applied
Gaming (AG), an environment for the creation and implementation of the PAGEL
simulation is to be developed, with the help of which the learning and training of
skills can be made tangible as part of a Competence and Qualification (CQ) that
can be certified at skill level. Building on existing digital frameworks of the
Qualification-Based Learning Model (QBLM) [→6], a hybrid, generalizable, and
connectable learning, simulation, training, and assessment environment for the
performance of work tasks and the evaluation of the skill levels achieved in the
process will be researched in the tandem-project “Gaming Analytics and
Learning Analytics for Qualifications-Based Learning Model” (GALA4QBLM) and
developed, based on methods of Gaming Analytics (GA) and Learning Analytics
(LA) [→7]. The training and assessment of the skills required to perform the
work tasks are simulated within a hybrid combination of LMS and AG
environment, which enables learners to train on skills playfully while completing
the tasks and to have the achievement of learning and training goals evaluated
online and finally assessed. This work task simulation process, including
evaluation, assessment, reflection, and feedback, creates a context in which
otherwise passive, declarative knowledge can be further developed into action
skills through personal experience and application of this knowledge. At the
same time, understanding how to achieve learning and training objectives can
be deepened in a way that is hardly possible with existing written or audio-visual
formats [→7].
The research project aims to use machine learning (ML) methods such as
natural language processing (NLP) and XAI to certify Competencies and
Qualifications (CQs) in the context of free-text responses in higher education for
the PAGEL application scenario. Preparatory work for XAI has already been
carried out in the KM-EP as part of earlier work [→8]. In our previous work, we
presented an approach on how to generate test and training data for an NLP
algorithm to asses free text answers for a digital university course on the topic
of work organization and work design in an automated approach, without actual
data from the learners’ as test and training data. This chapter focuses on
selecting the NLP algorithm and the appropriate XAI component for our
application area, the PAGEL course. In this chapter, we also describe the
evaluation of the NLP algorithm using the test and training data, as well as the
evaluation of the XAI algorithm.

14.1.2 Problem statements and research questions


The following problem statements (PSs) can be derived from the introduction
and background.

PS1: In today’s perspective, there is no possibility of automatically


evaluating free-text responses recorded as part of the PAGEL course and
attesting certain CQs. There is no corresponding evaluation and
attestation component in the KM-EP.
PS2: Instructors may need to be able to understand and explain the
assessment in the context of student complaints. Therefore, instructors
must be able to understand the assessments from the automated
assessment system and explain the solution. In today’s perspective, the
problem is that no corresponding XAI component is available in the KM-EP
to explain assessments from an NLP algorithm.
PS3: The current problem in today’s perspective is that the KM-EP does
not have the necessary hardware requirements for NLP and XAI
algorithms. Also, in today’s point of view, no software solution exists for
integrating the NLP and XAI algorithms into the KM-EP.

Based on the PSs, the following research questions (RQs) can be derived:

RQ1 for PS1: How can an automated solution evaluation and attestation
of CQs for free-text answers using NLP be implemented for the
application scenario PAGEL in the KM-EP?
RQ2 for PS2: How can a result of the NLP algorithm be explained by
applying XAI in the context of the PAGEL project in the KM-EP?
RQ3 for PS3: How can the NLP and XAI components be integrated into the
KM-EP?

14.1.3 Research approach


The Nunamaker [→9] methodology was selected for the research, comprising
four key phases: the Observation Phase (OP), which involves gaining knowledge
in an unfamiliar research domain using case studies and surveys. The theory
building phase (TBP) follows, where new ideas and conceptual frameworks are
developed. The third phase is the system development phase (SDP), including
prototyping to demonstrate feasibility, leading to technology transfer, if
successful. The fourth phase, experimentation (EP), validates the system
through laboratory and field experiments to enhance its performance. Based on
the RQs and the research methodology, corresponding research objectives (ROs)
are subsequently derived for each RQ and each phase from the research
methodology. These ROs are numbered consecutively according to the following
notation and referenced in the consecutive text:
RO. < phase > . RQ < number > . < consecutive numbering >

The respective abbreviation of the phase of the research methodology used is <
phase >. The numbering of the respective RQ, <number>. <consecutive
numbering> uniquely identifies each RO per phase of the research methodology
and RQ.
ROs for RQ1:
RO.OP.RQ1.1: Researching the current state-of-the-art NLP algorithms and
methods.
RO.TBP.RQ1.2: Concept for the NLP algorithm for evaluating free-text
answers in the PAGEL course in the KM-EP.
RO.SDP.RQ1.3: Realization of the concept from RO.TBP.RQ1.2.
RO.EP.RQ1.4: Evaluation of the NLP algorithm developed in RO.SDP.RQ1.3.

ROs for RQ2:


RO.OP.RQ2.1: Researching the current state-of-the-art XAI methods and
tools.
RO.TBP.RQ2.2: Concept for the XAI component for the results of the NLP
algorithm for RQ1 in the KM-EP.
RO.SDP.RQ2.3: Realization of the concept from RO.TBP.RQ2.2.
RO.EP.RQ2.4: Evaluation of the XAI algorithm developed in RO.SDP.RQ2.3.

ROs for RQ3:


RO.OP.RQ3.1: Research the current state-of-the-art and technology of KM-
EP concerning using NLP and XAI algorithms.
RO.TBP.RQ3.2: Concept for integrating the XAI and NLP components into
the KM-EP.
RO.SDP.RQ3.3: Realization of the concept from RO.TBP.RQ3.2.
RO.EP.RQ3.4: Evaluation of the development of RO.SDP.RQ3.3.

The structure of the chapter follows the order of the ROs. This means that in the
state-of-the-art section (Section 14.2), the ROs for the OP are described. In the
Conceptual Design section (Section 14.3), the ROs for the TBP are described. The
ROs for the SDP are explained in Section 14.4,, and in Section 14.5, the ROs of EP
are shown. The chapter concludes with a summary and discussion of further
topics.
Due to the fact that the research took place at a German university, several
diagrams are partly written and presented in German.

14.2 State-of-the-Art and Technology


In this part of the chapter, RO.OP.RQ1.1, RO.OP.RQ2.1, and RO.OP.RQ3.1, as part
of the OP, will be further described.

14.2.1 NLP algorithms


In this part of this chapter, we will introduce different approaches, tools, and
methods in the area of ML and especially NLP. In order to be able to process a
text automatically, it must be converted into an appropriate format. This process
is called normalization. Normalizing texts usually involves at least three
activities. These are tokenization, normalization of words, and segmentation of
sentences [→10].
Part of speech (POS) tagging [→10] is about finding a suitable tag for each
token, also concerning neighboring words [→10]. A tag is additional information,
such as an adjective word [→10]. The tags indicate the function of a word in a
sentence. They can be used to improve the analysis of texts, for example, in
lemmatization [→10]. The tags to be used are defined in a tag set [→10]. The
Universal Dependencies (UD) tagset [→11] is widespread and relatively
straightforward. With POS tagging, it can be determined that words such as
Peter, Berlin, and Apple are nouns. However, they are also a name, a place, and
an organization. Such a distinction is often relevant, especially when recognizing
specific terms, such as product names, in the chat tool. This is why the area of
Named Entity Recognition (NER) [→10] exists. This allows key information to be
extracted from a text to understand it better. Everything that has a name by
which it can be addressed, is referred to as a NER. Common types of NER are
person, organization, place, date, time, amount of money, percentage, building,
and geopolitical entity [→10]. However, other types can also be defined,
depending on the intended use, such as disease for a medical application.
Tokens are tagged with a specific type to identify them as NE [→10].
Most ML algorithms cannot work directly with text [→10]. Text data must,
therefore, be converted into numbers [→12]. A simple approach for this is called
Bag of Words [→13]. Each word is assigned a fixed number, and the number of
times a word or term occurs is counted in vectors (term frequency) [→12]. A
vector is created for each text or document to be checked for similarity [→12].
This approach is suitable for classifying a document as a salary statement or
invoice, for example [→12]. The similarity of two text vectors can be determined
using cosine similarity. The word embedding approach [→12] is a way to encode
the meaning of a word [→12]. For this purpose, a vector is created for each
word. This has a fixed size and contains the context of a word [→12]. The context
contains, for example, synonyms, variations of a word, and words with a close
connection, but it also contains opposites or words that explicitly have nothing
to do with the source word [→12]. The individual words are each given a
weighting. Words that have nothing to do with the source word are given a
negative value. Pretrained word embeddings, such as Word2Vec [→14] from
Google and Stanford’s GloVe [→15], which were trained on a large amount of
data [→16], can be used for this purpose. These word vectors can be used in a
variety of ways. For example, if the vector for the word man is subtracted from
the vector for the word king and the vector for the word woman is added, the
result is close to the vector for the word queen.
At the time of writing, the best results in NLP are achieved with models
based on the transformer architecture [→17, →18, →19]. This is the architecture
of a neural network (NN) [→20] that was developed specifically for the field of
NLP [→21]. This architecture makes it possible to include the context in which a
word is used when creating word vectors and to assign a corresponding vector
depending on this [→17]. Based on an attention mechanism [→17], essential
parts of an input sequence can be emphasized and the rest hidden [→17]. One
of these models is called bidirectional encoder representations from
transformers (BERT) [→22]. It is one of the pretrained language models [→23]
trained on large amounts of data, with much computing power [→21]. They can
be used in various ways, sometimes even for NLP tasks, for which they were not
explicitly trained (transfer learning) [→24]. Alternatively, they can be fine-tuned
for tasks with little effort [→24]. BERT is the first unsupervised, fully bidirectional
system for NLP pretraining [→22]. Unsupervised here means that the text used
does not have to be prepared beforehand and is an essential feature of BERT
due to the large amount of data with which it was trained [→22]. Bidirectional
here means that not only the words to the left of a word in the sentence are
considered as context, as is the case with predecessor models. Instead, the
words to the right of a word are also considered as its context [→22]. For
training, 15% of the words are masked, i.e., replaced by a placeholder [→22]. An
attempt is then made to predict the missing words using the bidirectional
context [→22]. The system also learns how individual sentences relate to each
other [→22]. For two sentences, A and B, the system tries to predict whether
sentence B directly follows sentence A or whether sentence B is a randomly
selected sentence from the text corpus [→22]. In this way, coherence can be
traced in different sentences. For example, if the name Peter appears in the first
sentence and the person is referred to in the second sentence using “he” [→22].
After the model has been trained and provided, it can be adjusted for the
respective purpose (e.g., in the medical field), with significantly less effort in the
second phase of fine-tuning [→22]. The obvious application of a model such as
BERT is the classification of sentences and tokens [→22]. However, after fine-
tuning, it can also be used for many other tasks [→25]. For example, for
answering questions based on a text or for summarizing texts [→25].
The model’s performance can be further improved by extending BERT with
several parameters and a larger corpus. However, this further increases the
model size and the computing power required. Therefore, the approach of
Efficiently Learning an Encoder that Classifies Token Replacements Accurately
(ELECTRA) differs. Building on BERT, it attempts to improve its efficiency [→26].
In the approach from [→26], individual words are replaced by plausible
alternatives for training instead of masking them. A small language model is
used to generate these words [→27]. In the next step, the model is trained by
predicting whether each word has been replaced. Thus, all words are used for
training, not only 15%, as in BERT [→22]. An ELECTRA model that delivers
comparable results to XLNet [→28] and RoBERTa [→29] (state-of-the-art at the
time of publication of ELECTRA) requires only 25% of the computing power for
training [→26]. The increased computational efficiency of ELECTRA lowers the
costs for NLP applications and thus also enables its use in projects with a small
budget or little computing power [→26]. With the language models presented,
the meaning of individual words can be coded, depending on the context. This
also enables a semantic comparison of different words. However, a semantic
comparison of longer sequences, such as sentences, can be very time-
consuming [→30]. According to [→30], it takes about 65 h to find the most
similar pair in a collection of 10,000 sentences with BERT. With Sentence-BERT
(SBERT) [→30], the time required for this can be reduced to 5 s, without
sacrificing accuracy [→30]. The idea is to go one abstraction level further up and
map the meaning of sentences instead of individual words in vectors [→30].
SBERT is a modification and fine-tuning of the BERT model [→22]. For this
purpose, two identical BERT models are combined to form a so-called Siamese
network [→30]. The BERT layer receives a sentence as input, processed as
before. The resulting word vectors are combined in a unifying layer to form a
fixed-size sentence vector [→30]. There are several strategies for this. For
example, the mean value of the word vectors can be used [→30]. During fine-
tuning with a corresponding data set, the individual weights are adjusted to
optimize the sentence vectors [→30]. Subsequently, the cosine similarity
between two sentence vectors can be calculated to determine the semantic
similarity.
In NLP, repetitive activities are performed, independent of the use case, such
as the normalization of texts. Some tools provide these functionalities to avoid
re-implementing these activities for each use case [→31]. For our case, we are
already using spaCy. The open-source library spaCy [→32] can use advanced NLP
methods in Python [→33]. Over 64 languages are supported, and getting started
is easier with pretrained pipelines and example projects [→34]. Pretrained word
vectors can be used as the basis for the pipelines [→34]. Alternatively, pretrained
transformer models such as BERT can also be used [→34]. Although these are
more computationally intensive, they are also more accurate [→34]. This is
because they can also consider the context in which a word is used [→35].
Components for NER, POS tagging, lemmatization, segmentation of sentences,
and tokenization are available for pipelines [→36]. In addition, custom
components can be developed and added to a pipeline [→36]. The arrows
indicate that individual components refer to the results of the previous
components [→37]. The order of the components is, therefore, important [→37].
It can also be seen that separate word vectors are used for the NER component
[→37]. This makes the component independent of the rest of the pipeline [→37].
This increases the modularity of the pipeline but makes the model larger and
slower [→38]. Therefore, the advantages and disadvantages must be weighed
on the intended use [→38].

14.2.2 Explainable Artificial Intelligence (XAI) in educational context


In order to achieve RO.OP.RQ2.1, this chapter examines how the solution
evaluation can be implemented comprehensibly using an evaluation system. To
this end, we first introduce how an evaluation scheme can be created. Various
XAI methods are then described, which should make it possible to understand
how a text-based black box model works.
Assessment/evaluation schemes are used in the manual solution
assessment of performance assessments to ensure consistency and objectivity
[→39]. An evaluation scheme aims to determine which aspects of a solution are
included in the evaluation and to what extent [→39]. Therefore, using an
evaluation scheme increases the comprehensibility and consistency of an
evaluation, as the evaluation was determined using objective criteria [→39]. The
instructions from [→39] are presented for creating an evaluation scheme. First,
an optimal sample solution for a question is created. Then, it is determined how
many evaluation categories there should be for a task, for example, correct and
incorrect [→39]. This depends on how many categories can be meaningfully
differentiated [→39]. Next, the conditions that need to be fulfilled to achieve the
individual levels are determined. For example, a certain minimum number of
points may be required for each level. Anchor examples can then be created for
the individual levels to concretize the assignment rules [→39]. Finally, rules for
awarding points are defined [→39].
In large ML models with many parameters and layers, as presented, it is not
easy to understand how a prediction was made [→40]. The models become a
black box that is impossible to interpret [→40]. However, it is essential to
understand how a model works in order to understand the results. This makes it
easier to optimize a model. Comprehensible decisions also help to create trust
among end users.
For this reason, this chapter presents various methods for XAI that should
make it possible to understand how a text-based black box model works. One
XAI method that can be used to explain black box models is called LIME [→41].
LIME stands for “Local Interpretable Model-Agnostic Explanation” [→41]. Here,
local means that the method is used to explain individual predictions and not the
functioning of the entire model (global) [→42]. It is a mathematical approach to
explain predictions of classification models approximately. It does not matter
which ML algorithm (model-agnostic) was used [→42]. LIME works as follows:
Different permutations are created from the data of a selected prediction, which
are evaluated by the black box [→41]. A preselected number of features is then
weighted, depending on how close the results are to the initial prediction [→41].
Using these weights, an attempt is made to find a simple model that delivers
similar results [→41]. For example, decision trees [→43] or logistic regression
[→44] are used. These results can then be visualized in a comprehensible way
[→45]. The aim here is to determine whether a catastrophe can evaluate a text.
How individual words are weighted and how strongly they influence the final
result can be seen. According to the model, words marked in blue speak against
classification as a disaster, and words marked in orange speak in favor of
classification as a disaster. For example, you can see that the word “killed” was
given the highest weighting of 0.14 by the model. Based on the weighting of the
individual words, the model predicts that the text is about a disaster (89%
probability) [→42].
However, LIME can provide different explanations for the same case
depending on the permutations generated [→42].
Another XAI method is SHAP [→46], which stands for Shapley Additive
exPlanations [→47]. The approach is also model-agnostic [→46]. However,
model-specific implementations exist for specific areas, such as transformers
[→47]. These have a lower runtime than the general implementation and are
preferable, if possible [→47]. SHAP is based on game theory [→48]. Predictions
are calculated with all possible combinations of features in order to determine
their influence on the result [→46]. This enables a consistent local explanation of
individual results [→46]. To this end, SHAP offers various options for graphically
displaying the influence of one or more features [→47]. The film review is shown
in the lower section [→47]. Text passages marked in red indicate a classification
as positive [→47]. A darker color indicates that this section has a stronger
influence on the decision than a lighter one [→47]. Areas marked in blue are
rated as factors for a negative evaluation [→47]. The individual text sections are
listed according to their influence [→47]. The approach of Teaching AI to Explain
its Decisions (TED) [→49] also explains a model at the local level [→49]. However,
the difference between LIME and SHAP is that it is a self-explanatory method
[→49]. A model is, therefore, trained in such a way that, in addition to a
prediction for a data set, it also explains it [→49]. For this purpose, an
explanation is added to the training data, in addition to parameters and a
solution value [→49]. However, this approach is only suitable for classification
problems with a fixed set of predictable classes and, therefore, also for
explanations [→49]. The open-source library AI Explainability 360 (AIX360) [→50]
is a collection of methods developed by IBM [→51] for the explainability and
interpretability of ML models [→52]. The library also contains the methods
presented above [→53]. AIX360 can be used to explain different dimensions of a
model [→53]. This allows visualizations to be created for different use cases
[→52].

14.2.3 Learning environment of the knowledge management ecosystem


portal and QBLM-related components
In the context of our research, we initially limit the scope to the VLE Knowledge
Management Ecosystem Portal (KM-EP) [→4]. Therefore, the KM-EP and its VLE
components are described in the following.
The KM-EP is a knowledge and content management environment with five
subsystems. The relevant subsystem for this work is the education system,
consisting of LMS Moodle and Course Authoring Tool (CAT) [→54]. The CAT is
used for creating Moodle-based courses, activities, and work materials. Courses
are subdivided into sections, and within each of them, individual activities are
located. Possible activities include file downloads, multiple-choice quizzes,
forums, and surveys. Another part of the KM-EP subsystem is the Qualifications-
Based Learning Model (QBLM) [→6]. The QBLM is also part of the KM-EP,
allowing machine-readable documentation of competencies and qualifications.
Moodle LMS is a software package for developing and delivering online courses.
Acquired competencies and qualifications can be displayed to students as a
Competencies and Qualifications Profile (CQP). For this purpose, a
corresponding Moodle plugin for managing QBLM CQs was implemented as
part of the work [→6]. Since students primarily interact with the LMS Moodle and
behavior-based data can be tracked there [→29], the possibilities of learner
interaction in the LMS Moodle will be discussed below. The Moodle LMS defines
itself as a “software package to develop and deliver Internet-based courses. It is
a global software development project, supporting a constructivist teaching and
learning approach” [→5]. Regarding its functionalities, Moodle defines itself as a
“free online learning management system that allows instructors to provide web
pages with dynamic learning content, enabling time and location-independent
learning” [→55].
Instructors use Moodle to create and manage courses with static and
dynamic learning content. Dynamic elements, known as activities, facilitate
interactions like chats and discussions. Currently, 23 activities and materials are
available. Commonly used activities include forums, glossaries, and tests.
Moodle offers a variety of activities and materials, including quizzes with
automatic scoring and feedback [→56].
After completing tasks in the PAGEL course, a CQ should be stored in the
player’s CQP [→57], for example, after completing the final quiz [→57]. In [→58],
a possibility for mapping between player successes and the achieved CQs was
developed. This will be explained using the example of [→58]. In this example,
the area of the blue square is to be calculated [→58]. After this solution has been
submitted, the procedure for achieving the solution should be indicated [→58]. If
the area is calculated correctly, the CQ is achieved [→58]. The proficiency level
(PL) with which the CQ is achieved depends on the answer to the second
question. The selection of option 2 indicates that the learner cannot apply the
learning content (calculation of area) [→58].
Even if this method can lead to the correct result, it is not flexible and should,
therefore, be rated lower than the other options [→58]. If the learner chooses
option one instead, then it can be concluded that he recognizes familiar facts in
the representation and can apply them to the figure [→58]. Option 3 also shows
that the learner can mentally expand the figure into a rectangle [→58]. Option 3,
therefore, represents the highest level [→58]. Accordingly, the individual options
can be assigned to different PLs of a CQ [→58].
A server instance with the current version of the KM-EP was made available
to investigate the hardware requirements for the NLP and XAI algorithms. This
server instance is a virtual machine. The computer on which the virtual machine
is operated does not have a graphical processing unit (GPU). However, a GPU is
typically needed for more prominent ML algorithms [→59]. This server instance
can be logged into via Secure Shell (SSH). After logging in, an output shows that
this is the Ubuntu operating system, version 16.04.7. The standard support for
this version is until April 2021. With a subscription, security updates are still
available for up to five years [→60]. The command apt list –installed | grep
python can list all Python installations. The highest version is Python 3.5. The
command apt search python3.9 can search the existing package sources for a
Python 3.9 installation. No Python 3.9 package can be installed. A possible
solution to the problem is to download the source files for Python and compile
them. Instructions for this are available in [→61]. However, installing packages
on the system is impossible as a dependency conflict exists with the installed
Neo4j software. This is available in version 1.x. The latest version is 5.9 [→62].
Official instructions for migrating from version 1.x to a higher version no longer
exist [→63]. After removing the Neo4j program, no errors were found in the
installed applications.

14.3 Concepts
In this section, RO.TBP.RQ1.2, RO.TBP.RQ2.2, and RO.TBP.RQ3.3, as part of the
TBP, will be further described. Therefore, the application scenario will be initially
explained. Afterward, the use cases, data models, and a system overview, based
on the results of [→73,→74], with the to-be-developed components will be given.
The section will be closed with a short description of the potential integration
mechanism of the NLP and XAI components into the KM-EP as part of the
chapter system overview. The section concludes by describing the concepts and
pipelines for the NLP and XAI algorithms. In the remainder of this chapter, the
term solution evaluation system is used as a generic term for the NLP and XAI
components. Based on the previous OP, the NLP algorithm will be based on a
BERT model since transformer models are the current state-of-the-art. We will
use SHAPE for the XAI component because model-specific implementations for
specific areas, such as transformers, are given.

14.3.1 Application scenario and use cases


The student is the end user of the PAGEL simulation. He asks questions to
fictitious colleagues via the chat tool. These questions are sent to the solution
evaluation system, where the content of the question is determined. A suitable
answer is determined and sent back using the product information from the
product library. The student can also use the chat tool to answer questions from
fictitious colleagues. The answer is sent with the question to the solution
evaluation system, which checks it. A response is then generated, which can be
used to select a suitable response in the chat tool. The student also answers the
questions in the final quiz. One use case here is to justify the personal
assessment of the game parameters. Another use case describes which aspects
are essential for a game parameter to contribute to good working conditions.
Both cases evaluate the solutions using sample solutions and search terms
(keywords) from the authoring tool. Depending on the evaluation, a CQ is then
assigned.
The lecturer maintains the required texts to use the PAGEL simulation. They
maintain the individual product information in the product library, which the
solution evaluation system retrieves as required. The authoring tool is used to
maintain the sample solutions, and keywords are used to evaluate the solutions
from the quiz. The configuration tool maintains the evaluation scheme for
solutions for the final quiz. He also creates sample data that serves as the basis
for creating T&T data for the solution evaluation system.
A developer is responsible for implementing the solution scoring system. He
generates Test &Training (T&T) data based on the sample data. The solution
evaluation system is then trained and tested with this data. The developer also
implements the evaluation scheme that determines how the quiz solutions are
evaluated. Both the developer and the lecturer look at the explanations for
individual assessments. This allows developers to understand whether the
solution evaluation system works as intended. In other words, whether the
evaluations are consistent, follow the evaluation scheme, and are
comprehensible. The lecturer can use it to check the correctness of individual
students’ assessments.
For the ROs for the TBP, the following use cases are considered. First, we will
consider the use cases and users that are relevant for the evaluation by the
solution evaluation system. The student can use the chat tool to ask questions to
fictitious colleagues or answer questions from fictitious colleagues. In the first
case, the solution evaluation system should determine the answers to a
question; in the second case, it should check whether the answers are correct. In
both cases, NEs are to be recognized in a question. This allows the required
information to be read from the product library to generate suitable feedback. In
the final quiz, the student is asked to justify their assessment of the game
parameters. There can be no standard solution as many different solutions can
be correct. The system should recognize how many searched terms (keywords)
are contained in a solution. In the final quiz, the student should also describe
how the game parameters should be developed to contribute to optimal
working conditions. The solution evaluation system should also recognize the
keywords in a solution and perform a semantic comparison with a sample
solution. The sample solution and keywords can be retrieved from the authoring
tool. The following use cases are considered explicitly for the explainability of the
solution evaluation. The lecturer maintains an evaluation scheme in the
configuration tool. He can also track the evaluations of the solution evaluation
system and thus check the correctness of the evaluations of individual students.
A developer implements the evaluation scheme in the solution evaluation
system to carry out the evaluation accordingly. By explaining the assessments,
the developer can check whether the solution evaluation system works as
intended. In other words, whether the evaluations are consistent,
comprehensible, and carried out according to the evaluation scheme.
14.3.2 Data model
The relationship between the data used in this work is explained in this section.
First, the data model for the chat tool as shown in →figure 14.1 is described. A
chat request can contain several NEs, and NEs can occur in several chat requests.
An NE can label several T&T data records for NER. A T&T NER record can contain
multiple NEs. An NE identifies a product detail, which several NEs can identify. A
product consists of several product details. A product detail can also be present
in several products. A chat response contains product details, which can be
present in several responses.

Figure 14.1: Data model for the chat tool.

Next, the data schema for the final quiz , as shown in →figure 14.2, will be
explained. A quiz request contains several quiz answers, and a quiz answer
belongs to a quiz request. A quiz answer answers a question, whereas several
quiz answers can answer a question. A question is assigned to a game
parameter. A game parameter can be assigned to several questions. A question
is also assigned a sample solution and several keywords. A sample solution and
a keyword are only assigned to one question at a time. A T&T data set for
comparing semantic similarity also contains a sample solution and a solution for
the points achieved. A sample solution can be contained in several T&T data sets.
A game parameter labels a T&T data set for classification. A game parameter can
label several T&T data sets. A T&T data record for keyword recognition contains
a solution and a match parameter, labeled by several keywords.

Figure 14.2: Final quiz data scheme.

Game parameters and keywords can occur in several T&T data sets for keyword
recognition. A rating is based on a rating scheme, several sample solutions, and
several keywords. A rating evaluates a quiz query. The scoring scheme, model
solution, and keyword can be used in multiple scores [→73, →74].
14.3.3 System overview
The solution evaluation system to be developed should interact with the PAGEL
simulation and the associated applications. All applications described here are
part of the KM-EP. The full system overview with all the components mentioned
in the text is shown in →figure 14.3. The PAGEL simulation will be called up from
the LMS Moodle via the LTI interface. The chat tool and the library in which the
product data is stored are part of the PAGEL simulation. In the context of this
work, the product library is simulated by a file to which the solution evaluation
system has access. Later, the solution evaluation system will access the product
data via an interface of the product library. Access is required to check the
answers from the chat tool for correctness or to answer the questions from the
chat tool. Questions from the fictitious colleagues come from a database to
which the chat tool has access. The chat sends a player’s questions and solutions
to the solution evaluation system. Depending on the case, an answer to the
question or feedback on the correctness of the solution is sent back from there.
Once the PAGEL simulation has been completed, the final quiz is called up,
where free-text answers are sent to the solution evaluation system, together
with the tasks.

Figure 14.3: System overview.


These are evaluated with the sample solutions and keywords from the authoring
tool. As the authoring tool has not yet been implemented, the sample solutions
and keywords are stored in a file to which the free-text analysis has access. How
the evaluation is carried out is defined in an evaluation scheme. This can be
maintained via the configuration tool. As the configuration tool has not yet been
implemented during this work, an evaluation scheme is created in the solution
evaluation system. After the solutions have been evaluated, a CQ mapping is
carried out. The corresponding CQ can then be stored in the player’s CQP. The
CQ is not yet transmitted to the CQP. The interface to the CQP currently uses a
player’s e-mail address to identify them. As this is not necessarily unique, the
CQP interface will be adapted at the time of the work as part of another final
thesis to use a unique ID for the assignment instead.
The various ML models are trained and evaluated using the T&T data. These
models are used in the pipelines. A pipeline can use several models, and a model
can be used in several pipelines. There is a chat pipeline for processing chat
requests and a quiz pipeline for processing quiz requests. A chat request is
processed by a chat pipeline, which can process several requests. The chat
pipeline determines what was requested and queries the corresponding product
detail from the product library. A chat response with the requested product
detail is then created. Quiz requests are processed in the quiz pipeline. A quiz
request can contain several solutions, each containing an answer to a question.
A quiz solution is assigned to precisely one quiz request. A scoring scheme is
applied to the keywords, and similarity scores are determined via the quiz
pipeline. A CQ is assigned to a rating in a CQ mapping. An ExplainService then
explains how a rating was arrived at.
A new Python version must be installed to run the NLP and XAI solution in
KM-EP. To do this, the dependencies analyzed from the OP must first be
removed. A new Python version should be installed, and the remaining tools
should be updated and reinstalled.

14.3.4 Design of the NLP and XAI algorithms


In this section, we will first describe the design of the NLP algorithm, and
afterward, we describe our conceptual considerations for the XAI algorithm. This
part of the subchapter describes how the NLP pipelines for solution evaluation
should be structured. Since spaCy was selected as a suitable NLP tool in the OP,
the pipelines are designed based on its structure. There are different
requirements for the solution evaluation system to process the texts from the
chat tool and the final quiz. Therefore, separate pipelines are developed for
each.
For queries from the chat tool, a question is used to determine what was
asked. The correct solution is then determined from a product library and either
returned or compared with a player’s solution. To do this, it is sufficient to check
a question’s content for the occurrence of predefined words. These can be the
individual product names and the names of the product features. For example, if
a question contains the product name “Spector” and the product property
“Product number,” the matching value can be searched in the product library. A
small pipeline is necessary for processing. First, the text is converted into
individual tokens. The extracted tokens are then used in an NER component to
identify the entities being searched for. The NER component is first trained with
the entities being searched for to do this. These NEs can read the required
product information from the product library. If there is no answer in the chat
input, i.e., it is a question, a response with the product information searched for
can be generated directly. The answer is compared with the product information
to determine an available answer. A response is then generated, indicating
whether the answer is correct.
For the pipeline for the final quiz, as shown in →figure 14.4, the text is first
divided into tokens again. The text is then checked for the presence of search
terms. However, the keywords in the final quiz are not NE. For example, the
following keywords are searched for the game parameter role ambiguity:
Unclear✶/ contradictory✶/ ambiguous✶/ uncertain✶/ not specific/ not
transparent/ not unambiguous✶/ imprecise/ not precise/ imprecise/ not exact/
none✶, goal/ tasks/ evaluation/ criteria/ guidance/ instruction/ explanation/
description/ instruction. They are, therefore, not objects with a unique name.
Keywords can occur in different forms, for example, through differently
conjugated verbs. This is indicated in the terms by the symbol ✶. Due to this
variability, NER cannot be used sensibly. Therefore, POS tagging takes place
after tokenization to add POS annotations to the tokens as additional attributes.
These POS tags are used for lemmatization to find the correct basic form of the
tokens.
Figure 14.4: Concept for the final quiz pipeline.

The primary forms are determined for the keywords and solution texts to
simplify the comparison. This is used to determine which keywords are
contained in the solution texts. If there is no sample solution for a solution, the
evaluation can then take place based on the keywords. If there is a sample
solution, sentence vectors are then created from the solutions. These are used
to carry out a semantic comparison between a student solution and a sample
solution. The resulting cosine similarity is used together with the keywords for
evaluation. A CQ can then be assigned based on the ratings. The solutions are
also categorized and assigned to one of the game parameters to make the
semantic evaluation more comprehensible.
In the next part, the evaluation scheme is designed, with which the solutions
from the final quiz are to be evaluated. The evaluation scheme is based on the
template from the OP.
A distinction is made between the two cases. In the first case, players justify
their assessment of the experienced game parameters. In this case, the game
parameters can vary with each round and be perceived differently by each
player. For example, a perceived four on a scale of one to five can be described
differently by two players, for example, if one player’s score is a two and the
other’s is a five.
For this reason, answers with very different content may have to be assessed
as correct, and no model solution can be used for assessment. Even if a model
solution is provided for each of the five possible levels of a game parameter, it is
only possible to check whether the content of an answer matches one of the
model solutions. In addition, an answer can also be a mixture of two sample
solutions, as there is no fixed definition of the characteristic levels.
Therefore, the evaluation in this case should be based on the relevant terms
for the respective game parameter. An answer should be considered entirely
correct if at least three keywords are included. One point should be awarded for
each correctly named keyword. A subdivision into individual evaluation
categories is not required, as no marks or similar are to be awarded for the
solutions. The evaluation is done by awarding points. The principle is illustrated
in →Table 14.1 using the role ambiguity game parameter example. In the second
case, players should describe how the game should be designed to create good
working conditions for the individual game parameters.

Table 14.1: Example of evaluation scheme for keywords for role ambiguity.
Keywords for role ambiguity Anchor example Points
Unclear/ contradictory/ ambiguous/ uncertain/ not There was no briefing at all. As a 3
certain/ not transparent/ not unambiguous/ imprecise/ result, I wasn’t sure exactly what
not precise/ inaccurate/ not exact/ no, goal/ tasks/ my tasks were and what criteria
evaluation/ criteria/ guidance/ instruction/ explanation/ were used for assessment.
description/ instruction. There was no description of the 2
tasks.
I was unsure what I was 1
supposed to do.
I didn’t understand anything. 0

The problem of the first case is that very different answers can be correct
regarding content that does not exist here. Therefore, a sample solution can be
used for evaluation in addition to the keywords. Six points should be awarded
for an entirely correct answer; a maximum of three for the keywords and three
for the semantic similarity to the sample solution. In this case, evaluation
categories are introduced to define which answer is awarded how many points,
depending on a sample solution.
These are shown in →Table 14.2, using the example of the role ambiguity
game parameter. Full points are awarded for an answer, whose sentence vectors
have a cosine similarity of 0.8 or higher than the sample solutions. Two points
are awarded up to a value of 0.7, and up to a cosine similarity of 0.6, one point is
awarded. The threshold values for the evaluation categories are evaluated in the
EP and adjusted if necessary. The examples in the table refer to the sample
solution: “It must be clear who is to do what, when, and how. The objectives of
the computer store must be clear. At the beginning, it must be described how
the game works and what exactly it is about. It must clearly state what points are
not awarded in the game. It must also be specified who has authority and who
does not.”

Table 14.2: Evaluation scheme for semantic similarity.


Evaluation Anchor example for the role clarity parameter Points
category (cosine
similarity)
≥0.8 There must be a briefing or instructions so that you know what to do. 3
This should define the objectives of the computer store. It must also
explain precisely what the game is about and what points are awarded
for. In addition, it should be defined who has authority to issue
instructions.
≥0.7 There must be a briefing or instructions so that people know what to do. 2
It must be precisely explained what the game is about and what points
are awarded for. It should also be defined who has authority to issue
instructions.
≥0.6 It should be explained what needs to be done, what the objectives of the 1
store are and how the game is played.
< 0.6 It should be explained what needs to be done. 0

In the next part, we will conceptualize how the result of the semantic
comparison of a solution with a model solution can be explained. The results of
the model are the sentence vectors generated from the solutions. The cosine
similarity between the sentence vectors is calculated for the semantic
comparison, and points are awarded based on this. The result of the model is,
therefore, not the score. Therefore, an XAI method such as SHAP or TED cannot
be used to explain the model and, thus, the rating. One solution would be to
train the model with data sets of solutions, sample solutions, and the points
awarded, to learn to assign the appropriate score. However, this requires large
amounts of data, for which the cosine similarity between the solution and the
sample solution must be calculated beforehand to specify the correct score.
Furthermore, it is not possible to ensure how the model arrives at the
corresponding score from the set vectors. This makes the model more error-
prone in order to increase explainability.
Another approach considers the individual game parameters as classes that
the model should predict for the solutions. The data sets required for this are
more accessible, as no cosine similarity needs to be calculated. For this purpose,
a component for text categorization is to be trained based on a BERT model. This
model can then be used with SHAP to show the influence of the individual text
passages on the assignment to a specific class. The data sets used for training
must not deviate too much from the model solutions. This ensures that a section
of text that significantly influences the assignment to a class also has similarities
to the corresponding sample solution. With this approach, the evaluation is still
based on the calculated cosine similarity. This means that poorer predictions of
the model do not compromise the explainability of the system.

14.4 Proof of concept realization


Based on the previously described TBP, the SDP will be described in this section.
Therefore, the results for the ROs RO.SDP.RQ1.3, RO.SDP.RQ2.3, and
RO.SDP.RQ3.3, based on [→73,→74], will be explained in more detail. The
contents of the following sections are based on the implementation results of
[→73,→74].
Python version 3.9 is used as the programming language. Python is the most
widely used programming language in ML. The language has a simple syntax
and is very versatile. Many libraries and frameworks also use Python for ML
[→64]. The NLP library spaCy presented in the OP implements the solution
evaluation system.

14.4.1 Realization of the NLP algorithm


The spaCy demo example [→65] for a NER pipeline is used to implement the chat
pipeline. The required spaCy components are already preconfigured there. The
project.yml file contained in the project defines commands and workflows. These
can be executed via spacy project run < workflow/command > under the path of
the project.yml file. The workflow executes the commands to convert, train, and
evaluate. First, the T&T data is converted from JSON to the binary data format
spaCy works with [→27]. This data is then used to train the NER model. Once the
model has been saved, it is evaluated in the final workflow step using the test
data. The training is configured in another file (ner/configs/config.cfg). This
defines which components are used and how they should be configured. For
example, which models or algorithms is a component based on? Does it use the
predecessors’ values in a pipeline, and how much data should be processed until
the model weights are adjusted? A tokenizer and a NER component of spaCy are
used for the chat pipeline. The tokenizer component creates tokens from a text
that can be accessed in other components. The tokens are used in the NER
component to train the NER. Apart from the specification of the components, the
standard configuration of spaCy is used in this work. After training, the result is
saved under ner/training/model-best or ner/training/model-last and can be
used. The result of the evaluation can be viewed under
ner/training/metrics.json. For example, for each NE category, such as PRINTER,
you can see how often it was recognized correctly and incorrectly. It is, therefore,
sufficient to generate the T&T data, as shown in [→8], and convert it into the
required format to train a new model with additional products or product
properties. Once the workflow has been executed, a new model is available. The
model developed this way is then used in services/ChatService.py to recognize
the NE in a chat message. The recognized NE reads the required product
property from the product library. Next, this information is compared with the
answer from the chat tool, and feedback is given on the correctness of the
answer. If only a question and no answer are sent via the chat tool, the product
property read from the library is returned as feedback.
Within the quiz pipeline, solutions should be checked for contained keywords
and, if available, checked for semantic similarity with a sample solution. In order
to recognize the keywords using RegEx, they must first be transformed into a
suitable form. The script keywordDetection/scripts/ transform_keyword_file.py is
available for this purpose. This adapts the keywords so that the initial letter is
case-insensitive. For example, for the keyword “support✶” both the term
“support” and the term “supporting” can be found. The script also replaces
individual characters. For example, the ✶ character is replaced by the RegEx
string (\w +)? This indicates that everything that comes after “support” and is
part of the word matches the search pattern. The resulting RegEx can be used to
search texts for keywords. When a request with solutions for the quiz is sent via
the API, the QuizService is called from there. The get_keywords method in the
keyword_detection script is then called for each task in the request. The
appropriate RegEx is determined there using the questionId sent along. This is
then used to search for the keywords in the text and return to the QuizService.
The script keywordDetection/scripts/ keyword_detection_test.py was developed
to test the keyword detection with the test data from [→8]. It checks how many
keywords are specified in the test data and whether others are recognized. The
QuizService then checks whether a sample solution exists for the individual
questions. If so, the get_similarity_score method from the semantic_similarity
script is called. The appropriate sample solution is determined there using the
questionId. This and the player’s solution are then each converted into sentence
vectors. For this purpose, a multilingual SBERT model [→66], which has been
pretrained for tasks related to semantic similarity [→66], is used. The two
vectors’ cosine similarity is determined to arrive at a similarity value between 0
and 1. This value is sent back to the quiz service. There, the evaluation is carried
out using the evaluation scheme. Finally, the answers and scores are saved with
a time stamp, and the CQ is achieved. The CQ is not yet transmitted to the CQP.
The interface to the CQP currently uses a player’s e-mail address to identify
them.
Once the interface has been adapted, the CQs achieved can be transferred
from the solution evaluation system to the corresponding CQP. The script
semanticSimilarity/scripts/semantic_similarity_test.py was developed to test the
semantic similarity evaluation. It performs a semantic similarity evaluation with
the test data from [→8] and returns the number of cases in which the evaluation
result corresponds to the expectations.

14.4.2 Realization of the XAI component


The entities/RatingScheme.py class was introduced to implement the rating
scheme. This is a simple class without logic that only contains attributes. The two
designed rating schemes from the TBP are combined and implemented here.
The number of points per keyword and the maximum number of points can be
defined as an integer for evaluating the keywords. A dictionary is used to
evaluate semantic similarity. This is a storage form for data in Python. A
dictionary is enclosed in curly brackets and can consist of several pairs with keys
and values. Dictionaries are ordered according to their keys, and no duplicate
keys must exist.
In this example, there is one point per keyword and a maximum of 3 points
per keyword. One point is awarded if at least 60% semantic similarity was
determined for two texts. Two points are awarded for at least 70% and three for
80% or more. A score is then created in services/QuizService.py based on the
scoring scheme. The overall rating is an instance of the entities/QuizRating.py
class.
In addition to the userId, it also contains a timestamp to distinguish different
rounds of the same player. The list of ExerciseRatings contains the individual
questions, answers, and the corresponding ratings. A path is also stored under
which an explanation of the semantic similarity is saved. In addition, the total
points achieved and the possible points are included in the overall rating. An
individual rating is an instance of the entities/Rating.py class. For a submitted
answer to a question, it records which keywords were found, how many points
there are for them, and the maximum number of points that would have been
possible. The determined similarity value and the achieved and maximum
possible points are included for semantic similarity. A completed evaluation is
saved in the assets/competencies.json file so that all evaluations can be tracked.
The CQ achieved is also saved here. This is later transferred to the player’s CQP.
A text classifier is trained to understand better how the semantic similarity
evaluation is achieved. This assigns a player’s solution to one of the questions.
This result and the weighting that the model assigns to individual words can
then be used to visualize the influence of individual words on the result using
SHAP. This functionality is implemented in the ExplainService from the
information model. An example project for text classification with spaCy is used
as the basis for the text classifier. It contains the required spaCy components
and a standard configuration for their training. Commands and workflows are
also defined in the textCategorization/project.yml file, which can be executed via
spacy project run < workflow/command>. The workflow executes the commands
to convert, train, and evaluate. First, the T&T data is converted from JSON to the
binary data format spaCy works with [→27]. This data is then used to train a
classification model. Once the model has been saved, it is evaluated in the final
workflow step using the test data.
The training is configured in the textCategorization/configs/config.cfg file,
where the components used and their parameters are defined. A transformer
and a text categorization component are used as pipeline components. The
transformer creates vectors from a text. These vectors are then used to
categorize the texts. The other values from the configuration file are default
values from spaCy. After training, the result is saved under
textCategorization/training/model-best or textCategorization/training/model-
last if it does not provide better results than the previous best model. The result
of the evaluation can be viewed under textCategorization/training/metrics.json.
The texts under textCategorization/assets/ sample_texts_textcat.json must be
adapted or extended to train a new model. The script
textCategorization/scripts/nlp_aug.py must then be executed to generate T&T
data. By executing the workflow, a new model can then be trained. A finished
model is used in the services/ExplainService.py class.
There, the trained model is first loaded as a spaCy pipeline. The tokenizer
and the categories are extracted from the pipeline. In the explain method, the
explanation takes place using SHAP. To do this, the texts are read from the
exercises provided, when the method is called. The SHAP-Explainer is then
initialized. It is given access to the predictions of the pipeline using the predict
method. The tok_wrapper method returns the tokenizer in the form required by
SHAP. It is required to split the text into tokens. The tokenization must work in
the same way as the prediction. Otherwise, different tokens may be used in
prediction and explanation, making the explanation worthless. The categories
that are to be predicted are also given to the Explainer.
The results are then prepared graphically using SHAP methods and saved in
an HTML file. The name of the HTML file is made up of the userId and the date
and time of the evaluation. The HTML file is stored under the path
assets/explanations. An example is shown in →Figure 14.5. For each submitted
response, the ten categories are listed first. The one to which a text fits best
according to the model is marked in red. The bar below shows the influence that
individual words had on the decision. Red words have a positive influence, and
blue words have a negative influence. Blue words ensure that a text is assigned
to a different category. The longer the bar of a word, the stronger is its
weighting. The words are also color-coded in the text below. The darker the
color, the stronger is the influence on the assignment to the selected category.
The exact value for a word can be displayed by placing the mouse over it. The
individual categories can also be clicked on so that the values of other categories
can be displayed [→73,→74].

Figure 14.5: XAI output (in German).

14.4.3 Integration of the NLP algorithm and XAI component into the KM-EP
In the root directory of the software project, there is a file called README.MD,
which describes how the solution can be installed. Firstly, the CUDA drivers, CuPy
and spaCy, will be installed. CUDA is a toolkit from the company NVIDIA, which is
used to develop high-performance graphics processing units (GPUs) and
accelerated applications [→67]. CUDA requires that the computer it is to run has
a CUDA-capable GPU [→68]. CuPy is an open-source library for GPU-accelerated
computing with Python [→69]. However, these steps are skipped because the
server has no GPU. After installing these dependencies, the spacy project run
start should be executed. This should install the other dependencies and start
the application. The necessary steps are defined in the project.yaml file. If the
spacy project run start script is executed, the installation fails with an error
message. This message occurs because the installed Python package is not
compatible with the installed version of the OpenSSL program is not compatible.
The installed version of urllib is 2.0. A compatible version of urllib3 is version
1.26.6. An error message appears if this version is installed and the spacy project
run start command is executed again. This is because no model has been
created yet. To create the model, the script textCategorization/script/nlp_aug.py
must first be executed to generate training data for the model.
The script does not work and will be called with the error message
Parameter device = “cuda”. If these parameters are removed, the script is
executed to the end, and the data is generated. To use the data to train the
model, the spacy project runs all commands executed in the text categorization/
directory. However, this fails with the message Missing dependency, specified by
the command “train”:configs/config.cfg. The problem is that the file
textCategorization/configs/config.cfg is missing. Bühler describes in his work
that he used the demo pipeline as described in [→70]. A significant difference,
however, is that the /configs/config.cfg file is present in the demo project.
If this file is now added, the error message ValueError: GPU is not accessible.
Was the library installed correctly? This can be remedied by setting the value
gpu_id to −1 in the textCategorization/project.yml file, meaning a CPU should be
used instead of a GPU [→71]. If the spacy project runs, all scripts are executed
again. It generates a trained model, eliminating the error. The error message
appears after executing the spacy project run start command again. The
message shows that the textcat component does not exist. The
textCategorization/configs/config.cfg file contains the components tok2vec and
textcat_multilabel.
The command spacy init config –pipeline “textcat” configs/config.cfg -F
creates a configuration with the textcat component. Subsequently, the
command spacy project run must then be executed again in the
textCategorization directory for the change to take effect. If the command spacy
project run start is executed in the project’s root directory, the application works
without error messages. The generation of another model is required. The
command python -m chatette ./assets/data_template.txt -o ./assets/generated is
executed to do this. However, no Python package named “chatette” is available
and must be installed using the pip install chatette command. After this step, the
command can be executed, and training data for an AI model is generated. To
transform this into a suitable form, the script ner/scripts/transform_ner_file.py is
executed. The model is then executed, with the spacy project running all
commands in the ner directory. You will notice that the gpu_id in the project.yml
is set to 0. This is changed to 1 to avoid an error message due to the missing
GPU. Executing the command results in the following error message:
Missing dependency specified by command ‘train’ : configs / config .cfg
A file is missing. One difference is that the project.yml file contains more
steps, including a create-config step. Suppose this step from the demo pipeline
is inserted into the Bühler AI project similarly, the command can be executed
without an error message, and the training results can be found in the
ner/training directory. If now a request to the assessment interface is sent, the
following error message appears FileNotFoundError: [ Errno 2] No such file or
directory: ‘assets / explanations /1234 _17-06-2023 _14 .23.23. html’
This is because the assets/explanations directory does not exist. Once the
directory has been created, a request is processed without an error message.
As the manual process is error-prone and time-consuming, all steps, from
installing the dependencies to generating the data and training, were combined
in a Python script called init_project.py. The Python language was chosen here
because it is operating system-independent, and the entire project also requires
a Python runtime environment; therefore, this dependency is already installed.
Missing information and dependencies were added to the code and the
parameters were adjusted. A start_server.py script was also written with which
the web server for the AI solution can be started. In order to install and initialize
the software on the KM-EP, the following steps must be carried out:
1. git clone <repository name>
2. cd freitextbewertung /
3. sudo git checkout fix_for_no_gpu
4. python3 .9 -m venv venv
5. source venv /bin / activate
6. python init_project .py
7. nohup python start_server .py

14.4.4 Verification of the implementation of the XAI component


Unit tests were developed for verification purposes [→74]. These can be divided
into two subgroups. Firstly, the similarity values between the different texts were
tested. On the other hand, the explanation of the evaluation was tested. Eight
tests were developed with nine data sets to verify the similarity values. Five areas
were defined for this purpose (→Figure 14.6). The areas are scored as follows:

Figure 14.6: Scoring for the area.

In the first test, a text was compared with itself. A similarity value of 1 was
expected, which was also the test result. The second test calculates a similarity
value for a reference text and the comparison text. The value is then calculated
again, but the reference and comparison texts are swapped. Both similarity
values should be the same. Running the test confirms that the values are the
same. In the third test, both texts contain the exact words but in a different
order. A very high similarity value is expected. In the fourth test, two texts with
the same number of words but containing completely different statements were
selected. A very low similarity value was expected. The value was 0.33, which is
very low. The fifth test compares two texts, one very long and the other very
short. Both have a completely different message. A very low similarity value was
expected. The value was −0.05. The sixth test compares two texts with the same
statement but formulated differently. This test was carried out with an English
and a German text. Both similarity scores were 0.95. In the seventh test, the
comparison text is a short subtext of the reference text. A medium–high
similarity value is expected. The value was 0.8 and is, therefore, high. In the
eighth test, the comparison text is a subtext of the reference text, only worded
differently. A medium–high similarity value is expected. The value was 0.73.
The values were 1, 0, 98, 0, 95, 0, 8, 0, 73, 0, 33, and −0, 05. It can be seen that
the values are distributed in different ranges. However, there are still
shortcomings. One value was negative, which does not correspond to the
expected range of values. Furthermore, it is noticeable that texts of similar
length are considered low instead of very low, even if they have no semantic
match and do not contain the exact words. However, if the semantics partially
match, the values are usually high or very high.
During the explanation of the evaluation, it was checked whether each word
was also assigned an influence value. This test was passed. Next, two texts
consisting of two words were constantly compared. The reference text is always
This sentence. The comparison texts are given in →Table 14.3.
Table 14.3: Table with the used values for the verification [→74].
Reference text Similarity Word Influence Word Influence Influence sum
This sentence 1.0 This 0.0521 sentence 0.259 0.3111
this sentence 0.998 this 0.0519 sentence 0.2563 0.3082
This Sentence 0.9877 This 0.0544 Sentence 0.2467 0.3011
sentence This 0.9749 sentence 0.2339 This 0.027 0.2609
This phrase 0.8585 This −0.0219 phrase 0.1174 0.0955
THIS SENTENCE 0.7505 THIS 0.0418 SENTENCE 0.0047 0.0465
This sentence 0.7107 This −0.0003 sentencE −0.0303 −0.0334
This sentenc 0.6873 This −0.0086 sentenc −0.0537 −0.0624
This fadadfsd 0.6717 This −0.0328 fadadfsd −0.0693 −0.1022
This senTence 0.6022 This −0.0748 senTence −0.1388 −0.2136
This dog 0.3531 This −0.0298 dog −0.3879 −0.4177

The table shows that the sum of the individual influence values decreases when
the similarity value decreases. It can also be seen that the capitalization of the
first letter of a word has a relatively small influence on the similarity value and
the influence value. However, the order changes the influence value of the
individual words somewhat more strongly, but the sum changes relatively little.
Using a synonym, as in line 5 of the table, also strongly influences the influence
value of the other word. The similarity and influence values are relatively
strongly influenced if the words are capitalized entirely. Suppose a single letter is
capitalized, which is small in the reference sentence and is not at the beginning
of the word, the similarity value and all influence values are significantly
reduced. The same applies if a letter of the word is omitted. Suppose an utterly
different word replaces the second word with no semantic similarity, the
similarity value drops sharply, as does each influence value, but not as sharply as
in the case of a character string that does not form a word but is just a random
combination of letters. Overall, it can be seen that the sum of the influences
correlates with the similarity values in these trials. Words in the solution that are
not included in the sample solution have a more excellent negative influence
value than those in the sample solution and the solution. As a prototype, this
method is suitable for explaining the evaluation. However, further experiments
with more test data would have to be carried out to confirm this assumption
with more certainty. Another desirable property for the explanation would be
that the influence value of a word does not change when another word changes
in comparison to the reference text [→74].
14.5 Initial evaluation
A CW was selected as the initial evaluation methodology for this chapter to reach
RO.EP.RQ1.4, RO.EP.RQ2.4, and RO.EP.RQ3.4 [→72] This task-oriented inspection
method is performed without an end user. However, a CW can be usefully
applied already in an early development state of software [→72].
A CW is divided into three phases. First, the system users are described in
the preparation phase, sample tasks are designed to be completed with the
system under test, and possible solutions are developed. In the analysis phase,
the actual CW takes place. In this phase, the domain expert documents each of
his actions and considers possible problems real users might have with the
system. In the Follow-Up phase, identified usage problems and their causes are
documented. In addition, possible alternatives for them are named [→72].

14.5.1 Evaluation scenario and tasks


During the CW, as shown in [→73, →74], the following tasks are considered:
1. Generate T&T data for the NER in the chat pipeline.
2. Train a NER model for the chat pipeline.
3. Asking questions to a fictitious colleague.
4. Answering questions from a fictitious colleague.
5. Maintain keywords for the quiz pipeline.
6. Maintain sample solutions for the quiz pipeline.
7. Submit answers to the final quiz.
8. Generate T&T data for the classification of solutions from the final quiz.
9. Train a classification model for the ExplainService.
10. Track the scores of the final quiz.
The first four tasks can be thematically assigned to processing messages from
the chat tool. They are, therefore, considered together in a pretest and CW. Task
1 tests the usability of the generation of T&T data for the NER. Task 2 checks how
user-friendly is training a new NER model using the previously generated T&T
data. This is part of the chat pipeline component. Tasks 3 and 4 test the chat
pipeline and the T&T data generated for the NER. In addition, the chat tool
response interface and the chat tool evaluation interface are evaluated. Tasks 5–
10 are thematically assigned to the processing of solutions for the final quiz.
They are, therefore, also considered together in a pretest and the subsequent
CW. Tasks 5–7 test the quiz pipeline component. Task 7 also tests the generated
T&T data and the quiz interface. Task 8 evaluates how user-friendly is the
generation of T&T data for text classification. Tasks 9 and 10 evaluate the
evaluation scheme and the ExplainService. The evaluation of the T&T data has
already been described in [→8]. In this chapter, the T&T data and test cases are
used for the evaluation of the XAI and NLP algorithms.

14.5.2 Evaluation results


During the CW evaluation, experts provided valuable insights across various
tasks. They recommended streamlining the manual execution of scripts for
generating and transforming T&T data into a cohesive spaCy workflow,
encompassing data generation, transformation, model training, and evaluation
(Task 1). Feedback on the prototype (Task 2) indicated that no additional training
is required, emphasizing training only to adapt the model to changing or
accurate data. Task 3 involved discussions on incorporating social stressors into
student feedback, while Tasks 4 to 6 received no specific comments. In Task 7,
experts highlighted the need to optimize keyword recognition and discussed
potential interface adaptations for future quizzes.
Additionally, concerns were raised about the system architecture and data
standards in the ongoing GALA4QBLM project. Task 8 focused on integrating the
script for T&T data generation into the model training workflow. Task 9 had no
comments, and Task 10 emphasized the inconvenience of tracking assessments
through JSON files, suggesting the implementation of a user-friendly interface
for displaying results. The feedback covered process optimization, social
considerations, keyword recognition, interface adaptation, and user interface
enhancement [→73, →74].

14.5.3 Further developments based on the evaluation


The CW confirmed the basic usability of the prototype. No errors were
discovered. However, potential for improvement was uncovered. This mainly
concerned increasing user-friendliness through the automated execution of
scripts and a user interface for tracking solution evaluations. Potential for
improvement was also identified in the recognition of keywords.

14.6 Conclusions
This chapter presented and realized a potential approach for an NLP algorithm
and an XAI algorithm for the PAGEL application scenario. Based on this, how
these algorithms could be integrated as components in the KM-EP and how they
were evaluated was shown.
Finally, based on the results of this work, an outlook on the possible further
developments is given. Once the interface to the CQP has been modified so that
an ID is used to identify a student instead of an e-mail address, this interface can
be linked to the solution evaluation system. This means that the CQs achieved in
the final quiz are entered directly into a student’s CQP. The interface of the
product library should also be connected to the solution evaluation system. This
allows the currently used file to be removed so that there is only one place
where products are maintained. This means that the solution evaluation system
always accesses the current products and product properties. The configuration
tool is also to be linked to the solution evaluation system. The evaluation scheme
can then be adapted in a user-friendly way. Once the authoring tool has been
implemented, its interface can be connected to the solution evaluation system.
The keywords and sample solutions are then maintained in the authoring tool
via a more user-friendly interface. As soon as new keywords are received from
the authoring tool, the script that converts them into a RegEx search pattern can
be executed automatically. This means that manual execution of the script is no
longer necessary.
To reduce the manual execution of scripts in other places, the scripts for
generating and formatting the T&T data for NER can be included as previous
steps in the workflow for training the NER model. The same applies to the script
that generates T&T data for text classification. This can be included as the first
step in the workflow for training the classification model.
The texts sent by real players can be collected to optimize the ML models
further. If they are available in sufficient quantity, they can be converted into the
required format and used to train the models. If insufficient actual data is
available, it can be used to generate new T&T data close to the actual data. The
better the T&T data represents the texts of real players, the better the model
developed with it. The solutions to the final quiz are already saved. The
prototype could be extended to store the chat messages as well.

References
[1] B. George, and O. Wooden, Managing the strategic transformation of higher
education through artificial intelligence. Administrative Sciences, vol. 13, no. 9,
pp. 196, 2023. →https://doi.org/10.3390/admsci13090196. →
[2] H. Khosravi, S. Buckingham Shum, G. Chen, C. Conati, Y.-S. Tsai, J. Kay, S.
Knight, R. Martinez-Maldonado, and S. Sadiq, Dragan Gašević. Explainable
Artificial Intelligence in Education, Computers and Education: Artificial
Intelligence, vol. 3, pp. 100074, 2022. ISSN 2666-920X,
→https://doi.org/10.1016/j.caeai.2022.100074. →
[3] J. Dettmers, “PAGEL Projekt” [online]. URL: →https://www.fernuni-
hagen.de/arbeitspsychologie/forschung/pagelprojekt.shtml [Zugriff am:
17.12.2023]. a, b, c
[4] B. Vu, A Taxonomy Management System Supporting Crowd-based Taxonomy
Generation, Evolution, and Management, Hagen: Fernuniversität in Hagen,
2020. a, b
[5] Moodle, „Die Moodle-Lernplattform“ [Online]. URL: →https://moodle.de/.
[Zugriff am 22 05 2022]. a, b
[6] M. Then, Supporting Qualifications-Based Learning (QBL) in a Higher
Education Institution’s IT-Infrastructure, “FernUniversität in Hagen, Hagen,
2020. a, b, c
[7] A. Marks, Hochschulinterne Ausschreibung „Fellowship in der digitalen
Hochschullehre (digiFellow)“ gestartet, 17 3.2022. [online]. URL:
→https://www.fernuni-hagen.de/zli/blog/neue-hochschulinterne-
ausschreibungsrunde-fellowship-in-der-digitalen-hochschullehre-digifellow-
gestartet/ [Zugriff am: 17.12.2023]. a, b
[8] R. Srbecky, et al., Concept and realization to automatically generate test and
training data for a natural language processing algorithm for the assessment of
free text answers for digital courses on work design in the context of psychology
studies in higher educational institutions. In: 2022 International Conference on
Computational Science and Computational Intelligence (CSCI), Las Vegas, pp.
1977–1982, NV, USA, 2022, doi: 10.1109/CSCI58124.2022.00355. a, b, c, d, e
[9] J. F. Nunamaker Jr., M. Chen, and T. D. Purdin, Systems development in
information systems research. Journal of Management Information Systems, vol.
7, no. 3, pp. 89–106, 1990. →
[10] D. Jurafsky, and J. H. Martin, Speech and Language Processing – An
Introduction to Natural Language Processing21 9 2021. [Online]. URL:
→https://web.stanford.edu/~jurafsky/slp3/old_sep21/ed3book_sep212021.pdf.
[Zugriff am 03 01 2022]. a, b, c, d, e, f, g, h, i, j
[11] Universal Dependencies, Universal Dependencies – This page pertains to UD
version 2, [Online]. URL: →https://universaldependencies.org/. [Zugriff am 12 05
2022]. →
[12] J. Brownlee, Deep Learning for Natural Language Processing – Develop
Deep Learning Models for Natural Language in Python, San Francisco: Jason
Brownlee, 2017. a, b, c, d, e, f, g, h
[13] J. Brownlee, A Gentle Introduction to the Bag-of-Words Model, 09 10 2017.
[Online]. URL: →https://machinelearningmastery.com/gentle-introduction-bag-
words-model/. [Zugriff am 05 01 2022]. →
[14] TensorFlow, word2vec – TensorFlow, [Online]. URL:
→https://www.tensorflow.org/tutorials/text/word2vec. [Zugriff am 12 05 2022].

[15] J. Pennington, R. Socher, and C. D. Manning, GloVe: Global Vectors for Word
Representation, 2014. [Online]. URL: →https://nlp.stanford.edu/projects/glove/.
[Zugriff am 12 05 2022]. →
[16] A. Pai, An Essential Guide to Pretrained Word Embeddings for NLP
Practitioners, 16 03 2020. [Online]. URL:
→https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-
embeddings-nlp/. [Zugriff am 05 01 2022]. →
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and Ł.
Kaiser, Attention Is All You Need, 6 12 2017. [Online]. URL:
→https://arxiv.org/pdf/1706.03762.pdf. [Zugriff am 04 01 2022]. a, b, c, d
[18] super.gluebenchmark.com, SuperGLUE – Learderboard Version 2.0,
[Online]. URL: →https://super.gluebenchmark.com/leaderboard. [Zugriff am 12
05 2022]. →
[19] Gluebenchmark.com, GLUE – Leaderboard, [Online]. URL:
→https://gluebenchmark.com/leaderboard. [Zugriff am 12 05 2022]. →
[20] IBM Cloud Education, Neural Networks – What are Neural Networks?, 2020.
[Online]. URL: →https://www.ibm.com/cloud/learn/neural-networks. [Zugriff am
07 05 2022]. →
[21] J. Uszkoreit, Transformer: A Novel Neural Network Architecture for
Language Understanding, 31 08 2017. [Online]. URL:
→https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html.
[Zugriff am 04 01 2021]. a, b
[22] J. Devlin, A. Rao, D. Markowiz, and H. Lin, BERT, 2020. [Online]. URL:
→https://github.com/google-research/bert. [Zugriff am 14 01 2022]. a, b, c, d, e,
f, g, h, i, j, k, l, m
[23] M. Katzlberger, Mit diesen Daten wurde GPT-3 trainiert – KI Sprachmodelle,
12 04 2021. [Online]. URL: →https://katzlberger.ai/2021/04/12/mit-diesen-daten-
wurde-gpt-3-trainiert/. [Zugriff am 12 05 2022]. →
[24] J. Devlin, and M.-W. Chang, Open Sourcing BERT: State-of-the-Art
Pretraining for Natural Language Processing, 02 11 2018. [Online]. URL:
→https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html.
[Zugriff am 14 01 2022]. a, b
[25] Hugging Face, Summary of the models, [Online]. URL:
→https://huggingface.co/docs/transformers/v4.15.0/en/model_summary.
[Zugriff am 15 01 2022]. a, b
[26] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, ELECTRA: Pre-training
Text Encoders as Discriminators Rather Than Generators, 2020. [Online]. URL:
→https://arxiv.org/abs/2003.10555. [Zugriff am 14 01 2022]. a, b, c, d
[27] EXPLOSION, Binary training format, [Online]. URL:
→https://spacy.io/api/data-formats. [Zugriff am 16 08 2022]. a, b, c
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, XLNet:
Generalized Autoregressive Pretraining, 2019. [Online]. URL:
→https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9e
e67cc69-Paper.pdf. [Zugriff am 12 05 2022]. →
[29] Geeks For Geeks, Overview of ROBERTa model, 26 11 2020. [Online]. URL:
→https://www.geeksforgeeks.org/overview-of-roberta-model/. [Zugriff am 12
05 2022]. a, b
[30] N. Reimers, and I. Gurevych, Sentence-BERT: Sentence Embeddings using
Siamese BERT-Networks, [Online]. URL: →https://arxiv.org/abs/1908.10084.
[Zugriff am 11 04 2022]. a, b, c, d, e, f, g, h, i
[31] Explosion, Industrial-strength natural language processing, IN PYTHON,
[Online]. URL: →https://spacy.io/. [Zugriff am 12 05 2022]. →
[32] Explosion, spaCy 101: Everything you need to know, [Online]. URL:
→https://spacy.io/usage/spacy-101. [Zugriff am 12 05 2022]. →
[33] Python Software Foundation, Python is powerful … and fast; plays well with
others; runs everywhere; is friendly & easy to learn; is Open., [Online]. URL:
→https://www.python.org/about/. →
[34] Exposion, Models & Languages, [Online]. URL:
→https://spacy.io/usage/models. [Zugriff am 12 05 2022]. a, b, c, d
[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and Ł.
Kaiser, Attention Is All You Need, 6 12 2017. [Online]. URL:
→https://arxiv.org/pdf/1706.03762.pdf. [Zugriff am 04 01 2022]. →
[36] Explosion, Facts & Figures, [Online]. URL: →https://spacy.io/usage/facts-
figures. [Zugriff am 04 03 2022]. a, b
[37] Explosion, Trained Models & Pipelines, [Online]. URL:
→https://spacy.io/models#design. [Zugriff am 04 03 2022]. a, b, c, d
[38] Explosion, Embeddings & Transformers, [Online]. URL:
→https://spacy.io/usage/embeddings-transformers. [Zugriff am 04 03 2022]. a,
b
[39] Technische Universität Darmstadt, Festlegen von Bewertungskriterien und
Erstellung eines Bewertungsrasters, [Online]. URL:
→https://www.einfachlehren.tu-
darmstadt.de/themensammlung/details_9536.de.jsp. [Zugriff am 08 05 2022]. a,
b, c, d, e, f, g, h
[40] J. Zornoza, Explainable Artificial Intelligence, 15 04 2020. [Online]. URL:
→https://towardsdatascience.com/explainable-artificial-intelligence-
14944563cc79. [Zugriff am 21 01 2022]. a, b
[41] M. T. Ribeiro, S. Singh, and C. Guestrin, Local Interpretable Model-Agnostic
Explanations (LIME): An Introduction, 12 08 2016. [Online]. URL:
→https://www.oreilly.com/content/introduction-to-local-interpretable-model-
agnostic-explanations-lime/. [Zugriff am 13 05 2022]. a, b, c, d, e
[42] K. Barkai, “Interpreting an NLP model with LIME and SHAP, “09 05 2020.
[Online]. URL: →https://medium.com/@kalia_65609/interpreting-an-nlp-model-
with-lime-and-shap-834ccfa124e4. [Zugriff am 21 01 2022]. a, b, c, d
[43] scikit-learn developers, Decision Trees, [Online]. URL: →https://scikit-
learn.org/stable/modules/tree.html. [Zugriff am 17 03 2022]. →
[44] IBM, Logistic regression, [Online]. URL:
→https://www.ibm.com/topics/logistic-regression. [Zugriff am 13 05 2022]. →
[45] S. Glander, Künstliche Intelligenz und Erklärbarkeit. SInformatik Aktuell, 22
05 2018. [Online]. URL: →https://www.informatik-aktuell.de/betrieb/kuenstliche-
intelligenz/kuenstliche-intelligenz-und-erklaerbarkeit.html. [Zugriff am 21 01
2022]. →
[46] S. M. Lundberg, and S.-I. Lee, A Unified Approach to Interpreting Model
Predictions, 2017. [Online]. URL:
→https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28
b67767-Paper.pdf. [Zugriff am 13 05 2022]. a, b, c, d
[47] S. Lundberg Und, et al., SHAP, 20 10 2021. [Online]. URL:
→https://github.com/slundberg/shap/blob/master/README.md. [Zugriff am 21
01 2022]. a, b, c, d, e, f, g, h, i
[48] K. Schroer, „Spieltheorie,“ [Online]. URL: →https://www.bwl-
lexikon.de/wiki/spieltheorie/. [Zugriff am 13 05 2022]. →
[49] M. Hind, D. Wei, M. Campbell, N. C. F. Codella, and A. Dhurandhar, TED:
Teaching AI to Explain its Decisions, 15 06 2019. [Online]. URL:
→https://arxiv.org/pdf/1811.04896.pdf. [Zugriff am 13 05 2022]. a, b, c, d, e, f
[50] IBM Research Trusted AI, Welcome to AI Explainability 360, [Online]. URL:
→http://aix360.mybluemix.net/resources. [Zugriff am 13 05 2022]. →
[51] IBM, About IBM, [Online]. URL: →https://www.ibm.com/about. [Zugriff am
13 05 2022]. →
[52] V. Arya und et al., “AI Explainability 360 (v0.2.1), “31 08 2021. [Online]. URL:
→https://github.com/Trusted-AI/AIX360. [Zugriff am 21 01 2022]. a, b
[53] IBM Corporation Revision, AI Explainability 360, [Online]. URL:
→https://aix360.readthedocs.io/en/latest/index.html. [Zugriff am 13 05 2022]. a,
b
[54] B. Wallenborn, Entwicklung einer innovativen Autorenumgebung für die
universitäre Fernlehre, 2018. [Online]. URL: →https://ub-deposit.fernuni-
hagen.de/receive/mir_mods_00001428. [Zugriff am 23.12.21]. →
[55] Moodle.org, →https://rb.gy/7qhmw, [Zugriff am 30.7.23]. →
[56] Moodle Kurzanleitung für Trainer/innen,
→https://docs.moodle.org/403/de/Kurzanleitung_f%C3%BCr_Trainer/innen,
[Zugriff am 30.10.22]. →
[57] R. Srbecky, M. Winterhagen, B. Wallenborn, M. Then, B. Vu, W. Fraas, J.
Dettmers, and M. Hemmje, Towards a work task simulation supporting training
of work design skills during qualification-based learning. In: Proceedings of the
14th International Conference on Computer Supported Education, vol. 2, CSEDU,
ISBN 978-989-758-562-3, pp. 534–542, 2022, doi: 10.5220/0011072800003182. a, b
[58] M. Frangenberg, Implementierung einer Erweiterung für das Learning
Management System (LMS) Moodle zur Unterstützung von Spielen im
schulischen Bereich mit automatischer Zuweisung von Kompetenzen
entsprechend dem Spielerverhalten, Hagen: FernUniversität in Hagen, 2021. a,
b, c, d, e, f, g, h, i, j, k
[59] J. Dsouza, What is a GPU and do you need one in Deep Learning?, 25.4.2020.
[online]. URL: →https://towardsdatascience.com/what-is-a-gpu-and-do-you-
need-one-in-deep-learning-718b9597aa0d [Zugriff am 16.12.23]. →
[60] Canonical, Ltd., Ubuntu releases, 6 2023.
→https://wiki.ubuntu.com/Releases, [Zugriff am 16.06.2023]. →
[61] PCPlanet, “How to install python 3.9 on ubuntu 16.04 to 22.04.”
→https://pcplanet.ca/how-to-install-python39-on-ubuntu/, [Zugriff am
17.06.2023]. →
[62] Neo4j, Inc, Neo4j supported versions.
→https://neo4j.com/developer/kb/neo4j-supported-versions/, [Zugriff am
17.06.2023]. →
[63] Neo4j, Inc, Neo4j supported versions, →https://neo4j.com/docs/upgrade-
migration-guide/current/version-4/migration/, [Zugriff am 17.06.2023]. →
[64] L. Ogoti, “Why Python is Good for Machine Learning, “ 2021. [Online]. URL:
→https://www.section.io/engineering-education/why-python-is-good-for-
machine-learning/. [Zugriff am 21 04 2022]. →
[65] Explosion, spaCy Project: Demo NER in a new pipeline (Named Entity
Recognition), [Online]. URL:
→https://github.com/explosion/projects/tree/v3/pipelines/ner_demo. [Zugriff
am 25 08 2022]. →
[66] N. Reimers, sentence-transformers/stsb-xlm-r-multilingual, [Online]. URL:
→https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual.
[Zugriff am 09 08 2022]. a, b
[67] N. Corporation, Cuda toolkit, →https://developer.nvidia.com/cuda-toolkit,
[Zugriff am 28.05.2023]. →
[68] N. Corporation, Nvidia cuda installation guide for linux,
→https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html, zuletzt
besucht am 28.05.2023. →
[69] Preferred Networks, Inc., Cupy, →https://cupy.dev/, [Zugriff am
16.06.2023]. →
[70] ExplosionAI GmbH, spacy project: Predicting whether a github issue is
about documentation (text classification),
→https://github.com/explosion/projects/tree/v3/tutorials/textcat_docs_issues,
zuletzt besucht am 16.06.2023. →
[71] ExplosionAI GmbH, Commad line interface, →https://v2.spacy.io/api/cli,
[Zugriff am 16.06.2023]. →
[72] C. Wilson, User Interface Inspection Methods: A User-Centered Design
Method, 1st edition, Morgan Kaufmann Publishers Inc, 2013. a, b, c
[73] F. Bühler, Implementierung eines KI-basierten Algorithmus zur
automatisierten Bewertung von Freitextantworten hinsichtlich der erreichten
Kompetenzen, Hagen: Fernuniversität in Hagen, 2022. a, b, c, d, e, f, g
[74] J. Schuljak, Implementierung einer Explainable AI Komponente zur
Erklärung von automatisierten Bewertungen von Freitextantworten, Hagen:
Fernuniversität in Hagen, 2023. a, b, c, d, e, f, g, h, i, j
15 Toward a skill-centered qualification ontology
supporting data mining of human resources in
knowledge-based enterprise process representations

Adrian Vogler
Binh Vu
Tobias Vogel
Benjamin Gernhardt
Matthias Hemmje

Abstract
In the evolving context of Industry 4.0 and the Fourth Industrial Revolution,
representing and managing process-based big data sources, e.g., optimizing
human resource (HR) assignment in production planning is critical. This research
introduces a HR ontology (FPHR ontology) integrated into knowledge-based
production planning (KPP) methods and automating the utilization of employee
skills in planning, amidst vast industrial big data landscapes. Utilizing semantic
techniques and the ESCO ontology, the FPHR ontology enhances KPP, enabling a
sophisticated consideration of competences and skills. The ontology supports
broad knowledge and skill domains, addressing digitalization challenges and
aiding the upskilling and re-skilling paradigm, highlighting the significance of
semantic representation and user-centric information system design for future
intelligent production planning and the corresponding HR management.
Keywords: knowledge-based production planning (KPP), human resource
ontology (FPHR ontology), Industry 4.0, big data, employee skills and
competencies, semantic web technologies,

15.1 Introduction
In the era of Industry 4.0 and the Fourth Industrial Revolution, the exponential
growth in data generation has presented organizations across industries –
including e-commerce, healthcare, and internal organizational functions like
human resources, sales, and production – with the critical challenge of big data
management and big data mining in order to support industrial process
management. This vast influx of structured and unstructured process-based
data necessitates advanced solutions for extracting actionable insights and
optimizing decision-making processes. The imperative to sift through massive
process-based datasets to identify patterns, trends, and correlations has
propelled the need for sophisticated knowledge representation, data mining,
and information extraction methods. This chapter introduces a skill-centered
qualification ontology designed for human resource management, aiming to
make sense of big data, by facilitating data mining related to human resource
skills in order to allow for process-oriented resource assignment and
optimization during informed decisions and process optimizations within this
context.
The Chair of Multimedia and Internet Applications at the University of
Hagen, Faculty of Mathematics and Computer Science together with its affiliated
institute FTK participated in the SHAMAN project. SHAMAN, or Sustaining
Heritage Access through Multivalent ArchiviNg [→1], was a European funded
project that focused on representation, documentation, managing, mining, and
archiving of industrial processes. But there was no semantic machine-readable
process representation and no representation of human resources in these
processes. Knowledge-based and process-oriented innovation management
(German: Wissensbasiertes und Prozessorientiertes Innovations-Management,
WPIM), as described by Vogel [→2], was one of the results of the SHAMAN
project. WPIM semantically represents and annotates process structures,
including process resources, thus providing machine-readability. Also, WPIM did
not address the issue of representing human resources in these process
representations. Furthermore, the European Commission funded collaborative
adaptive production process planning (CAPP)-4-SMEs project [→3] aimed to
enhance the competitiveness of European companies, particularly SMEs, in a
sustainable distributed manufacturing environment by focusing on automating
innovative process planning and control. In addition, building on the WPIM
results, Gernhardt [→4, →5, →6], initially developed knowledge-based
production planning (KPP) as a follow-up PhD project of CAPP-4-SMEs [→7]. Also,
in the CAPP-4-SMEs, no semantic representation of human resources within the
industrial processes was addressed. KPP focuses on knowledge-based planning
of complex production processes that involve various substeps, locations, and
companies. The aim is to semantically represent and make the process-
associated knowledge machine-readable. KPP has been developed and refined
in recent years to address the challenges of modern production environments
and the collaboration of small and medium-sized enterprises (SMEs). The
foundation of KPP is semantic knowledge representation, which enables flexible,
efficient, and effective support for CAPP. The integration of CAPP knowledge into
machine-readable process formalizations is a key factor in making such
knowledge accessible in cloud-based knowledge repositories and usable for
various SMEs [→4, →5, →6]. So far, KPP has focused on technical resources in
processes, while human resources, especially human resource descriptions
related to qualifications like certified competences and skills have been only
rudimentarily considered.
This chapter introduces a skill-centered qualification ontology aimed at
human resource management, specifically designed to navigate and mine the
complexities of big data produced by industrial process management and
related human resource assignments. It seeks to offer a structured approach to
understanding and mining vast process-oriented datasets, enabling
organizations to optimize their human resource assignment and development
planning, including the related optimization and decision-making processes, in
alignment with Industry 4.0 economical demands. Therefore, the motivation of
the work described in this chapter is built on the following research gaps: There
is no functional and process-oriented human resource ontology (FPHR ontology)
and no human resource qualification profile ontology (HRQP ontology) for the
semantic description of human resource functions and qualifications in
industrial processes. To support the FPHR ontology and the HRQP ontology,
three core problems exist: problem statement 1 (PS1): KPP cannot describe
human resource functions, although they are necessary for complete industrial
process planning. Problem statement 2 (PS2): KPP cannot describe human
resource qualifications, although they are necessary for complete industrial
process planning. Problem statement 3 (PS3): The system cannot automatically
use human resource functions and qualifications in the production planning and
optimization process, even if described using FPHR ontology and HRQP
ontology. The research questions (RQs) derived from the three problems are as
follows: research question 1 (RQ1): How can an FPHR ontology be derived and
represented in a machine-readable manner? Research question 2 (RQ2): How
can an HRQP ontology be derived and represented in a machine-readable
manner? Research question 3 (RQ3): How can the production planning system
be extended to automatically consider human resource functions and
qualifications as resources in the planning process? To solve these RQs, the work
of Vogler [→8] addresses the topic “Ontology creation and ontology integration
as well as exemplary evaluation in a knowledge-based software development
process.” Therefore, the remainder of this chapter is based on Vogler [→8], with
some additions related to qualification-based learning (QBL) in the area of state
of the art in Section 15.2.
The methodological approach in Vogler [→8] is based on Nunamaker et al.
[→9]. This means, the first so-called observation goals (OG) are derived from the
three RQs. For RQ1, which concerns the representation of the FPHR ontology,
the OGs focus on analyzing existing methods for skills and qualifications
management, encompassing classifications, taxonomies, and ontologies. In the
context of RQ2, related to the HRQP ontology, the OG is to explore concepts and
models based on qualification-driven learning. For RQ3, centered on the
incorporation of skills into production planning, the OG is to grasp the concepts
of knowledge-based planning, particularly WPIM and KPP. Regarding the so-
called theory-building goals (TG), the aim is to devise a conceptual information
model, tailored for human resource management, grounded in a thorough
analysis of user requirements. The TG is to design the FPHR ontology, drawing
upon established best practices in ontology engineering. Additionally,
conceptualizing the necessary extensions in the KPP system to assimilate the
FPHR ontology is a TG. One of the so-called systems-development goals (SG) is to
implement a prototype of the FPHR ontology, leveraging semantic web
technologies. Another SG is to enhance the KPP system at the database,
backend, and frontend strata to seamlessly integrate the FPHR ontology.
Furthermore, it is an SG to actualize key user requirements pinpointed during
the conceptual phase into the final implementation. In terms of the so-called
experimentation goals (EG), the priority lies in evaluating the user-friendliness and
utility of the FPHR ontology, along with its integration into KPP. The process will
involve pinpointing further refinements and augmentations based on expert
feedback. Conclusively, an EG is to put forth recommendations to refine the
approach for future endeavors.
Therefore, this chapter describes the results of the literature research,
corresponding to the OGs of these questions, and they are presented in the
state-of-the-art section. This is followed by the description of the results of the
conceptual modeling of the solution based on the TGs in Section 15.3 and the
description of the results of the implementation of proof-of-concept prototypes
(SGs) in Section 15.4 as well as the description of preliminary evaluation results
(EGs) in Section 15.5, closing with an outlook in Section 15.6.

15.2 State of the art


The structure of the state-of-the-art description aligns with the OGs, i. e.,
literature research OGs related to the RQs, beginning with the basic literature
review dealing with fundamental technologies. These foundational concepts
include semantic web technologies such as the Extensible Markup Language
(XML) [→10], which is a subset of the Standard Generalized Markup Language
(SGML) [→11] for hierarchical structuring of information, developed by the
World Wide Web Consortium (W3C) [→12], and primarily used as a meta mark-
up language for defining other data exchange formats [→13], the resource
description framework (RDF) [→14] as a basic building block for the description
of “resources” [→15], RDF Schema (RDFS) [→16], which is the simplest modeling
language for RDF that is used to semantically classify terms [→2], the SPARQL
Protocol [→17], RDF Query Language, which is a graph-based query language
for RDF and RDFS, comparable to the structured query language (SQL) [→18] in
the field of relational databases, and the Web Ontology Language (OWL) [→19],
with which it is possible to formalize the terms in web documents through
ontologies. These technologies provide the backbone for representing, querying,
and manipulating data on the web, serving as the groundwork for the further
exploration of our RQs. The technology stack of the semantic web, as defined by
the W3C, provides a standard for creating semantic models and using the data
from these models. This standard has proven itself for many years, so that it can
be used as a formal basis for the modeling and data querying in this elaboration
without any adaptations. The building blocks, XML, RDF, RDFS, and SPARQL, can
be used as they were presented. Thus, there are no remaining challenges in the
semantic web for achieving the goals of this work. In addition to these
technologies, the state-of-the-art review also draws from the computer science
field of knowledge and process management. In particular, the perspective from
information science of Kuhlen [→20] is presented. Kuhlen’s view provides an
integrative approach to knowledge management, considering not only the
technical aspects of information processing but also the social and
organizational contexts in which knowledge is used. This comprehensive
perspective on knowledge management serves as a crucial foundation for
understanding and managing skills and qualifications within the scope of the
research. This core scientific concept can be used without adaption, so that there
are no remaining challenges here. Building on these descriptions of core
technologies and core scientific concepts, the computer science field of the
management of skills and qualifications is reviewed next. A distinction was made
between classifications, simple taxonomies, and formal ontologies, based on the
semantic expressiveness of ontologies described in Sack [→21]. Classifications
include the International Standard Classification of Occupations (ISCO) [→22]
and the National Occupational Classification (NOC) [→23]. Taxonomies
encompass the European Dictionary of Skills and Competences (DISCO) [→24],
the Linked Data Competency Index (LDCI) [→25], the e-Competence Framework
(e-CF) [→26], the European Digital Competence Framework for Citizens
(DigComp) [→27], and others. Regarding ontologies, notable examples are the
Occupational Information Network (O✶NET) [→28], the European Skills,
Competences, and the Qualifications and Occupations Framework (ESCO) [→29].
In the area of skills and qualifications management, an overview was given of
solutions characterized by a growing degree of semantic expressivity. ESCO is an
ontology with three pillars: occupations, skills, and qualifications. ESCO has a
cross-industry approach. It is available free of charge and supports the
standards of the semantic web. It is currently available in twenty-six languages,
including German. It integrates most of the previously presented European
standards in classifications, taxonomies, and other frameworks. In this respect,
the module under consideration will be used further in the following. In
summary, ESCO appears to be the best building block for the field of ontology
creation for further use in theory building and implementation. In the sense of a
top-level ontology (upper ontology), ESCO can be used directly. Next, QBL [→30]
is reviewed, which is an approach inspired by competence-based learning (CBL)
[→31] and the QBL model (QBLM) [→30], which is a comprehensive framework
consisting of a domain model, an architectural model, and various service
distribution models [→31]. And finally the research focused on WPIM [→2] and
KPP [→7]. Through an understanding of WPIM in the context of theory building,
one also gains an understanding of the building blocks of KPP that are important
in the context of integrating the FPHR ontology. KPP is useful but must be
adapted, concerning the ontology, the processes, the application, and the
database. The FPHR ontology must be integrated into the application. Because it
is useful in an adapted way, theory building, implementation, and evaluation are
necessary. In addition, it must be checked to what extent the SEP already
integrated in WPIM was taken over when creating KPP. If this is the case, the SEP
may have to be optimized as a module. If this is not the case, the SEP must be
modeled and implemented anew in KPP.
The following diagram →Figure 15.1 shows the architecture of the KPP
multitier web application that exists before the extension. The main components
are “Persons,” “Skills,” “Roles,” “Machines,” “Processes,” “Functions,” and
“Queries.” Each process, each person, each role, and each machine is annotated
with skills, which is represented by the connections. The relationships between
people and roles and between machines and processes are also shown.

Figure 15.1: Architecture of the KPP application before the extension.


In our review of the relevant state of the art, we have explored in depth key
technologies such as XML, SGML, RDF, SPARQL, and the ESCO ontology, all of
which are central building blocks for our remaining research approach. These
building blocks allow us to start with the conceptual design and modeling of our
solution in the next section.
Kuhlen’s perspective from information science has provided us with a clear
understanding of how data, knowledge, and information are related to each
other. This understanding is particularly relevant as it forms the basis for the
modeling of our solution, in which data is not only seen as syntactic elements
but also a representation of knowledge. The research and analysis of these
building blocks have led us to important insights that support us in the
conceptual design and modeling of our solution. We discovered how to develop
a vocabulary for skills and competences using the ESCO ontology. Furthermore,
we found out that adaptations and extensions in the KPP system are necessary
to fulfil our specific requirements. The observational goals of our research have
been achieved by identifying tools and methods that enable the representation
and understanding of data and knowledge. This leads us to the open challenges
and remaining research gaps. We now need to explore how we can integrate
these building blocks into a coherent system design that is not only effective but
cognitively efficient. This integration will allow us to optimize and extend
knowledge representation and processing of knowledge resources in our field of
research.

15.3 Design and conceptual modeling


This section focuses on theory building, so the structure of sections about user
requirements, information model, and architecture aligns with the TGs. The User-
Centered System Design Methodology [→19] is used. The first step is to define the
Use Context to clarify the application domain and the needs of the users. Within
this context, there are use cases that describe individual application situations.
For these use cases, an information model is required that maps the relevant
aspects of the real world to a representation for storing relevant information of
the use case. Finally, the architecture describes the user interfaces and the
functional components of the targeted system. UML [→32] is used as the
modeling language.
The section on user requirements analysis delves deeply into understanding
the users’ needs, emphasizing both the application context and selected
scenarios from this domain. From these discussions, the definition of system
requirements, the information model, and the overarching architecture emerge.
In our modeling approach, the software industry is targeted as the exemplary
application domain, with IT firms taking center stage. These firms craft software
products tailored to customer requirements, utilizing a blend of individual talent
and software development processes. Several distinct user groups play pivotal
roles in this domain, from executives and project managers to specialists and HR
managers. Key scenarios unfold, painting a picture of staff procurement, where
executive decisions shape the mix of internal and external resources. Project
managers, on the other hand, fine-tune staffing specifics for projects, assessing
the intricate needs of skills. Another scenario sheds light on the focus on skills-
centered staff development. From these scenarios, specific exemplary use cases
are selected, intricately linked to the activities of various user groups. These use
cases then form the exemplary foundation for a detailed list of user
requirements, covering a spectrum, from skill searches to plotting career paths.
System needs, derived from these user-centric requirements, are presented,
highlighting functionalities like employee skill searches and career path
determinations. →Figure 15.2 and →Figure 15.3 show examples of some of these
use cases; in this case, shared between the roles, Executive and HR Manager.

Figure 15.2: Use context model for the KPP human resource management,
including three relevant KPP use cases, exemplarily selected from [→8].
Figure 15.3: Details of the exemplarily selected use case, “Acquire Personnel”
from [→8].

Within the use context of the exemplarily selected KPP human resource
management Use Cases, the system necessitates an information model. This
model captures and represents real-world entities relevant to the application
scenarios, such as main processes, process instances, roles, employees, their
annotated skills, experience levels, professions with their associated skills, and
qualification measures detailing imparted skills. →Figure 15.4 shows the part of
the scheme of the information model, consisting of the classes Person, Process,
Profession and QualificationMeasure that share attributes, and Skills, which is of
type string.
Figure 15.4: Excerpt of the KPP human resource information model of the KPP
human resource management context.

Thus, for the KPP human resource (KPP HR) information model, the ontology is
needed because the FPHR ontology defines the vocabulary for the attribute
skills, which otherwise contains only plain text. Therefore, the FPHR ontology is
part of the KPP HR information model too. To ensure a robust and methodical
development of this ontology, the process is based on Noy and McGuinness
[→20]. Initially, the scope of the ontology is meticulously defined. This involves
determining the precise domain and pinpointing the target audience, which
encompasses executives, project leaders, HR managers, and specialists. Integral
to this phase is the articulation of key competency questions that the ontology
should adeptly address, offering clarity and direction. Before diving into the
creation of a new ontology, it is crucial to explore the possibility of reusing
existing ontologies. This not only offers potential time and cost savings but also
leverages the validation and tools associated with established ontologies. In this
context, the ESCO ontology emerges as a promising candidate for reuse and
further refinement. Next, the ontology’s foundational terms are enumerated.
These represent the core concepts that the ontology seeks to encapsulate,
including but not limited to professions, skills, roles, processes, and
qualifications. This list serves as the bedrock upon which subsequent steps are
built. Defining classes is an essential step in the ontology creation process. In
essence, classes are collections of objects that share common properties. The
intricacies of class definition are not just limited to identifying these classes but
also entail crafting a hierarchical relationship amongst them, ensuring clarity
and logical flow. Furthermore, properties and attributes within each class are
meticulously defined, ensuring each class is comprehensive and robust. This task
borrows insights from the previously discussed informational model. For
instance, in this context, a profession might be an attribute of the Employee
class, and this profession, in turn, might possess a list of associated skills. Laying
down conditions for properties and attributes is the subsequent step. These
conditions stipulate the criteria that properties and attributes must satisfy,
adding another layer of specificity to the ontology. Finally, for the FPHR ontology
to be pragmatically useful, instances for the defined classes need to be created.
This phase is particularly labor-intensive, demanding rigorous quality assurance.
As of the research conducted in August 2022, the ESCO ontology comprised
13,890 instances of skills and 3,008 instances of professions, all intricately linked.
This ensures that user requirements, such as displaying skills associated with a
selected profession, are seamlessly met. The FPHR ontology is built based on
two main components: the ESCO ontology as upper ontology and the existing
KPP ontology itself. Within the scope of the FPHR ontology design, necessary
expansions are made, based on a detailed analysis of the existing KPP ontology.
A special focus is placed on potential heterogeneity conflicts that might arise at
different levels, such as data model level, data schema level, and data instance
level. The “Skill” →Figure 15.5 class is then analyzed in detail to identify possible
conflicts between the KPP and ESCO ontologies. Two further classes,
“Occupation” and “Qualification,” previously only present in the ESCO ontology,
are also detailed to identify aspects of these classes that should be incorporated
into the new FPHR ontology.
Figure 15.5: The Skill class, as modeled in KPP (excerpt from [→8]).
Figure 15.6: Model of ESCO ontology with the Skill class belonging to Member
Concept from [→8].
Figure 15.7: Excerpt of the hierarchy of skills from the ESCO Ontology from
[→17].

The architecture section outlines the design of use cases with respect to user
interfaces and the underlying components, particularly in the backend and data
storage layers. The existing KPP user interface raises questions about the
components necessary for implementing the human resource management use
cases, the components that require adjustments, and the ones that need to be
newly created. Furthermore, with respect to the existing KPP backend and data
storage components, there are inquiries about the services required within the
KPP’s backend API for the use cases, the services that might need modifications,
and those that should be newly developed.
Figure 15.8: Component model of the KPP skill management system.
Figure 15.9: Components that are part of the “annotate skills” use case.

Upon its enhancement, KPP will be capable of various tagging, searching, and
mining functionalities regarding employee skills and processes, all drawing upon
the ESCO ontology (→Figures 15.6 and →15.7). Furthermore, it will offer
capabilities to determine discrepancies between the skills of processes and
individuals.
Figure 15.10: Draft of the new component “annotate skills” for the KPP user
interface.

Figures 15.8 – 15.10 show the Component model of the KPP skill management
system with a highlight on components that are part of the “Annotate Skills” use
case. →Figure 15.11 shows the architecture of the web application after the
integration of the skill management system (SMS). The skills ontology
(“SkillsOntology”) is newly introduced and is central to the customization of
processes, roles, and people. The new “OntologyImportComponent” component
imports data into the skills ontology. In addition, a new “Jobs” component is
added, which also originates from the skills ontology and has annotated skills.
The existing relationships are retained, although the connection to the skills is
now made via the skills ontology.
Figure 15.11: KPP architecture with the SMS extension.

In this chapter, a comprehensive framework of models was meticulously crafted


to underpin the development of a human resource information management
system, particularly tailored for the software industry. The foundational model,
the Use Context Model, delineated the application domain and user needs,
setting the stage for the subsequent models. Building upon this, the use case
model emerged, detailing individual application scenarios within the defined
context. These scenarios were then intricately mapped in the information model,
which captured and represented real-world entities relevant to the application,
such as roles, employees, and their skills. The component model was introduced
next, focusing on the functional components of the system, including the user
interface model, which outlined the design of user interactions with the system.
The architecture model tied these elements together, describing the overall
system architecture, including backend and data storage layers. These models,
each serving a distinct yet interconnected purpose, paved the way for the next
section, where a prototypical implementation of each model is developed.

15.4 Proof-of-concept implementation


Based on the Nunamaker research method, this section deals with the system
development activities, i.e., the research objectives related to system
development, addressing RQ1 and RQ3. Due to time constraints, all models were
implemented, but not all user requirements for all models from [→8]. The base
technologies for implementation that are used for KPP are fully sufficient to
implement the results of design and conceptual modeling. In the backend,
Apache web server [→33], in combination with Apache Tomcat [→34], serves as
the web server. No adjustments to the existing KPP configuration are necessary
here. The script languages JavaScript (JS) [→35] and PHP [→36] are used for the
backend API development.
Figure 15.12: Technical architecture of the KPP application from [→8].

Figure 15.13: Communication flow in multitier architecture.

Regarding the frontend, the Apache web server is also used to design the user
interface. JavaScript and PHP are again used to create the user interfaces. In the
database, the models created in RDF/RDFS are implemented in Neo4j Graph
Database [→37], a technology specifically chosen for the representation of
ontologies in KPP. →Figure 15.12 shows the technical architecture of the KPP
application, while →Figure 15.13 show the communication flow in the system. To
store the ontology within KPP, an extension to the Neo4J implementation is
required. While the MySQL database [→38] stores non-domain-relevant KPP data
such as user information, the Neo4j database hosts all domain-relevant KPP
data. Adaptations are required here to meet user requirements, focusing on the
aspects essential to these requirements. To address the development of RQ1,
the FPHR ontology →Figure 15.14 is created, based on the existing KPP ontology,
and extended by the ESCO ontology around occupations, skills, and
qualifications.

Figure 15.14: Implementation of the FPHR ontology.

In terms of classes, the FPHR ontology introduces new entities that are created
in Protégé. This concerns the classes “Occupation” and “Qualification.” In
addition, the class “esco:Skill” is defined as the equivalent of the class “Skill.”
The relationships between these classes are defined as “Object Properties.” An
example of such a relationship is the connection between “Occupation” and
“Skill.” As far as instances are concerned, entries are added for both
“Occupation” and “Skill” via the ESCO ontology. Documentation is done with the
tool “Wizard for Documenting Ontologies” (WIDOCO) [→39]. This tool creates an
online version of the FPHR ontology. After starting WIDOCO, a template for the
documentation is first selected. Then, based on the FPHR ontology created in
Protégé [→40], WIDOCO generates a series of linked HTML files containing all
the necessary information about the ontology. To address the development of
RQ1, the FPHR ontology from the previous section was integrated into KPP,
whereby special adaptations and extensions had to be carried out.

Figure 15.15: Implementation of the KPP integration.

The KPP application is a distributed, multitier web application, whose


architecture does not need to be changed. Nevertheless, changes are necessary
at the database, backend, and frontend levels. Two database systems, MySQL
and Neo4j, are used in KPP. While MySQL is responsible for non-domain-relevant
data, Neo4j hosts all domain-relevant data. Adaptations in Neo4j are necessary
to meet user requirements and a special script is used to extend the Neo4j
database →Figure 15.15. In the backend, the classes “Occupation” and
“Qualification” are newly implemented, and existing classes are modified. The
frontend adaptations include the extension for the new classes “Occupation”
and “Qualification.” Additional overviews have been added to show the skills
that belong to a specific occupation. More overviews can be added to meet
different user needs.
Figure 15.16: Enhancing the NEO4J database layer for the ESCO classes (excerpt
from [→8]).

A query library will be set up to provide different queries in KPP to meet user
needs in the context of human resource data mining and human resource
assignment applications as well as upskilling and re-skilling applications related
to human resource development and overall process optimization. An evaluation
will be conducted to determine whether these implementations are user-friendly
and what further adjustments or enhancements should be made. In this section,
the necessary adaptations, and extensions of KPP, based on the
conceptualization from the previous section, were described. These changes
were implemented to fulfil user requirements using the FPHR ontology. The
challenges identified and modeled in the theoretical part were addressed in the
practical part of our research. Each model discussed was translated into a
corresponding prototype to demonstrate its practical applicability and
effectiveness in our specific context. This process of prototyping not only served
to validate our theoretical assumptions, but also to broaden our understanding
of the practical requirements.
In this chapter, a careful implementation of the previously designed models
for the human resource management system was carried out, specifically
tailored to the requirements of the software industry. The FPHR ontology was
developed based on the basic use context model, which defined the area of
application and user needs. This ontology was created based on the KPP
ontology and enriched by integrating the ESCO ontology to capture occupation
and skill data →Figure 15.16. The instance data for occupations and skills was
then imported from the ESCO ontology into the NEO4J graph database, enabling
a precise and comprehensive representation of these key elements in the
system. At the same time, the components, services, and queries required for
the integration of the FPHR ontology into the KPP system were newly created or
modified. These adaptations, as part of the KPP integration, included both
backend and frontend developments that were tailored to the specific
requirements of the system. The component model and the user interface
model played a central role here, as they guided the design of the functional
components of the system and the design of the user interfaces. The
architecture model, which describes the entire system architecture, including the
backend and data storage levels, formed the framework for these
implementations. The combination of all these elements resulted in a coherent
and functional human resource management system that fulfils the specific
requirements of the software industry. Each of these models contributed in its
own way to the realization of the prototype and made it possible to develop a
comprehensive and integrated system that could be tested and refined in the
subsequent evaluation phase. At the use case level, those that lie on the critical
path of the application were implemented. Following the development of
prototypes based on our theoretical models, we are now progressing to the
evaluation phase of our research. This phase will allow us to assess the
appropriateness of the implemented solutions. The evaluation will be conducted
through expert interviews and application tests to ensure a comprehensive
assessment of the practicability of our research results.
15.5 Preliminary evaluation results
When it comes to the methodological approach of evaluation, different
categories of evaluation exist, including formative [→41, p. 19], quantitative
[→41, p. 15], and qualitative [→41, p. 15] approaches. Quantitative and
qualitative evaluations focus on aspects such as robustness, freedom from
errors, runtime behavior, and usability of a software system. These methods are
suitable for a more mature software version. Formative evaluation, on the other
hand, can be applied at an earlier stage in the development of a software
system. The prototypical implementation considered in this chapter is a proof-of-
concept evaluation, for which formative evaluation is particularly suitable. The
aim of this evaluation is to check whether the implementation meets the
expectations of the users. Key questions include whether the functionality of the
application is complete and useful and whether there are functionalities that the
user does not find or does not need. One method of formative evaluation is the
cognitive walkthrough [→42]. Here, the application is mentally played through,
focusing on the potential tasks of a user. There are various approaches to a
cognitive walkthrough, including the empirical study and the cognitive
walkthrough with expert users. In the empirical study, representative user
groups are interviewed, while in the expert cognitive walkthrough, domain
experts are interviewed individually or gathered in an expert roundtable. For the
current evaluation, the cognitive walkthrough with expert users was chosen.
Experts from different domains, such as process management, human
resources, project management, and informatics, were asked in interviews about
the usability of the application. For the research goals 1.4 and 2.4, interviews
were conducted with six experts of these disciplines to explore different aspects
of the prototype implementation. First, for the evaluation of the KPP integration,
the use cases that form the core functionality of the system were selected based
on a critical path analysis. These use cases are essential to identify and
understand specific functional areas and requirements of the software and serve
as the basis for the subsequent cognitive walkthrough, for example, the use case
UC1: modeling processes. This is about the software’s ability to design and
represent processes. This is fundamental to visualize and structure work
processes within the system. Another example is the use case UC2: annotate
capabilities. Here, specific skills are assigned to specific processes or tasks. This
makes it possible to define and understand the requirements and qualifications
for specific activities within a process. A knowledge-based software development
process was introduced for evaluation purposes too. This process was described
in a diploma thesis entitled “Compatibility and Extensibility of WPIM with
Methods, Models and Processes in Software Development Projects” by [Baue12].
The process presented in this chapter was used as a basis for the usability
assessment in this chapter and adapted in parts for KPP integration.
A central element is the introduction of a new main process in KPP, which
was modeled with the integrated BPMN process editor [→43]. This knowledge-
based software development process refers to the V-Modell XT [→44], a widely
used process model for the development of software projects. The V-Modell XT
serves as a methodological framework and provides clear phases and steps for
software development. →Figure 15.17 visualizes this knowledge-based software
development process, based on the V-Modell XT. This gives users a clear idea of
the different stages and aspects of the process.

Figure 15.17: The process in the process editor [→8].

The focus of this walkthrough was on the use case “UC2: annotate capabilities,”
a core functionality of the system. This is a central process, where users can add
specific capabilities to processes or tasks and define what experience level is
needed for these capabilities. At the beginning of the walkthrough, the user
starts the KPP application and performs a login process. After a successful login,
the user navigates to the “query library,” a tool that provides an overview of the
different query options. Another important area is the process overview, where
users can create and edit processes and annotate the associated capabilities.
The next step is to use the process editor. Here, users can select a specific
process, view detailed information about it, and add the necessary capabilities.
This process is interactive and allows users to customize the capabilities as
needed. A special focus is on the integration of KPP’s “query library.” Here, users
can run specific queries, e.g., to identify occupations that require certain ICT
skills. In this way, suitable human resource assignments can be performed
during, e.g., during production process planning. Furthermore, data mining can
support identification of human resources who need upskilling or re-skilling in
order to be available for optimized human resource assignments, allowing on
the one hand to optimize process management and also allowing, on the other
hand, to optimize human resource development. After selecting and running a
query, the results are clearly presented to the user. Finally, the KPP integration
was presented. In this section, users can view occupations and their associated
skills. By selecting a specific occupation, the user can filter and view the specific
skills associated with that occupation.
Expert evaluation was based on criteria such as the helpfulness,
comprehensibility, usability, and completeness of the respective functions.
Among the many scenarios and use cases evaluated, the “home – overview,” for
example, was consistently rated positively in all scenarios. In contrast, the
“processes – editor – annotate capabilities” function was found to be helpful and
understandable but did not score fully in other categories. The expert feedback
showed that the scenarios presented were generally well supported. However,
one standout issue was the use of the “Cypher” language for queries, which was
seen as a potential barrier to access. Nevertheless, the graphical annotation of
capabilities in a process editor was highlighted and appreciated by almost all
experts. However, they wished for better support in selecting capabilities from
an extensive list to optimize the user experience. In addition to the existing
features, the experts gave indications of other desired functions. These included
the ability to switch between different languages, specific career planning
features, and options to export search results in standard formats such as RDF.
Another interesting suggestion was the introduction of a guided mode for
related work steps, also called “wizard mode.”
In Section 15.5, the implemented models for the human resource
management, development, and optimization system were evaluated through
six cognitive walkthroughs, engaging experts from domains such as process
management, human resources, project management, and informatics. This
evaluation aimed to assess the integration and usability of the FPHR ontology
within the KPP system. Key use cases, notably “modeling processes” and
“annotating capabilities,” were scrutinized to evaluate the system’s effectiveness
in structuring work processes and assigning skills to tasks. The usability of the
query library and the process editor was thoroughly assessed, collecting
feedback on the system’s overall functionality and user interface design. Experts
appreciated the system’s strengths, especially the integration of the FPHR
ontology and the graphical representation of capabilities. However, they also
pointed out areas needing improvement, such as the use of the “Cypher”
language for queries and the necessity for more intuitive selection mechanisms
for capabilities from extensive lists. Based on the feedback from the evaluation,
certain enhancements have already been implemented. These include the
creation of additional queries and their incorporation into the query library,
thereby enriching the system’s capabilities and making it more user-friendly.
Other suggested improvements, such as integrating functionalities directly into
the user interface for a more seamless user experience, are considered future
work. Overall, the evaluation provided valuable insights into the system’s
usability, identifying areas for improvement, and confirming the effective
integration of the FPHR ontology into the KPP system.

15.6 Summary and outlook


In this research, we have underscored the necessity for a FPHR ontology and a
HRQP ontology within the KPP system. This endeavor directly addresses the
challenges of semantic process representation and the automated utilization of
human resources in the face of industrial process-based big data, which
inundates modern production planning, including human resource
management, mining, development, and optimization. The implementation of
the FPHR ontology, validated through proof-of-concept and preliminary
evaluations, has enhanced KPP’s capabilities, offering a robust response to the
intricacies of industrial process-based big data by semantically enriching human
resource management, mining, development, and optimization functions and
qualifications. This advancement not only streamlines decision-making
processes but also optimizes human resource management in the context of
Industry 4.0.
Therefore, in Section 15.1, we established the necessity for a functional and
process-oriented human resource ontology (FPHR ontology and a HRQP
ontology to improve the KPP system, addressing the lack of semantic process
representation and automated utilization of human resources in production
planning.
Section 15.2 provided an extensive state-of-the-art review, detailing
foundational technologies and concepts crucial for the project. It emphasized
the importance of semantic web technologies and the role of classifications,
taxonomies, and particularly the ESCO ontology, in managing skills and
qualifications.
Section 15.3 outlined the design and conceptual modeling process, utilizing
the user-centered system design methodology. It focused on defining the
application domain, user needs and the essential components of the system,
paving the way for the development of the FPHR and HRQP ontologies.
Section 15.4 detailed the proof-of-concept implementation, demonstrating
the practical realization of the theoretical models. The FPHR ontology was
methodically integrated into the KPP system, enhancing its capabilities to
encompass human resource functions and qualifications. The implementation
process, anchored in the Nunamaker research method, utilized a combination of
technologies, and addressed user requirements by refining the system’s
backend, frontend, and database layers.
Section 15.5 presented the preliminary evaluation results, employing
formative evaluation and cognitive walkthrough methods to assess the usability
and integration of the newly developed ontologies into the KPP system. Experts
from various domains provided critical feedback on use cases. The chapter
underlined the significance of expert feedback in refining the system and
confirmed the effective integration of the FPHR ontology into the KPP system.

Figure 15.18: Workforce and employment: reskilling from [→20].

Looking forward, the prototype presents opportunities for further development,


especially in refining functionalities to meet evolving digitalization challenges.
The broader applicability of the FPHR ontology, as demonstrated, extends
beyond this thesis, offering solutions to industries grappling with the Fourth
Industrial Revolution. Our alignment with global re-skilling initiatives, →Figure
15.18 as highlighted by the World Economic Forum [→45], signifies our
contribution toward addressing workforce retraining needs, emphasizing the
potential of our ontologies in navigating the big data landscape effectively.

References
[1] Cordis, Sustaining Heritage Access through Multivalent ArchiviNg | SHAMAN
Project | Fact Sheet | FP7 | CORDIS | European Commission. [Online] Available:
Accessed on: May 12 2023→https://cordis.europa.eu/project/id/216736/de. →
[2] T. Vogel, Wissensbasiertes und Prozessorientiertes Innovationsmanagement
WPIM: Innovationsszenarien, Anforderungen, Modell und Methode,
Implementierung und Evaluierung anhand der Innovationsfähigkeit fertigender
Unternehmen. Dissertation, Hagen: FernUniversität Hagen, 2012. a, b, c
[3] Cordis, Collaborative and Adaptive Process Planning for Sustainable
Manufacturing Environments: CAPP-4-SMES Project. [Online] Available:
→https://cordis.europa.eu/project/id/314024. Accessed on: Mar. 21 2023. →
[4] B. Gernhardt, A semantic representation for process-oriented knowledge
management based on function block domain models supporting distributed
and collaborative production planning. 2015. a, b
[5] B. Gernhardt, Knowledge-based production planning within the reference
planning process supporting manufacturing change management, 2016. a, b
[6] B. Gernhardt, T. Vogel, L. Wang, and M. Hemmje, “Implementation of a
Knowledge-Based Production Planning Including a Direct Manipulative Process
Editor and a Mediator Architecture,” 2017. a, b
[7] EFFRA, CAPP-4-SMES Collaborative and Adaptive Process Planning for
Sustainable Manufacturing Environments | EFFRA Innovation Portal:
‎Kollaborative und adaptive Prozessplanung für nachhaltige
Fertigungsumgebungen‎. [Online] Available:
→https://portal.effra.eu/project/1009. Accessed on: Jan. 26 2022. a, b
[8] A. Vogler, Qualifikations-ontologie-erstellung und ontologie-integration
sowie exemplarische evaluation an einem wissensbasierten software-
entwicklungs-prozess. In: Masterarbeit, Multimedia und Internetanwendungen,
Hagen: FernUniversität Hagen, 2022. a, b, c, d, e, f, g, h, i, j, k
[9] J. F. Nunamaker Jr., M. Chen, and T. D. M. Purdin, Systems Development in
Information Systems Research, 1990. →
[10] W3C, RDF 1.1 XML Syntax. [Online] Available: →https://www.w3.org/TR/rdf-
syntax-grammar/. Accessed on: Jan. 15 2024. →
[11] ISO, ISO 8879:1986. [Online] Available:
→https://www.iso.org/standard/16387.html. Accessed on: Jan. 15 2024. →
[12] W3C, W3C. [Online] Available: →https://www.w3.org/. Accessed on: Jan. 15
2024. →
[13] M. Hemmje, 01877 KE3: Einführung in XML, DTD und XSD: 01877 Daten-,
Dokumenten-, Informations- und Wissensmanagement im Internet, Hagen:
Kurseinheit, FernUniversität Hagen, 2021. →
[14] W3C, RDF – Semantic Web Standards. [Online] Available:
→https://www.w3.org/RDF/. Accessed on: Jan. 09 2022. →
[15] M. Hemmje, Kurseinheit 3: Konzepte von XML: 1873 Daten- und
Dokumentenmanagement im Internet, Hagen: Kurseinheit, FernUniversität
Hagen, 2018. →
[16] W3C, RDF Schema 1.1. [Online] Available: →https://www.w3.org/TR/rdf-
schema/. Accessed on: Jan. 09 2022. →
[17] W3C, SPARQL Query Language for RDF. [Online] Available:
→https://www.w3.org/TR/rdf-sparql-query/. Accessed on: Jan. 09 2022. a, b
[18] ISO, ISO/IEC 9075-1:2023(en), Information technology – Database
languages SQL – Part 1: Framework (SQL/Framework). [Online] Available:
→https://www.iso.org/obp/ui/en/#iso:std:iso-iec:9075:-1:ed-6:v1:en. Accessed
on: Jan. 15 2024. →
[19] W3C, OWL Web Ontology Language Overview. [Online] Available:
→https://www.w3.org/TR/2004/REC-owl-features-20040210/. Accessed on: Jan.
09 2022. a, b
[20] R. Kuhlen, Informationsethik: Umgang mit Wissen und Information in
elektronischen Räumen. Konstanz: UVK Verl.-Ges, 2004. a, b, c
[21] H. Sack, 3.2 Ontology Types. →
[22] ISCO, ISCO – International Standard Classification of Occupations. [Online]
Available: →https://www.ilo.org/public/english/bureau/stat/isco/. Accessed on:
Jan. 15 2024. →
[23] National Occupational Classification (NOC) 2021 Version 1.0. [Online]
Available: →https://www.statcan.gc.ca/en/subjects/standard/noc/2021/indexV1.
Accessed on: Jan. 15 2024. →
[24] DISCO II Portal. [Online] Available: →http://disco-tools.eu/disco2_portal/.
Accessed on: Jan. 15 2024. →
[25] LDCI, About – Linked Data Competency Index. [Online] Available:
→https://dcmi.github.io/ldci/. Accessed on: Jan. 15 2024. →
[26] Europäische Kommission, European e-Competence Framework (e-CF).
[Online] Available: →https://esco.ec.europa.eu/en/about-
esco/escopedia/escopedia/european-e-competence-framework-e-cf. Accessed
on: Jan. 15 2024. →
[27] Europäische Kommission, European Digital Competence Framework for
Citizens (DigComp). [Online] Available: →https://esco.ec.europa.eu/en/about-
esco/escopedia/escopedia/european-digital-competence-framework-citizens-
digcomp. Accessed on: Jan. 15 2024. →
[28] Onetcenter.org, O✶NET-SOC Taxonomy at O✶NET Resource Center. [Online]
Available: →https://www.onetcenter.org/taxonomy.html. Accessed on: Mar. 10
2022. →
[29] Europäische Kommission, The ESCO ontology. [Online] Available:
→https://webgate.acceptance.ec.europa.eu/esco/lod/static/model.html.
Accessed on: Jan. 10 2022. →
[30] M. Then, Supporting Qualifications-Based Learning (QBL) in a Higher
Education Institution’s IT-Infrastructure, Hagen: FernUniversität 2020. a, b
[31] M. Then, Supporting Qualifications-Based Learning (QBL) in a Higher
Education Institution’s IT-Infrastructure, MyCoRe Community, 2020. a, b
[32] OMG, UML Resources | Unified Modeling Language. [Online] Available:
→https://www.uml.org/resource-hub.htm. Accessed on: Nov. 07 2023. →
[33] Apache Software Foundation, The Apache HTTP Server Project. [Online]
Available: →https://httpd.apache.org/. Accessed on: Jan. 09 2022. →
[34] Apache Software Foundation, Apache Tomcat. [Online] Available:
→https://tomcat.apache.org/. Accessed on: Jan. 09 2022. →
[35] MDN Web Docs, JavaScript | MDN. [Online] Available:
→https://developer.mozilla.org/en-US/docs/Web/JavaScript. Accessed on: Jan.
16 2024. →
[36] PHP Group, PHP: Hypertext Preprocessor. [Online] Available:
→https://www.php.net/. Accessed on: Jan. 16 2024. →
[37] Graph Database & Analytics, Neo4j Graph Database. [Online] Available:
→https://neo4j.com/product/neo4j-graph-database/. Accessed on: Jan. 16
2024. →
[38] Oracle, MySQL. [Online] Available: →https://www.mysql.com/de/. Accessed
on: Jan. 16 2024. →
[39] D. Garijo, “WIDOCO: A Wizard for Documenting Ontologies,” University of
Southern California, 2017. →
[40] Stanford University, Protege Wiki. [Online] Available:
→https://protegewiki.stanford.edu/wiki/Main_Page. Accessed on: Jan. 16 2024.

[41] Qualitative evaluation in sieben Schritten In: Qualitative Evaluation: Der
Einstieg in die Praxis, U. Kuckartz, (Ed.), 2nd edition, pp. 15–57, Wiesbaden: VS
Verl. für Sozialwiss, 2008. a, b, c
[42] PII: 0020-7373(92)90039-N, →
[43] J. Kossick, Reimplementierung, Erweiterung und exemplarische Evaluation
einer verteilten und kollaborativen Unterstützung für die Produktionsplanung,
Hagen: Bacherlorarbeit, FernUniversität Hagen, 2016. →
[44] BMI, V-Modell XT Kurz und Knackig – Ein Überblick, Der Beauftragte der
Bundesregierung für Informationstechnik, 02 Nov., 2023,
→https://www.cio.bund.de/Webs/CIO/DE/digitaler-
wandel/Achitekturen_und_Standards/V_modell_xt/V_modell_xt_ueberblick/v_mo
dell_xt_ueberblick_artikel.html. →
[45] World Economic Forum, How digital transformation is driving economic
change. [Online] Available:
→https://www.weforum.org/agenda/2022/01/digital-transformation-economic-
change-technology/. Accessed on: Jan. 28 2022. →
OceanofPDF.com
De Gruyter Series Intelligent Computing
ISSN
e-ISSN
Leonidas Deligiannidis, George Dimitoglou, Hamid R. Arabnia
(Eds.)
Artificial Intelligence: Machine Learning, Convolutional Neural
Networks and Large Language Models, 2024
ISBN 978-3-11-054816-7, e-ISBN (PDF) 978-3-11-134412-6, e-ISBN
(EPUB) 978-3-11-134417-1
George Dimitoglou, Leonidas Deligiannidis, Hamid R. Arabnia
(Eds.)
Big Data, Data Mining and Data Science: Algorithms,
Infrastructures, Management and Security,
ISBN 978-3-11-134407-2, e-ISBN 978-3-11-134455-3, e-ISBN
(EPUB) 978-3-11-134458-4
George Dimitoglou, Leonidas Deligiannidis, Hamid R. Arabnia
(Eds.)
Cybersecurity: Cyber Defense, Privacy and Cyber Warfare,
planned 2024
ISBN 978-3-11-143641-8, e-ISBN 978-3-11-143654-8, e-ISBN
(EPUB) 978-3-11-143656-2
Leonidas Deligiannidis, George Dimitoglou, Hamid R. Arabnia
(Eds.)
Imaging Science: Computer Vision, Image and Signal Processing,
Pattern Recognition, planned 2024
ISBN 978-3-11-143638-8, e-ISBN 978-3-11-143642-5, e-ISBN
(EPUB) 978-3-11-143657-9

OceanofPDF.com
Index
0
2DM files 1
3D models 1

A
abelian group 1
ablation studies 1
abnormal data 1
acceleration 1, 2
accelerometer 1
accuracy 1, 2, 3, 4, 5, 6, 7, 8
accuracy rate 1, 2, 3, 4
adaptive composition 1
adaptive exploration 1
adaptive learning 1
adjacency matrix 1, 2, 3, 4
AI Explainability 360 (AIX360) 1
Akaike information criterion (AIC) 1
algorithm selection 1
anomaly detection 1, 2, 3, 4
anomaly identification 1
Applied Gaming (AG) 1
approximate inference 1, 2, 3
area under the roc curve (AUC) 1
arity 1, 2, 3, 4, 5, 6
artificial intelligence (AI) 1
ASD 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
assortative networks 1
assortativity index 1, 2, 3, 4, 5, 6, 7, 8, 9
attention mechanism 1
attributes 1, 2, 3, 4, 5
audio 1, 2, 3
audio classification 1
authoring tool 1, 2, 3, 4, 5
Authoring Tool 1
autism spectrum disorder 1
AutoGluon 1
automated machine learning (AutoML) 1
automated target recognition 1, 2, 3
automatic dimensionality reduction 1
AutoML 1
AutoML model 1
autonomous 1
Auto-Sklearn 1
Auto-WEKA 1
Azure Databricks Spark 1

B
backward SFS 1
bagging 1, 2, 3
Bag of Words 1
Bahadur representation 1
ball-bearing 1
bandpass filter 1, 2
batch processing 1
bearing faults 1, 2, 3, 4
Bellman principle 1
Betti curves 1
betweenness (BWC) 1
bidirectional encoder representations from transformers 1
binary classification problems 1
biomarkers 1, 2
bipartite graph 1, 2
bipolar 1
bisect-K-means 1, 2
bond market 1, 2, 3
boosting 1, 2, 3
Brandes’ algorithm 1
breadth-first search algorithm 1
breakpoints 1, 2, 3, 4
B-trees 1
burst of economic bubble 1
business cycles 1

C
C++ 1, 2, 3, 4
caching 1
Calinski and Harabasz index 1
canonical correlation 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72
canonical correlation analysis (CCA) 1, 2, 3
Case Western Reserve University (CWRU) bearing data set 1
categorical encoding 1
centering 1
centrality metrics 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97
Centralized File System (CFS) 1
centroid frequency 1, 2
centroids 1, 2
chaos 1, 2, 3
ChatGPT 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38
chat tool 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Chebyshev distance 1
chromatin remodeling 1
chromosome band location 1, 2
chromosome location 1, 2, 3, 4
classical nonlinear dynamical systems 1
classification 1, 2
classification matrix 1
classifications 1, 2, 3
classification system 1, 2
class imbalance 1
closeness (CLC) 1
CLUSEQ 1
clustering 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21
clustering methods 1
clustering quality 1, 2, 3, 4, 5, 6, 7, 8, 9
cluster validity 1, 2, 3, 4
cluster validity measure 1, 2
cognitive walkthrough 1, 2
cohesion 1
cold-start problems 1
commodity market 1, 2, 3
common variants 1, 2
communality score 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
compactness 1, 2
Competence and Qualification (CQ) 1
complex network analysis 1, 2, 3
composite objective function 1
computational costs 1
concept drift 1, 2, 3, 4, 5
condition information 1
constrained-remaining component decomposition 1, 2
consumers 1, 2, 3
content management environment 1
content producers 1, 2, 3, 4, 5
content providers 1, 2, 3
controlled access 1
copy number variants 1
copyright difficulties 1
correlation analysis 1
correlation coefficient 1, 2
counterfactual explanations 1
covariance matrix 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23
COVID-19 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
crude oil prices 1
CUDA 1
currency trends 1
cyclical variations 1, 2
cyclostationary noise 1
cyclostationary signal 1
D
data 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49
data analysis 1, 2
data cleaning 1
data combining 1
data deduplication 1
data integration 1
data lake 1, 2, 3, 4, 5
data mining 1, 2, 3, 4
data model 1, 2, 3, 4
data normalization 1
data objects 1, 2, 3
data observation vectors 1, 2
data preprocessing 1
data reduction 1
data reservoir 1, 2, 3, 4, 5, 6, 7, 8
Datarobot 1
data scaling 1
data schema 1
data similarity 1, 2
data transformation 1
decimal scaling 1
decision fusion method 1, 2
decision trees 1
decomposition technique 1
deep learning 1, 2, 3, 4, 5
degree (DEG) 1
degree of density 1
de novo mutation 1, 2
de novo variants 1
derived imagery 1
descriptive metadata 1, 2
DGL 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
dimensionality curse 1
dimensionality reduction 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
discretization 1, 2, 3
dissortative networks 1
distance 1, 2, 3, 4
distance-based methods 1
distance function 1
distinct topological structures 1
distributed stochastic neighbor embedding (t-SNE) 1
distributed Thompson sampling 1
distribution optimization 1
distribution variance 1
DNA 1, 2
domain expertise 1
dominating factor 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Dow Jones Industrial Average 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14
DTD 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
dynamic advantage 1, 2, 3, 4, 5, 6
dynamic ensemble 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18
dynamic modeling 1
dynamic near real-time pricing 1
dynamic price elasticity computation 1, 2, 3
Dyna-SPECTS 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16

E
EdgeNode 1
Efficiently Learning an Encoder that Classifies Token
Replacements Accurately (ELECTRA) 1
eigenvalue 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
eigenvector 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26
ELECTRA 1
embedding techniques 1, 2, 3
energy market 1, 2
energy price volatility 1
ensemble 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37
Ensemble 1
Entity model 1
entity set 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
epochs of length 1, 2
epsilon greedy algorithm 1, 2
ER model 1, 2, 3, 4
ESCO ontology 1, 2, 3, 4, 5, 6, 7, 8, 9
estimating outliers 1, 2
Euclidean distance 1, 2
European stocks 1
European Union 1
evaluation metrics 1, 2
exceedances 1, 2, 3, 4, 5, 6, 7
Exchangeable Image File Format (EXIF) 1
experimentation goals (EG) 1
expert evaluation 1
explainability methods 1
explainable artificial intelligence (XAI) 1, 2, 3, 4
eXplainable artificial intelligence (XAI) 1
exploratory factor analysis 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
exponentiation 1
Extensible Markup Language (XML) 1
extract information from newspapers 1, 2

F
false positive detections 1
fault analysis 1, 2
fault depths 1, 2, 3, 4, 5
faults 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
feature engineering 1, 2
feature extraction 1, 2, 3, 4, 5, 6, 7, 8
feature matrix 1, 2, 3, 4, 5
feature ranking 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
feature ranking algorithms 1
feature relevance 1
feature scaling 1
feature selection 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
file registration 1
final factor loadings 1, 2, 3, 4, 5, 6
financial indices 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
finite strings 1
fixed composition 1, 2
forecast accuracy 1, 2
forecasting 1, 2, 3, 4
formal ontologies 1
forward SFS 1
Fourier transform 1
Fourth Industrial Revolution 1, 2, 3
Fréchet mean 1
functional components 1, 2
function approximation 1, 2
fusion method 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
G
Game of Thieves (GoT) node centrality metric 1
Gaming Analytics (GA) 1
gate delay 1, 2, 3, 4
GCN model 1, 2, 3, 4, 5
gene interaction data 1, 2, 3, 4, 5
genetic mutations 1
genetic risk 1
genome-wide significant loci 1
giotto-tda 1, 2, 3, 4
global trade tensions 1
GloVe 1
Google Vertex AI 1
graph database 1, 2, 3, 4, 5, 6, 7, 8, 9
graph neural networks 1, 2
graph sage 1, 2, 3, 4
Great East Japan Earthquake 1, 2, 3, 4, 5, 6
group membership 1

H
H2O DriverlessAI 1
Handbook of Clinical Neurology 1
Headers 1
high-dimensional data 1, 2, 3, 4
higher education 1, 2, 3, 4, 5
high-performance computing systems 1
high-performance table generation 1
high-risk gene variants 1
Hilbert transform 1
homology group 1, 2, 3, 4, 5, 6
housing market 1, 2, 3, 4, 5
HPI 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
human resource assignment 1, 2, 3
human resource management 1, 2, 3, 4, 5, 6, 7, 8
hypergraph 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
hypergraph database 1, 2, 3, 4, 5, 6, 7, 8, 9
hyperparameter tuning 1, 2, 3
hypervertices 1

I
IBM Watson AutoAI 1
IKMEANS 1
ILE 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29
implementation 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
incremental updates 1
independent component analysis 1
index of industrial production (IIP) 1
index of symmetry and uniformity 1, 2
industrial process management 1, 2
Industry 4.0 1, 2, 3, 4, 5
inference logging 1
infinite strings 1
information extraction 1
information gain 1
information model 1, 2, 3, 4, 5, 6, 7, 8
informedness 1, 2, 3, 4, 5
ingestible 1, 2, 3, 4, 5, 6, 7, 8
ingestible metadata 1, 2, 3, 4
inheritance 1
inner raceway 1
Integrated Rule-Oriented Data System (iRODS) 1, 2, 3
integration 1
inter-cluster 1, 2, 3
international stock price collapse 1
international trade frictions 1
interpretable ML models 1

J
Japanese Cabinet Office 1
Japanese economy 1
Japanese stocks 1, 2, 3, 4, 5
JavaScript 1, 2
Jensen-Shannon divergence metric 1, 2
Joint Photographic Experts Group (JPEG) 1
JSON 1, 2, 3, 4, 5, 6, 7

K
key competency questions 1
k-means 1
K-means clustering 1, 2, 3, 4
KNN model 1
knowledge-based and process-oriented innovation
management 1
knowledge-based software development process 1, 2, 3
Knowledge Management Ecosystem Portal (KM-EP) 1, 2, 3
knowledge representation 1, 2
KPP integration 1, 2, 3
kurtogram method 1
kurtosis 1, 2, 3
L
labeled datasets 1
label encoding 1
labels 1, 2, 3, 4, 5
large-scale events 1
large-scale language model 1
large-scale storage 1
large-scale tables 1
late fusion 1
leakage ratio coefficients 1
Leakage ratio coefficients 1
learning analytics 1
learning Analytics 1
Learning Management System (LMS) 1
least squares fitting 1
least squares method 1
lemmatization 1, 2, 3
Light Detection and Ranging (LiDAR) data sets 1
likelihood function 1
linear discriminant analysis (LDA) 1, 2
linear transformation 1
locality metadata 1
local linear model of time 1
logarithmic transformation 1
logic level circuits 1
logit model 1
log-likelihood 1, 2
long-tailed products (LTP) 1, 2
Lyapunov exponents 1, 2
M
machine learning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
machine learning algorithms 1, 2
machine learning models 1
machine-readable 1
macroeconomic 1, 2, 3, 4, 5, 6
macros 1
Mainichi Shimbun 1, 2, 3
major epidemics 1
major financial crisis 1
Manhattan distance 1
matrix factorization 1
maximum likelihood AR modeling methods 1
maximum likelihood estimate 1
maximum likelihood method 1, 2
MCMC sampling 1, 2, 3, 4
mean amplitudes 1
mean imputation 1
memory-efficient data structures 1
metadata 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69
metadata assignment 1, 2, 3
metadata queries 1, 2
Metadata Schema Reservoir (MeScheR) 1
meta-learning pipeline 1
metric normalization 1
Microsoft Azure AutoML 1
min–max scaling 1
missing value imputation 1
missing values 1, 2
MLjar 1
ML pipeline 1
model-agnostic 1, 2
model selection 1
Monte Carlo experiment 1, 2, 3
moving linear model 1
moving linear model approach 1
multiarmed bandits 1, 2
multiclass problems 1
multigraphs 1
multilayer perceptron (MLP) 1
multiobjective optimization 1, 2
multiple imputation 1
MySQL 1, 2, 3
MySQL database 1

N
Named Entity Recognition 1
natural disasters 1, 2, 3, 4, 5
natural language processing 1, 2, 3, 4
nearest neighbors 1
neighborhood-based centrality metrics 1, 2, 3, 4, 5, 6, 7, 8, 9
Neo4j 1, 2, 3, 4, 5, 6, 7
Neo4j Graph Database 1
NEO4J Graph Database 1
network-based machine learning (NBML) 1
Network model 1
neural contextual bandit algorithm 1
neural networks 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22
neurodevelopment 1
New York Stock Exchange 1, 2, 3, 4, 5
Nikkei Stock Average 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41
nodes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
noise 1, 2
noise filtering 1
noisy 1
nondominating factor 1, 2
nonkey attribute names 1
nonlinear dynamics 1
nonlinear feature transformations 1
nonstationary data 1, 2
nonstationary mean 1
nonstationary time series data 1
non-terminal 1, 2, 3, 4, 5, 6, 7
numerical instability 1
numerical modeling 1

O
one-hot encoding 1
one-to-many relationship 1
online learning 1, 2, 3, 4, 5, 6, 7
optimal number of components 1
optimal window size 1, 2, 3, 4
optimizing prices 1
orthogonal space 1
outer raceway 1
outlier detection 1, 2, 3
outlier removal 1
outliers 1, 2, 3, 4
overfitting 1, 2
oversampling 1, 2

P
PAGEL simulation 1, 2, 3, 4, 5
PageRank 1, 2
Pandas 1
pandemic 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19
parallelization 1, 2
parameterization 1
passband 1
Pearson’s correlation coefficient 1, 2, 3, 4, 5
persistence diagrams 1, 2, 3, 4, 5
persistence landscape 1
persistence silhouettes 1
persistent homology 1, 2, 3, 4
persistent storage 1
physical modeling data 1
pipeline 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
Policy Enforcement Point (PEP) 1
positive sentiment 1
posterior approximations 1
posterior distributions 1, 2
power-assisted steering systems 1, 2, 3
power-iteration algorithm 1
p–q projection 1, 2, 3, 4, 5, 6, 7
preprocessing 1, 2
pre-trained language model 1
price elasticity (PE) 1, 2, 3, 4, 5
price personalization 1, 2, 3
principal component analysis (PCA) 1
prior distributions 1
probability distribution 1
probit model 1
process-associated knowledge 1
processes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
process-oriented resource assignment 1
production 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52
professions 1, 2
proof-of-concept evaluation 1
Protégé 1
protein–protein interaction (PPI) 1
Protocol Enforcement Point (PEP) 1
Python 1, 2, 3, 4, 5, 6, 7, 8, 9

Q
Qualification-Based Learning Model (QBLM) 1
qualifications 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
query library 1, 2, 3, 4

R
random forest 1, 2, 3
randomness index 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
rating scheme 1, 2
Ray and Turi 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
RDF Query Language 1
RDF Schema (RDFS) 1
recommender systems 1
Recurrent Neural Networks 1
recursive feature elimination (RFE) 1
recursive grammar 1
reduction 1
regression 1
regression imputation 1
regression models 1, 2
Reinforcement Learning policy 1
Relational model 1, 2
relationship attributes 1, 2
relationship objects 1
relationship set 1, 2, 3, 4, 5, 6
ReliefF 1, 2, 3, 4, 5, 6, 7
repetition count 1
resource description framework (RDF) 1
resource efficiency 1
revenue maximization objective 1
reversibility 1
risk ranking 1
road surface classification 1, 2, 3, 4, 5, 6
RoBERTa 1
robustness checks 1
roles 1, 2, 3, 4, 5, 6
root element 1, 2, 3, 4
Ruby 1, 2, 3, 4, 5, 6, 7, 8, 9
run-to-failure bearing experiments 1

S
SalesForce TransmogrifAI 1
sampled imagery 1, 2
SAX embedding model 1, 2, 3
scaling 1
scene 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Scene 1, 2, 3
Sector-specific overheating 1
secure administration tools 1
seismic signals 1
semantic comparison 1, 2, 3, 4
semantic expressiveness 1
semantic knowledge representation 1
semantic similarity 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
sensitivity 1, 2, 3, 4, 5, 6, 7, 8, 9
sensors 1, 2, 3, 4, 5, 6
sentiment analysis 1
separated score integration 1
seq2seq methods 1
sequential feature selection (SFS) 1
sequential training 1
Sfari dataset 1, 2
SHAMAN 1
Shanghai Composite Index 1, 2, 3, 4
Shapley Additive exPlanations 1
shortest path-based centrality metrics 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11
simple classifiers 1
simple taxonomies 1
simplexes 1, 2
simplicial complex 1, 2, 3, 4
single time series 1
singular cluster 1
skill-centered qualification ontology 1, 2, 3
skills 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26
Skills 1, 2, 3
solution scoring system 1
spaCy 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
SPARQL Protocol 1
specificity 1, 2, 3, 4, 5, 6, 7, 8
specifier data 1
spectral contrast 1, 2
spectral features 1
spectral flatness 1, 2
spectral radius ratio 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
spectral slope 1, 2
spectral values 1
SPECTS model 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
stacking 1
Standard & Poor 1
Standard Generalized Markup Language (SGML) 1
static ensemble 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
stationary structures 1
stationary time series 1
statistic-domain features 1
steering wheel signals 1
stochastic PCA 1
stock market 1, 2
stock prices 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19
string matching 1
strongly negative canonical correlated networks 1
strongly positive canonical correlated networks 1
structured query language (SQL) 1
subgrammar 1, 2
subsampling rate 1
subset selection 1, 2, 3
Subset selection 1
subsurface effects 1
sum of squared distances (SSD) criterion 1
sum of squares 1
supervised learning 1
support vector machine 1, 2, 3
surface fluxes 1
Symbolic Aggregation approXimation (SAX) 1
symmetry 1
synaptic function 1, 2
synonymity 1
synthetic imagery 1
synthetic images 1, 2, 3, 4
synthetic scene generation 1
systems-development goals (SG) 1

T
table arrangement technology 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
tag 1
tagging 1, 2, 3
Tag Image File Format (TIFF) 1
tags 1, 2, 3, 4, 5, 6, 7, 8, 9
Takens embedding 1, 2, 3, 4, 5, 6
Takens embedding model 1, 2
target encoding 1
task simulation 1, 2
TDA 1, 2, 3, 4, 5, 6, 7, 8, 9
temporal data 1
test set 1, 2, 3
textual form 1
theory-building goals (TG) 1
third-order autocorrelation 1
Thompson sampling 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
Thompson Sampling 1, 2, 3
time reversibility 1
time series data 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
tokenization 1, 2, 3, 4
Tokyo Stock Exchange 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17
Tokyo stock market 1, 2, 3, 4, 5, 6
topological data analysis (TDA) 1, 2
topological methods 1
total market index 1
TPOT 1
trade agreements 1
trading partners 1, 2
traditional pricing strategies 1, 2, 3, 4
trained model 1, 2
training set 1, 2
transfer learning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
transformation 1
transformer 1, 2, 3, 4
transformer architecture 1
transition block 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
transportation network 1
transportation networks 1

U
unfiltered data 1
uniformity 1
unit variance 1
universal language 1
unsupervised learning 1
use case 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
user-determined threshold 1
user interfaces 1, 2, 3, 4

V
varimax rotation 1, 2, 3, 4, 5, 6, 7, 8
vector of frequencies 1
vector space 1, 2, 3, 4, 5
vector space model 1
vehicle speed 1, 2, 3
verification 1
vibration 1, 2
vibration feature extraction 1, 2
vibration sensor data 1
Vietoris-Rips method 1
virtual file system 1, 2
virtual learning environment (VLE) 1
visibility graph 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
V-Modell XT 1

W
Wasserstein gradient flows 1
Wassertein distances 1
Wavefront OBJ files 1
Web Ontology Language (OWL) 1
Web page ranking 1
weighted MAPE 1, 2, 3, 4
weighted stochastic gradient descent 1
width of the time interval (WTI) 1
window sizes 1
wizard mode 1
Word2Vec 1
word vector 1
worm-gear interface configuration 1, 2

X
xcdata 1, 2
XGBoost 1, 2, 3, 4, 5, 6, 7, 8
XGBoost regressor 1, 2, 3
XLNet 1
XML 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34
xmlalitem 1, 2, 3, 4
xmlattlist 1, 2
xmlcomm 1, 2, 3, 4, 5
xmlent 1, 2, 3, 4, 5, 6
XML files 1, 2
xmlpi 1, 2, 3
xstring 1, 2, 3

Z
zero mean 1
z-score normalization 1
OceanofPDF.com

You might also like