The Essentials of Machine Learning in Finance and Accounting
The Essentials of Machine Learning in Finance and Accounting
”
— Dr. Zamir Iqbal, VP Finance and Chief Financial Officer (CFO),
Islamic Development Bank (IsDB)
“An essential resource for financial accounting managers and students of financial management.”
— Professor Mehmet Huseyin Bilgin, Istanbul Medeniyet University, Turkey
This book introduces machine learning in finance and illustrates how we can use computational tools
in numerical finance in real-world context. These computational techniques are particularly useful
in financial risk management, corporate bankruptcy prediction, stock price prediction, and portfolio
management. The book also offers practical and managerial implications of financial and managerial
decision support systems and how these systems capture vast amount of financial data.
Business risk and uncertainty are two of the toughest challenges in the financial industry. This
book will be a useful guide to the use of machine learning in forecasting, modeling, trading, risk
management, economics, credit risk, and portfolio management.
Mohammad Zoynul Abedin is an associate professor of Finance at the Hajee Mohammad Danesh
Science and Technology University, Bangladesh. Dr. Abedin continuously publishes academic paper
in refereed journals. Moreover, Dr. Abedin served as an ad hoc reviewer for many academic journals.
His research interest includes data analytics and business intelligence.
M. Kabir Hassan is a professor of Finance at the University of New Orleans, USA. Prof. Hassan has
over 350 papers (225 SCOPUS, 108 SSCI, 58 ESCI, 227 ABDC, 161 ABS) published as book chapters
and in top refereed academic journals. According to an article published in Journal of Finance, the
number of publications would put Prof. Hassan in the top 1% of peers who continue to publish one
refereed article per year over a long period of time.
Petr Hajek is currently an associate professor with the Institute of System Engineering and Informat-
ics, University of Pardubice, Czech Republic. He is the author or co-author of four books and more
than 60 articles in leading journals. His current research interests include business decision making,
soft computing, text mining, and knowledge-based systems.
Mohammed Mohi Uddin is an assistant professor of Accounting at the University of Illinois Spring-
field, USA. His primary research interests concern accountability, performance management, cor-
porate social responsibility, and accounting data analytics. Dr. Uddin published scholarly articles in
reputable academic and practitioners’ journals.
Routledge Advanced Texts in Economics and Finance
24. Empirical Development Economics
Måns Söderbom and Francis Teal with Markus Eberhardt,
Simon Quinn and Andrew Zeitlin
25. Strategic Entrepreneurial Finance
From Value Creation to Realization
Darek Klonowski
26. Computational Economics
A Concise Introduction
Oscar Afonso and Paulo Vasconcelos
27. Regional Economics, Second Edition
Roberta Capello
28. Game Theory and Exercises
Gisèle Umbhauer
29. Innovation and Technology
Business and Economics Approaches
Nikos Vernardakis
30. Behavioral Economics, Third Edition
Edward Cartwright
31. Applied Econometrics
A Practical Guide
Chung-ki Min
32. The Economics of Transition
Developing and Reforming Emerging Economies
Edited by Ichiro Iwasaki
33. Applied Spatial Statistics and Econometrics
Data Analysis in R
Edited by Katarzyna Kopczewska
34. Spatial Microeconometrics
Giuseppe Arbia, Giuseppe Espa and Diego Giuliani
35. Financial Risk Management and Derivative Instruments
Michael Dempsey
36. The Essentials of Machine Learning in Finance and Accounting
Edited by Mohammad Zoynul Abedin, M. Kabir Hassan,
Petr Hajek, and Mohammed Mohi Uddin
Edited by
Mohammad Zoynul Abedin,
M. Kabir Hassan,
Petr Hajek, and
Mohammed Mohi Uddin
First published 2021
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
605 Third Avenue, New York, NY 10158
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2021 selection and editorial matter, Mohammad Zoynul Abedin, M. Kabir Hassan, Petr Hajek, and Mohammed Mohi Uddin;
individual chapters, the contributors
The right of Mohammad Zoynul Abedin, M. Kabir Hassan, Petr Hajek, and Mohammed Mohi Uddin to be identified as the authors
of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of
the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or
other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval
system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and
explanation without intent to infringe.
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record has been requested for this book
vii
viii Contents
7.4 Data contextuality: machine learning-based entity analytics across the enterprise ... 127
7.5 Identifying Central Counterparty (CCP) risk using ABM simulations ................ 131
7.6 Systemic risk and cloud concentration risk exposures ..................................... 134
7.7 How should regulators address these challenges? ........................................... 137
Notes ....................................................................................................... 137
References ................................................................................................. 138
8 Prospects and challenges of using artificial intelligence in the audit process ........... 139
EMON KALYAN CHOWDHURY
8.1 Introduction ....................................................................................... 139
8.1.1 Background and relevant aspect of auditing ....................................... 140
8.2 Literature review .................................................................................. 141
8.3 Artificial intelligence in auditing............................................................... 142
8.3.1 Artificial intelligence ................................................................... 142
8.3.2 Use of expert systems in auditing .................................................... 143
8.3.3 Use of neural network in auditing ................................................... 143
8.4 Framework for including AI in auditing ..................................................... 143
8.4.1 Components ............................................................................. 144
8.4.1.1 AI strategy ................................................................... 144
8.4.1.2 Governance .................................................................. 144
8.4.1.3 Human factor ............................................................... 144
8.4.2 Elements .................................................................................. 145
8.4.2.1 Cyber resilience ............................................................. 145
8.4.2.2 AI competencies ............................................................ 145
8.4.2.3 Data quality ................................................................. 145
8.4.2.4 Data architecture and infrastructure.................................... 145
8.4.2.5 Measuring performance ................................................... 145
8.4.2.6 Ethics ......................................................................... 145
8.4.2.7 Black box..................................................................... 146
8.5 Transformation of the audit process........................................................... 146
8.5.1 Impact of digitalization on audit quality ........................................... 147
8.5.2 Impact of digitalization on audit firms ............................................. 147
8.5.3 Steps to transform manual audit operations to AI-based........................ 148
8.6 Applications of artificial intelligence in auditing – few examples........................ 149
8.6.1 KPMG .................................................................................... 149
8.6.2 Deloitte ................................................................................... 149
8.6.3 PwC ....................................................................................... 149
8.6.4 Ernst and Young (EY) .................................................................. 150
8.6.5 K.Coe Isom .............................................................................. 150
8.6.6 Doeren Mayhew......................................................................... 150
8.6.7 CohnReznick ............................................................................ 150
8.6.8 The Association of Certified Fraud Examiners (ACFE) ......................... 150
8.7 Prospects of an AI-based audit process in Bangladesh ..................................... 150
8.7.1 General aspects .......................................................................... 151
Contents xi
9 Web usage analysis: pillar 3 information assessment in turbulent times ................ 157
ANNA PILKOVA, MICHAL MUNK, PETRA BLAZEKOVA AND LUBOMIR BENKO
9.1 Introduction ....................................................................................... 157
9.2 Related work ....................................................................................... 158
9.3 Research methodology ........................................................................... 161
9.4 Results............................................................................................... 164
9.5 Discussion and conclusion ...................................................................... 172
Acknowledgements ...................................................................................... 175
Disclosure statement..................................................................................... 175
References ................................................................................................. 175
2.1 Structure of a decision tree, highlighting root node (diamond), internal nodes (circles),
branches and leaves (squares) 8
2.2 Examples of alternative splitting rules giving rise to partitions characterized by (a) high
and (b) low impurity level. Each symbol represents an individual, G indicates
green eyes and B brown eyes 10
2.4 Log-returns on ICE BofA US Corporate Index BofA: residuals on the training
sample (panel a, vertical axis), prediction errors on the validation sample
(panel b, vertical axis) over time (horizontal axis), mean decrease accuracy
(panel c), and mean decrease impurity (panel d) variable importance
analysis 26
2.5 Log-returns on ICE BofA US Corporate Index BofA analysis: two-way partial
dependence plot involving Moody’s Seasoned BAA Corporate Bond Yield and
the Ten-Year Treasury Rate (panel a), static partial dependence plots of BAA,
AAA, and GS10 (panel b), sequential partial dependence plots of BAA (panel c),
and GS10 (panel d) corresponding to the three different periods: Jun. 1973–Sep.
1976, Jan. 1998–Apr. 2001, Sep. 2014–Dec. 2017 28
2.6 Default risk analysis: variable importance following the mean decrease accuracy
(panel a) and mean decrease impurity (panel b) score variables’ rankings 30
2.7 Default risk analysis: two-way partial dependence of default on net capital
and net capital to total assets ratio (panel c); scatterplot of the net capital and
net capital to total assets ratio in the test set (panel d, default cases in red,
active cases in black) 31
3.1 Survival probabilities t px . Models LC- and LC- -spl. Ages 40–100 and years
2015–2074 45
3.2 Reserves for term life insurance with single and periodic premium for the
models LC, LC- , and LC- -spl 47
3.3 Reserves for pure endowment contract with single premium for the models LC,
LC- , and LC- -spl 47
4.1 Scatter plots of original data and fitted curves of kernel ridge regression,
KRR and support vector regression, SVR (green curve is for the KRR and orange
curve is for the SVR) 58
xiii
xiv Figures
4.2 Scatter plots of the original data and fitted curves of the proposed method,
KSRR, kernel ridge regression, KRR, and support vector regression, SVR
(blue and red curves are for the KSSR, green curve is for the KRR, and orange
curve is for the SVR) 69
6.1 When the dimension of the space (d-sphere) increases from d 2 (2-sphere on the
left) to d = 3 (3-sphere on the right), the proportion of random points (black dots)
falling into the internal sphere (red area) of radius 1 , with 0.6, decreases,
whereas the proportion of points (empty circles) in the complement of the internal
sphere (gray area) increases 99
6.2 Dimension k of the projection sub-space given by the JL lemma (vertical axis,
left) and the upper bound in the norm preservation theorem (vertical axis, right)
as a function of the number of observations n (horizontal axes) 107
6.4 Log-returns of the S&P500 index (red line) from 1st August 2019 to
31st December 2019 and log-returns of all S&PHealthcare components
(colored lines) 110
6.5 Results for Gaussian random projection model estimated on whole sample. Top:
true values (black), and OLS (blue dashed) and random projection OLS
(red dashed) fitted values. Bottom: mean absolute error for OLS (blue dashed)
and random projection OLS (red dashed). Gray lines represent the results of each
one of the 1,000 random projection estimates 112
6.7 Returns on S&P500 index (black solid), OLS-based forecasting (dashed, top) and
(OLS+RP)-based forecasting (dashed, bottom) 115
6.8 Daily trading volumes of electricity in thousands of Megawatt (MW) (black) from
7th March 2010 to 28th March 2010. The vertical dashed line indicates the
end of the week 116
7.6 Bank of England’s IAAS cloud market share survey (January 2020) 135
9.3 Logit visualization of the model for the year 2012 with error data 168
9.4 Logit visualization of the model for the year 2012 with corrected data 169
2.2 Out-of-the-box (left) and test set (right) predictions and classification error
with a training sample size of 10344 (first line) and of 688 (second line) 30
3.2 Out-of-sample test results: RMSE for the models LC, LC- , and LC- -spl.
Ages 40–100 and years 1999–2014 44
3.3 The single (U) and period (P) premium of the term life insurance and pure
endowment for the models LC, LC- and LC- -spl 47
4.1 The mean error of 100 samples of two data sets: SD-1 and SD-2 68
4.2 The mean error of 100 samples of the synthetic data sets (SD-3) 69
4.3 Training and test error over 100 samples, n 400, the proposed method KSRR,
kernel ridge regression, KRR, and support vector regression, SVR 69
5.3 Worth of the variables obtained using the Relief feature selection method 88
6.2 Mean square error for the different estimation strategies (columns) and
experimental settings (rows) given in Table 6.1 108
xvii
xviii Tables
6.3 Mean Absolute Error (MAE), Tracking Error (TE) and Tracking Error with
normalized weights (TE-n) for different strategies. A strategy is given by
a combination of estimation method (OLS without Random Projection (OLS)
and OLS with Gaussian Random Projection (RP)) and training sample size
(percentage of the original samples (P) equal to nsub 104) 111
xix
xx Contributors
Petra Blazekova is currently an external PhD student at the Faculty of Management, Comenius
University in Bratislava. She works at a commercial bank in Slovakia as a risk management
specialist. Her research interests are in the field of risk management, reporting, and regulation.
Roberto Casarin is currently a Professor of Econometrics at Ca’ Foscari University of Venice
researching on Bayesian inference, computational methods, and time series analysis. He is the
director of the Venice Center for Risk Analytics and a member of ECLT, GRETA, Econometrics
Society, and International Society for Bayesian Analysis. He was an Assistant Professor at Uni-
versity of Brescia and a research assistant at University of Padova. He received a PhD degree in
Mathematics (2007) from University Paris Dauphine, a PhD degree in Economics (2003) from
the University of Venice, an MSc degree in Applied Mathematics from ENSAE-University Paris
Dauphine, and an MSc degree in Economics from the University of Venice.
Emon Kalyan Chowdhury is an Associate Professor of Accounting at Chittagong Independent
University, Bangladesh and the head of the Department of Accounting. His research focuses on
stock market risk, capital asset pricing model, financial reporting, and corporate governance.
Professor Chowdhury earned a PhD degree in stock market risk in 2016 from the University of
Chittagong, Bangladesh, where he had received an MBA degree in 2007 and a BBA degree in
2006. He received a second MBA degree in Finance and Human Resource Management from the
University of Bangalore, India in 2009. He has more than 25 scholarly publications in different
renowned national and international peer-reviewed journals.
Salma Drissi received a PhD degree in Management Science from the Laboratory of Research
in Performance Management of Public, Private Organizations and Social Economy, University
Ibn Zohr, Morocco. Drissi works as an Assistant Professor at the EMAA Business School, and as
a lecturer at the National School of Commerce and Management in Agadir. Dr. Salma holds in
her research portfolio several articles published in indexed international journals.
Alessandro Facchinetti received his bachelor degree in Foreign Languages and Literature from
Catholic University of Milan in 2014. He then shifted his academic interests to economics and
received a second bachelor degree in Business and Administration from Ca’ Foscari University of
Venice in 2017 and an MSc degree in Finance in 2020 with a thesis on machine learning methods.
After a short internship in the data production team of HypoVereinsbank-UniCredit, he obtained
a fixed-term position as a research assistant at the Venice Centre for Risk Analytics on machine
learning methods.
Petr Hajek is an Associate Professor with the Institute of System Engineering and Informatics,
Faculty of Economics and Administration, University of Pardubice. Dr. Petr received the PhD
degree in system engineering and informatics from the University of Pardubice, Pardubice, Czech
Republic, in 2006. He is the author of three books and more than 90 articles. His research interests
include soft computing, machine learning, and economic modeling. He was a recipient of the
Rector Award for Scientific Excellence both in 2018 and 2019, and six best paper awards by the
international scientific conferences.
Richard L. Harmon is the Managing Director of Cloudera’s Financial Services Industry Ver-
tical. He joined Cloudera in May 2016 and has over 25 years of experience in capital markets
with specializations in risk management, advance analytics, fixed income research, and simula-
tion analysis. Prior to Cloudera, Dr. Harmon was the Director of SAP’s EMEA Capital Markets
Contributors xxi
group for six years and also held senior positions at Citibank, JP Morgan, BlackRock, and Bank
of America/Countrywide Capital Markets. Dr. Harmon holds a PhD degree in Economics with
specialization in Econometrics from Georgetown University.
M. Kabir Hassan is Professor of Finance in the Department of Economics and Finance at the
University of New Orleans. He currently holds three endowed Chairs – Hibernia Professor of
Economics and Finance, Hancock Whitney Chair Professor in Business, and Bank One Professor
in Business – in the University of New Orleans. Professor Hassan is the winner of the 2016 Islamic
Development Bank (IDB) Prize in Islamic Banking and Finance. He received his BA degree in
Economics and Mathematics from Gustavus Adolphus College, Minnesota, USA, and an MA
degree in Economics and a PhD degree in Finance from the University of Nebraska-Lincoln,
USA.
Md. Shah Azizul Hoque is currently pursuing his MBA, major in Human Resource Management
(HRM), from the University of Chittagong, Bangladesh. Earlier, he received his BBA degree from
the same institution. Apart from this, he has completed a postgraduate diploma in HRM from
the BGMEA Institute of Fashion & Technology, Bangladesh. Mr. Hoque feels keen interests in
technological and environmental aspects of HRM practices.
Md. Kaosar Hossain is a graduate in Human Resource Management (HRM) from the Depart-
ment of Human Resource Management, University of Chittagong, Bangladesh, and he is pursu-
ing an MBA in the same field. His research interests lie in the adoption of technology in HRM,
AI, and green HRM.
Tarikul Islam is a graduate in Human Resource Management from the University of Chittagong,
Bangladesh. He finds research interests in the technology-enabled workplace, gender, leadership
style, and conflict resolution at work.
Osamu Komori received his PhD degree from the Graduate University for Advanced Studies in
2010. He is an Associate Professor at Seikei University. His research field includes bioinformatics,
machine learning, and linguistics.
Susanna Levantesi is Associate Professor in Mathematical Methods of Economy and Actuarial
and Financial Science in the Department of Statistics, Sapienza University of Rome, Italy. She
obtained her PhD degree in Actuarial Science from the Sapienza University of Rome. She is Fully
Qualified Actuary, member of the Professional association of Italian actuaries, member of the
Italian Institute of Actuaries, and member of the Working Group on the Mortality of pensioners
and annuitants in Italy. Her areas of expertise include longevity risk modeling and management,
actuarial models for health insurance, and machine learning and deep learning in insurance and
finance risk.
Michal Munk received the MS degree in Mathematics and Informatics and the PhD degree
in Mathematics in 2003 and 2007, respectively, from Constantine the Philosopher University,
Nitra, Slovakia. In 2018, he was appointed as a Professor in System Engineering and Informatics
at the Faculty of Informatics and Management, University of Hradec Kralove, Czechia. He is
currently a Professor in the Department of Computer Science in Constantine the Philosopher
University, Nitra, Slovakia. His research interests include data analysis, web mining, and natural
language processing.
xxii Contributors
Renata Myskova received her PhD degree in Economics and Management from the Faculty
of Business and Management, Brno University of Technology, Czech Republic, in 2003. Since
2007, she has been working as an Associate Professor at the Institute of Business Economics
and Management, Faculty of Economics and Administration, University of Pardubice. She has
been working with strategic management, management analysis, financial reporting, and financial
management. She has published a number of papers concerning economics and finance. She
serves as the associate editor of journal Economics and Management.
Andrea Nigri is a PhD student in Statistics at the Sapienza University of Rome. Since 2015, he
has been working as a biostatistician and co-author of publications in medical journals. In 2016,
Andrea was a research fellow at the Local Health Department of Padua (epidemiological surveil-
lance of mesothelioma). In 2017, during his first year of Doctoral School, he attended the EDSD
program (European Doctoral School of Demography) at the CPop in Odense. His research inter-
ests include mortality forecasting using statistical learning and deep learning methods and the
evolution of life expectancy and lifespan inequality.
Vladimir Olej was born in Poprad, Slovakia. In 2001, he worked as a Professor in the branch of
technical cybernetics at the Technical University of Ostrava. Since 2002, he has been working
as a Professor with Institute of System Engineering and Informatics, Faculty of Economics and
Administration, University of Pardubice, Czech Republic. His research interests include artificial
and computational intelligence. He has published a number of papers concerning fuzzy logic,
neural networks, and genetic algorithms.
Anna Pilkova is currently Professor of Management at Faculty of Management at Comenius
University, Bratislava, Slovakia. Earlier, she worked at top managerial positions in a commercial
bank in Slovakia. Her research interests are focused on the banking regulation, risk management,
and strategic management at commercial banking in developing countries. In addition, she has
conducted research on entrepreneurial activities and entrepreneurship inclusivity as a national
team leader for Global Entrepreneurship Monitor. She is a recipient of a few awards including
the Green Group Award of Computational Finance and Business Intelligence (Best paper) of
the International Conference on Computational Science 2013, the Workshop on Computational
Finance and Business Intelligence (Barcelona, 2013).
Gabriella Piscopo is Associate Professor in Financial Mathematics and Actuarial Science at Uni-
versity of Naples Federico II. She obtained her European PhD from University of Naples Federico
II and Cass Business School of London. She has been Visiting Professor at Renmin University of
China, and Visiting Researcher at Cass Business School of London and at Chatolique Université
de Louvain. Her areas of expertise are longevity risk modeling, life insurance valuation, machine
learning, and neural network in actuarial science. She is Fully Qualified Actuary member of the
Professional association of Italian actuaries and a member of the Italian Institute of Actuaries.
Andrew Psaltis previously served as the Cloudera APAC CTO, joining the company through the
Hortonworks merger where he was the APAC CTO starting in 2016. He has spent most of his 20+
year career leading from the intersection of business and technology to drive strategic planning,
tactical development, and implementation of leading-edge technology solutions across enterprises
globally. He’s recognized for being an industry thought leader in big data, data analytics, and
streaming systems.
Contributors xxiii
Md. Aftab Uddin has been teaching as an Associate Professor in the Department of Human
Resource Management, University of Chittagong, Bangladesh. Apart from scoring BBA and
MBA from the University of Chittagong, Dr. Uddin was also awarded an MBA and a PhD
from the Wuhan University of Technology, China. He researches corporate greenization, creative
engagement, innovative behavior, intelligence, leadership, positive psychology, technology-driven
workplace, etc.
Mohammed Mohi Uddin, PhD, is an Assistant Professor of Accounting at the University of
Illinois Springfield, USA. He received a Bachelor of Commerce and a Master of Commerce from
the University of Chittagong, an MBA degree from the University of Leeds, and a PhD degree
in Accounting from Aston University. His primary research interests concern accountability, per-
formance management, corporate social responsibility, and accounting data analytics. He secured
external research funding, and published scholarly articles in reputable academic and practition-
ers’ journals. Dr. Uddin served as an ad hoc reviewer of internationally reputable accounting
journals such as Accounting, Auditing and Accountability Journal, and Journal of Accounting in
Emerging Economies. He is a fellow of Higher Education Academy, United Kingdom. In the past,
Dr. Uddin held academic positions at Queen’s University Belfast, Aston University, and University
of Chittagong.
Veronica Veggente received her bachelor degree in Economics and Finance from Roma Tre Uni-
versity in 2018 writing a thesis on an econometric analysis of the investment choices of house-
holds. She then moved to Ca’ Foscari University of Venice and completed her MSc in Finance in
2020 discussing master thesis on random projection methods in statistics. During her studies she
did a short internship in the Financial and Credit Risk unit of Generali Italia and she obtained
a fixed-term position as a research assistant at the Venice Centre for Risk Analytic. Her current
project focuses on dimensionality reduction techniques for entropy estimation.
Chapter 1
1.1 Introduction
Machine learning (ML) is a type of applied artificial intelligence (AI) that enables computer
systems learn from data or observations, and automatically improves predictability by utilizing
ongoing learning. It is generally featured in computer science discipline but can be applied in
disciplines such as social sciences, finance, accounting and banking, marketing research, oper-
ations research, and applied sciences. It utilizes computationally intensive techniques, such as
cluster analysis, dimensionality reduction, and support vector analysis. ML has experienced a rise
in recognition among academics, researchers, and practitioners over the last couple of decades
because of its ability to help predict more accurately. Many applications of ML across diverse
fields have emerged. Particularly, its applications in major disciplines such as finance, account-
ing, information systems, statistics, economics, and operations research are noteworthy. In the
context of rapid innovations in computer science and availability of big data, ML can change the
way practitioners make predictions, and researchers collect and analyze data.
Computational finance is an interdisciplinary area that integrates computing tools with
numerical finance. By utilizing computer algorithms, it can contribute to the advancement of
financial data modeling systems. These computational techniques can successfully be utilized
in important finance areas, such as financial risk management, corporate bankruptcy predic-
tion, stock price prediction, and portfolio management. For example, ML can be utilized to
prevent and detect credit card frauds. Taking into consideration the availability of huge volume
of unstructured data, such as customer reviews and social media posts and news data, ML can
1
2 Mohammad Zoynul Abedin et al.
provide “new insights into business performance, risks and opportunities” (Cockcroft & Russell,
2018, p. 324). ML tools can be utilized to process these unstructured data in order for making
better business decisions. Managers can use ML tools in preventing and detecting accounting
frauds (see Cockcroft & Russell, 2018). It can also be used in accounting areas such as auditing,
income tax, and managerial accounting.
ML can be successfully utilized in accounting and finance research. For example, content
analysis is a widely used research method in accounting research. ML algorithms can provide
“reliability, stability, reproductivity and accuracy” (see Bogaerd & Aerts, 2011, p. 13414) in data
processing. Accounting researchers in corporate social responsibility (CSR) frequently use textual
analysis, a sub-category of content analysis, for identifying themes. ML algorithms can success-
fully be used to classify texts or generate themes (see Bogaerd & Aerts, 2011). Accounting scholars
(Deegan & Rankin, 1998; Gray, Kouhy & Lavers, 1995; Neu, Warsame & Pedwell, 1998) used
algorithms in classifying texts from corporate social and environmental reports. Bogaerd and Aerts
(2011) used LPU (learning from positive and unlabeled data) ML method in classifying texts with
90% accuracy. Behavioral finance researchers can use unstructured newspaper and social media
data to understand market sentiment, and utilize the data in developing models predicting prices
of financial products.
The next two sections highlight the motivation and provide the brief overviews of chapters
appearing in this book.
1.2 Motivation
ML is an emerging computing tool that can be successfully utilized in large and complex data
settings. For example, recently researchers from Warwick University, UK, discovered 50 new
planets from existing NASA data by using ML algorithms (Yeung, 2020). These types of oppor-
tunities were not available in the past due to the absence of large datasets and the processing
limitations of computers. Due to the advancement in computing technology, “big data” are now
easily available on various business areas. ML can be used to exploit the immense potential of uti-
lizing “big data” for making improved business decisions. Research on utilizing big data (and ML)
has potentials for better “industry practices” and “cross-disciplinary research” (see Cockcroft &
Russell, 2018, p. 323). However, although it has tremendous potentials, Goes (2014) argues that
finance industry does not have sufficient expertise to exploits the benefits of big data. This book
is interdisciplinary in nature. It aims to contribute to the emerging machine learning area and its
applications in businesses.
This book will present 12 chapters covering topics including machine learning concepts, algo-
rithms and their applications. More specifically, this book introduces methods such as kernel
switching ridge regression, sentimental analysis, decision trees, and random forests. It also intro-
duces empirical studies applying ML in multiple finance and accounting areas, such as forecasting
of mortality for insurance product pricing, using kernel switching ridge regression for improving
prediction models, managing risk and financial crimes, and predicting stock return volatilities.
Given the lack of availability of sufficient books in this area, this book will be useful to researchers,
including academics and research students, who are interested in advanced machine learning tools
and their applications. The contents of this proposed book are also expected to benefit practi-
tioners who are involved in forecasting modeling, stock-trading risk management, bankruptcy
Machine learning in finance and accounting 3
prediction, accounting and banking fraud detection, insurance product pricing, credit risk man-
agement, and portfolio management. We believe findings from this book will add new insights
into the stream of computational finance and accounting research.
depositors on the requirements of Pillar 3 disclosures and Pillar 3 related information” during the
credit crunch-related financial crisis in 2009 (Pilkova, Munk, Blazekova & Benko, 2021, p. 1).
Chapter 10 introduces various ML concepts and algorithms, and applications of ML in
accounting, finance, and economics. Particularly, it highlights the importance of using ML in
“algorithmic trading and portfolio management, risk management and credit scoring, insurance
pricing and detection of accounting and financial fraud” (Radwan, Drissi & Secinaro, 2021, p. 2).
Chapter 11 discusses challenges of applying classification techniques in highly class
imbalanced dataset. Select techniques including oversampling, undersampling, SMOTE, and
borderline-SMOTE to solve class imbalance problems are presented. This chapter also presents
different metrics for evaluating performance of classification techniques applied on imbalanced
dataset.
Chapter 12 is about the applications of AI in staff recruitment. This chapter applies the
lens of combined system of acceptance and usage of technology. By doing so, it highlights the
importance of various antecedents of acceptance of AI among HR experts for hiring talents in
Bangladesh. It also identifies the determinants of AI adoptions. By employing a deducting reason-
ing approach, the authors make some interesting empirical contributions. It also provides some
insightful comments and notes on opportunities for further research in this area.
References
Alam, M. A., Komori, O., & Rahman, M. F. (2021). Kernel switching ridge regression in business intelli-
gence system. In M. Z. Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M. (Eds.), The Essentials of
Machine Learning in Finance and Accounting (pp. 17–45). Oxford: Taylor and Francis.
Bogaerd, M. V., & Aerts, W. (2011). Applying machine learning in accounting research. Expert Systems with
Applications, 38, 13414–13424.
Casarin, R., Facchinetti, A., Sorice, D., & Tonellato, S. (2021). Decision trees and random forests. In M. Z.
Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M. (Eds.), The Essentials of Machine Learning in
Finance and Accounting (pp. 17–45). Oxford: Taylor and Francis.
Casarin, R., & Veggente, V. (2021). Random projection methods in economics and finance. In M. Z.
Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M. (Eds.), The Essentials of Machine Learning in
Finance and Accounting (pp. 17–45). Oxford: Taylor and Francis.
Cockcroft, S., & Russell, M. (2018). Big data opportunities for accounting and finance practice and research.
Australian Accounting Review, 86 (28), 1–12.
Deegan, C., & Rankin, M. (1996). Do Australian companies report environmental news objectively? An
analysis of environmental disclosures by firms prosecuted successfully by the Environmental Protec-
tion Authority. Accounting, Auditing and Accountability Journal, 9(2), 50–67.
Goes, P. B.(2014). Big Data and IS Research. Minneapolis: Carlson School of Management.
Gray, R., Kouhy, R., & Lavers, S. (1995). Corporate social and environmental reporting: A review of the
literature and a longitudinal study of UK disclosure. Accounting, Auditing and Accountability Journal,
8(2), 47–77.
Hajek, P., Myskova, R., & Olej, V. (2021). Predicting stock return volatility using sentiment analysis of
corporate annual reports. In M. Z. Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M. (Eds.), The
Essentials of Machine Learning in Finance and Accounting (pp. 17–45). Oxford: Taylor and Francis.
Machine learning in finance and accounting 5
Harmon, R., & Psaltis, A. (2021). The future of cloud computing in financial services: A machine learning
and artificial intelligence perspective. In M. Z. Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M.
(Eds.), The Essentials of Machine Learning in Finance and Accounting (pp. 17–45). Oxford: Taylor and
Francis.
Levantesi, S., Nigri, A., & Piscopo, G. (2021). Improving longevity risk management through machine
learning. In M. Z. Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M. (Eds.), The Essentials of
Machine Learning in Finance and Accounting (pp. 17–45). Oxford: Taylor and Francis.
Neu, D., Warsame, H., & Pedwell, K. (1998). Managing public impressions: Environmental disclosures in
annual reports. Accounting, Organizations and Society, 23(3), 265–282.
Pilkova, A., Munk, M., Blazekova, P., & Benko, L. (2021). Web usage analysis: Pillar 3 information assess-
ment in turbulent times. In M. Z. Abedin, Hassan, M. K., Hajek, P., and Uddin, M. M. (Eds.), The
Essentials of Machine Learning in Finance and Accounting (pp. 17–45). Oxford: Taylor and Francis.
Radwan, M., Drissi, S., & Secinaro, S. (2021). Machine learning in the fields of accounting, economics and
finance: The emergence of new strategies. In M. Z. Abedin, Hassan, M. K., Hajek, P., and Uddin, M.
M. (Eds.), The Essentials of Machine Learning in Finance and Accounting (pp. 17–45). Oxford: Taylor
and Francis.
Yeung, J. (2020). Breakthrough AI identifies 50 new planets from old NASA data. CNN Business, available
at: https://www.cnn.com/2020/08/26/tech/ai-new-planets-confirmed-intl-hnk-scli-scn/index.html,
date accessed: August 30, 2020.
Chapter 2
2.1 Introduction
A decision tree is a non-parametric supervised learning tool for extracting knowledge from avail-
able data. It is non-parametric, since it does not require any assumptions regarding the distribu-
tion of the dependent variable, the explanatory variables, and the functional form of the relation-
ships between them. It is supervised, since it is trained over a labeled dataset, L, in which each
observation is identified by a series of features, or explanatory variables, X1 , . . . , Xp , belonging to
the sample space X , and by a target, or response variables, Y . A generic observation is denoted
as oi = (x1i , x2i , ..., xpi , yi ), where yi represents the measurement of the response variable on the
ith sample item, whereas xji represents the measurement of the jth feature on the same item, with
j = 1, . . . , p and i = 1, . . . , N . The symbol x denotes a generic value of the p-dimensional
feature vector.
Based on the evidence provided by the data, a decision tree provides a prediction of the target
and a relationship between the features and such target.
The engine that powers a decision tree is recursive binary partitioning of the sample space
X as represented in Figure 2.1. The diamond-shaped box identifies the root node, which is asso-
ciated with observations, L. The circles identify the internal nodes. At each internal node, a test
∗ This research used the SCSCF multiprocessor cluster system and is part of the project Venice Center for Risk
Analytics (VERA) at Ca’ Foscari University of Venice.
7
8 Roberto Casarin et al.
Figure 2.1 Structure of a decision tree, highlighting root node (diamond), internal nodes
(circles), branches and leaves (squares).
is performed over some features and an arbitrary observation is associated either with the left or
with the right children node, accordingly with the result of such test. The terminal nodes, called
leaves, determine a partition of L by means of a sequence of splitting decisions.
Decision trees are discriminated according to the nature of the target they have to predict. A
classification tree is characterized by the fact that it predicts a categorical response, as opposed to
a quantitative and, generally, continuous one in the regression case. In the following two sections,
the Classification And Regression Tree (CART) algorithm (Breiman, Friedman, Olshen, & J.,
1984) will be illustrated. CART is one of the most popular algorithms used in the construction of
decision trees. Other valuable algorithms that will not be treated in this chapter are ID3 (Quinlan,
1986) and C4.5 (Quinlan, 1993). Section 3 will illustrate some common features of classification
and regression trees, with particular emphasis on surrogate splitting, handling of missing values,
and feature ranking. Section 4 provides a short discussion of the advantages and disadvantages of
decision trees. In Section 5, random forests are presented. Some applications of random forests
in classification, regression, and time series analysis are produced in Section 6. The R code used
in the applications is reported in the appendix.
them with distinct response labels. The final tree, trained on the learning sample, can be used to
classify any future measurement vector x assigning the correct label with the highest accuracy.
The problem of generating a tree consists in the following choices (Breiman et al., 1984):
A split is beneficial whenever it creates two descendant subsets that are indeed more homogeneous
with respect to the class distribution, and a way to judge the goodness of a split is to compute the
change in impurity that it produces:
Figure 2.2 Examples of alternative splitting rules giving rise to partitions characterized by (a)
high and (b) low impurity level. Each symbol represents an individual, G indicates green eyes
and B brown eyes.
where s represents a possible split of node t into the left and right children nodes, tL and tR ;
ι(t), ι(tL ), and ι(tR ) denote the impurity level of nodes t, tL , and tR respectively. The children
impurities are weighted by the proportions of instances that they have collected, pL and pR
respectively.
The tree T impurity level is defined as
X
I(T ) = p(t)ι(t),
t∈T̃
a weighted average of leaves impurity levels, where T̃ identifies the series of terminal nodes of T .
Let us assume we wanted to proceed with one further split at an arbitrary terminal node t. This
step would generate a different tree, where the single leaf t would make room for two new leaves
tL and tR . The overall change in impurity level between the original tree and the newly produced
one could be quantified as in Eq. (2.1).
Error-based index is indeed the most intuitive criterion: the best split is the one with the
smallest number of misclassifications in the children nodes. From a mathematical perspective we
can define the impurity of node t as the error index in t, i.e.,
It is straightforward to verify that if p(c|t) is close to 1 for some c, the impurity measure will tend
to 0, implying that the more the observations tend to belong to a single class, the higher the value
of the information conveyed by the split. From Eq. (2.1), it follows that
" # " #
X 2 X 2 X 2
1ιGI (s, t) = 1 − p(c|t) − pL 1 − p(c|s, tL ) − pR 1 − p(c|s, tR ) .
c c c
Another popular definition of impurity function is based on Shannon entropy index (Shannon,
1948):
X
ιHI (t) = − p(c|t) log2 p(c|t)
c
The behavior of the function is similar to one based on the Gini index. From Eq. (2.1) we have
" #
X X
1ιHI (s, t) = − p(c|t) log2 p(c|t) − pL − p(c|s, tL )log2 p(c|s, tL )
c c
" #
X
− pR − p(c|s, tR ) log2 p(c|s, tR ) .
c
where C represents the support of the categorical response variable. Thus, c ∗ is the Y -category
that most frequently occurs in the terminal node t. This assignment rule is known as majority
vote method.
12 Roberto Casarin et al.
1. the frequency distribution of either the response or the features is degenerate in such node
(these are also called mandatory conditions);
2. when any of its possible children contains a number of instances which is lower than a given
threshold, or if any further splitting does not determine a decrease of impurity higher than a
given level, or when the maximum admissible depth of the tree is reached, i.e., the maximum
number of splits has been achieved.
Post-pruning still resorts to a series of stopping criteria, just very loose ones: the aim is to let
an excessively large tree grow with few restrictions, and then progressively cut off parts of it
monitoring how the performance is affected. The available literature proposes a number of dif-
ferent post-pruning approaches such as reduced error pruning, pessimistic error pruning, min-
imum error pruning, critical value pruning, error-based pruning, and cost-complexity pruning
(Esposito, Malerba, Semeraro, & Kay, 1997). The last one is employed in the Classification and
Regression Trees (CART ) algorithm implemented by the rpart package Therneau and Atkinson
(2019). The pruning process consists in starting from the overly large tree, let us call it Tmax , and
in trimming the branches, thus generating a sequence of progressively smaller subtrees. Up to the
smallest one, the root node t0 itself. The optimal subtree is the one that minimizes the following
loss function:
L(T ) = MR(T ) + α(T ),
where MR(T ) represents the misclassification rate, i.e., the proportion of misclassified items in
the training set, and α(T ), α(T ) ≥ 0 denotes the cost associated with the size of the subtree.
meaning that the overall tree misclassification is simply the weighted average of the single leaves
misclassification rate. We must highlight a flaw of this method: the performance of a tree is
measured in terms of the good fitting only to the dataset over which it has been trained on. In
fact, if we had a single set of observations for both training and validation, we would obtain a
biased measure of the performance, specifically an overly optimistic one. Intuitively, given the
nature of Eq. (2.4), the more we split, the finer the partitions and the lower the misclassification
rate estimate. In other words, following this logic, we would be persuaded to keep on splitting
until we would be forced to stop by a mandatory condition. This is the reason why alternative
approaches should be looked for.
C C
1 X X
MR ts (T ) = Ndc . (2.5)
Nts c=1
d=1; d6=c
The remarkable feature of this approach is that the random selection ensures that the disjoint
sets Lts and Lnls are truly independent. As a rule of thumb, usually 70% of the observations are
reserved for the tree building process, and the rest are employed for validation. Of course, in
order to proceed in this way a sufficiently large sample size is required. A quantification of the
uncertainty in the estimation of the misclassification rate is given the standard error of MR ts :
14 Roberto Casarin et al.
s
ts MR ts (T ) (1 − MR ts (T ))
se MR (T ) = .
Nts
As the size of the test set Lts increases, the standard deviation decreases, coherently with the fact
that the estimation of the misclassification rate becomes increasingly reliable.
This approach can be further generalized by splitting the sample L in G disjoint training sets
and estimating the misclassification rate of a tree using cross-validation.
Lf = i ∈ L : (x1i , . . . , xip ) ∈ Rf .
Our aim is to find the partition PL that minimizes Eq. 2.6. Unfortunately, this optimization
problem cannot be solved when the number of features is moderately high. Therefore, decision
trees based on a binary splitting analogous to the one illustrated in the previous section are built
in order to approximately minimize S(PL ) in Eq. (2.6).
We move from the majority vote rationale to a least squares approach: the prediction attached
to a terminal node is simply the average of the response variable on the cases it collects. Then, the
prediction produced by node t is
1 X
y(t) = yi
N (t) i∈t
With N(t) we denote the number of instances collected by the node t; the summation concerns
only those sample items that are allocated to t.
It is worth to notice that the average minimizes a square loss function. Hence, a suitable
impurity function for the generic node t in a regression tree can be defined as
X
ι(t) = (yi − ȳt )2 .
i∈t
Decision trees and random forests 15
Once we have defined an impurity function, we can grow the tree by looking for those splits that
maximize Eq. (2.1).
Suppose we want to predict the value taken by the response y when the feature vector takes the
value x. The prediction produced by the regression tree trained on the sample L will be denoted
by
φL (x) = ȳt̃x ,
where t̃x is the leaf where sample items with feature vector equal to x are allocated, i.e., x ∈ Xt̃x .
MSE(T ) = E (Y − φL (x))2 .
The use of this estimate in the assessment of the performance of T encourages the algorithm to
keep on splitting, often raising overfitting issues.
where the same computation of Eq. (2.7) is limited to the Nts observations of the test sample,
compared to the predictions provided by the tree that has been trained on the complementary
subset Lnls . The standard error associated with this estimate can be easily computed:
16 Roberto Casarin et al.
" Nts
#1/2
1 1 X 4 2
sd MSEts (T ) = √ yi − φL (xi ) − MSEts (T ) .
Nts Nts i=1
Again, further refined estimates of the mean squared prediction error can be obtained through
cross-validation.
for the right one. By Nc (t) we denote the number of sample items of class c allocated to node
t elements that belong to node t, and by Nc (LL) or Nc (RR) the P number of those elements that
are sent either leftP
or right by both Pthe splits. The summation c accounts for all classes; hence,
at the numerator c Nc (LL) and c Nc (RR) identify all the observations whose direction the
two splits agree upon, regardless of class membership. At the denominator, N (t) corresponds
to the total number of observations allocated to node t. These ratios quantify the probability
that, since an observation is temporarily stored in t, it will be directed toward the same child,
therefore revealing the correspondence with the ratios of probabilities in the intermediate step.
The counterpart estimations in the regression case are
j
p tL ∩ tL N (LL)
pLL (s∗ , sj ) = =
p(t) N (t)
Decision trees and random forests 17
and
j
p tR ∩ tR N (RR)
pRR (s∗ , sj ) = = .
p(t) N (t)
Again, N (LL) and N (RR) identify the sets of observations that are sent either left or right by
both splits. The overall performance of a surrogate, regardless of the fact that we are dealing with
a classification or a regression tree, is given by
p s∗ , sj = pLL s∗ , sj + pRR s∗ , sj .
(2.9)
Therefore, given a specific alternative input, the best surrogate that it can provide is the one sj∗ ∈ Sj
that maximizes the function in Eq. (2.9). This approach creates a sequence of the best surrogates
that each input can offer. However, regardless of the “best” status, it does not necessarily mean
that a specific surrogate split is worth being taken into consideration. In other words, we need
a systematic measure to determine whether a split could indeed act as replacement or should,
instead, be discarded. Given the globally optimal split s∗ of a node t, we denote by pL and pR the
probabilities by which it directs the instances it contains toward the left or the right child. If a new
observation is fed to the node, then our prediction is that it will be collected by tL if max(pL , pR ) =
pL , or that it will end up in tR otherwise; the probability of formulating a wrong prediction about
the destination of an observation that is missing the primary variable value, based on the primary
variable behavior, thus corresponds to min(pL , pR ). From the surrogate perspective, Eq. (2.9)
estimates the probability that a new observation is directed toward the same node by both the
primary and surrogate splits. Hence, we deduce that the estimated probability that the two might
instead disagree corresponds to 1 − p(s∗ , sj ). The so-called predictive association between the two
splits
min pL , pR − 1 − p s∗ , sj
λ(s |sj ) =
∗
(2.10)
min pL , pR
gives a systematic answer to the issue of whether the surrogate split sj could be worthy of replacing
the primary s∗ .
where 1ι refers to the decrease in impurity (and not the measure of impurity itself ), which is
a function of the specific parent node t and the best surrogate split sj∗ that Xj can provide in
that circumstance. Intuitively, the larger the decreases in impurity that the splits over Xj can
guarantee, the higher the importance of the variable in the model. This measure evaluates the
systematic capacity of an input variable in providing surrogate splits: the clever implication is that
it quantifies the importance of a variable despite the fact that it may or may not have appeared
among the selected splits in the optimal subtree, removing the masking effect due to the presence
of other more recurring inputs. In order to generate a ranking and make it as visually appealing
as possible, another useful step is to normalize the measure as follows:
VI Xj
· 100,
maxVI Xj
j
where the denominator is the value of the most important variable and the ratio is multiplied by
a hundred, hence ranging from 0 to 100.
Decision trees are a non-parametric inferential technique; hence, there is no need to make any
a priori assumptions regarding the distribution of either the input or the output space. They are
particularly fit for modeling complex relationships between predictors and target, just by formu-
lating the appropriate sequence of binary questions (Breiman et al., 1984, ch. 2.7). Moreover, they
can easily account for missing values and are flexible enough to handle features of heterogeneous
nature such categorical and numerical variables.
Finally, decision trees are computationally efficient (Gupta, Rawat, Jain, Arora, & Dhami,
2017), nevertheless, they suffer from some limitations such as overfitting and instability. The per-
formances of the decision trees tend to be generally good on the training sample, but deteriorate
on out-of-sample observations. This overfitting problem can be partially mitigated by both pre-
pruning and post-pruning. As far as instability is concerned, even a small change in the training
data might produce a large change in the optimal tree structure. Random forests, introduced in
the next section, use simultaneously different training sets, thus providing a remedy to instability.
where the expectation is taken with respect to the joint probability distribution of X and Y , with
(X , Y ) being independent of L. Analogously, if the prediction is conditioned on X = x,
where the expectation is taken with respect to the probability distribution of Y conditioned on
X = x. When L is assumed random as well, the above two equations are written respectively as
R(φL ) = EL EX ,Y [L (Y − φL (X ))] ,
(2.13)
where XL represents the multivariate random variable generating the feature values in L.
20 Roberto Casarin et al.
L (φL (X )) = 1 − 1Y (φL (X )) ,
φB (x) = EY |X =x (Y ).
noise(x) = EY |X =x (Y − φB (x))2
(2.17)
and represents the lowest risk level that can be achieved by the optimal φB (x). The second term in
(2.16) is the square of the bias of φL (x), which is given by the difference between the prediction
produced by the tree trained on L and the one provided by the optimal predictor:
If the training sample L is random, we need to consider the risk function introduced in (2.14),
which can be rewritten as
The term bias(x) has been redefined in order to take into account the randomness of L, whereas
the definition of noise(x) is unchanged.
When dealing with complex datasets it is very difficult to identify the optimal predictor φB (x),
since this would require a precise knowledge of the data distribution. However, regression trees
provide predictors with relatively low bias, at the cost of high prediction error variance, which is
related to var(x). In the following, we shall introduce some tools that allow us to reduce var(x).
Decision trees and random forests 21
Both bias(x) and var(x) take into account the randomness of θ , whereas noise(x) is still defined
as in (2.17).
The risk in (2.18) is generally higher than the risk in (2.14); however, it can be substantially
reduced by averaging over an ensemble of random trees. Let θm , m = 1, . . . , M , be M indepen-
dently and identically distributed hyperparameters that give rise to M random trees. We can then
define the ensemble predictor
M
1 X
9L (x, θ1 , . . . , θm ) = φL (x, θm ). (2.19)
M m=1
Notice that the individual predictors φL (x, θm ) are identically distributed. Hence, we can write
σL,θ
2
(x) = VL,θm [φL φL (x, θm )] , m = 1, . . . , M . (2.21)
And from (2.20), it follows that
1 − ρ(x)
var(x) = ρ(x) + σL,θ
2
(x), (2.25)
M
with
ρ(x) = CorrL,θm ,θm0 (φL (x, θm ), φL (x, θm0 )) . (2.26)
Equations (2.22)–(2.26) explain the usefulness of the ensemble predictor. The squared bias term
in (2.23) does not play an important role, since it is usually low for the predictors produced
by regression trees. The quantity that plays a crucial role is var(x) in (2.23). It is closely linked
to ρ(x), the correlation coefficient between the predictors associated with an arbitrary pair of
random trees in the ensemble. It can be shown that 0 ≤ ρ(x) ≤ 1 (see Louppe (2014)). Then,
equation (2.25) shows that var(x) decreases as ρ(x) decreases and the size of the ensemble, M ,
increases. This result makes ensemble predictors to be preferred to predictors associated with
single trees and paves the way to the definition of random forests.
It is worth to notice that, being the θi ’s independent, the individual predictors φLm (x, θm ) are
weakly correlated. The reason why this happens is quite simple. Suppose that any tree in the
forest was grown using the CART algorithm and that there exists one feature, call it X1 , strongly
correlated with the response Y . In such a situation, X1 would be selected as the variable on which
the first split in any tree is produced. This would make the output of the trees in the forest strongly
correlated. Since at any step of any tree’s growth only a subset of features is randomly chosen as
candidate for splitting, such correlation is sensibly reduced. Consequently, the risk associated
with 9L RF (x, θ , ..., θ ),
1 M
RF RF
R 9L (x, θ1 , . . . , θm ) = EY ,L,θ1 ,...,θm L Y − 9L (x, θ1 , . . . , θm ) ,
is generally lower than the risk associated with the predictor produced by an individual regression
tree.
Some words need to be spent on the resampling of Lm , m = 1, . . . , M , the n-dimensional
training samples of the trees in the forest. Originally, Breiman (2001) fixed n equal to the sample
Decision trees and random forests 23
size, N . Under this circumstance, the Lm s are bootstrap samples from the original dataset, and the
forest predictor is obtained through a bootstrap aggregating scheme that Breiman called bagging.
It can be shown that, on average, each tree is trained on two-thirds of the observations in L. The
remaining part of observations are referred as out-of-bag (OOB) observations. We can predict
the response for the ith observation in L by averaging the predictions of all trees for which that
observation was OOB. This can be done for each observation in L, allowing us to estimate the
risk associated with the random forest predictor without recurring to any test set independent of
L (Breiman, 1996, 2001). If N is large, we can fix n << N , subsample the Lm s independently,
and proceed analogously.
Random forests are less interpretable than individual regression trees. However, they allow
us to quantify the importance of each feature in predicting the response. The two main tools
available are the mean decrease impurity (MDI) and the mean decrease accuracy (MDA). For each
feature, say Xj , we can compute the total decrease in node impurity when splitting on that variable
and averaging over all trees in the forest. This average is the MDI of Xj . Features can be ranked in
terms of the MDI: the most relevant variables will be the ones with high MDI. Alternatively, we
can estimate the mean square prediction error on the OOB prediction. Let us call such estimate
R̂. If we are interested on the relevance of feature Xj , we can proceed as follows. Permute the
values of Xj over the OOB items and estimate the mean square prediction error on the perturbed
OOB. Let us call this estimate R̂ j . Then the MDA is defined as
MDA(Xj ) = R̂ j − R̂.
A value of MDA(Xj ) close to zero means that Xj does not affect the prediction, whereas a high
value of MDA(Xj ) indicates Xj has a relevant predictive power.
where Lxj is the subset of L containing all the items on which the jth component of x is equal
to xj , Nxj refers to the number of items in Lxj , and xj is the punctual value of the jth feature
at which the function is computed. The partial dependence plot is the graphical representation
24 Roberto Casarin et al.
of such function, where the feature under investigation is given on the horizontal axis, and the
average prediction of the forest on the vertical one.
Similarly, a two-way partial dependence function can be obtained by computing the aver-
age prediction of the forest while keeping fixed the values of two features. This function allows
for studying the joint effect of two features of interest on the prediction, marginalizing out the
remaining explanatory variables.
Figure 2.3 compares the in- and out-of-sample predictions produced by the model to the
observed values. The training set residuals in Figure 2.4a exhibit higher dispersion in the periods
1975–1985 and 2001–2003. In the former the American bond market was under pressure in the
aftermath of the 1973–1974 oil crisis and the consequent deterioration of the trade balance and
deficit conditions. In the latter the dot-com bubble burst gave rise to higher uncertainty and
volatility levels in the financial markets. As far as the test set is concerned, in Figure 2.4b we
observe a similar behavior of the prediction errors, due to the financial crisis of 2008/2009.
Panels (a) and (b) of Figure 2.4 show the MDA and MDI variable importance scores. Both
scores agree in selecting the most relevant variables in the groups of interest and exchange rates,
prices, and stock market. The first five variables among them, with some differences in their
order, represent interest rates and are described in detail in Table 2.1. The partial dependence
plots, shown in Figure 2.5a, provide some interesting insights: the predicted return of the ICE
BofA index is negatively correlated with the Moody’s Seasoned BAA and AAA Corporate Bond
Yield and the Ten-Year Treasury Rate. Some of our findings are in line with the in-sample analysis
obtained by Ludvigson and Ng (2009) with a factor model approach. Returns on government and
corporate bonds are the main drivers of the bond market returns. In our findings longer maturities
from 1 to 10 years are more relevant than shorter maturities. Real variables on employment and
production conditions are important following MDI and MDA, respectively. Differently from
predicted values
0.10
Training set
Test set
0.05
0.00
−0.05
−0.10
Figure 2.3 Actual and predicted log-returns on ICE BofA US Corporate Index (vertical axis)
over time (horizontal axis).
26 Roberto Casarin et al.
0.04
0.03
0.02
0.02
0.00
−0.02
0.01
−0.04
0.00
−0.06
−0.01
−0.08
−0.02
−0.10
1975 1980 1985 1990 1995 2000 2005 2008 2010 2012 2014 2016 2018 2020
Time Time
(a) (b)
GS10 GS10
AAA GS5
BAA AAA
GS5 BAA
GS1 GS1
S.P.div.yield DDURRG3M086SBEA
T1YFFM TB6MS
TB6MS TB3MS
DDURRG3M086SBEA T1YFFM
TB3MS UNRATE
T10YFFM CES1021000001
BAAFFM MANEMP
T5YFFM TB6SMFFM
(c) (d)
Figure 2.4 Log-returns on ICE BofA US Corporate Index BofA: residuals on the training sample
(panel a, vertical axis), prediction errors on the validation sample (panel b, vertical axis) over
time (horizontal axis), mean decrease accuracy (panel c), and mean decrease impurity (panel d)
variable importance analysis.
Ludvigson and Ng (2009), consumption variables (changes in expenditure for durable goods)
are relevant following MDI, whereas inflation and price changes are not important. Stock market
factors (such as S&P Dividend Yield) are less relevant than real variables and interest rates.
The non-parametric nature of the random forest regression allows us to provide evidence of
a nonlinear relationship between the predictors and the response variable (e.g., see the S-shaped
curves in Figure 2.5a). The nonlinearity of the partial dependence function is confirmed by the
Decision trees and random forests 27
two-way partial dependence plot for GS10 and BAA (Figure 2.5b). The plot obtained using the
randomForestExplainer package (Paluszynska, Biecek, and Jiang (2019)) suggests that a
joint increase of the two yields affects negatively the prediction. It is also worth to notice that
the GS10 variable has an overwhelming effect with respect to BAA: when keeping BAA fixed,
variations in GS10 produce strong changes in predictions, whereas evidence of symmetric effects
is much weaker.
In the rolling regression, the variables that most often appeared among the first five positions
accordingly with MDA and MDI are AAA, BAA, GS10, GS5, and GS1, which confirm the results
of the statics analysis. Furthermore, the sequential analysis (see Figure 2.5c,d) provides further
evidence of nonlinearities and new evidence of substantial changes in the shapes of the partial
dependence function over the different subperiods considered (Jun. 1973–Sep. 1976, Jan. 1998–
Apr. 2001, Sep. 2014–Dec. 2017).
28 Roberto Casarin et al.
prediction
0.04
0.02
1 0.00
0.010
−0.02
Predicted Y
0.005
GS10
0
0.000
−1
X=GS10
X=AAA
X=BAA
−0.005
(a) (b)
0.006
0.006
Predicted Y
Predicted Y
0.004
0.004
0.002
0.002
RF #5 RF #5
RF #300 RF #300
RF #500 RF #500
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4 −0.2 0.0 0.2 0.4
BAA GS10
(c) (d)
Figure 2.5 Log-returns on ICE BofA US Corporate Index BofA analysis: two-way partial depen-
dence plot involving Moody’s Seasoned BAA Corporate Bond Yield and the Ten-Year Treasury
Rate (panel a), static partial dependence plots of BAA, AAA, and GS10 (panel b), sequential
partial dependence plots of BAA (panel c), and GS10 (panel d) corresponding to the three
different periods: Jun. 1973–Sep. 1976, Jan. 1998–Apr. 2001, Sep. 2014–Dec. 2017.
received attention in the credit risk literature (Calabrese, Marra, & Osmetti, 2016; Calabrese &
Osmetti, 2013) due to the acute challenges they are facing during periods of economic and
financial distress.
We propose an original application of random forest classification to 109,836 firms located in
the North-Eastern Italy regions (Veneto, Trentino Alto Adige, and Friuli Venezia Giulia) which
published their last financial statements in 2018 or 2019. The analysis is particularly impor-
tant since these regions are the first ones planning and starting reopening after the lockdown
due to the COVID-19 pandemic. An analysis of the weakest companies and of the determi-
nants of their frailty is crucial for studying early warning indicators and making correct policy
interventions.
The response variable is the legal status of the companies which is a categorical variable
labeled as Default or Active with some abuse of terminology: under the Default cat-
egory, we include all the enterprises which have been declared bankrupt, the ones that are insol-
vent or under receivership proceedings. Our purpose is to predict the Default state of an
enterprise, given the information provided by some salient features of the company, its financial
statements, and the main financial and profitability ratios. The number of inputs so defined is
104. Missing values have been estimated through the proximity measure (Breiman, 2003). One
important feature of the population is that only 689 enterprises out of 10,9836 are classified as
Default. This implies that any sample should be strongly unbalanced (i.e., it should contain
a very small proportion of Default cases) in order to represent the whole population. Such
class imbalance problem is common to many classification problems (Lemaître, Nogueira, & Ari-
das, 2017; Li, Bellotti, & Adams, 2019), and it determines an important undesired effect: while
Active cases would be correctly classified with high probability, the Default ones would
often be misclassified. This would happen independently on sample size: on large samples we
would correctly identify the Active cases with probability close to one, but we would pre-
dict as Active a high proportion of defaults. We assume this type of classification error is most
costly and randomly undersample the majority class to reduce the misclassification rate within the
Default cases.
We built two random forests with different size and composition of the training set to study
the effect of the undersampling rate. In the first one, the training set has size 10,344, with 10,000
Active cases, whereas in the second one, the training set has size 688, with 344 Active
enterprises. Clearly, the latter is far from representing the population, but it is perfectly balanced
in terms of legal status categories. Each random forest counts 500 trees, the number of features
chosen at each split is the default value of 10 (the square root of input variables), the minimum
number of observations in a terminal node in order for a split to be attempted is 1, and there
are no restrictions regarding the number of leaves in a single tree; the bootstrap samples have the
same size of the training set.
Comparing Table 2.2a,b with Table 2.2c,d, we can notice that when we use the largest train-
ing set the overall misclassification rate is 0.015 in both the training and the test set, whereas it
increases respectively to 0.096 and 0.054 in the test set when we use the smallest one. However,
the misclassification rate within the Default class decreases from about 0.345 to 0.137 in the
training set and from 0.316 to 0.128 when we move from the former to the latter. Figure 2.6a,b
show the ranking of the first 16 variables in terms of MDA and MDI scores. We can notice that
overwhelming importance is attributed to the amount of net capital and to the degree of finan-
cial dependence on third parties. Furthermore, a joint increase in the two variables produces an
30 Roberto Casarin et al.
Table 2.2 Out-of-the-box (left) and test set (right) predictions and classification error with
a training sample size of 10344 (first line) and of 688 (second line)
(a) (b)
Active Default Class. err. Active Default Class. err.
Active 9961 39 0.004 Active 9956 44 0.004
Default 120 224 0.345 Default 109 236 0.316
(c) (d)
Active Default Class. err. Active Default Class. err.
Active 325 19 0.055 Active 9484 516 0.052
Default 47 297 0.137 Default 44 301 0.128
Net_Capital Net_Capital
Net_capital.Total_assets Net_Capital.Debts
Net_Capital.Debts Leverage
Net_income Debt.Equity
EBITDA Net_capital.Total_assets
Net_capital.Total_assets_lag_1yr Net_capital.Total_assets_lag_1yr
Leverage ROS
Net_income_lag_2yrs Net_Capital.Debts_lag_1yr
Net_Capital.Debts_lag_2yrs Leverage_lag_1yr
ROA Net_Capital_lag_1yr
Net_Capital_lag_2yrs Net_Capital.Debts_lag_2yrs
Debt.EBITDA Net_capital.Total_assets_lag_2yr
Net_capital.Total_assets_lag_2yr Debt.EBITDA
Net_Capital.Debts_lag_1yr Net_Capital.Tangible_Assets
Debt.Equity Bank_debt.Sales
EBITDA_lag_2yrs Bank_debt.Sales_lag_1yr
8 10 12 14 16 10 20 30
Importance Importance
(a) (b)
Figure 2.6 Default risk analysis: variable importance following the mean decrease accuracy
(panel a) and mean decrease impurity (panel b) score variables’ rankings.
increase in the default probability (see the PDP in Figure 2.7a). Figure 2.7b shows the substantial
agreement between what predicted by the PDP and what observed in the test set.
−50
−1e+05 −5e+04 0e+00 5e+04 −1e+05 −5e+04 0e+00 5e+04
Net_Capital
Net_Capital
(a) (b)
Figure 2.7 Default risk analysis: two-way partial dependence of default on net capital and net
capital to total assets ratio (panel c); scatterplot of the net capital and net capital to total assets
ratio in the test set (panel d, default cases in red, active cases in black).
3 library(randomForest)
4 library(iml)
5 library(caret)
6 library(randomForestExplainer)
7 library(Metrics)
8 require(caTools)
9 library(ggplot2)
10 library(MASS)
11 library(pdp)
12 library(plotly)
13 library(plyr)
14 #----------------------------------------
15 # DATA IMPORT AND PREPARATION -----------
16 #----------------------------------------
17 data ← read.csv("Bofa_Fred.csv", header = TRUE, sep = ";")
18 bofa ← ts(data$BAMLCC0A0CMTRIV, start=c(1973, 1), end=c(2020, 1),
frequency=12)
19 bofa ← log(bofa)
20 data$BAMLCC0A0CMTRIV ← NULL
21 data$DATE ← NULL
32 Roberto Casarin et al.
14 #----------------------------------------
15 # DATASET -------------------------------
16 #----------------------------------------
17 load("SMETrainC.RData")
18 load("SMETestC.RData")
19
20 nc ← ncol(SMETrain)
21 #----------------------------------------
22 # DEFAULT RANDOM FOREST -----------------
23 #----------------------------------------
24 SMETrain = rfImpute(SMETrain[,2:nc], SMETrain[,1], iter = 5, ntree
= 50)
25 SMETest = rfImpute(SMETest[,2:nc], SMETest[,1], iter = 5, ntree =
50)
26
27 # random forest default setting
28 class_randomForest = randomForest(LegalStatus ∼ ., data=SMETrain,
replace = TRUE, nPerm = 4, importance = TRUE, proximity = TRUE,
oob.prox = TRUE, keep.inbag = TRUE)
29 print(class_randomForest)
30
31 # predictions on the test set
32 SMETest$RF_predictions = predict(class_randomForest, newdata =
SMETest)
33
34 # predictions on training set
35 SMETrain$RF_predictions = predict(class_randomForest, newdata =
SMETrain)
36
37 # Confusion matrix and error rates
38 conf_matrix = table(SMETest$LegalStatus, SMETest$RF_predictions)
39 ErrRates ← c(conf_matrix[1,2]/sum(conf_matrix[1,]),conf_matrix
[2,1]/sum(conf_matrix[2,]))
40 conf_matrix ← cbind(conf_matrix,ErrRates)
41 colnames(conf_matrix)[3] ← "class.error"
42 print(conf_matrix)
43 #----------------------------------------
44 # FEATURES IMPORTANCE -------------------
45 #----------------------------------------
46 # mean decreased accuracy
47 importance_MDA = vi_model(class_randomForest, type = 1, scale =
TRUE)
48 importance_MDA = importance_MDA[order(-importance_MDA$Importance),]
Decision trees and random forests 35
49
50 # mean decreased impurity
51 importance_MDI = vi_model(class_randomForest, type = 2, scale =
TRUE)
52 importance_MDI = importance_MDI[order(-importance_MDI$Importance),]
53
54 # partial dependence importance
55
56 # plotting all variable importance measures
57 MDA = vip(importance_MDA, num_features = 16, geom = "point", horiz
= TRUE, aesthetics = list(size = 3, shape = 16), main = "MDA")
58 MDI = vip(importance_MDI, num_features = 16, geom = "point", horiz
= TRUE, aesthetics = list(size = 3, shape = 16))
59
60 MDA+theme(text=element_text(size=15))+ggtitle("MDA")
61
62 MDI+theme(text=element_text(size=15))+ggtitle("MDA")
63 #----------------------------------------
64 ## # FEATURES INTERACTION ---------------
65 #----------------------------------------
66 # 2-way partial dependence plot
67 plot_predict_interaction(class_randomForest, SMETrain, names(
SMETrain)[9], names (SMETrain)[60], main = "Joint effect
on predicted default probability") + theme(legend.position="
bottom") + geom_hline(yintercept = 2, linetype="longdash") +
geom_vline(xintercept = 140, linetype="longdash")
68
69 plot(SMETest[,9],SMETest[,60], col=SMETest[,1],xlim=c(-100000
,80000), xlab=aa, ylab=bb, pch=18)
70 legend("topleft", legend=c("default","active"), col=c("red","black"
), pch=rep(18,2))
References
Bianchi, D., Büchner, M., & Tamoni, A. (2020). Bond risk premia with machine learning.Review of Finan-
cial Studies, forthcoming.
Breiman, L. (1996, aug). Bagging predictors. Mach. Learn., 24 (2), 123–140.
Breiman, L. (2001, 10). Random forests. Mach. Learn., 45 (1), 5–32.
Breiman, L. (2003). Setting up, using, and understanding random forests V4.0. Retrieved from
https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
Breiman, L., Friedman, J. H., Olshen, R. A., & J., S. C. (1984). Classification and Regression Trees. New
York: Chapman & Hall.
Calabrese, R., Marra, G., & Osmetti, S. A. (2016). Bankruptcy prediction of small and medium enterprises
using a flexible binary generalized extreme value model. Journal of the Operational Research Society,
67 (4), 604–615.
Calabrese, R., & Osmetti, S. A. (2013). Modelling small and medium enterprise loan defaults as rare events:
the generalized extreme value regression model. Journal of Applied Statistics, 40 (6), 1172–1188.
36 Roberto Casarin et al.
Cochrane, J. H., & Piazzesi, M. (2005, March). Bond risk premia. American Economic Review, 95 (1),
138–160.
Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for pruning
decision trees. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 19 (5), 476–491.
Friedman, J. H. (2001, 10). Greedy function approximation: A gradient boosting machine. Annals Statistics,
29 (5), 1189–1232.
Gupta, B., Rawat, A., Jain, A., Arora, A., & Dhami, N. (2017). Analysis of various deci- sion tree algorithms
for classification in data mining. International Journal of Computer Applications, 163, 15–19.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with
applications in r. Springer.
Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse
of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18 (17), 1-5.
Li, Y., Bellotti, T., & Adams, N. (2019). Issues using logistic regression with class imbalance, with a case
study from credit risk modelling. Foundations of Data Science, 1 (4), 389-417. Liaw, A., & Wiener,
M. (2002). Classification and regression by randomforest. R News, 2 (3), 18-22. Retrieved from
https://CRAN.R-project.org/doc/Rnews/
Louppe, G. (2014). Understanding random forests: From theory to practice. Retrieved from https://arxiv.org/
abs/1407.7502
Ludvigson, S., & Ng, S. (2009). Macro factors in bond risk premia. Review of Financial Studies, 22 (12),
5027–5067.
Maimon, O., & Rokach, L. (2005). Data mining and knowledge discovery handbook. Springer-Verlag.
McCracken, M. W., & Ng, S. (2016). Fred-md: A monthly database for macroeconomic research. Journal
of Business & Economic Statistics, 34 (4), 574-589.
Paluszynska, A., Biecek, P., & Jiang, Y. (2019). randomforestexplainer: Explaining and visual- izing random
forests in terms of variable importance [Computer software manual]. Retrieved from https://cran.r-
project.org/web/packages/randomForestExplainer/ randomForestExplainer.pdf
Quinlan, J. R. (1986, Mar 01). Induction of decision trees. Machine Learning, 1 (1), 81–106. Quinlan, J. R.
(1993). C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27 (3),
379–423.
Therneau, T., & Atkinson, B. (2019). rpart: Recursive partitioning and regression trees [Com- puter software
manual]. Retrieved from https://CRAN.R-project.org/package=rpart (R package version 4.1–15)
Chapter 3
3.1 Introduction
Over the last century, the human mortality has declined globally. To live longer is a good thing
for people, but it needs to be combined with satisfactory standard of living when they retire.
Before 2000 high financial returns sustained consumption and other expenses of the elderly.
Later, limited equity market performances and low interest rates together with life expectancy
improvements have represented a challenge for the pension industry and insurance sector. It is
clear that an increase in the life expectancy is posing many questions for individuals approaching
retirement, for insurance companies offering life products, for pension plans, and for governments
facing rising pensions and healthcare costs.
The changes in mortality trends strongly impact on pricing and reserve allocation of life annu-
ities and on the sustainability of social security systems. In 2012, the International Monetary Fund
estimated that each additional year of life expectancy added about 3%–4% to the present value
of the liabilities of a typical defined benefit pension fund. To manage the uncertainty related to
future life expectancy, the agents involved are trying to transfer such risk to capital markets. Due
to the long-term nature of the risks, accurate longevity projections are delicate and the longevity
risk transfer is a difficult process to realize without a theoretical recognized framework among
actuaries and industry. In the same time, investors are looking for alternative investment assets
with diversification purposes. The capital markets offer a complementary channel for distribut-
ing longevity risk and the involved players try to develop financial instruments indexed to the
37
38 Susanna Levantesi et al.
longevity of the population. Longevity bonds and longevity derivatives are designed to transfer
the risk of higher life expectancy to investors. As it is clear, a correct quantification of longevity risk
is necessary to offer adequate risk premium to investors from one hand, and from the other hand
to evaluate insurance liabilities as close as possible to the real obligations and define opportune
pricing policies. As the insurers used to say, there are no bad risks, but only bad pricing, meaning
that companies are able to offer protections against risks as long as they are good matched in
pricing.
Nowadays, many insurers still rely on traditional methods when evaluating risk. As regarding
the longevity risk, the majority of actuarial researchers and practitioners make predictions resort-
ing to classical demographic frameworks based on traditional extrapolative models. Among the
stochastic models for mortality, the widely used Lee and Carter (1992) model introduces a frame-
work for central mortality rates involving both age- and time-dependent terms. The Renshaw-
Haberman model (Renshaw and Haberman, 2003, 2006) was one of the first to incorporate a
cohort effect parameter to characterize the observed variations in mortality among individuals
from differing cohorts. Other approaches make use of penalized splines to smooth mortality and
derive future mortality patterns (Currie et al., 2004).
Recently, Artificial Intelligence (AI) in general and Machine Learning (ML) in particular
are appearing on the landscape of actuarial research and practice, albeit belatedly and slowly
with respect to other areas such as medicine, industry, finance, and so on. During the last three
years, the actuarial literature has presented the first results of the application of ML techniques
to mortality forecasts and longevity management. Some early unsupervised methods have been
proposed in the context of mortality analysis with application to different fields of medicine,
but around lately they have been exploited by demographers (Carracedo et al., 2018) and actu-
aries (Piscopo and Resta, 2017). As regards the supervised approaches, Deprez et al. (2017) use
some ML techniques to improve the estimation of the log mortality rates. The model has been
extended by Levantesi and Pizzorusso (2019) which takes the advantage of ML to improve the
mortality forecasts in the Lee Carter framework. Deep Learning (DL) techniques have been pro-
posed by Hainaut (2018) who employs a neural network to modeling mortality rates. Richman
and Wüthrich (2018) have proposed a multiple-population Lee-Carter model where parameters
are estimated using neural networks. Nigri et al. (2019) have integrated the original Lee-Carter
formulation introducing a Recurrent Neural Network (RNN) with Long Short-Term Memory
(LSTM) architecture to produce mortality forecasts more coherent with the observed mortality
dynamics, also in cases of nonlinear mortality trends. In the remainder of the chapter, we focus
on the application of ML to longevity data. Our original contribution lies in quantifying the
impact of longevity risk modeled with ML on two insurance products; as far as our knowledge
is concerned, in the literature there are no empirical analyses to real insurance policies. We show
the progress of results in terms of better fitting and projections of mortality with respect to the
classical approaches.
This chapter illustrates how ML can be used to improve both fitting and forecasting of tradi-
tional stochastic mortality models, taking the advantages of AI to better understand processes that
are not fully identifiable by standard models. It is organized as follows. Section 2 introduces the
framework of the generalized age period cohort models and the main accuracy measures. Section
3 summarizes the literature on the mortality modeling with ML, also providing a brief overview of
the Classification And Regression Trees (CART) approach that has been used until now in mor-
tality modeling. It discusses the improvement obtained by ML in the mortality fitting provided
Improving longevity risk management via ML 39
(i)
where αa is the age-specific parameter giving the average age profile of mortality, κt is the time
(i)
index and βa modifies its effect across ages so that their products are the age-period terms describ-
(0)
ing the mortality trends, γt−a is the cohort parameter, and βa modifies its effect across ages
(0)
βa γ
(c = t − a is the year of birth) so that t−a is the cohort effect. The predictor ηx is linked
Dx
to a function g as follows: ηx = g E Ex . The models here mentioned consider the log
link function and assume that the numbers of deaths Dx follow a Poisson distribution. All the
analyses reported in this chapter consider the Lee-Carter model as proposed by Brouhns et al.
(2005). In addition, Deprez et al. (2017) analyze the Renshaw-Haberman model and Levantesi
and Pizzorusso (2019) also the latter and the Plat model.
Following to the GAPC framework, the Lee-Carter model is specified by
(1)
log (mx ) = αa + βa(1) κt (3.2)
(1) (1)
under the constraints t∈T κt = 0, a∈A βa = 1 that allow us to avoid identifiability prob-
P P
lems with the parameters. The forecasted probabilities are obtained by modeling the time index
(1)
κt with an autoregressive integrated moving average (ARIMA) process. The random walk with
(1) (1)
drift properly fits the data: κt = κt−1 +δ+t , with t ∼ N 0, σk2 , where δ is the drift parame-
ter and t are the error terms, normally distributed with null mean and variance σk2 . Moreover, we
briefly introduce the Renshaw-Haberman model, which extends the Lee-Carter model by includ-
(1) (1)
ing a cohort effect:1 log (mx ) = αa + βa κt + γt−a . The model is subject to the constraints:
(1) (1)
t∈T κt = 0, a∈A βa = 1, and c∈C γc = 0, where c = t − a; and the Plat model:
P P P
(1) (2) (3)
log (mx ) = αa + κt + κt (a − a) + κt (a − a)+ + γt−a , where (a − a)+ = max (a − a, 0).
40 Susanna Levantesi et al.
are reasonable in a demographic perspective, and a sensitivity analysis on the age range in order
to verify the level of the improvement provided by the model on a reduced dataset. The case
study is based on the mortality data of Australia, France, Italy, Spain, the UK, and the USA,
and on the following set of variables A = {20, . . . , 100}, T = {1947, . . . , 2014}, and C =
{1847, . . . , 1994} and both genders. The authors obtained significative improvements in the
mortality projection of the Lee-Carter model by applying the RF estimator. These results hold
for all the analyzed countries.
Note that the condition ψx ≡ 1 is equivalent to state that the mortality model completely
fits the crude rates. This is an ideal condition as in the real world the model might overestimate
(ψx ≤ 1) or underestimate (ψx ≥ 1) the crude rates. The aim of the approach used in Deprez
42 Susanna Levantesi et al.
et al. (2017), Levantesi and Pizzorusso (2019), and Levantesi and Nigri (2020) is essentially to
calibrate the parameter ψx according to an ML algorithm in order to improve the fitting accuracy
of the mortality model. The estimator ψx is a solution of a tree-based algorithm applied to the ratio
between the death observations and the corresponding value estimated by the specified mortality
model:
Dx
j ∼ gender + age + year + cohort (3.6)
dx
j,ML
Let ψ̂x denote the ML estimator that is solution of equation (3), where j is the mortality
model and ML the ML technique we refer to. In order to reach a better fit of the observed data,
j j,ML
the central death rate of the mortality model, mx , are adjusted through ψ̂x as follows:
j,ML j,ML j
mx = ψ̂x mx , ∀x ∈ X (3.7)
The mortality improvement reached by the ML algorithm can be also measured through the
j,ML j,ML j,ML
relative changes of central death rates 1mx = ψ̂x − 1. The values of ψ̂x are then
obtained by applying an ML algorithm. As shown in Levantesi and Nigri (2020), this approach
can be also used to diagnose the limits of the traditional mortality models.
The ML algorithms used in the application provided by the specific literature on mortality
modeling belong to CART, Decision Tree (DT), Random Forest (RF), and Gradient Boosting
Machine (GBM), and are summarized in the following. DT operates a partition of a predictor
space by a sequence of binary splits, giving rise to a tree (Hastie et al., 2016). Following a hierar-
chical structure, the predictor space is recursively split into simple regions, and the response for
a given observation is predicted by the mean of the training observations in the region to which
that observation belongs (James et al., 2017).
Let (Xτ )τ ∈T be the partition of X , where J is the number of distinct and non-overlapping
regions. The DT estimator, given a set of variables x, is defined as ψ̂ DT (x) = 6τ ∈T ψ̂τ 1{x∈Xτ } ,
where 1{.} is the indicator function. The regions (Xτ ) τ ∈ T are found by minimizing the residual
sum of squares.
RF aggregates many DTs, obtained by generating bootstrap training samples from the original
dataset (Breiman, 2001). The main characteristic of this algorithm is to select a random subset
of predictors at each split, thus preventing the dominance of strong predictors in the splits of
each tree (James et al., 2017). The RF estimator is calculated as ψ̂ RF (x) = B1 Bb=1 ψ̂ (DT ) x|b ,
P
where B is the number of bootstrap samples and ψ̂ (DT ) x|b is the DT estimator on the sample b.
GBM considers a sequential approach in which each DT uses the information from the pre-
vious one in order to improve the current fit (Friedman, 2001). Given the current fit ψ̂ (xi−1 ) , at
each stage i (for i = 1, 2, .., N ), the GBM algorithm provides a new model adding an estimator
h to improve the fit: ψ̂ (xi ) = ψ̂ (xi−1 ) + λi hi (x), where hi ∈ H is a base learner function
(H is the set of arbitrary differentiable functions) and λ is a multiplier obtained by solving the
optimization problem.
Furthermore, going beyond the logic of ML, Levantesi and Pizzorusso (2019) propose to fore-
cast the ML estimator using the same framework of the original mortality model. This approach
is tested on the Lee-Carter model (j = LC), where the ML estimator is modeled as follows:
(1,ψ)
log ψ̂xLC,ML = αaψ + βa(1,ψ) κt (3.8)
Improving longevity risk management via ML 43
The use of the same framework of the original mortality model to fit and forecast the ML esti-
mators ψ̂x , as proposed by Levantesi and Nigri (2020), allows both to improve the mortality
projections’ accuracy and to analyze the effect of such improvements directly on the model’s
parameters.
The forecasting performance of the Lee-Carter model can be also improved by following the
methodology suggested by Levantesi and Nigri (2020), which finds the ML estimator future val-
ues through the extrapolation of ψ̂s previously smoothed with two-dimensional P-splines (Eilers
and Marx, 1996). In this method, forecasting is a natural consequence of the smoothing process:
future values are considered as missing values that are estimated by the two-dimensional P-splines
(Currie et al., 2006). The form of the forecast is then determined by the penalty function that is
more important in the extrapolation of future vales with respect to the smoothing of data.
Step 1: fitting Lee-Carter model (denoted by “LC" ) with StMoMo package (Villegas et al.,
2015);
44 Susanna Levantesi et al.
Step 2: fitting the ML estimator ψxLC,ML , which is the solution of equation (3), using the
random forest algorithm then obtaining ψ̂xLC,RF ;
Step 3: adjusting the central death rate of the Lee-Carter model through the RF estimator
(see equation (4)): mLC,RF
x = ψ̂xLC,RF mLC
x ;
Step 4: modeling the RF estimator with two different approaches: the first one proposed by
Levantesi and Pizzorusso (2019) (denoted by “LC-ψ" ), while the second one by Levantesi
and Nigri (2020) (denoted by “LC-ψ-spl" ):
LC-ψ model: the RF estimator is modeled with the Lee-Carter model as described in
equation (5),
LC-ψ-spl model: the RF estimator is smoothed using two-dimensional (age and time)
P-splines;
Step 5: forecasting LC and LC-ψ models, and extrapolating LC-ψ-spl model. The forecast
period is set to 2015–2074.
The goodness of fit is measured by the MAPE. The results are shown in Table 3.1 and evidence
a high improvement in the Lee-Carter model fitting provided by the RF estimator in all the
countries considered and for both gender.
To test the ability of the models we develop an out-of-sample test splitting the dataset in two
parts: data from 1947 to 1998 (fitting period) and data from 1999 to 2014 (forecasting period). The
goodness of the out-of-sample test is evaluated by the RMSE. The results reported in Table 3.2
show that the RF estimator provides strong improvements of the canonical Lee-Carter model.
The LC-ψ results the best one.
Table 3.2 Out-of-sample test results: RMSE for the models LC,
LC-ψ, and LC-ψ-spl. Ages 40–100 and years 1999–2014
Model FRA ITA UK
M F M F M F
LC 0.0444 0.0352 0.0410 0.0272 0.0364 0.0222
LC-ψ 0.0159 0.0142 0.0148 0.0085 0.0123 0.0077
LC-ψ-spl 0.0232 0.0199 0.0403 0.0164 0.0129 0.0103
Improving longevity risk management via ML 45
Now, we forecast the models over the period 2015–2074. To appreciate the model’s difference,
we show in Figure 3.1 the survival probabilities t px of a cohort of individuals aged 40 in 2015 over
the years 2015–2074 for each model. Both LC−ψ and LC-ψ-spl lower the LC model survival
probabilities.
Figure 3.1 Survival probabilities t px . Models LC-ψ and LC-ψ-spl. Ages 40–100 and years 2015–
2074.
46 Susanna Levantesi et al.
U = C ·n Ax (3.10)
where nAx is the expected value at t = 0 of a random variable whose outcome is the discounted
unitary sum at the technical rate i insured in the case of death:
n−1
X
n Ax = (1 + i)−(h+1) h/ qx (3.11)
h=0
with h/ qx being the probability that an insurer aged x will die between ages x + h and x + h + 1.
The pure endowment insurance provides the beneficiary with a lump sum benefit C at time
n if the insured is still alive. The single premium is given by
where n px is the probability that the insured aged x at the inception of the contract will be alive
at age x + n.
For both contracts the single premium can be converted in a sequence of m premiums P paid
at the policy anniversaries through the following formula:
m
X
U = P ·m Ex
h=1
The mathematical reserve is a technical tool for assessing the insurer’s debt during the period the
contract is in force. The insurer has to register in the balance sheet the amount of the mathematical
reserve to ensure the ability to meet the obligations assumed whenever the insured event occurs
until the expiration date n. At time t the perspective reserve is defined as
where E[Y (t, n)] and E[X (t, n)] at time t are respectively the expected actuarial present value
of the benefits which fall due in the period (t, n) and the expected actuarial present value of the
premiums paid in the same interval.
Let us consider an Italian male insured and set C = 1000, x = 40, i = 2%, n = m = 25 for
both the term insurance contract and the pure endowment. In Table 3.3, we report the single and
the period premiums calculated projecting the probabilities in equations (13)–(14) according to
the three models previously described LC, LC-ψ, and LC-ψ-spl. Figures 3.2 and 3.3 show the
trends of the reserves for both contracts from the issue to the expiration.
Improving longevity risk management via ML 47
Table 3.3 The single (U) and period (P) premium of the term life insur-
ance and pure endowment for the models LC, LC-ψ, and LC-ψ-spl
Term life insurance Endowment insurance
LC LC-ψ LC-ψ-spl LC LC-ψ LC-ψ-spl
U 290,8485 301,9316 311,2901 359,4046 350,4961 341,7526
P 16,30582 17,03309 17,59216 20,14928 19,77279 19,3137
Figure 3.2 Reserves for term life insurance with single and periodic premium for the models
LC, LC-ψ, and LC-ψ-spl.
Figure 3.3 Reserves for pure endowment contract with single premium for the models LC,
LC-ψ, and LC-ψ-spl.
48 Susanna Levantesi et al.
As it is clear, the longevity projections impact on pricing and reserving policy. The traditional
LC model produces lower projected probability of death and, speculatively, higher survival prob-
abilities than those obtained applying ML models. Consequently, on one hand in the case of a
death benefit the LC leads to lower prices and underestimated reserves with negative impact on
the solvency of the insurance company. On the other hand, in the case of a life benefit it leads
to overestimate prices and reserve with potential negative impact on the competitiveness and
attractiveness of the products.
3.5 Conclusions
In this chapter, we have illustrated the potentialities of an ML model applied to the quantifica-
tion of longevity risk in the management of real insurance products. The numerical application
presented has aimed to highlight the longevity risk impact on two life insurance policies, the pure
endowment and the term insurance, that are the basis for more complex portfolios. The choice
of a mortality projection model rather than another leads to different actuarial valuations in the
pricing and reserving policies. The impact of longevity projections depends on the features of the
insurance product involved. A realized mortality lower than projected increases insurer’s profit
when death benefits are concerned.
As ML is largely changing the way in which the society operates and economy grows, in the
insurance sector it can be used to manage and extract important information from very large
available datasets, increase competitiveness, reduce risk exposures, and improve profits through
automated and efficient pricing policies. An important aspect is that ML can reduce one point of
weakness of the insurance business, the information asymmetry between insurer and policyholder,
allowing for better understanding and quantification of the specific risk of each policyholder. This
appears fundamental in the longevity risk management of life products with long-term duration as
well as of some lifelong guaranteed options embedded in most insurance and pension contracts.
The accurate assessment of the impact of longevity risk on the balance sheets of the insurance
companies is until now based on the choice of the stochastic models and on scenario testing; the
results of the projections and the strategies implemented are model dependent. Instead, ML tech-
niques permit to integrate a stochastic model with a data-driven approach. These tools, improving
the longevity risk quantification, can support also the risk transfer through both the reinsurance
and the longevity capital market. As regards the first aspect, there is the perception that reinsurers
are reluctant to take this “toxic" risk (Blake et al., 2006), but the perplexities could be resized
thanks to a precise understanding of this risk deepening pieces of information extracted by large
datasets. As regards the second aspect, longevity risk market plays a role in the risk management of
longevity risk. Another crucial point is the improvement of longevity risk management, thanks
to the opportune exploitation of the available medical and socio-economical data. Traditional
risk models require very long time to process huge available datasets and often are incapable of
deepen each information. The advances of the microeconomic models are favored by AI and the
risk management is shifting from statistical methods such as principal component analysis to
ML to select important variables on the basis of supervised tree algorithms, which automatically
select variables of interest for building predictive models. A life insurance compartment where
longevity risk is acting up its strong impact is represented by the health insurance and long-term
care for elderly. Thanks to AI, smart sensor technologies are available for insurers to improve poli-
cyholders’ health monitoring and encourage a healthier lifestyle; they are improving older people’s
Improving longevity risk management via ML 49
quality of lives, reducing health costs at older ages. Finally, through ML the output from internal
risk models can be more accurate and also the validation process can be improved running on a
continuous basis. In the light of the improvements reached, we are convinced that ML can offer
insurance companies and pension fund managers new tools and methods supporting actuaries in
classifying longevity risks, offering accurate predictive pricing models, and reducing losses.
3.6 Appendix
# ------------ DATA - - - - - - - - - - - -#
for ( country in c ( " FRATNP " ," ITA " ," GBR _ NP " )){
Data = hmd . mx ( country = country , username = " insert username " ,
password = " insert password " , label = country )
Data . F = StMoMoData ( Data , series = " female " )
Data . M = StMoMoData ( Data , series = " male ")
Data . M $ Dxt = round ( Data . M$ Dxt )
Data . M $ Ext = round ( Data . M$ Ext )
Data . F $ Dxt = round ( Data . F$ Dxt )
Data . F $ Ext = round ( Data . F$ Ext )
ages = 4 0 : 1 0 0
years 0 = 1 9 4 7: 2 0 1 4
years = years 0
lim 1 . y = years [ 1 ]+ 1 - Data . M $ years [ 1 ]
limn . y = tail ( years , 1 )+ 1 - Data . M $ years [ 1 ]
lim 1 . a = ages [ 1 ]+ 1
limn . a = tail ( ages ,1 )+ 1
n . data = length ( years )* length ( ages )
Years = rep ( rep ( years [ 1 ]: tail ( years , 1 ) , each = length ( ages )) , 2 )
Ages = rep ( rep ( ages , length ( years )) , 2 )
Cohort = Years - Ages
Gender = c ( rep ( " F " , length ( years )* length ( ages )) ,
rep ( " M " , length ( years )* length ( ages )))
Dxt . C = c ( as . vector ( Data . F $ Dxt [ lim 1 . a : limn .a , lim 1 . y : limn . y ]) ,
as . vector ( Data . M $ Dxt [ lim 1 . a : limn .a , lim 1 . y : limn . y ]))
Ext . C = c ( as . vector ( Data . F $ Ext [ lim 1 . a : limn .a , lim 1 . y : limn . y ]) ,
as . vector ( Data . M $ Ext [ lim 1 . a : limn .a , lim 1 . y : limn . y ]))
Data . ML . C = data . frame ( Year = Years , Age = Ages , Cohort = Cohort ,
Gender = Gender , Dxt = Dxt .C , Ext = Ext . C )
# crude mortality rates
CR . M = Data . M $ Dxt [ ages [ 2 ]:( tail ( ages , 1 )+ 1 ) , lim 1 . y : limn . y ]/
Data . M $ Ext [ ages [ 2 ]:( tail ( ages , 1 )+ 1 ) , lim 1 . y : limn . y ]
CR . F = Data . F $ Dxt [ ages [ 2 ]:( tail ( ages , 1 )+ 1 ) , lim 1 . y : limn . y ]/
Data . F $ Ext [ ages [ 2 ]:( tail ( ages , 1 )+ 1 ) , lim 1 . y : limn . y ]
write . table ( CR .M , file = paste ( " CR . M" ," _ " , country , " . txt " , sep = " " ))
write . table ( CR .F , file = paste ( " CR . F" ," _ " , country , " . txt " , sep = " " ))
log . CR .F = log ( CR . F )
log . CR .M = log ( CR . M )
for ( country in c ( " FRATNP " ," ITA " ," GBR _ NP " )){
for ( sex 0 in c ( " M " ," F " )){
if ( sex 0 == " M " ){
qx <- as . matrix ( read . table ( paste ( " LC . fore . mM " ," _ " ,
country , ". txt " , sep = " " )))
psi _ LP <- as . matrix ( read . table ( paste ( " corfit " ," _ " , name , " _ " ,
country , ". txt " , sep = " " )))
psi _ LN <- as . matrix ( read . table ( paste ( " fit 2 D _ corrf _ LC _ M " ," _ " ,
country , ". txt " , sep = " " )))
qxcor . LC <- qx * psi _ LP
qxcor <- qx * psi _ LN
write . table ( qx , file = paste ( " qx " ," _" , sex 0 ," _ " , country , " . txt " , sep = " " ))
write . table ( qxcor . LC , file = paste ( " qxcor . LC " ," _ " , sex 0 ," _ " ,
country , ". txt " , sep = " " ))
write . table ( qxcor , file = paste ( " qxcor " ," _ " , sex 0 ," _ " ,
country , ". txt " , sep = " " ))
write . table ( cbind ( diag ( qx ) , diag ( qxcor . LC ) , diag ( qxcor )) ,
file = paste ( " qx . coh " ," _ " , sex 0 ," _ " , country , " . txt " , sep = " " ))
}
if ( sex 0 == " F " ){
qx <- as . matrix ( read . table ( paste ( " LC . fore . mF " ," _ " ,
country , ". txt " , sep = " " )))
psi _ LP <- as . matrix ( read . table ( paste ( " corfit " ," _ " , name , " _ " ,
country , ". txt " , sep = " " )))
psi _ LN <- as . matrix ( read . table ( paste ( " fit 2 D _ corrf _ LC _ F " ," _ " ,
country , ". txt " , sep = " " )))
qxcor . LC <- qx * psi _ LP
qxcor <- qx * psi _ LN
Improving longevity risk management via ML 55
Note
1 The setting has been proposed by Haberman and Renshaw (2011) for reaching a better stability with respect to
the original version.
References
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Brouhns, N., Denuit, M., & van Keilegom, I. (2005). Bootstrapping the Poisson logbilinear model for
mortality forecasting. Scandinavian Actuarial Journal, 3, 212–224.
Carracedo, P., DebÃşn, A., Iftimi, A. et al. (2018). Detecting spatio-temporal mortality clusters of European
countries by sex and age. International Journal for Equity in Health, 17, 38.
Currie, I. D., Durban, M., & Eilers P. H. C. (2004). Smoothing and forecasting mortality rates. Statistical
Modelling, 4, 279–298.
Currie, I. D., Durban, M., & Eilers, P. H. C. (2006). Generalized linear array models with applications to
multidimentional smoothing. Journal of the Royal Statistical Society B, 68, 259–280.
Deprez, P., Shevchenko, P. V., & Wüthrich, M. (2017). Machine learning techniques for mortality modeling.
European Actuarial Journal, 7 (2), 337–352.
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science,
11, 89–102.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics,
29, 1189–1232.
Hainaut, D. (2018). A neural-network analyzer for mortality forecast. Astin Bulletin, 48, 481–508.
Hastie, J., Hastie, T., & Tibshirani, R. (2016). The elements of statistical learning (2nd ed.). Data Mining,
Inference, and Prediction. New York: Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An introduction to statistical learning: with appli-
cations in R. New York: Springer.
Lee, R. D., & Carter, L. R. (1992). Modeling and forecasting US mortality. Journal of the American Statistical
Association, 87, 659–671.
Levantesi, S., & Nigri, A. (2020). A random forest algorithm to improve the Lee-Carter mortality forecast-
ing: Impact on q-forward. Soft Computing, 24, 8553–8567.
Levantesi, S., & Pizzorusso, V. (2019). Application of machine learning to mortality modeling and fore-
casting. Risks, 7 (1), 26.
Li, N., Lee, R., & Gerland, P. (2013). Extending the Lee-Carter method to model the rotation of age
patterns of mortality decline for long-term projections. Demography, 50(6), 2037–2051.
56 Susanna Levantesi et al.
Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S., & Perla, F. (2019). A deep learning integrated Lee-
Carter model. Risks, 7 (1), 33.
Piscopo, G., & Resta, M.(2017). Applying spectral biclustering to mortality data. Risks, 5, 24.
Renshaw A. E., & Haberman S. (2003). On the forecasting of mortality reduction factors. Insurance:
Mathematics and Economics, 32(3), 379–401.
Renshaw, A. E., & Haberman, S. (2006). A cohort-based extension to the Lee-Carter model for mortality
reduction factors. Insurance: Mathematics and Economics, 38, 556–570.
Richman, R., & Wüthrich, M. (2018). A neural network extension of the Lee-Carter model to multiple
populations. SSRN manuscript, id 3270877.
Villegas, A. M., Kaishev, V. K., & Millossovich, P. (2015). Stmomo: An r package for stochastic
mortality modelling. Available at: https://cran.r-project.org/web/packages/StMoMo/vignettes/
StMoMoVignette.pdf.
Chapter 4
4.1 Introduction
Business intelligence measurements depend on a variety of statistical advanced technologies used
to enable the collection, analysis, and dissemination of internal and external business informa-
tion. The key goal of business intelligence is to draw a well-informed business decision. Statistical
machine learning-based business intelligence systems (e.g., supervised learning) prioritize a com-
prehensive analysis to empower businesses (Brannon, 2010). Many supervised learning algorithms
have been developed to improve business decisions. The most effective algorithms are support vec-
tor machines (SVM) and support vector regression with a positive-definite kernel (Wang et al.,
2005). Regression analysis is a classical but the most useful approach in supervised learning to
identify trends and establish relationships of business data (Duan and Xu, 2012).
Regression techniques tell us the degree of change of a response variable by one or more
predictor variables using the conditional mean of a response variable given the predictor variables
instead of the mean of the response variable. Let us consider Y to be a response variable and X to
be a vector of predictor variables having a mean, the conditional mean of a random variable given
that the other variable is equal to the mean of the random variable, i.e., EX [EY Y X ] EY Y .
The regression analysis deals with the conditional mean of Y given X , E Y y X x and the
data are assumed to come from a single class or regime such that the mean or the conditional
mean works well. But neither the mean nor the conditional mean works properly for the data
collected from different and mixed classes. For example, to fit a nonlinear data set with two
57
58 Md. Ashad Alam et al.
classes, we apply two nonlinear regression approaches: kernel ridge regression (KRR) and support
vector regression (SVR) (Saunders and Gammerman, 1998; Shawe-Taylor and Cristianini, 2004).
Figure 4.1 presents a scatter plot of the original data and the fitted curves of the booth methods.
It is clearly shown that both methods are failed to fit the data properly. To overcome the problem,
the switching regression model is appropriate to study (Quandt, 1972). The switching regression
model forms a suitable model class for the regression problems with unknown heterogeneity.
In supervised learning, switching regression approach has become a well-known tool to deal
more than one regimes at known or unknown points (Dannemann and Holzmann, 2010; Liu
and Lin, 2013; Malikov and Kumbhakar, 2014; Souza and Heckman, 2014). However, with a
number of explanatory variables, switching regression is shown that parameter estimates have a
high probability of producing unsatisfactory results for non-orthogonal prediction vectors. The
ridge trace-based estimation has been found to be a procedure that can be used to overcome the
difficulties of estimation in supervised learning.
Ridge regression is a standard statistical technique that aims to address the bias variance trade-
off in the design of linear regression model. This procedure plays a central role in supervised
learning and has become a very popular approach to estimating complex functions as the solution
to a minimization problem (Cawley and Talbot, 2003; Hastie et al., 2009). This problem is also
a dual-form problem. In the dual ridge regression model, the original input data are mapped into
some high-dimensional space called a “feature space.” To represent a nonlinear regression, the
algorithms are used to construct a linear regression function in the feature space. We look at several
complex data sets with a large number of parameters, and this leads to serious computational
complexity that can be unbearable to overcome. To this end, we can deal with a dual version of
the ridge regression algorithms based on a positive-definite kernel function as a kernel method
(Alam, 2014; Alam and Fukumizu, 2013; Ham et al., 2004; Saunders and Gammerman, 1998).
Since the beginning of the 21st century, positive-definite kernels have a rich history with suc-
cessful applications in the area of nonlinear data analysis. A number of methods have been pro-
posed based on reproducing kernel Hilbert space (RKHS) using positive-definite kernels, includ-
ing kernel supervised as well as unsupervised dimensional reduction techniques, support vector
machine (SVM), kernel ridge regression, kernel Fisher’s discriminant analysis, and kernel canon-
ical correlation analysis (Akaho, 2001; Alam, 2014, Alam and Fukumizu, 2013; Alam et al., 2019;
Bach and Jordan, 2002; Fukumizu et al., 2009; Hofmann et al., 2008). Kernel ridge regression
Figure 4.1 Scatter plots of original data and fitted curves of kernel ridge regression, KRR and
support vector regression, SVR (green curve is for the KRR and orange curve is for the SVR).
Kernel switching ridge regression 59
or support vector regression does not consider any regime of the data; they are not sufficient
for two or more regime data in the area of statistical machine learning. However, to the best of
our knowledge, few well-founded methods have been established in general for penalized and
nonlinear switching regression in RKHS (Alam, 2014; Richfield et al., 2017).
The goal of this chapter is to propose a method called “kernel switching ridge regression”
(KSRR) to a comprehensive analysis of business data. The proposed method is able to overcome
the unstable solution and curse of dimensionality. We can use a large number even an infinite
number of explanatory variables by the proposed method. The experimental results on synthesized
data sets demonstrate that the proposed method shows better performance than the state-of-the-
art methods in supervised machine learning.
The rest of the chapter is structured as follows. In Section 2, we discuss a brief notion of switch-
ing regression, switching ridge regression, dual form of ridge regression, basic notion of kernel
methods, ridge regression in the feature space, kernel ridge regression, duality of kernel ridge
regression, and kernel switching ridge regression. Experimental results with different synthetic
data to measure the performance of the proposed method are presented in Section 3. Section 4
presents a discussion of the results. We draw conclusions on our work with a direction of the
future in Section 5. An R code of the proposed can be found in Appendix A.
4.2 Method
In the following subsections, we present the notion of switching regression, switching ridge regres-
sion, dual form of ridge regression, basic notion of kernel methods, ridge regression in the fea-
ture space, kernel ridge regression, duality of kernel ridge regression, and kernel switching ridge
regression.
variables. The j are the regression coefficients with j 1 2. The ji are distributed as normal with
mean zero and variances j2 . The coconditional density of the ith value of yi on x1i x2i xdi
is given by
d
1 2
g yi x1i x2i xdi exp 2
yi 10 1j xij
z 2 2 1 j 1
1
d
1 1 2
exp 2
yi 20 2j xij (4.1)
2 2 2 2 j 1
2
60 Md. Ashad Alam et al.
n d
1 2
L log exp 2
yi 1j xij
i 1 z 2 2 1 j 0
1
d
1 1 2
exp 2
yi 2j xij (4.2)
2 2 2 2 j 0
2
where xi0 1. We need to maximize the above nonlinear function with respect to 1j , 2j ,
2 0, 22 0, and 0 1
1
n d
1 2
L log exp 2
yi 1j xij
i 1 z 2 2 1 j 1
1
d
1 1 2
exp 2
yi 2j xij 1 1 2 2
2 2 2 2 j 1
2
as
[X T X Im ] 1 T
X y
XT X I y
1
XT y XT X
1 T
X y X XT (4.3)
We need to compute the gram matrix G XX T in Eq. (4.4), where Gij xi xj . Ridge
regression requires only inner products between data points.
function that models the dependencies between response and predictor variables. The classical
way to do that is to minimize the quadratic cost,
n
1 T
2
L yi xi
2 i 1
However, if we are going to work in feature space, where we replace xi xi , there is a clear
danger that we overfit. Hence, we need to regularize. A simple yet effective way to regularize is
to penalize the norm of . This is sometime called “weight-decay.” It remains to be determined
how to choose . The most used algorithm is to use cross-validation or leave-one-out estimates.
The total cost function hence becomes
n
1 T
2 1 2
L yi xi (4.6)
2 i 1
2
which needs to be minimized. Taking derivatives of Eq. (4.6) and equating them to zero gives
n n 1 n
T T
yi xi xi I xi x yi xi
i 1 i 1 i 1
We see that the regularization term helps to stabilize the inverse numerically by bounding the
smallest eigenvalues away from zero.
The inverse is performed in spaces of different dimensionality if C is not square. Consider a feature
space a such that ai and y yi . The solution is then given by
1 1
T T
Id y In y (4.7)
where d is the number of dimension in the feature space and n is the number of data points.
Using Eq. (4.7), we can write
n
i xi
i 1
T 1
with In y. This is an equation that will be a recurrent theme and it can be
interpreted as: the solution must lie in the span of the data points, even if the dimensionality
Kernel switching ridge regression 63
of the feature space is much larger than the number of data points. This seems intuitively clear,
since the algorithm is linear in feature space.
We finally need to show that we never actually need to access the feature vectors, which could
be infinitely long. This is computed by projecting it onto the solution ,
1
T T T 1
y x y In x y K In k x (4.8)
n 1
xi xiT In xi yi
i 1 i 1
Now let us consider a different derivation, making use of some Lagrange duality. If we introduce
a new variable wi and constrain it to be the difference between T xi and yi , we have
1 T 1 T
min w w
w 2 2
st wi yi xiT (4.10)
Using i to denote the Lagrange multipliers, this has the Lagrangian
n
1 T 1 T
w w i yi xiT wi
2 2 i 1
Recall the foray into Lagrange duality. We can solve the original problem by doing
max min w
w
64 Md. Ashad Alam et al.
First, we will take the inner minimization: by fixing , we would like to solve for the mini-
mizing and w. We can do this by setting the derivatives of with respect to zi and . We
can do this by setting the derivatives of with respect to wi and to be zero. Doing this,
we have
0 wi i
wi
wi i
and
n
0 i xi
i 1
n
1
i xi
i 1
So, we can solve the problem by maximizing the Lagrangian with respect to , where we substitute
the above expressions for zi and . Thus, we have an unconstrained maximization
max w
Thus, we obtain
max min w
w
n n n n
1 2 1 1 1 1
max i i xi j xj i yi xi i xi i
2 i
2 i 1 j 1 i 1 i 1
n n n n n
1 2 1 1
max i i j xi xj i j xi xj i yi i
2 i
2 i 1 j 1 i 1 j 1 i 1
n n n
1 2 1
max i i j xi xj i yi
2 i
2 i 1 j 1 i 1
n n n
1 2 1
max i i jk xi xj i yi
2 i
2 i 1 j 1 i 1
where k xi xj is the kernel function. Again, we only need inner products. If we define the matrix
K by Kij k xi xj , then we can rewrite this in a punchier vector notation as
1 T 1 T T
max min w max K y
w 2 2
Kernel switching ridge regression 65
Thing on the right is just a quadratic in . As such, we can find the optimum as the solution of
a linear system. What is important is the observation that again we only need the inner products
of the data k xi xj xi xj to do the optimization over . Then, once we have solve for we
can predict f x for new x using only inner products. If someone tells us all the inner products,
we do not need the original data xi at all,
d 1
0 K y
d
Thus,
1
K In y
Since is given by sum of the input vectors xi , weighted by i . If we were so inclined, we could
avoid explicitly computing and predict a new point x directly from the data as
n
1
f x x i xi x (4.11)
i 1
n
1 2 1 1 2
L log exp 2
yi f1 xi exp 2
yi f1 xi
i 1 z 2 2 1 2 2 2 2
1 2
1 2 2 2
f1 f1
2 2
n n
1 2
log exp 2
yi 1j k xi xj
i 1 z 2 2 1 j 0
1
n
1 1 2 1 T 2 T
exp 2
yi 2j k xi xj 1K 1 2K 2 (4.12)
2 2 2 2 j 0
2 2
2
where k xi x0 1, and 1 and 2 are the vector of coefficients. For simplicity in our experi-
ment, we consider 1 10 4 . We need to maximize the above nonlinear function with
2
respect to free parameters.
4.3.1 Simulation
Synthetic data-1 (SD-1): The dependent variable yi is generated by one of the following two
models with probability 0.5:
yi 3 15xi 1i and
yi 3 1xi 1i
where xi N 10 1 , 1i N 0 2 , and 2i N 0 2 5
Synthetic data-2 (SD-2): The dependent variable yi is generated by one of the following two
models with probability 0.5:
yi 1 2xi 1i and
yi 1 1xi 1i
where xi N 10 1 , 1i N 0 1 5 , and 2i N 0 2
Synthetic data-3 (SD-3): The dependent variable yi is generated by one of the following two
models with probability 0.5:
Kernel switching ridge regression 67
yi xi 1 1i and
yi xi 2 1i
where xi U , 1i N 0 0 1 , 2i N 0 0 01 , i 1 2 400.
First, we have compared the EM and SGD algorithms for the switching regression using dif-
ferent sizes of n 100 250 500 750 1000 2500 5000 for the SD-1 and SD-2. For each size
we have repeated the experiment 100 times. The mean error of the 100 samples is calculated.
Table 4 1 presents the mean errors of intercept 1 and regression coefficient ( 2 ) of the switch-
ing regression for both algorithms. We observed that both algorithms have almost similar results,
especially when the sample size is increased.
Second, we have compared the EM and SGD algorithms for the switching regression using
different dimensions of d 2 5 10 25 50 100 for the SD-3. For each dimension, we have
repeated the experiment 100 times. The mean error of the 100 samples is calculated. Table 4 2
presents the mean errors of intercept 1 and regression coefficient ( 2 ) of the switching regres-
sion for both algorithms. We observed that the EM algorithm does not work properly for the
high-dimensional data but the SGD. We may conclude that the SGD algorithm is a good choice
for the large dimensional data set.
Finally, the proposed method, KSRR via the SGD algorithm is applied for SD-4. To mea-
sure the performance of the proposed method over the KRR and SVR, training and ten-fold
cross-validation test errors are used. For the stability check, we repeated the experiment over 100
samples, n 400, using the Gaussian kernel with inverse bandwidth equal to 1 being consid-
ered. We also obtain standard errors for the proposed parameter estimators. Table 4 3 presents
the training errors and ten-fold cross-validation test errors. Figure 4 2 shows the scatter plots of
the original data and the fitted curve of the proposed method, KSRR, KRR, and SVR (blue curve
and red curve for the KSSR, green curve for the KRR, and orange curve for the SVR). By this
figure it is clear that in two-group data the KSRR is fitted to the data separately properly in two
groups. But both methods KRR and SVR are not fitted to the date properly. They are fitted to
the curve inside of the two groups.
Table 4.1 The mean error of 100 samples of two data sets: SD-1 and SD-2
n 100 250 500 750
1 2 1 2 1 2 1 2
Md. Ashad Alam et al.
SD-1 EM 2 154 1 755 2 829 2 218 1 444 1 047 19 1 337 1 082 0 896 1 334 1 047 0 893 0 675 1 013 0 867
SGD 1 628 1 110 1 525 1 128 1 027 0 796 0 932 0 683 0 732 0 492 0 862 0 534 0 546 0 448 0 578 0 4415
SD-2 EM 1 573 1 000 1 708 1 177 1 095 0 759 1 174 0 759 0 683 0 475 0 959 0 615 0 612 0 399 0 822 0 574
SGD 1 235 0 825 1 255 0 783 0 919 0 681 0 934 0 715 0 669 0 465 0 715 0 479 0 543 0 421 0 543 0 406
n 1000 2500 5000
1 2 1 2 1 2
SD-1 EM 0 697 0 550 0 927 0 699 0 423 0 309 0 524 0 379 0 305 0 264 0 441 0 328
SGD 0 468 0 333 0 472 0 305 0 271 0 192 0 344 0 239 0 227 0 17 0 211 0 124
SD-2 EM 0 546 0 415 0 794 0 591 0 296 0 227 0 431 0 329 0 229 0 1781 0 331 0 242
SGD 0 453 0 335 0 428 0 329 0 297 0 195 0 322 0 192 0 229 0 146 0 242 0 141
Kernel switching ridge regression 69
Table 4.2 The mean error of 100 samples of the synthetic data sets (SD-3)
d 2 5 10 25 50 100
EM 3 4059 0 699 4 279 0 777 4 0662 0 527 3 526 0 586 2 268 315
SGD 4 974 0 628 4 3938 0 924 4 798 0 844 5 725 1 392 5 708 0 923 5 419 0 605
Table 4.3 Training and test error over 100 samples, n 400,
the proposed method KSRR, kernel ridge regression, KRR,
and support vector regression, SVR
KSRR KRR SVR
Training error 1 475 0 666 2 201 0 036 2 273 0 024
10-CV error 1 587 0 749 2 362 0 049 2 362 0 049
Figure 4.2 Scatter plots of the original data and fitted curves of the proposed method, KSRR,
kernel ridge regression, KRR, and support vector regression, SVR (blue and red curves are for
the KSSR, green curve is for the KRR, and orange curve is for the SVR).
In order to verify the applicability of the proposed method in a real-world problem, we use the
motorcycle data set. The motorcycle data set is a well-known and widely used data set, especially
throughout the areas of statistical machine learning, data mining, and nonparametric regression
analysis. The data set consists of n 133 measurements of head acceleration (in g) taken through
time (in milliseconds) after impact in simulated motorcycle accidents. The data are available in the
70 Md. Ashad Alam et al.
R “switchnpreg" package and many different methodologies have been applied to the motorcycle
data (de Souza et al., 2014).
More recently, the motorcycle data set has become a benchmark data set for machine learning
techniques involving mixtures of Gaussian processes. In this section, we have used our proposed
method along with state-of-the-art methods to discover the average prediction error. We have
conducted five-fold and ten-fold cross-validation approaches to discover the average prediction
error of the response variable. Table 4 4 presents the perdition error using the cross-validation
(five-fold and ten-fold). From these results, it is evident that the proposed method-based predic-
tion is significantly more accurate than using two different state of the art in supervised learning
methods (KRR and SVR).
4.4 Discussion
In supervised machine learning tasks, switching regression is becoming an increasingly common
component in the age of big business data. The state-of-the-art regression approaches including
kernel ridge regression or support vector regression only work for a single regime of the data. In
order to make a comprehensive prediction for two or more regimes of data set, we can use switch-
ing regression for the business intelligence system. The standard switching regression approach
suffers from the curse of dimensionality as well as non-linearity of the data.
In this chapter, we have proposed a kernel-based switching regression model to study the effect
of explanatory variables on two regimes’ outcomes of interest. The performance of the proposed
method has been compared over the state-of-the-art methods in supervised machine learning:
KRR and SVR. To the end, we have made different experiments using synthetic examples and a
real-world problem. By these experiments, we have observed that parameter estimation via EM
algorithm has failed for the large dimensional data sets. In addition, for the sample dimensional
data set, parameter estimation via EM and SGD algorithms has similar performance. The param-
eter estimation via the SGD algorithm is applied for the proposed method instead of the EM
algorithm. By the experimental results, we observed that the KRR and SVR methods have failed
to estimate the data properly when data have two classes or regimes. They are fitted to the curves
inside of the two groups in the palace of the original data. However, it is clear that the KSRR is
fitted to the data properly in two groups (see Figure 4 2).
sigma1.new sigma1.old +
eta *Q1/2*(as.vector(( y[i]-X[i,]%*%beta1.old)2 )*(1/sigma1.old2 )-1/(sigma1.old))
sigma2.new sigma2.old +
eta *Q2/2*(as.vector(( y[i]-X[i,]%*%beta2.old)2 )*(1/sigma2.old2 )-1/(sigma2.old))
}
q1 (lam1.new/sqrt(2*pi*sigma1.new))*exp(-(y-X%*%beta1.new)2 /(2*sigma1.new))
q2 (lam2.new/sqrt(2*pi*sigma2.new))*exp(-(y-X%*%beta2.new)2 /(2*sigma2.new))
logL sum(log(q1+q2))
cat(“ctr ", ctr, “ ")
cat(“Log L ", logL, “ ") cat(“Beta1 ", beta1.new, “ ", “Beta2=", beta2.new, “ ")
cat(“lam1 ", lam1.new, “ ", “lam2=", lam2.new, “ ")
cat(“Sigma1 ", sigma1.new, “ ", “Sigma2=", sigma2.new, “ ")
ctr ctr+1
}
yhat12 cbind((y-X%*% beta1.new)2 , (y-X%*% beta2.new)2 )
yhat apply(yhat12, 1, min)
MSE mean(yhat)
return(list(error MSE, log-likelihood logL, Beta1 beta1.new, Beta2 beta2.new,
Par1 lam1.new, Par2 lam2.new, Var1 sigma1.new, Var2 sigma2.new))
}
References
Akaho, S. (2001). A kernel method for canonical correlation analysis, in Proceedings of the International
Meeting of Psychometric Society, Japan, 35, 321–377.
Alam, M. A. (2014). Kernel Choice for Unsupervised Kernel Methods, Ph.D. Dissertation, The Graduate
University for Advanced Studies, Japan.
Alam, M. A., Calhoun, V., and Wang, Y. P. (2018). Identifying outliers using multiple kernel canonical
correlation analysis with application to imaging genetics, Computational Statistics and Data Analysis,
125, 70–85.
Alam M. A. and Fukumizu, K. (2013). Higher-order regularized kernel CCA. 12th International Conference
on Machine Learning and Applications, Miami, Florida, USA, 374–377.
Alam M. A., Komori, O. Deng H-W. Calhoun V. D. and Wang Y-P. (2020). Robust kernel canonical corre-
lation analysis to detect gene-gene co-associations: A case study in genetics. Journal of Bioinformatics
and Computational Biology , 17 (4), 1950028.
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society,
337–404.
Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning
Research, 3, 1–48.
Brannon, N (2010), Business intelligence and E-discovery. Intellectual Property & Technology Law Journal.
22(7), 1–5.
Cawley, G. C. and Talbot, N. L. C. (2003). Reduced rank kernel ridge regression. Neural Processing Letters,
16, 293–302.
Kernel switching ridge regression 73
Dannemann, J. and Holzmann, H. (2010). Testing for two components in a switching regression model.
Computation Statistics and Data Analysis, 54, 1592–1604.
Duan, L. and Xu, L. D. (2011). Business intelligence for enterprise systems: A survey. IEEE Transactions on
Industrial Informatics, 8, 679–687.
Fukumizu, K. Bach, F. R., and Jordan, M. I. (2009). Kernel dimension reduction in regression. Annals of
Statistics, 37, 1871–1905.
Ham, J. Lee, D. D. and Mika, S. (2001). Schölkopf, B. A kernel view of the dimensionality reduction of
manifolds. Proceedings of the 21st International Conference on Machine Learning. 47.
Hastie, T, Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning. Springer, New York,
2nd ed.
Hofmann, T., Schölkopf, B. and Smola J. A. (2008). Kernel methods in machine learning. Annals of
Statistics, 36, 1171–1220.
Huang, S. Y., Yeh, Y. R. and Eguchi, S. (2009). Robust kernel principal component analysis. Neural Com-
putation, 21(11), 3179–3213.
Liu M. and Lin, T. (2013). A skew-normal mixture regression model. Educational and Psychological Mea-
surement, XX (X), 1–24.
Malikov, E. and Kumbhakar, S. C. (2014). A generalized panel data switching regression model. Economics
Letters, 124, 353–357.
Quandt, R. E. (1972). A new approach to estimating switching regressions, Journal of American Statistical
Associating, 67 (338), 306–310.
Richfield, O. Alam, M. A. Calhoun, V. and Wang, Y. P. (2017), Learning schizophrenia imaging genet-
ics data via multiple kernel canonical correlation analysis. Proceedings - 2016 IEEE International
Conference on Bioinformatics and Biomedicine, Shenzhen, China, 5, 507–511.
Saunders, C. Gammerman, A. and Vovk, V. (1998). Ridge regression learning algorithm in dual vari-
ables. Appears in Proceedings of the 15-th International Conference on Machine Learning, ICML’98,
Madison,Wisconsin, USA, 24–27.
Shawe-Taylor J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press,
Cambridge, 1st ed.
Souza, C. P. E. de. and Heckman, N. E. (2014). Switching nonparametric regression models. Journal of
Nonparametric Statistics, 26 (4), 617–637.
Steinwart, I. and Christmann A. (2008). Support Vector Machines. Springer, New York, 1st ed.
Wang, J. Wu, X. and Zhang C. (2005). Support vector machines based on K-mean clustering for real-time
business intelligence systems. International Journal of Business Intelligence and Data Mining, 1(1),
54–64.
Chapter 5
5.1 Introduction
It is well known that stock markets show certain periods of volatility caused by investors’ reaction
to economic and political changes. For this reason, monitoring and evaluation of volatility can
be considered an integral part of the investment decision making.
In this chapter, we focus on the short-term volatility of corporate stocks and the causes that
triggered it, and we derive implied (future) volatility based on historical volatility. As volatility
increases, the flexibility of active investment can be very lucrative compared to passive investment,
but it should also be noted that the risk increases with higher volatility. This is closely related to
the expectations, behavior, and interactions of various market participants, which are not only
informed investors but also noise traders. It is these investors who make their investment deci-
sions without rational use of fundamental data and overreact to good or bad news. According to
Staritz (2012), these investors may cause significant deviations in stock prices from their values
obtained using fundamental analysis. In addition, analysts now point out that the capital markets
are growing nervous and individual investors are very cautious about possible price slumps and
react faster than usual.
Text mining tools for automated analysis of company-related documents (e.g., Internet
postings, financial news, and corporate annual reports) have become increasingly important for
75
76 Petr Hajek et al.
stock market investors because information hidden in those documents may indicate future
changes in stock prices (Hajek, 2018). In previous studies, it has been shown that investors’
behavior is affected by the dissemination of those documents (Bicudo et al., 2019). The quality
of financial reporting, including additional commentary, is associated with a reduction in infor-
mation asymmetry among stakeholders, particularly between firm management and investors
(Biddle et al., 2009). Therefore, the qualitative information disclosed by management is increas-
ingly recognized as an essential supplement to accounting information (Kothari et al., 2009;
Kumar and Vadlamani, 2016). According to Seng and Yang (2017), who examined the volatility
of the capital market over the months, quarters, half-years, and year, positive (negative) reports
are positively (negatively) correlated with positive stock returns. Chin et al. (2018) examined the
links among stock market volatility, market sentiment, macroeconomic indicators, and spread
volatility over the persisting long-term component and the temporary short-term component.
They found no empirical evidence of the link between the volatile component and macroeco-
nomic indicators but found that the intermediate component was linked to variations in market
sentiment.
Significant effects of financial news and social media on stock return volatility have been
reported in earlier studies (Groth and Muntermann, 2011; Hajek, 2018; Oliveira et al., 2017).
However, little attention has been paid to the effect of other important textual disclosures. Here,
we aim to (1) propose a machine learning-based model for predicting short-term stock return
volatility and (2) study the effect of annual report filing on abnormal changes in firms’ stock
returns using the proposed model. In summary, we demonstrate that mining corporate annual
reports can be effective in predicting short-term stock return volatility using a three-day event
window.
stakeholders (Biddle et al., 2009). Text mining has therefore found a number of applications in
different domains, especially in the financial field, but also in the stock market prediction.
According to Seng and Yang (2017), who investigated the volatility of the capital market over
the months, quarters, half-years, and year, positive (negative) reports and high (low) occurrence of
scores in reporting are in positive (negative) correlations with stock returns. In terms of implied
volatility, ex-ante information should also be considered, subject to different expected market
scenarios and relating to different expected prices of financial instruments, including corporate
shares (Kaeck, 2018).
Table 5.1 summarizes the key findings of previous studies on stock return volatility prediction
using text mining. Different sources of text data have been reported as important indicators of
stock return volatility. These sources include (a) Internet message postings, (b) financial news,
(c) analyst reports, and (d) corporate annual reports. In other words, information extracted from
both outsiders (investors’ postings, news, and analyst reports) and insiders (managerial comments
in annual reports) have shown to be effective in predicting future stock return volatility. Antweiler
and Frank (2004) investigated the role of two different sources of text data, namely Internet mes-
sage postings and news articles. They found that higher number of postings and news indicate
greater market volatility on the next day. The authors also demonstrated the significant effects of
other financial indicators, such as trading value and stock market index. Tetlock (2007) investi-
gated the effect of word categories obtained from the General Inquirer lexicon on stock market
volatility. From those word categories, pessimism was the most informative indicator of increased
stock market volatility. Kothari et al. (2009) used the General Inquirer to identify favorable/unfa-
vorable messages and demonstrated that favorable news decreased volatility whereas unfavorable
news increased stock return volatility. Loughran and McDonald (2011) were the first who inves-
tigated the long-term impact of opinion in corporate annual reports on stock return volatility.
Unlike earlier studies, these authors created specific financial word lists to evaluate various senti-
ment categories in financial texts, including uncertainty and word modality. Significant long-term
effects were observed for the sentiment polarity and modality indicators. Groth and Munter-
mann (2011) used a different approach based on machine learning. First, frequent words and
word sequences were extracted from the news corpus and then machine learning methods were
used to perform the forecasting of abnormal stock return volatility. This approach was more
accurate than those based on word lexicons but this is achieved at the cost of decreased model
transparency. Kim and Kim (2014) investigated the effect of investor sentiment as expressed in
message board postings on stock return volatility of 91 companies. The NB classification method
was used to identify the overall sentiment polarity in the messages but no evidence was found
for the existence of significant effect of the sentiment polarity on future stock return volatility.
On the contrary, previous stock price performance was found to be important determinant of
investor sentiment. A similar sentiment indicator was proposed by Seeto and Yang (2017) to ana-
lyze sentiment in Twitter postings. Using the Support Vector Regression (SVR) machine learning
model, it was demonstrated that individual sentiment dispersion represents an informative mea-
sure of stock realized volatility. A more in-depth investigation was conducted by Shi et al. (2016)
in order to investigate the news sentiment effect across different industries, firm size, and low-
/high-volatility states. The news sentiment indicator appeared to be particularly important in the
calm scenario of low volatility. Ensemble-based machine learning methods such as Bagging and
Boosting were used by Myskova et al. (2018) to show that more negative sentiment and more
frequent news imply abnormally high stock return volatility. Chen et al. (2018) examined how
78 Petr Hajek et al.
single-stock options respond to contemporaneous sentiment in news articles. The authors found
that sentiment indicators provide additional information to stock return prediction. A different
topic of overnight news seems to be the reason of their positive effect that goes beyond market
volatility. A novel stock return volatility prediction model was employed by Xing et al. (2019) to
demonstrate the dominance of deep recurrent neural networks over traditional machine learning
methods. This is attributed to their capacity of capturing the bi-directional effects between stock
price and market sentiment.
The above literature suggests that information obtained from financial markets and financial
statements provides important support to manage financial risks and decrease a firm’s exposure to
such risk. However, it is generally accepted that this information is insufficient to provide accu-
rate early warning signals of abnormal stock price volatility. In fact, most firm-related information
pertaining to stock market risk comes in linguistic, rather than numerical form. Notably, corpo-
rate annual reports offer a detailed, linguistic communication of the financial risks the company
faces. In these reports, management discusses the most important risks that apply to the com-
pany, including how the company plans to handle those risks. However, only long-term effects
of linguistic variables in corporate annual reports on stock price volatility have been examined
in prior studies (Loughran and McDonald, 2011). Indeed, the short-term effects have only been
demonstrated for the information obtained from outsiders, not the insiders. In this chapter, we
overcome this problem by developing a novel prediction model utilizing managerial sentiment
extracted from corporate annual reports.
from the MarketWatch database, and (3) historical stock return volatility. This research framework
enables analysis of the effect of managerial comments in annual reports on stock return volatility
by considering the effects of financial indicators. To achieve a high accuracy of the prediction sys-
tem, several machine learning methods were examined, including REPTree, Bagging, Random
Forest (RF), Multilayer Perceptron neural network (MLP), Support Vector Machine (SVM), and
Deep Feed-Forward Neural Network (DFFNN).
1
1 2
t 3 Rt R (5.1)
3 1t 1
where
Pt 1 Pt 1
Rt 3 100 (5.2)
Pt 1
with t being the filing day of annual report (10-K), Pt the closing stock price at time t, Rt stock
return (rate of change) at time t, and R the mean value of stock price return for three consecutive
days. This indicator was chosen because it represents the variance of stock return over a short
period of time. Thus, the risk of the investment is taken into consideration. In general, more
volatile stock returns indicate higher financial risk. Note that the three-day event window was
adopted from previous related studies (Loughran and McDonald, 2011).
To reflect the systematic (market) risk, we also considered market return volatility in target
variable calculation. Specifically, the standard deviation of stock return t 3 was compared with
that of the stock market index and the instances were categorized into two classes: namely class 0
(negative abnormal volatility) was assigned to the less volatile stocks than the stock market index
and class 1 (positive abnormal volatility) otherwise. To consider historical market and stock return
volatilities as important indicators of future volatility, the standard deviations were also calculated
for the three-day historical event window (t-4 to t-2).
In addition, we controlled for the effect of other factors of stock return volatility to avoid draw-
ing erroneous conclusions. Therefore, we considered the following financial indicators adopted
from previous research (Hajek, 2018):
1. company size (measured by market capitalization (MC), given as the market value of out-
standing shares, i.e., MC = P shares outstanding),
2. liquidity ratio (defined as the daily dollar volume of shares, calculated as trading volume per
day/shares outstanding),
3. beta coefficient (sensitivity of a share to movements in the overall market, beta = cov(Re ,
Rm )/var(Rm ), where Re denotes stock return and Rm is market return),
4. price-to-earnings ratio (stock price to earnings per share (EPS), price-to-earnings ratio
P/E = P/EPS),
80 Petr Hajek et al.
5. book-to-market ratio (firm’s market capitalization to its book value, PBV = market price per
share/book price per share),
6. return on equity (net income to shareholder’s equity, ROE = net income/shareholder’s
equity), and
7. debt to equity (a measure of company’s financial leverage, firm’s total liabilities to share-
holder’s equity, D/E = total liabilities/shareholder’s equity).
The data for the financial indicators were obtained from the freely available MarketWatch
database. Those indicators reflect the risk effects of company size (small companies are more
risky), market expectations (higher PBV and lower P/E indicate higher risk), and financial ratios
(higher leverage and lower profitability indicate higher risk). Moreover, higher historical stock
return volatility and beta coefficient also serve as risk indicators.
1. REPTree (Quinlan, 1999). The REPTree (Reduced Error Pruning Tree) method generates
a decision tree based on information gain with entropy and the pruning of the decision
trees is performed using reduced-error pruning with backfitting. Thus, the error stemming
from the variance is minimized. We opted for this decision tree classifier because traditional
decision tree classifiers (C4.5, CART, ID3) may suffer from overfitting due to the generation
of large decision trees. The pruning methods developed by Quinlan (1999) overcome the
overfitting issue by replacing the internal nodes of the tree with the most frequent category.
This pruning procedure is performed for the nodes in the tree only when the prediction
accuracy is not affected. In our experiments, we used the REPTree implementation in the
Weka 3.8.4 program environment. The REPTree classification model was trained using the
following setting: the minimum total weight of instances per leaf was 2, the maximum depth
of the decision tree was not limited, and the minimum proportion of variance at a tree node
was set to 0.001. It should also be noted that this classifier is considered the fastest one among
the methods used in this study.
2. Bagging (Breiman, 1996). It is an ensemble strategy based on generating multiple decision
trees and aggregating their class predictions using a plurality vote. The multiple decision trees
82
Table 5.1 Summary of previous studies on predicting stock return volatility using text mining
Study Source Linguistic variables Method Key findings
Antweiler and Frank Internet message postings from Number of messages, GARCH Higher number of messages
(2004) Yahoo!Finance Financial bullishness and agreement indicates greater next-day
news from the Wall Street indexes market volatility
Journal
Petr Hajek et al.
Tetlock (2007) Financial news from the Wall Pessimism, weak and negative OLS Pessimism indicates increases
Street Journal word categories from the in stock market volatility
General Inquirer lexicon
Kothari et al. (2009) Financial news from Favorable and unfavorable Fama-MacBeth Favorable and unfavorable
Factiva/Dow Jones, word categories from the regression words have respectively
Interactive Company’s 10K General Inquirer lexicon significantly negative and
reports, Analyst disclosures positive impact on stock
from Factiva / Investext return volatility
Loughran and Company’s 10K reports Positive, negative, modal, Fama-MacBeth Sentiment polarity and modal
McDonald (2011) litigious, and uncertainty regression words in corporate reports
word categories indicate greater stock return
volatility in the next year
Groth and Financial news from corporate Bag-of-words with significant NB, k-NN, SVM, Bag-of-words improve the
Muntermann disclosures discriminative power MLP performance of financial risk
(2011) prediction methods
Kim and Kim (2014) Internet message postings from Overall investor sentiment NB Investor sentiment is not a
Yahoo!Finance significant predictor of stock
return volatility
(Continued)
Table 5.1 (Continued) Summary of previous studies on predicting stock return volatility using text mining
(Continued)
83
84
Petr Hajek et al.
Table 5.1 (Continued) Summary of previous studies on predicting stock return volatility using text mining
(base learners) are produced on different bootstrap replicates of the training set. Note that
these replicates are produced randomly with replacement. Hence, each training sample may
appear in the replicate training sets several times or not at all. In fact, bagging is effective only
when the produced replicate training sets differ from each other (i.e., the bootstrap procedure
is unstable). In our experiments, we used REPTree (with the same setting as presented above)
as the base learners because it is prone to overfitting. Bagging was trained with ten iterations
and bag size as 50% of the training set.
3. RF (Liaw and Wiener, 2002). It extends the Bagging algorithm by adding a random feature
selection component into the ensemble learning. More precisely, a sub-set of predictors are
randomly chosen at each node to improve the generalizability of the ensemble model. In
contrast to standard decision trees, the split criterion is applied not to the best of all variables
but only to the best predictor in the sub-set. As a result, the model is robust to overfitting
and only two parameters are required to set, namely the number of trees to be generated
(100 in our experiments) and the number of predictors used as candidates at each node (we
used a heuristics log2 #features 1 . In this study, this robustness feature is important also
because no feature selection must be performed to decrease the dimensionality of the dataset.
Again, the aggregation of the predictors was performed by using the majority vote. The same
as for the above machine learning methods, the Weka 3.8.4 implementation was employed
to train RF.
4. SVM (Keerthi et al., 2001). SVM is a kernel-based method based on the generation of the
decision hyperplane separating classes in order to maximize the margin between the samples
from different classes. In other words, in contrast to the remaining machine learning methods
used here, SVM minimizes structural risk, rather than training error only (empirical risk).
It is important to note that the optimal separating hyperplane is not produced in the input
data space, but in the multidimensional space obtained using a non-linear projection. To find
the optimal separating hyperplane in SVM, we used the sequential minimal optimization
(SMO) algorithm that solves the large quadratic programming optimization problem by
breaking it into a set of small quadratic programming problems. This method was used
in this study due to its scalability, fast training, and effectiveness in handling sparse text
datasets. SVM was trained with the following parameters: the complexity parameter was
tested in the range of C = 2 2 2 1 20 21 26 , polynomial kernel functions were
employed to map the training instances from the input space into the new feature space
of higher dimensionality. By controlling for the complexity of the SVM model, the risk of
overfitting was reduced for the data used in this study. The SMO algorithm implemented in
Weka 3.8.4 was used for predicting stock return volatility.
5. Neural networks MLP and DFFNN (Glorot and Bengio, 2010). The MLP model used here
consisted of three layers of neurons, namely input layer representing the independent vari-
ables, hidden layer modeling the non-linear relationships between inputs and outputs, and
output layer representing the predicted stock return volatility. In the DFFNN model, addi-
tional hidden layer was used to extract higher order features from the data. Both neural
network models were trained using the gradient descent algorithm with mini batches, learn-
ing rate of 0.01 and 1,000 iterations. For MLP, we used the MultilayerPerceptron algorithm
implemented in Weka 3.8.4, while for DFFNN, we employed the DeepLearning4J dis-
tributed deep learning library for Java and Scala. Distributed GPUs were used to perform
86 Petr Hajek et al.
the prediction task. To find the optimal structures of the models, we tested different numbers
of units in the hidden layer(s) in the range of C = 23 24 27 . For MLP, we used one
hidden layer, while two hidden layers were used for the DFFNN model. To avoid overfitting,
we applied dropout for both input and hidden layer(s) with dropout rates of 0.2 and 0.4,
respectively.
TP TN
Acc (5.8)
P N
TP
TPrate (5.9)
P
TN
TNrate (5.10)
N
1 d
AUC TPR T FPR T dT (5.11)
0 dT
where P and N are the numbers of samples classified as positive and negative abnormal volatility,
respectively, TP and TN are the numbers of samples correctly classified as positive and negative
abnormal volatility, respectively, and T is the cut-off point.
To further demonstrate the value of the proposed variables, we examined their merits using the
Relief feature ranking algorithm (Kira and Rendell, 1992). In Table 5.3, the average worth of vari-
ables is presented over ten experiments (obtained using ten-fold cross-validation). In addition, the
overall ranks are provided to show that historical stock and market volatilities represent the most
important input variables in the data. However, linguistic variables ranked just after them, with
commonality, certainty, and uncertainty ranked among top five variables. These results suggest
that linguistic information is a relevant determinant of stock return volatility.
With respect to these evaluation measures, Bagging and RF achieved a more balanced perfor-
mance, with acceptable accuracy on both classes (Table 5.4). By contrast, DFFNN and SVM
Predicting stock return volatility 87
performed best on the majority class only (TP rate). However, DFFNN was not capable of
detecting TN instances at all. In other words, the DFFNN model performed poorly for the
imbalanced dataset. Overall, the ensemble algorithms outperformed the individual machine
learning algorithms in terms of AUC. Regarding accuracy, REPTree performed best, with the bal-
anced accuracy on both classes (84.7% and 86.7% for the negative and positive abnormal stock
return volatility, respectively). To detect significant differences compared with the best perform-
ing machine method, Student’s paired t-test was carried out, indicating that SVM and DFFNN
were significantly outperformed in terms of both Acc and AUC.
In two further runs of experiments, sensitivity to linguistic indicators was tested. In Table 5.5,
the results are presented for the experiment without the word categories from the general
88 Petr Hajek et al.
dictionary, while Table 5.6 shows those obtained without the finance-specific dictionary. Most
importantly, the performance of the prediction models improved in terms of Acc for most
methods when the finance-specific dictionaries were removed, suggesting that the sentiment cat-
egories obtained using the general dictionaries were more informative for abnormal stock return
volatility prediction. No other significant change was observed in the results presented in Tables
5.5 and 5.6 compared with those from Table 5.4, suggesting that general dictionaries may signif-
icantly enhance the balance in performance on both target classes. This provides investors with a
more reliable prediction tool enabling them a wider range of investment strategies.
In the final run of experiments, only financial variables were employed for predicting abnor-
mal stock return volatility. From the results presented in Table 5.7, it can be observed that the
machine learning models were outperformed by their counterparts reported in previous experi-
ments, indicating that linguistic information extracted from annual reports increases prediction
performance regardless of the machine learning method used. Although Student’s paired t-tests
Table 5.4 Prediction performance for all variables
REPTree Bagging RF SVM MLP DFFNN
Acc [%] 86.15 2.54 86.00 2.24* 85.35 1.91* 82.74 2.06 83.83 3.63* 72.01 0.35
AUC 0.906 0.024* 0.925 0.019 0.924 0.018* 0.730 0.043 0.899 0.034* 0.485 0.095
TN rate 0.847 0.107 0.783 0.040 0.757 0.063 0.507 0.101 0.662 0.222 0.000 0.000
TP rate 0.867 0.031 0.890 0.019 0.891 0.025 0.952 0.024 0.906 0.046 1.000 0.000
* significant difference at p = 0.05.
Predicting stock return volatility
89
90
Petr Hajek et al.
performed to compare these approaches did not indicate significant differences, note that the
potential reduction in investors’ financial risk may have huge financial impact on the value of
their portfolios.
5.5 Conclusions
This study has shown that a more balanced performance of the prediction methods can be
achieved when incorporating the linguistic indicators from annual reports into the volatility pre-
diction models. More precisely, our results indicate that general dictionaries are more discrimi-
native than their finance-specific counterparts. Slightly more accurate machine learning models
were obtained using the linguistic indicators compared with the models trained using only finan-
cial indicators. From the theoretical point of view, the current findings add substantially to our
understanding of the effect of annual reports’ release on short-term stock return volatility. An
important practical implication is that investors can use the proposed tool to reduce their finan-
cial risk. Notably, we found that uncertainty in managerial comments indicates higher future
stock return volatility.
The main limitation of the current investigation is that it has only examined one-year period.
To further our research we intend to extend the monitored period. Future trials should also assess
the model for predicting stock return volatility for different prediction horizons and focus on
specific industries. Additional linguistic indicators can also be incorporated, such as those based
on semantic and syntactic features of words and phrases.
Acknowledgments
We gratefully acknowledge the support of the Czech Sciences Foundation under Grant No.
19-15498S. We also appreciate the comments and suggestions of reviewers and ICABL 2019
conference discussants on research methodology for this chapter.
References
Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of Internet stock
message boards. The Journal of Finance, 59(3), 1259–1294.
Bicudo de Castro, V. Gul, F. A. Muttakin, M. B., & Mihret, D. G. (2019). Optimistic tone and audit fees:
Some Australian evidence. International Journal of Auditing, 23(2), 352–364.
Biddle, G. C., Hilary, G., & Verdi, R. S. (2009). How does financial reporting quality relate to investment
efficiency?. Journal of Accounting and Economics, 48(2–3), 112–131.
Black, F. (1976). The pricing of commodity contracts. Journal of Financial Economics, 3(1), 167–179.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Carr, P., & Wu, L. (2009). Variance risk premiums. The Review of Financial Studies, 22, 1311–1341.
Chen, C., Fengler, M. R., HÃďrdle, W. K., & Liu, Y. (2018). Textual sentiment, option characteristics, and
stock return predictability. Economics Working Paper Series, 1808, University of St. Gallen, School of
Economics and Political Science.
94 Petr Hajek et al.
Chin, C. W., Harris, R. D. F., Stoja, E., & Chin, M. (2018). Financial market volatility, macroeconomic
fundamentals and investor sentiment. Journal of Banking & Finance, 92, 130–145.
Feldman, R., & Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured
data. Cambridge: Cambridge University Press.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Sardinia,
249–256.
Groth, S. S., & Muntermann, J. (2011). An intraday market risk management approach based on textual
analysis. Decision Support Systems, 50(4), 680–691.
Hajek, P. (2018). Combining bag-of-words and sentiment features of annual reports to predict abnormal
stock returns. Neural Computing and Applications, 29(7), 343–358.
Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent detection of financial
statement fraud - A comparative study of machine learning methods. Knowledge-Based Systems, 128,
139–152.
Hajek, P., Olej, V., & Myskova, R. (2014). Forecasting corporate financial performance using sentiment in
annual reports for stakeholders’ decision-making. Technological and Economic Development of Econ-
omy, 20(4), 721–738.
Hart, R. P. (2001). Redeveloping DICTION: Theoretical considerations. Progress in Communication Sci-
ences, 16, 43–60.
Kaeck, A. (2018). Variance-of-variance risk premium. Review of Finance, 22, 1549–1579.
Kaeck, A., & Alexander, C. (2013). Continuous-time VIX dynamics: On the role of stochastic volatility of
volatility. International Review of Financial Analysis, 28, 46–56.
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO
algorithm for SVM classifier design. Neural Computation, 13(3), 637–649.
Kim, S. H., & Kim, D. (2014). Investor sentiment from Internet message postings and the predictability
of stock returns. Journal of Economic Behavior & Organization, 107, 708–729.
Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Proceedings of the 9th
International Workshop on Machine Learning, Aberdeen, 249–256.
Knittel, Ch. R., & Pindyck, R. S. (2016). The simple economics of commodity price speculation. American
Economic Journal: Macroeconomics, American Economic Association, 8(2), 85–110.
Kothari, S. P., Xu, L., & Short, J. E. (2009). The effect of disclosures by management, analysts, and business
press on cost of capital, return volatility, and analyst forecasts: A study using content analysis. The
Accounting Review, 84(5), 1639–1670.
Kumar, B. S., & Vadlamani, R. (2016). A survey of the applications of text mining in financial domain.
Knowledge-Based Systems, 114, 128–147.
Liaw, A., & Wiener, M. (2002). Classification and regression by RandomForest. R News, 2(3), 18–22.
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and
10âĂŘKs. The Journal of Finance, 66 (1), 35–65.
Manning, C. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Myskova, R., Hajek, P., & Olej, V. (2018). Predicting abnormal stock return volatility using textual analysis
of news - A meta-learning approach. Amfiteatru Economic, 20(47), 185–201.
Oliveira, N., Cortez, P., & Areal, N. (2017). The impact of microblogging data for stock market predic-
tion: Using Twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert
Systems with Applications, 73, 125–144.
Predicting stock return volatility 95
Quinlan, J. R. (1999). Simplifying decision trees. International Journal of Human-Computer Studies, 51(2),
497–510.
See-To, E. W., & Yang, Y. (2017). Market sentiment dispersion and its effects on stock return and volatility.
Electronic Markets, 27 (3), 283–296.
Seng, J., & Yang, H. (2017), The association between stock price volatility and financial news – A sentiment
analysis approach. Kybernetes, 46 (8), 1341–1365.
Shi, Y., Ho, K. Y., & Liu, W. M. (2016). Public information arrival and stock return volatility: Evidence
from news sentiment and Markov regime-switching approach. International Review of Economics &
Finance, 42, 291–312.
Staritz, C. (2012). Financial markets and the commodity price boom: Causes and implications for developing
countries. Working Paper, Austrian Foundation for Development Research (ÃŰFSE).
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The
Journal of Finance, 62(3), 1139–1168.
Xing, F. Z., Cambria, E., & Zhang, Y. (2019). Sentiment-aware volatility forecasting. Knowledge-Based
Systems, 176, 68–76.
Zhao, L. T., Liu, L. N., Wang, Z. J., & He, L. Y. (2019). Forecasting oil price volatility in the era of big
data: A text mining for VaR approach. Sustainability, 11(14), 3892.
Chapter 6
6.1 Introduction
Nowadays, very huge datasets are available in each field of science due to the possibility of record-
ing and storing every relevant and apparently irrelevant information. Not only structured data
but also unstructured ones are available in different forms such as images, histograms and text
documents. All of them can be used as inputs in algorithms in order to improve the efficiency of
decision processes, and to discover new relationships in nature, financial markets and society. The
power of these data is potentially infinite and in recent years an active area of research emerged
in economics and finance, which aims at handling inference on high-dimensional and complex
data by combining machine learning methods with econometric models. Chinese Restaurant
Processes and Bayesian nonparametric inference have been considered in Bassetti et al. (2018b),
Billio et al. (2019), Bassetti et al. (2018a) and Bassetti et al. (2014); graphical models for time
series analysis are developed in Bianchi et al. (2019), Ahelegbey et al. (2016a) and Ahelegbey et al.
(2016b); random forest and nonparametric inference have been studied in Athey et al. (2019) and
Wager and Athey (2018).
Nevertheless, it should not be forgotten that an algorithm or inference procedure might not
work properly if all input data are not carefully refined. For this purpose, data preprocessing tech-
niques are exploited to ensure that the input dataset is suitable for a data analysis process so
that its efficiency is increased (see García et al. (2015) for further details). Data preprocessing
is organized into two main steps: data preparation and data reduction. Data preparation consists
∗ This research used the SCSCF multiprocessor cluster system and is part of the project Venice Center for Risk
Analytics (VERA) at Ca’ Foscari University of Venice.
97
98 Roberto Casarin and Veronica Veggente
in data cleaning, which includes correction of bad data, filtering of incorrect data and reduction
of unnecessary details, and data transformation, which includes data conversion, missing values
imputation and data normalization. Data reduction techniques, which are the main focus of this
chapter, allow for reducing sample size, and cardinality and dimensionality of the original data.
Sample size reduction ensures a data reduction through the estimation of parametric or
nonparametric models which preserve some data properties. Cardinality reduction includes for
example binning processes which divide data into intervals (bins) and identify a representative
feature value for each bin. Dimensionality reduction techniques allow for reducing the number of
variables and divide into three types of approaches. The first is feature selection which identifies
the best subset of variables to represent the original data. The second is feature construction which
compounds new features through the application to the original data of constructive operators
such as logical conjunction, string concatenation and numerical average (Sondhi, 2009). The
third is feature extraction which defines new features through a mapping of the original variables
into a lower dimensional space. Within this approach, we will review Principal Component Anal-
ysis, Factor Analysis and Projection Pursuit (Section 6.2) and focus on random projection methods
(Section 6.3) which have recently been applied to statistics and machine learning (Breger et al.,
2019; Fard et al., 2012; Guhaniyogi and Dunson, 2015; Kabán, 2014; Li et al., 2019; Maillard
and Munos, 2009; Thanei et al., 2017) and econometrics (Koop et al., 2019). Also, see Boot and
Nibbering (2019) for a review of sub-space projection methods in macro-econometrics.
Data reduction is a necessary preprocessing step for high-dimensional data since they come
with a lot of unwanted modeling, inference and prediction issues. In statistical models, the num-
ber of parameters increases exponentially with the number of variables (over-parametrization)
implying good in-sample prediction and poor out-of-sample prediction performances (overfit-
ting). Also, the number of observations needed to achieve an effective model fitting is large and
is not reached in many practical situations (inefficiency issues). Similar issues arise in machine
learning. The most relevant feature of a learning algorithm is that it should be trained on a large
dataset in order to achieve good fitting or forecasting performances. When the number of covari-
ates (features) increases, the algorithm needs a very large training set and can be subject to overfit-
ting the training data if the number of observations is not large enough. All the issues mentioned
above originate from the curse of dimensionality (Bellman, 1961) and multicollinearity problems
in high-dimensional spaces (Vershynin, 2019; Wang, 2012).
An intuitive explanation of the curse of dimensionality is that in high-dimensional spaces there
is exponentially more room than in low-dimensional spaces. For example, a cube with side 2 in
R 3 has an area 23 times larger than the one of the unit cube. The same 2-side cube in Rd has an
area 2d times larger than the unit cube lying in the same d -dimensional space. The larger volume
available in high-dimensional spaces makes more difficult that random observations fall in the
“center” of the distribution: the tails in high-dimensional probability distributions are much more
important than the center.
Two similar situations are as follows: a hypersphere inscribed in a cube and two embedded
hyperspheres. It can be shown that when the dimension increases, if data points are drawn ran-
domly in the hypercube, it is more likely that they fall in the complement of the hypersphere
inscribed in the hypercube; similarly, if data are drawn randomly in the larger hypersphere there
is a higher probability that they fall in the complement of the smaller hypersphere. In high-
dimensional spaces, data concentrates in unexpected parts of those spaces. This mathematical
fact is referred in the literature as empty space phenomenon and is illustrated in Figure 6.1. Given
Random projection methods 99
0.5
-0.5
-1
1
0.5 1
0 0.5
0
-0.5 -0.5
-1 -1
Figure 6.1 When the dimension of the space (d-sphere) increases from d = 2 (2-sphere on the
left) to d = 3 (3-sphere on the right), the proportion of random points (black dots) falling into
the internal sphere (red area) of radius 1 − ε, with ε = 0.6, decreases, whereas the proportion
of points (empty circles) in the complement of the internal sphere (gray area) increases.
1,000 points randomly generated in a unit d -sphere, the percentage of random points (black dots)
in the inscribed d-sphere of radius 1 − ε (red area), with ε ∈ (0, 1), decreases when d increases
from 2 (left plot) to 3 (right plot).
As in the examples of the hypercube and hyperspheres, the graph of a multidimensional Gaus-
sian distribution can be considered. In dimension d = 1 the volume of this distribution contained
in a 1-sphere (or interval) of radius 1.65 is 0.9; it can be proved that if the dimension of the space
increases, the volume contained in that sphere decreases and it becomes almost 0 in dimension
10. This shows that as d goes to infinity, the tails of a distribution become much more important
than its center. For a wide and comprehensive discussion of this topic, the reader can refer to
Verleysen and François (2005).
This result permits to argue a new concept: since data concentrate in extreme areas of the
distribution in high-dimensional spaces, as d increases the distance between data and the center
becomes higher; consequently, allqpoints have similar distance from the center. Hence, substitut-
ing the Euclidean norm ||x|| = x12 + . . . + xd2 of each point x = (x1 , . . . , xd ) ∈ Rd with the
mean of the norms µ||x|| computed over a subset of points x1 , . . . , xk ∈ Rd leads to a negligible
error in computing relative distances from the center of the distribution (Verleysen and François,
2005). This fact opens the way to the application of dimensionality reduction techniques when
learning or making inference on high-dimensional distributions.
Another issue in modeling high-dimensional data is collinearity, also called multicollinearity
in the multivariate case. It is a phenomenon that arises when one or more variables in a regres-
sion model can be nearly perfectly predicted by a linear combination of the other covariates
(near-collinearity). As a result, this phenomenon creates a substantial variability in the coeffi-
cient estimate for each variable. Nevertheless, the predictive effectiveness of a model as a whole
is not affected, what becomes senseless is the value of each regression coefficient that cannot
be interpreted in the usual way. In practice, this means that the effect of a unit change in a
100 Roberto Casarin and Veronica Veggente
variable controlling for the others cannot be well estimated. In addition, standard errors of these
estimates become larger and a small change in the input data can make them vary substantially.
In the case of perfect collinearity, that is much more unusual than near-collinearity, a variable
is the exact linear combination of another. In this case, the input matrix has not full rank and
the model is not well specified (Stock and Watson, 2015). As the space dimension increases, the
possibility that some variables are linearly correlated with the others increases substantially and
the model is more likely to suffer of the multicollinearity issues described above. Nevertheless,
it is possible to exploit multicollinearity to reduce the dimensionality of an input dataset. In
the following, we will call extrinsic dimension the dimension of the original dataset and intrinsic
dimension the number of independent variables in the reduced dataset.
The remainder of this chapter is organized as follows. In Section 6.2, dimensionality reduction
techniques are presented as a way to overcome the issues discussed in the introduction. Section
6.3 introduces random projection and Section 6.4 provides some illustrations on simulated time
series data and three original applications to tracking and forecasting a financial index and to
predicting electricity trading volumes.
A = U 3U 0
where U is the orthogonal matrix with the eigenvectors of A in its columns, i.e. U =
(u1 , u2 , . . . , ud ), U 0 its transposed, and 3 = diag(λ1 , λ2 , . . . , λd ) is a diagonal matrix com-
posed of all the eigenvalues corresponding to the eigenvectors of A, disposed in the same order.
Consider a normed linear combination of the components xi ∈ Rn :
d
X
C= αi xi
i=1
where ||α|| = 1.
The combination C with the maximum variance will be the first principal component, the
component with the second highest variance and orthogonal to the first will be the second prin-
cipal component and so on. Formally, this is an optimization problem of the type
It can be shown that the vector α which maximizes the variance of the normed linear combination
is u1 , that is the eigenvector corresponding to the largest eigenvalue (λ1 ). This vector is called the
first principal axis and the projection of u10 X on this axis is called the first principal component.
The following principal components are found adopting the same procedure repetitively. Rel-
evantly, each principal component explains a percentage of the total variance of the model and
the variance due to the ith principal component is equal to
λi
(λ1 + λ2 + · · · + λd )
This quantity decreases for each principal component, the variance due to the first principal
component being the largest.
The main drawbacks of the PCA are presented in the following.
Even if the principal components are all linear combinations of the original variables,
sometimes it is difficult to interpret them in the sense of the initial problem.
102 Roberto Casarin and Veronica Veggente
There is not a unique rule to decide the number k of components to include in the reduced
matrix that is the dimension of the projection sub-space. Generally, a rule of thumb is used;
components are added until a satisfactory percentage of the original variance is explained.
It can be not effective when original data lie on non-linear manifolds,1 because PCA is only
able to project onto a linear sub-space.2
X 0 = ϒF + E (6.2)
1. all factors are standardized; that is, their expected value is null (i.e. E(fj ) = 0) and their
variance-covariance matrix is equal to the identity (i.e. E(fj fj0 ) = I );
2. all specific factors have null expected value (i.e. E(ei ) = 0);
3. idiosyncratic terms are mutually independent (i.e. Cov(ei , ej ) = 0, i 6= j);
4. idiosyncratic terms and random factors are mutually independent (i.e. Cov(fi , ej ) = 0).
Since all columns of F and E are independent, the variance Var(X 0 ) is equal to Var(ϒF + E) =
Var(ϒF ) + Var(E) and according to the fundamental theorem of factor analysis (Bartholomew,
1984) the variance-covariance matrix can be decomposed as
6 = ϒϒ T + 9
where 9 is the variance-covariance matrix of the idiosyncratic term (E). From the specification
of the factor model in Equation 6.2, each xi,j , i = 1, . . . , j = 1, . . . , d , satisfies
k
X
xj,i = υj,l fl,i + ej,i
l=1
The first term in Equation 6.3, hi2 = kj=1 υi,j 2 , is called the communality and is common to
P
all variables while the second term 9i,i is called the specific or the unique variance, and it is
the part of variability due to the idiosyncratic term ei . If many xi s have high coefficients (υi,j )
for the same factor f , they highly depend on this same unknown factor, so they are probably
redundant. Exploiting this feature of the data, dimensionality of a dataset can be consistently
reduced, including in the analysis only non-redundant factors.
Theorem 6.3.2 (Norm Preservation). Let x ∈ Rd and A a k × d matrix with entries ai,j sampled
independently from a standard normal distribution N (0, 1); hence
2 !
1 3 k
P (1 − ) kxk ≤
√ Ax
≤ (1 + ) kxk ≥ 1 − 2e−( − ) 4
2
2 2
(6.4)
k
Given that the norm preservation theorem has been proved using the Gaussian random pro-
jection function (Dasgupta and Gupta, 1999), a last issue must be solved: the existence of some
couples u, v ∈ Q such that the JL lemma holds using the Gaussian random projection.
Set f in Equation 6.4 equal to the Gaussian random projection function so that the norm
preservation theorem holds
1
f (x) = √ Ax (6.5)
k
with k = 20 log(n)/ −2 , and assume u, v ∈ Q ⊂ Rd are both O(n2 ). Using probabilistic
arguments it can be shown that the probability that a couple of vectors u and v exists such that JL
Random projection methods 105
lemma does not hold is lower than one. Equivalently, under proper assumptions, the probability
to have u and v such that the lemma holds true is greater than zero. This ensures us that the JL
lemma is valid from a probabilistic point of view.
The first type allows for a higher degree of sparsity in the projection matrix. This characteristic
has two main advantages: first, it strongly reduces the computational costs as two-thirds of the
entries are zeros; moreover, this feature reflects the fact that economic and financial data are sparse.
Take as an example financial returns; following the literature, they are generally well described by
distributions with null mean and fat tails.
In the Bayesian literature, random projection was implemented to cope with problems of
large-scale regressions (Guhaniyogi and Dunson, 2015) and large-scale vector autoregressions (Koop
et al., 2019). In both these works, the authors used a projection matrix A with entries
p = φ2
1
+ √ φ
8i,j = 0 p = 2(1 − φ)φ
− √1 p = (1 − φ)2
φ
iid
Yj = β0 + xj0 β + ηj , ηj ∼ N 0, ση2
with i = 1, . . . , d . The Bernoulli random variable si allows for setting the coefficient βi at zero
randomly with probability 1 − p, thus excluding some covariates from the model which is gener-
ating the data. We apply a random projection approach to reduce the dimensionality of a linear
regression model. The dimension of the projection sub-space is given by the error bound estimates
of the JL lemma k = 20 log(n)/ −2 .
The numerical illustration in Figure 6.2 shows that in our experiment settings with n = 2,000
and n = 1,100 observations, the optimal sub-space dimensions are k = 634 and k = 588,
respectively. We fit the following random projection (RP) regression model:
iid
Yj = β0 + wj (A)0 β + ηj , ηj ∼ N 0, ση2
1
wj (A) = √ Axj , j = 1, . . . , n
k
Random projection methods 107
700 0.6
600 0.5
500
0.4
400
0.3
300
0.2
200
100 0.1
0 0
0 500 1000 1500 2000 0 200 400 600 800 1000
n n
Figure 6.2 Dimension k of the projection sub-space given by the JL lemma (vertical axis, left)
and the upper bound in the norm preservation theorem (vertical axis, right) as a function of
the number of observations n (horizontal axes).
Table 6.2 Mean square error for the different estimation strategies (columns)
and experimental settings (rows) given in Table 6.1
Sample size
Experiment OLS n = 2,000 OLS n = 1,100 RP OLS n = 2,000 RP OLS n = 1,100
1 27.98 221.04 115.53 213.17
2 3.38 44.95 117.17 167.40
3 29.95 209.51 468.81 943.79
4 36.24 326.48 22.95 41.42
5 24.97 304.51 39.59 76.09
process is lower (see Figure 6.3). This is to say that when data are sparser, the gain in effectiveness
using random projection instead of OLS is substantial.
1
0
RSP,t = RHC,t √ A β c + εt , t = 1, . . . , n
k
where the matrix A is Gaussian random projection matrix to efficiently determine an optimal
investment strategy.
Random projection methods 109
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 100 200 300 400 500 600 700 800 900 1000
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 100 200 300 400 500 600 700 800 900 1000
Figure 6.3 Results of experiment 4, where the probability of inclusion of a covariate in the
data generating process is small, that is p = 0.01. In each plot: true parameter values (blue
dots) and estimated parameter values (red dots) for a random projection OLS on the entire
samples of 2,000 observations (top) and on a subsample of 1,100 observations (bottom).
In all analyses, the number of observations is n = 104 and the projection sub-space has
dimension k = 15. We followed the suggestion in Guhaniyogi and Dunson (2015) and choose
the new dimension in the interval [2 log(d), min(n, d )]. Random projection is performed 1,000
times generating Ai , i = 1, . . . ,1,000, independent projection matrices. The elements of the
0
projection RHC,t Ai can be interpreted as random portfolios and the optimal coefficients β ci are
portfolio weights conditionally to the projections Ai . To recover the regression coefficients in the
c
original dimension d, we perform a reverse random projection of each β̂ i vector as follows:
c
β̂ i = Ai β̂ i
110 Roberto Casarin and Veronica Veggente
0.25
0.2
0.15
0.1
0.05
-0.05
-0.1
-0.15
-0.2
-0.25
02-Aug-19 23-Aug-19 16-Sep-19 07-Oct-19 28-Oct-19 18-Nov-19 11-Dec-19
Figure 6.4 Log-returns of the S&P500 index (red line) from 1st August 2019 to 31st December
2019 and log-returns of all S&PHealthcare components (colored lines).
Finally, we apply Bayesian model averaging (Hoeting et al., 1999) and obtain a d -dimensional
vector of portfolio weights:
1000
∗ 1 X ∗
β̄ = β̂ (6.7)
1000 i=1 i
The results in this section are averages over the 1,000 independent random projections.
Different portfolio strategies were set up according to different weight calculation methods
and different training sample sizes. Weight calculation is based alternatively on the coefficients
produced by the simple OLS regression (OLS method in Table 6.3), or on the Gaussian random
projection coefficients of Equation 6.7 (GRP method in Table 6.3).
Three different sizes of the training sample are considered: 80.77%, 67.31% and 57.69%
of the original sample size, which corresponds to the first 84, 70 and 60 observations of the
original sample, respectively. In the experiments with 57.69% of the original sample, the number
of observations available equals the number of coefficients to estimate plus one, which makes
the OLS estimates highly inefficient. These experiments mimic scenarios where there is a large
number of replicating assets and a few temporal observations available to determine the optimal
weights. We expect random projection techniques help to deal with this inefficiency issue.
Figure 6.6 presents the log-returns of the S&P500 index and of the replicating strategies. The
corresponding average tracking errors are given in Table 6.3. Results in Table 6.3 and Figure 6.5
show that random projection outperforms OLS in all of our experiments. Random projection
Random projection methods 111
displays a smaller mean absolute error and is more accurate in backtesting. When the subsample
is chosen really small, and the number of observations is close to the number of regressors, OLS-
based portfolios have a large variability (bottom plot of Figure 6.5) and random projection OLS
strongly outperforms the simple OLS method in terms mean absolute error (MAE) computed as
n
1 X
MAE = R̂SP,t − RSP,t
n t=1
∗
0
where R̂SP,t = RHC,t β̄ . We evaluate the performances of the investment strategies by computing
the Tracking Error (TE) that is a measure of the deviation of a tracking portfolio from the target
index: v
u n
u 1 X
TE = t ((R̂SP,t − R̂¯ SP ) − (RSP,t − R̄SP ))2
n − 1 t=1
where R̂¯ SP and R̄SP are the average returns of the replication and target portfolio, respectively
(Basak et al., 2009). The TE is computed as the standard deviation of the difference of the index
and portfolio returns. The TE values confirm that random projection has a better performance
in terms of portfolio replication. The results are confirmed when the regression coefficients are
normalized to obtain a replicating portfolio with self-financing constraint and initial capital equal
to 1 (last column of Table 6.3).
Figure 6.5 Results for Gaussian random projection model estimated on whole sample. Top:
true values (black), and OLS (blue dashed) and random projection OLS (red dashed) fitted
values. Bottom: mean absolute error for OLS (blue dashed) and random projection OLS (red
dashed). Gray lines represent the results of each one of the 1,000 random projection estimates.
0
RSP,t+1 = RHC,t β + εt
0.04
0.02
-0.02
-0.04
-0.06
0.04
0.02
-0.02
-0.04
-0.06
0.5
-0.5
-1
-1.5
-2
02-Aug-19 22-Aug-19 12-Sep-19 02-Oct-19 22-Oct-19 11-Nov-19 03-Dec-19 23-Dec-19
Figure 6.6 Performance of S&P500 index (blue), OLS-based portfolio (yellow) and (RP+OLS)-
based portfolio (red). Weights computed using the whole sample (top), a subsample of 70
observations (middle) and a subsample of 60 observations (bottom).
114 Roberto Casarin and Veronica Veggente
by the compressed regressions are stored. The aim of this application is to forecast the values of the
test set (51 observations) through a multiple regression performing a weighted model averaging
of these ten different trials. Particularly, mean absolute error is computed for each of these models
applied to the training set in order to derive a vector of weights to perform model averaging. To
compute it, a vector of absolute errors is produced for each Monte Carlo simulation:
AEi = RSP − R̂SP,i (6.8)
where AEi is a 52 × 1 vector of absolute errors for the ith fitted model and i = 1, . . . , 10.
Averaging all the values in each vector, a measure for the error committed by each regression is
produced (MAEi ). Then, its reciprocal is used to calculate weights
MAE −1
Pi = P10 i −1 (6.9)
j=1 MAEj
For OLS estimate it is equal to 658.93 while preprocessing data using random projection allows
for an appreciable reduction of MAPE to a value of 3.67.
0.03
0.02
0.01
-0.01
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
17-Oct-19 31-Oct-19 14-Nov-19 29-Nov-19 16-Dec-19 31-Dec-19
0.3
0.2
0.1
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
-0.7
-0.8
17-Oct-19 31-Oct-19 14-Nov-19 29-Nov-19 16-Dec-19 31-Dec-19
Figure 6.7 Returns on S&P500 index (black solid), OLS-based forecasting (dashed, top) and
(OLS+RP)-based forecasting (dashed, bottom).
the electricity volumes traded during a period of 21 days. Every Sunday there is a decrease in the
trading volumes, and every Tuesday and Wednesday there are peaks. These seasonal patterns in
the data call for the use of forecasting models with periodic components such as seasonal step-wise
dummies, and sine and cosine functions (Pedregal et al., 2007). In this application, we consider
a Fourier regression model with the aim of forecasting trading volumes in the Queensland at a
daily frequency.
We propose an augmented Fourier regression model
Yt = φt + µt + ξt + εt (6.10)
with εt being the idiosyncratic component, φt the Fourier period component
F
X F
X
φt = αj cos(2π fj t) + βj sin(2πfj t)
j=1 j=1
116 Roberto Casarin and Veronica Veggente
6.8
6.6
6.4
6.2
5.8
5.6
5.4
Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
Figure 6.8 Daily trading volumes of electricity in thousands of Megawatt (MW) (black)
from 7th March 2010 to 28th March 2010. The vertical dashed line indicates the end of the
week.
where fj = j/n and j = 1, . . . , n/2 are the Fourier frequencies and αj and βj the Fourier
coefficients. The augmentation terms are the quadratic time trend component
µt = α + βt + γ t 2
for the long-term dynamics in the trading volumes, with coefficients α, β and γ , and the seasonal
component
6
X
W
ξt = γj Dj,t
j=1
has in its columns the intercept, the linear and quadratic components of the deterministic trend
and the dummy variables, the regression matrix
Random projection methods 117
sin(1 · 2πf1 ) cos(1 · 2πf1 ) . . . sin(1 · 2π fF ) cos(1 · 2π fF )
sin(2 · 2πf1 ) cos(2 · 2πf1 ) . . . sin(2 · 2π fF ) cos(2 · 2π fF )
X2 =
.. .. .. .. ..
. . . . .
sin(2 · nπf1 ) cos(2 · nπf1 ) . . . sin(2 · nπ fF ) cos(2 · nπfF )
has the periodic components at different frequencies in the different columns and the vector ε
contains the error terms. The dimension of the original dataset is n = 4,132 temporal observa-
tions, d = 1 + 2 + 6 + 2F covariates which is equal to 4,141 when all Fourier frequencies are
considered, i.e. F = 2,066. We apply data preprocessing to a subset of covariates. The column
vectors in the regression matrix X1 are not compressed, whereas we apply the (d − 9) × k random
projection matrix A of the type exposed in Equation 6.6 to the columns of X2 . The resulting
partially compressed regression model is
y = X1 β 1 + X2 Aβ c + ε
where β c is a k-dimensional vector of coefficients, referred to the compressed regression matrix
X2 A, and ε is an n × 1 vector of idiosyncratic errors.
The model in Equation 6.10 is fitted 100 times to the training set using different random
draws of the projection matrix. For each draw, the mean absolute error MAEi is computed and
the coefficients are used to predict the observations in the second part of the sample (validation set)
conditioning on the random projection matrix. The unconditional out-of-sample predictions are
obtained by averaging the conditional predictions over all draws of the random projection matrix.
The combination weights of the Bayesian model averaging are proportional to MAEi−1 , that is
the inverse of the mean absolute error computed at each simulation step. In Figure 6.9 results
7.5
6.5
5.5
4.5
3.5
07-Dec-98 14-Feb-01 25-Apr-03 03-Jul-05 11-Sep-07 19-Nov-09
Figure 6.9 Electricity trading volumes in thousands of Megawatt (MW) (black) for Queensland
and RP combined with OLS forecasting (red) in sample (from 7th December 1998 to 22nd
February 2007) and out of sample (from 23rd February 2007 to 31st March 2010). The vertical
dashed line indicates the end of the in-sample analysis.
118 Roberto Casarin and Veronica Veggente
of the forecasting performance are shown for the RP combined with OLS. Since the number of
covariates is very large and very close to the sample size, the OLS without data preprocessing is not
efficient. In our application, the standard deviation of coefficients estimated with OLS method
takes values in the range [0, 22.19] with an average value of 0.062, whereas preprocessing data
allows for reducing the coefficient standard deviation to the range [0, 0.019] and to an average
value of 0.0003.
7 % Experiment Number 4
8 p =0.01;
9 sigmaeta =3;
10 sigmaeps =1;
11 sigmaz =0.01;
12
13 % Data generation
14 eps = randn ( n + ntest , d ) ;
15 eta = randn ( n + ntest ,1) ;
16 u = zeros ( n + ntest , d ) ;
17 y = zeros ( n + ntest ,1) ;
18
Use the following code to perform Ordinary Least Square (OLS) and OLS with random
projection (RP). The analyses are performed on the while sample and a smaller subsample.
43 % Whole Sample OLS Analysis
44 X1 =[ ones (n ,1) ,u (1: n ,:) ];
45 bethat1 = inv ( X1 (1: n ,:) '* X1 (1: n ,:) ) *( X1 (1: n ,:) '* y (1: n ) ) ;
46 yhat1 = Xtest * bethat1 ; % Test set prediction
47 mse1 = mean (( ytest - yhat1 ) .^2) ;
48
77 sz0 =5;
78 scatter ((1: d ) ', bet , sz0 , ' MarkerEdgeColor ' , ' none ' , ' MarkerFaceColor '
,[0 0 1]) ;
79 hold on ;
80 scatter ((1: d ) ',A * bethat3 (2: end ) ,sz , ' MarkerEdgeColor ' , ' none ', '
MarkerFaceColor ' ,[1 0 0]) ;
81 hold off ;
82 legend ( ' true ' ,[ ' Estim . RP ' , '( k = ', num2str ( ka ) , ' , n = ' , num2str ( n ) , ')
'] , ' location ' , ' northwest ') ;
83 ylim ([ -1+ min ( bet ) , max ( bet ) ]) ;
84
Notes
1 A manifold is a topological space that locally resembles a Euclidean space near each point. This means that each
point of an n-dimensional manifold has a neighborhood that is homoeomorphic.
2 Anyway, when data lie on a non-linear manifold, it is possible to apply PCA locally (Kambhatla and Leen,
1997).
3 A is composed of orthonormal column vectors ai , subject to Cov(ai , aj ) = 0 ∀i 6= j.
4 Nevertheless, Zhao and Mao (2015) propose to build projection matrices after a preliminary analysis of the
features of the original dataset.
5 For example, if a bound exists for the first derivative of a function f , it is said to be Lipschitz.
References
Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins.
Journal of Computer and System Sciences, 66(4):671–687.
Ahelegbey, D. F., Billio, M., and Casarin, R. (2016a). Bayesian graphical models for structural vector
autoregressive processes. Journal of Applied Econometrics, 31(2):357–386.
Ahelegbey, D. F., Billio, M., and Casarin, R. (2016b). Sparse graphical multivariate autoregression: A
Bayesian approach. Annals of Economics and Statistics, 123/124:1–30.
Athey, S., Tibshirani, J., Wager, S., et al. (2019). Generalized random forests. The Annals of Statistics,
47(2):1148–1178.
Bartholomew, D. J. (1984). The foundations of factor analysis. Biometrika, 71(2):221–232.
Basak, G. K., Jagannathan, R., and Ma, T. (2009). Jackknife estimator for tracking error variance of optimal
portfolios. Management Science, 55(6):990–1002.
Random projection methods 121
Bassetti, F., Casarin, R., and Leisen, F. (2014). Beta-product dependent Pitman–Yor processes for Bayesian
inference. Journal of Econometrics, 180(1):49–72.
Bassetti, F., Casarin, R., and Ravazzolo, F. (2018). Bayesian nonparametric calibration and com-
bination of predictive distributions. Journal of the American Statistical Association, 113(522):
675–685.
Bassetti, F., Casarin, R., Rossini, L., et al. (2020). Hierarchical species sampling models. Bayesian Analysis,
15(3), 809–838.
Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Chapter 5, pages 94–95. Princeton.
Bianchi, D., Billio, M., Casarin, R., and Guidolin, M. (2019). Modeling systemic risk with Markov
switching graphical sur models. Journal of Econometrics, 210(1):58–74.
Billio, M., Casarin, R., and Rossini, L. (2019). Bayesian nonparametric sparse var models. Journal of
Econometrics, 212(1):97–115.
Boot, T. and Nibbering, D. (2019). Forecasting using random subspace methods. Journal of Econometrics,
209(2):391–406.
Breger, A., Orlando, J., Harar, P., Dörfler, M., Klimscha, S., Grechenig, C., Gerendas, B., Schmidt-Erfurth,
U., and Ehler, M. (2019). On orthogonal projections for dimension reduction and applications in
augmented target loss functions for learning problems. Journal of Mathematical Imaging and Vision,
62:376–394.
Carreira-Perpinán, M. A. (1997). A review of dimension reduction techniques. Department of Computer
Science. University of Sheffield. Tech. Rep. CS-96-09, 9:1–69.
Chen, G. and Tuo, R. (2020). Projection pursuit Gaussian process regression. arXiv preprint
arXiv:2004.00667.
Corielli, F. and Marcellino, M. (2006). Factor based index tracking. Journal of Banking & Finance,
30(8):2215–2233.
Dasgupta, S. and Gupta, A. (1999). An elementary proof of the Johnson-Lindenstrauss lemma. International
Computer Science Institute, Technical Report, 22(1):1–5.
Dunteman, G. H. (1989). Principal components analysis. Number 69. Sage.
Fard, M. M., Grinberg, Y., Pineau, J., and Precup, D. (2012). Compressed least-squares regression on sparse
spaces. In AAAI’12: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto,
Ontario, Canada.
Fodor, I. K. (2002). A survey of dimension reduction techniques. Technical report, Lawrence Livermore
National Lab., CA (US).
Fu, J., Zhou, Q., Liu, Y., and Wu, X. (2020). Predicting stock market crises using daily stock market
valuation and investor sentiment indicators. The North American Journal of Economics and Finance,
51:100905.
García, S., Luengo, J., and Herrera, F. (2015). Data preprocessing in data mining. Springer.
Guhaniyogi, R. and Dunson, D. B. (2015). Bayesian compressed regression. Journal of the American
Statistical Association, 110(512):1500–1514.
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: A
tutorial. Statistical Science, 14(4):382–401.
Huber, P. J. (1985). Projection pursuit. The Annals of Statistics, 13(2):435–475.
Johnson, W. B. and Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space.
Contemporary Mathematics, 26:189–206.
Jolliffe, I. T. and Cadima, J. (2016). Principal component analysis: A review and recent developments.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences,
374(2065):20150202.
122 Roberto Casarin and Veronica Veggente
Kabán, A. (2014). New bounds on compressive linear least squares regression. Artificial Intelligence and
Statistics, 33:448–456.
Kambhatla, N. and Leen, T. K. (1997). Dimension reduction by local principal component analysis. Neural
Computation, 9(7):1493–1516.
Kim, S. and Kim, S. (2020). Index tracking through deep latent representation learning. Quantitative
Finance, 20(4):639–652.
Koop, G., Korobilis, D., and Pettenuzzo, D. (2019). Bayesian compressed vector autoregressions. Journal
of Econometrics, 210(1):135–154.
Lee, J. A. and Verleysen, M. (2007). Nonlinear dimensionality reduction. Springer Science & Business
Media.
Li, L., Vidyashankar, A. N., Diao, G., and Ahmed, E. (2019). Robust inference after random projections
via Hellinger distance for location-scale family. Entropy, 21(4):348.
Maillard, O. and Munos, R. (2009). Compressed least-squares regression. Advances in neural information
processing systems, volume 22, pages 1213–1221.
Pedregal, D. J. and Trapero, J. R. (1999). Electricity prices forecasting by automatic dynamic harmonic
regression models. Energy Conversion and Management, 48(5):1710–1719.
Raviv, E., Bouwman, K. E., and Van Dijk, D. (2015). Forecasting day-ahead electricity prices: Utilizing
hourly prices. Energy Economics, 50:227–239. Elsevier.
Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding.
Science, 290(5500):2323–2326.
Ruppert, D. (2011). Statistics and data analysis for financial engineering, volume 13, chapter 18, pages
517–527. Springer.
Schölkopf, B., Smola, A., and Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10(5):1299–1319.
Serneels, S. (2019). Projection pursuit based generalized betas accounting for higher order co-moment
effects in financial market analysis. arXiv preprint arXiv:1908.00141.
Sondhi, P. (2009). Feature construction methods: A survey. sifaka.cs.uiuc.edu, 69:70–71.
Stock, J. H. and Watson, M. W. (2002a). Forecasting using principal components from a large number of
predictors. Journal of the American Statistical Association, 97(460):1167–1179.
Stock, J. H. and Watson, M. W. (2002b). Macroeconomic forecasting using diffusion indexes. Journal of
Business & Economic Statistics, 20(2):147–162.
Stock, J. H. and Watson, M. W. (2015). Introduction to econometrics. Pearson.
Thanei, G. A., Heinze, C., and Meinshausen, N. (2017). Random projections for large-scale regression. In
Ahmed, S. Ejaz (ed.), Big and complex data analysis, pages 51–68. Springer.
Verleysen, M. and François, D. (2005). The curse of dimensionality in data mining and time series predic-
tion. In International Work-Conference on Artificial Neural Networks, pages 758–770. Springer.
Vershynin, R. (2019). High-dimensional probability, chapter 3, pages 42–43. Cambridge University Press.
Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random
forests. Journal of the American Statistical Association, 113(523):1228–1242.
Wang, J. (2012). Geometric structure of high-dimensional data and dimensionality reduction. Springer.
Weron, R. (2014). Electricity price forecasting: A review of the state-of-the-art with a look into the future.
International Journal of Forecasting, 30(4):1030–1081.
Woodruff, D. P. (2014). Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357.
Zhao, R. and Mao, K. (2015). Semi-random projection for dimensionality reduction and extreme learning
machine in high-dimensional space. IEEE Computational Intelligence Magazine, 10(3):30–41.
Chapter 7
7.1 Introduction
The age of cloud computing is upon us. In the Financial Services Industry, we will see a signifi-
cant acceleration in the migration of core banking applications and related workloads to public
and private clouds over the next three years. This will help drive innovation, agility and digital
transformations but it will also contribute to enhanced complexity and potential operational risk
concerns if institutions build out a siloed cloud environment.
This chapter will focus on the future of cloud computing in financial services and highlight
a few examples where machine learning (ML) and artificial intelligence (AI) are transforming
how the financial services industry is driving innovation by leveraging data to improve products
and services, and reduce risk. We also highlight the foundational role cloud computing provides
by enabling institutions to have an enterprise-grade hybrid and multi-cloud architecture that
provides full portability of data and applications while having a single data management, data
123
124 Richard L. Harmon and Andrew Psaltis
governance and data security capability across all environments. This allows for the full life cycle
industrialisation of ML and AI that provides capabilities to support real-time data ingestion, data
analysis, model development, validation and deployment with ongoing model management and
monitoring.
We showcase two examples where ML and AI provide innovative capabilities to address critical
business problems. We then briefly touch on regulatory concerns around cloud computing at a
firm and industry level and explain how the next generation of cloud computing platforms when
used in conjunction with a modern data management, ML and AI platform addresses many
of these firm-specific risk concerns but not the industry-level financial stability concerns which
require regulatory involvement.
Applications of AI and machine learning may enhance the interconnections between finan-
cial markets and institutions in unexpected ways. Institutions’ ability to make use of big
data from new sources may lead to greater dependencies on previously unrelated macroeco-
nomic variables and financial market prices, including from various nonfinancial corporate
sectors (e-commerce, sharing economy, etc.). As institutions find algorithms that generate
uncorrelated profits or returns, there is a risk these will be exploited on a sufficiently wide
scale that correlations actually increase.2
A related aspect of these concerns about ML and AI for the Financial Services industry is a require-
ment for Model Risk Management (MRM). This is a regulatory requirement as well as a core
aspect of a bank’s operational risk management framework. A recent McKinsey paper3 highlights
how the Coronavirus pandemic has identified shortcomings of existing MRM capabilities:
... the real failure is not that banks used models which failed in this crisis but rather that
they did not have fallback plans to manage when the crisis did come.
It highlights several reasons for models that have failed during the COVID-19 crisis:
1. Model assumptions and boundaries defined at the design stage were developed in a pre-
COVID-19 world.
2. Most models draw on historical data without the access to high-frequency data that would
enable recalibration.
3. While access to the needed alternative data is theoretically possible, models would not be able
to integrate the new information in an agile manner, because the systems and infrastructure
on which they are built lack the necessary flexibility.
The last point, regarding the lack of ML and AI system’s flexibility, is critical to the global efforts
of many firms and analytic vendors to develop what is widely called the industrialisation of ML
and AI. At Cloudera we have been working for years to enable the modern machine learning life
cycle for large teams across massive data and heterogeneous compute environments with an open
platform built for enterprise scale that runs anywhere. We call this Cloudera Machine Learning
(CML). CML provides the following key features and benefits:4
While the McKinsey paper provides a clear outline of the shortcomings of existing Model Risk
Management (MRM) capabilities, we think a more critical element is missing in their analysis.
This would be the lack of alternative or challenger approaches to the historical data-dependent
algorithms that are foundational to most ML and AI approaches. What is required within this
context is a more out-of-the-box approach that incorporates simulation-based frameworks, such
as Agent-Based Models (ABMs), that fall within the wider AI group of capabilities. We will illus-
trate below (Section 7.4) how ABMs can play a critical supportive role in modelling situations
where traditional ML algorithms fail.
The manifestation of these shifts has resulted in what Gartner5 has termed the “Cloud Data
Ecosystem” where data management in the cloud has shifted from a developer-oriented “some
assembly required” focus to an enterprise-oriented, solution-based focus where enterprises may
run workloads across multiple Cloud Service Providers (CSPs), on-premise or in a hybrid model
utilising on-premise resources and cloud computing. Coupling the industry shifts with the
Gartner research one arrives at the following core principles a modern data architecture must
fulfil:
Data must reside where it is best suited – on-premises or in the cloud workloads must be
processed in the most efficient compute pools.
There must be the ability to move workloads and data between compute environments –
CSPs and on-prem.
Management, security and governance for the entire multi-cloud ecosystem must be deliv-
ered through one management console.
Cloudera has continued to evolve its data management platform with these technological changes,
customer demands, and emerging “new” realities of increased data and workloads across on-prem,
hybrid and cloud in mind. The culmination of which has resulted in the creation of the Enterprise
Data Cloud industry segment.6
The Cloudera Data Platform (CDP) is an integrated platform allowing businesses of all sizes
to realise the benefits of an Enterprise Data Cloud. By simplifying operations, CDP reduces the
time to onboard new use cases across the organisation. It uses machine learning to intelligently
autoscale workloads up and down for a more cost-effective use of cloud infrastructure. CDP
manages data in any environment, including multiple public clouds, bare metal, private cloud
and hybrid cloud. With Cloudera’s Shared Data Experience (SDX), the security and governance
capabilities in CDP are ensured, IT can confidently deliver secure analytics, and ML and AI are
running against data anywhere. CDP is a new approach to enterprise data, anywhere from the
Edge to AI. The high-level architecture is depicted in Figure 7.1.7
This entire platform is delivered as a service, often referred to as a Platform as a Service (PaaS).
Our focus throughout this chapter will be on ML and AI and where appropriate we will identify
aspects of CDP supporting the discussed use cases.
the data is correct. Bringing together entities from multiple internal and external data sources in
real time or batch allows Quantexa to create a single entity view across an enterprise.
Quantexa is able to do this by using dynamic entity resolution. Dynamic entity resolution
relies on the ability to look at all of the data possible in a single relationship view. This involves
aggregating potentially billions of data points to assemble an investigative, predictive, preventative
and analytical single view. Quantexa does this in real time, making it accessible and applicable
for any use case.8
Figure 7.2 provides an example of how Quantexa is able to empirically identify entity-based
relationships. In this case, the single view of the customer is built with all the relevant and impor-
tant data parts that lead to a complete contextual understanding. This provides the ability to see
opportunities, threats and risks that each party presents to an institution in a quick and clear
manner. Being able to do this at scale requires that one architect the business to be running at a
microservices level enabling data and analytics to be rapidly deployed to address an institution’s
most immediate or evolving needs.9
As described in Section 2, the agility and scalability of the Big Data platform (CDP) provides
this capability to make dynamic entity resolution at scale a reality. This is done by first having all
of the data accessible within a single source and then being able to leverage the scalable computa-
tional capabilities of CDP to support the analytics. Figure 7.3 provides a high-level architecture
overview of the Quantexa application running on the Cloudera platform.
Let’s take a moment to discuss the key aspects of the high-level architecture in Figure 7.3.10
Starting at the bottom we have the data layer where we leverage the Hadoop Distributed Files
System (HDFS) to store the massive amounts of data required to have a complete understanding
of entity relationships, this may be a mix of structured and unstructured data. At this layer you
ML and AI perspective
also see a variety of tools used to allow ad hoc discovery and investigation of the raw and processed
data. Hive (a data warehousing tool), Spark for data engineering and Oracle for traditional SQL-
like queries and transformations along with elastic search for a Google-like search experience over
the unstructured data.
Moving up the diagram, in the middle we see the Batch Processing and Mid Dynamic Tiers
and above them what is commonly referred to as the Access Layer. The Batch Processing and Mid
Dynamic Tiers are both representations of the types of technologies that are indirectly in use to
help satisfy the use cases of the Access Layer. For example, on the top left is Cloudera Machine
Learning, which is part of CDP. This provides a data scientist with an enterprise-grade advanced
machine learning and artificial intelligence environment utilising the latest technologies. On the
opposite side, the top-right corner, we see Cloudera Services with Apache Kafka; this is also part of
the CDP and is the de facto tool of choice across all industries for exposing data and/or ingesting
data in a stream.
Today, many financial institutions find that their traditional anti-money laundering (AML)
transaction monitoring systems are insufficient when relied on to detect risk in financial markets.
In short, they fail to enrich, connect or operationalise the various forms of data associated with
such markets, often causing investigators to miss suspicious patterns of behaviour that may exist
within this data. Similarly, these legacy approaches overburden investigators with exceptionally
high volumes of false-positive alerts, making the process of detecting risks related to financial
crime to be both inefficient and ineffective.11
The Quantexa solution outlined above offers an innovative ML- and AI-driven approach that
allows financial institutions to resolve related entities into a single view and make connections
between distinct entities through interactions and relationships to derive context, and this context
gives investigators a far more unified and accurate picture of where true risk lies. Furthermore,
advanced analytics at the transaction, entity and network level can support the generation of
risk-scored alerts, allowing analysts to prioritise and focus their efforts.
ML and AI perspective 131
Step 1: Identifying hypothetical default and non-default loss scenarios (and a combination
of them) that may lead to resolution;
Step 2: Conducting a qualitative and quantitative evaluation of existing resources and tools
available in resolution;
Step 3: Assessing potential resolution costs;
Step 4: Comparing existing resources and tools to resolution costs and identifying any gaps;
Step 5: Evaluating the availability, costs and benefits of potential means of addressing any
identified gaps.
These five steps are to be utilised to outline the minimum requirements that a CCP must address
as part of a regular supervisory review process.
132 Richard L. Harmon and Andrew Psaltis
Two Cloudera partners, Deloitte and Simudyne, have developed a cloud-enabled ABM sim-
ulation model of CCPs on the CDP.15 As illustrated in Figure 7.4, CCPs have clearing members
which not only transact but are responsible for replenishing a default fund in the case of clearing
member defaults.
This model is able to capture many of the risk exposures of CCPs (e.g., concentration risk, liq-
uidity risk and wrong-way risk) which can have wide financial stability implications. This model
will also be able to directly support the FSB’s five-step risk and resiliency evaluation requirements.
The Deloitte-Simudyne ABM-based CCP model records all state changes for each agent and
the environment at every time step, as well as associated behaviour and interaction changes for
visualisation, analysis and quantification. This model captures three key components relevant to
the CCP structure.
1. Financial market simulator: It simulates a market environment for different types of clear-
ing members and end users, with evolving states in response to participants’ idiosyncratic
behaviours and interactions over time.
2. Margin call framework: It calculates initial margin and variation margin requirements for
clearing members portfolios and makes associated margin calls.
3. Default management framework: It simulates the default management protocols triggered
by a clearing member as default.
Within this ABM structure, the market simulator, the margin call and default management pro-
cesses will synchronously lead to changes in the states of the Clearing Members (CMs) and the
environment. By simulating the path of these agents over time, the model captures emergent
behaviour and the impact of adaptive agents responding to different financial market environ-
ments. From hundreds of thousands of simulation runs, institutions can analyse the effects of
any potential changes in regulatory policy, CCP risk management practices, CCP default fund
changes, stressed market conditions or other circumstances on the CCP, the CMs, other CCPs as
well as the wider financial system.16
ML and AI perspective 133
The model is built on the Simudyne platform and runs on Cloudera’s CDP. The Simudyne
platform provides an easy-to-use, highly customisable SDK, together with libraries of support
functions specifically optimised for ABMs. Architecturally, the Simudyne SDK uses Apache
Spark to distribute ABM simulations by performing calculations in parallel across a large num-
ber of processor nodes that are managed by CDP. It also has a user-friendly front-end console
built into the SDK, which greatly simplifies data and process visualisation. This is illustrated in
Figure 7.5.
The Cloudera platform provides a fast, easy and secure data platform that supports the models
to be run in a hybrid or multi-cloud environment. CDP is a key enabler for storing the massive
amount of data generated by the simulations. Taken together, the combined technologies of Simu-
dyne and Cloudera provide a highly scalable simulation environment, enabling seamless scaling
to millions of simulations, as well as storage of all information required for making decisions
based on deeper insights into the dynamics of the underlying behavioural relationships.
The Deloitte-Simudyne ABM CCP cloud-based solution is a notable example for regulators
and financial institutions to consider the use of ABMs as a complementary approach to traditional
approaches for quantifying potential future systemic risk events.
Figure 7.6 Bank of England’s IAAS cloud market share survey (January 2020).
We will use the results of the survey to inform and adjust our supervisory approach to cloud
oversight.
While a diverse list of operational resiliency concerns has been identified across many regulator
publications, we perceive the following six items reflect the most critical factors in evaluating
future systemic risk exposures.
1. Lack of unified data security and governance: Each cloud native product re-creates its own
silo of metadata making data management, security and governance much more complex.
Without a unified security and governance framework, institutions will be challenged to
identify, monitor and address crucial issues in data management that are critical for the
proper measurement of risk exposures across different platforms. This is especially true for
hybrid or multi-cloud environments.
2. Cyber attack resiliency: The consolidation of multiple organisations within one CSP
presents a more attractive target for cyber criminals than a single organisation.23 A fur-
ther complication is that Cloud security is a shared responsibility between the CSP and the
institution.
3. Vendor lock-in: The market share concentration of a small group of CSPs can result in
significant lock-in effects, whereby an institution is unable to easily change its cloud provider
either due to the terms of a contract, a lack of feasible alternatives, proprietary technical
features or high switching costs.
4. Operational resiliency: Much of the operational resiliency concerns by regulators is the
“shared responsibility” model inherent in the relationship between a Cloud customer and
the CSP. Regulators have consistently made it clear that institutions at all times remain fully
responsible for all the operational functions they outsource to third-party providers. This
addresses the liability aspect but does not address the fundamental risk exposure that still
exists.
5. Lack of transparency: A CSP is unlikely to share detailed information about its processes,
operations and controls. This restricts not only an individual institution but also the regulator
from being able to fully ensure sufficient oversight. From a reporting perspective, the UK and
Luxembourg regulators require institutions to periodically report all functions outsourced
to the Cloud, alongside requiring pre-authorisation for migration of critical applications.
136 Richard L. Harmon and Andrew Psaltis
6. Cloud concentration risk: Regulators are concerned about institutions’ over reliance on
one service provider to support their banking services. This not only presents cloud oper-
ational risks for individual institutions but creates financial stability risks for the financial
system within a single country as well as globally. Concentration risks also arise if a sig-
nificant number of institutions have a key operational or market infrastructure capability
(e.g., payment, settlement and clearing systems) in a single CSP. For instance, there is abun-
dant research on the potential systemic risk exposures from CCPs and their default fund
structures but little discussion among regulators on Cloud Concentration Risk in these risk
assessments.
Specifically, with regard to the issue of Cloud Concentration Risk, we can segment this into two
distinct categories: Firm-Specific and Systemic Concentration Risk.24
Firm-specific concentration risks: These consist of risks due to cloud lock-in, a lack of uni-
fied data security and governance across CSPs, third-party operational resiliency concerns
such as auditability, multi-cloud controls and cyber security exposures.
Systemic concentration risks: These consist of risks that affect the stability of the financial
system. This includes a lack of transparency on what critical applications currently have
or will be migrated to a specific CSP. Regulators are also concerned about the systemic
risk of having a concentration of many large financial service firms’ critical application(s)
all residing on the same CSP. These include applications such as payment, settlement and
clearing systems.
This bifurcation of oversight complexities of Cloud Concentration Risk highlights the need for
the Financial Services Industry, the CSPs and Regulators to collaboratively work towards resolv-
ing these issues. Fortunately, as illustrated in Section II, recent innovations in developing a com-
prehensive hybrid, multi-cloud architecture directly eliminate many of the regulatory concerns
around Vendor Lock-in dangers as well as the lack of a unified multi-cloud data security and
governance capability that help to address firm-specific Cloud Concentration Risks.
In addressing regulators’ overall concerns around operational resilience, institutions must
first specify the most important business functions that can impact financial stability risks. This
requires a careful mapping of the systems, facilities, people, processes and third parties that sup-
port those business services. From this, institutions need to identify how the failure of an individ-
ual system or process running in the cloud environment could impact the operations of a specific
business function and assess to what extent these systems or processes are capable of being sub-
stituted during disruption so that business services can continue to be delivered. Only when this
thorough mapping has been completed can the institution begin to assess the vulnerabilities and
concentration risk exposures that might result.
But this only addresses the operational risks that are specific to each institution. With the
current high level of CSP vendor concentration, any disruption of a key CSP has the poten-
tial under certain circumstances to trigger wider systemic impacts. For instance, the European
Systemic Risk Board’s (ESRB) systemic cyber risk study25 highlights a prominent type of inci-
dent effect whereby a “systemic cyber incident” could threaten financial stability. The key tipping
point in these circumstances would occur when confidence in the financial system was so severely
weakened that important financial institutions would cease all lending activity because they were
ML and AI perspective 137
no longer willing to lend, as opposed to being (technically) unable to lend. This is reflective of
the Lehman Brother collapse on September 15, 2008 and the resulting impact across the wider
financial system.
Notes
References
Bank of England (2020, January). How reliant are banks and insurers on cloud outsourcing? Bank Over-
ground, January 17, 2020.
Basel Committee on Banking Supervision (2020, April). Progress in adopting the principles for effective
risk data aggregation and risk reporting. Bank for International Settlements.
Bookstaber, R., Paddrik, M. & Tivanax, B. (2018). An agent-based model for financial vulnerability. Journal
of Economic Interaction and Coordination, vol. 13, issue 2, 433–466.
Calvanoz, E., Calzolari, G., Denicolox, V. & Pastorello, S. (2019, December). Artificial intelligence, algo-
rithmic pricing and collusion. SSRN Working Paper.
Cloudera (2020a). Cloudera machine learning. Cloudera Product Assets.
Cloudera (2020b). Cloudera data platform. Cloudera Product Assets.
European Systemic Risk Board (2020, February). Systemic cyber risk. ESRB.
Financial Stability Board (2017, November). Artificial intelligence and machine learning in financial ser-
vices: Market developments and financial stability implications. Financial Stability Board.
Financial Stability Board (2018, November). FSB 2018 resolution report: “Keeping the pressure up”. Finan-
cial Stability Board.
Financial Stability Board (2020, May). Guidance on financial resources to support CCP resolution and on
the treatment of CCP equity in resolution. Financial Stability Board Consultative Paper.
Gartner (2020, January). Cloud data ecosystems emerge as the new data and analytics battleground.
Gartner.
Harmon, R. (2018, May). Cloud concentration risk: Will this be our next systemic risk event? Cloudera
White Paper.
Harmon, R. (2020, June). Cloud concentration risk II: What has changed in the past 2 years? Cloudera
White Paper.
Hoque, I. (2019, November). The AI revolution is now: Why you should embrace AI decisioning. Quantexa
Blog.
InsideBigData (2019, August). What good is data without context? InsideBigData.Com.
Laurent, M., Plantefeve, O., Tejada, M. & Weyenbergh, F. (2020, May). Banking models after COVID-19:
Taking model-risk management to the next level. McKinsey Paper.
Psaltis, A. (2017). Streaming data understanding the real-time pipeline. Manning Publications.
Quantexa (2020, April). Situational awareness: Creating a contextual view to understand your commercial
customers in a changing world. Quantexa White Paper.
Strachan, D. (2019, September). Financial services on the cloud: The regulatory approach. Deloitte Blog.
Westin, S. & Zhang, T. (2020, April). Agent-based modelling for central counterparty clearing risk: CCP
Resilience - from one crisis to the next. Deloitte.
Chapter 8
8.1 Introduction
In the field of business, accounting is the area where the Information and Communication
Technology (ICT) tools and latest technologies were used for the first time (Kwilinski, 2019).
In accounting, the ICT tools and techniques were used to transform the manual accounting
process to automation. Later on, the analysis and interpretation parts were also included (Carr,
1985). At the initial stage, the adoption rate of ICT in accounting was very slow due to the
conservative attitude of accounting practitioners (Barras and Swann, 1984). In the early 1990s, as
the competition was skyrocketing, organizations were trying to improve operational efficiency and
to reduce expenses by replacing manual book-keeping and accounting process with automation
(Manson, 2001). In our daily life, seldom we find any such area where the ICT is not used.
Organizations use ICT tools for easy tasks like arithmetic calculations and for complex tasks like
programming, logit modeling and Enterprise Resource Planning (ERP). Deloitte uses ICT tools
for Visual Assurance while PricewaterhouseCooper (PwC) uses it for Risk Control Workbench.
Most of the large-scale audit firms use computer technology in their audit process such as image
processing, Electronic Fund Transfer (ETF) and Electronic Data Interchange (EDI) to ensure
accuracy of their data assessment and improve the quality of audit report, audit judgment and
overall audit services (Bell et al., 1998).
139
140 Emon Kalyan Chowdhury
1. Audit of financial statements: auditors examine financial statements to verify the true and
fair view of financial position, results and cash flows.
2. Operational audits: auditors measure the performance of a unit or department of an
organization.
3. Compliance audits: auditors assure whether an organization is complying the specific rules,
regulations and procedures set by concerned higher authority.
The success of an audit depends on the appropriate planning, participation of concerned parties
and smooth communication between client and the auditor. The audit process begins with the
acceptance of an organization’s offer by an auditor to audit their firm. The offer is made either to
acquire a new client or to continue with an existing client. During the second stage, an auditor
determines the volume of evidence and review required to assure that the financial statements of
the clients are free from material misstatement. During the third stage, an auditor tests necessary
supporting documents to support internal control and the fairness of the financial statements.
Finally, the auditor issues an opinion on the financial position through completion of rigorous
audit procedure. The manual audit process is very lengthy, time consuming, boring and costly;
moreover, there is a possibility of committing human errors. Due to lack of security, data may be
copied, stolen and destroyed.
In an organization, decisions are taken using pieces of information which are processed
through ICT-based tools. Nowadays, business organizations generate information using Artifi-
cial Intelligence where no human intervention is required. An AI is a computer software that can
behave like a human being and has the capacity to plan, learn and solve problems (Bakarich and
O’Brien, 2020). In our daily life, we interact with AI in some form or another. Due to thriving
inner capacity of AI, most of the businesses adopt it to stay competitive in today’s challenging
world. Audit being an indispensable part of a business, the application of AI in the audit pro-
cess has become an important research topic. This study extensively covers different aspects of
AI in the audit process such as the importance of using AI in the audit process; steps to convert
manual to AI-based audit process; and impact of AI-based audit process on audit firms, business,
AI in the audit process 141
customers, users of financial reports and other relevant stakeholders. Till date, no study was found
converging the full-fledged implications of AI in the audit process on different parties. This study
will impeccably bridge the gap and advance the existing literature by outspreading the scope to
broader areas with relevant examples. Furthermore, the empirical interview-based study on cur-
rent inadequacies and future prospects of applying AI in the audit process in Bangladesh will
create another dimension of imminent research.
The chapter continues as follows. Section 2 focuses on literature review. Section 3 overviews
the application of AI in auditing. Section 4 presents the required framework to implement AI in
the audit process. Section 5 illustrates the digitalization process of audit activities and its impact
on audit quality and audit firms. Section 6 cites few real examples of AI applications in the audit
process. Section 7 shows the current scenario of application of AI in Bangladesh and Section 8
concludes the chapter.
to plan audit process and to manage risk assessment (Bell et al., 2002; Brown, 1991; Zhang and
Zhou, 2004). Auditors have become more concerned about audit quality and the acceptability of
their audit reports. Due to numerous accounting scandals in the last few decades, audit firms are
now in serious image crisis and it is expected that the application of blockchain and AI may be a
way out to restore their lost confidence and reliability. This chapter will focus on how to apply AI
in the audit process to ensure more transparency and count each document to avoid unexpected
errors.
Image recognition: It is such a system which can recognize age, and mental and physical con-
dition of a person. In the healthcare industry, using computer vision and image recognition tech-
nology, robots record informative timeline of patients to analyze their emotional state while they
get admitted, stay and get discharged from the hospitals. Google uses this technology to search
the records of a person through the image search option (Choudhury, 2019).
AI-based framework having three major components – namely, AI strategy, governance and
human factor – and seven elements – namely, cyber resilience, AI competencies, measuring
performance, data quality, data architecture and infrastructure, ethics and the black box. The
following section contains a brief introduction on the AI-based audit framework.
8.4.1 Components
8.4.1.1 AI strategy
An organization designs its own AI strategy according to its own requirements to exploit the
best benefits that AI can offer. Every organization has its own digital data strategy, and the AI
strategy will be an important addition to it. Organizations can perform much better if they tie
the AI strategy with their existing digital capability (Chui, 2017). An organization must fulfill the
following criteria to enable a successful AI strategy:
8.4.1.2 Governance
AI governance includes procedures, structures and processes to manage, direct and monitor orga-
nization’s AI-related activities which play a vital role in achieving the desired goals. The structure
and scope of governance depends on the nature and type of the organization. AI governance
should be result-oriented which will comply the ethical, legal and social issues. An organiza-
tion should have a sufficient number of skilled and experienced people to oversee the AI-based
day-to-day activities.
8.4.2 Elements
8.4.2.1 Cyber resilience
It refers to the capacity of an organization to stay prepared, respond and recover from any unex-
pected cyber-attacks. It is a continuous process and helps the organization to keep the risk of
losses due to cyber-attack as low as possible. Cyber resilience broadly covers the following issues
(EY, 2017):
8.4.2.2 AI competencies
AI competency means ability of an organization to adopt AI-based operations. It includes soft-
ware, hardware, skilled people, budget, necessary supports from peer and approving authority.
8.4.2.6 Ethics
Designing AI for an organization should adhere to few ethical issues. The people who are behind
the designing, development, decision process and outcomes of AI should remain accountable.
The values and norms of the user group should be given due importance. The AI process should
be easily understandable and free from any sort of biasness.
146 Emon Kalyan Chowdhury
1. In the monitoring role, managers who are known as agents reduce agency costs by improving
the quality of accounting information through control and, hence, limit their discretionary
powers.
2. Shareholders and other stakeholders take important decisions based on the information
generated by the managers. Auditors ensure reliability and fairness of these information by
discharging their information role.
3. In the insurance role, auditors ensure that the information which is transferred to other
organization truly expresses the financial position of the company and there is nothing to
hide. However, it cannot solve the problem of information asymmetry between shareholders
and managers.
Under this circumstance, the principal role of an auditor is to assure the investors that there exists
no risk in relying on the information of organization. Technology can help auditors to make this
job easier by accelerating data processing rate and reducing human errors. Comprehensive and
integrated Accounting Information System reduces information asymmetry between managers
and stakeholders and the risk associated with the transfer of information. Therefore, the roles of
auditors will be occupied by technological developments. Jeacle (2017) and Andon et al. (2014)
assess that a new audit role will emerge which will be expanded to other areas of assurance such as
e-commerce, social and environmental responsibilities, reliability of information systems, cyber
security and performance measurements, although it depends on the legitimacy and ability of
audit firms to adopt these new changes.
AI in the audit process 147
quality services to the clients. It helps to predict the possibility of bankruptcy and detect any such
alarming issues of clients (Cao et al., 2015). Although there are many benefits of using artificial
intelligence or big data in an audit firm, safety and privacy should be given due importance to
mitigate the risks related to cyber security.
Step 6: Installation
The implementation process of the AI model in practice should be mentioned. Necessary
guidelines regarding involvement of different parties should be framed. The model should rig-
orously be reviewed by qualified third party before final installation. The final results and the
objectives need to reconcile very carefully.
Step 7: Monitoring
Monitoring measures the capacity of the organization to implement, maintain and super-
vise the AI model. It also compels concerned parties to discharge their respective responsibil-
ities and to maintain regulatory compliances in achieving organization’s goals in a social and
ethical way.
8.6.1 KPMG
KPMG uses AI in their audit process for its reasoning competences. It executed a contact with
IBM’s Watson in March 2016 to provide this service. The technology uses Application Program
Interfaces (APIs) for a wide range of purposes such as extractions of document entity to facial
recognitions (Lee, 2016). It applies advanced forecasting ability of AI in auto racing firms like
McLaren Applied Technologies (Sinclair, 2015).
8.6.2 Deloitte
Deloitte uses AI to enhance its cognitive capabilities in the audit process and create a base to
receive services from different vendors. According to the Chief Innovation Officer, Deloitte
applies AI for document review, inventory counts, confirmations, predictive risk analytics,
disclosure research and preparation of client request lists (Kokina and Davenport, 2017).
Deloitte partnered with Kira Systems to review complicated documents such as leases, con-
tracts, employment agreements, invoices and other legal documents. The system can interact
with human beings and enhance capacity in extracting more relevant information over time
(Whitehouse, 2015).
8.6.3 PwC
PwC uses “Halo” to analyze accounting journals. Although few are passed using human-support–
based business intelligence, rest are done using automated algorithms (PricewaterhouseCoopers,
2016).
150 Emon Kalyan Chowdhury
8.6.7 CohnReznick
CohnRezick is an advisory, assurance and tax consultancy firm currently enjoying competitive
advantage by adopting AI technology to provide security of public funds for local agency, state
and federal. It also assists government agencies to enhance their operating efficiency, enhance ser-
vice quality and ensure job satisfaction of employees. By applying AI both CohnRezick and gov-
ernment are improving the public trust and ensuring transparency of responsibilities at different
managerial and operational levels taking every data into consideration.
The Bangladesh government is moving ahead with the mission “Digital Bangladesh” to ensure
transparency, justice, accountability, human rights, democracy and delivery of government ser-
vices to the people of Bangladesh by utilizing the modern technology to improve the day-to-day
life-style. In line with this mission, different organizations are now adopting and implementing
AI in their operations gradually. There are more than 160 million population in Bangladesh, of
which 34% are young and extremely technology freak (Deowan, 2020). Bangladesh is thriving
to accustom itself with the wave of technological development in different industrial sectors. The
recent concepts – namely Big Data, Artificial Intelligence, Internet of Things and Blockchain –
are attaining huge attentions and being popular among Bangladeshi people; therefore, AI has an
incredible future in Bangladesh.
To know the current scenario and future prospects of applications of AI in the audit process
in Bangladesh, ten renowned accounting professionals and ten key officials from reputed organi-
zations were interviewed. The summary of the interviews is presented below in three parts. The
first part characterizes issues of both audit firms and business organizations. The second part deals
with issues related to audit firms and the third part focuses on business organizations.
8.8 Conclusion
Artificial intelligence is now being extensively used in different areas of accounting and finance.
Astoundingly, the implementation rate of this technology is very sluggish in the field of audit-
ing. Big data technology has tremendous potentiality to change the way now auditors work. It
can bring turnaround change in the mindset of users of financial statements by ensuring reliabil-
ity, confidence, stability and fairness. The slow adoption rate of big data technology in auditing
may put this sector under serious challenge. To make different parties familiar with the concept
of using big data in auditing, massive training program is required to launch and curricula of
different courses and programs should be updated. Accounting and auditing standards should
also be revised with necessary guidelines on how to adopt and use technology in the process of
auditing. Inclusion of big data technology with traditional audit techniques and expert judgment
ensures rigorous analytical procedures and helps to prepare quality report. Despite having many
benefits, lack of objectivity, insufficient studies on the existing system and human biasness may
create obstacle in achieving the desired goals. There should have been sufficient transparency in
developing an AI-based system as “black box” processing remains behind the scene to auditors,
regulatory authority, government and even the person who inputs the data in the system. A trans-
parent process is a pre-requisite for taking sensitive decisions and giving judgments on different
audit affairs. This chapter comprehensively focused on the technical aspects of AI, benefits and
challenges of using AI in the audit process, conversion of manual to AI-based audit process in a
few simple steps and the impact of digitalization of audit process on different concerned parties
citing few real examples. This chapter delineated the prospects and challenges of applying AI in a
AI in the audit process 153
developing market economy with reference to Bangladesh. As per opinion of expert professionals
and successful businessmen, due to the absence of sufficient groundworks and orthodox mind-
set of top management of business organizations, the use of AI in the audit process is stuck and
may take several years to come. This study used interview method to collect data for nationwide
lockdown due to COVID-19 pandemic crisis. In future, researchers may apply other methods
of data collection and can examine the cost-benefit of implementing AI in the audit process for
both audit firms and business organizations.
Bibliography
Abdolmohammadi, M. J. (1999). A comprehensive taxonomy of audit task structure, professional rank and
decision aids for behavioral research. Behavioral Research in Accounting, 11, 51.
Agnew, H. (2016). Auditing: Pitch battle. Financial Times. Retrieved from https://www.ft.com/content/
268637f6-15c8-11e6-9d9800386a18e39d
Andon, P., Free, C., & Sivabalan, P. (2014). The legitimacy of new assurance providers: Making the cap fit.
Accounting, Organizations and Society, 39(2), 75–96.
Arnold, V., Collier, P. A., Leech, S. A., & Sutton, S. G. (2004). Impact of intelligent decision aids on expert
and novice decision-makers’ judgements. Accounting and Finance, 44, 1–26.
Bakarich, K. M., & O’Brien, P. (2020). The robots are coming... But aren’t here yet: The use of artifi-
cial intelligence technologies in the public accounting profession. Journal of Emerging Technologies in
Accounting. doi:10.2308/JETA-19-11-20-47
Baldwin, A. A., Brown, C. E., & Trinkle, B. S. (2006). Opportunities for artificial intelligence develop-
ment in the accounting domain: The case for auditing. Intelligent Systems in Accounting, Finance &
Management: International Journal, 14(3), 77–86. doi:10.1002/isaf.277
Baldwin-Morgan, A. A., & Stone, M. F. (1995). A matrix model of expert systems impacts. Expert Systems
with Applications, 9(4), 599–608.
Barras, R., & Swann, J. (1984). The adoption and impact of information technology in the UK accountancy
profession. Technical Change Centre.
Bell, T. B., Bedard, J. C., Johnstone, K. M., & Smith, E. F. (2002). KRiskSM: A computerized decision
aid for client acceptance and continuance risk assessments. Auditing: A Journal of Practice & Theory,
21(2), 97–113.
Bell, T. B., & Carcello, J. V. (2000). A decision aid for assessing the likelihood of fraudulent financial
reporting. Auditing: A Journal of Practice and Theory, 19(1), 169–182.
Bell, T. B., Knechel, W. R., Payne, J. L., & Willingham, J. J. (1998). An empirical investigation of the
relationship between the computerisation of accounting system and the incidence and size of audit
differences. Auditing: A Journal of Practice and Theory, 17 (1), 13–26.
Brennan, B., Baccala, M., & Flynn, M. (2017). Artificial intelligence comes to financial statement audits. CFO.
com (February 2). Retrieved from http://ww2.cfo.com/auditing/2017/02/artiïňĄcial-intelligence-
audits
Brown, C. E. (1991). Expert systems in public accounting: Current practice and future directions. Expert
Systems with Applications, 3(1), 3–18.
Brown-Liburd, H., Issa, H., & Lombardi, D. (2015). Behavioral implications of Big Data’s impact on audit
judgment and decision making and future research directions. Accounting Horizons, 29(2), 451–468.
Cao, M., Chychyla, R., & Stewart, T. (2015). Big Data analytics in financial statement audits. Accounting
Horizons, 29(2), 423–429.
154 Emon Kalyan Chowdhury
Carlson, E. D. (1978). An approach for designing decision support systems. ACM SIGMIS Database: The
DATABASE for Advances in Information Systems, 10(3), 3–15. doi:10.1145/1040730.1040731
Carr, J. G. (1985). Summary and conclusions. IT and the accountant. Aldershort: Gower Publishing Company
Ltd/ACCA.
Chan, T, R. (2018). Chinese police are using facial-recognition glasses to scan travelers. Independent.
Retrieved from https://www.independent.co.uk/news/world/asia/china-police-facial-recognition-
sunglasses-security-smart-tech-travellers-criminals-a8206491.html
Chen, Q., Jiang, X., & Zhang, Y. (2019). The effects of audit quality disclosure on audit effort and invest-
ment efficiency. The Accounting Review, 94(4), 189–214.
Choudhury, A. (2019), 8 uses cases of image recognition that we see in our daily lives. Analytics India Mag-
azine. Retrieved from https://analyticsindiamag.com/8-uses-cases-of-image-recognition-that-we-see-
in-our-daily-lives/
Chui, M. (2017). Artificial intelligence the next digital frontier?. McKinsey and Company Global Institute,
47, 3–6.
Deowan, S. A. (2020). Artificial intelligence: Bangladesh perspective, The Business Standard. Retrieved from
https://tbsnews.net/tech/artificial-intelligence-bangladesh-perspective-44017
Dillard, J. F., & Yuthas, K. (2001). A responsibility ethic for audit expert systems. Journal of Business Ethics,
30(4), 337.
EY. (2017). Cyber resiliency: Evidencing a well-thought-out strategy. EYGM Limited. Retrieved
from https://www.ey.com/Publication/vwLUAssets/EY-cyber-resiliency-evidencing-a-well-thought-
out-strategy/$FILE/EY-cyber-resiliency-evidencing-a-well-thought-out-strategy.pdf
Ferreira, C., & Morais, A. I. (2020). Analysis of the relationship between company characteristics and key
audit matters disclosed. Revista Contabilidade & FinanÃğas, 31(83), 262–274.
Fowler, N. (2013) Samwer brothers’ Global Founders Capital co-invests millions in Wonga rival Kred-
itech. Retrieved from https://web.archive.org/web/20130423160012/http://venturevillage.eu/oliver-
samwers-global-founders-capital-kreditech
Gillett P. R. (1993). Automated dynamic audit programme tailoring: An expert system approach. Auditing:
A Journal of Practice and Theory, 12(2), 173âĂŞ-189.
Green, B. P., & Choi, J. H. (1997). Assessing the risk of management fraud through neural network
technology. uditing: A Journal of Practice and Theory, 16 (1), 14âĂŞ-28.
Hayes, R., Wallage, P., & Gortemaker, H. (2014). Principles of auditing: An introduction to international
standards on auditing. Pearson Higher Ed.
Issa, H., Sun, T., & Vasarhelyi, M. A. (2016). Research ideas for artificial intelligence in auditing: The
formalization of audit and workforce supplementation. Journal of Emerging Technologies in Accounting,
13(2), 1–20. doi:10.2308/jeta-10511
Jeacle, I. (2017). Constructing audit society in the virtual world: The case of the online reviewer. Accounting,
Auditing & Accountability Journal, 30(1), 18–37.
Kazim, E., & Koshiyama, A. (2020). A Review of the ICO’s Draft Guidance on the AI Auditing Framework.
Available at SSRN: https://ssrn.com/abstract=3599226
Kim, D., Song, S., & Choi, B. Y. (2017). Data deduplication for data optimization for storage and network
systems. Springer International Publishing.
Kogan, A., Alles, M. G., Vasarhelyi, M. A., & Wu, J. (2014). Design and evaluation of a continuous data
level auditing system. Auditing: A Journal of Practice & Theory, 33(4), 221–245.
Kokina, J., & Davenport, T. H. (2017). The emergence of artificial intelligence: How automation is chang-
ing auditing. Journal of Emerging Technologies in Accounting, 14(1), 115–122.
AI in the audit process 155
Krahel, J. P., & Titera, W. R. (2015). Consequences of Big Data and formalization on accounting and
auditing standards. Accounting Horizons, 29(2), 409–422.
Kwilinski, A. (2019). Implementation of blockchain technology in accounting sphere. Academy of Account-
ing and Financial Studies Journal, 23, 1–6.
Lee, D. (2016). KPMG recruits IBM Watson for cognitive tech audits, insights. Accounting Today
(March 8). Retrieved from http://www. accountingtoday.com/news/kpmg-recruits-ibm-watson-for-
cognitive-tech-audits-insights
Macaulay, M. T. (2016). Financial Executive, How cognitive tech is revolutionizing the audit. Business Source
Premier, 32, 18–24.
Manson, S., McCartney, S., & Sherer, M. (2001). Audit automation as control within audit firms. Account-
ing, Auditing and Accountability Journal, 14(1), 109–130.
Montes, G. A., & Goertzel, B. (2019). Distributed, decentralized, and democratized artificial intelligence.
Technological Forecasting and Social Change, 141, 354–358.
Nigrini, M. (2019). The patterns of the numbers used in occupational fraud schemes. Managerial Auditing
Journal, 34(5), 606–626. doi:10.1108/MAJ-11-2017-1717
Porter, M. E., & Heppelmann, J. E. (2014). How smart, connected products are transforming competition.
arvard Business Review, 92(11), 64–88.
PricewaterhouseCoopers. (2016). Halo for Journals. Retrieved from http://halo.pwc.com/
Rapoport, M. (2016). Auditing firms count on technology for backup. Wall Street Journal.
Samarajiva, R. (2020). Is it time for a policy on artificial intelligence? The Daily Star. Retrieved from
https://www.thedailystar.net/opinion/perspective/news/it-time-policy-artificial-intelligence-1733191
Sinclair, N. (2015). How KPMG is using Formula 1 to transform audit. CA Today (October 27). Available
at: https://www.icas.com/catoday-news/kpmg-and-formula-one-big-data
Srinivasan, V. (2016). Will financial auditors become extinct?. In Venkat Srinivasan (ed.), The Intelligent
Enterprise in the Era of Big Data, 171–183. New York: Wiley. doi:10.1002/9781118834725.ch7
Van den Broek, T., & van Veenstra, A. F. (2018). Governance of big data collaborations: How to balance
regulatory compliance and disruptive innovation. Technological Forecasting and Social Change, 129,
330–338.
Viaene, S., Derrig, R. A., Baesens, B., & Dedene, G. (2002). A comparison of stateâĂŘofâĂŘtheâĂŘart
classification techniques for expert automobile insurance claim fraud detection. Journal of Risk and
Insurance, 69(3), 373–421.
Wallace, W. A. (2004). The economic role of the audit in free and regulated markets: A look back and a
look forward. Research in Accounting Regulation, 17, 267–298.
Whitehouse, T. (2015). The technology transforming your annual audit. Compliance Week (December 1).
Retrieved from https://www. complianceweek.com/news/news-article/the-technology-transforming-
your-annual-audit#.WGQkebYrLPA
Yoon, K., Hoogduin, L., & Zhang, L. (2015). Big Data as complementary audit evidence. Accounting
Horizons, 29(2), 431–438.
Zhang, J., Yang, X., & Appelbaum, D. (2015). Toward effective Big Data analysis in continuous auditing.
Accounting Horizons, 29(2), 469–476.
Zhang, Y., Xiong, F., Xie, Y., Fan, X., & Gu, H. (2020). The impact of artificial intelligence and blockchain
on the accounting profession. IEEE Access, 8, 110461–110477.
Zhao, N., Yen, D. C., & Chang, I. C. (2004). Auditing in the eâĂŘcommerce era. Information Management
& Computer Security, 12(5), 389–400
Chapter 9
9.1 Introduction
The stability of the financial system is a long-term challenge for governments, regulators and aca-
demics. The 2007–2009 financial crisis proved stability’s importance and highlighted the weak-
nesses of the financial regulatory system around the world which was not able to avoid the failures
and losses generated by the turbulence in the banking industry. Regulators, policy makers and
academics learnt many lessons from this period and have attempted to fix the identified weak-
nesses. One of these areas is market discipline. There are different definitions of this term and one
of them is a mechanism to use by market participants to discipline risk-taking by financial institu-
tions. Basel supervisors stressed this mechanism’s importance in the Basel II architecture (BCBS,
2006) by the introduction of Pillar 3 components that complemented the other two pillars: Pil-
lar 1 – minimum risk-based capital requirements and other quantitative requirements; Pillar 2 –
supervisory review processes. At that time, it aimed to provide meaningful regulatory informa-
tion to market participants on a consistent basis and to be able to assess a bank’s risk. However,
the 2007–2009 financial crisis additionally exposed some weaknesses in Basel II and also with
Pillar 3. In Basel III (a revision of Basel II after the financial crisis), Pillar 3 started to become
more detailed, structured and with information more frequently disclosed. Lengthy discussion
among regulators and stakeholders on the new content, structure and frequency of information
disclosure just confirmed some authors’ opinions that it would have been more accurate to label
Pillar 3 “information disclosure” rather than “market discipline” (Flannery & Bliss, 2019). Above
157
158 Anna Pilkova et al.
all, in Europe the ideas on global financial stability and cross-border banking have to be achieved
through centralization in the European Central Bank (Miklaszewska & Pawłowska, 2014). All
in all, the current version of Pillar 3 is highly standardized with limited room for banks regard-
ing flexibility of the content, structure and frequency of reporting. Due to high standardization
and low differentiation disclosure reporting, according to Pillar 3, for some banks it can be a
very costly process which does not incentivize their stakeholders to behave in such a manner that
would discipline banks. It is the case with banks when insured deposits are major items in their
liabilities. According to research findings (Flannery & Bliss, 2019), insured depositors have no
incentive to spend resources on monitoring. In addition to that, they are probably not sophisti-
cated enough to interpret bank results correctly. Other specific cases are also banks with majority
ownership (many times more than 90%) by single foreign bank/financial group and when key
stakeholders are uninsured depositors. These stakeholders have an interest in information about
banks to be able to monitor the bank’s risk. To cover their requirements, it is important to estab-
lish which information they are particularly interested in and furthermore to design the most
effective structure regarding both content and time points of disclosed information. This study
analyses the interest in information disclosures aimed at a specific type of stakeholders (deposi-
tors) in foreign-owned commercial banks, not traded on the capital market. These types of banks
can represent a “model” in CEE countries where many commercial banks have similar ownership
and liability structures. Stakeholders’ behaviour related to Pillar 3 disclosed information can be
expected to be similar in these countries.
The main goal is as follows: first, to assess interests of depositors in the disclosed two groups
of information (Pillar 3 disclosure requirements, Pillar 3 related information during the period
2009–2012, year of crisis and subsequent years); second, to conduct robustness checks by verifying
results by applying two approaches based on different time variables (week, quarter during 2009–
2012).
The study has the following structure: the first section contains market discipline status of
the research; methodology of the research is introduced in the second section; results are dealt
with in the third section where the outcomes are compared based on different time frameworks;
subsequently, discussion and conclusion are dealt with in the last section.
the financial markets. However, Cubillas et al. (2012) conclude that while market discipline is
weakened by banking crises and policy implications, regulations and interventions strengthen
market discipline. At this point, these outcomes of disclosures as a market disciplining tool can
differ; they are important in the functioning of market discipline mechanisms and as a back-
ground for market discipline enhancement. Furthermore, recapitalization and forbearance have
negative effects on market discipline, but less supervisory power and more private ownership
and supervision of banks have opposite effects. Generally, market discipline’s efficiency is impor-
tant, and mainly during crises, due to stronger risk-taking incentives which can be eliminated by
market discipline (Nier & Baumann, 2006).
Market discipline framework is important in the concept of market discipline. This frame-
work may represent a functional system and is a key component in modern banking regulation
(Bartlett, 2012). However, this is true only in the case when the following four blocks are in perfect
coherence (Stephanou, 2010): information disclosure, market participants, discipline mechanism
and internal governance. Moreover, the interaction of these four blocks influences the market
discipline effectiveness, which depends on the enhancement of accurate and timely financial dis-
closures (Jagtiani & Lemieux, 2001) and market disclosure of private information that penetrates
the market (Berger & Davies, 1998).
There is no doubt that disclosures are one of the most effective tools for the enhancement of
market discipline (Fonseca & González, 2010) and serve as a macro-prudential tool in reducing
uncertainty in the capital markets during a financial crisis (Ellahie, 2012; Peristiani, Morgan &
Savino, 2010). Moreover, Sowerbutts et al. (2013) conclude that disclosures’ mechanism failure
contributed to the last financial crisis, because of inadequate public disclosure that was followed
by the inability of investors to judge risk and the withdrawal of lending in times of systemic stress.
Therefore, a few studies have been reviewed in which the authors concentrate on the factors of
disclosures, which contribute to the market discipline efficiency. Most of the authors concen-
trate on the impact of the increase of information disclosures on commercial banks. According
to Bouaiss et al. (2017), the increase in disclosure levels enhances transparency and efficient mar-
ket discipline and supervises excessive risk-taking. It positively influences investor’s attitude to
banks’ risk profiles and actively increases banks’ value (Zer, 2015). Furthermore, it increases the
ability to attract interbank funding (Guillemin & Semenova, 2018), boosts depositors’ sensitiv-
ity to equity levels (Kozłowski, 2016), improves sensitivity to risk-taking (Goldstein & Sapra,
2014) and prevents market breakdown. Additionally, an increase in disclosures is connected with
a reduction of risk-taking by commercial banks (Naz & Ayub, 2017) and with a lower probability
of default (Li, Li & Gao, 2020). However, it also implies the potential threat of disclosing too
much information, which destroys risk-sharing opportunities (Goldstein & Leitner, 2015).
The nature of the disclosure content as a base for adequate, accurate and timely disclosures
depends on the level of transparency, which is connected to the enhancement of market dis-
cipline. Accordingly, the bank stability and the probability of falling into crisis are influenced
by transparency that increases accountability and leads to greater market efficiency (Gandrud &
Hallerberg, 2014; Nier, 2005) and enables banks to raise cheaper capital (Frolov, 2007). However,
Moreno and Takalo (2016) conclude that only an intermediate level of transparency is socially
optimal and effective (Bouvard, Chaigneau & Motta, 2015). Key explanation of this finding is the
statement that more transparency can decrease efficient liquidity and increase rollover risk. This
development has a negative impact on the share prices of banks and the cost of trading corporate
bonds can decrease (Parwada, Lau & Ruenzi, 2015). This is in conjunction with Iren et al. (2014)
160 Anna Pilkova et al.
that concludes that transparency can have a positive impact on bank performance only up to
some level.
After the last financial crisis market discipline has become a background for stable financial
markets and its implementation by regulatory requirements was preceded by a complex discussion
process, which has led to significant changes and improvement of Pillar 3. The Pillar 3 disclosure
requirements enhance the efficiency of market discipline in order to achieve a resilient banking
system. Numerous studies evaluate Pillar 3 disclosures as a market disciplining tool and the major-
ity of authors concentrate on the benefits of Pillar 3 disclosures as an effective market disciplining
tool. First, Pillar 3 improves the safety of the banking system (Vauhkonen, 2012) and decreases
information asymmetry (Niessen-Ruenzi, Parwada & Ruenzi, 2015), and its quarterly reporting is
useful to investors (Parwada, Ruenzi & Sahgal, 2013). Second, banks’ adequate disclosures (Pillar
3 and annual reports) have a significant effect on market risk-taking behaviour and can minimize
risks (Sarker & Sharif, 2020). Third, the similarity of Pillar 3 regulation and COREP (Common
Reporting Standards (COREP) used in Europe) reports leads to a positive relationship between
these regulations and market discipline effectiveness (Yang & Koshiyama, 2019). However, a few
authors concentrate on Pillar 3 weaknesses. Concretely, an implementation of additional regula-
tions is effective in the area of financial reporting quality, but there arise differences for smaller
banks, which face disproportionately higher increases in the costs of compliance but big banks
can be engaged in higher risk acceptance (Poshakwale, Aghanya & Agarwal, 2020). Correspond-
ingly, Freixas and Laux (2011) in their study put doubt about the transparency of Pillar 3 reports,
mainly thanks to stories that occurred during the financial crisis. Moreover, Pillar 3 has excessive
superstructure and monitoring costs (Benli, 2015), which can bring a range of issues.
The content of Pillar 3 disclosures is also an important factor in any assessment of the effi-
ciency of Pillar 3 as a market disciplining tool. According to Giner et al. (2020), the most relevant
categories of Pillar 3 disclosures are credit risk, liquidity risk and market risk category (Scannella,
2018). This is in conjunction with Bischof et al. (2016), who conclude that the improved content
of Pillar 3 disclosures translates into higher market liquidity. Scannella and Polizzi (2019) concen-
trated on improvement in the disclosure of derivatives and hedging strategies, which is important
for the enhancement of interest of the market participants. Additionally, IT risk popped up as
one of the key risk categories that are positively correlated with a firm’s future stock price decrease
(Song, Cavusoglu, Lee & Ma, 2020).
However, the related work review suggests a lack of studies assessing the interest of stakehold-
ers to the content of Pillar 3 formation disclosures in commercial banks, which is crucial in order
to implement effective supervisory market discipline.
First, the sensitivity of stakeholders to negative content in disclosures is validated by a
few authors. It can trigger inefficient bank runs (Faria-e-Castro, Martinez & Philippon, 2017)
and qualitative information disclosed in a negative way that contributes to explain the bank
risk insolvency and increase probability of default (Del Gaudio, Megaravalli, Sampagnaro &
Verdoliva, 2020). Second, Araujo and Leyshon (2016) revealed a window in the content rel-
evancy of disclosures in relation to banks’ risk profile, because stakeholders are most respon-
sive to information related to the value of the bank’s assets, off-balance sheet and ratings. These
authors also highlight important factors, influencing Pillar 3 content efficiency, such as the over-
all quality of risk disclosures and the valuation of quantitative information more than qualitative
information.
Web usage analysis 161
Research in the CEE region focused on the analysis of disclosures is rare. A few authors
concentrate on their weaknesses (Bartulovic & Pervan, 2012). According to Arsov and Bucen-
ska (2017), disclosures in CEE countries lack transparency and independent verification of the
data in the presented reports (Habek, 2017). Moreover, for some financial institutions in Poland,
there is a lack of resources strongly related to disclosures (Fijalkowska, Zyznarska-Dworczak &
Garsztka, 2017). But it is clear that research in the field of Pillar 3 disclosures in CEE countries
is even more rare. Despite Matuszaka and Rozanska’s interesting study (Matuszak & Rozanska,
2017), which points out the Pillar 3 benefits but the research which would evaluate the per-
formance of commercial banks in Poland (positive relationship between Pillar 3 disclosure and
banks’ profitability measured by ROA and ROE of commercial banks), research which would
evaluate its content relevancy is insufficient.
As has been already stated in the introduction, depository markets are important in CEE
countries. Research studies identify a few challenges these markets cope with. According to Dis-
tinguin et al. (2012), it is a high correlation between the level of interbank deposits and the risk of
the bank. Higher proportion of interbank deposits in the bank’s balance sheet means lower lev-
els of risk in this region. They also conclude that explicit deposit insurance (implemented in the
1990s) contributed to effective market discipline in CEE. Moreover, Karas et al. (2013) discovered
that in Russia the introduction of deposit insurance for households caused lower sensitivity for
insured deposits flows than for uninsured ones. According to researchers, the uninsured deposi-
tors have stronger market discipline than insured ones. However, Lapteacru (2019) suggests that
external support has no impact on non-deposit funding at any type of banks according to owner-
ship. Hasan et al. (2013) present a relevant finding for banks owned by foreign investors (which
is the case in many CEE banks): a more positive correlation in interest to negative rumours on
the banks’ parent companies than to banks’ disclosures. This is in conjunction with Accornero
and Moscatelli (2018), who concluded that depositors of threatened banks used to be more sen-
sitive than other depositors to negative rumours. Substantially, it can be agreed with Berger and
Bouwman (2013) that particularly depository discipline research in Europe is not sufficient and
according to the findings even more rare in CEE countries. At this point, it can be agreed also
with authors (Miklaszewska & Pawłowska, 2014) who questioned post-crisis regulatory architec-
ture, especially for CEE banks in the competitive and unstable environment, which may produce
negative effects on relatively stable CEE banks. Based on these findings, this study aims to cover
the gap in the field of adequate and relevant Pillar 3 disclosures, which would also contribute to
higher interests of stakeholders to enhance the efficiency of market discipline.
better understand the risk. The log file consists of around two million logged accesses that were
obtained after data preparation. The data preparation is very important as was proved in previous
experiments (Drlik & Munk, 2019; Munk, Benko, Gangur & Turčáni, 2015; Munk, Drlik &
Vrabelova, 2011; Munk, Kapusta & Švec, 2010; Munk, Kapusta, Švec & Turčáni, 2010) where
bad data preparation can lead to different results from the analysis.
The chapter deals with a comparison of two different approaches to model the behaviour
of web users. Both approaches deal with the time variable as an indicator of the analysis. The
first approach deals with the analysis of weekly accesses to web categories of banking portal. The
multinomial logit model is used to analyse the data. The second approach deals with the evalua-
tion of frequent itemsets based on quantity. The itemsets were evaluated based on quarters. Both
approaches deal with the period 2009–2012. The year 2009 represents the year of the financial
crisis. Contrary, the years 2010–2012 represent the years after the financial crisis. The investigated
categorical dependent variable is a variable category that represents a group of web parts that deal
with a similar issue. The variable contains these categories of the web content: Business Condi-
tions, Pricing List, Pillar3 related, Reputation, Pillar3 disclosure requirements and We support. In this
experiment, the focus will be on two of these categories that are related to the topic of Pillar3:
Pillar3 related and Pillar3 disclosure requirements. The Pillar3 related category consists of parts:
Rating, Group, Information for Banks, Annual Reports, General Shareholder Meeting, Finan-
cial Reports and Emitent Prospects. The Pillar3 disclosure requirements consists of parts: Pillar3
Semiannually Info and Pillar3 Q-terly Info. Detailed analysis of web part frequent itemsets of
the Pillar3 category (Pillar3 disclosure requirements, Pillar3 related) in the respective quarters in
the examined period was carried out in Munk, Pilkova, Benko & Blažeková (2017). The applied
methodology has a similar data preparation phase for both approaches:
created by binarization. The next nominal variable quarterYear served to make the quarters
and specific years distinct. Similarly, dummy variables representing the specific quarter and
year (2009Q1, 2009Q2, etc.) were created.
After data preparation, the experiment, divided into two approaches, is conducted. One method
is focused on the time variable week and the other quarter. The first analysis is conducted using
the multinomial logit model. After determining the model, it is required to identify the type
of dependence for determining the degree of the polynomial and the selection of predictors,
including dummy variables. The first approach was done as follows:
6. the estimation of the models’ parameters j j by maximizing the logarithm of the like-
lihood function. The STATISTICA Generalized Linear Models was used to estimate the
parameters of individual values.
7. the estimation of logits ij for all predictors values ij aj xiT bj j 1 2 J 1.
8. the estimation of probability of accesses iJ in time i for reference web category J iJ
1
J 1 iJ .
1 j 1 e
The model was afterwards evaluated using alternative methods to estimate the model (Munk,
Drlik et al., 2011; Munk, Vrábelová & Kapusta, 2011). The evaluation consisted of the visualiza-
tion of observed and expected differences of counts, extreme identification, comparison of the
distribution of observed relative counts of accesses and estimated probabilities of the examined
web part j in time i, and observed and expected logit visualization of each web part, except the
reference web part. If the model is suitable at all levels then it is a suitable model for the analysed
data.
The second approach dealt with discovering the behaviour patterns of web users during quar-
ters in the examined period. The results were processed by association rule analysis using STA-
TISTICA Sequence and Association Analysis, which is an implementation of the algorithm using
a priori algorithm together with a tree-structured procedure that requires only one pass through
the data (Hill & Lewicki, 2013). The aim was to extract frequent itemsets with the min support
of 0.01 (Pilkova, Munk, Švec & Medo, 2015). After extracting web part frequent itemsets of web
parts in the identified sessions the interest is in comparison to the proportion of incidence in the
quarters of examined years. It can be summed up into the following:
The results of the two different approaches will be compared based on the specified time variable.
It can be assumed that the results should be similar and that the weekly analysis should offer a
more detailed look at the behaviour of the web users in comparison to the quarterly analysis.
164 Anna Pilkova et al.
9.4 Results
Frequent itemsets (a) were extracted and probabilities of the accesses (b) were examined of web
portal categories (category) based on time where time was represented by variables: (a) quarter and
(b) week. The data saved in the log file originated from a significant domestic commercial bank
operating in Slovakia. The examined log file was pre-processed, and variables were created that
represented the analysed factors. First, it was important to determine whether it was significant to
distinguish the individual years (variable year). In the case of the nominal variable year, a moderate
degree of dependency with the variable category (Chi-square = 389 844.7; df = 15; p = 0.000;
Contingency coefficient C = 0.4; Cramer V = 0.3) was identified. The contingency coefficient can
obtain values from 0 (represents no dependence between variables) to 1 (represents the perfect
dependence between variables). The contingency coefficient is statistically significant. Based on
these results, dummy variables were created representing the examined years (2009, 2010 and
2011). These variables gain only two values: 0 or 1 meaning whether the access was done in the
specific year. The dummy variable for the year 2012 is not needed as the accesses from this year
will have all other variables values set to 0.
Based on the Likelihood-ratio (LR) test (Table 9.1), estimates of the theoretical counts of
accesses were compared with the empirical counts of accesses. The results of the LR test helped
to identify the appropriate polynomial model of the third degree for the time variable week. The
value of the Pearson Chi-square approximates 1 and it means that the chosen model is suitable.
Also, the maximum of the logarithm of the likelihood function helps to choose the appropriate
model where the smallest value is the best.
The STATISTICA Generalized Linear Models was used to estimate parameters for individual
data. The significance of the parameters was examined using the Wald test. The probability of
access to the web portal categories has been modelled depending on time-week of access and
years. Time was represented by the predictor week and its transformation based on the degree of
the polynomial (week2 and week3 ) and the dummy variables of the examined years (2009, 2010
and 2011).
Based on the results (Table 9.2) of all effects test for the model, the parameters are statistically
significant. In the created model, all of the years represent statistically significant features that are
represented by the dummy variables. The weeks of the year represented by the variables week and
its transformation based on the degree of polynomial showed also statistically significant features.
The estimated parameters for both categories were significantly dependent on the week of
access and its transformations too (Table 9.3). The values of logits were significantly influenced by
the examined years. The logit model provides a probability estimate at the output. The absolute
size of the parameters reflects predictors with the highest influence on the examined variable.
A high absolute value of the parameter refers to a large dependency. A negative value refers to
indirectly proportional dependence.
Using the estimated parameters, it was possible to evaluate the logits for each category j in
time i. The third-degree polynomial model is:
2 3
ij j 1j weeki 2j weeki 3j weeki j yeari i 0 1 2 53 j 1 2 J 1
166 Anna Pilkova et al.
The evaluation of the suitability of the model was conducted. The importance of thorough data
preparation can be shown in the following example. The log file contained a big sample of unnec-
essary data that was not discovered during the data preparation phase. The evaluation of theo-
retical and empirical counts of accesses helped to identify this issue. During a specific week in
2012, a systematic error was identified as having occurred. It was an automated script that could
be related to maintenance, backup, etc. This was identified by examining the extreme values of
differences between the theoretical and empirical counts of accesses. As can be seen in Figure 9.1,
during the 21st week of the year 2012, there was a high extreme value (extreme value border is
depicted using the dashed line). This led to a more detailed analysis of the analysed log file and
finding the issue.
After cleaning the unnecessary data from the log file, the evaluation was repeated, and the
difference is shown in Figure 9.2. All of the estimated parameters already mentioned were done
using the corrected log file, but it was meaningful to mention that sometimes it is possible to
discover issues with the data preparation phase at the end of model evaluation.
A way to show the suitability of the model is also to evaluate theoretical and empirical logits.
The idea is whether the estimated theoretical logits fit (model) the empirical logits calculated
p
from the empirical relative counts of accesses hij ln piJij j 1 2 J 1 where pij is the
empirical relative count of access to the web category j in time i and piJ is the empirical relative
count of access to the referential web category J in time i. The visualization of observed and
expected logits of each of the examined categories (except the referential web category) can show
how the theoretical logits model the empirical logits. Based on the visualization for the critical
year 2012 (Figures 9.3 and 9.4), it was seen that after the new data cleansing the theoretical logits
fit the empirical logits better.
The plot (Figure 9.5) shows the visualization of probabilities of access to the market discipline-
related web categories (Pillar3) during the year 2009. This year is taken as the year of financial
crisis. It can be seen that the highest access during this year was to the web category Pillar3 related
at the beginning of the year (the 0th week has the value 0.193, for the record this week is a week
that contains days from the previous year and also from the actual year – at the turn of years). The
lowest estimated values were identified later in the year (the 38th week has the value 0.160). The
most interest for the category Pillar3 disclosure requirements was again at the turn of years but this
time it was at the end of the year 2009 (the 52nd week has the value 0.100). The lowest access
was in case of this category the same week as for the other category (the 38th week has the value
0.050). By studying both categories, it can be observed that both are interesting for stakeholders
at the beginning of the year and then the interest lowers, whereas at the end of the year it starts
to rise again. This can be analysed in more detail also using the other method by extracting the
frequent itemsets of the web categories. The asterisks contained in the plot (Figure 9.5) represent
the homogenous groups for occurrence of frequent itemsets of the web categories for the year
168 Anna Pilkova et al.
Figure 9.3 Logit visualization of the model for the year 2012 with error data.
2009. The zero hypothesis is rejected at the 5% significance level (df = 3, Q = 8.258, p 0.05) for
the quarters of the year 2009. Most frequent itemsets were identified in the first quarter (63.64%)
and the lowest in the third quarter (38.64%). In the year 2009, two homogenous groups (2009
Q3, 2009 Q4, 2009 Q2) and (2009 Q4, 2009 Q2, 2009 Q1) were identified (Figure 9.5) based
on the average occurrence of extracted frequent itemsets of the web parts. These results verify the
week analysis for this year, where the most interest of the web users in Pillar3 categories was at
the beginning of the year and the lowest in the third quarter (the 38th week is at the end of the
third quarter where between 2009 Q1 and 2009 Q3 a statistically significant difference at the 5%
significance level was identified).
The plot (Figure 9.6) shows the visualization of probabilities of access to the market discipline-
related web categories (Pillar3) during the year 2010. This year is taken as the year after the
financial crisis. The highest access during this year was to the web category Pillar3 related at
the beginning of the year (the third week has the value 0.211). The lowest estimated values were
identified at the end of the year in the last quarter (the 43rd week has the value 0.161). The most
interest for the category Pillar3 disclosure requirements was at the beginning of the year at the end
of the first quarter (the 11th week has the value 0.116). The lowest access was later in the year in
the case of this category (the 40th week has the value 0.058). By studying both categories, it can
be observed that both are interesting for the stakeholders at the beginning of the year and then
Web usage analysis 169
Figure 9.4 Logit visualization of the model for the year 2012 with corrected data.
the interest lowers, whereas at the end of the year it starts to rise again. Now the weekly results
will be compared with the frequent itemsets. The asterisks contained in the plot (Figure 9.6)
represent the homogenous groups for occurrence of frequent itemsets of the web categories for
the year 2010. The zero hypothesis is rejected at the 1% significance level (df = 3, Q = 12.581,
p 0.01) for the quarters of the year 2010. Most frequent itemsets were identified in the first
quarter (40.91%) and the lowest in the third and fourth quarters (20.45%–22.73%). In the year
2010, three homogenous groups (2010 Q4, 2010 Q3), (2010 Q3, 2010 Q2) and (2010 Q2, 2010
Q1) were identified (Figure 9.6) based on the average occurrence of extracted frequent itemsets
of the web parts. The most interest of the web users in Pillar3 categories was at the beginning of
the year and the lowest in the last quarters, where between 2010 Q1 and 2010 Q3/2010 Q4 and
between 2010 Q2 and 2010 Q4 a statistically significant difference at the 5% significance level
was identified.
The plot (Figure 9.7) shows the visualization of probabilities of access to the market discipline-
related web categories (Pillar3) during the year 2011. This year can be taken as the second year
after the financial crisis. The highest access during this year was to the web category Pillar3 related
again at the beginning of the year (the fifth week has the value 0.211). The lowest estimated values
were identified at the same week as in the previous year with an even lower value (the 43rd week
has the value 0.151). The most interest for the category Pillar3 disclosure requirements was also
170 Anna Pilkova et al.
Figure 9.5 Probability visualization of market discipline-related categories during the year
2009.
the same as the previous year but with a little higher value (the 11th week has the value 0.132).
The lowest access was also the same week as the previous year with an almost similar value in
the case of this category (the 40th week has the value 0.059). By studying both categories, it
can be observed that the behaviour is like the previous year with only a little deviation. Now the
weekly results will be compared with the frequent itemsets. The asterisks contained in the plot
(Figure 9.7) represent the homogenous groups for occurrence of frequent itemsets of the web
categories for the year 2011. The zero hypothesis is rejected at the 1% significance level (df = 3,
Q = 11.539, df = 3, p 0.01) for the quarters of the year 2011. The second quarter contained the
most frequent itemsets (38.64%) and the lowest in the first and third quarters (18.18% –20.45%).
In the year 2011, two homogenous groups (2011 Q1, 2011 Q3, 2011 Q4) and (2011 Q4, 2011 Q2)
were identified (Figure 9.7) based on the average occurrence of extracted frequent itemsets of
the web parts. There is a little difference to the previous years. Based on the quarters the highest
interest is now in the second quarter, but the week analysis shows us that it is on the period
interface.
The plot (Figure 9.8) shows the visualization of probabilities of access to the market discipline-
related web categories (Pillar3) during the year 2012. This year can be taken as one of the years after
the financial crisis. The highest access during this year was to the web category Pillar3 related and
has shifted more towards the end of the first quarter of the year (the 11th week has the value 0.135).
Web usage analysis 171
Figure 9.6 Probability visualization of market discipline-related categories during the year
2010.
The lowest estimated values were identified in the same period as in the previous two years (the
44th week has the value 0.067). The most interest for the category Pillar3 disclosure requirements
was almost the same as the previous year (the 12th week has the value 0.107). The lowest access
was in the case of this category stabilized at the same period (the 41st week has the value 0.034).
By studying both categories, it can be observed that the behaviour has stabilized so that interest
rises at the beginning of the year with the highest interest in the Pillar3 information at the end
of the first quarter. Consequently, the interest decreases with the lowest at the beginning of the
fourth quarter and then starts to rise again. It can be observed that this year has the lower interest
for both categories in comparison to the previous years. In the case of frequent itemsets, statis-
tically significant differences for the year 2012 were not found (df = 3, Q = 4.154, p 0.2453).
Statistically significant differences for all of the next years were also not found (2013: df = 3, Q =
3.255, p 0.3539; 2014: df = 3, Q = 4.565, p 0.2066; 2015: df= 3, Q = 3.001, p 0.3916 )
and the weekly analysis for these years was also not done. It can be said that the trend for the
years after the crisis is similar to the years 2010, 2011 and 2012.
The first quarters of the years 2009–2010 during the event of global financial crisis have a
significant impact on the quantity of identified frequent itemsets of the parts. This can be seen also
in the weekly analysis where the first quarters contained the highest interest from the examined
Pillar3 web categories. On this basis, it can be concluded that the required quarterly frequency
172 Anna Pilkova et al.
Figure 9.7 Probability visualization of market discipline-related categories during the year
2011.
publication of the results is not necessary for market discipline. It would be enough to publish this
information annually, ideally in the early weeks of the year. To obtain these results, two different
approaches with various time variables were used. Both approaches evaluate the behaviour of the
users in time (mainly seasonality): (1) modelling the probabilities of access to the portal in time,
and (2) quantitative evaluation of frequent itemsets incidence in time. The results match and
based on that can be regarded as robust. The combination of these methods improved the results
of the data and helped to better understand the behaviour of the stakeholders with the Pillar3
information on the web categories.
Figure 9.8 Probability visualization of market discipline-related categories during the year
2012.
far as their ownership structure, business model and funding are concerned. Usually, the impact
of this situation is on one side very low interest of stakeholders in information which is not in
line with their interests and on the other side, the high costs of banks related to the preparation
of enormous amounts of disclosed and unused information. One such market that might suffer
due to current Pillar 3 architecture is the depositors’ market, specifically in CEE countries where
single foreign banks/group ownership are very frequent. Therefore, in this study the interests of
depositors in the disclosed two groups of information were assessed: requirement of the Pillar 3
disclosure and Pillar 3 related information during the period 2009–2012 (year of crisis and subse-
quent years). The analysis was based on visits of the stakeholders to the web portal of commercial
banks and analysing their interest in relation to time spent on the web page (time variables) and
in relation to the events of the financial crisis in 2009. The analysis of time spent on categories
of the web portal was based on weekly accesses and frequent itemsets in terms of quantity (based
on quarters). The findings are as follows:
1. The results of the analysis during and after the crisis suggest that stakeholders have expressed
higher interest in Pillar 3 related information (such as annual reports, financial reports,
annual reports, rating, group, general shareholder meeting, emitent prospects) rather than
Pillar 3 disclosure requirements.
174 Anna Pilkova et al.
2. The highest interest of stakeholders in disclosed information was in the year of the crisis and
subsequently steadily decreased.
3. The results of the analysis on the year of the financial crisis (weekly and quarterly analysis)
have shown that in 2009 the highest interest was in the first quarter, at the beginning of the
year (exceptionally 52nd week for Pillar 3 disclosure requirements in 2009), and was lowest
in the third quarter for both categories.
4. In the analysis of the years 2010–2012, the years after the financial crisis, similar results
have been identified, with the highest interest being at the end of the first quarter, on a
period interface (exceptionally third week for Pillar 3 related in 2010 and fifth week for
Pillar 3 related in 2011), and the lowest interest being identified in the fourth quarter.
It is important to note that interest decreased generally during 2010–2012 in comparison
to 2009.
5. The results are in line with Munk et al. (2017), whose results show that studied CEE com-
mercial banks stakeholders are particularly interested in Pillar 3 disclosures in the first quarter
and that interest in disclosures decreased after the turbulence of 2009.
6. The results also suggest that due to the significant impact of the first quarter with the highest
interest of stakeholders, which was also validated by weekly analysis, quarterly disclosures
seem less important for market discipline effects in comparison to annual disclosures. Annual
disclosures imply higher interest than Pillar 3 disclosures and ideally should be disclosed at
the beginning of the year.
The above presented results suggest that changes in information disclosures’ design in commercial
banks operating according to the analysed model are inevitable to enhance the efficiency of market
discipline mechanisms and to add value to key stakeholders (depositors).
This chapter’s conclusion fits with the study (Miklaszewska & Pawłowska, 2014, p. 264) that
deeply analysed CEE banks’ perspectives. In spite of fact that in the EU is currently applied
complex regulatory and supervisory model, their conclusion is that it may not have produced the
required more efficient and stable banking system, particularly in CEE countries that have very
competitive banking environments.
Moreover, it can be agreed with Kuranchie-Pong, Bokpin & Andoh (2016) that stakeholders
in the banking industry are supposed to use market discipline to make risk management more
effective but to assess the bank’s risk profile they need sufficient relevant information disclosures.
Finally, according to the European Banking Authority (EBA), to disclose to markets a suffi-
cient risk profile of financial institutions is most important to ensure their correct functioning,
creating trust between market participants and the efficiency of market discipline. The princi-
ples for adequate disclosures are clarity, meaningfulness, consistency over time and comparability
across institutions, also in times of stress. There still exist open issues about the nature and poten-
tial impediments to disclosures in order to fulfil these principles and the authors hope that these
findings and conclusions may also contribute to resolve these issues.
Due to some limitations of this research, and according to the findings, an analysis of the
interest of stakeholders in the content of Pillar 3 disclosures has been identified as a future research
topic.
Web usage analysis 175
Acknowledgements
This work was supported by the Scientific Grant Agency of the Ministry of Education of the
Slovak Republic (ME SR) and Slovak Academy of Sciences (SAS) under the contract no. VEGA-
1/0776/18 and by the scientific research project of the Czech Sciences Foundation Grant No.
19-15498S.
Disclosure statement
The authors declare that they have no conflict of interest.
References
Accornero, M., & Moscatelli, M. (2018). Listening to the Buzz: Social media sentiment and retail depositors’
trust. SSRN Electronic Journal. doi:10.2139/ssrn.3160570
Arsov, S., & Bucevska, V. (2017). Determinants of transparency and disclosure – evidence
from post-transition economies. Economic Research-Ekonomska Istraživanja, 30(1), 745–760.
doi:10.1080/1331677X.2017.1314818
Bartlett, R. (2012). Making banks transparent. Vanderbilt Law Review, 65, 293–386. Retrieved from
http://scholarship.law.berkeley.edu/facpubs/1824
Bartulovic, M., & Pervan, I. (2012). Comparative analysis of voluntary internet financial reporting for
selected CEE countries. Recent Researches in Applied Economics and Management, 1(1), 296–301.
Benli, V. F. (2015). Basel’s forgotten pillar: The myth of market discipline on the forefront of Basel III.
Financial Internet Quarterly, 11(3), 70–91.
Berger, A. N., & Bouwman, C. H. S. (2013). How does capital affect bank performance during financial
crises? Journal of Financial Economics, 109(1), 146–176. doi:10.1016/j.jfineco.2013.02.008
Berger, A. N., & Davies, S. M. (1998). The information content of bank examinations. Journal of Financial
Services Research, 14(2), 117–144. doi:10.1023/A:1008011312729
Bischof, J., Daske, H., Elfers, F., & Hail, L. (2016). A tale of two regulators: Risk disclosures, liquidity, and
enforcement in the banking sector. SSRN Electronic Journal. doi:10.2139/ssrn.2580569
Bouaiss, K., Refait-Alexandre, C., & Alexandre, H. (2017). Will bank transparency really help financial
markets and regulators? Retrieved from https://hal.archives-ouvertes.fr/hal-01637917
Bouvard, M., Chaigneau, P., & Motta, A. de. (2015). Transparency in the financial system: Rollover risk
and crises. The Journal of Finance, 70(4), 1805–1837. doi:10.1111/jofi.12270
Calomiris, C. W. (2009). Bank regulatory reform in the wake of the financial crisis.
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world wide web browsing
patterns. Knowledge and Information Systems, 1(1), 5–32.
Cubillas, E., Fonseca, A. R., & González, F. (2012). Banking crises and market discipline: International
evidence. Journal of Banking & Finance, 36 (8), 2285–2298. doi:10.1016/J.JBANKFIN.2012.04.011
de Araujo, P., & Leyshon, K. I. (2016). The impact of international information disclosure requirements
on market discipline. Applied Economics, 49(10), 954–971. doi:10.1080/00036846.2016.1208361
176 Anna Pilkova et al.
Del Gaudio, B. L., Megaravalli, A. V., Sampagnaro, G., & Verdoliva, V. (2020). Mandatory disclosure tone
and bank risk-taking: Evidence from Europe. Economics Letters, 186, 108531. doi:10.1016/j.econlet.
2019.108531
Distinguin, I. (2008). Market discipline and banking supervision: The role of subordinated debt. SSRN
Electronic Journal. doi:10.2139/ssrn.1098252
Distinguin, I., Kouassi, T., & Tarazi, A. (2012). Interbank deposits and market discipline: Evidence from
Central and Eastern Europe. SSRN Electronic Journal, 41(2), 544–560. doi:10.2139/ssrn.2119956
Distinguin, I., Rous, P., & Tarazi, A. (2006). Market discipline and the use of stock market data to predict
bank financial distress. Journal of Financial Services Research, 30(2), 151–176. doi:10.1007/s10693-
0016-6
Drlik, M., & Munk, M. (2019). Understanding time-based trends in stakeholders’ choice of learning activity
type using predictive models. IEEE Access, 7, 3106–3121. doi:10.1109/ACCESS.2018.2887057
Ellahie, A. (2012). Capital market consequences of EU bank stress tests. SSRN Electronic Journal.
doi:10.2139/ssrn.2157715
Evanoff, D. D., & Wall, L. D. (2000). Subordinated debt as bank capital: A proposal for regulatory
reform. Economic Perspectives, (Q II), 40–53. Retrieved from https://ideas.repec.org/a/fip/fedhep/
y2000iqiip40-53nv.25no.2.html
Faria-e-Castro, M., Martinez, J., & Philippon, T. (2017). Runs versus lemons: Information disclosure and fiscal
capacity. Cambridge, MA. doi:10.3386/w21201
Fijalkowska, J., Zyznarska-Dworczak, B., & Garsztka, P. (2017). The relation between the CSR and the
accounting information system data in Central and Eastern European (CEE) countries – The evidence
of the polish financial institutions. Journal of Accounting and Management Information Systems, 16 (4),
490–521.
Flannery, M. J., & Bliss, R. R. (2019). Market discipline in regulation: Pre-and post-crisis. Forthcoming,
Oxford Handbook of Banking 3e.
Fonseca, A. R., & González, F. (2010). How bank capital buffers vary across countries: The influence of
cost of deposits, market power and bank regulation. Journal of Banking & Finance, 34(4), 892–902.
doi:10.1016/J.JBANKFIN.2009.09.020
Freixas, X., & Laux, C. (2011). Disclosure, transparency and market discipline. CFS Working Paper, 11, 1–39.
Retrieved from https://www.ifk-cfs.de/fileadmin/downloads/publications/wp/2011/11_11.pdf
Frolov, M. (2007). Why do we need mandated rules of public disclosure for banks? Journal of Banking
Regulation, 8(2), 177–191. doi:10.1057/palgrave.jbr.2350045
Gandrud, C., & Hallerberg, M. (2014). Supervisory transparency in the European banking union. Bruegel
Policy Contribution, (2014/01). Retrieved from https://www.econstor.eu/handle/10419/106314
Giner, B., Allini, A., & Zampella, A. (2020). The value relevance of risk disclosure: An analysis of the
banking sector. Accounting in Europe. doi:10.1080/17449480.2020.1730921
Goldstein, I., & Leitner, Y. (2015). Stress tests and information disclosure. No 15-10, Working Papers. Fed-
eral Reserve Bank of Philadelphia. Retrieved from https://econpapers.repec.org/paper/fipfedpwp/15-
10.htm
Goldstein, I., & Sapra, H. (2014). Should banks’ stress test results be disclosed? An analysis of the costs and
benefits. Foundations and TrendsÂő in Finance, 8(1), 1–54. doi:10.1561/0500000038
Guillemin, F., & Semenova, M. (2018). Transparency and market discipline: Evidence from the Rus-
sian Interbank Market. Higher School of Economics Research Paper No. WP BRP 67/FE/2018, 32.
doi:10.2139/ssrn.3225061
Habek, P. (2017). CSR reporting practices in Visegrad group countries and the quality of disclosure. Sus-
tainability, 9(12), 1–18.
Web usage analysis 177
Hadad, M. D., Agusman, A., Monroe, G. S., Gasbarro, D., & Zumwalt, J. K. (2011). Market disci-
pline, financial crisis and regulatory changes: Evidence from Indonesian banks. Journal of Banking
& Finance, 35(6), 1552–1562. doi:10.1016/j.jbankfin.2010.11.003
Hasan, I., Jackowicz, K., Kowalewski, O., & Kozłowski, Ł. (2013). Market discipline during crisis: Evi-
dence from bank depositors in transition countries. Journal of Banking & Finance, 37 (12), 5436–5451.
doi:10.1016/j.jbankfin.2013.06.007
Hill, T., & Lewicki, P. (2013). Electronic Statistics Textbook. StatSoft Inc. Retrieved from
http://www.statsoft.com/textbook/
Iren, P., Reichert, A. K., & Gramlich, D. (2014). Information disclosure, bank performance and
bank stability. International Journal Banking, Accounting and Finance, 5(4), 39. Retrieved from
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2874144
Jagtiani, J., Kaufman, G., & Lemieux, C. (1999). Do markets discipline banks and bank holding compa-
nies? Evidence from debt pricing. Emerging Issues, (Jun). Retrieved from http://econpapers.repec.org/
article/fipfedhei/y_3a1999_3ai_3ajun_3an_3asr-99-3r.htm
Jagtiani, J., & Lemieux, C. (2001). Market discipline prior to bank failure. Journal of Economics and
Business, 53(2), 313–324. Retrieved from http://econ.tu.ac.th/archan/Chalotorn/on%mkt%failure/
jagtiani.pdf
Jordan, J. S., Peek, J., & Rosengren, E. S. (2000). The market reaction to the disclosure of supervi-
sory actions: Implications for bank transparency. Journal of Financial Intermediation, 9(3), 298–319.
doi:10.1006/jfin.2000.0292
Kapusta, J., Munk, M., & Drlik, M. (2012a). Cut-off time calculation for user session identification by ref-
erence length. In 2012 6th International Conference on Application of Information and Communication
Technologies, AICT 2012 - Proceedings. doi:10.1109/ICAICT.2012.6398500
Kapusta, J., Munk, M., & Drlik, M. (2012b). User session identification using reference length. In Capay,
M., Mesarosova, M., & Palmarova, V. (Eds.), DIVAI 2012: 9th International Scientific Conference on
Distance Learning in Applied Informatics: Conference Proceedings, Sturovo, Slovakia (pp. 175–184).
Karas, A., Pyle, W., & Schoors, K. (2013). Deposit insurance, banking crises, and market discipline: Evi-
dence from a natural experiment on deposit flows and rates. Journal of Money, Credit and Banking,
45(1), 179–200. doi:10.1111/j.1538-4616.2012.00566.x
Kozłowski, Ł. (2016). Cooperative banks, the internet and market discipline. Journal of Co-Operative Orga-
nization and Management, 4(2), 76–84. doi:10.13140/RG.2.1.3768.6809
Kuranchie-Pong, L., Bokpin, G. A., & Andoh, C. (2016). Empirical evidence on disclosure and risk-
taking of banks in Ghana. Journal of Financial Regulation and Compliance, 24(2), 197–212.
doi:10.1108/JFRC-05-2015-0025
Lapteacru, I. (2019). Do bank activities and funding strategies of foreign and state-owned banks have a dif-
ferential effect on risk-taking in Central and Eastern Europe? Economics of Transition and Institutional
Change, 27 (2), 541–576. doi:10.1111/ecot.12185
Li, Y., Li, C., & Gao, Y. (2020). Voluntary disclosures and peer-to-peer lending decisions: Evidence from the
repeated game. Frontiers of Business Research in China, 14(1), 1–26. doi:10.1186/s11782-020-00075-5
Matuszak, L., & Rozanska, E. (2017). An examination of the relationship between CSR disclosure and
financial performance: The case of polish banks. Journal of Accounting and Management Information
Systems, 16 (4), 522–533. Retrieved from https://econpapers.repec.org/RePEc:ami:journl:v:16:y:2017:
i:4:p:522–533
Miklaszewska, E., & Pawłowska, M. (2014). Do safe banks create safe systems? Central and Eastern Euro-
pean banks’ perspective. Revue de l’OFCE, 132(1), 243–267. doi:10.3917/reof.132.0243
Moreno, D., & Takalo, T. (2016). Optimal bank transparency. Journal of Money, Credit and Banking, 48(1),
203–231. doi:10.1111/jmcb.12295
178 Anna Pilkova et al.
Munk, M., Benko, L., Gangur, M., & Turčáni, M. (2015). Influence of ratio of auxiliary pages on
the pre-processing phase of Web Usage Mining. E+M Ekonomie a Management, 18(3), 144–159.
doi:10.15240/tul/001/2015-3-013
Munk, M., Drlik, M., & Vrabelova, M. (2011). Probability modelling of accesses to the course activities
in the web-based educational system. In Computational Science and Its Applications - Iccsa 2011, Pt V
(Vol. 6786, pp. 485–499).
Munk, M., Kapusta, J., & Švec, P. (2010). Data preprocessing evaluation for web log mining: Recon-
struction of activities of a web visitor. In Procedia Computer Science (Vol. 1, pp. 2273–2280).
doi:10.1016/j.procs.2010.04.255
Munk, M., Kapusta, J., Švec, P., & Turčáni, M. (2010). Data advance preparation factors affecting results
of sequence rule analysis in web log mining. E+M Ekonomie a Management, 13(4), 143–160.
Munk, M., Pilkova, A., Benko, L., & Blažeková, P. (2017). Pillar 3: Market discipline of the key stakeholders
in CEE commercial bank and turbulent times. Journal of Business Economics and Management, 18(5),
954–973. doi:10.3846/16111699.2017.1360388
Munk, M., Pilková, A., Drlik, M., Kapusta, J., & Švec, P. (2012). Verification of the fulfilment of the
purposes of Basel II, Pillar 3 through application of the web log mining methods. Acta Universitatis
Agriculturae et Silviculturae Mendelianae Brunensis, 60(2), 217–222.
Munk, M., Vrábelová, M., & Kapusta, J. (2011). Probability modeling of accesses to the web parts of portal.
Procedia Computer Science, 3, 677–683. doi:10.1016/j.procs.2010.12.113
Naz, M., & Ayub, H. (2017). Impact of risk-related disclosure on the risk-taking behavior of commer-
cial banks in Pakistan. Journal of Independent Studies and Research-Management, Social Sciences and
Economics, 15. doi:10.31384/jisrmsse/2017.15.2.9
Nier, E. W. (2005). Bank stability and transparency. Journal of Financial Stability, 1(3), 342–354.
doi:10.1016/J.JFS.2005.02.007
Nier, E. W., & Baumann, U. (2006). Market discipline, disclosure and moral hazard in banking. Journal
of Financial Intermediation, 15(3), 332–361. doi:10.1016/j.jfi.2006.03.001
Niessen-Ruenzi, A., Parwada, J. T., & Ruenzi, S. (2015). Information effects of the Basel Bank capital
and Risk Pillar 3 disclosures on equity analyst research an exploratory examination. SSRN Electronic
Journal. doi:10.2139/ssrn.2670418
Parwada, J. T., Lau, K., & Ruenzi, S. (2015). The impact of Pillar 3 disclosures on asymmetric
information and liquidity in bank stocks: Multi-country evidence. CIFR Paper No. 82/2015, 27.
doi:10.2139/ssrn.2670403
Parwada, J. T., Ruenzi, S., & Sahgal, S. (2013). Market discipline and Basel Pillar 3 reporting. SSRN
Electronic Journal. doi:10.2139/ssrn.2443189
Peristiani, S., Morgan, D. P., & Savino, V. (2010).The information value of the stress test and bank opacity.
Journal of Money, Credit and Banking, 46 (7), 1479–1500.
Pilkova, A., Munk, M., Švec, P., & Medo, M. (2015). Assessment of the Pillar 3 financial and risk infor-
mation disclosures usefulness to the commercial banks users. Lecture Notes in Artificial Intelligence,
9227, 429–440.
Poshakwale, S., Aghanya, D., & Agarwal, V. (2020). The impact of regulations on compliance costs, risk-
taking, and reporting quality of the EU banks. International Review of Financial Analysis, 68, 101431.
doi:10.1016/j.irfa.2019.101431
Sarker, N., & Sharif, J. (2020). Simultaneity among market risk taking, bank disclosures and corporate
governance: Empirical evidence from the banking sector of Bangladesh. Academy of Accounting and
Financial Studies Journal, 24(1), 1–21.
Web usage analysis 179
Scannella, E. (2018). Market risk disclosure in banks’ balance sheet and Pillar 3 report: The case of Italian
banks. In Myriam García-Olalla & Judith Clifton (Eds.),Contemporary Issues in Banking, chapter 3
(pp. 53–90). Palgrave Macmillan, Cham.
Scannella, E., & Polizzi, S. (2019). Do large European Banks differ in their derivative disclosure prac-
tices? A cross-country empirical study. Journal of Corporate Accounting & Finance, 30(1), 14–35.
doi:10.1002/jcaf.22373
Sironi, A. (2003). Testing for market discipline in the European banking industry: Evidence from sub-
ordinated debt issues. Journal of Money, Credit and Banking, 35(3), 443–472. Retrieved from
http://econpapers.repec.org/article/mcbjmoncb/v_3a35_3ay_3a2003_3ai_3a3_3ap_3a443-72.htm
Song, V., Cavusoglu, H., Lee, G. M., & Ma, L. (2020). IT risk factor disclosure and stock price crashes.
doi:10.24251/HICSS.2020.738
Sowerbutts, R., Zer, I., & Zimmerman, P. (2013). Bank disclosure and financial stability. Bank of England
Quarterly Bulletin, Bank of England, 53 (4), 326–335.
Stephanou, C. (2010). Rethinking market discipline in banking: Lessons learned from the Financial Crisis.
Policy Research Working Paper, The World Bank, 5227, 1–37.
Vauhkonen, J. (2012). The impact of Pillar 3 disclosure requirements on bank safety. Journal of Financial
Services Research, 41(1–2), 37–49. doi:10.1007/s10693-011-0107-x
Yang, W., & Koshiyama, A. S. (2019). Assessing qualitative similarities between financial reporting frame-
works using visualization and rules: COREP vs. Pillar 3. Intelligent Systems in Accounting, Finance and
Management, 26 (1), 16–31. doi:10.1002/isaf.1441
Zer, I. (2015). Information disclosures, default risk, and bank value. Finance and Economics Discussion Series,
2015(104), 1–43. doi:10.17016/FEDS.2015.104
Chapter 10
10.1 Introduction
In recent years, machine learning (ML) has become a fascinating field which is experiencing a
renewed interest. Coming from many disciplines such as statistics, optimization, algorithms or
signal processing, it is a field of study in constant mutation which has now imposed itself in our
society. Already, ML has been used for decades in automatic character recognition or anti-spam
filters; the uses of ML are so numerous in real life that it conquers different fields starting with
designing an algorithm to estimate the value of an asset (the price of a house, or the expected gains
from a shop, etc.) based on previous observations, or even analyze the composition of an email
(Azencott, 2019), in particular, the content of the latter, as well as the number of occurrences of
the words constituting it, etc.
Finance is also no exception since many financial activities use solutions developed by ML.
Indeed, among the multiple uses of ML techniques in the finance field, we find, for example, the
prediction of the percentage chance of repayment of a loan according to the track record of the
181
182 Maha Radwan et al.
assets or even predict the fluctuations of stock market, thanks to fluctuations in past or future
years (Heaton et al., 2017).
In this context, ML is present everywhere: banks use it to assess the creditworthiness of a
borrower or to predict and analyze all available company data (not only financial reports but also
press releases, news and even sound recordings or videos transcribed) in order to identify the most
interesting investments. Mentioning another field of application is in the financial sector of ML,
namely, risk control and compliance (Ding et al., 2019).
In addition to use for investment decisions, risk control and compliance, marketing represents
another field of application for ML. Moreover, the use of Deep Mining, for example, opens
up immense potential to increase additional and transformation prospects that ensure the best
possible customer service and reverse the loss of loyalty (Singh et al., 2019).
Finally, ML can be used to reduce costs and increase the productivity of financial institu-
tions. One example in particular is the use of artificial intelligence to automate highly technical
translations of financial documents, thereby saving production time and resources.
This chapter is intended as an introduction to the concepts and algorithms and uses of ML,
so a more focused vision will be given to different applications of ML techniques in the fields of
economics, accounting and finance. The chapter would provide insights to researchers, business
experts and readers who seek to understand in a simplified way the foundations of the main
algorithms used in ML as well as its applications in the fields of finance, economics and accounting
through the presentation of the review of existing literature in this area.
Our chapter will be structured as follows: the first and second sections will be dedicated to
presenting an overview of ML in order to establish more clearly the concept, its genesis and its
main models. The third section will be reserved for the development of the main applications of
ML, particularly, in the fields of algorithmic trading and portfolio management, risk management
and credit scoring, insurance pricing and finally the detection of accounting and financial fraud.
To fully understand this definition, suppose a business wants to know the total amount spent
by a customer from their invoices. It is therefore sufficient to apply a conventional algorithm,
namely a simple addition since a learning algorithm is not necessary. However, assume the com-
pany this time seeks to use these invoices to determine which products the customer is most
likely to purchase in a month. While this is likely related, the company clearly does not have all
of the information it needs to do this. However, if it has the purchase history of a large number
of individuals, it becomes possible to use a ML algorithm so that it derives a predictive model
allowing it to provide an answer to this question.
In terms of its context of appearance, ML was born in the 1980s, and can be considered as
a branch of artificial intelligence (AI, whose beginnings date back to the war). ML is intimately
linked to data analysis and decision algorithms which have its origins in statistics.
Until the 18th century, statistics, assimilated “State Science”, were only descriptive
(Desrosières, 2016). It is only a century later that the probabilities will be linked to statistics,
with, among other things, the notion of extrapolation between the observation of a sample and
the characteristics of a population. We had to wait until the beginning of the 20th century for
statistics to organize themselves as a separate science and to be subdivided into two disciplines:
descriptive statistics and inferential statistics.
Thus, the notion of ML first refers to the understanding of human thought, which was studied
by Descartes and then by G. Leibniz in his work “De Arte combinatoria” in 1666. The philosopher
then tries to define the simplest reasoning thought (like an alphabet) which, when combined, will
formulate very complex thoughts (Horton et a., 2009). These works were later formalized by G.
Boole in 1854 in his work An Investigation of the Laws of Thought on Which Are Founded the
Mathematical Theories of Logic and Probability.
It was not until 1956 that ML took a remarkable turn with the Dartmouth conference orga-
nized by M. Minsky and J. McCarthy, the two great emblematic figures of Artificial Intelligence.
However, it will be necessary to wait for the 1980s and the development of computing capacities
for ML will take on its full extent.
Nevertheless, the most striking application of ML remains that Deep Blue, the IBM super-
computer, beats the world chess champion Garry Kasparov in 1997 (Campbell et al., 2002). In
2014, Russian artificial intelligence dubbed “Eugene Goostman” was able to become the first
“robot” to pass the famous Turing test 27 . Although the scientific community remains divided as
to this experience, it nevertheless remains an important stage of Artificial Intelligence, and more
particularly of ML (Vardi, 2014).
1. Defining the problem: Data analysis is the process of examining and interpreting data in
order to develop answers to questions or problems.
2. Data collection: Once the problem is determined and translated in terms of data analysis, it
is necessary to collect data in sufficient quantity, but also take into consideration the quality
of the data to be collected.
3. Preparation of the data: In this phase, it is necessary to clean up the acquired data.
4. Modeling and evaluation: Once the data is collected and the cleaning is established, it is
possible to apply different algorithms for the same problem.
5. Deployment: One of the challenges of data analysis is to be able to provide dynamic and
economically usable results. For this, it is necessary to deploy not only a solution in partner-
ship with the departments responsible for information technology (IT) but also an interface
(data visualization).
With regard to the main algorithms deployed, there are four main families of models: supervised,
unsupervised, semi-supervised learning and reinforcement learning.
In this two-dimensional space, the “border” is the black line, the “support vectors” are the
circled points (closest to the border) and the “margin” is the distance between the border and the
line.
(b) Decision trees (AD) algorithms
Decision trees (AD) are one of the ML techniques used both in data mining and in business
intelligence. The use of a hierarchical representation of the data structure in the form of decision
sequences (tests) allows us to predict a result or a class. Thus, each individual (or observation),
whose assignment to a well-defined class is sought, will be described by a set of variables which
will then be tested at the level of the nodes of the tree (Yang and Chen, 2020). Suddenly, the
tests are carried out at the level of the internal nodes and the decisions are taken in the leaf nodes.
Take the case of a board of directors having to decide on the definition of a strategy to develop the
turnover of their company (see Figure 10.2). Several options are possible: focus on the national
market by developing new product ranges or by intensifying prospecting to gain new customers.
Another alternative is possible: to develop internationally, either through a direct presence or by
establishing a local partnership.
(c) Neural network(ANN)
A neural network is a software and/or hardware system that mimics the functioning of biolog-
ical neurons. In other words, it is an intensely connected artificial neural network of elementary
processors and which operates in parallel (Audevart and Alonzo, 2019). Each elementary pro-
cessor (artificial neuron) calculates a single output based on the information it receives in both
parallel and successive layers (see Figure 10.3). The first layer receives raw information as input,
like the optic nerve, which processes human visual data.
Figure 10.2 Illustration of the application of decision tree algorithms to determine possible
scenarios for developing a business strategy.
186 Maha Radwan et al.
The circles represent the neurons arranged in layers. The network represented here comprises
three layers – the input layer receiving information on five neurons, and the output layer com-
prising a single neuron and giving the result of the internal calculation. Between these two layers
is a layer called the “hidden layer” which is not visible from the outside, and is used to perform
intermediate calculations.
Figure 10.4 Comparative analysis between supervised learning and unsupervised learning.
Figure 10.5 Comparative overview between the different algorithmic models of machine
learning.
188 Maha Radwan et al.
This form of learning is frequently used with methods such as classification, regression and
prediction. Thus, it seems that semi-supervised learning is preferred when the cost of labeling is
too high to justify a fully labeled learning process.
Regarding the second form of algorithmic trading, the automated trading also known as high-
frequency trading, automated trading or black box trading. The users are mainly hedge funds and
investment banks trading for their own account to become major players in the financial mar-
kets in the United States and Europe (Kim, 2010). This technique takes into account computer
resources and other mathematical models that can occupy an important place on the market
(Biais, 2011), which is the hard version of the system.
With high-frequency trading, automation wins the decision process itself, but at a different
level. The algorithm analyzes market data in real time, identifies imbalances or inefficiencies in
terms of liquidity or price and then translates them into trading opportunities which will be
implemented later (Aldridge, 2013). Automation and very short response times make it possible
to take the advantage of minimal variations and of very short duration, which a human operator
would not have been able to exploit or even detect (Henrique et al., 2019).
majority of AAA ratings were awarded to these stocks, yet the value of these assets plummeted
as property prices plummeted, triggering the financial crisis (Coffinet and Kien, 2020). Such a
situation has led to sharp criticism of the credit rating process from investors and bond issues,
especially after having introduced a subjective component into the model used for credit ratings
(De Moor et al., 2018).
However, the use of ML to manage bank credit risk should remain limited and certainly very
tightly controlled. Indeed, bank supervisors generally require that the risk models be clear and
simple in order to be understandable, verifiable and appropriate to be validated (Yu et al., 2008).
Whatever the case, these techniques generate significant productivity gains as soon as they
render a certain number of preprocessing on the data obsolete; the essential advantage of ML is
that it allows the use of new data (New Data) likely to better reflect credit risk (Khandani et al.,
2010).
Moreover, the innovative aspect is appreciated here compared to the customer data usually
used in score models such as payment history and income. However, the central question is
whether this data diversity allows infinite access to credit from individuals or companies previ-
ously considered too risky in traditional databases (Kruppa, 2013). In addition, these new data
can be considered Big Data if the number of predictors observed for each individual is very large
(Yue et al., 2007).
Often difficult to find a complete definition, financial fraud nevertheless presents in several
forms ranging from the gross forgery of an identity card to the use of very sophisticated social
engineering techniques. And to top it all off, the fraud mechanism is said to be “adversarial”, that
is, fraudsters are constantly trying to bypass existing detection procedures and systems in order
to exploit any potential defect.
A majority of anti-fraud systems currently found are based on rules set by a human actor
for the simple reason that the results obtained are relatively simple to understand and deemed
transparent by the profession (Ahmed et al., 2016). If we look at financial fraud as an example,
we find that effective fraud detection has always been an important but also complex task for
accounting professionals (Ngai et al., 2011).
Moreover, the use of traditional internal audit techniques to detect accounting fraud seems
to be obsolete (Fanning et al., 1995). First, auditors generally do not have the requisite knowl-
edge regarding accounting fraud. Second, the lack of experience and expertise on the part of
auditors in the detection and prevention of fraud negatively impacts their mission and can gen-
erate a very significant operational risk. Finally, while admitting the limits of an audit mission,
the financial and accounting managers affirmed that the traditional and standard audit proce-
dures are certainly easy to set up and effective; however, they become very difficult to maintain
and perpetuate as the number of rules to be applied increases. This situation prompted several
researchers to develop and empirically test predictive fraud detection audit models with practi-
cal applications for audit operations (Hajek and Henriques, 2017). Therefore, by analyzing the
actual accounting data, the proposed model can identify abnormal transactions which allow it
to focus directly on exceptions for a more thorough investigation in real time, which naturally
leads to a significant reduction in manual intervention and processing time in audit assignments
(Singh et al., 2019).
In addition, due to the asymmetry of information between corporate insiders and external
investors, the uncertainty of the financial market as to the existence of fraud can hamper the
normal functioning of the financial markets and the economic growth of a country (Li and Hoi,
2014).
The term “Data Mining” refers to the analysis and identification of interesting data and mod-
els from different perspectives and the transformation of this data into useful and statistically
reliable information by establishing relationships between previously unknown and exploitable
data (Grossrieder, 2013).
The convergence between Data Mining and the detection of accounting and financial fraud
is illustrated in the fact that data mining as an advanced analytical tool can help auditors make
decisions and detect fraud (Sharma and Panigrahi, 2013).
Indeed, Data Mining techniques have the potential to resolve the contradiction between the
effect and the effectiveness of fraud detection (Wang, 2010). Exploration or data mining allows
users to analyze data from different angles, to categorize it and to synthesize the relationships
identified in order to extract and discover the hidden models in a very large collection of data. In
traditional control models, an auditor can never be sure of the legitimacy and the intention of a
fraudulent transaction. To treat this situation, the most optimal and cost-effective option is to find
sufficient evidence of fraud from the available data using specialized, complex and sophisticated
mathematical and computer algorithms to segment the data and assess future probabilities (Wang,
2010).
194 Maha Radwan et al.
10.5 Conclusions
If artificial intelligence (AI) broadly designates a science aimed at imitating human capabilities,
ML is its subset which consists of forming a machine to learn by itself. Besides, thanks to new
computer technologies, ML has experienced remarkable growth in recent years. This renewed
interest can be explained by the recognition of trends and the theory that computers can learn
without being programmed to perform specific tasks. While many ML algorithms have been
around for a long time, the computing capabilities and new data available today allow us to
improve the accuracy of predictive analyses.
This renewed interest in ML is explained by the combination of several factors which also
contributed to its incredible popularity – in particular Data Mining as well as the multiplication
and diversification of available data.
In this chapter, the objective was to review the growing role as well as the functioning of
the main algorithms representing the engines of ML. In general, two main types of ML algo-
rithms are used today, supervised learning and unsupervised learning, as well as these different
fields of application, particularly in the field of finance, which obviously does not escape to this
revolutionary science.
This becomes a competitive advantage in various fields such as pricing, default risk, fraud
detection and portfolio management. As such, it is essential for those involved in finance to
understand the principles and applications of ML in order to have a clear and complete vision
of these models deployed within their organization and subsequently benefit from the enor-
mous opportunities allowing the times to stand out from the competition and reinvent their
business.
References
Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Paper Pre-
sented at the Proceedings of the Twenty-First International Conference on Machine Learning, Banff,
Canada.
Ahmed, M., Mahmood, A. N., & Islam, M. R. (2016). A survey of anomaly detection techniques in financial
domain. Future Generation Computer Systems, 55, 278–288.
Ahmed, S. R. (2004). Applications of data mining in retail business. In International Conference on Infor-
mation Technology: Coding and Computing 2004. Proceedings. ITCC 2004, 455–459.
Aldridge, I. (2013). High-frequency trading: A practical guide to algorithmic strategies and trading systems, Vol.
604, John Wiley & Sons.
ArtÃThs, M., Ayuso, M., & Guillén, M. (2002). Detection of automobile insurance fraud with discrete
choice models and misclassified claims. Journal of Risk and Insurance, 69(3), 325–340.
Asimit, V., Kyriakou, I., & Nielsen, J. P. (2020). Special issue “Machine Learning in Insurance”. Risks, 8(2),
54.
Audevart, A., & Alonzo, M. (2019). Apprendre demain: Quand intelligence artificielle et neurosciences révolu-
tionnent l’apprentissage. Dunod.
Azencott, C.-A. (2019). Introduction au machine learning. Dunod.
Aziz, S., & Dowling, M. (2019). Machine learning and AI for risk management. In Disrupting Finance,
Lynn, T., Mooney, J.G., Rosati, P., & Cummins, M. (Eds.), 33–50.
Barra, V., Miclet, L., & Cornuéjols, A. (2018). Apprentissage artificiel: Concepts et algorithmes. Ed. 3: Eyrolles.
The emergence of new strategies 195
Basak, S., Kar, S., Saha, S., Khaidem, L., & Dey, S. R. (2019). Predicting the direction of stock market prices
using tree-based classifiers. The North American Journal of Economics and Finance, 47, 552–567.
Baudry, M., & Robert, C. Y. (2019). A machine learning approach for individual claims reserving in
insurance. Applied Stochastic Models in Business and Industry, 35(5), 1127–1155.
Benureau, F. (2015). Self exploration of sensorimotor spaces in robots. Université de Bordeaux. Available at
https://tel.archives-ouvertes.fr/tel-01251324/document.
Biais, B. (2011). High frequency trading. European Institute of Financial Regulation. Available
at https://repository.nu.edu.sa/bitstream/123456789/2206/1/High frequency trading Bruno Biais
(Toulouse School of Economics).pdf.
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational
Science, 2(1), 1–8.
Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 17 (3), 235–249.
Bouyala, R. (2016). La révolution FinTech. RB édition.
Boyer, J.-M. (2015). La tarification et le big data: quelles opportunités? Revue d’Economie Financière, 4,
81–92.
Bunker, R. P., & Thabtah, F. (2019). A machine learning framework for sport result prediction. Applied
Computing and Informatics, 15(1), 27–33.
Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial Intelligence, 134(1–2), 57–83.
Cao, A., He, H., Chen, Z., & Zhang, W. (2018). Performance evaluation of machine learning approaches
for credit scoring. International Journal of Economics, Finance and Management Sciences, 6 (6),
255–260.
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010). Making words work: Using financial text as
a predictor of financial events. Decision Support Systems, 50(1), 164–175.
Coffinet, J., & Kien, J.-N. (2020). Detection of rare events: A Machine Learning toolkit with an application
to banking crises. The Journal of Finance and Data Science, 5(4), 183–207.
Corlosquet-Habart, M., & Janssen, J. (2017). Le big data pour les compagnies d’assurance. Vol. 1, ISTE Group.
De Moor, L., Luitel, P., Sercu, P., & Vanpée, R. (2018). Subjectivity in sovereign credit ratings. Journal of
Banking & Finance, 88, 366–392.
Desrosières, A. (2016). La politique des grands nombres: histoire de la raison statistique. La découverte.
Ding, K., Lev, B., Peng, X., Sun, T., & Vasarhelyi, M. A. (2019). Machine learning improves accounting
estimates. Available at SSRN 3253220.
Fanning, K., Cogger, K. O., & Srivastava, R. (1995). Detection of management fraud: A neural network
approach. Intelligent Systems in Accounting, Finance and Management, 4(2), 113–126.
Galland, M. (2012). La régulation du trading haute fréquence. Bull. Joly Bourse mars.
Géron, A. (2017). Machine Learning avec Scikit-Learn. Dunod.
Ghoddusi, H., Creamer, G. G., & Rafizadeh, N. (2019). Machine learning in energy economics and finance:
A review. Energy Economics, 81, 709–727.
Grossrieder, L., Albertetti, F., Stoffel, K., & Ribaux, O. (2013). Des données aux connaissances, un chemin
difficile: réflexion sur la place du data mining en analyse criminelle. Revue Internationale de Crimi-
nologie et de Police Technique et Scientifique, 66 (1), 99–116.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via Machine Learning. The Review of Financial
Studies, 33(5), 2223–2273.
Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent detection of financial
statement fraud–A comparative study of machine learning methods. Knowledge-Based Systems, 128,
139–152.
196 Maha Radwan et al.
Hajek, P., & Olej, V. (2014). Predicting firms’ credit ratings using ensembles of artificial immune systems
and machine learning âĂŞ An over-sampling approach. In IFIP International Conference on Artificial
Intelligence Applications and Innovations, 29–38. Berlin: Springer.
Heaton, J., Polson, N., & Witte, J. H. (2017). Deep learning for finance: deep portfolios. Applied Stochastic
Models in Business and Industry, 33(1), 3–12.
Hendershott, T., & Riordan, R. (2013). Algorithmic trading and the market for liquidity. Journal of Finan-
cial and Quantitative Analysis, 48(4), 1001–1024.
Henrique, B. M., Sobreiro, V. A., & Kimura, H. (2019). Literature review: Machine learning techniques
applied to financial market prediction. Expert Systems with Applications, 124, 226–251.
Horton, R., Morrissey, R., Olsen, M., Roe, G., & Voyer, R. (2009). Mining eighteenth century ontologies:
Machine learning and knowledge classification in the Encyclopédie. Digital Humanities Quarterly,
3(2), 1–15.
Huang, G., Song, S., Gupta, J. N., & Wu, C. (2014). Semi-supervised and unsupervised extreme learning
machines. IEEE Transactions on Cybernetics, 44(12), 2405–2417.
Jain, M., Narayan, S., Balaji, P., Bhowmick, A., & Muthu, R. K. (2020). Speech emotion recognition using
support vector machine. arXiv preprint arXiv:2002.07590.
Joudaki, H., Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., & Arab, M. (2015).
Using data mining to detect health care fraud and abuse: a review of literature. Global Journal of Health
Science, 7 (1), 194–202.
Kaur, J., & Dharni, K. (2019). Predicting daily returns of global stocks indices: Neural networks vs support
vector machines. Journal of Economics, Management and Trade, 24(6), 1–13.
Khandani, A. E., Kim, A. J., & Lo, A. W. (2010). Consumer credit-risk models via machine-learning
algorithms. Journal of Banking & Finance, 34(11), 2767–2787.
Kim, A., Yang, Y., Lessmann, S., Ma, T., Sung, M.-C., & Johnson, J. E. (2020). Can deep learning pre-
dict risky retail investors? A case study in financial risk behavior forecasting. European Journal of
Operational Research, 283(1), 217–234.
Kim, K. (2010). Electronic and algorithmic trading technology: The complete guide. Academic Press.
Kirchner, T. (2018). Predicting your casualtiesâĂŞHow machine learning is revolutionizing insurance pricing
at AXA, Available at https://digital.hbs.edu/platform-rctom/submission/predicting-your-casualties-
how-machine-learning-is-revolutionizing-insurance-pricing-at-axa/.
Kou, Y., Lu, C. T., Sirwongwattana, S., & Huang, Y. P. (2004). Survey of fraud detection techniques. In
IEEE International Conference on Networking, Sensing and Control, Vol. 2, 749–754. IEEE.
Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability
estimates using machine learning. Expert Systems with Applications, 40(13), 5125–5131.
Kumar, G., Muckley, C. B., Pham, L., & Ryan, D. (2019). Can alert models for fraud protect the elderly
clients of a financial institution? The European Journal of Finance, 25(17), 1683–1707.
Kumar, M., & Thenmozhi, M. (2006). Forecasting stock index movement: A comparison of support vector
machines and random forest. In Indian Institute of Capital Markets 9th Capital Markets Conference
Paper, Mumbai, India, 1–16.
Kuo, R. J., Chen, C., & Hwang, Y. (2001). An intelligent stock trading decision support system through
integration of genetic algorithm based fuzzy neural network and artificial neural network. Fuzzy Sets
and Systems, 118(1), 21–45.
Lallich, S., Lenca, P., & Vaillant, B. (2007). Construction d’une entropie décentrée pour l’apprentissage
supervisé. Available at https://hal.archives-ouvertes.fr/hal-02121562/.
The emergence of new strategies 197
Leung, C. K.-S., MacKinnon, R. K., & Wang, Y. (2014). A machine learning approach for stock price
prediction. In 18th International Database Engineering & Applications Symposium, Porto Portugal,
274–277.
Li, B., & Hoi, S. C. (2014). Online portfolio selection: A survey. ACM Computing Surveys (CSUR), 46 (3),
1–36.
Li, B., Yu, J., Zhang, J., & Ke, B. (2016). Detecting accounting frauds in publicly traded US firms: A
machine learning approach. In Asian Conference on Machine Learning.
López de Prado, M., & Lewis, M. J. (2019). Detection of false investment strategies using unsupervised
learning methods. Quantitative Finance, 19(9), 1555–1565.
Love, B. C. (2002). Comparing supervised and unsupervised category learning. Psychonomic Bulletin &
Review, 9(4), 829–835.
Lutz, M., & Biernat, E. (2015). Data science: fondamentaux et études de cas: Machine Learning avec Python
et R. Editions Eyrolles.
Ly, A. (2019). Algorithmes de Machine Learning en assurance: solvabilité, textmining, anonymisation et trans-
parence. Université Paris-Est.
Majumder, M., & Hussian, M. A. (2007). Forecasting of Indian stock market index using artificial neural
network. Information Science, 98–105.
McLean, I., & McMillan, A. (2009). The concise Oxford dictionary of politics. OUP Oxford.
Mittal, A., & Goel, A. (2012). Stock prediction using twitter sentiment analysis. Standford University,
CS229, available at http://cs229.stanford.edu/proj2011/GoelMittal-StockMarketPredictionUsing
TwitterSentimentAnalysis.pdf.
Mizuno, H., Kosaka, M., Yajima, H., & Komoda, N. (1998). Application of neural network to technical
analysis of stock market prediction. Studies in Informatic and Control, 7 (3), 111–120.
Moualek, D. Y. (2017). Deep Learning pour la classification des images. 07-03-2017. Available at
http://dspace.univ-tlemcen.dz/handle/112/12583.
Ngai, E. W., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques
in financial fraud detection: A classification framework and an academic review of literature. Decision
Support Systems, 50(3), 559–569.
Nuti, G., Mirghaemi, M., Treleaven, P., & Yingsaeree, C. (2011). Algorithmic trading. Computer, 44(11),
61–69.
Oxford, O. E. (2009). Oxford English Dictionary. Oxford University Press.
Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). A comprehensive survey of data mining-based fraud detection
research. arXiv preprint arXiv:1009.6119.
PÅĆawiak, P., Abdar, M., & Acharya, U. R. (2019). Application of new deep genetic cascade ensemble of
SVM classifiers to predict the Australian credit scoring. Applied Soft Computing, 84, 105740.
Sánchez, D., Vila, M., Cerda, L., & Serrano, J.-M. (2009). Association rules applied to credit card fraud
detection. Expert Systems with Applications, 36 (2), 3630–3640.
Sang, C., & Di Pierro, M. (2019). Improving trading technical analysis with tensorflow long short-term
memory (LSTM) neural network. The Journal of Finance and Data Science, 5(1), 1–11.
Sathya, R., & Abraham, A. (2013). Comparison of supervised and unsupervised learning algorithms for
pattern classification. International Journal of Advanced Research in Artificial Intelligence, 2(2), 34–38.
Schnaubelt, M., Fischer, T., & Krauss, C. (2020). Separating the signal from the noiseâĂŞfinancial machine
learning for Twitter. Journal of Economic Dynamics and Control, 103895.
198 Maha Radwan et al.
Sharma, A., & Panigrahi, P. K. (2013). A review of financial accounting fraud detection based on data mining
techniques. arXiv preprint arXiv:1309.3944.
Singh, N., Lai, K. h., Vejvar, M., & Cheng, T. E. (2019). Data-driven auditing: A predictive modeling
approach to fraud detection and classification. Journal of Corporate Accounting & Finance, 30(3),
64–82.
Smith, K. A., Willis, R. J., & Brooks, M. (2000). An analysis of customer retention and insurance claim
patterns using data mining: A case study. Journal of the Operational Research Society, 51(5), 532–541.
Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and
Machine Learning, 4(1), 1–103.
Thomas, T. P. V., Vijayaraghavan, A. P., & Emmanuel, S. (2020). Machine learning approaches in cyber
security analytics. Springer.
Tjung, L. C., Kwon, O., Tseng, K., & Bradley-Geist, J. (2010). Forecasting financial stocks using data
mining. Global Economy and Finance Journal, 3(2), 13–26.
Vandewalle, V. (2009). Estimation et sélection en classification semi-supervisée. Available at
https://tel.archives-ouvertes.fr/tel-00447141/.
Vardi, M. Y. (2014). Would Turing have passed the Turing test? Communications of the ACM, 57 (9), 5–5.
Verbelen, R., Antonio, K., & Claeskens, G. (2018). Unravelling the predictive power of telematics data
in car insurance pricing. Journal of the Royal Statistical Society: Series C (Applied Statistics), 67 (5),
1275–1304.
Wang, J.-H., Liao, Y.-L., Tsai, T.-M., & Hung, G. (2006). Technology-based financial frauds in Taiwan:
issues and approaches. In 2006 IEEE International Conference on Systems, Man and Cybernetics, 1120–
1124. IEEE.
Wang, S. (2010). A comprehensive survey of data mining-based accounting-fraud detection research.
In 2010 International Conference on Intelligent Computation Technology and Automation, Changsha,
China, 50–53.
Yafooz, W. M., Bakar, Z. B. A., Fahad, S. A., & Mithon, A. M. (2020). Business intelligence through
big data analytics, data mining and machine learning data management, analytics and innovation.
Sharma N., Chakrabarti A., & Balas V. (Eds.), Data Management, Analytics and Innovation. Advances
in Intelligent Systems and Computing, vol. 1016, Springer, Singapore, 217–230.
Yang, J.-C., Chuang, H.-C., & Kuan, C.-M. (2019). Double machine learning with gradient boosting and
its application to the big n audit quality effect. USC-INET Research Paper(19-05).
Yang, S.-B., & Chen, T.-L. (2020). Uncertain decisiontree for bank marketing classification. Journal of
Computational and Applied Mathematics, 371, 112710.
Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural network ensemble
learning approach. Expert Systems with Applications, 34(2), 1434–1444.
Yue, D., Wu, X., Wang, Y., Li, Y., & Chu, C.-H. (2007). A review of data mining-based financial fraud
detection research. In 2007 International Conference on Wireless Communications, Networking and
Mobile Computing, 5519–5522. IEEE.
Zhang, D., & Zhou, L. (2004). Discovering golden nuggets: Data mining in financial application. IEEE
Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 34(4), 513–522.
Chapter 11
11.1 Introduction
Due to the recent advancement in computer science and communication technology, the avail-
ability of the raw data has been immensely increased over the decade. As a result, a huge oppor-
tunity has been created to mine some new data patterns and knowledge from the data (He and
Garcia, 2009; Van Hulse, Khoshgoftaar, and Napolitano, 2007). But at the same time, this huge
data created major problems like data imbalance (Aurelio et al., 2019). Especially in the classi-
fication task, imbalance data is a curse where well-reputed algorithms might fail to classify with
efficient accuracy. The major reason is that the number of labeled data points for one specific
class is extremely higher than the number of labeled data points for the other class/classes (Han,
Wang, and Mao, 2005; Lemaître, Nogueira, and Aridas, 2017). Conventionally, the class having
more data points is called the majority class and the other class (classes) is (are) known as a minor-
ity. Hence, the algorithm may learn the majority class examples’ patterns with higher accuracy,
maybe more than 99%. On the contrary, for minority class, the performance is worst, even less
than 10%.
However, the data imbalance problem is a common scenario in some renowned research
domains in business such as credit card fraud direction (Wei et al., 2013) and product back-
order prediction (de Santis, de Aguiar, and Goliatt, 2017; Hajek and Abedin, 2020). There
are some other fields where the data imbalance problem is common, including information
retrieval (Chang et al., 2003; Chen et al., 2011; Zheng, Wu, and Srihari, 2004), detect-
ing unreliable telecommunication customers (Burez and Van den Poel, 2009), learning word
199
200 Md. Shajalal et al.
pronunciations, network intrusion (Cieslak, Chawla, and Striegel, 2006; Rodda and Erothi, 2016)
and oil-spill detection (De Maio et al., 2016). Researchers tried different approaches to solving
data imbalance problems that are broadly categorized as data-level approaches and algorithm-level
approaches.
From many business applications, however, the credit card fraud detection has been taken
as a study example in this chapter. For banks and card issuers, credit card fraud is an alarming
event. Due to credit card anomalies, financial institutions count a huge amount of monetary
losses every year. Thereby, credit card fraud discovery is an integral part of banks that screens fake
transactions in advance of their authorization by card issuers. Although credit card anomalies
happen infrequently, they result in huge impacts as most fake transactions have large values. As
of privacy issues, there is a lack of empirical research on evidencing real-world transaction data.
Hence, it is significant to classify and predict whether a transaction is fraudulent. Moreover, a
sufficient prediction of fraud transactions permits stakeholders to take timely actions that can
potentially prevent extra fraud or monetary losses.
In this chapter, we will mainly discuss the data-level approaches for detecting credit card fraud.
The oversampling and undersampling approach tries to duplicate the minority class examples and
remove some majority class examples, respectively. In some cases, the combination of both sam-
pling techniques might be an effective solution. Other data-level methods generate new samples
for minority class applying some data patterns and hypothesis. The SMOTE (Chawla et al., 2002)
and borderline-SMOTE (Han, Wang, and Mao, 2005; Nguyen, Cooper, and Kamei, 2009) are
the popular examples of such approaches. The evaluation metrics that suited to judge the perfor-
mance of classification methods on such datasets include AUC (Yang et al., 2017), ROC curve
and precision-recall curve (He and Garcia, 2009).
The rest of the chapter is organized as follows: we first define the data imbalance problem in
Section 11.2. Then, we discuss the possible solution for the imbalanced dataset in Section 11.3.
Section 11.4 presents the evaluation metrics to measure the performance of any method having
an imbalance problem. A case study on credit card fraud detection dataset is also presented in
Section 11.5. Finally, we conclude the chapter with some future research directions in Section 11.6.
N o o f s a m p le s p e r c la s s
Majority Minority
Class
fraud detection dataset, only 0.17% data points are labeled as fraud and the rest of the examples
(99.87%) are successful valid transactions. Hence, the algorithms need some level of superiority
to learn the pattern of credit card fraud. This is next to impossible for any classical method due
to the lack of data points for the minority class. Hence, the minority class examples should be
increased in some way so that the algorithm can learn the minority class and eventually predict
correctly. It can be done in two ways: one is to duplicate the examples and another is to generate
some examples synthetically considering the data patterns.
dataset. To create minority class examples artificially, a widely used technique called SMOTE-
(Synthetic Minority Over-sampling TEchnique) was proposed by Chawla et al. (2002). Unlike
random oversampling, SMOTE can create synthetic examples of a particular minority class arti-
ficially (Last, Douzas, and Bacao, 2017).
To choose minority class examples, SMOTE leverages the synthetic examples along the ling
segments (Sharma et al., 2018). At first, considering the nearest neighbors it selected k minority
examples by joining the line with all. Here, k nearest minority examples are chosen based on
the desired number of over-sampled examples. The difference between the feature vectors of the
considered sample and its nearest neighbor is exploited to create synthetic minority class examples
for selection. A random number in [0,1] is multiplied with the difference and added to the feature
vector of the considered sample. Therefore, the random point along the line is selected between
two features and it will be the more effective decision region.
However, depending on the percentage of the oversampling examples, SMOTE decides how
many synthetic samples are needed. For example, we need to generate new example for each
instance if the oversampling percentage is 100%; hence, the size of the minority class will be twice.
Let the percentage be n, given that n 100; the SMOTE oversampling technique’s working steps
are as follows:
n
1. SMOTE first identifies k nearest neighbors such that k ceil 100 from minority class X
by exploiting traditional Euclidean distance (Guo et al., 2003):
n
2
d P Q pi qi
i 1
where P and Q indicate the vector representation of features of a particular minority sample
and one neighbor, respectively.
2. Then, we need to calculate the difference between the feature vectors of the negative class
example and neighbor. In total, we have k feature difference vectors.
3. Each feature value of the vector is multiplied by a random number on a scale [0,1].
4. Finally, the summation vector among the difference vectors is used as a minority class
example.
11.3.3 Borderline-SMOTE
Most classification algorithms try to learn the borderline examples of each class. Because the
rate of misclassification might depend on how the algorithms learn borderline examples. It is
also common for any method that classifies the borderline data points as the wrong class. There-
fore a special version of SMOTE has been introduced considering the above scenario that is
widely known as borderline-SMOTE (Han, Wang, and Mao, 2005). Note that SMOTE first
selects k nearest neighbors in the minority class and generates the class examples for the minor-
ity class. But borderline-SMOTE only creates and oversampled the borderline data points of
the minority class (Nguyen, Cooper, and Kamei, 2009). In summary, first, borderline-SMOTE
finds out the borderline examples, then generates the synthetic data points and adds them to the
training set as minority class examples. Given a labeled training dataset, let p1 p2 pa and
Handling class imbalance data 203
n1 n2 nb be the minority and majority class examples having the size a and b, respectively.
Then the steps of borderline-SMOTE are described as follows:
1. Calculate the m nearest neighbors for each minority class example, pi , in the dataset. Let m
be the number of majority examples such that 0 m m
2. When m m, such that all the nearest neighbors are from majority examples, that pi
is considered as noise. If the number nearest of neighbors of pi from the majority class is
larger than the neighbors from minority class such that m2 m m, then pi could be
easily misclassified. Therefore, that pi is put into the DANGER set. These are the borderline
examples of minority class.
3. The members in the DANGER set are from minority examples and they are borderline data
points. Mathematically,
DANGER p1 p2 p3 pH
TN FP
Actual Negative
True Negative False Positive
Actual Positive FN TP
False Negative True Positive
Figure 11.2 Confusion matrix (see Moula, Guotai, and Abedin, 2017, p. 165).
True Negative (TN): How many negative examples are predicted appropriately as negative
examples?
False Positive (FP): How many negative examples are predicted inappropriately as positive
examples?
False Negative (FN): How many positive examples are predicted inappropriately as nega-
tive examples?
True Positive (TP): How many positive examples are predicted inappropriately as positive
examples?
Accuracy is the most widely used metrics (Moula, Guotai, and Abedin, 2017) to judge the perfor-
mance of any classifier which can be defined as follows:
TP TN
Acc
TP FN FP FN
Besides the Accuracy Acc metric, loss 1 Acc is a considerable metric to estimate the
system’s performance. Due to the imbalance in the class distribution, the losses are also unequal
throughout the classes. Therefore, the Accuracy is not sufficient to measure the performance of
the classifier (Chawla et al., 2002). Moreover, accuracy is a general metric that does not indicate
how many correctly classify samples of a particular class. For imbalance dataset, ROC is employed
widely as a better metric to evaluate the performance of the classification algorithm (Chawla et al.,
2002).
FP
F
FP TN
The precision p is another evaluation metric that is based on the accuracy of the classifier when
it classifies only the positive examples. Precision is calculated as follows:
TP
P
TP FP
The ability of any particular classifier to predict positive samples is measured by the recall R:
TP
R
TP FN
Receiver operating characteristics (ROC) curves: ROC curve is widely used to measure the
performance of any classifier applied on imbalanced dataset (Yang et al., 2017). This curve can be
Handling class imbalance data 205
constructed by using two measures: true positive rate and false positive rate. These can be calculated
by the following equations (see Moula, Guotai, and Abedin, 2017, p. 165):
TP FP
TP_rate FP_rate
TP FN FP TN
ROC curve can be drawn by plotting the TP_rate and FP_rate. Note that any point in the curve
represents the performance of the classifier in certain distribution. The curve also represents the
trade-off between true positive and false positive. They also correspond to the benefit and cost of
the classifier, respectively.
As an example, an ROC curve is illustrated in Figure 11.3. The ROC points in the diagonal
BD represent the random guess of the class labels, which means random classifier. The blue and
green curves in Figure 11.3 indicate the typical ROC curves. If the curve is closer to point A,
the performance is better compared to the others. Here, we can see that the performance of any
method having the blue ROC curve is better than the classifier having the green ROC curve.
According to the curve, we can also say that the classifier might have a higher number of true
positive examples by raising the false positive examples. However, any ROC curve going through
the shaded part in the figure is the worst classifier than the random classifiers.
Precision-Recall curve: Precision-Recall curve is quite similar to the ROC curve. This curve
can be used for a highly imbalanced dataset (He and Garcia, 2009). In the same fashion, this curve
can be generated by plotting the value of precision and recall of the classifier. The characteristics
are similar to the ROC curve. But for a severely imbalanced dataset where the number of negative
examples is extremely larger than the positive example, the ROC cannot be the only curve to
measure the performance. However, if the curve is similar for both ROC and precision-recall,
then the performance can be validated.
Area Under Curve (AUC): The most important evaluation metric to test is the efficiency of
a classifier under data imbalance in AUC (Area Under Curve) (Chawla et al., 2002; Yang et al.,
2017). Basically, AUC is estimated considering the ROC curve or the precision recall curve. This
metric is single-handedly well enough to indicate the performance of any classifier under data
imbalance problem. This evaluation metric is estimated as follows:
A D
True Positive rate
B C
False Positive rate
1 P F
AUC
2
The area under the blue curve illustrates the AUC of the classifier with blue ROC in Figure 11.3.
284,807
300,000
200,000
100,000
492
0
Negative Positive
Figure 11.4 The distribution of valid transaction and fraud transaction in Credit Card Fraud
detection dataset.
Handling class imbalance data 207
0.96
0.95
0.94
AUC
0.93
0.92
0.91
0.9
Weight Randm SMOTE Borderline
boosting oversampling SMOTE
Figure 11.5 Performance of different data balancing techniques on credit card fraud detection
dataset in terms of AUC.
Figure 11.6 Performance of different data balancing techniques on credit card fraud detection
dataset in terms of ROC curve.
208 Md. Shajalal et al.
11.6 Conclusion
In summary, the class imbalanced dataset has a skewed distribution of data points over the
classes. In a highly imbalanced dataset, the number of majority class examples might be extremely
higher than the number of minority class examples. In this chapter, we presented some data-
level challenges and the solution to overcome the imbalanced problem. The techniques to
solve the imbalanced problems include random oversampling and undersampling, SMOTE and
borderline-SMOTE synthetic sampling and class weight boosting. To measure the performance
of any classifier on such data, there are several effective evaluation metrics such as ROC curve,
precision-recall curve and AUC. Future research can use a larger dataset from multi-countries per-
spectives in order to test the effectiveness of various techniques. Research on innovative methods
in tackling classification challenges with imbalance data is warranted.
References
Aurelio, Yuri Sousa, Gustavo Matheus de Almeida, Cristiano Leite de Castro, and Antonio Padua Braga.
2019. “Learning from imbalanced data sets with weighted cross-entropy function.” Neural Processing
Letters 50 (2): 1937–1949.
Burez, Jonathan, and Dirk Van den Poel. 2009. “Handling class imbalance in customer churn prediction.”
Expert Systems with Applications 36 (3): 4626–4636.
Cao, Hong, Xiao-Li Li, David Yew-Kwong Woon, and See-Kiong Ng. 2013. “Integrated oversampling for
imbalanced time series classification.” IEEE Transactions on Knowledge and Data Engineering 25 (12):
2809–2822.
Chang, Edward Y, Beitao Li, Gang Wu, and Kingshy Goh. 2003. “Statistical learning for effective visual
information retrieval.” In Proceedings 2003 International Conference on Image Processing (Cat. No.
03CH37429), Vol. 3, III–609. IEEE.
Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE:
Synthetic minority over-sampling technique.” Journal of Artificial Intelligence Research 16: 321–357.
Chen, Enhong, Yanggang Lin, Hui Xiong, Qiming Luo, and Haiping Ma. 2011. “Exploiting probabilistic
topic models to improve text categorization under class imbalance.” Information Processing &
Management 47 (2): 202–214.
Cieslak, David A, Nitesh V Chawla, and Aaron Striegel. 2006. “Combating imbalance in network intrusion
datasets.” In 2006 IEEE International Conference on Granular Computing, 732–737. Atlanta.
De Maio, Antonio, Danilo Orlando, Luca Pallotta, and Carmine Clemente. 2016. “A multifamily GLRT
for oil spill detection.” IEEE Transactions on Geoscience and Remote Sensing 55 (1): 63–79.
de Santis, Rodrigo Barbosa, Eduardo Pestana de Aguiar, and Leonardo Goliatt. 2017. “Predicting mate-
rial backorders in inventory management using machine learning.” In 2017 IEEE Latin American
Conference on Computational Intelligence (LA-CCI), 1–6. IEEE.
Douzas, Georgios, and Fernando Bacao. 2017. “Geometric SMOTE: Effective oversampling for imbalanced
learning through a geometric extension of SMOTE.” arXiv preprint arXiv:1709.07377.
Guo, Gongde, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. “KNN model-based approach in
classification.” In OTM Confederated International Conferences “On the Move to Meaningful Internet
Systems", 986–996. Springer.
Hajek, Petr, and Mohammad Zoynul Abedin. 2020. “A profit function-maximizing inventory backorder
prediction system using big data analytics.” IEEE Access 8: 58982–58994.
Handling class imbalance data 209
Han, Hui, Wen-Yuan Wang, and Bing-Huan Mao. 2005. “Borderline-SMOTE: A new over-sampling
method in imbalanced data sets learning.” In International Conference on Intelligent Computing,
878–887. Springer.
He, Haibo, and Edwardo A Garcia. 2009. “Learning from imbalanced data.” IEEE Transactions on
Knowledge and Data Engineering 21 (9): 1263–1284.
Kraus, Mathias, Stefan Feuerriegel, and Asil Oztekin. 2020. “Deep learning in business analytics
and operations research: Models, applications and managerial implications.” European Journal of
Operational Research 281 (3): 628–641.
Krawczyk, Bartosz. 2016. “Learning from imbalanced data: Open challenges and future directions.” Progress
in Artificial Intelligence 5 (4): 221–232.
Last, Felix, Georgios Douzas, and Fernando Bacao. 2017. “Oversampling for imbalanced learning based on
k-means and smote.” arXiv preprint arXiv:1711.00837 .
Lemaître, Guillaume, Fernando Nogueira, and Christos K Aridas. 2017. “Imbalanced-learn: A python
toolbox to tackle the curse of imbalanced datasets in machine learning.” The Journal of Machine
Learning Research 18 (1): 559–563.
Moula, Fahmida E, Chi Guotai, and Mohammad Zoynul Abedin. 2017. “Credit default prediction
modeling: An application of support vector machine.” Risk Management 19 (2): 158–187.
Nguyen, Hien M, Eric W Cooper, and Katsuari Kamei. 2009. “Borderline over-sampling for imbalanced
data classification.” In Proceedings: Fifth International Workshop on Computational Intelligence &
Applications, Vol. 2009, 24–29. IEEE SMC Hiroshima Chapter.
Rodda, Sireesha, and Uma Shankar Rao Erothi. 2016. “Class imbalance problem in the network intrusion
detection systems.” In 2016 International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT), 2685–2688. IEEE.
Sharma, Shiven, Colin Bellinger, Bartosz Krawczyk, Osmar Zaiane, and Nathalie Japkowicz. 2018.
“Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance.”
In 2018 IEEE International Conference on Data Mining (ICDM), 447–456. IEEE.
Sun, Yanmin, Mohamed S Kamel, and Yang Wang. 2006. “Boosting for learning multiple classes with
imbalanced class distribution.” In Sixth International Conference on Data Mining (ICDM’06),
592–602. IEEE.
Van Hulse, Jason, Taghi M Khoshgoftaar, and Amri Napolitano. 2007. “Experimental perspectives on
learning from imbalanced data.” In Proceedings of the 24th International Conference on Machine
Learning, 935–942. New York, USA.
Wei, Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013. “Effective detection of
sophisticated online banking fraud on extremely imbalanced data.” World Wide Web 16 (4): 449–475.
Yang, Zhiyong, Taohong Zhang, Jingcheng Lu, Dezheng Zhang, and Dorothy Kalui. 2017. “Optimizing
area under the ROC curve via extreme learning machines.” Knowledge-Based Systems 130: 74–89.
Zheng, Zhaohui, Xiaoyun Wu, and Rohini Srihari. 2004. “Feature selection for text categorization on
imbalanced data.” ACM Sigkdd Explorations Newsletter 6 (1): 80–89.
Chapter 12
12.1 Introduction
Artificial intelligence (AI) is a fast-growing field of study in the computer science domain that
emulates specific human characteristics to catch up with the rapid advances in human life
(Hmoud and Laszlo, 2019; Tambe et al., 2019; Zhou and Shen, 2018). AI can be defined as
the ability of computer technology to model human behavior without minimum human inter-
vention (Black and van Esch, 2020; Hamet and Tremblay, 2017). Unlike the nature of human
intelligence, AI research has focused on only a few components of intelligence such as learning,
knowledge, perception, problem-solving, planning, ability to manipulate and move objects as
well as advance and correct uses of language. Since its beginning back in 1921 as a robot (the
earliest form of AI), mentioned in Hamet and Tremblay (2017), AI contributes to numerous
fields including business, education, transportation, agriculture, entertainment, engineering, etc.
(Popa, 2011; Ramesh et al., 2004). AI is used to accomplish administrative tasks faster, better, and
cheaper with absolute accuracy (Bhalgat, 2019; Kolbjornsrud et al., 2016). AI is becoming more
popular day by day because of its three salient features such as recruitment and on-boarding,
internal mobility and employee retention, and automation of administrative tasks in a cheaper,
quicker, and smoother manner (Iqbal, 2018; O’Connor, 2020).
211
212 Md. Aftab Uddin et al.
Nowadays, human resources are treated as the strategic weapon for businesses to outperform
competitors, and it is believed that knowledge, skill, abilities, and other characteristics embodied
in the human capital can elicit sustainable competitive advantage (Black and van Esch, 2020;
Patel et al., 2019). However, acquiring the right talent is considered as one of the toughest issues
managers face in a competitive global marketplace (Park et al., 2004). The talent acquisition
function within HR is assigned with the responsibility to acquire people with the highest possible
potentialities (Palshikar et al., 2019). Moreover, recruiting talents using traditional mode is very
challenging, particularly screening in the right resumes, sorting the desired skill sets objectively,
and eliciting the matched profile for the firm (Wilfred, 2018). Particularly, recruiters need to
scan and analyze the complete resumes/curriculum vitae, profiles, and other referrals to screen
the right candidate, which prevents effective recruitment because of human limitations caused
by unconscious biases, favoritism, preconceived perceptions, and time constraints (Hmoud and
Laszlo, 2019; Johansson and Herranen, 2019).
In the current tech-based workplace, hiring managers always strive to apply new techniques
to source the right talents for making sure of person-job fit (Dennis, 2018; Gupta et al., 2018a;
Wilfred, 2018). If we conclude based on the survey, there are tremendous opportunities for HR
professionals to adopt AI in their processes and reap the benefits of using this advanced technol-
ogy (Deloitte, 2019; Iqbal, 2018). Entelo, Harver, and HireVue are some examples of AI-based
recruitment applications used by thousands of organizations (Vedapradha, 2019). Bersin (2017)
revealed that around 96% of recruiters believe that AI facilitates talent acquisition and retention
significantly, and 13% of HR managers already see evidence of AI becoming a regular part of HR.
By using the software HR professionals are getting advantages in performing tasks such as
viewing and updating employee information, access to HR business transaction data, team train-
ing, automation of repetitive, low-value tasks, and the hiring process without any human inter-
vention (McGovern et al., 2018; Vardarlier and Zafer, 2020). Henceforth, these automated sys-
tems in recruitment are proposed to overcome the problems of manual processing and analysis
for saving time and cost of right talent recruitment (Tambe et al., 2019; van Esch et al., 2019;
Vedapradha, 2019). Organizations use AI (such as data mining, artificial neural network, expert
systems, machine learning, and knowledge-based search engines) for numerous purposes. AI is a
very time- and cost-saving option for the organizations that replaces tiresome manual processes
rapidly (Hmoud and Laszlo, 2019; Wilfred, 2018). It has been employed to generate candidate
pool – screening candidates initially, evaluating them, and finally selecting them as the best pos-
sible ones (Geetha and Bhanu, 2018; Nawaz, 2019). Even the failed candidates are tracked as
potential future employees. The investment in AI can be validated for HR functions by HR pro-
fessionals’ time in administrative tasks, eliminating the loads of shared service centers by providing
HR services and answers for routine works; recruiting, selecting, and retaining; and reducing bias
in decision-making on HR-related issues (Bhalgat, 2019).
Although a good number of studies were found on different recruitment tools in numer-
ous countries (Ball, 2001; Bamel et al., 2014; Boudreau and Cascio, 2017; Kmail et al., 2015;
Mehrabad and Brojeny, 2007; Palshikar et al., 2019; Quaosar, 2018), there is an insignificant
number of study on the use of AI technology in talent acquisition; particularly, study on the
recruiters’ and human resource professionals’ acceptance behavior of AI-directed talent acquisi-
tion tools is absent in the Bangladesh context. Other estimates reported in Tambe et al. (2019)
about Wharton School showed that 41% of CEOs are not prepared to make proper use of new
data analytic tools and only 4% claim that they are “to a large extent” prepared.
AI in recruiting talents 213
In many countries such as Bangladesh different brands of AI software such as the HR cloud
solution Success Factors (SF), with conversational AI capabilities through IBM Watson, SAP
Leonardo ML Foundation, Recast.AI, ServiceNow, and Microsoft Azure/Skype are used to con-
duct day-to-day activities by HR professionals. AI adoption is slow in Bangladesh as well as in the
rest of the world because of the talent gap, concern over privacy, ongoing maintenance, integra-
tion capabilities, and limited proven applications (McGovern et al., 2018). The existing literature
shows that AI is increasingly used for talent recruitment; however, little is known about the docu-
mentation to its user acceptance. The research questions in this project were based on the essence
of the UTAUT that blends other fragmented theories such as diffusion of innovation theory
(DOI), the theory of planned behavior (TPB), the social cognitive theory (SCT), and the tech-
nology acceptance model (TAM) (Uddin et al., 2020). Henceforth, this study will look forward
to answering the following research questions:
RQ1. What factors drive the recruiters/HR professionals to use or implement AI in recruiting
talents in Bangladeshi business organizations?
RQ2. What significance does UTAUT provide in the implementation of AI at home and
abroad?
The study contributes to advance the theory and knowledge by providing new insights and
empirical evidence in numerous ways. This study is an attempt to figure out the influential factors
associated with the adoption of AI-based software for talent acquisition and the magnitude of their
impacts on the adoption of AI technology using UTAUT (Khanam et al., 2015). We noticed that
the majority of studies on the use and implementations of AI using the UTAUT model were
conducted mostly in the advanced countries. Thereby, this study is going to be the first in the
context of South Asian perspectives, particularly in Bangladesh, which will, directly and indirectly,
help to advance prior knowledge. Second, we add two new variables, such as technology anxiety
(TA) and resistance to change (RC) with the previous UTAUT model, which might add fresh
evidence to the previously held knowledge. Finally, we will investigate the moderating effect of
age status of HR professionals or recruiters because it is common that old adults resist any new
technology more than young adults do. Since AI adoption is unique in the case of Bangladesh
perspective, old adults might be reluctant to adopt AI-based technology to recruit talents. This
effect might also provide new dimensions of understanding AI adoption from age perspectives of
recruiters and HR professionals.
UTAUT had undergone considerable progress as opposed to any other models that accounted
for nearly 69% of BI, which is the strongest predictor of AU. However, AI is currently a novel
technology in Bangladesh and most HR professionals limitedly use AI at their work. Therefore,
if we only consider actual use it may lead to an imprecise conclusion about HR professionals’
AI adoption behavior. The extent of the HR professionals’ willingness to adopt AI-tool can be
defined as a BI to use AI in recruiting talents. Specifically, we used potential factors included
in the original UTAUT model along with two additional variables and one moderating variable
for manifesting an accurate preview of AI adoption for recruiting talents by HR professionals in
Bangladesh.
2019; Uddin et al., 2020; Wrycza et al., 2017). Therefore, we expect that EE is very vital for HR
professionals when using AI at the time of recruiting talents. Subsequently, we developed the
following hypothesis:
H3. EE influences BI to use AI.
young adults and old adults, Zajicek and Hall (2000) and Arning and Ziefle (2007) professed
that older adults display more inertia to change and fear of failure to their counterparts because of
their low level of technical efficacy. Basing on the prior studies’ observation, we identified that the
age difference between young adults and old adults moderates the influence of predictor variables
on BI to use AI in recruiting talents. In consequence, the hypotheses are:
H8. The influence of TA on the intention to use AI would be more substantial for young
adults than old adults.
H9. The impact of PE on the intention to use AI would be stronger for young adults than
old adults.
H10. The effect of EE on the intention to use AI would be stronger for young adults than old
adults.
H11. The influence of SI on the intention to use AI would be more substantial for young
adults than old adults.
H12. The impact of RC on the intention to use AI would be stronger for young adults than
old adults.
H13. The influence of FC on the intention to use AI would be stronger for young adults than
old adults (Figure 12.1).
generate accurate replies. We adopted items for constructs representing PE, EE, FC, SI, TA, and
BI to use developed and refined by Venkatesh et al. (2003) and Venkatesh et al. (2012). Finally, we
adopted items from Rajan (2015) and Laumer et al. (2016) to measure AU and RC, respectively.
Convergent validity refers to the clustering of its item into the same construct, whereas dis-
criminant validity indicates the construct’s distinctiveness from other constructs (Hair et al.,
2017a, 2014a; Hair, 2017b). According to Hair (2017b), convergent validities will be achieved
when a construct’s average variance extracted (AVE) exceeds 0.50. Table 12.2 reports that AVE
ranged from 0.656 to 0.882, which demonstrated that constructs’ convergent validity is main-
tained. We measured discriminant validity using the approach of Fornell and Larcker (1981),
which posits that the square root of any construct’s AVE needs to be larger than its correlation
with other scores. Likewise, Table 12.2 also depicts the diagonal italicized score (the square root
AI in recruiting talents 221
of the related construct’s AVE) which is higher than scores beneath it. Thus, there is no concern
about reliability and validity.
adults is higher than young adults. Unfortunately, the difference between old and young adults
is not significant ( 0 175 p 0 123). Likewise, the impacts of PE, EE, and SI on IU
have also differed concerning old and young adults. Akin to H8, the influences of PE (
0 217 p 0 122), EE ( 0 070 p 0 555), and SI ( 0 081 p 0 462) on IU are
not significantly different between old and young adults.
Furthermore, the effect of RC on IU is found insignificant among old adults; however, the
same effect is found significant among young adults. Surprisingly, the difference is found sig-
nificant ( 0 296 p 0 007). Finally, in H13, the influence of FC on IU reported that
the effect is also different between old and young adults, and the difference is found signifi-
cant ( 0 269 p 0 021). Therefore, Table 12.4 indicates that hypotheses 8–11 are not
supported, whereas hypotheses 12 and 13 are supported.
Hypothesis Path relations B STER p-Value STER p-Value Difference in p-Value Decision
Hypothesis 8 TA -> IU 0 348 0.089 0.000 0 174 0.069 0.012 0 175 0.123 Not supported
Hypothesis 9 PE -> IU 0.249 0.119 0.037 0.032 0.072 0.656 0.217 0.122 Not supported
Hypothesis 10 EE -> IU 0.165 0.089 0.064 0.235 0.078 0.003 0 070 0.555 Not supported
Hypothesis 11 SI -> IU 0.205 0.081 0.011 0.124 0.077 0.108 0.081 0.462 Not supported
Hypothesis 12 RC -> IU 0.092 0.083 0.268 0 204 0.064 0.002 0.296 0.007 Supported
Hypothesis 13 FC -> IU 0.155 0.081 0.055 0.424 0.081 0.000 0 269 0.021 Supported
TA, technology anxiety; PE, perceived performance; EE, effort expectancy; SI, social influence; RC, resistance to change; FC, facilitating conditions;
STER, standard error.
AI in recruiting talents 225
talents, in consequence, no relevance was noticed. However, the results found consistent with
the findings of UTAUT, which were studied in other settings (Alam and Uddin, 2019; Kwateng
et al., 2019; Rozmi et al., 2019; Uddin et al., 2020; Venkatesh et al., 2003; Wrycza et al., 2017).
This study surely will contribute to putting forward the unique insights for AI adoption and the
AU of it from the conceptualization of UTAUT in acquiring talents. This study will also provide
an understanding of the processes and mechanisms of AI adoption and its AU by the recruiters
and HR professionals in recruiting talents.
Using the premise of the UTAUT model, this study reviewed several factors influencing
the adoption and the AU of AI in the context of Bangladesh. The observed results delineate
that when TA, PE, EE, SI, and FC of AI increase, actual users’ BI to use of AI fluctuates,
which is found relevant in the context of the UTAUT model (Asamoah and Andoh-Baidoo,
2018). Likewise, the impact of RC on the BI to use AI is not found significant. Studies show
that HR professionals are relatively young (33.40 years), and it is also found that they are
highly educated (79.80% have master degrees), which signifies that they will have less inertia to
change.
We have examined the moderating effects of age from the ideation of the original study
(Venkatesh et al., 2003). The results of the moderating effects of age status associated with AI
indicate that age plays an insignificant role regarding the influence of EE, TA, SI, and RC
on the BI to use AI because the age difference is very low among the recruiting professionals
in Bangladesh. The result is found consistent with the findings of Venkatesh et al. (2012) and
Laguna and Babcock (1997). Similar to the observations of Arning and Ziefle (2007), Soliman
et al. (2019), Kwateng et al. (2019), Venkatesh et al. (2003), and Zajicek and Hall (2000), the
study shows that young and old adults demonstrated different results as to the influence of RC
and FC on the BI to use AI (Hall and Mansfield, 1975; Kwateng et al., 2019; Soliman et al.,
2019).
References
Aggelidis, V. P., & Chatzoglou, P. D. (2009). Using a modified technology acceptance model in hospitals.
International Journal of Medical Informatics, 78(2), 115–126. doi:10.1016/j.ijmedinf.2008.06.006
Alam, M. S., Uddin, K. M. K., & Uddin, M. A. (2018). End users’ behavioral intention to use an enter-
prise resource planning (ERP) system: An empirical explanation of the UTAUT Model. The Comilla
University Journal of Business Studies, 5(1), 73–86.
Alam, M. S., & Uddin, M. A. (2019). Adoption and implementation of Enterprise Resource Planning
(ERP): An empirical study. Journal of Management and Research, 6 (1), 84–116.
Arning, K., & Ziefle, M. (2007). Understanding age differences in PDA acceptance and performance.
Computers in Human Behavior, 23(6), 2904–2927. doi:10.1016/j.chb.2006.06.005
Asamoah, D., & Andoh-Baidoo, F. K. (2018). Antecedents and outcomes of extent of ERP systems imple-
mentation in the Sub-Saharan Africa context: A panoptic perspective. Communications of the Associ-
ation for Information Systems, 42, 581–601. doi:10.17705/1cais.04222
Audia, P. G., & Brion, S. (2007). Reluctant to change: Self-enhancing responses to diverging per-
formance measures. Organizational Behavior and Human Decision Processes, 102(2), 255–269.
doi:10.1016/j.obhdp.2006.01.007
Azim, M. T., Fan, L., Uddin, M. A., Jilani, M. M. A. K., & Begum, S. (2019). Linking transformational
leadership with employees’ engagement in the creative process. Management Research Review, 42(7),
837–858. doi:10.1108/MRR-08-2018-0286
Ball, K. S. (2001). The use of human resource information systems: A survey. Personnel Review, 30(6),
677–693. doi:10.1108/eum0000000005979
Bamel, N., Bamel, U. K., Sahay, V., & Thite, M. (2014). Usage, benefits and barriers of human resource
information system in universities. VINE: The Journal of Information and Knowledge Management
Systems, 44(4), 519–536. doi:10.1108/VINE-04-2013-0024
Barrane, F. Z., Karuranga, G. E., & Poulin, D. (2018). Technology adoption and diffusion: A new applica-
tion of the UTAUT model. International Journal of Innovation and Technology Management, 15(06),
1950004–1950023. doi:10.1142/s0219877019500044
Bersin, J. (2017). Robotics, AI and Cognitive Computing Are Changing Organizations Even Faster than We
Thought. Forbes.
Bhalgat, K. H. (2019). An Exploration of How Artificial Intelligence Is Impacting Recruitment and Selection
Process. Dublin Business School. Retrieved from https://esource.dbs.ie/handle/10788/3956
Black, J. S., & van Esch, P. (2020). AI-enabled recruiting: What is it and how should a manager use it?
Business Horizons, 63(2), 215–226. doi:10.1016/j.bushor.2019.12.001
Boudreau, J., & Cascio, W. (2017). Human capital analytics: Why are we not there? Journal of Organiza-
tional Effectiveness: People and Performance, 4(2), 119–126. doi:10.1108/JOEPP-03-2017-0021
Brislin, R. W. (1970). Back-translation for cross-cultural research. Journal of Cross-Cultural Psychology, 1(3),
185–216. doi:10.1177/135910457000100301
Cohen, J. (1977). Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press.
Czaja, S. J., Charness, N., Fisk, A. D., Hertzog, C., Nair, S. N., Rogers, W. A., & Sharit, J. (2006).
Factors predicting the use of technology: Findings from the center for research and education on
aging and technology enhancement (create). Psychology and Aging, 21(2), 333–352. doi:10.1037/0882-
7974.21.2.333
Deloitte. (2019). Deloitte’s 2019 Global Human Capital Trends survey. Accessed on July 6, 2020,
Retrieved from https://www2.deloitte.com/content/dam/Deloitte/cz/Documents/human-capital/cz-
hc-trends-reinvent-with-human-focus.pdf
AI in recruiting talents 227
Dennis, M. J. (2018). Artificial intelligence and recruitment, admission, progression, and retention. Enroll-
ment Management Report, 22(9), 1–3. doi:10.1002/emt.30479
Dong, J. Q. (2009). User acceptance of information technology innovations in the Chinese cultural context.
Asian Journal of Technology Innovation, 17 (2), 129–149. doi:10.1080/19761597.2009.9668676
Dwivedi, Y. K., Rana, N. P., Jeyaraj, A., Clement, M., & Williams, M. D. (2019). Re-examining the Uni-
fied Theory of Acceptance and Use of Technology (UTAUT): Towards a revised theoretical model.
Information Systems Frontiers, 21(3), 719–734. doi:10.1007/s10796-017-9774-y
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables
and measurement error. Journal of Marketing Research, 18(1), 39–50. doi:10.1177/002224378101
800104
Geetha, R., & Bhanu, S. R. D. (2018). Recruitment through artificial intelligence: A conceptual study.
International Journal of Mechanical Engineering and Technology, 9(7), 63–70.
Guo, X., Sun, Y., Wang, N., Peng, Z., & Yan, Z. (2013). The dark side of elderly acceptance of preventive
mobile health services in China. Electronic Markets, 23(1), 49–61. doi:10.1007/s12525-012-0112-4
Gupta, P., Fernandes, S. F., & Jain, M. (2018). Automation in recruitment: A new frontier. Journal of
Information Technology Teaching Cases, 8(2), 118–125. doi:10.1057/s41266-018-0042-x
Gupta, S., Misra Subhas, C., Kock, N., & Roubaud, D. (2018). Organizational, technological and extrinsic
factors in the implementation of cloud ERP in SMEs. Journal of Organizational Change Management,
31(1), 83–102. doi:10.1108/JOCM-06-2017-0230
Hair, J., Hollingsworth, C. L., Randolph, A. B., & Chong, A. Y. L. (2017). An updated and expanded
assessment of PLS-SEM in information systems research. Industrial Management & Data Systems,
117 (3), 442–458. doi:10.1108/IMDS-04-2016-0130
Hair Jr, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2014). Multivariate Data Analysis: A Global
Perspective. 7 ed. London: Pearson.
Hair Jr., J. F., Hult, G. T. M., Ringle, C. M., & Sarstedt, M. (2017). A Primer on Partial Least Squares
Structural Equation Modeling (PLS-SEM). Los Angeles: Sage Publication.
Hair Jr, J. F., Sarstedt, M., Hopkins, L., & Kuppelwieser, V. G. (2014). Partial least squares structural
equation modeling (PLS-SEM). European Business Review, 26 (2), 106–121. doi:10.1108/ebr-10-2013-
0128
Hall, D. T., & Mansfield, R. (1975). Relationships of age and seniority with career variables of engineers
and scientists. Journal of Applied Psychology, 60(2), 201–210. doi:10.1037/h0076549
Hamet, P., & Tremblay, J. (2017). Artificial intelligence in medicine. Metabolism, 69, S36–S40.
doi:10.1016/j.metabol.2017.01.011
Hasan, M. S., Ebrahim, Z., Mahmood, W. H. W., & Rahmanm, M. N. A. (2018). Factors influencing
enterprise resource planning system: A review. Journal of Advanced Manufacturing Technology, 12(1),
247–258.
Hmoud, B., & Laszlo, V. (2019). Will artificial intelligence take over humanresources recruitment and
selection. Network Intelligence Studies, 7 (13), 21–30.
Huy, Q. N. (1999). Emotional capability, emotional intelligence, and radical change. Academy of Manage-
ment Review, 24(2), 325–345. doi:10.5465/amr.1999.1893939
Iqbal, F. M. (2018). Can artificial intelligence change the way in which companies recruit, train, develop
and manage human resources in workplace? Asian Journal of Social Sciences and Management Studies,
5(3), 102–104.
Johansson, J., & Herranen, S. (2019). The application of Artificial Intelligence (AI) in human resource manage-
ment: Current state of AI and its impact on the traditional recruitment process. Bachelor Degree Thesis,
Jönköping University, Sweden.
228 Md. Aftab Uddin et al.
Khanam, L., Uddin, M. A., & Mahfuz, M. A. (2015). Students’ behavioral intention and acceptance of
e-recruitment system: A Bangladesh perspective. In 12th International Conference on Innovation and
Management, ICIM 2015, 1297–1303. Wuhan, China.
Kmail, A. B., Maree, M., Belkhatir, M., & Alhashmi, S. M. (2015). An automatic online recruitment system
based on exploiting multiple semantic resources and concept-relatedness measures. In 2015 IEEE 27th
International Conference on Tools with Artificial Intelligence (ICTAI), 620–627. IEEE.
KolbjÃÿrnsrud, V., Amico, R., & Thomas, R. J. (2016). How artificial intelligence will redefine manage-
ment. Harvard Business Review, 2, 1–6.
Kwateng, K. O., Atiemo, K. A. O., & Appiah, C. (2019). Acceptance and use of mobile banking: An appli-
cation of UTAUT2. Journal of Enterprise Information Management, 32(1), 118–151. doi:10.1108/JEIM-
03-2018-0055
Laguna, K., & Babcock, R. L. (1997). Computer anxiety in young and older adults: Implications for
human-computer interactions in older populations. Computers in Human Behavior, 13(3), 317–326.
doi:10.1016/S0747-5632(97)00012-5
Latané, B. (1981). The psychology of social impact. American Psychologist, 36 (4), 343–356. doi:10.1037/
0003-066X.36.4.343
Laumer, S., Maier, C., Eckhardt, A., & Weitzel, T. (2016). User personality and resistance to manda-
tory information systems in organizations: A theoretical model and empirical test of dispositional
resistance to change. Journal of Information Technology, 31(1), 67–82. doi:10.1057/jit.2015.17
Lowry, P. B., & Gaskin, J. (2014). Partial least squares (PLS) structural equation modeling (SEM) for build-
ing and testing behavioral causal theory: When to choose it and how to use it. IEEE Transactions on
Professional Communication, 57 (2), 123–146. doi:10.1109/TPC.2014.2312452
Lu, J., Yao, J. E., & Yu, C.-S. (2005). Personal innovativeness, social influences and adoption of wireless
Internet services via mobile technology. The Journal of Strategic Information Systems, 14(3), 245–268.
doi:10.1016/j.jsis.2005.07.003
Mahmood, M., Uddin, M. A., & Luo, F. (2019). Influence of transformational leadership on employ-
ees’ creative process engagement: A multi-level analysis. Management Decision, 57 (3), 741–764.
doi:10.1108/MD-07-2017-0707
McGovern, S. L. P., Vinod, Gill, S., Aldrich, T., Myers, C., Desai, C., Gera, M., & Balasubrama-
nian, V. (2018). The new age: Artificial intelligence for human resource opportunities and functions.
Retrieved from https://www.ey.com/Publication/vwLUAssets/EY-the-new-age-artificial-intelligence-
for-human-resource-opportunities-and-functions/
Mehrabad, M. S., & Brojeny, M. F. (2007). The development of an expert system for effective selection
and appointment of the jobs applicants in human resource management. Computers & Industrial
Engineering, 53(2), 306–312. doi:10.1016/j.cie.2007.06.023
Meuter, M. L., Ostrom, A. L., Bitner, M. J., & Roundtree, R. (2003). The influence of technology anxiety
on consumer use and experiences with self-service technologies. Journal of Business Research, 56 (11),
899–906. doi:10.1016/S0148-2963(01)00276-4
Morris, M. G., & Venkatesh, V. (2000). Age differences in technology adoption decisions: Implications for
a changing work force. Personnel Psychology, 53(2), 375–403.
Nawaz, N. (2019). How far have we come with the study of artificial intelligence for recruitment process.
Internatinal Journal of Scientific and Technology Research, 8(7), 488–493.
Niharika Reddy, M., Mamatha, T., & Balaram, A. (2019). Analysis of e-recruitment systems and detecting
e-recruitment fraud. In International Conference on Communications and Cyber Physical Engineering
2018, 411–417. Singapore: Springer.
AI in recruiting talents 229
Nuq, P. A., & Aubert, B. (2013). Towards a better understanding of the intention to use eHealth ser-
vices by medical professionals: The case of developing countries. International Journal of Healthcare
Management, 6 (4), 217–236. doi:10.1179/2047971913Y.0000000033
O’Connor, S. W. (2020). Artificial Intelligence in Human Resource Management. Accessed on July 6,
2020, Retrieved from https://www.northeastern.edu/graduate/blog/artificial-intelligence-in-human-
resource-management/
Oh, J.-C., & Yoon, S.-J. (2014). Predicting the use of online information services based
on a modified UTAUT model. Behaviour & Information Technology, 33(7), 716–729.
doi:10.1080/0144929X.2013.872187
Palshikar, G. K., Srivastava, R., Pawar, S., Hingmire, S., Jain, A., Chourasia, S., & Shah, M. (2019).
Analytics-led talent acquisition for improving efficiency and effectiveness. In A. K. Laha (Ed.),
Advances in Analytics and Applications, 141–160. Singapore: Springer.
Park, H. J., Gardner, T. M., & Wright, P. M. (2004). HR practices or HR capabilities: which mat-
ters? Insights from the Asia Pacific region. Asia Pacific Journal of Human Resources, 42(3), 260–273.
doi:10.1177/1038411104045394
Patel, C., Budhwar, P., Witzemann, A., & Katou, A. (2019). HR outsourcing: The impact on HR’s
strategic role and remaining in-house HR function. Journal of Business Research, 103, 397–406.
doi:10.1016/j.jbusres.2017.11.007
Popa, C. (2011). Adoption of Artificial Intelligence in Agriculture. Bulletin of the University of Agricultural
Sciences and Veterinary Medicine Cluj-Napoca Agriculture, 68(1), 284–293. doi:10.15835/buasvmcn-
agr:6454
Powell, A. L. (2013). Computer anxiety: Comparison of research from the 1990s and 2000s. Computers in
Human Behavior, 29(6), 2337–2381. doi:10.1016/j.chb.2013.05.012
Quaosar, G. M. A. A. (2018). Adoption of human resource information systems in developing countries:
An empirical study. International Business Research, 11(4), 133–141. doi:10.5539/ibr.v11n4p133
Rahi, S., Mansour, M. M. O., Alghizzawi, M., & Alnaser, F. M. (2019). Integration of UTAUT model
in internet banking adoption context. Journal of Research in Interactive Marketing, 13(3), 411–435.
doi:10.1108/JRIM-02-2018-0032
Rahman, M. M., Lesch, M. F., Horrey, W. J., & Strawderman, L. (2017). Assessing the utility of TAM, TPB,
and UTAUT for advanced driver assistance systems. Accident Analysis & Prevention, 108, 361–373.
doi:10.1016/j.aap.2017.09.011
Rajan, C. A., & Baral, R. (2015). Adoption of ERP system: An empirical study of factors influenc-
ing the usage of ERP and its impact on end user. IIMB Management Review, 27 (2), 105–117.
doi:10.1016/j.iimb.2015.04.008
Ramesh, A. N., Kambhampati, C., Monson, J. R. T., & Drew, P. J. (2004). Artificial intel-
ligence in medicine. Annals of the Royal College of Surgeons of England, 86 (5), 334–338.
doi:10.1308/147870804290
Raza, S. A., Shah, N., & Ali, M. (2019). Acceptance of mobile banking in Islamic banks: evidence from
modified UTAUT model. Journal of Islamic Marketing, 10(1), 357–376. doi:10.1108/JIMA-04-2017-
0038
Rhodes, S. R. (1983). Age-related differences in work attitudes and behavior: A review and conceptual
analysis. Psychological Bulletin, 93(2), 328–367. doi:10.1037/0033-2909.93.2.328
Rouhani, S., & Mehri, M. (2018). Empowering benefits of ERP systems implementation: empirical study
of industrial firms. Journal of Systems and Information Technology, 20(1), 54–72. doi:10.1108/JSIT-05-
2017-0038
230 Md. Aftab Uddin et al.
Rozmi, A. N. A., Bakar, M. I. A., Abdul Hadi, A. R., & Imran Nordin, A. (2019). Investigating the Inten-
tions to Adopt ICT in Malaysian SMEs Using the UTAUT Model. In International Visual Informatics
Conference, 477–487. Cham: Springer.
Sarstedt, M., Henseler, J., & Ringle, C. M. (2011). Multigroup analysis in partial least squares (PLS) path
modeling: Alternative methods and empirical results. Measurement and Research Methods in Interna-
tional Marketing, 22, 195–218.
Slade, E. L., Williams, M. D., & Dwivedi, Y. (2013). An extension of the UTAUT 2 in a healthcare context.
Paper presented at the UK Academy for Information Systems Conference, UK.
Soliman, M. S. M., Karia, N., Moeinzadeh, S., Islam, M. S., & Mahmud, I. (2019). Modelling intention to
use ERP systems among higher education institutions in Egypt: UTAUT perspective. International
Journal of Supply Chain Management, 8(2), 429–440.
Tambe, P., Cappelli, P., & Yakubovich, V. (2019). Artificial intelligence in human resources
management: Challenges and a path forward. California Management Review, 61(4), 15–42.
doi:10.1177/0008125619867910
Tarhini, A., El-Masri, M., Ali, M., & Serrano, A. (2016). Extending the UTAUT model to understand
the customers’ acceptance and use of internet banking in Lebanon: A structural equation modeling
approach. Information Technology and People, 29(4), 23–38.
Uddin, M. A., Alam, M. S., Mamun, A. A., Khan, T.-U.-Z., & Akter, A. (2020). A study of the adoption and
implementation of enterprise resource planning (ERP): Identification of moderators and mediator.
Journal of Open Innovation: Technology, Market, and Complexity, 6 (1), 2–19.
Uddin, M. A., Mahmood, M., & Fan, L. (2019). Why individual employee engagement matters for team
performance? Mediating effects of employee commitment and Organizational citizenship behaviour.
Team Performance Management: An International Journal, 25(1/2), 47–68. doi:10.1108/TPM-12-
2017-0078
Uddin, M. A., Priyankara, H. P. R., & Mahmood, M. (2019). Does a creative identity encourage innovative
behaviour? Evidence from knowledge-intensive IT service firms. European Journal of Innovation
Management. doi:10.1108/EJIM-06-2019-0168
Upadhyay, A. K., & Khandelwal, K. (2018). Applying artificial intelligence: implications for recruitment.
Strategic HR Review, 17 (5), 255–258. doi:10.1108/SHR-07-2018-0051
van Esch, P., Black, J. S., & Ferolie, J. (2019). Marketing AI recruitment: The next phase in job application
and selection. Computers in Human Behavior, 90, 215–222. doi:10.1016/j.chb.2018.09.009
Vardarlier, P., & Zafer, C. (2020). Use of artificial intelligence as business strategy in recruitment pro-
cess and social perspective. In U. Hacioglu (Ed.), Digital Business Strategies in Blockchain Ecosystems:
Transformational Design and Future of Global Business, 355–373. Cham: Springer.
Vedapradha, R., Hariharan, R., & Shivakami, R. (2019). Artificial intelligence: A technological prototype
in recruitment. Journal of Service Science and Management, 12(3), 382–390.
Venkatesh, V. (2000). Determinants of perceived ease of use: Integrating control, intrinsic motivation, and
emotion into the technology acceptance model. Information Systems Research, 11(4), 342–365.
Venkatesh, V., Morris, M. G., Davis, G. B., & Davis, F. D. (2003). User acceptance of information tech-
nology: Toward a unified view. MIS Quarterly, 27 (3), 425–478.
Venkatesh, V., Thong, J. Y., & Xu, X. (2012). Consumer acceptance and use of information technology:
extending the unified theory of acceptance and use of technology. MIS Quarterly, 36 (1), 157–178.
Wilfred, D. (2018). AI in recruitment. NHRD Network Journal, 11(2), 15–18. doi:10.1177/0974173920
180204
AI in recruiting talents 231
Wrycza, S., Marcinkowski, B., & Gajda, D. (2017). The enriched UTAUT model for the acceptance of
software engineering tools in academic education. Information Systems Management, 34(1), 38–49.
doi:10.1080/10580530.2017.1254446
Yi, M. Y., Jackson, J. D., Park, J. S., & Probst, J. C. (2006). Understanding information technology accep-
tance by individual professionals: Toward an integrative view. Information & Management, 43(3),
350–363. doi:10.1016/j.im.2005.08.006
Zajicek, M., & Hall, S. (2000). Solutions for Elderly Visually Impaired People Using the Internet. London:
Springer.
Zhou, J., & Shen, M. (2018). When human intelligence meets artificial intelligence. PsyCh Journal, 7 (3),
156–157. doi:10.1002/pchj.216
Index
233
234 Index