0% found this document useful (0 votes)
250 views137 pages

Data Centric Artificial Intelligence: A Beginner's Guide

This document provides an introduction to the book "Data Centric Artificial Intelligence: A Beginner's Guide". It discusses the need to shift from a model-centric approach to artificial intelligence to a more data-centric approach. The book aims to present the transformation from model-centric to data-centric AI, methodologies for improving data quality and accuracy, and challenges in developing data-centric models and datasets. It also provides use cases and examples to help readers understand and apply data-centric AI concepts.

Uploaded by

rolorot958
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views137 pages

Data Centric Artificial Intelligence: A Beginner's Guide

This document provides an introduction to the book "Data Centric Artificial Intelligence: A Beginner's Guide". It discusses the need to shift from a model-centric approach to artificial intelligence to a more data-centric approach. The book aims to present the transformation from model-centric to data-centric AI, methodologies for improving data quality and accuracy, and challenges in developing data-centric models and datasets. It also provides use cases and examples to help readers understand and apply data-centric AI concepts.

Uploaded by

rolorot958
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

Data-Intensive Research

Parikshit N. Mahalle
Gitanjali R. Shinde
Yashwant S. Ingle
Namrata N. Wasatkar

Data Centric
Artificial
Intelligence:
A Beginner’s Guide
Data-Intensive Research

Series Editors
Nilanjan Dey, Techno International New Town, Kolkata, West Bengal, India
Bijaya Ketan Panigrahi, Indian Institute of Technology Delhi, New Delhi, India
Vincenzo Piuri, University of Milan, Milano, Italy
This book series provides a comprehensive and up-to-date collection of research
and experimental works, summarizing state-of-the-art developments in the fields
of data science and engineering. The trends, technologies and state-of-the art
research related to data collection, storage, representation, visualization, processing,
interpretation, analysis, and management related concepts, taxonomy, techniques,
designs, approaches, systems, algorithms, tools, engines, applications, best prac-
tices, bottlenecks, perspectives, policies, properties, practicalities, quality control,
usage, validation, workflows, assessment, evaluation, metrics, and many more are to
be covered.
The series will publish monographs, edited volumes, textbooks and proceedings
of important conferences, symposia and meetings in the field of autonomic and
data-driven computing.
Parikshit N. Mahalle · Gitanjali R. Shinde ·
Yashwant S. Ingle · Namrata N. Wasatkar

Data Centric Artificial


Intelligence: A Beginner’s
Guide
Parikshit N. Mahalle Gitanjali R. Shinde
Department of Artificial Intelligence Vishwakarma Institute of Information
and Data Science Technology
Vishwakarma Institute of Information Pune, Maharashtra, India
Technology
Pune, Maharashtra, India Namrata N. Wasatkar
Department of Computer Engineering
Yashwant S. Ingle Vishwakarma Institute of Information
Department of Artificial Intelligence Technology
and Data Science Pune, Maharashtra, India
Vishwakarma Institute of Information
Technology
Pune, Maharashtra, India

ISSN 2731-555X ISSN 2731-5568 (electronic)


Data-Intensive Research
ISBN 978-981-99-6352-2 ISBN 978-981-99-6353-9 (eBook)
https://doi.org/10.1007/978-981-99-6353-9

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.


Preface

One, who sees inaction in action, and action in inaction, is


intelligent among all.
—Bhagwad Gita

The deployment of artificial intelligence (AI) and deep learning-based solutions in


computer vision and all other places has improved for businesses from a variety
of sectors, such as automotive, electronics, and medical device manufacture, when
compared to conventional, rules-based implementations. Today’s AI is model-centric
where the data is essentially a fixed input, and majority of development efforts are
spent on refining the model. However, there is need to have data-centric approach.
In a data-centric AI approach, the data serves as the primary object that is itera-
tively developed. This means more time is invested on labeling, managing, slicing,
supplementing, and curating the data, while the model itself is kept relatively more
static. The adoption of a data-centric strategy has resulted in some advancement that
potentially makes AI benefits available to most businesses.
The data-centric AI focuses on comprehending, utilizing, and reaching conclu-
sions from data. AI used to be heavily dependent on rules and heuristics before
becoming data-centric. These could be helpful in some circumstances, but when
used on fresh datasets, they frequently produced less than ideal outcomes or even
errors. By adding machine learning and big data analytics tools, data-centric AI
modifies this by enabling it to learn from data rather of depending on algorithms. It
can therefore make wiser choices and deliver more precise outcomes. Additionally,
it has the potential to be significantly more scalable than conventional AI methods.
As datasets get bigger and more complicated, data-centric AI will probably become
more and more significant in the future. This book mainly focuses on recent devel-
opments in data-centric AI, how data-centric AI works in various business use cases,
building AI applications with quality data for multidisciplinary domains, importance
of data-centric approach in edge AI and data security.

v
vi Preface

The key objectives of this book include presenting a need of data-centric AI with
compared to model-centric approach, transformations from model-centric to data-
centric AI, methodologies to achieve accurate results by improving the quality of
data, challenges in improving quality of data-centric models, challenges in datasets
generation, synthetic datasets, analysis, and prediction algorithms in stochastic
ways, etc.
The main characteristics of this book are:
1. Helpful for user to understand the transformation from model-centric to data-
centric AI.
2. A concise and summarized description of all the topics.
3. Use case and scenarios-based descriptions.
4. Concise and crisp book for novice reader from introduction to building basic
data-centric AI applications.
5. Numerous examples, technical descriptions, and real-world scenarios.
6. Simple and easy language so that it can be useful to a wide range of stakeholders
like a layman to educate users, villages to metros and national to global levels.
In nutshell, this book puts forward the best research roadmaps, strategies, and
challenges to design and develop data-centric AI applications. Book is motivating to
use a technology for better analysis in the need of layman users to educated users to
design various use cases in data-centric AI. Book also contributes to social respon-
sibilities by inventing different ways to cater the requirements for government and
manufacturing communities. The book is useful for undergraduates, postgraduates,
industry, researchers and research scholars in ICT, AI, machine learning, and we are
sure that this book will be well-received by all stakeholders.

Pune, India Parikshit N. Mahalle


Gitanjali R. Shinde
Yashwant S. Ingle
Namrata N. Wasatkar
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Building Blocks of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 AI Current State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Need for Paradigm Shift from Model-Centric AI
to Data-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Model-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Supervised Machine Learning Algorithms . . . . . . . . . . . . . . . 14
2.2.2 Unsupervised Machine Learning Algorithms . . . . . . . . . . . . . 18
2.2.3 Deep Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Use Cases: Model-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Data-Centric Principles for AI Engineering . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 AI Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii
viii Contents

3.4 Data-Centric Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Mathematical Foundation for Data-Centric AI . . . . . . . . . . . . . . . . . . . . 47
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.4 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.5 Multivariate Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.6 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Statistical Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Data Tendency and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1 Data Tendency/Measure of Central Tendency . . . . . . . . . . . . 53
4.3.2 Measure of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Data-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 The Data Acquisition Process . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.2 Key Insights for Big Data Acquisition . . . . . . . . . . . . . . . . . . . 68
5.1.3 Case Study: Data Acquisition for Retail Company . . . . . . . . 69
5.2 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 How Does Data Labeling Work? . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 Data Labeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.3 Importance of Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.4 Case Study: Data Labeling for Autonomous Vehicle
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Types of Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Case Study on Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.1 How Does Data Augmentation Work? . . . . . . . . . . . . . . . . . . . 79
5.4.2 Case Study on Data Augmentation . . . . . . . . . . . . . . . . . . . . . 80
5.5 Data Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.1 Case Study on Data Deployment . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Data-Centric AI Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6.1 Case Study: Predicting Customer Churn
for a Telecommunications Company . . . . . . . . . . . . . . . . . . . . 84
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Contents ix

6 Data-Centric AI in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Need and Challenges of Data-Centric Approach . . . . . . . . . . . . . . . . 89
6.3 Application Implementation in Data-Centric Approach . . . . . . . . . . 91
6.4 Application Implementation in Model-Centric Approach . . . . . . . . . 91
6.5 Comparison of Model-Centric AI and Data-Centric AI . . . . . . . . . . . 93
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Data-Centric AI in Mechanical Engineering . . . . . . . . . . . . . . . . . . . . . . 97
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Need and Challenges of Data-Centric Approach . . . . . . . . . . . . . . . . 98
7.3 Application Implementation in Data-Centric Approach . . . . . . . . . . 100
7.4 Application Implementation in Model-Centric Approach . . . . . . . . . 102
7.5 Comparison of Model-Centric AI and Data-Centric AI . . . . . . . . . . . 104
7.6 Case Study: Mechanical Tools Classification . . . . . . . . . . . . . . . . . . . 106
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8 Data-Centric AI in Information, Communication
and Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Need and Challenges of Data-Centric Approach . . . . . . . . . . . . . . . . 110
8.3 Application Implementation in Data-Centric Approach . . . . . . . . . . 113
8.4 Application Implementation in Model-Centric Approach . . . . . . . . . 115
8.5 Comparison of Model-Centric AI and Data-Centric AI . . . . . . . . . . . 118
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Research Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
About the Authors

Dr. Parikshit N. Mahalle is Senior Member in IEEE and is Professor, Dean Research,
and Development and Head—Department of Artificial Intelligence and Data Science
at Vishwakarma Institute of Information Technology, Pune, India. He completed his
Ph.D. from Aalborg University, Denmark, and continued as Postdoc Researcher at
CMI, Copenhagen, Denmark. He has 23+ years of teaching and research experience.
He is Ex-Member of the Board of Studies in Computer Engineering and Ex-Chairman
of Information Technology, Savitribai Phule Pune University and various universi-
ties and autonomous colleges across India. He has 15 patents and 200+ research
publications (Google Scholar citations-2900 plus, H index-25, and Scopus Cita-
tions are 1400 plus with H index-18, and Web of Science citations are 438 with
H index-10) and authored/edited 54 books with Springer, CRC Press, Cambridge
University Press, etc. He is Editor-in-Chief for IGI Global–International Journal
of Rough Sets and Data Analysis, Inter-science International Journal of Grid and
Utility Computing; Member in Editorial Review Board for IGI Global–International
Journal of Ambient Computing and Intelligence; and Reviewer for various journals
and conferences of the repute. His research interests are Machine Learning, Data
Science, Algorithms, Internet of Things, Identity Management, and Security. He is
guiding eight Ph.D. students in the area of IoT and machine learning, and six students
have successfully defended their Ph.D. under his supervision from SPPU. He is also
the recipient of “Best Faculty Award” by Sinhgad Institutes and Cognizant Tech-
nologies Solutions. He has delivered 200 plus lectures at national and international
level.

Dr. Gitanjali R. Shinde has overall 15 years of experience and presently working as
Head and Associate Professor in Department of Computer Science and Engineering
(AI and ML), Vishwakarma Institute of Information Technology, Pune, India. She
has done Ph.D. in Wireless Communication from CMI, Aalborg University, Copen-
hagen, Denmark, on Research Problem Statement “Cluster Framework for Internet
of People, Things and Services”—Ph.D. awarded on May 8, 2018. She obtained
M.E. (Computer Engineering) degree from the University of Pune, Pune, in 2012
and B.E. (Computer Engineering) degree from the University of Pune, Pune, in 2006.

xi
xii About the Authors

She has received research funding for the project “Lightweight group authentication
for IoT” by SPPU, Pune. She has presented a research article in the World Wireless
Research Forum (WWRF) meeting, Beijing, China. She has published 50+ papers
in national and international conferences and journals. She is Author of 10+ books
with publishers Springer and CRC Taylor & Francis Group, and she is also Editor
of books. Her book “Data Analytics for Pandemics A COVID-19 Case Study” is
awarded outstanding book of year 2020.

Mr. Yashwant S. Ingle is presently working at VIIT, Pune, as Assistant Professor at


the Department of Artificial Intelligence and Data Science. The author has 12 years’
experience in Teaching and three years Industry Experience. The author is pursuing
his Ph.D. from Savitribai Phule Pune University, completed his M.Tech. CSE from
Visvesvaraya National Institute of Technology, Nagpur, and B.E. CSE from Amravati
University. Author has one US patent published, 15 Indian utility patents published,
three design patents granted, three important software copyrights, and two literary
research copyrights registered. He also has 25+ publications in Scopus and Web
of Science journals, IEEE, and Springer international conferences and received
four best paper awards in RACE National Conference 2021. The author is Life
Member of ISTE, VNIT Alumni Association, and Member of ACM. The author
regularly teaches AI honor course and AI regular course, soft computing, and opti-
mization algorithms to third year and final year of Computer Engineering. The
author has undergone vigorous trainings in FDPs and professional courses on subject
matter AI.

Dr. Namrata N. Wasatkar has overall 10 years of experience and presently working
Assistant Professor in Department of Computer Engineering, Vishwakarma Institute
of Information Technology, Pune, India. She has done Ph.D. in Computer Engi-
neering from Savitribai Phule Pune University, Pune, India, on Research Problem
Statement “Rule based Machine translation of simple Marathi sentences to English
sentences”—Ph.D. awarded on November 17, 2022. She obtained M.E. (Computer
Engineering) degree from the University of Pune, Pune, in 2014 and B.E. (Computer
Engineering) degree from the University of Pune, Pune, in 2012. She has received
research funding for the project “SPPU online chatbot” by SPPU, Pune. She has
published 10+ papers in national and international conferences and journals.
Chapter 1
Introduction

1.1 Building Blocks of AI

Artificial Intelligence (AI) is a multidisciplinary field that aims to create intelligent


systems which are able to perform tasks that typically need human intelligence.
These systems rely on a variety of building blocks that work together to enable AI
capabilities. In this explanation, we will explore the fundamental building blocks of
AI, including machine learning (ML), Natural Language Processing (NLP), computer
vision, and robotics [1] (Fig. 1.1).
. Machine Learning: Machine learning is a key component of AI that focuses on
the development of algorithms and models that allow computers to learn from
and make predictions or take decisions based on data. It enables systems to auto-
matically improve their performance through experience without being explicitly
programmed. Machine learning algorithms can be classified into three types:
supervised learning, unsupervised learning, and reinforcement learning. Super-
vised learning involves training models using labeled examples, while unsuper-
vised learning discovers patterns and structures in unlabeled data. Reinforcement
learning is a technique where an agent learns to interact with an environment by
receiving punishments or rewards based on its actions [1].
. Natural Language Processing (NLP): Natural Language Processing enables
computers to understand, interpret, and generate human language. It involves
the development of models and algorithms that can process and analyze text,
speech, and other forms of natural language data. NLP techniques include tasks
like language translation, sentiment analysis, text summarization, and question-
answering systems. NLP algorithms utilize techniques from machine learning,
computational linguistics, and linguistics to extract meaning and context from
human language [2].
. Computer Vision: Computer vision focuses on enabling computers to gain high-
level understanding from visual data, such as images and videos. It involves the
development of models and algorithms that can analyze, interpret, and extract

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_1
2 1 Introduction

Fig. 1.1 Building block of AI

information from visual inputs. Computer vision tasks include image segmen-
tation, image classification, object detection, and facial recognition. These tasks
are achieved through techniques like image feature extraction, pattern recogni-
tion, and deep learning algorithms, which mimic the structure and function of the
human visual system [3].
. Robotics: Robotics is a field that combines AI, mechanical engineering, and elec-
tronics to create physical machines (robots) capable of interacting with the phys-
ical world. AI plays a vital role in robotics by providing intelligent capabilities
to robots, allowing them to perceive, reason, and make decisions autonomously.
AI-powered robots can perform complex tasks such as autonomous navigation,
object manipulation, and human–robot interaction. These robots utilize sensor
data, machine learning algorithms, and control systems to sense and understand
the environment, plan actions, and execute them with precision [4].
. Data: Data is a fundamental building block of AI. The performance and effec-
tiveness of AI systems heavily rely on the quality, quantity, and diversity of the
data they are trained on. Large volumes of labeled data are often required to
train machine learning models effectively. Data collection methods, data prepro-
cessing, and data augmentation techniques have a crucial role to play in preparing
and curating the data for AI applications. Additionally, the ethical consider-
ations surrounding data privacy and bias are essential to ensure fairness and
accountability in AI systems [5].
. Algorithms and Models: Algorithms and models are the mathematical and compu-
tational frameworks that underpin AI systems. They define the behavior and func-
tionality of AI systems. Machine learning algorithms, like deep neural networks,
1.2 AI Current State 3

support vector machines, and decision trees, form the backbone of many AI appli-
cations. These algorithms learn from data to make predictions, classify inputs, or
optimize decisions. Models represent the learned knowledge of the algorithms and
can be used for inference and prediction. Model architectures, hyperparameter
tuning, and optimization techniques are crucial in building accurate and efficient
AI systems [5].
. Computing Power: The computational power of modern hardware infrastruc-
ture, such as Graphics Processing Units (GPUs) and specialized AI accelera-
tors, has played a significant role in the advancement of AI. These hardware
components enable the efficient training and execution of complex AI models.
High-performance computing clusters and cloud-based infrastructures provide
the computational resources required for large-scale AI projects. Furthermore,
advancements in distributed computing and parallel processing have accelerated
the development and deployment of AI systems [6].
. Ethical Considerations: As AI continues to advance, ethical considerations
become increasingly important. Ensuring that AI systems are fair, transparent,
and unbiased is crucial to avoid perpetuating existing social biases or discrimi-
nations. Ethical AI frameworks promote accountability, privacy protection, and
the responsible use of AI technologies. The development and deployment of AI
systems should consider the potential impact on society, human rights, and indi-
vidual privacy. Collaboration between policymakers, AI researchers, and industry
stakeholders is necessary to establish ethical guidelines and regulations for the
responsible development and use of AI [6].
The building blocks of artificial intelligence encompass machine learning, NLP,
computer vision, robotics, data, algorithms and models, computing power, and ethical
considerations. These components form the foundation of AI systems and enable the
development of intelligent applications that can perceive, reason, and make decisions
in various domains. By understanding these building blocks, we can appreciate the
complexity and potential of AI and its impact on society.

1.2 AI Current State

As of the current state of AI in 2023, significant progress and advancements have


been made across various domains. Here are some key highlights:
. Machine Learning and Deep Learning: Machine learning techniques, particularly
deep learning, have seen tremendous growth and success. Deep neural networks
have performed remarkably in tasks such as image recognition, object detection,
NLP, and speech synthesis. Pretrained models, like OpenAI’s GPT-3, have show-
cased the ability to generate coherent and contextually relevant text. Transfer
learning, where models trained on one task can be fine-tuned for another, has
become a popular approach for leveraging existing knowledge [6].
4 1 Introduction

. Natural Language Processing (NLP): NLP has made significant strides in


understanding and generating human language. Chatbots and virtual assistants
have become increasingly sophisticated, providing more accurate and context-
aware responses. Language translation systems, such as Google Translate, have
improved, although challenges in capturing subtle nuances and context remain.
Sentiment analysis algorithms can analyze social media data to gauge public
opinion and trends [7].
. Computer Vision: Computer vision technologies have advanced, enabling
machines to perceive and understand visual information. Object detection and
recognition algorithms have become more accurate and robust, contributing to
applications like autonomous vehicles, surveillance systems, and augmented
reality. Facial recognition technology has gained widespread use in various
industries, raising privacy and ethical concerns [7].
. Robotics and Automation: AI has had a significant impact on robotics and
automation. Robots equipped with AI capabilities are being employed in manu-
facturing, logistics, healthcare, and domestic settings. Collaborative robots, or
cobots, can work safely alongside humans, enhancing productivity and efficiency.
Autonomous drones are being used for delivery services, inspection, and moni-
toring. AI-powered robotic systems are being developed for complex tasks like
surgery and caregiving [7].
. AI in Healthcare: AI applications in healthcare have seen promising advance-
ments. Machine learning algorithms can assist in diagnosing diseases, analyzing
medical images, predicting patient outcomes, and designing personalized treat-
ment plans. Natural Language Processing is being used to extract valuable
information from medical records and scientific literature, aiding in research
and clinical decision-making. AI-enabled wearable devices and remote moni-
toring systems are improving patient care and enabling early detection of health
issues [7].
. AI in Finance: AI technologies are increasingly utilized in the financial sector.
Fraud detection algorithms can identify suspicious transactions, while machine
learning models help predict market trends and optimize trading strategies. Robo-
advisors leverage AI algorithms to provide personalized investment advice to indi-
viduals. Natural Language Processing is used for sentiment analysis of financial
news and reports [7].
. Ethical Considerations: As AI becomes more pervasive, ethical considerations
have gained attention. Issues such as bias in data, transparency, privacy, and
accountability are being addressed to ensure the responsible development and use
of AI technologies. Regulatory frameworks and guidelines are being developed
to govern AI applications and protect individual rights [8].
While AI has made significant progress, challenges and limitations remain. AI
systems can be susceptible to adversarial attacks, and ensuring fairness and mitigating
bias in algorithms and data is an ongoing concern. Continued research, collaboration,
and responsible deployment of AI technologies are crucial to maximize their benefits
while addressing ethical, societal, and technical challenges.
1.3 Motivation 5

1.3 Motivation

The motivation for data-centric AI stems from the recognition that data is a crucial
and valuable asset in building effective and accurate AI systems. Data-centric AI
approaches prioritize the collection, curation, and utilization of high-quality data to
drive AI model development and decision-making. Here are some key motivations
for adopting a data-centric approach in AI:
. Performance Improvement: Data-centric AI recognizes that the performance of
AI models heavily depends up on the quality and quantity of the data they are
trained on. By focusing on collecting diverse and representative data, AI models
can better generalize to real-world scenarios, leading to improved accuracy and
reliability in their predictions and decisions [9].
. Addressing Bias and Fairness: Biases present in data can be reflected in AI models,
potentially leading to discriminatory or unfair outcomes. A data-centric approach
allows for the identification and mitigation of biases by thoroughly analyzing the
data, ensuring proper representation and inclusivity across various demographics.
By prioritizing unbiased and fair data, AI systems have a higher chance of making
equitable decisions [9].
. Robustness and Generalization: AI models trained on limited or biased data
may exhibit poor performance in unseen or unfamiliar situations. By adopting
a data-centric approach, AI developers can collect and incorporate a wide range
of relevant data, enabling models to better understand and adapt to different
contexts. This leads to increased robustness, generalization, and the ability to
handle variations and edge cases [9].
. Domain-Specific Expertise: Data-centric AI recognizes the value of domain-
specific knowledge and expertise embedded in data. By leveraging domain knowl-
edge, including expert annotations and labels, AI models can capture intri-
cate patterns and relationships specific to the target application. This exper-
tise enhances the accuracy and effectiveness of AI systems in specialized
domains [9].
. Real-Time Adaptation: Data-centric AI systems are designed to continuously
learn and adapt in real time. By collecting and analyzing new data as it becomes
available, AI models can update and refine their understanding, allowing them to
stay up-to-date with evolving environments and user needs. This adaptability is
particularly important in dynamic and rapidly changing domains [9].
. Enhanced User Experience: A data-centric approach in AI aims to improve
the user experience by leveraging data-driven insights. By understanding
user preferences, behavior, and context, AI systems can provide personalized
and tailored recommendations, services, and interactions. This leads to more
engaging and satisfying user experiences, ultimately increasing user adoption and
satisfaction [9].
. Decision Support and Insights: Data-centric AI empowers decision-making by
providing actionable insights and recommendations based on data analysis. By
leveraging large datasets, AI systems can identify patterns, correlations, and
6 1 Introduction

trends that may not be comprehensive to humans. These insights can inform
strategic business decisions, optimize processes, and unlock new opportunities
for innovation and growth [9].
Hence, the motivation for data-centric AI is driven by the recognition that high-
quality and diverse data is essential for developing accurate, unbiased, and robust
AI systems. By prioritizing data collection, curation, and utilization, data-centric
approaches aim to improve performance, address bias and fairness concerns, enhance
user experiences, and provide valuable decision support and insights in various
domains.

1.4 Need for Paradigm Shift from Model-Centric AI


to Data-Centric AI

The paradigm shift from model-centric AI to data-centric AI represents a change


in focus in the development and deployment of artificial intelligence systems. In
the traditional model-centric approach, the emphasis was primarily on designing
complex algorithms and models to solve specific tasks. However, with the increase
in the availability of data and advancements in machine learning, the data-centric
approach has gained prominence. Let’s explore this shift in more detail:
Model-Centric AI: In the model-centric paradigm, the primary focus is on
designing and optimizing sophisticated algorithms and models. This approach often
involves extensive manual feature engineering, where domain experts identify and
handcraft relevant features for the model. The model is then trained on the avail-
able data, typically with a specific task or problem in mind. The goal is to create a
highly performant and accurate model by refining its architecture and optimizing its
parameters.
Limitations of Model-Centric AI:
While model-centric AI has been successful in many applications, it has several
limitations:
. Dependence on Labeled Data: Model-centric approaches often require large
amounts of labeled data for training. Labeled data is laborious to prepare,
time-consuming, and very expensive to obtain, especially for complex tasks or
specialized domains. This reliance on labeled data can limit the scalability and
applicability of AI systems [9].
. Generalization Challenges: Models developed using a model-centric approach
may struggle to generalize well to unseen or unfamiliar data. The models are
typically optimized for the specific training data and may not capture the full
complexity of real-world scenarios. As a result, they may exhibit poor performance
in practical situations [9].
1.4 Need for Paradigm Shift from Model-Centric AI to Data-Centric AI 7

. Lack of Adaptability: Model-centric AI systems can be static and inflexible,


unable to adapt to evolving data distributions or changing environments. When
faced with new data or unexpected scenarios, these models may require significant
retraining or modifications, making them less agile and responsive.
. Data-Centric AI: Data-centric AI, on the other hand, places the primary emphasis
on the quality, diversity, and quantity of data. Instead of relying solely on complex
models, data-centric approaches recognize that more significant improvements in
AI can be achieved by leveraging vast amounts of relevant data.
Key aspects of Data-Centric AI include:
. Data Collection and Curation: Data-centric AI prioritizes the collection of compre-
hensive and diverse datasets. It involves strategies for acquiring relevant data,
including crowd-sourcing, data partnerships, and collaborations. Data curation
techniques, such as data cleaning, labeling, and augmentation, ensure the quality
and usability of the collected data.
. Training on Large-Scale Data: Data-centric AI leverages large-scale datasets to
train models. With more data, models can capture complex patterns and variations,
improving their generalization capabilities. Techniques like transfer learning,
where pretrained models are fine-tuned on specific tasks, reduce the need for
extensive labeled data.
. Continuous Learning and Adaptation: Data-centric AI systems are designed to
continuously learn from new data. They can adapt and update their models in real
time as more information becomes available. This enables them to stay current
with changing data distributions and user preferences.
Benefits of Data-Centric AI: The shift toward data-centric AI offers several
advantages:
. Improved Accuracy and Performance: By leveraging vast and diverse datasets,
data-centric AI can achieve higher accuracy and performance compared to model-
centric approaches. The models are more representative of real-world scenarios,
enabling better generalization.
. Flexibility and Adaptability: Data-centric AI systems can adapt to new data distri-
butions and evolving environments, making them more versatile and adaptable.
They can handle variations, edge cases, and new contexts more effectively.
. Reduced Dependency on Manual Feature Engineering: Data-centric AI reduces
the reliance on manual feature engineering, allowing the models to automatically
learn relevant features directly from the data. This saves time and effort in devel-
oping complex models and enables the system to discover hidden patterns and
insights.
. Scalability: With the availability of large-scale datasets, data-centric AI can scale
to handle complex tasks and specialized domains. It enables AI systems to tackle
real-world problems that require a significant amount of data for training.
8 1 Introduction

Hence as we can understand the shift from model-centric AI to data-centric AI


represents a move toward leveraging large-scale, diverse, and high-quality data to
drive the development and performance of AI systems. By prioritizing data collection,
curation, and utilization, data-centric approaches overcome the limitations of model-
centric AI and enable more accurate, adaptable, and scalable AI systems.

1.5 Summary

The paradigm shift from model-centric AI to data-centric AI represents a significant


transformation in the approach to artificial intelligence. In the traditional model-
centric paradigm, the focus was primarily on designing complex algorithms and
models to solve specific tasks. However, with the growing availability of data and
advancements in machine learning, the data-centric approach has gained prominence.
This shift recognizes the crucial role of high-quality data in building effective and
accurate AI systems.
The model-centric AI approach relied heavily on manual feature engineering,
where domain experts identified and handcrafted relevant features for the models.
These models were then trained on the available data, typically with a specific task
or problem in mind. The goal was to create highly performant and accurate models
by refining their architectures and optimizing their parameters. While this approach
achieved success in many applications, it had limitations that prompted the shift to
data-centric AI.
One of the main motivations for adopting a data-centric AI approach is the under-
standing that the performance of AI models heavily relies on the quality and quan-
tity of data they are trained on. Data-centric AI recognizes that more significant
improvements in AI can be achieved by leveraging vast amounts of relevant data.
This approach prioritizes the collection, curation, and utilization of high-quality data
to drive AI model development and decision-making [6–9].
Data-centric AI involves several key aspects. First, it emphasizes the collection and
curation of comprehensive and diverse datasets. Strategies such as crowd-sourcing,
data partnerships, and collaborations are employed to acquire relevant data. Data
curation techniques, including data cleaning, labeling, and augmentation, ensure the
quality and usability of the collected data.
In the data-centric paradigm, models are trained on large-scale datasets to capture
complex patterns and variations. With more data, models can better generalize to real-
world scenarios, leading to improved accuracy and reliability in their predictions and
decisions. Techniques like transfer learning, where pretrained models are fine-tuned
on specific tasks, reduce the need for extensive labeled data.
Another significant aspect of data-centric AI is its focus on continuous learning
and adaptation. Data-centric AI systems are designed to continuously learn from new
data, updating and refining their models in real time. This adaptability enables the
systems to stay current with evolving data distributions and user needs, making them
more responsive to changing environments.
References 9

The shift to data-centric AI offers several benefits. Firstly, it improves the


accuracy and performance of AI systems by leveraging vast and diverse datasets.
Models developed through data-centric approaches are more representative of real-
world scenarios, enabling better generalization. Additionally, data-centric AI systems
exhibit flexibility and adaptability, allowing them to handle variations, edge cases,
and new contexts more effectively.
Moreover, the data-centric approach reduces the dependence on manual feature
engineering, enabling models to automatically learn relevant features directly from
the data. This saves time and effort in developing complex models and allows the
system to discover hidden patterns and insights.
The scalability of data-centric AI is another advantage. With the availability
of large-scale datasets, data-centric AI can tackle complex tasks and specialized
domains that require a significant amount of data for training.
However, the paradigm shift from model-centric to data-centric AI also brings
challenges. The availability of high-quality data remains a crucial factor, and data
privacy and security concerns need to be addressed. Handling biases and ensuring
fairness in the data and models is another challenge that requires careful consid-
eration. Additionally, the computational and storage requirements for large-scale
datasets pose technical challenges that need to be overcome.
So to summarize the shift from model-centric AI to data-centric AI represents
a fundamental change in focus, recognizing the critical role of high-quality data
in building effective AI systems. By prioritizing data collection, curation, and
utilization, data-centric AI approaches overcome the limitations of model-centric
AI and enable more accurate, adaptable, and scalable AI systems. While challenges
remain, the ongoing development and responsible deployment of data-centric AI will
continue to drive advancements and unlock the full potential of artificial intelligence.

References

1. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://
doi.org/10.1038/nature14539
2. Caruana, R. (2006). Empirical methods for AI and machine learning. Carnegie Mellon
University, School of Computer Science, Technical Report CMU-CS-06-181
3. Zou, J., & Schiebinger, L. (2018). AI can be biased: Here’s what we should do about it. Science,
366(6464), 421–422. https://doi.org/10.1126/science.aaz1107
4. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., et al. (2019).
Model cards for model reporting. In Proceedings of the conference on fairness, accountability,
and transparency (pp. 220–229). https://doi.org/10.1145/3287560.3287596
5. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., &
Crawford, K. (2020). Datasheets for datasets. arXiv preprint arXiv:1803.09010
6. Wu, L., Wu, H., Wu, J., & Wei, Y. (2020). Towards data-centric AI. In Proceedings of the IEEE/
CVF conference on computer vision and pattern recognition (pp. 4388–4397). https://doi.org/
10.1109/CVPR42600.2020.00445
7. Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications,
19(2), 171–209. https://doi.org/10.1007/s11036-013-0489-0
10 1 Introduction

8. Li, Y., Li, J., Duan, Y., Zhang, S., Huang, T., & Gao, W. (2020). A survey on deep learning for
big data. Information Fusion, 57, 42–56. https://doi.org/10.1016/j.inffus.2019.10.003
9. Chandre, P., Mahalle, P., & Shinde, G. (2022). Intrusion prevention system using convolutional
neural network for wireless sensor network. IAES International Journal of Artificial Intelligence,
11(2), 504–515.
Chapter 2
Model-Centric AI

2.1 Working Principle

Machine learning is a way for computers to learn to perform tasks without explicit
programming. It is like, if we want to teach a child some new thing, then we can
show them how to do it and ask them to repeat it. Depending on how the child is
performing that task, we can give them suggestions.
When it comes to machine learning, computers are provided with massive amounts
of data and taught to detect patterns. After pattern is detected estimations are done
on the data. In other words, if you want to teach a computer how to recognize images
of dogs, you need to display number of images of dogs and label them as “dogs.” By
doing so, computers grasp the fundamental characteristics of dog images, including
their fur, ears, and eyes. Once these features are extracted, they can be utilized to
recognize novel images of dogs that have never been perceived before.
This concept can be implemented in many existing domains, like stock markets
(to make stock predictions), hospitals (to assist in disease diagnosis), weather fore-
casting, etc. If we provide sufficient amounts of data to computers, then computers
will be able to make good predictions. In some cases, computers can even surpass
humans. Nonetheless, the quality of predictions will depend on two things. First, the
quality of data used in training and second, the algorithm used for the analysis.
Machine learning does not need manual programming. Computers can learn and
perform better through the experience [1, 2].
In traditional computer programming developers write precise instructions for
computers to follow. However, in machine learning, computers use algorithms to
analyze data and acquire information. So machine learning systems can identify
various patterns in the data and make predictions.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 11
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_2
12 2 Model-Centric AI

The machine learning techniques can be classified into three categories as follows:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
The data used in supervised machine learning is trained on labeled data.
Unsupervised learning involves training the data on unlabeled data. Unsupervised
learning requires that the algorithms identify the underlying pattern in the data. In
reinforcement learning, data training is reliant on environmental feedback.
Machine learning applications are deployed in various fields, such as healthcare,
weather prediction, finance, transport, retail, etc. Machine learning has the ability to
automate processes and enhance accuracy [3, 4].

2.1.1 Supervised Learning

Supervised learning involves teaching computers using labeled data, where the
correct answer is already known. The computer is given input data and corresponding
output or target variables [5, 6]. The aim is to design a model that can predict the
correct output for new input data.
The process of supervised machine learning has two important phases:
1. Training
2. Testing
In the training process, the computer utilizes labeled data to understand the corre-
lations and patterns between input data and output variables. The objective is to
develop a model that can provide precise predictions for fresh input data.
A different labeled dataset that wasn’t used during training is used to test the model
after it has been trained. This aids in assessing how well the model generalizes to
new and untested data. The ultimate aim is to create a model that can make precise
predictions not only on the data used for training but also on new and unfamiliar
data.
Applications of supervised machine learning can be seen in speech recognition,
image classification, fraud detection, etc. When you have large amounts of labeled
data available, supervised learning can be very helpful in these situations. Supervised
learning can also work with small datasets. The quality and quantity of data are the
biggest deciding factors in the performance of supervised machine learning models.
Along with quality and quantity selection of algorithm and hyperparameter tuning
are also important.
2.1 Working Principle 13

2.1.2 Unsupervised Learning

Unsupervised machine learning is a form of machine learning that uses unlabeled


data to detect patterns or relationships in the underlying data. Unlike supervised
learning, which relies on labeled output values, unsupervised learning tends to
identify underlying patterns in the data without any prior knowledge.
Unsupervised learning algorithm analyzes the data without any supervision or
guidance. The algorithm identifies patterns, relationships, and clusters in the data
based on its inherent properties. Clustering is a popular unsupervised learning tech-
nique that groups similar data points in the same cluster and separates dissim-
ilar ones into different clusters. Other unsupervised learning techniques include
dimensionality reduction, anomaly detection, and association rule learning.
Unsupervised learning is frequently applied in exploratory data analysis to acquire
an understanding of the data structure and detect concealed patterns or relationships.
It is employed in multiple applications, including customer profiling, market segmen-
tation, and image and text analysis, as well as recommendation systems. However,
evaluating the algorithm’s performance can be more difficult in unsupervised learning
because there are no labeled output values, unlike supervised learning [5].

2.1.3 Reinforcement Learning

Reinforcement learning is a computer program learns to make decisions by trial


and error. The program is rewarded when it makes the correct decision. Over time,
computer learns to make better decisions. It’s similar to training a puppy to do tricks
by rewarding it with treats. In reinforcement learning, the program is like a puppy
that receives rewards for making the right moves in a virtual environment. By using
this technique, the program can learn which actions lead to the most rewards and
make better decisions in the future [7, 8].
Computers have been trained using this method of learning to operate robots,
play video games, and even make financial judgments. It is a potent tool for building
intelligent systems that can learn and adjust to changing circumstances.
An agent is taught to make decisions by being rewarded in the reinforcement
learning subfield of machine learning. The agent picks up new skills through trial
and error and receives feedback from its environment in the form of rewards or
penalties. The primary goal of reinforcement learning is to teach the agent how to
behave in a way that maximizes long-term cumulative rewards.
The agent learns by interacting with the environment in a sort of machine learning
called reinforcement learning. The reward an agent receives for each action influences
the decisions it makes. Contrary to supervised learning, in which the agent is trained
on labeled data, or unsupervised learning, in which the agent learns from unstructured
data, reinforcement learning does not provide explicit instructions on how to carry
out a task.
14 2 Model-Centric AI

Reinforcement learning can be used to train agents to play games, control bots,
assist in financial decision-making, etc. It is an effective tool for forming intelligent
systems that could examine and adapt to new situations.

2.2 Learning Methods

2.2.1 Supervised Machine Learning Algorithms

. Linear regression

Linear regression is a popular supervised machine learning algorithm that allows


predicting a continuous numerical output based on one or several input variables.
The algorithm looks for the optimal linear relationship between the input and output
variables. Simple linear regression is the most basic form of this algorithm that
deals with only one input variable. It tries to find the best-fitting line on a 2D plot,
where the x-axis represents the input variable, and the y-axis represents the output
variable [9].
Equation if line is given as follows:

y = mx + b.

where m is the slope of the line and b is the y-intercept.


In multiple linear regression, as the name suggests there are multiple input vari-
ables. Linear regression algorithm finds the best hyperplane (high dimension) for the
data points. A hyperplane is defined by an equation of the form

y = b + m1x1 + m2x2 + · · · + mnxn.

where b is the y-intercept and m1, m2, …, mn are the coefficients of the input variable
x1, x2,…, xn.
Linear regression is a commonly used technique to estimate values, including
house prices, stock prices, and sales by analyzing past data. It can also help in
feature selection by checking the significance of each input variable through regres-
sion coefficients. However, it’s important to note that linear regression has certain
assumptions about the data, such as linearity, homoscedasticity, and normality of
residuals. If these assumptions are violated, it can adversely affect the accuracy of
the model, and therefore, it’s essential to consider them while interpreting the results.
. Logistic regression
The outcome of logistic regression is binary independent variable. It means it can
have one of two values such as yes/no or 1/0. Logistic regression can have one or
more independent variable.
2.2 Learning Methods 15

Logistic regression involves modeling the dependent variable as a function of


independent variables, utilizing the logistic function (also called the sigmoid func-
tion). The logistic function maps real-valued input to a value between 0 and 1. It
represents the probability of the dependent variable having a value of 1 [10, 11].
Logistic regression is most commonly used for classification tasks. The main aim
of the algorithm is to predict the classes based on the underlying features. Logistic
regression is a useful tool that can be utilized to anticipate whether a customer is likely
to buy a product based on their demographic and behavioral features. Additionally,
it can be adapted to tackle multiclass classification problems where the dependent
variable can have more than two values. This is achieved through a technique referred
to as “multinomial logistic regression” or “softmax regression.”
. Naive Bayes
Naive Bayes is a frequently used machine learning algorithm that helps with clas-
sifications. It operates on the principles of Bayes’ theorem, which calculates the
probability of an event based on past information about the factors that might be
linked to the event [12, 13].
In case of classification, Naive Bayes assumes that certain traits in a class exist
independently of the presence of other traits. This assumption is known as “Naïve”
because in real-world cases this assumption fails most of the times. But still the
algorithm performs well in many cases.
Each class has base characteristic values. Naïve Bayes use those values to calculate
the probability of each class. Based on the evidence given by features, Bayes theorem
updates class prior probabilities. The class having maximum probability will be
selected as the class for prediction.
The three types of Naive Bayesian algorithms are as follows:
1. Gaussian Naive Bayes
2. Multinomial Naive Bayes
3. Bernoulli Naive Bayes.
When the feature is in the form of continuous variables, Gaussian Naive Bayes is
used. When the feature is in the form of discrete variables, multinomial Naive Bayes
and Bernoulli Naive Bayes are used.
Naive Bayesian algorithm has following advantages:
1. It is simple and efficient in nature.
2. Training time required for larger dataset is considerably less.
Naive Bayesian algorithm can be implemented for spam filtering, text classifica-
tion, sentiment analysis, etc.
. Support Vector Machine (SVM)
SVM is a machine learning algorithm. This algorithm can be used to perform tasks
such as classification, regression, and outlier detection.
16 2 Model-Centric AI

The main aim of SVM to find the perfect hyperplane that will give the clear
classification clear margins. Margin is measured as the distance between hyperplane
and the closest point in both classes. SVM attempts to give clear classification with
maximized margins [14].
SVM kernels can be helpful when the underlying data is not linearly separable.
SVM kernels transform low-dimensional data into high-dimensional data so data can
be easily separated. Frequently used SVM kernels are as follows:
1. Linear kernel
2. Polynomial kernel
3. Radial basis kernel.
SVM can easily handle binary as well as multiclass classification tasks. In case
of multiclass classification one-to-one or all-to-one method are used. Support vector
machines (SVMs) can be trained using either a one-to-one or one-versus-all approach.
In the one-to-one approach, binary classifiers are trained for each pair of classes, and
their results are combined. In contrast, a binary classifier is trained for each class
in the one-versus-all strategy, and the class with the greatest score is selected as the
final prediction. SVMs are superior to conventional classification algorithms in a
number of ways, including the handling of high-dimensional data and robustness to
outliers. They can also find unique solutions to specific problems. However, choosing
the right kernel function and its parameters is important as SVMs can be sensitive
to their selection. Additionally, SVMs can be computationally expensive on large
datasets.
. Decision Tree
Decision tree is a machine learning algorithm. This algorithm can be used to perform
tasks such as classification and regression. As name suggests decision tree creates a
tree like structure in order to make the prediction [15].
Formation of a decision tree starts with forming a single node which constitutes
the entire dataset. The next step is the selection of features that can be used to split
the dataset. The splitting criteria can be based on values such as entropy or the GINI
index. The features having the highest information gain and lowest GINI index are
used to split the dataset into two or more datasets. The process is replicated until the
stopping criteria is met.
The developed tree will be used for making predictions. The predictions are made
by traversing the formed tree from root node to leaf node. Leaf node can either give us
the class prediction or value prediction. Every inner node gives a decision depending
on characteristics. Leaf node gives us a class prediction or value. Decision trees can
be viewed as flow charts. Flow chart like representation makes it easier to understand
the results.
Decision trees have following advantages:
1. It can handle categorical and continuous data
2. Easy interpretability
3. It can handle nonlinear relationship between target and features.
2.2 Learning Methods 17

Decision trees can result in overfitting if underlying tree is very large or has
multiple features. There are ways to avoid overfitting of data such as tree pruning,
selecting max tree depth. We can also use random forest or gradient boosting to avoid
overfitting.
. K-Nearest Neighbors (KNN)
KNN can implemented in case of regression and classification tasks. This algorithm
is simple but very effective in the field of machine learning.
KNN can classify data points by finding k values near them according to a
distance metric such as Euclidean distance New class predictions are made by
majority vote of the K-nearest neighbors. Regression takes mean values into
account [16].
The selection K value is a very crucial part of the KNN algorithm because this
value can have a major effect on the bias variance trade-off. A small value of K
will result in less distortion and more variance. A big value of k will result in more
distortion and less variance.
KNN has the following advantages:
1. KNN is simple and flexible in nature.
2. It can handle a nonlinear relationship between the target and the features.
KNN is susceptible to the distance metric used along with feature scaling. It can
affect the performance of the algorithm if the underlying data is very large.
KNNs can be beneficial when boundaries between the target and the feature are
nonlinear and complex in nature. KNN can be implemented in recommender systems,
anomaly detection, etc.
. Random Forest
Random forest is one of the most in-demand ensemble learning algorithms used to
perform classification as well as regression. In random forest, multiple decision trees
are formed. The results from all the created decision trees are combined in order
to deliver a prediction. This process is done in order to achieve better accuracy and
reduce overfitting of the underlying data [17].
In a random forest, in order to form a tree, a random subset of data and features are
selected from the underlying data. The random selection aims to achieve diversity
and reduce correlation among trees. Every formed tree produces a prediction. Using
aggregation methods, all the predictions are combined together. Aggregation methods
include classification, a voting method is used and, for regression, means calculations
are done. As the name suggests, the random forest offers way more benefits than a
single decision tree and they are as follows:
1. Capable of handling high-dimensional data.
2. Capable of dealing with outliers and noisy data.
3. The ability to detect the nonlinear relationships between the features and target.
4. Less susceptible to overfitting because of the nature of the algorithm.
18 2 Model-Centric AI

Tuning of various hyperparameters can improve the performance of random forest.


The hyperparameters are as follows:
1. The number of trees formed
2. The max depth for tree formation
3. The size of the subset used for training
4. The feature that impacts the formation of the tree and frequency of its use.
In the case of a classification task, a feature selection task, anomaly detection,
random forest can be implemented. The use of the algorithm becomes critical when
boundaries are not clear enough or the dependency between feature and target is a
nonlinear one.

2.2.2 Unsupervised Machine Learning Algorithms

. K-means

K-means is the most commonly used and the favored clustering algorithm in the field
of unsupervised machine learning. Based on similarity measures, data is divided into
k-clusters. It is also known as the partition-based clustering technique [18].
In K-means, the k-clusters are given by the user and an algorithm given by k-means
finds the best and optimized way to divide the data into k-clusters. The first step of
the K-means algorithm is to randomly select the k points from the given data. That
data will now be considered as initial cluster centroids. The next step is to allocate
available data points to the nearest centroid on the basis of some distance metric, e.g.,
Euclidean distance. After calculating the distance, the centroids will be modified to
the average of data points given to each cluster. The allocation of data points and
modification in centroids will be replicated until there is no change in centroid or we
meet the terminating condition.
The k-means algorithm aims to minimize the sum of squared distances between
data points and centroids. This is also called inertia of clusters. This aim helps k-
means to form well separated and compact clusters. It can also be used to assess the
overall quality of clustering.
The advantages of K-means can be listed as efficient, scalable and simplicity of
the algorithm. But it can be very sensitive to the first chosen centroids. Also it is very
difficult to find the optimized k value. The distance metric used and scaling features
will influence the quality of clusters.
K-means is frequently used in applications like anomaly detection, image segmen-
tation, customer segmentation, etc. The algorithm can be very helpful when we do
not know what kind of data we are dealing with. The aim of the algorithm is to find
natural groups and detect hidden patterns in the data.
2.2 Learning Methods 19

. K-medoids
K-medoids is a clustering technique that shares similarities with K-means algorithm.
However, unlike K-means, K-medoids uses a medoid as a centroid instead of the
mean of data points in each cluster. A medoid is identified as a data point that is
nearest to the center of the cluster, which may not necessarily be the mean of data
points in the cluster.
K-medoids is a commonly used clustering algorithm when the data does not follow
a normal distribution or when the mean is not an accurate representation of the cluster
center. Compared to K-means, K-medoids is less affected by outliers because a single
outlier is less likely to move the medoid as compared to the mean [19].
The K-medoids algorithm is a process that begins by randomly selecting K data
points to serve as the initial medoids. After that, each data point is assigned to the
closest medoid based on a distance metric, like Euclidean distance. Following that,
the medoids are updated to be the data point with the smallest average distance to
all other data points in the cluster. This process of assigning points and updating
medoids is repeated until the medoids remain unchanged or a stopping criterion is
reached.
K-medoids requires more computational power than K-means. This is because it
involves identifying the medoid of each cluster, which means calculating distances
between each data point in the cluster. However, it has the advantage of being more
resistant to noise and outliers. It can produce better clustering results for clusters that
are not convex or have irregular shapes.
K-medoids are frequently used in applications such as image segmentation,
expression analysis, etc. It is especially useful when datapoints contain a lot of noise
or outliers or source of data is unknown.

2.2.3 Deep Learning Algorithms

The inspiration for neural networks is derived from the structure and function of
the human brain. Neural network includes multiple layers of interconnected nodes
aka neurons. These neurons handle incoming information by applying mathematical
functions to it.
The critical part of the neural network is the neuron. The neurons can gather
input from other neurons or some other external sources. A mathematical function
is implemented on the gathered input which in turn gives the output. The output
is then transferred to the other neurons, or it can be used as the final outcome of
the network. Each neuron includes a series of weights that are used to boost the
input signals. The weights are tuned during the training phase so as to achieve the
optimized performance of the network.
20 2 Model-Centric AI

Neural networks are commonly used while performing regression, classification,


object recognition in an image, etc. These networks typically perform well in applica-
tions where underlying data has complex relationships or manual feature extraction
is tough. Some commonly used neural networks include the following types:
. Feedforward Neural Networks
In feedforward neural networks, multiple layers of neurons are used to process
information in the forward direction. They are commonly used in tasks such as
classification or regression.
A feedforward neural network is an artificial neural network that contains multiple
layers of interconnected neurons (nodes) in which information is processed in a
feedforward way from input to the output. The input to the neural networks is a
vector (collection of features) and output is the number of predictions or class labels.
The middle layers are also known as hidden layers, which are connected to all the
other neurons used in previous layers [20].
The core concept in a neural network is a feedforward network. In each layer,
multiple neurons execute nonlinear operation on input data which is then further
processed in successive layers. This enables networks to process and learn complex
relationships and patterns in the underlying data. The linear models may not be able
to process this type of information.
Feedforward neural networks are commonly used in applications like speech
recognition applications, NLP applications, computer vision applications, etc. These
applications are very helpful in providing adaptable models which will help in recog-
nizing the complex patterns in the underlying data. The one disadvantage is that these
models are typically high cost and hard to train.
To curtail the difference between predicted value and actual value, weights and
biases of the neural networks are updated with the help of an optimization algorithm.
This step is known as backpropagation. It enables the network to learn from the
underlying labeled data and improve performance.
Some popular types of feedforward neural networks include:
. Multilayer perceptron (MLP)
This is a simple feedforward neural network which contains multiple neurons
interconnected to each other.
. Convolutional neural network (CNN)
CNN can also be termed as a feedforward neural network, which is constructed to
process information which is in the form of images (grids). The network uses the
convolution to get the features from input and pooling layers to reduce the size of
the feature map [21, 22].
Convolutional neural networks (CNNs) are a specific type of artificial neural
network that is intended to process data with a grid-like topology, such as images.
These networks employ convolutional layers to extract relevant features from the
input data and pooling layers to downsample the feature maps. CNNs are extensively
2.2 Learning Methods 21

utilized in image and video processing applications, as they can learn features and
patterns in the input data that are important for the given task automatically.
The presence of convolution layer which is created to extract the spatial features
in underlying data is the main difference between CNN’s and feedforward neural
network. A convolutional layer is a group of filters that move over the input data
and execute a convolution operation. The convolution operation involves computing
a dot product between the weights of the filter and a small section of the input data.
This process is done across the entire input, creating a collection of feature maps that
emphasize various components of the input data.
After the convolutional layers, the feature maps are usually passed through one or
more fully connected layers to accomplish the ultimate classification or regression
task. Moreover, CNNs usually employ pooling layers to compress the size of the
feature maps and enhance the network’s computational efficiency.
Some popular types of convolutional neural networks include:
. LeNet-5: It was the simplest and earliest CNN architectures aim to perform
handwritten digit recognition.
. AlexNet: It is a deep CNN architecture that is also winner of ImageNet Large Scale
Visual Recognition Challenge in 2012 which is hallmark for image classification
tasks.
. VGG: It is a deep CNN architecture that have gained high accuracy on the
ImageNet challenge in 2014 with the help of simple and uniform architecture.
. ResNet: It is a deep CNN architecture which initiated the concept of residual
connections. It made optimization of networks easier and deeper.
CNNs performed exceptionally well in case of image and video processing tasks,
including object detection, semantic segmentation, and action recognition. The main
requirement for CNN is large amount of labeled data. Along with that delicate tuning
of hyperparameters is necessary for better performance. CNNs can be effectively
trained with the help of GPUs and distributed computing frameworks.
Convolutional neural networks (CNNs) consist of a number of operations that
convert input data into a feature map that presents different aspect of the input. The
key operations involved in CNNs are convolution operation, pooling operation, and
set of fully connected layers.
Here is a brief overview of how CNNs work:
. Convolutional Layers: The convolution layer has a number of filters which are
used to slide over the image (input data given) and do the convolution operation.
In this operation, calculation of a dot product between weights of filters and
input data is done. The described operation is performed repetitively over the
input, which gives multiple feature maps. These feature maps represent different
features of the input data.
. Activation Function: Once a convolution operation is performed, the activation
function is implemented feature wise to the output provided by the convolution
layer. The activation function is used to create nonlinearity in the network, which
enables the network to capture complex features and relationships in the given
data.
22 2 Model-Centric AI

. Pooling Layers: A pooling layer helps the network become more efficient by
narrowing down the size of features. The mainly used pooling operation is termed
as max pooling. In max pool, the maximum value of the feature map is extracted
and provided as an output.
. Fully Connected Layers: The series of convolution and pooling operations are
performed to get the feature map. The feature map is then flattened to create
a vector and provided as an input to one or more fully connected layers. The
final task of these layers is to accomplish classification or regression. The fully
connected network can be viewed as a traditional feedforward neural network.
. Loss Function and Optimization: The output generated by the fully connected
layer is matched to the label of the input data. A loss function is used to calculate the
differentiation between the actual label provided with input data and the predicted
label. After the difference calculation, the weights in the network are efficiently
updated with the help of the optimization algorithm such as stochastic gradient
descent (used to minimize the loss).
In the training process of a CNN, weights and biases are tuned to achieve the max
accuracy on labeled data. While training, the network slowly learns to capture features
and patterns of the input data which will help in tasks such as object recognition in
given images etc.
When the training of the network is done, it is used to predict the unlabeled data
with the help of trained network. The predicted values are assessed with the help of
different test set.
. Recurrent neural networks (RNNs)
RNN layers have been created to process sequential data like text or speech. RNNs
use recurrent layers to hold memory of earlier input in order to model temporal
dependencies.
Recurrent neural networks (RNNs) are a specialized type of artificial neural
network that can effectively handle sequential data, ranging from time series and
text to speech. Unlike feedforward neural networks, which can only process a fixed-
size input and output, RNNs can handle input sequences of varying lengths and
produce output sequences of any length.
Recurrent neural networks (RNNs) are characterized by their ability to use recur-
rent connections to maintain an internal state that captures the input sequence’s
context. This internal state is updated at each time step by combining the current
input with the previous state and passing the result through an activation function
[23, 24].
The long short-term memory (LSTM) network is the most widely used type of
RNN. It utilizes specialized gating mechanisms to regulate the flow of information in
and out of the internal state. LSTMs are ideal for managing long-term dependencies
in sequential data since they can remember or forget information based on the context.
Here is a brief overview of how RNNs work:
. Recurrent Connections: Each step is combination of input at that time and previous
step. To update the current state it is passed through the activation function.
2.2 Learning Methods 23

. Internal State: The internal state of the RNN captures the context of the input
sequence up to the current time step.
. Gating Mechanisms: LSTMs incorporate specific gating mechanisms that regulate
the flow of information to and from the internal state. This enables the network
to accurately retain or discard relevant data based on the given context.
. Output: The RNN generates its output at each time step through a fully connected
layer that takes the current state as input. The resulting vector has the same
dimensionality as the desired output.
. Loss Function and Optimization: At the last step of the process, the result is
matched with the actual label of the given input, and a loss function is applied
to determine how much difference exists between the predicted output and the
real label. After that, the network’s weights are adjusted using an optimization
algorithm like stochastic gradient descent to decrease the loss function.
When training a recurrent neural network (RNN), the network’s weights and biases
are adjusted to improve its accuracy on a training set of sequential data. Through
this process, the network learns to understand the context of the input sequence and
generate the desired output. Once the network is trained, it can be utilized to make
predictions on new and unlabeled sequential data by inputting it into the network and
obtaining a predicted output. The predictions of the network can be evaluated using
a separate test set of sequential data.
. Long Short-Term Memory (LSTM)
LSTM is a type of RNN or recurrent neural network that is intended to manage
long-term dependencies in sequential data. The primary advantage of LSTMs is
the implementation of specialized gating mechanisms that enable the network to
selectively retain or discard information based on the context [25].
In a standard LSTM, the internal state comprises two segments:
1. The “cell state”
2. The “hidden state.”
The cell state preserves the network’s long-term memory, while the hidden state
embodies the LSTM’s output at each time step. Sigmoid and tanh activation functions
are utilized to regulate the information flow in and out of the cell state through gating
mechanisms.
Here is a more detailed overview of how LSTMs work:
. Input and Previous Hidden State: Each step in LSTM has input vector and earlier
hidden state as inputs.
. Gates: Combination of input state and previous state will go through gating
mechanisms namely the input gate, the forget gate, and the output gate.
. Input Gate: The input vector will decide which information from input vector will
be part of the cell state.
. Forget Gate: The forget gate will decide which information should be remembered
or which information should be forgotten from earlier states.
24 2 Model-Centric AI

. Update Cell State: Combination of input gate and forget gate is used to update
the cell state. It then goes through the tanh activation function.
. Output Gate: The output gate decides which part of the updated cell state will be
part of the output.
. Hidden State and Output: The updated cell state goes through a tanh activation
function to form the hidden state. It is also the output of the LSTM at that time
step.
. Loss Function and Optimization: During the final time step, the predicted output
is compared to the actual label of the input. To quantify the difference between
the two, a loss function is applied. The weights of the network are then adjusted
using an optimization algorithm such as stochastic gradient descent to minimize
the loss function.
To train an LSTM, the weights and biases of the network are adjusted for better
accuracy on a training set of sequential data. The network learns to selectively
remember or forget information based on the input sequence and context during
the training process. Once trained, the network can predict new sequential data by
feeding the input through the network. The predictions can be evaluated using a
separate test set of sequential data.
. Autoencoder
Autoencoders are feedforward neural networks which are created to learn from
compressed representation present in the input data. It includes an encoder network
that charts low-dimensional latent space and decoder networks chart latent space
back into input.
. Generative adversarial networks (GANs)
Generative adversarial networks (GANs) are a type of deep learning model that use
two neural networks namely:
1. A generator
2. A discriminator
They are used to create new data that resembles a given dataset. The generator
and discriminator compete with each other to produce high-quality generated data.
The main objective of a GAN is to understand the inherent distribution of a given
set of input data and then generate new data that is similar to the original input data
[26, 27].
Here is a high-level overview of how GANs work:
. Generator: The generator is a type of neural network that utilizes a random noise
vector as its input to create a fresh sample of data. Its main aim is to generate
samples that can be compared to the original data.
. Discriminator: It is a form of neural network in which an input is received and
prediction consists of whether the received input is real sample or fake sample
created by generator. The main aim of discriminator is to precisely differentiate
between real and fake samples.
2.3 Model Building 25

. Training: The generator and discriminator are taught to work against each other,
where the generator creates samples that appear real to the discriminator, and the
discriminator tries to differentiate between real and fake samples accurately. The
training process involves feeding both types of samples repeatedly through the
discriminator, and using the error signal to adjust the weights of the generator and
discriminator networks.
. Evaluation: After training of GAN, the generator can generate data similar to
input. This is achieved by feeding random noise vectors through the generator
and getting the generated output.
GANs have found applications in creating lifelike visual art, music, and textual
content. A significant benefit of GANs is their potential to produce fresh data exam-
ples that are both varied and authentic. Nevertheless, training GANs can be a complex
task and requires precise adjustments to hyperparameters to guarantee consistent
convergence.

2.3 Model Building

Model building in machine learning involves the development of a mathemat-


ical model that depicts the relationship between variables in a dataset. The
process involves choosing the right algorithm, specifying the input variables. Then
optimization of the model to achieve the desired level of accuracy and performance.
The model-building process typically involves the following steps [28]:
. Data collection and preprocessing: Collecting relevant information and converting
it into a format that is appropriate for analysis, which may require refining,
cleaning, scaling, and normalization.
. Feature selection and engineering: Recognizing the most relevant features or input
variables to include in the model. Transforming or creating new features as needed.
. Algorithm selection: Selecting an appropriate algorithm or combination of
algorithms on the basis of type of problem at hand.
. Training and validation: Dividing the data into training and testing sets. Training
set is used to train the model. The testing set is used to assess the model and
optimize its performance.
. Hyperparameter tuning: Adjusting the parameters of the model and algorithm to
optimize performance and avoid overfitting.
. Evaluation and deployment: Assessing the performance of the model on a different
dataset, and implementing the model for use in real-world applications.
To achieve accurate and strong predictions in machine learning, it’s crucial to
have a solid model-building process. This involves having a thorough knowledge of
the algorithms, data, and problem domain. Along with that a constant refining and
experimenting with the model to optimize its performance is necessary.
26 2 Model-Centric AI

2.4 Model Training

Model training is a crucial step in machine learning that involves adjusting the internal
parameters of an algorithm to make accurate predictions. This is done by providing
the algorithm with input features and output labels, and then fine-tuning its parameters
through iterations until it can accurately predict output labels for new input data.
Here’s an example of how model training works.
Let us consider there is a dataset containing housing prices. Each data point
represents a house and provides information about its features such as number of
bedrooms, lot size, year of construction, and sale price. The objective is to create
a machine learning model that can predict the price of a new house based on its
features. To do this, the dataset needs to be divided into two parts.
. Training set
. Testing set
The first part is the training set, which will be used to train the model. The second
part is the testing set, which will be utilized to evaluate the model’s accuracy. A
common split is 80/20, where 80% of the data is used for training and 20% is used
for validation.
Next, we will select an appropriate machine learning algorithm for the task, such
as linear regression, decision tree, or SVM. The next step is to input the training
set into the algorithm. Internal parameters will be adjusted based on the difference
between its predicted output labels and the true labels in the training set.
During the training process, the algorithm iteratively adjusts its parameters to
minimize the difference between its predicted outputs and the true labels in the
training set. This process is known as gradient descent. It computes the gradients of
the loss function with respect to the model parameters and adjusting the parameters
in the direction of the negative gradient.
After the algorithm is trained on the training set, we will assess its performance
on the testing set to determine how well it adapts to new, unseen data. If the model
performs well on the testing set, we can utilize it to make predictions on new, unseen
data. In short, model training is an essential phase in the machine learning process
since it empowers us to construct precise predictive models that have various practical
applications.

2.5 Model Testing

Model testing is a way to assess the performance of the fully trained model on a
different dataset that was not used in the training phase. This step is very important
to make sure that model is generalized and can make accurate predictions even in
case of unseen data.
2.6 Model Tuning 27

Here’s an example of how model testing works.


Imagine a scenario where we have created a machine learning model that predicts
the likelihood of a bank loan applicant defaulting on their loan. The model was
trained on a historical dataset of loan applications, which includes important factors
such as income, credit score, employment status, and other relevant information.
The outcome of each loan application, be it default or not, was also included in the
dataset.
In order to test the model, we will divide the dataset into two parts, namely:
. A training set
. A testing set
We will use training set to train the model and after that we will assess the
performance of the model with the help of testing set.
We will input the test set into the model and determine its accuracy by cross-
checking the predicted outcomes with the actual outcomes in the test set. To evaluate
classification models, we may use standard metrics such as precision, recall, and F1
score.
If model’s performance is good, then we can safely say that underlying model is
generalized one which can handle new and unseen data. It can be used for making
predictions on new loan applications. If model’s performance is not good, then we
will need to adjust the hyperparameters or selection of features, or selecting new
machine learning algorithm.
A model testing is a vital step in the process of machine learning. It helps us to
make sure our models are robust and precise. Then we can use these models in the
real-world scenarios.

2.6 Model Tuning

Hyperparameter adjustment in order to achieve optimized performance is referred as


model tuning. Hyperparameters should be adjusted before initiation of the training
phase. Hyperparameters cannot be adjusted in the training phase. Examples of hyper-
parameters are learning rate, regularization strength, number of hidden layers in a
neural network, etc.
Here’s an example of how model tuning works.
Let’s say we built a machine learning model to predict the likelihood of a customer
clicking on an online ad. We trained the model on a dataset of historical online ad
impressions. Each data point contains information about the size of the ad, its location
on the webpage, other relevant factors, and whether the customer clicked on the ad.
28 2 Model-Centric AI

For tuning the model, we will divide the dataset into three parts:
. Training set
. Testing set
. Validation set
We will use training set for training of the model. Testing set will be used for
assessment of performance along with hyperparameter tuning.
To improve the model’s performance during training, we can begin by adjusting the
learning rate. This rate determines how fast the model adjusts its internal parameters.
We can evaluate the model’s performance on the validation set by training it with
different learning rates. If we find a learning rate that yields good results on the
validation set, we can use it to train the model on the entire training set.
We can proceed by fine-tuning the regularization strength, which regulates the
extent to which the model punishes high parameter values while being trained. Simi-
larly, we would retrain the model using different regularization strengths and measure
its performance on the validation set. Additionally, we could experiment with varying
the number of hidden layers in the model, which can impact its capability to capture
intricate correlations within the data.
After fine-tuning the model on the testing set, we need to assess its performance
on the test set to ensure its ability to handle novel, unseen data. Optimizing model
performance through tuning is a critical step in the machine learning process, ensuring
that our models can be effectively implemented in real-world applications.

2.7 Use Cases: Model-Centric AI

. Define the task


This is crucial step for any application. The aim of the application should be clear.
After that the kind of data is needed for application and what will be the output
should be decided.
Aim of the application=> Face mask recognition
. Collect the data
This is the most important step in any machine learning project. The quality of
your data will directly impact the quality of your model. If data is clean then
model will perform well otherwise noisy data can have negative consequences.
You will need a dataset of images that contain faces with and without masks. You
can find such datasets online, or you can create your own by collecting images
from the web.
2.7 Use Cases: Model-Centric AI 29

. Prepare the data. This includes cleaning the data, removing outliers, and trans-
forming it into a format that can be used by your model. Also it involves resizing
the images to a size, converting them to grayscale, and normalizing the pixel
values, etc.

. Choose CNN model. There are many different types of deep learning models
available. The best model for your project will depend on the specific problem
you are trying to handle. For face mask recognition, VGG16 is used.

. Train the model. To train your model, this is the stage where you will utilize the
data you have gathered. The length of the training process can vary based on
the size of your dataset and the intricacy of your model. Essentially, you’ll feed
the images in your dataset to the model and enable it to learn how to distinguish
between masked and unmasked faces. The duration of the training process will
depend on the size of your dataset and the complexity of your model.
30 2 Model-Centric AI

. Evaluate the model. Once your model is trained, you need to evaluate its perfor-
mance on a different dataset. This will help you to determine how well the model
will generalize to new data.

. Deploy the model. Once you are satisfied with the performance of your model,
you can deploy it to production. This means making it available to users so that
they can use it to make predictions.
Here are some additional tips for success in deep learning projects:
. Use a good deep learning framework. There are many different deep learning
frameworks available, such as TensorFlow, PyTorch, and Keras. Select a frame-
work that is required for your project.
. Use a GPU. Use of GPU will tremendously increase the speed up the training
time of models.
. Be patient. Training a deep learning model can be a time-consuming process
hence be patient for results.
. Experiment. When it comes to deep learning, there’s no universal solution that
works for everyone. To determine the best approach for your project, it’s important
to try out various models, hyperparameters, and data preprocessing techniques
through experimentation.

2.8 Summary

Model-centric machine learning is a method in the field of machine learning that


places a strong emphasis on the creation, enhancement, and use of machine learning
models. Models serve as the central focus of the machine learning process, which
References 31

is referred to as being “model-centric” in this context. In this chapter, working


and algorithm of model-centric machine learning are presented with the use-case
implementation.

References

1. Mahesh, B. (2020). Machine learning algorithms: A review. International Journal of Science


and Research (IJSR), 9, 381–386.
2. Mohammed, M., Khan, M. B., & Bashier, E. B. M. (2016). Machine learning: Algorithms and
applications. CRC Press
3. El Naqa, I., & Murphy, M. J. (2015). What is machine learning? (pp. 3–11). Springer
International Publishing
4. Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research
directions. SN Computer Science, 2(3), 160.
5. Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., & Aljaaf, A. J. (2020). A system-
atic review on supervised and unsupervised machine learning algorithms for data science.
Supervised and Unsupervised Learning for Data Science, 14, 3–21.
6. Jiang, T., Gradus, J. L., & Rosellini, A. J. (2020). Supervised machine learning: A brief primer.
Behavior Therapy, 51(5), 675–687.
7. Mason, K., & Grijalva, S. (2019). A review of reinforcement learning for autonomous building
energy management. Computers and Electrical Engineering, 78, 300–312.
8. Liu, R., Nageotte, F., Zanne, P., de Mathelin, M., & Dresp-Langley, B. (2021). Deep reinforce-
ment learning for the control of robotic manipulation: A focussed mini-review. Robotics, 10(1),
22.
9. Maulud, D., & Abdulazeez, A. M. (2020). A review on linear regression comprehensive in
machine learning. Journal of Applied Science and Technology Trends, 1(4), 140–147.
10. Shipe, M. E., Deppen, S. A., Farjah, F., & Grogan, E. L. (2019). Developing prediction models
for clinical use using logistic regression: An overview. Journal of Thoracic Disease, 11, S574.
11. Bisong, E., & Bisong, E. (2019). Logistic regression. In Building machine learning and deep
learning models on google cloud platform: A comprehensive guide for beginners (pp. 243–250)
12. Chen, S., Webb, G. I., Liu, L., & Ma, X. (2020). A novel selective naïve Bayes algorithm.
Knowledge-Based Systems, 192, 105361.
13. Wongkar, M., & Angdresey, A. (2019, October). Sentiment analysis using Naive Bayes algo-
rithm of the data crawler: Twitter. In Proceedings of the 2019 Fourth International Conference
on Informatics and Computing (ICIC) (pp. 1–5). IEEE
14. Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A compre-
hensive survey on support vector machine classification: Applications, challenges and trends.
Neurocomputing, 408, 189–215.
15. Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for
machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.
16. Lubis, A. R., & Lubis, M. (2020). Optimization of distance formula in K-nearest neighbor
method. Bulletin of Electrical Engineering and Informatics, 9(1), 326–338.
17. Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. The
Stata Journal, 20(1), 3–29.
18. Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE Access,
8, 80716–80727.
19. Ushakov, A. V., & Vasilyev, I. (2021). Near-optimal large-scale k-medoids clustering.
Information Sciences, 545, 344–362.
20. Baldi, P., & Vershynin, R. (2019). The capacity of feedforward neural networks. Neural
Networks, 116, 288–311.
32 2 Model-Centric AI

21. Lee, H., & Song, J. (2019). Introduction to convolutional neural network using Keras; An
understanding from a statistician. Communications for Statistical Applications and Methods,
26(6), 591–610.
22. Ghosh, A., Sufian, A., Sultana, F., Chakrabarti, A., & De, D. (2020). Fundamental concepts
of convolutional neural network. Recent Trends and Advances in Artificial Intelligence and
Internet of Things, 128, 519–567.
23. Sherstinsky, A. (2020). Fundamentals of recurrent neural network (RNN) and long short-term
memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306.
24. Schmidt, R. M. (2019). Recurrent neural networks (RNNS): A gentle introduction and overview.
arXiv preprint arXiv:1912.05911
25. Song, X., Liu, Y., Xue, L., Wang, J., Zhang, J., Wang, J., et al. (2020). Time-series well
performance prediction based on long short-term memory (LSTM) neural network model.
Journal of Petroleum Science and Engineering, 186, 106682.
26. Lan, L., You, L., Zhang, Z., Fan, Z., Zhao, W., Zeng, N., et al. (2020). Generative adversarial
networks and its applications in biomedical informatics. Frontiers in Public Health, 8, 164.
27. Yinka-Banjo, C., & Ugot, O. A. (2020). A review of generative adversarial networks and its
application in cybersecurity. Artificial Intelligence Review, 53, 1721–1736.
28. Mahalle, P. N., Shinde, G. R., Pise, P. D., & Deshmukh, J. Y. (2022). Foundations of data
science for engineering problem solving. Springer.
Chapter 3
Data-Centric Principles for AI
Engineering

3.1 Overview

AI has become an interdisciplinary field and has applications in all verticals across
day-to-day routines. There has been a lot of development in this field, and the current
AI as a black box is becoming AI as a white box where a new category of AI algo-
rithms called Explainable AI is emerging in which the focus is more on the data than
models. Today, we are surrounded by digital transformation where every application
is driven by the Internet of Everything (IoE) [1]. In IoE, numerous sensors are installed
in the environment as well as commissioned in smartphones and smartwatches. These
sensors generate a huge amount of data, and a rich set of data analytics algorithms [2]
are helping us to proactively track the activities carried out by users. These activity
recognition and tracking algorithms are empowered by emerging machine and deep
learning techniques and thus enable a human-centered design approach [3]. Appli-
cation of AI algorithms on the underlined data includes various operations which
include data engineering, data analytics, model building, data science, and busi-
ness intelligence. The model-building stage decides which algorithm from machine
learning or deep learning is more appropriate to apply for developing the model based
on the given data and the questions to be posted on this data. However, data-centric
AI is more crucial than the selection of algorithms for the model building which is
nothing but model-centric AI as discussed in Chap. 2 of this book.
Consider the use case which is to be designed and developed for audiences who
are visually impaired. The devices required are mobile or any other equivalent device
with a speaker, microphone, and camera. The objectives of this use case are to provide
commands to this device using voice recognition and perform different actions which
include reading a bible book, showing obstacles on the way of the road using a camera,
and providing responses in the voice as required in the specific language. There is
a need to use a camera to detect who is the person in front of plug and play camera
and to store the details of the person who met next time if he comes again then the
device should predict the nature and details of that person.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 33
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_3
34 3 Data-Centric Principles for AI Engineering

Consider the use case which is to be designed and developed for audiences who
are visually impaired. The devices required are like Alexa with camera, and the
features to be incorporated with this device include asking a question using voice
and getting a response to anything that is on Google, the device should detect the
human’s emotions and react accordingly, and a person with the device will be treated
as master and should have reset button to learn the behavior of master again. In
addition to this, if a new person comes to meet with the master, the details should be
learned by the model in relationship with the master and if the same person comes
again, a prediction should be made on his behavior using the deep learning model.
The model-centric AI first look into the technology, device, team requirements,
and modules to build these use cases as follows:
Technology
. Natural Language Processing—Voice recognition, NLU, and sentiment analysis
. Deep Learning—Model to learn the sentiments of the master, emotion analysis,
answering any questions asked by the master
. Flask Web services to integrate with UI/android app
Devices
. Device with microphone and camera
Team
. Minimum 3–4 people in AI-ML
. Expert in Android/IoT device
. Web designer
Modules
. Voice and image recognition, question-answering module
. Detection of facial recognition of user ethnicity, age, etc.
This model-centric AI technological suite is depicted in Fig. 3.1.
When we consider this use case, there are two perspectives to look at it. One is
the user perspective, and the other is the developer perspective each has a different
set of requirements, challenges, and design issues. In addition to this, there are many
AI engineering concerns that are to be taken care of during the development of this
use case as well as while using it by the end user.

3.2 AI Engineering

AI engineering is a converged field of research that integrates computer science and


human-centered design and the development of applications. AI engineering deals
with the application of AI to the real world in order to build design and develop
AI-enabled tools, products, applications, and services. Everything in nature is a
3.2 AI Engineering 35

Fig. 3.1 Model-centric AI technological suite

System—it can be Small like an Atomic System in which electrons orbit around
the nucleus or Big like a whole universal system where stars and planets contin-
uously orbit around each other. In AI, the system can be an open loop or closed
loop, and it is represented as a black box having Inputs (X) and Outputs (Y ). These
two explicit sets Inputs (X) and Outputs (Y ) precisely differentiate model-centric AI
from data-centric AI. In general, the aim of any AI or machine learning, or deep
learning algorithm is to replicate the relationship between observed X and Y; while
the aim of statistics is to model this relationship between X and Y. Empirical risk
and its minimization is the main goal of AI, and structural risk minimization is
the key objective of statistics. However, it should be noted that every algorithm in
AI is based on the theory of statistics, and the two main aspects, i.e., error mini-
mization and hypothesis approximation, go hand in hand while building any AI
application. In view of this, Probably Approximately Correctly (PAC) learning [4] is
an interesting type of learning which is aimed at finding the upper or lower bounds
of learning algorithms using concentration inequalities. Generally, the relationship
between empirical risk minimization and structural risk minimization is not linear,
and the intersection between AI and statistics optimizes the tradeoff between empir-
ical risk minimization (accuracy) and structural risk minimization (approximation)
for maximum confidence interval (probability) and is referred as PAC learning.
36 3 Data-Centric Principles for AI Engineering

The model debugging strategies and iterations are very critical to AI engineering.
It is very important to understand how the debugging of AI models is different from
the debugging of software models. In debugging an AI model, more emphasis is on
the reasons for the underperformance of the model than code. It is very important to
understand the reasons for the model’s poor performance which include data, model
building, hyperparameters, etc. The appropriate strategy needs to be investigated
for debugging of AI model. Assume that you are given many datasets of varying
descriptions which include small/large sizes the dataset, datasets with small/large size
of features, outliers in datasets, noise in datasets, linearly separable scatter of datasets,
nonlinearly separable datasets, overlapped and non-separable datasets, time series
datasets, etc. In each case or a combination thereof, the main task is to recommend
with a justification of a proper classifier model.
A few common bugs in the AI model are listed and discussed below:
. Small data
As we know that AI or machine learning or deep learning is made for big data.
However, there are many use cases where there is not enough data available, and
it results in low accuracy and other performance metrics. Augmentation, generative
models, generative adversarial networks, and auto-encoders are some techniques to
address this issue. So in the case of small data, making it high quality is the best
approach to training the model.
. Logging mechanism
The proper logging mechanism must be there in place to log useful information than
useless information. Decision logging and information exchange logging along with
scenario mapping based on description is also an important function to be taken care.
. Model confirmation
This is postmodel-building issue which includes confirming whether the model has
not been tampered, i.e., ensuring the integrity of the model, proof of correctness as
well reliability of the model. The popular methods for model confirmation are varied
analysis with multiple inputs, auto-generation of test data, etc.
. Data dimension
The major bug in the model building is caused when the input data is out of the
required dimension. Due to the limitations of linear algebra and the availability of a
rich set of libraries to detect inconsistent behavior in the data, data dimension causes
major bugs in the model building.
. Data inconsistency
Inconsistencies and flaws in the input data contribute majorly to the model bugs. If
the set of questions to be posted on the data and the type of outcomes expected from
the AI model is not clear, then it is possible that model will lose accuracy.
3.2 AI Engineering 37

. Hyperparameter tuning
Performance improvement of hyperparameters can be done by tuning them properly
in order to control the behavior and performance of any AI model. Descriptive statis-
tics of the dataset also perform feature scaling and outlier detection, and in turn, the
model performance can be improved.
AI models are never perfect, however, an effort is there to build an AI model
which is close to perfection and should outperform varied classes of datasets. The
accuracy of any AI model is evaluated with the help of several performance metrics
[5, 6] like confusion matrix, receiver operating characteristics, the area under curves,
data models, etc. The key parameters for AI model evaluation are listed below:
1. Receiver Operating Characteristics
Receiver operating characteristic (ROC) curves are very useful and important
parameters used in machine learning applications for the assessment of classifiers
and the main objective of classifier assessment is to decide a cutoff value for the
underline dataset.
Consider the healthcare application, particularly in the context of pathological
tests. Generally, in medical biochemistry, for vitamin B12, 200 is considered the
cutoff value. The patient population having a B12 value below 200 is grouped
into the deficient patients, and the patient population having a B1 value above
200 is grouped into the normal patients. However, clinically it is also possible
that there are false positives, i.e., the patients having a B12 value more than 200
are deficient and there are false negatives, i.e., the patients having a B12 value
< 200 are normal.
ROC, or ROC curve, is a statistical plot in the graphical representation that
justifies and explains the performance of binary classifiers. As its discrimination
threshold is varied, the curve is created by plotting the true positive rate (TPR)
against the false positive rate (FPR) at varied threshold settings. The false positive
rate (fall-out) can be calculated as (1—specificity) and the ROC curve plots
sensitivity (true positive rate) against the false positive rate (fall-out). When the
probability distributions for both detection and false alarm are known, the ROC
curve can be generated by plotting the cumulative distribution function of the
detection probability on the y-axis and the cumulative distribution function of
the false-alarm probability on the x-axis.
ROC curves are commonly used in medicine and engineering to determine a
cutoff value for a test, basic experimentation, or any diagnostic in the healthcare
domain. For example, the threshold value of a 4.0 ng/ml for the prostate-specific
antigen (PSA) test in prostate cancer is determined using ROC curve analysis,
and there are many such similar applications in the diagnostic. A test value below
4.0 is considered normal, and above 4.0 is considered abnormal. However, it is
important to note that there will always be some patients with PSA values below
38 3 Data-Centric Principles for AI Engineering

4.0 that are abnormal (false negatives) and those above 4.0 that are normal (false
positives) regardless of the cutoff value chosen. The goal of ROC curve analysis
is to determine the optimal cutoff value that balances the tradeoff between true
positives and false positives to maximize the overall diagnostic performance of
the test.
2. Sensitivity and Specificity
Sensitivity and specificity are very important statistical measures of the perfor-
mance of a binary classification test or any binary classifiers in the context of
machine learning.
. Sensitivity which is also referred as true positive rate measures the proportion
of positives that are correctly identified as such (e.g., the percentage of sick
people who are correctly identified as having the condition).
. Specificity which is also referred as true negative rate measures the proportion
of negatives that are correctly identified as such (e.g., the percentage of healthy
people who are correctly identified as not having the condition).
3. Precision and Recall
. In the context of pattern recognition and information retrieval with binary
classification, precision (also called positive predictive value) is a measure of
how many of the retrieved instances are actually relevant.
. Recall (also known as sensitivity) is a measure of how many of the relevant
instances were retrieved.
Both precision and recall are based on an understanding and measure of relevance.
4. Cross-Validation
. Cross-validation (also known as rotation estimation) is a technique used to
assess how the results of a statistical analysis will generalize to an independent
dataset
. It is mainly used in the applications as well as system settings where the key
objective is prediction. The cross-validation is mainly used to define a dataset
for testing the model in the training phase, and it is also known as the process
of the validation of the dataset. This helps to limit the problems like overfitting
and to give insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real-world problem).
5. Ensemble Methods
. Integration and combination of multiple learning algorithms to improve the
predictive performance is known as the ensemble approach. In statistics and
machine learning applications, ensemble methods are becoming more popular
for enhanced performance as compared to any other existing algorithms.
. However, there can be computational overhead, and there is always a tradeoff
between this overhead and performance.
3.3 Challenges 39

• It is what if analysis based on the statistical technique


Sensitivity
Analysis • Useful for variable analysis to improve on the predictions

• It uses numerical methods to calculate the model errors


Residual
Analysis • Useful to validate the quality of model

• It uses comparison method with the benchmark model


Benchmarking • It is useful to check performance on different data classes

• It uses security assessment of the model for attacks


Security
Audits • It is useful when the attacker can access training data set

• New training samples can be added to fill the gaps


Augmentation • It is useful to augment audio, text, images etc.

Fig. 3.2 Model debugging strategies

6. Bagging
. Bagging is a process of bootstrap aggregating where a meta-algorithm using an
ensemble approach is designed in order to improve the accuracy and stability
of any AI algorithm.
. It is mainly used in classification and regression techniques and helps to reduce
variance and also avoid overfitting.
In addition to these issues causing bugs in the model, the developer also uses model
debugging strategies which are depicted in Fig. 3.2.

3.3 Challenges

Modern AI engineering is facing several challenges and design issues, particularly


in business intelligence and application development. Due to the increasing number
of users and devices connected to the Internet, scalability is becoming one of the
major issues, and this section focuses more on the key challenges for modern AI
engineering. These challenges are listed below:
40 3 Data-Centric Principles for AI Engineering

. Current state of the data


More than 80% of the data coming from different organizational use cases today is
unstructured and unlabeled. This data cannot be taken readily for AI processing. The
data coming from contracts and healthcare data are uncurated, and labeling this data
is a big challenge. In this case, manual data labeling, preprocessing, and processing
approaches are required.
. Training data development
Many AI models are built using the developed training data at one time. However, the
training data development is an iterative process and not a one-time process. Model
build using iterative training data gives better accuracy. This iterative training data
development process is presented in Fig. 3.3.
. Application-specific models

In an actual production environment, applications and the concerned datasets are


more important than the models. All AI applications actually have data operations
on the varied classes of datasets to get to production. These operations include
data engineering, cleaning and data immigration, postprocessing and business logic,
data scaling for enhanced predictions, prediction postprocessing, etc. As stated in
[7], deep learning is not yet completely can be considered the only final solution
for most real-world automation. We need considerable attention to the prior injec-
tion, postprocessing, and other engineering operations in addition. In the sequel, the
companies dealing in AI model development are now becoming consulting shops

Model Updation

Define Deploy
Problem Solution

Data Updation

• New data
. Fix model bugs
. Drift adaptation
. Schema updation

Fig. 3.3 Iterative training data development


3.4 Data-Centric Principles 41

and service providers. This is how AI engineering is now fundamentally shifting


from a model-centric to a data-centric approach.
. Paradigm shift
A paradigm shift from model-centric to data-centric is one of the key challenges for
today’s AI engineering. Making AI models more trustworthy with better performance
is the main objective of any AI-driven application development, and data-centric
AI is becoming more useful on this front. Focusing on more quality data than the
applications and making this quality data available for model building is the key
challenge for data-centric AI. Lack of uniformity in data labeling, building models
even for small data for the use cases where big data is not available, and more effective
no-code, low-code tools are some issues to be addressed for this paradigm shift.

3.4 Data-Centric Principles

Formulation and iteration of a more accurate AI model for application development as


well as prototyping as a service, there are a few guiding principles that are presented
and discussed in this section. Solution designing, technical architecting and model
building, and data science are the main functional components of data-centric AI.
The emerging trends include cultivation as a service, prototyping as a service, agile
data science, and model as a service where the theory of constraints plays a crucial
role in end-to-end workflow from data sources, data engineering, and data science
to data visualization.
. End-to-end modeling
Efficient debugging and introspection is the key requirement for any AI model, and
in view of this training, the model as a black box will not suffice. To address this
issue, model training as a white box is required and it also makes debugging and
introspection more easy and clear. End-to-end modeling also helps to improve the
fine grain performance by doing error analysis, mitigating these errors, decomposing
complex models into pipelines of submodules, and in turn resulting in the debug-
gable building blocks, version control, etc. This scenario of end-to-end modeling is
presented in Fig. 3.4.
. Evaluation and iteration

There should be proper architecture in place so that we can pick the data at different
parts to evaluate a piece of components and enable incremental data training, and it
also enables end-to-end evaluation and iteration of the AI model. It is also important
to decide the performance metrics at the different stages of the pipeline to improve
the performance of AI models. Different operators at different stages of the pipeline
and global applications along with their scores can also be customized with the help
42 3 Data-Centric Principles for AI Engineering

Positive
Negative

75%F1
Input Data End-to-end Model Classified, clustered,
linked entities

Fig. 3.4 End-to-end modeling

Entity Sentiment Entity Reducer Positive


Trigger Classifier Linker Negative

52% F1 87%F1 90%F1 94%F1 75%F1


Data Local Operator Scores Global Application Score

Fig. 3.5 End-to-end evaluation and iteration

of per component and end-to-end evaluation and iteration. The scenario depicted in
Fig. 3.5 throws more light on this design principle.
. Selection of appropriate technique

Each building block in the AI engineering process performs a data frame transforma-
tion. For example, a document data frame is transformed into a classified document
data frame after the application of a particular classifier. However, the selection of
a classifier plays important role in the entire process, and the task can be done by
applying a heuristic classifier or it can be also done with a machine learning classifier.
In addition to this and based on the data size, we can also use a deep learning classifier
for better prediction and higher accuracy. It is recommended that one should start
with the simple model and eventually we can add complexity to the model building
depending on the requirements.
. Iteration with programmatic labeling
In the AI model-building process, programmatic labeling is very much useful in
bringing rapid iteration. Traditionally, in the process of pipeline model building,
manual data labeling is used which is the major bottleneck in iteration. The existing
approaches being adopted are listed below along with their limitations:
3.4 Data-Centric Principles 43

. Outsource to labeling vendors


– Privacy challenges
– Lagging in domain expertise
– Difficult for audit and governance
– More difficult to adapt
. Label with in-house experts
– Slow
– More cost
– Difficult for audit and governance
– More difficult to adapt
Labeling functions, code, or more efficient automated methods can be created to
label the data instead of labeling them manually, and it also enables rapid iterations.
This programmatic labeling and its advantages are presented in Fig. 3.6.
Along with these data-centric principles of AI engineering, software engineering
principles for data-centric AI are also important for more accurate modeling. The
software engineering principles are listed below:
. Single responsibility principle/modularity
This principle states that there should be prioritization of natural modularity ensuring
the single responsibility principle in the end-to-end AI model-building process.
. Debuggability and introspection
Debuggability and introspection should be the part of end-to-end evaluation and
iteration process.

Work Supervision
Model

Label
Training
Programmatically
Data Model Iteration

Analyze
Data Iteration

Fig. 3.6 Programmatic labeling


44 3 Data-Centric Principles for AI Engineering

. Premature optimization
Machine learning is a very complex set of code and functions and while integrating
it into the pipeline, premature optimization should be avoided.
. Incremental development
Change management is the routine process in the software development life cycle
wherein the updates and change requirements always come from the clients, and it is
to be incorporated in every phase of the software development life cycle. Anticipating
change management and enabling incremental development is an important principle
of data-centric model development.

3.5 Summary

Data-centric AI is becoming a more popular trend now which has more emphasis on
the data than the model. This chapter first presents the reference use case to understand
how the technological and design requirements change from model-centric AI to
data-centric AI. These differentiating parameters include technology, devices, team,
and modules. In the next part of this chapter, AI engineering aspects are presented
and discussed in the view of model-centric and data-centric AI model building with
examples. A few important bugs in the AI model-building process are also discussed
in this section. The key parameters for AI model evaluation are also elaborated which
include ROC curves, sensitivity, and specificity, precision and recall, cross-validation,
ensemble methods bagging, boosting, etc. Finally, this chapter concludes with the
key challenges as well as important data-centric principles which are recommended
to follow in the data-centric AI-building process.

References

1. Dey, N., Shinde, G., Mahalle, P., & Olesen, H. (2019). The internet of everything: Advances,
challenges, and applications. De Gruyter. https://doi.org/10.1515/9783110628517
2. Mahalle, P. N., Gitanjali, R. S., Shinde, G. R., Pise, P. D., Deshmukh, J. Y., & Jyoti, Y. D. (2022).
Data collection and preparation. In Foundations of data science for engineering problem solving
(pp. 15–31). Springer
3. Boy, G. (2017). The handbook of human-machine interaction: A human-centered design
approach. CRC Press
4. Pydi, M. S., Jog, V. (2020). Adversarial risk via optimal transport and optimal couplings. In
Proceedings of the 37th international conference on machine learning, in proceedings of machine
learning research (Vol. 119, pp. 7814–7823). https://proceedings.mlr.press/v119/pydi20a.html
References 45

5. Gupta, A., Parmar, R., Suri, P., & Kumar, R. (2021). Determining accuracy rate of artifi-
cial intelligence models using Python and R-studio. In Proceedings of the 2021 3rd interna-
tional conference on advances in computing, communication control and networking (ICAC3N)
(pp. 889–894)
6. Patalas-Maliszewska, J., Paj˛ak, I., & Skrzeszewska, M. (2020). AI-based decision-making model
for the development of a manufacturing company in the context of industry 4.0. In Proceedings
of the 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE) (pp. 1–7)
7. Paszke, A. et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In
H. Wallach (Ed.), Advances in neural information processing systems (Vol. 32, pp. 8024–8035)
Chapter 4
Mathematical Foundation
for Data-Centric AI

4.1 Overview

Mathematics provide better tools in order to understand and analyze the underlying
data. In data analysis, mathematics is crucial. Some of the fundamental quantitative
ideas and methods that are applied in data analysis are listed below.

4.1.1 Statistics

Statistics is defined as the part of mathematics which deals with data collection,
analysis, interpretation along with presentation of the data. It provides foundation for
concepts like regression analysis, probability, variance, hypothesis testing, etc. It also
includes concepts of descriptive analytics like central tendency measure, variability,
inferential statistics [1], etc. To recognize patterns, trends, and relationships in the
data is the crucial part of statistical data analysis.

4.1.2 Linear Algebra

This part of mathematics includes concepts like vector spaces, linear equations,
matrices, etc. In the data analysis, techniques like dimensionality reduction, clas-
sification, clustering, etc. are used for better results. Linear algebra is a part of
mathematics which provides powerful tools for data analysis. It permits analysts
to represent and manipulate data in a such a way that hidden pattern along with
relationships in the data can be discovered. This will allow analysts to make precise
predictions. To find the relationship between two variables linear regression can be
very helpful. Linear regression uses a line to fit the underlying datapoints. To reduce
dimensionality of underlying data, principal component analysis (PCA) is used. It

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 47
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_4
48 4 Mathematical Foundation for Data-Centric AI

helps in discovery of linear combination of variables which represent variance in


the data. For this we need the eigenvectors and eigenvalues of covariance metrics of
underlying data [2].

4.1.3 Calculus

To observe growth rate and change calculus can be used. Calculus can be used in
tasks such as optimization, fitting of the curve, and integration with data analysis.
The rate of change at any point is given by the calculating the derivative of function
at that point. The same calculations can be used for finding the optimal values of
parameters in the model. To get the optimized functions, calculus finds maximum or
minimum value of a function. Optimization is a crucial part of data analysis as it finds
the best variables for model which helps in prediction and minimize errors. To design
models for systems that change over time differential systems are used. Differential
equations can be implemented to design systems such as population growth, disease
spread, and latest economic trends. Calculus gives us very powerful tools to assess
and analyze the complex data [3].

4.1.4 Probability Theory

The study of likelihood random occurrences comes under the branch of mathe-
matics known as probability theory [1]. This theory is often used in data analysis for
modeling uncertainty and makes the accurate predictions. It can be used tasks such
as hypothesis testing, Monte Carlo simulations, and Bayesian inference.
. Probability Distribution: Probability distributions give the likelihood of occur-
rence of a random variable. For modeling and analysis of data distribution, proba-
bility distribution is used. It can also be used to alter parameters used in statistical
modeling.
. Bayes Theorem: Bayes theorem is a keystone of probability theory. It gives us
the way to adjust the hypothesis probabilities if any or new information comes
to light. This theorem is mostly used in the field of data analysis in order to alter
previously formed assumptions in the presence of newly surfaced information.
. P-value: The p-value is used for hypothesis testing. It gives us information whether
underlying data supports null hypothesis or supports alternate hypothesis. Null
hypothesis means there are no difference in groups or absence of an effect vice
versa for alternate hypothesis. P-value is calculated by comparing it to predeter-
mined significance levels (often denoted as α-alpha). If p-value is smaller than
significance level, then we can reject null hypothesis and support the alternate
4.1 Overview 49

hypothesis. Significance level is probability threshold which gives us baseline for


rejection of null hypothesis. Most commonly used significance levels are 0.05
(5%) and 0.01 (1%). A significance level 0.05 does not mean 5% chance of null
hypothesis being true or false. It just defines the threshold for consideration against
null hypothesis.
We can take an example of a coin. We have to determine whether a coin is fair or
not.
Your null hypothesis and alternate hypothesis will be as follows:
Null hypothesis (h0): Coin is fair
Alternate hypothesis (h1): Coin is not fair
Next step will be to set up an experiment. We will toss coin 100 times and observe
the possibilities. Suppose we get 40 heads and 60 tails. Now based on these obser-
vations, we have to decide whether the coin is fair or not. For p-value calculation,
we can use binomial test as we have two distinct outcomes. The probability of each
outcome is 0.5. Let us say significance level is 0.05. The p-value is most probably
calculated by the domain experts.
The Python implementation will be as follows:

The p-value is calculated with this test is 0.05.

P-value is equal to significance value; then the null hypothesis is true. It means a
coin can get 40 heads which means a coin is fair.
If calculated p-value comes out to be 0.02, then we will have to reject the null
hypothesis as p-value is less than significance value. It will mean the coin is unfair.
50 4 Mathematical Foundation for Data-Centric AI

4.1.5 Multivariate Calculus

Multiple-factor analysis is the basis for mathematics behind multivariate calculus. It is


implemented in multiple areas of data analysis such as gradient descent, optimization,
and partial differential equations [3].
i. Gradient Descent: It is most commonly used optimization technique for data
analysis and machine learning. For iterative updating of parameters in the direc-
tion of the sharp descent, we need to computation of a gradient of a cost function
with respect to the model parameters.
ii. Multivariate Regression Analysis: To model the relationship between multiple
independent variables and dependent variable, multivariate regression analysis
is used. Multivariate calculations are used to assess the importance of variables
and finding the regression coefficients needed in regression equations.
iii. Optimization of Multivariate Functions: Frequent optimization of multi-
variate functions like probability functions, objective functions, or cost functions
is necessary while performing data analysis tasks. Main parts of the functions
are identified with the help of multivariate analysis. It also helps in identification
of function like maxima, minima, or saddle points for further use.

4.1.6 Graph Theory

The branch of mathematics that deals with study of networks and their properties
is referred as graph theory. It can be implemented in data analysis tasks such as
community detection, clustering, and social network analysis. Graph theory can be
used for data analysis to study and analyze the relationships between things or entities
represented as nodes on the graph and the connections and interactions between them
represented as edges [4].
i. Link Prediction: On the basis of the node properties or structure of graph,
process of determining the probability of connection formation among nodes
is done. This can be implemented in recommendation systems, social media
analysis, or drug discovery, etc. For implementation of graph theory, multiple
techniques can be applied for link prediction such as common neighbors, Jaccard
similarity, preferential attachment, or random walk.
ii. Network Clustering: The process of arranging network (nodes) into clusters
or communities on the basis of underlying similarity or connectivity is known
as network clustering. The concept of network clustering can be implemented
in scenarios such as image segmentation, text clustering, or gene expression
analysis. For graph clustering, many techniques are offered by graph theory such
as modularity optimization, spectral clustering, and hierarchical clustering.
iii. Visualization: Graph theory can be implemented in visualization of high-
dimensional data. It can be done by presenting every data point as a node
and connection between them can be made on the basis of distance from
4.2 Statistical Data Analysis 51

each other (how far or near data points are). The difficult to find patterns or
clusters in high0dimensional space can be easily recognized with the help of
high-dimensional visualization (two or three dimension).

4.2 Statistical Data Analysis

Statistical data analysis is referred as the collection of tasks such as collecting,


assessing, and interpreting data with the help of statistical techniques. To obtain
meaningful insights and interpretation, statistical data analysis is used.
The procedure usually consists of several steps, such as:
a. Formation of Hypothesis or Research Query: It is vital in research process.
This involves identifying the exact topic or problem the research should address
and preparing a brief and concise hypothesis or question to guide the research.
When a researcher wants to investigate or discover something first a research
question is formed. It should be clear, concise, and focused on the main topic or
issue being studied.
The research question should be formed in a way that it can be solved by
collection and analysis of data. A hypothesis is defined as the theoretical or
hypothetical interpretation or forecast made regarding event that is still being
studied. It also identifies the connection between variables along with their rela-
tionship to each other. A hypothesis should be verified or invalidated with the
help of empirical data in order to have perfect conclusion.
In both situations, defining the research question or hypothesis is a crucial first
stage in the research process because it enables the researcher to concentrate on
the main issue or problem being studied and to create a well-organized research
plan.
b. Gathering of Data from Various Sources: It is the initial step in statistical data
analysis. It consists of multiple methods such as surveys, experiments, obser-
vations, etc. After the collection of data is done, the data has to be arranged in
such a way that analysis can be performed on it. This may include cleaning and
preprocessing data to eliminate errors and inconsistencies.
When you understand the query or underlying issue the next step is to locate
the data source that will help in the research. Databases and openly accessible
datasets are used for the purpose.
The best technique for data collection will depend on the source of the data you
are looking for. This can be done using techniques such as surveys, interviews,
observations, tests, the use of preexisting datasets, etc.
. Designing the data gathering tools: If you want to collect data through surveys
or interviews, you’ll need to design tools such as surveys and interview scripts.
It is important to ensure that the tools are understandable, objective, and able
gather the necessary data.
52 4 Mathematical Foundation for Data-Centric AI

. Once tool designing is done, the gathering of data can be initiated. Data
gathering should be done in methodical and concise manner.
. After data collection is done, data should be prepared for preprocessing.
This includes assessing the data for underlying errors, contradictions, missing
values, and transforming it into the structure that is fit for analysis.
c. Analysis of the Data: Analysis of data should be done with the help of statis-
tical techniques. This can include techniques like hypothesis testing, regression
analysis, etc. or descriptive statistics like measures of central tendency, vari-
ability, etc. Descriptive statistics includes collection, organization, and analysis
of underlying data so researchers can make accurate interpretations. The main
aim of the statistical data is to identify hidden patterns, trends, and relationships
in the data. This involves using numerical and graphical techniques to summarize
and describe key elements of the dataset. Central tendency, variability, and shape
of the distribution are the techniques using descriptive statistics. Mean, median,
and mode are examples of measures of central tendency that provide details about
the normal or mean value of a dataset. To disclose the information about how
data is being distributed, measures of uncertainty include range, variance, and
standard deviation.
Commonly used descriptive statistics can include frequency distributions,
histograms, box plots, scatter plots, correlation factors, etc. These tools can be
very helpful to observe the data distribution of underlying data, detecting outliers,
identification of hidden pattern in the data, relationship between the variables in
the dataset, etc. This can applied in the fields like business, economic, social
sciences, and natural sciences.
With the tools and techniques, you can make informed decisions based on
empirical evidence. It can also help in summarization of data which can be
shared in a clear and concise manner. The primary purpose of inferential statistics
is to determine whether observed differences or similarities between groups or
variables are genuine or a coincidence.
Various techniques such as regression analysis, confidence intervals, and
hypothesis testing are used in inference statistics to analyze data and derive infor-
mation about communities from samples. Many fields such as economics, social
sciences, engineering, and health sciences rely heavily on inferential statistics.
This permits researchers to make decisions or predictions on the basis of data
they have collected with higher accuracy.
d. Presenting the Results: Giving clear and concise representations of the findings
is vital task in data analysis process. It can be done with the help of report,
slideshows, and data visualization tools. Visualization includes graphs, charts,
histograms, and diagrams which helps in effective communication of your main
research. For example scatter plot or line chart helps in visual representation of
two variables changing over certain time period. The underlying data and the
findings will impact the visualization you are going to use.
4.3 Data Tendency and Distribution 53

4.3 Data Tendency and Distribution

Data tendency and distribution are crucial topics in statistics that explains central
tendency and distribution of the dataset [1]. First we will see the concept of data
tendency in detail.

4.3.1 Data Tendency/Measure of Central Tendency

Data tendency represents single value (average value) that explains dataset by iden-
tifying central tendency of the dataset. It gives insights about representative value
given by data. Data tendency measures are given by:
. Mean: It is a basic concept in the statistics. It is given by addition of all the values
in the dataset and dividing it by total number of values in the dataset. Mean is
easily affected by the outliers in the dataset.
Mean can have significant implementations in the field of data analysis and its
interpretation. In the field of economics or finances, mean represents the performance
of the investments done by investors. In research fields, mean represents abstraction
or summarization of experimental data.
. Median: The median value is the middle value when we sort the entire dataset
in ascending or descending order. If dataset entries are in odd in number, then
middle value is easily recognized. But if dataset entries are even in nature, then
middle value is given by calculating mean of middle two values. Median is less
affected by outliers in the dataset.
For the analysis of skewed distributions (in which outliers can have great impact
on mean) like income, house prices, etc., median can be used.
. Mode: The mode is given by the number which repetitively occurred in the dataset.
We can have more than one modes or in some cases no mode.
For the analysis of categorical data mode is used. For pointing out most common
purchase in the general store, most commonly used vehicle, common response given
in survey, etc. mode can be used. Also for identification of outliers mode can be used
(when mode is not there or possibility of multiple modes are there).
Consider the following dataset and calculate mean, median, and mode.
9, 11, 14, 10, 41, 35, 22, 32, 28, 8
Mean of the dataset can be calculated as follows:

Mean = (9 + 11 + 14 + 10 + 41 + 35 + 22 + 32 + 28 + 8)/10
Mean = 21
54 4 Mathematical Foundation for Data-Centric AI

Median of the dataset can be calculated as follows:

Original dataset = 9, 11, 14, 10, 41, 35, 22, 32, 28, 8

In median calculation, first step is to sort the entire dataset in ascending or


descending order.

Sorted dataset = 8, 9, 10, 11, 14, 22,28, 32, 35, 41

As entries are even in nature, then middle value is given by calculating mean of
middle two values.

Median = (14 + 22)/2


Median = 18

Mode of the dataset can be calculated as follows.


As the mode is given by the number which repetitively occurred in the dataset,
there is no number in the dataset that is frequently occurring. There is no mode for
this dataset.
Python implementation of the mean, mode, and median is as follows:

We will consider the same dataset used before.


9, 11, 14, 10, 41, 35, 22, 32, 28, 8
The output of the above code will be as follows:
4.3 Data Tendency and Distribution 55

4.3.2 Measure of Dispersion

It is also called as a measure of variability. It gives us the spread of data points and
also provides extent to which value differs from each other [1]. Commonly known
measures of dispersion are as follows:
i. Range
ii. Variance
iii. Standard deviation
iv. Interquartile range
v. Percentile quartile

. Range
It is simplest form of measure of dispersion. It can be calculated as difference
between maximum value in dataset and minimum value in the dataset. It gives us
idea about the spread of data points. Range is very sensitive to the outliers.
The Python implementation of range is as follows:

The outcome is as follows:

. Variance
It is defined as average squared difference between each data point and the mean.
Higher the variance, higher dispersion in the data is observed.
Sample is nothing but sample of data, whereas population is an entire dataset.
This is the main difference between sample and variance.
If we are calculating the variance for a sample, then the formula is as follows:
∑n
i=1 (x i
− x)2
s2 =
n−1
56 4 Mathematical Foundation for Data-Centric AI

where
S 2 is sample variance.
x i is individual data point in sample.
x is mean of data points.
n is sample size.
If we are calculating the variance for a population, then formula is as follows:
∑N
i=1 (x i − μ)2
σ =2
N
where
σ 2 is population variance.
x i is individual data point in sample.
u is population mean.
N is population size.
The Python implementation of variance and standard deviation is as follows:

The outcome is as follows:


4.3 Data Tendency and Distribution 57

. Standard Deviation
The square root of variance is known as the standard deviation. It gives us average
deviation from mean. Just like variance, the higher the standard deviation spread
will be higher.
. Interquartile Range
It is range between first quartile (25th percentile) and the third quartile (75th
percentile) in a dataset. It gives us spread of data in middle 50%. It is less affected
from the outliers.
The Python implementation of standard deviation and interquartile range is as
follows:

The outcome is as follows:

. Percentile Quartile
It is range between given two percentiles. It gives spread between given two
percentiles.
The Python implementation of percentile quartile is as follows:
58 4 Mathematical Foundation for Data-Centric AI

The outcome of percentile quartile is as follows:

4.3.3 Data Distribution

Data distribution means how the data is distributed or spread across values present
in the dataset. It gives us vital information such as range and variability of the data.
The data distribution consists of:
i. Normal distribution
ii. Skewed distribution
iii. Uniform distribution
iv. Bimodal distribution

. Normal distribution

This distribution is also called as Gaussian distribution. It is symmetric, and most of


the values in the dataset are distributed around the mean. The representation of the
normal distribution is bell-shaped.
We can distinguish it using mean and standard deviation.
Standard deviation gives spread or variation of data within a dataset. It calculates
amount by which the data has deviated from the mean.
If we have to identify the normal distribution, then look for certain properties
that give clear indications. A normal distribution is symmetric and unimodal (curve
having single peak) in nature.
4.3 Data Tendency and Distribution 59

Normal distribution can be depicted with diagram as shown in following figure


(Fig. 4.1).
. Skewed Distribution

The measure of asymmetry in the distribution is given by asymmetric distribution.


The longer right tail means positive skewness and the longer the left tail means
negative skewness. Skewed distributions tend to be asymmetric in nature. As shown
in figure, the datapoints are not spread around the mean which results in the longer
tail.
Skewed distribution can be depicted with diagram as shown in following figure
(Figs. 4.2, 4.3).

Fig. 4.1 Normal distribution

Fig. 4.2 Left-skewed distribution


60 4 Mathematical Foundation for Data-Centric AI

Fig. 4.3 Right-skewed distribution

Skewed distributions can be used to detect potential outliers as data distribution


is not symmetric in nature. Skewed distributions can be seen in real-life scenarios
like financial data, income distributions, marketing, and consumer behavior.
. Uniform Distribution
When all the data points in the dataset have equal probability and distribution is even
across the range, then it is referred as uniform distribution. This distribution gives
us randomness and fairness which can be used as a baseline for statistical analysis
as each data point has equal probability.
Uniform distributions can have applications such as random number generator,
random sampling, optimization algorithms, etc.
. Bimodal Distribution
As the name suggests, the distribution of underlying datapoints has two modes with
more than one central tendencies.
Bimodal distributions can help in revealing the underlying subpopulation which
can turn out to be crucial for decision-making process, understanding of complex
systems, etc. Bimodal distribution can have applications in fields such as biomedical
research, i.e., in a clinical trial, patients can have different response to a treatment.
It can help doctors to assess and adjust treatment protocols.
Along with these distributions, there are some distributions that are used such as
exponential, log-normal, gamma distributions each distribution has its own properties
with it.
In order to perform accurate data analysis, understanding of underlying data
distribution and tendency is vital. These concepts are basics in statistical analysis,
hypothesis testing. Also drawing the inferences from sample about population can
be done.
4.4 Data Models 61

4.4 Data Models

Data models are representations of data structures, relationships, and rules that define
how data is organized, stored, and accessed within a system or database. ML algo-
rithms work on various types of dataset; it may be structured, unstructured, or semi-
structured in nature [5]. On the basis of organization and its underlying format of the
data, the following models have been created:
. Structured data
. Unstructured data
. Semi-structured data
They provide a conceptual framework for understanding and working with data.
The data models are explained in detail.
. Structured Data: The model that has precise organization for its data in a prede-
fined format is known as structured data. It has its own predefined structure and
is usually stored in relational databases or tabular data formats. We can recognize
structured data with following properties:
i. Consistent format: Every data element has to be adhered to specific data
format defined by schema or tabular structure.
ii. Predefined schema: Everything is predefined including underlying structure,
data types, and relationships between data elements.
iii. Organized in rows and columns: It has a tabular structure consisting of rows
and columns. Rows give us specific records, and columns give us specific
attributes or fields of underlying data.
iv. Easy to query and analyze: The organization of structured data is so well
defined that underlying data can be easily be used for analysis purpose using
queries with the help of Structured Query Language (SQL) or other database
tools.
The well-known examples of structured data can be seen in financial records,
sales transactions, employee records, sensor data, etc.
. Unstructured Data: The data that does not have a predefined structure or orga-
nization is known as unstructured data. It does not have any specific structure or
schema. It typically has human readable format. Unstructured data has following
characteristics:
i. No predefined structure: It does not follow any predefined model or schema.
ii. Varied formats: It can have different formats. It can exist in various formats,
such as text documents, emails, social media posts, images, audio files,
videos, and web pages.
iii. Difficult to analyze: As unstructured data does not have any specific format,
the analysis of such a data is quiet challenging.
62 4 Mathematical Foundation for Data-Centric AI

iv. Requires specialized techniques: Techniques like NLP, image recognition,


and machine learning are implemented to draw out meaningful information
from unstructured data.
The well-known examples of structured data can be seen in emails, social media
posts, customer reviews, images, audio recordings, videos, and documents like
PDF files or Word documents.
. Semi-structured Data: It is a combination of structured and unstructured data.
It has some kind of structure but does not follow any specific schema. It follows
some form of format such as tags, labels, or attributes which gives us partial details
about underlying data. Unstructured data has following characteristics:
i. Partial structure: It supports both structured and unstructured data.
ii. Flexible schema: Half-defined schema for underlying data allows variations
in organization and structure of data.
iii. Combination of formats: It supports combination of formats such as XML,
JSON, or key-value pairs.
iv. More accessible than unstructured data: Because of the partial structure, this
type of data can be used for query and analysis purposes.
The well-known examples of semi-structured data can be seen in XML files, JSON
documents, log files, web server logs, and NoSQL databases.
Structured data follows very rigid format along with schema, unstructured data
does not have any organization or structure, while semi-structured data has some
kind of structure but not as strict as structured data. For efficient data management,
storage, and analysis purpose the understanding of the underlying data becomes
crucial.

4.5 Optimization Techniques

Optimization techniques are used to enhance the performance of the underlying


system, process, or algorithm. As per the given requirement, optimization techniques
tend to find best possible solution [6]. Optimization techniques are used in fields such
as mathematics, computer science, engineering, economics, and data science some
commonly used optimization techniques in machine learning are listed as follows:
. Gradient Descent (GD)
In order to find the global minimum or maximum, this iterative optimization algo-
rithm is used. You can find implementation of gradient descent algorithm in the
field of machine learning. The loss function is used by algorithm in order to update
the model parameters. Stochastic gradient descent (SGD) and mini-batch gradient
descent are very popular variation of gradient descent.
4.5 Optimization Techniques 63

. Stochastic Gradient Descent (SGD)


In this variant, the model parameters are updated for each training instance rather
than the entire dataset. SGD is efficient in terms of calculation, particularly for large
datasets, and it can avoid local optima more easily because of the stochastic nature
of updates.
. Mini-batch Gradient Descent
This variant is combination of batch gradient descent and stochastic gradient descent.
In mini-batch gradient descent, the model parameters are updated using small batch
of training examples rather than individual example or entire dataset. It gives best of
computational efficiency and noise reduction.
. Adaptive Learning Rate Methods
These optimization techniques are dynamic in nature as learning rate is updated
in order to boost up the convergence, speed, and stability. The implementation of
adaptive learning rate methods is AdaGrad, RMSprop, Adam, etc. Historical gradient
information is taken into consideration while updating learning rate.
. RMSprop
Adaptation of learning rate for each parameter is done by Root Mean Square Prop-
agation (RMSprop) optimization algorithm. RMSprop is implemented by taking
the moving mean of gradients that are squared for normalization of learning rates.
RMSprop tries to solve the problem caused in deep neural network due to vanishing
learning rate problems.
. Adam
Adam (Adaptive Moment Estimation) is an amalgamation of two commonly used
algorithms, namely momentum and RMSprop. On the basis of mean (first-order
moment) and second-order moment (non-centered variance), new learning rate
parameters are adapted. The dynamic adjustment of the learning rate while training
is good for variety of ML tasks.
. Adagrad
Adagrad (Adaptive Gradient) optimization algorithm in which learning rate for each
parameter is adjusted on the basis of gradients that are collected throughout history.
Whenever there are a little to no updates in parameters, larger learning rates are
given by the Adagrad. If learning rates are too small, then there are considered to be
frequent updates in parameters. Sparse data management is smartly done by Adagrad.
. L-BFGS
If there are many parameters involved in the dataset, then Limited-memory Broyden–
Fletcher–Goldfarb–Shanno (L-BFGS) optimization algorithm can be used. It uses a
quasi-Newton method that estimates the inverse Hessian matrix for search direction.
L-BFGS is used in cases where the main concern is memory requirement like deep
neural network training.
64 4 Mathematical Foundation for Data-Centric AI

. Momentum
Momentum is an optimization technique that speeds up the optimization process by
adding a fraction of the previous settings update to the current update. It helps in
elimination of noise from the gradients. Also it enhances the convergence when there
is flat or plateau regions in the field.
. Convex Optimization
The focus is on solving optimization problems with convex objective functions and
boundary conditions specified. These optimization techniques promise to find global
optimum efficiently. The implementation of convex optimization can be seen in the
fields of machine learning, signal processing, control systems, etc.
Along with these techniques, some more techniques are implemented that have
performed well with the data.
. Linear Programming
When the linear objective function and constraints are specified without any ambi-
guity, linear programming optimization can be used. The implementation of this
technique can be seen in the field of operation research, resource allocation problems,
etc.
. Genetic Algorithms
These are evolutionary algorithms inspired by natural selection process as well as
the genetics present in the nature. The individual in the populations is used to find
the solution to the problems. Algorithms iteratively evolve over the population using
operations of selection, crossover, and mutation to find the solution optimal in nature.
. Simulated Annealing
This technique is probabilistic in nature. It is an optimization algorithm which simu-
lates annealing process used in metallurgy field. It works by iterative exploration of
the solution space beginning from an initial solution, allowing upward movement
(worse solutions) based on a probability distribution of underlying space. In order to
avoid local minima, the probability of accepting uphill moves is gradually decreased.
. Particle Swarm Optimization
The inspiration for particle swarm optimization is taken from the behavior of bird
flocking or fish schooling. It handles a population of particles that explore the solution
space by updating their positions in accordance with their own best solution and the
best solution found by the whole population.
. Ant Colony Optimization
As the name suggests, the inspiration for ant colony optimization is taken by behavior
of ants. The ants use pheromones for tracking each other. The algorithm uses
same simulation for finding out the best possible path depending on the pheromone
concentration. Most commonly seen implementation is traveling salesman problem.
4.5 Optimization Techniques 65

. Constraint Programming
For solving combinatorial problems along with specified constraints, constraint
programming optimization is used. The problem is represented as a set of vari-
ables, domains, and constraints. Afterward, it searches for valid assignments to the
variables which will satisfy all the constraints specified by the problem.
These are commonly used optimization techniques as per the requirements in the
hand. Each technique has its pros and cons, but the choice of technique depends on
the problem at hand and the available resources to solve them.
For improving performance and efficiency of machine learning models, opti-
mization techniques are used. Some of the reasons of importance of optimization
techniques are listed below:
. Model Performance Improvement: The main goal of optimization techniques is to
minimize the error or loss of model. By searching for optimal values for parameters
of model, its performance can be improved. It works well in case of unseen data
and gives us more accurate predictions.
. Efficient Resource Utilization: Optimization techniques use computational
resources such as memory, processing power, and time. The main goal for opti-
mization techniques is efficient utilization of resources that are available. Efficient
utilization becomes very crucial when you are dealing with the large datasets.
. Handling Complex Models and High-Dimensional Data: The machine learning
models are very complex in nature along with number of parameter optimization.
Optimization techniques provide optimal space in which optimal solution to the
problem at hand can be found. This part becomes crucial in case of deep learning
models where millions of parameters are there. In case of high-dimensional data,
it is handled by reducing the dimensions which enhances efficiency.
. Overcoming Non-Convexity: Non-convex optimization includes objective func-
tion with presence of multiple local optima. In these cases, optimization tech-
niques are used to find the good solution in non-convex spaces. Techniques like
stochastic gradient descent (SGD), Adam, and conjugate gradient methods, etc.
are used.
. Regularization and Generalization: In order to prevent overfitting and mini-
mize complexity of underlying model, optimization techniques provide us a
concept called as regularization methods. Regularization techniques include L1
and L2 regularization which aids in minimizing model’s sensitivity to noise along
with outliers in underlying data. With creation of simpler models regularization
enhances generalization to unseen data.
. Hyperparameter Tuning: Hyperparameter tuning includes finding optimal values
for parameters through learning process. Techniques such as grid search, random
search, and Bayesian optimization are used for hyperparameter tuning. Accurate
hyperparameter tuning enhances the model performance along with issues like
underfitting or overfitting are solved.
. Optimization Across Different Algorithms: These optimization techniques can be
applied to wide variety of learning algorithms, including linear regression, neural
networks, support vector machines, decision trees, and more.
66 4 Mathematical Foundation for Data-Centric AI

Optimization techniques are crucial in machine learning for improving model


performance, efficient resource utilization, handling complex models and data, over-
coming non-convexity, regularization, hyperparameter tuning, and enhancing perfor-
mance across different algorithms. They play vital part in achieving accurate and
efficient models that works very well with unseen data.

4.6 Summary

In order to comprehend and use machine learning algorithms, one must have a solid
background in statistics and mathematics. Key mathematical and statistical ideas that
are pertinent to machine learning are outlined here.

References

1. DasGupta, A. (2011). Probability for statistics and machine learning: Fundamentals and
advanced topics (p. 566). Springer.
2. Brownlee, J. (2018). Basics of linear algebra for machine learning. Machine Learning Mastery
3. Brownlee, J., Cristina, S., & Saeed, M. (2022). Calculus for machine learning. Machine Learning
Mastery
4. Patil, P., Wu, C. S. M., Potika, K., & Orang, M. (2020). Stock market prediction using ensemble of
graph theory, machine learning and deep learning models. In Proceedings of the 3rd international
conference on software engineering and information management (pp. 85–92)
5. Jain, A., Patel, H., Nagalapatti, L., Gupta, N., Mehta, S., Guttula, S., et al. (2020). Overview and
importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD
international conference on knowledge discovery and data mining (pp. 3561–3562)
6. Sra, S., Nowozin, S., & Wright, S. J. (Eds.). (2012). Optimization for machine learning. MIT
Press
Chapter 5
Data-Centric AI

Data-centric AI refers to an approach to artificial intelligence (AI) that places a


strong emphasis on the use of data. This approach involves collecting, processing,
and analyzing large amounts of data to extract insights, identify patterns, and make
predictions. Data-centric AI is often used in machine learning, where algorithms are
trained on large datasets to learn how to perform specific tasks.
One of the key advantages of a data-centric approach to AI is that it allows for more
accurate predictions and decision-making. AI systems can find patterns and links in
vast volumes of data that may not be immediately obvious to human researchers.
Better forecasts and judgments may result from this.
However, data-centric AI also presents some challenges. For example, collecting
and processing large amounts of data can be time-consuming and expensive. Addi-
tionally, there are concerns around privacy and security when it comes to handling
sensitive data.
Despite these challenges, data-centric AI is becoming increasingly important in
many industries, from healthcare and finance to marketing and manufacturing. As
more and more organizations adopt this approach, it is likely that we will see even
more innovative applications of data-centric AI in the years to come.

5.1 Data Acquisition

Data acquisition in a data-centric approach refers to the process of collecting,


preparing, and storing data for use in artificial intelligence (AI) systems. It is a
critical aspect of the data-centric approach, as the quality and quantity of data avail-
able can directly impact the performance of AI models. Dataset discovery [1], a job
that seeks to find the most pertinent datasets from a data lake (a repository of data
saved in its raw format), is an example of this.
In data acquisition, raw data is collected from various sources and converted into
a format that can be used by AI models. This process involves cleaning the data

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 67
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_5
68 5 Data-Centric AI

to remove any errors or inconsistencies, transforming the data into a more useful
format, and storing the data in a centralized location for easy access by AI systems.
Data acquisition can involve various types of data, including structured data
(such as databases and spread sheets) and unstructured data (such as text, images,
and videos). The data may be sourced from various internal and external sources,
including sensors, social media platforms, customer feedback, and other data
repositories.
The goal of data acquisition in the data-centric approach is to ensure that the AI
models are trained on a diverse and representative dataset that can produce accu-
rate and meaningful insights. The process requires careful planning, execution, and
ongoing maintenance to ensure that the data remains relevant and up to date.

5.1.1 The Data Acquisition Process

The data acquisition process refers to the set of activities involved in collecting data
from various sources, such as sensors, databases, or manual input, for the purpose of
analysis or processing. The process typically involves several stages, starting with
planning and design, followed by data collection, cleaning, and storage.
During the planning stage, the data acquisition team defines the goals and objec-
tives of the project and determines the data sources and collection methods that will
be used. This stage also involves identifying any potential risks or issues that may
arise during the data acquisition process. Once the planning stage is complete, data
collection can begin. During this phase, data is gathered from a variety of sources,
including surveys, experiments, and automated sensors. The collected data may be in
different formats and may require cleaning or transformation to ensure consistency
and accuracy.
The next stage involves cleaning and preparing the data for analysis. In this phase,
duplicates are eliminated, mistakes are fixed, and data is transformed into an analysis-
ready format. Once the data is cleaned and prepared, it is typically stored in a database
or other storage system. Finally, the data can be analyzed and processed to extract
insights or information. This stage involves using statistical and analytical techniques
to uncover patterns or relationships within the data. The results of the analysis can
then be used to make decisions, inform policies, or guide further research.

5.1.2 Key Insights for Big Data Acquisition

The process of gathering significant volumes of data from many sources, including
social media, sensors, and other digital platforms, is referred to as big data acquisition.
Here are some essential tips for collecting huge data.
5.1 Data Acquisition 69

1. Understand the sources: It is important to have a clear understanding of the


sources from which you will be collecting data. This covers the kind of data,
how often it is collected, and the level of data quality.
2. Data quality matters: Collecting large amounts of data is useless if the quality of
the data is poor. Ensure that the data you are collecting is accurate, relevant, and
up to date.
3. Select the appropriate tools: A range of technologies are available to gather and
handle large data. Depending on the amount, velocity, and diversity of data you
need to capture, pick the tools that will work best for you.
4. Have a data governance plan: Data governance refers to the policies, procedures,
and guidelines for managing data. Having a plan in place will help ensure that
data is collected, stored, and used in a way that is ethical and complies with
relevant regulations.
5. Consider scalability: As the amount of data you collect grows, you will need to
ensure that your data acquisition processes can scale to meet your needs. This
may require investing in more powerful hardware or cloud-based solutions.
6. Data security is critical: Big data can include sensitive information; therefore,
it is necessary to have strong security measures in place to guard against data
breaches and cyber-attacks. Data security is essential.
7. Data integration is key: Big data is often collected from multiple sources, so it
is important to have a plan for integrating the data into a unified database or
data warehouse. This will help ensure that the data is accessible and useful for
analysis.

5.1.3 Case Study: Data Acquisition for Retail Company

Background: A retail company with multiple stores across the country wants to
improve its sales by understanding customer behavior and preferences. The company
wants to analyze its sales data to identify patterns and trends that can help them make
data-driven decisions to improve their sales.
Challenge: The company’s sales data is stored in different formats, including Excel
spreadsheets, text files, and databases, making it difficult to analyze the data. The data
is also located in different systems across the company’s stores and offices, making
it hard to aggregate the data. The company needs to acquire the data and centralize
it in a single repository to facilitate data analysis.
Solution: To address this challenge, the retail company implemented a data
acquisition solution that involved the following steps:
1. Identify Data Sources: The first step was to identify all the data sources that
contained the company’s sales data. The sources included sales databases,
customer databases, inventory management systems, and employee performance
data.
70 5 Data-Centric AI

2. Data Extraction: The company used a data extraction tool to extract data from
each source. The tool was configured to extract data in the required format and
structure.
3. Data Cleansing: Once the data was extracted, it was cleaned to remove duplicates,
inconsistencies, and errors. The data was standardized to ensure that it was in a
consistent format and that there were no discrepancies.
4. Data Integration: The cleaned data was then integrated into a centralized data
repository using a data integration tool. The tool was configured to merge the
data from different sources, ensure data consistency, and maintain data quality.
5. Data Validation: The company performed data validation to ensure that the data
was accurate and complete. The validation process involved checking the data
for errors, discrepancies, and inconsistencies.
6. Data Analysis: With the data centralized and validated, the company was able
to analyze its sales data to identify patterns and trends. The analysis provided
insights into customer behavior, product preferences, and sales performance.
Results: The data acquisition solution provided the following benefits to the retail
company:
1. Centralized Data: The solution enabled the company to centralize its sales data,
making it easy to access and analyze.
2. Improved Data Quality: The data cleansing and validation process improved data
quality, reducing errors and inconsistencies.
3. Data-Driven Decisions: The analysis of the sales data enabled the company to
make data-driven decisions to improve sales.
4. Increased Efficiency: The data acquisition solution reduced the time and effort
required to extract, clean, and integrate data, improving operational efficiency.
Conclusion: Implementing a data acquisition solution enabled the retail company
to centralize its sales data, improve data quality, and make data-driven decisions to
improve sales. The solution increased efficiency and provided insights into customer
behavior and preferences, which helped the company to make better decisions to
improve sales.

5.2 Data Labeling

The practice of manually adding one or more descriptive tags or labels to a dataset
is known as data labeling. In a data-centric approach, data labeling is a crucial step
that involves annotating raw data with relevant metadata to help machine learning
models learn from it.
For example, in image recognition, data labeling may involve identifying objects
or people within an image and tagging them with descriptive labels. Similarly, in
Natural Language Processing, data labeling may involve identifying and tagging
specific parts of speech or sentiments within a text.
5.2 Data Labeling 71

The quality and accuracy of data labeling can significantly impact the performance
of machine learning models, making it an essential step in the data-centric approach.
A class of methods known as semi-supervised labeling infers the labels of unla-
beled data from a limited amount of labeled data [2, 3]. (e.g., using unlabeled data
to train a model to generate predictions, for instance). The most important unla-
beled examples are chosen in each iteration of active learning, an iterative labeling
technique [4–6]. The tagging process in contexts with limited supervision has been
redefined by other studies [7, 8]. For instance, domain-specific heuristics are used as
input in data programming [7] to infer labels. Deep learning is mostly made possible
by large, labeled datasets. We anticipate the development of more effective tagging
techniques using different forms of human participation for a range of data kinds.

5.2.1 How Does Data Labeling Work?

Data labeling can be performed in different ways depending on the nature of the data
and the specific task at hand. However, the general process typically involves the
following steps.
Data Collection: The first step is to collect the raw data that needs to be labeled.
This data could be in various formats, including images, audio recordings, text
documents, or structured data.
Annotation Guidelines: After data collection, the next step is to create annotation
guidelines or a labeling scheme that defines the labels to be used and how they
should be applied to the data. The guidelines ensure consistency and accuracy across
all labeled data.
Labeling Tools: After creating the annotation guidelines, labeling tools are used
to annotate the data. These tools can be customized to the specific data type and
labeling task.
Human Labelers: Data labeling is typically performed by human labelers who are
trained to apply the annotation guidelines correctly. The number of labelers used can
vary depending on the size and complexity of the dataset.
Quality Control: To ensure the accuracy and consistency of the labeled data,
a quality control process is often implemented. This involves randomly sampling
labeled data and checking it for errors or inconsistencies.
Iterative Improvement: As labeled data is reviewed, errors are corrected, and the
annotation guidelines are updated based on feedback, creating a cycle of iterative
improvement. This process helps to improve the quality and accuracy of the labeled
data over time.
Overall, data labeling is a critical step in machine learning, as it provides the
labeled data needed to train models accurately. By using human labelers and iterative
improvement, the data can be accurately and consistently labeled to create high-
quality training data.
72 5 Data-Centric AI

5.2.2 Data Labeling Approaches

There are several approaches to data labeling, and the choice of approach depends on
the specific data type, labeling task, and available resources. Here are some common
approaches:
1. Manual Labeling: This approach involves human labelers manually reviewing
and annotating each data point according to the established annotation guidelines.
Manual labeling can be time-consuming and expensive, but it provides the highest
accuracy and flexibility.
2. Active Learning: In this approach, a machine learning model is used to auto-
matically label a subset of the data, and human labelers review and correct the
labels. The model is then retrained on the newly labeled data, and the process is
repeated. Active learning can significantly reduce the time and cost of labeling
while maintaining high accuracy.
3. Semi-supervised Learning: This approach combines manually labeled data with
unlabeled data to train a machine learning model. The labeled data serves as
the first training set for the model, which is subsequently applied to label the
remaining unlabeled data. The process is then repeated using the freshly labeled
data and the training set. Compared to manual labeling alone, semi-supervised
learning may be more cost-effective.
4. Crowdsourcing: This approach involves using many non-expert human labelers
to annotate data. Crowdsourcing platforms such as Amazon Mechanical Turk can
be used to distribute labeling tasks to a large pool of workers. Crowdsourcing
can be cost-effective but may result in lower accuracy due to the variability in
labeling quality.
5. Hybrid Approaches: Different labeling approaches can be combined to create
a hybrid approach that leverages the strengths of each method. For example,
a semi-supervised learning approach can be combined with manual labeling to
improve the accuracy of the final labeled dataset.
Overall, the selection of a labeling strategy is influenced by a number of variables,
such as the kind and complexity of the data, the size of the dataset, the required
accuracy, and the resources that are available.
• Make the labels consistent.
• Use multiple labelers spot inconsistencies.
• Clarify label instructions by tracking down ambiguous example.
• Add some distracting instances. More information is always preferable.
• To improve, employ error analysis with a focus on a subset of data.
These tips a little bit more applicable maybe for unstructured data applications
like images text and audio.
Recommendation 1: Make the labels Y consistent.
It turns out that when developing a learning algorithm, especially for a small
dataset, the ideal situation is when there is a deterministic, non-random function that
5.2 Data Labeling 73

Fig. 5.1 Pills dataset image

maps the inputs x to the outputs y, and the labels are consistent with this function.
However, this ideal situation may be less realistic when the labels are generated.
By humans but let us take an example as shown in Fig. 5.1. Consider evaluating
manufactured pills while working for a pharmaceutical company. The label datasheet
can ask, “What when is the pill scratch or defective?” as an example.
If we plot the length of a scratch against the label, is it defective or not zero. In
this case, as shown in Fig. 5.2, it is not really consistent; it sort of goes up and down
right; some longer scratches are not defective compared to other shorter ones.
If you are able to sort the images at the bottom in increasing order of the length
of the scratch. The short scratches smaller than a specific length can be determined
by looking at the labels and making a choice.
Here, we are considering two and a half millimeters call that defective what that
corresponds to it editing dataset, so the labels now become like this as shown in
below Fig. 5.3 and much more consistent.

Fig. 5.2 Graph for pills scratch length versus defect

Fig. 5.3 Pills updated dataset image


74 5 Data-Centric AI

In√some cases, if we have an inherently noisy dataset, then error will decrease
O(1/ m) where m is raining set size. In some situations, error can decline on the
order of (1/m) and one over m goes down much quicker than one over the square root
of m if we acquire consistent labels such that there is some clean learnable consistent
function where error like this curve your x-axis and y-axis is just a basic threshold
learning.
Recommendation 2: Use multiple labels to spawn inconsistencies.
If labels y are inconsistent, use multiple labels to spawn inconsistencies, so here
are some examples of inconsistencies. Suppose one label is to label this maybe it
looks like this pill has a chip on it and maybe label two says it has a scratch and
actually we do not know who is right but when we see cases like these is any consistent
standard is probably better than an inconsistent standard. So decide if stuff like this
is a chip or a scratch, make best decision, and reduce the inconsistency by using
multi-labeling.

5.2.3 Importance of Data Labeling

Data labeling is the process of assigning meaningful and relevant tags or labels to
data to make it more usable and understandable for machine learning algorithms.
It is an important aspect of machine learning as it plays a key role in ensuring the
accuracy and efficiency of models.
The following are some of the main justifications why data labeling is crucial:
1. Improved accuracy: Data labeling helps to ensure that machine learning models
are accurately trained. When data is properly labeled, it provides a clear under-
standing of the features and characteristics of the data. This enables algorithms
to make more accurate predictions and classifications.
2. Enhanced efficiency: Data labeling also improves the efficiency of machine
learning algorithms. Labeled data allows algorithms to quickly process and
analyze large amounts of data, leading to faster and more efficient decision-
making.
3. Better quality data: When data is labeled, it is easier to identify and remove
irrelevant or incorrect data. This leads to higher quality data that is more relevant
and useful for machine learning algorithms.
4. Increased productivity: Data labeling also increases productivity by reducing the
time and effort required to manually sort through and label data. This frees up
valuable resources, allowing organizations to focus on other important tasks.
5. More personalized experiences: Data labeling enables machine learning algo-
rithms to make more personalized recommendations and predictions. This can
lead to improved customer experiences and increased engagement.
5.2 Data Labeling 75

Overall, data labeling is a crucial step in the machine learning process that helps
to ensure accuracy, efficiency, and high-quality data.

5.2.4 Case Study: Data Labeling for Autonomous Vehicle


Training

Background: A tech company is developing autonomous vehicles and requires a


large amount of labeled data to train their algorithms. The company collects data
from various sources such as LiDAR, radar, and cameras to enable the vehicle to
sense its environment, identify objects, and make decisions accordingly.
Challenge: The company needs to label the collected data to train their autonomous
vehicle algorithms. The data is a combination of images, videos, and sensor data from
various sources, and the labeling must be accurate and consistent for the algorithms
to learn effectively. The company has a large amount of data to label, and the process
can be time-consuming, expensive, and prone to errors.
Solution: To tackle this challenge, the company outsources the data labeling task
to a third-party service provider that specializes in data annotation. The service
provider uses a combination of human and machine-based annotation methods to
ensure accurate and consistent labeling.
The human annotators are trained and have a background in computer vision,
enabling them to understand the different objects and scenarios present in the data.
They use a variety of tools and techniques to annotate the data, such as bounding
boxes, polygons, and semantic segmentation. The annotators are also supervised by
quality assurance teams to ensure accuracy and consistency in the labeling process.
In addition to human annotation, the service provider also employs machine
learning models to automate some of the labeling tasks. The models are trained on
a subset of the data and can then predict the labels for similar data. The predictions
are reviewed and corrected by human annotators to ensure accuracy.
Results: By outsourcing the data labeling task, the company was able to signifi-
cantly reduce the time and cost associated with the process. The third-party service
provider was able to label the data accurately and consistently, which improved the
effectiveness of the autonomous vehicle algorithms.
The use of machine learning models also helped to speed up the labeling process
while maintaining accuracy.
76 5 Data-Centric AI

5.3 Data Annotation

Data annotation refers to the process of labeling or tagging data with metadata, which
makes it easier to analyze and use for machine learning and other data-centric tasks.
Data annotation is a crucial part of many machine learning applications because it
helps algorithms to recognize patterns and relationships within large datasets.
The process of data annotation typically involves human annotators who manually
review and label each piece of data with relevant tags, such as categories, keywords,
or descriptive labels. This labeling process can be time-consuming and costly, but it
is essential for building accurate machine learning models and improving the quality
of data-driven insights.

5.3.1 Types of Data Annotation

There are several types of data annotation techniques used in data-centric applica-
tions; some of the most common ones are:
1. Image Annotation: This type of data annotation involves identifying and labeling
objects within an image, such as classifying them into different categories,
detecting their location, and outlining their boundaries. Image annotation is often
used in applications such as autonomous vehicles, object recognition, and facial
recognition as shown in Fig. 5.4.
2. Text Annotation: This type of data annotation as shown in Fig. 5.5 involves
tagging or labeling text data with relevant metadata such as keywords, topics, or
named entities. Text annotation is used in applications such as sentiment analysis,
text classification, and chatbots.

Fig. 5.4 Image annotation


example
5.3 Data Annotation 77

Fig. 5.5 Text annotation example

3. Audio Annotation: Audio annotation involves tagging or labeling audio data


with relevant metadata, such as speaker identification, emotions, or speech
recognition. Audio annotation is used in applications such as virtual assistants,
speech recognition, and audio transcription. Video Annotation: Video annota-
tion involves identifying and labeling objects or events within a video, such as
identifying specific actions or objects, tracking movements, or annotating video
frames. Video annotation is used in applications such as security surveillance,
autonomous vehicles, and video analysis.
4. Sensor Data Annotation: Sensor data annotation involves labeling data from
sensors such as accelerometers, GPS, or temperature sensors. This type of anno-
tation is often used in applications such as sports analytics, fitness tracking, and
industrial automation.
These are just a few examples of the types of data annotation techniques used in
data-centric applications. The specific type of annotation used will depend on the
nature of the data and the requirements of the application.

5.3.2 Case Study on Data Annotation

Background: A startup company was working on developing a machine learning


algorithm for autonomous driving vehicles. They needed to train their model to
recognize objects on the road, such as pedestrians, cars, traffic lights, and road signs.
They had a large dataset of images, but they needed to label the objects in the images
to train their model.
78 5 Data-Centric AI

The startup decided to outsource the data annotation process to a third-party


company specializing in data labeling services. The third-party company provided a
team of trained annotators who could label the images with high accuracy and speed.
The startup provided the third-party company with a detailed guideline for data
labeling, including instructions for identifying objects, labeling, and quality control.
The data labeling process involved the following steps:
1. Image preprocessing: The startup provided raw images to the third-party
company, which then preprocessed the images to enhance their quality and
resolution.
2. Object identification: The annotators were trained to identify and label different
objects on the road, such as pedestrians, cars, traffic lights, and road signs.
3. Data labeling: The annotators used a software tool to label the objects in the
images with specific tags and annotations.
4. Quality control: The third-party company had a quality control team that reviewed
the labeled data to ensure accuracy and consistency.
5. Data delivery: The labeled data was delivered back to the startup in a format that
was compatible with their machine learning algorithm.
The startup was able to train their machine learning algorithm using the annotated
data provided by the third-party company. They achieved high accuracy in object
recognition, which was crucial for the success of their autonomous driving project.
Conclusion: Data annotation is a crucial step in training machine learning models.
It requires accuracy, consistency, and quality control to achieve the desired results.
Outsourcing the data annotation process to a third-party company can be an effective
solution for startups and companies who do not have the expertise or resources to
perform the task in-house.

5.4 Data Augmentation

Data augmentation is a technique used in data-centric approaches to increase the size


and diversity of a dataset by creating new training examples from existing data. This
technique involves applying transformations or modifications to the original data in
order to create new, artificial examples that are similar to the original data but have
slight variations. Data augmentation can help improve the performance of machine
learning models by providing more training examples and reducing overfitting.
In creating fresh and diverse instances to train datasets, data augmentation is
beneficial to enhance the performance and results of machine learning models. A
machine learning model operates more effectively and correctly when the dataset is
large and sufficient.
Data collection and labeling for machine learning models may be time-consuming
and expensive activities. Using data augmentation approaches, firms may transform
datasets to save these operating expenses.
5.4 Data Augmentation 79

Fig. 5.6 Example of basic


data augmentation
techniques: horizontal flip,
random rotate, and partial
occlusion

Cleaning data is one of the processes of a data model, which is essential for
high-accuracy models. However, the model cannot make accurate predictions for
inputs from the actual world if cleaning decreases the representing ability of the
data. By producing variables that the model could encounter in the real world,
data augmentation approaches might help machine learning models become more
resilient.
A brief illustration makes things clearer: Consider that we are teaching a model
to recognize birds. The picture of a bird on the left in the example below in Fig. 5.6
is obtained from our initial dataset. On the right, you can see three variations of the
original image that our model would still likely interpret as depicting a bird. The
first two are easy to understand: Whether a bird is flying upward or downward, east,
or west, it is still a bird. The third illustration shows a bird that has had its head and
body artificially obscured. Thus, this image would prompt our model to pay attention
to feathered wings as a characteristic bird.

5.4.1 How Does Data Augmentation Work?

Data augmentation is a technique used in machine learning and deep learning to


increase the amount of training data available for a model, by generating new data
samples from existing ones, without collecting more data. This helps to improve
the performance of the model, by making it more robust to variations in the input
data [9].
The basic idea behind data augmentation is to apply a set of random transforma-
tions to the original data samples, such as scaling, rotation, flipping, cropping, and
adding noise. These transformations preserve the underlying information in the data
but create new variations that the model can learn from.
For example, in image recognition, data augmentation can be used to generate
new images by randomly cropping, flipping or rotating the original images, changing
the brightness and contrast, or adding noise or distortions. In text classification, data
80 5 Data-Centric AI

augmentation can be used to generate new text samples by replacing words with
synonyms, adding noise, or spelling errors, or shuffling the order of words in a
sentence.
Data augmentation can be performed online, during training, by randomly
applying transformations to each batch of data, or offline.

5.4.2 Case Study on Data Augmentation

Background: A company was working on developing a machine learning model to


identify fraudulent transactions in their financial system. They had a large dataset
of transaction records, but the dataset was highly imbalanced with only a small
percentage of fraudulent transactions.
To address this issue, the company decided to use data augmentation tech-
niques to create additional data for the minority class. They used the following
data augmentation techniques:
1. Random Oversampling: The company randomly sampled the minority class to
create additional data points for the fraudulent transactions.
2. Synthetic Minority Oversampling Technique (SMOTE): SMOTE is a popular
data augmentation technique that generates synthetic data points by interpolating
between existing data points.
3. The company used SMOTE to create additional data points for the minority class.
4. Random Rotation and Translation: The company applied random rotation and
translation to the images to create variations in the data.
5. Image Flipping: The company flipped the images horizontally and vertically to
create mirror images.
6. Color Jittering: The company added random noise to the images to change their
colors and create variations.
The company used these data augmentation techniques to create additional data
points for the fraudulent transactions. They then trained their machine learning model
on the augmented dataset and evaluated its performance.
The results showed that the performance of the machine learning model had signif-
icantly improved after data augmentation. The accuracy of the model had increased,
and it was able to correctly identify more fraudulent transactions.
Conclusion: Data augmentation is a powerful technique to improve the performance
of machine learning models, especially when working with imbalanced datasets. In
this case study, the company used various data augmentation techniques to create
additional data points for the minority class and improve the performance of their
fraud detection model. Data augmentation can be a cost-effective way to increase
the amount of data available for training a model and improve its accuracy.
5.5 Data Deployment 81

5.5 Data Deployment

In a data-centric context, data deployment refers to the process of making data acces-
sible and available to users or systems that need it. This involves various steps, such
as selecting the appropriate data storage and management systems, designing the
data architecture, and establishing secure and efficient data access mechanisms.
Data deployment typically involves the following steps:
Data storage: Data deployment begins with selecting the appropriate data storage
systems, such as databases, data warehouses, or data lakes. The choice of storage
system depends on the type and volume of data, the desired level of data processing,
and the budget available.
Data architecture: After selecting the data storage systems, the next step is to
design the data architecture. This involves defining the data structure, including
tables, fields, and relationships, and deciding how data will be organized, classified,
and accessed.
Data integration: Data integration involves combining data from different sources
and formats and transforming it into a unified format. This may involve data cleaning,
data normalization, and data enrichment.
Data access: Once the data is stored and integrated, the next step is to estab-
lish secure and efficient data access mechanisms. This may involve setting up user
accounts and permissions, establishing API endpoints, or creating data dashboards
and visualizations.
Data governance: Finally, data deployment involves establishing data governance
policies and procedures to ensure that data is secure, accurate, and compliant with
regulatory requirements.
Overall, data deployment is a critical step in making data available and accessible
to users and systems, enabling organizations to leverage their data assets to make
better-informed decisions and drive business success.

5.5.1 Case Study on Data Deployment

Background: Company X is a large e-commerce company that sells a variety of


products online. They have a massive amount of data collected from their website,
including user behavior, product information, and transaction history. The company
wants to deploy this data to improve their business operations, such as increasing
sales, reducing churn, and improving customer satisfaction.
To achieve this, the company decided to use a cloud-based data warehouse solu-
tion. They evaluated various options and ultimately chose Amazon Redshift because
of its scalability, performance, and cost-effectiveness.
The data deployment process involved the following steps:
1. Data Extraction: The first step was to extract the data from various sources such
as web analytics tools, customer relationship management (CRM) systems, and
82 5 Data-Centric AI

sales databases. The company used various tools and scripts to extract the data,
including Apache Spark, AWS Glue, and AWS Data Pipeline.
2. Data Transformation: Once the data was extracted, it needed to be transformed
into a usable format for analysis. The company used various tools and scripts
to transform the data, including AWS Glue, AWS Lambda, and Apache Spark.
They also created data pipelines to automate the transformation process.
3. Data Loading: The transformed data was then loaded into Amazon Redshift. The
company used various tools and scripts to load the data, including AWS Glue,
AWS Data Pipeline, and Amazon Redshift’s COPY command.
4. Data Analysis: With the data now in Amazon Redshift, the company was able
to perform various types of data analysis, including data mining, predictive
modeling, and machine learning. They used various tools and libraries, including
Python, SQL, and Amazon Machine Learning.
5. Visualization and Reporting: To share the insights gained from the data analysis,
the company used various tools to create visualizations and reports, including
Amazon QuickSight, Tableau, and Power BI. These reports were shared with
various teams within the company, including marketing, sales, and product
development.
The data deployment process has helped the company in various ways. For
example, they were able to identify which products were most likely to lead to
repeat purchases, which helped them optimize their marketing campaigns. They
were also able to identify which products were most likely to lead to customer
churn, which helped them improve their product offerings. Overall, the data deploy-
ment process has helped Company X improve its business operations and gain a
competitive advantage in the e-commerce industry.

5.6 Data-Centric AI Tools

Data-centric AI tools are software applications that are designed to help users work
with and analyze large volumes of data, with the goal of discovering insights, making
predictions, or improving decision-making. These tools use artificial intelligence (AI)
and machine learning (ML) techniques to automate data analysis, reduce manual data
processing tasks, and improve the accuracy and efficiency of data-driven tasks.
Some examples of data-centric AI tools include:
1. Data visualization tools: These tools help users visualize data and identify
patterns and trends. They allow users to create charts, graphs, and other visual
representations of data, making it easier to understand and communicate insights.
2. Predictive analytics tools: These technologies analyze previous data and forecast
future results using machine learning algorithms. They can be used for a range
of applications, from forecasting sales revenue to predicting equipment failures.
5.6 Data-Centric AI Tools 83

3. Natural Language Processing (NLP) tools: These tools use AI and ML to analyze
and understand human language. They can be used to analyze customer feedback,
automate customer service interactions, or identify trends in social media.
4. Recommendation engines: These tools use data analysis to make personal-
ized recommendations to users, such as product recommendations or content
recommendations on a website.
5. Data preparation and cleaning tools: These tools help users prepare and clean data
for analysis. They can automate tasks such as data normalization, data cleaning,
and data transformation, making it easier to work with data and reducing the risk
of errors.
6. Data integration tools: These tools help users combine data from different sources
into a unified format. They can be used to integrate data from databases, data
warehouses, or other data sources.
Overall, data-centric AI tools are designed to help users work more effectively with
data, enabling them to discover insights, make predictions, and improve decision-
making.
There are several data-centric AI tools available in the market, some of which are:
1. TensorFlow: An open-source platform developed by Google, TensorFlow is
widely used for machine learning and deep learning applications.
2. Keras: A high-level neural networks API, Keras is built on top of TensorFlow.
It provides an easy-to-use interface to build and train deep learning models.
3. PyTorch: Developed by Facebook, PyTorch is an open-source machine learning
library used for building and training deep learning models.
4. Scikit-learn: A popular Python library for machine learning, Scikit-learn
provides tools for data preprocessing, feature selection, and model evaluation.
5. Apache Spark: A distributed computing platform, Apache Spark is used for big
data processing and machine learning applications.
6. Hadoop: An open-source distributed computing platform, Hadoop is used for
storing and processing large datasets.
7. IBM Watson: A cloud-based AI platform, IBM Watson provides tools for
building and deploying AI models.
8. Amazon SageMaker: A cloud-based platform, Amazon SageMaker provides
tools for building, training, and deploying machine learning models.
9. Microsoft Azure Machine Learning: A cloud-based platform, Microsoft Azure
Machine Learning provides tools for building, deploying, and managing
machine learning models.
10. Google Cloud AI Platform: A cloud-based platform, Google Cloud AI Platform
provides tools for building, training, and deploying machine learning models.
11. Pandas: A library for open-source data analysis and manipulation that offers
data structures for effectively storing and processing huge datasets.
12. NumPy: A Python open-source toolkit for numerical computing that supports
huge, multi-dimensional arrays and matrices.
84 5 Data-Centric AI

13. Apache Flink: An open-source framework for streaming data processing that
supports both batch processing and real-time data streaming.
14. Apache Kafka: A distributed streaming platform that is open source and can be
used to create streaming apps and real-time data pipelines.

5.6.1 Case Study: Predicting Customer Churn


for a Telecommunications Company

A telecommunications company wants to reduce customer churn by predicting which


customers are most likely to cancel their service. To do this, they collect a large
amount of customer data, including demographic information, usage patterns, and
customer service interactions [10].
The company decides to use data-centric AI tools to build a predictive model that
can identify customers who are at risk of churning. They use the following tools.
Apache Spark: To process and analyze the large amount of customer data, the
company uses Apache Spark to perform distributed processing and data preparation.
Scikit-learn: The company uses Scikit-learn to build a machine learning model that
can predict customer churn based on the data they have collected. They experiment
with different algorithms and parameters to find the best performing model.
TensorFlow: Once they have selected a machine learning model, the company
uses TensorFlow to train the model on their data and deploy it into production.
Kibana: Finally, the company uses Kibana to monitor the performance of
their predictive model in production, visualize the results, and identify areas for
improvement.
By using data-centric AI tools, the telecommunications company is able to build
an accurate predictive model that can identify customers who are most likely to
churn. They can then take proactive steps to retain these customers and reduce their
overall churn rate, resulting in increased customer satisfaction and revenue.

5.7 Summary

Data-centric AI involves a series of steps to develop and deploy AI models that are
data-driven and effective. Overall, data-centric AI is an iterative and ongoing process
that requires careful attention to each step to ensure the development of high-quality,
reliable AI models that deliver accurate and actionable insights.
References 85

References

1. Bogatu, A., Fernandes, A. A., Paton, N. W., & Konstantinou, N. (2020). Dataset discovery in
data lakes. In ICDE
2. Xu, Y., Ding, J., Zhang, L., & Zhou, S. (2021). Dp-ssl: Towards robust semi-supervised learning
with a few labeled samples. In NeurIPS
3. Karamanolakis, G., Mukherjee, S., Zheng, G., & Hassan, A. (2021). Self-training with weak
supervision. In NAACL
4. Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., & Wang, X. (2021).
A survey of deep active learning. ACM Computing Surveys, 54, 1–40.
5. Zha, D., Lai, K.-H., Wan, M., & Hu, X. (2020). Metaaad: Active anomaly detection with deep
reinforcement learning. In ICDM
6. Dong, J., Zhang, Q., Huang, X., Tan, Q., Zha, D., & Zihao, Z. (2023). Active ensemble learning
for knowledge graph error detection. In WSDM
7. Ratner, J., De Sa, C. M., Wu, S., Selsam, D., & Re, C. (2016). Data programming: Creating
large training sets, quickly. In NeurIPS
8. Zha, D., & Li, C. (2019). Multi-label dataless text classification with topic modeling. Knowledge
and Information Systems, 61, 137–160.
9. Kharate, N. G., & Patil, V. H. (2019). Challenges in rule based machine translation from
Marathi to English. In Proceedings of the 5th international conference on advances in computer
science and information technology (ACSTY-2019) (pp. 45–54). https://doi.org/10.5121/csit.
2019.91005
10. Woody, A. (2013). A data-centric approach to securing the enterprise. Packt Publishing
Chapter 6
Data-Centric AI in Healthcare

6.1 Overview

The predominant paradigm for AI development over the last few decades has been a
model- or software-centric approach, in which building a machine learning system
requires writing code to implement algorithms, models, as well as taking that code
and training it on data. In the last few years, there has been tremendous progress
in neural networks and other algorithms, and the code is essentially a good open
source issue that you can download from GitHub today for numerous applications.
Historically, most of us know how to download the dataset, hold the dataset as fixed,
and then modify the software’s code to get to do well on the data. Therefore, it is
not always more beneficial to use a data-centric strategy where we may even hold
the code patch; instead, focus on collecting or creating the correct data to feed the
learning algorithm.
A data-centric approach is one that focuses on the importance of data in decision-
making and problem-solving. In various industries, including healthcare, data is
generated at an unprecedented rate, and utilizing this data effectively can lead to
better outcomes. A data-centric approach prioritizes data management and analysis,
with the goal of leveraging insights from data to inform decision-making.
A campaign started by Ng et al. [1] that promotes a machine learning strategy that
is more data centric than model centric, or a fundamental shift from model creation
to data quality and dependability, is largely responsible for the growth of DCAI.
As a result, in order to seek data excellence, academics’ and practitioners’
focus has steadily switched to data-centric AI. Artificial Intelligence (AI) and
machine learning (ML) techniques are used in the healthcare industry to evaluate
and make sense of the enormous volumes of data created every day. This informa-
tion includes wearable technology, genetics, medical imaging, and electronic health
records (EHRs).
The goal of trends and AI in healthcare is to improve patient outcomes, reduce
costs, and increase efficiency by using data to inform clinical decision-making, iden-
tify patterns and trends, and develop predictive models. For example, AI algorithms

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 87
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_6
88 6 Data-Centric AI in Healthcare

can help identify patients at high risk for certain diseases, predict treatment outcomes,
and identify opportunities for personalized medicine.
However, privacy, security, and bias issues are also problems with data-centric AI
in the healthcare industry. It is crucial to guarantee the security of patient data and
the accountability and transparency of AI algorithms. Additionally, biases in the data
can lead to biased algorithms, which can perpetuate health disparities. Overall, data-
centric AI has the potential to revolutionize healthcare by unlocking new insights
and improving patient care. However, it is important to approach these technologies
with caution and to prioritize patient privacy and equity.
In order to generate insights and enhance patient care, clinical research, and health-
care operations, data-centric AI in healthcare refers to a method that places a strong
priority on the gathering, curation, and analysis of high-quality healthcare data. How
data-centric AI is used in healthcare is as follows:
. Data Collection and Integration
Data-Centric AI in healthcare starts with the collection and integration of diverse
healthcare data from various sources such as electronic health records (EHRs),
medical imaging, wearable devices, genomics, and patient-reported data. This
involves developing robust data infrastructure and interoperability standards to ensure
data can be effectively integrated and accessed.
. Data Preprocessing and Cleaning
Healthcare data often contains missing values, errors, inconsistencies, and noise.
Data preprocessing techniques are applied to clean and transform the data into a
usable format. Preprocessing steps may include handling missing data, normalizing
values, resolving data inconsistencies, and anonymizing sensitive information to
protect patient privacy.
. Data Governance and Security
Given the sensitivity of healthcare data, data-centric AI in healthcare places an
emphasis on sound data governance procedures and complies with privacy laws like
Health Insurance Portability and Accountability Act (HIPAA) to safeguard patient
privacy and preserve data security.
. Data Analysis and Insights
Advanced analytics and machine learning techniques are applied to the curated
healthcare data to extract meaningful insights and patterns.
Data-centric AI models can help identify disease patterns, predict patient
outcomes, personalize treatment plans, and support clinical decision-making.
These models can also be used to discover new biomarkers, identify population
health trends, and support public health initiatives.
. Clinical Decision Support Systems (CDSS):
Data-centric AI is used to develop and deploy clinical decision support systems
that provide healthcare professionals with real-time insights and recommendations
6.2 Need and Challenges of Data-Centric Approach 89

based on patient-specific data. CDSS can assist in diagnosing diseases, suggesting


treatment options, predicting patient outcomes, and improving overall patient safety
and care quality.
. Healthcare Operations and Resource Optimization
Data-centric AI is employed to optimize healthcare operations by analyzing
data related to patient flow, resource utilization, scheduling, and supply chain
management.
Predictive models and optimization algorithms can help hospitals and healthcare
systems improve efficiency, reduce wait times, allocate resources effectively, and
streamline processes.
. Research and Drug Development
Data-centric AI plays a significant role in accelerating clinical research and drug
development processes. By analyzing large-scale clinical trial data, genomics data,
and other biomedical data, AI models can identify potential drug targets, aid in drug
discovery, optimize clinical trial design, and enable precision medicine approaches.
Data-centric AI in healthcare holds immense potential to improve patient
outcomes, enhance clinical decision-making, drive research advancements, and opti-
mize healthcare operations. However, it also requires careful consideration of ethical,
legal, and privacy aspects to ensure the responsible and secure use of healthcare data.

6.2 Need and Challenges of Data-Centric Approach

In the past, AI was frequently thought of as a subject that was model centric and
focused on improving model designs with respect to specified datasets. The excessive
dependence on predefined datasets, however, neglects the scope, complexity, and
fidelity of the data to the underlying issue, which may result in worse model behavior
in real-world applications [2]. Clinicians, clinical researchers, and scientists make
judgements in the healthcare field based on data. Excellent data supports excellent
judgements, whereas bad data encourages bad decisions.
Since the models are so highly specialized and adapted to certain situations, it is
sometimes challenging to apply them to different challenges. Underestimating data
quality might also result in data cascades [3], which could have detrimental impacts
including lower accuracy and enduring biases [4]. This can seriously limit the use of
AI systems, especially in high-stakes fields.
There are several reasons why a data-centric approach is necessary in healthcare:
1. Data is central to healthcare decision-making: In healthcare, decisions are made
based on data, whether it is patient history, laboratory results, or imaging data. A
data-centric approach ensures that all relevant data is considered when making
decisions, leading to better outcomes for patients.
90 6 Data-Centric AI in Healthcare

2. Improved efficiency and accuracy: With the large amount of data generated in
healthcare, it can be difficult for clinicians to manually analyze it all. By using
data-centric AI tools, clinicians can quickly and accurately analyze large amounts
of data, leading to faster diagnosis and treatment decisions.
3. Personalized medicine: Data-centric AI can help identify patterns and trends in
large datasets, allowing clinicians to develop personalized treatment plans for
patients based on their unique medical history, genetics, and lifestyle factors.
4. Research and development: Data-centric AI can also aid in the development of
new treatments and medications by helping researchers identify new targets and
potential drug candidates.
Overall, a data-centric approach in healthcare is essential to improve patient
outcomes, increase efficiency, and advance medical research. Healthcare profes-
sionals can make better decisions and give patients better treatment by utilizing the
potential of AI and machine learning to evaluate massive information.
Challenges
The process of gathering data is quite difficult and demands careful planning. Tech-
nically speaking, datasets are frequently heterogeneous and poorly matched with one
another, making it difficult to quantify their relatedness or properly integrate them.
It might also be challenging to successfully synthesize data from the current dataset
because it significantly depends on subject expertise [5]. Additionally, several crucial
problems that arise during data collecting cannot be handled exclusively from a tech-
nological standpoint. For instance, in many real-world scenarios, we might not be
able to find a publicly accessible dataset that matches our needs; thus, we still need to
gather data from scratch. However, for logistical, ethical, or legal reasons, it may be
challenging to access some data sources. Additionally, there are ethical issues while
gathering new data, particularly in relation to informed permission, data privacy, and
data security. These difficulties in analyzing and carrying out data collecting must
be taken into consideration by researchers and practitioners. Data leakage, reporting
only average loss, and incorrect labels are a few of the common mistakes people
make when evaluating models with data. Other common mistakes include validating
data that is not representative of the deployment environment and failing to use truly
held out data.
While there are many potential benefits to a data-centric approach in healthcare,
there are also several challenges that need to be addressed. These challenges include:
1. Privacy and security: To ensure patient privacy, extremely sensitive patient data
must be secured. A data-centric approach requires strong security measures to
ensure that patient data is not compromised.
2. Data quality and accuracy: Data quality and accuracy are critical to the success
of data-centric AI. If the data is incomplete, inaccurate, or biased, it can lead to
flawed algorithms and inaccurate predictions.
3. Bias: Data-centric AI algorithms can perpetuate prejudices and cause health
disparities if they are trained on biased data since they are only as good as the
data they are provided.
6.4 Application Implementation in Model-Centric Approach 91

4. Integration and interoperability: In healthcare, data is often stored in different


systems that may not be compatible with each other. Integrating these systems
and ensuring interoperability is critical to a data-centric approach.
5. Adoption and acceptance: Healthcare providers may be resistant to adopting new
technologies, especially if they are not familiar with how they work. It is important
to provide education and training to ensure that providers are comfortable with
the tools and understand how to use them effectively.
Overall, a data-centric approach in healthcare requires addressing these challenges
to ensure that patient data is protected, algorithms are accurate and unbiased, and
healthcare providers are equipped to use the tools effectively.

6.3 Application Implementation in Data-Centric Approach

As is common practice when working with data, data cleaning is a preprocessing


step. However, with a data-centric approach to AI, improving the data is an integral
part of the iterative process of model development rather than a step performed once
before the real work of training and learning algorithms.
Noise in a cancer dataset refers to any inaccuracies or inconsistencies in the data
that may affect the performance of a machine learning algorithm. In the context of
cancer datasets, noise may arise due to a variety of reasons, such as measurement
errors, mislabeling of data points, or inconsistencies in the data collection process.
The presence of noise in a cancer dataset can have a significant impact on the
performance of a machine learning algorithm. If the noise is too high, it may lead
to overfitting, where the algorithm learns to fit the noise in the data rather than the
underlying patterns. This can result in poor generalization performance, where the
algorithm does not perform well on new, unseen data.
To handle noise in a cancer dataset, there are several techniques that can be used.
One common approach is to preprocess the data to remove or reduce the effect of the
noise. This can involve techniques such as feature scaling, normalization, or outlier
removal. Additionally, using a more robust machine learning algorithm, such as a
random forest or a neural network with regularization, can also help to mitigate the
effect of noise in the data. Code screenshot is given in Fig. 6.1.

6.4 Application Implementation in Model-Centric


Approach

In model-centric approach, focus is on different models, and selection models are


based on performance of model. There are various machine learning models that can
be used for cancer data analysis [6]. Here are some commonly used models in the
context of cancer data [7]
92 6 Data-Centric AI in Healthcare

Fig. 6.1 Screenshot of the source code of data-centric model with preprocessing of data and model
training on cancer dataset

Logistic Regression
In the study of cancer data, logistic regression is a well-liked model for binary
classification problems. Based on the input characteristics, it calculates the likelihood
that a given instance belongs to a specific class (such as cancer vs. non-cancer). It
presupposes a linear relationship between the characteristics and the target class’s
log-odds.
Support Vector Machine (SVM)
SVM is a flexible model that may be applied to both regression and classification
applications. To divide instances of various classes in a high-dimensional space,
it generates a hyperplane or collection of hyperplanes. To increase generalization,
SVM seeks to maximize the margin between the hyperplane and the examples that
are closest to it.
Random Forest
An ensemble model called random forest mixes many decision trees to produce
predictions. To create the result, it builds a forest of decision trees and averages each
tree’s predictions. High-dimensional data is easily handled by random forest, and it
can capture intricate feature–feature relationships.
6.5 Comparison of Model-Centric AI and Data-Centric AI 93

Gradient Boosting Methods


Gradient boosting methods, such as gradient boosting machine (GBM) and XGBoost,
are powerful ensemble models that sequentially train weak models to improve predic-
tive performance. These models iteratively fit new models to the residual errors of
the previous models, minimizing the overall prediction errors.
Gradient boosting methods are known for their high accuracy and ability to handle
complex patterns in the data.
Artificial Neural Networks (ANNs)
ANN, also known as deep learning models, is particularly effective for cancer
data analysis due to their ability to learn intricate patterns from large datasets.
ANN consists of multiple layers of interconnected nodes (neurons) that process
and transform the input data. Deep learning models often require large amounts of
data and computational resources for training, but they can achieve state-of-the-art
performance.
Convolutional Neural Networks (CNNs)
CNN is a type of deep learning model specifically designed for image analysis.
CNN learns hierarchical representations of images by convolving filters over the
input image and pooling the results. CNN has been widely used in cancer imaging
tasks, such as tumor detection, segmentation, and classification.
These are just a few examples of machine learning models used in cancer data
analysis. The choice of model depends on various factors, including the type and
size of the dataset, the nature of the problem (classification, regression, etc.), and
the desired performance [8]. It is often recommended to experiment with multiple
models and evaluate their performance to select the most suitable one for a specific
task. Here in this code, random forest classifier used train model as shown in Fig. 6.2.

6.5 Comparison of Model-Centric AI and Data-Centric AI

Learning classifier can teach total randomness. It is like if you give it complete
garbage data, just like completely random labels, it can learn to map like images
to completely arbitrary labels or text to completely arbitrary labels. So basically, if
we give it really bad data, it will just produce exactly what it learns even if the data
is completely wrong. So traditional machine learning is very model centric. If you
have good machine learning models that work on highly curated data, but then the
real-world data is actually really messy, then it makes sense to actually focus on
fixing the issues in the data. Most cited and most used test sets in the field of machine
learning all have wrong labels. Data-centric AI often takes one of two forms. So, one
form is that you have AI algorithms that understand something about data and then
they use this new information that they have understood to help a model train better.
For instance, in the classroom, if you are studying addition, should your teacher start
94 6 Data-Centric AI in Healthcare

Fig. 6.2 Screenshot of the source code of model centric with random forest model training on
cancer noisy dataset

with 10,051 plus 1042 or would they start with 1 plus 2? Should your teacher give
you incredibly difficult examples as the very first instances?
Right, now, we know this because like we have all learned addition, but when a
machine learning model starts, it is starting from scratch, so it does not know right
from the beginning what is the sort of easiest example, and so there are data-centric
AI approaches that actually estimate what is the easiest example, and then when you
train an ML model, you start with that example, and then you give it slightly harder
ones and slightly harder ones and so it is like curriculum. Another sort of common
form data-centric AI is that you actually modify the dataset to directly improve the
performance of a model. Which instructor do you think you will likely learn data-
centric AI from better if I taught you the wrong thing 30% of the time against a
another teacher who comes and tells you the right thing 100% of the time? And so,
the idea is that we want to take this bad sort of, you know, those 30% of wrong things,
and we want to get rid of them so that you could sort of come to the classroom and
redo your learning experiences as if those never happened, and that is the idea here,
the best model, and that is like the classical way of thinking about model-centric AI.
And the idea is changing the model to improve performance on an AI task.
In data-centric AI instead, it is given some model, that may be fixed, or you
may change. If you want to improve that model by improving your dataset, both
approaches can be effective in cancer detection, and the choice between them will
depend on the specific requirements of the task and the available data. A data-centric
approach can be particularly useful when the data is noisy or contains biases, while
6.6 Summary 95

a model-centric approach may be more suitable when the focus is on achieving the
highest possible performance on a well-defined task.
There are some reasons why models get a particular prediction wrong.
1. Given label is incorrect. If given label is incorrect, recommended action is correct
the label.
2. Example does not belong to any of the K classes or if it is fundamentally
not predictable for example if it is blurring image. Recommended actions are
tossing this example from dataset or consider adding another class if many such
examples.
3. Example is an outlier. Actions suggested are toss example if similar examples
would never be seen in deployment; otherwise, collect additional training data
that looks similar if possible. Second alternative is applying data transformation
to make outliers features more like other example’s or upweight it or duplicate
it multiple times.
4. Type of model using is suboptimal for such examples. Diagnosis is retraining
model or upweight similar examples or duplicate them many times in dataset.
5. Dataset has other examples with identical features, but different labels in that
case define classes more distinctly or measure extra features to enrich data.
These are reasons to boost performance of model using data-centric approach.

6.6 Summary

Currently, many AI applications are model centric, one possible reason behind this
is that the AI sector pays careful attention to academic research on models. This
is because it is difficult to create large datasets that can become generally recog-
nized standards. As a result, the AI community believes that model-centric machine
learning is more promising. In today’s machine learning, data is crucial, yet it is often
overlooked and mishandled in AI initiatives. As a result, hundreds of hours are wasted
fine-tuning a model based on faulty data. That could very well be the fundamental
cause for model’s lower accuracy, and it has nothing to do with model optimization.
The data-centric approach allows for continuous improvement in healthcare systems.
The data-centric approach plays a crucial role in the healthcare domain by offering
several benefits and advancements. Data-centric approach in the healthcare domain
leads to improved decision-making, personalized medicine, early detection, remote
monitoring, research advancements, population health management, and continuous
improvement. By harnessing the power of data, healthcare providers can deliver more
effective, precise, and patient-centered care.
96 6 Data-Centric AI in Healthcare

References

1. Ng, A., Laird, D. & He, L. (2021) Data-centricai competition, DeepLearning AI. Available
online: https://github.com/Nov05/deeplearningai-data-centric-competition. Accessed on Dec 9,
2021.
2. Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W. G., Diamos, S., Diamos, G., He,
L., Kiela, D., Jurado, D., et al. (2022). Dataperf: benchmarks for data-centric ai development.
arXiv preprint arXiv:2207.10062
3. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021).
Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai. In
CHI.
4. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in
commercial gender classification. In FAccT
5. Aroyo, L., Lease, M., Paritosh, P., & Schaekermann, M. (2022). Data excellence for ai: Why
should you care? Interactions, 29(2), 66–69.
6. Prabadevi, B., Deepa, N., Krithika, L. B. & Vinod, V. (2020). Analysis of machine learning
algorithms on cancer dataset. In: 2020 International conference on emerging trends in infor-
mation technology and engineering (ic-ETITE) (pp. 1–10). https://doi.org/10.1109/ic-ETITE4
7903.2020.36.
7. Sarker, I. H. (2021). Machine learning: algorithms, real-world applications and research
directions. SN Computer Science, 2, 160.
8. Khadse, V., Mahalle, P. N. & Biraris, S. V. (2018). An empirical comparison of supervised
machine learning algorithms for internet of things data. In: 2018 fourth international conference
on computing communication control and automation (ICCUBEA) (pp. 1–6). https://doi.org/10.
1109/ICCUBEA.2018.8697476.
Chapter 7
Data-Centric AI in Mechanical
Engineering

7.1 Overview

Data-centric AI approach focuses on techniques to gather data from multiple sources,


its management, and its utilization. This approach understands the meaning of high-
quality data. The high-quality data, well-curated data will in turn help in better model
development and accurate outcomes.
This approach in mechanical engineering includes application of AI techniques
and data analysis techniques to enhance the multiple parts of mechanical engineering
process, systems, and designs. It includes taking advantage of large volumes of data
in order to gain insights. It will in turn optimize the performance and efficiency and
will help in maintenance of mechanical systems.
In data-centric AI, the focus is on understanding the data and its characteristics.
It will ensure data quality and reliability and use it to extract meaningful insights.
Here are some key aspects of data-centric AI:
. Data Collection: For training of any AI model, collection of relevant and useful
data is vital. This process includes recognizing potential sources of data and
defining parameters regarding data collection process. Also we have to ensure
that the collected data should be unbiased, diverse, and comprehensive [1].
. Data Preprocessing: For data analysis purpose, it is important that underlying data
should go through data preprocessing stage. Data cleaning and transforming data
that should be suitable for data analysis are vital to data preprocessing. This step
involves following steps like removing noise from data, handling missing data,
standardizing/normalizing underlying data in one format [2].
. Data Integration: In real-life scenarios, potential data can have multiple resources.
Hence, it needs integration process to create one single dataset. This process
involves combining multiple datasets, resolving inconsistencies throughout data,
and merging of information to create one comprehensive view.
. Data Quality Assurance: For getting dependable and accurate AI results, we have
to ensure that the underlying data should have a quality. This process includes
error detection and correction, data accuracy verification along with complete
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 97
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_7
98 7 Data-Centric AI in Mechanical Engineering

data assessment for quality. It also addresses issues such as data redundancy,
inconsistency, and noise detection.
. Feature Engineering: Feature engineering includes selection of appropriate
features (input features) from underlying data that can represent given problem.
This step needs a domain expert to creatively extract meaningful information from
underlying data.
. Model Training and Evaluation: With the help of collected and preprocessed data
in data-centric AI, training machine learning models becomes easy. Assessment
of models is done through appropriate metrics. Enhancements to the models are
done as and when needed basis.
. Continuous Learning: Data-centric AI identifies the need to learn from raw data
that is essential for enhancement of AI models. This task includes model perfor-
mance monitoring, raw data collection, and maintaining the models to easily
update new information to the model.
. Privacy and Ethics: Responsible and ethical use of data-centric AI is one of the
major goals for us. While collecting data, privacy concerns and ethical consid-
erations should be ensured. These include two tasks, namely data anonymity
consideration and sensitive information protection.
With the help of data-centric approach, mechanical systems can be made accu-
rate, robust, and more efficient. It helps organization in the decision-making process
depending on patterns and insights discovered in the data. This will help an
application to be more reliable and effective in multiple disciplines.

7.2 Need and Challenges of Data-Centric Approach

High-quality data is crucial for success of any AI model and application. Here are
the key needs and challenges associated with the data-centric approach [3–5]:
Needs of Data-Centric Approach
. Accuracy and Reliability: For generating accurate and reliable predictions/
decisions, AI models should be provided with high-quality data. It is vital for
the accurate and reliable predictions. A data-centric AI approach makes sure that
the data used for training of AI model is cleaned, unbiased, and can generate
solid outcome. In mechanical engineering, it becomes crucial to have an accurate
outcome in order to build the next part of the system. If one part of the system
fails, then the entire system can experience a domino effect.
. Insights and Discoveries: Data-centric AI is responsible for a two-step process,
namely extraction of meaningful information from raw data and performing anal-
ysis to discover more insights from big datasets. For decision-making process,
it is vital to maintain the data quality and understand the underlying data. It can
be helpful while recognizing the patterns and correlations from the data. The
7.2 Need and Challenges of Data-Centric Approach 99

data collection from sensors helps in data acquisition in case of mechanical engi-
neering. But analysis of that collected data in order to get the insight is even more
important.
. Adaptability and Flexibility: In real-world scenario, the environments are contin-
uously changing and data-centric AI understands and recognizes this need. Data-
centric AI also helps in keeping up with these changes in data, and the models
are evolved with the help of new data. Since the models are trained on newly
updated data, hence they can stay relevant for a longer period of time. In case of
mechanical engineering, the change can occur at any stage; hence, the developed
model should be able to adjust these changes.
. Decision Support: For making an informed decision, data-centric AI provides
evidence-based insights to the decision-makers. The accurate and effective anal-
ysis of underlying data aids in decision-making process, optimization of systems,
etc. Mechanical systems can have a solid support system with data-centric
approach.
Challenges of Data-Centric Approach
. Data Quality and Availability
Granting the data quality and availability of the data is one of the major challenges
faced by data-centric AI. Data can be dirty, missing values, containing errors or biases,
scattered across various platforms, etc. Data cleaning, integration, and curation of
data need a lot effort. Data collection in case of mechanical systems is very critical.
One of the ways is to take help of sensors in order to collect data.
. Data Privacy and Security
As data is becoming the core center of everything, privacy and security concerns are
becoming more remarkable. It is vital to protect sensitive and personal information at
all costs. If this data is breached, the consequences can be catastrophic. Hence, data
privacy and security remain a significant challenge for data-centric AI. Mechanical
systems handle critical data and heavy machineries. It becomes imperative to protect
sensitive information regarding these systems.
. Scalability and Infrastructure
As data is becoming core of everything, handling this huge amount of data is a great
challenge for data-centric AI. There is a need to create robust and scalable system
that will efficiently handle data storage, processing, and analysis part of this big data.
. Data Governance and Ethics
Data-centric AI gives rise to vital questions such as data governance, data ownership,
and ethics. There is a need to form clear and detailed policies/ frameworks for secure
handling of data, removing biases, etc. The ethical issues addressing data collection
and its usage should be handled in a right way. The sensitive information about
mechanical systems should be protected at all times. This information, if leaked, can
create a catastrophe.
100 7 Data-Centric AI in Mechanical Engineering

. Interdisciplinary Collaboration
Data-centric AI will require amalgamation between domain experts, data scientists,
and IT professionals. An effective communication across all disciplines will help
in proper insights and requirements identification. This collaboration is vital for
data-centric AI implementation.
Combination of technical expertise, strong data management practices, ethical
considerations, and a commitment to continuous improvement will be able to address
these challenges. Mechanical and computer engineers together can create systems
that are more efficient and reliable. Also, with the help of data-centric approach, an
organization can unlock significant opportunities using high-quality data.

7.3 Application Implementation in Data-Centric Approach

By taking leverage of principles and techniques of data-centric approach, AI appli-


cation can be developed and deployed. AI application will take advantage of
high-quality data generated by data-centric approach. The steps included in the
implementation of data-centric approach are as follows [6]:
. Identify Business Objectives: The goals and objectives that should be achieved by
the application should be clearly defined. In order to do this, clear understanding
of problem is necessary along with deep understanding of allied use cases and
expected results from AI application.
. Data Requirements: Spotting of the data requirements vital for your application is
essential. First of all, spot what kind data is required for the application like struc-
tured, unstructured, semi-structured, sensor, etc. It is also necessary to identify
volume, variety, and velocity of data needed for training purpose.
. Data Collection and Integration: Data collection from various resources such as
databases, APIs, or sensors is essential. The collected data should be unbiased,
diverse, and of high quality. The data collected from all resources should be
integrated into one dataset which will be used for further training and testing of
AI models.
. Data Preprocessing: Preprocessing data involves following tasks such as data
cleaning, removing noise, handling of missing values, and standardization/
normalization of the data. Next step will be to perform feature engineering in
order to extract the meaningful information from raw data.
. Model Selection and Training: Select those AI models(s) that will support the
applications objectives and data characteristics. The preprocessed data from
earlier step is used to train the model(s). In order to improve the models’ perfor-
mance, optimization should be performed. Optimization includes hyperparameter
tuning along with assessment of model(s) for optimal performance.
7.3 Application Implementation in Data-Centric Approach 101

. Model Evaluation and Validation: Assessment of trained model(s) is done on


the basis of evaluation metrics such as accuracy, precision, recall, or F1-score.
Validation of these model(s) is done through a separate dataset or with the help
of cross-validation.
. Application Development: The next step is to develop an application that will be
able to support the trained model. Create user interface (UI) and user experience
components (UX) to provide best user experience to end-users.
. Deployment and Monitoring: Deploy the application in real-time environment so
that user will be able to get access to it. Develop and set up monitoring system
to trail the application performance, user data collection, and feedback collection
from users. These will help us make better improvements to the application from
user perspective.
. Continuous Learning and Iteration: Consistently monitor and assess application’s
performance and gather raw data for training of model. It will help in making
more efficient enhancements. Take the user feedback to recognize and address the
limitations and issues the application is facing, which will make the application
more efficient and effective.
. Data Governance and Compliance: Make sure data governance policies, privacy
regulations, and ethical considerations are followed throughout the application’s
development and implementation. Data protection rules have to be followed in
order to protect the sensitive information along with secure data access.
. A multidisciplinary team including data scientists, domain experts, software engi-
neers, and UX designers is needed for data-centric approach. Collaboration and
iteration are vital factors for the development of a successful AI application. It
makes use of high-quality data for accurate AI predictions.
Here are some of the parts of mechanical engineering process where data-centric
AI approach can be implemented.
. Predictive Maintenance
We can collect data from sensors in real time, which will help us in predicting when
a particular component can fail in that system. This will help in maintaining the
systems before failure occurs. It will also reduce downtime and optimize maintenance
schedules.
. Performance Optimization
AI algorithms can analyze data from multiple sensors to recognize patterns which
will help in optimization of the mechanical process. For example in manufacturing
process, AI can optimize parameters such as cutting speeds, tool paths, or material
selection. It will improve efficiency and quality of mechanical systems.
. Design Optimization
AI techniques like genetic algorithms, machine learning, etc. can help in the anal-
ysis of large datasets. This analysis will identify optimal designs for components
102 7 Data-Centric AI in Mechanical Engineering

or systems. This can enhance performance, reduce weight, maximize durability,


increase efficiency, etc.
. Energy Efficiency
For energy optimization, AI algorithms can study historical energy consumptions in
order to recognize the opportunities for energy optimization. This involves operations
such as operating parameters, optimizing control strategies, or identifying areas of
energy waste.
. Quality Control
AI algorithms can collect data from sensors and cameras, to identify defects or
anomalies in manufactured parts. This can enhance the quality of control process
and reduce scrap rates. It will also ensure consistent data quality.
. Simulation and Virtual Testing
AI techniques can also improve virtual testing and simulation capabilities in mechan-
ical engineering. With the help of analysis of huge amounts of data, AI algorithms can
improve accuracy simulations. It will also reduce the need for physical prototypes
and design iteration process.
. Supply Chain Optimization
AI can perform analysis on supply chain such as demand forecasts, inventory levels,
and supplier performance, to optimize procurement, reduce costs, and improve
overall supply chain efficiency in mechanical engineering industries.
Data-centric AI in mechanical engineering takes advantage of large volumes of
data in order to optimize performance, improve efficiency, enhance designs, enable
predictive maintenance, and optimize various aspects of mechanical systems and
processes. This approach has the ability to revolutionize the field using its accurate
decision-making process. It will also enhance performance and sustainability.

7.4 Application Implementation in Model-Centric


Approach

Application implementation in a model-centric approach includes developing and


deploying AI applications in which major focus will be on design and optimization
of AI model. The steps included in application implementation within model-centric
approach are almost similar to the data-centric approach:
. Define application objectives.
. Data collection and preparation.
. Model selection.
. Model training.
. Model evaluation.
7.4 Application Implementation in Model-Centric Approach 103

. Application development.
. Deployment and Testing: When an application is deployed in production envi-
ronment, it gives users of the application an easy access. A detailed and thorough
testing of application should be conducted in order to make sure that all the func-
tions are executing as per the requirements. Application should be able to handle
wide range of scenarios and provide reliable outcome.
. Monitoring and Performance Optimization: Consistently monitor and assess
application’s performance and gather raw data for training of model. It will help
in making more efficient enhancements. Take the user feedback to recognize and
address the limitations and issues the application is facing, which will make the
application more efficient and effective.
. Iteration and Enhancement: Collect feedback from the users, so we can include
the valuable feedback into applications improvement cycle. Consistent updates
will enhance the applications performance while maintaining the main objectives.
. Governance and Compliance: Make sure that data governance policies, privacy
regulations, and ethical considerations are followed throughout the application’s
development and implementation. Data protection rules have to be followed in
order to protect the sensitive information along with secure data access.
Model-centric AI in mechanical engineering uses AI techniques to create and
utilize models that can accurately represent the mechanical systems. These models
can be used in situations such as simulation, optimization, control, and decision-
making.
Here are some examples of model-centric AI in mechanical engineering [7–10]:
. Simulation and Virtual Prototyping
Using AI technology, you can develop accurate and reliable models that simulate
the behavior of mechanical systems. These models can be used to predict outcomes,
assess designs, and assess the impact of various parameters and operating condi-
tions before building physical prototypes. AI can improve simulation capabilities by
enhancing accuracy, reducing computation time, and automating model generation.
. Optimization and Design Exploration
AI algorithms help optimize mechanical design by exploring large design spaces
and finding the best solution. These algorithms can work in conjunction with the
machine model to iteratively search for design parameters that meet specific goals
such as enhancing performance, minimizing weight, or reducing cost.
. Control and Automation
AI-based systems will use models to predict and monitor the behavior of a mechanical
system. With the combination of real-time data with model, AI algorithms will be
able to provide intelligent decisions. It can also adjust control parameters to enhance
performance, improve stability, deal with changing conditions, etc.
104 7 Data-Centric AI in Mechanical Engineering

. Digital Twin Technology


Digital twin technology includes building a virtual or simulation of a physical
mechanical system. AI techniques can be used to develop or update the existing
mechanical process with the help from sensors. It will help in real-time monitoring,
maintenance, performance optimization, etc.
. Fault Detection and Diagnosis
AI algorithms can be trained to detect faults and anomalies in mechanical systems.
We can compare the real-time data from sensors with the expected behavior of the
system to detect faults. AI algorithms will be able to recognize deviations and can
give early warning about potential failures. This will allow the timely maintenance
and repair of mechanical system.
. Decision Support Systems
Model-centric AI can assist in developing decision-making systems which will help
engineers to make an informed choice. If we integrate AI models with the real-time
data, then AI algorithms can provide accurate and reliable predictions. This will help
in optimization of processes like production planning, scheduling, quality control,
etc.
. Sensitivity Analysis and What-If Scenarios
AI models can be used to perform sensitivity analysis along with what-if scenarios.
With multiple input parameters, we can observe model’s performance. Engineers
will get the insights into which factors are affecting systems performance and how
much they are affecting the system. An informed decision can be made by performing
detailed analysis.
Model-centric AI in mechanical engineering can assist engineers with advanced
model capabilities. It also enhances design process, optimization, control, and
decision-making process. It takes advantage of power AI algorithms to create
and utilize accurate models. These models will provide us enhanced performance,
reduced cost, and innovation in mechanical engineering field.

7.5 Comparison of Model-Centric AI and Data-Centric AI

Model-centric AI and data-centric AI are two different approaches in artificial intel-


ligence which focus on different parts of AI development process. Let us discuss the
comparison of these two approaches:
7.5 Comparison of Model-Centric AI and Data-Centric AI 105

Sr. Model-centric AI Data-centric AI


No.
1 Focus This approach majorly This approach has a strong focus
focuses on design and on data quality, unbiased data
optimization of model itself. collection, and management of
Focus is on selection of right this data. Focus is on data
model, architecture, and collection and preprocessing. This
algorithm for achieving high high-quality data is then used for
performance training of models
2 Importance of data Data is an essential part of Data is the critical asset in
model-centric AI. The main data-centric AI. The main focus is
focus is on algorithms and on data understanding of data and
models. If sufficient amount its quality. High-quality data will
of data is provided, then ensure the better AI outcomes
model can do well
3 Iteration and Iteration and improvement Iteration and improvement
improvement process mainly focus on process mainly focus on
optimization of model itself. consistent update and
Hyperparameter tuning, enhancement of the data. Data
architecture modifications, quality, feature engineering, and
and algorithm enhancements incorporating new data are vital in
are vital in model-centric data-centric process. It helps in
process. It helps in making improving AI models accuracy
enhancement to model and reliability
4 Domain experts and This approach needs experts This approach needs experts in
skillset in domains such as machine data engineering, data
learning algorithms, model preprocessing, and data analysis.
architectures, and Data collection, cleaning,
optimization techniques. integration, and feature
Model training, engineering are vital skills
hyperparameter tuning, and required for this approach.
understanding complex Domain knowledge is also helpful
algorithms are critical for this for understanding data, its
approach characteristics, and its relevance
5 Adaptability This approach faces This approach focuses on
difficulties when adapting to consistent learning and adaptation
new environments and newly to new changes. Its main focus is
gathered data. Model can to use the new data for better
become outdated as time enhancements in models. This
passes on. The significant will make the models more robust
modifications should be done and accurate over the time
for inclusion of new data
6 Decision-making Decision-making process Decision-making process is
aims to deliver accurate and influenced by patterns and
reliable outcomes driven by insights extracted from the data.
model’s predictions. The The focus is on understanding the
main focus is on model data and then make better use of it
optimization for achieving
required outcomes
106 7 Data-Centric AI in Mechanical Engineering

Model-centric and data-centric approaches have its pros and cons. The selection
depends on the requirements of the AI application. In real time, the combination of
two approaches will create more effective AI system.

7.6 Case Study: Mechanical Tools Classification

The dataset contains images of tools like hammer, wrench, pliers, ropes, etc. shown
in Fig. 7.1. The aim of the case study is to classify the given mechanical tool image
correctly. Some of the images from dataset will look like the following:
By developing the deep learning model, we can classify the images.

Fig. 7.1 Mechnaical toolset


References 107

The accuracy achieved with above-developed model is as follows:

Accuracy : 90.00%

In mechanical engineering, data-centric approach refers to using clean data and its
outcome in order to make an informed decision, build products, etc. If we use data that
contains a lot of noise and missing information, it will create a number of problems
for developers. The system developed with such a data will not be efficient. Along
with that, it will include a lot of risk factors as data is dirty. Hence, it is imperative
to use data-centric approach for good and reliable outcomes.

7.7 Summary

In data-centric AI, the focus is on understanding the data and its characteristics. It will
ensure data quality and reliability and use it to extract meaningful insights. In mechan-
ical engineering, we can detect probable system component failures using real-time
data gathered from sensors, enabling preventive maintenance prior to any breakdown.
This strategy reduces downtime and streamlines maintenance plans. AI algorithms
are also capable of pattern recognition and process optimization for mechanical
processes, including production parameters like cutting rates, tool trajectories, and
material selection. The use of AI in performance optimization hence improves the
general effectiveness and standard of mechanical systems.

References

1. Gallagher, M. (2009). Data collection and analysis. Researching With Children And Young
People: Research Design, Methods And Analysis, 65–127.
2. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data
preprocessing: Methods and prospects. Big Data Analytics, 1(1), 1–22.
3. Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., & Hu, X. (2023). Data-centric ai: Perspectives and
challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM)
(pp. 945–948). Society for Industrial and Applied Mathematics.
4. Polyzotis, N., & Zaharia, M. (2021). What can data-centric ai learn from data and ml
engineering?. arXiv preprint arXiv:2112.06439.
108 7 Data-Centric AI in Mechanical Engineering

5. Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., & Hu, X. (2023). Data-centric
artificial intelligence: A survey. arXiv preprint arXiv:2303.10158.
6. Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W. G., Diamos, S., & Reddi, V. J.
(2022). Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.
10062.
7. Amini, M., Sharifani, K., & Rahmani, A. (2023). Machine learning model towards evaluating
data gathering methods in manufacturing and mechanical engineering. International Journal
of Applied Science and Engineering Research, 15(2023), 349–362.
8. Patel, A. R., Ramaiya, K. K., Bhatia, C. V., Shah, H. N., & Bhavsar, S. N. (2021). Artificial
intelligence: Prospect in mechanical engineering field—a review. Data Science and Intelligent
Applications: Proceedings of ICDSIA, 2020, 267–282.
9. Razvi, S. S., Feng, S., Narayanan, A., Lee, Y. T. T., & Witherell, P. (2019, August). A review of
machine learning applications in additive manufacturing. In International design engineering
technical conferences and computers and information in engineering conference (Vol. 59179,
p. V001T02A040). American Society of Mechanical Engineers.
10. Huang, Q. (2016, July). Application of artificial intelligence in mechanical engineering. In
2nd International conference on computer engineering, information science & application
technology (ICCIA 2017) (pp. 882–887). Atlantis Press.
Chapter 8
Data-Centric AI in Information,
Communication and Technology

8.1 Overview

Data-centric AI has gained significant traction in the field of Information Commu-


nication and Technology (ICT) due to its ability to harness the power of data for
enhanced decision-making, automation, and optimization. In the ICT domain, data-
centric AI focuses on leveraging large volumes of data to drive innovation, improve
system performance, and deliver personalized user experiences. Here is an overview
of how data-centric AI is transforming ICT:
. Data-Driven Decision-Making: Data-centric AI enables organizations to make
data-driven decisions by analyzing and extracting insights from vast amounts
of structured and unstructured data. By employing machine learning algorithms
and statistical models, ICT systems can automatically process and interpret data,
leading to more informed decision-making processes across various areas such
as resource allocation, customer targeting, and service optimization [1].
. Predictive Analytics: Data-centric AI enables the development of predictive
analytics models that can anticipate future events and trends. In the ICT sector,
this capability is particularly useful for demand forecasting, network optimization,
and resource management. By analyzing historical data, ICT systems can predict
system failures, network congestion, or customer demand patterns, allowing orga-
nizations to proactively address potential issues and optimize their operations
[1].
. Personalized User Experiences: Data-centric AI empowers ICT systems to
provide personalized user experiences. By leveraging user data, such as pref-
erences, behaviors, and past interactions, AI algorithms can tailor recommenda-
tions, content, and services to individual users. This personalization enhances user
satisfaction, engagement, and loyalty, leading to improved customer experiences
in areas such as e-commerce platforms, content streaming services, and social
media platforms [1].

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 109
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_8
110 8 Data-Centric AI in Information, Communication and Technology

. Intelligent Automation: Data-centric AI plays a pivotal role in automating various


ICT processes and workflows. Machine learning algorithms can be trained on large
datasets to automate tasks such as data processing, system monitoring, network
management, and cybersecurity. Through intelligent automation, ICT systems
can optimize resource utilization, reduce human errors, and enhance operational
efficiency [2].
. Network Optimization: Data-centric AI enables advanced network optimization
in the ICT domain. By analyzing network traffic, performance metrics, and user
behavior, AI algorithms can dynamically optimize network routing, bandwidth
allocation, and quality of service. This optimization ensures efficient and reliable
network connectivity, especially in complex ICT infrastructures such as cloud
computing, Internet of Things (IoT) systems, and 5G networks [2].
. Cybersecurity and Anomaly Detection: Data-centric AI plays a critical role in
cybersecurity by identifying anomalies, detecting threats, and mitigating risks. AI
algorithms can analyze large-scale network traffic, user behavior, and system logs
to detect patterns indicative of malicious activities or security breaches. By lever-
aging real-time data analysis and machine learning, ICT systems can proactively
respond to cybersecurity threats and protect sensitive information [2].
. Intelligent Virtual Assistants: Data-centric AI enables the development of intelli-
gent virtual assistants or chatbots that interact with users in natural language.
These AI-powered assistants can provide personalized support, answer user
queries, and perform tasks in various ICT domains, such as customer service, tech-
nical support, and information retrieval. By continuously learning from user inter-
actions and data, virtual assistants become more proficient over time, delivering
enhanced user experiences [3]
Thus, data-centric AI is revolutionizing the ICT domain by leveraging the
power of data to drive informed decision-making, personalized experiences, intel-
ligent automation, network optimization, cybersecurity, and virtual assistance. By
harnessing the vast amount of data generated in ICT systems, organizations can
unlock new insights, improve efficiency, and deliver enhanced services to users. As
technology advances and datasets continue to grow, the potential for data-centric AI
in ICT is poised to expand further, driving innovation and transforming the industry
[1].

8.2 Need and Challenges of Data-Centric Approach

The need for a data-centric approach in various fields arises from the increasing avail-
ability of data and the recognition of its crucial role in driving innovation, improving
decision-making, and enhancing system performance. The data-centric approach
prioritizes the collection, curation, and utilization of high-quality data to inform and
optimize processes, systems, and strategies. However, this approach also comes with
several challenges that need to be addressed. In this response, we will delve into the
8.2 Need and Challenges of Data-Centric Approach 111

need for a data-centric approach and discuss the key challenges associated with it
[4].
Need for a Data-Centric Approach:
. Enhanced Decision-Making: Data-centric approaches enable organizations to
make informed and data-driven decisions. By analyzing large volumes of data,
organizations can gain valuable insights into customer behavior, market trends,
operational efficiency, and other critical factors. These insights help in identi-
fying patterns, predicting outcomes, and making better strategic and operational
decisions [4].
. Innovation and Product Development: Data-centric approaches foster innovation
by providing valuable information for product development and optimization.
By collecting and analyzing customer feedback, usage patterns, and market data,
organizations can identify gaps, uncover new opportunities, and design products
and services that cater to specific customer needs. Data-centricity enables iterative
improvement and innovation cycles by continuously gathering user feedback and
incorporating it into product development processes [4].
. Personalized Experiences: In today’s digital age, customers expect personal-
ized experiences. Data-centric approaches allow organizations to collect and
analyze customer data to gain insights into preferences, behaviors, and needs.
This information enables the delivery of tailored experiences, recommendations,
and services, enhancing customer satisfaction, engagement, and loyalty [4].
. Process Optimization and Efficiency: Data-centric approaches help organiza-
tions optimize processes and improve operational efficiency. By collecting data
on process performance, resource utilization, and bottlenecks, organizations can
identify areas of improvement, eliminate inefficiencies, and streamline operations.
This leads to cost savings, increased productivity, and enhanced competitiveness
[4].
. Performance Monitoring and Maintenance: Data-centric approaches enable
continuous monitoring and maintenance of systems, equipment, and infrastruc-
ture. By collecting and analyzing real-time operational data, organizations can
detect anomalies, predict failures, and proactively address issues before they
escalate. This improves system reliability, minimizes downtime, and reduces
maintenance costs [4].
Challenges of a Data-Centric Approach [4]:
. Data Quality and Reliability: One of the primary challenges of a data-centric
approach is ensuring the quality and reliability of the data used for decision-
making. Data may be incomplete, inaccurate, or biased, leading to incorrect
insights and flawed decision-making. Data cleansing, validation, and verification
processes are necessary to ensure data quality and reliability [4].
. Data Privacy and Security: With the increasing emphasis on data-centric
approaches, data privacy and security become critical concerns. Organizations
need to implement robust data protection measures to safeguard sensitive and
personal information. Compliance with privacy regulations, encryption, access
112 8 Data-Centric AI in Information, Communication and Technology

controls, and secure data storage are essential to maintain data privacy and protect
against cyber threats [4].
. Data Collection and Integration: Collecting relevant and comprehensive data
can be a complex and resource-intensive task. Different data sources may have
varying formats, structures, and semantics, making integration and consolida-
tion challenging. Organizations need to establish effective data collection mecha-
nisms, employ data governance frameworks, and implement data integration and
transformation processes to ensure seamless data flow and interoperability [4].
. Scalability and Infrastructure: Data-centric approaches require robust and scalable
infrastructure to handle large volumes of data. Processing, storing, and analyzing
massive datasets can strain existing IT infrastructure. Organizations need to invest
in scalable storage solutions, high-performance computing resources, and efficient
data processing frameworks to handle the data-centric demands of their operations
[4].
. Data Interpretation and Analysis: Making sense of the vast amount of data
collected is a complex task. Data-centric approaches require skilled data scientists,
analysts, and domain experts who can interpret and analyze the data effectively.
Organizations need to invest in talent acquisition, training, and the development
of data analytics capabilities to extract meaningful insights from the data [4].
. Ethical and Bias Concerns: Data-centric approaches raise ethical concerns
regarding data usage and potential biases. Biases can emerge from biased data
collection, algorithmic biases, or unfair data representations. Organizations must
be vigilant in identifying and addressing biases to ensure fairness, transparency,
and ethical use of data in decision-making processes [4].
. Data Governance and Compliance: With the increased reliance on data, organiza-
tions need to establish robust data governance frameworks to ensure responsible
data management. Data-centric approaches require organizations to comply with
regulations, industry standards, and ethical guidelines regarding data collection,
storage, processing, and sharing. Implementing data governance frameworks and
ensuring compliance can be complex and resource-intensive [4].
. Cultural and Organizational Shifts: Adopting a data-centric approach often
requires a cultural and organizational shift within an organization. It involves
changing mindsets, embracing data-driven decision-making, and fostering a data-
centric culture. Organizations need to invest in change management, provide
training and education on data literacy, and promote a data-driven mindset across
all levels of the organization [4].
In conclusion, the need for a data-centric approach arises from the growing
importance of data in driving innovation, decision-making, and system optimiza-
tion. However, this approach is not without its challenges. Addressing issues related
to data quality, privacy, security, integration, scalability, interpretation, ethics, gover-
nance, and organizational readiness is crucial for organizations to fully leverage
the potential of a data-centric approach. Overcoming these challenges requires a
comprehensive and holistic strategy that encompasses technological, organizational,
and cultural aspects to harness the power of data effectively.
8.3 Application Implementation in Data-Centric Approach 113

8.3 Application Implementation in Data-Centric Approach

Implementing ICT applications using a data-centric approach has the potential


to transform industries, optimize processes, and enhance user experiences. This
approach leverages the power of data to drive innovation, decision-making, and
system optimization (Fig. 8.1).
In this response, we will explore the implementation of ICT applications through
a data-centric approach and discuss its benefits and challenges [5, 6].
. Customer Relationship Management (CRM): Implementing a data-centric
approach in CRM enables organizations to gain insights into customer behavior,
preferences, and needs. By collecting and analyzing customer data from various
touchpoints, such as interactions, transactions, and feedback, organizations can
personalize their marketing campaigns, improve customer engagement, and tailor
their products or services to meet individual customer requirements [5].
. Business Intelligence (BI) and Analytics: A data-centric approach in BI and
analytics empowers organizations to extract actionable insights from vast amounts
of data. By implementing robust data collection mechanisms, data integration
processes, and advanced analytics tools, organizations can gain a deeper under-
standing of their operations, market trends, and customer behavior. This enables
informed decision-making, identification of new business opportunities, and
optimization of business processes [5].
. Supply Chain Management (SCM): Implementing a data-centric approach in
SCM helps organizations optimize their supply chain operations, reduce costs,

Fig. 8.1 Data-centric AI


development
114 8 Data-Centric AI in Information, Communication and Technology

and improve efficiency. By collecting and analyzing real-time data on inven-


tory levels, demand patterns, and supplier performance, organizations can make
data-driven decisions related to procurement, production planning, and logistics.
This enables organizations to streamline their supply chain processes, reduce lead
times, and improve overall customer satisfaction [6].
. Internet of Things (IoT): The data-centric approach is essential in implementing
IoT applications. By connecting various devices and sensors, IoT generates
massive amounts of data. Leveraging this data through a data-centric approach
allows organizations to monitor and control IoT devices, analyze sensor data
for insights, and optimize operations. For example, in smart cities, data-centric
IoT applications can enable efficient traffic management, energy consumption
optimization, and waste management [5].
. Predictive Maintenance: By implementing a data-centric approach in predictive
maintenance, organizations can monitor and analyze real-time sensor data from
machines and equipment. This data is used to detect anomalies, predict failures,
and schedule maintenance activities proactively. Predictive maintenance helps
organizations minimize downtime, reduce maintenance costs, and optimize asset
utilization [6].
. Healthcare: Data-centric approaches are transforming the healthcare industry by
improving patient care, clinical decision-making, and resource allocation. By
collecting and analyzing patient health records, medical imaging data, and clinical
research data, healthcare providers can identify patterns, predict disease progres-
sion, and personalize treatment plans. This enables better patient outcomes,
reduces medical errors, and enhances healthcare delivery [5].
. Smart Energy Management: A data-centric approach is crucial for implementing
smart energy management systems. By collecting data on energy consumption,
weather patterns, and user behavior, organizations can optimize energy usage,
predict peak demand periods, and implement energy-saving measures. This not
only reduces energy costs but also contributes to environmental sustainability [5].
Benefits of Data-Centric Approach in ICT Application Implementation:
. Improved Decision-Making: Data-centric approaches provide organizations with
timely and accurate information for decision-making, leading to better outcomes
and improved business performance [6].
. Personalization: By leveraging customer data, organizations can deliver person-
alized experiences, recommendations, and services, enhancing customer satisfac-
tion and loyalty [6].
. Optimization and Efficiency: Data-centric approaches enable organizations to
optimize processes, resources, and operations, resulting in improved efficiency
and cost savings [6].
. Enhanced User Experiences: By analyzing user data, organizations can tailor their
products, services, and interfaces to meet user expectations and provide seamless
and engaging experiences [6].
8.4 Application Implementation in Model-Centric Approach 115

. Innovation and New Opportunities: Data-centric approaches provide valuable


insights into market trends, user needs, and emerging opportunities, driving
innovation and fostering a competitive edge [5].
. Proactive Maintenance and Risk Management: By leveraging real-time data, orga-
nizations can detect anomalies, predict failures, and mitigate risks proactively,
leading to improved system reliability and reduced downtime [6].
Challenges of Data-Centric Approach in ICT Application Implementation:
. Data Quality and Integration: Ensuring data quality, accuracy, and consistency
across different data sources can be challenging. Organizations need to invest in
data cleansing, integration, and data governance processes.
. Data Security and Privacy: Protecting sensitive data and ensuring compliance
with privacy regulations are critical challenges in data-centric implementations.
Organizations need to implement robust security measures and establish data
privacy frameworks.
. Scalability and Infrastructure: Implementing data-centric applications often
requires scalable infrastructure and high-performance computing resources to
handle large volumes of data and ensure efficient processing.
. Skillset and Talent: Data-centric implementations require skilled data scientists,
analysts, and IT professionals who can collect, analyze, and interpret data effec-
tively. Organizations may face challenges in attracting and retaining the right
talent.
. Change Management: Adopting a data-centric approach often necessitates
cultural and organizational shifts. Organizations need to invest in change manage-
ment efforts to foster a data-driven culture and promote data literacy across the
organization.
So, implementing ICT applications through a data-centric approach offers
numerous benefits, including improved decision-making, personalization, optimiza-
tion, and enhanced user experiences. However, organizations must address challenges
related to data quality, security, scalability, talent, and change management to fully
leverage the potential of a data-centric approach. By overcoming these challenges,
organizations can unlock the transformative power of data and drive innovation in
their respective industries.

8.4 Application Implementation in Model-Centric


Approach

Information and Communication Technology (ICT) applications have traditionally


been implemented using a model-centric approach, where the focus is on designing
and developing robust software models and algorithms. However, with the increasing
availability of data and the advancements in artificial intelligence and machine
116 8 Data-Centric AI in Information, Communication and Technology

learning, there has been a paradigm shift toward a data-centric approach in ICT appli-
cation implementation. The data-centric approach emphasizes the collection, anal-
ysis, and utilization of large volumes of data to drive innovation, optimize processes,
and deliver enhanced user experiences (Fig. 8.2).
In this response, we will explore the implementation of ICT applications through
a model-centric approach and discuss its benefits and challenges.
. Software Development: In the model-centric approach, software development
revolves around designing and implementing software models and algorithms.
Developers focus on creating logical models, architectural designs, and algorithms
that define the behavior and functionality of the application. The model-centric
approach emphasizes the software development life cycle, including requirements
gathering, design, coding, testing, and deployment. The goal is to create efficient
and scalable software models that can handle various tasks and deliver the desired
functionalities [5].
. Algorithm Design and Optimization: Model-centric ICT application implementa-
tion emphasizes algorithm design and optimization. Developers and data scientists
create algorithms that perform specific tasks or solve particular problems. These
algorithms are often designed using mathematical and computational models,
which are then implemented in software systems. The model-centric approach
focuses on optimizing algorithms for efficiency, accuracy, and scalability to handle
large-scale data processing and analysis [5].

Fig. 8.2 Model-centric AI


development
8.4 Application Implementation in Model-Centric Approach 117

. Performance Optimization: In the model-centric approach, performance optimiza-


tion plays a crucial role. Developers focus on optimizing the performance of soft-
ware models and algorithms to ensure efficient execution and response times.
Techniques such as algorithmic optimization, parallel processing, caching, and
load balancing are employed to enhance performance and scalability. The model-
centric approach aims to deliver high-performance applications that can handle
large workloads and provide quick responses to user requests [5].
. User Interface Design: User interface design is an essential aspect of ICT appli-
cation implementation in the model-centric approach. Developers create user
interfaces that interact with the underlying software models and algorithms. The
focus is on designing intuitive and user-friendly interfaces that provide a seamless
user experience. User interface design involves understanding user requirements,
conducting usability testing, and iterating the design based on user feedback [6].
. Code Reusability: Model-centric ICT application implementation promotes code
reusability. Developers create modular and reusable components that can be
used across different applications or modules. This approach enhances devel-
opment efficiency, reduces duplication of effort, and simplifies maintenance and
updates. Code reusability allows developers to leverage existing software models,
algorithms, and libraries, accelerating the application development process [6].
Benefits of Model-Centric Approach in ICT Application Implementation:
. Emphasis on Software Design: The model-centric approach ensures robust soft-
ware design, with a focus on logical models and architectural patterns. This leads
to well-structured and maintainable codebases [5].
. Performance Optimization: The model-centric approach allows for fine-tuning
and optimizing algorithms and software models for improved performance and
efficiency [5].
. Algorithmic Expertise: The model-centric approach requires expertise in
designing and implementing algorithms, which can lead to the development of
sophisticated and specialized applications [5].
. Modularity and Reusability: By designing modular components, the model-
centric approach promotes code reusability, reducing development time and effort
[5].
. Clear Separation of Concerns: The model-centric approach facilitates the sepa-
ration of different concerns, such as data processing, algorithm implementation,
and user interface design, making the development process more manageable and
maintainable [5].
Challenges of Model-Centric Approach in ICT Application Implementation:
. Limited Scalability: The model-centric approach may face challenges in handling
large volumes of data and scaling applications to meet growing demands [6].
. Lack of Adaptability: Traditional model-centric approaches may struggle to adapt
to dynamic and evolving data sources, as they may require manual updates and
modifications to accommodate new data formats or structures [6].
118 8 Data-Centric AI in Information, Communication and Technology

. Data Integration: Model-centric approaches often require well-defined data struc-


tures and formats, which can make data integration and interoperability with
diverse data sources complex [6].
. Limited Flexibility: The model-centric approach may have limitations in adapting
to changing business requirements and evolving user needs, as it relies on
predefined models and algorithms [6].
Hence, the model-centric approach to ICT application implementation emphasizes
software design, algorithm development, and performance optimization. While it
offers benefits such as robust software architecture, performance optimization, and
code reusability, it may face challenges in scalability, adaptability to dynamic data
sources, and flexibility. The model-centric approach can still be effective in certain
domains and applications where the focus is on algorithmic expertise and efficient
software models. However, with the increasing availability of data and the rise of data-
centric approaches, organizations are increasingly adopting a data-centric paradigm
to leverage the power of data for innovation, optimization, and personalized user
experiences.

8.5 Comparison of Model-Centric AI and Data-Centric AI

Model-centric AI and data-centric AI are two distinct paradigms that approach artifi-
cial intelligence from different angles. In this response, we will compare and contrast
these two approaches, highlighting their differences, strengths, and weaknesses.
1. Focus and Emphasis
. Model-Centric AI: Model-centric AI places a strong emphasis on designing
and developing sophisticated algorithms and models. The focus is on creating
intelligent systems through the use of well-defined models and algorithms that
capture knowledge and decision-making processes. The models are designed
to solve specific tasks and make accurate predictions based on the given inputs
[7].
. Data-Centric AI: Data-centric AI, on the other hand, focuses on the collec-
tion, analysis, and utilization of large volumes of data. The emphasis is on
extracting meaningful insights, patterns, and correlations from the data to drive
decision-making and innovation. Data-centric AI leverages machine learning
techniques to learn from data and improve performance over time [6–8].
2. Approach to Problem-Solving
. Model-Centric AI: Model-centric AI approaches problem-solving through
the lens of designing and optimizing algorithms and models. The focus is on
creating intelligent systems that rely on predefined rules, logic, and repre-
sentations. These models are trained on specific datasets and are expected to
generalize well to new inputs [6].
8.5 Comparison of Model-Centric AI and Data-Centric AI 119

. Data-Centric AI: Data-centric AI, on the other hand, approaches problem-


solving by learning from data. It uses machine learning algorithms to analyze
large datasets and extract valuable insights. Data-centric AI systems can adapt
and evolve based on the data they encounter, making them more flexible in
handling diverse scenarios [6].
3. Data Requirements
. Model-Centric AI: Model-centric AI typically requires labeled or structured
data that aligns with the predefined models and algorithms. The training data
must be carefully curated and annotated to match the specific requirements of
the model. The performance of model-centric AI heavily relies on the quality
and representativeness of the training data [6][6][6].
. Data-Centric AI: Data-centric AI thrives on large volumes of data, whether
labeled or unlabeled. It leverages the abundance of data to identify patterns,
correlations, and trends. Data-centric AI systems can handle unstructured and
diverse datasets, allowing for a wider range of applications [6–8].
4. Performance and Accuracy
. Model-Centric AI: Model-centric AI can achieve high levels of accuracy and
performance on tasks that align with the predefined models. These models are
designed to optimize specific metrics and can deliver precise and consistent
results within their domain of expertise. However, they may struggle when
faced with unfamiliar or evolving scenarios [6].
. Data-Centric AI: Data-centric AI systems can adapt to different domains and
evolve as they encounter new data. They can learn from diverse datasets and
improve performance over time. While they may not always achieve the same
level of accuracy as model-centric approaches, data-centric AI systems can
provide valuable insights and handle complex, real-world scenarios [6].
5. Interpretability and Explainability
. Model-Centric AI: Model-centric AI often relies on interpretable models that
can provide insights into the decision-making process. With explicit rules
and representations, it is easier to understand how the models arrive at their
outputs. This interpretability can be crucial in domains where transparency
and explainability are required [6].
. Data-Centric AI: Data-centric AI, particularly deep learning models, can
be less interpretable due to their complex architecture and the lack of
explicit rules. While they excel at pattern recognition and can deliver
impressive results, understanding the decision-making process and providing
explanations can be challenging [8].
6. Adaptability and Generalization
. Model-Centric AI: Model-centric AI systems excel at tasks for which they
were specifically designed and trained. They can generalize well within the
defined problem domain but may struggle with tasks that deviate from the
120 8 Data-Centric AI in Information, Communication and Technology

original design. Adapting model-centric AI to new scenarios often requires


significant effort and retraining [8].
. Data-Centric AI: Data-centric AI systems have the advantage of adaptability
and generalization. They can learn from diverse datasets and handle different
scenarios, making them more flexible in addressing new challenges. Data-
centric AI systems can generalize well beyond their original training data,
allowing for wider application and scalability [8].
7. Scalability and Resource Requirements
. Model-Centric AI: Model-centric AI may require substantial computational
resources for training and inference, particularly for complex models and
large datasets. Training models can be computationally intensive and time-
consuming, requiring powerful hardware and significant memory capacity
[8].
. Data-Centric AI: Data-centric AI can benefit from distributed computing and
parallel processing to handle large-scale datasets efficiently. While the initial
data processing and preparation stages may require computational resources,
data-centric AI systems can often scale well and leverage cloud-based
infrastructure for processing and storage [8].
Therefore, as we can see both model-centric AI and data-centric AI have their
strengths and weaknesses. Model-centric AI offers precise and interpretable solutions
within well-defined problem domains, while data-centric AI provides adaptability,
scalability, and the ability to handle diverse scenarios. The choice between these
approaches depends on the specific problem, available data, interpretability require-
ments, and the desired trade-off between accuracy and flexibility. The future of AI is
likely to involve a combination of both approaches, leveraging the strengths of each
to drive innovation and address complex challenges.

8.6 Summary

Data-centric AI in Information and Communication Technology (ICT) is a paradigm


shift that emphasizes the collection, analysis, and utilization of large volumes of
data to drive innovation, enhance decision-making, optimize operations, and deliver
personalized user experiences. In this summary, we will highlight the key points
about data-centric AI in ICT.
Data-centric AI focuses on leveraging data as a strategic asset to gain insights,
make informed decisions, and create value. It recognizes that the abundance of data in
today’s digital world holds immense potential for organizations to gain a competitive
edge and improve their products and services.
In data-centric AI, the process begins with data collection from various sources,
including sensors, devices, social media, and other digital platforms. This data is then
stored, processed, and analyzed using machine learning algorithms and techniques.
8.6 Summary 121

The key benefits of data-centric AI in ICT are as follows [1]:


. Improved Decision-Making: Data-centric AI enables organizations to make data-
driven decisions by extracting meaningful insights and patterns from large
datasets. It provides a holistic view of the business, customer behavior, and market
trends, enabling organizations to make informed decisions that align with their
goals and objectives [1].
. Personalization and Enhanced User Experiences: By leveraging data, organiza-
tions can personalize their products, services, and user experiences. Data-centric
AI enables the analysis of user preferences, behaviors, and feedback, allowing
organizations to deliver tailored recommendations, content, and interactions [1].
. Optimization and Efficiency: Data-centric AI helps organizations optimize their
processes, workflows, and resource allocation. By analyzing data, organizations
can identify bottlenecks, inefficiencies, and areas for improvement, leading to
enhanced productivity, cost savings, and streamlined operations [1].
. Predictive and Preventive Capabilities: Data-centric AI enables predictive
analytics and proactive decision-making. By analyzing historical data and
patterns, organizations can anticipate future events, trends, and risks. This allows
them to take preventive measures, mitigate risks, and optimize resource allocation
[1].
. Automation and Intelligent Systems: Data-centric AI enables the development
of intelligent systems that can automate tasks, processes, and decision-making.
By leveraging machine learning algorithms, organizations can build systems that
learn from data and adapt to changing conditions, leading to increased efficiency
and productivity [1].
However, data-centric AI in ICT also presents challenges that need to be addressed:
. Data Quality and Integration: Ensuring data quality, accuracy, and consistency
across different data sources can be challenging. Organizations need to invest in
data cleansing, integration, and data governance processes to ensure the reliability
of the insights derived from the data [1].
. Data Security and Privacy: Protecting sensitive data and ensuring compliance with
privacy regulations are critical challenges in data-centric AI implementations.
Organizations must implement robust security measures and establish data privacy
frameworks to safeguard data and maintain user trust [9].
. Scalability and Infrastructure: Implementing data-centric AI applications often
requires scalable infrastructure and high-performance computing resources to
handle large volumes of data and ensure efficient processing. Organizations
need to invest in the appropriate infrastructure to support data-intensive AI
initiatives [10].
. Skillset and Talent: Data-centric AI implementations require skilled data scien-
tists, analysts, and IT professionals who can collect, analyze, and interpret data
effectively. Organizations may face challenges in attracting and retaining the right
talent with expertise in data analytics and machine learning [1].
122 8 Data-Centric AI in Information, Communication and Technology

In summary, data-centric AI in ICT offers numerous benefits, including improved


decision-making, personalized user experiences, optimization, and predictive capa-
bilities. However, organizations must address challenges related to data quality, secu-
rity, scalability, and talent to fully leverage the potential of data-centric AI and drive
innovation in their respective industries [1].
Case Study Data-Centric AI Platform for ICT Services
Introduction
As businesses continue their digital transformation journey, data has become a crit-
ical strategic asset. It provides insights into customer behavior, market trends, and
business processes. However, the sheer volume of data can make it difficult to extract
meaningful insights. To address this challenge, many organizations are turning to arti-
ficial intelligence (AI) and machine learning (ML) to automate data analysis. This
case study examines how a data-centric AI platform has helped a telecommunications
company improve its ICT services [1].
Background
The telecommunications industry is highly competitive. Providers must constantly
innovate to differentiate themselves and provide better services to customers. One
major area of focus for telecom companies is Information and Communications
Technology (ICT) services [1]. These include:
. Cloud services.
. Internet of Things (IoT) connectivity.
. Managed IT services.
. Cybersecurity.
ICT services provide businesses with the tools they need to operate efficiently
and securely in the digital age. However, telecom companies face several chal-
lenges in delivering these services. For example, customers may have different
needs and expectations depending on their industry, size, and location. Additionally,
ICT services are often complex and require specialized technical expertise. Finally,
telecom providers must keep pace with rapidly evolving technology and security
threats [1, 11].
To address these challenges, the telecom company in this case study decided to
invest in a data-centric AI platform. The goal was to improve the delivery of their
ICT services by using machine learning and other advanced analytics techniques to
extract insights from large volumes of data [1].
Implementation
The implementation of the data-centric AI platform involved several phases:
1. Data Collection and Integration
The first phase involved collecting data from various sources, such as customer
interactions, service usage, and network performance. The data was then integrated
8.6 Summary 123

into a central repository to create a unified view of the customer and their ICT
services. The use of application programming interfaces (APIs) and cloud-based
data warehousing solutions helped to simplify the integration process [1].
2. Data Analysis and Modeling [11]
The second phase involved using machine learning and other advanced analytics
techniques to analyze the data and create models. The models were used to iden-
tify patterns, predict outcomes, and provide recommendations for improving ICT
services. Specific techniques used included [1]:
. Natural Language Processing (NLP) to analyze customer feedback and sentiment.
. Predictive modeling to anticipate network performance issues.
. Time series forecasting to predict future service demand.
. 3. Platform Deployment and Integration with ICT Services [11]
The final phase involved deploying the platform and integrating it with ICT
services. This enabled the telecom company to use the insights generated by the
data-centric AI platform to improve service delivery, increase efficiency, and reduce
costs [1]. For example:
. Insights from NLP analysis of customer feedback were used to identify areas for
improvement in customer service processes.
. Predictive modeling was used to proactively address network performance issues
before they impacted service quality.
. Time series forecasting was used to optimize resource allocation based on
anticipated service demand.
Results
The implementation of the data-centric AI platform has helped the telecom company
achieve several key outcomes [6–9, 8]:
1. 1. Improved Customer Satisfaction
The platform has enabled the telecom company to gain a deeper understanding of
customer needs and preferences. This has helped to improve the quality of their ICT
services and increase customer satisfaction.
2. Increased Efficiency
By using insights from the data-centric AI platform, the telecom company has
been able to streamline service delivery processes and reduce costs. For example,
predictive modeling has enabled the company to proactively address network perfor-
mance issues before they impact service quality, reducing the need for costly service
disruptions.
3. Enhanced Security
The data-centric AI platform has helped the telecom company improve its cybersecu-
rity posture. By analyzing network activity and identifying anomalies, the platform
124 8 Data-Centric AI in Information, Communication and Technology

provides early warnings of potential cyber threats. Additionally, by using machine


learning to analyze network traffic patterns, the platform can identify and respond to
potential attacks in real time.
4. Improved Competitive Position
By delivering high-quality, efficient, and secure ICT services, the telecom company
has been able to differentiate itself from competitors and gain market share.
Conclusion
The implementation of a data-centric AI platform has helped the telecom company
in this case study improve the delivery of its ICT services. By using machine learning
and other advanced analytics techniques to extract insights from large volumes of
data, the company has been able to achieve improved customer satisfaction, increased
efficiency, enhanced security, and an improved competitive position [1, 6–9].
As the telecom industry continues to evolve, the use of AI and machine learning
will likely become even more critical. Telecom companies that invest in data-centric
AI platforms today will be better positioned to succeed in the future.

References

1. Kim, H. S., Jee, H. K., & Lee, H. J. (2020). Data-centric AI platform for ICT services.
International Journal of Advanced Science and Technology, 29(7), 513–520.
2. Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2017). Data-centric systems and
applications: An overview. ACM Transactions on Internet Technology, 17(4), 1–19.
3. Furuhata, T. et al. (2020). Data-centric approach to AI/ML for advanced information systems.
In IEEE International Conference on Information Reuse and Integration (pp. 159–166).
4. Yao, L. et al. (2020). Data-centric AI systems: challenges and research directions. In IEEE
International Conference on Big Data (pp. 5563–5572).
5. Maier, A. L., et al. (2021). Towards data-centric artificial intelligence. IEEE Transactions on
Knowledge and Data Engineering, 33(3), 1235–1248.
6. Chen, M., et al. (2022). Data-centric AI systems: A survey. IEEE Transactions on Knowledge
and Data Engineering, 34(5), 937–959.
7. Zaki, M., et al. (2021). Data-centric AI systems: opportunities, challenges, and research
directions. Future Generation Computer Systems, 115, 105–119.
8. Yaghmaie, A. D., & Datta, A. (2021). Data-centric AI: concepts, challenges, and opportunities.
International Journal of Distributed Sensor Networks, 17(3), 15501477211002688.
9. Singh, D., et al. (2020). Data-centric AI approaches for internet of things: A review. IEEE
Access, 8, 137939–137958.
10. Farber, M. et al. (2019). Data-centric artificial intelligence: The next revolution in digital
transformation. Technical Report, McKinsey & Company.
11. Mahalle, P. N., Anggorojati, B., Prasad, N. R., & Prasad, R. (2013). Identity Authentication and
capability based access control (IACAC) for the internet of things. Journal of Cyber Security
and Mobility, 1, 309–348
Chapter 9
Conclusion

9.1 Summary

AI has become an interdisciplinary field and having applications in all the verticals
across day-to-day routine. There has been lot of development in this filed, and the
current AI as a black box is becoming AI as a white box where new category of AI
algorithms called as explainable AI is emerging in which the focus is more on the
data than models. The model-based AI decides which algorithm from the machine
learning or deep learning is more appropriate to apply for developing the model
based on the given data and the questions to be posted on this data. However, there
is a need of data-centric AI which will focus more on data than the selection of
algorithms. This book also presents prominent use cases of data-centric AI with its
need and benefits [1]. This book aims at providing an insight on data-centric AI and
model-centric AI both and express need for paradigm shift in AI.
The main focus of data-centric AI is on quality data in contrast to model-centric
AI. Model-centric AI is for developing and enhancing models and algorithms to
obtain greater performance on a particular job. Data-centric AI views models as static
artifacts and places more emphasis on increasing data quality than model-centric AI,
which regards data as a fixed artifact [2]. Quality over quantity is the top priority for
data-centric AI. A data-centric strategy helps to resolve many of the difficulties that
can occur while implementing AI infrastructure, as opposed to model-centric AI,
which aims to engineer performance advantages by enlarging datasets. This book
focuses on need of data-centric approach. This book also presents prominent use
cases of data-centric AI in various domains. Basic understanding of model building,
training, and model testing is required for development of any AI application; in view
of this, the implementation details and usage of tools for same are also presented in
this book.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 125
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_9
126 9 Conclusion

9.2 Research Areas

Data-centric AI is a rapidly evolving field that encompasses a wide range of research


areas. Some of the key research areas in data-centric AI include:
1. Data Preprocessing: This research area focuses on developing techniques for
cleaning, transforming, and augmenting data to improve the quality of data used
by AI models.
2. Data Fusion: Data fusion is the process of combining multiple data sources to
create a unified dataset that can be used to train AI models. Research in this area
focuses on developing techniques for effectively integrating data from diverse
sources.
3. Data Representation: This research area involves developing techniques for repre-
senting complex data types, such as text, images, audio, and video, in a way that
can be processed by AI models.
4. Data Labeling: Data labeling research focuses on developing techniques for accu-
rately and efficiently labeling data, which is essential for creating high-quality
training datasets for AI models.
5. Active Learning: Active learning is a machine learning technique that involves
selecting the most informative data points from a large dataset to label manually.
Research in this area focuses on developing effective strategies for selecting data
points to label.
6. Transfer Learning: Transfer learning is a technique that involves using knowledge
gained from training one AI model to improve the performance of a different,
related model. Research in this area focuses on developing effective transfer
learning techniques.
7. Model Interpretability: Model interpretability research focuses on developing
techniques for understanding how AI models make predictions, which is
important for ensuring that AI systems are transparent and trustworthy.
These are just a few examples of the research areas in data-centric AI. As the
field continues to evolve, new areas of research are likely to emerge. In addition to
this, the recognized value of data and increasing data literacy are changing what it
means to be data-driven, and data-centric AI is disrupting traditional data manage-
ment. Organizations that invest in AI at scale will evolve to preserve evergreen clas-
sical data management ideas and extend them to AI, adding capabilities necessary
for convenient AI development by an AI-focused audience. Additionally, by 2025,
context-driven analytics and AI models will replace 60% of traditional data manage-
ment and response approaches, which indicates that data-centric AI will continue to
play an increasingly important role in the future.
References 127

References

1. Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., Hu, X. (2023). Data-centric
artificial intelligence: A survey. arXiv 2023 arXiv:2303.10158.
2. Whang, S. E., Roh, Y., Song, H., Lee, J. G. (2021). Data collection and quality challenges in
deep learning: A data-centric AI perspective. arXiv preprint arXiv:2112.06409.

You might also like