Data Centric Artificial Intelligence: A Beginner's Guide
Data Centric Artificial Intelligence: A Beginner's Guide
Parikshit N. Mahalle
Gitanjali R. Shinde
Yashwant S. Ingle
Namrata N. Wasatkar
Data Centric
Artificial
Intelligence:
A Beginner’s Guide
Data-Intensive Research
Series Editors
Nilanjan Dey, Techno International New Town, Kolkata, West Bengal, India
Bijaya Ketan Panigrahi, Indian Institute of Technology Delhi, New Delhi, India
Vincenzo Piuri, University of Milan, Milano, Italy
This book series provides a comprehensive and up-to-date collection of research
and experimental works, summarizing state-of-the-art developments in the fields
of data science and engineering. The trends, technologies and state-of-the art
research related to data collection, storage, representation, visualization, processing,
interpretation, analysis, and management related concepts, taxonomy, techniques,
designs, approaches, systems, algorithms, tools, engines, applications, best prac-
tices, bottlenecks, perspectives, policies, properties, practicalities, quality control,
usage, validation, workflows, assessment, evaluation, metrics, and many more are to
be covered.
The series will publish monographs, edited volumes, textbooks and proceedings
of important conferences, symposia and meetings in the field of autonomic and
data-driven computing.
Parikshit N. Mahalle · Gitanjali R. Shinde ·
Yashwant S. Ingle · Namrata N. Wasatkar
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
v
vi Preface
The key objectives of this book include presenting a need of data-centric AI with
compared to model-centric approach, transformations from model-centric to data-
centric AI, methodologies to achieve accurate results by improving the quality of
data, challenges in improving quality of data-centric models, challenges in datasets
generation, synthetic datasets, analysis, and prediction algorithms in stochastic
ways, etc.
The main characteristics of this book are:
1. Helpful for user to understand the transformation from model-centric to data-
centric AI.
2. A concise and summarized description of all the topics.
3. Use case and scenarios-based descriptions.
4. Concise and crisp book for novice reader from introduction to building basic
data-centric AI applications.
5. Numerous examples, technical descriptions, and real-world scenarios.
6. Simple and easy language so that it can be useful to a wide range of stakeholders
like a layman to educate users, villages to metros and national to global levels.
In nutshell, this book puts forward the best research roadmaps, strategies, and
challenges to design and develop data-centric AI applications. Book is motivating to
use a technology for better analysis in the need of layman users to educated users to
design various use cases in data-centric AI. Book also contributes to social respon-
sibilities by inventing different ways to cater the requirements for government and
manufacturing communities. The book is useful for undergraduates, postgraduates,
industry, researchers and research scholars in ICT, AI, machine learning, and we are
sure that this book will be well-received by all stakeholders.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Building Blocks of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 AI Current State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Need for Paradigm Shift from Model-Centric AI
to Data-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Model-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Supervised Machine Learning Algorithms . . . . . . . . . . . . . . . 14
2.2.2 Unsupervised Machine Learning Algorithms . . . . . . . . . . . . . 18
2.2.3 Deep Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Use Cases: Model-Centric AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Data-Centric Principles for AI Engineering . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 AI Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
viii Contents
6 Data-Centric AI in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Need and Challenges of Data-Centric Approach . . . . . . . . . . . . . . . . 89
6.3 Application Implementation in Data-Centric Approach . . . . . . . . . . 91
6.4 Application Implementation in Model-Centric Approach . . . . . . . . . 91
6.5 Comparison of Model-Centric AI and Data-Centric AI . . . . . . . . . . . 93
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Data-Centric AI in Mechanical Engineering . . . . . . . . . . . . . . . . . . . . . . 97
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Need and Challenges of Data-Centric Approach . . . . . . . . . . . . . . . . 98
7.3 Application Implementation in Data-Centric Approach . . . . . . . . . . 100
7.4 Application Implementation in Model-Centric Approach . . . . . . . . . 102
7.5 Comparison of Model-Centric AI and Data-Centric AI . . . . . . . . . . . 104
7.6 Case Study: Mechanical Tools Classification . . . . . . . . . . . . . . . . . . . 106
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8 Data-Centric AI in Information, Communication
and Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Need and Challenges of Data-Centric Approach . . . . . . . . . . . . . . . . 110
8.3 Application Implementation in Data-Centric Approach . . . . . . . . . . 113
8.4 Application Implementation in Model-Centric Approach . . . . . . . . . 115
8.5 Comparison of Model-Centric AI and Data-Centric AI . . . . . . . . . . . 118
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Research Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
About the Authors
Dr. Parikshit N. Mahalle is Senior Member in IEEE and is Professor, Dean Research,
and Development and Head—Department of Artificial Intelligence and Data Science
at Vishwakarma Institute of Information Technology, Pune, India. He completed his
Ph.D. from Aalborg University, Denmark, and continued as Postdoc Researcher at
CMI, Copenhagen, Denmark. He has 23+ years of teaching and research experience.
He is Ex-Member of the Board of Studies in Computer Engineering and Ex-Chairman
of Information Technology, Savitribai Phule Pune University and various universi-
ties and autonomous colleges across India. He has 15 patents and 200+ research
publications (Google Scholar citations-2900 plus, H index-25, and Scopus Cita-
tions are 1400 plus with H index-18, and Web of Science citations are 438 with
H index-10) and authored/edited 54 books with Springer, CRC Press, Cambridge
University Press, etc. He is Editor-in-Chief for IGI Global–International Journal
of Rough Sets and Data Analysis, Inter-science International Journal of Grid and
Utility Computing; Member in Editorial Review Board for IGI Global–International
Journal of Ambient Computing and Intelligence; and Reviewer for various journals
and conferences of the repute. His research interests are Machine Learning, Data
Science, Algorithms, Internet of Things, Identity Management, and Security. He is
guiding eight Ph.D. students in the area of IoT and machine learning, and six students
have successfully defended their Ph.D. under his supervision from SPPU. He is also
the recipient of “Best Faculty Award” by Sinhgad Institutes and Cognizant Tech-
nologies Solutions. He has delivered 200 plus lectures at national and international
level.
Dr. Gitanjali R. Shinde has overall 15 years of experience and presently working as
Head and Associate Professor in Department of Computer Science and Engineering
(AI and ML), Vishwakarma Institute of Information Technology, Pune, India. She
has done Ph.D. in Wireless Communication from CMI, Aalborg University, Copen-
hagen, Denmark, on Research Problem Statement “Cluster Framework for Internet
of People, Things and Services”—Ph.D. awarded on May 8, 2018. She obtained
M.E. (Computer Engineering) degree from the University of Pune, Pune, in 2012
and B.E. (Computer Engineering) degree from the University of Pune, Pune, in 2006.
xi
xii About the Authors
She has received research funding for the project “Lightweight group authentication
for IoT” by SPPU, Pune. She has presented a research article in the World Wireless
Research Forum (WWRF) meeting, Beijing, China. She has published 50+ papers
in national and international conferences and journals. She is Author of 10+ books
with publishers Springer and CRC Taylor & Francis Group, and she is also Editor
of books. Her book “Data Analytics for Pandemics A COVID-19 Case Study” is
awarded outstanding book of year 2020.
Dr. Namrata N. Wasatkar has overall 10 years of experience and presently working
Assistant Professor in Department of Computer Engineering, Vishwakarma Institute
of Information Technology, Pune, India. She has done Ph.D. in Computer Engi-
neering from Savitribai Phule Pune University, Pune, India, on Research Problem
Statement “Rule based Machine translation of simple Marathi sentences to English
sentences”—Ph.D. awarded on November 17, 2022. She obtained M.E. (Computer
Engineering) degree from the University of Pune, Pune, in 2014 and B.E. (Computer
Engineering) degree from the University of Pune, Pune, in 2012. She has received
research funding for the project “SPPU online chatbot” by SPPU, Pune. She has
published 10+ papers in national and international conferences and journals.
Chapter 1
Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_1
2 1 Introduction
information from visual inputs. Computer vision tasks include image segmen-
tation, image classification, object detection, and facial recognition. These tasks
are achieved through techniques like image feature extraction, pattern recogni-
tion, and deep learning algorithms, which mimic the structure and function of the
human visual system [3].
. Robotics: Robotics is a field that combines AI, mechanical engineering, and elec-
tronics to create physical machines (robots) capable of interacting with the phys-
ical world. AI plays a vital role in robotics by providing intelligent capabilities
to robots, allowing them to perceive, reason, and make decisions autonomously.
AI-powered robots can perform complex tasks such as autonomous navigation,
object manipulation, and human–robot interaction. These robots utilize sensor
data, machine learning algorithms, and control systems to sense and understand
the environment, plan actions, and execute them with precision [4].
. Data: Data is a fundamental building block of AI. The performance and effec-
tiveness of AI systems heavily rely on the quality, quantity, and diversity of the
data they are trained on. Large volumes of labeled data are often required to
train machine learning models effectively. Data collection methods, data prepro-
cessing, and data augmentation techniques have a crucial role to play in preparing
and curating the data for AI applications. Additionally, the ethical consider-
ations surrounding data privacy and bias are essential to ensure fairness and
accountability in AI systems [5].
. Algorithms and Models: Algorithms and models are the mathematical and compu-
tational frameworks that underpin AI systems. They define the behavior and func-
tionality of AI systems. Machine learning algorithms, like deep neural networks,
1.2 AI Current State 3
support vector machines, and decision trees, form the backbone of many AI appli-
cations. These algorithms learn from data to make predictions, classify inputs, or
optimize decisions. Models represent the learned knowledge of the algorithms and
can be used for inference and prediction. Model architectures, hyperparameter
tuning, and optimization techniques are crucial in building accurate and efficient
AI systems [5].
. Computing Power: The computational power of modern hardware infrastruc-
ture, such as Graphics Processing Units (GPUs) and specialized AI accelera-
tors, has played a significant role in the advancement of AI. These hardware
components enable the efficient training and execution of complex AI models.
High-performance computing clusters and cloud-based infrastructures provide
the computational resources required for large-scale AI projects. Furthermore,
advancements in distributed computing and parallel processing have accelerated
the development and deployment of AI systems [6].
. Ethical Considerations: As AI continues to advance, ethical considerations
become increasingly important. Ensuring that AI systems are fair, transparent,
and unbiased is crucial to avoid perpetuating existing social biases or discrimi-
nations. Ethical AI frameworks promote accountability, privacy protection, and
the responsible use of AI technologies. The development and deployment of AI
systems should consider the potential impact on society, human rights, and indi-
vidual privacy. Collaboration between policymakers, AI researchers, and industry
stakeholders is necessary to establish ethical guidelines and regulations for the
responsible development and use of AI [6].
The building blocks of artificial intelligence encompass machine learning, NLP,
computer vision, robotics, data, algorithms and models, computing power, and ethical
considerations. These components form the foundation of AI systems and enable the
development of intelligent applications that can perceive, reason, and make decisions
in various domains. By understanding these building blocks, we can appreciate the
complexity and potential of AI and its impact on society.
1.3 Motivation
The motivation for data-centric AI stems from the recognition that data is a crucial
and valuable asset in building effective and accurate AI systems. Data-centric AI
approaches prioritize the collection, curation, and utilization of high-quality data to
drive AI model development and decision-making. Here are some key motivations
for adopting a data-centric approach in AI:
. Performance Improvement: Data-centric AI recognizes that the performance of
AI models heavily depends up on the quality and quantity of the data they are
trained on. By focusing on collecting diverse and representative data, AI models
can better generalize to real-world scenarios, leading to improved accuracy and
reliability in their predictions and decisions [9].
. Addressing Bias and Fairness: Biases present in data can be reflected in AI models,
potentially leading to discriminatory or unfair outcomes. A data-centric approach
allows for the identification and mitigation of biases by thoroughly analyzing the
data, ensuring proper representation and inclusivity across various demographics.
By prioritizing unbiased and fair data, AI systems have a higher chance of making
equitable decisions [9].
. Robustness and Generalization: AI models trained on limited or biased data
may exhibit poor performance in unseen or unfamiliar situations. By adopting
a data-centric approach, AI developers can collect and incorporate a wide range
of relevant data, enabling models to better understand and adapt to different
contexts. This leads to increased robustness, generalization, and the ability to
handle variations and edge cases [9].
. Domain-Specific Expertise: Data-centric AI recognizes the value of domain-
specific knowledge and expertise embedded in data. By leveraging domain knowl-
edge, including expert annotations and labels, AI models can capture intri-
cate patterns and relationships specific to the target application. This exper-
tise enhances the accuracy and effectiveness of AI systems in specialized
domains [9].
. Real-Time Adaptation: Data-centric AI systems are designed to continuously
learn and adapt in real time. By collecting and analyzing new data as it becomes
available, AI models can update and refine their understanding, allowing them to
stay up-to-date with evolving environments and user needs. This adaptability is
particularly important in dynamic and rapidly changing domains [9].
. Enhanced User Experience: A data-centric approach in AI aims to improve
the user experience by leveraging data-driven insights. By understanding
user preferences, behavior, and context, AI systems can provide personalized
and tailored recommendations, services, and interactions. This leads to more
engaging and satisfying user experiences, ultimately increasing user adoption and
satisfaction [9].
. Decision Support and Insights: Data-centric AI empowers decision-making by
providing actionable insights and recommendations based on data analysis. By
leveraging large datasets, AI systems can identify patterns, correlations, and
6 1 Introduction
trends that may not be comprehensive to humans. These insights can inform
strategic business decisions, optimize processes, and unlock new opportunities
for innovation and growth [9].
Hence, the motivation for data-centric AI is driven by the recognition that high-
quality and diverse data is essential for developing accurate, unbiased, and robust
AI systems. By prioritizing data collection, curation, and utilization, data-centric
approaches aim to improve performance, address bias and fairness concerns, enhance
user experiences, and provide valuable decision support and insights in various
domains.
1.5 Summary
References
1. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://
doi.org/10.1038/nature14539
2. Caruana, R. (2006). Empirical methods for AI and machine learning. Carnegie Mellon
University, School of Computer Science, Technical Report CMU-CS-06-181
3. Zou, J., & Schiebinger, L. (2018). AI can be biased: Here’s what we should do about it. Science,
366(6464), 421–422. https://doi.org/10.1126/science.aaz1107
4. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., et al. (2019).
Model cards for model reporting. In Proceedings of the conference on fairness, accountability,
and transparency (pp. 220–229). https://doi.org/10.1145/3287560.3287596
5. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., &
Crawford, K. (2020). Datasheets for datasets. arXiv preprint arXiv:1803.09010
6. Wu, L., Wu, H., Wu, J., & Wei, Y. (2020). Towards data-centric AI. In Proceedings of the IEEE/
CVF conference on computer vision and pattern recognition (pp. 4388–4397). https://doi.org/
10.1109/CVPR42600.2020.00445
7. Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications,
19(2), 171–209. https://doi.org/10.1007/s11036-013-0489-0
10 1 Introduction
8. Li, Y., Li, J., Duan, Y., Zhang, S., Huang, T., & Gao, W. (2020). A survey on deep learning for
big data. Information Fusion, 57, 42–56. https://doi.org/10.1016/j.inffus.2019.10.003
9. Chandre, P., Mahalle, P., & Shinde, G. (2022). Intrusion prevention system using convolutional
neural network for wireless sensor network. IAES International Journal of Artificial Intelligence,
11(2), 504–515.
Chapter 2
Model-Centric AI
Machine learning is a way for computers to learn to perform tasks without explicit
programming. It is like, if we want to teach a child some new thing, then we can
show them how to do it and ask them to repeat it. Depending on how the child is
performing that task, we can give them suggestions.
When it comes to machine learning, computers are provided with massive amounts
of data and taught to detect patterns. After pattern is detected estimations are done
on the data. In other words, if you want to teach a computer how to recognize images
of dogs, you need to display number of images of dogs and label them as “dogs.” By
doing so, computers grasp the fundamental characteristics of dog images, including
their fur, ears, and eyes. Once these features are extracted, they can be utilized to
recognize novel images of dogs that have never been perceived before.
This concept can be implemented in many existing domains, like stock markets
(to make stock predictions), hospitals (to assist in disease diagnosis), weather fore-
casting, etc. If we provide sufficient amounts of data to computers, then computers
will be able to make good predictions. In some cases, computers can even surpass
humans. Nonetheless, the quality of predictions will depend on two things. First, the
quality of data used in training and second, the algorithm used for the analysis.
Machine learning does not need manual programming. Computers can learn and
perform better through the experience [1, 2].
In traditional computer programming developers write precise instructions for
computers to follow. However, in machine learning, computers use algorithms to
analyze data and acquire information. So machine learning systems can identify
various patterns in the data and make predictions.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 11
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_2
12 2 Model-Centric AI
The machine learning techniques can be classified into three categories as follows:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
The data used in supervised machine learning is trained on labeled data.
Unsupervised learning involves training the data on unlabeled data. Unsupervised
learning requires that the algorithms identify the underlying pattern in the data. In
reinforcement learning, data training is reliant on environmental feedback.
Machine learning applications are deployed in various fields, such as healthcare,
weather prediction, finance, transport, retail, etc. Machine learning has the ability to
automate processes and enhance accuracy [3, 4].
Supervised learning involves teaching computers using labeled data, where the
correct answer is already known. The computer is given input data and corresponding
output or target variables [5, 6]. The aim is to design a model that can predict the
correct output for new input data.
The process of supervised machine learning has two important phases:
1. Training
2. Testing
In the training process, the computer utilizes labeled data to understand the corre-
lations and patterns between input data and output variables. The objective is to
develop a model that can provide precise predictions for fresh input data.
A different labeled dataset that wasn’t used during training is used to test the model
after it has been trained. This aids in assessing how well the model generalizes to
new and untested data. The ultimate aim is to create a model that can make precise
predictions not only on the data used for training but also on new and unfamiliar
data.
Applications of supervised machine learning can be seen in speech recognition,
image classification, fraud detection, etc. When you have large amounts of labeled
data available, supervised learning can be very helpful in these situations. Supervised
learning can also work with small datasets. The quality and quantity of data are the
biggest deciding factors in the performance of supervised machine learning models.
Along with quality and quantity selection of algorithm and hyperparameter tuning
are also important.
2.1 Working Principle 13
Reinforcement learning can be used to train agents to play games, control bots,
assist in financial decision-making, etc. It is an effective tool for forming intelligent
systems that could examine and adapt to new situations.
. Linear regression
y = mx + b.
where b is the y-intercept and m1, m2, …, mn are the coefficients of the input variable
x1, x2,…, xn.
Linear regression is a commonly used technique to estimate values, including
house prices, stock prices, and sales by analyzing past data. It can also help in
feature selection by checking the significance of each input variable through regres-
sion coefficients. However, it’s important to note that linear regression has certain
assumptions about the data, such as linearity, homoscedasticity, and normality of
residuals. If these assumptions are violated, it can adversely affect the accuracy of
the model, and therefore, it’s essential to consider them while interpreting the results.
. Logistic regression
The outcome of logistic regression is binary independent variable. It means it can
have one of two values such as yes/no or 1/0. Logistic regression can have one or
more independent variable.
2.2 Learning Methods 15
The main aim of SVM to find the perfect hyperplane that will give the clear
classification clear margins. Margin is measured as the distance between hyperplane
and the closest point in both classes. SVM attempts to give clear classification with
maximized margins [14].
SVM kernels can be helpful when the underlying data is not linearly separable.
SVM kernels transform low-dimensional data into high-dimensional data so data can
be easily separated. Frequently used SVM kernels are as follows:
1. Linear kernel
2. Polynomial kernel
3. Radial basis kernel.
SVM can easily handle binary as well as multiclass classification tasks. In case
of multiclass classification one-to-one or all-to-one method are used. Support vector
machines (SVMs) can be trained using either a one-to-one or one-versus-all approach.
In the one-to-one approach, binary classifiers are trained for each pair of classes, and
their results are combined. In contrast, a binary classifier is trained for each class
in the one-versus-all strategy, and the class with the greatest score is selected as the
final prediction. SVMs are superior to conventional classification algorithms in a
number of ways, including the handling of high-dimensional data and robustness to
outliers. They can also find unique solutions to specific problems. However, choosing
the right kernel function and its parameters is important as SVMs can be sensitive
to their selection. Additionally, SVMs can be computationally expensive on large
datasets.
. Decision Tree
Decision tree is a machine learning algorithm. This algorithm can be used to perform
tasks such as classification and regression. As name suggests decision tree creates a
tree like structure in order to make the prediction [15].
Formation of a decision tree starts with forming a single node which constitutes
the entire dataset. The next step is the selection of features that can be used to split
the dataset. The splitting criteria can be based on values such as entropy or the GINI
index. The features having the highest information gain and lowest GINI index are
used to split the dataset into two or more datasets. The process is replicated until the
stopping criteria is met.
The developed tree will be used for making predictions. The predictions are made
by traversing the formed tree from root node to leaf node. Leaf node can either give us
the class prediction or value prediction. Every inner node gives a decision depending
on characteristics. Leaf node gives us a class prediction or value. Decision trees can
be viewed as flow charts. Flow chart like representation makes it easier to understand
the results.
Decision trees have following advantages:
1. It can handle categorical and continuous data
2. Easy interpretability
3. It can handle nonlinear relationship between target and features.
2.2 Learning Methods 17
Decision trees can result in overfitting if underlying tree is very large or has
multiple features. There are ways to avoid overfitting of data such as tree pruning,
selecting max tree depth. We can also use random forest or gradient boosting to avoid
overfitting.
. K-Nearest Neighbors (KNN)
KNN can implemented in case of regression and classification tasks. This algorithm
is simple but very effective in the field of machine learning.
KNN can classify data points by finding k values near them according to a
distance metric such as Euclidean distance New class predictions are made by
majority vote of the K-nearest neighbors. Regression takes mean values into
account [16].
The selection K value is a very crucial part of the KNN algorithm because this
value can have a major effect on the bias variance trade-off. A small value of K
will result in less distortion and more variance. A big value of k will result in more
distortion and less variance.
KNN has the following advantages:
1. KNN is simple and flexible in nature.
2. It can handle a nonlinear relationship between the target and the features.
KNN is susceptible to the distance metric used along with feature scaling. It can
affect the performance of the algorithm if the underlying data is very large.
KNNs can be beneficial when boundaries between the target and the feature are
nonlinear and complex in nature. KNN can be implemented in recommender systems,
anomaly detection, etc.
. Random Forest
Random forest is one of the most in-demand ensemble learning algorithms used to
perform classification as well as regression. In random forest, multiple decision trees
are formed. The results from all the created decision trees are combined in order
to deliver a prediction. This process is done in order to achieve better accuracy and
reduce overfitting of the underlying data [17].
In a random forest, in order to form a tree, a random subset of data and features are
selected from the underlying data. The random selection aims to achieve diversity
and reduce correlation among trees. Every formed tree produces a prediction. Using
aggregation methods, all the predictions are combined together. Aggregation methods
include classification, a voting method is used and, for regression, means calculations
are done. As the name suggests, the random forest offers way more benefits than a
single decision tree and they are as follows:
1. Capable of handling high-dimensional data.
2. Capable of dealing with outliers and noisy data.
3. The ability to detect the nonlinear relationships between the features and target.
4. Less susceptible to overfitting because of the nature of the algorithm.
18 2 Model-Centric AI
. K-means
K-means is the most commonly used and the favored clustering algorithm in the field
of unsupervised machine learning. Based on similarity measures, data is divided into
k-clusters. It is also known as the partition-based clustering technique [18].
In K-means, the k-clusters are given by the user and an algorithm given by k-means
finds the best and optimized way to divide the data into k-clusters. The first step of
the K-means algorithm is to randomly select the k points from the given data. That
data will now be considered as initial cluster centroids. The next step is to allocate
available data points to the nearest centroid on the basis of some distance metric, e.g.,
Euclidean distance. After calculating the distance, the centroids will be modified to
the average of data points given to each cluster. The allocation of data points and
modification in centroids will be replicated until there is no change in centroid or we
meet the terminating condition.
The k-means algorithm aims to minimize the sum of squared distances between
data points and centroids. This is also called inertia of clusters. This aim helps k-
means to form well separated and compact clusters. It can also be used to assess the
overall quality of clustering.
The advantages of K-means can be listed as efficient, scalable and simplicity of
the algorithm. But it can be very sensitive to the first chosen centroids. Also it is very
difficult to find the optimized k value. The distance metric used and scaling features
will influence the quality of clusters.
K-means is frequently used in applications like anomaly detection, image segmen-
tation, customer segmentation, etc. The algorithm can be very helpful when we do
not know what kind of data we are dealing with. The aim of the algorithm is to find
natural groups and detect hidden patterns in the data.
2.2 Learning Methods 19
. K-medoids
K-medoids is a clustering technique that shares similarities with K-means algorithm.
However, unlike K-means, K-medoids uses a medoid as a centroid instead of the
mean of data points in each cluster. A medoid is identified as a data point that is
nearest to the center of the cluster, which may not necessarily be the mean of data
points in the cluster.
K-medoids is a commonly used clustering algorithm when the data does not follow
a normal distribution or when the mean is not an accurate representation of the cluster
center. Compared to K-means, K-medoids is less affected by outliers because a single
outlier is less likely to move the medoid as compared to the mean [19].
The K-medoids algorithm is a process that begins by randomly selecting K data
points to serve as the initial medoids. After that, each data point is assigned to the
closest medoid based on a distance metric, like Euclidean distance. Following that,
the medoids are updated to be the data point with the smallest average distance to
all other data points in the cluster. This process of assigning points and updating
medoids is repeated until the medoids remain unchanged or a stopping criterion is
reached.
K-medoids requires more computational power than K-means. This is because it
involves identifying the medoid of each cluster, which means calculating distances
between each data point in the cluster. However, it has the advantage of being more
resistant to noise and outliers. It can produce better clustering results for clusters that
are not convex or have irregular shapes.
K-medoids are frequently used in applications such as image segmentation,
expression analysis, etc. It is especially useful when datapoints contain a lot of noise
or outliers or source of data is unknown.
The inspiration for neural networks is derived from the structure and function of
the human brain. Neural network includes multiple layers of interconnected nodes
aka neurons. These neurons handle incoming information by applying mathematical
functions to it.
The critical part of the neural network is the neuron. The neurons can gather
input from other neurons or some other external sources. A mathematical function
is implemented on the gathered input which in turn gives the output. The output
is then transferred to the other neurons, or it can be used as the final outcome of
the network. Each neuron includes a series of weights that are used to boost the
input signals. The weights are tuned during the training phase so as to achieve the
optimized performance of the network.
20 2 Model-Centric AI
utilized in image and video processing applications, as they can learn features and
patterns in the input data that are important for the given task automatically.
The presence of convolution layer which is created to extract the spatial features
in underlying data is the main difference between CNN’s and feedforward neural
network. A convolutional layer is a group of filters that move over the input data
and execute a convolution operation. The convolution operation involves computing
a dot product between the weights of the filter and a small section of the input data.
This process is done across the entire input, creating a collection of feature maps that
emphasize various components of the input data.
After the convolutional layers, the feature maps are usually passed through one or
more fully connected layers to accomplish the ultimate classification or regression
task. Moreover, CNNs usually employ pooling layers to compress the size of the
feature maps and enhance the network’s computational efficiency.
Some popular types of convolutional neural networks include:
. LeNet-5: It was the simplest and earliest CNN architectures aim to perform
handwritten digit recognition.
. AlexNet: It is a deep CNN architecture that is also winner of ImageNet Large Scale
Visual Recognition Challenge in 2012 which is hallmark for image classification
tasks.
. VGG: It is a deep CNN architecture that have gained high accuracy on the
ImageNet challenge in 2014 with the help of simple and uniform architecture.
. ResNet: It is a deep CNN architecture which initiated the concept of residual
connections. It made optimization of networks easier and deeper.
CNNs performed exceptionally well in case of image and video processing tasks,
including object detection, semantic segmentation, and action recognition. The main
requirement for CNN is large amount of labeled data. Along with that delicate tuning
of hyperparameters is necessary for better performance. CNNs can be effectively
trained with the help of GPUs and distributed computing frameworks.
Convolutional neural networks (CNNs) consist of a number of operations that
convert input data into a feature map that presents different aspect of the input. The
key operations involved in CNNs are convolution operation, pooling operation, and
set of fully connected layers.
Here is a brief overview of how CNNs work:
. Convolutional Layers: The convolution layer has a number of filters which are
used to slide over the image (input data given) and do the convolution operation.
In this operation, calculation of a dot product between weights of filters and
input data is done. The described operation is performed repetitively over the
input, which gives multiple feature maps. These feature maps represent different
features of the input data.
. Activation Function: Once a convolution operation is performed, the activation
function is implemented feature wise to the output provided by the convolution
layer. The activation function is used to create nonlinearity in the network, which
enables the network to capture complex features and relationships in the given
data.
22 2 Model-Centric AI
. Pooling Layers: A pooling layer helps the network become more efficient by
narrowing down the size of features. The mainly used pooling operation is termed
as max pooling. In max pool, the maximum value of the feature map is extracted
and provided as an output.
. Fully Connected Layers: The series of convolution and pooling operations are
performed to get the feature map. The feature map is then flattened to create
a vector and provided as an input to one or more fully connected layers. The
final task of these layers is to accomplish classification or regression. The fully
connected network can be viewed as a traditional feedforward neural network.
. Loss Function and Optimization: The output generated by the fully connected
layer is matched to the label of the input data. A loss function is used to calculate the
differentiation between the actual label provided with input data and the predicted
label. After the difference calculation, the weights in the network are efficiently
updated with the help of the optimization algorithm such as stochastic gradient
descent (used to minimize the loss).
In the training process of a CNN, weights and biases are tuned to achieve the max
accuracy on labeled data. While training, the network slowly learns to capture features
and patterns of the input data which will help in tasks such as object recognition in
given images etc.
When the training of the network is done, it is used to predict the unlabeled data
with the help of trained network. The predicted values are assessed with the help of
different test set.
. Recurrent neural networks (RNNs)
RNN layers have been created to process sequential data like text or speech. RNNs
use recurrent layers to hold memory of earlier input in order to model temporal
dependencies.
Recurrent neural networks (RNNs) are a specialized type of artificial neural
network that can effectively handle sequential data, ranging from time series and
text to speech. Unlike feedforward neural networks, which can only process a fixed-
size input and output, RNNs can handle input sequences of varying lengths and
produce output sequences of any length.
Recurrent neural networks (RNNs) are characterized by their ability to use recur-
rent connections to maintain an internal state that captures the input sequence’s
context. This internal state is updated at each time step by combining the current
input with the previous state and passing the result through an activation function
[23, 24].
The long short-term memory (LSTM) network is the most widely used type of
RNN. It utilizes specialized gating mechanisms to regulate the flow of information in
and out of the internal state. LSTMs are ideal for managing long-term dependencies
in sequential data since they can remember or forget information based on the context.
Here is a brief overview of how RNNs work:
. Recurrent Connections: Each step is combination of input at that time and previous
step. To update the current state it is passed through the activation function.
2.2 Learning Methods 23
. Internal State: The internal state of the RNN captures the context of the input
sequence up to the current time step.
. Gating Mechanisms: LSTMs incorporate specific gating mechanisms that regulate
the flow of information to and from the internal state. This enables the network
to accurately retain or discard relevant data based on the given context.
. Output: The RNN generates its output at each time step through a fully connected
layer that takes the current state as input. The resulting vector has the same
dimensionality as the desired output.
. Loss Function and Optimization: At the last step of the process, the result is
matched with the actual label of the given input, and a loss function is applied
to determine how much difference exists between the predicted output and the
real label. After that, the network’s weights are adjusted using an optimization
algorithm like stochastic gradient descent to decrease the loss function.
When training a recurrent neural network (RNN), the network’s weights and biases
are adjusted to improve its accuracy on a training set of sequential data. Through
this process, the network learns to understand the context of the input sequence and
generate the desired output. Once the network is trained, it can be utilized to make
predictions on new and unlabeled sequential data by inputting it into the network and
obtaining a predicted output. The predictions of the network can be evaluated using
a separate test set of sequential data.
. Long Short-Term Memory (LSTM)
LSTM is a type of RNN or recurrent neural network that is intended to manage
long-term dependencies in sequential data. The primary advantage of LSTMs is
the implementation of specialized gating mechanisms that enable the network to
selectively retain or discard information based on the context [25].
In a standard LSTM, the internal state comprises two segments:
1. The “cell state”
2. The “hidden state.”
The cell state preserves the network’s long-term memory, while the hidden state
embodies the LSTM’s output at each time step. Sigmoid and tanh activation functions
are utilized to regulate the information flow in and out of the cell state through gating
mechanisms.
Here is a more detailed overview of how LSTMs work:
. Input and Previous Hidden State: Each step in LSTM has input vector and earlier
hidden state as inputs.
. Gates: Combination of input state and previous state will go through gating
mechanisms namely the input gate, the forget gate, and the output gate.
. Input Gate: The input vector will decide which information from input vector will
be part of the cell state.
. Forget Gate: The forget gate will decide which information should be remembered
or which information should be forgotten from earlier states.
24 2 Model-Centric AI
. Update Cell State: Combination of input gate and forget gate is used to update
the cell state. It then goes through the tanh activation function.
. Output Gate: The output gate decides which part of the updated cell state will be
part of the output.
. Hidden State and Output: The updated cell state goes through a tanh activation
function to form the hidden state. It is also the output of the LSTM at that time
step.
. Loss Function and Optimization: During the final time step, the predicted output
is compared to the actual label of the input. To quantify the difference between
the two, a loss function is applied. The weights of the network are then adjusted
using an optimization algorithm such as stochastic gradient descent to minimize
the loss function.
To train an LSTM, the weights and biases of the network are adjusted for better
accuracy on a training set of sequential data. The network learns to selectively
remember or forget information based on the input sequence and context during
the training process. Once trained, the network can predict new sequential data by
feeding the input through the network. The predictions can be evaluated using a
separate test set of sequential data.
. Autoencoder
Autoencoders are feedforward neural networks which are created to learn from
compressed representation present in the input data. It includes an encoder network
that charts low-dimensional latent space and decoder networks chart latent space
back into input.
. Generative adversarial networks (GANs)
Generative adversarial networks (GANs) are a type of deep learning model that use
two neural networks namely:
1. A generator
2. A discriminator
They are used to create new data that resembles a given dataset. The generator
and discriminator compete with each other to produce high-quality generated data.
The main objective of a GAN is to understand the inherent distribution of a given
set of input data and then generate new data that is similar to the original input data
[26, 27].
Here is a high-level overview of how GANs work:
. Generator: The generator is a type of neural network that utilizes a random noise
vector as its input to create a fresh sample of data. Its main aim is to generate
samples that can be compared to the original data.
. Discriminator: It is a form of neural network in which an input is received and
prediction consists of whether the received input is real sample or fake sample
created by generator. The main aim of discriminator is to precisely differentiate
between real and fake samples.
2.3 Model Building 25
. Training: The generator and discriminator are taught to work against each other,
where the generator creates samples that appear real to the discriminator, and the
discriminator tries to differentiate between real and fake samples accurately. The
training process involves feeding both types of samples repeatedly through the
discriminator, and using the error signal to adjust the weights of the generator and
discriminator networks.
. Evaluation: After training of GAN, the generator can generate data similar to
input. This is achieved by feeding random noise vectors through the generator
and getting the generated output.
GANs have found applications in creating lifelike visual art, music, and textual
content. A significant benefit of GANs is their potential to produce fresh data exam-
ples that are both varied and authentic. Nevertheless, training GANs can be a complex
task and requires precise adjustments to hyperparameters to guarantee consistent
convergence.
Model training is a crucial step in machine learning that involves adjusting the internal
parameters of an algorithm to make accurate predictions. This is done by providing
the algorithm with input features and output labels, and then fine-tuning its parameters
through iterations until it can accurately predict output labels for new input data.
Here’s an example of how model training works.
Let us consider there is a dataset containing housing prices. Each data point
represents a house and provides information about its features such as number of
bedrooms, lot size, year of construction, and sale price. The objective is to create
a machine learning model that can predict the price of a new house based on its
features. To do this, the dataset needs to be divided into two parts.
. Training set
. Testing set
The first part is the training set, which will be used to train the model. The second
part is the testing set, which will be utilized to evaluate the model’s accuracy. A
common split is 80/20, where 80% of the data is used for training and 20% is used
for validation.
Next, we will select an appropriate machine learning algorithm for the task, such
as linear regression, decision tree, or SVM. The next step is to input the training
set into the algorithm. Internal parameters will be adjusted based on the difference
between its predicted output labels and the true labels in the training set.
During the training process, the algorithm iteratively adjusts its parameters to
minimize the difference between its predicted outputs and the true labels in the
training set. This process is known as gradient descent. It computes the gradients of
the loss function with respect to the model parameters and adjusting the parameters
in the direction of the negative gradient.
After the algorithm is trained on the training set, we will assess its performance
on the testing set to determine how well it adapts to new, unseen data. If the model
performs well on the testing set, we can utilize it to make predictions on new, unseen
data. In short, model training is an essential phase in the machine learning process
since it empowers us to construct precise predictive models that have various practical
applications.
Model testing is a way to assess the performance of the fully trained model on a
different dataset that was not used in the training phase. This step is very important
to make sure that model is generalized and can make accurate predictions even in
case of unseen data.
2.6 Model Tuning 27
For tuning the model, we will divide the dataset into three parts:
. Training set
. Testing set
. Validation set
We will use training set for training of the model. Testing set will be used for
assessment of performance along with hyperparameter tuning.
To improve the model’s performance during training, we can begin by adjusting the
learning rate. This rate determines how fast the model adjusts its internal parameters.
We can evaluate the model’s performance on the validation set by training it with
different learning rates. If we find a learning rate that yields good results on the
validation set, we can use it to train the model on the entire training set.
We can proceed by fine-tuning the regularization strength, which regulates the
extent to which the model punishes high parameter values while being trained. Simi-
larly, we would retrain the model using different regularization strengths and measure
its performance on the validation set. Additionally, we could experiment with varying
the number of hidden layers in the model, which can impact its capability to capture
intricate correlations within the data.
After fine-tuning the model on the testing set, we need to assess its performance
on the test set to ensure its ability to handle novel, unseen data. Optimizing model
performance through tuning is a critical step in the machine learning process, ensuring
that our models can be effectively implemented in real-world applications.
. Prepare the data. This includes cleaning the data, removing outliers, and trans-
forming it into a format that can be used by your model. Also it involves resizing
the images to a size, converting them to grayscale, and normalizing the pixel
values, etc.
. Choose CNN model. There are many different types of deep learning models
available. The best model for your project will depend on the specific problem
you are trying to handle. For face mask recognition, VGG16 is used.
. Train the model. To train your model, this is the stage where you will utilize the
data you have gathered. The length of the training process can vary based on
the size of your dataset and the intricacy of your model. Essentially, you’ll feed
the images in your dataset to the model and enable it to learn how to distinguish
between masked and unmasked faces. The duration of the training process will
depend on the size of your dataset and the complexity of your model.
30 2 Model-Centric AI
. Evaluate the model. Once your model is trained, you need to evaluate its perfor-
mance on a different dataset. This will help you to determine how well the model
will generalize to new data.
. Deploy the model. Once you are satisfied with the performance of your model,
you can deploy it to production. This means making it available to users so that
they can use it to make predictions.
Here are some additional tips for success in deep learning projects:
. Use a good deep learning framework. There are many different deep learning
frameworks available, such as TensorFlow, PyTorch, and Keras. Select a frame-
work that is required for your project.
. Use a GPU. Use of GPU will tremendously increase the speed up the training
time of models.
. Be patient. Training a deep learning model can be a time-consuming process
hence be patient for results.
. Experiment. When it comes to deep learning, there’s no universal solution that
works for everyone. To determine the best approach for your project, it’s important
to try out various models, hyperparameters, and data preprocessing techniques
through experimentation.
2.8 Summary
References
21. Lee, H., & Song, J. (2019). Introduction to convolutional neural network using Keras; An
understanding from a statistician. Communications for Statistical Applications and Methods,
26(6), 591–610.
22. Ghosh, A., Sufian, A., Sultana, F., Chakrabarti, A., & De, D. (2020). Fundamental concepts
of convolutional neural network. Recent Trends and Advances in Artificial Intelligence and
Internet of Things, 128, 519–567.
23. Sherstinsky, A. (2020). Fundamentals of recurrent neural network (RNN) and long short-term
memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306.
24. Schmidt, R. M. (2019). Recurrent neural networks (RNNS): A gentle introduction and overview.
arXiv preprint arXiv:1912.05911
25. Song, X., Liu, Y., Xue, L., Wang, J., Zhang, J., Wang, J., et al. (2020). Time-series well
performance prediction based on long short-term memory (LSTM) neural network model.
Journal of Petroleum Science and Engineering, 186, 106682.
26. Lan, L., You, L., Zhang, Z., Fan, Z., Zhao, W., Zeng, N., et al. (2020). Generative adversarial
networks and its applications in biomedical informatics. Frontiers in Public Health, 8, 164.
27. Yinka-Banjo, C., & Ugot, O. A. (2020). A review of generative adversarial networks and its
application in cybersecurity. Artificial Intelligence Review, 53, 1721–1736.
28. Mahalle, P. N., Shinde, G. R., Pise, P. D., & Deshmukh, J. Y. (2022). Foundations of data
science for engineering problem solving. Springer.
Chapter 3
Data-Centric Principles for AI
Engineering
3.1 Overview
AI has become an interdisciplinary field and has applications in all verticals across
day-to-day routines. There has been a lot of development in this field, and the current
AI as a black box is becoming AI as a white box where a new category of AI algo-
rithms called Explainable AI is emerging in which the focus is more on the data than
models. Today, we are surrounded by digital transformation where every application
is driven by the Internet of Everything (IoE) [1]. In IoE, numerous sensors are installed
in the environment as well as commissioned in smartphones and smartwatches. These
sensors generate a huge amount of data, and a rich set of data analytics algorithms [2]
are helping us to proactively track the activities carried out by users. These activity
recognition and tracking algorithms are empowered by emerging machine and deep
learning techniques and thus enable a human-centered design approach [3]. Appli-
cation of AI algorithms on the underlined data includes various operations which
include data engineering, data analytics, model building, data science, and busi-
ness intelligence. The model-building stage decides which algorithm from machine
learning or deep learning is more appropriate to apply for developing the model based
on the given data and the questions to be posted on this data. However, data-centric
AI is more crucial than the selection of algorithms for the model building which is
nothing but model-centric AI as discussed in Chap. 2 of this book.
Consider the use case which is to be designed and developed for audiences who
are visually impaired. The devices required are mobile or any other equivalent device
with a speaker, microphone, and camera. The objectives of this use case are to provide
commands to this device using voice recognition and perform different actions which
include reading a bible book, showing obstacles on the way of the road using a camera,
and providing responses in the voice as required in the specific language. There is
a need to use a camera to detect who is the person in front of plug and play camera
and to store the details of the person who met next time if he comes again then the
device should predict the nature and details of that person.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 33
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_3
34 3 Data-Centric Principles for AI Engineering
Consider the use case which is to be designed and developed for audiences who
are visually impaired. The devices required are like Alexa with camera, and the
features to be incorporated with this device include asking a question using voice
and getting a response to anything that is on Google, the device should detect the
human’s emotions and react accordingly, and a person with the device will be treated
as master and should have reset button to learn the behavior of master again. In
addition to this, if a new person comes to meet with the master, the details should be
learned by the model in relationship with the master and if the same person comes
again, a prediction should be made on his behavior using the deep learning model.
The model-centric AI first look into the technology, device, team requirements,
and modules to build these use cases as follows:
Technology
. Natural Language Processing—Voice recognition, NLU, and sentiment analysis
. Deep Learning—Model to learn the sentiments of the master, emotion analysis,
answering any questions asked by the master
. Flask Web services to integrate with UI/android app
Devices
. Device with microphone and camera
Team
. Minimum 3–4 people in AI-ML
. Expert in Android/IoT device
. Web designer
Modules
. Voice and image recognition, question-answering module
. Detection of facial recognition of user ethnicity, age, etc.
This model-centric AI technological suite is depicted in Fig. 3.1.
When we consider this use case, there are two perspectives to look at it. One is
the user perspective, and the other is the developer perspective each has a different
set of requirements, challenges, and design issues. In addition to this, there are many
AI engineering concerns that are to be taken care of during the development of this
use case as well as while using it by the end user.
3.2 AI Engineering
System—it can be Small like an Atomic System in which electrons orbit around
the nucleus or Big like a whole universal system where stars and planets contin-
uously orbit around each other. In AI, the system can be an open loop or closed
loop, and it is represented as a black box having Inputs (X) and Outputs (Y ). These
two explicit sets Inputs (X) and Outputs (Y ) precisely differentiate model-centric AI
from data-centric AI. In general, the aim of any AI or machine learning, or deep
learning algorithm is to replicate the relationship between observed X and Y; while
the aim of statistics is to model this relationship between X and Y. Empirical risk
and its minimization is the main goal of AI, and structural risk minimization is
the key objective of statistics. However, it should be noted that every algorithm in
AI is based on the theory of statistics, and the two main aspects, i.e., error mini-
mization and hypothesis approximation, go hand in hand while building any AI
application. In view of this, Probably Approximately Correctly (PAC) learning [4] is
an interesting type of learning which is aimed at finding the upper or lower bounds
of learning algorithms using concentration inequalities. Generally, the relationship
between empirical risk minimization and structural risk minimization is not linear,
and the intersection between AI and statistics optimizes the tradeoff between empir-
ical risk minimization (accuracy) and structural risk minimization (approximation)
for maximum confidence interval (probability) and is referred as PAC learning.
36 3 Data-Centric Principles for AI Engineering
The model debugging strategies and iterations are very critical to AI engineering.
It is very important to understand how the debugging of AI models is different from
the debugging of software models. In debugging an AI model, more emphasis is on
the reasons for the underperformance of the model than code. It is very important to
understand the reasons for the model’s poor performance which include data, model
building, hyperparameters, etc. The appropriate strategy needs to be investigated
for debugging of AI model. Assume that you are given many datasets of varying
descriptions which include small/large sizes the dataset, datasets with small/large size
of features, outliers in datasets, noise in datasets, linearly separable scatter of datasets,
nonlinearly separable datasets, overlapped and non-separable datasets, time series
datasets, etc. In each case or a combination thereof, the main task is to recommend
with a justification of a proper classifier model.
A few common bugs in the AI model are listed and discussed below:
. Small data
As we know that AI or machine learning or deep learning is made for big data.
However, there are many use cases where there is not enough data available, and
it results in low accuracy and other performance metrics. Augmentation, generative
models, generative adversarial networks, and auto-encoders are some techniques to
address this issue. So in the case of small data, making it high quality is the best
approach to training the model.
. Logging mechanism
The proper logging mechanism must be there in place to log useful information than
useless information. Decision logging and information exchange logging along with
scenario mapping based on description is also an important function to be taken care.
. Model confirmation
This is postmodel-building issue which includes confirming whether the model has
not been tampered, i.e., ensuring the integrity of the model, proof of correctness as
well reliability of the model. The popular methods for model confirmation are varied
analysis with multiple inputs, auto-generation of test data, etc.
. Data dimension
The major bug in the model building is caused when the input data is out of the
required dimension. Due to the limitations of linear algebra and the availability of a
rich set of libraries to detect inconsistent behavior in the data, data dimension causes
major bugs in the model building.
. Data inconsistency
Inconsistencies and flaws in the input data contribute majorly to the model bugs. If
the set of questions to be posted on the data and the type of outcomes expected from
the AI model is not clear, then it is possible that model will lose accuracy.
3.2 AI Engineering 37
. Hyperparameter tuning
Performance improvement of hyperparameters can be done by tuning them properly
in order to control the behavior and performance of any AI model. Descriptive statis-
tics of the dataset also perform feature scaling and outlier detection, and in turn, the
model performance can be improved.
AI models are never perfect, however, an effort is there to build an AI model
which is close to perfection and should outperform varied classes of datasets. The
accuracy of any AI model is evaluated with the help of several performance metrics
[5, 6] like confusion matrix, receiver operating characteristics, the area under curves,
data models, etc. The key parameters for AI model evaluation are listed below:
1. Receiver Operating Characteristics
Receiver operating characteristic (ROC) curves are very useful and important
parameters used in machine learning applications for the assessment of classifiers
and the main objective of classifier assessment is to decide a cutoff value for the
underline dataset.
Consider the healthcare application, particularly in the context of pathological
tests. Generally, in medical biochemistry, for vitamin B12, 200 is considered the
cutoff value. The patient population having a B12 value below 200 is grouped
into the deficient patients, and the patient population having a B1 value above
200 is grouped into the normal patients. However, clinically it is also possible
that there are false positives, i.e., the patients having a B12 value more than 200
are deficient and there are false negatives, i.e., the patients having a B12 value
< 200 are normal.
ROC, or ROC curve, is a statistical plot in the graphical representation that
justifies and explains the performance of binary classifiers. As its discrimination
threshold is varied, the curve is created by plotting the true positive rate (TPR)
against the false positive rate (FPR) at varied threshold settings. The false positive
rate (fall-out) can be calculated as (1—specificity) and the ROC curve plots
sensitivity (true positive rate) against the false positive rate (fall-out). When the
probability distributions for both detection and false alarm are known, the ROC
curve can be generated by plotting the cumulative distribution function of the
detection probability on the y-axis and the cumulative distribution function of
the false-alarm probability on the x-axis.
ROC curves are commonly used in medicine and engineering to determine a
cutoff value for a test, basic experimentation, or any diagnostic in the healthcare
domain. For example, the threshold value of a 4.0 ng/ml for the prostate-specific
antigen (PSA) test in prostate cancer is determined using ROC curve analysis,
and there are many such similar applications in the diagnostic. A test value below
4.0 is considered normal, and above 4.0 is considered abnormal. However, it is
important to note that there will always be some patients with PSA values below
38 3 Data-Centric Principles for AI Engineering
4.0 that are abnormal (false negatives) and those above 4.0 that are normal (false
positives) regardless of the cutoff value chosen. The goal of ROC curve analysis
is to determine the optimal cutoff value that balances the tradeoff between true
positives and false positives to maximize the overall diagnostic performance of
the test.
2. Sensitivity and Specificity
Sensitivity and specificity are very important statistical measures of the perfor-
mance of a binary classification test or any binary classifiers in the context of
machine learning.
. Sensitivity which is also referred as true positive rate measures the proportion
of positives that are correctly identified as such (e.g., the percentage of sick
people who are correctly identified as having the condition).
. Specificity which is also referred as true negative rate measures the proportion
of negatives that are correctly identified as such (e.g., the percentage of healthy
people who are correctly identified as not having the condition).
3. Precision and Recall
. In the context of pattern recognition and information retrieval with binary
classification, precision (also called positive predictive value) is a measure of
how many of the retrieved instances are actually relevant.
. Recall (also known as sensitivity) is a measure of how many of the relevant
instances were retrieved.
Both precision and recall are based on an understanding and measure of relevance.
4. Cross-Validation
. Cross-validation (also known as rotation estimation) is a technique used to
assess how the results of a statistical analysis will generalize to an independent
dataset
. It is mainly used in the applications as well as system settings where the key
objective is prediction. The cross-validation is mainly used to define a dataset
for testing the model in the training phase, and it is also known as the process
of the validation of the dataset. This helps to limit the problems like overfitting
and to give insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real-world problem).
5. Ensemble Methods
. Integration and combination of multiple learning algorithms to improve the
predictive performance is known as the ensemble approach. In statistics and
machine learning applications, ensemble methods are becoming more popular
for enhanced performance as compared to any other existing algorithms.
. However, there can be computational overhead, and there is always a tradeoff
between this overhead and performance.
3.3 Challenges 39
6. Bagging
. Bagging is a process of bootstrap aggregating where a meta-algorithm using an
ensemble approach is designed in order to improve the accuracy and stability
of any AI algorithm.
. It is mainly used in classification and regression techniques and helps to reduce
variance and also avoid overfitting.
In addition to these issues causing bugs in the model, the developer also uses model
debugging strategies which are depicted in Fig. 3.2.
3.3 Challenges
Model Updation
Define Deploy
Problem Solution
Data Updation
• New data
. Fix model bugs
. Drift adaptation
. Schema updation
There should be proper architecture in place so that we can pick the data at different
parts to evaluate a piece of components and enable incremental data training, and it
also enables end-to-end evaluation and iteration of the AI model. It is also important
to decide the performance metrics at the different stages of the pipeline to improve
the performance of AI models. Different operators at different stages of the pipeline
and global applications along with their scores can also be customized with the help
42 3 Data-Centric Principles for AI Engineering
Positive
Negative
75%F1
Input Data End-to-end Model Classified, clustered,
linked entities
of per component and end-to-end evaluation and iteration. The scenario depicted in
Fig. 3.5 throws more light on this design principle.
. Selection of appropriate technique
Each building block in the AI engineering process performs a data frame transforma-
tion. For example, a document data frame is transformed into a classified document
data frame after the application of a particular classifier. However, the selection of
a classifier plays important role in the entire process, and the task can be done by
applying a heuristic classifier or it can be also done with a machine learning classifier.
In addition to this and based on the data size, we can also use a deep learning classifier
for better prediction and higher accuracy. It is recommended that one should start
with the simple model and eventually we can add complexity to the model building
depending on the requirements.
. Iteration with programmatic labeling
In the AI model-building process, programmatic labeling is very much useful in
bringing rapid iteration. Traditionally, in the process of pipeline model building,
manual data labeling is used which is the major bottleneck in iteration. The existing
approaches being adopted are listed below along with their limitations:
3.4 Data-Centric Principles 43
Work Supervision
Model
Label
Training
Programmatically
Data Model Iteration
Analyze
Data Iteration
. Premature optimization
Machine learning is a very complex set of code and functions and while integrating
it into the pipeline, premature optimization should be avoided.
. Incremental development
Change management is the routine process in the software development life cycle
wherein the updates and change requirements always come from the clients, and it is
to be incorporated in every phase of the software development life cycle. Anticipating
change management and enabling incremental development is an important principle
of data-centric model development.
3.5 Summary
Data-centric AI is becoming a more popular trend now which has more emphasis on
the data than the model. This chapter first presents the reference use case to understand
how the technological and design requirements change from model-centric AI to
data-centric AI. These differentiating parameters include technology, devices, team,
and modules. In the next part of this chapter, AI engineering aspects are presented
and discussed in the view of model-centric and data-centric AI model building with
examples. A few important bugs in the AI model-building process are also discussed
in this section. The key parameters for AI model evaluation are also elaborated which
include ROC curves, sensitivity, and specificity, precision and recall, cross-validation,
ensemble methods bagging, boosting, etc. Finally, this chapter concludes with the
key challenges as well as important data-centric principles which are recommended
to follow in the data-centric AI-building process.
References
1. Dey, N., Shinde, G., Mahalle, P., & Olesen, H. (2019). The internet of everything: Advances,
challenges, and applications. De Gruyter. https://doi.org/10.1515/9783110628517
2. Mahalle, P. N., Gitanjali, R. S., Shinde, G. R., Pise, P. D., Deshmukh, J. Y., & Jyoti, Y. D. (2022).
Data collection and preparation. In Foundations of data science for engineering problem solving
(pp. 15–31). Springer
3. Boy, G. (2017). The handbook of human-machine interaction: A human-centered design
approach. CRC Press
4. Pydi, M. S., Jog, V. (2020). Adversarial risk via optimal transport and optimal couplings. In
Proceedings of the 37th international conference on machine learning, in proceedings of machine
learning research (Vol. 119, pp. 7814–7823). https://proceedings.mlr.press/v119/pydi20a.html
References 45
5. Gupta, A., Parmar, R., Suri, P., & Kumar, R. (2021). Determining accuracy rate of artifi-
cial intelligence models using Python and R-studio. In Proceedings of the 2021 3rd interna-
tional conference on advances in computing, communication control and networking (ICAC3N)
(pp. 889–894)
6. Patalas-Maliszewska, J., Paj˛ak, I., & Skrzeszewska, M. (2020). AI-based decision-making model
for the development of a manufacturing company in the context of industry 4.0. In Proceedings
of the 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE) (pp. 1–7)
7. Paszke, A. et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In
H. Wallach (Ed.), Advances in neural information processing systems (Vol. 32, pp. 8024–8035)
Chapter 4
Mathematical Foundation
for Data-Centric AI
4.1 Overview
Mathematics provide better tools in order to understand and analyze the underlying
data. In data analysis, mathematics is crucial. Some of the fundamental quantitative
ideas and methods that are applied in data analysis are listed below.
4.1.1 Statistics
Statistics is defined as the part of mathematics which deals with data collection,
analysis, interpretation along with presentation of the data. It provides foundation for
concepts like regression analysis, probability, variance, hypothesis testing, etc. It also
includes concepts of descriptive analytics like central tendency measure, variability,
inferential statistics [1], etc. To recognize patterns, trends, and relationships in the
data is the crucial part of statistical data analysis.
This part of mathematics includes concepts like vector spaces, linear equations,
matrices, etc. In the data analysis, techniques like dimensionality reduction, clas-
sification, clustering, etc. are used for better results. Linear algebra is a part of
mathematics which provides powerful tools for data analysis. It permits analysts
to represent and manipulate data in a such a way that hidden pattern along with
relationships in the data can be discovered. This will allow analysts to make precise
predictions. To find the relationship between two variables linear regression can be
very helpful. Linear regression uses a line to fit the underlying datapoints. To reduce
dimensionality of underlying data, principal component analysis (PCA) is used. It
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 47
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_4
48 4 Mathematical Foundation for Data-Centric AI
4.1.3 Calculus
To observe growth rate and change calculus can be used. Calculus can be used in
tasks such as optimization, fitting of the curve, and integration with data analysis.
The rate of change at any point is given by the calculating the derivative of function
at that point. The same calculations can be used for finding the optimal values of
parameters in the model. To get the optimized functions, calculus finds maximum or
minimum value of a function. Optimization is a crucial part of data analysis as it finds
the best variables for model which helps in prediction and minimize errors. To design
models for systems that change over time differential systems are used. Differential
equations can be implemented to design systems such as population growth, disease
spread, and latest economic trends. Calculus gives us very powerful tools to assess
and analyze the complex data [3].
The study of likelihood random occurrences comes under the branch of mathe-
matics known as probability theory [1]. This theory is often used in data analysis for
modeling uncertainty and makes the accurate predictions. It can be used tasks such
as hypothesis testing, Monte Carlo simulations, and Bayesian inference.
. Probability Distribution: Probability distributions give the likelihood of occur-
rence of a random variable. For modeling and analysis of data distribution, proba-
bility distribution is used. It can also be used to alter parameters used in statistical
modeling.
. Bayes Theorem: Bayes theorem is a keystone of probability theory. It gives us
the way to adjust the hypothesis probabilities if any or new information comes
to light. This theorem is mostly used in the field of data analysis in order to alter
previously formed assumptions in the presence of newly surfaced information.
. P-value: The p-value is used for hypothesis testing. It gives us information whether
underlying data supports null hypothesis or supports alternate hypothesis. Null
hypothesis means there are no difference in groups or absence of an effect vice
versa for alternate hypothesis. P-value is calculated by comparing it to predeter-
mined significance levels (often denoted as α-alpha). If p-value is smaller than
significance level, then we can reject null hypothesis and support the alternate
4.1 Overview 49
P-value is equal to significance value; then the null hypothesis is true. It means a
coin can get 40 heads which means a coin is fair.
If calculated p-value comes out to be 0.02, then we will have to reject the null
hypothesis as p-value is less than significance value. It will mean the coin is unfair.
50 4 Mathematical Foundation for Data-Centric AI
The branch of mathematics that deals with study of networks and their properties
is referred as graph theory. It can be implemented in data analysis tasks such as
community detection, clustering, and social network analysis. Graph theory can be
used for data analysis to study and analyze the relationships between things or entities
represented as nodes on the graph and the connections and interactions between them
represented as edges [4].
i. Link Prediction: On the basis of the node properties or structure of graph,
process of determining the probability of connection formation among nodes
is done. This can be implemented in recommendation systems, social media
analysis, or drug discovery, etc. For implementation of graph theory, multiple
techniques can be applied for link prediction such as common neighbors, Jaccard
similarity, preferential attachment, or random walk.
ii. Network Clustering: The process of arranging network (nodes) into clusters
or communities on the basis of underlying similarity or connectivity is known
as network clustering. The concept of network clustering can be implemented
in scenarios such as image segmentation, text clustering, or gene expression
analysis. For graph clustering, many techniques are offered by graph theory such
as modularity optimization, spectral clustering, and hierarchical clustering.
iii. Visualization: Graph theory can be implemented in visualization of high-
dimensional data. It can be done by presenting every data point as a node
and connection between them can be made on the basis of distance from
4.2 Statistical Data Analysis 51
each other (how far or near data points are). The difficult to find patterns or
clusters in high0dimensional space can be easily recognized with the help of
high-dimensional visualization (two or three dimension).
. Once tool designing is done, the gathering of data can be initiated. Data
gathering should be done in methodical and concise manner.
. After data collection is done, data should be prepared for preprocessing.
This includes assessing the data for underlying errors, contradictions, missing
values, and transforming it into the structure that is fit for analysis.
c. Analysis of the Data: Analysis of data should be done with the help of statis-
tical techniques. This can include techniques like hypothesis testing, regression
analysis, etc. or descriptive statistics like measures of central tendency, vari-
ability, etc. Descriptive statistics includes collection, organization, and analysis
of underlying data so researchers can make accurate interpretations. The main
aim of the statistical data is to identify hidden patterns, trends, and relationships
in the data. This involves using numerical and graphical techniques to summarize
and describe key elements of the dataset. Central tendency, variability, and shape
of the distribution are the techniques using descriptive statistics. Mean, median,
and mode are examples of measures of central tendency that provide details about
the normal or mean value of a dataset. To disclose the information about how
data is being distributed, measures of uncertainty include range, variance, and
standard deviation.
Commonly used descriptive statistics can include frequency distributions,
histograms, box plots, scatter plots, correlation factors, etc. These tools can be
very helpful to observe the data distribution of underlying data, detecting outliers,
identification of hidden pattern in the data, relationship between the variables in
the dataset, etc. This can applied in the fields like business, economic, social
sciences, and natural sciences.
With the tools and techniques, you can make informed decisions based on
empirical evidence. It can also help in summarization of data which can be
shared in a clear and concise manner. The primary purpose of inferential statistics
is to determine whether observed differences or similarities between groups or
variables are genuine or a coincidence.
Various techniques such as regression analysis, confidence intervals, and
hypothesis testing are used in inference statistics to analyze data and derive infor-
mation about communities from samples. Many fields such as economics, social
sciences, engineering, and health sciences rely heavily on inferential statistics.
This permits researchers to make decisions or predictions on the basis of data
they have collected with higher accuracy.
d. Presenting the Results: Giving clear and concise representations of the findings
is vital task in data analysis process. It can be done with the help of report,
slideshows, and data visualization tools. Visualization includes graphs, charts,
histograms, and diagrams which helps in effective communication of your main
research. For example scatter plot or line chart helps in visual representation of
two variables changing over certain time period. The underlying data and the
findings will impact the visualization you are going to use.
4.3 Data Tendency and Distribution 53
Data tendency and distribution are crucial topics in statistics that explains central
tendency and distribution of the dataset [1]. First we will see the concept of data
tendency in detail.
Data tendency represents single value (average value) that explains dataset by iden-
tifying central tendency of the dataset. It gives insights about representative value
given by data. Data tendency measures are given by:
. Mean: It is a basic concept in the statistics. It is given by addition of all the values
in the dataset and dividing it by total number of values in the dataset. Mean is
easily affected by the outliers in the dataset.
Mean can have significant implementations in the field of data analysis and its
interpretation. In the field of economics or finances, mean represents the performance
of the investments done by investors. In research fields, mean represents abstraction
or summarization of experimental data.
. Median: The median value is the middle value when we sort the entire dataset
in ascending or descending order. If dataset entries are in odd in number, then
middle value is easily recognized. But if dataset entries are even in nature, then
middle value is given by calculating mean of middle two values. Median is less
affected by outliers in the dataset.
For the analysis of skewed distributions (in which outliers can have great impact
on mean) like income, house prices, etc., median can be used.
. Mode: The mode is given by the number which repetitively occurred in the dataset.
We can have more than one modes or in some cases no mode.
For the analysis of categorical data mode is used. For pointing out most common
purchase in the general store, most commonly used vehicle, common response given
in survey, etc. mode can be used. Also for identification of outliers mode can be used
(when mode is not there or possibility of multiple modes are there).
Consider the following dataset and calculate mean, median, and mode.
9, 11, 14, 10, 41, 35, 22, 32, 28, 8
Mean of the dataset can be calculated as follows:
Mean = (9 + 11 + 14 + 10 + 41 + 35 + 22 + 32 + 28 + 8)/10
Mean = 21
54 4 Mathematical Foundation for Data-Centric AI
Original dataset = 9, 11, 14, 10, 41, 35, 22, 32, 28, 8
As entries are even in nature, then middle value is given by calculating mean of
middle two values.
It is also called as a measure of variability. It gives us the spread of data points and
also provides extent to which value differs from each other [1]. Commonly known
measures of dispersion are as follows:
i. Range
ii. Variance
iii. Standard deviation
iv. Interquartile range
v. Percentile quartile
. Range
It is simplest form of measure of dispersion. It can be calculated as difference
between maximum value in dataset and minimum value in the dataset. It gives us
idea about the spread of data points. Range is very sensitive to the outliers.
The Python implementation of range is as follows:
. Variance
It is defined as average squared difference between each data point and the mean.
Higher the variance, higher dispersion in the data is observed.
Sample is nothing but sample of data, whereas population is an entire dataset.
This is the main difference between sample and variance.
If we are calculating the variance for a sample, then the formula is as follows:
∑n
i=1 (x i
− x)2
s2 =
n−1
56 4 Mathematical Foundation for Data-Centric AI
where
S 2 is sample variance.
x i is individual data point in sample.
x is mean of data points.
n is sample size.
If we are calculating the variance for a population, then formula is as follows:
∑N
i=1 (x i − μ)2
σ =2
N
where
σ 2 is population variance.
x i is individual data point in sample.
u is population mean.
N is population size.
The Python implementation of variance and standard deviation is as follows:
. Standard Deviation
The square root of variance is known as the standard deviation. It gives us average
deviation from mean. Just like variance, the higher the standard deviation spread
will be higher.
. Interquartile Range
It is range between first quartile (25th percentile) and the third quartile (75th
percentile) in a dataset. It gives us spread of data in middle 50%. It is less affected
from the outliers.
The Python implementation of standard deviation and interquartile range is as
follows:
. Percentile Quartile
It is range between given two percentiles. It gives spread between given two
percentiles.
The Python implementation of percentile quartile is as follows:
58 4 Mathematical Foundation for Data-Centric AI
Data distribution means how the data is distributed or spread across values present
in the dataset. It gives us vital information such as range and variability of the data.
The data distribution consists of:
i. Normal distribution
ii. Skewed distribution
iii. Uniform distribution
iv. Bimodal distribution
. Normal distribution
Data models are representations of data structures, relationships, and rules that define
how data is organized, stored, and accessed within a system or database. ML algo-
rithms work on various types of dataset; it may be structured, unstructured, or semi-
structured in nature [5]. On the basis of organization and its underlying format of the
data, the following models have been created:
. Structured data
. Unstructured data
. Semi-structured data
They provide a conceptual framework for understanding and working with data.
The data models are explained in detail.
. Structured Data: The model that has precise organization for its data in a prede-
fined format is known as structured data. It has its own predefined structure and
is usually stored in relational databases or tabular data formats. We can recognize
structured data with following properties:
i. Consistent format: Every data element has to be adhered to specific data
format defined by schema or tabular structure.
ii. Predefined schema: Everything is predefined including underlying structure,
data types, and relationships between data elements.
iii. Organized in rows and columns: It has a tabular structure consisting of rows
and columns. Rows give us specific records, and columns give us specific
attributes or fields of underlying data.
iv. Easy to query and analyze: The organization of structured data is so well
defined that underlying data can be easily be used for analysis purpose using
queries with the help of Structured Query Language (SQL) or other database
tools.
The well-known examples of structured data can be seen in financial records,
sales transactions, employee records, sensor data, etc.
. Unstructured Data: The data that does not have a predefined structure or orga-
nization is known as unstructured data. It does not have any specific structure or
schema. It typically has human readable format. Unstructured data has following
characteristics:
i. No predefined structure: It does not follow any predefined model or schema.
ii. Varied formats: It can have different formats. It can exist in various formats,
such as text documents, emails, social media posts, images, audio files,
videos, and web pages.
iii. Difficult to analyze: As unstructured data does not have any specific format,
the analysis of such a data is quiet challenging.
62 4 Mathematical Foundation for Data-Centric AI
. Momentum
Momentum is an optimization technique that speeds up the optimization process by
adding a fraction of the previous settings update to the current update. It helps in
elimination of noise from the gradients. Also it enhances the convergence when there
is flat or plateau regions in the field.
. Convex Optimization
The focus is on solving optimization problems with convex objective functions and
boundary conditions specified. These optimization techniques promise to find global
optimum efficiently. The implementation of convex optimization can be seen in the
fields of machine learning, signal processing, control systems, etc.
Along with these techniques, some more techniques are implemented that have
performed well with the data.
. Linear Programming
When the linear objective function and constraints are specified without any ambi-
guity, linear programming optimization can be used. The implementation of this
technique can be seen in the field of operation research, resource allocation problems,
etc.
. Genetic Algorithms
These are evolutionary algorithms inspired by natural selection process as well as
the genetics present in the nature. The individual in the populations is used to find
the solution to the problems. Algorithms iteratively evolve over the population using
operations of selection, crossover, and mutation to find the solution optimal in nature.
. Simulated Annealing
This technique is probabilistic in nature. It is an optimization algorithm which simu-
lates annealing process used in metallurgy field. It works by iterative exploration of
the solution space beginning from an initial solution, allowing upward movement
(worse solutions) based on a probability distribution of underlying space. In order to
avoid local minima, the probability of accepting uphill moves is gradually decreased.
. Particle Swarm Optimization
The inspiration for particle swarm optimization is taken from the behavior of bird
flocking or fish schooling. It handles a population of particles that explore the solution
space by updating their positions in accordance with their own best solution and the
best solution found by the whole population.
. Ant Colony Optimization
As the name suggests, the inspiration for ant colony optimization is taken by behavior
of ants. The ants use pheromones for tracking each other. The algorithm uses
same simulation for finding out the best possible path depending on the pheromone
concentration. Most commonly seen implementation is traveling salesman problem.
4.5 Optimization Techniques 65
. Constraint Programming
For solving combinatorial problems along with specified constraints, constraint
programming optimization is used. The problem is represented as a set of vari-
ables, domains, and constraints. Afterward, it searches for valid assignments to the
variables which will satisfy all the constraints specified by the problem.
These are commonly used optimization techniques as per the requirements in the
hand. Each technique has its pros and cons, but the choice of technique depends on
the problem at hand and the available resources to solve them.
For improving performance and efficiency of machine learning models, opti-
mization techniques are used. Some of the reasons of importance of optimization
techniques are listed below:
. Model Performance Improvement: The main goal of optimization techniques is to
minimize the error or loss of model. By searching for optimal values for parameters
of model, its performance can be improved. It works well in case of unseen data
and gives us more accurate predictions.
. Efficient Resource Utilization: Optimization techniques use computational
resources such as memory, processing power, and time. The main goal for opti-
mization techniques is efficient utilization of resources that are available. Efficient
utilization becomes very crucial when you are dealing with the large datasets.
. Handling Complex Models and High-Dimensional Data: The machine learning
models are very complex in nature along with number of parameter optimization.
Optimization techniques provide optimal space in which optimal solution to the
problem at hand can be found. This part becomes crucial in case of deep learning
models where millions of parameters are there. In case of high-dimensional data,
it is handled by reducing the dimensions which enhances efficiency.
. Overcoming Non-Convexity: Non-convex optimization includes objective func-
tion with presence of multiple local optima. In these cases, optimization tech-
niques are used to find the good solution in non-convex spaces. Techniques like
stochastic gradient descent (SGD), Adam, and conjugate gradient methods, etc.
are used.
. Regularization and Generalization: In order to prevent overfitting and mini-
mize complexity of underlying model, optimization techniques provide us a
concept called as regularization methods. Regularization techniques include L1
and L2 regularization which aids in minimizing model’s sensitivity to noise along
with outliers in underlying data. With creation of simpler models regularization
enhances generalization to unseen data.
. Hyperparameter Tuning: Hyperparameter tuning includes finding optimal values
for parameters through learning process. Techniques such as grid search, random
search, and Bayesian optimization are used for hyperparameter tuning. Accurate
hyperparameter tuning enhances the model performance along with issues like
underfitting or overfitting are solved.
. Optimization Across Different Algorithms: These optimization techniques can be
applied to wide variety of learning algorithms, including linear regression, neural
networks, support vector machines, decision trees, and more.
66 4 Mathematical Foundation for Data-Centric AI
4.6 Summary
In order to comprehend and use machine learning algorithms, one must have a solid
background in statistics and mathematics. Key mathematical and statistical ideas that
are pertinent to machine learning are outlined here.
References
1. DasGupta, A. (2011). Probability for statistics and machine learning: Fundamentals and
advanced topics (p. 566). Springer.
2. Brownlee, J. (2018). Basics of linear algebra for machine learning. Machine Learning Mastery
3. Brownlee, J., Cristina, S., & Saeed, M. (2022). Calculus for machine learning. Machine Learning
Mastery
4. Patil, P., Wu, C. S. M., Potika, K., & Orang, M. (2020). Stock market prediction using ensemble of
graph theory, machine learning and deep learning models. In Proceedings of the 3rd international
conference on software engineering and information management (pp. 85–92)
5. Jain, A., Patel, H., Nagalapatti, L., Gupta, N., Mehta, S., Guttula, S., et al. (2020). Overview and
importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD
international conference on knowledge discovery and data mining (pp. 3561–3562)
6. Sra, S., Nowozin, S., & Wright, S. J. (Eds.). (2012). Optimization for machine learning. MIT
Press
Chapter 5
Data-Centric AI
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 67
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_5
68 5 Data-Centric AI
to remove any errors or inconsistencies, transforming the data into a more useful
format, and storing the data in a centralized location for easy access by AI systems.
Data acquisition can involve various types of data, including structured data
(such as databases and spread sheets) and unstructured data (such as text, images,
and videos). The data may be sourced from various internal and external sources,
including sensors, social media platforms, customer feedback, and other data
repositories.
The goal of data acquisition in the data-centric approach is to ensure that the AI
models are trained on a diverse and representative dataset that can produce accu-
rate and meaningful insights. The process requires careful planning, execution, and
ongoing maintenance to ensure that the data remains relevant and up to date.
The data acquisition process refers to the set of activities involved in collecting data
from various sources, such as sensors, databases, or manual input, for the purpose of
analysis or processing. The process typically involves several stages, starting with
planning and design, followed by data collection, cleaning, and storage.
During the planning stage, the data acquisition team defines the goals and objec-
tives of the project and determines the data sources and collection methods that will
be used. This stage also involves identifying any potential risks or issues that may
arise during the data acquisition process. Once the planning stage is complete, data
collection can begin. During this phase, data is gathered from a variety of sources,
including surveys, experiments, and automated sensors. The collected data may be in
different formats and may require cleaning or transformation to ensure consistency
and accuracy.
The next stage involves cleaning and preparing the data for analysis. In this phase,
duplicates are eliminated, mistakes are fixed, and data is transformed into an analysis-
ready format. Once the data is cleaned and prepared, it is typically stored in a database
or other storage system. Finally, the data can be analyzed and processed to extract
insights or information. This stage involves using statistical and analytical techniques
to uncover patterns or relationships within the data. The results of the analysis can
then be used to make decisions, inform policies, or guide further research.
The process of gathering significant volumes of data from many sources, including
social media, sensors, and other digital platforms, is referred to as big data acquisition.
Here are some essential tips for collecting huge data.
5.1 Data Acquisition 69
Background: A retail company with multiple stores across the country wants to
improve its sales by understanding customer behavior and preferences. The company
wants to analyze its sales data to identify patterns and trends that can help them make
data-driven decisions to improve their sales.
Challenge: The company’s sales data is stored in different formats, including Excel
spreadsheets, text files, and databases, making it difficult to analyze the data. The data
is also located in different systems across the company’s stores and offices, making
it hard to aggregate the data. The company needs to acquire the data and centralize
it in a single repository to facilitate data analysis.
Solution: To address this challenge, the retail company implemented a data
acquisition solution that involved the following steps:
1. Identify Data Sources: The first step was to identify all the data sources that
contained the company’s sales data. The sources included sales databases,
customer databases, inventory management systems, and employee performance
data.
70 5 Data-Centric AI
2. Data Extraction: The company used a data extraction tool to extract data from
each source. The tool was configured to extract data in the required format and
structure.
3. Data Cleansing: Once the data was extracted, it was cleaned to remove duplicates,
inconsistencies, and errors. The data was standardized to ensure that it was in a
consistent format and that there were no discrepancies.
4. Data Integration: The cleaned data was then integrated into a centralized data
repository using a data integration tool. The tool was configured to merge the
data from different sources, ensure data consistency, and maintain data quality.
5. Data Validation: The company performed data validation to ensure that the data
was accurate and complete. The validation process involved checking the data
for errors, discrepancies, and inconsistencies.
6. Data Analysis: With the data centralized and validated, the company was able
to analyze its sales data to identify patterns and trends. The analysis provided
insights into customer behavior, product preferences, and sales performance.
Results: The data acquisition solution provided the following benefits to the retail
company:
1. Centralized Data: The solution enabled the company to centralize its sales data,
making it easy to access and analyze.
2. Improved Data Quality: The data cleansing and validation process improved data
quality, reducing errors and inconsistencies.
3. Data-Driven Decisions: The analysis of the sales data enabled the company to
make data-driven decisions to improve sales.
4. Increased Efficiency: The data acquisition solution reduced the time and effort
required to extract, clean, and integrate data, improving operational efficiency.
Conclusion: Implementing a data acquisition solution enabled the retail company
to centralize its sales data, improve data quality, and make data-driven decisions to
improve sales. The solution increased efficiency and provided insights into customer
behavior and preferences, which helped the company to make better decisions to
improve sales.
The practice of manually adding one or more descriptive tags or labels to a dataset
is known as data labeling. In a data-centric approach, data labeling is a crucial step
that involves annotating raw data with relevant metadata to help machine learning
models learn from it.
For example, in image recognition, data labeling may involve identifying objects
or people within an image and tagging them with descriptive labels. Similarly, in
Natural Language Processing, data labeling may involve identifying and tagging
specific parts of speech or sentiments within a text.
5.2 Data Labeling 71
The quality and accuracy of data labeling can significantly impact the performance
of machine learning models, making it an essential step in the data-centric approach.
A class of methods known as semi-supervised labeling infers the labels of unla-
beled data from a limited amount of labeled data [2, 3]. (e.g., using unlabeled data
to train a model to generate predictions, for instance). The most important unla-
beled examples are chosen in each iteration of active learning, an iterative labeling
technique [4–6]. The tagging process in contexts with limited supervision has been
redefined by other studies [7, 8]. For instance, domain-specific heuristics are used as
input in data programming [7] to infer labels. Deep learning is mostly made possible
by large, labeled datasets. We anticipate the development of more effective tagging
techniques using different forms of human participation for a range of data kinds.
Data labeling can be performed in different ways depending on the nature of the data
and the specific task at hand. However, the general process typically involves the
following steps.
Data Collection: The first step is to collect the raw data that needs to be labeled.
This data could be in various formats, including images, audio recordings, text
documents, or structured data.
Annotation Guidelines: After data collection, the next step is to create annotation
guidelines or a labeling scheme that defines the labels to be used and how they
should be applied to the data. The guidelines ensure consistency and accuracy across
all labeled data.
Labeling Tools: After creating the annotation guidelines, labeling tools are used
to annotate the data. These tools can be customized to the specific data type and
labeling task.
Human Labelers: Data labeling is typically performed by human labelers who are
trained to apply the annotation guidelines correctly. The number of labelers used can
vary depending on the size and complexity of the dataset.
Quality Control: To ensure the accuracy and consistency of the labeled data,
a quality control process is often implemented. This involves randomly sampling
labeled data and checking it for errors or inconsistencies.
Iterative Improvement: As labeled data is reviewed, errors are corrected, and the
annotation guidelines are updated based on feedback, creating a cycle of iterative
improvement. This process helps to improve the quality and accuracy of the labeled
data over time.
Overall, data labeling is a critical step in machine learning, as it provides the
labeled data needed to train models accurately. By using human labelers and iterative
improvement, the data can be accurately and consistently labeled to create high-
quality training data.
72 5 Data-Centric AI
There are several approaches to data labeling, and the choice of approach depends on
the specific data type, labeling task, and available resources. Here are some common
approaches:
1. Manual Labeling: This approach involves human labelers manually reviewing
and annotating each data point according to the established annotation guidelines.
Manual labeling can be time-consuming and expensive, but it provides the highest
accuracy and flexibility.
2. Active Learning: In this approach, a machine learning model is used to auto-
matically label a subset of the data, and human labelers review and correct the
labels. The model is then retrained on the newly labeled data, and the process is
repeated. Active learning can significantly reduce the time and cost of labeling
while maintaining high accuracy.
3. Semi-supervised Learning: This approach combines manually labeled data with
unlabeled data to train a machine learning model. The labeled data serves as
the first training set for the model, which is subsequently applied to label the
remaining unlabeled data. The process is then repeated using the freshly labeled
data and the training set. Compared to manual labeling alone, semi-supervised
learning may be more cost-effective.
4. Crowdsourcing: This approach involves using many non-expert human labelers
to annotate data. Crowdsourcing platforms such as Amazon Mechanical Turk can
be used to distribute labeling tasks to a large pool of workers. Crowdsourcing
can be cost-effective but may result in lower accuracy due to the variability in
labeling quality.
5. Hybrid Approaches: Different labeling approaches can be combined to create
a hybrid approach that leverages the strengths of each method. For example,
a semi-supervised learning approach can be combined with manual labeling to
improve the accuracy of the final labeled dataset.
Overall, the selection of a labeling strategy is influenced by a number of variables,
such as the kind and complexity of the data, the size of the dataset, the required
accuracy, and the resources that are available.
• Make the labels consistent.
• Use multiple labelers spot inconsistencies.
• Clarify label instructions by tracking down ambiguous example.
• Add some distracting instances. More information is always preferable.
• To improve, employ error analysis with a focus on a subset of data.
These tips a little bit more applicable maybe for unstructured data applications
like images text and audio.
Recommendation 1: Make the labels Y consistent.
It turns out that when developing a learning algorithm, especially for a small
dataset, the ideal situation is when there is a deterministic, non-random function that
5.2 Data Labeling 73
maps the inputs x to the outputs y, and the labels are consistent with this function.
However, this ideal situation may be less realistic when the labels are generated.
By humans but let us take an example as shown in Fig. 5.1. Consider evaluating
manufactured pills while working for a pharmaceutical company. The label datasheet
can ask, “What when is the pill scratch or defective?” as an example.
If we plot the length of a scratch against the label, is it defective or not zero. In
this case, as shown in Fig. 5.2, it is not really consistent; it sort of goes up and down
right; some longer scratches are not defective compared to other shorter ones.
If you are able to sort the images at the bottom in increasing order of the length
of the scratch. The short scratches smaller than a specific length can be determined
by looking at the labels and making a choice.
Here, we are considering two and a half millimeters call that defective what that
corresponds to it editing dataset, so the labels now become like this as shown in
below Fig. 5.3 and much more consistent.
In√some cases, if we have an inherently noisy dataset, then error will decrease
O(1/ m) where m is raining set size. In some situations, error can decline on the
order of (1/m) and one over m goes down much quicker than one over the square root
of m if we acquire consistent labels such that there is some clean learnable consistent
function where error like this curve your x-axis and y-axis is just a basic threshold
learning.
Recommendation 2: Use multiple labels to spawn inconsistencies.
If labels y are inconsistent, use multiple labels to spawn inconsistencies, so here
are some examples of inconsistencies. Suppose one label is to label this maybe it
looks like this pill has a chip on it and maybe label two says it has a scratch and
actually we do not know who is right but when we see cases like these is any consistent
standard is probably better than an inconsistent standard. So decide if stuff like this
is a chip or a scratch, make best decision, and reduce the inconsistency by using
multi-labeling.
Data labeling is the process of assigning meaningful and relevant tags or labels to
data to make it more usable and understandable for machine learning algorithms.
It is an important aspect of machine learning as it plays a key role in ensuring the
accuracy and efficiency of models.
The following are some of the main justifications why data labeling is crucial:
1. Improved accuracy: Data labeling helps to ensure that machine learning models
are accurately trained. When data is properly labeled, it provides a clear under-
standing of the features and characteristics of the data. This enables algorithms
to make more accurate predictions and classifications.
2. Enhanced efficiency: Data labeling also improves the efficiency of machine
learning algorithms. Labeled data allows algorithms to quickly process and
analyze large amounts of data, leading to faster and more efficient decision-
making.
3. Better quality data: When data is labeled, it is easier to identify and remove
irrelevant or incorrect data. This leads to higher quality data that is more relevant
and useful for machine learning algorithms.
4. Increased productivity: Data labeling also increases productivity by reducing the
time and effort required to manually sort through and label data. This frees up
valuable resources, allowing organizations to focus on other important tasks.
5. More personalized experiences: Data labeling enables machine learning algo-
rithms to make more personalized recommendations and predictions. This can
lead to improved customer experiences and increased engagement.
5.2 Data Labeling 75
Overall, data labeling is a crucial step in the machine learning process that helps
to ensure accuracy, efficiency, and high-quality data.
Data annotation refers to the process of labeling or tagging data with metadata, which
makes it easier to analyze and use for machine learning and other data-centric tasks.
Data annotation is a crucial part of many machine learning applications because it
helps algorithms to recognize patterns and relationships within large datasets.
The process of data annotation typically involves human annotators who manually
review and label each piece of data with relevant tags, such as categories, keywords,
or descriptive labels. This labeling process can be time-consuming and costly, but it
is essential for building accurate machine learning models and improving the quality
of data-driven insights.
There are several types of data annotation techniques used in data-centric applica-
tions; some of the most common ones are:
1. Image Annotation: This type of data annotation involves identifying and labeling
objects within an image, such as classifying them into different categories,
detecting their location, and outlining their boundaries. Image annotation is often
used in applications such as autonomous vehicles, object recognition, and facial
recognition as shown in Fig. 5.4.
2. Text Annotation: This type of data annotation as shown in Fig. 5.5 involves
tagging or labeling text data with relevant metadata such as keywords, topics, or
named entities. Text annotation is used in applications such as sentiment analysis,
text classification, and chatbots.
Cleaning data is one of the processes of a data model, which is essential for
high-accuracy models. However, the model cannot make accurate predictions for
inputs from the actual world if cleaning decreases the representing ability of the
data. By producing variables that the model could encounter in the real world,
data augmentation approaches might help machine learning models become more
resilient.
A brief illustration makes things clearer: Consider that we are teaching a model
to recognize birds. The picture of a bird on the left in the example below in Fig. 5.6
is obtained from our initial dataset. On the right, you can see three variations of the
original image that our model would still likely interpret as depicting a bird. The
first two are easy to understand: Whether a bird is flying upward or downward, east,
or west, it is still a bird. The third illustration shows a bird that has had its head and
body artificially obscured. Thus, this image would prompt our model to pay attention
to feathered wings as a characteristic bird.
augmentation can be used to generate new text samples by replacing words with
synonyms, adding noise, or spelling errors, or shuffling the order of words in a
sentence.
Data augmentation can be performed online, during training, by randomly
applying transformations to each batch of data, or offline.
In a data-centric context, data deployment refers to the process of making data acces-
sible and available to users or systems that need it. This involves various steps, such
as selecting the appropriate data storage and management systems, designing the
data architecture, and establishing secure and efficient data access mechanisms.
Data deployment typically involves the following steps:
Data storage: Data deployment begins with selecting the appropriate data storage
systems, such as databases, data warehouses, or data lakes. The choice of storage
system depends on the type and volume of data, the desired level of data processing,
and the budget available.
Data architecture: After selecting the data storage systems, the next step is to
design the data architecture. This involves defining the data structure, including
tables, fields, and relationships, and deciding how data will be organized, classified,
and accessed.
Data integration: Data integration involves combining data from different sources
and formats and transforming it into a unified format. This may involve data cleaning,
data normalization, and data enrichment.
Data access: Once the data is stored and integrated, the next step is to estab-
lish secure and efficient data access mechanisms. This may involve setting up user
accounts and permissions, establishing API endpoints, or creating data dashboards
and visualizations.
Data governance: Finally, data deployment involves establishing data governance
policies and procedures to ensure that data is secure, accurate, and compliant with
regulatory requirements.
Overall, data deployment is a critical step in making data available and accessible
to users and systems, enabling organizations to leverage their data assets to make
better-informed decisions and drive business success.
sales databases. The company used various tools and scripts to extract the data,
including Apache Spark, AWS Glue, and AWS Data Pipeline.
2. Data Transformation: Once the data was extracted, it needed to be transformed
into a usable format for analysis. The company used various tools and scripts
to transform the data, including AWS Glue, AWS Lambda, and Apache Spark.
They also created data pipelines to automate the transformation process.
3. Data Loading: The transformed data was then loaded into Amazon Redshift. The
company used various tools and scripts to load the data, including AWS Glue,
AWS Data Pipeline, and Amazon Redshift’s COPY command.
4. Data Analysis: With the data now in Amazon Redshift, the company was able
to perform various types of data analysis, including data mining, predictive
modeling, and machine learning. They used various tools and libraries, including
Python, SQL, and Amazon Machine Learning.
5. Visualization and Reporting: To share the insights gained from the data analysis,
the company used various tools to create visualizations and reports, including
Amazon QuickSight, Tableau, and Power BI. These reports were shared with
various teams within the company, including marketing, sales, and product
development.
The data deployment process has helped the company in various ways. For
example, they were able to identify which products were most likely to lead to
repeat purchases, which helped them optimize their marketing campaigns. They
were also able to identify which products were most likely to lead to customer
churn, which helped them improve their product offerings. Overall, the data deploy-
ment process has helped Company X improve its business operations and gain a
competitive advantage in the e-commerce industry.
Data-centric AI tools are software applications that are designed to help users work
with and analyze large volumes of data, with the goal of discovering insights, making
predictions, or improving decision-making. These tools use artificial intelligence (AI)
and machine learning (ML) techniques to automate data analysis, reduce manual data
processing tasks, and improve the accuracy and efficiency of data-driven tasks.
Some examples of data-centric AI tools include:
1. Data visualization tools: These tools help users visualize data and identify
patterns and trends. They allow users to create charts, graphs, and other visual
representations of data, making it easier to understand and communicate insights.
2. Predictive analytics tools: These technologies analyze previous data and forecast
future results using machine learning algorithms. They can be used for a range
of applications, from forecasting sales revenue to predicting equipment failures.
5.6 Data-Centric AI Tools 83
3. Natural Language Processing (NLP) tools: These tools use AI and ML to analyze
and understand human language. They can be used to analyze customer feedback,
automate customer service interactions, or identify trends in social media.
4. Recommendation engines: These tools use data analysis to make personal-
ized recommendations to users, such as product recommendations or content
recommendations on a website.
5. Data preparation and cleaning tools: These tools help users prepare and clean data
for analysis. They can automate tasks such as data normalization, data cleaning,
and data transformation, making it easier to work with data and reducing the risk
of errors.
6. Data integration tools: These tools help users combine data from different sources
into a unified format. They can be used to integrate data from databases, data
warehouses, or other data sources.
Overall, data-centric AI tools are designed to help users work more effectively with
data, enabling them to discover insights, make predictions, and improve decision-
making.
There are several data-centric AI tools available in the market, some of which are:
1. TensorFlow: An open-source platform developed by Google, TensorFlow is
widely used for machine learning and deep learning applications.
2. Keras: A high-level neural networks API, Keras is built on top of TensorFlow.
It provides an easy-to-use interface to build and train deep learning models.
3. PyTorch: Developed by Facebook, PyTorch is an open-source machine learning
library used for building and training deep learning models.
4. Scikit-learn: A popular Python library for machine learning, Scikit-learn
provides tools for data preprocessing, feature selection, and model evaluation.
5. Apache Spark: A distributed computing platform, Apache Spark is used for big
data processing and machine learning applications.
6. Hadoop: An open-source distributed computing platform, Hadoop is used for
storing and processing large datasets.
7. IBM Watson: A cloud-based AI platform, IBM Watson provides tools for
building and deploying AI models.
8. Amazon SageMaker: A cloud-based platform, Amazon SageMaker provides
tools for building, training, and deploying machine learning models.
9. Microsoft Azure Machine Learning: A cloud-based platform, Microsoft Azure
Machine Learning provides tools for building, deploying, and managing
machine learning models.
10. Google Cloud AI Platform: A cloud-based platform, Google Cloud AI Platform
provides tools for building, training, and deploying machine learning models.
11. Pandas: A library for open-source data analysis and manipulation that offers
data structures for effectively storing and processing huge datasets.
12. NumPy: A Python open-source toolkit for numerical computing that supports
huge, multi-dimensional arrays and matrices.
84 5 Data-Centric AI
13. Apache Flink: An open-source framework for streaming data processing that
supports both batch processing and real-time data streaming.
14. Apache Kafka: A distributed streaming platform that is open source and can be
used to create streaming apps and real-time data pipelines.
5.7 Summary
Data-centric AI involves a series of steps to develop and deploy AI models that are
data-driven and effective. Overall, data-centric AI is an iterative and ongoing process
that requires careful attention to each step to ensure the development of high-quality,
reliable AI models that deliver accurate and actionable insights.
References 85
References
1. Bogatu, A., Fernandes, A. A., Paton, N. W., & Konstantinou, N. (2020). Dataset discovery in
data lakes. In ICDE
2. Xu, Y., Ding, J., Zhang, L., & Zhou, S. (2021). Dp-ssl: Towards robust semi-supervised learning
with a few labeled samples. In NeurIPS
3. Karamanolakis, G., Mukherjee, S., Zheng, G., & Hassan, A. (2021). Self-training with weak
supervision. In NAACL
4. Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., & Wang, X. (2021).
A survey of deep active learning. ACM Computing Surveys, 54, 1–40.
5. Zha, D., Lai, K.-H., Wan, M., & Hu, X. (2020). Metaaad: Active anomaly detection with deep
reinforcement learning. In ICDM
6. Dong, J., Zhang, Q., Huang, X., Tan, Q., Zha, D., & Zihao, Z. (2023). Active ensemble learning
for knowledge graph error detection. In WSDM
7. Ratner, J., De Sa, C. M., Wu, S., Selsam, D., & Re, C. (2016). Data programming: Creating
large training sets, quickly. In NeurIPS
8. Zha, D., & Li, C. (2019). Multi-label dataless text classification with topic modeling. Knowledge
and Information Systems, 61, 137–160.
9. Kharate, N. G., & Patil, V. H. (2019). Challenges in rule based machine translation from
Marathi to English. In Proceedings of the 5th international conference on advances in computer
science and information technology (ACSTY-2019) (pp. 45–54). https://doi.org/10.5121/csit.
2019.91005
10. Woody, A. (2013). A data-centric approach to securing the enterprise. Packt Publishing
Chapter 6
Data-Centric AI in Healthcare
6.1 Overview
The predominant paradigm for AI development over the last few decades has been a
model- or software-centric approach, in which building a machine learning system
requires writing code to implement algorithms, models, as well as taking that code
and training it on data. In the last few years, there has been tremendous progress
in neural networks and other algorithms, and the code is essentially a good open
source issue that you can download from GitHub today for numerous applications.
Historically, most of us know how to download the dataset, hold the dataset as fixed,
and then modify the software’s code to get to do well on the data. Therefore, it is
not always more beneficial to use a data-centric strategy where we may even hold
the code patch; instead, focus on collecting or creating the correct data to feed the
learning algorithm.
A data-centric approach is one that focuses on the importance of data in decision-
making and problem-solving. In various industries, including healthcare, data is
generated at an unprecedented rate, and utilizing this data effectively can lead to
better outcomes. A data-centric approach prioritizes data management and analysis,
with the goal of leveraging insights from data to inform decision-making.
A campaign started by Ng et al. [1] that promotes a machine learning strategy that
is more data centric than model centric, or a fundamental shift from model creation
to data quality and dependability, is largely responsible for the growth of DCAI.
As a result, in order to seek data excellence, academics’ and practitioners’
focus has steadily switched to data-centric AI. Artificial Intelligence (AI) and
machine learning (ML) techniques are used in the healthcare industry to evaluate
and make sense of the enormous volumes of data created every day. This informa-
tion includes wearable technology, genetics, medical imaging, and electronic health
records (EHRs).
The goal of trends and AI in healthcare is to improve patient outcomes, reduce
costs, and increase efficiency by using data to inform clinical decision-making, iden-
tify patterns and trends, and develop predictive models. For example, AI algorithms
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 87
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_6
88 6 Data-Centric AI in Healthcare
can help identify patients at high risk for certain diseases, predict treatment outcomes,
and identify opportunities for personalized medicine.
However, privacy, security, and bias issues are also problems with data-centric AI
in the healthcare industry. It is crucial to guarantee the security of patient data and
the accountability and transparency of AI algorithms. Additionally, biases in the data
can lead to biased algorithms, which can perpetuate health disparities. Overall, data-
centric AI has the potential to revolutionize healthcare by unlocking new insights
and improving patient care. However, it is important to approach these technologies
with caution and to prioritize patient privacy and equity.
In order to generate insights and enhance patient care, clinical research, and health-
care operations, data-centric AI in healthcare refers to a method that places a strong
priority on the gathering, curation, and analysis of high-quality healthcare data. How
data-centric AI is used in healthcare is as follows:
. Data Collection and Integration
Data-Centric AI in healthcare starts with the collection and integration of diverse
healthcare data from various sources such as electronic health records (EHRs),
medical imaging, wearable devices, genomics, and patient-reported data. This
involves developing robust data infrastructure and interoperability standards to ensure
data can be effectively integrated and accessed.
. Data Preprocessing and Cleaning
Healthcare data often contains missing values, errors, inconsistencies, and noise.
Data preprocessing techniques are applied to clean and transform the data into a
usable format. Preprocessing steps may include handling missing data, normalizing
values, resolving data inconsistencies, and anonymizing sensitive information to
protect patient privacy.
. Data Governance and Security
Given the sensitivity of healthcare data, data-centric AI in healthcare places an
emphasis on sound data governance procedures and complies with privacy laws like
Health Insurance Portability and Accountability Act (HIPAA) to safeguard patient
privacy and preserve data security.
. Data Analysis and Insights
Advanced analytics and machine learning techniques are applied to the curated
healthcare data to extract meaningful insights and patterns.
Data-centric AI models can help identify disease patterns, predict patient
outcomes, personalize treatment plans, and support clinical decision-making.
These models can also be used to discover new biomarkers, identify population
health trends, and support public health initiatives.
. Clinical Decision Support Systems (CDSS):
Data-centric AI is used to develop and deploy clinical decision support systems
that provide healthcare professionals with real-time insights and recommendations
6.2 Need and Challenges of Data-Centric Approach 89
In the past, AI was frequently thought of as a subject that was model centric and
focused on improving model designs with respect to specified datasets. The excessive
dependence on predefined datasets, however, neglects the scope, complexity, and
fidelity of the data to the underlying issue, which may result in worse model behavior
in real-world applications [2]. Clinicians, clinical researchers, and scientists make
judgements in the healthcare field based on data. Excellent data supports excellent
judgements, whereas bad data encourages bad decisions.
Since the models are so highly specialized and adapted to certain situations, it is
sometimes challenging to apply them to different challenges. Underestimating data
quality might also result in data cascades [3], which could have detrimental impacts
including lower accuracy and enduring biases [4]. This can seriously limit the use of
AI systems, especially in high-stakes fields.
There are several reasons why a data-centric approach is necessary in healthcare:
1. Data is central to healthcare decision-making: In healthcare, decisions are made
based on data, whether it is patient history, laboratory results, or imaging data. A
data-centric approach ensures that all relevant data is considered when making
decisions, leading to better outcomes for patients.
90 6 Data-Centric AI in Healthcare
2. Improved efficiency and accuracy: With the large amount of data generated in
healthcare, it can be difficult for clinicians to manually analyze it all. By using
data-centric AI tools, clinicians can quickly and accurately analyze large amounts
of data, leading to faster diagnosis and treatment decisions.
3. Personalized medicine: Data-centric AI can help identify patterns and trends in
large datasets, allowing clinicians to develop personalized treatment plans for
patients based on their unique medical history, genetics, and lifestyle factors.
4. Research and development: Data-centric AI can also aid in the development of
new treatments and medications by helping researchers identify new targets and
potential drug candidates.
Overall, a data-centric approach in healthcare is essential to improve patient
outcomes, increase efficiency, and advance medical research. Healthcare profes-
sionals can make better decisions and give patients better treatment by utilizing the
potential of AI and machine learning to evaluate massive information.
Challenges
The process of gathering data is quite difficult and demands careful planning. Tech-
nically speaking, datasets are frequently heterogeneous and poorly matched with one
another, making it difficult to quantify their relatedness or properly integrate them.
It might also be challenging to successfully synthesize data from the current dataset
because it significantly depends on subject expertise [5]. Additionally, several crucial
problems that arise during data collecting cannot be handled exclusively from a tech-
nological standpoint. For instance, in many real-world scenarios, we might not be
able to find a publicly accessible dataset that matches our needs; thus, we still need to
gather data from scratch. However, for logistical, ethical, or legal reasons, it may be
challenging to access some data sources. Additionally, there are ethical issues while
gathering new data, particularly in relation to informed permission, data privacy, and
data security. These difficulties in analyzing and carrying out data collecting must
be taken into consideration by researchers and practitioners. Data leakage, reporting
only average loss, and incorrect labels are a few of the common mistakes people
make when evaluating models with data. Other common mistakes include validating
data that is not representative of the deployment environment and failing to use truly
held out data.
While there are many potential benefits to a data-centric approach in healthcare,
there are also several challenges that need to be addressed. These challenges include:
1. Privacy and security: To ensure patient privacy, extremely sensitive patient data
must be secured. A data-centric approach requires strong security measures to
ensure that patient data is not compromised.
2. Data quality and accuracy: Data quality and accuracy are critical to the success
of data-centric AI. If the data is incomplete, inaccurate, or biased, it can lead to
flawed algorithms and inaccurate predictions.
3. Bias: Data-centric AI algorithms can perpetuate prejudices and cause health
disparities if they are trained on biased data since they are only as good as the
data they are provided.
6.4 Application Implementation in Model-Centric Approach 91
Fig. 6.1 Screenshot of the source code of data-centric model with preprocessing of data and model
training on cancer dataset
Logistic Regression
In the study of cancer data, logistic regression is a well-liked model for binary
classification problems. Based on the input characteristics, it calculates the likelihood
that a given instance belongs to a specific class (such as cancer vs. non-cancer). It
presupposes a linear relationship between the characteristics and the target class’s
log-odds.
Support Vector Machine (SVM)
SVM is a flexible model that may be applied to both regression and classification
applications. To divide instances of various classes in a high-dimensional space,
it generates a hyperplane or collection of hyperplanes. To increase generalization,
SVM seeks to maximize the margin between the hyperplane and the examples that
are closest to it.
Random Forest
An ensemble model called random forest mixes many decision trees to produce
predictions. To create the result, it builds a forest of decision trees and averages each
tree’s predictions. High-dimensional data is easily handled by random forest, and it
can capture intricate feature–feature relationships.
6.5 Comparison of Model-Centric AI and Data-Centric AI 93
Learning classifier can teach total randomness. It is like if you give it complete
garbage data, just like completely random labels, it can learn to map like images
to completely arbitrary labels or text to completely arbitrary labels. So basically, if
we give it really bad data, it will just produce exactly what it learns even if the data
is completely wrong. So traditional machine learning is very model centric. If you
have good machine learning models that work on highly curated data, but then the
real-world data is actually really messy, then it makes sense to actually focus on
fixing the issues in the data. Most cited and most used test sets in the field of machine
learning all have wrong labels. Data-centric AI often takes one of two forms. So, one
form is that you have AI algorithms that understand something about data and then
they use this new information that they have understood to help a model train better.
For instance, in the classroom, if you are studying addition, should your teacher start
94 6 Data-Centric AI in Healthcare
Fig. 6.2 Screenshot of the source code of model centric with random forest model training on
cancer noisy dataset
with 10,051 plus 1042 or would they start with 1 plus 2? Should your teacher give
you incredibly difficult examples as the very first instances?
Right, now, we know this because like we have all learned addition, but when a
machine learning model starts, it is starting from scratch, so it does not know right
from the beginning what is the sort of easiest example, and so there are data-centric
AI approaches that actually estimate what is the easiest example, and then when you
train an ML model, you start with that example, and then you give it slightly harder
ones and slightly harder ones and so it is like curriculum. Another sort of common
form data-centric AI is that you actually modify the dataset to directly improve the
performance of a model. Which instructor do you think you will likely learn data-
centric AI from better if I taught you the wrong thing 30% of the time against a
another teacher who comes and tells you the right thing 100% of the time? And so,
the idea is that we want to take this bad sort of, you know, those 30% of wrong things,
and we want to get rid of them so that you could sort of come to the classroom and
redo your learning experiences as if those never happened, and that is the idea here,
the best model, and that is like the classical way of thinking about model-centric AI.
And the idea is changing the model to improve performance on an AI task.
In data-centric AI instead, it is given some model, that may be fixed, or you
may change. If you want to improve that model by improving your dataset, both
approaches can be effective in cancer detection, and the choice between them will
depend on the specific requirements of the task and the available data. A data-centric
approach can be particularly useful when the data is noisy or contains biases, while
6.6 Summary 95
a model-centric approach may be more suitable when the focus is on achieving the
highest possible performance on a well-defined task.
There are some reasons why models get a particular prediction wrong.
1. Given label is incorrect. If given label is incorrect, recommended action is correct
the label.
2. Example does not belong to any of the K classes or if it is fundamentally
not predictable for example if it is blurring image. Recommended actions are
tossing this example from dataset or consider adding another class if many such
examples.
3. Example is an outlier. Actions suggested are toss example if similar examples
would never be seen in deployment; otherwise, collect additional training data
that looks similar if possible. Second alternative is applying data transformation
to make outliers features more like other example’s or upweight it or duplicate
it multiple times.
4. Type of model using is suboptimal for such examples. Diagnosis is retraining
model or upweight similar examples or duplicate them many times in dataset.
5. Dataset has other examples with identical features, but different labels in that
case define classes more distinctly or measure extra features to enrich data.
These are reasons to boost performance of model using data-centric approach.
6.6 Summary
Currently, many AI applications are model centric, one possible reason behind this
is that the AI sector pays careful attention to academic research on models. This
is because it is difficult to create large datasets that can become generally recog-
nized standards. As a result, the AI community believes that model-centric machine
learning is more promising. In today’s machine learning, data is crucial, yet it is often
overlooked and mishandled in AI initiatives. As a result, hundreds of hours are wasted
fine-tuning a model based on faulty data. That could very well be the fundamental
cause for model’s lower accuracy, and it has nothing to do with model optimization.
The data-centric approach allows for continuous improvement in healthcare systems.
The data-centric approach plays a crucial role in the healthcare domain by offering
several benefits and advancements. Data-centric approach in the healthcare domain
leads to improved decision-making, personalized medicine, early detection, remote
monitoring, research advancements, population health management, and continuous
improvement. By harnessing the power of data, healthcare providers can deliver more
effective, precise, and patient-centered care.
96 6 Data-Centric AI in Healthcare
References
1. Ng, A., Laird, D. & He, L. (2021) Data-centricai competition, DeepLearning AI. Available
online: https://github.com/Nov05/deeplearningai-data-centric-competition. Accessed on Dec 9,
2021.
2. Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W. G., Diamos, S., Diamos, G., He,
L., Kiela, D., Jurado, D., et al. (2022). Dataperf: benchmarks for data-centric ai development.
arXiv preprint arXiv:2207.10062
3. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021).
Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai. In
CHI.
4. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in
commercial gender classification. In FAccT
5. Aroyo, L., Lease, M., Paritosh, P., & Schaekermann, M. (2022). Data excellence for ai: Why
should you care? Interactions, 29(2), 66–69.
6. Prabadevi, B., Deepa, N., Krithika, L. B. & Vinod, V. (2020). Analysis of machine learning
algorithms on cancer dataset. In: 2020 International conference on emerging trends in infor-
mation technology and engineering (ic-ETITE) (pp. 1–10). https://doi.org/10.1109/ic-ETITE4
7903.2020.36.
7. Sarker, I. H. (2021). Machine learning: algorithms, real-world applications and research
directions. SN Computer Science, 2, 160.
8. Khadse, V., Mahalle, P. N. & Biraris, S. V. (2018). An empirical comparison of supervised
machine learning algorithms for internet of things data. In: 2018 fourth international conference
on computing communication control and automation (ICCUBEA) (pp. 1–6). https://doi.org/10.
1109/ICCUBEA.2018.8697476.
Chapter 7
Data-Centric AI in Mechanical
Engineering
7.1 Overview
data assessment for quality. It also addresses issues such as data redundancy,
inconsistency, and noise detection.
. Feature Engineering: Feature engineering includes selection of appropriate
features (input features) from underlying data that can represent given problem.
This step needs a domain expert to creatively extract meaningful information from
underlying data.
. Model Training and Evaluation: With the help of collected and preprocessed data
in data-centric AI, training machine learning models becomes easy. Assessment
of models is done through appropriate metrics. Enhancements to the models are
done as and when needed basis.
. Continuous Learning: Data-centric AI identifies the need to learn from raw data
that is essential for enhancement of AI models. This task includes model perfor-
mance monitoring, raw data collection, and maintaining the models to easily
update new information to the model.
. Privacy and Ethics: Responsible and ethical use of data-centric AI is one of the
major goals for us. While collecting data, privacy concerns and ethical consid-
erations should be ensured. These include two tasks, namely data anonymity
consideration and sensitive information protection.
With the help of data-centric approach, mechanical systems can be made accu-
rate, robust, and more efficient. It helps organization in the decision-making process
depending on patterns and insights discovered in the data. This will help an
application to be more reliable and effective in multiple disciplines.
High-quality data is crucial for success of any AI model and application. Here are
the key needs and challenges associated with the data-centric approach [3–5]:
Needs of Data-Centric Approach
. Accuracy and Reliability: For generating accurate and reliable predictions/
decisions, AI models should be provided with high-quality data. It is vital for
the accurate and reliable predictions. A data-centric AI approach makes sure that
the data used for training of AI model is cleaned, unbiased, and can generate
solid outcome. In mechanical engineering, it becomes crucial to have an accurate
outcome in order to build the next part of the system. If one part of the system
fails, then the entire system can experience a domino effect.
. Insights and Discoveries: Data-centric AI is responsible for a two-step process,
namely extraction of meaningful information from raw data and performing anal-
ysis to discover more insights from big datasets. For decision-making process,
it is vital to maintain the data quality and understand the underlying data. It can
be helpful while recognizing the patterns and correlations from the data. The
7.2 Need and Challenges of Data-Centric Approach 99
data collection from sensors helps in data acquisition in case of mechanical engi-
neering. But analysis of that collected data in order to get the insight is even more
important.
. Adaptability and Flexibility: In real-world scenario, the environments are contin-
uously changing and data-centric AI understands and recognizes this need. Data-
centric AI also helps in keeping up with these changes in data, and the models
are evolved with the help of new data. Since the models are trained on newly
updated data, hence they can stay relevant for a longer period of time. In case of
mechanical engineering, the change can occur at any stage; hence, the developed
model should be able to adjust these changes.
. Decision Support: For making an informed decision, data-centric AI provides
evidence-based insights to the decision-makers. The accurate and effective anal-
ysis of underlying data aids in decision-making process, optimization of systems,
etc. Mechanical systems can have a solid support system with data-centric
approach.
Challenges of Data-Centric Approach
. Data Quality and Availability
Granting the data quality and availability of the data is one of the major challenges
faced by data-centric AI. Data can be dirty, missing values, containing errors or biases,
scattered across various platforms, etc. Data cleaning, integration, and curation of
data need a lot effort. Data collection in case of mechanical systems is very critical.
One of the ways is to take help of sensors in order to collect data.
. Data Privacy and Security
As data is becoming the core center of everything, privacy and security concerns are
becoming more remarkable. It is vital to protect sensitive and personal information at
all costs. If this data is breached, the consequences can be catastrophic. Hence, data
privacy and security remain a significant challenge for data-centric AI. Mechanical
systems handle critical data and heavy machineries. It becomes imperative to protect
sensitive information regarding these systems.
. Scalability and Infrastructure
As data is becoming core of everything, handling this huge amount of data is a great
challenge for data-centric AI. There is a need to create robust and scalable system
that will efficiently handle data storage, processing, and analysis part of this big data.
. Data Governance and Ethics
Data-centric AI gives rise to vital questions such as data governance, data ownership,
and ethics. There is a need to form clear and detailed policies/ frameworks for secure
handling of data, removing biases, etc. The ethical issues addressing data collection
and its usage should be handled in a right way. The sensitive information about
mechanical systems should be protected at all times. This information, if leaked, can
create a catastrophe.
100 7 Data-Centric AI in Mechanical Engineering
. Interdisciplinary Collaboration
Data-centric AI will require amalgamation between domain experts, data scientists,
and IT professionals. An effective communication across all disciplines will help
in proper insights and requirements identification. This collaboration is vital for
data-centric AI implementation.
Combination of technical expertise, strong data management practices, ethical
considerations, and a commitment to continuous improvement will be able to address
these challenges. Mechanical and computer engineers together can create systems
that are more efficient and reliable. Also, with the help of data-centric approach, an
organization can unlock significant opportunities using high-quality data.
. Application development.
. Deployment and Testing: When an application is deployed in production envi-
ronment, it gives users of the application an easy access. A detailed and thorough
testing of application should be conducted in order to make sure that all the func-
tions are executing as per the requirements. Application should be able to handle
wide range of scenarios and provide reliable outcome.
. Monitoring and Performance Optimization: Consistently monitor and assess
application’s performance and gather raw data for training of model. It will help
in making more efficient enhancements. Take the user feedback to recognize and
address the limitations and issues the application is facing, which will make the
application more efficient and effective.
. Iteration and Enhancement: Collect feedback from the users, so we can include
the valuable feedback into applications improvement cycle. Consistent updates
will enhance the applications performance while maintaining the main objectives.
. Governance and Compliance: Make sure that data governance policies, privacy
regulations, and ethical considerations are followed throughout the application’s
development and implementation. Data protection rules have to be followed in
order to protect the sensitive information along with secure data access.
Model-centric AI in mechanical engineering uses AI techniques to create and
utilize models that can accurately represent the mechanical systems. These models
can be used in situations such as simulation, optimization, control, and decision-
making.
Here are some examples of model-centric AI in mechanical engineering [7–10]:
. Simulation and Virtual Prototyping
Using AI technology, you can develop accurate and reliable models that simulate
the behavior of mechanical systems. These models can be used to predict outcomes,
assess designs, and assess the impact of various parameters and operating condi-
tions before building physical prototypes. AI can improve simulation capabilities by
enhancing accuracy, reducing computation time, and automating model generation.
. Optimization and Design Exploration
AI algorithms help optimize mechanical design by exploring large design spaces
and finding the best solution. These algorithms can work in conjunction with the
machine model to iteratively search for design parameters that meet specific goals
such as enhancing performance, minimizing weight, or reducing cost.
. Control and Automation
AI-based systems will use models to predict and monitor the behavior of a mechanical
system. With the combination of real-time data with model, AI algorithms will be
able to provide intelligent decisions. It can also adjust control parameters to enhance
performance, improve stability, deal with changing conditions, etc.
104 7 Data-Centric AI in Mechanical Engineering
Model-centric and data-centric approaches have its pros and cons. The selection
depends on the requirements of the AI application. In real time, the combination of
two approaches will create more effective AI system.
The dataset contains images of tools like hammer, wrench, pliers, ropes, etc. shown
in Fig. 7.1. The aim of the case study is to classify the given mechanical tool image
correctly. Some of the images from dataset will look like the following:
By developing the deep learning model, we can classify the images.
Accuracy : 90.00%
In mechanical engineering, data-centric approach refers to using clean data and its
outcome in order to make an informed decision, build products, etc. If we use data that
contains a lot of noise and missing information, it will create a number of problems
for developers. The system developed with such a data will not be efficient. Along
with that, it will include a lot of risk factors as data is dirty. Hence, it is imperative
to use data-centric approach for good and reliable outcomes.
7.7 Summary
In data-centric AI, the focus is on understanding the data and its characteristics. It will
ensure data quality and reliability and use it to extract meaningful insights. In mechan-
ical engineering, we can detect probable system component failures using real-time
data gathered from sensors, enabling preventive maintenance prior to any breakdown.
This strategy reduces downtime and streamlines maintenance plans. AI algorithms
are also capable of pattern recognition and process optimization for mechanical
processes, including production parameters like cutting rates, tool trajectories, and
material selection. The use of AI in performance optimization hence improves the
general effectiveness and standard of mechanical systems.
References
1. Gallagher, M. (2009). Data collection and analysis. Researching With Children And Young
People: Research Design, Methods And Analysis, 65–127.
2. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data
preprocessing: Methods and prospects. Big Data Analytics, 1(1), 1–22.
3. Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., & Hu, X. (2023). Data-centric ai: Perspectives and
challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM)
(pp. 945–948). Society for Industrial and Applied Mathematics.
4. Polyzotis, N., & Zaharia, M. (2021). What can data-centric ai learn from data and ml
engineering?. arXiv preprint arXiv:2112.06439.
108 7 Data-Centric AI in Mechanical Engineering
5. Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., & Hu, X. (2023). Data-centric
artificial intelligence: A survey. arXiv preprint arXiv:2303.10158.
6. Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W. G., Diamos, S., & Reddi, V. J.
(2022). Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.
10062.
7. Amini, M., Sharifani, K., & Rahmani, A. (2023). Machine learning model towards evaluating
data gathering methods in manufacturing and mechanical engineering. International Journal
of Applied Science and Engineering Research, 15(2023), 349–362.
8. Patel, A. R., Ramaiya, K. K., Bhatia, C. V., Shah, H. N., & Bhavsar, S. N. (2021). Artificial
intelligence: Prospect in mechanical engineering field—a review. Data Science and Intelligent
Applications: Proceedings of ICDSIA, 2020, 267–282.
9. Razvi, S. S., Feng, S., Narayanan, A., Lee, Y. T. T., & Witherell, P. (2019, August). A review of
machine learning applications in additive manufacturing. In International design engineering
technical conferences and computers and information in engineering conference (Vol. 59179,
p. V001T02A040). American Society of Mechanical Engineers.
10. Huang, Q. (2016, July). Application of artificial intelligence in mechanical engineering. In
2nd International conference on computer engineering, information science & application
technology (ICCIA 2017) (pp. 882–887). Atlantis Press.
Chapter 8
Data-Centric AI in Information,
Communication and Technology
8.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 109
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_8
110 8 Data-Centric AI in Information, Communication and Technology
The need for a data-centric approach in various fields arises from the increasing avail-
ability of data and the recognition of its crucial role in driving innovation, improving
decision-making, and enhancing system performance. The data-centric approach
prioritizes the collection, curation, and utilization of high-quality data to inform and
optimize processes, systems, and strategies. However, this approach also comes with
several challenges that need to be addressed. In this response, we will delve into the
8.2 Need and Challenges of Data-Centric Approach 111
need for a data-centric approach and discuss the key challenges associated with it
[4].
Need for a Data-Centric Approach:
. Enhanced Decision-Making: Data-centric approaches enable organizations to
make informed and data-driven decisions. By analyzing large volumes of data,
organizations can gain valuable insights into customer behavior, market trends,
operational efficiency, and other critical factors. These insights help in identi-
fying patterns, predicting outcomes, and making better strategic and operational
decisions [4].
. Innovation and Product Development: Data-centric approaches foster innovation
by providing valuable information for product development and optimization.
By collecting and analyzing customer feedback, usage patterns, and market data,
organizations can identify gaps, uncover new opportunities, and design products
and services that cater to specific customer needs. Data-centricity enables iterative
improvement and innovation cycles by continuously gathering user feedback and
incorporating it into product development processes [4].
. Personalized Experiences: In today’s digital age, customers expect personal-
ized experiences. Data-centric approaches allow organizations to collect and
analyze customer data to gain insights into preferences, behaviors, and needs.
This information enables the delivery of tailored experiences, recommendations,
and services, enhancing customer satisfaction, engagement, and loyalty [4].
. Process Optimization and Efficiency: Data-centric approaches help organiza-
tions optimize processes and improve operational efficiency. By collecting data
on process performance, resource utilization, and bottlenecks, organizations can
identify areas of improvement, eliminate inefficiencies, and streamline operations.
This leads to cost savings, increased productivity, and enhanced competitiveness
[4].
. Performance Monitoring and Maintenance: Data-centric approaches enable
continuous monitoring and maintenance of systems, equipment, and infrastruc-
ture. By collecting and analyzing real-time operational data, organizations can
detect anomalies, predict failures, and proactively address issues before they
escalate. This improves system reliability, minimizes downtime, and reduces
maintenance costs [4].
Challenges of a Data-Centric Approach [4]:
. Data Quality and Reliability: One of the primary challenges of a data-centric
approach is ensuring the quality and reliability of the data used for decision-
making. Data may be incomplete, inaccurate, or biased, leading to incorrect
insights and flawed decision-making. Data cleansing, validation, and verification
processes are necessary to ensure data quality and reliability [4].
. Data Privacy and Security: With the increasing emphasis on data-centric
approaches, data privacy and security become critical concerns. Organizations
need to implement robust data protection measures to safeguard sensitive and
personal information. Compliance with privacy regulations, encryption, access
112 8 Data-Centric AI in Information, Communication and Technology
controls, and secure data storage are essential to maintain data privacy and protect
against cyber threats [4].
. Data Collection and Integration: Collecting relevant and comprehensive data
can be a complex and resource-intensive task. Different data sources may have
varying formats, structures, and semantics, making integration and consolida-
tion challenging. Organizations need to establish effective data collection mecha-
nisms, employ data governance frameworks, and implement data integration and
transformation processes to ensure seamless data flow and interoperability [4].
. Scalability and Infrastructure: Data-centric approaches require robust and scalable
infrastructure to handle large volumes of data. Processing, storing, and analyzing
massive datasets can strain existing IT infrastructure. Organizations need to invest
in scalable storage solutions, high-performance computing resources, and efficient
data processing frameworks to handle the data-centric demands of their operations
[4].
. Data Interpretation and Analysis: Making sense of the vast amount of data
collected is a complex task. Data-centric approaches require skilled data scientists,
analysts, and domain experts who can interpret and analyze the data effectively.
Organizations need to invest in talent acquisition, training, and the development
of data analytics capabilities to extract meaningful insights from the data [4].
. Ethical and Bias Concerns: Data-centric approaches raise ethical concerns
regarding data usage and potential biases. Biases can emerge from biased data
collection, algorithmic biases, or unfair data representations. Organizations must
be vigilant in identifying and addressing biases to ensure fairness, transparency,
and ethical use of data in decision-making processes [4].
. Data Governance and Compliance: With the increased reliance on data, organiza-
tions need to establish robust data governance frameworks to ensure responsible
data management. Data-centric approaches require organizations to comply with
regulations, industry standards, and ethical guidelines regarding data collection,
storage, processing, and sharing. Implementing data governance frameworks and
ensuring compliance can be complex and resource-intensive [4].
. Cultural and Organizational Shifts: Adopting a data-centric approach often
requires a cultural and organizational shift within an organization. It involves
changing mindsets, embracing data-driven decision-making, and fostering a data-
centric culture. Organizations need to invest in change management, provide
training and education on data literacy, and promote a data-driven mindset across
all levels of the organization [4].
In conclusion, the need for a data-centric approach arises from the growing
importance of data in driving innovation, decision-making, and system optimiza-
tion. However, this approach is not without its challenges. Addressing issues related
to data quality, privacy, security, integration, scalability, interpretation, ethics, gover-
nance, and organizational readiness is crucial for organizations to fully leverage
the potential of a data-centric approach. Overcoming these challenges requires a
comprehensive and holistic strategy that encompasses technological, organizational,
and cultural aspects to harness the power of data effectively.
8.3 Application Implementation in Data-Centric Approach 113
learning, there has been a paradigm shift toward a data-centric approach in ICT appli-
cation implementation. The data-centric approach emphasizes the collection, anal-
ysis, and utilization of large volumes of data to drive innovation, optimize processes,
and deliver enhanced user experiences (Fig. 8.2).
In this response, we will explore the implementation of ICT applications through
a model-centric approach and discuss its benefits and challenges.
. Software Development: In the model-centric approach, software development
revolves around designing and implementing software models and algorithms.
Developers focus on creating logical models, architectural designs, and algorithms
that define the behavior and functionality of the application. The model-centric
approach emphasizes the software development life cycle, including requirements
gathering, design, coding, testing, and deployment. The goal is to create efficient
and scalable software models that can handle various tasks and deliver the desired
functionalities [5].
. Algorithm Design and Optimization: Model-centric ICT application implementa-
tion emphasizes algorithm design and optimization. Developers and data scientists
create algorithms that perform specific tasks or solve particular problems. These
algorithms are often designed using mathematical and computational models,
which are then implemented in software systems. The model-centric approach
focuses on optimizing algorithms for efficiency, accuracy, and scalability to handle
large-scale data processing and analysis [5].
Model-centric AI and data-centric AI are two distinct paradigms that approach artifi-
cial intelligence from different angles. In this response, we will compare and contrast
these two approaches, highlighting their differences, strengths, and weaknesses.
1. Focus and Emphasis
. Model-Centric AI: Model-centric AI places a strong emphasis on designing
and developing sophisticated algorithms and models. The focus is on creating
intelligent systems through the use of well-defined models and algorithms that
capture knowledge and decision-making processes. The models are designed
to solve specific tasks and make accurate predictions based on the given inputs
[7].
. Data-Centric AI: Data-centric AI, on the other hand, focuses on the collec-
tion, analysis, and utilization of large volumes of data. The emphasis is on
extracting meaningful insights, patterns, and correlations from the data to drive
decision-making and innovation. Data-centric AI leverages machine learning
techniques to learn from data and improve performance over time [6–8].
2. Approach to Problem-Solving
. Model-Centric AI: Model-centric AI approaches problem-solving through
the lens of designing and optimizing algorithms and models. The focus is on
creating intelligent systems that rely on predefined rules, logic, and repre-
sentations. These models are trained on specific datasets and are expected to
generalize well to new inputs [6].
8.5 Comparison of Model-Centric AI and Data-Centric AI 119
8.6 Summary
into a central repository to create a unified view of the customer and their ICT
services. The use of application programming interfaces (APIs) and cloud-based
data warehousing solutions helped to simplify the integration process [1].
2. Data Analysis and Modeling [11]
The second phase involved using machine learning and other advanced analytics
techniques to analyze the data and create models. The models were used to iden-
tify patterns, predict outcomes, and provide recommendations for improving ICT
services. Specific techniques used included [1]:
. Natural Language Processing (NLP) to analyze customer feedback and sentiment.
. Predictive modeling to anticipate network performance issues.
. Time series forecasting to predict future service demand.
. 3. Platform Deployment and Integration with ICT Services [11]
The final phase involved deploying the platform and integrating it with ICT
services. This enabled the telecom company to use the insights generated by the
data-centric AI platform to improve service delivery, increase efficiency, and reduce
costs [1]. For example:
. Insights from NLP analysis of customer feedback were used to identify areas for
improvement in customer service processes.
. Predictive modeling was used to proactively address network performance issues
before they impacted service quality.
. Time series forecasting was used to optimize resource allocation based on
anticipated service demand.
Results
The implementation of the data-centric AI platform has helped the telecom company
achieve several key outcomes [6–9, 8]:
1. 1. Improved Customer Satisfaction
The platform has enabled the telecom company to gain a deeper understanding of
customer needs and preferences. This has helped to improve the quality of their ICT
services and increase customer satisfaction.
2. Increased Efficiency
By using insights from the data-centric AI platform, the telecom company has
been able to streamline service delivery processes and reduce costs. For example,
predictive modeling has enabled the company to proactively address network perfor-
mance issues before they impact service quality, reducing the need for costly service
disruptions.
3. Enhanced Security
The data-centric AI platform has helped the telecom company improve its cybersecu-
rity posture. By analyzing network activity and identifying anomalies, the platform
124 8 Data-Centric AI in Information, Communication and Technology
References
1. Kim, H. S., Jee, H. K., & Lee, H. J. (2020). Data-centric AI platform for ICT services.
International Journal of Advanced Science and Technology, 29(7), 513–520.
2. Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2017). Data-centric systems and
applications: An overview. ACM Transactions on Internet Technology, 17(4), 1–19.
3. Furuhata, T. et al. (2020). Data-centric approach to AI/ML for advanced information systems.
In IEEE International Conference on Information Reuse and Integration (pp. 159–166).
4. Yao, L. et al. (2020). Data-centric AI systems: challenges and research directions. In IEEE
International Conference on Big Data (pp. 5563–5572).
5. Maier, A. L., et al. (2021). Towards data-centric artificial intelligence. IEEE Transactions on
Knowledge and Data Engineering, 33(3), 1235–1248.
6. Chen, M., et al. (2022). Data-centric AI systems: A survey. IEEE Transactions on Knowledge
and Data Engineering, 34(5), 937–959.
7. Zaki, M., et al. (2021). Data-centric AI systems: opportunities, challenges, and research
directions. Future Generation Computer Systems, 115, 105–119.
8. Yaghmaie, A. D., & Datta, A. (2021). Data-centric AI: concepts, challenges, and opportunities.
International Journal of Distributed Sensor Networks, 17(3), 15501477211002688.
9. Singh, D., et al. (2020). Data-centric AI approaches for internet of things: A review. IEEE
Access, 8, 137939–137958.
10. Farber, M. et al. (2019). Data-centric artificial intelligence: The next revolution in digital
transformation. Technical Report, McKinsey & Company.
11. Mahalle, P. N., Anggorojati, B., Prasad, N. R., & Prasad, R. (2013). Identity Authentication and
capability based access control (IACAC) for the internet of things. Journal of Cyber Security
and Mobility, 1, 309–348
Chapter 9
Conclusion
9.1 Summary
AI has become an interdisciplinary field and having applications in all the verticals
across day-to-day routine. There has been lot of development in this filed, and the
current AI as a black box is becoming AI as a white box where new category of AI
algorithms called as explainable AI is emerging in which the focus is more on the
data than models. The model-based AI decides which algorithm from the machine
learning or deep learning is more appropriate to apply for developing the model
based on the given data and the questions to be posted on this data. However, there
is a need of data-centric AI which will focus more on data than the selection of
algorithms. This book also presents prominent use cases of data-centric AI with its
need and benefits [1]. This book aims at providing an insight on data-centric AI and
model-centric AI both and express need for paradigm shift in AI.
The main focus of data-centric AI is on quality data in contrast to model-centric
AI. Model-centric AI is for developing and enhancing models and algorithms to
obtain greater performance on a particular job. Data-centric AI views models as static
artifacts and places more emphasis on increasing data quality than model-centric AI,
which regards data as a fixed artifact [2]. Quality over quantity is the top priority for
data-centric AI. A data-centric strategy helps to resolve many of the difficulties that
can occur while implementing AI infrastructure, as opposed to model-centric AI,
which aims to engineer performance advantages by enlarging datasets. This book
focuses on need of data-centric approach. This book also presents prominent use
cases of data-centric AI in various domains. Basic understanding of model building,
training, and model testing is required for development of any AI application; in view
of this, the implementation details and usage of tools for same are also presented in
this book.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 125
P. N. Mahalle et al., Data Centric Artificial Intelligence: A Beginner’s Guide,
Data-Intensive Research, https://doi.org/10.1007/978-981-99-6353-9_9
126 9 Conclusion
References
1. Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., Hu, X. (2023). Data-centric
artificial intelligence: A survey. arXiv 2023 arXiv:2303.10158.
2. Whang, S. E., Roh, Y., Song, H., Lee, J. G. (2021). Data collection and quality challenges in
deep learning: A data-centric AI perspective. arXiv preprint arXiv:2112.06409.